diff --git a/.dockerignore b/.dockerignore index 05ec59a..ab6a1f8 100644 --- a/.dockerignore +++ b/.dockerignore @@ -2,4 +2,3 @@ !Dockerfile !docker/entrypoint.sh !target/release/omnigraph-server -!target/release/omnigraph diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS new file mode 100644 index 0000000..d4ecfa5 --- /dev/null +++ b/.github/CODEOWNERS @@ -0,0 +1,18 @@ +# AUTOGENERATED from .github/codeowners-roles.yml. Do not edit by hand. +# +# To change role membership or path assignments: +# 1. Edit .github/codeowners-roles.yml +# 2. Run `python3 .github/scripts/render-codeowners.py` +# 3. Commit both files together +# +# CI fails if this file drifts from its source, and rejects PRs that +# edit this file directly without also editing the yml. + +* @ragnorc + +crates/** @ragnorc +docs/** @ragnorc +README.md @ragnorc +AGENTS.md @ragnorc +CLAUDE.md @ragnorc +SECURITY.md @ragnorc diff --git a/.github/DISCUSSION_TEMPLATE/rfc.yml b/.github/DISCUSSION_TEMPLATE/rfc.yml deleted file mode 100644 index 2a63525..0000000 --- a/.github/DISCUSSION_TEMPLATE/rfc.yml +++ /dev/null @@ -1,34 +0,0 @@ -labels: ["rfc"] -body: - - type: markdown - attributes: - value: | - Use this to **incubate an RFC** β€” socialize a design and reach rough - consensus before writing the formal document. When it's ready, graduate - it into a pull request that adds `docs/rfcs/NNNN-title.md` - (see [docs/rfcs/README.md](../blob/main/docs/rfcs/README.md)); a - maintainer merging that PR is acceptance. - - For a plain feature request or open-ended idea, use the **Ideas** - category instead. For bugs, open an [Issue](../../issues/new/choose). - - type: textarea - id: problem - attributes: - label: Problem / motivation - description: What needs solving, and why is it worth the long-run cost? - validations: - required: true - - type: textarea - id: sketch - attributes: - label: Proposed direction (sketch) - description: A rough shape of the design. Detail comes later in the RFC document. - validations: - required: true - - type: textarea - id: invariants - attributes: - label: Invariants touched - description: Which items in docs/dev/invariants.md does this affect or risk? Any deny-list brush? - validations: - required: false diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml deleted file mode 100644 index 8e19465..0000000 --- a/.github/ISSUE_TEMPLATE/bug_report.yml +++ /dev/null @@ -1,55 +0,0 @@ -name: Bug report -description: Report a reproducible problem or wrong behavior in OmniGraph. -title: "bug: " -labels: ["bug", "needs-triage"] -body: - - type: markdown - attributes: - value: | - Issues are for **reporting problems** β€” concrete, reproducible bugs. - For ideas, feature requests, or questions, please use - [Discussions](../../discussions) instead. - For a security vulnerability, follow [SECURITY.md](../../blob/main/SECURITY.md) β€” do **not** file it here. - - A maintainer will triage this; once labelled **`accepted`** it's open for a pull request - (see [GOVERNANCE.md](../../blob/main/GOVERNANCE.md)). - - type: textarea - id: what-happened - attributes: - label: What happened - description: What went wrong, and what you expected instead. - validations: - required: true - - type: textarea - id: repro - attributes: - label: Steps to reproduce - description: Minimal steps, commands, schema/query, or a failing snippet. - placeholder: | - 1. omnigraph init ... - 2. omnigraph ... - 3. observed: ... / expected: ... - validations: - required: true - - type: input - id: version - attributes: - label: Version - description: Output of `omnigraph --version` (or the engine/crate version) and how you installed it. - validations: - required: true - - type: input - id: environment - attributes: - label: Environment - description: OS, architecture, and storage backend (local FS / S3 / RustFS / MinIO). - validations: - required: false - - type: textarea - id: logs - attributes: - label: Logs / output - description: Relevant error text or logs. Will be rendered as code. - render: shell - validations: - required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml deleted file mode 100644 index 50720b8..0000000 --- a/.github/ISSUE_TEMPLATE/config.yml +++ /dev/null @@ -1,13 +0,0 @@ -# Issues are for problem reports only. Disable blank issues so everything is -# routed: bugs through the form, everything else to Discussions / SECURITY.md. -blank_issues_enabled: false -contact_links: - - name: πŸ’‘ Idea, feature request, or RFC - url: https://github.com/ModernRelay/omnigraph/discussions - about: Propose features and designs in Discussions. RFCs graduate from there into a docs/rfcs/ pull request. - - name: ❓ Question or help - url: https://github.com/ModernRelay/omnigraph/discussions - about: Ask in Discussions β€” questions are not tracked as Issues. - - name: πŸ”’ Security vulnerability - url: https://github.com/ModernRelay/omnigraph/blob/main/SECURITY.md - about: Report security issues privately per SECURITY.md β€” never as a public Issue. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md deleted file mode 100644 index 2a548c7..0000000 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ /dev/null @@ -1,29 +0,0 @@ - - -## What & why - - - -## Backing issue / RFC - - - -- [ ] Fixes an **accepted** issue: Closes # -- [ ] Implements / is an **accepted** RFC: -- [ ] **Trivial fast-lane** (typo / docs / dependency bump / comment / one-line CI) β€” no issue/RFC required - -## Checklist - -- [ ] Change is focused (one logical change) -- [ ] Tests added/updated for behavior changes (or N/A) -- [ ] Public docs updated if user-facing surface changed (or N/A) -- [ ] Reviewed against [docs/dev/invariants.md](../blob/main/docs/dev/invariants.md) β€” no Hard Invariant weakened, no deny-list item hit (or justified) - -## Notes for reviewers - - diff --git a/.github/branch-protection.json b/.github/branch-protection.json index fa1d57f..61b7d33 100644 --- a/.github/branch-protection.json +++ b/.github/branch-protection.json @@ -1,18 +1,22 @@ { - "_comment": "Branch protection policy for main. Applied via scripts/apply-branch-protection.sh. See docs/dev/branch-protection.md for rationale. CODEOWNERS was removed (2-person team where both maintainers own everything, so code-owner review added friction without value). Review is no longer code-owner-scoped and no approvals are required; the required CI status checks are the gate. Maintainers merge their own PRs once checks pass.", + "_comment": "Branch protection policy for main. Applied via scripts/apply-branch-protection.sh. See docs/branch-protection.md for rationale.", "required_status_checks": { "strict": true, "contexts": [ "Classify Changes", "Check AGENTS.md Links", - "Test omnigraph-server --features aws" + "Test Workspace", + "Test omnigraph-server --features aws", + "CODEOWNERS / drift", + "CODEOWNERS / noedit" ] }, "enforce_admins": false, "required_pull_request_reviews": { - "dismiss_stale_reviews": false, - "require_code_owner_reviews": false, - "required_approving_review_count": 0, + "dismissal_restrictions": {}, + "dismiss_stale_reviews": true, + "require_code_owner_reviews": true, + "required_approving_review_count": 1, "require_last_push_approval": false }, "restrictions": null, diff --git a/.github/codeowners-roles.yml b/.github/codeowners-roles.yml new file mode 100644 index 0000000..c5e36a9 --- /dev/null +++ b/.github/codeowners-roles.yml @@ -0,0 +1,54 @@ +# Source of truth for .github/CODEOWNERS. +# +# How to change role membership or path assignments: +# 1. Edit this file. +# 2. Run `python3 .github/scripts/render-codeowners.py` to regenerate +# .github/CODEOWNERS. +# 3. Commit both files in the same PR. +# +# CI fails on drift between this source and the generated CODEOWNERS +# (see .github/workflows/codeowners.yml). CI also rejects direct edits +# to .github/CODEOWNERS that don't accompany a change here. +# +# Why a generator instead of editing CODEOWNERS directly? +# The yml is the audit trail: `git log .github/codeowners-roles.yml` +# shows every role change with a reviewable diff and a merge commit. +# The rendered CODEOWNERS is what GitHub reads at PR time. + +roles: + engineering: + description: > + All production code under crates/**. Engine, CLI, server, + compiler. + members: + - ragnorc + + docs: + description: > + Documentation under docs/**, plus repo-level docs (README.md, + AGENTS.md, CLAUDE.md symlink, SECURITY.md). + members: + - ragnorc + +# Path β†’ role mapping. GitHub CODEOWNERS uses "last match wins" +# semantics β€” when multiple patterns match a file, only the last +# matching pattern's owners apply. The generator handles this by +# emitting `default` as the first `*` line and the specific patterns +# below afterward, so specific paths override the catch-all. +# +# Within this list, order matters only between overlapping specific +# patterns (the later one wins). Today nothing overlaps; future +# additions should keep more-specific patterns later. +paths: + "crates/**": [engineering] + "docs/**": [docs] + "README.md": [docs] + "AGENTS.md": [docs] + "CLAUDE.md": [docs] + "SECURITY.md": [docs] + +# Catch-all for paths not explicitly mapped (.github/, scripts/, +# Cargo.toml, Cargo.lock, openapi.json, LICENSE, etc.). Defaults to +# engineering β€” every change to repo infrastructure needs the +# engineering owner's review. +default: [engineering] diff --git a/.github/scripts/render-codeowners.py b/.github/scripts/render-codeowners.py new file mode 100755 index 0000000..f243d0c --- /dev/null +++ b/.github/scripts/render-codeowners.py @@ -0,0 +1,134 @@ +#!/usr/bin/env python3 +"""Render .github/CODEOWNERS from .github/codeowners-roles.yml. + +The yml is the source of truth β€” editing CODEOWNERS directly is +rejected by CI (see .github/workflows/codeowners.yml). This script +expands the role-based yml into the flat pathβ†’owners format GitHub +expects. + +Usage: + python3 .github/scripts/render-codeowners.py + +Exits non-zero on: + - Missing PyYAML. + - Unknown role referenced in `paths` or `default`. + - Role with no members (a role must always resolve to at least + one owner; otherwise CODEOWNERS would assign nobody and GitHub + would silently fall back to "no required reviewer", which + defeats the purpose). +""" + +from __future__ import annotations + +import sys +from pathlib import Path + +try: + import yaml +except ImportError: + sys.exit( + "error: PyYAML is required. Install with `pip install pyyaml` " + "or `python3 -m pip install pyyaml`." + ) + +REPO_ROOT = Path(__file__).resolve().parents[2] +SOURCE = REPO_ROOT / ".github" / "codeowners-roles.yml" +OUTPUT = REPO_ROOT / ".github" / "CODEOWNERS" + +BANNER = """\ +# AUTOGENERATED from .github/codeowners-roles.yml. Do not edit by hand. +# +# To change role membership or path assignments: +# 1. Edit .github/codeowners-roles.yml +# 2. Run `python3 .github/scripts/render-codeowners.py` +# 3. Commit both files together +# +# CI fails if this file drifts from its source, and rejects PRs that +# edit this file directly without also editing the yml. +""" + + +def resolve(role_name: str, roles: dict) -> list[str]: + role = roles.get(role_name) + if role is None: + sys.exit( + f"error: unknown role '{role_name}'. " + f"Known roles: {sorted(roles.keys())}" + ) + members = role.get("members") or [] + if not members: + sys.exit( + f"error: role '{role_name}' has no members. " + f"A role must resolve to at least one owner." + ) + return members + + +def owners_for(role_names: list[str], roles: dict) -> list[str]: + """Return @-prefixed GitHub handles, deduped, preserving order.""" + seen: list[str] = [] + for role_name in role_names: + for member in resolve(role_name, roles): + handle = f"@{member}" + if handle not in seen: + seen.append(handle) + return seen + + +def main() -> int: + if not SOURCE.exists(): + sys.exit(f"error: source file not found: {SOURCE}") + spec = yaml.safe_load(SOURCE.read_text()) + + roles = spec.get("roles") or {} + if not roles: + sys.exit("error: codeowners-roles.yml declares no roles") + + paths = spec.get("paths") or {} + if not paths: + sys.exit("error: codeowners-roles.yml declares no paths") + + lines: list[str] = [BANNER] + + # Pad the path column for alignment. Width is the longest pattern + # plus a small margin. + width = max(len(p) for p in paths) + 2 + + # GitHub CODEOWNERS uses "last match wins" semantics. Emit the + # default catch-all `*` FIRST so specific patterns below override + # it for the paths they cover. If we emitted `*` last, every file + # would resolve to the default owners regardless of more-specific + # rules β€” which would silently nullify any role distinction. + if "default" in spec: + default_owners = owners_for(spec["default"], roles) + lines.append(f"{'*':<{width}} {' '.join(default_owners)}") + lines.append("") + + for pattern, role_names in paths.items(): + owners = owners_for(role_names, roles) + lines.append(f"{pattern:<{width}} {' '.join(owners)}") + + lines.append("") # trailing newline so the file ends cleanly + + rendered = "\n".join(lines) + + # Regression check: the catch-all `*` line (if any) must precede + # every specific-path line. Failure here means the generator is + # silently nullifying specific rules. + if "default" in spec: + non_comment = [ln for ln in rendered.splitlines() if ln and not ln.startswith("#")] + first_pattern = non_comment[0].split()[0] if non_comment else None + if first_pattern != "*": + sys.exit( + f"error: generator invariant violated β€” first emitted pattern is " + f"{first_pattern!r}, expected '*'. CODEOWNERS uses last-match-wins; " + f"the catch-all must come first." + ) + + OUTPUT.write_text(rendered) + print(f"wrote {OUTPUT.relative_to(REPO_ROOT)}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 1e9249f..3dc2e80 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -88,11 +88,8 @@ jobs: .github/workflows/ci.yml|Cargo.toml|Cargo.lock|crates/*/Cargo.toml) run_rustfs_ci=true ;; crates/omnigraph/src/storage.rs) run_rustfs_ci=true ;; crates/omnigraph/src/db/manifest.rs|crates/omnigraph/src/db/manifest/*) run_rustfs_ci=true ;; - crates/omnigraph/tests/s3_storage.rs|crates/omnigraph/tests/write_cost_s3.rs|crates/omnigraph/tests/helpers/*) run_rustfs_ci=true ;; - crates/omnigraph/src/table_store.rs|crates/omnigraph/src/instrumentation.rs) run_rustfs_ci=true ;; - crates/omnigraph-cluster/src/store.rs|crates/omnigraph-cluster/src/serve.rs) run_rustfs_ci=true ;; - crates/omnigraph-cluster/tests/s3_cluster.rs) run_rustfs_ci=true ;; - crates/omnigraph-server/tests/s3.rs|crates/omnigraph-server/tests/support/*) run_rustfs_ci=true ;; + crates/omnigraph/tests/s3_storage.rs|crates/omnigraph/tests/helpers/*) run_rustfs_ci=true ;; + crates/omnigraph-server/tests/server.rs) run_rustfs_ci=true ;; crates/omnigraph-cli/tests/system_local.rs) run_rustfs_ci=true ;; esac done @@ -114,46 +111,11 @@ jobs: - name: Verify AGENTS.md ↔ docs/ cross-links run: bash scripts/check-agents-md.sh - entrypoint_test: - name: Container Entrypoint - runs-on: ubuntu-latest - permissions: - contents: read - steps: - - name: Checkout source - uses: actions/checkout@v5.0.1 - - - name: Verify omnigraph-server entrypoint arg composition - run: sh docker/entrypoint_test.sh - test: name: Test Workspace needs: classify_changes - # PR latency: the full workspace + failpoints build/test is the slowest - # gate (~15min warm, up to the 75min ceiling cold) and dominated PR - # turnaround. It now runs only on push to `main` (post-merge), on tags, - # and on manual `workflow_dispatch` β€” NOT on pull_request. Trade-off - # accepted deliberately: a regression is caught on the `main` run after - # merge rather than before it, so `main` can briefly go red. Mitigations: - # (1) `Test Workspace` is removed from required PR checks in - # `.github/branch-protection.json` (a required check that never - # reports would leave every PR permanently pending); - # (2) run the full suite locally before merging risky changes - # (`cargo test --workspace --locked`), or trigger this workflow via - # the Actions "Run workflow" button (workflow_dispatch) on your branch; - # (3) openapi.json is no longer auto-regenerated on PRs (that step lived - # here) β€” regenerate it locally for server/API changes - # (`OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi`) - # or the strict drift check fails the post-merge `main` run. - if: github.event_name != 'pull_request' runs-on: ubuntu-latest - # 75, not 45: a cold rust-cache (every Cargo.lock change) costs a full - # workspace + failpoints-feature build on a 2-core runner, which now - # exceeds 45 minutes on slow runner days. A timed-out run never SAVES - # its cache, so an undersized budget self-perpetuates: every retry - # starts cold and dies the same way (observed 2026-06-11, four runs). - # Warm-cache runs stay ~15 minutes; this is headroom, not a target. - timeout-minutes: 75 + timeout-minutes: 45 permissions: contents: write env: @@ -199,18 +161,15 @@ jobs: OMNIGRAPH_UPDATE_OPENAPI: ${{ (github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name == github.repository) && '1' || '' }} run: cargo test --workspace --locked - - name: Run failpoints feature tests + - name: Run failpoints feature test if: needs.classify_changes.outputs.run_full_ci == 'true' # Run after the workspace test so the build cache is warm β€” # enabling --features failpoints is just an incremental rebuild - # of the target crate + the small `fail` crate, not the full + # of omnigraph-engine + the small `fail` crate, not the full # dep tree (lance, datafusion). A separate job with its own # cache key would be a fresh ~20min build on first run; this - # is ~30s on a warm cache. The cluster feature does not enable - # omnigraph/failpoints, so each line rebuilds only its crate. - run: | - cargo test --locked -p omnigraph-engine --features failpoints --test failpoints - cargo test --locked -p omnigraph-cluster --features failpoints --test failpoints + # is ~30s on a warm cache. + run: cargo test --locked -p omnigraph-engine --features failpoints --test failpoints - name: Commit regenerated openapi.json to PR branch if: | @@ -292,9 +251,6 @@ jobs: rustfs_integration: name: RustFS S3 Integration - # `needs: test` means this is push-/dispatch-only too: on pull_request the - # `test` job is skipped, so this dependent is skipped with it. S3 - # integration runs post-merge on `main`, alongside the workspace suite. needs: - classify_changes - test @@ -335,12 +291,14 @@ jobs: . -> target - name: Start RustFS - # Pinned to 1.0.0-beta.8 (2026-06-10). beta.4+ refuses "default" - # credentials (rustfsadmin/rustfsadmin) unless - # RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true is set β€” fine for - # an ephemeral CI container. The three S3 suites were validated - # against the beta.8 binary locally before this bump. Keep the pin - # explicit (never `latest`) so upgrades are deliberate. + # Pinned to 1.0.0-beta.3 (2026-05-14) β€” the last known-good tag. + # `rustfs/rustfs:latest` (1.0.0-beta.4, 2026-05-21) added a + # credentials-policy check that refuses to start when + # AWS_ACCESS_KEY_ID/SECRET_ACCESS_KEY are values it considers + # "default" (rustfsadmin/rustfsadmin in our case). Bumping to + # beta.4+ requires either rotating those creds to less-default + # values or setting RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true + # β€” deliberate work, not an emergency. Pin first; upgrade later. run: | docker rm -f rustfs >/dev/null 2>&1 || true docker run -d \ @@ -349,8 +307,7 @@ jobs: -p 9001:9001 \ -e RUSTFS_ACCESS_KEY="${AWS_ACCESS_KEY_ID}" \ -e RUSTFS_SECRET_KEY="${AWS_SECRET_ACCESS_KEY}" \ - -e RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true \ - rustfs/rustfs:1.0.0-beta.8 \ + rustfs/rustfs:1.0.0-beta.3 \ /data - name: Install AWS CLI @@ -373,36 +330,12 @@ jobs: - name: Run RustFS storage tests run: cargo test --locked -p omnigraph-engine --test s3_storage -- --nocapture - - name: Run RustFS write-path cost gate (RFC-013 step 3a opener) - run: cargo test --locked -p omnigraph-engine --test write_cost_s3 -- --nocapture - - name: Run RustFS server smoke - # No name filter: every test in the s3 target is bucket-gated, and a - # filter matching nothing passes vacuously (which silently ran zero - # tests here for a while β€” the old filter said s3_repo, the test - # said s3_graph). - run: cargo test --locked -p omnigraph-server --test s3 -- --nocapture - - - name: Run RustFS cluster e2e - run: cargo test --locked -p omnigraph-cluster --test s3_cluster -- --nocapture + run: cargo test --locked -p omnigraph-server --test server server_opens_s3_repo_directly_and_serves_snapshot_and_read -- --nocapture - name: Run RustFS CLI smoke run: cargo test --locked -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow -- --nocapture - - name: Run RustFS recovery-sidecar lifecycle - # Sidecar put/list/delete through the S3 storage backend on a - # real bucket (the failpoint only wedges the publisher; the - # sidecar I/O is exercised for real). Name filter `s3_` matches - # the bucket-gated tests in the failpoints target only; the - # grep guards against the filter going vacuous (cargo passes - # with 0 tests matched) if those tests are ever renamed. - run: | - output=$(cargo test --locked -p omnigraph-engine --features failpoints --test failpoints s3_ -- --nocapture 2>&1); status=$? - echo "$output" - [ "$status" -eq 0 ] || exit "$status" - echo "$output" | grep -Eq "test result: ok\. [1-9][0-9]* passed" \ - || { echo "::error::filter 's3_' matched no tests β€” vacuous pass"; exit 1; } - - name: Dump RustFS logs on failure if: failure() run: docker logs rustfs diff --git a/.github/workflows/codeowners.yml b/.github/workflows/codeowners.yml new file mode 100644 index 0000000..19d5835 --- /dev/null +++ b/.github/workflows/codeowners.yml @@ -0,0 +1,66 @@ +name: CODEOWNERS + +on: + pull_request: + paths: + - '.github/codeowners-roles.yml' + - '.github/CODEOWNERS' + - '.github/scripts/render-codeowners.py' + - '.github/workflows/codeowners.yml' + workflow_dispatch: + +# Read-only; we never push from this workflow. +permissions: + contents: read + +jobs: + drift: + name: CODEOWNERS matches source + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v5.0.1 + + - name: Set up Python + uses: actions/setup-python@v5.4.0 + with: + python-version: '3.13' + + - name: Install PyYAML + run: pip install pyyaml + + - name: Re-render CODEOWNERS + run: python3 .github/scripts/render-codeowners.py + + - name: Reject drift + run: | + if ! git diff --quiet .github/CODEOWNERS; then + echo "::error::.github/CODEOWNERS is out of sync with .github/codeowners-roles.yml." + echo "::error::Run \`python3 .github/scripts/render-codeowners.py\` locally and commit the result." + echo "--- diff ---" + git --no-pager diff .github/CODEOWNERS + exit 1 + fi + echo "CODEOWNERS is in sync with its source." + + noedit: + name: CODEOWNERS not hand-edited + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v5.0.1 + with: + # Need history so we can diff against the PR base. + fetch-depth: 0 + + - name: Reject hand-edits to generated file + run: | + base="origin/${{ github.base_ref }}" + git fetch origin "${{ github.base_ref }}" --quiet + changed=$(git diff --name-only "$base" HEAD) + edited_generated=$(echo "$changed" | grep -E '^\.github/CODEOWNERS$' || true) + edited_source=$(echo "$changed" | grep -E '^\.github/codeowners-roles\.yml$' || true) + if [ -n "$edited_generated" ] && [ -z "$edited_source" ]; then + echo "::error::This PR edits .github/CODEOWNERS but not its source .github/codeowners-roles.yml." + echo "::error::Edit the yml and regenerate via \`python3 .github/scripts/render-codeowners.py\`." + exit 1 + fi + echo "CODEOWNERS edits accompany source edits (or no CODEOWNERS edits in this PR)." diff --git a/.github/workflows/publish-crates.yml b/.github/workflows/publish-crates.yml index 4fac941..9484b98 100644 --- a/.github/workflows/publish-crates.yml +++ b/.github/workflows/publish-crates.yml @@ -1,6 +1,6 @@ name: Publish to crates.io -# Publishes the publishable workspace crates to crates.io in dependency order. +# Publishes the four workspace crates to crates.io in dependency order. # # Triggers: # - push of any v* tag (future releases auto-publish alongside release.yml) @@ -115,14 +115,10 @@ jobs: # Order matters: each crate must precede anything that depends on it. # omnigraph-compiler and omnigraph-policy have no internal deps; - # omnigraph-engine depends on both; omnigraph-api-types and - # omnigraph-cluster depend on engine (+ compiler); server depends on - # engine + api-types + cluster + the two leaf crates; cli depends on - # everything. + # omnigraph-engine depends on both; server depends on engine + the + # two leaf crates; cli depends on everything. publish_if_new omnigraph-compiler publish_if_new omnigraph-policy publish_if_new omnigraph-engine - publish_if_new omnigraph-api-types - publish_if_new omnigraph-cluster publish_if_new omnigraph-server publish_if_new omnigraph-cli diff --git a/.github/workflows/release-edge.yml b/.github/workflows/release-edge.yml index 3996e65..6147646 100644 --- a/.github/workflows/release-edge.yml +++ b/.github/workflows/release-edge.yml @@ -43,8 +43,6 @@ jobs: asset_name: omnigraph-linux-x86_64 - runner: macos-14 asset_name: omnigraph-macos-arm64 - - runner: windows-latest - asset_name: omnigraph-windows-x86_64 env: CARGO_TERM_COLOR: always steps: @@ -61,10 +59,6 @@ jobs: if: runner.os == 'macOS' run: brew install protobuf - - name: Install Windows dependencies - if: runner.os == 'Windows' - run: choco install protoc -y - - name: Install Rust stable uses: dtolnay/rust-toolchain@stable with: @@ -79,8 +73,7 @@ jobs: - name: Build release binaries run: cargo build --release --locked -p omnigraph-cli -p omnigraph-server - - name: Package Unix release archive - if: runner.os != 'Windows' + - name: Package release archive run: | mkdir -p release install -m 0755 target/release/omnigraph release/omnigraph @@ -88,22 +81,6 @@ jobs: tar -C release -czf "${{ matrix.asset_name }}.tar.gz" omnigraph omnigraph-server shasum -a 256 "${{ matrix.asset_name }}.tar.gz" > "${{ matrix.asset_name }}.sha256" - - name: Package Windows release archive - if: runner.os == 'Windows' - run: | - New-Item -ItemType Directory -Force -Path release | Out-Null - Copy-Item target/release/omnigraph.exe release/omnigraph.exe - Copy-Item target/release/omnigraph-server.exe release/omnigraph-server.exe - Compress-Archive -Path release/omnigraph.exe, release/omnigraph-server.exe -DestinationPath "${{ matrix.asset_name }}.zip" -Force - $hash = (Get-FileHash "${{ matrix.asset_name }}.zip" -Algorithm SHA256).Hash.ToLowerInvariant() - "$hash ${{ matrix.asset_name }}.zip" | Out-File -FilePath "${{ matrix.asset_name }}.sha256" -Encoding ascii - New-Item -ItemType Directory -Force -Path verify | Out-Null - Expand-Archive -Path "${{ matrix.asset_name }}.zip" -DestinationPath verify -Force - $items = Get-ChildItem -Path verify -File - if ($items.Count -ne 2 -or !(Test-Path verify/omnigraph.exe) -or !(Test-Path verify/omnigraph-server.exe)) { - throw "Windows release archive is missing expected binaries" - } - - name: Publish edge release assets uses: softprops/action-gh-release@v2.5.0 with: @@ -114,22 +91,5 @@ jobs: body: | Rolling prerelease from `${{ github.sha }}`. files: | - ${{ matrix.asset_name }}.* - - smoke_windows_installer: - name: Smoke Windows installer - needs: build_release - runs-on: windows-latest - permissions: - contents: read - steps: - - name: Checkout source - uses: actions/checkout@v5.0.1 - - - name: Install from edge release - run: ./scripts/install.ps1 -ReleaseChannel edge -InstallDir "$env:RUNNER_TEMP/omnigraph-bin" - - - name: Smoke installed binaries - run: | - & "$env:RUNNER_TEMP/omnigraph-bin/omnigraph.exe" version - & "$env:RUNNER_TEMP/omnigraph-bin/omnigraph-server.exe" --help + ${{ matrix.asset_name }}.tar.gz + ${{ matrix.asset_name }}.sha256 diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 4b9456a..e7fc75f 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -1,34 +1,17 @@ name: Release -# Build per-platform binaries in a matrix, then publish the GitHub release ONCE -# from a single job. The matrix used to call `softprops/action-gh-release` -# concurrently β€” three jobs racing to create/finalize the same release, which -# exhausted the action's finalize retries and dropped whole platforms' assets. -# The matrix now only uploads workflow artifacts; `publish_release` is the sole -# writer of the release (no race). -# -# Triggers: -# - push of a v* tag (normal release) -# - workflow_dispatch with an explicit `tag` (re-publish a past tag without -# re-cutting it; resolves the same `${{ inputs.tag || github.ref_name }}`) - on: push: tags: - "v*" workflow_dispatch: - inputs: - tag: - description: "Tag to (re)publish (e.g. v0.7.0). Required for manual dispatches." - required: true - type: string jobs: build_release: name: Build ${{ matrix.asset_name }} runs-on: ${{ matrix.runner }} permissions: - contents: read + contents: write strategy: fail-fast: false matrix: @@ -37,15 +20,11 @@ jobs: asset_name: omnigraph-linux-x86_64 - runner: macos-14 asset_name: omnigraph-macos-arm64 - - runner: windows-latest - asset_name: omnigraph-windows-x86_64 env: CARGO_TERM_COLOR: always steps: - name: Checkout source uses: actions/checkout@v5.0.1 - with: - ref: ${{ inputs.tag || github.ref_name }} - name: Install Linux dependencies if: runner.os == 'Linux' @@ -57,10 +36,6 @@ jobs: if: runner.os == 'macOS' run: brew install protobuf - - name: Install Windows dependencies - if: runner.os == 'Windows' - run: choco install protoc -y - - name: Install Rust stable uses: dtolnay/rust-toolchain@stable with: @@ -75,8 +50,7 @@ jobs: - name: Build release binaries run: cargo build --release --locked -p omnigraph-cli -p omnigraph-server - - name: Package Unix release archive - if: runner.os != 'Windows' + - name: Package release archive run: | mkdir -p release install -m 0755 target/release/omnigraph release/omnigraph @@ -84,62 +58,21 @@ jobs: tar -C release -czf "${{ matrix.asset_name }}.tar.gz" omnigraph omnigraph-server shasum -a 256 "${{ matrix.asset_name }}.tar.gz" > "${{ matrix.asset_name }}.sha256" - - name: Package Windows release archive - if: runner.os == 'Windows' - run: | - New-Item -ItemType Directory -Force -Path release | Out-Null - Copy-Item target/release/omnigraph.exe release/omnigraph.exe - Copy-Item target/release/omnigraph-server.exe release/omnigraph-server.exe - Compress-Archive -Path release/omnigraph.exe, release/omnigraph-server.exe -DestinationPath "${{ matrix.asset_name }}.zip" -Force - $hash = (Get-FileHash "${{ matrix.asset_name }}.zip" -Algorithm SHA256).Hash.ToLowerInvariant() - "$hash ${{ matrix.asset_name }}.zip" | Out-File -FilePath "${{ matrix.asset_name }}.sha256" -Encoding ascii - New-Item -ItemType Directory -Force -Path verify | Out-Null - Expand-Archive -Path "${{ matrix.asset_name }}.zip" -DestinationPath verify -Force - $items = Get-ChildItem -Path verify -File - if ($items.Count -ne 2 -or !(Test-Path verify/omnigraph.exe) -or !(Test-Path verify/omnigraph-server.exe)) { - throw "Windows release archive is missing expected binaries" - } - - # Upload artifacts only β€” the single `publish_release` job attaches them to - # the release, so no two jobs ever write the release concurrently. - - name: Upload build artifact - uses: actions/upload-artifact@v4 - with: - name: ${{ matrix.asset_name }} - path: | - ${{ matrix.asset_name }}.* - if-no-files-found: error - retention-days: 1 - - publish_release: - name: Publish GitHub release - needs: build_release - runs-on: ubuntu-latest - permissions: - contents: write - steps: - - name: Download all build artifacts - uses: actions/download-artifact@v4 - with: - path: dist - merge-multiple: true - - - name: Publish release (single writer β€” no matrix race) + - name: Publish GitHub release assets uses: softprops/action-gh-release@v2.5.0 with: - tag_name: ${{ inputs.tag || github.ref_name }} - files: dist/** - overwrite_files: true + files: | + ${{ matrix.asset_name }}.tar.gz + ${{ matrix.asset_name }}.sha256 update_homebrew_tap: name: Update Homebrew tap - needs: publish_release + needs: build_release runs-on: ubuntu-latest permissions: contents: read env: HOMEBREW_TAP_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }} - RELEASE_TAG: ${{ inputs.tag || github.ref_name }} steps: - name: Skip if HOMEBREW_TAP_TOKEN is not configured if: env.HOMEBREW_TAP_TOKEN == '' @@ -150,8 +83,6 @@ jobs: - name: Checkout source if: env.HOMEBREW_TAP_SKIP != '1' uses: actions/checkout@v5.0.1 - with: - ref: ${{ env.RELEASE_TAG }} - name: Checkout Homebrew tap if: env.HOMEBREW_TAP_SKIP != '1' @@ -166,32 +97,7 @@ jobs: env: GH_TOKEN: ${{ github.token }} run: | - ./scripts/update-homebrew-formula.sh "${RELEASE_TAG}" homebrew-tap/Formula/omnigraph.rb - - # Diagnostic only: brew is not on PATH on the ubuntu runner by default, so - # set it up explicitly. Both this setup and the audit below are best-effort - # canaries, not gates β€” continue-on-error on each keeps a failed/flaky brew - # (the action is pinned to a moving @master ref) from skipping the actual - # tap publish below. The formula is correct by construction - # (update-homebrew-formula.sh), so brew tooling must never block the push. - - name: Set up Homebrew - if: env.HOMEBREW_TAP_SKIP != '1' - continue-on-error: true - uses: Homebrew/actions/setup-homebrew@master - - - name: Audit generated formula - if: env.HOMEBREW_TAP_SKIP != '1' - continue-on-error: true - run: | - # Audit the checked-out tap by name (brew audit rejects bare paths - # and needs tap context). Symlink the checkout into Homebrew's Taps - # tree so `modernrelay/tap/omnigraph` resolves to it. Offline audit - # (no --online) keeps it deterministic; it still catches the - # ComponentsOrder/structure class of problems. - tap_dir="$(brew --repository)/Library/Taps/modernrelay/homebrew-tap" - mkdir -p "$(dirname "$tap_dir")" - ln -sfn "$PWD/homebrew-tap" "$tap_dir" - brew audit --strict modernrelay/tap/omnigraph + ./scripts/update-homebrew-formula.sh "${GITHUB_REF_NAME}" homebrew-tap/Formula/omnigraph.rb - name: Commit and push formula update if: env.HOMEBREW_TAP_SKIP != '1' @@ -205,28 +111,5 @@ jobs: git config user.name "github-actions[bot]" git config user.email "41898282+github-actions[bot]@users.noreply.github.com" git add Formula/omnigraph.rb - git commit -m "Update Omnigraph formula to ${RELEASE_TAG}" + git commit -m "Update Omnigraph formula to ${GITHUB_REF_NAME}" git push origin HEAD:main - - smoke_windows_installer: - name: Smoke Windows installer - needs: publish_release - if: ${{ inputs.tag != '' || startsWith(github.ref, 'refs/tags/v') }} - runs-on: windows-latest - permissions: - contents: read - env: - RELEASE_TAG: ${{ inputs.tag || github.ref_name }} - steps: - - name: Checkout source - uses: actions/checkout@v5.0.1 - with: - ref: ${{ env.RELEASE_TAG }} - - - name: Install from tagged release - run: ./scripts/install.ps1 -Version "$env:RELEASE_TAG" -InstallDir "$env:RUNNER_TEMP/omnigraph-bin" - - - name: Smoke installed binaries - run: | - & "$env:RUNNER_TEMP/omnigraph-bin/omnigraph.exe" version - & "$env:RUNNER_TEMP/omnigraph-bin/omnigraph-server.exe" --help diff --git a/AGENTS.md b/AGENTS.md index d6d242d..27d1b7b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -16,9 +16,9 @@ Tools that support `@`-imports (Claude Code) auto-include all three files via th `CLAUDE.md` is a symlink to this file β€” there is exactly one source of truth. Edit `AGENTS.md`. -**Version surveyed:** 0.7.2 -**Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-policy`, `omnigraph-api-types` (shared HTTP wire DTOs), `omnigraph-cluster`, `omnigraph-cli`, `omnigraph-server` -**Storage substrate:** Lance 7.x (columnar, versioned, branchable) +**Version surveyed:** 0.6.0 +**Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-policy`, `omnigraph-cli`, `omnigraph-server` +**Storage substrate:** Lance 6.x (columnar, versioned, branchable) **License:** MIT **Toolchain:** Rust stable, edition 2024 @@ -33,8 +33,8 @@ OmniGraph is a typed property-graph engine built as a coordination layer over ma - **Multi-modal querying**: vector ANN (`nearest`), full-text (`search`/`fuzzy`/`match_text`/`bm25`), Reciprocal Rank Fusion (`rrf`), and graph traversal (`Expand`, anti-join `not { … }`) in one runtime. - **Branches and commits across the whole graph**: Git-style β€” every successful publish appends to a commit DAG; merges are three-way at the row level. - **Atomic per-query writes**: `mutate_as` and `load` accumulate insert/update batches into an in-memory `MutationStaging.pending` per touched table; one `stage_*` + `commit_staged` per table runs at end-of-query, then `ManifestBatchPublisher::publish` commits the manifest atomically with per-table `expected_table_versions` CAS. A mid-query failure leaves Lance HEAD untouched on staged tables β€” no drift, no run state machine, no staging branches. Deletes still inline-commit; Dβ‚‚ at parse time prevents inserts/updates and deletes from coexisting in one query. -- **HTTP server**: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager). Cedar policy enforcement is engine-wide β€” every `_as` writer calls `Omnigraph::enforce(action, scope, actor)`, so HTTP, CLI, and embedded SDK consumers all hit the same gate. **Cluster-only boot** (RFC-011): the server always boots from a cluster directory (`--cluster `, RFC-005) and serves N graphs (N β‰₯ 1) under multi-graph routes (`/graphs/{graph_id}/...` + read-only `GET /graphs` enumeration); there are no single-graph flat routes and no positional-URI boot. Per-graph + server-level Cedar policies. Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`) is not exposed β€” operators run `cluster apply` and restart. -- **CLI** with two-surface config (RFC-007/008): the team-owned cluster directory (`cluster.yaml`) plus the per-operator `~/.omnigraph/config.yaml` (servers, clusters, credentials, actor, profiles, aliases, defaults). Graphs are addressed via `--store`/`--server`/`--cluster`/`--profile`/operator defaults (RFC-011). Multi-format output (json/jsonl/csv/kv/table). +- **HTTP server**: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager). Cedar policy enforcement is engine-wide β€” every `_as` writer calls `Omnigraph::enforce(action, scope, actor)`, so HTTP, CLI, and embedded SDK consumers all hit the same gate. **Two modes** (v0.6.0+): single-graph (legacy flat routes) and multi-graph (`/graphs/{graph_id}/...` cluster routes + read-only `GET /graphs` enumeration). Per-graph + server-level Cedar policies. Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`) is not exposed β€” operators edit `omnigraph.yaml` and restart. +- **CLI** driven by a single `omnigraph.yaml`; multi-format output (json/jsonl/csv/kv/table). Throughout the docs, capabilities are split into **L1 β€” Inherited from Lance** vs **L2 β€” Added by OmniGraph**. @@ -53,7 +53,7 @@ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) omnigraph (engine) ── ManifestCoordinator, CommitGraph, RunRegistry, GraphIndex (CSR/CSC), exec β”‚ β–Ό - Lance 7.x ── columnar Arrow, fragments, per-dataset versions/branches, indexes + Lance 6.x ── columnar Arrow, fragments, per-dataset versions/branches, indexes β”‚ β–Ό Object store (file / s3 / RustFS / MinIO / S3-compat) @@ -73,37 +73,31 @@ Full diagram and concurrency model: [docs/dev/architecture.md](docs/dev/architec | **Lance docs index β€” fetch upstream Lance docs by problem domain** | **[docs/dev/lance.md](docs/dev/lance.md)** | | **Test coverage map β€” what's covered, what helpers to reuse, before-every-task checklist** | **[docs/dev/testing.md](docs/dev/testing.md)** | | Architecture, L1/L2 framing, concurrency model | [docs/dev/architecture.md](docs/dev/architecture.md) | -| Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/user/concepts/storage.md](docs/user/concepts/storage.md) | -| `.pg` schema language, types, constraints, annotations, migration planning | [docs/user/schema/index.md](docs/user/schema/index.md) | -| Schema-lint codes (`OG-XXX-NNN`), families, severity, suppression | [docs/user/schema/lint.md](docs/user/schema/lint.md) | -| `.gq` query language, MATCH/RETURN/ORDER, IR ops, lint codes | [docs/user/queries/index.md](docs/user/queries/index.md) | -| Mutations β€” insert/update/delete, D2, atomicity | [docs/user/mutations/index.md](docs/user/mutations/index.md) | -| Search funcs (`nearest`/`bm25`/`rrf`), hybrid ranking | [docs/user/search/index.md](docs/user/search/index.md) | -| Indexes (BTREE / inverted / vector / graph topology) | [docs/user/search/indexes.md](docs/user/search/indexes.md) | -| Embeddings (engine client, env vars, `@embed`) | [docs/user/search/embeddings.md](docs/user/search/embeddings.md) | -| Concepts β€” what OmniGraph is, L1/L2 framing | [docs/user/concepts/index.md](docs/user/concepts/index.md) | -| Quickstart β€” init β†’ load β†’ query β†’ branch | [docs/user/quickstart.md](docs/user/quickstart.md) | -| Branches, commit graph, system branches | [docs/user/branching/index.md](docs/user/branching/index.md) | -| Snapshots & time travel | [docs/user/branching/time-travel.md](docs/user/branching/time-travel.md) | -| Three-way merge and conflict kinds (user-facing) | [docs/user/branching/merge.md](docs/user/branching/merge.md) | -| Transactions and atomicity (per-query atomic; branches as multi-query transactions) | [docs/user/branching/transactions.md](docs/user/branching/transactions.md) | -| Direct-publish write path (staging, D2, recovery sidecars; the former Run state machine) | [docs/dev/writes.md](docs/dev/writes.md) | +| Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/user/storage.md](docs/user/storage.md) | +| `.pg` schema language, types, constraints, annotations, migration planning | [docs/user/schema-language.md](docs/user/schema-language.md) | +| Schema-lint codes (`OG-XXX-NNN`), families, severity, suppression | [docs/user/schema-lint.md](docs/user/schema-lint.md) | +| `.gq` query language, MATCH/RETURN/ORDER, search funcs, mutations, IR ops, lint codes | [docs/user/query-language.md](docs/user/query-language.md) | +| Indexes (BTREE / inverted / vector / graph topology) | [docs/user/indexes.md](docs/user/indexes.md) | +| Embeddings (compiler + engine clients, env vars, `@embed`) | [docs/user/embeddings.md](docs/user/embeddings.md) | +| Branches, commit graph, snapshots, system branches | [docs/user/branches-commits.md](docs/user/branches-commits.md) | +| Transactions and atomicity (per-query atomic; branches as multi-query transactions) | [docs/user/transactions.md](docs/user/transactions.md) | +| Direct-publish writes (the former Run state machine, now demoted to publisher CAS) | [docs/dev/runs.md](docs/dev/runs.md) | | Three-way merge and conflict kinds | [docs/dev/merge.md](docs/dev/merge.md) | -| Diff / change feed (`diff_between`, `diff_commits`) | [docs/user/branching/changes.md](docs/user/branching/changes.md) | +| Diff / change feed (`diff_between`, `diff_commits`) | [docs/user/changes.md](docs/user/changes.md) | | Query execution, mutation execution, bulk loader, `load` vs `ingest` | [docs/dev/execution.md](docs/dev/execution.md) | -| `optimize` (compaction) and `cleanup` (version GC) | [docs/user/operations/maintenance.md](docs/user/operations/maintenance.md) | -| Cluster operator guide (deploy/manage clusters, approvals, recovery, serving) | [docs/user/clusters/index.md](docs/user/clusters/index.md) | -| Cedar policy actions, scopes, CLI | [docs/user/operations/policy.md](docs/user/operations/policy.md) | -| HTTP server endpoints, auth, error model, body limits | [docs/user/operations/server.md](docs/user/operations/server.md) | -| CLI quick-start | [docs/user/cli/index.md](docs/user/cli/index.md) | -| CLI command surface and config schema (`~/.omnigraph/config.yaml`) | [docs/user/cli/reference.md](docs/user/cli/reference.md) | -| Audit / actor tracking | [docs/user/operations/audit.md](docs/user/operations/audit.md) | -| Error taxonomy and result serialization | [docs/user/operations/errors.md](docs/user/operations/errors.md) | +| `optimize` (compaction) and `cleanup` (version GC) | [docs/user/maintenance.md](docs/user/maintenance.md) | +| Cedar policy actions, scopes, CLI | [docs/user/policy.md](docs/user/policy.md) | +| HTTP server endpoints, auth, error model, body limits | [docs/user/server.md](docs/user/server.md) | +| CLI quick-start | [docs/user/cli.md](docs/user/cli.md) | +| CLI command surface and `omnigraph.yaml` schema | [docs/user/cli-reference.md](docs/user/cli-reference.md) | +| Audit / actor tracking | [docs/user/audit.md](docs/user/audit.md) | +| Error taxonomy and result serialization | [docs/user/errors.md](docs/user/errors.md) | | Install (binary / Homebrew / source / channels) | [docs/user/install.md](docs/user/install.md) | -| Deployment (binary / container / S3-local testing / auth / build variants) | [docs/user/deployment.md](docs/user/deployment.md) | +| Deployment (binary / container / RustFS bootstrap / auth / build variants) | [docs/user/deployment.md](docs/user/deployment.md) | | CI / release workflows | [docs/dev/ci.md](docs/dev/ci.md) | +| Code ownership (CODEOWNERS source of truth, roles, regeneration) | [docs/dev/codeowners.md](docs/dev/codeowners.md) | | Branch protection policy (declarative, applied via `scripts/apply-branch-protection.sh`) | [docs/dev/branch-protection.md](docs/dev/branch-protection.md) | -| Constants & tunables cheat sheet | [docs/user/reference/constants.md](docs/user/reference/constants.md) | +| Constants & tunables cheat sheet | [docs/user/constants.md](docs/user/constants.md) | | Per-version release notes | [docs/releases/](docs/releases/) | --- @@ -124,8 +118,6 @@ This is a decision lens, not a code-size rule. It cuts both ways. Sometimes the When evaluating a design, ask: *"what does this look like after 5 more changes like it?"* If the answer is "this converges to one shape", cost is bounded. If it's "this forks every time", the option is mortgaging the future for present convenience β€” pick differently. -The same lens has a structural corollary: **one source of truth, cheaply derived.** Lance and the manifest are the source of truth; everything else is a derived view. Maintaining a parallel copy invites drift that compounds over time, and re-deriving a view from the full source on every call makes its cost grow with history. Both are liabilities integrated over time, so both are ruled out the same way: hold a warm derived view and refresh it with a cheap probe, never shadow the source or rebuild from it cold. Invariant 15 in [docs/dev/invariants.md](docs/dev/invariants.md) states this; invariants 1 (respect the substrate) and 7 (indexes are derived state) are instances. - ### Tiebreakers when liability alone is silent - **Correctness > simplicity > performance.** Lexicographic β€” give up performance for simpler code; give up simplicity for correct code; never give up correctness. The deny-list ("no silent failures," "no acks before durable persistence," "no reads of partial commits") is this rule's hard floor. @@ -145,8 +137,6 @@ These are architectural rules that need to be in scope on every change. They're 4. **Bearer-token plaintext never persists in process memory.** Tokens are hashed at startup; auth uses constant-time comparison; the actor id is server-resolved from the hash match and must not be settable by the client. 5. **Reads always see the current index state for the branch they're reading.** Indexes track the branch head, not historical snapshots. If you change index lifecycle, preserve this guarantee. 6. **Stable type IDs survive renames.** Schema migration relies on identity that's stable across rename β€” don't mint new IDs on rename. -7. **Logical contract over physical state.** Physical state (index coverage, fragment layout, compaction versions, staged writes) is derived and rebuildable; it must never fail a logical operation. Check preconditions against logical state and let reconciliation converge the physical state idempotently β€” genuine logical conflicts still fail loudly. This is the rule rules 1–6 instantiate; full statement and applications in [docs/dev/invariants.md](docs/dev/invariants.md). -8. **One source of truth, cheaply derived.** Lance and the manifest are the source of truth; runtime state is a derived view of them. Don't maintain a parallel copy that can drift, and don't re-derive a view from cold storage on every call (that makes cost grow with history). Hold it warm, refresh with a cheap probe. ### Deny-list (fast-pass review filter β€” full reasoning in [docs/dev/invariants.md](docs/dev/invariants.md)) @@ -168,39 +158,12 @@ If a proposal fits one of these, the burden is on the proposer to justify why th - Cloud-only correctness fixes β€” correctness is always OSS. - Forking the codebase for Cloud β€” trait-extension only. - Hand-rolling something Lance already does β€” check the spec first. -- Shadowing the source of truth with a maintained parallel copy, or re-deriving a derived view from cold storage per call (cost then scales with history). Hold it warm and refresh cheaply. - Mutating in place state that should be immutable (Lance fragments, index segments) β€” new segments instead. - Silent failures β€” OOM, timeout, partial result must all be surfaced and bounded. - Shipping observable behavior as if it weren't part of the contract β€” output ordering, error-message text, timestamp precision, default-flag values, latency profile. Per Hyrum's Law, every observable behavior gets depended on once shipped; don't expose what you don't want to commit to. --- -## Build, test, lint - -Rust stable workspace (edition 2024). `protoc` is a build dependency (`brew install protobuf` / `apt-get install protobuf-compiler libprotobuf-dev`). **Crate dir β‰  package name** for the engine: the directory is `crates/omnigraph` but its Cargo package is `omnigraph-engine` (use that in `-p`). The CLI binary built from `omnigraph-cli` is named `omnigraph`. - -```bash -cargo build --workspace --locked # build everything -cargo test --workspace --locked # the canonical CI gate (matches CI exactly) -cargo run -p omnigraph-cli -- # run the `omnigraph` CLI from source -cargo run -p omnigraph-server -- --cluster --bind 0.0.0.0:8080 # run the server from source - -# Run one crate / one test file / one test fn -cargo test -p omnigraph-engine --test traversal # one integration-test file (see docs/dev/testing.md) -cargo test -p omnigraph-engine --test writes concurrent # one test fn by name substring -cargo test -p omnigraph-engine some_inline_test -- --nocapture # show stdout - -# Feature-gated suites (each is its own job in CI, not part of the default run) -cargo test -p omnigraph-engine --features failpoints --test failpoints # fault injection -cargo build -p omnigraph-server --features aws # AWS Secrets Manager bearer-token source -``` - -S3-backed tests (`s3_storage`, and the S3 paths in server/CLI system tests) **skip** unless `OMNIGRAPH_S3_TEST_BUCKET` + `AWS_*` (incl. `AWS_ENDPOINT_URL_S3` for non-AWS) are set; CI runs them against containerized RustFS. To run RustFS/MinIO yourself, see [docs/user/deployment.md](docs/user/deployment.md) β†’ *Testing against S3 locally*. - -CI does **not** run `clippy` or `rustfmt` as gates β€” but `cargo test --workspace --locked` is the exact gate, so run it before pushing. Two non-test CI checks: `scripts/check-agents-md.sh` (doc cross-link integrity β€” run it after moving/renaming docs) and OpenAPI drift (`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json`; set `OMNIGRAPH_UPDATE_OPENAPI=1` to update the checked-in copy when a server/API change is intentional). - ---- - ## Quick-reference flows ```bash @@ -210,12 +173,13 @@ omnigraph init --schema ./schema.pg s3://my-bucket/graph.omni # Bulk load omnigraph load --data ./seed.jsonl --mode overwrite s3://my-bucket/graph.omni -# Load a review batch onto its own branch (--from forks it if missing) -omnigraph load --branch review/2026-04-25 --from main --mode merge --data ./batch.jsonl s3://my-bucket/graph.omni +# Branch + ingest a review batch +omnigraph branch create --from main review/2026-04-25 s3://my-bucket/graph.omni +omnigraph ingest --branch review/2026-04-25 --data ./batch.jsonl s3://my-bucket/graph.omni -# Run a hybrid (vector + BM25) query β€” ad-hoc .gq against a store (positional = query name) -omnigraph query --query ./queries.gq find_similar \ - --params '{"q":"trends in AI safety"}' --format table --store s3://my-bucket/graph.omni +# Run a hybrid (vector + BM25) query +omnigraph read --query ./queries.gq --name find_similar \ + --params '{"q":"trends in AI safety"}' --format table s3://my-bucket/graph.omni # Plan + apply schema migration omnigraph schema plan --schema ./next.pg s3://my-bucket/graph.omni @@ -224,21 +188,17 @@ omnigraph schema apply --schema ./next.pg s3://my-bucket/graph.omni --json # Merge review branch back omnigraph branch merge review/2026-04-25 --into main s3://my-bucket/graph.omni -# Compact, preview any uncovered drift, then repair/GC after review +# Compact + GC (preview, then confirm) omnigraph optimize s3://my-bucket/graph.omni -omnigraph repair s3://my-bucket/graph.omni -omnigraph repair --confirm s3://my-bucket/graph.omni -# For suspicious/unverifiable drift only after deliberate review: -# omnigraph repair --force --confirm s3://my-bucket/graph.omni omnigraph cleanup --keep 10 --older-than 7d s3://my-bucket/graph.omni omnigraph cleanup --keep 10 --older-than 7d --confirm s3://my-bucket/graph.omni # Stand up the HTTP server (token from env) OMNIGRAPH_SERVER_BEARER_TOKEN=xxxx \ - omnigraph-server --cluster s3://my-bucket/cluster --bind 0.0.0.0:8080 + omnigraph-server s3://my-bucket/graph.omni --bind 0.0.0.0:8080 # Cedar policy explain -omnigraph policy explain --cluster ./company-brain --graph knowledge --actor act-alice --action change --branch main +omnigraph policy explain --actor act-alice --action change --branch main ``` --- @@ -250,11 +210,10 @@ omnigraph policy explain --cluster ./company-brain --graph knowledge --actor act | Columnar storage on object store | βœ… Arrow/Lance | URI normalization, S3 env-var plumbing | | Per-dataset versioning + time travel | βœ… | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | | Per-dataset branches | βœ… | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | -| Atomic single-dataset commits | βœ… | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore` followed by a manifest publish of the restored version (so both directions converge to `manifest == HEAD` β€” no residual drift), and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). The write entry points (`load_as`, `mutate_as`, `apply_schema_as`, `branch_merge_as`) and `refresh` additionally run an in-process roll-forward-only heal (serialized against live writers via the per-table write queues), so a long-lived server converges on its next write without restart; only rollback-eligible sidecars still defer to the next read-write open (a future background reconciler's goal). Engine writes route through a sealed `TableStorage` trait (`db.storage()`) exposing only `stage_*` + `commit_staged` + reads; the inline-commit residuals (`delete_where`, `create_vector_index`) are split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so the default surface cannot couple a write with a HEAD advance β€” Β§1 holds by construction. `delete_where` and `create_vector_index` stay inline until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)); `LoadMode::Overwrite` uses Lance `Overwrite` staged transactions. | -| Compaction (`compact_files`) + reindex (`optimize_indices`) | βœ… | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; per table runs `compact_files` **then Lance `optimize_indices`** (folds appended/rewritten fragments back into existing indexes β€” incremental merge, not retrain) and **publishes the resulting version to `__manifest`** (so the manifest tracks the Lance HEAD β€” required for reads to observe the work and for schema apply / strict writes to pass their HEAD-vs-manifest precondition), under the per-`(table, main)` write queue with `SidecarKind::Optimize` recovery coverage spanning both ops; **commits even with no compaction work if index coverage is stale**; **refuses on an unrecovered graph**; **skips uncovered HEAD > manifest drift** with `DriftNeedsRepair`; **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent; reindex is skipped for them too today), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | -| Repair uncovered drift | β€” | `omnigraph repair` explicitly classifies uncovered table `HEAD > manifest` drift: verified maintenance drift (`ReserveFragments`/`Rewrite`) can be published with `--confirm`; suspicious or unverifiable drift requires `--force --confirm`. Sidecar-covered crash residuals still recover automatically on open. | +| Atomic single-dataset commits | βœ… | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore`, and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. | +| Compaction (`compact_files`) | βœ… | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency | | Cleanup (`cleanup_old_versions`) | βœ… | `omnigraph cleanup` with `--keep` / `--older-than` policy | -| BTREE / inverted (FTS) / vector indexes | βœ… | `@index`/`@key` declares intent; the physical index is derived state that never fails a logical op. Built per column through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar β†’ BTREE, free-text String β†’ FTS, Vector β†’ vector); idempotent; lazy across branches. **Schema apply builds nothing** (records intent only); `load`/`mutate` build inline but **defer an untrainable Vector column** (no trainable vectors yet) as *pending* rather than aborting. `ensure_indices`/`optimize` is the reconciler that materializes declared-but-missing indexes and restores coverage of appended/rewritten fragments (`optimize_indices`), reporting still-pending columns (see Compaction row). | +| BTREE / inverted (FTS) / vector indexes | βœ… | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | | `merge_insert` upsert | βœ… | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | | Vector search | βœ… | `nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema | | Full-text search | βœ… | `search/fuzzy/match_text/bm25` query ops | @@ -263,16 +222,15 @@ omnigraph policy explain --cluster ./company-brain --graph knowledge --actor act | Schema language | β€” | `.pg` + Pest grammar + catalog + interfaces + constraints + annotations | | Query language | β€” | `.gq` + Pest grammar + IR + lowering + linter | | Schema migration planning | β€” | `plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__` | -| Commit graph (DAG) across whole graph | β€” | Lineage (linear + merge parents, ULID ids, actor) stored as `graph_commit`/`graph_head` rows in `__manifest`, written in the same publish CAS as the table-version rows (RFC-013 Phase 7 β€” no separate `_graph_commits.lance` write; manifestβ†’commit-graph atomicity gap closed); the in-memory commit graph is a projection of those rows | +| Commit graph (DAG) across whole graph | β€” | `_graph_commits.lance` with linear + merge parents, ULID ids, actor map | | Per-query atomic writes | β€” | In-memory `MutationStaging.pending` accumulator + `stage_*` / `commit_staged` per touched table at end-of-query + publisher CAS via `commit_with_expected` (single manifest commit per `mutate_as` / `load`); Dβ‚‚ parse-time rule keeps inserts/updates and deletes from mixing | | Three-way row-level merge | β€” | `OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind` | | Change feeds | β€” | `diff_between` / `diff_commits` with manifest fast path + ID streaming | -| Cedar policy | β€” | Per-graph actions plus server-scoped actions (see [docs/user/operations/policy.md](docs/user/operations/policy.md) for the current list), branch / target_branch / protected scopes, validate/test/explain CLI. **Engine-wide enforcement** (MR-722): every `_as` writer (`apply_schema_as`, `mutate_as`, `load_as` β€” the deprecated `ingest_as` shims route through it β€” `branch_create_as` / `branch_create_from_as`, `branch_delete_as`, `branch_merge_as`) calls `Omnigraph::enforce(action, scope, actor)` β€” HTTP, CLI, embedded SDK all hit the same gate. | -| HTTP server | β€” | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), `authorize_request` at the HTTP boundary (resolves bearerβ†’actor, applies admission control), NDJSON streaming export, **cluster-only boot (RFC-011): always `--cluster `, serving N graphs (N β‰₯ 1) under multi-graph routes + read-only `GET /graphs` enumeration + per-graph + server-level Cedar policies. Add/remove graphs via `cluster apply` and restart.** | -| CLI with config | β€” | two-surface config (team `cluster.yaml` dir + per-operator `~/.omnigraph/config.yaml`), scope addressing (`--store`/`--server`/`--cluster`/`--profile`/defaults, RFC-011), aliases, multi-format output (json/jsonl/csv/kv/table) | +| Cedar policy | β€” | Per-graph actions plus server-scoped actions (see [docs/user/policy.md](docs/user/policy.md) for the current list), branch / target_branch / protected scopes, validate/test/explain CLI. **Engine-wide enforcement** (MR-722): every `_as` writer (`apply_schema_as`, `mutate_as`, `load_as`, `ingest_as`, `branch_create_as` / `branch_create_from_as`, `branch_delete_as`, `branch_merge_as`) calls `Omnigraph::enforce(action, scope, actor)` β€” HTTP, CLI, embedded SDK all hit the same gate. | +| HTTP server | β€” | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), `authorize_request` at the HTTP boundary (resolves bearerβ†’actor, applies admission control), NDJSON streaming export, **multi-graph mode (v0.6.0+) with cluster routes + read-only `GET /graphs` enumeration + per-graph + server-level Cedar policies. Add/remove graphs by editing `omnigraph.yaml` and restarting.** | +| CLI with config | β€” | `omnigraph.yaml`, aliases, multi-format output (json/jsonl/csv/kv/table) | | Audit / actor tracking | β€” | `_as` write APIs + actor map in commit graph | -| Local S3 testing | β€” | run RustFS/MinIO + the `AWS_*` env; see [docs/user/deployment.md](docs/user/deployment.md) β†’ *Testing against S3 locally* | -| Agent skill | β€” | `skills/omnigraph` β€” operational playbook for driving Omnigraph; install with `npx skills add ModernRelay/omnigraph@omnigraph` | +| Local RustFS bootstrap | β€” | `scripts/local-rustfs-bootstrap.sh` one-shot S3-backed dev environment | --- @@ -293,7 +251,7 @@ Rules: 7. **Re-verify before recommending.** If you cite a flag, env var, endpoint, or constant to the user or in code, grep for it in source first. Memory and docs go stale; the code is authoritative. 8. **Keep AGENTS.md short.** This file is always loaded into agent context, so every added line has a recurring context-window cost. Prefer pointers and terse invariants here; put detail in `docs/`. 9. **Keep AGENTS.md a map, not an encyclopedia.** New deep content goes into `docs/`. Add an entry to "Where to find each topic" instead of pasting prose into this file. The "Always-on rules" section is the exception β€” it's for invariants that should always be in scope. -10. **Re-read on schema/query/IR changes.** Edits to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of [docs/user/schema/index.md](docs/user/schema/index.md), [docs/user/queries/index.md](docs/user/queries/index.md), and [docs/dev/execution.md](docs/dev/execution.md) to confirm they still describe reality. +10. **Re-read on schema/query/IR changes.** Edits to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of [docs/user/schema-language.md](docs/user/schema-language.md), [docs/user/query-language.md](docs/user/query-language.md), and [docs/dev/execution.md](docs/dev/execution.md) to confirm they still describe reality. 11. **Always make smaller commits.** Each commit does one thing, compiles, and passes tests; mechanical refactors land separately from the behavior changes they enable. 12. **Test-first for bug fixes.** When fixing an identified bug, write a regression test that reproduces the failure first. Confirm it fails against the current code with the predicted symptom (not an unrelated error). Then land the fix in a separate commit and confirm the test turns green. The test commit lands just before the fix commit so the red β†’ green pair is visible in `git log` and a reviewer can check out the test commit alone and reproduce the failure. 13. **Correct by design over symptomatic patches.** When a bug surfaces, identify the root cause and make the fix correct by construction. Don't patch the symptom. If the design admits the bug class, the fix is to close the class, not to add a guard around the latest instance. A symptomatic patch is acceptable only as a stop-gap, with an explicit note in the commit message and a follow-up issue tracking the design fix. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 1029b2f..8d9c687 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,29 +1,10 @@ # Contributing -Thanks for your interest in OmniGraph. This page is the practical how-to; the -rules and decision authority behind it live in [GOVERNANCE.md](GOVERNANCE.md). +Small bug fixes and documentation improvements are welcome directly through pull +requests. -## Start in the right place - -| I want to… | Go to | Notes | -|---|---|---| -| **Report a bug** or wrong behavior | **[Open an Issue](../../issues/new/choose)** | Concrete and reproducible. A maintainer triages it; once labelled **`accepted`** it's open for a PR. | -| **Suggest a feature / share an idea / ask** | **[Start a Discussion](../../discussions)** | Ideas and questions live here, not in Issues. | -| **Propose a design / RFC** | **An RFC pull request** | Anyone can author one β€” see [docs/rfcs/README.md](docs/rfcs/README.md). A maintainer merging it is acceptance. | -| **Fix something / implement a change** | **A pull request** | Must link an `accepted` issue or an accepted RFC β€” unless it's trivial (below). | -| **Report a security vulnerability** | **[SECURITY.md](SECURITY.md)** | Do **not** open a public Issue. | - -### When can I just open a PR? -The **trivial fast-lane** β€” open directly, no prior issue/RFC needed: typo and -wording fixes, doc corrections, dependency bumps, comment fixes, obvious -one-line CI tweaks. Anything more substantial needs a backing `accepted` issue -or accepted RFC first, so the *why* is agreed before the *how* is reviewed. A PR -that turns out to be non-trivial will be redirected β€” that's about process, not -the merit of the change. - -> **Maintainers (ModernRelay team)** follow a separate internal process and are -> not bound by the intake rules above. Everyone is bound by review, branch -> protection, and CI. +For larger changes, please open an issue or design discussion first so the +proposed direction is clear before implementation starts. ## Development @@ -68,11 +49,6 @@ CI runs both. ## Pull Requests -- **Link the backing issue or RFC** (`Closes #123`, or reference the RFC) β€” or - mark the PR as trivial per the fast-lane. -- Keep changes focused; one logical change per PR. -- Include tests for behavior changes when practical. -- Update public docs when the user-facing surface changes. - -New to the codebase? Read [AGENTS.md](AGENTS.md) β€” the architecture map and the -always-on invariants every change is reviewed against. +- keep changes focused +- include tests for behavior changes when practical +- update public docs when the user-facing surface changes diff --git a/Cargo.lock b/Cargo.lock index 16a1827..a3d6d62 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -23,9 +23,9 @@ version = "0.8.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b169f7a6d4742236a0a00c541b845991d0ac43e546831af1249753ab4c3aa3a0" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "cipher", - "cpufeatures 0.2.17", + "cpufeatures", ] [[package]] @@ -34,7 +34,7 @@ version = "0.8.12" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5a15f179cd60c4584b8a8c596927aadc462e27f2ca70c04e0071964a73ba7a75" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "const-random", "getrandom 0.3.4", "once_cell", @@ -83,9 +83,9 @@ dependencies = [ [[package]] name = "anstream" -version = "1.0.0" +version = "0.6.21" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "824a212faf96e9acacdbd09febd34438f8f711fb84e09a8916013cd7815ca28d" +checksum = "43d5b281e737544384e969a5ccad3f1cdd24b48086a0fc1b2a5262a26b8f4f4a" dependencies = [ "anstyle", "anstyle-parse", @@ -104,9 +104,9 @@ checksum = "5192cca8006f1fd4f7237516f40fa183bb07f8fbdfedaa0036de5ea9b0b45e78" [[package]] name = "anstyle-parse" -version = "1.0.0" +version = "0.2.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "52ce7f38b242319f7cabaa6813055467063ecdc9d355bbb4ce0c68908cd8130e" +checksum = "4e7644824f0aa2c7b9384579234ef10eb7efb6a0deb83f9630a49594dd9c15c2" dependencies = [ "utf8parse", ] @@ -137,15 +137,6 @@ version = "1.0.102" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c" -[[package]] -name = "approx" -version = "0.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cab112f0a86d568ea0e627cc1d6be74a1e9cd55214684db5561995f6dad897c6" -dependencies = [ - "num-traits", -] - [[package]] name = "ar_archive_writer" version = "0.5.1" @@ -717,11 +708,11 @@ dependencies = [ "bytes", "form_urlencoded", "hex", - "hmac 0.12.1", + "hmac", "http 0.2.12", "http 1.4.0", "percent-encoding", - "sha2 0.10.9", + "sha2", "time", "tracing", ] @@ -989,7 +980,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bb531853791a215d7c62a30daf0dde835f381ab5de4589cfe7c649d2cbe92bd6" dependencies = [ "addr2line", - "cfg-if 1.0.4", + "cfg-if", "libc", "miniz_oxide", "object", @@ -1080,7 +1071,7 @@ version = "0.10.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "46502ad458c9a52b69d4d4d32775c788b7a1b85e8bc9d482d92250fc0e3f8efe" dependencies = [ - "digest 0.10.7", + "digest", ] [[package]] @@ -1092,9 +1083,9 @@ dependencies = [ "arrayref", "arrayvec 0.7.6", "cc", - "cfg-if 1.0.4", + "cfg-if", "constant_time_eq", - "cpufeatures 0.2.17", + "cpufeatures", ] [[package]] @@ -1106,15 +1097,6 @@ dependencies = [ "generic-array", ] -[[package]] -name = "block-buffer" -version = "0.12.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2f6c7dbe95a6ed67ad9f18e57daf93a2f034c524b99fd2b76d18fdfeb6660aa" -dependencies = [ - "hybrid-array", -] - [[package]] name = "block-padding" version = "0.3.3" @@ -1284,12 +1266,6 @@ dependencies = [ "smol_str", ] -[[package]] -name = "cfg-if" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4785bdd1c96b2a846b2bd7cc02e86b6b3dbf14e7e53446c4f54c92a361040822" - [[package]] name = "cfg-if" version = "1.0.4" @@ -1302,17 +1278,6 @@ version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724" -[[package]] -name = "chacha20" -version = "0.10.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6f8d983286843e49675a4b7a2d174efe136dc93a18d69130dd18198a6c167601" -dependencies = [ - "cfg-if 1.0.4", - "cpufeatures 0.3.0", - "rand_core 0.10.1", -] - [[package]] name = "chrono" version = "0.4.44" @@ -1343,15 +1308,15 @@ version = "0.4.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "773f3b9af64447d2ce9850330c473515014aa235e6a783b02db81ff39e4a3dad" dependencies = [ - "crypto-common 0.1.7", + "crypto-common", "inout", ] [[package]] name = "clap" -version = "4.6.1" +version = "4.5.58" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ddb117e43bbf7dacf0a4190fef4d345b9bad68dfc649cb349e7d17d28428e51" +checksum = "63be97961acde393029492ce0be7a1af7e323e6bae9511ebfac33751be5e6806" dependencies = [ "clap_builder", "clap_derive", @@ -1359,9 +1324,9 @@ dependencies = [ [[package]] name = "clap_builder" -version = "4.6.0" +version = "4.5.58" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "714a53001bf66416adb0e2ef5ac857140e7dc3a0c48fb28b2f10762fc4b5069f" +checksum = "7f13174bda5dfd69d7e947827e5af4b0f2f94a4a3ee92912fba07a66150f21e2" dependencies = [ "anstream", "anstyle", @@ -1371,9 +1336,9 @@ dependencies = [ [[package]] name = "clap_derive" -version = "4.6.1" +version = "4.5.55" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2ce8604710f6733aa641a2b3731eaa1e8b3d9973d5e3565da11800813f997a9" +checksum = "a92793da1a46a5f2a02a6f4c46c6496b28c43638adea8306fcb0caa1634f24e5" dependencies = [ "heck", "proc-macro2", @@ -1396,12 +1361,6 @@ dependencies = [ "cc", ] -[[package]] -name = "cmov" -version = "0.5.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0c9ea0ac24bc397ab3c98583a3c9ba74fa56b09a4449bbe172b9b1ddb016027a" - [[package]] name = "color-eyre" version = "0.6.5" @@ -1435,25 +1394,6 @@ version = "1.0.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b05b61dc5112cbb17e4b6cd61790d9845d13888356391624cbe7e41efeac1e75" -[[package]] -name = "colored" -version = "3.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "faf9468729b8cbcea668e36183cb69d317348c2e08e994829fb56ebfdfbaac34" -dependencies = [ - "windows-sys 0.61.2", -] - -[[package]] -name = "combine" -version = "4.6.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba5a308b75df32fe02788e748662718f03fde005016435c444eea572398219fd" -dependencies = [ - "bytes", - "memchr", -] - [[package]] name = "comfy-table" version = "7.2.2" @@ -1496,12 +1436,6 @@ version = "0.9.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2459377285ad874054d797f3ccebf984978aa39129f6eafde5cdc8315b612f8" -[[package]] -name = "const-oid" -version = "0.10.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a6ef517f0926dd24a1582492c791b6a4818a4d94e789a334894aa15b0d12f55c" - [[package]] name = "const-random" version = "0.1.18" @@ -1522,37 +1456,12 @@ dependencies = [ "tiny-keccak", ] -[[package]] -name = "const-str" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "18f12cc9948ed9604230cdddc7c86e270f9401ccbe3c2e98a4378c5e7632212f" - -[[package]] -name = "const_panic" -version = "0.2.15" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e262cdaac42494e3ae34c43969f9cdeb7da178bdb4b66fa6a1ea2edb4c8ae652" -dependencies = [ - "typewit", -] - [[package]] name = "constant_time_eq" version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3d52eff69cd5e647efe296129160853a42795992097e8af39800e1060caeea9b" -[[package]] -name = "core-foundation" -version = "0.9.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "91e195e091a93c46f7102ec7818a2aa394e1e1771c3ab4825963fa03e45afb8f" -dependencies = [ - "core-foundation-sys", - "libc", -] - [[package]] name = "core-foundation" version = "0.10.1" @@ -1569,15 +1478,6 @@ version = "0.8.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b" -[[package]] -name = "countio" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9702aee5d1d744c01d82f6915644f950f898e014903385464c773b96fefdecb" -dependencies = [ - "futures-io", -] - [[package]] name = "cpufeatures" version = "0.2.17" @@ -1587,15 +1487,6 @@ dependencies = [ "libc", ] -[[package]] -name = "cpufeatures" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8b2a41393f66f16b0823bb79094d54ac5fbd34ab292ddafb9a0456ac9f87d201" -dependencies = [ - "libc", -] - [[package]] name = "crc32c" version = "0.6.8" @@ -1611,7 +1502,7 @@ version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", ] [[package]] @@ -1683,15 +1574,6 @@ dependencies = [ "typenum", ] -[[package]] -name = "crypto-common" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ce6e4c961d6cd6c9a86db418387425e8bdeaf05b3c8bc1411e6dca4c252f1453" -dependencies = [ - "hybrid-array", -] - [[package]] name = "csv" version = "1.4.0" @@ -1713,31 +1595,6 @@ dependencies = [ "memchr", ] -[[package]] -name = "ctor" -version = "0.6.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "424e0138278faeb2b401f174ad17e715c829512d74f3d1e81eb43365c2e0590e" -dependencies = [ - "ctor-proc-macro", - "dtor", -] - -[[package]] -name = "ctor-proc-macro" -version = "0.0.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "52560adf09603e58c9a7ee1fe1dcb95a16927b17c127f0ac02d6e768a0e25bc1" - -[[package]] -name = "ctutils" -version = "0.4.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7d5515a3834141de9eafb9717ad39eea8247b5674e6066c404e8c4b365d2a29e" -dependencies = [ - "cmov", -] - [[package]] name = "darling" version = "0.23.0" @@ -1778,7 +1635,7 @@ version = "6.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5041cc499144891f3790297212f32a74fb938e5136a14943f338ef9e0ae276cf" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "crossbeam-utils", "hashbrown 0.14.5", "lock_api", @@ -1824,7 +1681,7 @@ dependencies = [ "futures", "itertools 0.14.0", "log", - "object_store", + "object_store 0.13.2", "parking_lot", "rand 0.9.2", "regex", @@ -1855,7 +1712,7 @@ dependencies = [ "futures", "itertools 0.14.0", "log", - "object_store", + "object_store 0.13.2", "parking_lot", "tokio", ] @@ -1880,7 +1737,7 @@ dependencies = [ "futures", "itertools 0.14.0", "log", - "object_store", + "object_store 0.13.2", ] [[package]] @@ -1899,7 +1756,7 @@ dependencies = [ "itertools 0.14.0", "libc", "log", - "object_store", + "object_store 0.13.2", "paste", "sqlparser", "tokio", @@ -1940,7 +1797,7 @@ dependencies = [ "glob", "itertools 0.14.0", "log", - "object_store", + "object_store 0.13.2", "rand 0.9.2", "tokio", "url", @@ -1966,7 +1823,7 @@ dependencies = [ "datafusion-session", "futures", "itertools 0.14.0", - "object_store", + "object_store 0.13.2", "tokio", ] @@ -1988,7 +1845,7 @@ dependencies = [ "datafusion-physical-plan", "datafusion-session", "futures", - "object_store", + "object_store 0.13.2", "regex", "tokio", ] @@ -2011,7 +1868,7 @@ dependencies = [ "datafusion-physical-plan", "datafusion-session", "futures", - "object_store", + "object_store 0.13.2", "serde_json", "tokio", "tokio-stream", @@ -2039,7 +1896,7 @@ dependencies = [ "datafusion-physical-expr-common", "futures", "log", - "object_store", + "object_store 0.13.2", "parking_lot", "rand 0.9.2", "tempfile", @@ -2108,7 +1965,7 @@ dependencies = [ "num-traits", "rand 0.9.2", "regex", - "sha2 0.10.9", + "sha2", "unicode-segmentation", "uuid", ] @@ -2427,7 +2284,7 @@ version = "0.7.10" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb" dependencies = [ - "const-oid 0.9.6", + "const-oid", "pem-rfc7468", "zeroize", ] @@ -2454,24 +2311,12 @@ version = "0.10.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" dependencies = [ - "block-buffer 0.10.4", - "const-oid 0.9.6", - "crypto-common 0.1.7", + "block-buffer", + "const-oid", + "crypto-common", "subtle", ] -[[package]] -name = "digest" -version = "0.11.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f1dd6dbb5841937940781866fa1281a1ff7bd3bf827091440879f9994983d5c2" -dependencies = [ - "block-buffer 0.12.1", - "const-oid 0.10.2", - "crypto-common 0.2.2", - "ctutils", -] - [[package]] name = "dirs" version = "6.0.0" @@ -2513,21 +2358,6 @@ dependencies = [ "const-random", ] -[[package]] -name = "dtor" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "404d02eeb088a82cfd873006cb713fe411306c7d182c344905e101fb1167d301" -dependencies = [ - "dtor-proc-macro", -] - -[[package]] -name = "dtor-proc-macro" -version = "0.0.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f678cf4a922c215c63e0de95eb1ff08a958a81d47e485cf9da1e27bf6305cfa5" - [[package]] name = "dunce" version = "1.0.5" @@ -2573,7 +2403,7 @@ version = "0.8.35" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "75030f3c4f45dafd7586dd6780965a8c7e8e285a5ecb86713e63a79c5b2766f3" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", ] [[package]] @@ -2748,9 +2578,9 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c" [[package]] name = "fsst" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bcd0ce0249ac12fd44fcde62d435c36d881952c2f0df4d1de24b45e1dbba5ddb" +checksum = "83cf860f6a6bf0a6a60fdfe5a36c75121fad5ea4332d1d12deee3e65b6047727" dependencies = [ "arrow-array", "rand 0.9.2", @@ -2859,15 +2689,6 @@ dependencies = [ "slab", ] -[[package]] -name = "gearhash" -version = "0.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c8cf82cf76cd16485e56295a1377c775ce708c9f1a0be6b029076d60a245d213" -dependencies = [ - "cfg-if 0.1.10", -] - [[package]] name = "generator" version = "0.8.8" @@ -2875,7 +2696,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "52f04ae4152da20c76fe800fa48659201d5cf627c5149ca0b707b69d7eef6cf9" dependencies = [ "cc", - "cfg-if 1.0.4", + "cfg-if", "libc", "log", "rustversion", @@ -2899,10 +2720,10 @@ version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "js-sys", "libc", - "wasi 0.11.1+wasi-snapshot-preview1", + "wasi", "wasm-bindgen", ] @@ -2912,7 +2733,7 @@ version = "0.3.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "js-sys", "libc", "r-efi 5.3.0", @@ -2926,14 +2747,11 @@ version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0de51e6874e94e7bf76d726fc5d13ba782deca734ff60d5bb2fb2607c7406555" dependencies = [ - "cfg-if 1.0.4", - "js-sys", + "cfg-if", "libc", "r-efi 6.0.0", - "rand_core 0.10.1", "wasip2", "wasip3", - "wasm-bindgen", ] [[package]] @@ -2942,26 +2760,6 @@ version = "0.32.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e629b9b98ef3dd8afe6ca2bd0f89306cec16d43d907889945bc5d6687f2f13c7" -[[package]] -name = "git-version" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ad568aa3db0fcbc81f2f116137f263d7304f512a1209b35b85150d3ef88ad19" -dependencies = [ - "git-version-macro", -] - -[[package]] -name = "git-version-macro" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "53010ccb100b96a67bc32c0175f0ed1426b31b655d562898e57325f81c023ac0" -dependencies = [ - "proc-macro2", - "quote", - "syn 2.0.117", -] - [[package]] name = "glob" version = "0.3.3" @@ -3024,7 +2822,7 @@ version = "2.7.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "crunchy", "num-traits", "zerocopy", @@ -3068,12 +2866,6 @@ version = "0.17.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ed5909b6e89a2db4456e54cd5f673791d7eca6732202bbf2a9cc504fe2f9b84a" -[[package]] -name = "heapify" -version = "0.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0049b265b7f201ca9ab25475b22b47fe444060126a51abe00f77d986fc5cc52e" - [[package]] name = "heck" version = "0.5.0" @@ -3092,44 +2884,22 @@ version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" -[[package]] -name = "hf-xet" -version = "1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "430b33fa84f92796d4d263070b6c0d3ca219df7b9a0e1853ee431029b1612bcd" -dependencies = [ - "async-trait", - "bytes", - "http 1.4.0", - "more-asserts", - "serde", - "thiserror", - "tokio", - "tokio-util", - "tracing", - "uuid", - "xet-client", - "xet-core-structures", - "xet-data", - "xet-runtime", -] - [[package]] name = "hmac" version = "0.12.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6c49c37c09c17a53d937dfbb742eb3a961d65a994e6bcdcf37e7399d0cc8ab5e" dependencies = [ - "digest 0.10.7", + "digest", ] [[package]] -name = "hmac" -version = "0.13.0" +name = "home" +version = "0.5.12" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6303bc9732ae41b04cb554b844a762b4115a61bfaa81e3e83050991eeb56863f" +checksum = "cc627f471c528ff0c4a49e1d5e60450c8f6461dd6d10ba9dcd3a61d3dff7728d" dependencies = [ - "digest 0.11.3", + "windows-sys 0.61.2", ] [[package]] @@ -3205,15 +2975,6 @@ version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "135b12329e5e3ce057a9f972339ea52bc954fe1e9358ef27f95e89716fbc5424" -[[package]] -name = "hybrid-array" -version = "0.4.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9155a582abd142abc056962c29e3ce5ff2ad5469f4246b537ed42c5deba857da" -dependencies = [ - "typenum", -] - [[package]] name = "hyper" version = "0.14.32" @@ -3312,11 +3073,9 @@ dependencies = [ "percent-encoding", "pin-project-lite", "socket2 0.6.2", - "system-configuration", "tokio", "tower-service", "tracing", - "windows-registry", ] [[package]] @@ -3512,7 +3271,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4d09b98f7eace8982db770e4408e7470b028ce513ac28fecdc6bf4c30fe92b62" dependencies = [ "bitflags", - "cfg-if 1.0.4", + "cfg-if", "libc", ] @@ -3570,12 +3329,10 @@ checksum = "1a3546dc96b6d42c5f24902af9e2538e82e39ad350b0c766eb3fbf2d8f3d8359" dependencies = [ "jiff-static", "jiff-tzdb-platform", - "js-sys", "log", "portable-atomic", "portable-atomic-util", "serde_core", - "wasm-bindgen", "windows-sys 0.61.2", ] @@ -3605,55 +3362,6 @@ dependencies = [ "jiff-tzdb", ] -[[package]] -name = "jni" -version = "0.22.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5efd9a482cf3a427f00d6b35f14332adc7902ce91efb778580e180ff90fa3498" -dependencies = [ - "cfg-if 1.0.4", - "combine", - "jni-macros", - "jni-sys", - "log", - "simd_cesu8", - "thiserror", - "walkdir", - "windows-link", -] - -[[package]] -name = "jni-macros" -version = "0.22.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a00109accc170f0bdb141fed3e393c565b6f5e072365c3bd58f5b062591560a3" -dependencies = [ - "proc-macro2", - "quote", - "rustc_version", - "simd_cesu8", - "syn 2.0.117", -] - -[[package]] -name = "jni-sys" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c6377a88cb3910bee9b0fa88d4f42e1d2da8e79915598f65fb0c7ee14c878af2" -dependencies = [ - "jni-sys-macros", -] - -[[package]] -name = "jni-sys-macros" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "38c0b942f458fe50cdac086d2f946512305e5631e720728f2a61aabcd47a6264" -dependencies = [ - "quote", - "syn 2.0.117", -] - [[package]] name = "jobserver" version = "0.1.34" @@ -3694,32 +3402,30 @@ dependencies = [ "serde_json", ] +[[package]] +name = "jsonwebtoken" +version = "9.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5a87cc7a48537badeae96744432de36f4be2b4a34a05a5ef32e9dd8a1c169dde" +dependencies = [ + "base64", + "js-sys", + "pem", + "ring", + "serde", + "serde_json", + "simple_asn1", +] + [[package]] name = "keccak" version = "0.1.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "cb26cec98cce3a3d96cbb7bced3c4b16e3d13f27ec56dbd62cbc8f39cfb9d653" dependencies = [ - "cpufeatures 0.2.17", + "cpufeatures", ] -[[package]] -name = "konst" -version = "0.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f660d5f887e3562f9ab6f4a14988795b694099d66b4f5dedc02d197ba9becb1d" -dependencies = [ - "const_panic", - "konst_proc_macros", - "typewit", -] - -[[package]] -name = "konst_proc_macros" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e037a2e1d8d5fdbd49b16a4ea09d5d6401c1f29eca5ff29d03d3824dba16256a" - [[package]] name = "lalrpop" version = "0.22.2" @@ -3754,11 +3460,10 @@ dependencies = [ [[package]] name = "lance" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3944aca86f4c78f4da04af1c2bf33e664a2826b7af72972ad200d6b9de59019f" +checksum = "d34e854994e84d043897f5ec9fb609221e9e69e3fd52996cd715d979fcd349f6" dependencies = [ - "arc-swap", "arrow", "arrow-arith", "arrow-array", @@ -3773,11 +3478,9 @@ dependencies = [ "async-trait", "async_cell", "aws-credential-types", - "bitpacking", "byteorder", "bytes", "chrono", - "crossbeam-queue", "crossbeam-skiplist", "dashmap", "datafusion", @@ -3804,14 +3507,13 @@ dependencies = [ "lance-tokenizer", "log", "moka", - "object_store", + "object_store 0.12.5", "permutation", "pin-project", "prost", "prost-build", "prost-types", "rand 0.9.2", - "rayon", "roaring", "semver", "serde", @@ -3827,12 +3529,13 @@ dependencies = [ [[package]] name = "lance-arrow" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "253f4a0a70580c985b91e65e9ca6cad644825a4078de28d8efbacf3ffbd7ecdc" +checksum = "7827fe404358c27d120ee8ea8ef7b9415c2911d54072bec83dd689d750ae65da" dependencies = [ "arrow-array", "arrow-buffer", + "arrow-cast", "arrow-data", "arrow-ipc", "arrow-ord", @@ -3849,9 +3552,9 @@ dependencies = [ [[package]] name = "lance-bitpacking" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "80c4d12521b1945041dd515a56aa0854973138e7ac12111c92572e33e4ecb593" +checksum = "2cd0b31570d50fe13c7e4e36b03e1f1c99c3d8e5a34845b24b0665b51b40570d" dependencies = [ "arrayref", "paste", @@ -3860,9 +3563,9 @@ dependencies = [ [[package]] name = "lance-core" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "13f84020da5a484e2f07dd1796e09785ed7cd889857ebc4cb77e32ef214ee594" +checksum = "b128c213c676cb8e03c62a68670642770825171e64097cc2da97cbb19fe35d29" dependencies = [ "arrow-array", "arrow-buffer", @@ -3870,6 +3573,7 @@ dependencies = [ "async-trait", "byteorder", "bytes", + "chrono", "datafusion-common", "datafusion-sql", "deepsize", @@ -3878,9 +3582,10 @@ dependencies = [ "lance-arrow", "libc", "log", + "mock_instant", "moka", "num_cpus", - "object_store", + "object_store 0.12.5", "pin-project", "prost", "rand 0.9.2", @@ -3897,9 +3602,9 @@ dependencies = [ [[package]] name = "lance-datafusion" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7460597a66534a75987993d4dac5bc330586d99c5b79ae73367dbcbd4e29e576" +checksum = "e03b2de71cbcd09b10bf1a17c83cacbc0176ecd97203fb72b9e59d9b8f9a3743" dependencies = [ "arrow", "arrow-array", @@ -3923,15 +3628,16 @@ dependencies = [ "pin-project", "prost", "prost-build", + "snafu", "tokio", "tracing", ] [[package]] name = "lance-datagen" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "046f5506ed2271cd941a050de7bf535dd3aedc291aadec836a63fa56c5926e3b" +checksum = "2fe7c7ea7fd397e495a1646fec360e46ee0cbd75718f1c0e887aad657c5f2944" dependencies = [ "arrow", "arrow-array", @@ -3949,9 +3655,9 @@ dependencies = [ [[package]] name = "lance-encoding" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7af54edf43dcf9d6a56cc636eb35d457e68373c6448dca3f0891b3325b4a24e6" +checksum = "fe3f8070835b407d8db9ea8728386bc3207ba23c66a9c22d344e231ef12b77ca" dependencies = [ "arrow-arith", "arrow-array", @@ -3976,7 +3682,9 @@ dependencies = [ "num-traits", "prost", "prost-build", + "prost-types", "rand 0.9.2", + "snafu", "strum", "tokio", "tracing", @@ -3986,9 +3694,9 @@ dependencies = [ [[package]] name = "lance-file" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0772ae2d6207995dc1eb28aff9507f78e90b3362b58f311da001e9dc25f3d736" +checksum = "a6dfcf654549330df3aef708cd7c12e170feecddd34d6c19dd005b4153213268" dependencies = [ "arrow-arith", "arrow-array", @@ -4009,21 +3717,21 @@ dependencies = [ "lance-io", "log", "num-traits", - "object_store", + "object_store 0.12.5", "prost", "prost-build", "prost-types", + "snafu", "tokio", "tracing", ] [[package]] name = "lance-index" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e71fbfb51096a903cb524fe0da716f5f15fbc4a6b6f84cd6dec21abf319c5e84" +checksum = "4fb8ad0bd10efa2608634a2518b7dd501231e76c56a65fbd6519e23914cc425a" dependencies = [ - "arc-swap", "arrow", "arrow-arith", "arrow-array", @@ -4042,6 +3750,7 @@ dependencies = [ "datafusion-common", "datafusion-expr", "datafusion-physical-expr", + "datafusion-sql", "deepsize", "dirs", "fst", @@ -4063,7 +3772,7 @@ dependencies = [ "log", "ndarray", "num-traits", - "object_store", + "object_store 0.12.5", "prost", "prost-build", "prost-types", @@ -4075,6 +3784,7 @@ dependencies = [ "serde", "serde_json", "smallvec", + "snafu", "tempfile", "tokio", "tracing", @@ -4084,9 +3794,9 @@ dependencies = [ [[package]] name = "lance-io" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bab8c98ef1b870b20541d27f3ca4efdf7c9f5c25214233be07d231ba88900219" +checksum = "ef5314703fa8c8baed04193cc669da80ab42521c6319d3cc921a4a997690dcc0" dependencies = [ "arrow", "arrow-arith", @@ -4110,9 +3820,10 @@ dependencies = [ "lance-arrow", "lance-core", "lance-namespace", + "libc", "log", "moka", - "object_store", + "object_store 0.12.5", "object_store_opendal", "opendal", "path_abs", @@ -4120,6 +3831,7 @@ dependencies = [ "prost", "rand 0.9.2", "serde", + "snafu", "tempfile", "tokio", "tracing", @@ -4128,9 +3840,9 @@ dependencies = [ [[package]] name = "lance-linalg" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6b4c51cad0ac780b02dc4da48528479e7693c03e8d05390510bbc69ca2a9a1f1" +checksum = "51aa9b73279f505b2bec0f194c7a2390ca74ad3260131e631a7bef8d97d54b2e" dependencies = [ "arrow-array", "arrow-buffer", @@ -4146,29 +3858,31 @@ dependencies = [ [[package]] name = "lance-namespace" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "014e8332ca0615506342e0d3af608639864b68396973be14239f09c9f21f1fc2" +checksum = "39cd01581f55ce45c49cbe494ee86c7ba7ca4ca3654690fd820941cd9105a46e" dependencies = [ "arrow", "async-trait", "bytes", "lance-core", "lance-namespace-reqwest-client", + "serde", "snafu", ] [[package]] name = "lance-namespace-impls" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e8d1231906a3cf92dd3dcda7d14a09c4835af6cd2bcd76dfd2481e87f20a282d" +checksum = "c2cb89f3933060f01350ad05a5a3fbda952e8ba638799bf8ac4cd2368416ee46" dependencies = [ "arrow", "arrow-ipc", "arrow-schema", "async-trait", "bytes", + "chrono", "futures", "lance", "lance-core", @@ -4178,9 +3892,10 @@ dependencies = [ "lance-namespace", "lance-table", "log", - "object_store", + "object_store 0.12.5", "rand 0.9.2", "serde_json", + "snafu", "tokio", "url", ] @@ -4191,7 +3906,7 @@ version = "0.7.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6369eee4682fb11edf538388b43c61ce288b8302fe89bb40944d7daa7faaae99" dependencies = [ - "reqwest 0.12.28", + "reqwest", "serde", "serde_json", "serde_repr", @@ -4201,9 +3916,9 @@ dependencies = [ [[package]] name = "lance-table" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b16f1355904aea4ebb04ffc70c58c97901e10bde44452b4b021de4a1f329250d" +checksum = "5db70650465a1af174b7dfe6948ec91a3d466ada12e11274eb66e51132173aa0" dependencies = [ "arrow", "arrow-array", @@ -4221,7 +3936,7 @@ dependencies = [ "lance-file", "lance-io", "log", - "object_store", + "object_store 0.12.5", "prost", "prost-build", "prost-types", @@ -4240,9 +3955,9 @@ dependencies = [ [[package]] name = "lance-tokenizer" -version = "7.0.0" +version = "6.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b39b7f5ed9d0c0b716bf599b559d888267ed1dfe4c4e29d3648b51d2a28940cf" +checksum = "eb08ef9382c9d58036c323db2c19cc097e02d1d0d87714fc7176b5d3b36a31aa" dependencies = [ "rust-stemmers", "serde", @@ -4425,7 +4140,7 @@ version = "0.7.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "419e0dc8046cb947daa77eb95ae174acfbddb7673b4151f56d1eed8e93fbfaca" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "generator", "scoped-tls", "tracing", @@ -4500,17 +4215,8 @@ version = "0.10.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d89e7ee0cfbedfc4da3340218492196241d89eefb6dab27de5df917a6d2e78cf" dependencies = [ - "cfg-if 1.0.4", - "digest 0.10.7", -] - -[[package]] -name = "mea" -version = "0.6.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2640d335e7273dacdcf51044026139b2e269c3bb0dfc3f8cb3496b85e3f6a42c" -dependencies = [ - "slab", + "cfg-if", + "digest", ] [[package]] @@ -4525,7 +4231,7 @@ version = "7.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5f98efec8807c63c752b5bd61f862c165c115b0a35685bdcfd9238c7aeb592b7" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "miette-derive", "serde", "unicode-width 0.1.14", @@ -4575,10 +4281,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a69bcab0ad47271a0234d9422b131806bf3968021e5dc9328caf2d4cd58557fc" dependencies = [ "libc", - "wasi 0.11.1+wasi-snapshot-preview1", + "wasi", "windows-sys 0.61.2", ] +[[package]] +name = "mock_instant" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dce6dd36094cac388f119d2e9dc82dc730ef91c32a6222170d630e5414b956e6" + [[package]] name = "moka" version = "0.12.15" @@ -4599,12 +4311,6 @@ dependencies = [ "uuid", ] -[[package]] -name = "more-asserts" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fafa6961cabd9c63bcd77a45d7e3b7f3b552b70417831fb0f56db717e72407e" - [[package]] name = "multimap" version = "0.10.1" @@ -4656,15 +4362,6 @@ version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "61807f77802ff30975e01f4f071c8ba10c022052f98b3294119f3e615d13e5be" -[[package]] -name = "ntapi" -version = "0.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c3b335231dfd352ffb0f8017f3b6027a4917f7df785ea2143d8af2adc66980ae" -dependencies = [ - "winapi", -] - [[package]] name = "nu-ansi-term" version = "0.50.3" @@ -4755,34 +4452,6 @@ dependencies = [ "libc", ] -[[package]] -name = "objc2-core-foundation" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a180dd8642fa45cdb7dd721cd4c11b1cadd4929ce112ebd8b9f5803cc79d536" -dependencies = [ - "bitflags", -] - -[[package]] -name = "objc2-io-kit" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "33fafba39597d6dc1fb709123dfa8289d39406734be322956a69f0931c73bb15" -dependencies = [ - "libc", - "objc2-core-foundation", -] - -[[package]] -name = "objc2-system-configuration" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7216bd11cbda54ccabcab84d523dc93b858ec75ecfb3a7d89513fa22464da396" -dependencies = [ - "objc2-core-foundation", -] - [[package]] name = "object" version = "0.37.3" @@ -4794,18 +4463,16 @@ dependencies = [ [[package]] name = "object_store" -version = "0.13.2" +version = "0.12.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "622acbc9100d3c10e2ee15804b0caa40e55c933d5aa53814cd520805b7958a49" +checksum = "fbfbfff40aeccab00ec8a910b57ca8ecf4319b335c542f2edcd19dd25a1e2a00" dependencies = [ "async-trait", "base64", "bytes", "chrono", "form_urlencoded", - "futures-channel", - "futures-core", - "futures-util", + "futures", "http 1.4.0", "http-body-util", "httparse", @@ -4815,11 +4482,11 @@ dependencies = [ "md-5", "parking_lot", "percent-encoding", - "quick-xml 0.39.4", - "rand 0.10.1", - "reqwest 0.12.28", + "quick-xml 0.38.4", + "rand 0.9.2", + "reqwest", "ring", - "rustls-pki-types", + "rustls-pemfile", "serde", "serde_json", "serde_urlencoded", @@ -4833,50 +4500,62 @@ dependencies = [ ] [[package]] -name = "object_store_opendal" -version = "0.56.0" +name = "object_store" +version = "0.13.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "08298874eee5935c95bcaa393148834f9c53d904461ca15584a041d8a1c907c2" +checksum = "622acbc9100d3c10e2ee15804b0caa40e55c933d5aa53814cd520805b7958a49" +dependencies = [ + "async-trait", + "bytes", + "chrono", + "futures-channel", + "futures-core", + "futures-util", + "http 1.4.0", + "humantime", + "itertools 0.14.0", + "parking_lot", + "percent-encoding", + "thiserror", + "tokio", + "tracing", + "url", + "walkdir", + "wasm-bindgen-futures", + "web-time", +] + +[[package]] +name = "object_store_opendal" +version = "0.55.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "113ab0769e972eee585e57407b98de08bda5354fa28e8ba4d89038d6cb6a8991" dependencies = [ "async-trait", "bytes", "chrono", "futures", - "mea", - "object_store", + "object_store 0.12.5", "opendal", "pin-project", "tokio", ] -[[package]] -name = "omnigraph-api-types" -version = "0.7.2" -dependencies = [ - "omnigraph-compiler", - "omnigraph-engine", - "serde", - "serde_json", - "utoipa", -] - [[package]] name = "omnigraph-cli" -version = "0.7.2" +version = "0.6.0" dependencies = [ "assert_cmd", "clap", "color-eyre", "lance", "lance-index", - "omnigraph-api-types", - "omnigraph-cluster", "omnigraph-compiler", "omnigraph-engine", "omnigraph-policy", "omnigraph-server", "predicates", - "reqwest 0.12.28", + "reqwest", "serde", "serde_json", "serde_yaml", @@ -4884,28 +4563,9 @@ dependencies = [ "tokio", ] -[[package]] -name = "omnigraph-cluster" -version = "0.7.2" -dependencies = [ - "fail", - "omnigraph-compiler", - "omnigraph-engine", - "serde", - "serde_json", - "serde_yaml", - "serial_test", - "sha2 0.10.9", - "tempfile", - "thiserror", - "time", - "tokio", - "ulid", -] - [[package]] name = "omnigraph-compiler" -version = "0.7.2" +version = "0.6.0" dependencies = [ "ahash", "arrow-array", @@ -4916,15 +4576,17 @@ dependencies = [ "arrow-select", "pest", "pest_derive", + "reqwest", "serde", "serde_json", - "sha2 0.10.9", + "sha2", "thiserror", + "tokio", ] [[package]] name = "omnigraph-engine" -version = "0.7.2" +version = "0.6.0" dependencies = [ "arc-swap", "arrow-array", @@ -4942,21 +4604,18 @@ dependencies = [ "lance-datafusion", "lance-file", "lance-index", - "lance-io", "lance-linalg", "lance-namespace", "lance-namespace-impls", "lance-table", - "object_store", + "object_store 0.12.5", "omnigraph-compiler", "omnigraph-policy", - "proptest", "regex", - "reqwest 0.12.28", + "reqwest", "serde", "serde_json", "serial_test", - "sha2 0.10.9", "tempfile", "thiserror", "time", @@ -4968,7 +4627,7 @@ dependencies = [ [[package]] name = "omnigraph-policy" -version = "0.7.2" +version = "0.6.0" dependencies = [ "cedar-policy", "clap", @@ -4981,7 +4640,7 @@ dependencies = [ [[package]] name = "omnigraph-server" -version = "0.7.2" +version = "0.6.0" dependencies = [ "arc-swap", "async-trait", @@ -4994,8 +4653,6 @@ dependencies = [ "futures", "lance", "lance-index", - "omnigraph-api-types", - "omnigraph-cluster", "omnigraph-compiler", "omnigraph-engine", "omnigraph-policy", @@ -5004,7 +4661,7 @@ dependencies = [ "serde_json", "serde_yaml", "serial_test", - "sha2 0.10.9", + "sha2", "subtle", "tempfile", "thiserror", @@ -5028,227 +4685,34 @@ version = "1.70.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe" -[[package]] -name = "oneshot" -version = "0.1.13" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "269bca4c2591a28585d6bf10d9ed0332b7d76900a1b02bec41bdc3a2cdcda107" - [[package]] name = "opendal" -version = "0.56.0" +version = "0.55.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "97b31d3d8e99a85d83b73ec26647f5607b80578ed9375810b6e44ffa3590a236" -dependencies = [ - "ctor", - "opendal-core", - "opendal-layer-concurrent-limit", - "opendal-layer-logging", - "opendal-layer-retry", - "opendal-layer-timeout", - "opendal-service-azblob", - "opendal-service-azdls", - "opendal-service-gcs", - "opendal-service-hf", - "opendal-service-oss", - "opendal-service-s3", -] - -[[package]] -name = "opendal-core" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1849dd2687e173e776d3af5fce1ba3ae47b9dd37a09d1c4deba850ef45fe00ca" +checksum = "d075ab8a203a6ab4bc1bce0a4b9fe486a72bf8b939037f4b78d95386384bc80a" dependencies = [ "anyhow", + "backon", "base64", "bytes", + "crc32c", "futures", + "getrandom 0.2.17", "http 1.4.0", "http-body 1.0.1", "jiff", "log", "md-5", - "mea", "percent-encoding", "quick-xml 0.38.4", - "reqsign-core", - "reqwest 0.13.4", + "reqsign", + "reqwest", "serde", "serde_json", + "sha2", "tokio", "url", "uuid", - "web-time", -] - -[[package]] -name = "opendal-layer-concurrent-limit" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "048b1b29c503263bdd80a9afe46a68cd02ea9bd361185b1feab4b151078998e9" -dependencies = [ - "futures", - "http 1.4.0", - "mea", - "opendal-core", -] - -[[package]] -name = "opendal-layer-logging" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2645adc988b12eda106e2679ae529facfbbaa868ceb706f6f8125c6af15c47b" -dependencies = [ - "log", - "opendal-core", -] - -[[package]] -name = "opendal-layer-retry" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4eac134ffa4ddda6131a640a84a5315996424b9416c85052f8c64c1a33b70ad4" -dependencies = [ - "backon", - "log", - "opendal-core", -] - -[[package]] -name = "opendal-layer-timeout" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "619586ab7480c2e3009f6d18eabab18957bc094778fd130bcc38924970a90f4c" -dependencies = [ - "opendal-core", - "tokio", -] - -[[package]] -name = "opendal-service-azblob" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7452bf3ec61cfd81ac9ad9ada17825931e9e371d44a045c6bfab9596c0a2ac3b" -dependencies = [ - "base64", - "bytes", - "http 1.4.0", - "log", - "opendal-core", - "opendal-service-azure-common", - "quick-xml 0.38.4", - "reqsign-azure-storage", - "reqsign-core", - "reqsign-file-read-tokio", - "serde", - "sha2 0.10.9", - "uuid", -] - -[[package]] -name = "opendal-service-azdls" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8f9884c2d8cf8ba2bb077d79c877dac5863ba3bab9e2c9c1e41a2e0491404772" -dependencies = [ - "bytes", - "http 1.4.0", - "log", - "opendal-core", - "opendal-service-azure-common", - "quick-xml 0.38.4", - "reqsign-azure-storage", - "reqsign-core", - "reqsign-file-read-tokio", - "serde", - "serde_json", -] - -[[package]] -name = "opendal-service-azure-common" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ffb0e45d6c8dcf66ce2da20e241bcb80e6e540e109a4ff20f318f6c9b4c54e0c" -dependencies = [ - "http 1.4.0", - "opendal-core", -] - -[[package]] -name = "opendal-service-gcs" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "70a49477a10163431896d106136117f5670717f9c9e49cf6f710528800c6633a" -dependencies = [ - "async-trait", - "bytes", - "http 1.4.0", - "log", - "opendal-core", - "percent-encoding", - "quick-xml 0.38.4", - "reqsign-core", - "reqsign-file-read-tokio", - "reqsign-google", - "serde", - "serde_json", - "tokio", -] - -[[package]] -name = "opendal-service-hf" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7b2ab7a2a8a11dfe257ef4db5c0de798acbcd0d6429c37382dad2154bc06a388" -dependencies = [ - "bytes", - "hf-xet", - "http 1.4.0", - "log", - "opendal-core", - "percent-encoding", - "reqwest 0.13.4", - "serde", - "serde_json", -] - -[[package]] -name = "opendal-service-oss" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "29c8a917829ad06d21b639558532cb0101fe49b040d946d673a73018683fac05" -dependencies = [ - "bytes", - "http 1.4.0", - "log", - "opendal-core", - "quick-xml 0.38.4", - "reqsign-aliyun-oss", - "reqsign-core", - "reqsign-file-read-tokio", - "serde", -] - -[[package]] -name = "opendal-service-s3" -version = "0.56.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9dadddeb9bb50b0d30927dd914c298c4ddca47e4c1cfa7674d311f0cf9b051c8" -dependencies = [ - "base64", - "bytes", - "crc32c", - "http 1.4.0", - "log", - "md-5", - "opendal-core", - "quick-xml 0.38.4", - "reqsign-aws-v4", - "reqsign-core", - "reqsign-file-read-tokio", - "serde", - "url", ] [[package]] @@ -5282,15 +4746,6 @@ dependencies = [ "hashbrown 0.14.5", ] -[[package]] -name = "os_str_bytes" -version = "6.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e2355d85b9a3786f481747ced0e0ff2ba35213a1f9bd406ed906554d7af805a1" -dependencies = [ - "memchr", -] - [[package]] name = "outref" version = "0.5.2" @@ -5325,7 +4780,7 @@ version = "0.9.12" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2621685985a2ebf1c516881c026032ac7deafcda1a2c9b7850dc81e3dfcb64c1" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "libc", "redox_syscall", "smallvec", @@ -5356,8 +4811,8 @@ version = "0.12.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f8ed6a7761f76e3b9f92dfb0a60a6a6477c61024b775147ff0973a02653abaf2" dependencies = [ - "digest 0.10.7", - "hmac 0.12.1", + "digest", + "hmac", ] [[package]] @@ -5431,7 +4886,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "89815c69d36021a140146f26659a81d6c2afa33d216d736dd4be5381a7362220" dependencies = [ "pest", - "sha2 0.10.9", + "sha2", ] [[package]] @@ -5543,7 +4998,7 @@ dependencies = [ "der", "pbkdf2", "scrypt", - "sha2 0.10.9", + "sha2", "spki", ] @@ -5670,25 +5125,6 @@ dependencies = [ "unicode-ident", ] -[[package]] -name = "proptest" -version = "1.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4b45fcc2344c680f5025fe57779faef368840d0bd1f42f216291f0dc4ace4744" -dependencies = [ - "bit-set", - "bit-vec", - "bitflags", - "num-traits", - "rand 0.9.2", - "rand_chacha 0.9.0", - "rand_xorshift", - "regex-syntax", - "rusty-fork", - "tempfile", - "unarray", -] - [[package]] name = "prost" version = "0.14.3" @@ -5751,10 +5187,14 @@ dependencies = [ ] [[package]] -name = "quick-error" -version = "1.2.3" +name = "quick-xml" +version = "0.37.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1d01941d82fa2ab50be1e79e6714289dd7cde78eba4c074bc5a4374f650dfe0" +checksum = "331e97a1af0bf59823e6eadffe373d7b27f485be8748f71471c662c1f269b7fb" +dependencies = [ + "memchr", + "serde", +] [[package]] name = "quick-xml" @@ -5766,26 +5206,6 @@ dependencies = [ "serde", ] -[[package]] -name = "quick-xml" -version = "0.39.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cdcc8dd4e2f670d309a5f0e83fe36dfdc05af317008fea29144da1a2ac858e5e" -dependencies = [ - "memchr", - "serde", -] - -[[package]] -name = "quick-xml" -version = "0.40.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2474bd2e5029e7ccb6abb2ba48cf2383a333851dedf495901544281590c7da7f" -dependencies = [ - "memchr", - "serde", -] - [[package]] name = "quinn" version = "0.11.9" @@ -5812,7 +5232,6 @@ version = "0.11.13" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f1906b49b0c3bc04b5fe5d86a77925ae6524a19b816ae38ce1e426255f1d8a31" dependencies = [ - "aws-lc-rs", "bytes", "getrandom 0.3.4", "lru-slab", @@ -5844,9 +5263,9 @@ dependencies = [ [[package]] name = "quote" -version = "1.0.45" +version = "1.0.44" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924" +checksum = "21b2ebcf727b7760c461f091f9f0f539b77b8e87f2fd88131e7f1b433b3cece4" dependencies = [ "proc-macro2", ] @@ -5890,17 +5309,6 @@ dependencies = [ "rand_core 0.9.5", ] -[[package]] -name = "rand" -version = "0.10.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2e8e8bcc7961af1fdac401278c6a831614941f6164ee3bf4ce61b7edb162207" -dependencies = [ - "chacha20", - "getrandom 0.4.2", - "rand_core 0.10.1", -] - [[package]] name = "rand_chacha" version = "0.3.1" @@ -5939,12 +5347,6 @@ dependencies = [ "getrandom 0.3.4", ] -[[package]] -name = "rand_core" -version = "0.10.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "63b8176103e19a2643978565ca18b50549f6101881c443590420e4dc998a3c69" - [[package]] name = "rand_distr" version = "0.5.1" @@ -5955,15 +5357,6 @@ dependencies = [ "rand 0.9.2", ] -[[package]] -name = "rand_xorshift" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "513962919efc330f829edb2535844d1b912b0fbe2ca165d613e4e8788bb05a5a" -dependencies = [ - "rand_core 0.9.5", -] - [[package]] name = "rand_xoshiro" version = "0.7.0" @@ -6018,15 +5411,6 @@ dependencies = [ "crossbeam-utils", ] -[[package]] -name = "redb" -version = "3.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4ba239c1c1693315d3cc0e601db3b3965543afbf48c41730fdca2f069f510f4a" -dependencies = [ - "libc", -] - [[package]] name = "redox_syscall" version = "0.5.18" @@ -6103,116 +5487,34 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a" [[package]] -name = "reqsign-aliyun-oss" -version = "3.1.0" +name = "reqsign" +version = "0.16.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "372266b4733756738eeb199a98188037d27a0989980e2600ae7ce1faf00a867d" +checksum = "43451dbf3590a7590684c25fb8d12ecdcc90ed3ac123433e500447c7d77ed701" dependencies = [ "anyhow", + "async-trait", + "base64", + "chrono", "form_urlencoded", + "getrandom 0.2.17", + "hex", + "hmac", + "home", "http 1.4.0", + "jsonwebtoken", "log", + "once_cell", "percent-encoding", - "reqsign-core", + "quick-xml 0.37.5", + "rand 0.8.5", + "reqwest", + "rsa", "rust-ini", "serde", "serde_json", -] - -[[package]] -name = "reqsign-aws-v4" -version = "3.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7b75624bd8a466e37ddc0a7b6c33ac859a85347c153a916e1dd9d0b68338f74a" -dependencies = [ - "anyhow", - "bytes", - "form_urlencoded", - "hex", - "http 1.4.0", - "log", - "percent-encoding", - "quick-xml 0.40.1", - "reqsign-core", - "rust-ini", - "serde", - "serde_json", - "serde_urlencoded", "sha1", -] - -[[package]] -name = "reqsign-azure-storage" -version = "3.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62b96928e73ad984de1d99e382749d09e5dab7dd707b767974f7e40aa926b82f" -dependencies = [ - "anyhow", - "base64", - "bytes", - "form_urlencoded", - "http 1.4.0", - "log", - "pem", - "percent-encoding", - "reqsign-core", - "rsa", - "serde", - "serde_json", - "sha1", -] - -[[package]] -name = "reqsign-core" -version = "3.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a5fa5cb48808693614d1701fcd3db0b30fa292e0f18e122ae068b6d32eaeed3f" -dependencies = [ - "anyhow", - "base64", - "bytes", - "form_urlencoded", - "futures", - "hex", - "hmac 0.13.0", - "http 1.4.0", - "jiff", - "log", - "percent-encoding", - "rsa", - "serde", - "serde_json", - "sha1", - "sha2 0.11.0", - "windows-sys 0.61.2", -] - -[[package]] -name = "reqsign-file-read-tokio" -version = "3.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6a4b6f3a3fd29ffcc99a90aec585a65217783badfd73acddf847b63ae683bda9" -dependencies = [ - "anyhow", - "reqsign-core", - "tokio", -] - -[[package]] -name = "reqsign-google" -version = "3.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eb215d0876a18b6bd9cdd380b589e5292aaa638ca15266de794b1122d898b6b2" -dependencies = [ - "form_urlencoded", - "http 1.4.0", - "log", - "percent-encoding", - "reqsign-aws-v4", - "reqsign-core", - "rsa", - "serde", - "serde_json", + "sha2", "tokio", ] @@ -6258,65 +5560,11 @@ dependencies = [ "url", "wasm-bindgen", "wasm-bindgen-futures", - "wasm-streams 0.4.2", + "wasm-streams", "web-sys", "webpki-roots", ] -[[package]] -name = "reqwest" -version = "0.13.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "219c5811de6525e5416c7d5d53bb656d3afdbc6c5af816e0802bcfa42dbdc1c3" -dependencies = [ - "base64", - "bytes", - "futures-core", - "futures-util", - "http 1.4.0", - "http-body 1.0.1", - "http-body-util", - "hyper 1.8.1", - "hyper-rustls 0.27.7", - "hyper-util", - "js-sys", - "log", - "percent-encoding", - "pin-project-lite", - "quinn", - "rustls 0.23.36", - "rustls-pki-types", - "rustls-platform-verifier", - "serde", - "serde_json", - "sync_wrapper", - "tokio", - "tokio-rustls 0.26.4", - "tokio-util", - "tower", - "tower-http", - "tower-service", - "url", - "wasm-bindgen", - "wasm-bindgen-futures", - "wasm-streams 0.5.0", - "web-sys", -] - -[[package]] -name = "reqwest-middleware" -version = "0.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "07bc3f1384cffa4f274dad2d4ddd73aed32fed8f786d96c6be8aa4e5fd3c3b58" -dependencies = [ - "anyhow", - "async-trait", - "http 1.4.0", - "reqwest 0.13.4", - "thiserror", - "tower-service", -] - [[package]] name = "ring" version = "0.17.14" @@ -6324,7 +5572,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7" dependencies = [ "cc", - "cfg-if 1.0.4", + "cfg-if", "getrandom 0.2.17", "libc", "untrusted", @@ -6333,9 +5581,9 @@ dependencies = [ [[package]] name = "roaring" -version = "0.11.4" +version = "0.11.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1dedc5658c6ecb3bdb5ef5f3295bb9253f42dcf3fd1402c03f6b1f7659c3c4a9" +checksum = "8ba9ce64a8f45d7fc86358410bb1a82e8c987504c0d4900e9141d69a9f26c885" dependencies = [ "bytemuck", "byteorder", @@ -6347,15 +5595,15 @@ version = "0.9.10" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b8573f03f5883dcaebdfcf4725caa1ecb9c15b2ef50c43a07b816e06799bb12d" dependencies = [ - "const-oid 0.9.6", - "digest 0.10.7", + "const-oid", + "digest", "num-bigint-dig", "num-integer", "num-traits", "pkcs1", "pkcs8", "rand_core 0.6.4", - "sha2 0.10.9", + "sha2", "signature", "spki", "subtle", @@ -6368,7 +5616,7 @@ version = "0.21.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "796e8d2b6696392a43bea58116b667fb4c29727dc5abd27d6acf338bb4f688c7" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "ordered-multimap", ] @@ -6461,6 +5709,15 @@ dependencies = [ "security-framework", ] +[[package]] +name = "rustls-pemfile" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dce314e5fee3f39953d46bb63bb8a46d40c2f8fb7cc5a3b6cab2bde9721d6e50" +dependencies = [ + "rustls-pki-types", +] + [[package]] name = "rustls-pki-types" version = "1.14.0" @@ -6471,33 +5728,6 @@ dependencies = [ "zeroize", ] -[[package]] -name = "rustls-platform-verifier" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "26d1e2536ce4f35f4846aa13bff16bd0ff40157cdb14cc056c7b14ba41233ba0" -dependencies = [ - "core-foundation 0.10.1", - "core-foundation-sys", - "jni", - "log", - "once_cell", - "rustls 0.23.36", - "rustls-native-certs", - "rustls-platform-verifier-android", - "rustls-webpki 0.103.9", - "security-framework", - "security-framework-sys", - "webpki-root-certs", - "windows-sys 0.61.2", -] - -[[package]] -name = "rustls-platform-verifier-android" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f87165f0995f63a9fbeea62b64d10b4d9d8e78ec6d7d51fb2125fda7bb36788f" - [[package]] name = "rustls-webpki" version = "0.101.7" @@ -6526,30 +5756,12 @@ version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" -[[package]] -name = "rusty-fork" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cc6bf79ff24e648f6da1f8d1f011e9cac26491b619e6b9280f2b47f1774e6ee2" -dependencies = [ - "fnv", - "quick-error", - "tempfile", - "wait-timeout", -] - [[package]] name = "ryu" version = "1.0.23" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9774ba4a74de5f7b1c1451ed6cd5285a32eddb5cccb8cc655a4e50009e06477f" -[[package]] -name = "safe-transmute" -version = "0.11.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3944826ff8fa8093089aba3acb4ef44b9446a99a16f3bf4e74af3f77d340ab7d" - [[package]] name = "salsa20" version = "0.10.2" @@ -6630,7 +5842,7 @@ checksum = "0516a385866c09368f0b5bcd1caff3366aace790fcd46e2bb032697bb172fd1f" dependencies = [ "pbkdf2", "salsa20", - "sha2 0.10.9", + "sha2", ] [[package]] @@ -6656,7 +5868,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b7f4bc775c73d9a02cde8bf7b2ec4c9d12743edf609006c7facc23998404cd1d" dependencies = [ "bitflags", - "core-foundation 0.10.1", + "core-foundation", "core-foundation-sys", "libc", "security-framework-sys", @@ -6834,13 +6046,13 @@ dependencies = [ [[package]] name = "sha1" -version = "0.11.0" +version = "0.10.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "aacc4cc499359472b4abe1bf11d0b12e688af9a805fa5e3016f9a386dc2d0214" +checksum = "e3bf829a2d51ab4a5ddf1352d8470c140cadc8301b2ae1789db023f01cedd6ba" dependencies = [ - "cfg-if 1.0.4", - "cpufeatures 0.3.0", - "digest 0.11.3", + "cfg-if", + "cpufeatures", + "digest", ] [[package]] @@ -6849,30 +6061,9 @@ version = "0.10.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" dependencies = [ - "cfg-if 1.0.4", - "cpufeatures 0.2.17", - "digest 0.10.7", - "sha2-asm", -] - -[[package]] -name = "sha2" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "446ba717509524cb3f22f17ecc096f10f4822d76ab5c0b9822c5f9c284e825f4" -dependencies = [ - "cfg-if 1.0.4", - "cpufeatures 0.3.0", - "digest 0.11.3", -] - -[[package]] -name = "sha2-asm" -version = "0.6.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b845214d6175804686b2bd482bcffe96651bb2d1200742b712003504a2dac1ab" -dependencies = [ - "cc", + "cfg-if", + "cpufeatures", + "digest", ] [[package]] @@ -6881,7 +6072,7 @@ version = "0.10.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "75872d278a8f37ef87fa0ddbda7802605cb18344497949862c0d4dcb291eba60" dependencies = [ - "digest 0.10.7", + "digest", "keccak", ] @@ -6894,17 +6085,6 @@ dependencies = [ "lazy_static", ] -[[package]] -name = "shellexpand" -version = "3.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "32824fab5e16e6c4d86dc1ba84489390419a39f97699852b66480bb87d297ed8" -dependencies = [ - "bstr", - "dirs", - "os_str_bytes", -] - [[package]] name = "shlex" version = "1.3.0" @@ -6927,7 +6107,7 @@ version = "2.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "77549399552de45a898a580c1b41d445bf730df867cc44e6c0233bbc4b8329de" dependencies = [ - "digest 0.10.7", + "digest", "rand_core 0.6.4", ] @@ -6937,22 +6117,24 @@ version = "0.3.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e320a6c5ad31d271ad523dcf3ad13e2767ad8b1cb8f047f75a8aeaf8da139da2" -[[package]] -name = "simd_cesu8" -version = "1.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "94f90157bb87cddf702797c5dadfa0be7d266cdf49e22da2fcaa32eff75b2c33" -dependencies = [ - "rustc_version", - "simdutf8", -] - [[package]] name = "simdutf8" version = "0.1.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e3a9fe34e3e7a50316060351f37187a3f546bce95496156754b601a5fa71b76e" +[[package]] +name = "simple_asn1" +version = "0.6.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0d585997b0ac10be3c5ee635f1bab02d512760d14b7c468801ac8a01d9ae5f1d" +dependencies = [ + "num-bigint", + "num-traits", + "thiserror", + "time", +] + [[package]] name = "siphasher" version = "1.0.2" @@ -7072,28 +6254,12 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "08d74a23609d509411d10e2176dc2a4346e3b4aea2e7b1869f19fdedbc71c013" dependencies = [ "cc", - "cfg-if 1.0.4", + "cfg-if", "libc", "psm", "windows-sys 0.59.0", ] -[[package]] -name = "static_assertions" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a2eb9349b6444b326872e140eb1cf5e7c522154d69e7a0ffb0fb81c06b37543f" - -[[package]] -name = "statrs" -version = "0.18.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a3fe7c28c6512e766b0874335db33c94ad7b8f9054228ae1c2abd47ce7d335e" -dependencies = [ - "approx", - "num-traits", -] - [[package]] name = "std_prelude" version = "0.2.12" @@ -7152,12 +6318,6 @@ version = "2.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "13c2bddecc57b384dee18652358fb23172facb8a2c51ccc10d74c157bdea3292" -[[package]] -name = "symlink" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a7973cce6668464ea31f176d85b13c7ab3bba2cb3b77a2ed26abd7801688010a" - [[package]] name = "syn" version = "1.0.109" @@ -7200,41 +6360,6 @@ dependencies = [ "syn 2.0.117", ] -[[package]] -name = "sysinfo" -version = "0.38.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "92ab6a2f8bfe508deb3c6406578252e491d299cbbf3bc0529ecc3313aee4a52f" -dependencies = [ - "libc", - "memchr", - "ntapi", - "objc2-core-foundation", - "objc2-io-kit", - "windows", -] - -[[package]] -name = "system-configuration" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a13f3d0daba03132c0aa9767f98351b3488edc2c100cda2d2ec2b04f3d8d3c8b" -dependencies = [ - "bitflags", - "core-foundation 0.9.4", - "system-configuration-sys", -] - -[[package]] -name = "system-configuration-sys" -version = "0.6.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8e1d1b10ced5ca923a1fcb8d03e96b8d3268065d724548c0211415ff6ac6bac4" -dependencies = [ - "core-foundation-sys", - "libc", -] - [[package]] name = "tagptr" version = "0.2.0" @@ -7310,7 +6435,7 @@ version = "1.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f60246a4944f24f6e018aa17cdeffb7818b76356965d03b07d6a9886e8962185" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", ] [[package]] @@ -7406,17 +6531,6 @@ dependencies = [ "syn 2.0.117", ] -[[package]] -name = "tokio-retry" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4a129d95275ebf4c493ec53bf0f8cd95f5ac161bc4f381700809a54f595d4470" -dependencies = [ - "pin-project-lite", - "rand 0.10.1", - "tokio", -] - [[package]] name = "tokio-rustls" version = "0.24.1" @@ -7526,19 +6640,6 @@ dependencies = [ "tracing-core", ] -[[package]] -name = "tracing-appender" -version = "0.2.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "050686193eb999b4bb3bc2acfa891a13da00f79734704c4b8b4ef1a10b368a3c" -dependencies = [ - "crossbeam-channel", - "symlink", - "thiserror", - "time", - "tracing-subscriber", -] - [[package]] name = "tracing-attributes" version = "0.1.31" @@ -7581,16 +6682,6 @@ dependencies = [ "tracing-core", ] -[[package]] -name = "tracing-serde" -version = "0.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "704b1aeb7be0d0a84fc9828cae51dab5970fee5088f83d1dd7ee6f6246fc6ff1" -dependencies = [ - "serde", - "tracing-core", -] - [[package]] name = "tracing-subscriber" version = "0.3.22" @@ -7601,15 +6692,12 @@ dependencies = [ "nu-ansi-term", "once_cell", "regex-automata", - "serde", - "serde_json", "sharded-slab", "smallvec", "thread_local", "tracing", "tracing-core", "tracing-log", - "tracing-serde", ] [[package]] @@ -7635,15 +6723,9 @@ checksum = "6af6ae20167a9ece4bcb41af5b80f8a1f1df981f6391189ce00fd257af04126a" [[package]] name = "typenum" -version = "1.20.1" +version = "1.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b6f5e870be6c3b371b77fe0ee0bafb859fa4964b4404c27de1d380043c4dda20" - -[[package]] -name = "typewit" -version = "1.15.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "214ca0b2191785cbc06209b9ca1861e048e39b5ba33574b3cedd58363d5bb5f6" +checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" [[package]] name = "ucd-trie" @@ -7661,12 +6743,6 @@ dependencies = [ "web-time", ] -[[package]] -name = "unarray" -version = "0.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eaea85b334db583fe3274d12b4cd1880032beab409c0d774be044d4480ab9a94" - [[package]] name = "unicase" version = "2.9.0" @@ -7864,15 +6940,6 @@ version = "0.11.1+wasi-snapshot-preview1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" -[[package]] -name = "wasi" -version = "0.14.7+wasi-0.2.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "883478de20367e224c0090af9cf5f9fa85bed63a95c1abf3afc5c083ebc06e8c" -dependencies = [ - "wasip2", -] - [[package]] name = "wasip2" version = "1.0.2+wasi-0.2.9" @@ -7891,22 +6958,13 @@ dependencies = [ "wit-bindgen", ] -[[package]] -name = "wasite" -version = "1.0.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "66fe902b4a6b8028a753d5424909b764ccf79b7a209eac9bf97e59cda9f71a42" -dependencies = [ - "wasi 0.14.7+wasi-0.2.4", -] - [[package]] name = "wasm-bindgen" version = "0.2.108" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "64024a30ec1e37399cf85a7ffefebdb72205ca1c972291c51512360d90bd8566" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "once_cell", "rustversion", "wasm-bindgen-macro", @@ -7919,7 +6977,7 @@ version = "0.4.58" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "70a6e77fd0ae8029c9ea0063f87c46fde723e7d887703d74ad2616d792e51e6f" dependencies = [ - "cfg-if 1.0.4", + "cfg-if", "futures-util", "js-sys", "once_cell", @@ -7994,19 +7052,6 @@ dependencies = [ "web-sys", ] -[[package]] -name = "wasm-streams" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9d1ec4f6517c9e11ae630e200b2b65d193279042e28edd4a2cda233e46670bbb" -dependencies = [ - "futures-util", - "js-sys", - "wasm-bindgen", - "wasm-bindgen-futures", - "web-sys", -] - [[package]] name = "wasmparser" version = "0.244.0" @@ -8039,15 +7084,6 @@ dependencies = [ "wasm-bindgen", ] -[[package]] -name = "webpki-root-certs" -version = "1.0.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f31141ce3fc3e300ae89b78c0dd67f9708061d1d2eda54b8209346fd6be9a92c" -dependencies = [ - "rustls-pki-types", -] - [[package]] name = "webpki-roots" version = "1.0.6" @@ -8057,35 +7093,6 @@ dependencies = [ "rustls-pki-types", ] -[[package]] -name = "whoami" -version = "2.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6a5b12f9df4f978d2cfdb1bd3bac52433f44393342d7ee9c25f5a1c14c0f45d" -dependencies = [ - "libc", - "libredox", - "objc2-system-configuration", - "wasite", - "web-sys", -] - -[[package]] -name = "winapi" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" -dependencies = [ - "winapi-i686-pc-windows-gnu", - "winapi-x86_64-pc-windows-gnu", -] - -[[package]] -name = "winapi-i686-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" - [[package]] name = "winapi-util" version = "0.1.11" @@ -8095,33 +7102,6 @@ dependencies = [ "windows-sys 0.61.2", ] -[[package]] -name = "winapi-x86_64-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" - -[[package]] -name = "windows" -version = "0.62.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "527fadee13e0c05939a6a05d5bd6eec6cd2e3dbd648b9f8e447c6518133d8580" -dependencies = [ - "windows-collections", - "windows-core", - "windows-future", - "windows-numerics", -] - -[[package]] -name = "windows-collections" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "23b2d95af1a8a14a3c7367e1ed4fc9c20e0a26e79551b1454d72583c97cc6610" -dependencies = [ - "windows-core", -] - [[package]] name = "windows-core" version = "0.62.2" @@ -8135,17 +7115,6 @@ dependencies = [ "windows-strings", ] -[[package]] -name = "windows-future" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e1d6f90251fe18a279739e78025bd6ddc52a7e22f921070ccdc67dde84c605cb" -dependencies = [ - "windows-core", - "windows-link", - "windows-threading", -] - [[package]] name = "windows-implement" version = "0.60.2" @@ -8174,27 +7143,6 @@ version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5" -[[package]] -name = "windows-numerics" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6e2e40844ac143cdb44aead537bbf727de9b044e107a0f1220392177d15b0f26" -dependencies = [ - "windows-core", - "windows-link", -] - -[[package]] -name = "windows-registry" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "02752bf7fbdcce7f2a27a742f798510f3e5ad88dbe84871e5168e2120c3d5720" -dependencies = [ - "windows-link", - "windows-result", - "windows-strings", -] - [[package]] name = "windows-result" version = "0.4.1" @@ -8282,15 +7230,6 @@ dependencies = [ "windows_x86_64_msvc 0.53.1", ] -[[package]] -name = "windows-threading" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3949bd5b99cafdf1c7ca86b43ca564028dfe27d66958f2470940f73d86d75b37" -dependencies = [ - "windows-link", -] - [[package]] name = "windows_aarch64_gnullvm" version = "0.52.6" @@ -8490,153 +7429,6 @@ dependencies = [ "tap", ] -[[package]] -name = "xet-client" -version = "1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3e1e496dcbe6a09017acdfaf48e1a646735e7ff5b2a49e2c7e081cca77a59bc8" -dependencies = [ - "anyhow", - "async-trait", - "base64", - "bytes", - "clap", - "crc32fast", - "futures", - "http 1.4.0", - "hyper 1.8.1", - "lazy_static", - "more-asserts", - "rand 0.10.1", - "redb", - "reqwest 0.13.4", - "reqwest-middleware", - "serde", - "serde_json", - "serde_repr", - "statrs", - "tempfile", - "thiserror", - "tokio", - "tokio-retry", - "tracing", - "tracing-subscriber", - "url", - "urlencoding", - "web-time", - "xet-core-structures", - "xet-runtime", -] - -[[package]] -name = "xet-core-structures" -version = "1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cb838aa8eb67d730af301584cf003caad407487606058292a6750711b603fbee" -dependencies = [ - "async-trait", - "base64", - "blake3", - "bytemuck", - "bytes", - "clap", - "countio", - "csv", - "futures", - "futures-util", - "getrandom 0.4.2", - "heapify", - "itertools 0.14.0", - "lazy_static", - "lz4_flex", - "more-asserts", - "rand 0.10.1", - "regex", - "safe-transmute", - "serde", - "static_assertions", - "tempfile", - "thiserror", - "tokio", - "tokio-util", - "tracing", - "uuid", - "web-time", - "xet-runtime", -] - -[[package]] -name = "xet-data" -version = "1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "67fd409bef621411a9d9013798540bb8036cb2678f03ab39af89a5e88034ed8c" -dependencies = [ - "anyhow", - "async-trait", - "bytes", - "chrono", - "clap", - "gearhash", - "http 1.4.0", - "itertools 0.14.0", - "lazy_static", - "more-asserts", - "rand 0.10.1", - "serde", - "serde_json", - "sha2 0.10.9", - "tempfile", - "thiserror", - "tokio", - "tokio-util", - "tracing", - "url", - "uuid", - "walkdir", - "xet-client", - "xet-core-structures", - "xet-runtime", -] - -[[package]] -name = "xet-runtime" -version = "1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "15d8f121c33866f7648b737abe70d0e2dd9c0af4ffdd7219207531d0283aa63d" -dependencies = [ - "anyhow", - "async-trait", - "bytes", - "chrono", - "colored", - "const-str", - "ctor", - "dirs", - "futures", - "git-version", - "humantime", - "konst", - "lazy_static", - "libc", - "more-asserts", - "oneshot", - "pin-project", - "rand 0.10.1", - "reqwest 0.13.4", - "serde", - "serde_json", - "shellexpand", - "sysinfo", - "thiserror", - "tokio", - "tokio-util", - "tracing", - "tracing-appender", - "tracing-subscriber", - "whoami", - "winapi", -] - [[package]] name = "xmlparser" version = "0.13.6" diff --git a/Cargo.toml b/Cargo.toml index c442242..66bfc01 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -4,8 +4,6 @@ members = [ "crates/omnigraph-compiler", "crates/omnigraph", "crates/omnigraph-cli", - "crates/omnigraph-api-types", - "crates/omnigraph-cluster", "crates/omnigraph-policy", "crates/omnigraph-server", ] @@ -31,14 +29,14 @@ datafusion-common = "53" datafusion-expr = "53" datafusion-functions-aggregate = "53" -lance = { version = "7.0.0", default-features = false, features = ["aws"] } -lance-datafusion = "7.0.0" -lance-file = "7.0.0" -lance-index = "7.0.0" -lance-linalg = "7.0.0" -lance-namespace = "7.0.0" -lance-namespace-impls = "7.0.0" -lance-table = "7.0.0" +lance = { version = "6.0.1", default-features = false, features = ["aws"] } +lance-datafusion = "6.0.1" +lance-file = "6.0.1" +lance-index = "6.0.1" +lance-linalg = "6.0.1" +lance-namespace = "6.0.1" +lance-namespace-impls = "6.0.1" +lance-table = "6.0.1" ulid = "1" futures = "0.3" @@ -48,7 +46,7 @@ pest = "2" pest_derive = "2" thiserror = "2" tokio = { version = "1", features = ["rt-multi-thread", "macros", "time", "net", "signal", "sync"] } -clap = { version = "4.6", features = ["derive"] } +clap = { version = "4", features = ["derive"] } serde = { version = "1", features = ["derive"] } serde_json = "1" serde_yaml = "0.9" @@ -64,7 +62,7 @@ base64 = "0.22" ariadne = "0.4" regex = "1" reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] } -object_store = { version = "0.13.2", default-features = false, features = ["aws", "fs"] } +object_store = { version = "0.12.5", default-features = false, features = ["aws"] } fail = "0.5" time = { version = "0.3", features = ["formatting"] } axum = { version = "0.8", features = ["json", "macros"] } diff --git a/Dockerfile b/Dockerfile index ca22a93..e49a6c7 100644 --- a/Dockerfile +++ b/Dockerfile @@ -11,13 +11,9 @@ RUN groupadd --system omnigraph \ && useradd --system --gid omnigraph --create-home --home-dir /var/lib/omnigraph omnigraph COPY target/release/omnigraph-server /usr/local/bin/omnigraph-server -# The CLI ships in the image so the cluster day-2 loop (cluster -# apply/approve/status, data loads by explicit URI) runs in-container via -# `docker exec` / ECS exec / `railway shell` β€” no omnigraph.yaml required. -COPY target/release/omnigraph /usr/local/bin/omnigraph COPY docker/entrypoint.sh /usr/local/bin/omnigraph-entrypoint -RUN chmod 0755 /usr/local/bin/omnigraph-server /usr/local/bin/omnigraph /usr/local/bin/omnigraph-entrypoint +RUN chmod 0755 /usr/local/bin/omnigraph-server /usr/local/bin/omnigraph-entrypoint ENV OMNIGRAPH_BIND=0.0.0.0:8080 diff --git a/GOVERNANCE.md b/GOVERNANCE.md deleted file mode 100644 index 2768e5b..0000000 --- a/GOVERNANCE.md +++ /dev/null @@ -1,105 +0,0 @@ -# Governance - -This document describes how **external contributions** to OmniGraph are -proposed, accepted, and merged. It exists so an outside contributor can answer, -without asking: *where does my report/idea/change go, who decides, and what has -to happen before code lands?* - -> **Scope.** This governs the public contribution surface β€” Issues, -> Discussions, RFCs, and pull requests from people outside the ModernRelay -> team. **Maintainers operate under a separate internal process** and are not -> bound by the intake gates below. Everyone, maintainer or not, is still bound -> by the universal gates: branch protection on `main` and CI -> (see [docs/dev/branch-protection.md](docs/dev/branch-protection.md)). - -## Roles - -| Role | Who | Authority | -|---|---|---| -| **Maintainer** | The ModernRelay team (repository admins) | Validate issues, accept/reject RFCs, review and merge PRs, set direction. Final decision authority. | -| **Contributor** | Anyone else | Report problems (Issues), propose ideas (Discussions), author RFCs, and open pull requests. | - -Decision authority rests with the maintainers (the ModernRelay team holding -repository-admin access). - -## The three channels - -Each channel has one job. Using the right one is the first thing we ask of a -contribution. - -| Channel | Purpose | Not for | -|---|---|---| -| **[Issues](../../issues)** | **Report a problem** β€” a bug, a regression, a documented behavior that's wrong. Something concrete and reproducible. | Feature requests, ideas, questions, or design proposals (β†’ Discussions). | -| **[Discussions](../../discussions)** | **Propose and explore** β€” new ideas, feature requests, questions, and the incubation of RFCs. | Bug reports (β†’ Issues). | -| **Pull requests** | **Land a sanctioned change** β€” a fix for a *validated* issue, an *accepted* RFC, or a trivial change (see fast-lane). | Substantive change with no backing issue/RFC β€” it will be redirected. | - -## How a change becomes mergeable - -``` - β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ bug ───────────┐ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€ idea / feature ────────┐ - β–Ό β”‚ β–Ό β”‚ - Issue (problem report) β”‚ Discussion (idea / RFC incubation) β”‚ - β”‚ β”‚ β”‚ β”‚ - maintainer triage β”‚ rough consensus β”‚ - β”‚ β”‚ β”‚ graduate β”‚ - β–Ό β”‚ β–Ό β”‚ - label: accepted ──────────┐ β”‚ RFC PR (docs/rfcs/NNNN-*.md) β”‚ - β”‚ β”‚ β”‚ β”‚ β”‚ - β”‚ β”‚ β”‚ maintainer review β”‚ - β–Ό β–Ό β”‚ β–Ό β”‚ - Pull request ◀──────────┴──────────│── merged == accepted β”‚ - (links the issue or the accepted RFC) β—€β”€β”€β”€β”€β”€β”€β”€β”˜ (implementation PRs reference it) β”‚ - β”‚ - review + branch protection + CI - β–Ό - merged -``` - -### Issues β†’ validated -A new issue starts unlabeled. A maintainer triages it and, if it's a real, -in-scope problem, applies the **`accepted`** label. **Only `accepted` issues are -open for a contributor PR.** This prevents the "I fixed an issue you hadn't -agreed was a problem" rejection. Want to fix something? Get the issue accepted -first, or pick one already labelled `accepted` / `help wanted`. - -### Discussions β†’ RFCs β†’ accepted -Ideas and feature requests start in **Discussions**. Anyone β€” including external -contributors β€” may then **author an RFC** by opening a pull request that adds -`docs/rfcs/NNNN-title.md` (see [docs/rfcs/README.md](docs/rfcs/README.md)). The -RFC is reviewed as code; **a maintainer merging it is the act of acceptance** -(it becomes the durable decision record). Implementation PRs then reference the -accepted RFC. - -Authoring an RFC is open to everyone; **accepting one is a maintainer -decision.** Maintainers may also decline an RFC, with rationale, by closing it. - -### Pull requests β†’ sanctioned -A contributor PR must do one of: -1. link a maintainer-**`accepted`** issue it fixes, or -2. be (or reference) an **accepted RFC**, or -3. qualify for the **trivial fast-lane**. - -**Trivial fast-lane** β€” these may be opened directly, no prior issue/RFC: -typo and wording fixes, documentation corrections, dependency bumps, comment -fixes, and obviously-correct one-line CI tweaks. When in doubt, open an Issue or -Discussion first; a PR that turns out to be non-trivial will be asked to. - -A substantive PR with no backing issue/RFC will be closed with a pointer to the -right channel β€” not as a judgment of the idea, but to keep design discussion -where it's reviewable. - -## What maintainers do *not* gate -Maintainers' own changes do not pass through the intake gates above β€” the team -runs a separate internal process. The universal gates (review, branch -protection, CI) apply to everyone. Enforcement of the intake rules is, to -start, **by convention and review** (PR template + labels); an automated check -keyed to author association may be added later if volume warrants. - -## Code of conduct & security -- Conduct: [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md). -- Security issues are **not** public Issues β€” see [SECURITY.md](SECURITY.md). - -## Changing this document -Governance changes the same way code does: a pull request, reviewed by -maintainers. This file describes the external surface; the internal maintainer -process is intentionally out of scope here. diff --git a/README.md b/README.md index deaea8b..ae3234b 100644 --- a/README.md +++ b/README.md @@ -1,233 +1,105 @@ -

- - - OMNIGRAPH - -

+# Omnigraph -

- Lakehouse graph database for context assembly & multi-agent coordination
- Multimodal retrieval Β· Git-style branching Β· object-storage native -

+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) +[![Rust](https://img.shields.io/badge/rust-stable-orange.svg)](rust-toolchain.toml) +[![Crates.io](https://img.shields.io/crates/v/omnigraph-cli.svg)](https://crates.io/crates/omnigraph-cli) +[![CI](https://github.com/ModernRelay/omnigraph/actions/workflows/ci.yml/badge.svg)](https://github.com/ModernRelay/omnigraph/actions/workflows/ci.yml) -

- Quickstart  Β·  - Docs  Β·  - Cookbooks  Β·  - CLI -

+**Object-storage native knowledge graph with git-style workflows. Designed for agents and humans to collaborate on shared structured knowledge.** -

- License: MIT - crates.io - Rust -

+Turns fragmented context into a live graph, lets humans and agents coordinate through that graph, and uses branches so agent-generated changes can be reviewed and merged safely. -
+Built on Rust, Arrow, DataFusion and Lance. -Omnigraph is the operational state and coordination layer for fleets of agents.\ -Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely. +Join the [Omnigraph Slack community](https://join.slack.com/t/omnigraphworkspace/shared_invite/zt-3wfpglyxj-lHvJGhuySPfqLtN35uJZNw) -## Key capabilities +## Use Cases -| Capability | What it gives you | -|---|---| -| **Declared as code** | A `cluster.yaml` declares graphs, schemas, stored queries, embedding providers, and policies; `cluster apply` converges it and `omnigraph-server` brings every graph online at `/graphs/{id}/…`. | -| **Built for fleets of agents** | Hundreds of agents enrich the graph on **parallel isolated branches**; changes are reviewed and merged safely, Git-style, across the whole graph. | -| **Multimodal retrieval** | Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in **one** query runtime, for context assembly. | -| **Security as code** | Cedar policy enforced **server-side on every mutation**, per-graph and server-wide; bearer auth; actor/audit tracking. | -| **Runs on your infrastructure** | Any S3-compatible object store: **on-prem via RustFS / MinIO**, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store. | -| **Open, versioned storage** | [`Lance`](https://github.com/lance-format/lance) columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video). | +- Company brain / [Second brain](https://github.com/ModernRelay/omnigraph-cookbooks/tree/main/second-brain) +- Context graph +- Knowledge base for multi-agent research +- Incident response graph +- Compliance & audit graph -## What you can build -| Use case | What it's for | -|---|---| -| **Company brain** | Org knowledge unified into one graph every agent can query | -| **Agentic memory** | Durable, versioned memory: a branch per agent or per task, merged on review | -| **Context graph** | Decision traces and codified tribal knowledge for retrieval | -| **Dev graph** | Issues & dependency model that coding agents read and write | -| **R&D / ML data layer** | Experiments and trials written into branches, versioned for training & eval | +## Capabilities -## Install +- Typed schema, typed queries, and typed mutations +- Native blob-as-data support (docs, images, videos, etc) +- Schema-as-code, query validation and linting +- Git-style graph workflows: branches, commits, merges, and transactional runs +- Local, on-prem & cloud S3-native storage with snapshot-pinned reads +- Graph traversal + text, fuzzy, BM25, vector, and RRF search in one runtime +- Policy-as-code for server-side access control +- Single CLI for multiple deployments + +## Quick Install ```bash curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash ``` -This installs `omnigraph` (CLI) and `omnigraph-server` into `~/.local/bin` from -published release binaries. Or with Homebrew: +This installs `omnigraph` and `omnigraph-server` into `~/.local/bin` from +published release binaries. + +Or install with Homebrew: ```bash brew tap ModernRelay/tap brew install ModernRelay/tap/omnigraph ``` -## Set it up with an AI agent +For starter graphs and agent skills to bootstrap and operate Omnigraph, see [`ModernRelay/omnigraph-cookbooks`](https://github.com/ModernRelay/omnigraph-cookbooks). -Omnigraph is built to be run by coding agents. Two ways in: - -**Teach your agent the playbook.** This repo ships the -[**`omnigraph` agent skill**](skills/omnigraph): the operational playbook -covering cluster mode, the two config surfaces, schema evolution, query linting, -data writes, branches, Cedar policy, and the common gotchas. +## One-Command Local RustFS Bootstrap ```bash -npx skills add ModernRelay/omnigraph@omnigraph +curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/local-rustfs-bootstrap.sh | bash ``` -**Or have an agent set it up from scratch.** Paste this into Claude Code, -Codex, or any agent that can read a URL and run a shell command: +That bootstrap: -```text -Help me set up Omnigraph +- starts RustFS on `127.0.0.1:9000` +- creates a bucket and S3-backed graph +- loads the checked-in context fixture +- launches `omnigraph-server` on `127.0.0.1:8080` -1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with - docs/user/clusters/index.md, then docs/user/deployment.md. -2. Skim the starter graphs and seed data in the cookbooks: - https://github.com/ModernRelay/omnigraph-cookbooks -3. Ask me what I want to build (company brain, agent memory, dev graph, - research / R&D layer, …). Then stand up a cluster for it, load a little - data, and run a query so I can see it working. -``` +Docker must be installed and running first. -For ready-to-run graphs with real seed data (company brain, VC operating system, -pharma & industry intel), -[`ModernRelay/omnigraph-cookbooks`](https://github.com/ModernRelay/omnigraph-cookbooks) -is the fastest way to see Omnigraph shaped to a real domain. +The RustFS bootstrap prefers the rolling `edge` binaries and only falls back to +source builds when release assets are unavailable. -## Deploy +If a previous run left objects under the same graph prefix but did not finish +initializing the graph, rerun with `RESET_REPO=1` or set `PREFIX` to a new +value. -A deployment is a **cluster**: a **multigraph** config directory that declares -its graphs, schemas, stored queries, and policies as code. You manage it -**Terraform-style**: `cluster plan` previews the diff, `cluster apply` converges -it. `omnigraph-server` then boots from the cluster and brings every graph online -at `/graphs/{id}/…`, each behind its own policy. +## Common Commands -**1. Declare the cluster.** - -``` -company-brain/ -β”œβ”€β”€ cluster.yaml -β”œβ”€β”€ people.pg # schema for the "knowledge" graph -β”œβ”€β”€ queries/ # stored queries: the .gq files ARE the declaration -β”‚ └── people.gq -└── base.policy.yaml # a Cedar policy bundle -``` - -```yaml -# cluster.yaml -version: 1 -metadata: - name: company-brain -storage: s3://company/clusters/company-brain # ledger, catalog, and graph data live here -graphs: - knowledge: - schema: people.pg - queries: queries/ # every `query ` in queries/*.gq registers -policies: - base: - file: base.policy.yaml - applies_to: [knowledge] # graph-bound; use [cluster] for server-level -``` - -**2. Stand up your object store.** On-prem, run RustFS (or MinIO); Omnigraph -writes [Lance](https://github.com/lance-format/lance) to it over the standard S3 -API. In the cloud, point the same `AWS_*` env at S3 / R2 / GCS instead. - -**3. Converge and run.** `apply` creates each graph, applies its schema, and -publishes queries and policies into the content-addressed catalog. It is -idempotent; re-running is always safe. +The same URI works for local paths, `s3://…`, or `http://host:port`. ```bash -omnigraph cluster validate # parse + typecheck everything -omnigraph cluster plan # preview what apply would do -omnigraph cluster apply # converge - -# Boot the server from the cluster dir; storage resolves through cluster.yaml -omnigraph-server --cluster company-brain --bind 0.0.0.0:8080 +omnigraph init --schema ./schema.pg ./graph.omni +omnigraph load --data ./data.jsonl ./graph.omni +omnigraph read --query ./queries.gq --name get_person --params '{"name":"Alice"}' ./graph.omni +omnigraph change --query ./queries.gq --name insert_person --params '{"name":"Mina"}' ./graph.omni +omnigraph branch create --from main feature-x ./graph.omni +omnigraph branch merge feature-x --into main ./graph.omni ``` -See the [cluster guide](docs/user/clusters/index.md) for the day-2 loop -(edit β†’ plan β†’ apply β†’ restart), approval gates for destructive changes, drift -inspection, and recovery; the [deployment guide](docs/user/deployment.md) for -containers, AWS/Railway, auth, and the full `AWS_*` contract. - -## Query and mutate - -Set a default server and graph once in `~/.omnigraph/config.yaml`, and the -everyday commands stay short. Stored queries and mutations run **by name**: - -```bash -omnigraph query search_docs --params '{"q":"AI safety"}' -omnigraph mutate add_person --params '{"name":"Mina"}' - -# Branch, review, merge across the whole graph; agents write in isolation -omnigraph branch create --from main agent/ingest-42 -omnigraph branch merge agent/ingest-42 --into main -``` - -An **alias** is shorter still: bind a server, graph, and stored query to one -name, then `omnigraph alias triage` runs it. For an ad-hoc target, any command -still takes `--server --graph ` (or `--store ` for a local -graph). See the [CLI reference](docs/user/cli/reference.md). - -## Security & governance - -- **Engine-wide enforcement:** every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules. -- **Declared in the cluster:** a policy bundle is bound to graphs (or the whole server) via `policies:` β†’ `applies_to`. -- **Scoped:** rules apply per graph, per branch, or server-wide. -- **No plaintext tokens:** bearer tokens are hashed at startup and compared in constant time. -- **Forge-proof identity:** the actor is resolved server-side from the token; clients can't set it. - -See the [policy guide](docs/user/operations/policy.md). - -## Clients & SDKs - -| Client | Use it for | Where | -|---|---|---| -| **TypeScript SDK** | typed access from Node / TS | [`@modernrelay/omnigraph`](https://www.npmjs.com/package/@modernrelay/omnigraph) Β· [source](https://github.com/ModernRelay/omnigraph-ts) | -| **MCP server** | bridge Omnigraph to LLM hosts (Claude, Codex, …) | [`@modernrelay/omnigraph-mcp`](https://www.npmjs.com/package/@modernrelay/omnigraph-mcp) | -| **HTTP / OpenAPI** | any language, the wire contract | the server's OpenAPI spec | -| **Python SDK** | typed access from Python | *coming soon* | - -Both npm packages are versioned in lockstep with `omnigraph-server`. - -## Local quick test (no server) - -1-min setup to try it: an **embedded, local file-backed graph** (no server, no -object store). For dev and experiments; production is the deployed cluster above. - -```bash -cat > schema.pg <<'PG' -node Signal { slug: String @key, title: String } -node Pattern { slug: String @key, name: String } -edge Indicates: Signal -> Pattern -PG -printf '%s\n' \ - '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \ - '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \ - '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl - -omnigraph init --schema schema.pg ./graph.omni -omnigraph load --data data.jsonl --mode overwrite --store ./graph.omni - -# "What pattern does signal s1 indicate?" -omnigraph query --store ./graph.omni \ - -e 'query indicates() { match { $s: Signal { slug: "s1" } $s indicates $p } return { $p.name } }' -# β†’ adoption -``` +See [docs/user/cli.md](docs/user/cli.md) for schema apply, snapshots, ingest, runs, and policy commands. ## Docs -- [Cluster guide](docs/user/clusters/index.md) Β· [Deployment guide](docs/user/deployment.md) Β· [CLI reference](docs/user/cli/reference.md) -- [Schema](docs/user/schema/index.md) Β· [Queries](docs/user/queries/index.md) Β· [Search](docs/user/search/index.md) Β· [Policy](docs/user/operations/policy.md) +- [Install guide](docs/user/install.md) +- [CLI guide](docs/user/cli.md) +- [Deployment guide](docs/user/deployment.md) ## Build And Test ```bash cargo build --workspace -cargo test --workspace +cargo check --workspace +cargo test --workspace ``` Notes: @@ -239,13 +111,10 @@ Notes: ## Workspace Crates -- `crates/omnigraph-compiler`: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency) -- `crates/omnigraph` (package `omnigraph-engine`): storage/runtime, branching, merge, change detection, query execution, and embeddings -- `crates/omnigraph-policy`: Cedar policy compilation and enforcement -- `crates/omnigraph-api-types`: shared HTTP wire DTOs used by both the server and the CLI -- `crates/omnigraph-cluster`: cluster config validation, planning, and apply (the control plane) -- `crates/omnigraph-server`: Axum HTTP server, cluster-first, runs N graphs under `/graphs/{id}/…` -- `crates/omnigraph-cli`: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance +- `crates/omnigraph-compiler`: shared schema/query parser, typechecker, catalog, and IR lowering +- `crates/omnigraph`: storage/runtime, branching, merge, change detection, and query execution +- `crates/omnigraph-cli`: CLI for init/load/ingest/read/change/branch/snapshot/export/policy operations +- `crates/omnigraph-server`: Axum HTTP server for remote reads, changes, ingest, export, branches, commits, and runs ## Contributing diff --git a/assets/omnigraph-wordmark-dark.svg b/assets/omnigraph-wordmark-dark.svg deleted file mode 100644 index 47b2033..0000000 --- a/assets/omnigraph-wordmark-dark.svg +++ /dev/null @@ -1,152 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/assets/omnigraph-wordmark.svg b/assets/omnigraph-wordmark.svg deleted file mode 100644 index 45778dc..0000000 --- a/assets/omnigraph-wordmark.svg +++ /dev/null @@ -1,152 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/crates/omnigraph-api-types/Cargo.toml b/crates/omnigraph-api-types/Cargo.toml deleted file mode 100644 index 96677d1..0000000 --- a/crates/omnigraph-api-types/Cargo.toml +++ /dev/null @@ -1,16 +0,0 @@ -[package] -name = "omnigraph-api-types" -version = "0.7.2" -edition = "2024" -description = "Shared HTTP wire DTOs for Omnigraph β€” request/response types and engine-result β†’ DTO mappings used by both omnigraph-server and omnigraph-cli (RFC-009). Plain serde/utoipa types; no transport or server internals." -license = "MIT" -repository = "https://github.com/ModernRelay/omnigraph" -homepage = "https://github.com/ModernRelay/omnigraph" -documentation = "https://docs.rs/omnigraph-api-types" - -[dependencies] -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" } -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } -serde = { workspace = true } -serde_json = { workspace = true } -utoipa = { workspace = true } diff --git a/crates/omnigraph-api-types/src/lib.rs b/crates/omnigraph-api-types/src/lib.rs deleted file mode 100644 index 32bc753..0000000 --- a/crates/omnigraph-api-types/src/lib.rs +++ /dev/null @@ -1,704 +0,0 @@ -//! Shared HTTP wire DTOs (RFC-009 Phase 2) β€” moved from -//! omnigraph-server's api module so server and CLI share one definition -//! and one engine-result -> DTO mapping per verb. Plain serde/utoipa -//! types; no transport, no server internals. - -use omnigraph::db::{GraphCommit, MergeOutcome, ReadTarget, SchemaApplyResult, Snapshot}; -use omnigraph::error::{MergeConflict, MergeConflictKind}; -use omnigraph::loader::{LoadMode, LoadResult}; -use omnigraph_compiler::SchemaMigrationStep; -use omnigraph_compiler::query::ast::Param; -use omnigraph_compiler::result::QueryResult; -use omnigraph_compiler::types::{PropType, ScalarType}; -use serde::{Deserialize, Serialize}; -use serde_json::Value; -use utoipa::{IntoParams, ToSchema}; - -/// Shadow enum for documenting [`LoadMode`] in the OpenAPI schema. -#[derive(ToSchema)] -#[schema(as = LoadMode)] -#[allow(dead_code)] -enum LoadModeSchema { - /// Overwrite existing data. - #[schema(rename = "overwrite")] - Overwrite, - /// Append to existing data. - #[schema(rename = "append")] - Append, - /// Merge by id key (upsert). - #[schema(rename = "merge")] - Merge, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct SnapshotTableOutput { - pub table_key: String, - pub table_path: String, - pub table_version: u64, - pub table_branch: Option, - pub row_count: u64, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct SnapshotOutput { - pub branch: String, - pub manifest_version: u64, - pub tables: Vec, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchCreateRequest { - /// Parent branch to fork from. Defaults to `main`. - pub from: Option, - /// Name of the new branch. Must not already exist. - pub name: String, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchCreateOutput { - pub uri: String, - pub from: String, - pub name: String, - pub actor_id: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchListOutput { - pub branches: Vec, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchDeleteOutput { - pub uri: String, - pub name: String, - pub actor_id: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchMergeRequest { - /// Source branch whose commits will be merged. - pub source: String, - /// Target branch that will receive the merge. Defaults to `main`. - pub target: Option, -} - -#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] -#[serde(rename_all = "snake_case")] -pub enum BranchMergeOutcome { - AlreadyUpToDate, - FastForward, - Merged, -} - -impl From for BranchMergeOutcome { - fn from(value: MergeOutcome) -> Self { - match value { - MergeOutcome::AlreadyUpToDate => Self::AlreadyUpToDate, - MergeOutcome::FastForward => Self::FastForward, - MergeOutcome::Merged => Self::Merged, - } - } -} - -impl BranchMergeOutcome { - pub fn as_str(self) -> &'static str { - match self { - Self::AlreadyUpToDate => "already_up_to_date", - Self::FastForward => "fast_forward", - Self::Merged => "merged", - } - } -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct BranchMergeOutput { - pub source: String, - pub target: String, - pub outcome: BranchMergeOutcome, - pub actor_id: Option, -} - -#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] -#[serde(rename_all = "snake_case")] -pub enum MergeConflictKindOutput { - DivergentInsert, - DivergentUpdate, - DeleteVsUpdate, - OrphanEdge, - UniqueViolation, - CardinalityViolation, - ValueConstraintViolation, -} - -impl MergeConflictKindOutput { - pub fn as_str(self) -> &'static str { - match self { - Self::DivergentInsert => "divergent_insert", - Self::DivergentUpdate => "divergent_update", - Self::DeleteVsUpdate => "delete_vs_update", - Self::OrphanEdge => "orphan_edge", - Self::UniqueViolation => "unique_violation", - Self::CardinalityViolation => "cardinality_violation", - Self::ValueConstraintViolation => "value_constraint_violation", - } - } -} - -impl From for MergeConflictKindOutput { - fn from(value: MergeConflictKind) -> Self { - match value { - MergeConflictKind::DivergentInsert => Self::DivergentInsert, - MergeConflictKind::DivergentUpdate => Self::DivergentUpdate, - MergeConflictKind::DeleteVsUpdate => Self::DeleteVsUpdate, - MergeConflictKind::OrphanEdge => Self::OrphanEdge, - MergeConflictKind::UniqueViolation => Self::UniqueViolation, - MergeConflictKind::CardinalityViolation => Self::CardinalityViolation, - MergeConflictKind::ValueConstraintViolation => Self::ValueConstraintViolation, - } - } -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct MergeConflictOutput { - pub table_key: String, - pub row_id: Option, - pub kind: MergeConflictKindOutput, - pub message: String, -} - -impl From<&MergeConflict> for MergeConflictOutput { - fn from(value: &MergeConflict) -> Self { - Self { - table_key: value.table_key.clone(), - row_id: value.row_id.clone(), - kind: value.kind.into(), - message: value.message.clone(), - } - } -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ReadTargetOutput { - pub branch: Option, - pub snapshot: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ReadOutput { - pub query_name: String, - pub target: ReadTargetOutput, - pub row_count: usize, - #[serde(default, skip_serializing_if = "Vec::is_empty")] - pub columns: Vec, - pub rows: Value, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ChangeOutput { - pub branch: String, - pub query_name: String, - pub affected_nodes: usize, - pub affected_edges: usize, - pub actor_id: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct IngestTableOutput { - pub table_key: String, - pub rows_loaded: usize, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct IngestOutput { - pub uri: String, - pub branch: String, - /// Base branch a fork was requested from (the request's `from`), echoed - /// even when the branch already existed. `null` when `from` was absent. - pub base_branch: Option, - pub branch_created: bool, - #[schema(value_type = LoadModeSchema)] - pub mode: LoadMode, - pub tables: Vec, - pub actor_id: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct CommitOutput { - pub graph_commit_id: String, - pub manifest_branch: Option, - pub manifest_version: u64, - pub parent_commit_id: Option, - pub merged_parent_commit_id: Option, - pub actor_id: Option, - /// Commit creation time as Unix epoch microseconds. - #[schema(example = 1714000000000000i64)] - pub created_at: i64, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct CommitListOutput { - pub commits: Vec, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ReadRequest { - /// GQ query source. May declare one or more named queries; pick one with - /// `query_name` if there is more than one. - #[schema( - example = "query get_person($name: String) {\n match {\n $p: Person { name: $name }\n }\n return { $p.name, $p.age }\n}" - )] - pub query_source: String, - /// Name of the query to run when `query_source` declares multiple. Optional - /// when only one query is declared. - pub query_name: Option, - /// JSON object whose keys match the query's declared parameters. - pub params: Option, - /// Branch to read from. Mutually exclusive with `snapshot`. Defaults to `main`. - pub branch: Option, - /// Snapshot id to read from. Mutually exclusive with `branch`. - pub snapshot: Option, -} - -/// Inline read-query request for `POST /query`. -/// -/// Friendlier-named alternative to [`ReadRequest`] for ad-hoc reads and -/// AI-agent integration. Mutations are rejected with 400 β€” use `POST -/// /mutate` (or its deprecated alias `POST /change`) for write queries. -/// Field names are deliberately short (`query`, `name`) to match the GQ -/// keyword and the CLI `-e` flag. -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct QueryRequest { - /// GQ read-query source. May declare one or more named queries; pick one - /// with `name` when more than one is declared. Mutations - /// (`insert`/`update`/`delete`) get 400 β€” use `POST /mutate` (or its - /// deprecated alias `POST /change`) instead. - #[schema(example = "query get_person($name: String) {\n match {\n $p: Person { name: $name }\n }\n return { $p.name, $p.age }\n}")] - pub query: String, - /// Name of the query to run when `query` declares multiple. Optional when - /// only one query is declared. - pub name: Option, - /// JSON object whose keys match the query's declared parameters. - pub params: Option, - /// Branch to read from. Mutually exclusive with `snapshot`. Defaults to `main`. - pub branch: Option, - /// Snapshot id to read from. Mutually exclusive with `branch`. - pub snapshot: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ChangeRequest { - /// GQ mutation source containing `insert`, `update`, or `delete` statements. - /// May declare multiple named mutations; pick one with `name`. - /// - /// Accepts the legacy field name `query_source` as a deserialization alias. - #[schema( - example = "query insert_person($name: String, $age: I32) {\n insert Person { name: $name, age: $age }\n}" - )] - #[serde(alias = "query_source")] - pub query: String, - /// Name of the mutation to run when `query` declares multiple. - /// - /// Accepts the legacy field name `query_name` as a deserialization alias. - #[serde(default, alias = "query_name")] - pub name: Option, - /// JSON object whose keys match the mutation's declared parameters. - #[serde(default)] - pub params: Option, - /// Target branch. Defaults to `main`. - #[serde(default)] - pub branch: Option, -} - -/// Body for `POST /queries/{name}` β€” invokes the server-side stored query -/// named in the path. The query source and name come from the registry, -/// never the body; only the runtime inputs are supplied here. -#[derive(Debug, Clone, Default, Serialize, Deserialize, ToSchema)] -pub struct InvokeStoredQueryRequest { - /// JSON object whose keys match the stored query's declared parameters. - #[serde(default)] - pub params: Option, - /// Branch to run against. Defaults to `main`; for a stored mutation the - /// write targets this branch. - #[serde(default)] - pub branch: Option, - /// Snapshot id to read from (read queries only β€” rejected for a stored - /// mutation). Mutually exclusive with `branch`. - #[serde(default)] - pub snapshot: Option, - /// The kind the caller expects (RFC-011 Decision 3): `Some(false)` for - /// `omnigraph query `, `Some(true)` for `omnigraph mutate `. - /// When set and it disagrees with the stored query's actual kind, the - /// server rejects the call (400) so the verb asserts the kind. `None` - /// (the default) skips the check β€” preserving older clients and aliases. - #[serde(default)] - pub expect_mutation: Option, -} - -/// Response for `POST /queries/{name}`: the read envelope for a stored -/// read, or the mutation envelope for a stored mutation. Serialized -/// **untagged**, so the wire shape is exactly [`ReadOutput`] or -/// [`ChangeOutput`] β€” classification follows the stored query, not a -/// wrapper field. -#[derive(Debug, Serialize, ToSchema)] -#[serde(untagged)] -pub enum InvokeStoredQueryResponse { - Read(ReadOutput), - Change(ChangeOutput), -} - -/// The kind of a stored-query parameter, decomposed so a client (e.g. an -/// MCP server) can build a typed input schema with a closed `match` and -/// never re-parse omnigraph's type spelling. `bigint`/`date`/`datetime`/ -/// `blob` are carried as JSON strings on the wire: a 64-bit integer past -/// 2^53 loses precision as a JSON number, and Date/DateTime are ISO -/// strings, Blob a blob-URI string. -#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] -#[serde(rename_all = "snake_case")] -pub enum ParamKind { - String, - Bool, - Int, - #[serde(rename = "bigint")] - BigInt, - Float, - Date, - #[serde(rename = "datetime")] - DateTime, - Blob, - Vector, - List, -} - -/// One declared parameter of a stored query, projected for the catalog. -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ParamDescriptor { - pub name: String, - pub kind: ParamKind, - /// Element kind when `kind == list` (always a scalar β€” the grammar - /// forbids lists of vectors or nested lists). - #[serde(skip_serializing_if = "Option::is_none")] - pub item_kind: Option, - /// Dimension when `kind == vector`. - #[serde(skip_serializing_if = "Option::is_none")] - pub vector_dim: Option, - /// `false` β†’ the caller must supply it; `true` β†’ optional. - pub nullable: bool, -} - -/// One entry in the stored-query catalog (`GET /queries`). -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct QueryCatalogEntry { - /// Registry key / invoke path segment (`POST /queries/{name}`). - pub name: String, - /// MCP tool id (the `tool_name` override, else `name`). - pub tool_name: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub description: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub instruction: Option, - /// `true` for a stored mutation β†’ an MCP read-only hint of `false`. - pub mutation: bool, - pub params: Vec, -} - -/// Response for `GET /queries`: every stored query in a graph's -/// registry, each with typed parameters. -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct QueriesCatalogOutput { - pub queries: Vec, -} - -/// Total map from a resolved scalar to its catalog kind. Exhaustive on -/// purpose: a new `ScalarType` is a compile error here until catalogued. -fn scalar_kind(scalar: ScalarType) -> ParamKind { - match scalar { - ScalarType::String => ParamKind::String, - ScalarType::Bool => ParamKind::Bool, - ScalarType::I32 | ScalarType::U32 => ParamKind::Int, - ScalarType::I64 | ScalarType::U64 => ParamKind::BigInt, - ScalarType::F32 | ScalarType::F64 => ParamKind::Float, - ScalarType::Date => ParamKind::Date, - ScalarType::DateTime => ParamKind::DateTime, - ScalarType::Blob => ParamKind::Blob, - ScalarType::Vector(_) => ParamKind::Vector, - } -} - -pub fn param_descriptor(param: &Param) -> ParamDescriptor { - match PropType::from_param_type_name(¶m.type_name, param.nullable) { - Some(pt) if pt.list => ParamDescriptor { - name: param.name.clone(), - kind: ParamKind::List, - item_kind: Some(scalar_kind(pt.scalar)), - vector_dim: None, - nullable: param.nullable, - }, - Some(pt) => { - let (kind, vector_dim) = match pt.scalar { - ScalarType::Vector(dim) => (ParamKind::Vector, Some(dim)), - other => (scalar_kind(other), None), - }; - ParamDescriptor { - name: param.name.clone(), - kind, - item_kind: None, - vector_dim, - nullable: param.nullable, - } - } - // Unreachable for a parsed query (every declared param type is - // grammatical); fall back to an opaque string so the field is still - // usable rather than dropped. - None => ParamDescriptor { - name: param.name.clone(), - kind: ParamKind::String, - item_kind: None, - vector_dim: None, - nullable: param.nullable, - }, - } -} - - -#[derive(Debug, Clone, Default, Serialize, Deserialize, ToSchema)] -pub struct SchemaApplyRequest { - /// Project schema in `.pg` source form. The diff against the current - /// schema produces the migration steps that will be applied. - #[schema( - example = "node Person {\n name: String @key\n age: I32?\n}\n\nedge Knows: Person -> Person" - )] - pub schema_source: String, - /// When true, promote every `DropMode::Soft` step in the plan to - /// `DropMode::Hard`, making the prior column data unreachable - /// after the apply. Matches the CLI's `--allow-data-loss` flag. - /// Defaults to `false` (drops remain reversible via time travel). - #[serde(default)] - pub allow_data_loss: bool, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct SchemaApplyOutput { - pub uri: String, - pub supported: bool, - pub applied: bool, - pub step_count: usize, - pub manifest_version: u64, - #[schema(value_type = Vec)] - pub steps: Vec, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct SchemaOutput { - pub schema_source: String, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct IngestRequest { - /// Target branch. Defaults to `main`. Without `from`, the branch must - /// already exist β€” a missing branch is a 404, never an implicit fork. - pub branch: Option, - /// Parent branch used to create `branch` if it does not exist. Branch - /// creation is opt-in by presence of this field; omit it to require an - /// existing branch. - pub from: Option, - /// How existing rows are handled. Defaults to `merge`. - #[schema(value_type = Option)] - pub mode: Option, - /// NDJSON payload: one record per line, each shaped - /// `{"type": "", "data": {...}}`. - #[schema( - example = "{\"type\": \"Person\", \"data\": {\"name\": \"Alice\", \"age\": 30}}\n{\"type\": \"Person\", \"data\": {\"name\": \"Bob\", \"age\": 25}}" - )] - pub data: String, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ExportRequest { - /// Branch to export. Defaults to `main`. - pub branch: Option, - /// Restrict the export to these node/edge type names. Empty exports all types. - #[serde(default)] - pub type_names: Vec, - /// Restrict the export to these table keys. Empty exports all tables. - #[serde(default)] - pub table_keys: Vec, -} - -#[derive(Debug, Clone, Deserialize, IntoParams)] -pub struct SnapshotQuery { - pub branch: Option, -} - -#[derive(Debug, Clone, Deserialize, IntoParams)] -pub struct CommitListQuery { - pub branch: Option, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct HealthOutput { - pub status: String, - pub version: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub source_version: Option, -} - -#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] -#[serde(rename_all = "snake_case")] -pub enum ErrorCode { - Unauthorized, - Forbidden, - BadRequest, - NotFound, - /// 405 Method Not Allowed β€” the route exists but the active server - /// mode doesn't serve this method (e.g. `GET /graphs` in single-graph - /// mode). Distinct from 404 so clients can tell "wrong context" from - /// "no such resource." - MethodNotAllowed, - Conflict, - /// 429 Too Many Requests β€” per-actor admission cap exceeded. - /// Clients should respect the `Retry-After` header. - TooManyRequests, - Internal, -} - -/// Structured details for a publisher-level OCC failure. Surfaces alongside -/// HTTP 409 when a write was rejected because the caller's pre-write view of -/// one table's manifest version was stale relative to the current head. The -/// expected/actual fields tell the client which table to refresh. -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ManifestConflictOutput { - pub table_key: String, - pub expected: u64, - pub actual: u64, -} - -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct ErrorOutput { - pub error: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub code: Option, - #[serde(default, skip_serializing_if = "Vec::is_empty")] - pub merge_conflicts: Vec, - /// Set when the conflict is a publisher CAS rejection - /// (`ManifestConflictDetails::ExpectedVersionMismatch`). The caller's - /// pre-write view of `table_key` was at version `expected` but the - /// manifest is now at `actual`. Refresh and retry. - #[serde(skip_serializing_if = "Option::is_none")] - pub manifest_conflict: Option, -} - -pub fn snapshot_payload(branch: &str, snapshot: &Snapshot) -> SnapshotOutput { - let mut entries: Vec<_> = snapshot.entries().cloned().collect(); - entries.sort_by(|a, b| a.table_key.cmp(&b.table_key)); - let tables = entries - .iter() - .map(|entry| SnapshotTableOutput { - table_key: entry.table_key.clone(), - table_path: entry.table_path.clone(), - table_version: entry.table_version, - table_branch: entry.table_branch.clone(), - row_count: entry.row_count, - }) - .collect::>(); - SnapshotOutput { - branch: branch.to_string(), - manifest_version: snapshot.version(), - tables, - } -} - -pub fn schema_apply_output(uri: &str, result: SchemaApplyResult) -> SchemaApplyOutput { - SchemaApplyOutput { - uri: uri.to_string(), - supported: result.supported, - applied: result.applied, - step_count: result.steps.len(), - manifest_version: result.manifest_version, - steps: result.steps, - } -} - -pub fn commit_output(commit: &GraphCommit) -> CommitOutput { - CommitOutput { - graph_commit_id: commit.graph_commit_id.clone(), - manifest_branch: commit.manifest_branch.clone(), - manifest_version: commit.manifest_version, - parent_commit_id: commit.parent_commit_id.clone(), - merged_parent_commit_id: commit.merged_parent_commit_id.clone(), - actor_id: commit.actor_id.clone(), - created_at: commit.created_at, - } -} - -pub fn read_output(query_name: String, target: &ReadTarget, result: QueryResult) -> ReadOutput { - let columns = result - .schema() - .fields() - .iter() - .map(|field| field.name().clone()) - .collect(); - ReadOutput { - query_name, - target: read_target_output(target), - row_count: result.num_rows(), - columns, - rows: result.to_rust_json(), - } -} - -pub fn ingest_output( - uri: &str, - result: &LoadResult, - mode: LoadMode, - actor_id: Option, -) -> IngestOutput { - IngestOutput { - uri: uri.to_string(), - branch: result.branch.clone(), - base_branch: result.base_branch.clone(), - branch_created: result.branch_created, - mode, - tables: result - .to_ingest_tables() - .into_iter() - .map(|table| IngestTableOutput { - table_key: table.table_key, - rows_loaded: table.rows_loaded, - }) - .collect(), - actor_id, - } -} - -pub fn read_target_output(target: &ReadTarget) -> ReadTargetOutput { - match target { - ReadTarget::Branch(branch) => ReadTargetOutput { - branch: Some(branch.clone()), - snapshot: None, - }, - ReadTarget::Snapshot(snapshot) => ReadTargetOutput { - branch: None, - snapshot: Some(snapshot.as_str().to_string()), - }, - } -} - -// ─── MR-668 β€” management endpoint shapes ────────────────────────────────── - -/// One entry in the response from `GET /graphs`. Cluster operators -/// consume this list to discover which graphs the server is currently -/// serving. The shape is intentionally minimal β€” `graph_id` and `uri` -/// are the only fields a routing client needs. -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct GraphInfo { - pub graph_id: String, - pub uri: String, -} - -/// Response from `GET /graphs`. Lists every graph registered with the -/// server in alphabetical order by `graph_id` (sorted server-side so -/// clients get deterministic output across requests). -#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] -pub struct GraphListResponse { - pub graphs: Vec, -} diff --git a/crates/omnigraph-cli/Cargo.toml b/crates/omnigraph-cli/Cargo.toml index df4ac8d..0d35ed8 100644 --- a/crates/omnigraph-cli/Cargo.toml +++ b/crates/omnigraph-cli/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-cli" -version = "0.7.2" +version = "0.6.0" edition = "2024" description = "CLI for the Omnigraph graph database." license = "MIT" @@ -13,12 +13,10 @@ name = "omnigraph" path = "src/main.rs" [dependencies] -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" } -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } -omnigraph-api-types = { path = "../omnigraph-api-types", version = "0.7.2" } -omnigraph-cluster = { path = "../omnigraph-cluster", version = "0.7.2" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.7.2" } -omnigraph-server = { path = "../omnigraph-server", version = "0.7.2" } +omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } +omnigraph-server = { path = "../omnigraph-server", version = "0.6.0" } clap = { workspace = true } color-eyre = { workspace = true } serde = { workspace = true } diff --git a/crates/omnigraph-cli/src/cli.rs b/crates/omnigraph-cli/src/cli.rs deleted file mode 100644 index 2b1a861..0000000 --- a/crates/omnigraph-cli/src/cli.rs +++ /dev/null @@ -1,678 +0,0 @@ -//! The clap surface: every command, subcommand, and argument struct -//! (moved verbatim from main.rs in the modularization). - -use super::*; - -pub(crate) const DEFAULT_BEARER_TOKEN_ENV: &str = "OMNIGRAPH_BEARER_TOKEN"; - -#[derive(Debug, Parser)] -#[command(name = "omnigraph")] -#[command(about = "Omnigraph graph database CLI")] -#[command(version = env!("CARGO_PKG_VERSION"), disable_version_flag = true)] -// Subcommands render in declaration order (clap can't print labeled headings -// between groups), so this legend names the capability each command needs β€” -// the user-facing vocabulary (RFC-011). `Plane` stays the internal classifier. -#[command(after_help = "\ -COMMANDS BY CAPABILITY:\n \ -any β€” run against a graph, served (--server / --profile) or embedded (--store / a \ -URI): query, mutate, load, branch, snapshot, export, commit, schema show/apply.\n \ -served β€” require a server: graphs.\n \ -direct β€” direct storage access; reject --server (init, optimize, repair, cleanup, \ -schema plan, lint).\n \ -control β€” manage or inspect a cluster (cluster via --config; policy & queries via \ ---cluster).\n \ -local β€” no explicit graph scope; local config & tooling: alias, embed, login, logout, profile, version.\n\ -See the 'Command capabilities' section of the CLI reference for which flags apply where.")] -pub(crate) struct Cli { - /// Actor id for direct-engine writes; overrides `cli.actor`. No effect on - /// remote writes (the server resolves the actor from the bearer token). - /// With a policy configured but no actor set, the write is denied β€” see - /// docs/user/operations/policy.md. - #[arg(long = "as", global = true, value_name = "ACTOR")] - pub(crate) as_actor: Option, - - /// Address a server by name (resolves to its `url` from `servers:` in - /// ~/.omnigraph/config.yaml) or by a literal `http(s)://` URL. Exclusive - /// with a positional URI. - #[arg(long, global = true, value_name = "NAME|URL")] - pub(crate) server: Option, - - /// Select a graph within a multi-graph scope: on a `--server` it appends - /// `/graphs/` to the server url; on a `--cluster` it picks which - /// cluster graph to maintain. Rejected on a single-graph address (a - /// positional URI / `--store`). - #[arg(long, global = true, value_name = "GRAPH_ID")] - pub(crate) graph: Option, - - /// Select a named scope bundle (RFC-011) from `profiles:` in - /// ~/.omnigraph/config.yaml: fills in this command's omitted addressing - /// (server/cluster/store + default graph). Falls back to - /// $OMNIGRAPH_PROFILE. Config data, not state β€” every command resolves - /// scope fresh. - #[arg(long, global = true, value_name = "NAME")] - pub(crate) profile: Option, - - /// Address a single graph's storage directly (RFC-011): a `file://` / - /// `s3://` store URI. Explicit, ad-hoc direct access β€” bypasses any - /// server. Exclusive with a positional URI / `--server`. - #[arg(long, global = true, value_name = "URI")] - pub(crate) store: Option, - - /// Address a cluster-managed graph's storage for maintenance (RFC-011): - /// a cluster directory or storage-root URI β€” named via `clusters:` in - /// ~/.omnigraph/config.yaml, or a literal `file://`/`s3://` root. Pair - /// with `--graph ` to select the graph. Used by optimize / repair / - /// cleanup; exclusive with a positional URI / `--store` / `--server`. - #[arg(long, global = true, value_name = "DIR|URI")] - pub(crate) cluster: Option, - - /// Skip the confirmation prompt for a destructive write (`cleanup`, - /// overwrite `load`, `branch delete`) against a non-local scope (RFC-011 - /// Decision 9). Without it, a non-local destructive write prompts on a TTY - /// and refuses (errors) when there is no TTY or `--json` is set. - #[arg(long, global = true)] - pub(crate) yes: bool, - - /// Suppress the one-line resolved-write-target diagnostic that write - /// commands echo to stderr (RFC-011 Decision 9). - #[arg(long, global = true)] - pub(crate) quiet: bool, - - #[command(subcommand)] - pub(crate) command: Command, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum Command { - // ── Data plane ── run against a graph (embedded or via --server). - /// Execute a read query against a branch or snapshot. - /// - /// Canonical read endpoint. The previous name `omnigraph read` is - /// kept as a visible alias and prints a one-line deprecation warning - /// when used. Pairs with `omnigraph mutate` on the write side. - #[command(visible_alias = "read")] - Query { - /// Query name. With no `--query`/`-e`, the stored query to invoke from - /// the catalog (served β€” addressed via --server/--profile). With - /// `--query`/`-e`, selects which query in that ad-hoc source to run. - name: Option, - /// Ad-hoc query file (a `.gq` you're authoring / break-glass). - #[arg(long, conflicts_with = "query_string")] - query: Option, - /// Inline ad-hoc GQ source β€” alternative to `--query `. - #[arg(short = 'e', long = "query-string", value_name = "GQ", conflicts_with = "query")] - query_string: Option, - #[command(flatten)] - params: ParamsArgs, - #[arg(long, conflicts_with = "snapshot")] - branch: Option, - #[arg(long, conflicts_with = "branch")] - snapshot: Option, - #[arg(long, conflicts_with = "json")] - format: Option, - #[arg(long, conflicts_with = "format")] - json: bool, - }, - /// Execute a graph mutation query against a branch. - /// - /// Canonical mutation endpoint. The previous name `omnigraph change` - /// is kept as a visible alias and prints a one-line deprecation - /// warning when used. Pairs with `omnigraph query` on the read side. - #[command(visible_alias = "change")] - Mutate { - /// Query name. With no `--query`/`-e`, the stored mutation to invoke - /// from the catalog (served β€” addressed via --server/--profile). With - /// `--query`/`-e`, selects which query in that ad-hoc source to run. - name: Option, - /// Ad-hoc mutation file (a `.gq` you're authoring / break-glass). - #[arg(long, conflicts_with = "query_string")] - query: Option, - /// Inline ad-hoc GQ source β€” alternative to `--query `. - #[arg(short = 'e', long = "query-string", value_name = "GQ", conflicts_with = "query")] - query_string: Option, - #[command(flatten)] - params: ParamsArgs, - #[arg(long)] - branch: Option, - #[arg(long)] - json: bool, - }, - /// Invoke an operator alias (RFC-011 Decision 4). - /// - /// An alias is a personal binding under `aliases:` in - /// ~/.omnigraph/config.yaml β€” name β†’ (server, graph, stored-query name, - /// default params). `omnigraph alias [args]` invokes the bound - /// stored query on its server. Living in its own namespace, an alias can - /// never shadow or be shadowed by a built-in verb. Replaces the removed - /// `--alias` flag on `query`/`mutate`. - Alias { - /// Alias name (a key under `aliases:` in ~/.omnigraph/config.yaml). - name: String, - /// Positional args bound to the alias's declared `args` params, in order. - args: Vec, - #[command(flatten)] - params: ParamsArgs, - #[arg(long, conflicts_with = "json")] - format: Option, - #[arg(long, conflicts_with = "format")] - json: bool, - }, - /// Load data into a graph (local or remote) - Load { - /// Graph URI - uri: Option, - #[arg(long)] - data: PathBuf, - /// Target branch (defaults to main). Without --from it must exist. - #[arg(long)] - branch: Option, - /// Base branch to fork --branch from when it doesn't exist yet. - /// Without this flag a missing branch is an error, never a fork. - #[arg(long)] - from: Option, - /// How existing rows are handled: overwrite | append | merge. - /// Required β€” overwrite is destructive, so there is no default. - #[arg(long)] - mode: CliLoadMode, - #[arg(long)] - json: bool, - }, - /// Deprecated alias of `load --from ` (defaults: --mode merge, --from main) - #[command(hide = true)] - Ingest { - /// Graph URI - uri: Option, - #[arg(long)] - data: PathBuf, - #[arg(long)] - branch: Option, - #[arg(long)] - from: Option, - #[arg(long, default_value = "merge")] - mode: CliLoadMode, - #[arg(long)] - json: bool, - }, - /// Branch operations - Branch { - #[command(subcommand)] - command: BranchCommand, - }, - /// Show graph snapshot - Snapshot { - /// Graph URI - uri: Option, - #[arg(long)] - branch: Option, - #[arg(long)] - json: bool, - }, - /// Export a full graph snapshot as JSONL - Export { - /// Graph URI - uri: Option, - #[arg(long)] - branch: Option, - #[arg(long, hide = true)] - jsonl: bool, - #[arg(long = "type")] - type_names: Vec, - #[arg(long = "table")] - table_keys: Vec, - }, - /// Commit history operations - Commit { - #[command(subcommand)] - command: CommitCommand, - }, - /// Schema planning operations - Schema { - #[command(subcommand)] - command: SchemaCommand, - }, - /// Manage graphs on a multi-graph server (MR-668) - Graphs { - #[command(subcommand)] - command: GraphsCommand, - }, - - // ── Storage / local graph ops ── direct storage or local files; reject --server. - /// Initialize a new graph from a schema - Init { - #[arg(long)] - schema: PathBuf, - /// Graph URI (local path or s3://) - uri: String, - /// Overwrite existing schema artifacts at the URI. Without - /// this flag, init refuses to touch a URI that already holds - /// `_schema.pg`, `_schema.ir.json`, or `__schema_state.json` - /// β€” closes the re-init footgun (MR-668 follow-up). With the - /// flag, the operator opts in to destructive semantics. - #[arg(long)] - force: bool, - }, - /// Compact small Lance fragments in every table of the graph - Optimize { - /// Graph URI - uri: Option, - #[arg(long)] - json: bool, - }, - /// Classify and explicitly repair manifest/head drift - Repair { - /// Graph URI - uri: Option, - /// Publish verified maintenance drift. Without this flag, repair only - /// previews what it would do. - #[arg(long)] - confirm: bool, - /// Also publish suspicious or unverifiable drift. Requires - /// `--confirm`; use only after operator review. - #[arg(long, requires = "confirm")] - force: bool, - #[arg(long)] - json: bool, - }, - /// Remove old Lance versions from every table of the graph (destructive) - Cleanup { - /// Graph URI - uri: Option, - /// Number of recent versions to keep per table. Either `--keep` or - /// `--older-than` (or both) must be set. - #[arg(long)] - keep: Option, - /// Only remove versions older than this duration. Accepts Go-style - /// durations: `7d`, `24h`, `90m`. At least one of --keep / --older-than. - #[arg(long)] - older_than: Option, - /// Required to actually run; without it, prints what would be removed - #[arg(long)] - confirm: bool, - #[arg(long)] - json: bool, - }, - /// Validate queries against a schema (offline) or repo (repo-backed). - /// - /// Canonical name is `lint` (matches the `omnigraph_compiler::lint` - /// module and the `OG-XXX-NNN` lint-code vocabulary). Replaces the - /// deprecated `omnigraph query lint` / `omnigraph query check` / - /// `omnigraph check` invocations β€” each is kept as an argv-level - /// shim that prints a one-line stderr warning and rewrites to - /// `omnigraph lint`. Aliases are deliberately *not* exposed via - /// clap's `visible_alias` because that would advertise two - /// equivalent canonical names, which agents emit interchangeably - /// (see MR-981). - Lint { - /// Graph URI - uri: Option, - #[arg(long)] - query: PathBuf, - #[arg(long)] - schema: Option, - #[arg(long)] - json: bool, - }, - /// Operate on the server-side stored-query registry (`queries:`). - Queries { - #[command(subcommand)] - command: QueriesCommand, - }, - - // ── Control plane ── manage a cluster directory (--config ). - /// Validate and plan read-only cluster configuration. - Cluster { - #[command(subcommand)] - command: ClusterCommand, - }, - - /// Policy administration and diagnostics against a cluster's applied bundles - Policy { - #[command(subcommand)] - command: PolicyCommand, - }, - /// Generate, clean, or refresh explicit seed embeddings - Embed(EmbedArgs), - /// Store a bearer token for a named server (0600 credentials file). Token - /// via --token or piped on stdin; see the CLI reference for token resolution. - Login { - /// Server name (keys the credential; declare its url under - /// `servers:` in ~/.omnigraph/config.yaml) - name: String, - /// The token. Prefer piping via stdin over this flag (shell - /// history). - #[arg(long)] - token: Option, - #[arg(long)] - json: bool, - }, - /// Remove a named server's stored credential. Idempotent. - Logout { - name: String, - #[arg(long)] - json: bool, - }, - /// Inspect the scope profiles in ~/.omnigraph/config.yaml (read-only). - Profile { - #[command(subcommand)] - command: ProfileCommand, - }, - /// Print the CLI version - Version, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum ProfileCommand { - /// List the profiles defined in ~/.omnigraph/config.yaml. - List { - #[arg(long)] - json: bool, - }, - /// Show a profile's resolved scope. With no name, shows the active - /// (`$OMNIGRAPH_PROFILE`) profile, else the flat operator defaults. - Show { - /// Profile name (optional). - name: Option, - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum ClusterCommand { - /// Validate cluster.yaml and referenced schemas, queries, and policy files. - Validate { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Produce a read-only plan by diffing cluster.yaml against __cluster/state.json. - Plan { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Converge the cluster to its config: create graphs, apply schema updates - /// (soft drops), write stored-query/policy catalog resources, and execute - /// approved graph deletes, in one ordered run. Serving picks up the applied - /// revision after an `omnigraph-server --cluster` restart. - Apply { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Record a digest-bound approval for a gated (irreversible) change, - /// e.g. a graph delete. Requires the global --as actor. - Approve { - /// Typed resource address of the gated change (e.g. graph.scratch). - resource: String, - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Read the local JSON state ledger without scanning live graph resources. - Status { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Refresh existing local JSON state from declared graph observations. - Refresh { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Import initial local JSON state from declared graph observations. - Import { - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, - /// Remove a held local JSON state lock after operator confirmation. - ForceUnlock { - /// Exact lock id from cluster status or a state_lock_held diagnostic. - lock_id: String, - /// Cluster config directory containing cluster.yaml. - #[arg(long, default_value = ".")] - config: PathBuf, - /// Emit JSON instead of human text. - #[arg(long)] - json: bool, - }, -} - -/// Operations on the graph registry of a multi-graph server (MR-668). -/// -/// All operations target a remote multi-graph server URL (http:// or -/// https://). Local-URI invocations return a clear error. To add or -/// remove graphs, operators edit `omnigraph.yaml` directly and restart -/// the server β€” runtime mutation is not exposed in v0.6.0. -#[derive(Debug, Subcommand)] -pub(crate) enum GraphsCommand { - /// List every graph registered with the multi-graph server. - List { - /// Remote server URL (e.g. `https://server.example.com`). - #[arg(long)] - uri: Option, - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum BranchCommand { - /// Create a new branch - Create { - /// Graph URI - #[arg(long)] - uri: Option, - #[arg(long)] - from: Option, - name: String, - #[arg(long)] - json: bool, - }, - /// List branches - List { - /// Graph URI - #[arg(long)] - uri: Option, - #[arg(long)] - json: bool, - }, - /// Delete a branch - Delete { - /// Graph URI - #[arg(long)] - uri: Option, - name: String, - #[arg(long)] - json: bool, - }, - /// Merge a source branch into a target branch - Merge { - /// Graph URI - #[arg(long)] - uri: Option, - source: String, - #[arg(long)] - into: Option, - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum SchemaCommand { - /// Plan a schema migration against the accepted persisted schema - Plan { - /// Graph URI - uri: Option, - #[arg(long)] - schema: PathBuf, - #[arg(long)] - json: bool, - /// Show the plan as it would execute with `--allow-data-loss`. - /// Promotes every `DropMode::Soft` step to `DropMode::Hard` - /// so the plan output reflects the destructive intent. - #[arg(long, default_value_t = false)] - allow_data_loss: bool, - }, - /// Apply a supported schema migration - Apply { - /// Graph URI - uri: Option, - #[arg(long)] - schema: PathBuf, - #[arg(long)] - json: bool, - /// Allow destructive (data-loss) schema changes. - /// - /// Without this flag, drops are "soft": the column or table - /// is removed from the current manifest version but prior - /// versions are retained, so `snapshot_at_version(pre_drop)` - /// can still read the dropped data until `omnigraph cleanup` - /// runs. With this flag, drops are "hard": `cleanup_old_versions` - /// runs on the affected datasets immediately after the apply, - /// making the prior data unreachable. - #[arg(long, default_value_t = false)] - allow_data_loss: bool, - }, - /// Show the current accepted schema source - #[command(alias = "get")] - Show { - /// Graph URI - uri: Option, - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Subcommand)] - -pub(crate) enum CommitCommand { - /// List graph commits - List { - /// Graph URI - uri: Option, - #[arg(long)] - branch: Option, - #[arg(long)] - json: bool, - }, - /// Show a graph commit - Show { - /// Graph URI - #[arg(long)] - uri: Option, - commit_id: String, - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum PolicyCommand { - /// Compile and validate the Cedar policy bundle(s) applied in a cluster. - /// - /// Sources the bundle(s) from the cluster's applied policies - /// (`--cluster `); pass the global `--graph ` to pick one - /// graph's bundle when several apply. - Validate {}, - /// Run declarative policy tests against a cluster's applied bundle. - /// - /// The cluster model has no per-bundle tests file, so the cases are - /// supplied explicitly with `--tests ` and checked against the - /// bundle selected by `--cluster` (+ optional `--graph`). - Test { - /// Path to a policy.tests.yaml file. - #[arg(long)] - tests: PathBuf, - }, - /// Explain one policy decision against a cluster's applied bundle. - Explain { - #[arg(long)] - actor: String, - #[arg(long)] - action: PolicyAction, - #[arg(long)] - branch: Option, - #[arg(long = "target-branch")] - target_branch: Option, - }, -} - -#[derive(Debug, Subcommand)] -pub(crate) enum QueriesCommand { - /// Type-check a cluster's stored-query registry against its schemas. - /// - /// Distinct from `omnigraph lint` (which lints one `.gq` file): this - /// validates the whole `queries:` registry of a cluster (`--cluster - /// `, optional `--graph `) by reading each graph's applied - /// schema and confirming every stored query still type-checks. Exits - /// non-zero on any breakage. - Validate { - #[arg(long)] - json: bool, - }, - /// List a cluster's registered stored queries (name, params). - List { - #[arg(long)] - json: bool, - }, -} - -#[derive(Debug, Args, Clone)] -pub(crate) struct ParamsArgs { - #[arg(long, conflicts_with = "params_file")] - pub(crate) params: Option, - #[arg(long, conflicts_with = "params")] - pub(crate) params_file: Option, -} - -#[derive(Clone, Copy, Debug, Eq, PartialEq, Serialize, ValueEnum)] -#[serde(rename_all = "snake_case")] -pub(crate) enum CliLoadMode { - Overwrite, - Append, - Merge, -} - -impl From for LoadMode { - fn from(value: CliLoadMode) -> Self { - match value { - CliLoadMode::Overwrite => LoadMode::Overwrite, - CliLoadMode::Append => LoadMode::Append, - CliLoadMode::Merge => LoadMode::Merge, - } - } -} -impl CliLoadMode { - pub(crate) fn as_str(self) -> &'static str { - match self { - CliLoadMode::Overwrite => "overwrite", - CliLoadMode::Append => "append", - CliLoadMode::Merge => "merge", - } - } -} diff --git a/crates/omnigraph-cli/src/client.rs b/crates/omnigraph-cli/src/client.rs deleted file mode 100644 index 7151f5e..0000000 --- a/crates/omnigraph-cli/src/client.rs +++ /dev/null @@ -1,821 +0,0 @@ -//! `GraphClient` β€” the one place the embedded-vs-remote split lives -//! (RFC-009 Phase 3). A CLI command body calls a verb method; the -//! enum routes to the engine (local URI) or HTTP (remote URI). The -//! 15 per-command `if graph.is_remote { … } else { … }` forks collapse -//! into two arms here. -//! -//! Phase 3a put the factory + the uniform read verbs in place. Phase 3b -//! adds the data-plane writes (`load`/`ingest`/`mutate`/`branch_*`/ -//! `apply_schema`) and `query`. The wrinkle 3a deferred: writes open the -//! local engine WITH policy (`open_local_db_with_policy`) and carry a -//! resolved actor, while reads/`query` open WITHOUT policy. So the -//! `Embedded` variant grows an optional policy context (`graph`/`actor`) -//! and a second factory (`resolve_with_policy`) fills it; `resolve()` -//! leaves it empty. The open path picks itself from whether `graph` is -//! set, preserving today's two behaviors exactly. Export + graphs-list -//! land in 3c. Behavior is unchanged per verb β€” the Phase-1 parity matrix -//! is the referee and stays textually unchanged. -//! -//! Enum, not a trait (RFC sketch said "trait"): only two variants ever, -//! and inherent async methods sidestep `async_trait` boxing plus the -//! `apply_schema` catalog-validator closure that is not object-safe. -//! Same one-body-two-impls collapse, less ceremony. - -use std::io::Write; - -use color_eyre::Result; -use color_eyre::eyre::bail; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph_api_types::{ - BranchCreateOutput, BranchCreateRequest, BranchDeleteOutput, BranchListOutput, - BranchMergeOutput, BranchMergeRequest, ChangeOutput, CommitListOutput, CommitOutput, - ErrorOutput, ExportRequest, GraphListResponse, IngestOutput, IngestRequest, - InvokeStoredQueryRequest, ReadOutput, - ReadRequest, SchemaApplyOutput, SchemaApplyRequest, SchemaOutput, SnapshotOutput, commit_output, - ingest_output, read_output, schema_apply_output, snapshot_payload, -}; -use omnigraph_compiler::catalog::Catalog; -use reqwest::Method; -use serde_json::Value; - -use crate::cli::CliLoadMode; -use crate::helpers::{ - apply_bearer_token, apply_server_flag, build_http_client, is_remote_uri, - legacy_change_request_body, query_params_from_json, - remote_json, remote_url, resolve_cli_actor, resolve_cli_graph, resolve_remote_bearer_token, - resolve_server_flag, select_named_query, -}; -use crate::output::{LoadOutput, load_output_from_result, load_output_from_tables}; - -pub(crate) enum GraphClient { - /// Local engine at `uri`. Reads (`resolve()`) leave `actor` empty; - /// writes (`resolve_with_policy()`) attribute the resolved actor. - /// Direct-store access carries no Cedar policy (RFC-011: policy lives - /// in the cluster/server, not in per-operator addressing). - Embedded { - uri: String, - actor: Option, - }, - /// Remote HTTP server. The actor is resolved server-side from the - /// token; the client never sets identity. - Remote { - http: reqwest::Client, - base_url: String, - token: Option, - }, -} - -/// RFC-011 Decision 7: a server scope that selects no graph (no `--graph`, no -/// `default_graph`) must not silently fall through to the bare server URL when -/// the server is multi-graph. Best-effort probe `GET /graphs`: a populated list -/// forces `--graph` (listing the candidates); a single-graph/flat server (405), -/// a policy-gated `/graphs`, or an unreachable server all proceed β€” the bare URL -/// is then correct, or the real request surfaces the failure. Only fires on the -/// no-graph path, so a `--graph`/`default_graph` happy path does no extra I/O. -async fn require_graph_for_multi_graph_server( - scope: &crate::scope::ResolvedScope, -) -> Result<()> { - let (Some(server), None) = (scope.server.as_deref(), scope.graph.as_deref()) else { - return Ok(()); - }; - let Some(base) = resolve_server_flag(Some(server), None)? else { - return Ok(()); - }; - let token = resolve_remote_bearer_token(Some(&base))?; - let probe = GraphClient::Remote { - http: build_http_client()?, - base_url: base, - token, - }; - if let Ok(resp) = probe.list_graphs().await { - if !resp.graphs.is_empty() { - let ids: Vec<&str> = resp.graphs.iter().map(|g| g.graph_id.as_str()).collect(); - bail!( - "server scope '{server}' has {} {}: [{}]; pass --graph to select one \ - (or set `default_graph` in your operator config)", - ids.len(), - if ids.len() == 1 { "graph" } else { "graphs" }, - ids.join(", ") - ); - } - } - Ok(()) -} - -/// A remote graph must be addressed with `--server` (RFC-011): a positional or -/// `--uri` `http(s)://` URL no longer auto-dispatches to a server. A remote URL -/// produced by a server scope (`via_server`) is fine. -fn reject_positional_remote(via_server: bool, uri: &str) -> Result<()> { - if !via_server && is_remote_uri(uri) { - bail!( - "a remote graph must be addressed with `--server ` β€” a positional \ - (or `--uri`) http(s):// URL no longer dispatches to a server" - ); - } - Ok(()) -} - -impl GraphClient { - /// Resolve the addressing (positional URI / `--target` / `--server`) - /// and credential once, then pick the variant by URI scheme β€” the - /// single branch point that replaces every per-command `is_remote` - /// fork. Mirrors the read verbs' current preamble (`resolve_uri` - /// path, not the policy-bearing `resolve_cli_graph`). Used by reads - /// and `query` (which opens without policy, like the reads). - pub(crate) async fn resolve( - server: Option<&str>, - graph: Option<&str>, - uri: Option, - profile: Option<&str>, - store: Option<&str>, - ) -> Result { - // RFC-011: a scope (profile / --store / operator defaults) may stand in - // for omitted addressing. The explicit branch passes server/graph/uri - // straight through, so existing invocations are unchanged. - let scope = crate::scope::resolve_scope( - &crate::operator::load_operator_config()?, - crate::planes::Capability::Any, - crate::scope::ScopeFlags { profile, store, server, cluster: None, graph, uri }, - )?; - require_graph_for_multi_graph_server(&scope).await?; - let (server, graph, uri) = ( - scope.server.as_deref(), - scope.graph.as_deref(), - scope.uri, - ); - let via_server = server.is_some(); - let uri = apply_server_flag(server, graph, uri)?; - let token = resolve_remote_bearer_token(uri.as_deref())?; - let uri = crate::helpers::resolve_uri(uri)?; - reject_positional_remote(via_server, &uri)?; - if is_remote_uri(&uri) { - Ok(GraphClient::Remote { - http: build_http_client()?, - base_url: uri, - token, - }) - } else { - Ok(GraphClient::Embedded { uri, actor: None }) - } - } - - /// Write-path factory: the same addressing/credential resolution as - /// `resolve()`, but through the stricter `resolve_cli_graph` (which - /// carries `policy_file`/`graph_id`/`selected`), and with the actor - /// resolved up front. The embedded arm then opens WITH policy. The - /// resolution order matches the write arms exactly: server flag β†’ - /// bearer token β†’ graph. - pub(crate) async fn resolve_with_policy( - server: Option<&str>, - graph: Option<&str>, - uri: Option, - cli_as: Option<&str>, - profile: Option<&str>, - store: Option<&str>, - ) -> Result { - // RFC-011 scope translation (see `resolve`); explicit addressing passes - // through unchanged. - let scope = crate::scope::resolve_scope( - &crate::operator::load_operator_config()?, - crate::planes::Capability::Any, - crate::scope::ScopeFlags { profile, store, server, cluster: None, graph, uri }, - )?; - require_graph_for_multi_graph_server(&scope).await?; - let (server, graph, uri) = ( - scope.server.as_deref(), - scope.graph.as_deref(), - scope.uri, - ); - let via_server = server.is_some(); - let uri = apply_server_flag(server, graph, uri)?; - let token = resolve_remote_bearer_token(uri.as_deref())?; - let resolved = resolve_cli_graph(uri)?; - reject_positional_remote(via_server, &resolved.uri)?; - if resolved.is_remote { - // A served write resolves the actor server-side from the bearer - // token; `--as` cannot set identity here and is rejected. - if cli_as.is_some() { - bail!( - "`--as` is not allowed on a served write β€” the server resolves the actor \ - from the bearer token. Remove `--as`, or run the write directly against \ - storage with `--store `." - ); - } - Ok(GraphClient::Remote { - http: build_http_client()?, - base_url: resolved.uri, - token, - }) - } else { - let actor = resolve_cli_actor(cli_as)?; - Ok(GraphClient::Embedded { - uri: resolved.uri, - actor, - }) - } - } - - /// The graph URI (local path / remote base URL) this client addresses. - pub(crate) fn uri(&self) -> &str { - match self { - GraphClient::Embedded { uri, .. } => uri, - GraphClient::Remote { base_url, .. } => base_url, - } - } - - pub(crate) fn is_remote(&self) -> bool { - matches!(self, GraphClient::Remote { .. }) - } - - /// Open the local engine. Direct-store access carries no Cedar policy - /// (RFC-011), so both read and write paths open bare; the actor is still - /// attributed on the write via the `_as` engine APIs. - async fn open_embedded(uri: &str) -> Result { - Ok(Omnigraph::open(uri).await?) - } - - pub(crate) async fn branch_list(&self) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::GET, - remote_url(base_url, &["branches"], &[])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - let mut branches = db.branch_list().await?; - branches.sort(); - Ok(BranchListOutput { branches }) - } - } - } - - pub(crate) async fn snapshot(&self, branch: &str) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::GET, - remote_url(base_url, &["snapshot"], &[("branch", branch)])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - let snapshot = db.snapshot_of(ReadTarget::branch(branch)).await?; - Ok(snapshot_payload(branch, &snapshot)) - } - } - } - - pub(crate) async fn schema_source(&self) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::GET, - remote_url(base_url, &["schema"], &[])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - Ok(SchemaOutput { - schema_source: db.schema_source().to_string(), - }) - } - } - } - - pub(crate) async fn list_commits(&self, branch: Option<&str>) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let url = match branch { - Some(branch) => remote_url(base_url, &["commits"], &[("branch", branch)])?, - None => remote_url(base_url, &["commits"], &[])?, - }; - remote_json(http, Method::GET, url, None, token.as_deref()).await - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - let commits = db - .list_commits(branch) - .await? - .iter() - .map(commit_output) - .collect::>(); - Ok(CommitListOutput { commits }) - } - } - } - - pub(crate) async fn get_commit(&self, commit_id: &str) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::GET, - remote_url(base_url, &["commits", commit_id], &[])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - Ok(commit_output(&db.get_commit(commit_id).await?)) - } - } - } - - /// `load` β€” bulk-load `data` (a file path) onto `branch`, forking from - /// `from` if missing. Returns the CLI `LoadOutput`; each arm keeps its - /// own mapping (remote sums the wire `IngestOutput.tables`, embedded - /// reads the richer `LoadResult` directly) β€” preserved exactly. - pub(crate) async fn load( - &self, - branch: &str, - from: Option<&str>, - data: &str, - mode: CliLoadMode, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let data = std::fs::read_to_string(data)?; - // RFC-009 Phase 5: the canonical `load` verb targets the - // canonical `/load` route (the deprecated `ingest` verb below - // still rides `/ingest`). - let output = remote_json::( - http, - Method::POST, - remote_url(base_url, &["load"], &[])?, - Some(serde_json::to_value(IngestRequest { - branch: Some(branch.to_string()), - from: from.map(ToOwned::to_owned), - mode: Some(mode.into()), - data, - })?), - token.as_deref(), - ) - .await?; - Ok(load_output_from_tables(base_url, branch, mode.as_str(), &output)) - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let result = db - .load_file_as(branch, from, data, mode.into(), actor.as_deref()) - .await?; - Ok(load_output_from_result(uri, branch, mode.as_str(), &result)) - } - } - } - - /// `ingest` β€” the deprecated alias of `load`. Same operation, but the - /// surfaced shape is the wire `IngestOutput` (printed by - /// `print_ingest_human`), so it is its own method. The embedded arm - /// echoes `actor_id: None` in the output exactly as the legacy arm did - /// (the actor is still attributed on the commit via `load_file_as`). - pub(crate) async fn ingest( - &self, - branch: &str, - from: &str, - data: &str, - mode: CliLoadMode, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let data = std::fs::read_to_string(data)?; - remote_json( - http, - Method::POST, - remote_url(base_url, &["ingest"], &[])?, - Some(serde_json::to_value(IngestRequest { - branch: Some(branch.to_string()), - from: Some(from.to_string()), - mode: Some(mode.into()), - data, - })?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let result = db - .load_file_as(branch, Some(from), data, mode.into(), actor.as_deref()) - .await?; - Ok(ingest_output(uri, &result, mode.into(), None)) - } - } - } - - /// `mutate` β€” run a change query against `branch`. Folds - /// `execute_change` / `execute_change_remote` + the legacy request body. - pub(crate) async fn mutate( - &self, - branch: &str, - query_source: &str, - query_name: Option<&str>, - params_json: Option<&Value>, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::POST, - remote_url(base_url, &["change"], &[])?, - Some(legacy_change_request_body( - query_source, - query_name, - branch, - params_json, - )), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let (selected_name, query_params) = select_named_query(query_source, query_name)?; - let params = query_params_from_json(&query_params, params_json)?; - let db = Self::open_embedded(uri).await?; - let actor = actor.as_deref(); - let result = db - .mutate_as(branch, query_source, &selected_name, ¶ms, actor) - .await?; - Ok(ChangeOutput { - branch: branch.to_string(), - query_name: selected_name, - affected_nodes: result.affected_nodes, - affected_edges: result.affected_edges, - actor_id: actor.map(String::from), - }) - } - } - } - - /// `query` β€” run a read query against `target`. Folds `execute_read` / - /// `execute_read_remote`; the embedded arm opens WITHOUT policy (reads - /// never attach one), so this verb resolves via `resolve()`. - pub(crate) async fn query( - &self, - target: ReadTarget, - query_source: &str, - query_name: Option<&str>, - params_json: Option<&Value>, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let (branch, snapshot) = match &target { - ReadTarget::Branch(branch) => (Some(branch.clone()), None), - ReadTarget::Snapshot(snapshot) => (None, Some(snapshot.as_str().to_string())), - }; - remote_json( - http, - Method::POST, - remote_url(base_url, &["read"], &[])?, - Some(serde_json::to_value(ReadRequest { - query_source: query_source.to_string(), - query_name: query_name.map(ToOwned::to_owned), - params: params_json.cloned(), - branch, - snapshot, - })?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, .. } => { - let (selected_name, query_params) = select_named_query(query_source, query_name)?; - let params = query_params_from_json(&query_params, params_json)?; - let db = Self::open_embedded(uri).await?; - let result = db - .query(target.clone(), query_source, &selected_name, ¶ms) - .await?; - Ok(read_output(selected_name, &target, result)) - } - } - } - - /// `invoke_named` β€” run a stored query **by catalog name** (RFC-011 D3). - /// Served-only: the catalog is server-owned, so a `--store` (embedded) - /// scope has nothing to resolve the name against. `expect_mutation` carries - /// the verb's asserted kind; the server rejects a mismatch (400) before - /// running, so the response is exactly the expected envelope β€” the caller - /// deserializes it as the concrete `T` (`ReadOutput` for `query`, - /// `ChangeOutput` for `mutate`), sidestepping the untagged wire enum. - pub(crate) async fn invoke_named( - &self, - name: &str, - expect_mutation: bool, - params_json: Option<&Value>, - branch: Option, - snapshot: Option, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let body = InvokeStoredQueryRequest { - params: params_json.cloned(), - branch, - snapshot, - expect_mutation: Some(expect_mutation), - }; - remote_json( - http, - Method::POST, - remote_url(base_url, &["queries", name], &[])?, - Some(serde_json::to_value(body)?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { .. } => bail!( - "by-name invocation needs a server (the stored-query catalog is \ - server-owned); use -e '' or --query for an ad-hoc query \ - against --store, or address a server with --server / --profile" - ), - } - } - - pub(crate) async fn branch_create_from( - &self, - from: &str, - name: &str, - ) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::POST, - remote_url(base_url, &["branches"], &[])?, - Some(serde_json::to_value(BranchCreateRequest { - from: Some(from.to_string()), - name: name.to_string(), - })?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let actor = actor.as_deref(); - db.branch_create_from_as(ReadTarget::branch(from), name, actor) - .await?; - Ok(BranchCreateOutput { - uri: uri.clone(), - from: from.to_string(), - name: name.to_string(), - actor_id: actor.map(String::from), - }) - } - } - } - - pub(crate) async fn branch_delete(&self, name: &str) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::DELETE, - remote_url(base_url, &["branches", name], &[])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let actor = actor.as_deref(); - db.branch_delete_as(name, actor).await?; - Ok(BranchDeleteOutput { - uri: uri.clone(), - name: name.to_string(), - actor_id: actor.map(String::from), - }) - } - } - } - - pub(crate) async fn branch_merge(&self, source: &str, into: &str) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::POST, - remote_url(base_url, &["branches", "merge"], &[])?, - Some(serde_json::to_value(BranchMergeRequest { - source: source.to_string(), - target: Some(into.to_string()), - })?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let actor = actor.as_deref(); - let outcome = db.branch_merge_as(source, into, actor).await?; - Ok(BranchMergeOutput { - source: source.to_string(), - target: into.to_string(), - outcome: outcome.into(), - actor_id: actor.map(String::from), - }) - } - } - } - - /// `apply_schema` β€” apply `schema_source`. The embedded arm runs the - /// caller's catalog validator (stored-query registry check) inside the - /// engine's `apply_schema_as_with_catalog_check`; the remote arm runs - /// the server's own check and IGNORES `validate`. The `impl FnOnce` - /// validator is exactly why this is an enum, not a trait (non-object- - /// safe). - pub(crate) async fn apply_schema( - &self, - schema_source: &str, - allow_data_loss: bool, - validate: F, - ) -> Result - where - F: FnOnce(&Catalog) -> omnigraph::error::Result<()>, - { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - // MR-694 PR B: SchemaApplyRequest carries allow_data_loss so - // Hard-mode drops are no longer CLI-only; the server's - // `server_schema_apply` honors it (and runs its own catalog - // check, so `validate` does not apply here). - remote_json::( - http, - Method::POST, - remote_url(base_url, &["schema", "apply"], &[])?, - Some(serde_json::to_value(SchemaApplyRequest { - schema_source: schema_source.to_string(), - allow_data_loss, - })?), - token.as_deref(), - ) - .await - } - GraphClient::Embedded { uri, actor } => { - let db = Self::open_embedded(uri).await?; - let result = db - .apply_schema_as_with_catalog_check( - schema_source, - omnigraph::db::SchemaApplyOptions { allow_data_loss }, - actor.as_deref(), - validate, - ) - .await?; - Ok(schema_apply_output(uri, result)) - } - } - } - - /// `export` β€” stream the branch as JSONL into `writer`. The streaming - /// shape (a `W: Write`, not a returned DTO) is why this lands in 3c - /// rather than 3b. Opens WITHOUT policy (like reads), so it is reached - /// via `resolve()`; the Embedded arm opens bare. The Remote arm streams - /// the chunked response body straight through (no buffering the whole - /// export in memory). - pub(crate) async fn export( - &self, - branch: &str, - type_names: &[String], - table_keys: &[String], - writer: &mut W, - ) -> Result<()> { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - let request = apply_bearer_token( - http.request(Method::POST, remote_url(base_url, &["export"], &[])?), - token.as_deref(), - ) - .json(&ExportRequest { - branch: Some(branch.to_string()), - type_names: type_names.to_vec(), - table_keys: table_keys.to_vec(), - }); - let mut response = request.send().await?; - let status = response.status(); - if !status.is_success() { - let text = response.text().await?; - if let Ok(error) = serde_json::from_str::(&text) { - bail!(error.error); - } - bail!("server returned {}: {}", status, text); - } - while let Some(chunk) = response.chunk().await? { - writer.write_all(&chunk)?; - } - writer.flush()?; - Ok(()) - } - GraphClient::Embedded { uri, .. } => { - let db = Omnigraph::open(uri).await?; - db.export_jsonl_to_writer(branch, type_names, table_keys, writer) - .await?; - writer.flush()?; - Ok(()) - } - } - } - - /// `graphs list` β€” enumerate the graphs a remote multi-graph server - /// serves (`GET /graphs`). Remote-only by design: there is no local - /// enumeration endpoint, so the Embedded arm fails loudly. Routing it - /// through the enum still buys the shared `resolve()` addressing/token - /// preamble. - pub(crate) async fn list_graphs(&self) -> Result { - match self { - GraphClient::Remote { - http, - base_url, - token, - } => { - remote_json( - http, - Method::GET, - remote_url(base_url, &["graphs"], &[])?, - None, - token.as_deref(), - ) - .await - } - GraphClient::Embedded { .. } => bail!( - "`omnigraph graphs list` requires a remote multi-graph server \ - (--server ). To enumerate the graphs in a cluster, run \ - `omnigraph cluster status --config `." - ), - } - } -} diff --git a/crates/omnigraph-cli/src/embed.rs b/crates/omnigraph-cli/src/embed.rs index a0603b7..2e1c6d9 100644 --- a/crates/omnigraph-cli/src/embed.rs +++ b/crates/omnigraph-cli/src/embed.rs @@ -9,6 +9,8 @@ use omnigraph::embedding::EmbeddingClient; use serde::{Deserialize, Serialize}; use serde_json::{Map, Value, json}; +const DEFAULT_EMBED_MODEL: &str = "gemini-embedding-2-preview"; + #[derive(Debug, Args, Clone)] pub(crate) struct EmbedArgs { /// Seed manifest path @@ -83,6 +85,8 @@ impl EmbedMode { #[derive(Debug, Clone, Deserialize)] struct EmbedSpec { + #[serde(default = "default_embed_model")] + model: String, dimension: usize, types: BTreeMap, } @@ -176,6 +180,13 @@ pub(crate) fn resolve_embed_job(args: &EmbedArgs) -> Result { (input, output, spec) }; + if spec.model != DEFAULT_EMBED_MODEL { + bail!( + "only {} is supported for explicit seed embeddings right now", + DEFAULT_EMBED_MODEL + ); + } + Ok(EmbedJob { input, output, @@ -294,14 +305,7 @@ pub(crate) async fn run_embed_job(job: &EmbedJob) -> Result { cleaned_rows, mode: job.mode.as_str(!job.selectors.is_empty()), dimension: job.spec.dimension, - // The embedding model is resolved solely from the provider config; the - // spec carries no model field, so there is no second source of truth to - // silently disagree with the API. Report what was actually used (empty - // for `--clean`, which builds no client). - model: client - .as_ref() - .map(|c| c.config().model.clone()) - .unwrap_or_default(), + model: job.spec.model.clone(), }) } @@ -311,6 +315,10 @@ fn temp_output_path(output: &Path) -> PathBuf { PathBuf::from(temp) } +fn default_embed_model() -> String { + DEFAULT_EMBED_MODEL.to_string() +} + fn load_embed_spec(path: &Path) -> Result { Ok(serde_json::from_str(&fs::read_to_string(path)?)?) } diff --git a/crates/omnigraph-cli/src/helpers.rs b/crates/omnigraph-cli/src/helpers.rs deleted file mode 100644 index be00808..0000000 --- a/crates/omnigraph-cli/src/helpers.rs +++ /dev/null @@ -1,1159 +0,0 @@ -//! Resolution helpers: config/actor/graph/branch/query resolution, -//! remote HTTP, env/token handling, scaffolding (moved verbatim from -//! main.rs in the modularization). - -use std::io::IsTerminal; - -use super::*; -use crate::operator; - -pub(crate) fn ensure_local_graph_parent(uri: &str) -> Result<()> { - if !uri.contains("://") { - fs::create_dir_all(uri)?; - } - Ok(()) -} - -pub(crate) fn is_remote_uri(uri: &str) -> bool { - uri.starts_with("http://") || uri.starts_with("https://") -} - -/// Whether a resolved write target is *local* for the purposes of the RFC-011 -/// Decision 9 destructive-confirm gate: a bare path or a `file://` URI. Anything -/// else carrying a scheme β€” `http(s)://` (served), `s3://` / `gs://` / … (object -/// store) β€” is non-local and a destructive write against it requires explicit -/// consent. Generalizes `is_remote_uri` (which only catches http(s)). -pub(crate) fn uri_is_local(uri: &str) -> bool { - !uri.contains("://") || uri.starts_with("file://") -} - -/// Echo the resolved write target + access path to stderr (RFC-011 Decision 9), -/// unless `--quiet`. One line, e.g. `omnigraph load β†’ file://g.omni (direct, -/// local)`. stderr so `--json` consumers reading stdout are unaffected; the line -/// legitimately differs embedded-vs-served (that visibility is the point). -pub(crate) fn echo_write_target(quiet: bool, label: &str, uri: &str, served: bool) { - if quiet { - return; - } - let access = if served { - "served" - } else if uri_is_local(uri) { - "direct, local" - } else { - "direct, remote" - }; - eprintln!("omnigraph {label} β†’ {uri} ({access})"); -} - -/// Gate a destructive write (`cleanup`, overwrite `load`, `branch delete`) -/// against a non-local scope (RFC-011 Decision 9). A local target needs no -/// confirmation; otherwise `--yes` consents, an interactive TTY is prompted, and -/// a non-TTY / `--json` run refuses rather than silently proceeding. -pub(crate) fn confirm_destructive(label: &str, uri: &str, yes: bool, json: bool) -> Result<()> { - if uri_is_local(uri) || yes { - return Ok(()); - } - if json || !std::io::stdin().is_terminal() { - bail!( - "refusing destructive `{label}` against non-local target {uri} without confirmation; \ - pass --yes to confirm (an interactive TTY would be prompted instead)" - ); - } - eprint!( - "About to run a destructive `{label}` against {uri} (not local). Type 'yes' to continue: " - ); - io::stderr().flush()?; - let mut answer = String::new(); - io::stdin().read_line(&mut answer)?; - match answer.trim().to_ascii_lowercase().as_str() { - "yes" | "y" => Ok(()), - _ => bail!("aborted: destructive `{label}` not confirmed"), - } -} - -/// THE one way the CLI composes a remote request URL. Every remote call -/// routes through here so URL assembly has a single mechanism instead of -/// per-callsite string interpolation. -/// -/// - `base` is the resolved server root (single-graph) or `…/graphs/{id}` -/// (multi-graph). -/// - `segments` are appended as individual percent-encoded path segments, so -/// a dynamic component (branch name, commit id, query name) is always one -/// safe segment β€” e.g. a branch `etl/zendesk/run-1` becomes `%2F`-escaped. -/// - `query` pairs are percent-encoded values. -/// -/// Trailing-slash normalization happens exactly once via `pop_if_empty`: -/// `Url::parse` normalizes a path-less base (`http://host`) to a single empty -/// trailing segment, and a `…/graphs/{id}/` base keeps its own. `extend` -/// appends *after* the last segment, so without dropping a trailing empty one -/// the join emits `…/graphs/{id}//branches/{name}` β€” the empty `//` segment -/// misses the route and 404s. Because callers pass structured segments rather -/// than a pre-joined string, neither a stray `//` nor an un-encoded dynamic -/// component is representable here. -pub(crate) fn remote_url( - base: &str, - segments: &[&str], - query: &[(&str, &str)], -) -> Result { - let mut url = reqwest::Url::parse(base.trim_end_matches('/'))?; - url.path_segments_mut() - .map_err(|_| color_eyre::eyre::eyre!("invalid remote base url"))? - .pop_if_empty() - .extend(segments); - if !query.is_empty() { - let mut pairs = url.query_pairs_mut(); - for (key, value) in query { - pairs.append_pair(key, value); - } - } - Ok(url.to_string()) -} - -pub(crate) fn normalize_bearer_token(value: Option) -> Option { - value - .map(|value| value.trim().to_string()) - .filter(|value| !value.is_empty()) -} - -pub(crate) fn bearer_token_from_env(var_name: &str) -> Option { - normalize_bearer_token(std::env::var(var_name).ok()) -} - -/// The Cedar resource id for a graph selection: the explicit graph name when one -/// is given, else the normalized URI (the anonymous fallback). Used by the -/// `policy` tooling to address a graph's bundle. -pub(crate) fn graph_resource_id_for_selection( - selected_graph: Option<&str>, - normalized_uri: &str, -) -> String { - selected_graph.unwrap_or(normalized_uri).to_string() -} - -#[derive(Debug, Clone)] -pub(crate) struct ResolvedCliGraph { - pub(crate) uri: String, - pub(crate) is_remote: bool, -} - -/// Resolve the cluster for a control-plane tooling command (`policy`, -/// `queries`) from `--cluster`. A configured name (`clusters:` in operator -/// config) is rewritten to its root; a literal dir / `s3://`/`file://` root is -/// passed through. A `--profile`/`OMNIGRAPH_PROFILE` cluster binding also -/// resolves here when `--cluster` is absent. No omnigraph.yaml. -pub(crate) fn require_cluster_scope( - cluster: Option<&str>, - profile: Option<&str>, - command: &str, -) -> Result { - let op = operator::load_operator_config()?; - let resolve_name = |name: &str| { - op.cluster_root(name) - .map(str::to_string) - .unwrap_or_else(|| name.to_string()) - }; - if let Some(cluster) = cluster { - return Ok(resolve_name(cluster)); - } - // A cluster profile (flag, else OMNIGRAPH_PROFILE) binds the cluster too. - let profile_name = profile - .map(str::to_string) - .or_else(|| std::env::var(scope::PROFILE_ENV).ok().filter(|s| !s.is_empty())); - if let Some(name) = profile_name { - let profile = op.profile(&name).ok_or_else(|| { - color_eyre::eyre::eyre!("unknown profile '{name}' (not defined under `profiles:`)") - })?; - if let crate::operator::ScopeBinding::Cluster(cluster) = profile.binding(&name)? { - return Ok(resolve_name(&cluster)); - } - } - bail!( - "`{command}` needs a cluster β€” pass --cluster (or a name from `clusters:` \ - in ~/.omnigraph/config.yaml), or select a cluster profile" - ) -} - -/// Read a cluster's serving snapshot for a control-plane tooling command, -/// flattening the readiness `Diagnostic` list into one loud error. The single -/// snapshot entry point for `policy`/`queries` so the not-servable message stays -/// identical across them. -async fn read_serving_snapshot_or_report( - cluster: &str, -) -> Result { - omnigraph_cluster::read_serving_snapshot(cluster) - .await - .map_err(|diagnostics| { - color_eyre::eyre::eyre!( - "cluster `{cluster}` is not servable:\n {}", - diagnostics - .iter() - .map(|d| d.message.clone()) - .collect::>() - .join("\n ") - ) - }) -} - -/// Resolve the Cedar policy bundle(s) for a `--cluster` policy-tooling command -/// (RFC-011). Sources the applied policies from the cluster's serving snapshot; -/// each `ServingPolicy` carries its `source` (digest-verified content) and the -/// scopes it `applies_to` (`cluster` | `graph.`). The optional `graph` -/// selects a graph's bundle when several apply. -pub(crate) async fn read_cluster_policies( - cluster: &str, -) -> Result> { - Ok(read_serving_snapshot_or_report(cluster).await?.policies) -} - -/// Pick the single policy bundle that applies to the selection. With `--graph`, -/// the bundle bound to `graph.` (or the cluster-wide one); without it, the -/// sole bundle if there's exactly one. Ambiguity or absence is a loud error. -pub(crate) fn select_cluster_policy<'p>( - cluster: &str, - policies: &'p [omnigraph_cluster::ServingPolicy], - graph: Option<&str>, -) -> Result<&'p omnigraph_cluster::ServingPolicy> { - if let Some(graph_id) = graph { - let graph_ref = format!("graph.{graph_id}"); - let matching: Vec<&omnigraph_cluster::ServingPolicy> = policies - .iter() - .filter(|p| { - p.applies_to - .iter() - .any(|s| s == &graph_ref || s == "cluster") - }) - .collect(); - return match matching.as_slice() { - [only] => Ok(only), - [] => bail!( - "cluster `{cluster}` has no policy bundle bound to graph `{graph_id}` \ - (or to the cluster scope)" - ), - many => bail!( - "graph `{graph_id}` in cluster `{cluster}` matches {} policy bundles ([{}]); \ - the cluster model expects one bundle per graph scope", - many.len(), - many.iter().map(|p| p.name.as_str()).collect::>().join(", ") - ), - }; - } - match policies { - [only] => Ok(only), - [] => bail!("cluster `{cluster}` has no applied policy bundles"), - many => bail!( - "cluster `{cluster}` has {} policy bundles ([{}]); pass --graph to select one", - many.len(), - many.iter().map(|p| p.name.as_str()).collect::>().join(", ") - ), - } -} - -/// THE actor chain (RFC-011) β€” every command that needs an identity -/// resolves through this one function (one path per concern): -/// `--as` > `operator.actor` in ~/.omnigraph/config.yaml > none. -pub(crate) fn resolve_actor(cli_as: Option<&str>) -> Result> { - if let Some(actor) = cli_as { - return Ok(Some(actor.to_string())); - } - Ok(operator::load_operator_config()? - .actor() - .map(str::to_string)) -} - -pub(crate) fn resolve_cluster_actor(cli_as: Option<&str>) -> Result> { - resolve_actor(cli_as) -} - -pub(crate) fn resolve_cli_actor(cli_as: Option<&str>) -> Result> { - resolve_actor(cli_as) -} - -/// The bearer token for a remote request (RFC-011): the operator keyed chain -/// for the matching server (`OMNIGRAPH_TOKEN_` env β†’ 0600 credentials -/// file), then the default `OMNIGRAPH_BEARER_TOKEN` env. No omnigraph.yaml -/// chain. -pub(crate) fn resolve_remote_bearer_token(explicit_uri: Option<&str>) -> Result> { - // The keyed hop (RFC-007 Β§D4, gh-host model): when the effective remote - // URL belongs to an operator-defined server, that server's keyed chain - // applies first β€” OMNIGRAPH_TOKEN_ env, then the 0600 credentials - // file. The keyed token is structurally scoped to its own server: a URL - // matching no operator server never sees it. - if let Some(remote_url) = explicit_uri.filter(|uri| is_remote_uri(uri)) { - let operator_config = operator::load_operator_config()?; - if let Some(server) = operator_config.find_server_for_url(remote_url) { - if let Some(token) = operator::resolve_keyed_token(server)? { - return Ok(Some(token)); - } - } - } - - Ok(bearer_token_from_env(DEFAULT_BEARER_TOKEN_ENV)) -} - -/// `--server ` (RFC-007 PR 3): resolve an operator-defined server -/// name (+ optional `--graph` for multi-graph servers) to the effective -/// remote URI. The result feeds the ordinary `uri` slot, so graph -/// resolution and the keyed-token URL match work unchanged β€” the flag is -/// sugar for a URI the operator already owns. Unknown names fail loudly, -/// listing what IS defined. -pub(crate) fn resolve_server_flag( - server: Option<&str>, - graph: Option<&str>, -) -> Result> { - let Some(server) = server else { - return Ok(None); - }; - // RFC-011 Decision 2: a value containing `://` is a literal base URL - // (bypasses the operator-config registry); otherwise it is a config name. - let base_url = if server.contains("://") { - server.to_string() - } else { - let operator_config = operator::load_operator_config()?; - let Some(entry) = operator_config.servers.get(server) else { - let known = operator_config - .servers - .keys() - .map(String::as_str) - .collect::>() - .join(", "); - color_eyre::eyre::bail!( - "unknown server '{server}' β€” servers defined in the operator config: [{known}] (add it under servers: in ~/.omnigraph/config.yaml)" - ); - }; - entry.url.clone() - }; - let base = base_url.trim_end_matches('/'); - Ok(Some(match graph { - Some(graph) => format!("{base}/graphs/{graph}"), - None => base.to_string(), - })) -} - -/// Execute an OPERATOR alias (RFC-007 PR 3): a pure binding invoking a -/// stored query by name on a named server β€” POST {base}/queries/{name}. -/// Param precedence: --params > positional args > the alias's fixed -/// params. The keyed token applies via the ordinary URL match. -pub(crate) async fn execute_operator_alias( - client: &reqwest::Client, - alias_name: &str, - alias: &crate::operator::OperatorAlias, - alias_args: &[String], - explicit_params: Option, -) -> Result { - let uri = resolve_server_flag(Some(&alias.server), alias.graph.as_deref())? - .expect("server name is present"); - let bearer_token = resolve_remote_bearer_token(Some(&uri))?; - - let mut params = serde_json::Map::new(); - for (key, value) in &alias.params { - let Some(key) = key.as_str() else { - bail!("alias '{alias_name}': params keys must be strings"); - }; - params.insert(key.to_string(), serde_json::to_value(value)?); - } - if alias_args.len() > alias.args.len() { - bail!( - "alias '{alias_name}' takes {} positional arg(s) ({}), got {}", - alias.args.len(), - alias.args.join(", "), - alias_args.len() - ); - } - for (name, value) in alias.args.iter().zip(alias_args) { - params.insert(name.clone(), parse_alias_value(value)); - } - if let Some(Value::Object(explicit)) = explicit_params { - for (key, value) in explicit { - params.insert(key, value); - } - } - - let mut body = serde_json::Map::new(); - body.insert("expect_mutation".to_string(), Value::Bool(false)); - if !params.is_empty() { - body.insert("params".to_string(), Value::Object(params)); - } - remote_json( - client, - Method::POST, - remote_url(&uri, &["queries", &alias.query], &[])?, - Some(Value::Object(body)), - bearer_token.as_deref(), - ) - .await -} - -/// Apply `--server`/`--graph` to a command's uri/target slots: exclusive -/// with both (loud error, not silent precedence), no-op when absent. -pub(crate) fn apply_server_flag( - server: Option<&str>, - graph: Option<&str>, - uri: Option, -) -> Result> { - if server.is_none() { - return Ok(uri); - } - if uri.is_some() { - color_eyre::eyre::bail!( - "--server is exclusive with a positional URI β€” pick one way to address the graph" - ); - } - resolve_server_flag(server, graph) -} - -pub(crate) fn build_http_client() -> Result { - Ok(reqwest::Client::new()) -} - -pub(crate) fn apply_bearer_token( - request: reqwest::RequestBuilder, - token: Option<&str>, -) -> reqwest::RequestBuilder { - if let Some(token) = token { - request.header(AUTHORIZATION, format!("Bearer {}", token)) - } else { - request - } -} - -pub(crate) async fn remote_json( - client: &reqwest::Client, - method: Method, - url: String, - body: Option, - bearer_token: Option<&str>, -) -> Result { - let request = apply_bearer_token(client.request(method, url), bearer_token); - let request = if let Some(body) = body { - request.json(&body) - } else { - request - }; - let response = request.send().await?; - let status = response.status(); - let text = response.text().await?; - if !status.is_success() { - if let Ok(error) = serde_json::from_str::(&text) { - bail!(error.error); - } - bail!("server returned {}: {}", status, text); - } - Ok(serde_json::from_str(&text)?) -} - -/// The graph URI a command addresses (RFC-011): the scope-resolved URI string -/// (positional URI / `--store` / `--profile` / `defaults.store`). No -/// omnigraph.yaml `cli.graph` fallback β€” an absent address is a loud error. -pub(crate) fn resolve_uri(cli_uri: Option) -> Result { - cli_uri.ok_or_else(|| { - color_eyre::eyre::eyre!( - "no graph addressed β€” pass a positional URI, --store , --server , \ - --profile , or set a default scope in ~/.omnigraph/config.yaml" - ) - }) -} - -pub(crate) fn resolve_cli_graph(cli_uri: Option) -> Result { - let uri = resolve_uri(cli_uri)?; - Ok(ResolvedCliGraph { - is_remote: is_remote_uri(&uri), - uri, - }) -} - -pub(crate) fn resolve_local_graph( - cli_uri: Option, - operation: &str, -) -> Result { - let graph = resolve_cli_graph(cli_uri)?; - if graph.is_remote { - bail!( - "`{}` is a direct (storage-native) command and needs direct storage \ - access; the resolved target is a remote server ({}). Pass the \ - graph's file:// or s3:// URI.", - operation, - graph.uri - ); - } - Ok(graph) -} - -pub(crate) fn parse_duration_arg(s: &str) -> Result { - let s = s.trim(); - if s.is_empty() { - bail!("duration is empty"); - } - let (num_part, unit) = match s - .char_indices() - .rev() - .find(|(_, c)| c.is_ascii_alphabetic()) - { - Some((i, _)) => ( - &s[..i + 1 - s[i..].chars().next().unwrap().len_utf8()], - &s[i..], - ), - None => (s, ""), - }; - let n: u64 = num_part - .parse() - .map_err(|e| color_eyre::eyre::eyre!("invalid duration '{}': {}", s, e))?; - let secs = match unit { - "" | "s" => n, - "m" => n * 60, - "h" => n * 60 * 60, - "d" => n * 60 * 60 * 24, - "w" => n * 60 * 60 * 24 * 7, - _ => bail!("unknown duration unit '{}'. Supported: s, m, h, d, w", unit), - }; - Ok(std::time::Duration::from_secs(secs)) -} - -pub(crate) fn resolve_local_uri(cli_uri: Option, operation: &str) -> Result { - Ok(resolve_local_graph(cli_uri, operation)?.uri) -} - -/// Resolve a direct (storage-native) verb's address to a storage URI through the -/// one RFC-011 scope path β€” the maintenance verbs (optimize/repair/cleanup) plus -/// `schema plan` and `lint`'s graph-target path. Every primitive funnels here: a -/// positional URI, `--store`, `--cluster --graph `, a `--profile` -/// cluster binding, or operator defaults β€” all resolved at the `Direct` -/// capability (so a server scope is rejected, a cluster scope is allowed when the -/// verb opts into cluster addressing), then mapped to a storage URI by -/// `resolve_storage_uri`. -pub(crate) async fn resolve_maintenance_uri( - profile: Option<&str>, - store: Option<&str>, - cluster: Option<&str>, - graph: Option<&str>, - cli_uri: Option, - operation: &str, -) -> Result { - let scope = scope::resolve_scope( - &operator::load_operator_config()?, - planes::Capability::Direct, - scope::ScopeFlags { - profile, - store, - server: None, - cluster, - graph, - uri: cli_uri, - }, - )?; - resolve_storage_uri( - scope.uri, - scope.cluster.as_deref(), - scope.cluster_graph.as_deref(), - operation, - ) - .await -} - -/// Map a resolved direct address to a storage URI: a cluster scope -/// (`--cluster --graph `, or a `--profile` cluster binding) resolves -/// the graph's storage URI from the **served cluster state**; otherwise the -/// ordinary positional-URI path. When a cluster scope carries no graph -/// selection (RFC-011 D7), enumerate the catalog: a sole graph is used -/// automatically, otherwise error and list the candidates so the operator can -/// pass `--graph `. -pub(crate) async fn resolve_storage_uri( - cli_uri: Option, - cluster: Option<&str>, - cluster_graph: Option<&str>, - operation: &str, -) -> Result { - match (cluster, cluster_graph) { - (Some(cluster), Some(graph_id)) => resolve_cluster_graph_uri(cluster, graph_id).await, - (Some(cluster), None) => { - let graph_id = resolve_sole_cluster_graph(cluster).await?; - resolve_cluster_graph_uri(cluster, &graph_id).await - } - (None, None) => resolve_local_uri(cli_uri, operation), - (None, Some(_)) => { - bail!("internal error: a graph was selected without a cluster scope") - } - } -} - -/// Pick the graph for a cluster scope that has no `--graph`/`default_graph` -/// (RFC-011 D7): exactly one applied graph β†’ use it; zero β†’ error; more than -/// one β†’ error and list the candidates. Never auto-picks among several. -async fn resolve_sole_cluster_graph(cluster: &str) -> Result { - let ids = omnigraph_cluster::cluster_graph_ids(cluster) - .await - .map_err(|diagnostic| color_eyre::eyre::eyre!("{}", diagnostic.message))?; - match ids.as_slice() { - [only] => Ok(only.clone()), - [] => bail!("cluster `{cluster}` has no applied graphs; run `cluster apply` first"), - many => bail!( - "cluster `{cluster}` has {} graphs: [{}]; pass --graph to select one", - many.len(), - many.join(", ") - ), - } -} - -/// Look up a graph's storage URI from a cluster's applied state ledger. Uses -/// the lightweight `resolve_graph_storage_uri` (NOT the full serving-snapshot -/// validation), so maintenance β€” especially `repair` β€” works even when an -/// unrelated catalog payload is corrupt or a recovery sweep is pending. -async fn resolve_cluster_graph_uri(cluster: &str, graph_id: &str) -> Result { - omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id) - .await - .map_err(|diagnostic| color_eyre::eyre::eyre!("{}", diagnostic.message)) -} - -pub(crate) fn resolve_branch( - cli_branch: Option, - alias_branch: Option, - default_branch: &str, -) -> String { - cli_branch - .or(alias_branch) - .unwrap_or_else(|| default_branch.to_string()) -} - -pub(crate) fn resolve_read_target( - cli_branch: Option, - cli_snapshot: Option, - alias_branch: Option, -) -> Result { - if cli_branch.is_some() && cli_snapshot.is_some() { - bail!("read target may specify branch or snapshot, not both"); - } - Ok(read_target_from_cli(cli_branch.or(alias_branch), cli_snapshot)) -} - -pub(crate) fn resolve_query_path( - explicit_query: Option<&PathBuf>, - alias_query: Option<&str>, -) -> Result { - // The `.gq` path is resolved plainly (cwd-relative) β€” no omnigraph.yaml - // `query.roots` search. - explicit_query - .map(PathBuf::from) - .or_else(|| alias_query.map(PathBuf::from)) - .ok_or_else(|| { - color_eyre::eyre::eyre!( - "exactly one of --query, --query-string, or --alias must be provided" - ) - }) -} - -pub(crate) fn resolve_query_source( - explicit_query: Option<&PathBuf>, - inline_query: Option<&str>, - alias_query: Option<&str>, -) -> Result { - if let Some(inline) = inline_query { - if inline.trim().is_empty() { - bail!("--query-string must not be empty"); - } - return Ok(inline.to_string()); - } - Ok(fs::read_to_string(resolve_query_path( - explicit_query, - alias_query, - )?)?) -} - -pub(crate) fn parse_alias_value(value: &str) -> Value { - serde_json::from_str(value).unwrap_or_else(|_| Value::String(value.to_string())) -} - -/// The format cascade (RFC-011): `--json` > `--format` > alias format > -/// operator `defaults.output` > table. -pub(crate) fn resolve_read_format( - cli_format: Option, - json: bool, - alias_format: Option, -) -> ReadOutputFormat { - if json { - return ReadOutputFormat::Json; - } - cli_format - .or(alias_format) - .or_else(|| { - operator::load_operator_config() - .ok() - .and_then(|operator| operator.output()) - }) - .unwrap_or_default() -} - - - -pub(crate) fn read_target_from_cli(branch: Option, snapshot: Option) -> ReadTarget { - if let Some(snapshot) = snapshot { - ReadTarget::snapshot(SnapshotId::new(snapshot)) - } else { - ReadTarget::branch(branch.unwrap_or_else(|| "main".to_string())) - } -} - -pub(crate) fn load_params_json(params: &ParamsArgs) -> Result> { - match (¶ms.params, ¶ms.params_file) { - (Some(inline), None) => Ok(Some(serde_json::from_str(inline)?)), - (None, Some(path)) => Ok(Some(serde_json::from_str(&fs::read_to_string(path)?)?)), - (None, None) => Ok(None), - (Some(_), Some(_)) => bail!("only one of --params or --params-file may be provided"), - } -} - -pub(crate) fn select_named_query( - query_source: &str, - requested_name: Option<&str>, -) -> Result<(String, Vec)> { - let parsed = parse_query(query_source)?; - let query = if let Some(name) = requested_name { - parsed - .queries - .into_iter() - .find(|query| query.name == name) - .ok_or_else(|| color_eyre::eyre::eyre!("query '{}' not found", name))? - } else if parsed.queries.len() == 1 { - parsed.queries.into_iter().next().unwrap() - } else { - bail!("query file contains multiple queries; pass --name"); - }; - - Ok((query.name, query.params)) -} - -pub(crate) fn query_params_from_json( - query_params: &[omnigraph_compiler::query::ast::Param], - params_json: Option<&Value>, -) -> Result { - json_params_to_param_map(params_json, query_params, JsonParamMode::Standard) - .map_err(|err| color_eyre::eyre::eyre!(err.to_string())) -} - -pub(crate) async fn execute_query_lint( - cli_uri: Option, - schema_path: Option<&PathBuf>, - query_path: &PathBuf, -) -> Result { - let resolved_query_path = resolve_query_path(Some(query_path), None)?; - let query_source = fs::read_to_string(&resolved_query_path)?; - let query_path = resolved_query_path.to_string_lossy().into_owned(); - - if let Some(schema_path) = schema_path { - let schema_source = fs::read_to_string(schema_path)?; - let schema = - parse_schema(&schema_source).map_err(|err| color_eyre::eyre::eyre!(err.to_string()))?; - let catalog = - build_catalog(&schema).map_err(|err| color_eyre::eyre::eyre!(err.to_string()))?; - return Ok(lint_query_file( - &catalog, - &query_source, - query_path, - QueryLintSchemaSource::file(schema_path.to_string_lossy().into_owned()), - )); - } - - if cli_uri.is_none() { - bail!( - "lint requires --schema (offline) or a graph target \ - (--store / --cluster --graph )" - ); - } - - let uri = resolve_local_uri(cli_uri, "lint")?; - let db = Omnigraph::open(&uri).await?; - Ok(lint_query_file( - &db.catalog(), - &query_source, - query_path, - QueryLintSchemaSource::graph(uri), - )) -} - -/// Build a `QueryRegistry` from a cluster serving snapshot's stored queries, -/// optionally scoped to one graph. The `ServingQuery.source` is the -/// digest-verified `.gq` content, so no file I/O or omnigraph.yaml is involved. -fn registry_from_serving_queries( - queries: &[omnigraph_cluster::ServingQuery], - graph: Option<&str>, -) -> Result { - let specs: Vec = queries - .iter() - .filter(|q| graph.is_none_or(|g| q.graph_id == g)) - .map(|q| omnigraph_server::queries::RegistrySpec { - name: q.name.clone(), - source: q.source.clone(), - expose: false, - tool_name: None, - }) - .collect(); - QueryRegistry::from_specs(specs).map_err(|errors| { - color_eyre::eyre::eyre!( - "stored-query registry failed to load:\n {}", - errors - .iter() - .map(|e| e.to_string()) - .collect::>() - .join("\n ") - ) - }) -} - - -/// `queries validate --cluster ` (RFC-011): type-check every stored query -/// in the cluster catalog against its graph's applied schema. Both the registry -/// and the schemas come from the cluster serving snapshot β€” no omnigraph.yaml. -/// With `--graph`, scope to a single graph. -pub(crate) async fn execute_queries_validate( - cluster: &str, - graph: Option<&str>, - json: bool, -) -> Result<()> { - let snapshot = read_serving_snapshot_or_report(cluster).await?; - - // Type-check per graph: each graph's stored queries against its own schema - // (read from the graph's applied storage root). A `--graph` filter scopes to - // exactly one graph; an unknown id is a loud error. - let mut breakages = Vec::new(); - let mut warnings = Vec::new(); - let mut total = 0usize; - let mut matched_any = false; - for serving_graph in &snapshot.graphs { - if graph.is_some_and(|g| g != serving_graph.graph_id) { - continue; - } - matched_any = true; - let registry = registry_from_serving_queries(&snapshot.queries, Some(&serving_graph.graph_id))?; - let db = Omnigraph::open(&serving_graph.root.to_string_lossy()).await?; - let report = check(®istry, &db.catalog()); - total += registry.len(); - for b in &report.breakages { - breakages.push(QueriesIssue { - query: b.query.clone(), - message: b.message.clone(), - }); - } - for w in &report.warnings { - warnings.push(QueriesIssue { - query: w.query.clone(), - message: w.message.clone(), - }); - } - } - if let Some(graph_id) = graph { - if !matched_any { - bail!("graph `{graph_id}` is not applied in cluster `{cluster}`"); - } - } - - let has_breakages = !breakages.is_empty(); - let output = QueriesValidateOutput { - ok: !has_breakages, - breakages, - warnings, - }; - - if json { - print_json(&output)?; - } else { - if output.breakages.is_empty() { - println!( - "OK {} stored quer{} type-check against the schema", - total, - if total == 1 { "y" } else { "ies" } - ); - } - for issue in &output.breakages { - println!("ERROR query '{}': {}", issue.query, issue.message); - } - for issue in &output.warnings { - println!("WARN query '{}': {}", issue.query, issue.message); - } - } - - if has_breakages { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -/// Print a stored-query annotation under its `queries list` entry. A -/// `@description`/`@instruction` value may be multiline (GQ string literals -/// admit newlines); continuation lines are indented to align under the first -/// so the catalog stays readable instead of breaking the left margin. -fn print_query_annotation(label: &str, value: &str) { - let prefix = format!(" {label}: "); - let continuation = " ".repeat(prefix.len()); - let mut lines = value.split('\n'); - match lines.next() { - Some(first) => { - println!("{prefix}{first}"); - for line in lines { - println!("{continuation}{line}"); - } - } - None => println!("{prefix}"), - } -} - -/// `queries list --cluster ` (RFC-011): list the catalog's stored queries. -/// With `--graph`, scope to one graph. -pub(crate) async fn execute_queries_list( - cluster: &str, - graph: Option<&str>, - json: bool, -) -> Result<()> { - let snapshot = read_serving_snapshot_or_report(cluster).await?; - let registry = registry_from_serving_queries(&snapshot.queries, graph)?; - - let output = QueriesListOutput { - queries: registry - .iter() - .map(|q| QueriesListItem { - name: q.name.clone(), - mcp_expose: q.expose, - tool_name: q.tool_name.clone(), - mutation: q.is_mutation(), - description: q.decl.description.clone(), - instruction: q.decl.instruction.clone(), - params: q - .decl - .params - .iter() - .map(|p| QueriesParam { - name: p.name.clone(), - type_name: p.type_name.clone(), - nullable: p.nullable, - }) - .collect(), - }) - .collect(), - }; - - if json { - print_json(&output)?; - } else if output.queries.is_empty() { - println!("(no stored queries registered)"); - } else { - for q in &output.queries { - let kind = if q.mutation { "mutation" } else { "read" }; - let params = q - .params - .iter() - .map(|p| { - format!( - "${}: {}{}", - p.name, - p.type_name, - if p.nullable { "?" } else { "" } - ) - }) - .collect::>() - .join(", "); - let mcp = if q.mcp_expose { - format!(" [mcp: {}]", q.tool_name.as_deref().unwrap_or(&q.name)) - } else { - String::new() - }; - println!("{kind} {}({params}){mcp}", q.name); - if let Some(description) = &q.description { - print_query_annotation("description", description); - } - if let Some(instruction) = &q.instruction { - print_query_annotation("instruction", instruction); - } - } - } - Ok(()) -} - -pub(crate) fn legacy_change_request_body( - query_source: &str, - query_name: Option<&str>, - branch: &str, - params_json: Option<&Value>, -) -> Value { - let mut body = serde_json::json!({ - "query_source": query_source, - "branch": branch, - }); - if let Some(name) = query_name { - body["query_name"] = Value::String(name.to_string()); - } - if let Some(params) = params_json { - body["params"] = params.clone(); - } - body -} - -pub(crate) fn rewrite_deprecated_argv(args: Vec) -> Vec { - if args.len() >= 3 { - let sub = args[1].to_str(); - let sub2 = args[2].to_str(); - if sub == Some("query") && matches!(sub2, Some("lint") | Some("check")) { - let suffix = sub2.unwrap(); - eprintln!( - "warning: `omnigraph query {suffix}` is deprecated; use `omnigraph lint` instead" - ); - // Drop the leading `query` token AND normalize `check` -> `lint`. - // `check` is no longer a clap visible_alias (MR-981 Β§6), so the - // rewritten argv must reach the canonical `lint` subcommand - // directly. Result for `omnigraph query check --query foo.gq`: - // `omnigraph lint --query foo.gq`. - let mut out = Vec::with_capacity(args.len() - 1); - out.push(args[0].clone()); - out.push(OsString::from("lint")); - out.extend(args[3..].iter().cloned()); - return out; - } - } - if let Some(sub) = args.get(1).and_then(|s| s.to_str()) { - match sub { - "read" => { - eprintln!("warning: `omnigraph read` is deprecated; use `omnigraph query` instead") - } - "change" => eprintln!( - "warning: `omnigraph change` is deprecated; use `omnigraph mutate` instead" - ), - "check" => { - eprintln!("warning: `omnigraph check` is deprecated; use `omnigraph lint` instead"); - // Rewrite the top-level subcommand to `lint`; pass through the rest. - let mut out = Vec::with_capacity(args.len()); - out.push(args[0].clone()); - out.push(OsString::from("lint")); - out.extend(args[2..].iter().cloned()); - return out; - } - _ => {} - } - } - args -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn graph_resource_id_for_selection_uses_name_or_anonymous_uri() { - assert_eq!( - graph_resource_id_for_selection(Some("local"), "/tmp/graph.omni"), - "local" - ); - assert_eq!( - graph_resource_id_for_selection(None, "/tmp/graph.omni"), - "/tmp/graph.omni" - ); - } - - // RFC-011 Decision 9: locality classifier for the destructive-confirm gate. - #[test] - fn uri_is_local_truth_table() { - // Local: bare path or file://. - assert!(uri_is_local("graph.omni")); - assert!(uri_is_local("/abs/path/graph.omni")); - assert!(uri_is_local("file:///tmp/graph.omni")); - // Non-local: served or object-store schemes. - assert!(!uri_is_local("http://host/graphs/g")); - assert!(!uri_is_local("https://host/graphs/g")); - assert!(!uri_is_local("s3://bucket/graph.omni")); - assert!(!uri_is_local("gs://bucket/graph.omni")); - } - - // RFC-011 Decision 9: a non-local destructive write with `--json` (the CI - // shape β€” also covers the no-TTY case, since tests run without a terminal) - // refuses rather than proceeding; a local one and an explicit `--yes` pass. - #[test] - fn confirm_destructive_refuses_non_local_without_consent() { - let err = confirm_destructive("cleanup", "s3://b/g.omni", false, true) - .unwrap_err() - .to_string(); - assert!(err.contains("--yes"), "{err}"); - } - - #[test] - fn confirm_destructive_allows_local_and_explicit_yes() { - // Local needs no confirmation, even with --json. - assert!(confirm_destructive("cleanup", "file:///tmp/g.omni", false, true).is_ok()); - assert!(confirm_destructive("branch delete", "graph.omni", false, true).is_ok()); - // --yes consents to a non-local target. - assert!(confirm_destructive("cleanup", "s3://b/g.omni", true, true).is_ok()); - } - - // RFC-011 Decision 2: `--server` accepts a literal URL (value with `://`), - // bypassing the operator-config registry β€” so no config / OMNIGRAPH_HOME is - // read on this path (hermetic). - #[test] - fn server_flag_accepts_a_literal_url() { - assert_eq!( - resolve_server_flag(Some("https://graph.example.com"), None).unwrap(), - Some("https://graph.example.com".to_string()) - ); - // trailing slash trimmed; `--graph` appends the multi-graph path. - assert_eq!( - resolve_server_flag(Some("https://graph.example.com/"), Some("knowledge")).unwrap(), - Some("https://graph.example.com/graphs/knowledge".to_string()) - ); - } - - // `branch delete` interpolates the branch into the URL path. The composed - // path must be exactly `/branches/` with no empty `//` - // segment β€” an empty segment misses the - // `/graphs/{graph_id}/branches/{branch}` route and 404s. - #[test] - fn remote_url_multi_graph_base_has_no_double_slash() { - let url = remote_url("http://host/graphs/p9-os", &["branches", "tmpbranch"], &[]).unwrap(); - assert_eq!(url, "http://host/graphs/p9-os/branches/tmpbranch"); - assert!( - !url.contains("//branches"), - "double slash before branches: {url}" - ); - } - - #[test] - fn remote_url_single_graph_base_has_no_double_slash() { - let url = remote_url("http://host", &["branches", "tmpbranch"], &[]).unwrap(); - assert_eq!(url, "http://host/branches/tmpbranch"); - } - - #[test] - fn remote_url_tolerates_trailing_slash_on_base() { - let url = remote_url("http://host/graphs/p9-os/", &["branches", "tmpbranch"], &[]).unwrap(); - assert_eq!(url, "http://host/graphs/p9-os/branches/tmpbranch"); - } - - #[test] - fn remote_url_encodes_slashes_in_path_segment() { - let url = remote_url( - "http://host/graphs/p9-os", - &["branches", "etl/zendesk/run-1"], - &[], - ) - .unwrap(); - assert_eq!( - url, - "http://host/graphs/p9-os/branches/etl%2Fzendesk%2Frun-1" - ); - } - - // Sibling cases the unified builder closes by construction: a dynamic - // commit id in the path, and a branch name carried as a query value, are - // both percent-encoded instead of interpolated raw. - #[test] - fn remote_url_encodes_dynamic_path_segment_for_commits() { - let url = remote_url("http://host/graphs/p9-os", &["commits", "a/b c"], &[]).unwrap(); - assert_eq!(url, "http://host/graphs/p9-os/commits/a%2Fb%20c"); - } - - #[test] - fn remote_url_encodes_query_values() { - let url = remote_url( - "http://host/graphs/p9-os", - &["snapshot"], - &[("branch", "feature&x=1")], - ) - .unwrap(); - assert_eq!( - url, - "http://host/graphs/p9-os/snapshot?branch=feature%26x%3D1" - ); - } -} diff --git a/crates/omnigraph-cli/src/main.rs b/crates/omnigraph-cli/src/main.rs index fa6f4db..b7e3041 100644 --- a/crates/omnigraph-cli/src/main.rs +++ b/crates/omnigraph-cli/src/main.rs @@ -1,16 +1,14 @@ use std::ffi::OsString; use std::fs; use std::io::{self, Write}; +use std::path::Path; use std::path::PathBuf; +use std::sync::Arc; + use clap::{Arg, ArgAction, Args, CommandFactory, FromArgMatches, Parser, Subcommand, ValueEnum}; use color_eyre::eyre::{Result, bail}; use omnigraph::db::{Omnigraph, ReadTarget, SnapshotId}; use omnigraph::loader::LoadMode; -use omnigraph_cluster::{ - ApplyOptions, ApplyOutput, ApproveOutput, DiagnosticSeverity, ForceUnlockOutput, PlanOutput, StateSyncOutput, StatusOutput, - ValidateOutput, apply_config_dir_with_options, approve_config_dir, force_unlock_config_dir, import_config_dir, plan_config_dir, - refresh_config_dir, status_config_dir, validate_config_dir, -}; use omnigraph_compiler::query::parser::parse_query; use omnigraph_compiler::schema::parser::parse_schema; use omnigraph_compiler::{ @@ -18,13 +16,17 @@ use omnigraph_compiler::{ QueryLintSeverity, QueryLintStatus, SchemaMigrationPlan, SchemaMigrationStep, build_catalog, json_params_to_param_map, lint_query_file, }; -use omnigraph_api_types::{ - ChangeOutput, CommitOutput, ErrorOutput, IngestOutput, ReadOutput, SchemaApplyOutput, - SnapshotTableOutput, +use omnigraph_server::api::{ + BranchCreateOutput, BranchCreateRequest, BranchDeleteOutput, BranchListOutput, + BranchMergeOutput, BranchMergeRequest, ChangeOutput, CommitListOutput, CommitOutput, + ErrorOutput, ExportRequest, GraphListResponse, IngestOutput, IngestRequest, ReadOutput, + ReadRequest, SchemaApplyOutput, SchemaApplyRequest, SchemaOutput, SnapshotOutput, + SnapshotTableOutput, commit_output, ingest_output, read_output, schema_apply_output, + snapshot_payload, }; -use omnigraph_server::queries::{QueryRegistry, check}; use omnigraph_server::{ - PolicyAction, PolicyDecision, PolicyEngine, PolicyRequest, PolicyTestConfig, + AliasCommand, OmnigraphConfig, PolicyAction, PolicyDecision, PolicyEngine, PolicyRequest, + PolicyTestConfig, ReadOutputFormat, load_config, }; use reqwest::Method; use reqwest::header::AUTHORIZATION; @@ -33,21 +35,1812 @@ use serde::de::DeserializeOwned; use serde_json::Value; mod embed; -mod operator; mod read_format; use embed::{EmbedArgs, EmbedOutput, execute_embed}; -use read_format::{ReadOutputFormat, ReadRenderOptions, render_read}; +use read_format::{ReadRenderOptions, render_read}; -mod cli; -mod client; -mod helpers; -mod output; -mod scope; -mod planes; -use cli::*; -use helpers::*; -use output::*; +const DEFAULT_BEARER_TOKEN_ENV: &str = "OMNIGRAPH_BEARER_TOKEN"; + +#[derive(Debug, Parser)] +#[command(name = "omnigraph")] +#[command(about = "Omnigraph graph database CLI")] +#[command(version = env!("CARGO_PKG_VERSION"), disable_version_flag = true)] +struct Cli { + /// Actor identity for direct-engine writes (MR-722). Overrides + /// `cli.actor` from `omnigraph.yaml`. When the configured policy + /// is in effect, Cedar evaluates this actor against the requested + /// action and scope; with policy configured but neither this flag + /// nor `cli.actor` set, the engine-layer footgun guard fires and + /// the write is denied (no silent bypass). Has no effect on remote + /// HTTP writes β€” those resolve their actor server-side from the + /// bearer token. + #[arg(long = "as", global = true, value_name = "ACTOR")] + as_actor: Option, + + #[command(subcommand)] + command: Command, +} + +#[derive(Debug, Subcommand)] +enum Command { + /// Print the CLI version + Version, + /// Generate, clean, or refresh explicit seed embeddings + Embed(EmbedArgs), + /// Initialize a new graph from a schema + Init { + #[arg(long)] + schema: PathBuf, + /// Graph URI (local path or s3://) + uri: String, + /// Overwrite existing schema artifacts at the URI. Without + /// this flag, init refuses to touch a URI that already holds + /// `_schema.pg`, `_schema.ir.json`, or `__schema_state.json` + /// β€” closes the re-init footgun (MR-668 follow-up). With the + /// flag, the operator opts in to destructive semantics. + #[arg(long)] + force: bool, + }, + /// Load data into a graph + Load { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + data: PathBuf, + #[arg(long)] + branch: Option, + #[arg(long, default_value = "overwrite")] + mode: CliLoadMode, + #[arg(long)] + json: bool, + }, + /// Ingest data into a reviewable named branch + Ingest { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + data: PathBuf, + #[arg(long)] + branch: Option, + #[arg(long)] + from: Option, + #[arg(long, default_value = "merge")] + mode: CliLoadMode, + #[arg(long)] + json: bool, + }, + /// Branch operations + Branch { + #[command(subcommand)] + command: BranchCommand, + }, + /// Schema planning operations + Schema { + #[command(subcommand)] + command: SchemaCommand, + }, + /// Validate queries against a schema (offline) or repo (repo-backed). + /// + /// Canonical name is `lint` (matches the `omnigraph_compiler::lint` + /// module and the `OG-XXX-NNN` lint-code vocabulary). Replaces the + /// deprecated `omnigraph query lint` / `omnigraph query check` / + /// `omnigraph check` invocations β€” each is kept as an argv-level + /// shim that prints a one-line stderr warning and rewrites to + /// `omnigraph lint`. Aliases are deliberately *not* exposed via + /// clap's `visible_alias` because that would advertise two + /// equivalent canonical names, which agents emit interchangeably + /// (see MR-981). + Lint { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + query: PathBuf, + #[arg(long)] + schema: Option, + #[arg(long)] + json: bool, + }, + /// Show graph snapshot + Snapshot { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + branch: Option, + #[arg(long)] + json: bool, + }, + /// Export a full graph snapshot as JSONL + Export { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + branch: Option, + #[arg(long, hide = true)] + jsonl: bool, + #[arg(long = "type")] + type_names: Vec, + #[arg(long = "table")] + table_keys: Vec, + }, + /// Commit history operations + Commit { + #[command(subcommand)] + command: CommitCommand, + }, + /// Execute a read query against a branch or snapshot. + /// + /// Canonical read endpoint. The previous name `omnigraph read` is + /// kept as a visible alias and prints a one-line deprecation warning + /// when used. Pairs with `omnigraph mutate` on the write side. + #[command(visible_alias = "read")] + Query { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(hide = true)] + legacy_uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long, conflicts_with_all = ["query", "query_string"])] + alias: Option, + #[arg(long, conflicts_with_all = ["alias", "query_string"])] + query: Option, + /// Inline GQ source β€” alternative to `--query ` and `--alias `. + #[arg(short = 'e', long = "query-string", value_name = "GQ", conflicts_with_all = ["query", "alias"])] + query_string: Option, + #[arg(long)] + name: Option, + #[command(flatten)] + params: ParamsArgs, + #[arg(long, conflicts_with = "snapshot")] + branch: Option, + #[arg(long, conflicts_with = "branch")] + snapshot: Option, + #[arg(long, conflicts_with = "json")] + format: Option, + #[arg(long, conflicts_with = "format")] + json: bool, + #[arg()] + alias_args: Vec, + }, + /// Execute a graph mutation query against a branch. + /// + /// Canonical mutation endpoint. The previous name `omnigraph change` + /// is kept as a visible alias and prints a one-line deprecation + /// warning when used. Pairs with `omnigraph query` on the read side. + #[command(visible_alias = "change")] + Mutate { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(hide = true)] + legacy_uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long, conflicts_with_all = ["query", "query_string"])] + alias: Option, + #[arg(long, conflicts_with_all = ["alias", "query_string"])] + query: Option, + /// Inline GQ source β€” alternative to `--query ` and `--alias `. + #[arg(short = 'e', long = "query-string", value_name = "GQ", conflicts_with_all = ["query", "alias"])] + query_string: Option, + #[arg(long)] + name: Option, + #[command(flatten)] + params: ParamsArgs, + #[arg(long)] + branch: Option, + #[arg(long)] + json: bool, + #[arg()] + alias_args: Vec, + }, + /// Policy administration and diagnostics + Policy { + #[command(subcommand)] + command: PolicyCommand, + }, + /// Compact small Lance fragments in every table of the graph + Optimize { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + json: bool, + }, + /// Remove old Lance versions from every table of the graph (destructive) + Cleanup { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + /// Number of recent versions to keep per table. Either `--keep` or + /// `--older-than` (or both) must be set. + #[arg(long)] + keep: Option, + /// Only remove versions older than this duration. Accepts Go-style + /// durations: `7d`, `24h`, `90m`. At least one of --keep / --older-than. + #[arg(long)] + older_than: Option, + /// Required to actually run; without it, prints what would be removed + #[arg(long)] + confirm: bool, + #[arg(long)] + json: bool, + }, + /// Manage graphs on a multi-graph server (MR-668) + Graphs { + #[command(subcommand)] + command: GraphsCommand, + }, +} + +/// Operations on the graph registry of a multi-graph server (MR-668). +/// +/// All operations target a remote multi-graph server URL (http:// or +/// https://). Local-URI invocations return a clear error. To add or +/// remove graphs, operators edit `omnigraph.yaml` directly and restart +/// the server β€” runtime mutation is not exposed in v0.6.0. +#[derive(Debug, Subcommand)] +enum GraphsCommand { + /// List every graph registered with the multi-graph server. + List { + /// Remote server URL (e.g. `https://server.example.com`). + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + json: bool, + }, +} + +#[derive(Debug, Subcommand)] +enum BranchCommand { + /// Create a new branch + Create { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + from: Option, + name: String, + #[arg(long)] + json: bool, + }, + /// List branches + List { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + json: bool, + }, + /// Delete a branch + Delete { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + name: String, + #[arg(long)] + json: bool, + }, + /// Merge a source branch into a target branch + Merge { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + source: String, + #[arg(long)] + into: Option, + #[arg(long)] + json: bool, + }, +} + +#[derive(Debug, Subcommand)] +enum SchemaCommand { + /// Plan a schema migration against the accepted persisted schema + Plan { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + schema: PathBuf, + #[arg(long)] + json: bool, + /// Show the plan as it would execute with `--allow-data-loss`. + /// Promotes every `DropMode::Soft` step to `DropMode::Hard` + /// so the plan output reflects the destructive intent. + #[arg(long, default_value_t = false)] + allow_data_loss: bool, + }, + /// Apply a supported schema migration + Apply { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + schema: PathBuf, + #[arg(long)] + json: bool, + /// Allow destructive (data-loss) schema changes. + /// + /// Without this flag, drops are "soft": the column or table + /// is removed from the current manifest version but prior + /// versions are retained, so `snapshot_at_version(pre_drop)` + /// can still read the dropped data until `omnigraph cleanup` + /// runs. With this flag, drops are "hard": `cleanup_old_versions` + /// runs on the affected datasets immediately after the apply, + /// making the prior data unreachable. + #[arg(long, default_value_t = false)] + allow_data_loss: bool, + }, + /// Show the current accepted schema source + #[command(alias = "get")] + Show { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + json: bool, + }, +} + +#[derive(Debug, Subcommand)] + +enum CommitCommand { + /// List graph commits + List { + /// Graph URI + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + #[arg(long)] + branch: Option, + #[arg(long)] + json: bool, + }, + /// Show a graph commit + Show { + /// Graph URI + #[arg(long)] + uri: Option, + #[arg(long)] + target: Option, + #[arg(long)] + config: Option, + commit_id: String, + #[arg(long)] + json: bool, + }, +} + +#[derive(Debug, Subcommand)] +enum PolicyCommand { + /// Validate policy YAML and compiled Cedar policy state + Validate { + #[arg(long)] + config: Option, + }, + /// Run declarative policy tests from policy.tests.yaml + Test { + #[arg(long)] + config: Option, + }, + /// Explain one policy decision locally + Explain { + #[arg(long)] + config: Option, + #[arg(long)] + actor: String, + #[arg(long)] + action: PolicyAction, + #[arg(long)] + branch: Option, + #[arg(long = "target-branch")] + target_branch: Option, + }, +} + +#[derive(Debug, Args, Clone)] +struct ParamsArgs { + #[arg(long, conflicts_with = "params_file")] + params: Option, + #[arg(long, conflicts_with = "params")] + params_file: Option, +} + +#[derive(Clone, Copy, Debug, Eq, PartialEq, Serialize, ValueEnum)] +#[serde(rename_all = "snake_case")] +enum CliLoadMode { + Overwrite, + Append, + Merge, +} + +impl From for LoadMode { + fn from(value: CliLoadMode) -> Self { + match value { + CliLoadMode::Overwrite => LoadMode::Overwrite, + CliLoadMode::Append => LoadMode::Append, + CliLoadMode::Merge => LoadMode::Merge, + } + } +} + +impl CliLoadMode { + fn as_str(self) -> &'static str { + match self { + CliLoadMode::Overwrite => "overwrite", + CliLoadMode::Append => "append", + CliLoadMode::Merge => "merge", + } + } +} + +#[derive(Debug, Serialize)] +struct LoadOutput<'a> { + uri: &'a str, + branch: &'a str, + mode: &'a str, + nodes_loaded: usize, + edges_loaded: usize, + node_types_loaded: usize, + edge_types_loaded: usize, +} + +#[derive(Debug, Serialize)] +struct SchemaPlanOutput<'a> { + uri: &'a str, + supported: bool, + step_count: usize, + steps: &'a [SchemaMigrationStep], +} + +fn print_schema_apply_human(output: &SchemaApplyOutput) { + println!("schema apply for {}", output.uri); + println!("supported: {}", if output.supported { "yes" } else { "no" }); + println!("applied: {}", if output.applied { "yes" } else { "no" }); + println!("manifest_version: {}", output.manifest_version); + if output.steps.is_empty() { + println!("no schema changes"); + return; + } + for step in &output.steps { + println!("- {}", render_schema_plan_step(step)); + } +} + +fn query_kind_label(kind: QueryLintQueryKind) -> &'static str { + match kind { + QueryLintQueryKind::Read => "read", + QueryLintQueryKind::Mutation => "mutation", + } +} + +fn severity_label(severity: QueryLintSeverity) -> &'static str { + match severity { + QueryLintSeverity::Error => "ERROR", + QueryLintSeverity::Warning => "WARN ", + QueryLintSeverity::Info => "INFO ", + } +} + +fn print_query_lint_human(output: &QueryLintOutput) { + for result in &output.results { + match result.status { + QueryLintStatus::Ok => { + println!( + "OK query `{}` ({})", + result.name, + query_kind_label(result.kind) + ); + } + QueryLintStatus::Error => { + println!( + "ERROR query `{}`: {}", + result.name, + result.error.as_deref().unwrap_or("unknown error") + ); + } + } + + for warning in &result.warnings { + println!("WARN query `{}`: {}", result.name, warning); + } + } + + for finding in &output.findings { + println!("{} {}", severity_label(finding.severity), finding.message); + } + + println!( + "INFO Lint complete: {} queries processed ({} error(s), {} warning(s), {} info item(s))", + output.queries_processed, output.errors, output.warnings, output.infos + ); +} + +fn finish_query_lint(output: &QueryLintOutput, json: bool) -> Result<()> { + if json { + print_json(output)?; + } else { + print_query_lint_human(output); + } + + if output.status == QueryLintStatus::Error { + io::stdout().flush()?; + std::process::exit(1); + } + + Ok(()) +} + +fn ensure_local_graph_parent(uri: &str) -> Result<()> { + if !uri.contains("://") { + fs::create_dir_all(uri)?; + } + Ok(()) +} + +fn print_json(value: &T) -> Result<()> { + println!("{}", serde_json::to_string_pretty(value)?); + Ok(()) +} + +fn is_remote_uri(uri: &str) -> bool { + uri.starts_with("http://") || uri.starts_with("https://") +} + +fn remote_url(base: &str, path: &str) -> String { + format!("{}{}", base.trim_end_matches('/'), path) +} + +fn remote_branch_url(base: &str, branch: &str) -> Result { + let mut url = reqwest::Url::parse(&format!("{}/", base.trim_end_matches('/')))?; + url.path_segments_mut() + .map_err(|_| color_eyre::eyre::eyre!("invalid remote base url"))? + .extend(["branches", branch]); + Ok(url.to_string()) +} + +fn normalize_bearer_token(value: Option) -> Option { + value + .map(|value| value.trim().to_string()) + .filter(|value| !value.is_empty()) +} + +fn bearer_token_from_env(var_name: &str) -> Option { + normalize_bearer_token(std::env::var(var_name).ok()) +} + +fn parse_env_assignment(line: &str) -> Option<(String, String)> { + let line = line.trim(); + if line.is_empty() || line.starts_with('#') { + return None; + } + + let line = line.strip_prefix("export ").unwrap_or(line).trim(); + let (name, value) = line.split_once('=')?; + let name = name.trim(); + if name.is_empty() { + return None; + } + + let value = value.trim(); + let value = if value.len() >= 2 + && ((value.starts_with('"') && value.ends_with('"')) + || (value.starts_with('\'') && value.ends_with('\''))) + { + &value[1..value.len() - 1] + } else { + value + }; + + Some((name.to_string(), value.to_string())) +} + +fn bearer_token_from_env_file(path: &Path, var_name: &str) -> Result> { + if !path.exists() { + return Ok(None); + } + + for line in fs::read_to_string(path)?.lines() { + let Some((name, value)) = parse_env_assignment(line) else { + continue; + }; + if name == var_name { + return Ok(normalize_bearer_token(Some(value))); + } + } + + Ok(None) +} + +fn load_env_file_into_process(path: &Path) -> Result<()> { + if !path.exists() { + return Ok(()); + } + + for line in fs::read_to_string(path)?.lines() { + let Some((name, value)) = parse_env_assignment(line) else { + continue; + }; + if std::env::var_os(&name).is_none() { + unsafe { + std::env::set_var(name, value); + } + } + } + + Ok(()) +} + +fn load_cli_config(config_path: Option<&PathBuf>) -> Result { + let config = load_config(config_path)?; + if let Some(path) = config.resolve_auth_env_file() { + load_env_file_into_process(&path)?; + } + Ok(config) +} + +fn resolve_policy_engine(config: &OmnigraphConfig) -> Result { + let policy_file = config + .resolve_policy_file() + .ok_or_else(|| color_eyre::eyre::eyre!("policy.file must be set in omnigraph.yaml"))?; + PolicyEngine::load_graph(&policy_file, &policy_graph_id(config)) +} + +/// Open a local-URI graph and, when `policy.file` is configured in +/// `omnigraph.yaml`, install the resolved `PolicyEngine` on the engine +/// handle so every direct-engine write goes through +/// `Omnigraph::enforce(...)` (MR-722). Without a configured policy this +/// is identical to a bare `Omnigraph::open`. +/// +/// Returns owned `Omnigraph`; chained on top of `Omnigraph::open(...)`'s +/// existing future to keep call sites narrow. +async fn open_local_db_with_policy(uri: &str, config: &OmnigraphConfig) -> Result { + let db = Omnigraph::open(uri).await?; + if config.resolve_policy_file().is_some() { + let engine = Arc::new(resolve_policy_engine(config)?); + Ok(db.with_policy(engine as Arc)) + } else { + Ok(db) + } +} + +/// Resolve the CLI's effective actor identity for engine-layer policy +/// (MR-722). Precedence: `--as ` (top-level flag) overrides +/// `cli.actor` from `omnigraph.yaml`; both unset returns `None`. When +/// policy is configured and this returns `None`, the engine-layer +/// footgun guard intentionally denies β€” silent bypass via "I forgot the +/// actor" is what the guard prevents. +fn resolve_cli_actor<'a>(cli_as: Option<&'a str>, config: &'a OmnigraphConfig) -> Option<&'a str> { + cli_as.or(config.cli.actor.as_deref()) +} + +fn resolve_policy_tests_path(config: &OmnigraphConfig) -> Result { + config.resolve_policy_tests_file().ok_or_else(|| { + color_eyre::eyre::eyre!( + "policy.tests.yaml requires policy.file to be set in omnigraph.yaml" + ) + }) +} + +fn policy_graph_id(config: &OmnigraphConfig) -> String { + if let Some(name) = &config.project.name { + return name.clone(); + } + config + .resolve_target_uri(None, None, config.server_graph_name()) + .or_else(|_| config.resolve_target_uri(None, None, config.cli_graph_name())) + .unwrap_or_else(|_| "default".to_string()) +} + +fn resolve_remote_bearer_token( + config: &OmnigraphConfig, + explicit_uri: Option<&str>, + explicit_target: Option<&str>, +) -> Result> { + let scoped_env = + config.graph_bearer_token_env(explicit_uri, explicit_target, config.cli_graph_name()); + let mut env_names = Vec::new(); + if let Some(name) = scoped_env { + env_names.push(name.to_string()); + } + if env_names + .iter() + .all(|name| name != DEFAULT_BEARER_TOKEN_ENV) + { + env_names.push(DEFAULT_BEARER_TOKEN_ENV.to_string()); + } + + let env_file = config.resolve_auth_env_file(); + for env_name in env_names { + if let Some(token) = bearer_token_from_env(&env_name) { + return Ok(Some(token)); + } + if let Some(path) = env_file.as_ref() { + if let Some(token) = bearer_token_from_env_file(path, &env_name)? { + return Ok(Some(token)); + } + } + } + + Ok(None) +} + +fn build_http_client() -> Result { + Ok(reqwest::Client::new()) +} + +fn apply_bearer_token( + request: reqwest::RequestBuilder, + token: Option<&str>, +) -> reqwest::RequestBuilder { + if let Some(token) = token { + request.header(AUTHORIZATION, format!("Bearer {}", token)) + } else { + request + } +} + +async fn remote_json( + client: &reqwest::Client, + method: Method, + url: String, + body: Option, + bearer_token: Option<&str>, +) -> Result { + let request = apply_bearer_token(client.request(method, url), bearer_token); + let request = if let Some(body) = body { + request.json(&body) + } else { + request + }; + let response = request.send().await?; + let status = response.status(); + let text = response.text().await?; + if !status.is_success() { + if let Ok(error) = serde_json::from_str::(&text) { + bail!(error.error); + } + bail!("server returned {}: {}", status, text); + } + Ok(serde_json::from_str(&text)?) +} + +fn resolve_uri( + config: &OmnigraphConfig, + cli_uri: Option, + cli_target: Option<&str>, +) -> Result { + config.resolve_target_uri(cli_uri, cli_target, config.cli_graph_name()) +} + +/// Parse a Go-style compact duration: `7d`, `24h`, `30m`, `90s`, or a plain +/// integer as seconds. Used by the `cleanup --older-than` flag. +fn parse_duration_arg(s: &str) -> Result { + let s = s.trim(); + if s.is_empty() { + bail!("duration is empty"); + } + let (num_part, unit) = match s + .char_indices() + .rev() + .find(|(_, c)| c.is_ascii_alphabetic()) + { + Some((i, _)) => ( + &s[..i + 1 - s[i..].chars().next().unwrap().len_utf8()], + &s[i..], + ), + None => (s, ""), + }; + let n: u64 = num_part + .parse() + .map_err(|e| color_eyre::eyre::eyre!("invalid duration '{}': {}", s, e))?; + let secs = match unit { + "" | "s" => n, + "m" => n * 60, + "h" => n * 60 * 60, + "d" => n * 60 * 60 * 24, + "w" => n * 60 * 60 * 24 * 7, + _ => bail!("unknown duration unit '{}'. Supported: s, m, h, d, w", unit), + }; + Ok(std::time::Duration::from_secs(secs)) +} + +fn resolve_local_uri( + config: &OmnigraphConfig, + cli_uri: Option, + cli_target: Option<&str>, + operation: &str, +) -> Result { + let uri = resolve_uri(config, cli_uri, cli_target)?; + if is_remote_uri(&uri) { + bail!( + "{} is only supported against local graph URIs in this milestone", + operation + ); + } + Ok(uri) +} + +fn resolve_branch( + config: &OmnigraphConfig, + cli_branch: Option, + alias_branch: Option, + default_branch: &str, +) -> String { + cli_branch + .or(alias_branch) + .or_else(|| config.cli.branch.clone()) + .unwrap_or_else(|| default_branch.to_string()) +} + +fn resolve_read_target( + config: &OmnigraphConfig, + cli_branch: Option, + cli_snapshot: Option, + alias_branch: Option, +) -> Result { + if cli_branch.is_some() && cli_snapshot.is_some() { + bail!("read target may specify branch or snapshot, not both"); + } + Ok(read_target_from_cli( + cli_branch + .or(alias_branch) + .or_else(|| config.cli.branch.clone()), + cli_snapshot, + )) +} + +fn resolve_query_path( + config: &OmnigraphConfig, + explicit_query: Option<&PathBuf>, + alias_query: Option<&str>, +) -> Result { + explicit_query + .map(PathBuf::from) + .or_else(|| alias_query.map(PathBuf::from)) + .ok_or_else(|| { + color_eyre::eyre::eyre!( + "exactly one of --query, --query-string, or --alias must be provided" + ) + }) + .and_then(|query_path| config.resolve_query_path(&query_path)) +} + +fn resolve_query_source( + config: &OmnigraphConfig, + explicit_query: Option<&PathBuf>, + inline_query: Option<&str>, + alias_query: Option<&str>, +) -> Result { + if let Some(inline) = inline_query { + if inline.trim().is_empty() { + bail!("--query-string must not be empty"); + } + return Ok(inline.to_string()); + } + Ok(fs::read_to_string(resolve_query_path( + config, + explicit_query, + alias_query, + )?)?) +} + +fn parse_alias_value(value: &str) -> Value { + serde_json::from_str(value).unwrap_or_else(|_| Value::String(value.to_string())) +} + +fn merged_params_json( + alias_name: Option<&str>, + alias_arg_names: &[String], + alias_arg_values: &[String], + explicit: Option, +) -> Result> { + if alias_arg_values.len() > alias_arg_names.len() { + let alias = alias_name.unwrap_or(""); + bail!( + "alias '{}' expects at most {} args but got {}", + alias, + alias_arg_names.len(), + alias_arg_values.len() + ); + } + + let mut merged = serde_json::Map::new(); + for (arg_name, arg_value) in alias_arg_names.iter().zip(alias_arg_values.iter()) { + merged.insert(arg_name.clone(), parse_alias_value(arg_value)); + } + + match explicit { + Some(Value::Object(object)) => { + for (key, value) in object { + merged.insert(key, value); + } + } + Some(_) => bail!("params JSON must be an object"), + None => {} + } + + if merged.is_empty() { + Ok(None) + } else { + Ok(Some(Value::Object(merged))) + } +} + +fn print_load_human( + uri: &str, + branch: &str, + mode: CliLoadMode, + nodes_loaded: usize, + edges_loaded: usize, + node_types_loaded: usize, + edge_types_loaded: usize, +) { + println!( + "loaded {} on branch {} with {}: {} nodes across {} node types, {} edges across {} edge types", + uri, + branch, + mode.as_str(), + nodes_loaded, + node_types_loaded, + edges_loaded, + edge_types_loaded + ); +} + +fn print_ingest_human(output: &IngestOutput) { + println!( + "ingested {} into branch {} from {} with {} ({})", + output.uri, + output.branch, + output.base_branch, + output.mode.as_str(), + if output.branch_created { + "branch created" + } else { + "branch exists" + } + ); + for table in &output.tables { + println!("{} rows_loaded={}", table.table_key, table.rows_loaded); + } + if let Some(actor_id) = &output.actor_id { + println!("actor_id: {}", actor_id); + } +} + +fn print_schema_plan_human(uri: &str, plan: &SchemaMigrationPlan) { + println!("schema plan for {}", uri); + println!("supported: {}", if plan.supported { "yes" } else { "no" }); + if plan.steps.is_empty() { + println!("no schema changes"); + return; + } + for step in &plan.steps { + println!("- {}", render_schema_plan_step(step)); + } +} + +fn render_schema_plan_step(step: &SchemaMigrationStep) -> String { + match step { + SchemaMigrationStep::AddType { type_kind, name } => { + format!("add {} type '{}'", schema_type_kind_label(*type_kind), name) + } + SchemaMigrationStep::RenameType { + type_kind, + from, + to, + } => format!( + "rename {} type '{}' -> '{}'", + schema_type_kind_label(*type_kind), + from, + to + ), + SchemaMigrationStep::AddProperty { + type_kind, + type_name, + property_name, + property_type, + } => format!( + "add property '{}.{}' ({}) on {} '{}'", + type_name, + property_name, + render_prop_type(property_type), + schema_type_kind_label(*type_kind), + type_name + ), + SchemaMigrationStep::RenameProperty { + type_kind, + type_name, + from, + to, + } => format!( + "rename property '{}.{}' -> '{}.{}' on {} '{}'", + type_name, + from, + type_name, + to, + schema_type_kind_label(*type_kind), + type_name + ), + SchemaMigrationStep::AddConstraint { + type_kind, + type_name, + constraint, + } => format!( + "add constraint {} on {} '{}'", + render_constraint(constraint), + schema_type_kind_label(*type_kind), + type_name + ), + SchemaMigrationStep::UpdateTypeMetadata { + type_kind, + name, + annotations, + } => format!( + "update metadata on {} '{}' ({})", + schema_type_kind_label(*type_kind), + name, + render_annotations(annotations) + ), + SchemaMigrationStep::UpdatePropertyMetadata { + type_kind, + type_name, + property_name, + annotations, + } => format!( + "update metadata on property '{}.{}' of {} '{}' ({})", + type_name, + property_name, + schema_type_kind_label(*type_kind), + type_name, + render_annotations(annotations) + ), + SchemaMigrationStep::DropType { + type_kind, + name, + mode, + } => format!( + "drop {} type '{}' ({} mode)", + schema_type_kind_label(*type_kind), + name, + drop_mode_label(*mode), + ), + SchemaMigrationStep::DropProperty { + type_kind, + type_name, + property_name, + mode, + } => format!( + "drop property '{}.{}' of {} '{}' ({} mode)", + type_name, + property_name, + schema_type_kind_label(*type_kind), + type_name, + drop_mode_label(*mode), + ), + SchemaMigrationStep::UnsupportedChange { entity, reason, .. } => { + // When a schema-lint code is attached, render code + tier + // so operators see at-a-glance the kind of risk (destructive + // / validated / safe) β€” not just the rule identifier. + // Reach the diagnostic via the `diagnostic()` helper so the + // CLI doesn't need to know how the lookup works. + match step.diagnostic() { + Some(diag) => format!( + "unsupported change on {} [{}, {}]: {}", + entity, + diag.code, + schema_lint_tier_label(diag.tier), + reason, + ), + None => format!("unsupported change on {}: {}", entity, reason), + } + } + } +} + +fn schema_type_kind_label(kind: omnigraph_compiler::SchemaTypeKind) -> &'static str { + match kind { + omnigraph_compiler::SchemaTypeKind::Interface => "interface", + omnigraph_compiler::SchemaTypeKind::Node => "node", + omnigraph_compiler::SchemaTypeKind::Edge => "edge", + } +} + +fn schema_lint_tier_label(tier: omnigraph_compiler::SafetyTier) -> &'static str { + match tier { + omnigraph_compiler::SafetyTier::Safe => "safe", + omnigraph_compiler::SafetyTier::Validated => "validated", + omnigraph_compiler::SafetyTier::Destructive => "destructive", + } +} + +fn drop_mode_label(mode: omnigraph_compiler::DropMode) -> &'static str { + match mode { + omnigraph_compiler::DropMode::Soft => "soft", + omnigraph_compiler::DropMode::Hard => "hard", + } +} + +fn render_prop_type(prop_type: &omnigraph_compiler::PropType) -> String { + let base = if let Some(values) = &prop_type.enum_values { + format!("Enum({})", values.join("|")) + } else { + prop_type.scalar.to_string() + }; + let base = if prop_type.list { + format!("[{}]", base) + } else { + base + }; + if prop_type.nullable { + format!("{}?", base) + } else { + base + } +} + +fn render_constraint(constraint: &omnigraph_compiler::schema::ast::Constraint) -> String { + match constraint { + omnigraph_compiler::schema::ast::Constraint::Key(columns) => { + format!("@key({})", columns.join(", ")) + } + omnigraph_compiler::schema::ast::Constraint::Unique(columns) => { + format!("@unique({})", columns.join(", ")) + } + omnigraph_compiler::schema::ast::Constraint::Index(columns) => { + format!("@index({})", columns.join(", ")) + } + omnigraph_compiler::schema::ast::Constraint::Range { property, min, max } => { + format!("@range({}, {:?}, {:?})", property, min, max) + } + omnigraph_compiler::schema::ast::Constraint::Check { property, pattern } => { + format!("@check({}, {:?})", property, pattern) + } + } +} + +fn render_annotations(annotations: &[omnigraph_compiler::schema::ast::Annotation]) -> String { + annotations + .iter() + .map(|annotation| match &annotation.value { + Some(value) => format!("@{}({})", annotation.name, value), + None => format!("@{}", annotation.name), + }) + .collect::>() + .join(", ") +} + +fn print_embed_human(output: &EmbedOutput) { + println!( + "embedded {} rows (selected {}, cleaned {}) from {} -> {} [{} {}d]", + output.embedded_rows, + output.selected_rows, + output.cleaned_rows, + output.input, + output.output, + output.mode, + output.dimension + ); +} + +fn print_snapshot_human(branch: &str, manifest_version: u64, entries: &[SnapshotTableOutput]) { + println!("branch: {}", branch); + println!("manifest_version: {}", manifest_version); + for entry in entries { + println!( + "{} v{} branch={} rows={}", + entry.table_key, + entry.table_version, + entry.table_branch.as_deref().unwrap_or("main"), + entry.row_count + ); + } +} + +fn print_read_output( + output: &ReadOutput, + format: ReadOutputFormat, + config: &OmnigraphConfig, +) -> Result<()> { + println!( + "{}", + render_read( + output, + format, + &ReadRenderOptions { + max_column_width: config.table_max_column_width(), + cell_layout: config.table_cell_layout(), + }, + )? + ); + Ok(()) +} + +fn print_change_human(output: &ChangeOutput) { + println!( + "changed {} via {}: {} nodes, {} edges", + output.branch, output.query_name, output.affected_nodes, output.affected_edges + ); + if let Some(actor_id) = &output.actor_id { + println!("actor_id: {}", actor_id); + } +} + +fn print_commit_list_human(commits: &[CommitOutput]) { + for commit in commits { + let branch = commit.manifest_branch.as_deref().unwrap_or("main"); + println!( + "{} branch={} version={}{}", + commit.graph_commit_id, + branch, + commit.manifest_version, + commit + .actor_id + .as_deref() + .map(|actor| format!(" actor={}", actor)) + .unwrap_or_default() + ); + } +} + +fn print_commit_human(commit: &CommitOutput) { + println!("graph_commit_id: {}", commit.graph_commit_id); + println!( + "manifest_branch: {}", + commit.manifest_branch.as_deref().unwrap_or("main") + ); + println!("manifest_version: {}", commit.manifest_version); + if let Some(parent_commit_id) = &commit.parent_commit_id { + println!("parent_commit_id: {}", parent_commit_id); + } + if let Some(merged_parent_commit_id) = &commit.merged_parent_commit_id { + println!("merged_parent_commit_id: {}", merged_parent_commit_id); + } + if let Some(actor_id) = &commit.actor_id { + println!("actor_id: {}", actor_id); + } + println!("created_at: {}", commit.created_at); +} + +fn print_policy_explain(decision: &PolicyDecision, actor_id: &str, request: &PolicyRequest) { + println!( + "decision: {}", + if decision.allowed { "allow" } else { "deny" } + ); + println!("actor: {}", actor_id); + println!("action: {}", request.action); + if let Some(branch) = &request.branch { + println!("branch: {}", branch); + } + if let Some(target_branch) = &request.target_branch { + println!("target_branch: {}", target_branch); + } + if let Some(rule_id) = &decision.matched_rule_id { + println!("matched_rule: {}", rule_id); + } + println!("message: {}", decision.message); +} + +fn resolve_read_format( + config: &OmnigraphConfig, + cli_format: Option, + json: bool, + alias_format: Option, +) -> ReadOutputFormat { + if json { + ReadOutputFormat::Json + } else { + cli_format + .or(alias_format) + .unwrap_or_else(|| config.cli_output_format()) + } +} + +fn resolve_alias<'a>( + config: &'a OmnigraphConfig, + alias_name: Option<&'a str>, + expected: AliasCommand, +) -> Result> { + let Some(alias_name) = alias_name else { + return Ok(None); + }; + let alias = config.alias(alias_name)?; + if alias.command != expected { + bail!( + "alias '{}' is a {:?} alias, not a {:?} alias", + alias_name, + alias.command, + expected + ); + } + Ok(Some((alias_name, alias))) +} + +fn normalize_legacy_alias_uri( + uri: Option, + target_available: bool, + alias_name: Option<&str>, + mut alias_args: Vec, +) -> (Option, Vec) { + let Some(candidate) = uri else { + return (None, alias_args); + }; + + if alias_name.is_some() && target_available { + alias_args.insert(0, candidate); + return (None, alias_args); + } + + (Some(candidate), alias_args) +} + +fn scaffold_config_if_missing(uri: &str) -> Result<()> { + let path = inferred_config_path(uri)?; + if path.exists() { + return Ok(()); + } + + fs::write( + path, + format!( + "\ +project: + name: Omnigraph Project + +graphs: + local: + uri: {} + # bearer_token_env: OMNIGRAPH_BEARER_TOKEN + +server: + graph: local + bind: 127.0.0.1:8080 + +cli: + graph: local + branch: main + output_format: table + table_max_column_width: 80 + table_cell_layout: truncate + +query: + roots: + - queries + - . + +aliases: + # owner: + # command: read + # query: context.gq + # name: decision_owner + # args: [slug] + # graph: local + # branch: main + # format: kv + # + # attach_trace: + # command: change + # query: mutations.gq + # name: attach_trace + # args: [decision_slug, trace_slug] + # graph: local + # branch: main + +# auth: +# env_file: ./.env.omni +# +# policy: +# file: ./policy.yaml +", + yaml_string(uri), + ), + )?; + Ok(()) +} + +fn yaml_string(value: &str) -> String { + format!("'{}'", value.replace('\'', "''")) +} + +fn inferred_config_path(uri: &str) -> Result { + if uri.contains("://") { + return Ok(omnigraph_server::config::default_config_path()); + } + + let path = Path::new(uri); + let base = if path.is_absolute() { + path.parent() + .map(Path::to_path_buf) + .unwrap_or(std::env::current_dir()?) + } else { + std::env::current_dir()?.join(path.parent().unwrap_or_else(|| Path::new("."))) + }; + Ok(base.join(omnigraph_server::config::DEFAULT_CONFIG_FILE)) +} + +fn read_target_from_cli(branch: Option, snapshot: Option) -> ReadTarget { + if let Some(snapshot) = snapshot { + ReadTarget::snapshot(SnapshotId::new(snapshot)) + } else { + ReadTarget::branch(branch.unwrap_or_else(|| "main".to_string())) + } +} + +fn load_params_json(params: &ParamsArgs) -> Result> { + match (¶ms.params, ¶ms.params_file) { + (Some(inline), None) => Ok(Some(serde_json::from_str(inline)?)), + (None, Some(path)) => Ok(Some(serde_json::from_str(&fs::read_to_string(path)?)?)), + (None, None) => Ok(None), + (Some(_), Some(_)) => bail!("only one of --params or --params-file may be provided"), + } +} + +fn select_named_query( + query_source: &str, + requested_name: Option<&str>, +) -> Result<(String, Vec)> { + let parsed = parse_query(query_source)?; + let query = if let Some(name) = requested_name { + parsed + .queries + .into_iter() + .find(|query| query.name == name) + .ok_or_else(|| color_eyre::eyre::eyre!("query '{}' not found", name))? + } else if parsed.queries.len() == 1 { + parsed.queries.into_iter().next().unwrap() + } else { + bail!("query file contains multiple queries; pass --name"); + }; + + Ok((query.name, query.params)) +} + +fn query_params_from_json( + query_params: &[omnigraph_compiler::query::ast::Param], + params_json: Option<&Value>, +) -> Result { + json_params_to_param_map(params_json, query_params, JsonParamMode::Standard) + .map_err(|err| color_eyre::eyre::eyre!(err.to_string())) +} + +async fn execute_query_lint( + config: &OmnigraphConfig, + cli_uri: Option, + cli_target: Option<&str>, + schema_path: Option<&PathBuf>, + query_path: &PathBuf, +) -> Result { + let resolved_query_path = resolve_query_path(config, Some(query_path), None)?; + let query_source = fs::read_to_string(&resolved_query_path)?; + let query_path = resolved_query_path.to_string_lossy().into_owned(); + + if let Some(schema_path) = schema_path { + let schema_source = fs::read_to_string(schema_path)?; + let schema = + parse_schema(&schema_source).map_err(|err| color_eyre::eyre::eyre!(err.to_string()))?; + let catalog = + build_catalog(&schema).map_err(|err| color_eyre::eyre::eyre!(err.to_string()))?; + return Ok(lint_query_file( + &catalog, + &query_source, + query_path, + QueryLintSchemaSource::file(schema_path.to_string_lossy().into_owned()), + )); + } + + let has_graph_target = + cli_uri.is_some() || cli_target.is_some() || config.cli_graph_name().is_some(); + if !has_graph_target { + bail!("query lint requires --schema or a resolvable graph target"); + } + + let uri = resolve_local_uri(config, cli_uri, cli_target, "query lint")?; + let db = Omnigraph::open(&uri).await?; + Ok(lint_query_file( + &db.catalog(), + &query_source, + query_path, + QueryLintSchemaSource::graph(uri), + )) +} + +async fn execute_read( + uri: &str, + query_source: &str, + query_name: Option<&str>, + target: ReadTarget, + params_json: Option<&Value>, +) -> Result { + let (selected_name, query_params) = select_named_query(query_source, query_name)?; + let params = query_params_from_json(&query_params, params_json)?; + let db = Omnigraph::open(uri).await?; + let result = db + .query(target.clone(), query_source, &selected_name, ¶ms) + .await?; + Ok(read_output(selected_name, &target, result)) +} + +async fn execute_read_remote( + client: &reqwest::Client, + uri: &str, + query_source: &str, + query_name: Option<&str>, + target: ReadTarget, + params_json: Option<&Value>, + bearer_token: Option<&str>, +) -> Result { + let (branch, snapshot) = match &target { + ReadTarget::Branch(branch) => (Some(branch.clone()), None), + ReadTarget::Snapshot(snapshot) => (None, Some(snapshot.as_str().to_string())), + }; + remote_json( + client, + Method::POST, + remote_url(uri, "/read"), + Some(serde_json::to_value(ReadRequest { + query_source: query_source.to_string(), + query_name: query_name.map(ToOwned::to_owned), + params: params_json.cloned(), + branch, + snapshot, + })?), + bearer_token, + ) + .await +} + +async fn execute_change( + uri: &str, + query_source: &str, + query_name: Option<&str>, + branch: &str, + params_json: Option<&Value>, + config: &OmnigraphConfig, + cli_as_actor: Option<&str>, +) -> Result { + let (selected_name, query_params) = select_named_query(query_source, query_name)?; + let params = query_params_from_json(&query_params, params_json)?; + let db = open_local_db_with_policy(uri, config).await?; + let actor = resolve_cli_actor(cli_as_actor, config); + let result = db + .mutate_as(branch, query_source, &selected_name, ¶ms, actor) + .await?; + Ok(ChangeOutput { + branch: branch.to_string(), + query_name: selected_name, + affected_nodes: result.affected_nodes, + affected_edges: result.affected_edges, + actor_id: actor.map(String::from), + }) +} + +/// Build the JSON body for `POST /change` using the legacy wire shape. +/// +/// `ChangeRequest`'s Rust field names are now `query` / `name` (the canonical +/// wire shape going forward), but old `omnigraph-server` builds still require +/// the legacy `query_source` / `query_name` keys on `/change`. Hand-rolling +/// the JSON with the legacy names keeps a newer CLI talking to an older +/// server intact -- the same byte-stability contract we apply to +/// `execute_read_remote` against `/read`. +fn legacy_change_request_body( + query_source: &str, + query_name: Option<&str>, + branch: &str, + params_json: Option<&Value>, +) -> Value { + let mut body = serde_json::json!({ + "query_source": query_source, + "branch": branch, + }); + if let Some(name) = query_name { + body["query_name"] = Value::String(name.to_string()); + } + if let Some(params) = params_json { + body["params"] = params.clone(); + } + body +} + +async fn execute_change_remote( + client: &reqwest::Client, + uri: &str, + query_source: &str, + query_name: Option<&str>, + branch: &str, + params_json: Option<&Value>, + bearer_token: Option<&str>, +) -> Result { + remote_json( + client, + Method::POST, + remote_url(uri, "/change"), + Some(legacy_change_request_body( + query_source, + query_name, + branch, + params_json, + )), + bearer_token, + ) + .await +} + +async fn execute_export_to_writer( + uri: &str, + branch: &str, + type_names: &[String], + table_keys: &[String], + writer: &mut W, +) -> Result<()> { + let db = Omnigraph::open(uri).await?; + db.export_jsonl_to_writer(branch, type_names, table_keys, writer) + .await?; + writer.flush()?; + Ok(()) +} + +async fn execute_export_remote_to_writer( + client: &reqwest::Client, + uri: &str, + branch: &str, + type_names: &[String], + table_keys: &[String], + bearer_token: Option<&str>, + writer: &mut W, +) -> Result<()> { + let request = apply_bearer_token( + client.request(Method::POST, remote_url(uri, "/export")), + bearer_token, + ) + .json(&ExportRequest { + branch: Some(branch.to_string()), + type_names: type_names.to_vec(), + table_keys: table_keys.to_vec(), + }); + let mut response = request.send().await?; + let status = response.status(); + if !status.is_success() { + let text = response.text().await?; + if let Ok(error) = serde_json::from_str::(&text) { + bail!(error.error); + } + bail!("server returned {}: {}", status, text); + } + + while let Some(chunk) = response.chunk().await? { + writer.write_all(&chunk)?; + } + writer.flush()?; + Ok(()) +} + +/// Rewrite deprecated CLI invocations into their canonical form. +/// +/// The current rename pass moves four subcommands: +/// - `omnigraph read` -> `omnigraph query` (clap `visible_alias` handles parsing; we warn) +/// - `omnigraph change` -> `omnigraph mutate` (clap `visible_alias` handles parsing; we warn) +/// - `omnigraph check` -> `omnigraph lint` (rewrite required; no visible_alias by design) +/// - `omnigraph query lint` -> `omnigraph lint` (rewrite required; `query` is now the read-runner) +/// - `omnigraph query check` -> `omnigraph lint` (rewrite required) +/// +/// `check` is *not* a clap visible_alias on `lint` even though they're +/// semantically equivalent. Visible aliases create two canonical names +/// that agents emit interchangeably depending on training-data drift +/// (see MR-981 Β§6 for the policy). The argv-shim + stderr warning +/// pattern preserves back-compat for human users while pointing every +/// caller at the single canonical name in `--help`. +/// +/// Returns the (possibly rewritten) argv that clap should parse. +fn rewrite_deprecated_argv(args: Vec) -> Vec { + if args.len() >= 3 { + let sub = args[1].to_str(); + let sub2 = args[2].to_str(); + if sub == Some("query") && matches!(sub2, Some("lint") | Some("check")) { + let suffix = sub2.unwrap(); + eprintln!( + "warning: `omnigraph query {suffix}` is deprecated; use `omnigraph lint` instead" + ); + // Drop the leading `query` token AND normalize `check` -> `lint`. + // `check` is no longer a clap visible_alias (MR-981 Β§6), so the + // rewritten argv must reach the canonical `lint` subcommand + // directly. Result for `omnigraph query check --query foo.gq`: + // `omnigraph lint --query foo.gq`. + let mut out = Vec::with_capacity(args.len() - 1); + out.push(args[0].clone()); + out.push(OsString::from("lint")); + out.extend(args[3..].iter().cloned()); + return out; + } + } + if let Some(sub) = args.get(1).and_then(|s| s.to_str()) { + match sub { + "read" => eprintln!( + "warning: `omnigraph read` is deprecated; use `omnigraph query` instead" + ), + "change" => eprintln!( + "warning: `omnigraph change` is deprecated; use `omnigraph mutate` instead" + ), + "check" => { + eprintln!( + "warning: `omnigraph check` is deprecated; use `omnigraph lint` instead" + ); + // Rewrite the top-level subcommand to `lint`; pass through the rest. + let mut out = Vec::with_capacity(args.len()); + out.push(args[0].clone()); + out.push(OsString::from("lint")); + out.extend(args[2..].iter().cloned()); + return out; + } + _ => {} + } + } + args +} #[tokio::main] async fn main() -> Result<()> { @@ -66,152 +1859,7 @@ async fn main() -> Result<()> { Cli::from_arg_matches(&matches)? }; let http_client = build_http_client()?; - // RFC-010 Slice 1: reject data-plane addressing flags (--server/--graph) on - // a verb that doesn't live on the data plane, from one declared table β€” - // before any per-command dispatch. - planes::guard_addressing(&cli)?; match cli.command { - Command::Login { name, token, json } => { - let token = match token { - Some(token) => token, - None => { - let mut line = String::new(); - std::io::stdin().read_line(&mut line)?; - line - } - }; - let Some(token) = normalize_bearer_token(Some(token)) else { - color_eyre::eyre::bail!( - "no token provided: pass --token or pipe it on stdin (echo $TOKEN | omnigraph login {name})" - ); - }; - let operator_config = crate::operator::load_operator_config()?; - let declared = operator_config.servers.contains_key(&name); - let path = crate::operator::write_credential(&name, &token)?; - finish_login(&name, &path, declared, json)?; - } - Command::Logout { name, json } => { - let path = crate::operator::remove_credential(&name)?; - finish_logout(&name, &path, json)?; - } - Command::Profile { command } => { - use crate::operator::ScopeBinding; - let op = crate::operator::load_operator_config()?; - let active = std::env::var(scope::PROFILE_ENV) - .ok() - .filter(|s| !s.is_empty()); - match command { - ProfileCommand::List { json } => { - let items: Vec = op - .profiles - .iter() - .map(|(name, profile)| { - let (binding, scope_kind, target, valid, error) = - match profile.binding(name) { - Ok(ScopeBinding::Server(s)) => ( - format!("server: {s}"), - "server".to_string(), - Some(s), - true, - None, - ), - Ok(ScopeBinding::Cluster(c)) => ( - format!("cluster: {c}"), - "cluster".to_string(), - Some(c), - true, - None, - ), - Ok(ScopeBinding::Store(u)) => ( - format!("store: {u}"), - "store".to_string(), - Some(u), - true, - None, - ), - Err(e) => ( - format!("invalid: {e}"), - "invalid".to_string(), - None, - false, - Some(e.to_string()), - ), - }; - ProfileListItem { - name: name.clone(), - binding, - scope_kind, - target, - valid, - error, - default_graph: profile.default_graph.clone(), - active: active.as_deref() == Some(name.as_str()), - } - }) - .collect(); - print_profile_list(&items, json)?; - } - ProfileCommand::Show { name, json } => { - let detail = match name.or(active) { - Some(name) => { - let profile = op.profile(&name).ok_or_else(|| { - color_eyre::eyre::eyre!( - "unknown profile '{name}' (not defined under `profiles:`)" - ) - })?; - let (kind, target, endpoint) = match profile.binding(&name)? { - ScopeBinding::Server(s) => { - let endpoint = op.servers.get(&s).map(|sv| sv.url.clone()); - ("server", Some(s), endpoint) - } - ScopeBinding::Cluster(c) => { - let endpoint = op.cluster_root(&c).map(str::to_string); - ("cluster", Some(c), endpoint) - } - ScopeBinding::Store(u) => ("store", Some(u.clone()), Some(u)), - }; - ProfileDetail { - name, - scope_kind: kind.to_string(), - target, - endpoint, - default_graph: profile - .default_graph - .clone() - .or_else(|| op.default_graph().map(str::to_string)), - output_format: op - .output() - .and_then(|f| f.to_possible_value()) - .map(|v| v.get_name().to_string()), - } - } - // No name and no active profile: the flat operator defaults. - None => { - let (kind, target, endpoint) = if let Some(s) = op.default_server() { - let endpoint = op.servers.get(s).map(|sv| sv.url.clone()); - ("server", Some(s.to_string()), endpoint) - } else if let Some(u) = op.default_store() { - ("store", Some(u.to_string()), Some(u.to_string())) - } else { - ("none", None, None) - }; - ProfileDetail { - name: "(defaults)".to_string(), - scope_kind: kind.to_string(), - target, - endpoint, - default_graph: op.default_graph().map(str::to_string), - output_format: op - .output() - .and_then(|f| f.to_possible_value()) - .map(|v| v.get_name().to_string()), - } - } - }; - print_profile_detail(&detail, json)?; - } - } - } Command::Version => { println!("omnigraph {}", env!("CARGO_PKG_VERSION")); } @@ -224,16 +1872,6 @@ async fn main() -> Result<()> { } } Command::Init { schema, uri, force } => { - // RFC-010 Slice 3: graphs inside an established cluster are created - // by `cluster apply` (which records ledger/recovery/approvals), not - // by hand-running `init` into the cluster's storage layout. - if let Some(root) = omnigraph_cluster::cluster_root_for_graph_uri(&uri).await { - bail!( - "`{uri}` is inside cluster `{root}`. Graphs in a cluster are created by \ - `cluster apply` (which records ledger, recovery, and approvals), not `init`. \ - Declare the graph in cluster.yaml and run `cluster apply`." - ); - } let schema_source = fs::read_to_string(&schema)?; ensure_local_graph_parent(&uri)?; Omnigraph::init_with_options( @@ -242,67 +1880,94 @@ async fn main() -> Result<()> { omnigraph::db::InitOptions { force }, ) .await?; + scaffold_config_if_missing(&uri)?; println!("initialized {}", uri); } Command::Load { uri, + target, + config, data, branch, - from, mode, json, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let branch = resolve_branch(branch, None, "main"); - if matches!(mode, CliLoadMode::Overwrite) { - confirm_destructive("load --mode overwrite", client.uri(), cli.yes, json)?; - } - echo_write_target(cli.quiet, "load", client.uri(), client.is_remote()); - let payload = client - .load(&branch, from.as_deref(), &data.to_string_lossy(), mode) + let config = load_cli_config(config.as_ref())?; + let uri = resolve_local_uri(&config, uri, target.as_deref(), "load")?; + let branch = resolve_branch(&config, branch, None, "main"); + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + let result = db + .load_file_as(&branch, &data.to_string_lossy(), mode.into(), actor) .await?; + let payload = LoadOutput { + uri: &uri, + branch: &branch, + mode: mode.as_str(), + nodes_loaded: result.nodes_loaded.values().sum(), + edges_loaded: result.edges_loaded.values().sum(), + node_types_loaded: result.nodes_loaded.len(), + edge_types_loaded: result.edges_loaded.len(), + }; if json { print_json(&payload)?; } else { - print_load_human(&payload); + print_load_human( + &uri, + &branch, + mode, + payload.nodes_loaded, + payload.edges_loaded, + payload.node_types_loaded, + payload.edge_types_loaded, + ); } } Command::Ingest { uri, + target, + config, data, branch, from, mode, json, } => { - // stderr so `--json` consumers reading stdout are unaffected. - eprintln!( - "warning: `omnigraph ingest` is deprecated and will be removed in a future release; \ - use `omnigraph load --from --mode ` (ingest defaults: --from main --mode merge)" - ); - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let branch = resolve_branch(branch, None, "main"); - let from = resolve_branch(from, None, "main"); - echo_write_target(cli.quiet, "ingest", client.uri(), client.is_remote()); - let payload = client - .ingest(&branch, &from, &data.to_string_lossy(), mode) - .await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let branch = resolve_branch(&config, branch, None, "main"); + let from = resolve_branch(&config, from, None, "main"); + let payload = if is_remote_uri(&uri) { + let data = fs::read_to_string(&data)?; + remote_json::( + &http_client, + Method::POST, + remote_url(&uri, "/ingest"), + Some(serde_json::to_value(IngestRequest { + branch: Some(branch.clone()), + from: Some(from.clone()), + mode: Some(mode.into()), + data, + })?), + bearer_token.as_deref(), + ) + .await? + } else { + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + let result = db + .ingest_file_as( + &branch, + Some(&from), + &data.to_string_lossy(), + mode.into(), + actor, + ) + .await?; + ingest_output(&uri, &result, None) + }; if json { print_json(&payload)?; } else { @@ -312,22 +1977,41 @@ async fn main() -> Result<()> { Command::Branch { command } => match command { BranchCommand::Create { uri, + target, + config, from, name, json, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let from = resolve_branch(from, None, "main"); - echo_write_target(cli.quiet, "branch create", client.uri(), client.is_remote()); - let payload = client.branch_create_from(&from, &name).await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let from = resolve_branch(&config, from, None, "main"); + let payload = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::POST, + remote_url(&uri, "/branches"), + Some(serde_json::to_value(BranchCreateRequest { + from: Some(from.clone()), + name: name.clone(), + })?), + bearer_token.as_deref(), + ) + .await? + } else { + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + db.branch_create_from_as(ReadTarget::branch(&from), &name, actor) + .await?; + BranchCreateOutput { + uri: uri.clone(), + from: from.clone(), + name: name.clone(), + actor_id: actor.map(String::from), + } + }; if json { print_json(&payload)?; } else { @@ -336,17 +2020,29 @@ async fn main() -> Result<()> { } BranchCommand::List { uri, + target, + config, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let payload = client.branch_list().await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let payload = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::GET, + remote_url(&uri, "/branches"), + None, + bearer_token.as_deref(), + ) + .await? + } else { + let db = Omnigraph::open(&uri).await?; + let mut branches = db.branch_list().await?; + branches.sort(); + BranchListOutput { branches } + }; if json { print_json(&payload)?; } else { @@ -357,21 +2053,34 @@ async fn main() -> Result<()> { } BranchCommand::Delete { uri, + target, + config, name, json, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - confirm_destructive("branch delete", client.uri(), cli.yes, json)?; - echo_write_target(cli.quiet, "branch delete", client.uri(), client.is_remote()); - let payload = client.branch_delete(&name).await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let payload = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::DELETE, + remote_branch_url(&uri, &name)?, + None, + bearer_token.as_deref(), + ) + .await? + } else { + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + db.branch_delete_as(&name, actor).await?; + BranchDeleteOutput { + uri: uri.clone(), + name: name.clone(), + actor_id: actor.map(String::from), + } + }; if json { print_json(&payload)?; } else { @@ -380,22 +2089,40 @@ async fn main() -> Result<()> { } BranchCommand::Merge { uri, + target, + config, source, into, json, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let into = resolve_branch(into, None, "main"); - echo_write_target(cli.quiet, "branch merge", client.uri(), client.is_remote()); - let payload = client.branch_merge(&source, &into).await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let into = resolve_branch(&config, into, None, "main"); + let payload = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::POST, + remote_url(&uri, "/branches/merge"), + Some(serde_json::to_value(BranchMergeRequest { + source: source.clone(), + target: Some(into.clone()), + })?), + bearer_token.as_deref(), + ) + .await? + } else { + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + let outcome = db.branch_merge_as(&source, &into, actor).await?; + BranchMergeOutput { + source: source.clone(), + target: into.clone(), + outcome: outcome.into(), + actor_id: actor.map(String::from), + } + }; if json { print_json(&payload)?; } else { @@ -411,38 +2138,67 @@ async fn main() -> Result<()> { Command::Commit { command } => match command { CommitCommand::List { uri, + target, + config, branch, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let payload = client.list_commits(branch.as_deref()).await?; - if json { - print_json(&payload)?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let commits = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::GET, + if let Some(branch) = branch.as_deref() { + format!("{}?branch={}", remote_url(&uri, "/commits"), branch) + } else { + remote_url(&uri, "/commits") + }, + None, + bearer_token.as_deref(), + ) + .await? + .commits } else { - print_commit_list_human(&payload.commits); + let db = Omnigraph::open(&uri).await?; + db.list_commits(branch.as_deref()) + .await? + .iter() + .map(commit_output) + .collect::>() + }; + if json { + print_json(&CommitListOutput { commits })?; + } else { + print_commit_list_human(&commits); } } CommitCommand::Show { uri, + target, + config, commit_id, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let commit = client.get_commit(&commit_id).await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let commit = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::GET, + remote_url(&uri, &format!("/commits/{}", commit_id)), + None, + bearer_token.as_deref(), + ) + .await? + } else { + let db = Omnigraph::open(&uri).await?; + commit_output(&db.get_commit(&commit_id).await?) + }; if json { print_json(&commit)?; } else { @@ -453,19 +2209,14 @@ async fn main() -> Result<()> { Command::Schema { command } => match command { SchemaCommand::Plan { uri, + target, + config, schema, json, allow_data_loss, } => { - let uri = resolve_maintenance_uri( - cli.profile.as_deref(), - cli.store.as_deref(), - cli.cluster.as_deref(), - cli.graph.as_deref(), - uri, - "schema plan", - ) - .await?; + let config = load_cli_config(config.as_ref())?; + let uri = resolve_local_uri(&config, uri, target.as_deref(), "schema plan")?; let schema_source = fs::read_to_string(&schema)?; let db = Omnigraph::open(&uri).await?; let plan = db @@ -488,47 +2239,46 @@ async fn main() -> Result<()> { } SchemaCommand::Apply { uri, + target, + config, schema, json, allow_data_loss, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - // RFC-011 Decision 10: a graph managed by a cluster evolves via - // `cluster apply` (ledger/recovery/approvals), not a direct - // `schema apply` against its storage root β€” that would bypass the - // ledger. Mirrors `init`'s refusal. Only the embedded path can - // address a storage root; a served apply (`--server`) is the - // server's concern. - if !client.is_remote() { - if let Some(root) = - omnigraph_cluster::cluster_root_for_graph_uri(client.uri()).await - { - bail!( - "`{}` is inside cluster `{root}`. A graph in a cluster evolves via \ - `cluster apply` (which records ledger, recovery, and approvals), not \ - `schema apply`. Update the schema in cluster.yaml and run `cluster apply`.", - client.uri() - ); - } - } + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; let schema_source = fs::read_to_string(&schema)?; - // The embedded (direct-store) arm carries no stored-query - // registry β€” the registry is cluster-owned (RFC-011), so a - // direct apply has nothing to validate against. The served arm - // runs the server's own catalog check. So the validator is a - // no-op here on both arms. - echo_write_target(cli.quiet, "schema apply", client.uri(), client.is_remote()); - let output = client - .apply_schema(&schema_source, allow_data_loss, |_catalog| Ok(())) - .await?; + let output = if is_remote_uri(&uri) { + // MR-694 PR B: SchemaApplyRequest gained an + // allow_data_loss field so Hard-mode drops are no + // longer CLI-only. The previous bail is gone; the + // field is forwarded into the JSON payload, and + // the server's `server_schema_apply` honors it. + remote_json::( + &http_client, + Method::POST, + remote_url(&uri, "/schema/apply"), + Some(serde_json::to_value(SchemaApplyRequest { + schema_source: schema_source.clone(), + allow_data_loss, + })?), + bearer_token.as_deref(), + ) + .await? + } else { + let db = open_local_db_with_policy(&uri, &config).await?; + let actor = resolve_cli_actor(cli.as_actor.as_deref(), &config); + let result = db + .apply_schema_as( + &schema_source, + omnigraph::db::SchemaApplyOptions { allow_data_loss }, + actor, + ) + .await?; + schema_apply_output(&uri, result) + }; if json { print_json(&output)?; } else { @@ -537,17 +2287,29 @@ async fn main() -> Result<()> { } SchemaCommand::Show { uri, + target, + config, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let output = client.schema_source().await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let output = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::GET, + remote_url(&uri, "/schema"), + None, + bearer_token.as_deref(), + ) + .await? + } else { + let db = Omnigraph::open(&uri).await?; + SchemaOutput { + schema_source: db.schema_source().to_string(), + } + }; if json { print_json(&output)?; } else { @@ -557,59 +2319,45 @@ async fn main() -> Result<()> { }, Command::Lint { uri, + target, + config, query, schema, json, } => { - // A graph target (when `--schema` is absent) resolves through the - // direct scope path (positional URI / --store / --profile / - // defaults.store). Offline (`--schema`) needs no graph, so leave - // the uri unresolved in that case. - let graph_uri = if schema.is_some() { - uri - } else { - Some( - resolve_maintenance_uri( - cli.profile.as_deref(), - cli.store.as_deref(), - cli.cluster.as_deref(), - cli.graph.as_deref(), - uri, - "lint", - ) - .await?, - ) - }; - let output = execute_query_lint(graph_uri, schema.as_ref(), &query).await?; + let config = load_cli_config(config.as_ref())?; + let output = + execute_query_lint(&config, uri, target.as_deref(), schema.as_ref(), &query) + .await?; finish_query_lint(&output, json)?; } - Command::Queries { command } => { - let cluster = - require_cluster_scope(cli.cluster.as_deref(), cli.profile.as_deref(), "queries")?; - match command { - QueriesCommand::Validate { json } => { - execute_queries_validate(&cluster, cli.graph.as_deref(), json).await?; - } - QueriesCommand::List { json } => { - execute_queries_list(&cluster, cli.graph.as_deref(), json).await?; - } - } - } Command::Snapshot { uri, + target, + config, branch, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let branch = resolve_branch(branch, None, "main"); - let payload = client.snapshot(&branch).await?; + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let branch = resolve_branch(&config, branch, None, "main"); + let payload = if is_remote_uri(&uri) { + remote_json::( + &http_client, + Method::GET, + format!("{}?branch={}", remote_url(&uri, "/snapshot"), branch), + None, + bearer_token.as_deref(), + ) + .await? + } else { + let db = Omnigraph::open(&uri).await?; + let snapshot = db.snapshot_of(ReadTarget::branch(branch.as_str())).await?; + snapshot_payload(&branch, &snapshot) + }; + if json { print_json(&payload)?; } else { @@ -618,114 +2366,205 @@ async fn main() -> Result<()> { } Command::Export { uri, + target, + config, branch, jsonl, type_names, table_keys, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let branch = resolve_branch(branch, None, "main"); + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + let branch = resolve_branch(&config, branch, None, "main"); if jsonl { eprintln!("warning: --jsonl is deprecated; `omnigraph export` always emits JSONL"); } let stdout = io::stdout(); let mut stdout = stdout.lock(); - client - .export(&branch, &type_names, &table_keys, &mut stdout) + if is_remote_uri(&uri) { + execute_export_remote_to_writer( + &http_client, + &uri, + &branch, + &type_names, + &table_keys, + bearer_token.as_deref(), + &mut stdout, + ) .await?; + } else { + execute_export_to_writer(&uri, &branch, &type_names, &table_keys, &mut stdout) + .await?; + } } Command::Query { - name, + uri, + legacy_uri, + target, + config, + alias, query, query_string, + name, params, branch, snapshot, format, json, + alias_args, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - None, - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let params_json = load_params_json(¶ms)?; - let target = resolve_read_target(branch, snapshot, None)?; - let output: ReadOutput = if query.is_some() || query_string.is_some() { - // Ad-hoc lane: run the source; the positional `name` selects - // within it when it holds more than one query. - let query_source = - resolve_query_source(query.as_ref(), query_string.as_deref(), None)?; - client - .query(target, &query_source, name.as_deref(), params_json.as_ref()) - .await? + if alias.is_none() && query.is_none() && query_string.is_none() { + bail!("exactly one of --query, --query-string, or --alias must be provided"); + } + + let config = load_cli_config(config.as_ref())?; + let alias = resolve_alias(&config, alias.as_deref(), AliasCommand::Read)?; + let alias_name = alias.as_ref().map(|(name, _)| *name); + let alias_config = alias.as_ref().map(|(_, alias)| *alias); + let target_available = target.is_some() + || alias_config + .and_then(|alias| alias.graph.as_deref()) + .is_some() + || config.cli_graph_name().is_some(); + let (legacy_uri, alias_args) = + normalize_legacy_alias_uri(legacy_uri, target_available, alias_name, alias_args); + let uri = uri.or(legacy_uri); + let target_name = target + .as_deref() + .or_else(|| alias_config.and_then(|alias| alias.graph.as_deref())); + let bearer_token = resolve_remote_bearer_token(&config, uri.as_deref(), target_name)?; + let uri = resolve_uri(&config, uri, target_name)?; + let query_source = resolve_query_source( + &config, + query.as_ref(), + query_string.as_deref(), + alias_config.map(|a| a.query.as_str()), + )?; + let params_json = merged_params_json( + alias_name, + alias_config + .map(|alias| alias.args.as_slice()) + .unwrap_or(&[]), + &alias_args, + load_params_json(¶ms)?, + )?; + let target = resolve_read_target( + &config, + branch, + snapshot, + alias_config.and_then(|alias| alias.branch.clone()), + )?; + let query_name = name.or_else(|| alias_config.and_then(|alias| alias.name.clone())); + let output = if is_remote_uri(&uri) { + execute_read_remote( + &http_client, + &uri, + &query_source, + query_name.as_deref(), + target, + params_json.as_ref(), + bearer_token.as_deref(), + ) + .await? } else { - // Catalog lane (served-only): invoke the stored query by name. - let Some(name) = name else { - bail!( - "provide a query name to invoke from the catalog, or -e '' / \ - --query for an ad-hoc query" - ); - }; - let (branch, snapshot) = match &target { - ReadTarget::Branch(b) => (Some(b.clone()), None), - ReadTarget::Snapshot(s) => (None, Some(s.as_str().to_string())), - }; - client - .invoke_named(&name, false, params_json.as_ref(), branch, snapshot) - .await? + execute_read( + &uri, + &query_source, + query_name.as_deref(), + target, + params_json.as_ref(), + ) + .await? }; - let format = resolve_read_format(format, json, None); - print_read_output(&output, format)?; + let format = resolve_read_format( + &config, + format, + json, + alias_config.and_then(|alias| alias.format), + ); + print_read_output(&output, format, &config)?; } Command::Mutate { - name, + uri, + legacy_uri, + target, + config, + alias, query, query_string, + name, params, branch, json, + alias_args, } => { - let client = client::GraphClient::resolve_with_policy( - cli.server.as_deref(), - cli.graph.as_deref(), - None, - cli.as_actor.as_deref(), - cli.profile.as_deref(), - cli.store.as_deref(), - ) - .await?; - let params_json = load_params_json(¶ms)?; - let branch = resolve_branch(branch, None, "main"); - let output: ChangeOutput = if query.is_some() || query_string.is_some() { - // Ad-hoc lane: run the source; positional `name` selects within it. - let query_source = - resolve_query_source(query.as_ref(), query_string.as_deref(), None)?; - client - .mutate(&branch, &query_source, name.as_deref(), params_json.as_ref()) - .await? + if alias.is_none() && query.is_none() && query_string.is_none() { + bail!("exactly one of --query, --query-string, or --alias must be provided"); + } + + let config = load_cli_config(config.as_ref())?; + let alias = resolve_alias(&config, alias.as_deref(), AliasCommand::Change)?; + let alias_name = alias.as_ref().map(|(name, _)| *name); + let alias_config = alias.as_ref().map(|(_, alias)| *alias); + let target_available = target.is_some() + || alias_config + .and_then(|alias| alias.graph.as_deref()) + .is_some() + || config.cli_graph_name().is_some(); + let (legacy_uri, alias_args) = + normalize_legacy_alias_uri(legacy_uri, target_available, alias_name, alias_args); + let uri = uri.or(legacy_uri); + let target_name = target + .as_deref() + .or_else(|| alias_config.and_then(|alias| alias.graph.as_deref())); + let bearer_token = resolve_remote_bearer_token(&config, uri.as_deref(), target_name)?; + let uri = resolve_uri(&config, uri, target_name)?; + let query_source = resolve_query_source( + &config, + query.as_ref(), + query_string.as_deref(), + alias_config.map(|a| a.query.as_str()), + )?; + let params_json = merged_params_json( + alias_name, + alias_config + .map(|alias| alias.args.as_slice()) + .unwrap_or(&[]), + &alias_args, + load_params_json(¶ms)?, + )?; + let branch = resolve_branch( + &config, + branch, + alias_config.and_then(|alias| alias.branch.clone()), + "main", + ); + let query_name = name.or_else(|| alias_config.and_then(|alias| alias.name.clone())); + let output = if is_remote_uri(&uri) { + execute_change_remote( + &http_client, + &uri, + &query_source, + query_name.as_deref(), + &branch, + params_json.as_ref(), + bearer_token.as_deref(), + ) + .await? } else { - // Catalog lane (served-only): invoke the stored mutation by name. - let Some(name) = name else { - bail!( - "provide a mutation name to invoke from the catalog, or -e '' / \ - --query for an ad-hoc mutation" - ); - }; - client - .invoke_named(&name, true, params_json.as_ref(), Some(branch), None) - .await? + execute_change( + &uri, + &query_source, + query_name.as_deref(), + &branch, + params_json.as_ref(), + &config, + cli.as_actor.as_deref(), + ) + .await? }; if json { print_json(&output)?; @@ -733,92 +2572,53 @@ async fn main() -> Result<()> { print_change_human(&output); } } - Command::Alias { - name, - args, - params, - format, - json, - } => { - let operator_config = crate::operator::load_operator_config()?; - let Some(operator_alias) = operator_config.aliases.get(&name) else { - let defined: Vec<&str> = - operator_config.aliases.keys().map(String::as_str).collect(); - bail!( - "unknown alias '{name}'; defined aliases: [{}] \ - (add it under `aliases:` in ~/.omnigraph/config.yaml)", - defined.join(", ") + Command::Policy { command } => match command { + PolicyCommand::Validate { config } => { + let config = load_cli_config(config.as_ref())?; + let engine = resolve_policy_engine(&config)?; + let policy_file = config + .resolve_policy_file() + .expect("policy file should exist after resolve_policy_engine"); + println!( + "policy valid: {} [{} actors]", + policy_file.display(), + engine.known_actor_count() ); - }; - let output = execute_operator_alias( - &http_client, - &name, - operator_alias, - &args, - load_params_json(¶ms)?, - ) - .await?; - let format = resolve_read_format(format, json, operator_alias.format); - print_read_output(&output, format)?; - } - Command::Policy { command } => { - // Policy tooling sources the Cedar bundle(s) from the cluster's - // applied policies (RFC-011): --cluster , + the global --graph - // to pick a graph's bundle when several apply. - let cluster = - require_cluster_scope(cli.cluster.as_deref(), cli.profile.as_deref(), "policy")?; - let graph = cli.graph.as_deref(); - let graph_id = match graph { - Some(id) => graph_resource_id_for_selection(Some(id), ""), - None => graph_resource_id_for_selection(None, "default"), - }; - let policies = read_cluster_policies(&cluster).await?; - match command { - PolicyCommand::Validate {} => { - let bundle = select_cluster_policy(&cluster, &policies, graph)?; - let engine = PolicyEngine::load_graph_from_source(&bundle.source, &graph_id)?; - println!( - "policy valid: bundle '{}' [{} actors]", - bundle.name, - engine.known_actor_count() - ); - } - PolicyCommand::Test { tests } => { - let bundle = select_cluster_policy(&cluster, &policies, graph)?; - let engine = PolicyEngine::load_graph_from_source(&bundle.source, &graph_id)?; - let tests = PolicyTestConfig::load(&tests)?; - engine.run_tests(&tests)?; - println!("policy tests passed: {} cases", tests.cases.len()); - } - PolicyCommand::Explain { - actor, + } + PolicyCommand::Test { config } => { + let config = load_cli_config(config.as_ref())?; + let engine = resolve_policy_engine(&config)?; + let tests_path = resolve_policy_tests_path(&config)?; + let tests = PolicyTestConfig::load(&tests_path)?; + engine.run_tests(&tests)?; + println!("policy tests passed: {} cases", tests.cases.len()); + } + PolicyCommand::Explain { + config, + actor, + action, + branch, + target_branch, + } => { + let config = load_cli_config(config.as_ref())?; + let engine = resolve_policy_engine(&config)?; + let request = PolicyRequest { action, branch, target_branch, - } => { - let bundle = select_cluster_policy(&cluster, &policies, graph)?; - let engine = PolicyEngine::load_graph_from_source(&bundle.source, &graph_id)?; - let request = PolicyRequest { - action, - branch, - target_branch, - }; - let decision = engine.authorize(&actor, &request)?; - print_policy_explain(&decision, &actor, &request); - } + }; + let decision = engine.authorize(&actor, &request)?; + print_policy_explain(&decision, &actor, &request); } - } - Command::Optimize { uri, json } => { - let uri = resolve_maintenance_uri( - cli.profile.as_deref(), - cli.store.as_deref(), - cli.cluster.as_deref(), - cli.graph.as_deref(), - uri, - "optimize", - ) - .await?; - echo_write_target(cli.quiet, "optimize", &uri, false); + }, + Command::Optimize { + uri, + target, + config, + json, + } => { + let config = load_cli_config(config.as_ref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; let db = Omnigraph::open(&uri).await?; let stats = db.optimize().await?; if json { @@ -829,140 +2629,36 @@ async fn main() -> Result<()> { "fragments_removed": s.fragments_removed, "fragments_added": s.fragments_added, "committed": s.committed, - "skipped": s.skipped.map(|r| r.as_str()), - "manifest_version": s.manifest_version, - "lance_head_version": s.lance_head_version, - "pending_indexes": s.pending_indexes.iter().map(|p| serde_json::json!({ - "column": p.column, - "reason": p.reason, - })).collect::>(), })).collect::>(), }); print_json(&value)?; } else { println!("optimize {} β€” {} tables", uri, stats.len()); for s in &stats { - if let Some(reason) = s.skipped { - println!(" {:<40} skipped ({reason})", s.table_key); - } else if s.committed { + if s.committed { println!( " {:<40} frags {} β†’ {} βœ“", - s.table_key, s.fragments_removed, s.fragments_added + s.table_key, + s.fragments_removed + s.fragments_added - s.fragments_added, + s.fragments_added ); } else { println!(" {:<40} no-op", s.table_key); } - for p in &s.pending_indexes { - println!(" ↳ index pending on '{}': {}", p.column, p.reason); - } } } } - Command::Repair { - uri, - confirm, - force, - json, - } => { - let uri = resolve_maintenance_uri( - cli.profile.as_deref(), - cli.store.as_deref(), - cli.cluster.as_deref(), - cli.graph.as_deref(), - uri, - "repair", - ) - .await?; - echo_write_target(cli.quiet, "repair", &uri, false); - let db = Omnigraph::open(&uri).await?; - let stats = db - .repair(omnigraph::db::RepairOptions { confirm, force }) - .await?; - let refused_count = stats - .tables - .iter() - .filter(|s| matches!(s.action, omnigraph::db::RepairAction::Refused)) - .count(); - if json { - let value = serde_json::json!({ - "uri": uri, - "confirm": confirm, - "force": force, - "manifest_version": stats.manifest_version, - "tables": stats.tables.iter().map(|s| serde_json::json!({ - "table_key": s.table_key, - "manifest_version": s.manifest_version, - "lance_head_version": s.lance_head_version, - "classification": s.classification.as_str(), - "action": s.action.as_str(), - "operations": s.operations, - "error": s.error, - })).collect::>(), - }); - print_json(&value)?; - } else { - let mode = if confirm { "confirm" } else { "preview" }; - println!( - "repair {} β€” {} mode, {} tables", - uri, - mode, - stats.tables.len() - ); - for s in &stats.tables { - let drift = if s.manifest_version == s.lance_head_version { - format!("{}", s.manifest_version) - } else { - format!("{} β†’ {}", s.manifest_version, s.lance_head_version) - }; - let ops = if s.operations.is_empty() { - String::new() - } else { - format!(" [{}]", s.operations.join(", ")) - }; - let err = s - .error - .as_ref() - .map(|err| format!(" ({err})")) - .unwrap_or_default(); - println!( - " {:<40} {:<12} {:<22} {}{}{}", - s.table_key, - s.action.as_str(), - s.classification.as_str(), - drift, - ops, - err - ); - } - if !confirm { - println!("rerun with --confirm to publish verified maintenance drift"); - } - } - if refused_count > 0 { - bail!( - "repair refused {} suspicious or unverifiable table(s); review the preview \ - output and rerun with --force --confirm only if publishing that drift is \ - intentional", - refused_count - ); - } - } Command::Cleanup { uri, + target, + config, keep, older_than, confirm, json, } => { - let uri = resolve_maintenance_uri( - cli.profile.as_deref(), - cli.store.as_deref(), - cli.cluster.as_deref(), - cli.graph.as_deref(), - uri, - "cleanup", - ) - .await?; + let config = load_cli_config(config.as_ref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; let older_than_dur = older_than.as_deref().map(parse_duration_arg).transpose()?; @@ -986,11 +2682,6 @@ async fn main() -> Result<()> { ); return Ok(()); } - // Past the preview gate: a real destructive run. Against a non-local - // scope this additionally requires --yes (or an interactive yes), so - // `cleanup --confirm s3://…` in CI refuses rather than destroying. - confirm_destructive("cleanup", &uri, cli.yes, json)?; - echo_write_target(cli.quiet, "cleanup", &uri, false); let options = omnigraph::db::CleanupPolicyOptions { keep_versions: keep, @@ -1008,101 +2699,48 @@ async fn main() -> Result<()> { "table_key": s.table_key, "bytes_removed": s.bytes_removed, "old_versions_removed": s.old_versions_removed, - "error": s.error, })).collect::>(), }); print_json(&value)?; } else { let total_bytes: u64 = stats.iter().map(|s| s.bytes_removed).sum(); let total_versions: u64 = stats.iter().map(|s| s.old_versions_removed).sum(); - let failed: Vec<&str> = stats - .iter() - .filter(|s| s.error.is_some()) - .map(|s| s.table_key.as_str()) - .collect(); println!( "cleanup {} ({}) β€” removed {} versions ({} bytes) across {} tables", uri, policy_desc, total_versions, total_bytes, - stats.len() - failed.len() + stats.len() ); - if !failed.is_empty() { - println!( - " {} table(s) failed and will be retried on the next cleanup: {}", - failed.len(), - failed.join(", ") - ); - } } } - Command::Cluster { command } => match command { - ClusterCommand::Validate { config, json } => { - let output = validate_config_dir(config); - finish_cluster_validate(&output, json)?; - } - ClusterCommand::Plan { config, json } => { - let output = plan_config_dir(config).await; - finish_cluster_plan(&output, json)?; - } - ClusterCommand::Apply { config, json } => { - // The actor attributes graph-moving operations (sidecars, - // audit entries, engine schema-apply commits). Cluster FACTS - // stay unlayered; the operator's identity resolves --as flag - // first, then per-operator config `operator.actor`. - let actor = resolve_cluster_actor(cli.as_actor.as_deref())?; - let output = apply_config_dir_with_options(config, ApplyOptions { actor }).await; - finish_cluster_apply(&output, json)?; - } - ClusterCommand::Approve { - resource, - config, - json, - } => { - let Some(approver) = resolve_cluster_actor(cli.as_actor.as_deref())? else { - bail!( - "`cluster approve` requires an approver: pass the global --as flag or set `operator.actor` in ~/.omnigraph/config.yaml β€” an approval without an approver is meaningless" - ); - }; - let output = approve_config_dir(config, &resource, &approver).await; - finish_cluster_approve(&output, json)?; - } - ClusterCommand::Status { config, json } => { - let output = status_config_dir(config).await; - finish_cluster_status(&output, json)?; - } - ClusterCommand::Refresh { config, json } => { - let output = refresh_config_dir(config).await; - finish_cluster_state_sync(&output, json)?; - } - ClusterCommand::Import { config, json } => { - let output = import_config_dir(config).await; - finish_cluster_state_sync(&output, json)?; - } - ClusterCommand::ForceUnlock { - lock_id, - config, - json, - } => { - let output = force_unlock_config_dir(config, lock_id).await; - finish_cluster_force_unlock(&output, json)?; - } - }, Command::Graphs { command } => match command { GraphsCommand::List { uri, + target, + config, json, } => { - let client = client::GraphClient::resolve( - cli.server.as_deref(), - cli.graph.as_deref(), - uri, - cli.profile.as_deref(), - cli.store.as_deref(), + let config = load_cli_config(config.as_ref())?; + let bearer_token = + resolve_remote_bearer_token(&config, uri.as_deref(), target.as_deref())?; + let uri = resolve_uri(&config, uri, target.as_deref())?; + if !is_remote_uri(&uri) { + bail!( + "`omnigraph graphs list` requires a remote multi-graph server URL \ + (http:// or https://). To enumerate local graphs, read `omnigraph.yaml` \ + directly." + ); + } + let payload = remote_json::( + &http_client, + Method::GET, + remote_url(&uri, "/graphs"), + None, + bearer_token.as_deref(), ) .await?; - let payload = client.list_graphs().await?; if json { print_json(&payload)?; } else { @@ -1116,7 +2754,271 @@ async fn main() -> Result<()> { Ok(()) } - #[cfg(test)] -#[path = "main_tests.rs"] -mod tests; +mod tests { + use std::fs; + + use super::{ + DEFAULT_BEARER_TOKEN_ENV, apply_bearer_token, bearer_token_from_env_file, + legacy_change_request_body, load_cli_config, load_env_file_into_process, + normalize_bearer_token, parse_env_assignment, resolve_remote_bearer_token, + }; + use omnigraph_server::load_config; + use reqwest::header::AUTHORIZATION; + use serde_json::json; + use tempfile::tempdir; + + #[test] + fn legacy_change_request_body_uses_legacy_field_names() { + // `execute_change_remote` hits `POST /change`, which old + // `omnigraph-server` builds deserialize as `ChangeRequest` with + // **required** `query_source` and optional `query_name` keys. + // Newer servers accept both spellings via serde alias, but a + // newer CLI must still emit the legacy keys on the wire so it + // can talk to an old server during a rolling upgrade. + let body = legacy_change_request_body( + "query insert_person($n: String) { insert Person { name: $n } }", + Some("insert_person"), + "main", + Some(&json!({ "n": "Alice" })), + ); + assert_eq!( + body["query_source"].as_str(), + Some("query insert_person($n: String) { insert Person { name: $n } }"), + ); + assert_eq!(body["query_name"].as_str(), Some("insert_person")); + assert_eq!(body["branch"].as_str(), Some("main")); + assert_eq!(body["params"]["n"].as_str(), Some("Alice")); + // Crucially, the **new** field names must NOT appear -- old + // servers would silently treat them as unknown fields and then + // fail on missing required `query_source`. + assert!( + body.get("query").is_none(), + "legacy /change body must not carry the renamed `query` key; got {body}" + ); + assert!( + body.get("name").is_none(), + "legacy /change body must not carry the renamed `name` key; got {body}" + ); + } + + #[test] + fn legacy_change_request_body_omits_optional_fields_when_unset() { + let body = legacy_change_request_body( + "query find() { match { $p: Person } return { $p.name } }", + None, + "main", + None, + ); + assert_eq!(body["branch"].as_str(), Some("main")); + assert!(body.get("query_name").is_none()); + assert!(body.get("params").is_none()); + } + + #[test] + fn apply_bearer_token_adds_header_when_configured() { + let client = reqwest::Client::new(); + let request = apply_bearer_token(client.get("http://example.com"), Some("demo-token")) + .build() + .unwrap(); + assert_eq!( + request + .headers() + .get(AUTHORIZATION) + .and_then(|value| value.to_str().ok()), + Some("Bearer demo-token") + ); + } + + #[test] + fn apply_bearer_token_leaves_request_unchanged_when_not_configured() { + let client = reqwest::Client::new(); + let request = apply_bearer_token(client.get("http://example.com"), None) + .build() + .unwrap(); + assert!(request.headers().get(AUTHORIZATION).is_none()); + } + + #[test] + fn normalize_bearer_token_trims_and_filters_blank_values() { + assert_eq!(normalize_bearer_token(None), None); + assert_eq!(normalize_bearer_token(Some(" ".to_string())), None); + assert_eq!( + normalize_bearer_token(Some(" demo-token ".to_string())).as_deref(), + Some("demo-token") + ); + } + + #[test] + fn parse_env_assignment_supports_plain_and_exported_values() { + assert_eq!( + parse_env_assignment("DEMO_TOKEN=demo-token"), + Some(("DEMO_TOKEN".to_string(), "demo-token".to_string())) + ); + assert_eq!( + parse_env_assignment("export DEMO_TOKEN=\"quoted-token\""), + Some(("DEMO_TOKEN".to_string(), "quoted-token".to_string())) + ); + assert_eq!(parse_env_assignment("# comment"), None); + assert_eq!(parse_env_assignment(" "), None); + } + + #[test] + fn bearer_token_from_env_file_reads_named_value() { + let temp = tempdir().unwrap(); + let env_file = temp.path().join(".env.omni"); + fs::write( + &env_file, + "FIRST=ignore\nexport DEMO_TOKEN=\" demo-token \"\n", + ) + .unwrap(); + + assert_eq!( + bearer_token_from_env_file(&env_file, "DEMO_TOKEN") + .unwrap() + .as_deref(), + Some("demo-token") + ); + assert_eq!( + bearer_token_from_env_file(&env_file, "MISSING").unwrap(), + None + ); + } + + #[test] + fn load_env_file_into_process_sets_missing_values_without_overriding_existing_ones() { + let temp = tempdir().unwrap(); + let env_file = temp.path().join(".env.omni"); + fs::write( + &env_file, + "AUTOLOAD_ONLY=from-file\nAUTOLOAD_PRESET=from-file\n", + ) + .unwrap(); + + let missing_key = "AUTOLOAD_ONLY"; + let preset_key = "AUTOLOAD_PRESET"; + let previous_missing = std::env::var_os(missing_key); + let previous_preset = std::env::var_os(preset_key); + + unsafe { + std::env::remove_var(missing_key); + std::env::set_var(preset_key, "from-env"); + } + + load_env_file_into_process(&env_file).unwrap(); + + assert_eq!(std::env::var(missing_key).unwrap(), "from-file"); + assert_eq!(std::env::var(preset_key).unwrap(), "from-env"); + + unsafe { + if let Some(value) = previous_missing { + std::env::set_var(missing_key, value); + } else { + std::env::remove_var(missing_key); + } + + if let Some(value) = previous_preset { + std::env::set_var(preset_key, value); + } else { + std::env::remove_var(preset_key); + } + } + } + + #[test] + fn resolve_remote_bearer_token_uses_scoped_env_file_with_global_fallback() { + let temp = tempdir().unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + r#" +graphs: + demo: + uri: https://example.com + bearer_token_env: DEMO_TOKEN +auth: + env_file: .env.omni +cli: + graph: demo +"#, + ) + .unwrap(); + fs::write( + temp.path().join(".env.omni"), + "DEMO_TOKEN=scoped-token\nOMNIGRAPH_BEARER_TOKEN=global-token\n", + ) + .unwrap(); + + let previous = std::env::var_os(DEFAULT_BEARER_TOKEN_ENV); + unsafe { + std::env::remove_var(DEFAULT_BEARER_TOKEN_ENV); + } + + let config_path = temp.path().join("omnigraph.yaml"); + let config = load_config(Some(&config_path)).unwrap(); + + assert_eq!( + resolve_remote_bearer_token(&config, None, Some("demo")) + .unwrap() + .as_deref(), + Some("scoped-token") + ); + assert_eq!( + resolve_remote_bearer_token(&config, Some("https://override.example.com"), None) + .unwrap() + .as_deref(), + Some("global-token") + ); + + unsafe { + if let Some(value) = previous { + std::env::set_var(DEFAULT_BEARER_TOKEN_ENV, value); + } else { + std::env::remove_var(DEFAULT_BEARER_TOKEN_ENV); + } + } + } + + #[test] + fn load_cli_config_autoloads_env_file_into_process() { + let temp = tempdir().unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + r#" +auth: + env_file: .env.omni +graphs: + demo: + uri: s3://bucket/prefix +"#, + ) + .unwrap(); + fs::write( + temp.path().join(".env.omni"), + "AUTOLOAD_FROM_CONFIG=loaded\n", + ) + .unwrap(); + + let key = "AUTOLOAD_FROM_CONFIG"; + let previous = std::env::var_os(key); + unsafe { + std::env::remove_var(key); + } + + let config_path = temp.path().join("omnigraph.yaml"); + let config = load_cli_config(Some(&config_path)).unwrap(); + + assert_eq!( + config.resolve_target_uri(None, Some("demo"), None).unwrap(), + "s3://bucket/prefix" + ); + assert_eq!(std::env::var(key).unwrap(), "loaded"); + + unsafe { + if let Some(value) = previous { + std::env::set_var(key, value); + } else { + std::env::remove_var(key); + } + } + } +} diff --git a/crates/omnigraph-cli/src/main_tests.rs b/crates/omnigraph-cli/src/main_tests.rs deleted file mode 100644 index 4f93277..0000000 --- a/crates/omnigraph-cli/src/main_tests.rs +++ /dev/null @@ -1,124 +0,0 @@ -//! In-source test suite for the CLI binary (moved verbatim from -//! main.rs; `use super::*` resolves through the #[path] declaration). - - use super::{ - DEFAULT_BEARER_TOKEN_ENV, apply_bearer_token, legacy_change_request_body, - normalize_bearer_token, resolve_remote_bearer_token, - }; - use reqwest::header::AUTHORIZATION; - use serde_json::json; - - #[test] - fn legacy_change_request_body_uses_legacy_field_names() { - // `mutate`'s remote arm hits `POST /change`, which old - // `omnigraph-server` builds deserialize as `ChangeRequest` with - // **required** `query_source` and optional `query_name` keys. - // Newer servers accept both spellings via serde alias, but a - // newer CLI must still emit the legacy keys on the wire so it - // can talk to an old server during a rolling upgrade. - let body = legacy_change_request_body( - "query insert_person($n: String) { insert Person { name: $n } }", - Some("insert_person"), - "main", - Some(&json!({ "n": "Alice" })), - ); - assert_eq!( - body["query_source"].as_str(), - Some("query insert_person($n: String) { insert Person { name: $n } }"), - ); - assert_eq!(body["query_name"].as_str(), Some("insert_person")); - assert_eq!(body["branch"].as_str(), Some("main")); - assert_eq!(body["params"]["n"].as_str(), Some("Alice")); - // Crucially, the **new** field names must NOT appear -- old - // servers would silently treat them as unknown fields and then - // fail on missing required `query_source`. - assert!( - body.get("query").is_none(), - "legacy /change body must not carry the renamed `query` key; got {body}" - ); - assert!( - body.get("name").is_none(), - "legacy /change body must not carry the renamed `name` key; got {body}" - ); - } - - #[test] - fn legacy_change_request_body_omits_optional_fields_when_unset() { - let body = legacy_change_request_body( - "query find() { match { $p: Person } return { $p.name } }", - None, - "main", - None, - ); - assert_eq!(body["branch"].as_str(), Some("main")); - assert!(body.get("query_name").is_none()); - assert!(body.get("params").is_none()); - } - - #[test] - fn apply_bearer_token_adds_header_when_configured() { - let client = reqwest::Client::new(); - let request = apply_bearer_token(client.get("http://example.com"), Some("demo-token")) - .build() - .unwrap(); - assert_eq!( - request - .headers() - .get(AUTHORIZATION) - .and_then(|value| value.to_str().ok()), - Some("Bearer demo-token") - ); - } - - #[test] - fn apply_bearer_token_leaves_request_unchanged_when_not_configured() { - let client = reqwest::Client::new(); - let request = apply_bearer_token(client.get("http://example.com"), None) - .build() - .unwrap(); - assert!(request.headers().get(AUTHORIZATION).is_none()); - } - - #[test] - fn normalize_bearer_token_trims_and_filters_blank_values() { - assert_eq!(normalize_bearer_token(None), None); - assert_eq!(normalize_bearer_token(Some(" ".to_string())), None); - assert_eq!( - normalize_bearer_token(Some(" demo-token ".to_string())).as_deref(), - Some("demo-token") - ); - } - - #[test] - fn resolve_remote_bearer_token_falls_back_to_default_env() { - // RFC-011: with no operator server matching the URL, the only chain - // left is the default `OMNIGRAPH_BEARER_TOKEN` env (no omnigraph.yaml - // scoped chain). Hermetic: no operator config is read for a literal URL - // that matches no `servers:` entry. - let previous = std::env::var_os(DEFAULT_BEARER_TOKEN_ENV); - let previous_home = std::env::var_os("OMNIGRAPH_HOME"); - unsafe { - std::env::set_var(DEFAULT_BEARER_TOKEN_ENV, "global-token"); - std::env::set_var("OMNIGRAPH_HOME", "/nonexistent/omnigraph-test-home"); - } - - assert_eq!( - resolve_remote_bearer_token(Some("https://override.example.com")) - .unwrap() - .as_deref(), - Some("global-token") - ); - - unsafe { - if let Some(value) = previous { - std::env::set_var(DEFAULT_BEARER_TOKEN_ENV, value); - } else { - std::env::remove_var(DEFAULT_BEARER_TOKEN_ENV); - } - if let Some(value) = previous_home { - std::env::set_var("OMNIGRAPH_HOME", value); - } else { - std::env::remove_var("OMNIGRAPH_HOME"); - } - } - } diff --git a/crates/omnigraph-cli/src/operator.rs b/crates/omnigraph-cli/src/operator.rs deleted file mode 100644 index dbfe781..0000000 --- a/crates/omnigraph-cli/src/operator.rs +++ /dev/null @@ -1,841 +0,0 @@ -//! The operator config surface (RFC-007): `~/.omnigraph/config.yaml` β€” who -//! the operator IS (identity, ergonomics), never what the system is (that's -//! cluster config) and never a project file (nothing here arrives with a -//! repo checkout). -//! -//! PR-1 scope: `operator.actor` + `defaults.output`. Unknown keys WARN and -//! are preserved-by-ignoring β€” a file written for a newer CLI (servers, -//! aliases, credentials keys from later slices) must load cleanly on this -//! one. Contrast with `cluster.yaml`, where unknown keys are fatal because -//! they change what a plan means. -//! -//! This module is CLI-only by design: the server never reads operator -//! config (server-side identity comes from bearer auth β€” invariant 11 -//! holds by construction). - -use std::collections::BTreeMap; -use std::env; -use std::path::{Path, PathBuf}; - -use color_eyre::Result; -use color_eyre::eyre::{bail, eyre}; -use serde::Deserialize; - -use crate::read_format::{ReadOutputFormat, TableCellLayout}; - -pub(crate) const OPERATOR_HOME_ENV: &str = "OMNIGRAPH_HOME"; -pub(crate) const OPERATOR_DIR: &str = ".omnigraph"; -pub(crate) const OPERATOR_CONFIG_FILE: &str = "config.yaml"; - -#[derive(Debug, Default, Deserialize)] -pub(crate) struct OperatorConfig { - #[serde(default)] - pub(crate) operator: OperatorIdentity, - #[serde(default)] - pub(crate) defaults: OperatorDefaults, - /// Operator-owned endpoint definitions (RFC-007 Β§D2/Β§D4): name β†’ url. - /// The name keys the credential chain; nothing a repo checkout supplies - /// can redefine an entry here. No tokens in this file, ever. - #[serde(default)] - pub(crate) servers: BTreeMap, - /// Personal alias bindings (RFC-007 PR 3); see OperatorAlias. - #[serde(default)] - pub(crate) aliases: BTreeMap, - /// Named scope bundles (RFC-011): each binds exactly one of - /// {server, cluster, store} plus an optional default graph. Config data, - /// not state β€” selecting one (`--profile`/`OMNIGRAPH_PROFILE`) fills in a - /// command's omitted addressing; it never puts you "in" a mode. - #[serde(default)] - pub(crate) profiles: BTreeMap, - /// Managed-cluster storage roots (RFC-011): name β†’ root URI. The ONLY - /// place a storage root appears in operator config β€” admin-only and - /// opt-in; a normal operator's file has none. - #[serde(default)] - pub(crate) clusters: BTreeMap, - /// Everything this CLI version doesn't know. Warned once at load, - /// otherwise ignored (forward compatibility within the operator layer). - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -/// A personal alias: a pure BINDING to a stored query on a named server β€” -/// never content, never a file (RFC-007 Β§D2 "Aliases are bindings, not -/// content"). The stored query is the team's contract; the alias, its -/// defaults, and its name are the operator's. -#[derive(Debug, Deserialize)] -pub(crate) struct OperatorAlias { - /// Names an entry under `servers:`. - pub(crate) server: String, - /// Graph id for multi-graph servers (appends `/graphs/`). - pub(crate) graph: Option, - /// The STORED query's name on that server. - pub(crate) query: String, - /// Positional CLI args bind to these param names, in order. - #[serde(default)] - pub(crate) args: Vec, - /// Fixed default params; positionals and `--params` override per key. - #[serde(default)] - pub(crate) params: serde_yaml::Mapping, - pub(crate) format: Option, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -#[derive(Debug, Deserialize)] -pub(crate) struct OperatorServer { - pub(crate) url: String, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -#[derive(Debug, Default, Deserialize)] -pub(crate) struct OperatorIdentity { - /// Default actor for every `--as` cascade (CLI direct-engine writes and - /// cluster commands alike): `--as` > this > none. - pub(crate) actor: Option, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -#[derive(Debug, Default, Deserialize)] -pub(crate) struct OperatorDefaults { - /// Default read output format, below every more-specific source. - pub(crate) output: Option, - /// Table rendering preferences for `--format table`. - pub(crate) table_max_column_width: Option, - pub(crate) table_cell_layout: Option, - /// Default server scope (RFC-011): the everyday addressing when no - /// `--profile` / primitive / legacy address is given. Names an entry - /// under `servers:`. Mutually exclusive with `store` β€” a scope binds one - /// entity. - pub(crate) server: Option, - /// Default **store** scope (RFC-011): a `file://` / `s3://` graph storage - /// URI used as the zero-flag local default for graph commands when no - /// `--profile` / primitive address is given. The local-dev counterpart of - /// `server`; mutually exclusive with it. - pub(crate) store: Option, - /// Default graph selected within a server/cluster scope when no - /// `--graph` is passed (RFC-011). - pub(crate) default_graph: Option, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -/// A named scope bundle (RFC-011): exactly one of {server, cluster, store} -/// plus an optional default graph. Validated on use (`binding()`), not at -/// parse time, so an unknown CLI's profile still loads. -#[derive(Debug, Default, Deserialize)] -pub(crate) struct OperatorProfile { - /// Names an entry under `servers:` β€” a served scope. - pub(crate) server: Option, - /// Names an entry under `clusters:` β€” a privileged direct cluster scope. - pub(crate) cluster: Option, - /// A single graph's storage URI β€” a direct store scope. - pub(crate) store: Option, - /// Default graph within a server/cluster scope (ignored for a store, - /// which is already one graph). - pub(crate) default_graph: Option, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -/// A managed-cluster storage root (RFC-011). -#[derive(Debug, Default, Deserialize)] -pub(crate) struct OperatorCluster { - /// The cluster's storage-root URI (`file://` / `s3://`). - pub(crate) root: String, - #[serde(flatten)] - unknown: serde_yaml::Mapping, -} - -/// The one entity a profile (or flat default) binds. Exactly one variant β€” -/// the scope resolver consumes this; "exactly one of server/cluster/store" -/// is enforced when producing it. -#[derive(Debug, Clone, PartialEq, Eq)] -pub(crate) enum ScopeBinding { - /// Served scope: a server name (resolved against `servers:`) or a literal URL. - Server(String), - /// Direct cluster scope: a cluster name (resolved against `clusters:`) or a - /// literal root URI. - Cluster(String), - /// Direct store scope: a single graph's storage URI. - Store(String), -} - -impl OperatorConfig { - pub(crate) fn actor(&self) -> Option<&str> { - self.operator.actor.as_deref() - } - - pub(crate) fn output(&self) -> Option { - self.defaults.output - } - - /// The gh-host model: which operator server (if any) does this request - /// URL belong to? Longest-prefix match after trailing-slash - /// normalization, so `url: http://h:8080` matches - /// `http://h:8080/graphs/spike` but never `http://h:8080-evil`. - pub(crate) fn find_server_for_url(&self, request_url: &str) -> Option<&str> { - let request = request_url.trim_end_matches('/'); - let mut best: Option<(&str, usize)> = None; - for (name, server) in &self.servers { - let base = server.url.trim_end_matches('/'); - let matches = request == base - || request - .strip_prefix(base) - .is_some_and(|rest| rest.starts_with('/')); - if matches && best.is_none_or(|(_, len)| base.len() > len) { - best = Some((name, base.len())); - } - } - best.map(|(name, _)| name) - } - - /// A named profile, if defined (RFC-011). - pub(crate) fn profile(&self, name: &str) -> Option<&OperatorProfile> { - self.profiles.get(name) - } - - /// The storage root of a named cluster, if defined (RFC-011). - pub(crate) fn cluster_root(&self, name: &str) -> Option<&str> { - self.clusters.get(name).map(|c| c.root.as_str()) - } - - /// The flat-default server scope name, if set (RFC-011). - pub(crate) fn default_server(&self) -> Option<&str> { - self.defaults.server.as_deref() - } - - /// The flat-default store scope URI, if set (RFC-011) β€” the zero-flag - /// local-dev default. - pub(crate) fn default_store(&self) -> Option<&str> { - self.defaults.store.as_deref() - } - - /// The flat-default graph within a server/cluster scope, if set (RFC-011). - pub(crate) fn default_graph(&self) -> Option<&str> { - self.defaults.default_graph.as_deref() - } - - /// A scope binds one entity (Decision 6): `defaults.server` and - /// `defaults.store` are mutually exclusive, and a `store` (already a single - /// graph) cannot carry a `default_graph`. Both are refused loudly rather - /// than silently dropped. - fn validate_defaults(&self) -> Result<()> { - if self.defaults.server.is_some() && self.defaults.store.is_some() { - bail!( - "operator config `defaults` sets both `server` and `store` β€” a default scope \ - binds one entity; keep one (use a `profile` if you need both)" - ); - } - if self.defaults.store.is_some() && self.defaults.default_graph.is_some() { - bail!( - "operator config `defaults` sets both `store` and `default_graph` β€” a store is \ - already a single graph; drop `default_graph` (it applies only to a server/cluster scope)" - ); - } - Ok(()) - } -} - -impl OperatorProfile { - /// The single entity this profile binds, or a loud error if it binds zero - /// or more than one of {server, cluster, store} (Decision 6: a scope binds - /// exactly one entity). Validated here, on use, rather than at parse time. - pub(crate) fn binding(&self, profile_name: &str) -> Result { - let set: Vec<&str> = [ - self.server.as_ref().map(|_| "server"), - self.cluster.as_ref().map(|_| "cluster"), - self.store.as_ref().map(|_| "store"), - ] - .into_iter() - .flatten() - .collect(); - match set.as_slice() { - ["server"] => Ok(ScopeBinding::Server(self.server.clone().unwrap())), - ["cluster"] => Ok(ScopeBinding::Cluster(self.cluster.clone().unwrap())), - ["store"] => Ok(ScopeBinding::Store(self.store.clone().unwrap())), - [] => Err(eyre!( - "profile '{profile_name}' binds no scope; set exactly one of \ - `server`, `cluster`, or `store`" - )), - many => Err(eyre!( - "profile '{profile_name}' binds {} scopes ({}); a profile must \ - bind exactly one of `server`, `cluster`, or `store`", - many.len(), - many.join(", ") - )), - } - } -} - -/// The operator dir: `$OMNIGRAPH_HOME` if set (tilde-expanded), else -/// `~/.omnigraph`. Returns None when no home directory is resolvable -/// (degenerate environments β€” the layer is simply absent). -pub(crate) fn operator_dir() -> Option { - if let Some(home_override) = env::var_os(OPERATOR_HOME_ENV) { - let raw = home_override.to_string_lossy().into_owned(); - return Some(expand_tilde(&raw)); - } - env::home_dir().map(|home| home.join(OPERATOR_DIR)) -} - -/// Load the operator layer. Absent file (or unresolvable home) is an empty -/// layer, never an error; a present-but-malformed file is a loud error (the -/// operator owns it and can fix it); unknown keys warn to stderr once. -pub(crate) fn load_operator_config() -> Result { - let Some(dir) = operator_dir() else { - return Ok(OperatorConfig::default()); - }; - load_operator_config_at(&dir.join(OPERATOR_CONFIG_FILE)) -} - -pub(crate) fn load_operator_config_at(path: &Path) -> Result { - let text = match std::fs::read_to_string(path) { - Ok(text) => text, - Err(err) if err.kind() == std::io::ErrorKind::NotFound => { - return Ok(OperatorConfig::default()); - } - Err(err) => { - return Err(eyre!( - "could not read operator config '{}': {err}", - path.display() - )); - } - }; - let config: OperatorConfig = serde_yaml::from_str(&text).map_err(|err| { - eyre!( - "could not parse operator config '{}': {err}", - path.display() - ) - })?; - for warning in config.unknown_key_warnings() { - eprintln!("warning: {warning} in operator config '{}'", path.display()); - } - config.validate_defaults()?; - Ok(config) -} - -impl OperatorConfig { - fn unknown_key_warnings(&self) -> Vec { - let mut warnings = Vec::new(); - let mut collect = |mapping: &serde_yaml::Mapping, prefix: &str| { - for key in mapping.keys() { - if let Some(name) = key.as_str() { - warnings.push(format!( - "unknown key `{prefix}{name}` (newer CLI feature or typo); ignored" - )); - } - } - }; - collect(&self.unknown, ""); - collect(&self.operator.unknown, "operator."); - collect(&self.defaults.unknown, "defaults."); - for (name, server) in &self.servers { - collect(&server.unknown, &format!("servers.{name}.")); - } - for (name, alias) in &self.aliases { - collect(&alias.unknown, &format!("aliases.{name}.")); - } - for (name, profile) in &self.profiles { - collect(&profile.unknown, &format!("profiles.{name}.")); - } - for (name, cluster) in &self.clusters { - collect(&cluster.unknown, &format!("clusters.{name}.")); - } - warnings - } -} - -// ---- keyed credentials (RFC-007 Β§D4) ---- - -pub(crate) const CREDENTIALS_FILE: &str = "credentials"; -const TOKEN_ENV_PREFIX: &str = "OMNIGRAPH_TOKEN_"; - -pub(crate) fn credentials_path() -> Option { - operator_dir().map(|dir| dir.join(CREDENTIALS_FILE)) -} - -/// `intel-dev` β†’ `OMNIGRAPH_TOKEN_INTEL_DEV`. -pub(crate) fn token_env_name(server: &str) -> String { - let mut name = String::from(TOKEN_ENV_PREFIX); - for c in server.chars() { - name.push(match c { - '-' => '_', - other => other.to_ascii_uppercase(), - }); - } - name -} - -/// The keyed token chain for a named server (Β§D4 steps 1–2): -/// `OMNIGRAPH_TOKEN_` env β†’ `[]` in the credentials file. -/// `Ok(None)` means "no keyed token" β€” callers fall through to the legacy -/// chain; a present-but-unreadable/over-permissive credentials file is a -/// loud error, never a silent skip. -pub(crate) fn resolve_keyed_token(server: &str) -> Result> { - if let Ok(token) = env::var(token_env_name(server)) { - let token = token.trim(); - if !token.is_empty() { - return Ok(Some(token.to_string())); - } - } - let Some(path) = credentials_path() else { - return Ok(None); - }; - read_credential_at(&path, server) -} - -pub(crate) fn read_credential_at(path: &Path, server: &str) -> Result> { - let text = match std::fs::read_to_string(path) { - Ok(text) => text, - Err(err) if err.kind() == std::io::ErrorKind::NotFound => return Ok(None), - Err(err) => { - return Err(eyre!( - "could not read credentials file '{}': {err}", - path.display() - )); - } - }; - refuse_over_permissive(path)?; - let mut in_section = false; - for line in text.lines() { - let line = line.trim(); - if line.is_empty() || line.starts_with('#') { - continue; - } - if let Some(section) = line.strip_prefix('[').and_then(|l| l.strip_suffix(']')) { - in_section = section.trim() == server; - continue; - } - if in_section { - if let Some((key, value)) = line.split_once('=') { - if key.trim() == "token" { - let value = unquote(value.trim()); - if value.is_empty() { - return Ok(None); - } - return Ok(Some(value.to_string())); - } - } - } - } - Ok(None) -} - -/// Write (or rotate) one server's token, preserving every other section. -/// Temp file + rename (#139 finding 7), created 0600. -pub(crate) fn write_credential(server: &str, token: &str) -> Result { - let path = credentials_path() - .ok_or_else(|| eyre!("no home directory resolvable for the credentials file"))?; - rewrite_credentials_at(&path, server, Some(token))?; - Ok(path) -} - -/// Remove one server's section. Idempotent: absent file or section is fine. -pub(crate) fn remove_credential(server: &str) -> Result { - let path = credentials_path() - .ok_or_else(|| eyre!("no home directory resolvable for the credentials file"))?; - rewrite_credentials_at(&path, server, None)?; - Ok(path) -} - -pub(crate) fn rewrite_credentials_at( - path: &Path, - server: &str, - token: Option<&str>, -) -> Result<()> { - let existing = match std::fs::read_to_string(path) { - Ok(text) => { - refuse_over_permissive(path)?; - text - } - Err(err) if err.kind() == std::io::ErrorKind::NotFound => String::new(), - Err(err) => { - return Err(eyre!( - "could not read credentials file '{}': {err}", - path.display() - )); - } - }; - - // Drop the target section (if present), keep everything else verbatim. - let mut out = String::new(); - let mut in_target = false; - for line in existing.lines() { - let trimmed = line.trim(); - if let Some(section) = trimmed.strip_prefix('[').and_then(|l| l.strip_suffix(']')) { - in_target = section.trim() == server; - if in_target { - continue; - } - } - if !in_target { - out.push_str(line); - out.push('\n'); - } - } - if let Some(token) = token { - if !out.is_empty() && !out.ends_with("\n\n") { - out.push('\n'); - } - out.push_str(&format!("[{server}]\ntoken = {token}\n")); - } - - if let Some(parent) = path.parent() { - std::fs::create_dir_all(parent)?; - } - let tmp = path.with_extension(format!("tmp.{}", std::process::id())); - write_owner_only(&tmp, &out)?; - std::fs::rename(&tmp, path).map_err(|err| { - let _ = std::fs::remove_file(&tmp); - eyre!( - "could not move credentials file into place '{}': {err}", - path.display() - ) - })?; - Ok(()) -} - -#[cfg(unix)] -fn write_owner_only(path: &Path, content: &str) -> Result<()> { - use std::io::Write; - use std::os::unix::fs::OpenOptionsExt; - let mut file = std::fs::OpenOptions::new() - .write(true) - .create(true) - .truncate(true) - .mode(0o600) - .open(path)?; - file.write_all(content.as_bytes())?; - Ok(()) -} - -#[cfg(not(unix))] -fn write_owner_only(path: &Path, content: &str) -> Result<()> { - std::fs::write(path, content)?; - Ok(()) -} - -/// Secrets are operator-private: refuse a credentials file other accounts -/// can read (the chain errs loudly rather than using a leaked secret). -#[cfg(unix)] -fn refuse_over_permissive(path: &Path) -> Result<()> { - use std::os::unix::fs::PermissionsExt; - let mode = std::fs::metadata(path)?.permissions().mode(); - if mode & 0o077 != 0 { - return Err(eyre!( - "credentials file '{}' is group/world-accessible (mode {:o}); run `chmod 600 {}`", - path.display(), - mode & 0o777, - path.display() - )); - } - Ok(()) -} - -#[cfg(not(unix))] -fn refuse_over_permissive(_path: &Path) -> Result<()> { - Ok(()) -} - -fn unquote(value: &str) -> &str { - if value.len() >= 2 - && ((value.starts_with('"') && value.ends_with('"')) - || (value.starts_with('\'') && value.ends_with('\''))) - { - &value[1..value.len() - 1] - } else { - value - } -} - -/// Expand a leading `~` / `~/` to the home directory (PR #139 finding 9: -/// a literal `./~/…` path silently created a directory named `~`). -pub(crate) fn expand_tilde(raw: &str) -> PathBuf { - if raw == "~" { - return env::home_dir().unwrap_or_else(|| PathBuf::from(raw)); - } - if let Some(rest) = raw.strip_prefix("~/") { - if let Some(home) = env::home_dir() { - return home.join(rest); - } - } - PathBuf::from(raw) -} - -#[cfg(test)] -mod tests { - use super::*; - use std::fs; - - #[test] - fn absent_file_is_an_empty_layer() { - let dir = tempfile::tempdir().unwrap(); - let config = load_operator_config_at(&dir.path().join("config.yaml")).unwrap(); - assert!(config.actor().is_none()); - assert!(config.output().is_none()); - } - - #[test] - fn parses_identity_and_defaults() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "operator:\n actor: act-andrew\ndefaults:\n output: json\n", - ) - .unwrap(); - let config = load_operator_config_at(&path).unwrap(); - assert_eq!(config.actor(), Some("act-andrew")); - assert_eq!(config.output(), Some(ReadOutputFormat::Json)); - } - - #[test] - fn defaults_store_parses_and_is_accessible() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write(&path, "defaults:\n store: file:///tmp/dev.omni\n").unwrap(); - let config = load_operator_config_at(&path).unwrap(); - assert_eq!(config.default_store(), Some("file:///tmp/dev.omni")); - assert_eq!(config.default_server(), None); - } - - #[test] - fn defaults_server_and_store_together_is_a_loud_error() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "defaults:\n server: prod\n store: file:///tmp/dev.omni\n", - ) - .unwrap(); - let err = load_operator_config_at(&path).unwrap_err().to_string(); - assert!(err.contains("binds one entity"), "{err}"); - } - - #[test] - fn defaults_store_with_default_graph_is_a_loud_error() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "defaults:\n store: file:///tmp/dev.omni\n default_graph: knowledge\n", - ) - .unwrap(); - let err = load_operator_config_at(&path).unwrap_err().to_string(); - assert!(err.contains("already a single graph"), "{err}"); - } - - #[test] - fn unknown_keys_warn_but_load() { - // A file written for a later slice (servers/aliases) must load - // cleanly today β€” warn-only forward compatibility. - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "operator:\n actor: act-a\n color: green\nservers:\n prod:\n url: https://example.com\naliases: {}\n", - ) - .unwrap(); - let config = load_operator_config_at(&path).unwrap(); - assert_eq!(config.actor(), Some("act-a")); - let warnings = config.unknown_key_warnings(); - // `servers` (PR 2) and `aliases` (PR 3) are known keys now. - assert_eq!(warnings.len(), 1, "{warnings:?}"); - assert!(warnings.iter().any(|w| w.contains("`operator.color`"))); - assert_eq!(config.servers["prod"].url, "https://example.com"); - } - - #[test] - fn parses_profiles_clusters_and_scope_defaults() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - let yaml = "\ -defaults: - server: prod - default_graph: knowledge -servers: - prod: - url: https://example.com -clusters: - brain: - root: s3://acme/clusters/brain -profiles: - staging: - server: staging - default_graph: knowledge - brain-admin: - cluster: brain - default_graph: knowledge -"; - fs::write(&path, yaml).unwrap(); - let config = load_operator_config_at(&path).unwrap(); - assert_eq!(config.default_server(), Some("prod")); - assert_eq!(config.default_graph(), Some("knowledge")); - assert_eq!(config.cluster_root("brain"), Some("s3://acme/clusters/brain")); - assert_eq!( - config.profile("staging").unwrap().binding("staging").unwrap(), - ScopeBinding::Server("staging".into()) - ); - assert_eq!( - config - .profile("brain-admin") - .unwrap() - .binding("brain-admin") - .unwrap(), - ScopeBinding::Cluster("brain".into()) - ); - // No unknown-key warnings for the new blocks. - assert!(config.unknown_key_warnings().is_empty(), "{:?}", config.unknown_key_warnings()); - } - - #[test] - fn profile_binding_rejects_zero_or_multiple_entities() { - let none = OperatorProfile::default(); - let err = none.binding("p").unwrap_err().to_string(); - assert!(err.contains("binds no scope"), "{err}"); - - let two = OperatorProfile { - server: Some("prod".into()), - store: Some("graph.omni".into()), - ..Default::default() - }; - let err = two.binding("p").unwrap_err().to_string(); - assert!(err.contains("binds 2 scopes"), "{err}"); - assert!(err.contains("server") && err.contains("store"), "{err}"); - } - - #[test] - fn unknown_keys_in_a_profile_warn() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "profiles:\n p:\n server: prod\n flavour: spicy\n", - ) - .unwrap(); - let config = load_operator_config_at(&path).unwrap(); - let warnings = config.unknown_key_warnings(); - assert!( - warnings.iter().any(|w| w.contains("`profiles.p.flavour`")), - "{warnings:?}" - ); - } - - #[test] - fn malformed_yaml_is_a_loud_error() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write(&path, "operator: [not, a, mapping\n").unwrap(); - let err = load_operator_config_at(&path).unwrap_err(); - assert!(err.to_string().contains("could not parse operator config")); - } - - #[test] - fn find_server_for_url_longest_prefix_no_substring_traps() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "servers:\n dev:\n url: http://h:8080\n dev-spike:\n url: http://h:8080/graphs/spike\n", - ) - .unwrap(); - let config = load_operator_config_at(&path).unwrap(); - assert_eq!(config.find_server_for_url("http://h:8080"), Some("dev")); - assert_eq!( - config.find_server_for_url("http://h:8080/graphs/other"), - Some("dev") - ); - // longest prefix wins - assert_eq!( - config.find_server_for_url("http://h:8080/graphs/spike/queries/q"), - Some("dev-spike") - ); - // no substring trap: a different port/host must not match - assert_eq!(config.find_server_for_url("http://h:8080-evil/x"), None); - assert_eq!(config.find_server_for_url("http://other:9999"), None); - } - - #[test] - fn server_lookup_supports_targeting() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("config.yaml"); - fs::write( - &path, - "servers:\n intel-dev:\n url: http://127.0.0.1:8080/\n", - ) - .unwrap(); - let config = load_operator_config_at(&path).unwrap(); - // the --server resolution shape: bare url and graph-scoped url - let base = config.servers["intel-dev"].url.trim_end_matches('/'); - assert_eq!(base, "http://127.0.0.1:8080"); - assert_eq!( - format!("{base}/graphs/spike"), - "http://127.0.0.1:8080/graphs/spike" - ); - } - - #[test] - fn token_env_name_uppercases_and_underscores() { - assert_eq!(token_env_name("intel-dev"), "OMNIGRAPH_TOKEN_INTEL_DEV"); - assert_eq!(token_env_name("prod"), "OMNIGRAPH_TOKEN_PROD"); - } - - #[test] - fn credentials_roundtrip_rotate_remove_preserving_other_sections() { - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("credentials"); - - rewrite_credentials_at(&path, "prod", Some("tok-1")).unwrap(); - rewrite_credentials_at(&path, "dev", Some("tok-dev")).unwrap(); - assert_eq!( - read_credential_at(&path, "prod").unwrap().as_deref(), - Some("tok-1") - ); - - // rotate prod; dev preserved - rewrite_credentials_at(&path, "prod", Some("tok-2")).unwrap(); - assert_eq!( - read_credential_at(&path, "prod").unwrap().as_deref(), - Some("tok-2") - ); - assert_eq!( - read_credential_at(&path, "dev").unwrap().as_deref(), - Some("tok-dev") - ); - - // remove prod; dev preserved; removal is idempotent - rewrite_credentials_at(&path, "prod", None).unwrap(); - rewrite_credentials_at(&path, "prod", None).unwrap(); - assert_eq!(read_credential_at(&path, "prod").unwrap(), None); - assert_eq!( - read_credential_at(&path, "dev").unwrap().as_deref(), - Some("tok-dev") - ); - } - - #[cfg(unix)] - #[test] - fn credentials_written_0600_and_over_permissive_refused() { - use std::os::unix::fs::PermissionsExt; - let dir = tempfile::tempdir().unwrap(); - let path = dir.path().join("credentials"); - rewrite_credentials_at(&path, "prod", Some("tok")).unwrap(); - let mode = fs::metadata(&path).unwrap().permissions().mode(); - assert_eq!(mode & 0o777, 0o600, "written {:o}", mode & 0o777); - - fs::set_permissions(&path, fs::Permissions::from_mode(0o644)).unwrap(); - let err = read_credential_at(&path, "prod").unwrap_err(); - assert!(err.to_string().contains("chmod 600"), "{err}"); - } - - #[test] - fn expand_tilde_resolves_home_prefix() { - let home = env::home_dir().unwrap(); - assert_eq!(expand_tilde("~"), home); - assert_eq!(expand_tilde("~/x/y"), home.join("x/y")); - assert_eq!(expand_tilde("/abs/path"), PathBuf::from("/abs/path")); - assert_eq!(expand_tilde("rel/path"), PathBuf::from("rel/path")); - } -} diff --git a/crates/omnigraph-cli/src/output.rs b/crates/omnigraph-cli/src/output.rs deleted file mode 100644 index 80de625..0000000 --- a/crates/omnigraph-cli/src/output.rs +++ /dev/null @@ -1,1034 +0,0 @@ -//! Human/JSON output formatting for every command (moved verbatim from -//! main.rs in the modularization). - -use super::*; - -#[derive(Debug, Serialize)] -pub(crate) struct LoadOutput { - pub(crate) uri: String, - pub(crate) branch: String, - pub(crate) mode: &'static str, - /// Present only when `--from` was given; echoes the requested base. - #[serde(skip_serializing_if = "Option::is_none")] - pub(crate) base_branch: Option, - pub(crate) branch_created: bool, - pub(crate) nodes_loaded: usize, - pub(crate) edges_loaded: usize, - pub(crate) node_types_loaded: usize, - pub(crate) edge_types_loaded: usize, -} - -pub(crate) fn load_output_from_tables( - uri: &str, - branch: &str, - mode: &'static str, - output: &IngestOutput, -) -> LoadOutput { - let mut nodes_loaded = 0; - let mut edges_loaded = 0; - let mut node_types_loaded = 0; - let mut edge_types_loaded = 0; - for table in &output.tables { - if table.table_key.starts_with("node:") { - nodes_loaded += table.rows_loaded; - node_types_loaded += 1; - } else if table.table_key.starts_with("edge:") { - edges_loaded += table.rows_loaded; - edge_types_loaded += 1; - } - } - LoadOutput { - uri: uri.to_string(), - branch: branch.to_string(), - mode, - base_branch: output.base_branch.clone(), - branch_created: output.branch_created, - nodes_loaded, - edges_loaded, - node_types_loaded, - edge_types_loaded, - } -} - -/// The local arm's twin of `load_output_from_tables`: build the same -/// `LoadOutput` from the engine `LoadResult` directly (the remote arm only -/// has the wire `IngestOutput`'s table list; the local arm has the full -/// result). Both load mappings live here, next to the struct β€” RFC-009 -/// Phase 2's "one place" for the `-> LoadOutput` mapping that used to fork -/// between this file and main.rs's inline construction. -pub(crate) fn load_output_from_result( - uri: &str, - branch: &str, - mode: &'static str, - result: &omnigraph::loader::LoadResult, -) -> LoadOutput { - LoadOutput { - uri: uri.to_string(), - branch: branch.to_string(), - mode, - base_branch: result.base_branch.clone(), - branch_created: result.branch_created, - nodes_loaded: result.nodes_loaded.values().sum(), - edges_loaded: result.edges_loaded.values().sum(), - node_types_loaded: result.nodes_loaded.len(), - edge_types_loaded: result.edges_loaded.len(), - } -} - -#[derive(Debug, Serialize)] -pub(crate) struct SchemaPlanOutput<'a> { - pub(crate) uri: &'a str, - pub(crate) supported: bool, - pub(crate) step_count: usize, - pub(crate) steps: &'a [SchemaMigrationStep], -} - -pub(crate) fn print_schema_apply_human(output: &SchemaApplyOutput) { - println!("schema apply for {}", output.uri); - println!("supported: {}", if output.supported { "yes" } else { "no" }); - println!("applied: {}", if output.applied { "yes" } else { "no" }); - println!("manifest_version: {}", output.manifest_version); - if output.steps.is_empty() { - println!("no schema changes"); - return; - } - for step in &output.steps { - println!("- {}", render_schema_plan_step(step)); - } -} - -pub(crate) fn query_kind_label(kind: QueryLintQueryKind) -> &'static str { - match kind { - QueryLintQueryKind::Read => "read", - QueryLintQueryKind::Mutation => "mutation", - } -} - -pub(crate) fn severity_label(severity: QueryLintSeverity) -> &'static str { - match severity { - QueryLintSeverity::Error => "ERROR", - QueryLintSeverity::Warning => "WARN ", - QueryLintSeverity::Info => "INFO ", - } -} - -pub(crate) fn print_query_lint_human(output: &QueryLintOutput) { - for result in &output.results { - match result.status { - QueryLintStatus::Ok => { - println!( - "OK query `{}` ({})", - result.name, - query_kind_label(result.kind) - ); - } - QueryLintStatus::Error => { - println!( - "ERROR query `{}`: {}", - result.name, - result.error.as_deref().unwrap_or("unknown error") - ); - } - } - - for warning in &result.warnings { - println!("WARN query `{}`: {}", result.name, warning); - } - } - - for finding in &output.findings { - println!("{} {}", severity_label(finding.severity), finding.message); - } - - println!( - "INFO Lint complete: {} queries processed ({} error(s), {} warning(s), {} info item(s))", - output.queries_processed, output.errors, output.warnings, output.infos - ); -} - -pub(crate) fn finish_query_lint(output: &QueryLintOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_query_lint_human(output); - } - - if output.status == QueryLintStatus::Error { - io::stdout().flush()?; - std::process::exit(1); - } - - Ok(()) -} - -pub(crate) fn print_json(value: &T) -> Result<()> { - println!("{}", serde_json::to_string_pretty(value)?); - Ok(()) -} - -pub(crate) fn print_cluster_validate_human(output: &ValidateOutput) { - if output.ok { - println!( - "cluster config valid: {} resource(s), {} dependency edge(s)", - output.resources.len(), - output.dependencies.len() - ); - } else { - println!("cluster config invalid"); - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn print_cluster_plan_human(output: &PlanOutput) { - if output.ok { - println!( - "cluster plan: {} change(s), {} approval gate(s)", - output.changes.len(), - output.approvals_required.len() - ); - for change in &output.changes { - let bindings = if change.binding_change { " [bindings]" } else { "" }; - println!(" {:?} {}{bindings}", change.operation, change.resource); - if let Some(migration) = &change.migration { - if !migration.supported { - println!(" migration UNSUPPORTED:"); - } - for step in &migration.steps { - println!( - " {}", - serde_json::to_string(step).unwrap_or_else(|_| format!("{step:?}")) - ); - } - } - } - if output.changes.is_empty() { - println!(" no changes"); - } - } else { - println!("cluster plan failed"); - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn print_cluster_apply_human(output: &ApplyOutput) { - if output.ok { - println!( - "cluster apply: {} applied, {} deferred/blocked", - output.applied_count, output.deferred_count - ); - } else { - println!("cluster apply failed"); - } - // The change list prints on failure too: an operator debugging a partial - // apply (payload or state-write error) needs to see what was attempted. - print_cluster_apply_changes(&output.changes); - if output.ok { - let state = &output.state_observations; - println!( - " state: revision {}, converged: {}, written: {}", - state.state_revision, output.converged, output.state_written - ); - println!(" note: cluster-booted servers (--cluster) serve this on their next restart; omnigraph.yaml deployments are unaffected"); - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn print_cluster_apply_changes(changes: &[omnigraph_cluster::PlanChange]) { - for change in changes { - let bindings = if change.binding_change { " [bindings]" } else { "" }; - match (&change.disposition, change.reason.as_deref()) { - (Some(disposition), Some(reason)) => println!( - " {:?} {}{bindings} [{disposition:?}: {reason}]", - change.operation, change.resource - ), - (Some(disposition), None) => println!( - " {:?} {}{bindings} [{disposition:?}]", - change.operation, change.resource - ), - _ => println!(" {:?} {}{bindings}", change.operation, change.resource), - } - } - if changes.is_empty() { - println!(" no changes"); - } -} - -pub(crate) fn print_cluster_status_human(output: &StatusOutput) { - if output.ok { - let state = &output.state_observations; - if state.state_found { - println!( - "cluster state: revision {}, {} resource(s)", - state.state_revision, state.resource_count - ); - if let Some(digest) = state.applied_config_digest.as_deref() { - println!(" applied config: {digest}"); - } - if state.locked { - println!(" lock: held{}", cluster_lock_summary(state)); - } else { - println!(" lock: not held"); - } - } else { - println!("cluster state missing"); - } - } else { - println!("cluster status failed"); - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn print_cluster_state_sync_human(output: &StateSyncOutput) { - let operation = match output.operation { - omnigraph_cluster::StateSyncOperation::Refresh => "refresh", - omnigraph_cluster::StateSyncOperation::Import => "import", - }; - if output.ok { - let state = &output.state_observations; - println!( - "cluster {operation}: revision {}, {} resource(s)", - state.state_revision, state.resource_count - ); - if let Some(cas) = state.state_cas.as_deref() { - println!(" state_cas: {cas}"); - } - if state.locked { - println!(" lock: acquired{}", cluster_lock_summary(state)); - } else { - println!(" lock: not acquired"); - } - } else { - println!("cluster {operation} failed"); - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn print_cluster_force_unlock_human(output: &ForceUnlockOutput) { - if output.ok { - if output.lock_removed { - println!( - "cluster force-unlock: removed lock{}", - cluster_lock_summary(&output.state_observations) - ); - } else { - println!("cluster force-unlock: no lock removed"); - } - } else { - println!("cluster force-unlock failed"); - if output.state_observations.locked { - println!( - " lock: held{}", - cluster_lock_summary(&output.state_observations) - ); - } - } - print_cluster_diagnostics(&output.diagnostics); -} - -pub(crate) fn cluster_lock_summary(state: &omnigraph_cluster::StateObservations) -> String { - let Some(lock_id) = state.lock_id.as_deref() else { - return String::new(); - }; - let mut parts = vec![format!("id={lock_id}")]; - if let Some(operation) = state.lock_operation.as_deref() { - parts.push(format!("operation={operation}")); - } - if let Some(pid) = state.lock_pid { - parts.push(format!("pid={pid}")); - } - if let Some(created_at) = state.lock_created_at.as_deref() { - parts.push(format!("created_at={created_at}")); - } - if let Some(age_seconds) = state.lock_age_seconds { - parts.push(format!("age_seconds={age_seconds}")); - } - format!(" ({})", parts.join(", ")) -} - -pub(crate) fn print_cluster_diagnostics(diagnostics: &[omnigraph_cluster::Diagnostic]) { - for diagnostic in diagnostics { - let label = match diagnostic.severity { - DiagnosticSeverity::Error => "ERROR", - DiagnosticSeverity::Warning => "WARN ", - }; - println!( - "{label} {} {}: {}", - diagnostic.code, diagnostic.path, diagnostic.message - ); - } -} - -pub(crate) fn finish_cluster_validate(output: &ValidateOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_validate_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_plan(output: &PlanOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_plan_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_apply(output: &ApplyOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_apply_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_approve(output: &ApproveOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else if output.ok { - println!( - "cluster approve: {} {} approved by {} (approval {})", - output - .operation - .as_ref() - .map(|operation| format!("{operation:?}").to_lowercase()) - .unwrap_or_default(), - output.resource.as_deref().unwrap_or("?"), - output.approved_by.as_deref().unwrap_or("?"), - output.approval_id.as_deref().unwrap_or("?"), - ); - print_cluster_diagnostics(&output.diagnostics); - } else { - println!("cluster approve failed"); - print_cluster_diagnostics(&output.diagnostics); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_status(output: &StatusOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_status_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_state_sync(output: &StateSyncOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_state_sync_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn finish_cluster_force_unlock(output: &ForceUnlockOutput, json: bool) -> Result<()> { - if json { - print_json(output)?; - } else { - print_cluster_force_unlock_human(output); - } - if !output.ok { - io::stdout().flush()?; - std::process::exit(1); - } - Ok(()) -} - -pub(crate) fn print_load_human(payload: &LoadOutput) { - println!( - "loaded {} on branch {} with {}: {} nodes across {} node types, {} edges across {} edge types", - payload.uri, - payload.branch, - payload.mode, - payload.nodes_loaded, - payload.node_types_loaded, - payload.edges_loaded, - payload.edge_types_loaded - ); - if payload.branch_created { - if let Some(base) = &payload.base_branch { - println!("branch {} created from {}", payload.branch, base); - } - } -} - -pub(crate) fn print_ingest_human(output: &IngestOutput) { - println!( - "ingested {} into branch {} from {} with {} ({})", - output.uri, - output.branch, - output.base_branch.as_deref().unwrap_or("main"), - output.mode.as_str(), - if output.branch_created { - "branch created" - } else { - "branch exists" - } - ); - for table in &output.tables { - println!("{} rows_loaded={}", table.table_key, table.rows_loaded); - } - if let Some(actor_id) = &output.actor_id { - println!("actor_id: {}", actor_id); - } -} - -pub(crate) fn print_schema_plan_human(uri: &str, plan: &SchemaMigrationPlan) { - println!("schema plan for {}", uri); - println!("supported: {}", if plan.supported { "yes" } else { "no" }); - if plan.steps.is_empty() { - println!("no schema changes"); - return; - } - for step in &plan.steps { - println!("- {}", render_schema_plan_step(step)); - } -} - -pub(crate) fn render_schema_plan_step(step: &SchemaMigrationStep) -> String { - match step { - SchemaMigrationStep::AddType { type_kind, name } => { - format!("add {} type '{}'", schema_type_kind_label(*type_kind), name) - } - SchemaMigrationStep::RenameType { - type_kind, - from, - to, - } => format!( - "rename {} type '{}' -> '{}'", - schema_type_kind_label(*type_kind), - from, - to - ), - SchemaMigrationStep::AddProperty { - type_kind, - type_name, - property_name, - property_type, - } => format!( - "add property '{}.{}' ({}) on {} '{}'", - type_name, - property_name, - render_prop_type(property_type), - schema_type_kind_label(*type_kind), - type_name - ), - SchemaMigrationStep::RenameProperty { - type_kind, - type_name, - from, - to, - } => format!( - "rename property '{}.{}' -> '{}.{}' on {} '{}'", - type_name, - from, - type_name, - to, - schema_type_kind_label(*type_kind), - type_name - ), - SchemaMigrationStep::AddConstraint { - type_kind, - type_name, - constraint, - } => format!( - "add constraint {} on {} '{}'", - render_constraint(constraint), - schema_type_kind_label(*type_kind), - type_name - ), - SchemaMigrationStep::UpdateTypeMetadata { - type_kind, - name, - annotations, - } => format!( - "update metadata on {} '{}' ({})", - schema_type_kind_label(*type_kind), - name, - render_annotations(annotations) - ), - SchemaMigrationStep::UpdatePropertyMetadata { - type_kind, - type_name, - property_name, - annotations, - } => format!( - "update metadata on property '{}.{}' of {} '{}' ({})", - type_name, - property_name, - schema_type_kind_label(*type_kind), - type_name, - render_annotations(annotations) - ), - SchemaMigrationStep::DropType { - type_kind, - name, - mode, - } => format!( - "drop {} type '{}' ({} mode)", - schema_type_kind_label(*type_kind), - name, - drop_mode_label(*mode), - ), - SchemaMigrationStep::DropProperty { - type_kind, - type_name, - property_name, - mode, - } => format!( - "drop property '{}.{}' of {} '{}' ({} mode)", - type_name, - property_name, - schema_type_kind_label(*type_kind), - type_name, - drop_mode_label(*mode), - ), - SchemaMigrationStep::UnsupportedChange { entity, reason, .. } => { - // When a schema-lint code is attached, render code + tier - // so operators see at-a-glance the kind of risk (destructive - // / validated / safe) β€” not just the rule identifier. - // Reach the diagnostic via the `diagnostic()` helper so the - // CLI doesn't need to know how the lookup works. - match step.diagnostic() { - Some(diag) => format!( - "unsupported change on {} [{}, {}]: {}", - entity, - diag.code, - schema_lint_tier_label(diag.tier), - reason, - ), - None => format!("unsupported change on {}: {}", entity, reason), - } - } - } -} - -pub(crate) fn schema_type_kind_label(kind: omnigraph_compiler::SchemaTypeKind) -> &'static str { - match kind { - omnigraph_compiler::SchemaTypeKind::Interface => "interface", - omnigraph_compiler::SchemaTypeKind::Node => "node", - omnigraph_compiler::SchemaTypeKind::Edge => "edge", - } -} - -pub(crate) fn schema_lint_tier_label(tier: omnigraph_compiler::SafetyTier) -> &'static str { - match tier { - omnigraph_compiler::SafetyTier::Safe => "safe", - omnigraph_compiler::SafetyTier::Validated => "validated", - omnigraph_compiler::SafetyTier::Destructive => "destructive", - } -} - -pub(crate) fn drop_mode_label(mode: omnigraph_compiler::DropMode) -> &'static str { - match mode { - omnigraph_compiler::DropMode::Soft => "soft", - omnigraph_compiler::DropMode::Hard => "hard", - } -} - -pub(crate) fn render_prop_type(prop_type: &omnigraph_compiler::PropType) -> String { - let base = if let Some(values) = &prop_type.enum_values { - format!("Enum({})", values.join("|")) - } else { - prop_type.scalar.to_string() - }; - let base = if prop_type.list { - format!("[{}]", base) - } else { - base - }; - if prop_type.nullable { - format!("{}?", base) - } else { - base - } -} - -pub(crate) fn render_constraint(constraint: &omnigraph_compiler::schema::ast::Constraint) -> String { - match constraint { - omnigraph_compiler::schema::ast::Constraint::Key(columns) => { - format!("@key({})", columns.join(", ")) - } - omnigraph_compiler::schema::ast::Constraint::Unique(columns) => { - format!("@unique({})", columns.join(", ")) - } - omnigraph_compiler::schema::ast::Constraint::Index(columns) => { - format!("@index({})", columns.join(", ")) - } - omnigraph_compiler::schema::ast::Constraint::Range { property, min, max } => { - format!("@range({}, {:?}, {:?})", property, min, max) - } - omnigraph_compiler::schema::ast::Constraint::Check { property, pattern } => { - format!("@check({}, {:?})", property, pattern) - } - } -} - -pub(crate) fn render_annotations(annotations: &[omnigraph_compiler::schema::ast::Annotation]) -> String { - annotations - .iter() - .map(|annotation| { - let mut args: Vec = Vec::new(); - // Values are parsed via `decode_string_literal` (quotes stripped), so - // re-quote them as string literals on render β€” otherwise a value with - // non-ident chars (e.g. `model=openai/text-embedding-3-large`) fails to - // round-trip back through the schema parser (`annotation_kwarg` wants a - // quoted `literal`, not a bare token). - if let Some(value) = &annotation.value { - args.push(format!("\"{}\"", value)); - } - for (key, val) in &annotation.kwargs { - args.push(format!("{}=\"{}\"", key, val)); - } - if args.is_empty() { - format!("@{}", annotation.name) - } else { - format!("@{}({})", annotation.name, args.join(", ")) - } - }) - .collect::>() - .join(", ") -} - -pub(crate) fn print_embed_human(output: &EmbedOutput) { - println!( - "embedded {} rows (selected {}, cleaned {}) from {} -> {} [{} {}d]", - output.embedded_rows, - output.selected_rows, - output.cleaned_rows, - output.input, - output.output, - output.mode, - output.dimension - ); -} - -pub(crate) fn print_snapshot_human(branch: &str, manifest_version: u64, entries: &[SnapshotTableOutput]) { - println!("branch: {}", branch); - println!("manifest_version: {}", manifest_version); - for entry in entries { - println!( - "{} v{} branch={} rows={}", - entry.table_key, - entry.table_version, - entry.table_branch.as_deref().unwrap_or("main"), - entry.row_count - ); - } -} - -pub(crate) fn print_read_output( - output: &ReadOutput, - format: ReadOutputFormat, -) -> Result<()> { - println!( - "{}", - render_read(output, format, &resolve_table_render_options())? - ); - Ok(()) -} - -pub(crate) fn print_change_human(output: &ChangeOutput) { - println!( - "changed {} via {}: {} nodes, {} edges", - output.branch, output.query_name, output.affected_nodes, output.affected_edges - ); - if let Some(actor_id) = &output.actor_id { - println!("actor_id: {}", actor_id); - } -} - -pub(crate) fn print_commit_list_human(commits: &[CommitOutput]) { - for commit in commits { - let branch = commit.manifest_branch.as_deref().unwrap_or("main"); - println!( - "{} branch={} version={}{}", - commit.graph_commit_id, - branch, - commit.manifest_version, - commit - .actor_id - .as_deref() - .map(|actor| format!(" actor={}", actor)) - .unwrap_or_default() - ); - } -} - -pub(crate) fn print_commit_human(commit: &CommitOutput) { - println!("graph_commit_id: {}", commit.graph_commit_id); - println!( - "manifest_branch: {}", - commit.manifest_branch.as_deref().unwrap_or("main") - ); - println!("manifest_version: {}", commit.manifest_version); - if let Some(parent_commit_id) = &commit.parent_commit_id { - println!("parent_commit_id: {}", parent_commit_id); - } - if let Some(merged_parent_commit_id) = &commit.merged_parent_commit_id { - println!("merged_parent_commit_id: {}", merged_parent_commit_id); - } - if let Some(actor_id) = &commit.actor_id { - println!("actor_id: {}", actor_id); - } - println!("created_at: {}", commit.created_at); -} - -pub(crate) fn print_policy_explain(decision: &PolicyDecision, actor_id: &str, request: &PolicyRequest) { - println!( - "decision: {}", - if decision.allowed { "allow" } else { "deny" } - ); - println!("actor: {}", actor_id); - println!("action: {}", request.action); - if let Some(branch) = &request.branch { - println!("branch: {}", branch); - } - if let Some(target_branch) = &request.target_branch { - println!("target_branch: {}", target_branch); - } - if let Some(rule_id) = &decision.matched_rule_id { - println!("matched_rule: {}", rule_id); - } - println!("message: {}", decision.message); -} - -#[derive(serde::Serialize)] -pub(crate) struct QueriesIssue { - pub(crate) query: String, - pub(crate) message: String, -} - -#[derive(serde::Serialize)] -pub(crate) struct QueriesValidateOutput { - pub(crate) ok: bool, - pub(crate) breakages: Vec, - pub(crate) warnings: Vec, -} - -#[derive(serde::Serialize)] -pub(crate) struct QueriesParam { - pub(crate) name: String, - #[serde(rename = "type")] - pub(crate) type_name: String, - pub(crate) nullable: bool, -} - -#[derive(serde::Serialize)] -pub(crate) struct QueriesListItem { - pub(crate) name: String, - pub(crate) mcp_expose: bool, - pub(crate) tool_name: Option, - pub(crate) mutation: bool, - /// `@description` from the query declaration β€” what the query is for. - /// Carried so the CLI catalog matches the HTTP `GET /queries` surface. - #[serde(skip_serializing_if = "Option::is_none")] - pub(crate) description: Option, - /// `@instruction` from the query declaration β€” how/when to invoke it. - #[serde(skip_serializing_if = "Option::is_none")] - pub(crate) instruction: Option, - pub(crate) params: Vec, -} - -#[derive(serde::Serialize)] -pub(crate) struct QueriesListOutput { - pub(crate) queries: Vec, -} - -pub(crate) fn finish_login( - server: &str, - credentials_path: &std::path::Path, - declared: bool, - json: bool, -) -> Result<()> { - if json { - print_json(&serde_json::json!({ - "server": server, - "credentials_path": credentials_path.display().to_string(), - "declared": declared, - }))?; - } else { - println!( - "stored credential for '{server}' in {}", - credentials_path.display() - ); - } - if !declared { - eprintln!( - "note: '{server}' is not declared under servers: in the operator config; the token applies once you add `servers:\n {server}:\n url: ` to ~/.omnigraph/config.yaml" - ); - } - Ok(()) -} - -pub(crate) fn finish_logout( - server: &str, - credentials_path: &std::path::Path, - json: bool, -) -> Result<()> { - if json { - print_json(&serde_json::json!({ - "server": server, - "credentials_path": credentials_path.display().to_string(), - }))?; - } else { - println!( - "removed credential for '{server}' from {}", - credentials_path.display() - ); - } - Ok(()) -} - -#[derive(Debug, Serialize)] -pub(crate) struct ProfileListItem { - pub(crate) name: String, - /// `server: ` / `cluster: ` / `store: ` / `invalid: `. - pub(crate) binding: String, - /// `server` | `cluster` | `store` | `invalid`. - pub(crate) scope_kind: String, - /// The bound server/cluster name, or the store URI. `None` when invalid. - pub(crate) target: Option, - pub(crate) valid: bool, - pub(crate) error: Option, - pub(crate) default_graph: Option, - pub(crate) active: bool, -} - -#[derive(Debug, Serialize)] -pub(crate) struct ProfileDetail { - /// Profile name, or `(defaults)` for the no-name flat-defaults view. - pub(crate) name: String, - /// `server` | `cluster` | `store` | `none`. - pub(crate) scope_kind: String, - /// The bound server/cluster name, or the store URI. - pub(crate) target: Option, - /// Resolved endpoint: a server's URL / a cluster's root / the store URI; - /// `None` if a named server/cluster isn't defined in this config. - pub(crate) endpoint: Option, - pub(crate) default_graph: Option, - pub(crate) output_format: Option, -} - -pub(crate) fn print_profile_list(items: &[ProfileListItem], json: bool) -> Result<()> { - if json { - return print_json(&items); - } - if items.is_empty() { - println!("no profiles defined in the operator config"); - return Ok(()); - } - for item in items { - let active = if item.active { " (active)" } else { "" }; - let graph = item - .default_graph - .as_deref() - .map(|g| format!(" Β· graph: {g}")) - .unwrap_or_default(); - println!("{}{active} {}{graph}", item.name, item.binding); - } - Ok(()) -} - -pub(crate) fn print_profile_detail(detail: &ProfileDetail, json: bool) -> Result<()> { - if json { - return print_json(detail); - } - println!("profile: {}", detail.name); - let target = detail - .target - .as_deref() - .map(|t| format!(" {t}")) - .unwrap_or_default(); - println!(" scope: {}{target}", detail.scope_kind); - if let Some(endpoint) = &detail.endpoint { - println!(" endpoint: {endpoint}"); - } else if matches!(detail.scope_kind.as_str(), "server" | "cluster") { - println!(" endpoint: (undefined β€” name not in this config)"); - } - if let Some(graph) = &detail.default_graph { - println!(" default graph: {graph}"); - } - if let Some(format) = &detail.output_format { - println!(" output: {format}"); - } - Ok(()) -} - -/// Table prefs cascade (RFC-011): operator defaults.table_* > built-in. -pub(crate) fn resolve_table_render_options() -> ReadRenderOptions { - let operator = crate::operator::load_operator_config().unwrap_or_default(); - ReadRenderOptions { - max_column_width: operator.defaults.table_max_column_width.unwrap_or(80), - cell_layout: operator.defaults.table_cell_layout.unwrap_or_default(), - } -} - -#[cfg(test)] -mod tests { - use omnigraph_compiler::schema::ast::Annotation; - use omnigraph_compiler::schema::parser::parse_schema; - use std::collections::BTreeMap; - - use super::render_annotations; - - #[test] - fn render_annotations_quotes_values_so_embed_round_trips() { - let mut kwargs = BTreeMap::new(); - kwargs.insert( - "model".to_string(), - "openai/text-embedding-3-large".to_string(), - ); - let embed = Annotation { - name: "embed".to_string(), - value: Some("title".to_string()), - kwargs, - }; - - let rendered = render_annotations(std::slice::from_ref(&embed)); - assert_eq!( - rendered, - r#"@embed("title", model="openai/text-embedding-3-large")"# - ); - - // The bug: an unquoted `model=openai/text-embedding-3-large` is not a - // valid `annotation_kwarg` literal, so `schema show` output did not - // re-parse. The rendered form must round-trip through the grammar. - let schema = format!("node Doc {{\ntitle: String\nembedding: Vector(3) {rendered}\n}}\n"); - let parsed = parse_schema(&schema); - assert!( - parsed.is_ok(), - "rendered @embed must re-parse: {:?}", - parsed.err() - ); - } -} diff --git a/crates/omnigraph-cli/src/planes.rs b/crates/omnigraph-cli/src/planes.rs deleted file mode 100644 index b599076..0000000 --- a/crates/omnigraph-cli/src/planes.rs +++ /dev/null @@ -1,357 +0,0 @@ -//! Declared CLI "planes" (RFC-010 Slice 1). -//! -//! Every subcommand belongs to exactly one plane. This classification is the -//! single source of truth the wrong-plane guard consumes β€” and that later -//! RFC-010 slices (the capability surface, plane-grouped help) will consume -//! too. The `command_plane` match is **exhaustive on purpose**: adding a -//! `Command` variant is a compile error until its plane is declared, so the -//! surface cannot silently drift from the command set. -//! -//! See [docs/dev/rfc-010-cli-planes-restructure.md]. - -use color_eyre::Result; -use color_eyre::eyre::bail; - -use crate::cli::{Cli, Command, QueriesCommand, SchemaCommand}; - -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) enum Plane { - /// Runs against a graph, embedded **or** via `--server` (the `GraphClient` - /// axis). The only plane on which the data-plane addressing flags - /// (`--server`/`--graph`) apply. - Data, - /// Direct storage access; no server. Maintenance + local-only inspection - /// that must work with the server down. - Storage, - /// Operates on a cluster directory, not a graph URI. - Control, - /// Touches no graph at all β€” session / config / local tooling. - Session, -} - -impl std::fmt::Display for Plane { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.write_str(match self { - Plane::Data => "data", - Plane::Storage => "storage", - Plane::Control => "control", - Plane::Session => "session", - }) - } -} - -/// What a command *needs*, in the user-facing vocabulary (RFC-011). This is the -/// language CLI errors and `--help` speak; `Plane` stays the internal classifier -/// (`Capability` is derived from it, so the two cannot drift). -/// -/// - `any` β€” graph-scoped data; served via a server scope, or direct against a -/// store scope. Accepts `--server`/`--graph`. -/// - `served` β€” requires a server. Accepts `--server`/`--graph`. -/// - `direct` β€” storage-native; opens storage directly, never through a server. -/// - `control` β€” operates on a cluster (control plane). -/// - `local` β€” addresses no graph at all. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) enum Capability { - Any, - Served, - Direct, - Control, - Local, -} - -impl Capability { - /// A human phrase for error messages (`` `optimize` is a {…} command ``). - pub(crate) fn describe(self) -> &'static str { - match self { - Capability::Any => "data", - Capability::Served => "served", - Capability::Direct => "direct (storage-native)", - Capability::Control => "cluster control", - Capability::Local => "local", - } - } - - /// `--server`/`--graph` are served-graph addressing: they apply only to the - /// capabilities that reach a graph through a server. - fn accepts_server_addressing(self) -> bool { - matches!(self, Capability::Any | Capability::Served) - } -} - -/// The capability a subcommand needs, derived from its `Plane` (the exhaustive -/// classifier) plus the one Dataβ†’Served refinement: `graphs` is remote-only. -/// -/// This reflects *current enforced behavior*, so messages stay truthful: -/// `queries`/`policy` read a cluster's applied state (`Control`). -pub(crate) fn command_capability(cmd: &Command) -> Capability { - if let Command::Graphs { .. } = cmd { - return Capability::Served; - } - match command_plane(cmd) { - Plane::Data => Capability::Any, - Plane::Storage => Capability::Direct, - Plane::Control => Capability::Control, - Plane::Session => Capability::Local, - } -} - -/// The plane a subcommand belongs to. Exhaustive β€” a new `Command` variant -/// will not compile until classified. Descends into the nested enums where -/// the plane differs per subcommand (`schema plan` is storage while `schema -/// show`/`apply` are data; `queries`/`policy` read cluster applied state). -pub(crate) fn command_plane(cmd: &Command) -> Plane { - match cmd { - Command::Query { .. } - | Command::Mutate { .. } - | Command::Load { .. } - | Command::Ingest { .. } - | Command::Branch { .. } - | Command::Snapshot { .. } - | Command::Export { .. } - | Command::Commit { .. } - | Command::Graphs { .. } => Plane::Data, - Command::Schema { - command: SchemaCommand::Show { .. } | SchemaCommand::Apply { .. }, - } => Plane::Data, - Command::Schema { - command: SchemaCommand::Plan { .. }, - } => Plane::Storage, - // `queries` and `policy` tooling now source their inputs from a - // cluster's applied state (`--cluster`), so they live on the control - // plane (RFC-011 β€” omnigraph.yaml excised from the CLI). - Command::Queries { .. } => Plane::Control, - Command::Policy { .. } => Plane::Control, - Command::Init { .. } - | Command::Optimize { .. } - | Command::Repair { .. } - | Command::Cleanup { .. } - | Command::Lint { .. } => Plane::Storage, - Command::Cluster { .. } => Plane::Control, - Command::Alias { .. } - | Command::Embed(_) - | Command::Login { .. } - | Command::Logout { .. } - | Command::Profile { .. } - | Command::Version => Plane::Session, - } -} - -/// User-facing label for a subcommand (descends one level for the nested -/// families so messages read `schema plan`, `queries validate`, etc.). -pub(crate) fn command_label(cmd: &Command) -> &'static str { - match cmd { - Command::Version => "version", - Command::Login { .. } => "login", - Command::Logout { .. } => "logout", - Command::Profile { .. } => "profile", - Command::Embed(_) => "embed", - Command::Init { .. } => "init", - Command::Load { .. } => "load", - Command::Ingest { .. } => "ingest", - Command::Branch { .. } => "branch", - Command::Schema { command } => match command { - SchemaCommand::Plan { .. } => "schema plan", - SchemaCommand::Apply { .. } => "schema apply", - SchemaCommand::Show { .. } => "schema show", - }, - Command::Lint { .. } => "lint", - Command::Queries { command } => match command { - QueriesCommand::Validate { .. } => "queries validate", - QueriesCommand::List { .. } => "queries list", - }, - Command::Snapshot { .. } => "snapshot", - Command::Export { .. } => "export", - Command::Commit { .. } => "commit", - Command::Query { .. } => "query", - Command::Mutate { .. } => "mutate", - Command::Alias { .. } => "alias", - Command::Policy { .. } => "policy", - Command::Optimize { .. } => "optimize", - Command::Repair { .. } => "repair", - Command::Cleanup { .. } => "cleanup", - Command::Cluster { .. } => "cluster", - Command::Graphs { .. } => "graphs", - } -} - -/// The verbs that consume a cluster scope. Maintenance/lint select a graph with -/// `--cluster --graph `; policy/queries inspect the cluster's -/// applied control-plane state and may optionally use `--graph` to select one -/// bundle/registry. `init` is storage-plane too but *creates* a graph (cluster -/// graphs are born from `cluster apply`, not `init`), and `schema plan` takes a -/// positional URI, so the guard rejects `--cluster`/`--graph` there rather than -/// silently dropping the flag. -pub(crate) fn accepts_cluster_addressing(cmd: &Command) -> bool { - matches!( - cmd, - Command::Optimize { .. } - | Command::Repair { .. } - | Command::Cleanup { .. } - // `lint` can type-check a `.gq` against a cluster graph's schema - // (RFC-011): `--cluster --graph `. - | Command::Lint { .. } - // The policy/queries tooling addresses a cluster's applied state - // (RFC-011): `--cluster ` selects the cluster, `--graph ` - // picks a graph's bundle/registry within it. - | Command::Policy { .. } - | Command::Queries { .. } - ) -} - -/// Reject a scope-addressing flag (`--server`/`--cluster`/`--graph`) on a verb -/// that cannot consume it, rather than silently dropping it (the old behavior: -/// e.g. `optimize --server prod` dropped `--server` and failed later with an -/// unrelated message). `alias` gets an extra guard because its binding owns all -/// addressing and several ignored globals sit outside this three-flag guard. -/// Each flag has a distinct valid surface: -/// - `--server` β†’ served-graph scopes (`any`/`served`); -/// - `--cluster` β†’ cluster-scoped direct/control verbs; -/// - `--graph` β†’ any multi-graph scope: a served scope *or* a cluster one. -/// RFC-010 Slice 1, generalized for RFC-011 cluster addressing. -pub(crate) fn guard_addressing(cli: &Cli) -> Result<()> { - if let Command::Alias { .. } = &cli.command { - let mut flags = Vec::new(); - if cli.server.is_some() { - flags.push("--server"); - } - if cli.graph.is_some() { - flags.push("--graph"); - } - if cli.store.is_some() { - flags.push("--store"); - } - if cli.cluster.is_some() { - flags.push("--cluster"); - } - if cli.profile.is_some() { - flags.push("--profile"); - } - if cli.as_actor.is_some() { - flags.push("--as"); - } - if !flags.is_empty() { - bail!( - "`alias` uses the server, graph, and stored query declared in \ - `aliases.` in ~/.omnigraph/config.yaml; remove global scope \ - flag(s): {}", - flags.join(", ") - ); - } - } - if cli.server.is_none() && cli.cluster.is_none() && cli.graph.is_none() { - return Ok(()); - } - let capability = command_capability(&cli.command); - let label = command_label(&cli.command); - let cluster_ok = accepts_cluster_addressing(&cli.command); - - if cli.server.is_some() && !capability.accepts_server_addressing() { - bail!( - "`{label}` is a {} command; --server addresses a served graph and does not apply.{}", - capability.describe(), - remediation(capability, &cli.command), - ); - } - if cli.cluster.is_some() && !cluster_ok { - bail!( - "`{label}` is a {} command; --cluster addresses a cluster-scoped command \ - and does not apply.{}", - capability.describe(), - remediation(capability, &cli.command), - ); - } - if cli.graph.is_some() && !(capability.accepts_server_addressing() || cluster_ok) { - bail!( - "`{label}` is a {} command; --graph selects a graph within a server or cluster \ - scope and does not apply.{}", - capability.describe(), - remediation(capability, &cli.command), - ); - } - Ok(()) -} - -/// The "what to do instead" tail for a wrong-address error, by capability. -/// Includes its own leading space when non-empty so the caller appends it -/// directly β€” an empty tail (the served-addressing capabilities, which only -/// reach this fn for a misplaced `--cluster`/`--graph`) leaves no trailing space. -fn remediation(capability: Capability, cmd: &Command) -> &'static str { - match capability { - Capability::Direct => match cmd { - Command::Init { .. } => " Pass a storage URI.", - Command::Optimize { .. } | Command::Repair { .. } | Command::Cleanup { .. } => { - " Pass a storage URI, or --cluster --graph ." - } - _ => " Pass a storage URI.", - }, - Capability::Control => match cmd { - Command::Cluster { .. } => { - " It operates on a cluster config directory (pass --config )." - } - Command::Policy { .. } | Command::Queries { .. } => { - " It operates on a cluster (pass --cluster , or select a cluster profile)." - } - _ => " It operates on a cluster.", - }, - Capability::Local => " It does not address a graph.", - Capability::Any | Capability::Served => "", - } -} - -#[cfg(test)] -mod tests { - use clap::Parser; - - use super::*; - - #[test] - fn server_addressing_allowed_exactly_on_any_and_served() { - // The behavior-preservation contract: `--server`/`--graph` apply to the - // served-graph capabilities (`any`, `served`) and nothing else. This is - // the old "Data plane only" allow set, re-expressed β€” graphs (the one - // Dataβ†’Served verb) was already allowed. - assert!(Capability::Any.accepts_server_addressing()); - assert!(Capability::Served.accepts_server_addressing()); - assert!(!Capability::Direct.accepts_server_addressing()); - assert!(!Capability::Control.accepts_server_addressing()); - assert!(!Capability::Local.accepts_server_addressing()); - } - - #[test] - fn command_capability_classifies_representative_verbs() { - let cap = |args: &[&str]| { - command_capability(&Cli::try_parse_from(args).unwrap().command) - }; - // The one Dataβ†’Served refinement β€” if the `graphs` guard were deleted, - // every other assertion here would still pass. - assert_eq!(cap(&["omnigraph", "graphs", "list"]), Capability::Served); - assert_eq!(cap(&["omnigraph", "alias", "who"]), Capability::Local); - assert_eq!(cap(&["omnigraph", "optimize", "graph.omni"]), Capability::Direct); - assert_eq!(cap(&["omnigraph", "schema", "plan", "--schema", "s.pg", "graph.omni"]), Capability::Direct); - assert_eq!(cap(&["omnigraph", "cluster", "status", "--config", "."]), Capability::Control); - assert_eq!(cap(&["omnigraph", "version"]), Capability::Local); - // `queries`/`policy` tooling reads cluster state now (control plane). - assert_eq!(cap(&["omnigraph", "queries", "list"]), Capability::Control); - assert_eq!( - cap(&["omnigraph", "policy", "validate"]), - Capability::Control - ); - } - - #[test] - fn every_capability_describes_distinctly() { - let phrases = [ - Capability::Any.describe(), - Capability::Served.describe(), - Capability::Direct.describe(), - Capability::Control.describe(), - Capability::Local.describe(), - ]; - for (i, a) in phrases.iter().enumerate() { - assert!(!a.is_empty()); - for b in &phrases[i + 1..] { - assert_ne!(a, b); - } - } - } -} diff --git a/crates/omnigraph-cli/src/read_format.rs b/crates/omnigraph-cli/src/read_format.rs index 3ffa9e6..b205b19 100644 --- a/crates/omnigraph-cli/src/read_format.rs +++ b/crates/omnigraph-cli/src/read_format.rs @@ -1,31 +1,9 @@ -use clap::ValueEnum; use color_eyre::eyre::Result; +use omnigraph_server::ReadOutputFormat; use omnigraph_server::api::ReadOutput; -use serde::{Deserialize, Serialize}; +use omnigraph_server::config::TableCellLayout; use serde_json::{Map, Value}; -/// Output rendering format for read-shaped commands (`read`/`query`/`alias`). -/// A CLI presentation concern β€” lives here, not in the server. -#[derive(Debug, Clone, Copy, Default, Eq, PartialEq, Serialize, Deserialize, ValueEnum)] -#[serde(rename_all = "snake_case")] -pub enum ReadOutputFormat { - #[default] - Table, - Kv, - Csv, - Jsonl, - Json, -} - -/// How an over-wide table cell is laid out when rendering `--format table`. -#[derive(Debug, Clone, Copy, Default, Eq, PartialEq, Serialize, Deserialize, ValueEnum)] -#[serde(rename_all = "snake_case")] -pub enum TableCellLayout { - #[default] - Truncate, - Wrap, -} - pub struct ReadRenderOptions { pub max_column_width: usize, pub cell_layout: TableCellLayout, diff --git a/crates/omnigraph-cli/src/scope.rs b/crates/omnigraph-cli/src/scope.rs deleted file mode 100644 index 257907d..0000000 --- a/crates/omnigraph-cli/src/scope.rs +++ /dev/null @@ -1,529 +0,0 @@ -//! RFC-011 Slice A scope resolution. -//! -//! Translates the new scope inputs (`--profile` / `--store` / operator-config -//! `profiles`/`clusters`/`defaults`) into the SAME effective addressing tuple -//! the existing `GraphClient` factories (`client.rs`) and the maintenance -//! resolver (`helpers::resolve_storage_uri`) already consume. This is a -//! translation layer that sits *in front* of those resolvers β€” it is purely -//! additive: an explicit legacy address (`--uri`/`--target`/`--server`/ -//! `--store`) wins and reproduces today's behavior exactly, so existing -//! invocations are unaffected. -//! -//! The access path (served vs direct) is never chosen here; it falls out of the -//! scope's binding Γ— the verb's capability. The capabilityβ†’scope check rejects -//! mismatches (e.g. a server scope on a maintenance verb) only on the *new* -//! resolution paths. - -use std::env; - -use color_eyre::Result; -use color_eyre::eyre::{bail, eyre}; - -use crate::operator::{OperatorConfig, ScopeBinding}; -use crate::planes::Capability; - -pub(crate) const PROFILE_ENV: &str = "OMNIGRAPH_PROFILE"; - -/// The effective addressing a command should use, in the terms the existing -/// resolvers consume. Data/served verbs read `server`/`graph`/`uri`/`target`; -/// maintenance verbs read `cluster`/`cluster_graph`. -#[derive(Debug, Default, PartialEq, Eq)] -pub(crate) struct ResolvedScope { - pub(crate) server: Option, - pub(crate) graph: Option, - pub(crate) uri: Option, - pub(crate) cluster: Option, - pub(crate) cluster_graph: Option, -} - -/// The raw addressing inputs for one command: the global scope flags plus the -/// command's own positional URI. -pub(crate) struct ScopeFlags<'a> { - pub(crate) profile: Option<&'a str>, - pub(crate) store: Option<&'a str>, - pub(crate) server: Option<&'a str>, - pub(crate) cluster: Option<&'a str>, - pub(crate) graph: Option<&'a str>, - pub(crate) uri: Option, -} - -/// Resolve the scope for a command with `capability`. Precedence (RFC-011): -/// 1. explicit primitive address (`uri`/`--server`/`--store`) β†’ passthrough; -/// 2. `--profile` / `OMNIGRAPH_PROFILE`; -/// 3. flat `defaults.server` + `defaults.default_graph`; -/// 4. nothing β€” downstream behaves as today. -pub(crate) fn resolve_scope( - op: &OperatorConfig, - capability: Capability, - flags: ScopeFlags<'_>, -) -> Result { - // At most one explicit scope primitive may address a command β€” a positional - // URI, `--store`, `--server`, or `--cluster` are mutually exclusive ways to - // name the graph. Combining them is a contradiction, not a silent precedence. - let primitives: Vec<&str> = [ - flags.uri.as_deref().map(|_| "a positional URI"), - flags.store.map(|_| "--store"), - flags.server.map(|_| "--server"), - flags.cluster.map(|_| "--cluster"), - ] - .into_iter() - .flatten() - .collect(); - if primitives.len() > 1 { - bail!( - "{} are mutually exclusive β€” pick one way to address the graph", - primitives.join(" and ") - ); - } - - // 1a. `--cluster` is the cluster scope primitive (maintenance): resolve its - // root + select the graph with `--graph`. - if let Some(cluster) = flags.cluster { - return scope_from_binding( - op, - capability, - ScopeBinding::Cluster(cluster.to_string()), - flags.graph.map(str::to_string), - "--cluster", - ); - } - - // 1b. Any other explicit address wins; reproduce today's behavior untouched. - // `--store` is an explicit store URI β€” fold it into `uri`. - if flags.uri.is_some() || flags.server.is_some() || flags.store.is_some() { - // `--graph` selects within a multi-graph scope; a bare positional URI / - // `--store` is already a single graph, so a stray `--graph` is an error - // rather than a silently-dropped flag. - if flags.graph.is_some() && flags.server.is_none() { - bail!( - "--graph selects a graph within a server or cluster scope; a positional \ - URI / --store is already a single graph" - ); - } - return Ok(ResolvedScope { - server: flags.server.map(str::to_string), - graph: flags.graph.map(str::to_string), - uri: flags.store.map(str::to_string).or(flags.uri), - ..Default::default() - }); - } - - // 2. A named profile (flag, else env). - let profile_name = flags - .profile - .map(str::to_string) - .or_else(|| env::var(PROFILE_ENV).ok().filter(|s| !s.is_empty())); - if let Some(name) = profile_name { - let profile = op.profile(&name).ok_or_else(|| { - eyre!("unknown profile '{name}' (not defined under `profiles:` in operator config)") - })?; - let binding = profile.binding(&name)?; - let graph = flags - .graph - .map(str::to_string) - .or_else(|| profile.default_graph.clone()); - return scope_from_binding(op, capability, binding, graph, &format!("profile '{name}'")); - } - - // 3. Flat default server scope. - if let Some(server) = op.default_server() { - let graph = flags - .graph - .map(str::to_string) - .or_else(|| op.default_graph().map(str::to_string)); - return scope_from_binding( - op, - capability, - ScopeBinding::Server(server.to_string()), - graph, - "operator defaults", - ); - } - - // 3b. Flat default store scope β€” the zero-flag local-dev default (RFC-011). - // Mutually exclusive with `defaults.server` (enforced at config load). - if let Some(store) = op.default_store() { - return scope_from_binding( - op, - capability, - ScopeBinding::Store(store.to_string()), - flags.graph.map(str::to_string), - "operator defaults", - ); - } - - // 4. Nothing resolved β€” leave the tuple empty; downstream falls through to - // today's behavior (legacy `cli.graph` default or a no-address error). - Ok(ResolvedScope::default()) -} - -/// Map a resolved binding to the effective tuple, enforcing scope Γ— capability -/// capability (RFC-011): a server scope is served (data only); a cluster scope -/// is privileged direct (maintenance/control only); a store scope is direct -/// (either). -fn scope_from_binding( - op: &OperatorConfig, - capability: Capability, - binding: ScopeBinding, - graph: Option, - source: &str, -) -> Result { - match binding { - ScopeBinding::Server(server) => { - if capability == Capability::Direct { - bail!( - "this command needs direct storage access, but {source} resolves a \ - server scope; name storage explicitly with --store (or \ - --cluster --graph for a managed graph)" - ); - } - Ok(ResolvedScope { - server: Some(server), - graph, - ..Default::default() - }) - } - ScopeBinding::Cluster(cluster) => { - if capability == Capability::Any { - bail!( - "{source} resolves a cluster scope, which is not valid for graph data \ - commands; run data commands through a server, or use --store \ - for ad-hoc direct access" - ); - } - // A cluster value is a config name (resolved against `clusters:`) - // or a literal root: an `s3://`/`file://` URI or a local cluster - // directory. Only a configured name is rewritten; anything else is - // passed through to the cluster-state resolver verbatim, so a bare - // directory path keeps working as it did for per-command `--cluster`. - let root = op - .cluster_root(&cluster) - .map(str::to_string) - .unwrap_or(cluster); - // A cluster holds many graphs; maintenance addresses one at a time. - // When no `--graph`/`default_graph` is given, leave `cluster_graph` - // empty and defer to the async storage-URI resolver (RFC-011 D7), - // which enumerates the catalog: auto-use a sole graph, else error - // and list the candidates. - Ok(ResolvedScope { - cluster: Some(root), - cluster_graph: graph, - ..Default::default() - }) - } - ScopeBinding::Store(uri) => { - if graph.is_some() { - bail!( - "--graph does not apply to a store scope ({source}): a store is already \ - a single graph" - ); - } - Ok(ResolvedScope { - uri: Some(uri), - ..Default::default() - }) - } - } -} - -#[cfg(test)] -mod tests { - use super::*; - - fn cfg(yaml: &str) -> OperatorConfig { - serde_yaml::from_str(yaml).unwrap() - } - - fn flags<'a>() -> ScopeFlags<'a> { - ScopeFlags { - profile: None, - store: None, - server: None, - cluster: None, - graph: None, - uri: None, - } - } - - #[test] - fn explicit_legacy_address_wins_unchanged() { - let op = cfg("defaults:\n server: prod\nservers:\n prod:\n url: https://x\n"); - // A positional URI given β†’ profile/defaults are ignored entirely. - let scope = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - uri: Some("graph.omni".into()), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.uri.as_deref(), Some("graph.omni")); - assert_eq!(scope.server, None); - } - - #[test] - fn store_flag_folds_into_uri_and_rejects_graph() { - let op = OperatorConfig::default(); - let scope = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - store: Some("s3://b/g.omni"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.uri.as_deref(), Some("s3://b/g.omni")); - } - - #[test] - fn scope_primitives_are_mutually_exclusive() { - let op = OperatorConfig::default(); - for flags in [ - ScopeFlags { - store: Some("s3://b/g.omni"), - uri: Some("file://other.omni".into()), - ..flags() - }, - ScopeFlags { - store: Some("s3://b/g.omni"), - server: Some("prod"), - ..flags() - }, - ScopeFlags { - cluster: Some("./brain"), - uri: Some("file://other.omni".into()), - ..flags() - }, - ScopeFlags { - cluster: Some("./brain"), - server: Some("prod"), - ..flags() - }, - ] { - let err = resolve_scope(&op, Capability::Direct, flags) - .unwrap_err() - .to_string(); - assert!(err.contains("mutually exclusive"), "{err}"); - } - } - - #[test] - fn cluster_flag_resolves_root_and_graph_for_maintenance() { - let op = cfg("clusters:\n brain:\n root: s3://acme/brain\n"); - let scope = resolve_scope( - &op, - Capability::Direct, - ScopeFlags { - cluster: Some("brain"), - graph: Some("knowledge"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.cluster.as_deref(), Some("s3://acme/brain")); - assert_eq!(scope.cluster_graph.as_deref(), Some("knowledge")); - } - - #[test] - fn cluster_flag_accepts_a_literal_root_uri() { - let op = OperatorConfig::default(); - let scope = resolve_scope( - &op, - Capability::Direct, - ScopeFlags { - cluster: Some("s3://bucket/clusters/brain"), - graph: Some("knowledge"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.cluster.as_deref(), Some("s3://bucket/clusters/brain")); - assert_eq!(scope.cluster_graph.as_deref(), Some("knowledge")); - } - - #[test] - fn cluster_scope_without_a_graph_defers_to_catalog_enumeration() { - // RFC-011 D7: with no `--graph`/`default_graph`, resolution no longer - // bails here β€” it resolves the cluster root and leaves `cluster_graph` - // empty, deferring to the async storage-URI resolver (which enumerates - // the catalog: auto-use a sole graph, else error listing candidates). - let op = cfg("clusters:\n brain:\n root: s3://acme/brain\n"); - let scope = resolve_scope( - &op, - Capability::Direct, - ScopeFlags { - cluster: Some("brain"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.cluster.as_deref(), Some("s3://acme/brain")); - assert_eq!(scope.cluster_graph, None); - } - - #[test] - fn graph_on_a_bare_store_or_uri_is_rejected() { - let op = OperatorConfig::default(); - for flags in [ - ScopeFlags { - uri: Some("graph.omni".into()), - graph: Some("knowledge"), - ..flags() - }, - ScopeFlags { - store: Some("s3://b/g.omni"), - graph: Some("knowledge"), - ..flags() - }, - ] { - let err = resolve_scope(&op, Capability::Any, flags) - .unwrap_err() - .to_string(); - assert!(err.contains("already a single graph"), "{err}"); - } - } - - #[test] - fn flat_default_store_drives_local_verbs() { - // RFC-011: `defaults.store` is the zero-flag local default β€” no flags, - // no profile β†’ the store URI resolves as the (single-graph) store scope. - let op = cfg("defaults:\n store: file:///tmp/dev.omni\n"); - let scope = resolve_scope(&op, Capability::Any, flags()).unwrap(); - assert_eq!(scope.uri.as_deref(), Some("file:///tmp/dev.omni")); - assert_eq!(scope.server, None); - } - - #[test] - fn flat_default_store_rejects_graph() { - // A store is already a single graph, so `--graph` against a default - // store is a loud error. - let op = cfg("defaults:\n store: file:///tmp/dev.omni\n"); - let err = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - graph: Some("knowledge"), - ..flags() - }, - ) - .unwrap_err() - .to_string(); - assert!(err.contains("does not apply to a store scope"), "{err}"); - } - - #[test] - fn flat_default_server_drives_data_verbs() { - let op = cfg("defaults:\n server: prod\n default_graph: knowledge\nservers:\n prod:\n url: https://x\n"); - let scope = resolve_scope(&op, Capability::Any, flags()).unwrap(); - assert_eq!(scope.server.as_deref(), Some("prod")); - assert_eq!(scope.graph.as_deref(), Some("knowledge")); - } - - #[test] - fn profile_server_scope_with_graph_override() { - let op = cfg( - "servers:\n staging:\n url: https://s\nprofiles:\n staging:\n server: staging\n default_graph: knowledge\n", - ); - let scope = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - profile: Some("staging"), - graph: Some("archive"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.server.as_deref(), Some("staging")); - assert_eq!(scope.graph.as_deref(), Some("archive")); // flag beats profile default - } - - #[test] - fn profile_cluster_scope_resolves_root_for_maintenance() { - let op = cfg( - "clusters:\n brain:\n root: s3://acme/brain\nprofiles:\n admin:\n cluster: brain\n default_graph: knowledge\n", - ); - let scope = resolve_scope( - &op, - Capability::Direct, - ScopeFlags { - profile: Some("admin"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.cluster.as_deref(), Some("s3://acme/brain")); - assert_eq!(scope.cluster_graph.as_deref(), Some("knowledge")); - } - - #[test] - fn profile_cluster_scope_with_graph_override() { - // The deferral closed by this slice: a `--graph` flag overrides a - // profile cluster's default_graph, exactly as it does for a server scope. - let op = cfg( - "clusters:\n brain:\n root: s3://acme/brain\nprofiles:\n admin:\n cluster: brain\n default_graph: knowledge\n", - ); - let scope = resolve_scope( - &op, - Capability::Direct, - ScopeFlags { - profile: Some("admin"), - graph: Some("archive"), - ..flags() - }, - ) - .unwrap(); - assert_eq!(scope.cluster.as_deref(), Some("s3://acme/brain")); - assert_eq!(scope.cluster_graph.as_deref(), Some("archive")); // flag beats profile default - } - - #[test] - fn server_scope_on_maintenance_verb_errors() { - let op = cfg("defaults:\n server: prod\nservers:\n prod:\n url: https://x\n"); - let err = resolve_scope(&op, Capability::Direct, flags()).unwrap_err().to_string(); - assert!(err.contains("direct storage access"), "{err}"); - } - - #[test] - fn cluster_scope_on_data_verb_errors() { - let op = cfg( - "clusters:\n brain:\n root: s3://acme/brain\nprofiles:\n admin:\n cluster: brain\n", - ); - let err = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - profile: Some("admin"), - ..flags() - }, - ) - .unwrap_err() - .to_string(); - assert!(err.contains("not valid for graph data commands"), "{err}"); - } - - #[test] - fn unknown_profile_is_a_loud_error() { - let op = OperatorConfig::default(); - let err = resolve_scope( - &op, - Capability::Any, - ScopeFlags { - profile: Some("nope"), - ..flags() - }, - ) - .unwrap_err() - .to_string(); - assert!(err.contains("unknown profile 'nope'"), "{err}"); - } - - #[test] - fn no_address_resolves_empty_for_legacy_fallthrough() { - let op = OperatorConfig::default(); - let scope = resolve_scope(&op, Capability::Any, flags()).unwrap(); - assert_eq!(scope, ResolvedScope::default()); - } -} diff --git a/crates/omnigraph-cli/tests/cli_data.rs b/crates/omnigraph-cli/tests/cli.rs similarity index 63% rename from crates/omnigraph-cli/tests/cli_data.rs rename to crates/omnigraph-cli/tests/cli.rs index 81e1aab..6e5de37 100644 --- a/crates/omnigraph-cli/tests/cli_data.rs +++ b/crates/omnigraph-cli/tests/cli.rs @@ -1,9 +1,7 @@ -//! Data commands: load/read/change/branch/commit/export/snapshot/policy/embed/maintenance. -//! Moved verbatim from tests/cli.rs in the modularization. - use std::fs; -use assert_cmd::Command; +use lance::index::DatasetIndexExt; +use omnigraph::db::{Omnigraph, ReadTarget}; use serde_json::Value; use tempfile::tempdir; @@ -11,6 +9,85 @@ mod support; use support::*; +const POLICY_YAML: &str = r#" +version: 1 +groups: + team: [act-andrew, act-bruno] + admins: [act-andrew] +protected_branches: [main] +rules: + - id: team-read + allow: + actors: { group: team } + actions: [read] + branch_scope: any + - id: team-write + allow: + actors: { group: team } + actions: [change] + branch_scope: unprotected + - id: admins-promote + allow: + actors: { group: admins } + actions: [branch_merge] + target_branch_scope: protected +"#; + +const POLICY_TESTS_YAML: &str = r#" +version: 1 +cases: + - id: allow-feature-write + actor: act-andrew + action: change + branch: feature + expect: allow + - id: deny-main-write + actor: act-bruno + action: change + branch: main + expect: deny +"#; + +fn manifest_dataset_version(graph: &std::path::Path) -> u64 { + tokio::runtime::Runtime::new().unwrap().block_on(async { + Omnigraph::open(graph.to_string_lossy().as_ref()) + .await + .unwrap() + .snapshot_of(ReadTarget::branch("main")) + .await + .unwrap() + .version() + }) +} + +fn write_policy_config_fixture(root: &std::path::Path) -> (std::path::PathBuf, std::path::PathBuf) { + let config = root.join("omnigraph.yaml"); + let policy = root.join("policy.yaml"); + fs::write( + &config, + r#" +project: + name: policy-test-graph +policy: + file: ./policy.yaml +"#, + ) + .unwrap(); + fs::write(&policy, POLICY_YAML).unwrap(); + fs::write(root.join("policy.tests.yaml"), POLICY_TESTS_YAML).unwrap(); + (config, policy) +} + +#[test] +fn version_command_prints_current_cli_version() { + let output = output_success(cli().arg("version")); + let stdout = stdout_string(&output); + + assert_eq!( + stdout.trim(), + format!("omnigraph {}", env!("CARGO_PKG_VERSION")) + ); +} #[test] fn short_version_flag_prints_current_cli_version() { @@ -144,196 +221,320 @@ fn embed_seed_preserves_non_entity_rows() { } #[test] -fn optimize_json_succeeds_on_local_graph() { - // Happy path for the resolve_local_uri swap (RFC-010 Slice 1): a positional - // local path still resolves and runs embedded. +fn init_creates_graph_successfully_on_missing_local_directory() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); - init_graph(&graph); - load_fixture(&graph); + let schema = fixture("test.pg"); - let output = output_success(cli().arg("optimize").arg("--json").arg(&graph)); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert!(payload["tables"].as_array().is_some()); + let output = output_success(cli().arg("init").arg("--schema").arg(&schema).arg(&graph)); + let stdout = stdout_string(&output); + + assert!(stdout.contains("initialized")); + assert!(graph.join("_schema.pg").exists()); + assert!(graph.join("__manifest").exists()); + assert!(temp.path().join("omnigraph.yaml").exists()); } #[test] -fn optimize_with_server_flag_errors_wrong_plane() { - // RFC-010 Slice 1: --server is a data-plane addressing flag; on a - // storage-plane verb the guard rejects it loudly (was: silently ignored). - let output = output_failure(cli().arg("optimize").arg("--server").arg("prod")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("`optimize` is a direct (storage-native) command") - && stderr.contains("--server addresses a served graph and does not apply") - && stderr.contains("Pass a storage URI, or --cluster --graph ."), - "wrong-capability guard message not found; got: {stderr}" - ); -} - -#[test] -fn wrong_address_guard_message_has_no_trailing_space() { - // The remediation tail is empty for served-addressing capabilities, so a - // misplaced --cluster on a data verb must not leave "… does not apply. " - // with a dangling space (error text is observable contract). NO_COLOR keeps - // the assertion off ANSI styling. - let output = output_failure( - cli() - .env("NO_COLOR", "1") - .arg("query") - .arg("--cluster") - .arg("./brain") - .arg("-e") - .arg("query q { Person { id } }"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("and does not apply."), - "expected the wrong-address message; got: {stderr}" - ); - assert!( - !stderr.contains("and does not apply. "), - "trailing space after the message; got: {stderr}" - ); -} - -#[test] -fn graph_flag_on_a_positional_uri_errors() { - // RFC-011: `--graph` selects within a multi-graph scope (a server or - // cluster). An explicit `--store ` is already a single graph, so - // pairing it with `--graph` is a loud error, not a silently-dropped flag. - // (The guard lets `--graph` reach a data verb; the scope resolver rejects - // it.) +fn schema_plan_json_reports_supported_additive_change() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let schema_path = temp.path().join("next.pg"); init_graph(&graph); - let output = output_failure( - cli() - .arg("query") - .arg("--store") - .arg(&graph) - .arg("--graph") - .arg("knowledge") - .arg("-e") - .arg("query q { Person { id } }"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("already a single graph"), - "expected --graph-on-explicit-store rejection; got: {stderr}" - ); -} -#[test] -fn query_by_name_against_a_store_needs_a_server() { - // RFC-011 D3: by-name (catalog) invocation is served-only β€” the catalog is - // server-owned, so a bare `--store` has nothing to resolve the name - // against. The ad-hoc lane (`-e`/`--query`) is the local alternative. - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - let output = output_failure( + let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + fs::write(&schema_path, next_schema).unwrap(); + + let output = output_success( cli() - .arg("query") - .arg("find_people") - .arg("--store") + .arg("schema") + .arg("plan") + .arg("--schema") + .arg(&schema_path) + .arg("--json") .arg(&graph), ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("needs a server"), - "expected a served-only by-name error; got: {stderr}" - ); -} - -#[test] -fn optimize_with_remote_target_errors_storage_plane() { - // RFC-010 Slice 1: a maintenance verb pointed at a remote URI fails loudly - // and declaratively (was: whatever Omnigraph::open said about an https URI). - let output = output_failure(cli().arg("optimize").arg("https://graph.example.invalid")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("`optimize` is a direct (storage-native) command and needs direct storage access") - && stderr.contains("remote server"), - "direct remote-target message not found; got: {stderr}" - ); -} - -#[test] -fn repair_json_reports_noop_on_clean_graph() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - load_fixture(&graph); - - let output = output_success(cli().arg("repair").arg("--json").arg(&graph)); let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["confirm"], false); - assert_eq!(payload["force"], false); - assert_eq!(payload["manifest_version"], Value::Null); - let tables = payload["tables"].as_array().unwrap(); - assert_eq!(tables.len(), 4); - assert!(tables.iter().all(|table| { - table["classification"] == "no_drift" && table["action"] == "no_op" + assert_eq!(payload["supported"], true); + assert_eq!(payload["step_count"], 1); + assert_eq!(payload["steps"][0]["kind"], "add_property"); + assert_eq!(payload["steps"][0]["type_kind"], "node"); + assert_eq!(payload["steps"][0]["type_name"], "Person"); + assert_eq!(payload["steps"][0]["property_name"], "nickname"); +} + +#[test] +fn schema_plan_json_reports_unsupported_type_change() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("breaking.pg"); + init_graph(&graph); + + let breaking_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "age: I64?"); + fs::write(&schema_path, breaking_schema).unwrap(); + + let output = output_success( + cli() + .arg("schema") + .arg("plan") + .arg("--schema") + .arg(&schema_path) + .arg("--json") + .arg(&graph), + ); + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + + assert_eq!(payload["supported"], false); + assert!(payload["steps"].as_array().unwrap().iter().any(|step| { + step["kind"] == "unsupported_change" + && step["entity"] + .as_str() + .unwrap_or_default() + .contains("Person.age") })); } #[test] -fn repair_confirm_json_refuses_suspicious_drift_with_nonzero_exit_then_force_succeeds() { +fn schema_apply_json_applies_supported_migration() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let schema_path = temp.path().join("next.pg"); init_graph(&graph); - load_fixture(&graph); - let graph_manifest_before = manifest_dataset_version(&graph); - let (table_manifest_before, table_head_before) = forge_person_delete_drift(&graph); - let refused = output_failure( + let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + fs::write(&schema_path, next_schema).unwrap(); + + let output = output_success( cli() - .arg("repair") - .arg("--confirm") + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) .arg("--json") .arg(&graph), ); - let refused_payload: Value = serde_json::from_slice(&refused.stdout).unwrap(); - assert_eq!(refused_payload["manifest_version"], Value::Null); - let person = refused_payload["tables"] - .as_array() + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + + assert_eq!(payload["supported"], true); + assert_eq!(payload["applied"], true); + assert_eq!(payload["step_count"], 1); + + let db = tokio::runtime::Runtime::new() .unwrap() - .iter() - .find(|table| table["table_key"] == "node:Person") + .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) .unwrap(); - assert_eq!(person["classification"], "suspicious"); - assert_eq!(person["action"], "refused"); assert!( - String::from_utf8_lossy(&refused.stderr).contains("repair refused"), - "stderr should explain the non-zero exit; got: {}", - String::from_utf8_lossy(&refused.stderr) + db.catalog().node_types["Person"] + .properties + .contains_key("nickname") ); - assert_eq!(manifest_dataset_version(&graph), graph_manifest_before); +} - let forced = output_success( +#[test] +fn schema_apply_human_reports_noop() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = fixture("test.pg"); + init_graph(&graph); + + let output = output_success( cli() - .arg("repair") - .arg("--force") - .arg("--confirm") + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg(&graph), + ); + let stdout = stdout_string(&output); + + assert!(stdout.contains("applied: no")); + assert!(stdout.contains("no schema changes")); +} + +#[test] +fn schema_apply_json_renames_type_and_updates_snapshot() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("rename.pg"); + init_graph(&graph); + + let renamed_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("node Person {\n", "node Human @rename_from(\"Person\") {\n") + .replace("edge Knows: Person -> Person", "edge Knows: Human -> Human") + .replace( + "edge WorksAt: Person -> Company", + "edge WorksAt: Human -> Company", + ); + fs::write(&schema_path, renamed_schema).unwrap(); + + let output = output_success( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) .arg("--json") .arg(&graph), ); - let forced_payload: Value = serde_json::from_slice(&forced.stdout).unwrap(); - let forced_manifest = forced_payload["manifest_version"].as_u64().unwrap(); - assert!(forced_manifest > graph_manifest_before); - let person = forced_payload["tables"] - .as_array() + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["applied"], true); + + let db = tokio::runtime::Runtime::new() .unwrap() - .iter() - .find(|table| table["table_key"] == "node:Person") + .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) .unwrap(); - assert_eq!(person["classification"], "suspicious"); - assert_eq!(person["action"], "forced"); - assert_eq!(person["manifest_version"], table_manifest_before); - assert_eq!(person["lance_head_version"], table_head_before); - assert_eq!(manifest_dataset_version(&graph), forced_manifest); + let snapshot = tokio::runtime::Runtime::new() + .unwrap() + .block_on(db.snapshot_of(ReadTarget::branch("main"))) + .unwrap(); + assert!(snapshot.entry("node:Human").is_some()); + assert!(snapshot.entry("node:Person").is_none()); +} + +#[test] +fn schema_apply_json_renames_property_and_updates_catalog() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("rename-property.pg"); + init_graph(&graph); + + let renamed_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "years: I32? @rename_from(\"age\")"); + fs::write(&schema_path, renamed_schema).unwrap(); + + let output = output_success( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg("--json") + .arg(&graph), + ); + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["applied"], true); + + let db = tokio::runtime::Runtime::new() + .unwrap() + .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) + .unwrap(); + let person = &db.catalog().node_types["Person"]; + assert!(person.properties.contains_key("years")); + assert!(!person.properties.contains_key("age")); +} + +#[test] +fn schema_apply_json_adds_index_for_existing_property() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("index.pg"); + init_graph(&graph); + + let before_index_count = tokio::runtime::Runtime::new().unwrap().block_on(async { + let db = Omnigraph::open(graph.to_string_lossy().as_ref()) + .await + .unwrap(); + let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); + let dataset = snapshot.open("node:Person").await.unwrap(); + dataset.load_indices().await.unwrap().len() + }); + + let indexed_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("name: String @key", "name: String @key @index"); + fs::write(&schema_path, indexed_schema).unwrap(); + + let output = output_success( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg("--json") + .arg(&graph), + ); + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["applied"], true); + + let after_index_count = tokio::runtime::Runtime::new().unwrap().block_on(async { + let db = Omnigraph::open(graph.to_string_lossy().as_ref()) + .await + .unwrap(); + let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); + let dataset = snapshot.open("node:Person").await.unwrap(); + dataset.load_indices().await.unwrap().len() + }); + assert!(after_index_count > before_index_count); +} + +#[test] +fn schema_apply_rejects_unsupported_plan() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("breaking.pg"); + init_graph(&graph); + + let breaking_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "age: I64?"); + fs::write(&schema_path, breaking_schema).unwrap(); + + let output = output_failure( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg(&graph), + ); + let stderr = String::from_utf8_lossy(&output.stderr); + assert!(stderr.contains("changing property type")); +} + +#[test] +fn schema_apply_rejects_when_non_main_branch_exists() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("next.pg"); + init_graph(&graph); + output_success( + cli() + .arg("branch") + .arg("create") + .arg("--from") + .arg("main") + .arg("--uri") + .arg(&graph) + .arg("feature"), + ); + + let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + fs::write(&schema_path, next_schema).unwrap(); + + let output = output_failure( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg(&graph), + ); + let stderr = String::from_utf8_lossy(&output.stderr); + assert!(stderr.contains("schema apply requires a graph with only main")); } #[test] @@ -383,6 +584,57 @@ query update_policy($slug: String, $name: String) { ); } +#[test] +fn query_check_alias_matches_lint_output() { + let temp = tempdir().unwrap(); + let schema_path = temp.path().join("schema.pg"); + let query_path = temp.path().join("queries.gq"); + write_file( + &schema_path, + r#" +node Person { + name: String +} +"#, + ); + write_query_file( + &query_path, + r#" +query list_people() { + match { $p: Person } + return { $p.name } +} +"#, + ); + + let lint_output = output_success( + cli() + .arg("query") + .arg("lint") + .arg("--query") + .arg(&query_path) + .arg("--schema") + .arg(&schema_path) + .arg("--json"), + ); + let check_output = output_success( + cli() + .arg("query") + .arg("check") + .arg("--query") + .arg(&query_path) + .arg("--schema") + .arg(&schema_path) + .arg("--json"), + ); + + assert_eq!(stdout_string(&lint_output), stdout_string(&check_output)); +} + +/// `omnigraph lint` is the canonical top-level lint command after the +/// query/mutate rename. `omnigraph query lint` and `omnigraph query check` +/// are kept as deprecated argv shims (warning + rewrite). All three must +/// produce identical stdout output. #[test] fn lint_top_level_matches_deprecated_query_lint_output() { let temp = tempdir().unwrap(); @@ -462,6 +714,12 @@ query list_people() { ); } +/// Bare `omnigraph check` is NOT a clap `visible_alias` on `lint` (MR-981 Β§6: +/// visible aliases give agents two canonical names to emit interchangeably). +/// It's an argv-level shim: rewrites to `omnigraph lint`, prints a one-line +/// stderr deprecation warning, and produces identical stdout to the canonical +/// invocation. Cargo/Go users typing `check` keep working; help text shows +/// only `lint`. #[test] fn deprecated_check_top_level_rewrites_to_lint() { let temp = tempdir().unwrap(); @@ -527,15 +785,21 @@ query list_people() { ); } +/// `omnigraph read` and `omnigraph change` are kept as visible clap +/// aliases for the new canonical `query` / `mutate` subcommands, plus an +/// argv-level deprecation warning. The warning is emitted to stderr; the +/// command otherwise behaves identically to the canonical form. #[test] fn deprecated_read_and_change_subcommands_emit_warnings() { - // Both subcommands require `--query`/`--query-string`, so invoking them - // with no args will exit non-zero. That's fine -- we only care that the - // deprecation warning is printed before the argument-required error. + // Both subcommands require `--query`/`--query-string`/`--alias`, so + // invoking them with no args will exit non-zero. That's fine -- + // we only care that the deprecation warning is printed before the + // argument-required error. let output = cli().arg("read").output().unwrap(); let stderr = String::from_utf8(output.stderr).unwrap(); assert!( - stderr.contains("`omnigraph read` is deprecated") && stderr.contains("`omnigraph query`"), + stderr.contains("`omnigraph read` is deprecated") + && stderr.contains("`omnigraph query`"), "expected `omnigraph read` deprecation warning; got: {stderr}" ); @@ -599,15 +863,13 @@ query list_people() { } #[test] -fn query_lint_can_resolve_graph_from_store_scope() { - // RFC-011: lint resolves its graph target through `--store` (the direct - // scope), not omnigraph.yaml's cli.graph; the .gq path is plain cwd-relative. +fn query_lint_can_resolve_graph_and_query_from_config() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let config_path = temp.path().join("omnigraph.yaml"); init_graph(&graph); - let query_path = temp.path().join("queries.gq"); write_query_file( - &query_path, + &temp.path().join("queries.gq"), r#" query list_people() { match { $p: Person } @@ -615,15 +877,16 @@ query list_people() { } "#, ); + write_config(&config_path, &local_yaml_config(&graph)); let output = output_success( cli() .arg("query") .arg("lint") .arg("--query") - .arg(&query_path) - .arg("--store") - .arg(&graph) + .arg("queries.gq") + .arg("--config") + .arg(&config_path) .arg("--json"), ); let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); @@ -659,12 +922,8 @@ query list_people() { .arg("http://127.0.0.1:8080"), ); let stderr = String::from_utf8_lossy(&output.stderr); - // RFC-010/011: the direct (storage-native) verbs share one declared message - // (was: "query lint is only supported against local graph URIs …"). assert!( - stderr.contains("`lint` is a direct (storage-native) command and needs direct storage access") - && stderr.contains("remote server"), - "direct remote-target message not found; got: {stderr}" + stderr.contains("query lint is only supported against local graph URIs in this milestone") ); } @@ -691,9 +950,7 @@ query list_people() { ); let stderr = String::from_utf8_lossy(&output.stderr); assert!( - stderr.contains("lint requires --schema ") - || stderr.contains("no graph addressed"), - "expected a schema-or-graph-target requirement; got: {stderr}" + stderr.contains("query lint requires --schema or a resolvable graph target") ); } @@ -792,8 +1049,6 @@ fn load_json_outputs_summary_for_main_branch() { let output = output_success( cli() .arg("load") - .arg("--mode") - .arg("overwrite") .arg("--data") .arg(&data) .arg("--json") @@ -862,10 +1117,10 @@ fn read_json_outputs_rows_for_named_query() { let output = output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -879,58 +1134,6 @@ fn read_json_outputs_rows_for_named_query() { assert_eq!(payload["rows"][0]["p.name"], "Alice"); } -#[test] -fn read_via_store_flag_and_profile_match_positional_uri() { - // RFC-011 Slice A: the new scope addressing (--store, and a --profile that - // binds a store) drives a read identically to the legacy positional URI β€” - // the scope layer is additive, not a behavior change. - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - load_fixture(&graph); - let queries = fixture("test.gq"); - - let read_rows = |cmd: &mut Command| -> Value { - let output = output_success( - cmd.arg("--query") - .arg(&queries) - .arg("get_person") - .arg("--params") - .arg(r#"{"name":"Alice"}"#) - .arg("--json"), - ); - serde_json::from_slice(&output.stdout).unwrap() - }; - - // Baseline: --store names the graph. - let baseline = read_rows(cli().arg("query").arg("--store").arg(&graph)); - assert_eq!(baseline["rows"][0]["p.name"], "Alice"); - - // --store names the same graph directly. - let via_store = read_rows(cli().arg("query").arg("--store").arg(&graph)); - assert_eq!(via_store["rows"], baseline["rows"]); - - // A profile binding that store, selected with --profile (no positional). - let home = temp.path().join("op-home"); - std::fs::create_dir_all(&home).unwrap(); - std::fs::write( - home.join("config.yaml"), - format!( - "profiles:\n local:\n store: '{}'\n", - graph.to_string_lossy() - ), - ) - .unwrap(); - let via_profile = read_rows( - cli() - .env("OMNIGRAPH_HOME", &home) - .arg("query") - .arg("--profile") - .arg("local"), - ); - assert_eq!(via_profile["rows"], baseline["rows"]); -} - #[test] fn export_jsonl_outputs_source_rows_for_selected_branch_and_type() { let temp = tempdir().unwrap(); @@ -990,38 +1193,43 @@ fn export_jsonl_outputs_source_rows_for_selected_branch_and_type() { ); } -// RFC-011: `policy validate|test|explain` source the Cedar bundle from a -// converged cluster's applied policies (`--cluster ` + `--graph `), -// not omnigraph.yaml's policy.file. - #[test] -fn policy_validate_accepts_cluster_bundle() { - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_YAML)); +fn policy_validate_accepts_valid_policy_file() { + let temp = tempdir().unwrap(); + let (config, _) = write_policy_config_fixture(temp.path()); let output = output_success( cli() .arg("policy") .arg("validate") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge"), + .arg("--config") + .arg(&config), ); let stdout = stdout_string(&output); assert!(stdout.contains("policy valid:")); + assert!(stdout.contains("policy.yaml")); assert!(stdout.contains("[2 actors]")); } #[test] -fn policy_validate_fails_for_invalid_cluster_bundle() { - // The cluster does not validate a policy bundle's internal rules, so an - // applied-but-malformed bundle reaches `policy validate`, which compiles it - // and surfaces the error (here: a duplicate rule id). - let cluster = converged_loaded_cluster( - "knowledge", - Some( - r#" +fn policy_validate_fails_for_invalid_policy_file() { + let temp = tempdir().unwrap(); + let config = temp.path().join("omnigraph.yaml"); + let policy = temp.path().join("policy.yaml"); + fs::write( + &config, + r#" +project: + name: policy-test-graph +policy: + file: ./policy.yaml +"#, + ) + .unwrap(); + fs::write( + &policy, + r#" version: 1 groups: team: [act-andrew] @@ -1037,42 +1245,26 @@ rules: actions: [export] branch_scope: any "#, - ), - ); + ) + .unwrap(); let output = output_failure( cli() .arg("policy") .arg("validate") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge"), + .arg("--config") + .arg(&config), ); let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - stderr.contains("duplicate policy rule id"), - "expected a duplicate-rule error; got: {stderr}" - ); + assert!(stderr.contains("duplicate policy rule id")); } #[test] -fn policy_test_runs_declarative_cases_against_cluster_bundle() { - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_YAML)); - let tests = cluster.path().join("policy.tests.yaml"); - fs::write(&tests, POLICY_TESTS_YAML).unwrap(); +fn policy_test_runs_declarative_cases() { + let temp = tempdir().unwrap(); + let (config, _) = write_policy_config_fixture(temp.path()); - let output = output_success( - cli() - .arg("policy") - .arg("test") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge") - .arg("--tests") - .arg(&tests), - ); + let output = output_success(cli().arg("policy").arg("test").arg("--config").arg(&config)); let stdout = stdout_string(&output); assert!(stdout.contains("policy tests passed: 2 cases")); @@ -1080,16 +1272,15 @@ fn policy_test_runs_declarative_cases_against_cluster_bundle() { #[test] fn policy_explain_reports_decision_and_matched_rule() { - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_YAML)); + let temp = tempdir().unwrap(); + let (config, _) = write_policy_config_fixture(temp.path()); let allow = output_success( cli() .arg("policy") .arg("explain") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--actor") .arg("act-andrew") .arg("--action") @@ -1105,10 +1296,8 @@ fn policy_explain_reports_decision_and_matched_rule() { cli() .arg("policy") .arg("explain") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--actor") .arg("act-bruno") .arg("--action") @@ -1122,26 +1311,22 @@ fn policy_explain_reports_decision_and_matched_rule() { } #[test] -fn read_resolves_uri_from_default_store_scope() { - // RFC-011: a zero-flag read resolves its graph from `defaults.store` in the - // operator config (the local-dev default scope) β€” no omnigraph.yaml. +fn read_can_resolve_uri_from_config() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); init_graph(&graph); load_fixture(&graph); - let home = tempdir().unwrap(); - std::fs::write( - home.path().join("config.yaml"), - format!("defaults:\n store: {}\n", graph.to_string_lossy()), - ) - .unwrap(); + write_config(&config, &local_yaml_config(&graph)); let output = output_success( cli() - .env("OMNIGRAPH_HOME", home.path()) .arg("read") + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -1151,6 +1336,135 @@ fn read_resolves_uri_from_default_store_scope() { assert_eq!(payload["row_count"], 1); } +#[test] +fn read_alias_from_yaml_config_runs_with_kv_output() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); + let query = temp.path().join("aliases.gq"); + init_graph(&graph); + load_fixture(&graph); + write_query_file( + &query, + &std::fs::read_to_string(fixture("test.gq")).unwrap(), + ); + write_config( + &config, + &format!( + "{}aliases:\n owner:\n command: read\n query: aliases.gq\n name: get_person\n args: [name]\n format: kv\n", + local_yaml_config(&graph) + ), + ); + + let output = output_success( + cli() + .arg("read") + .arg("--config") + .arg(&config) + .arg("--alias") + .arg("owner") + .arg("Alice"), + ); + let stdout = stdout_string(&output); + + assert!(stdout.contains("row 1")); + assert!(stdout.contains("p.name: Alice")); +} + +#[test] +fn read_alias_uses_alias_target_without_cli_default_and_accepts_url_like_arg() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); + let query = temp.path().join("aliases.gq"); + let data = temp.path().join("url-like.jsonl"); + init_graph(&graph); + write_jsonl( + &data, + r#"{"type":"Person","data":{"name":"https://example.com","age":30}}"#, + ); + output_success(cli().arg("load").arg("--data").arg(&data).arg(&graph)); + write_query_file( + &query, + &std::fs::read_to_string(fixture("test.gq")).unwrap(), + ); + write_config( + &config, + &format!( + "graphs:\n local:\n uri: '{}'\nquery:\n roots:\n - .\npolicy: {{}}\naliases:\n owner:\n command: read\n query: aliases.gq\n name: get_person\n args: [name]\n graph: local\n format: kv\n", + graph.to_string_lossy() + ), + ); + + let output = output_success( + cli() + .arg("read") + .arg("--config") + .arg(&config) + .arg("--alias") + .arg("owner") + .arg("https://example.com"), + ); + let stdout = stdout_string(&output); + + assert!(stdout.contains("row 1")); + assert!(stdout.contains("p.name: https://example.com")); +} + +#[test] +fn change_alias_from_yaml_config_persists_changes() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); + let query = temp.path().join("mutations.gq"); + init_graph(&graph); + load_fixture(&graph); + write_query_file( + &query, + r#" +query insert_person($name: String, $age: I32) { + insert Person { name: $name, age: $age } +} +"#, + ); + write_config( + &config, + &format!( + "{}aliases:\n add_person:\n command: change\n query: mutations.gq\n name: insert_person\n args: [name, age]\n", + local_yaml_config(&graph) + ), + ); + + let output = output_success( + cli() + .arg("change") + .arg("--config") + .arg(&config) + .arg("--alias") + .arg("add_person") + .arg("Eve") + .arg("29") + .arg("--json"), + ); + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["affected_nodes"], 1); + + let verify = output_success( + cli() + .arg("read") + .arg(&graph) + .arg("--query") + .arg(fixture("test.gq")) + .arg("--name") + .arg("get_person") + .arg("--params") + .arg(r#"{"name":"Eve"}"#) + .arg("--json"), + ); + let verify_payload: Value = serde_json::from_slice(&verify.stdout).unwrap(); + assert_eq!(verify_payload["row_count"], 1); +} + #[test] fn read_csv_format_outputs_header_and_row_values() { let temp = tempdir().unwrap(); @@ -1161,10 +1475,10 @@ fn read_csv_format_outputs_header_and_row_values() { let output = output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -1177,52 +1491,6 @@ fn read_csv_format_outputs_header_and_row_values() { assert!(stdout.contains("Alice")); } -/// RFC-007 PR 1: the format cascade's operator hop β€” `defaults.output` in -/// ~/.omnigraph/config.yaml applies when nothing more specific is given, -/// and `--format` still wins over it. -#[test] -fn read_uses_operator_default_output_format() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - load_fixture(&graph); - let operator_home = tempdir().unwrap(); - fs::write( - operator_home.path().join("config.yaml"), - "defaults:\n output: csv\n", - ) - .unwrap(); - - let read = |extra: &[&str]| { - let mut command = cli(); - command - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("read") - .arg("--store") - .arg(&graph) - .arg("--query") - .arg(fixture("test.gq")) - .arg("get_person") - .arg("--params") - .arg(r#"{"name":"Alice"}"#); - for arg in extra { - command.arg(arg); - } - stdout_string(&output_success(&mut command)) - }; - - let stdout = read(&[]); - assert!( - stdout.lines().next().unwrap().contains("p.name") && stdout.contains("Alice"), - "operator defaults.output: csv applies with no --format: {stdout}" - ); - let stdout = read(&["--format", "jsonl"]); - assert!( - stdout.starts_with('{'), - "--format wins over the operator default: {stdout}" - ); -} - #[test] fn read_jsonl_format_outputs_metadata_header_first() { let temp = tempdir().unwrap(); @@ -1233,10 +1501,10 @@ fn read_jsonl_format_outputs_metadata_header_first() { let output = output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -1268,7 +1536,6 @@ query insert_person($name: String, $age: I32) { let output = output_success( cli() .arg("change") - .arg("--store") .arg(&graph) .arg("--query") .arg(&mutation_file) @@ -1285,10 +1552,10 @@ query insert_person($name: String, $age: I32) { let verify = output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Eve"}"#) @@ -1300,13 +1567,13 @@ query insert_person($name: String, $age: I32) { } #[test] -fn change_resolves_uri_and_default_branch_from_store_scope() { - // RFC-011: a mutate resolves its graph from `--store` and defaults the - // branch to main (no omnigraph.yaml cli.graph / cli.branch). +fn change_can_resolve_uri_and_branch_from_config() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); init_graph(&graph); load_fixture(&graph); + write_config(&config, &local_yaml_config(&graph)); let mutation_file = temp.path().join("config-mutations.gq"); write_query_file( &mutation_file, @@ -1320,8 +1587,8 @@ query insert_person($name: String, $age: I32) { let output = output_success( cli() .arg("change") - .arg("--store") - .arg(&graph) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) .arg("--params") @@ -1343,7 +1610,6 @@ fn read_requires_name_for_multi_query_files() { let output = output_failure( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(fixture("test.gq")), @@ -1362,7 +1628,6 @@ fn read_supports_inline_query_string() { let output = output_success( cli() .arg("read") - .arg("--store") .arg(&repo) .arg("-e") .arg("query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }") @@ -1376,49 +1641,6 @@ fn read_supports_inline_query_string() { assert_eq!(payload["rows"][0]["p.name"], "Alice"); } -#[test] -fn positional_http_uri_on_a_data_verb_is_rejected() { - // RFC-011: a `--store` http(s):// URL no longer dispatches to a remote - // server β€” that requires `--server `. - let output = output_failure( - cli() - .arg("query") - .arg("--store") - .arg("http://127.0.0.1:1") - .arg("-e") - .arg("query q() { match { $p: Person { } } return { $p } }"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("must be addressed with `--server `"), - "expected store-remote rejection; got: {stderr}" - ); -} - -#[test] -fn as_on_a_served_write_is_rejected() { - // RFC-011: a served write resolves the actor from the bearer token, so --as - // cannot set identity. It errors while building the remote client β€” before - // any HTTP call, so no server is needed. - let output = output_failure( - cli() - .arg("mutate") - .arg("--server") - .arg("http://127.0.0.1:1") - .arg("--as") - .arg("act-nope") - .arg("-e") - .arg("query add($name: String) { insert Person { name: $name } }") - .arg("--params") - .arg(r#"{"name":"X"}"#), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("`--as` is not allowed on a served write"), - "expected --as-served rejection; got: {stderr}" - ); -} - #[test] fn change_supports_inline_query_string() { let temp = tempdir().unwrap(); @@ -1429,7 +1651,6 @@ fn change_supports_inline_query_string() { let output = output_success( cli() .arg("change") - .arg("--store") .arg(&repo) .arg("--query-string") .arg("query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }") @@ -1444,7 +1665,6 @@ fn change_supports_inline_query_string() { let verify = output_success( cli() .arg("read") - .arg("--store") .arg(&repo) .arg("-e") .arg("query find($name: String) { match { $p: Person { name: $name } } return { $p.name } }") @@ -1466,7 +1686,6 @@ fn read_rejects_query_string_combined_with_query() { let output = output_failure( cli() .arg("read") - .arg("--store") .arg(&repo) .arg("--query") .arg(fixture("test.gq")) @@ -1487,7 +1706,7 @@ fn read_rejects_empty_query_string() { init_graph(&repo); load_fixture(&repo); - let output = output_failure(cli().arg("read").arg("--store").arg(&repo).arg("-e").arg("")); + let output = output_failure(cli().arg("read").arg(&repo).arg("-e").arg("")); let stderr = String::from_utf8(output.stderr).unwrap(); assert!( stderr.contains("must not be empty"), @@ -1615,160 +1834,6 @@ fn branch_delete_rejects_main() { assert!(stderr.contains("cannot delete branch 'main'")); } -// ── RFC-011 Decision 9: write diagnostics + non-local destructive-confirm ── - -#[test] -fn write_echoes_resolved_target_to_stderr() { - // Every write echoes its resolved target + access path to stderr; --json - // (stdout) is unaffected. A local load β†’ "(direct, local)". - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - let data = fixture("test.jsonl"); - let output = output_success( - cli() - .arg("load") - .arg("--mode") - .arg("append") - .arg("--data") - .arg(&data) - .arg(&graph) - .arg("--json"), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - stderr.contains("omnigraph load β†’") && stderr.contains("(direct, local)"), - "missing write-target echo; stderr: {stderr}" - ); - // stdout still parses as JSON β€” the echo went to stderr. - let _: Value = serde_json::from_slice(&output.stdout).unwrap(); -} - -#[test] -fn quiet_suppresses_the_write_target_echo() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - let data = fixture("test.jsonl"); - let output = output_success( - cli() - .arg("--quiet") - .arg("load") - .arg("--mode") - .arg("append") - .arg("--data") - .arg(&data) - .arg(&graph), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - !stderr.contains("omnigraph load β†’"), - "--quiet should suppress the echo; stderr: {stderr}" - ); -} - -#[test] -fn branch_delete_against_non_local_scope_refuses_without_yes() { - // No bucket needed: the confirm gate fires before the graph is opened. - let output = output_failure( - cli() - .arg("branch") - .arg("delete") - .arg("--store") - .arg("s3://fake-bucket/g.omni") - .arg("feature") - .arg("--json"), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - stderr.contains("refusing destructive `branch delete`") && stderr.contains("--yes"), - "expected a non-local destructive refusal; stderr: {stderr}" - ); -} - -#[test] -fn branch_delete_against_non_local_scope_passes_gate_with_yes() { - // With --yes the gate is bypassed; the command then fails for an unrelated - // reason (the fake bucket can't be opened), so the refusal must be ABSENT. - let output = output_failure( - cli() - .arg("branch") - .arg("delete") - .arg("--store") - .arg("s3://fake-bucket/g.omni") - .arg("feature") - .arg("--yes") - .arg("--json"), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - !stderr.contains("refusing destructive"), - "--yes should bypass the confirm gate; stderr: {stderr}" - ); -} - -#[test] -fn overwrite_load_against_non_local_scope_refuses_without_yes() { - let output = output_failure( - cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(fixture("test.jsonl")) - .arg("--store") - .arg("s3://fake-bucket/g.omni") - .arg("--json"), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - stderr.contains("refusing destructive `load --mode overwrite`"), - "expected a non-local overwrite refusal; stderr: {stderr}" - ); -} - -#[test] -fn cleanup_against_non_local_scope_refuses_without_yes() { - // Past the --confirm preview gate, a non-local cleanup still needs --yes. - let output = output_failure( - cli() - .arg("cleanup") - .arg("--store") - .arg("s3://fake-bucket/g.omni") - .arg("--keep") - .arg("5") - .arg("--confirm") - .arg("--json"), - ); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!( - stderr.contains("refusing destructive `cleanup`"), - "expected a non-local cleanup refusal; stderr: {stderr}" - ); -} - -#[test] -fn cleanup_against_local_scope_executes_with_confirm() { - // Local cleanup needs no --yes; --confirm alone executes (and echoes). - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - load_fixture(&graph); - let output = output_success( - cli() - .arg("cleanup") - .arg("--keep") - .arg("1") - .arg("--confirm") - .arg(&graph) - .arg("--json"), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert!(payload["tables"].as_array().is_some(), "{payload}"); - let stderr = String::from_utf8(output.stderr).unwrap(); - assert!(stderr.contains("omnigraph cleanup β†’"), "stderr: {stderr}"); -} - #[test] fn branch_merge_defaults_target_to_main() { let temp = tempdir().unwrap(); @@ -1917,18 +1982,82 @@ fn snapshot_json_returns_manifest_version_and_tables() { assert!(payload["tables"].as_array().unwrap().len() >= 4); } +fn write_seed_fixture(root: &std::path::Path) -> std::path::PathBuf { + fs::create_dir_all(root.join("data")).unwrap(); + fs::create_dir_all(root.join("build")).unwrap(); + let raw_seed = root.join("data/seed.jsonl"); + let seed = root.join("seed.yaml"); + + fs::write( + &raw_seed, + concat!( + "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-alpha\",\"intent\":\"Alpha ship\"}}\n", + "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-beta\",\"intent\":\"Beta ship\",\"embedding\":[0.1,0.2]}}\n" + ), + ) + .unwrap(); + + fs::write( + &seed, + concat!( + "graph:\n", + " slug: mr-context-graph\n", + "sources:\n", + " raw_seed: ./data/seed.jsonl\n", + "artifacts:\n", + " embedded_seed: ./build/seed.embedded.jsonl\n", + "embeddings:\n", + " model: gemini-embedding-2-preview\n", + " dimension: 4\n", + " types:\n", + " Decision:\n", + " target: embedding\n", + " fields: [slug, intent]\n" + ), + ) + .unwrap(); + + seed +} + +fn write_seed_fixture_with_edge(root: &std::path::Path) -> std::path::PathBuf { + let seed = write_seed_fixture(root); + let raw_seed = root.join("data/seed.jsonl"); + fs::write( + &raw_seed, + concat!( + "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-alpha\",\"intent\":\"Alpha ship\"}}\n", + "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-beta\",\"intent\":\"Beta ship\",\"embedding\":[0.1,0.2]}}\n", + "{\"edge\":\"Triggered\",\"from\":\"sig-alpha\",\"to\":\"dec-alpha\"}\n" + ), + ) + .unwrap(); + seed +} + +fn read_embedded_rows(path: std::path::PathBuf) -> Vec { + fs::read_to_string(path) + .unwrap() + .lines() + .filter(|line| !line.trim().is_empty()) + .map(|line| serde_json::from_str(line).unwrap()) + .collect() +} + #[test] -fn snapshot_resolves_uri_from_store_scope() { +fn snapshot_can_resolve_uri_from_config() { let temp = tempdir().unwrap(); let graph = graph_path(temp.path()); + let config = temp.path().join("omnigraph.yaml"); init_graph(&graph); load_fixture(&graph); + write_config(&config, &local_yaml_config(&graph)); let output = output_success( cli() .arg("snapshot") - .arg("--store") - .arg(&graph) + .arg("--config") + .arg(&config) .arg("--json"), ); let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); @@ -2018,8 +2147,6 @@ fn cli_fails_for_missing_schema_or_data_file() { let load_output = output_failure( cli() .arg("load") - .arg("--mode") - .arg("overwrite") .arg("--data") .arg(&missing_data) .arg(&graph), @@ -2070,161 +2197,182 @@ fn cli_fails_for_invalid_merge_requests() { ); } -/// RFC-011 Decision 8: `profile list` / `profile show` inspect the operator -/// config's profiles read-only. Hermetic via OMNIGRAPH_HOME. -fn profile_home() -> tempfile::TempDir { - let home = tempdir().unwrap(); - std::fs::write( - home.path().join("config.yaml"), - "operator:\n actor: act-andrew\n\ - defaults:\n output: json\n server: prod\n default_graph: knowledge\n\ - servers:\n prod:\n url: https://graph.example.com\n\ - clusters:\n brain:\n root: s3://acme/clusters/brain\n\ - profiles:\n\ - \x20 staging:\n server: prod\n default_graph: kb\n\ - \x20 brain-admin:\n cluster: brain\n\ - \x20 localdev:\n store: file:///data/dev.omni\n\ - \x20 broken:\n server: a\n store: b\n", - ) - .unwrap(); - home -} +// `omnigraph run list/show/publish/abort` subcommands removed +// alongside the run state machine. Direct-to-target writes leave nothing +// for these CLIs to manage. Audit history is now visible via +// `omnigraph commit list` reading the commit graph. + +// ─── MR-694 PR B: --allow-data-loss flag end-to-end ────────────────────── +// +// The schema-lint chassis v1.2 (PR #100) shipped the `--allow-data-loss` +// flag at the CLI layer; the SDK suite verifies promotion to Hard mode +// via `apply_schema_with_options(.., SchemaApplyOptions { allow_data_loss })`. +// These CLI tests close the integration gap so a future change that +// drops the flag wiring in `main.rs` turns red. #[test] -fn profile_list_names_each_profile_with_its_binding_and_marks_active() { - let home = profile_home(); - let out = output_success( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .env("OMNIGRAPH_PROFILE", "staging") - .arg("profile") - .arg("list"), - ); - let stdout = stdout_string(&out); - assert!(stdout.contains("staging (active)"), "{stdout}"); - assert!(stdout.contains("server: prod"), "{stdout}"); - assert!(stdout.contains("cluster: brain"), "{stdout}"); - assert!(stdout.contains("store: file:///data/dev.omni"), "{stdout}"); - // A malformed (two-scope) profile is reported, not a hard failure. - assert!(stdout.contains("broken") && stdout.contains("invalid:"), "{stdout}"); -} +fn schema_apply_allow_data_loss_flag_promotes_drops_to_hard() { + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("drop-age.pg"); + init_graph(&graph); -#[test] -fn profile_list_json_shape() { - let home = profile_home(); - let out = output_success( + // Drop the nullable `age` column. + let next_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace(" age: I32?\n", ""); + fs::write(&schema_path, next_schema).unwrap(); + + let output = output_success( cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("profile") - .arg("list") - .arg("--json"), + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg("--allow-data-loss") + .arg("--json") + .arg(&graph), ); - let items: Value = serde_json::from_slice(&out.stdout).unwrap(); - let brain = items + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["applied"], true); + + let drop_step = payload["steps"] .as_array() .unwrap() .iter() - .find(|p| p["name"] == "brain-admin") - .unwrap(); - assert_eq!(brain["binding"], "cluster: brain"); - assert_eq!(brain["scope_kind"], "cluster"); - assert_eq!(brain["target"], "brain"); - assert_eq!(brain["valid"], true); - assert!(brain["error"].is_null()); - assert_eq!(brain["active"], false); - let broken = items + .find(|s| s["kind"] == "drop_property") + .expect("plan should include a drop_property step"); + assert_eq!( + drop_step["mode"], "hard", + "--allow-data-loss should promote Soft β†’ Hard; full step: {drop_step}", + ); +} + +#[test] +fn schema_apply_without_allow_data_loss_keeps_soft_drops() { + // Symmetric to the above: same schema change without the flag β†’ + // drops stay Soft. Pins default semantics against accidental Hard + // promotion if a future refactor changes the option threading. + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + let schema_path = temp.path().join("drop-age-soft.pg"); + init_graph(&graph); + + let next_schema = fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace(" age: I32?\n", ""); + fs::write(&schema_path, next_schema).unwrap(); + + let output = output_success( + cli() + .arg("schema") + .arg("apply") + .arg("--schema") + .arg(&schema_path) + .arg("--json") + .arg(&graph), + ); + let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); + assert_eq!(payload["applied"], true); + + let drop_step = payload["steps"] .as_array() .unwrap() .iter() - .find(|p| p["name"] == "broken") - .unwrap(); - assert_eq!(broken["scope_kind"], "invalid"); - assert_eq!(broken["valid"], false); - assert!(broken["target"].is_null()); + .find(|s| s["kind"] == "drop_property") + .expect("plan should include a drop_property step"); + assert_eq!( + drop_step["mode"], "soft", + "no flag should leave drops Soft; full step: {drop_step}", + ); +} + +#[test] +fn schema_plan_parity_cli_and_sdk() { + // Same .pg through `Omnigraph::plan_schema_with_options` (SDK) and + // `omnigraph schema plan --json` (CLI). Asserts the steps array is + // byte-identical after JSON round-trip. HTTP doesn't expose a + // separate /schema/plan route β€” that side of parity is covered by + // the HTTP soft/hard drop tests, which exercise apply with + // identical fixtures. + let temp = tempdir().unwrap(); + let graph = graph_path(temp.path()); + init_graph(&graph); + let schema_path = temp.path().join("plan-parity.pg"); + let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + fs::write(&schema_path, &next_schema).unwrap(); + + // CLI side. + let cli_output = output_success( + cli() + .arg("schema") + .arg("plan") + .arg("--schema") + .arg(&schema_path) + .arg("--json") + .arg(&graph), + ); + let cli_payload: Value = serde_json::from_slice(&cli_output.stdout).unwrap(); + + // SDK side: open graph, call plan_schema. + let plan = tokio::runtime::Runtime::new().unwrap().block_on(async { + let db = Omnigraph::open(graph.to_string_lossy().as_ref()) + .await + .unwrap(); + db.plan_schema(&next_schema).await.unwrap() + }); + let sdk_steps = serde_json::to_value(&plan.steps).unwrap(); + + assert_eq!( + cli_payload["steps"], sdk_steps, + "CLI plan steps must match SDK plan steps for identical input", + ); + assert_eq!(cli_payload["supported"], plan.supported); +} + +// ─── MR-668 PR 8 β€” omnigraph graphs subcommand ───────────────────────────── + +/// `omnigraph graphs --help` lists only the read-only `list` +/// subcommand. Runtime add (`create`) and remove (`delete`) are +/// deferred β€” operators add/remove graphs by editing `omnigraph.yaml` +/// and restarting. This test pins the deferral against accidental +/// re-introduction. +#[test] +fn graphs_subcommand_help_lists_list_only() { + let output = output_success(cli().arg("graphs").arg("--help")); + let stdout = stdout_string(&output); assert!( - broken["error"] - .as_str() - .unwrap() - .contains("profile 'broken'") + stdout.contains("list"), + "expected `list` subcommand in help output:\n{stdout}" + ); + let lowered = stdout.to_lowercase(); + assert!( + !lowered.contains("create a new graph"), + "graph create should not be in v0.6.0 help; got:\n{stdout}" + ); + assert!( + !lowered.contains("delete a graph"), + "graph delete should not be in v0.6.0 help; got:\n{stdout}" ); } +/// `omnigraph graphs list` against a local URI errors with a clear +/// message β€” the CLI only operates against remote multi-graph servers. #[test] -fn profile_show_resolves_named_scope_endpoints() { - let home = profile_home(); - // A cluster profile resolves its root. - let cluster = output_success( +fn graphs_list_against_local_uri_errors_with_remote_only_message() { + let output = output_failure( cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("profile") - .arg("show") - .arg("brain-admin"), + .arg("graphs") + .arg("list") + .arg("--uri") + .arg("/tmp/local"), ); - let cs = stdout_string(&cluster); - assert!(cs.contains("scope: cluster brain"), "{cs}"); - assert!(cs.contains("endpoint: s3://acme/clusters/brain"), "{cs}"); - - // A store profile shows its URI as the endpoint. - let store = output_success( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("profile") - .arg("show") - .arg("localdev") - .arg("--json"), + let stderr = String::from_utf8_lossy(&output.stderr).into_owned(); + assert!( + stderr.contains("remote multi-graph server URL"), + "expected 'remote multi-graph server URL' rejection in stderr; got:\n{stderr}" ); - let detail: Value = serde_json::from_slice(&store.stdout).unwrap(); - assert_eq!(detail["scope_kind"], "store"); - assert_eq!(detail["endpoint"], "file:///data/dev.omni"); -} - -#[test] -fn profile_show_without_name_falls_back_to_flat_defaults() { - let home = profile_home(); - let out = output_success( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("profile") - .arg("show") - .arg("--json"), - ); - let detail: Value = serde_json::from_slice(&out.stdout).unwrap(); - assert_eq!(detail["name"], "(defaults)"); - assert_eq!(detail["scope_kind"], "server"); - assert_eq!(detail["endpoint"], "https://graph.example.com"); - assert_eq!(detail["default_graph"], "knowledge"); -} - -#[test] -fn profile_show_without_name_uses_active_env_profile() { - let home = profile_home(); - let out = output_success( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .env("OMNIGRAPH_PROFILE", "brain-admin") - .arg("profile") - .arg("show") - .arg("--json"), - ); - let detail: Value = serde_json::from_slice(&out.stdout).unwrap(); - // No name arg, but $OMNIGRAPH_PROFILE selects brain-admin (not the flat defaults). - assert_eq!(detail["name"], "brain-admin"); - assert_eq!(detail["scope_kind"], "cluster"); - assert_eq!(detail["endpoint"], "s3://acme/clusters/brain"); - // output_format renders as the canonical lowercase value name. - assert_eq!(detail["output_format"], "json"); -} - -#[test] -fn profile_show_unknown_name_errors() { - let home = profile_home(); - let out = output_failure( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("profile") - .arg("show") - .arg("nope"), - ); - let stderr = String::from_utf8_lossy(&out.stderr); - assert!(stderr.contains("unknown profile 'nope'"), "{stderr}"); } diff --git a/crates/omnigraph-cli/tests/cli_cluster.rs b/crates/omnigraph-cli/tests/cli_cluster.rs deleted file mode 100644 index d2b6d13..0000000 --- a/crates/omnigraph-cli/tests/cli_cluster.rs +++ /dev/null @@ -1,1129 +0,0 @@ -//! Cluster command surface: validate/plan/apply/approve/status/sync/force-unlock. -//! Moved verbatim from tests/cli.rs in the modularization. - -use std::fs; - -use tempfile::tempdir; - -mod support; - -use support::*; - - -#[test] -fn cluster_validate_config_success() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let output = output_success( - cli() - .arg("cluster") - .arg("validate") - .arg("--config") - .arg(temp.path()), - ); - let stdout = stdout_string(&output); - assert!(stdout.contains("cluster config valid"), "{stdout}"); -} - -#[test] -fn cluster_validate_json_is_stable() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("validate") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert!(json["resource_digests"]["graph.knowledge"].is_string()); - assert!(json["resource_digests"]["query.knowledge.find_person"].is_string()); - assert_eq!(json["dependencies"][0]["from"], "policy.base"); - assert_eq!(json["dependencies"][0]["to"], "graph.knowledge"); -} - -#[test] -fn cluster_plan_json_reads_inferred_local_state() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#" -{ - "version": 1, - "applied_revision": { - "config_digest": "old", - "resources": { - "graph.knowledge": { "digest": "old-graph" }, - "policy.old": { "digest": "old-policy" } - } - } -} -"#, - ) - .unwrap(); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("plan") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["state_observations"]["state_found"], true); - assert!( - json["changes"] - .as_array() - .unwrap() - .iter() - .any(|change| change["resource"] == "policy.old" && change["operation"] == "delete"), - "plan should read state and delete stale resources: {json}" - ); -} - -#[test] -fn cluster_status_json_reports_missing_state() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("status") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["state_observations"]["state_found"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_missing"), - "missing state should be a warning diagnostic: {json}" - ); -} - -#[test] -fn cluster_status_json_reports_lock_metadata() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "refresh"); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("status") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["state_observations"]["locked"], true); - assert_eq!(json["state_observations"]["lock_id"], "held-lock"); - assert_eq!(json["state_observations"]["lock_operation"], "refresh"); - assert_eq!(json["state_observations"]["lock_pid"], 123); - assert_eq!( - json["state_observations"]["lock_created_at"], - "1970-01-01T00:00:00Z" - ); - assert!(json["state_observations"]["lock_age_seconds"].is_number()); -} - -#[test] -fn cluster_status_json_reports_extended_state() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#" -{ - "version": 1, - "state_revision": 5, - "applied_revision": { - "config_digest": "applied", - "resources": { - "graph.knowledge": { "digest": "graph-digest" } - } - }, - "resource_statuses": { - "graph.knowledge": { "status": "applied", "conditions": ["healthy"] } - }, - "approval_records": {}, - "recovery_records": {}, - "observations": {} -} -"#, - ) - .unwrap(); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("status") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["state_observations"]["state_revision"], 5); - assert!( - json["state_observations"]["state_cas"] - .as_str() - .unwrap() - .starts_with("sha256:") - ); - assert_eq!(json["resource_digests"]["graph.knowledge"], "graph-digest"); - assert_eq!( - json["resource_statuses"]["graph.knowledge"]["status"], - "applied" - ); -} - -#[test] -fn cluster_plan_json_includes_state_cas_revision_and_lock_observation() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#" -{ - "version": 1, - "state_revision": 9, - "applied_revision": { - "config_digest": "old", - "resources": { - "graph.knowledge": { "digest": "old-graph" } - } - } -} -"#, - ) - .unwrap(); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("plan") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["state_observations"]["state_revision"], 9); - assert!( - json["state_observations"]["state_cas"] - .as_str() - .unwrap() - .starts_with("sha256:") - ); - assert_eq!(json["state_observations"]["locked"], false); - assert_eq!(json["state_observations"]["lock_acquired"], true); - assert!(json["state_observations"]["acquired_lock_id"].is_string()); - assert!(!state_dir.join("lock.json").exists()); -} - -#[test] -fn cluster_plan_locked_state_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "plan"); - - let output = output_failure( - cli() - .arg("cluster") - .arg("plan") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - ); - let json = parse_stdout_json(&output); - assert_eq!(json["ok"], false); - assert_eq!(json["state_observations"]["locked"], true); - assert_eq!(json["state_observations"]["lock_acquired"], false); - assert_eq!(json["state_observations"]["lock_id"], "held-lock"); - assert_eq!(json["state_observations"]["lock_operation"], "plan"); - assert_eq!(json["state_observations"]["lock_pid"], 123); - assert_eq!( - json["state_observations"]["lock_created_at"], - "1970-01-01T00:00:00Z" - ); - assert!(json["state_observations"]["lock_age_seconds"].is_number()); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_lock_held" - && diagnostic["message"] - .as_str() - .unwrap() - .contains("force-unlock held-lock")), - "locked state should produce a useful diagnostic: {json}" - ); -} - -#[test] -fn cluster_force_unlock_json_removes_lock() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "plan"); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("force-unlock") - .arg("held-lock") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["lock_removed"], true); - assert_eq!(json["state_observations"]["lock_id"], "held-lock"); - assert_eq!(json["state_observations"]["lock_operation"], "plan"); - assert!(!temp.path().join("__cluster/lock.json").exists()); -} - -#[test] -fn cluster_force_unlock_wrong_id_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "plan"); - - let json = parse_stdout_json(&output_failure( - cli() - .arg("cluster") - .arg("force-unlock") - .arg("other-lock") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], false); - assert_eq!(json["lock_removed"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_lock_id_mismatch") - ); - assert!(temp.path().join("__cluster/lock.json").exists()); -} - -#[test] -fn cluster_locked_plan_then_force_unlock_then_plan_succeeds() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "plan"); - - let locked = parse_stdout_json(&output_failure( - cli() - .arg("cluster") - .arg("plan") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(locked["ok"], false); - assert_eq!(locked["state_observations"]["lock_id"], "held-lock"); - - let unlocked = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("force-unlock") - .arg("held-lock") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(unlocked["lock_removed"], true); - - let planned = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("plan") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(planned["ok"], true); -} - -#[test] -fn cluster_import_json_bootstraps_missing_state() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("import") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["operation"], "import"); - assert_eq!(json["state_observations"]["state_revision"], 1); - assert!( - json["state_observations"]["state_cas"] - .as_str() - .unwrap() - .starts_with("sha256:") - ); - assert_eq!(json["state_observations"]["locked"], false); - assert_eq!(json["state_observations"]["lock_acquired"], true); - assert!(json["state_observations"]["acquired_lock_id"].is_string()); - assert!(json["observations"]["graph.knowledge"]["manifest_version"].is_number()); - assert_eq!( - json["resource_statuses"]["graph.knowledge"]["status"], - "applied" - ); - assert!(temp.path().join("__cluster/state.json").exists()); - assert!(!temp.path().join("__cluster/lock.json").exists()); -} - -#[test] -fn cluster_refresh_json_updates_revision_cas_and_removes_lock() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#" -{ - "version": 1, - "state_revision": 2, - "applied_revision": { "resources": {} } -} -"#, - ) - .unwrap(); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("refresh") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true); - assert_eq!(json["operation"], "refresh"); - assert_eq!(json["state_observations"]["state_revision"], 3); - assert!( - json["state_observations"]["state_cas"] - .as_str() - .unwrap() - .starts_with("sha256:") - ); - assert_eq!(json["state_observations"]["locked"], false); - assert_eq!(json["state_observations"]["lock_acquired"], true); - assert!(json["state_observations"]["acquired_lock_id"].is_string()); - assert!(!state_dir.join("lock.json").exists()); -} - -#[test] -fn cluster_refresh_missing_state_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let output = output_failure( - cli() - .arg("cluster") - .arg("refresh") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - ); - let json = parse_stdout_json(&output); - assert_eq!(json["ok"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_missing"), - "missing state should produce a useful diagnostic: {json}" - ); -} - -#[test] -fn cluster_import_existing_state_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - - let output = output_failure( - cli() - .arg("cluster") - .arg("import") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - ); - let json = parse_stdout_json(&output); - assert_eq!(json["ok"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_already_exists"), - "existing state should produce a useful diagnostic: {json}" - ); -} - -#[test] -fn cluster_refresh_and_import_locked_state_exit_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - fs::write( - state_dir.join("lock.json"), - r#"{"version":1,"lock_id":"held-lock","operation":"refresh","created_at":"2026-06-08T00:00:00Z","pid":123}"#, - ) - .unwrap(); - - let refresh = parse_stdout_json(&output_failure( - cli() - .arg("cluster") - .arg("refresh") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(refresh["state_observations"]["locked"], true); - assert_eq!(refresh["state_observations"]["lock_id"], "held-lock"); - assert_eq!(refresh["state_observations"]["lock_acquired"], false); - assert!( - refresh["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_lock_held") - ); - - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let state_dir = temp.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("lock.json"), - r#"{"version":1,"lock_id":"held-lock","operation":"import","created_at":"2026-06-08T00:00:00Z","pid":123}"#, - ) - .unwrap(); - - let imported = parse_stdout_json(&output_failure( - cli() - .arg("cluster") - .arg("import") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(imported["state_observations"]["locked"], true); - assert_eq!(imported["state_observations"]["lock_id"], "held-lock"); - assert_eq!(imported["state_observations"]["lock_acquired"], false); - assert!( - imported["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_lock_held") - ); -} - -#[test] -fn cluster_validate_invalid_config_exits_nonzero() { - let temp = tempdir().unwrap(); - fs::write( - temp.path().join("cluster.yaml"), - "version: 1\ngraphs: {}\npipelines: {}\n", - ) - .unwrap(); - - let output = output_failure( - cli() - .arg("cluster") - .arg("validate") - .arg("--config") - .arg(temp.path()), - ); - let stdout = stdout_string(&output); - assert!(stdout.contains("future_phase_field"), "{stdout}"); -} - -#[test] -fn cluster_apply_json_applies_query_and_policy() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let validate = write_cluster_applyable_state(temp.path()); - - let json = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(json["ok"], true, "{json}"); - assert_eq!(json["applied_count"], 2, "{json}"); - assert_eq!(json["converged"], true, "{json}"); - assert_eq!(json["state_written"], true, "{json}"); - assert_eq!( - json["resource_statuses"]["query.knowledge.find_person"]["status"], - "applied" - ); - - let query_digest = validate["resource_digests"]["query.knowledge.find_person"] - .as_str() - .unwrap(); - let payload = temp - .path() - .join("__cluster/resources/query/knowledge/find_person") - .join(format!("{query_digest}.gq")); - assert!(payload.exists(), "missing payload {}", payload.display()); - - let state: serde_json::Value = serde_json::from_str( - &fs::read_to_string(temp.path().join("__cluster/state.json")).unwrap(), - ) - .unwrap(); - assert_eq!(state["state_revision"], 2); - assert_eq!( - state["applied_revision"]["resources"]["query.knowledge.find_person"]["digest"], - *query_digest - ); -} - -#[test] -fn cluster_apply_missing_state_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let output = output_failure( - cli() - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - ); - let json = parse_stdout_json(&output); - assert_eq!(json["ok"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_missing"), - "{json}" - ); - assert!(!temp.path().join("__cluster/resources").exists()); -} - -#[test] -fn cluster_apply_locked_exits_nonzero() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_applyable_state(temp.path()); - write_cluster_lock(temp.path(), "held-lock", "plan"); - - let output = output_failure( - cli() - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - ); - let json = parse_stdout_json(&output); - assert_eq!(json["ok"], false); - assert!( - json["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "state_lock_held"), - "{json}" - ); - assert!(temp.path().join("__cluster/lock.json").exists()); - assert!(!temp.path().join("__cluster/resources").exists()); -} - -/// RFC-011: the actor chain is `--as` > `operator.actor` > none. The CLI no -/// longer reads omnigraph.yaml `cli.actor`. -#[test] -fn cluster_apply_uses_operator_actor_from_omnigraph_home() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let operator_home = tempdir().unwrap(); - fs::write( - operator_home.path().join("config.yaml"), - "operator:\n actor: act-operator\n", - ) - .unwrap(); - - let output = cli() - .current_dir(temp.path()) - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("cluster") - .arg("import") - .arg("--config") - .arg(temp.path()) - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - - let apply = |extra: &[&str]| { - let mut command = cli(); - command - .current_dir(temp.path()) - .env("OMNIGRAPH_HOME", operator_home.path()); - for arg in extra { - command.arg(arg); - } - let output = command - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(temp.path()) - .arg("--json") - .output() - .unwrap(); - let json: serde_json::Value = - serde_json::from_str(String::from_utf8_lossy(&output.stdout).trim()).unwrap(); - json["actor"].clone() - }; - - // No --as: the operator identity applies. - assert_eq!( - apply(&[]), - "act-operator", - "operator.actor is the no-flag default" - ); - // --as still wins over the operator layer. - assert_eq!(apply(&["--as", "andrew"]), "andrew"); -} - -#[test] -fn cluster_approve_uses_operator_actor_fallback() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - let operator_home = tempdir().unwrap(); - fs::write( - operator_home.path().join("config.yaml"), - "operator:\n actor: act-operator\n", - ) - .unwrap(); - // Converge, then remove the graph so a gated delete is pending. - for command in ["import", "apply"] { - let output = cli() - .current_dir(temp.path()) - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("cluster") - .arg(command) - .arg("--config") - .arg(temp.path()) - .output() - .unwrap(); - assert!(output.status.success(), "cluster {command} failed"); - } - fs::write(temp.path().join("cluster.yaml"), "version: 1\ngraphs: {}\n").unwrap(); - - let output = cli() - .current_dir(temp.path()) - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("cluster") - .arg("approve") - .arg("graph.knowledge") - .arg("--config") - .arg(temp.path()) - .arg("--json") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - let json: serde_json::Value = - serde_json::from_str(String::from_utf8_lossy(&output.stdout).trim()).unwrap(); - assert_eq!(json["approved_by"], "act-operator"); - - // With neither flag nor operator config: refused with the actionable - // message (an approval without an approver is meaningless). - let bare = tempdir().unwrap(); - write_cluster_config_fixture(bare.path()); - let bare_home = tempdir().unwrap(); - let output = output_failure( - cli() - .current_dir(bare.path()) - .env("OMNIGRAPH_HOME", bare_home.path()) - .arg("cluster") - .arg("approve") - .arg("graph.knowledge") - .arg("--config") - .arg(bare.path()), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!(stderr.contains("--as"), "{stderr}"); - assert!(stderr.contains("operator.actor"), "{stderr}"); - assert!(stderr.contains("config.yaml"), "{stderr}"); - assert!(!stderr.contains("cli.actor"), "{stderr}"); - assert!(!stderr.contains("omnigraph.yaml"), "{stderr}"); -} - -#[test] -fn cluster_commands_ignore_legacy_omnigraph_yaml() { - // RFC-011: the CLI never reads omnigraph.yaml for cluster commands β€” a - // present (even malformed) legacy file is inert. The actor falls back to - // `operator.actor`, then to none (no loud failure on absence). - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - fs::write(temp.path().join("omnigraph.yaml"), "{{{{ not yaml").unwrap(); - - for command in ["validate", "plan", "status"] { - let output = cli() - .current_dir(temp.path()) - .arg("cluster") - .arg(command) - .arg("--config") - .arg(temp.path()) - .arg("--json") - .output() - .unwrap(); - assert!( - output.status.success() || command == "plan", // plan warns state-missing pre-import; still must not config-error - "cluster {command} affected by malformed omnigraph.yaml: {output:?}" - ); - assert!( - !String::from_utf8_lossy(&output.stderr).contains("omnigraph.yaml"), - "cluster {command} touched omnigraph.yaml" - ); - } - // import + apply (no --as, no operator config): the legacy file is never - // loaded and the no-actor apply succeeds (actor defaults to none). - for command in ["import", "apply"] { - let output = cli() - .current_dir(temp.path()) - .arg("cluster") - .arg(command) - .arg("--config") - .arg(temp.path()) - .output() - .unwrap(); - assert!( - output.status.success(), - "cluster {command} affected by malformed omnigraph.yaml: {}", - String::from_utf8_lossy(&output.stderr) - ); - } -} - -#[test] -fn cluster_commands_ignore_conflicting_local_config() { - let baseline = tempdir().unwrap(); - write_cluster_config_fixture(baseline.path()); - let with_config = tempdir().unwrap(); - write_cluster_config_fixture(with_config.path()); - fs::write( - with_config.path().join("omnigraph.yaml"), - r#" -server: - bind: 0.0.0.0:9999 -graphs: - phantom: - uri: ./phantom.omni -"#, - ) - .unwrap(); - - let validate = |dir: &std::path::Path| { - let output = cli() - .current_dir(dir) - .arg("cluster") - .arg("validate") - .arg("--config") - .arg(dir) - .arg("--json") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - serde_json::from_str::(String::from_utf8_lossy(&output.stdout).trim()) - .unwrap() - }; - let (a, b) = (validate(baseline.path()), validate(with_config.path())); - // Compare the path-free invariants (paths embed each tempdir). - for key in ["ok", "diagnostics", "resource_digests", "dependencies"] { - assert_eq!(a[key], b[key], "conflicting omnigraph.yaml leaked into cluster validate ({key})"); - } - let leaked = b.to_string(); - assert!(!leaked.contains("phantom") && !leaked.contains("9999"), "{leaked}"); -} - - -// ── RFC-010 Slice 3: cluster-managed maintenance addressing + init signpost ── - -/// Stand up an applied, served cluster with the `knowledge` graph and return -/// its directory guard. Mirrors the e2e setup (fixture β†’ init β†’ import β†’ apply). -fn applied_knowledge_cluster() -> tempfile::TempDir { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - temp -} - -#[test] -fn optimize_resolves_a_cluster_graph_by_id() { - let temp = applied_knowledge_cluster(); - // No hand-typed storage path: address the graph by cluster dir + id. - let out = output_success( - cli() - .arg("optimize") - .arg("--cluster") - .arg(temp.path()) - .arg("--graph") - .arg("knowledge") - .arg("--json"), - ); - let payload = parse_stdout_json(&out); - assert!( - payload["tables"].as_array().is_some(), - "optimize did not run against the resolved cluster graph: {payload}" - ); -} - -#[test] -fn optimize_unknown_cluster_graph_id_errors() { - let temp = applied_knowledge_cluster(); - let out = output_failure( - cli() - .arg("optimize") - .arg("--cluster") - .arg(temp.path()) - .arg("--graph") - .arg("does-not-exist") - .arg("--json"), - ); - let stderr = String::from_utf8_lossy(&out.stderr); - assert!( - stderr.contains("is not applied in cluster") && stderr.contains("cluster apply"), - "expected an unapplied-graph error pointing at cluster apply; got: {stderr}" - ); -} - -#[test] -fn optimize_auto_uses_the_sole_cluster_graph() { - // RFC-011 D7: a cluster with exactly one applied graph needs no --graph β€” - // the resolver enumerates the catalog and uses the only candidate. - let temp = applied_knowledge_cluster(); - let out = output_success( - cli() - .arg("optimize") - .arg("--cluster") - .arg(temp.path()) - .arg("--json"), - ); - assert!( - parse_stdout_json(&out)["tables"].as_array().is_some(), - "optimize should auto-resolve the sole cluster graph" - ); -} - -/// Stand up an applied cluster with two graphs (`knowledge`, `archive`). -fn applied_two_graph_cluster() -> tempfile::TempDir { - let temp = tempdir().unwrap(); - let root = temp.path(); - fs::write( - root.join("people.pg"), - "node Person {\n name: String @key\n age: I32?\n}\n", - ) - .unwrap(); - fs::write(root.join("base.policy.yaml"), "rules: []\n").unwrap(); - fs::write( - root.join("cluster.yaml"), - r#" -version: 1 -metadata: - name: two-graph -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - archive: - schema: ./people.pg -policies: - base: - file: ./base.policy.yaml - applies_to: [knowledge, archive] -"#, - ) - .unwrap(); - init_named_cluster_graph(root, "knowledge", "people.pg"); - init_named_cluster_graph(root, "archive", "people.pg"); - assert_eq!(cluster_json(root, "import")["ok"], true); - assert_eq!(cluster_json(root, "apply")["converged"], true); - temp -} - -#[test] -fn optimize_on_multi_graph_cluster_without_graph_lists_candidates() { - // RFC-011 D7: >1 graph and no --graph β†’ error naming every candidate, - // never an auto-pick. - let temp = applied_two_graph_cluster(); - let out = output_failure( - cli() - .arg("optimize") - .arg("--cluster") - .arg(temp.path()) - .arg("--json"), - ); - let stderr = String::from_utf8_lossy(&out.stderr); - assert!( - stderr.contains("2 graphs") - && stderr.contains("archive") - && stderr.contains("knowledge") - && stderr.contains("--graph "), - "expected a candidate-listing error; got: {stderr}" - ); -} - -#[test] -fn init_refuses_a_cluster_managed_path_and_signposts_cluster_apply() { - let temp = applied_knowledge_cluster(); - // Hand-init a NEW graph into the established cluster's storage layout. - let out = output_failure( - cli() - .arg("init") - .arg("--schema") - .arg(temp.path().join("people.pg")) - .arg(temp.path().join("graphs").join("sneaky.omni")), - ); - let stderr = String::from_utf8_lossy(&out.stderr); - assert!( - stderr.contains("cluster apply"), - "init into a cluster-managed path should signpost `cluster apply`; got: {stderr}" - ); - // And it did not create the graph. - assert!(!temp.path().join("graphs").join("sneaky.omni").exists()); -} - -#[test] -fn schema_apply_refuses_a_cluster_managed_graph_and_signposts_cluster_apply() { - // RFC-011 Decision 10: a direct `schema apply` against a cluster-managed - // graph's storage root would bypass the ledger/recovery/approvals, so it is - // refused and points at `cluster apply` (mirrors `init`'s refusal). - let temp = applied_knowledge_cluster(); - // A schema that WOULD change the graph (adds `bio`) β€” so the no-mutation - // assertion below is meaningful, not a no-op re-apply. - fs::write( - temp.path().join("people_v2.pg"), - "node Person {\n name: String @key\n age: I32?\n bio: String?\n}\n", - ) - .unwrap(); - let out = output_failure( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(temp.path().join("people_v2.pg")) - .arg("--store") - .arg(temp.path().join("graphs").join("knowledge.omni")), - ); - let stderr = String::from_utf8_lossy(&out.stderr); - assert!( - stderr.contains("cluster apply"), - "schema apply against a cluster-managed graph should signpost `cluster apply`; got: {stderr}" - ); - // And it bailed BEFORE mutating: the live schema still lacks `bio`. - let show = output_success( - cli() - .arg("schema") - .arg("show") - .arg(temp.path().join("graphs").join("knowledge.omni")), - ); - assert!( - !stdout_string(&show).contains("bio"), - "the refused apply must not have changed the live schema; got: {}", - stdout_string(&show) - ); -} - -#[test] -fn init_outside_a_cluster_still_works() { - // Regression guard: ordinary init (no cluster layout) is unaffected. - let temp = tempdir().unwrap(); - let schema = fixture("test.pg"); - let out = output_success( - cli() - .arg("init") - .arg("--schema") - .arg(&schema) - .arg(temp.path().join("plain.omni")), - ); - assert!(stdout_string(&out).contains("initialized")); -} - -#[test] -fn optimize_by_cluster_works_when_catalog_payloads_are_degraded() { - // Robustness (Greptile, #221): maintenance resolves the graph URI from the - // state ledger alone, so an unrelated corrupt/missing catalog payload (or a - // pending recovery sweep) does NOT block it β€” unlike the full serving-snapshot - // read. This is what keeps `repair --cluster` usable on a degraded cluster. - let temp = applied_knowledge_cluster(); - // Remove the verified catalog payloads (queries/policies) β€” a serving read - // would refuse with a catalog-payload diagnostic; the ledger-only resolve - // must not care. - let resources = temp.path().join("__cluster").join("resources"); - if resources.exists() { - fs::remove_dir_all(&resources).unwrap(); - } - let out = output_success( - cli() - .arg("optimize") - .arg("--cluster") - .arg(temp.path()) - .arg("--graph") - .arg("knowledge") - .arg("--json"), - ); - assert!( - parse_stdout_json(&out)["tables"].as_array().is_some(), - "optimize should resolve via the ledger despite degraded catalog payloads" - ); -} diff --git a/crates/omnigraph-cli/tests/cli_cluster_e2e.rs b/crates/omnigraph-cli/tests/cli_cluster_e2e.rs deleted file mode 100644 index 35ded58..0000000 --- a/crates/omnigraph-cli/tests/cli_cluster_e2e.rs +++ /dev/null @@ -1,623 +0,0 @@ -//! Cluster lifecycle compositions over the spawned binary (recovery, drift, convergence). -//! Moved verbatim from tests/cli.rs in the modularization. - -use std::fs; - -use omnigraph::db::Omnigraph; -use tempfile::tempdir; - -mod support; - -use support::*; - - -#[test] -fn cluster_e2e_lifecycle_import_apply_status_refresh_converges() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - assert_eq!(import["state_observations"]["state_revision"], 1); - - let plan = cluster_json(temp.path(), "plan"); - let changes = plan["changes"].as_array().unwrap(); - assert_eq!(changes.len(), 3, "{plan}"); - let disposition_of = |resource: &str| { - changes - .iter() - .find(|change| change["resource"] == resource) - .unwrap_or_else(|| panic!("missing change for {resource}: {plan}"))["disposition"] - .clone() - }; - assert_eq!(disposition_of("graph.knowledge"), "derived"); - assert_eq!(disposition_of("query.knowledge.find_person"), "applied"); - assert_eq!(disposition_of("policy.base"), "applied"); - - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["ok"], true, "{apply}"); - assert_eq!(apply["applied_count"], 2, "{apply}"); - assert_eq!(apply["converged"], true, "{apply}"); - - let status = cluster_json(temp.path(), "status"); - assert_eq!( - status["resource_statuses"]["query.knowledge.find_person"]["status"], - "applied" - ); - assert_eq!(status["resource_statuses"]["policy.base"]["status"], "applied"); - assert!( - status["state_observations"]["applied_config_digest"].is_string(), - "converged apply must record the applied config digest: {status}" - ); - - // Refresh re-observes the live graph; it must not undo apply's work. - let refresh = cluster_json(temp.path(), "refresh"); - assert_eq!(refresh["ok"], true, "{refresh}"); - let replan = cluster_json(temp.path(), "plan"); - assert!( - replan["changes"].as_array().unwrap().is_empty(), - "refresh after a converged apply must not re-open the plan: {replan}" - ); - - // A query edit round-trips: plan update -> apply -> converged again. - fs::write( - temp.path().join("people.gq"), - r#" -query find_person($name: String) { - match { $p: Person { name: $name } } - return { $p.name } -} -"#, - ) - .unwrap(); - let apply_edit = cluster_json(temp.path(), "apply"); - assert_eq!(apply_edit["applied_count"], 1, "{apply_edit}"); - assert_eq!(apply_edit["converged"], true, "{apply_edit}"); - - let final_apply = cluster_json(temp.path(), "apply"); - assert_eq!(final_apply["state_written"], false, "{final_apply}"); - assert!(final_apply["changes"].as_array().unwrap().is_empty()); -} - -#[test] -fn cluster_e2e_schema_change_applied_by_cluster() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - - // Additive schema change: Stage 4B applies it from the cluster β€” no - // manual schema apply, no refresh round-trip. - fs::write( - temp.path().join("people.pg"), - r#" -node Person { - name: String @key - age: I32? - bio: String? -} -"#, - ) - .unwrap(); - - // Plan previews the real migration steps (RFC-004 Β§D7). - let plan = cluster_json(temp.path(), "plan"); - let schema_change = change_for(&plan, "schema.knowledge"); - assert_eq!(schema_change["disposition"], "applied", "{plan}"); - let migration = &schema_change["migration"]; - assert_eq!(migration["supported"], true, "{plan}"); - assert!( - migration["steps"] - .as_array() - .unwrap() - .iter() - .any(|step| step["kind"] == "add_property"), - "{plan}" - ); - - let evolve = cluster_json(temp.path(), "apply"); - assert_eq!(evolve["ok"], true, "{evolve}"); - assert_eq!(evolve["converged"], true, "{evolve}"); - assert_eq!(change_for(&evolve, "schema.knowledge")["disposition"], "applied"); - - // The live graph carries the new schema; the plan is empty. - let schema_show = output_success( - cli() - .arg("schema") - .arg("show") - .arg(temp.path().join("graphs/knowledge.omni")), - ); - assert!(stdout_string(&schema_show).contains("bio"), "live schema updated"); - let replan = cluster_json(temp.path(), "plan"); - assert!( - replan["changes"].as_array().unwrap().is_empty(), - "one cluster apply converges a schema change: {replan}" - ); -} - -#[test] -fn cluster_e2e_force_unlock_unblocks_apply() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - write_cluster_applyable_state(temp.path()); - write_cluster_lock(temp.path(), "stuck-lock", "apply"); - - let refused = parse_stdout_json(&output_failure( - cli() - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(refused["ok"], false); - - let unlocked = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("force-unlock") - .arg("stuck-lock") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(unlocked["lock_removed"], true, "{unlocked}"); - - let retried = cluster_json(temp.path(), "apply"); - assert_eq!(retried["ok"], true, "{retried}"); - assert_eq!(retried["converged"], true, "{retried}"); -} - -#[test] -fn cluster_e2e_lost_state_reimport_recovers_catalog() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - - let query_digest = change_for(&apply, "query.knowledge.find_person")["after_digest"] - .as_str() - .unwrap() - .to_string(); - let blob = temp - .path() - .join("__cluster/resources/query/knowledge/find_person") - .join(format!("{query_digest}.gq")); - let blob_content = fs::read_to_string(&blob).unwrap(); - - // Disaster: the state ledger is lost. - fs::remove_file(temp.path().join("__cluster/state.json")).unwrap(); - - let reimport = cluster_json(temp.path(), "import"); - assert_eq!(reimport["ok"], true, "{reimport}"); - assert_eq!(reimport["state_observations"]["state_revision"], 1); - // Import observes graph/schema only; query/policy digests are not invented. - assert!( - reimport["resource_digests"] - .get("query.knowledge.find_person") - .is_none(), - "{reimport}" - ); - - let plan = cluster_json(temp.path(), "plan"); - assert_eq!( - change_for(&plan, "query.knowledge.find_person")["disposition"], - "applied" - ); - assert_eq!(change_for(&plan, "policy.base")["disposition"], "applied"); - - let reapply = cluster_json(temp.path(), "apply"); - assert_eq!(reapply["ok"], true, "{reapply}"); - assert_eq!(reapply["converged"], true, "{reapply}"); - assert!( - reapply["state_observations"]["applied_config_digest"].is_string(), - "{reapply}" - ); - // The catalog blob was reused, not rewritten with different content. - assert_eq!(fs::read_to_string(&blob).unwrap(), blob_content); - - let replan = cluster_json(temp.path(), "plan"); - assert!(replan["changes"].as_array().unwrap().is_empty(), "{replan}"); -} - -#[test] -fn cluster_e2e_out_of_band_schema_drift_then_apply_converges_it() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - - // Out-of-band: the live graph evolves while cluster.yaml stays put. RFC-011 - // D10 makes the CLI `schema apply` refuse a cluster-managed graph, so this - // simulates a true bypass β€” a direct engine apply against the storage root, - // exactly the drift the control plane must still detect and converge. - let people_v2 = r#" -node Person { - name: String @key - age: I32? - bio: String? -} -"#; - tokio::runtime::Runtime::new().unwrap().block_on(async { - let db = Omnigraph::open( - temp.path() - .join("graphs/knowledge.omni") - .to_string_lossy() - .as_ref(), - ) - .await - .unwrap(); - db.apply_schema(people_v2).await.unwrap(); - }); - - // Drift is visible... - let refresh = cluster_json(temp.path(), "refresh"); - assert_eq!( - refresh["resource_statuses"]["schema.knowledge"]["status"], - "drifted" - ); - // ...the plan proposes converging back to desired, with a migration - // preview (a soft drop of the out-of-band field)... - let plan = cluster_json(temp.path(), "plan"); - let schema_change = change_for(&plan, "schema.knowledge"); - assert_eq!(schema_change["disposition"], "applied", "{plan}"); - assert!( - schema_change["migration"]["steps"] - .as_array() - .unwrap() - .iter() - .any(|step| step["kind"] == "drop_property" && step["mode"] == "soft"), - "{plan}" - ); - // ...and apply converges the live schema back (axiom 8: drift correction - // is gated like any change; a soft migration is the recoverable tier). - let converge = cluster_json(temp.path(), "apply"); - assert_eq!(converge["ok"], true, "{converge}"); - assert_eq!(converge["converged"], true, "{converge}"); - let schema_show = output_success( - cli() - .arg("schema") - .arg("show") - .arg(temp.path().join("graphs/knowledge.omni")), - ); - assert!( - !stdout_string(&schema_show).contains("bio"), - "out-of-band field soft-dropped back to desired" - ); - let replan = cluster_json(temp.path(), "plan"); - assert!(replan["changes"].as_array().unwrap().is_empty(), "{replan}"); -} - -#[test] -fn cluster_e2e_graph_root_destruction_drifts_then_apply_recreates_empty_graph() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - let query_digest = change_for(&apply, "query.knowledge.find_person")["after_digest"] - .as_str() - .unwrap() - .to_string(); - - fs::remove_dir_all(temp.path().join("graphs/knowledge.omni")).unwrap(); - - // Missing root is drift, not an error. - let refresh = cluster_json(temp.path(), "refresh"); - assert_eq!(refresh["ok"], true, "{refresh}"); - assert_eq!( - refresh["resource_statuses"]["graph.knowledge"]["status"], - "drifted" - ); - assert!( - refresh["resource_statuses"]["graph.knowledge"]["conditions"] - .as_array() - .unwrap() - .iter() - .any(|condition| condition == "graph_missing"), - "{refresh}" - ); - // Graph/schema digests removed; query/policy digests preserved. - assert!(refresh["resource_digests"].get("graph.knowledge").is_none()); - assert!(refresh["resource_digests"].get("schema.knowledge").is_none()); - assert!( - refresh["resource_digests"] - .get("query.knowledge.find_person") - .is_some(), - "{refresh}" - ); - - let plan = cluster_json(temp.path(), "plan"); - assert_eq!(change_for(&plan, "graph.knowledge")["operation"], "create"); - // Stage 4A: the re-create is executable and the plan says so β€” nothing - // hidden about converging a destroyed root back to an EMPTY graph (the - // data was already lost; this is declarative convergence, RFC-004 Β§D1). - assert_eq!(change_for(&plan, "graph.knowledge")["disposition"], "applied"); - assert_eq!(change_for(&plan, "schema.knowledge")["disposition"], "applied"); - // Converged-then-destroyed: query/policy are already in state at the - // desired digests, so they are not changes at all. - assert_eq!(plan["changes"].as_array().unwrap().len(), 2, "{plan}"); - - let recreate = cluster_json(temp.path(), "apply"); - assert_eq!(recreate["ok"], true, "{recreate}"); - assert_eq!(recreate["converged"], true, "{recreate}"); - // The empty graph is back on disk; catalog state survived throughout. - assert!(temp.path().join("graphs/knowledge.omni").exists()); - let state: serde_json::Value = serde_json::from_str( - &fs::read_to_string(temp.path().join("__cluster/state.json")).unwrap(), - ) - .unwrap(); - assert_eq!( - state["applied_revision"]["resources"]["query.knowledge.find_person"]["digest"], - query_digest - ); - assert!( - temp.path() - .join("__cluster/resources/query/knowledge/find_person") - .join(format!("{query_digest}.gq")) - .exists() - ); -} - -#[test] -fn cluster_e2e_multi_graph_mixed_dispositions_then_approve_and_converge() { - let temp = tempdir().unwrap(); - write_multi_graph_cluster_fixture(temp.path()); - // No manual init: Stage 4A creates both graphs. - - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["ok"], true, "{apply}"); - assert_eq!(apply["converged"], true, "{apply}"); - assert_eq!(change_for(&apply, "graph.knowledge")["disposition"], "applied"); - assert_eq!( - change_for(&apply, "graph.engineering")["disposition"], - "applied" - ); - assert_eq!( - change_for(&apply, "query.engineering.find_service")["disposition"], - "applied" - ); - // The graph-spanning and cluster-scoped policies ride the same run. - assert_eq!(change_for(&apply, "policy.shared")["disposition"], "applied"); - assert_eq!( - change_for(&apply, "policy.cluster_wide")["disposition"], - "applied" - ); - assert!(temp.path().join("graphs/knowledge.omni").exists()); - assert!(temp.path().join("graphs/engineering.omni").exists()); - - // Mixed run: a graph REMOVAL (4C territory β€” deferred) gates its query - // delete (blocked), while a knowledge query update is independent - // (applied) and re-derives its composite. All four dispositions at once. - fs::write( - temp.path().join("cluster.yaml"), - r#" -version: 1 -metadata: - name: company-brain -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - shared: - file: ./shared.policy.yaml - applies_to: [knowledge] - cluster_wide: - file: ./cluster_wide.policy.yaml - applies_to: [cluster] -"#, - ) - .unwrap(); - fs::write( - temp.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - - let mixed = cluster_json(temp.path(), "apply"); - assert_eq!(mixed["ok"], true, "{mixed}"); - assert_eq!(mixed["converged"], false, "{mixed}"); - // Stage 4C: deletes are gated on a digest-bound approval, one gate per - // subtree (the graph-level approval carries schema + queries). - assert_eq!( - change_for(&mixed, "graph.engineering")["disposition"], - "blocked" - ); - assert_eq!( - change_for(&mixed, "graph.engineering")["reason"], - "approval_required" - ); - assert_eq!( - change_for(&mixed, "schema.engineering")["reason"], - "approval_required" - ); - assert_eq!( - change_for(&mixed, "query.engineering.find_service")["reason"], - "approval_required" - ); - let gate_plan = cluster_json(temp.path(), "plan"); - let gates = gate_plan["approvals_required"].as_array().unwrap(); - assert_eq!(gates.len(), 1, "{gate_plan}"); - assert_eq!(gates[0]["resource"], "graph.engineering"); - assert_eq!(gates[0]["satisfied"], false); - assert_eq!( - change_for(&mixed, "query.knowledge.find_person")["disposition"], - "applied" - ); - // 5A: policy.shared's applies_to narrowed with an unchanged file digest - // β€” now a first-class binding change, applied in the same run. - assert_eq!(change_for(&mixed, "policy.shared")["binding_change"], true); - assert_eq!(change_for(&mixed, "policy.shared")["disposition"], "applied"); - assert_eq!( - change_for(&mixed, "graph.knowledge")["disposition"], - "derived" - ); - // Deterministic ordering: changes sorted by resource address. - let order: Vec<&str> = mixed["changes"] - .as_array() - .unwrap() - .iter() - .map(|change| change["resource"].as_str().unwrap()) - .collect(); - let mut sorted = order.clone(); - sorted.sort_unstable(); - assert_eq!(order, sorted, "{mixed}"); - // The conclusion: an apply without approval stays blocked; the approved - // delete converges the cluster, tombstoning the removed graph. - let still_blocked = cluster_json(temp.path(), "apply"); - assert_eq!(still_blocked["converged"], false, "{still_blocked}"); - - let approve = parse_stdout_json(&output_success( - cli() - .arg("--as") - .arg("andrew") - .arg("cluster") - .arg("approve") - .arg("graph.engineering") - .arg("--config") - .arg(temp.path()) - .arg("--json"), - )); - assert_eq!(approve["ok"], true, "{approve}"); - assert_eq!(approve["approved_by"], "andrew"); - - let converge = cluster_json(temp.path(), "apply"); - assert_eq!(converge["ok"], true, "{converge}"); - assert_eq!(converge["converged"], true, "{converge}"); - assert!(!temp.path().join("graphs/engineering.omni").exists()); - - let status = cluster_json(temp.path(), "status"); - assert_eq!(status["observations"]["graph.engineering"]["kind"], "tombstone"); - let final_plan = cluster_json(temp.path(), "plan"); - assert!( - final_plan["changes"].as_array().unwrap().is_empty(), - "{final_plan}" - ); -} - -#[test] -fn cluster_e2e_approve_requires_actor() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let output = output_failure( - cli() - .arg("cluster") - .arg("approve") - .arg("graph.knowledge") - .arg("--config") - .arg(temp.path()), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!(stderr.contains("--as"), "{stderr}"); -} - -#[test] -fn cluster_e2e_declared_graph_created_by_apply() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["ok"], true, "{apply}"); - assert_eq!(apply["converged"], true, "{apply}"); - assert_eq!(change_for(&apply, "graph.knowledge")["disposition"], "applied"); - assert!(temp.path().join("graphs/knowledge.omni").exists()); - - // The created graph is a real graph: the per-graph CLI can open it. - let snapshot = output_success( - cli() - .arg("snapshot") - .arg(temp.path().join("graphs/knowledge.omni")), - ); - assert!(!stdout_string(&snapshot).is_empty()); - - let plan = cluster_json(temp.path(), "plan"); - assert!(plan["changes"].as_array().unwrap().is_empty(), "{plan}"); - let status = cluster_json(temp.path(), "status"); - assert_eq!( - status["resource_statuses"]["graph.knowledge"]["status"], - "applied" - ); -} - -#[test] -fn cluster_e2e_payload_drift_self_heals() { - let temp = tempdir().unwrap(); - write_cluster_config_fixture(temp.path()); - init_cluster_derived_graph(temp.path()); - let import = cluster_json(temp.path(), "import"); - assert_eq!(import["ok"], true, "{import}"); - let apply = cluster_json(temp.path(), "apply"); - assert_eq!(apply["converged"], true, "{apply}"); - - let query_digest = change_for(&apply, "query.knowledge.find_person")["after_digest"] - .as_str() - .unwrap() - .to_string(); - let blob = temp - .path() - .join("__cluster/resources/query/knowledge/find_person") - .join(format!("{query_digest}.gq")); - fs::remove_file(&blob).unwrap(); - - let status = cluster_json(temp.path(), "status"); - assert_eq!(status["ok"], true, "{status}"); - assert!( - status["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| diagnostic["code"] == "catalog_payload_missing"), - "{status}" - ); - - let refresh = cluster_json(temp.path(), "refresh"); - assert_eq!(refresh["ok"], true, "{refresh}"); - assert_eq!( - refresh["resource_statuses"]["query.knowledge.find_person"]["status"], - "drifted" - ); - - let heal = cluster_json(temp.path(), "apply"); - assert_eq!(heal["ok"], true, "{heal}"); - assert_eq!(heal["converged"], true, "{heal}"); - assert!(blob.exists(), "blob republished"); - - let clean = cluster_json(temp.path(), "status"); - assert!( - !clean["diagnostics"] - .as_array() - .unwrap() - .iter() - .any(|diagnostic| { - diagnostic["code"] - .as_str() - .is_some_and(|code| code.starts_with("catalog_payload")) - }), - "{clean}" - ); -} diff --git a/crates/omnigraph-cli/tests/cli_queries.rs b/crates/omnigraph-cli/tests/cli_queries.rs deleted file mode 100644 index b51018e..0000000 --- a/crates/omnigraph-cli/tests/cli_queries.rs +++ /dev/null @@ -1,384 +0,0 @@ -//! Stored-query commands and alias resolution. -//! Moved verbatim from tests/cli.rs in the modularization. - - -use tempfile::tempdir; - -mod support; - -use support::*; - - -#[test] -fn query_check_alias_matches_lint_output() { - let temp = tempdir().unwrap(); - let schema_path = temp.path().join("schema.pg"); - let query_path = temp.path().join("queries.gq"); - write_file( - &schema_path, - r#" -node Person { - name: String -} -"#, - ); - write_query_file( - &query_path, - r#" -query list_people() { - match { $p: Person } - return { $p.name } -} -"#, - ); - - let lint_output = output_success( - cli() - .arg("query") - .arg("lint") - .arg("--query") - .arg(&query_path) - .arg("--schema") - .arg(&schema_path) - .arg("--json"), - ); - let check_output = output_success( - cli() - .arg("query") - .arg("check") - .arg("--query") - .arg(&query_path) - .arg("--schema") - .arg(&schema_path) - .arg("--json"), - ); - - assert_eq!(stdout_string(&lint_output), stdout_string(&check_output)); -} - -// Legacy `omnigraph.yaml` `aliases:` invoked via the `--alias` flag were -// removed in RFC-011 D4 β€” operator aliases now live under `omnigraph alias -// ` (the happy path is covered by system_local's operator-alias e2e). -// The legacy file-alias path has no CLI entry point. - -#[test] -fn alias_flag_is_removed_from_query() { - // RFC-011 D4: `--alias` no longer exists on query/mutate; use `alias `. - let output = output_failure(cli().arg("query").arg("--alias").arg("who")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("unexpected argument") && stderr.contains("--alias"), - "expected clap to reject --alias on query; got: {stderr}" - ); -} - -#[test] -fn alias_unknown_name_errors_listing_defined() { - // Hermetic: an unknown alias fails before any network, listing defined ones. - let home = tempdir().unwrap(); - std::fs::write( - home.path().join("config.yaml"), - "servers:\n dev:\n url: https://x\naliases:\n who:\n server: dev\n query: find_person\n", - ) - .unwrap(); - let output = output_failure( - cli() - .env("OMNIGRAPH_HOME", home.path()) - .arg("alias") - .arg("nope"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("unknown alias 'nope'") && stderr.contains("who"), - "expected an unknown-alias error listing defined aliases; got: {stderr}" - ); -} - -#[test] -fn alias_rejects_global_scope_flags_that_the_binding_owns() { - for (flag, value) in [ - ("--server", "dev"), - ("--graph", "local"), - ("--store", "file:///tmp/graph.omni"), - ("--cluster", "."), - ("--profile", "prod"), - ("--as", "act-op"), - ] { - let output = output_failure(cli().arg(flag).arg(value).arg("alias").arg("who")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("`alias` uses the server, graph, and stored query") - && stderr.contains(flag), - "expected {flag} to be rejected by the alias binding guard; got: {stderr}" - ); - } -} - -#[test] -fn queries_and_policy_wrong_server_scope_points_at_cluster_scope() { - let output = output_failure(cli().arg("--server").arg("prod").arg("queries").arg("list")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("pass --cluster ") && !stderr.contains("pass --config "), - "queries should point at --cluster, not --config; got: {stderr}" - ); - - let output = output_failure( - cli() - .arg("--server") - .arg("prod") - .arg("policy") - .arg("validate"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("pass --cluster ") && !stderr.contains("pass --config "), - "policy should point at --cluster, not --config; got: {stderr}" - ); -} - -// RFC-011: `queries validate`/`list` source the registry + schemas from a -// converged cluster's applied state (`--cluster `), not omnigraph.yaml. - -/// Build a converged single-graph cluster (id `knowledge`) with one stored -/// query. `query_block` is the YAML under the graph's `queries:` key. -fn converged_cluster_with_query(query_file: &str, query_src: &str, query_block: &str) -> tempfile::TempDir { - let temp = tempdir().unwrap(); - let dir = temp.path(); - std::fs::copy(fixture("test.pg"), dir.join("graph.pg")).unwrap(); - write_query_file(&dir.join(query_file), query_src); - std::fs::write( - dir.join("cluster.yaml"), - format!( - "version: 1\nmetadata:\n name: sys\nstate:\n backend: cluster\n lock: true\n\ - graphs:\n knowledge:\n schema: ./graph.pg\n queries:\n{query_block}" - ), - ) - .unwrap(); - output_success(cli().arg("cluster").arg("import").arg("--config").arg(dir)); - output_success(cli().arg("cluster").arg("apply").arg("--config").arg(dir)); - temp -} - -#[test] -fn queries_validate_exits_zero_on_clean_registry() { - let cluster = converged_cluster_with_query( - "find_person.gq", - "query find_person($name: String) { match { $p: Person { name: $name } } return { $p.age } }", - " find_person:\n file: ./find_person.gq\n", - ); - let output = output_success( - cli() - .arg("queries") - .arg("validate") - .arg("--cluster") - .arg(cluster.path()), - ); - let stdout = stdout_string(&output); - assert!(stdout.contains("OK"), "stdout:\n{stdout}"); -} - -#[test] -fn cluster_import_rejects_a_type_broken_query() { - // In the cluster model a stored query is type-checked at the cluster - // boundary (import/apply), so a broken query can never reach the applied - // state `queries validate` reads β€” the gate is upstream. `Widget` is not in - // the fixture schema, so import must reject it, naming the query. - let temp = tempdir().unwrap(); - let dir = temp.path(); - std::fs::copy(fixture("test.pg"), dir.join("graph.pg")).unwrap(); - write_query_file( - &dir.join("ghost.gq"), - "query ghost() { match { $w: Widget } return { $w.name } }", - ); - std::fs::write( - dir.join("cluster.yaml"), - "version: 1\nmetadata:\n name: sys\nstate:\n backend: cluster\n lock: true\n\ - graphs:\n knowledge:\n schema: ./graph.pg\n queries:\n ghost:\n file: ./ghost.gq\n", - ) - .unwrap(); - let output = output_failure(cli().arg("cluster").arg("import").arg("--config").arg(dir)); - let combined = format!( - "{}{}", - stdout_string(&output), - String::from_utf8_lossy(&output.stderr) - ); - assert!( - combined.contains("ghost"), - "cluster import must reject the broken query, naming it; got:\n{combined}" - ); -} - -#[test] -fn queries_list_prints_registered_query() { - let cluster = converged_cluster_with_query( - "find_person.gq", - "query find_person($name: String) { match { $p: Person { name: $name } } return { $p.age } }", - " find_person:\n file: ./find_person.gq\n", - ); - let output = output_success( - cli() - .arg("queries") - .arg("list") - .arg("--cluster") - .arg(cluster.path()), - ); - let stdout = stdout_string(&output); - assert!(stdout.contains("find_person"), "stdout:\n{stdout}"); - assert!( - stdout.contains("$name: String"), - "list should show typed params; stdout:\n{stdout}" - ); -} - -#[test] -fn queries_list_surfaces_description_and_instruction() { - // `@description`/`@instruction` are the whole point of a stored query in a - // catalog β€” they tell an agent/operator what it does and how to invoke it. - // The CLI catalog must surface them in both human and --json output, to - // match the HTTP `GET /queries` surface. - let cluster = converged_cluster_with_query( - "described.gq", - "query described($name: String) \ - @description(\"Find a person by exact name.\") \ - @instruction(\"Use for exact lookups; prefer search for fuzzy matches.\") \ - { match { $p: Person { name: $name } } return { $p.age } }", - " described:\n file: ./described.gq\n", - ); - - // Human output. - let output = output_success( - cli().arg("queries").arg("list").arg("--cluster").arg(cluster.path()), - ); - let stdout = stdout_string(&output); - assert!( - stdout.contains("description: Find a person by exact name."), - "human list must show @description; stdout:\n{stdout}" - ); - assert!( - stdout.contains("instruction: Use for exact lookups; prefer search for fuzzy matches."), - "human list must show @instruction; stdout:\n{stdout}" - ); - - // --json output. - let output = output_success( - cli() - .arg("queries") - .arg("list") - .arg("--cluster") - .arg(cluster.path()) - .arg("--json"), - ); - let body: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap(); - let entry = body["queries"] - .as_array() - .unwrap() - .iter() - .find(|q| q["name"] == "described") - .unwrap(); - assert_eq!(entry["description"], "Find a person by exact name."); - assert_eq!( - entry["instruction"], - "Use for exact lookups; prefer search for fuzzy matches." - ); -} - -#[test] -fn queries_list_indents_multiline_annotation_continuation() { - // GQ string literals admit newlines, so a `@description`/`@instruction` - // can be multiline. Human output must indent continuation lines to align - // under the first rather than breaking back to the left margin. - let cluster = converged_cluster_with_query( - "multi.gq", - "query multi($name: String) \ - @description(\"line one\\nline two\") \ - { match { $p: Person { name: $name } } return { $p.age } }", - " multi:\n file: ./multi.gq\n", - ); - let output = output_success( - cli().arg("queries").arg("list").arg("--cluster").arg(cluster.path()), - ); - let stdout = stdout_string(&output); - // " description: " is 17 chars wide; the continuation aligns under it. - assert!( - stdout.contains(" description: line one\n line two"), - "multiline annotation must indent the continuation; stdout:\n{stdout}" - ); -} - -#[test] -fn queries_list_omits_annotations_when_absent() { - // The other half of the contract: a query that declares neither annotation - // prints no extra lines and omits both JSON fields entirely. This keeps the - // catalog clean rather than echoing empty `description:`/`instruction:`. - let cluster = converged_cluster_with_query( - "bare.gq", - "query bare() { match { $p: Person } return { $p.name } }", - " bare:\n file: ./bare.gq\n", - ); - - // Human output: the query is listed, but no annotation lines. - let output = output_success( - cli().arg("queries").arg("list").arg("--cluster").arg(cluster.path()), - ); - let stdout = stdout_string(&output); - assert!(stdout.contains("bare()"), "stdout:\n{stdout}"); - assert!( - !stdout.contains("description:") && !stdout.contains("instruction:"), - "a query without annotations prints no annotation lines; stdout:\n{stdout}" - ); - - // --json output: both fields omitted (not present as null). - let output = output_success( - cli() - .arg("queries") - .arg("list") - .arg("--cluster") - .arg(cluster.path()) - .arg("--json"), - ); - let body: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap(); - let entry = body["queries"] - .as_array() - .unwrap() - .iter() - .find(|q| q["name"] == "bare") - .unwrap(); - assert!( - entry.get("description").is_none() && entry.get("instruction").is_none(), - "a query without annotations omits both JSON fields: {entry}" - ); -} - -#[test] -fn queries_validate_requires_a_cluster() { - // RFC-011: with no --cluster (and no cluster profile), the command errors - // loudly rather than reading any omnigraph.yaml. - let output = output_failure(cli().arg("queries").arg("validate")); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("needs a cluster") || stderr.contains("--cluster"), - "queries validate must require a cluster; stderr:\n{stderr}" - ); -} - -#[test] -fn queries_validate_graph_filter_selects_one_graph() { - // A multi-graph cluster: validate scoped to `knowledge` type-checks only - // that graph's registry, ignoring `engineering`'s. - let temp = tempdir().unwrap(); - let dir = temp.path(); - write_multi_graph_cluster_fixture(dir); - output_success(cli().arg("cluster").arg("import").arg("--config").arg(dir)); - output_success(cli().arg("cluster").arg("apply").arg("--config").arg(dir)); - let output = output_success( - cli() - .arg("queries") - .arg("validate") - .arg("--cluster") - .arg(dir) - .arg("--graph") - .arg("knowledge"), - ); - assert!(stdout_string(&output).contains("OK")); -} diff --git a/crates/omnigraph-cli/tests/cli_schema_config.rs b/crates/omnigraph-cli/tests/cli_schema_config.rs deleted file mode 100644 index 5577aa8..0000000 --- a/crates/omnigraph-cli/tests/cli_schema_config.rs +++ /dev/null @@ -1,563 +0,0 @@ -//! init/config scaffolding, schema plan/apply, graphs listing, version. -//! Moved verbatim from tests/cli.rs in the modularization. - -use std::fs; - -use lance::index::DatasetIndexExt; -use omnigraph::db::{Omnigraph, ReadTarget}; -use serde_json::Value; -use tempfile::tempdir; - -mod support; - -use support::*; - - -#[test] -fn version_command_prints_current_cli_version() { - let output = output_success(cli().arg("version")); - let stdout = stdout_string(&output); - - assert_eq!( - stdout.trim(), - format!("omnigraph {}", env!("CARGO_PKG_VERSION")) - ); -} - -#[test] -fn help_groups_commands_by_capability() { - // RFC-010 Slice 2 / RFC-011 Slice B: `--help` clusters commands (declaration - // order in the Command enum) and explains the capability each needs in an - // after_help legend. Pinned lightly β€” the legend phrase + the cluster - // ordering β€” to avoid brittle full-text assertions on clap's help body. - let output = output_success(cli().arg("--help")); - let stdout = stdout_string(&output); - - assert!( - stdout.contains("COMMANDS BY CAPABILITY"), - "capability legend (after_help) missing from --help:\n{stdout}" - ); - - // The Commands list precedes the legend, so first occurrences sit in the - // list and must appear in order: an `any` data verb, then a `direct` verb, - // then the `control` verb. - let pos = |needle: &str| { - stdout - .find(needle) - .unwrap_or_else(|| panic!("'{needle}' not found in --help:\n{stdout}")) - }; - assert!( - pos("query") < pos("optimize"), - "data (any) commands should be listed before direct commands" - ); - assert!( - pos("optimize") < pos("cluster"), - "direct commands should be listed before the control command" - ); -} - -#[test] -fn init_creates_graph_successfully_on_missing_local_directory() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema = fixture("test.pg"); - - let output = output_success(cli().arg("init").arg("--schema").arg(&schema).arg(&graph)); - let stdout = stdout_string(&output); - - assert!(stdout.contains("initialized")); - assert!(graph.join("_schema.pg").exists()); - assert!(graph.join("__manifest").exists()); - // RFC-008 stage 3: init no longer scaffolds the legacy config file. - assert!(!temp.path().join("omnigraph.yaml").exists()); -} - -#[test] -fn schema_plan_json_reports_supported_additive_change() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("next.pg"); - init_graph(&graph); - - let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - fs::write(&schema_path, next_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("plan") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - - assert_eq!(payload["supported"], true); - assert_eq!(payload["step_count"], 1); - assert_eq!(payload["steps"][0]["kind"], "add_property"); - assert_eq!(payload["steps"][0]["type_kind"], "node"); - assert_eq!(payload["steps"][0]["type_name"], "Person"); - assert_eq!(payload["steps"][0]["property_name"], "nickname"); -} - -#[test] -fn schema_plan_with_server_flag_errors_wrong_plane() { - // RFC-010 Slice 1: `schema plan` is storage-plane while `schema show/apply` - // are data-plane β€” the guard rejects --server on plan with the per-subcommand - // label (proving command_plane/command_label descend into the nested enum). - let output = output_failure( - cli() - .arg("schema") - .arg("plan") - .arg("--schema") - .arg(fixture("test.pg")) - .arg("--server") - .arg("prod"), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("`schema plan` is a direct (storage-native) command") - && stderr.contains("Pass a storage URI."), - "schema plan wrong-capability message not found; got: {stderr}" - ); -} - -#[test] -fn schema_plan_json_reports_unsupported_type_change() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("breaking.pg"); - init_graph(&graph); - - let breaking_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "age: I64?"); - fs::write(&schema_path, breaking_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("plan") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - - assert_eq!(payload["supported"], false); - assert!(payload["steps"].as_array().unwrap().iter().any(|step| { - step["kind"] == "unsupported_change" - && step["entity"] - .as_str() - .unwrap_or_default() - .contains("Person.age") - })); -} - -#[test] -fn schema_apply_json_applies_supported_migration() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("next.pg"); - init_graph(&graph); - - let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - fs::write(&schema_path, next_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - - assert_eq!(payload["supported"], true); - assert_eq!(payload["applied"], true); - assert_eq!(payload["step_count"], 1); - - let db = tokio::runtime::Runtime::new() - .unwrap() - .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) - .unwrap(); - assert!( - db.catalog().node_types["Person"] - .properties - .contains_key("nickname") - ); -} - -#[test] -fn schema_apply_human_reports_noop() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = fixture("test.pg"); - init_graph(&graph); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg(&graph), - ); - let stdout = stdout_string(&output); - - assert!(stdout.contains("applied: no")); - assert!(stdout.contains("no schema changes")); -} - -#[test] -fn schema_apply_json_renames_type_and_updates_snapshot() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("rename.pg"); - init_graph(&graph); - - let renamed_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("node Person {\n", "node Human @rename_from(\"Person\") {\n") - .replace("edge Knows: Person -> Person", "edge Knows: Human -> Human") - .replace( - "edge WorksAt: Person -> Company", - "edge WorksAt: Human -> Company", - ); - fs::write(&schema_path, renamed_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["applied"], true); - - let db = tokio::runtime::Runtime::new() - .unwrap() - .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) - .unwrap(); - let snapshot = tokio::runtime::Runtime::new() - .unwrap() - .block_on(db.snapshot_of(ReadTarget::branch("main"))) - .unwrap(); - assert!(snapshot.entry("node:Human").is_some()); - assert!(snapshot.entry("node:Person").is_none()); -} - -#[test] -fn schema_apply_json_renames_property_and_updates_catalog() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("rename-property.pg"); - init_graph(&graph); - - let renamed_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "years: I32? @rename_from(\"age\")"); - fs::write(&schema_path, renamed_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["applied"], true); - - let db = tokio::runtime::Runtime::new() - .unwrap() - .block_on(Omnigraph::open(graph.to_string_lossy().as_ref())) - .unwrap(); - let person = &db.catalog().node_types["Person"]; - assert!(person.properties.contains_key("years")); - assert!(!person.properties.contains_key("age")); -} - -#[test] -fn schema_apply_json_adds_index_for_existing_property() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("index.pg"); - init_graph(&graph); - - let before_index_count = tokio::runtime::Runtime::new().unwrap().block_on(async { - let db = Omnigraph::open(graph.to_string_lossy().as_ref()) - .await - .unwrap(); - let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let dataset = snapshot.open("node:Person").await.unwrap(); - dataset.load_indices().await.unwrap().len() - }); - - let indexed_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("name: String @key", "name: String @key @index"); - fs::write(&schema_path, indexed_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["applied"], true); - - let after_index_count = tokio::runtime::Runtime::new().unwrap().block_on(async { - let db = Omnigraph::open(graph.to_string_lossy().as_ref()) - .await - .unwrap(); - let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let dataset = snapshot.open("node:Person").await.unwrap(); - dataset.load_indices().await.unwrap().len() - }); - // iss-848: `schema apply` records the `@index` intent but defers the physical - // index build (materialized later by ensure_indices/optimize; on this empty - // table nothing builds anyway). So the physical index count is unchanged. - assert_eq!( - after_index_count, before_index_count, - "schema apply records @index intent but defers the physical build (iss-848)" - ); -} - -#[test] -fn schema_apply_rejects_unsupported_plan() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("breaking.pg"); - init_graph(&graph); - - let breaking_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "age: I64?"); - fs::write(&schema_path, breaking_schema).unwrap(); - - let output = output_failure( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg(&graph), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!(stderr.contains("changing property type")); -} - -#[test] -fn schema_apply_rejects_when_non_main_branch_exists() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("next.pg"); - init_graph(&graph); - output_success( - cli() - .arg("branch") - .arg("create") - .arg("--from") - .arg("main") - .arg("--uri") - .arg(&graph) - .arg("feature"), - ); - - let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - fs::write(&schema_path, next_schema).unwrap(); - - let output = output_failure( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg(&graph), - ); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!(stderr.contains("schema apply requires a graph with only main")); -} - -#[test] -fn schema_apply_allow_data_loss_flag_promotes_drops_to_hard() { - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("drop-age.pg"); - init_graph(&graph); - - // Drop the nullable `age` column. - let next_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace(" age: I32?\n", ""); - fs::write(&schema_path, next_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--allow-data-loss") - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["applied"], true); - - let drop_step = payload["steps"] - .as_array() - .unwrap() - .iter() - .find(|s| s["kind"] == "drop_property") - .expect("plan should include a drop_property step"); - assert_eq!( - drop_step["mode"], "hard", - "--allow-data-loss should promote Soft β†’ Hard; full step: {drop_step}", - ); -} - -#[test] -fn schema_apply_without_allow_data_loss_keeps_soft_drops() { - // Symmetric to the above: same schema change without the flag β†’ - // drops stay Soft. Pins default semantics against accidental Hard - // promotion if a future refactor changes the option threading. - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - let schema_path = temp.path().join("drop-age-soft.pg"); - init_graph(&graph); - - let next_schema = fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace(" age: I32?\n", ""); - fs::write(&schema_path, next_schema).unwrap(); - - let output = output_success( - cli() - .arg("schema") - .arg("apply") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let payload: Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["applied"], true); - - let drop_step = payload["steps"] - .as_array() - .unwrap() - .iter() - .find(|s| s["kind"] == "drop_property") - .expect("plan should include a drop_property step"); - assert_eq!( - drop_step["mode"], "soft", - "no flag should leave drops Soft; full step: {drop_step}", - ); -} - -#[test] -fn schema_plan_parity_cli_and_sdk() { - // Same .pg through `Omnigraph::plan_schema_with_options` (SDK) and - // `omnigraph schema plan --json` (CLI). Asserts the steps array is - // byte-identical after JSON round-trip. HTTP doesn't expose a - // separate /schema/plan route β€” that side of parity is covered by - // the HTTP soft/hard drop tests, which exercise apply with - // identical fixtures. - let temp = tempdir().unwrap(); - let graph = graph_path(temp.path()); - init_graph(&graph); - let schema_path = temp.path().join("plan-parity.pg"); - let next_schema = fs::read_to_string(fixture("test.pg")).unwrap().replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - fs::write(&schema_path, &next_schema).unwrap(); - - // CLI side. - let cli_output = output_success( - cli() - .arg("schema") - .arg("plan") - .arg("--schema") - .arg(&schema_path) - .arg("--json") - .arg(&graph), - ); - let cli_payload: Value = serde_json::from_slice(&cli_output.stdout).unwrap(); - - // SDK side: open graph, call plan_schema. - let plan = tokio::runtime::Runtime::new().unwrap().block_on(async { - let db = Omnigraph::open(graph.to_string_lossy().as_ref()) - .await - .unwrap(); - db.plan_schema(&next_schema).await.unwrap() - }); - let sdk_steps = serde_json::to_value(&plan.steps).unwrap(); - - assert_eq!( - cli_payload["steps"], sdk_steps, - "CLI plan steps must match SDK plan steps for identical input", - ); - assert_eq!(cli_payload["supported"], plan.supported); -} - -#[test] -fn graphs_subcommand_help_lists_list_only() { - let output = output_success(cli().arg("graphs").arg("--help")); - let stdout = stdout_string(&output); - assert!( - stdout.contains("list"), - "expected `list` subcommand in help output:\n{stdout}" - ); - let lowered = stdout.to_lowercase(); - assert!( - !lowered.contains("create a new graph"), - "graph create should not be in v0.6.0 help; got:\n{stdout}" - ); - assert!( - !lowered.contains("delete a graph"), - "graph delete should not be in v0.6.0 help; got:\n{stdout}" - ); -} - -#[test] -fn graphs_list_against_local_uri_errors_with_remote_only_message() { - // RFC-011: `graphs list` is served-only; a `--store` (local) address has no - // enumeration endpoint, so it fails loudly pointing at a server / cluster. - let output = output_failure( - cli() - .arg("graphs") - .arg("list") - .arg("--store") - .arg("/tmp/local"), - ); - let stderr = String::from_utf8_lossy(&output.stderr).into_owned(); - assert!( - stderr.contains("remote multi-graph server"), - "expected a remote-server rejection in stderr; got:\n{stderr}" - ); -} diff --git a/crates/omnigraph-cli/tests/parity_matrix.rs b/crates/omnigraph-cli/tests/parity_matrix.rs deleted file mode 100644 index e46f064..0000000 --- a/crates/omnigraph-cli/tests/parity_matrix.rs +++ /dev/null @@ -1,285 +0,0 @@ -//! RFC-009 Phase 1 β€” the embedded/remote parity referee. -//! -//! For every CLI verb with an `is_remote` fork, run the identical -//! invocation against (a) the local graph directly and (b) a spawned -//! server on a twin copy of the same graph, with the SAME actor on both -//! arms (local `--as act-parity`; remote bearer token resolving to -//! `act-parity`). Scrub the declared-volatile allowlist -//! (`support::scrub_volatile` β€” ids, wall-clock, transport locations); -//! everything else must match exactly. -//! -//! This test PINS behavior; it does not idealize it. Genuine divergences -//! discovered here are recorded in `KNOWN_DIVERGENCES` below (and filed), -//! never silently repaired β€” repairs are Phase 3's job, gated by this -//! referee staying green through the refactor. - -use tempfile::TempDir; - -mod support; -use support::*; - -/// Divergences between the arms that exist today, pinned as expectations. -/// Removing an entry requires the corresponding behavior change to be a -/// deliberate, release-noted decision (RFC-009 Compatibility). -const KNOWN_DIVERGENCES: &[&str] = &[ - // populated by the rows below as they are written -]; - -/// One matched setup per row: twin graphs + the parity Cedar bundle on the -/// served arm. The local (`--store`) arm carries no policy (RFC-011); the -/// bundle is permissive for `act-parity`, so the arms still agree. -struct Parity { - _temp: TempDir, - local: std::path::PathBuf, - server: TestServer, -} - -fn parity() -> Parity { - let (temp, local, remote) = twin_graphs(); - // RFC-011 cluster-only: the remote arm is served from a converged - // cluster directory (one graph, id `parity`), seeded with the same - // fixture data as the local twin. - let cluster_dir = parity_configs(temp.path(), &local, &remote); - let server = spawn_server_with_cluster_env( - &cluster_dir, - &[( - "OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", - r#"{"act-parity":"parity-tok"}"#, - )], - ); - Parity { - _temp: temp, - local, - server, - } -} - -impl Parity { - fn run(&self, args: &[&str]) -> (std::process::Output, std::process::Output) { - run_both(&self.local, &self.server.base_url, args) - } -} - -fn assert_parity(verb: &str, local: &std::process::Output, remote: &std::process::Output) { - assert_eq!( - local.status.code(), - remote.status.code(), - "{verb}: exit codes diverge\nlocal: {local:?}\nremote: {remote:?}" - ); - if local.status.success() { - let local_json = scrubbed_json(local); - let remote_json = scrubbed_json(remote); - assert_eq!( - local_json, remote_json, - "{verb}: scrubbed JSON diverges (left=local, right=remote)" - ); - } -} - -#[test] -fn parity_query() { - let p = parity(); - let query = fixture("test.gq"); - let (l, r) = p.run(&[ - "query", - "--query", - query.to_str().unwrap(), - "get_person", - "--params", - r#"{"name":"Alice"}"#, - "--json", - ], - ); - assert_parity("query", &l, &r); -} - -#[test] -fn parity_schema_show() { - let p = parity(); - let (l, r) = p.run(&["schema", "show", "--json"]); - assert_parity("schema show", &l, &r); -} - -#[test] -fn parity_snapshot() { - let p = parity(); - let (l, r) = p.run(&["snapshot", "--json"]); - assert_parity("snapshot", &l, &r); -} - -#[test] -fn parity_branch_list() { - let p = parity(); - let (l, r) = p.run(&["branch", "list", "--json"]); - assert_parity("branch list", &l, &r); -} - -#[test] -fn parity_commit_list() { - let p = parity(); - let (l, r) = p.run(&["commit", "list", "--json"]); - assert_parity("commit list", &l, &r); -} - -#[test] -fn parity_mutate() { - let p = parity(); - let (l, r) = p.run(&[ - "mutate", - "-e", - "query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }", - "--params", - r#"{"name":"Parity","age":7}"#, - "--json", - ], - ); - assert_parity("mutate", &l, &r); -} - -#[test] -fn parity_branch_create_delete() { - let p = parity(); - let (l, r) = p.run(&["branch", "create", "--from", "main", "parity-branch", "--json"], - ); - assert_parity("branch create", &l, &r); - // `branch delete` is destructive: the served (remote) arm is non-local and - // requires consent (RFC-011 Decision 9), so the row passes `--yes` to test - // the operation itself, not the safety gate. The local arm ignores `--yes`. - let (l, r) = p.run(&["branch", "delete", "parity-branch", "--yes", "--json"], - ); - assert_parity("branch delete", &l, &r); -} - -#[test] -fn parity_branch_merge() { - let p = parity(); - let (l, r) = p.run(&["branch", "create", "--from", "main", "feature", "--json"], - ); - assert_parity("branch create (merge setup)", &l, &r); - let (l, r) = p.run(&["branch", "merge", "feature", "--into", "main", "--json"], - ); - assert_parity("branch merge", &l, &r); -} - -#[test] -fn parity_load() { - let p = parity(); - let data = p.local.parent().unwrap().join("rows.jsonl"); - std::fs::write( - &data, - "{\"type\":\"Person\",\"data\":{\"name\":\"Loaded\",\"age\":1}}\n", - ) - .unwrap(); - let (l, r) = p.run(&[ - "load", - "--mode", - "merge", - "--data", - data.to_str().unwrap(), - "--json", - ], - ); - assert_parity("load", &l, &r); -} - -#[test] -fn parity_export() { - let p = parity(); - let (l, r) = p.run(&["export"]); - // export emits a JSONL STREAM, not a single `--json` document, so the - // scrubbed-single-doc `assert_parity` doesn't apply β€” compare line-wise. - // The twin graphs are byte-copies of one loaded fixture, so rows carry - // identical ids/versions and need no scrubbing; sort the lines so any - // cross-arm row-ordering difference doesn't masquerade as a divergence. - assert_eq!( - l.status.code(), - r.status.code(), - "export: exit codes diverge\nlocal {l:?}\nremote {r:?}" - ); - assert!(l.status.success(), "export local arm failed: {l:?}"); - let mut local_lines: Vec<&str> = std::str::from_utf8(&l.stdout).unwrap().lines().collect(); - let mut remote_lines: Vec<&str> = std::str::from_utf8(&r.stdout).unwrap().lines().collect(); - assert!( - !local_lines.is_empty(), - "export produced no rows β€” the parity check would be vacuous" - ); - local_lines.sort_unstable(); - remote_lines.sort_unstable(); - assert_eq!( - local_lines, remote_lines, - "export: JSONL streams diverge (left=local, right=remote)" - ); -} - -// ---- error parity: exit codes must match for shared failure cases ---- - -#[test] -fn parity_errors_share_exit_codes() { - let p = parity(); - - // unknown branch on merge - let (l, r) = p.run(&["branch", "merge", "no-such-branch", "--into", "main", "--json"], - ); - assert_eq!( - (l.status.success(), r.status.success()), - (false, false), - "merge of unknown branch must fail on both arms\nlocal {l:?}\nremote {r:?}" - ); - - // unknown query name in the source - let query = fixture("test.gq"); - let (l, r) = p.run(&[ - "query", - "--query", - query.to_str().unwrap(), - "no_such_query", - "--json", - ], - ); - assert_eq!( - (l.status.success(), r.status.success()), - (false, false), - "unknown query name must fail on both arms\nlocal {l:?}\nremote {r:?}" - ); - - // Discovery (parity HOLDS, behavior surprising): an inline query run - // with a declared-but-unbound param does NOT error on either arm β€” it - // returns every row (the filter drops), while the stored-query invoke - // path hard-errors 'parameter not provided'. Pinned here as agreeing - // behavior; the cross-path asymmetry is filed separately. - let (l, r) = p.run(&[ - "query", - "--query", - query.to_str().unwrap(), - "get_person", - "--json", - ], - ); - assert_eq!( - (l.status.success(), r.status.success()), - (true, true), - "unbound-param inline query currently SUCCEEDS on both arms (matches-all)" - ); -} - -// ---- documented exclusions (not bugs; the Phase 4 capability table) ---- -// -// - `graphs list`: server-only today; becomes Both-capability when the -// embedded arm enumerates the cluster catalog (RFC-009 open Q3, answered). -// - `ingest`: deprecated alias of load; its remote arm rides the deprecated -// /ingest route. The canonical `load` verb targets `/load` (RFC-009 Phase 5, -// landed) β€” `parity_load` exercises it on the remote arm. -// - `init`, `optimize`, `repair`, `cleanup`, `cluster *`: storage-plane by -// design (must work with the server down); Phase 4 declares this. -#[allow(dead_code)] -const EXCLUSIONS_DOCUMENTED: () = (); - -#[test] -fn known_divergences_ledger_is_current() { - // The ledger exists so removals are deliberate: an empty list with all - // rows green means the arms agree everywhere the matrix looks. - assert!( - KNOWN_DIVERGENCES.is_empty(), - "divergences are pinned: {KNOWN_DIVERGENCES:?}" - ); -} diff --git a/crates/omnigraph-cli/tests/support/mod.rs b/crates/omnigraph-cli/tests/support/mod.rs index ff6a5d4..b62d861 100644 --- a/crates/omnigraph-cli/tests/support/mod.rs +++ b/crates/omnigraph-cli/tests/support/mod.rs @@ -12,24 +12,12 @@ use reqwest::blocking::Client; use serde_json::Value; use tempfile::{TempDir, tempdir}; -/// Hermetic default: point OMNIGRAPH_HOME at a path that exists on no -/// machine, so spawned binaries never read the developer's real -/// ~/.omnigraph/ (an absent operator config is an empty layer). Tests -/// exercising the operator layer override the var explicitly. -pub const HERMETIC_OPERATOR_HOME: &str = "/nonexistent/omnigraph-test-home"; - pub fn cli() -> Command { - let mut command = Command::cargo_bin("omnigraph").unwrap(); - command.env("OMNIGRAPH_HOME", HERMETIC_OPERATOR_HOME); - command.env_remove("OMNIGRAPH_CONFIG"); - command + Command::cargo_bin("omnigraph").unwrap() } pub fn cli_process() -> StdCommand { - let mut command = StdCommand::new(assert_cmd::cargo::cargo_bin("omnigraph")); - command.env("OMNIGRAPH_HOME", HERMETIC_OPERATOR_HOME); - command.env_remove("OMNIGRAPH_CONFIG"); - command + StdCommand::new(assert_cmd::cargo::cargo_bin("omnigraph")) } fn server_process() -> StdCommand { @@ -105,15 +93,7 @@ pub fn init_graph(graph: &Path) { pub fn load_fixture(graph: &Path) { let data = fixture("test.jsonl"); - output_success( - cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(&data) - .arg(graph), - ); + output_success(cli().arg("load").arg("--data").arg(&data).arg(graph)); } pub fn write_jsonl(path: &Path, rows: &str) { @@ -232,42 +212,6 @@ pub fn spawn_server_with_config(config: &Path) -> TestServer { spawn_server_process(command) } -pub fn spawn_server_with_cluster(cluster_dir: &Path) -> TestServer { - let mut command = server_process(); - command.arg("--cluster").arg(cluster_dir).arg("--unauthenticated"); - spawn_server_process(command) -} - -/// Cluster boot with the server process's cwd set explicitly β€” used to prove -/// rule 0 never touches the cwd omnigraph.yaml search. -pub fn spawn_server_with_cluster_in(cluster_dir: &Path, cwd: &Path) -> TestServer { - let mut command = server_process(); - command - .arg("--cluster") - .arg(cluster_dir) - .arg("--unauthenticated") - .current_dir(cwd); - spawn_server_process(command) -} - -pub fn spawn_server_with_cluster_env(cluster_dir: &Path, envs: &[(&str, &str)]) -> TestServer { - let mut command = server_process(); - command.arg("--cluster").arg(cluster_dir); - for (name, value) in envs { - command.env(name, value); - } - spawn_server_process(command) -} - -pub fn spawn_server_with_env(graph: &Path, envs: &[(&str, &str)]) -> TestServer { - let mut command = server_process(); - command.arg(graph); - for (name, value) in envs { - command.env(name, value); - } - spawn_server_process(command) -} - pub fn spawn_server_with_config_env(config: &Path, envs: &[(&str, &str)]) -> TestServer { let mut command = server_process(); command.arg("--config").arg(config); @@ -338,646 +282,3 @@ impl SystemGraph { spawn_server_with_config_env(config, envs) } } - -/// A converged cluster directory the server can boot from (`--cluster`), -/// serving one graph seeded with the standard fixture. Holds the temp dir -/// alive for the test's lifetime. -pub struct ClusterFixture { - _temp: TempDir, - dir: PathBuf, -} - -impl ClusterFixture { - pub fn path(&self) -> &Path { - &self.dir - } -} - -/// Build a converged cluster (RFC-011 cluster-only serving) with a single -/// graph `graph_id`, seeded with the `test.jsonl` fixture so reads return -/// data. When `policy_yaml` is `Some`, the bundle is bound to the graph -/// scope. The server boots from the returned path via `--cluster`. -pub fn converged_loaded_cluster(graph_id: &str, policy_yaml: Option<&str>) -> ClusterFixture { - let temp = tempdir().unwrap(); - let dir = temp.path().to_path_buf(); - fs::copy(fixture("test.pg"), dir.join("graph.pg")).unwrap(); - - let policy_block = match policy_yaml { - Some(source) => { - fs::write(dir.join("graph.policy.yaml"), source).unwrap(); - format!( - "policies:\n graph:\n file: ./graph.policy.yaml\n applies_to: [{graph_id}]\n" - ) - } - None => String::new(), - }; - fs::write( - dir.join("cluster.yaml"), - format!( - "version: 1\nmetadata:\n name: sys\nstate:\n backend: cluster\n lock: true\ngraphs:\n {graph_id}:\n schema: ./graph.pg\n{policy_block}" - ), - ) - .unwrap(); - - output_success(cli().arg("cluster").arg("import").arg("--config").arg(&dir)); - output_success(cli().arg("cluster").arg("apply").arg("--config").arg(&dir)); - - let served_root = dir.join("graphs").join(format!("{graph_id}.omni")); - output_success( - cli() - .arg("load") - .arg("--data") - .arg(fixture("test.jsonl")) - .arg("--mode") - .arg("overwrite") - .arg(&served_root), - ); - - ClusterFixture { _temp: temp, dir } -} - -// ---- helpers moved from the monolithic tests/cli.rs ---- -#[allow(unused_imports)] -use lance::Dataset; -#[allow(unused_imports)] -use lance::index::DatasetIndexExt; -#[allow(unused_imports)] -use omnigraph::db::{Omnigraph, ReadTarget}; - -pub const POLICY_YAML: &str = r#" -version: 1 -groups: - team: [act-andrew, act-bruno] - admins: [act-andrew] -protected_branches: [main] -rules: - - id: team-read - allow: - actors: { group: team } - actions: [read] - branch_scope: any - - id: team-write - allow: - actors: { group: team } - actions: [change] - branch_scope: unprotected - - id: admins-promote - allow: - actors: { group: admins } - actions: [branch_merge] - target_branch_scope: protected -"#; - -pub const POLICY_TESTS_YAML: &str = r#" -version: 1 -cases: - - id: allow-feature-write - actor: act-andrew - action: change - branch: feature - expect: allow - - id: deny-main-write - actor: act-bruno - action: change - branch: main - expect: deny -"#; - -pub fn manifest_dataset_version(graph: &std::path::Path) -> u64 { - tokio::runtime::Runtime::new().unwrap().block_on(async { - Omnigraph::open(graph.to_string_lossy().as_ref()) - .await - .unwrap() - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap() - .version() - }) -} - -pub fn forge_person_delete_drift(graph: &std::path::Path) -> (u64, u64) { - tokio::runtime::Runtime::new().unwrap().block_on(async { - let uri = graph.to_string_lossy(); - let db = Omnigraph::open(uri.as_ref()).await.unwrap(); - let snap = db - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap(); - let entry = snap.entry("node:Person").unwrap(); - let full_path = format!("{}/{}", uri.trim_end_matches('/'), entry.table_path); - let mut ds = Dataset::open(&full_path).await.unwrap(); - let deleted = ds.delete("name = 'Alice'").await.unwrap(); - assert_eq!(deleted.num_deleted_rows, 1); - let head = deleted.new_dataset.version().version; - assert!(head > entry.table_version); - (entry.table_version, head) - }) -} - -pub fn write_policy_config_fixture(root: &std::path::Path) -> (std::path::PathBuf, std::path::PathBuf) { - let config = root.join("omnigraph.yaml"); - let policy = root.join("policy.yaml"); - fs::write( - &config, - r#" -project: - name: policy-test-graph -policy: - file: ./policy.yaml -"#, - ) - .unwrap(); - fs::write(&policy, POLICY_YAML).unwrap(); - fs::write(root.join("policy.tests.yaml"), POLICY_TESTS_YAML).unwrap(); - (config, policy) -} - -pub fn write_cluster_config_fixture(root: &std::path::Path) { - fs::write( - root.join("people.pg"), - r#" -node Person { - name: String @key - age: I32? -} -"#, - ) - .unwrap(); - fs::write( - root.join("people.gq"), - r#" -query find_person($name: String) { - match { $p: Person { name: $name } } - return { $p.name, $p.age } -} -"#, - ) - .unwrap(); - fs::write(root.join("base.policy.yaml"), "rules: []\n").unwrap(); - fs::write( - root.join("cluster.yaml"), - r#" -version: 1 -metadata: - name: company-brain -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - base: - file: ./base.policy.yaml - applies_to: [knowledge] -"#, - ) - .unwrap(); -} - -pub fn init_cluster_derived_graph(root: &std::path::Path) { - init_named_cluster_graph(root, "knowledge", "people.pg"); -} - -pub fn init_named_cluster_graph(root: &std::path::Path, graph_id: &str, schema_file: &str) { - let graph_dir = root.join("graphs"); - fs::create_dir_all(&graph_dir).unwrap(); - output_success( - cli() - .arg("init") - .arg("--schema") - .arg(root.join(schema_file)) - .arg(graph_dir.join(format!("{graph_id}.omni"))), - ); -} - -pub fn write_cluster_lock(root: &std::path::Path, lock_id: &str, operation: &str) { - let state_dir = root.join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("lock.json"), - format!( - r#"{{"version":1,"lock_id":"{lock_id}","operation":"{operation}","created_at":"1970-01-01T00:00:00Z","pid":123}}"# - ), - ) - .unwrap(); -} - -pub fn write_cluster_applyable_state(root: &std::path::Path) -> serde_json::Value { - let validate = parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg("validate") - .arg("--config") - .arg(root) - .arg("--json"), - )); - let schema_digest = validate["resource_digests"]["schema.knowledge"] - .as_str() - .unwrap() - .to_string(); - let state_dir = root.join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - format!( - r#"{{ - "version": 1, - "state_revision": 1, - "applied_revision": {{ - "resources": {{ - "graph.knowledge": {{ "digest": "seed" }}, - "schema.knowledge": {{ "digest": "{schema_digest}" }} - }} - }} -}} -"# - ), - ) - .unwrap(); - validate -} - -pub fn cluster_json(root: &std::path::Path, command: &str) -> serde_json::Value { - parse_stdout_json(&output_success( - cli() - .arg("cluster") - .arg(command) - .arg("--config") - .arg(root) - .arg("--json"), - )) -} - -pub fn write_multi_graph_cluster_fixture(root: &std::path::Path) { - write_cluster_config_fixture(root); - fs::write( - root.join("services.pg"), - r#" -node Service { - name: String @key -} -"#, - ) - .unwrap(); - fs::write( - root.join("services.gq"), - r#" -query find_service($name: String) { - match { $s: Service { name: $name } } - return { $s.name } -} -"#, - ) - .unwrap(); - fs::write(root.join("cluster_wide.policy.yaml"), "rules: []\n").unwrap(); - fs::write(root.join("shared.policy.yaml"), "rules: []\n").unwrap(); - fs::write( - root.join("cluster.yaml"), - r#" -version: 1 -metadata: - name: company-brain -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq - engineering: - schema: ./services.pg - queries: - find_service: - file: ./services.gq -policies: - shared: - file: ./shared.policy.yaml - applies_to: [knowledge, engineering] - cluster_wide: - file: ./cluster_wide.policy.yaml - applies_to: [cluster] -"#, - ) - .unwrap(); -} - -pub fn change_for<'j>(json: &'j serde_json::Value, resource: &str) -> &'j serde_json::Value { - json["changes"] - .as_array() - .unwrap() - .iter() - .find(|change| change["resource"] == resource) - .unwrap_or_else(|| panic!("missing change for {resource}: {json}")) -} - -pub fn write_seed_fixture(root: &std::path::Path) -> std::path::PathBuf { - fs::create_dir_all(root.join("data")).unwrap(); - fs::create_dir_all(root.join("build")).unwrap(); - let raw_seed = root.join("data/seed.jsonl"); - let seed = root.join("seed.yaml"); - - fs::write( - &raw_seed, - concat!( - "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-alpha\",\"intent\":\"Alpha ship\"}}\n", - "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-beta\",\"intent\":\"Beta ship\",\"embedding\":[0.1,0.2]}}\n" - ), - ) - .unwrap(); - - fs::write( - &seed, - concat!( - "graph:\n", - " slug: mr-context-graph\n", - "sources:\n", - " raw_seed: ./data/seed.jsonl\n", - "artifacts:\n", - " embedded_seed: ./build/seed.embedded.jsonl\n", - "embeddings:\n", - " model: gemini-embedding-2-preview\n", - " dimension: 4\n", - " types:\n", - " Decision:\n", - " target: embedding\n", - " fields: [slug, intent]\n" - ), - ) - .unwrap(); - - seed -} - -pub fn write_seed_fixture_with_edge(root: &std::path::Path) -> std::path::PathBuf { - let seed = write_seed_fixture(root); - let raw_seed = root.join("data/seed.jsonl"); - fs::write( - &raw_seed, - concat!( - "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-alpha\",\"intent\":\"Alpha ship\"}}\n", - "{\"type\":\"Decision\",\"data\":{\"slug\":\"dec-beta\",\"intent\":\"Beta ship\",\"embedding\":[0.1,0.2]}}\n", - "{\"edge\":\"Triggered\",\"from\":\"sig-alpha\",\"to\":\"dec-alpha\"}\n" - ), - ) - .unwrap(); - seed -} - -pub fn read_embedded_rows(path: std::path::PathBuf) -> Vec { - fs::read_to_string(path) - .unwrap() - .lines() - .filter(|line| !line.trim().is_empty()) - .map(|line| serde_json::from_str(line).unwrap()) - .collect() -} - -pub fn queries_test_config(graph_uri: &str, entry: &str, gq_file: &str) -> String { - format!( - "graphs:\n local:\n uri: '{}'\n queries:\n {entry}:\n file: ./{gq_file}\n\ - cli:\n graph: local\npolicy: {{}}\n", - graph_uri.replace('\'', "''") - ) -} - -// ---- RFC-009 Phase 1: parity-matrix harness ---- - -/// Twin graphs for embedded-vs-remote comparison: the same loaded fixture -/// copied to two roots, so write verbs can run once per arm on identical -/// state. Returns (tempdir-guard, local_graph, remote_graph). -pub fn twin_graphs() -> (TempDir, PathBuf, PathBuf) { - let temp = tempdir().unwrap(); - let seed = temp.path().join("seed"); - fs::create_dir_all(&seed).unwrap(); - let graph = seed.join("server.omni"); - init_graph(&graph); - load_fixture(&graph); - let local = temp.path().join("local.omni"); - let remote = temp.path().join("remote.omni"); - copy_dir(&graph, &local); - copy_dir(&graph, &remote); - (temp, local, remote) -} - -pub fn copy_dir(from: &Path, to: &Path) { - fs::create_dir_all(to).unwrap(); - for entry in fs::read_dir(from).unwrap() { - let entry = entry.unwrap(); - let target = to.join(entry.file_name()); - if entry.file_type().unwrap().is_dir() { - copy_dir(&entry.path(), &target); - } else { - fs::copy(entry.path(), &target).unwrap(); - } - } -} - -/// Scrub declared-volatile fields (RFC-009 Phase 1 allowlist) so the rest -/// of the JSON must match exactly. Key-based, recursive; both arms get the -/// same placeholders. Everything NOT listed here is contract. -pub fn scrub_volatile(value: &mut serde_json::Value) { - const VOLATILE_KEYS: &[&str] = &[ - // identity-bearing per-instance values - "commit_id", "id", "parent_id", "merge_parent_id", "snapshot", - // wall-clock - "committed_at", "created_at", "timestamp", - // transport / location - "uri", "path", - ]; - match value { - serde_json::Value::Object(map) => { - for (key, val) in map.iter_mut() { - if VOLATILE_KEYS.contains(&key.as_str()) && !val.is_null() { - *val = serde_json::Value::String(format!("")); - } else { - scrub_volatile(val); - } - } - } - serde_json::Value::Array(items) => { - for item in items { - scrub_volatile(item); - } - } - _ => {} - } -} - -pub const PARITY_ACTOR: &str = "act-parity"; -pub const PARITY_TOKEN: &str = "parity-tok"; - -/// Identical Cedar bundle for BOTH arms β€” like-for-like enforcement is part -/// of the parity contract (a bare local arm is permissive while a -/// tokens-only server is default-deny; comparing those would measure -/// configuration, not the fork). -pub fn parity_policy_yaml() -> String { - r#"version: 1 -groups: - parity: ["act-parity"] -protected_branches: [] -rules: - - id: reads - allow: - actors: { group: parity } - actions: [read, export, invoke_query] - - id: read-scope - allow: - actors: { group: parity } - actions: [read, export] - branch_scope: any - - id: writes - allow: - actors: { group: parity } - actions: [change] - branch_scope: any - - id: branching - allow: - actors: { group: parity } - actions: [schema_apply, branch_create, branch_delete, branch_merge] - target_branch_scope: any -"# - .to_string() -} - -/// The graph id the parity cluster serves the remote arm under. The -/// remote arm addresses it with `--graph PARITY_GRAPH_ID` (RFC-011: the -/// server is cluster-only, so a graph selector is required). -pub const PARITY_GRAPH_ID: &str = "parity"; - -/// Build the remote arm's configuration (RFC-011 cluster-only server). -/// -/// The remote arm is served from a converged cluster directory whose single -/// graph (id `parity`) carries the parity Cedar bundle (bound to the graph -/// scope). The cluster's derived graph root (`/graphs/parity.omni`) is -/// seeded with the SAME fixture data as the local twin so the two arms compare -/// like-for-like. The local (`--store`) arm carries no Cedar policy (RFC-011), -/// which is fine because the parity bundle is permissive for `act-parity`. -/// -/// `local_graph` is overwritten with a byte-for-byte copy of the cluster's -/// seeded served graph so identity-bearing values that are NOT scrubbed -/// (e.g. `graph_commit_id`, edge `id`s in export) match across the arms β€” -/// the served graph is the source of truth and the local twin mirrors it. -/// -/// Returns the `cluster_dir`. The caller spawns the server with `--cluster`. -pub fn parity_configs(root: &Path, local_graph: &Path, _remote_graph: &Path) -> PathBuf { - let policy = root.join("parity.policy.yaml"); - fs::write(&policy, parity_policy_yaml()).unwrap(); - - // Remote arm: a cluster directory the server boots from. One graph - // (`parity`), schema = the shared fixture, policy bound to the graph. - let cluster_dir = root.join("parity-cluster"); - fs::create_dir_all(&cluster_dir).unwrap(); - fs::copy(fixture("test.pg"), cluster_dir.join("parity.pg")).unwrap(); - fs::copy(&policy, cluster_dir.join("parity.policy.yaml")).unwrap(); - fs::write( - cluster_dir.join("cluster.yaml"), - format!( - r#"version: 1 -metadata: - name: parity -state: - backend: cluster - lock: true -graphs: - {PARITY_GRAPH_ID}: - schema: ./parity.pg -policies: - parity: - file: ./parity.policy.yaml - applies_to: [{PARITY_GRAPH_ID}] -"# - ), - ) - .unwrap(); - - // Converge the cluster (creates the empty graph at the derived root), - // then seed it with the same fixture data the local twin holds. - output_success( - cli() - .arg("cluster") - .arg("import") - .arg("--config") - .arg(&cluster_dir), - ); - output_success( - cli() - .arg("cluster") - .arg("apply") - .arg("--config") - .arg(&cluster_dir), - ); - let served_root = cluster_dir - .join("graphs") - .join(format!("{PARITY_GRAPH_ID}.omni")); - output_success( - cli() - .arg("load") - .arg("--data") - .arg(fixture("test.jsonl")) - .arg("--mode") - .arg("overwrite") - .arg(&served_root), - ); - - // Mirror the seeded served graph into the local twin so both arms hold - // identical ULIDs / commit ids (the served graph is authoritative). - if local_graph.exists() { - fs::remove_dir_all(local_graph).unwrap(); - } - copy_dir(&served_root, local_graph); - - cluster_dir -} - -/// Run one CLI invocation per arm with identical verb args: locally against -/// `local_graph` (--as actor) and remotely against a server URL whose token -/// resolves to the same actor. Returns raw Outputs for exit-code + JSON -/// comparison by the caller. -pub fn run_both( - local_graph: &Path, - server_url: &str, - args: &[&str], -) -> (std::process::Output, std::process::Output) { - // Address both arms with GLOBAL flags (`--store` / `--server`) appended after - // the verb + its args, so the address is placed correctly regardless of - // subcommand nesting (a positional graph only works for top-level verbs; - // `schema show ` etc. need the global flag). Local = embedded store, - // remote = served. RFC-011: a direct (`--store`) write carries no Cedar - // policy β€” the parity policy is permissive for `act-parity` on the served - // arm, so the two arms still agree. - let mut local = cli(); - local - .args(args) - .arg("--store") - .arg(local_graph) - .arg("--as") - .arg(PARITY_ACTOR); - let local_out = local.output().unwrap(); - - let mut remote = cli(); - remote - .env("OMNIGRAPH_BEARER_TOKEN", PARITY_TOKEN) - .args(args) - .arg("--server") - .arg(server_url) - // RFC-011: the parity server is cluster-only (multi-graph), so the - // remote arm must name the graph it addresses. - .arg("--graph") - .arg(PARITY_GRAPH_ID); - let remote_out = remote.output().unwrap(); - (local_out, remote_out) -} - -/// Parse, scrub, and pretty-print for diffable assertion messages. -pub fn scrubbed_json(output: &std::process::Output) -> String { - let mut value: serde_json::Value = serde_json::from_slice(&output.stdout) - .unwrap_or_else(|e| panic!("non-JSON stdout ({e}): {output:?}")); - scrub_volatile(&mut value); - serde_json::to_string_pretty(&value).unwrap() -} diff --git a/crates/omnigraph-cli/tests/system_local.rs b/crates/omnigraph-cli/tests/system_local.rs index 9b3701e..074b203 100644 --- a/crates/omnigraph-cli/tests/system_local.rs +++ b/crates/omnigraph-cli/tests/system_local.rs @@ -3,7 +3,6 @@ mod support; use std::env; use std::fs; -use omnigraph::db::Omnigraph; use reqwest::blocking::Client; use serde_json::Value; @@ -63,6 +62,31 @@ cases: expect: allow "#; +fn yaml_string(value: &str) -> String { + format!("'{}'", value.replace('\'', "''")) +} + +fn local_policy_config(graph: &SystemGraph) -> String { + format!( + "\ +project: + name: policy-e2e-local +graphs: + local: + uri: {} +cli: + graph: local + branch: main +query: + roots: + - . +policy: + file: ./policy.yaml +", + yaml_string(&graph.path().to_string_lossy()) + ) +} + fn insert_person_query(graph: &SystemGraph, name: &str) -> std::path::PathBuf { graph.write_query( name, @@ -175,8 +199,6 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { output_success( cli() .arg("load") - .arg("--mode") - .arg("overwrite") .arg("--data") .arg(fixture("test.jsonl")) .arg(graph.path()), @@ -185,10 +207,10 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { let read_before = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -200,7 +222,6 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { let change_payload = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(&mutation_file) @@ -214,10 +235,10 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { let read_after = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Eve"}"#) @@ -232,7 +253,6 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { let inline_change = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--store") .arg(graph.path()) .arg("-e") .arg("query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }") @@ -247,7 +267,6 @@ fn local_cli_end_to_end_init_load_read_change_read_flow() { let inline_read = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query-string") .arg("query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }") @@ -279,7 +298,6 @@ fn local_cli_end_to_end_branch_change_merge_flow() { let change_payload = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(&mutation_file) @@ -295,10 +313,10 @@ fn local_cli_end_to_end_branch_change_merge_flow() { let feature_read = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature") @@ -323,10 +341,10 @@ fn local_cli_end_to_end_branch_change_merge_flow() { let main_read = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Zoe"}"#) @@ -357,7 +375,7 @@ fn local_cli_ingest_creates_review_branch_and_keeps_it_readable() { {"type":"Person","data":{"name":"Bob","age":26}}"#, ); - let ingest_output = output_success( + let ingest_payload = parse_stdout_json(&output_success( cli() .arg("ingest") .arg("--data") @@ -366,13 +384,7 @@ fn local_cli_ingest_creates_review_branch_and_keeps_it_readable() { .arg("feature-ingest") .arg(graph.path()) .arg("--json"), - ); - // The deprecation warning goes to stderr so --json stdout stays clean. - assert!( - String::from_utf8_lossy(&ingest_output.stderr).contains("deprecated"), - "ingest must warn about its deprecation on stderr" - ); - let ingest_payload = parse_stdout_json(&ingest_output); + )); assert_eq!(ingest_payload["branch"], "feature-ingest"); assert_eq!(ingest_payload["base_branch"], "main"); assert_eq!(ingest_payload["branch_created"], true); @@ -393,10 +405,10 @@ fn local_cli_ingest_creates_review_branch_and_keeps_it_readable() { let zoe = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature-ingest") @@ -410,10 +422,10 @@ fn local_cli_ingest_creates_review_branch_and_keeps_it_readable() { let bob = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature-ingest") @@ -425,88 +437,6 @@ fn local_cli_ingest_creates_review_branch_and_keeps_it_readable() { assert_eq!(bob["rows"][0]["p.age"], 26); } -/// The unified `load` subsumes ingest: `--from` opts into fork-if-missing, -/// while without it a missing branch is an error β€” never an implicit fork. -#[test] -fn local_cli_load_from_forks_branch_and_missing_branch_errors_without_from() { - let graph = SystemGraph::loaded(); - let extra = graph.write_jsonl( - "system-local-load-from.jsonl", - r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#, - ); - - // Without --from, a missing branch must fail and create nothing. - let failure = output_failure( - cli() - .arg("load") - .arg("--mode") - .arg("merge") - .arg("--data") - .arg(&extra) - .arg("--branch") - .arg("feature-load") - .arg(graph.path()), - ); - assert!( - String::from_utf8_lossy(&failure.stderr).contains("feature-load"), - "error should name the missing branch" - ); - - // With --from, the branch is forked and the load lands on it. - let payload = parse_stdout_json(&output_success( - cli() - .arg("load") - .arg("--mode") - .arg("merge") - .arg("--data") - .arg(&extra) - .arg("--branch") - .arg("feature-load") - .arg("--from") - .arg("main") - .arg(graph.path()) - .arg("--json"), - )); - assert_eq!(payload["branch"], "feature-load"); - assert_eq!(payload["base_branch"], "main"); - assert_eq!(payload["branch_created"], true); - assert_eq!(payload["mode"], "merge"); - assert_eq!(payload["nodes_loaded"], 1); - - let snapshot = parse_stdout_json(&output_success( - cli() - .arg("snapshot") - .arg(graph.path()) - .arg("--branch") - .arg("feature-load") - .arg("--json"), - )); - assert_eq!(snapshot["branch"], "feature-load"); -} - -/// `--mode` is required: overwrite is destructive, so the unified `load` -/// has no implicit default. -#[test] -fn local_cli_load_requires_mode_flag() { - let graph = SystemGraph::loaded(); - let extra = graph.write_jsonl( - "system-local-load-no-mode.jsonl", - r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#, - ); - - let failure = output_failure( - cli() - .arg("load") - .arg("--data") - .arg(&extra) - .arg(graph.path()), - ); - assert!( - String::from_utf8_lossy(&failure.stderr).contains("--mode"), - "clap should demand the missing --mode flag" - ); -} - #[test] fn local_cli_export_round_trips_full_branch_graph() { let graph = SystemGraph::loaded(); @@ -560,8 +490,6 @@ fn local_cli_export_round_trips_full_branch_graph() { output_success( cli() .arg("load") - .arg("--mode") - .arg("overwrite") .arg("--data") .arg(&export_path) .arg(&imported_graph), @@ -587,10 +515,10 @@ fn local_cli_export_round_trips_full_branch_graph() { let eve = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(&imported_graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Eve"}"#) @@ -602,10 +530,10 @@ fn local_cli_export_round_trips_full_branch_graph() { let friends = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(&imported_graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("friends_of") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -623,12 +551,30 @@ fn local_cli_s3_end_to_end_init_load_read_flow() { let temp = tempfile::tempdir().unwrap(); let query_root = temp.path(); + let config = query_root.join("omnigraph.yaml"); let query = query_root.join("test.gq"); fs::copy(fixture("test.gq"), &query).unwrap(); + write_config( + &config, + &format!( + "\ +graphs: + rustfs: + uri: '{}' +cli: + graph: rustfs + branch: main +query: + roots: + - . +policy: {{}} +", + graph_uri + ), + ); output_success( cli() - .current_dir(query_root) .arg("init") .arg("--schema") .arg(fixture("test.pg")) @@ -636,29 +582,21 @@ fn local_cli_s3_end_to_end_init_load_read_flow() { ); output_success( cli() - .current_dir(query_root) .arg("load") - .arg("--mode") - .arg("overwrite") - // `--yes` clears the RFC-011 Decision 9 destructive-write - // confirmation: `--mode overwrite` against a non-local (s3://) - // target is refused without it. - .arg("--yes") .arg("--data") .arg(fixture("test.jsonl")) .arg(&graph_uri), ); - // RFC-011: the graph is addressed by `--store `; the `.gq` path is - // resolved cwd-relative (no omnigraph.yaml `query.roots`). let read = parse_stdout_json(&output_success( cli() .current_dir(query_root) .arg("read") - .arg("--store") - .arg(&graph_uri) + .arg("--config") + .arg(&config) .arg("--query") .arg("test.gq") + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -671,8 +609,8 @@ fn local_cli_s3_end_to_end_init_load_read_flow() { cli() .current_dir(query_root) .arg("snapshot") - .arg("--store") - .arg(&graph_uri) + .arg("--config") + .arg(&config) .arg("--json"), )); assert!(snapshot["tables"].is_array()); @@ -720,7 +658,6 @@ fn local_cli_failed_change_keeps_target_state_unchanged() { let output = output_failure( cli() .arg("change") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(&mutation_file) @@ -733,10 +670,10 @@ fn local_cli_failed_change_keeps_target_state_unchanged() { let friends_payload = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("friends_of") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -748,22 +685,36 @@ fn local_cli_failed_change_keeps_target_state_unchanged() { } #[test] -fn local_cli_resolves_relative_query_cwd_relative() { - // RFC-011: omnigraph.yaml `query.roots` search is gone β€” a `--query` - // path is resolved plainly relative to the process cwd. This pins that - // a bare relative `.gq` filename resolves against `.current_dir`, and - // that the file actually read is the cwd-local one (a same-named query - // elsewhere with different columns is never picked up). +fn local_cli_resolves_relative_query_against_config_base_dir() { let graph = SystemGraph::loaded(); let root = graph.path().parent().unwrap(); - let cwd_dir = root.join("cwd"); - let other_dir = root.join("other"); - fs::create_dir_all(&cwd_dir).unwrap(); - fs::create_dir_all(&other_dir).unwrap(); + let config_dir = root.join("config"); + let query_dir = config_dir.join("queries"); + let ambient_dir = root.join("ambient"); + fs::create_dir_all(&query_dir).unwrap(); + fs::create_dir_all(&ambient_dir).unwrap(); - // The query in the cwd projects (age, name). + let config = config_dir.join("omnigraph.yaml"); + write_config( + &config, + &format!( + "\ +graphs: + local: + uri: '{}' +cli: + graph: local + branch: main +query: + roots: + - queries +policy: {{}} +", + graph.path().display() + ), + ); write_query_file( - &cwd_dir.join("local.gq"), + &query_dir.join("local.gq"), r#" query get_person($name: String) { match { @@ -773,10 +724,8 @@ query get_person($name: String) { } "#, ); - // A same-named query elsewhere projects only (name): if cwd-relative - // resolution regressed and picked this up, the columns assert fails. write_query_file( - &other_dir.join("local.gq"), + &ambient_dir.join("local.gq"), r#" query get_person($name: String) { match { @@ -789,12 +738,13 @@ query get_person($name: String) { let payload = parse_stdout_json(&output_success( cli() - .current_dir(&cwd_dir) + .current_dir(&ambient_dir) .arg("read") - .arg("--store") - .arg(graph.path()) + .arg("--config") + .arg(&config) .arg("--query") .arg("local.gq") + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -890,23 +840,15 @@ query get_task($slug: String) { ); output_success(cli().arg("init").arg("--schema").arg(&schema).arg(&graph)); - output_success( - cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(&data) - .arg(&graph), - ); + output_success(cli().arg("load").arg("--data").arg(&data).arg(&graph)); let filtered = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("due_with_tag") .arg("--params") .arg(r#"{"deadline":"2026-04-02T00:00:00Z","tag":"launch"}"#) @@ -928,10 +870,10 @@ query get_task($slug: String) { let insert_payload = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("insert_task") .arg("--params") .arg( @@ -944,10 +886,10 @@ query get_task($slug: String) { let update_payload = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("update_task") .arg("--params") .arg(r#"{"slug":"gamma","due_at":"2026-04-04T10:45:00Z","tags":["embed","released"],"scores":[13,21],"active_days":["2026-04-04","2026-04-05"]}"#) @@ -958,10 +900,10 @@ query get_task($slug: String) { let gamma = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("get_task") .arg("--params") .arg(r#"{"slug":"gamma"}"#) @@ -1028,28 +970,15 @@ query vector_search($q: String) { ); output_success(cli().arg("init").arg("--schema").arg(&schema).arg(&graph)); - output_success( - cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(&data) - .arg(&graph), - ); + output_success(cli().arg("load").arg("--data").arg(&data).arg(&graph)); let result = parse_stdout_json(&output_success( cli() - // Stored vectors above were produced with gemini-embedding-2-preview; - // pin the query-time embedder to the same provider/model so the - // auto-embedded `$q` lands in the same vector space. - .env("OMNIGRAPH_EMBED_PROVIDER", "gemini") - .env("OMNIGRAPH_EMBED_MODEL", "gemini-embedding-2-preview") .arg("read") - .arg("--store") .arg(&graph) .arg("--query") .arg(&queries) + .arg("--name") .arg("vector_search") .arg("--params") .arg(r#"{"q":"alpha"}"#) @@ -1062,7 +991,7 @@ query vector_search($q: String) { // The publisher CAS conflict shape is verified end-to-end at the engine // level in -// `crates/omnigraph/tests/writes.rs::concurrent_writers_one_succeeds_one_gets_expected_version_mismatch` +// `crates/omnigraph/tests/runs.rs::concurrent_writers_one_succeeds_one_gets_expected_version_mismatch` // and at the HTTP boundary in // `crates/omnigraph-server/tests/server.rs::change_conflict_returns_manifest_conflict_409`. // A CLI-level race would be timing-dependent; with direct-publish the @@ -1070,48 +999,32 @@ query vector_search($q: String) { #[test] fn local_cli_policy_tooling_is_end_to_end() { - // RFC-011: the read-only policy CLI surfaces source the bundle from a - // cluster's applied policies (`--cluster ` + `--graph `), not - // from an omnigraph.yaml `graphs:` map. These don't mutate the graph; - // they parse and evaluate the effective bundle bound to the graph. - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - // `policy test` has no per-bundle tests file in the cluster model, so - // the cases are supplied explicitly via `--tests`. - let tests_file = cluster.path().join("policy.tests.yaml"); - fs::write(&tests_file, POLICY_E2E_TESTS_YAML).unwrap(); + // Sanity check for the read-only policy CLI surfaces. These don't + // mutate the graph β€” they just parse and evaluate the policy file β€” + // so they don't depend on PR #4's engine-side enforcement. + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + graph.write_config("policy.tests.yaml", POLICY_E2E_TESTS_YAML); let validate = output_success( cli() .arg("policy") .arg("validate") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge"), + .arg("--config") + .arg(&config), ); assert!(stdout_string(&validate).contains("policy valid:")); - let tests = output_success( - cli() - .arg("policy") - .arg("test") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge") - .arg("--tests") - .arg(&tests_file), - ); + let tests = output_success(cli().arg("policy").arg("test").arg("--config").arg(&config)); assert!(stdout_string(&tests).contains("policy tests passed: 2 cases")); let explain = output_success( cli() .arg("policy") .arg("explain") - .arg("--cluster") - .arg(cluster.path()) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--actor") .arg("act-bruno") .arg("--action") @@ -1124,91 +1037,78 @@ fn local_cli_policy_tooling_is_end_to_end() { assert!(explain_stdout.contains("branch: main")); } -/// Tokenβ†’actor map for the served-policy tests: the bearer tokens the -/// cluster server resolves to `act-bruno` / `act-ragnor`. -const POLICY_TOKENS_JSON: &str = r#"{"act-bruno":"bruno-tok","act-ragnor":"ragnor-tok"}"#; - #[test] fn local_cli_change_enforces_engine_layer_policy() { - // RFC-011: a CLI direct-store write carries NO policy β€” policy lives in - // the cluster/server. So engine-layer policy on a direct write no longer - // exists; this test asserts the faithful migration: the SERVER enforces - // the bundle bound to the served graph, addressed via `--server --graph` - // with a bearer token that resolves to the actor. + // Asserts MR-722 PR #4: when `policy.file` is configured in + // `omnigraph.yaml`, the CLI loads PolicyEngine into Omnigraph and + // every direct-engine write hits `enforce(action, scope, actor)` β€” + // identical to what the HTTP server gets, regardless of transport. // // Three cases, each discriminating: // - // 1. No token β†’ the server refuses (401, unauthenticated). The old - // embedded "no actor" footgun does not apply to the served path - // (the actor comes from the token), so this replaces it. - // 2. bruno token, change on protected main β†’ Cedar denies (bruno can - // change unprotected branches; main is protected). Non-zero exit, - // "denied" surfaced from the server error body. - // 3. ragnor token, change on main β†’ Cedar permits (admins-write). Write - // succeeds and the inserted row is readable. - if skip_system_e2e("local_cli_change_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], - ); - let insert = - "query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }"; + // 1. Policy installed, no actor source (no `cli.actor` in config, + // no `--as` flag) β†’ engine-layer footgun guard fires; CLI exits + // non-zero with a "no actor" message. Silent bypass is the bug + // PR #4 prevents. + // 2. Policy installed, `--as act-bruno`, change on main β†’ Cedar + // denies (bruno can change unprotected branches; main is + // protected). CLI exits non-zero with a "denied" message. + // 3. Policy installed, `--as act-ragnor`, change on main β†’ + // Cedar permits (admins-write rule). Write succeeds and the + // inserted row is readable. + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + let mutation_file = insert_person_query(&graph, "system-local-policy-change.gq"); - // Case 1: no token β†’ the server refuses before any policy check. - let no_token = cli() - .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("-e") - .arg(insert) - .arg("--params") - .arg(r#"{"name":"NoTokenPerson","age":1}"#) - .arg("--json") - .output() - .unwrap(); + // Case 1: policy configured, no actor threaded β†’ footgun guard. + let no_actor = output_failure( + cli() + .arg("change") + .arg("--config") + .arg(&config) + .arg("--query") + .arg(&mutation_file) + .arg("--params") + .arg(r#"{"name":"NoActorPerson","age":1}"#) + .arg("--json"), + ); + let no_actor_stderr = String::from_utf8_lossy(&no_actor.stderr); assert!( - !no_token.status.success(), - "unauthenticated served write must be refused: {no_token:?}" + no_actor_stderr.contains("no actor"), + "expected 'no actor' footgun message, got stderr: {no_actor_stderr}" ); - // Case 2: bruno token against protected main β†’ denied by the server. - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("-e") - .arg(insert) - .arg("--params") - .arg(r#"{"name":"BrunoOnMain","age":2}"#) - .arg("--json") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno/main must be denied"); + // Case 2: `--as act-bruno` against protected main β†’ denied. + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("change") + .arg("--config") + .arg(&config) + .arg("--query") + .arg(&mutation_file) + .arg("--params") + .arg(r#"{"name":"BrunoOnMain","age":2}"#) + .arg("--json"), + ); let denied_stderr = String::from_utf8_lossy(&denied.stderr); assert!( denied_stderr.contains("denied"), "expected 'denied' message for bruno/main, got stderr: {denied_stderr}" ); - // Case 3: ragnor token against main β†’ permitted by admins-write. + // Case 3: `--as act-ragnor` against main β†’ permitted by admins-write. let allowed = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("-e") - .arg(insert) + .arg("--config") + .arg(&config) + .arg("--query") + .arg(&mutation_file) .arg("--params") .arg(r#"{"name":"RagnorOnMain","age":3}"#) .arg("--json"), @@ -1218,19 +1118,14 @@ fn local_cli_change_enforces_engine_layer_policy() { assert_eq!(allowed["actor_id"], "act-ragnor"); // Verify the row landed β€” proves the write actually committed, not - // just that enforce returned Ok and silently dropped the work. The read - // uses the bruno token: POLICY_E2E_YAML grants `read` to the `team` - // group (bruno), while admins (ragnor) get write-only rules. + // just that enforce returned Ok and silently dropped the work. let verify = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"RagnorOnMain"}"#) @@ -1240,35 +1135,6 @@ fn local_cli_change_enforces_engine_layer_policy() { assert_eq!(verify["rows"][0]["p.name"], "RagnorOnMain"); } -#[test] -fn local_cli_direct_store_write_is_unpoliced_regardless_of_actor() { - // RFC-011: a direct (`--store`) write carries no Cedar policy at all β€” - // policy lives in the cluster/server. So a write that the SERVED path - // would deny (bruno changing protected main) succeeds on the direct - // path, regardless of the actor. This is the faithful replacement for - // the obsolete `..._positional_uri_does_not_inherit_default_graph_policy` - // premise: a positional/`--store` address has no policy to inherit. - let graph = SystemGraph::loaded(); - let mutation_file = insert_person_query(&graph, "system-local-policy-direct.gq"); - - let allowed = parse_stdout_json(&output_success( - cli() - .arg("--as") - .arg("act-bruno") - .arg("change") - .arg("--store") - .arg(graph.path()) - .arg("--query") - .arg(&mutation_file) - .arg("--params") - .arg(r#"{"name":"DirectStoreBruno","age":4}"#) - .arg("--json"), - )); - assert_eq!(allowed["branch"], "main"); - assert_eq!(allowed["affected_nodes"], 1); - assert_eq!(allowed["actor_id"], "act-bruno"); -} - // ─── MR-722 PR A: CLIΓ—writer matrix ─────────────────────────────────────── // // The change writer is covered above by `local_cli_change_enforces_engine_layer_policy`. @@ -1282,44 +1148,26 @@ fn local_cli_direct_store_write_is_unpoliced_regardless_of_actor() { #[test] fn local_cli_load_enforces_engine_layer_policy() { - // RFC-011 served re-point: the server enforces the graph-bound bundle on - // a remote load. A load into protected main is a `change`: bruno - // (team-write-unprotected) is denied, ragnor (admins-write) is allowed. - if skip_system_e2e("local_cli_load_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], - ); - let temp = tempfile::tempdir().unwrap(); - let data = temp.path().join("policy-load.jsonl"); - fs::write( - &data, + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + let data = graph.write_jsonl( + "system-local-policy-load.jsonl", r#"{"type":"Person","data":{"name":"LoadPolicy","age":11}}"#, - ) - .unwrap(); + ); // act-bruno: change-on-protected is denied (team-write-unprotected only). - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("load") - .arg("--mode") - .arg("overwrite") - // `--yes` clears the RFC-011 Decision 9 destructive-write confirmation - // so the policy check (not the confirmation refusal) is what denies. - .arg("--yes") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("--data") - .arg(&data) - .arg("--json") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno/main load must be denied"); + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("load") + .arg("--config") + .arg(&config) + .arg("--data") + .arg(&data) + .arg("--json"), + ); let stderr = String::from_utf8_lossy(&denied.stderr); assert!( stderr.contains("denied"), @@ -1329,15 +1177,11 @@ fn local_cli_load_enforces_engine_layer_policy() { // act-ragnor: admins-write rule permits change anywhere. let allowed = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--yes") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--data") .arg(&data) .arg("--json"), @@ -1348,55 +1192,47 @@ fn local_cli_load_enforces_engine_layer_policy() { #[test] fn local_cli_ingest_enforces_engine_layer_policy() { - // RFC-011 served re-point: ingest into a new branch requires both - // BranchCreate and Change. Bruno has change-unprotected only (no - // branch-ops) β€” either gate denies. Ragnor has admins-write + - // admins-branch-ops β€” both fire as ingest creates the branch + loads. - if skip_system_e2e("local_cli_ingest_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], - ); - let temp = tempfile::tempdir().unwrap(); - let data = temp.path().join("policy-ingest.jsonl"); - fs::write( - &data, + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + let data = graph.write_jsonl( + "system-local-policy-ingest.jsonl", r#"{"type":"Person","data":{"name":"IngestPolicy","age":12}}"#, - ) - .unwrap(); + ); - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("ingest") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("--data") - .arg(&data) - .arg("--branch") - .arg("policy-ingest-feature") - .arg("--json") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno ingest must be denied"); + // act-bruno: ingest into a new branch requires both BranchCreate and + // Change. Bruno has change-unprotected only, and the implicit + // branch_create fires first when the target branch doesn't exist. + // Either gate is enough to deny β€” assert denial without pinning + // which one fires first. + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("ingest") + .arg("--config") + .arg(&config) + .arg("--data") + .arg(&data) + .arg("--branch") + .arg("policy-ingest-feature") + .arg("--json"), + ); let stderr = String::from_utf8_lossy(&denied.stderr); assert!( stderr.contains("denied"), "expected 'denied' for bruno ingest, got: {stderr}" ); + // act-ragnor: admins-write covers Change, admins-branch-ops covers + // BranchCreate. Both fire as ingest creates the branch + loads. let allowed = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("ingest") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--data") .arg(&data) .arg("--branch") @@ -1408,32 +1244,73 @@ fn local_cli_ingest_enforces_engine_layer_policy() { } #[test] -fn local_cli_branch_create_enforces_engine_layer_policy() { - // RFC-011 served re-point: bruno has no branch-ops rule β†’ denied; - // ragnor has admins-branch-ops β†’ allowed. - if skip_system_e2e("local_cli_branch_create_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], +fn local_cli_schema_apply_enforces_engine_layer_policy() { + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + + // Additive: add a nullable property; SDK-compatible with the fixture + // schema. Uses the schema-apply scope (TargetBranch("main")). + let new_schema = std::fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ); + let schema_path = graph.path().join("policy-additive.pg"); + std::fs::write(&schema_path, &new_schema).unwrap(); + + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("schema") + .arg("apply") + .arg("--config") + .arg(&config) + .arg("--schema") + .arg(&schema_path) + .arg("--json"), + ); + let stderr = String::from_utf8_lossy(&denied.stderr); + assert!( + stderr.contains("denied"), + "expected 'denied' for bruno schema apply, got: {stderr}" ); - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("branch") - .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("--from") - .arg("main") - .arg("bruno-feature") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno branch create must be denied"); + let allowed = parse_stdout_json(&output_success( + cli() + .arg("--as") + .arg("act-ragnor") + .arg("schema") + .arg("apply") + .arg("--config") + .arg(&config) + .arg("--schema") + .arg(&schema_path) + .arg("--json"), + )); + assert_eq!(allowed["applied"], true); +} + +#[test] +fn local_cli_branch_create_enforces_engine_layer_policy() { + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); + + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("branch") + .arg("create") + .arg("--config") + .arg(&config) + .arg("--from") + .arg("main") + .arg("bruno-feature"), + ); let stderr = String::from_utf8_lossy(&denied.stderr); assert!( stderr.contains("denied"), @@ -1442,13 +1319,12 @@ fn local_cli_branch_create_enforces_engine_layer_policy() { output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("ragnor-feature"), @@ -1457,47 +1333,34 @@ fn local_cli_branch_create_enforces_engine_layer_policy() { #[test] fn local_cli_branch_delete_enforces_engine_layer_policy() { - // RFC-011 served re-point: bruno has no branch-ops rule β†’ denied; - // ragnor has admins-branch-ops β†’ allowed. - if skip_system_e2e("local_cli_branch_delete_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], - ); + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); // Pre-create the branch as ragnor so there's something to delete. output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("doomed"), ); - // `--yes` clears the RFC-011 Decision 9 destructive-write confirmation so - // the policy check (not the confirmation refusal) is what denies. - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("branch") - .arg("delete") - .arg("--yes") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("doomed") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno branch delete must be denied"); + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("branch") + .arg("delete") + .arg("--config") + .arg(&config) + .arg("doomed"), + ); let stderr = String::from_utf8_lossy(&denied.stderr); assert!( stderr.contains("denied"), @@ -1506,61 +1369,48 @@ fn local_cli_branch_delete_enforces_engine_layer_policy() { output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("branch") .arg("delete") - .arg("--yes") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("doomed"), ); } #[test] fn local_cli_branch_merge_enforces_engine_layer_policy() { - // RFC-011 served re-point: merging into protected main needs - // branch_merge with target_branch_scope protected. bruno has no such - // rule β†’ denied; ragnor has admins-promote β†’ allowed. - if skip_system_e2e("local_cli_branch_merge_enforces_engine_layer_policy") { - return; - } - let cluster = converged_loaded_cluster("knowledge", Some(POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", POLICY_TOKENS_JSON)], - ); + let graph = SystemGraph::loaded(); + let config = graph.write_config("omnigraph-policy.yaml", &local_policy_config(&graph)); + graph.write_config("policy.yaml", POLICY_E2E_YAML); // Pre-create a feature branch as ragnor (admins-branch-ops covers it). output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("merge-feature"), ); - let denied = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "bruno-tok") - .arg("branch") - .arg("merge") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") - .arg("merge-feature") - .arg("--into") - .arg("main") - .output() - .unwrap(); - assert!(!denied.status.success(), "bruno branch merge must be denied"); + let denied = output_failure( + cli() + .arg("--as") + .arg("act-bruno") + .arg("branch") + .arg("merge") + .arg("--config") + .arg(&config) + .arg("merge-feature") + .arg("--into") + .arg("main"), + ); let stderr = String::from_utf8_lossy(&denied.stderr); assert!( stderr.contains("denied"), @@ -1569,56 +1419,68 @@ fn local_cli_branch_merge_enforces_engine_layer_policy() { output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "ragnor-tok") + .arg("--as") + .arg("act-ragnor") .arg("branch") .arg("merge") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("knowledge") + .arg("--config") + .arg(&config) .arg("merge-feature") .arg("--into") .arg("main"), ); } -// ─── RFC-011: operator.actor cascade ────────────────────────────────────── +// ─── MR-722 PR A: cli.actor config-only precedence ──────────────────────── // -// The CLI actor chain is `--as` > `operator.actor` (in the operator config -// at $OMNIGRAPH_HOME/config.yaml) > none. These two tests pin that order on -// a direct (`--store`) write. RFC-011 makes direct-store writes unpoliced, -// so the assertion is on which `actor_id` the write records, not on a Cedar -// allow/deny β€” the actor still has to be resolved correctly and stamped onto -// the commit. +// The change-writer test above uses `--as` directly. These two tests +// pin the precedence rule that `main.rs::resolve_cli_actor` implements: +// `--as` flag > `cli.actor` from `omnigraph.yaml` > None. -/// An operator config (`$OMNIGRAPH_HOME/config.yaml`) carrying just -/// `operator.actor`. Pointing OMNIGRAPH_HOME at the holding dir makes the -/// CLI read it as the operator layer. -fn operator_home_with_actor(actor: &str) -> tempfile::TempDir { - let home = tempfile::tempdir().unwrap(); - fs::write( - home.path().join("config.yaml"), - format!("operator:\n actor: {actor}\n"), +fn local_policy_config_with_actor(graph: &SystemGraph, actor: &str) -> String { + // Mirrors `local_policy_config` but adds `cli.actor` so the + // config-only precedence path is exercised. The `cli:` block + // already has `graph` and `branch`; appending `actor` here. + format!( + "\ +project: + name: policy-e2e-local +graphs: + local: + uri: {} +cli: + graph: local + branch: main + actor: {} +query: + roots: + - . +policy: + file: ./policy.yaml +", + yaml_string(&graph.path().to_string_lossy()), + actor, ) - .unwrap(); - home } #[test] fn local_cli_actor_from_config_used_when_no_flag() { - // operator.actor: act-ragnor in the operator config, no --as flag β†’ - // the write records act-ragnor. Proves the operator-layer actor source - // is consulted when `--as` is absent. + // cli.actor: act-ragnor in omnigraph.yaml, no --as flag β†’ change + // permitted via admins-write rule. Proves the config-only path + // works; previously the only proof was structural. let graph = SystemGraph::loaded(); - let home = operator_home_with_actor("act-ragnor"); + let config = graph.write_config( + "omnigraph-policy.yaml", + &local_policy_config_with_actor(&graph, "act-ragnor"), + ); + graph.write_config("policy.yaml", POLICY_E2E_YAML); let mutation_file = insert_person_query(&graph, "system-local-cli-actor.gq"); let allowed = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_HOME", home.path()) .arg("change") - .arg("--store") - .arg(graph.path()) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) .arg("--params") @@ -1631,921 +1493,33 @@ fn local_cli_actor_from_config_used_when_no_flag() { #[test] fn local_cli_actor_flag_overrides_config_actor() { - // operator.actor: act-ragnor in the config + --as act-bruno on the CLI β†’ - // the write records act-bruno. The flag wins per the precedence rule. - // Without this test, a future change that reverses precedence would ride - // through silently. + // cli.actor: act-ragnor in config + --as act-bruno on CLI β†’ change + // denied. Flag wins per the precedence rule. Without this test, a + // future change that reverses precedence would ride through silently. let graph = SystemGraph::loaded(); - let home = operator_home_with_actor("act-ragnor"); + let config = graph.write_config( + "omnigraph-policy.yaml", + &local_policy_config_with_actor(&graph, "act-ragnor"), + ); + graph.write_config("policy.yaml", POLICY_E2E_YAML); let mutation_file = insert_person_query(&graph, "system-local-cli-actor-override.gq"); - let overridden = parse_stdout_json(&output_success( + let denied = output_failure( cli() - .env("OMNIGRAPH_HOME", home.path()) .arg("--as") .arg("act-bruno") .arg("change") - .arg("--store") - .arg(graph.path()) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) .arg("--params") .arg(r#"{"name":"OverrideEve","age":19}"#) .arg("--json"), - )); - assert_eq!(overridden["affected_nodes"], 1); - assert_eq!(overridden["actor_id"], "act-bruno"); -} - -/// Phase 5 (RFC-005): "applied means serving" β€” converge a cluster with the -/// CLI, boot the real omnigraph-server binary with --cluster, and serve the -/// applied stored query over HTTP with zero omnigraph.yaml involvement. -#[test] -fn local_cluster_apply_then_server_boots_from_cluster_state() { - let temp = tempfile::tempdir().unwrap(); - std::fs::write( - temp.path().join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - std::fs::write( - temp.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - std::fs::write( - temp.path().join("cluster.yaml"), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -"#, - ) - .unwrap(); - for command in ["import", "apply"] { - let output = cli() - .arg("cluster") - .arg(command) - .arg("--config") - .arg(temp.path()) - .arg("--json") - .output() - .unwrap(); - assert!(output.status.success(), "cluster {command} failed"); - } - // Seed a row through the graph plane so the stored query has data. - let data = temp.path().join("seed.jsonl"); - std::fs::write(&data, "{\"type\":\"Person\",\"data\":{\"name\":\"Ada\"}}\n").unwrap(); - let output = cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(&data) - .arg(temp.path().join("graphs/knowledge.omni")) - .output() - .unwrap(); - assert!(output.status.success(), "graph load failed"); - - let server = spawn_server_with_cluster(temp.path()); - let client = reqwest::blocking::Client::new(); - let queries: serde_json::Value = client - .get(format!("{}/graphs/knowledge/queries", server.base_url)) - .send() - .unwrap() - .json() - .unwrap(); - assert!( - queries["queries"] - .as_array() - .unwrap() - .iter() - .any(|q| q["name"] == "find_person"), - "{queries}" ); - let response = client - .post(format!( - "{}/graphs/knowledge/queries/find_person", - server.base_url - )) - .json(&serde_json::json!({"params": {"name": "Ada"}})) - .send() - .unwrap(); - assert!(response.status().is_success(), "{:?}", response.status()); - let body: serde_json::Value = response.json().unwrap(); - assert!(body.to_string().contains("Ada"), "{body}"); -} - -// ---- Comprehensive full-cycle cluster e2e (Phases 1-5 composed) ---- - -/// Run a `cluster` subcommand and return its JSON output. Deliberately does -/// NOT assert a zero exit code: blocked/unconverged runs (e.g. an `apply` -/// awaiting an approval) exit non-zero by contract while still emitting the -/// structured output the caller asserts on (`ok`/`converged`/dispositions). -/// Commands where failure is never expected must assert on those fields -/// (every call here checks `ok` or `converged`) or use `cli()` directly with -/// `status.success()`. -fn cluster_cli(dir: &std::path::Path, args: &[&str]) -> serde_json::Value { - let mut command = cli(); - command.arg("cluster"); - for arg in args { - command.arg(arg); - } - let output = command - .arg("--config") - .arg(dir) - .arg("--json") - .output() - .unwrap(); - let stdout = String::from_utf8_lossy(&output.stdout); - serde_json::from_str(stdout.trim()).unwrap_or_else(|err| { - panic!( - "cluster {args:?} produced unparseable output ({err}): stdout={stdout} stderr={}", - String::from_utf8_lossy(&output.stderr) - ) - }) -} - -fn write_two_graph_cluster(dir: &std::path::Path) { - std::fs::write( - dir.join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("services.pg"), - "\nnode Service {\n name: String @key\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("services.gq"), - "\nquery find_service($name: String) {\n match { $s: Service { name: $name } }\n return { $s.name }\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("cluster.yaml"), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq - engineering: - schema: ./services.pg - queries: - find_service: - file: ./services.gq -"#, - ) - .unwrap(); -} - -fn seed_graph(dir: &std::path::Path, graph: &str, row: &str) { - let data = dir.join(format!("{graph}-seed.jsonl")); - std::fs::write(&data, row).unwrap(); - let output = cli() - .arg("load") - .arg("--mode") - .arg("overwrite") - .arg("--data") - .arg(&data) - .arg(dir.join(format!("graphs/{graph}.omni"))) - .output() - .unwrap(); + let stderr = String::from_utf8_lossy(&denied.stderr); assert!( - output.status.success(), - "seed {graph} failed: {}", - String::from_utf8_lossy(&output.stderr) - ); -} - -fn invoke_query( - client: &Client, - base_url: &str, - graph: &str, - query: &str, - params: serde_json::Value, -) -> (u16, serde_json::Value) { - let response = client - .post(format!("{base_url}/graphs/{graph}/queries/{query}")) - .json(&serde_json::json!({ "params": params })) - .send() - .unwrap(); - let status = response.status().as_u16(); - let body = response.json().unwrap_or(serde_json::Value::Null); - (status, body) -} - -/// Opt-out for the comprehensive system e2es below. They need no external -/// services β€” only the workspace-built `omnigraph`/`omnigraph-server` -/// binaries (cargo provides them via `CARGO_BIN_EXE_*`), ephemeral localhost -/// ports, and local-FS temp dirs β€” but they spawn real server processes and -/// run multi-stage lifecycles, so constrained sandboxes can suppress them: -/// `OMNIGRAPH_SKIP_SYSTEM_E2E=1 cargo test ...` (same skip-with-message -/// pattern as the S3 tests' `OMNIGRAPH_S3_TEST_BUCKET` gate). -fn skip_system_e2e(test_name: &str) -> bool { - if std::env::var("OMNIGRAPH_SKIP_SYSTEM_E2E").is_ok_and(|v| !v.is_empty() && v != "0") { - eprintln!("skipping {test_name}: OMNIGRAPH_SKIP_SYSTEM_E2E is set"); - return true; - } - false -} - -/// The whole control-plane story in one test: declare two graphs β†’ converge -/// (apply creates them) β†’ serve β†’ evolve schema+query in one apply β†’ restart -/// serves the new shape β†’ out-of-band drift converged back β†’ approved graph -/// delete β†’ restart serves the survivor only β†’ plan empty. -#[test] -fn local_cluster_full_lifecycle_declare_serve_evolve_delete() { - if skip_system_e2e("local_cluster_full_lifecycle_declare_serve_evolve_delete") { - return; - } - let temp = tempfile::tempdir().unwrap(); - let dir = temp.path(); - write_two_graph_cluster(dir); - - // Phase 1-2: declare + record. - assert_eq!(cluster_cli(dir, &["import"])["ok"], true); - // Phase 3-4: one apply creates both graphs and publishes the catalog. - let converge = cluster_cli(dir, &["apply"]); - assert_eq!(converge["converged"], true, "{converge}"); - seed_graph(dir, "knowledge", "{\"type\":\"Person\",\"data\":{\"name\":\"Ada\"}}\n"); - seed_graph(dir, "engineering", "{\"type\":\"Service\",\"data\":{\"name\":\"billing\"}}\n"); - - // Phase 5: serve the applied revision. - let client = Client::new(); - { - let server = spawn_server_with_cluster(dir); - let (status, body) = invoke_query( - &client, - &server.base_url, - "knowledge", - "find_person", - serde_json::json!({"name": "Ada"}), - ); - assert_eq!(status, 200, "{body}"); - assert_eq!(body["rows"][0]["p.name"], "Ada", "{body}"); - let (status, body) = invoke_query( - &client, - &server.base_url, - "engineering", - "find_service", - serde_json::json!({"name": "billing"}), - ); - assert_eq!(status, 200, "{body}"); - assert_eq!(body["rows"][0]["s.name"], "billing", "{body}"); - } - - // Evolve: schema gains a field, the query returns it β€” one apply, with - // the migration previewed in plan. - std::fs::write( - dir.join("people.pg"), - "\nnode Person {\n name: String @key\n bio: String?\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name, $p.bio }\n}\n", - ) - .unwrap(); - let plan = cluster_cli(dir, &["plan"]); - let schema_change = plan["changes"] - .as_array() - .unwrap() - .iter() - .find(|change| change["resource"] == "schema.knowledge") - .unwrap(); - assert_eq!(schema_change["migration"]["supported"], true, "{plan}"); - let evolve = cluster_cli(dir, &["apply"]); - assert_eq!(evolve["converged"], true, "{evolve}"); - - // Restart: the server serves the evolved shape. - { - let server = spawn_server_with_cluster(dir); - let (status, body) = invoke_query( - &client, - &server.base_url, - "knowledge", - "find_person", - serde_json::json!({"name": "Ada"}), - ); - assert_eq!(status, 200, "{body}"); - assert!( - body["columns"] - .as_array() - .unwrap() - .iter() - .any(|column| column == "p.bio"), - "evolved query must project the new field: {body}" - ); - } - - // Out-of-band drift: the live graph evolves behind the cluster's back; - // refresh observes it, apply converges it back to the declared schema. RFC-011 - // D10 makes the CLI `schema apply` refuse a cluster-managed graph, so a true - // bypass is a direct engine apply against the storage root. - let rogue_pg = "\nnode Person {\n name: String @key\n bio: String?\n rogue: String?\n}\n"; - tokio::runtime::Runtime::new().unwrap().block_on(async { - let db = Omnigraph::open(dir.join("graphs/knowledge.omni").to_string_lossy().as_ref()) - .await - .unwrap(); - db.apply_schema(rogue_pg).await.unwrap(); - }); - let refresh = cluster_cli(dir, &["refresh"]); - assert_eq!( - refresh["resource_statuses"]["schema.knowledge"]["status"], - "drifted", - "{refresh}" - ); - let heal = cluster_cli(dir, &["apply"]); - assert_eq!(heal["converged"], true, "{heal}"); - let schema_show = cli() - .arg("schema") - .arg("show") - .arg(dir.join("graphs/knowledge.omni")) - .output() - .unwrap(); - assert!( - schema_show.status.success(), - "schema show failed: {}", - String::from_utf8_lossy(&schema_show.stderr) - ); - let shown = String::from_utf8_lossy(&schema_show.stdout); - assert!(shown.contains("Person"), "schema show produced no schema: {shown}"); - assert!( - !shown.contains("rogue"), - "drift must be soft-dropped back to the declared schema: {shown}" - ); - - // Retire engineering: gated delete, then the server serves the survivor. - std::fs::write( - dir.join("cluster.yaml"), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -"#, - ) - .unwrap(); - let blocked = cluster_cli(dir, &["apply"]); - assert_eq!(blocked["converged"], false, "{blocked}"); - let approve_output = cli() - .arg("--as") - .arg("andrew") - .arg("cluster") - .arg("approve") - .arg("graph.engineering") - .arg("--config") - .arg(dir) - .arg("--json") - .output() - .unwrap(); - assert!(approve_output.status.success(), "approve failed"); - let delete = cluster_cli(dir, &["apply"]); - assert_eq!(delete["converged"], true, "{delete}"); - assert!(!dir.join("graphs/engineering.omni").exists()); - - { - let server = spawn_server_with_cluster(dir); - let (status, body) = invoke_query( - &client, - &server.base_url, - "knowledge", - "find_person", - serde_json::json!({"name": "Ada"}), - ); - assert_eq!(status, 200, "{body}"); - let response = client - .post(format!( - "{}/graphs/engineering/queries/find_service", - server.base_url - )) - .json(&serde_json::json!({"params": {"name": "billing"}})) - .send() - .unwrap(); - assert_eq!( - response.status().as_u16(), - 404, - "a deleted graph must vanish from the serving surface" - ); - } - - // The story ends converged: nothing left to do. - let final_plan = cluster_cli(dir, &["plan"]); - assert!( - final_plan["changes"].as_array().unwrap().is_empty(), - "{final_plan}" - ); -} - -/// Applied policy bundles gate serving per their bindings: the cluster-bound -/// bundle owns the management surface (graph_list), the graph-bound bundle -/// owns query invocation β€” enforced over HTTP with bearer-resolved actors. -#[test] -fn local_cluster_serving_enforces_applied_policy_bindings() { - if skip_system_e2e("local_cluster_serving_enforces_applied_policy_bindings") { - return; - } - let temp = tempfile::tempdir().unwrap(); - let dir = temp.path(); - std::fs::write( - dir.join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - std::fs::write( - dir.join("graph.policy.yaml"), - r#" -version: 1 -groups: - readers: ["act-reader"] -protected_branches: [main] -rules: - - id: allow-invoke - allow: - actors: { group: readers } - actions: [invoke_query] - - id: allow-read - allow: - actors: { group: readers } - actions: [read] - branch_scope: any -"#, - ) - .unwrap(); - std::fs::write( - dir.join("server.policy.yaml"), - r#" -version: 1 -kind: server -groups: - admins: ["act-admin"] -rules: - - id: allow-list - allow: - actors: { group: admins } - actions: [graph_list] -"#, - ) - .unwrap(); - std::fs::write( - dir.join("cluster.yaml"), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - graph_rules: - file: ./graph.policy.yaml - applies_to: [knowledge] - server_rules: - file: ./server.policy.yaml - applies_to: [cluster] -"#, - ) - .unwrap(); - assert_eq!(cluster_cli(dir, &["import"])["ok"], true); - let converge = cluster_cli(dir, &["apply"]); - assert_eq!(converge["converged"], true, "{converge}"); - seed_graph(dir, "knowledge", "{\"type\":\"Person\",\"data\":{\"name\":\"Ada\"}}\n"); - - let server = spawn_server_with_cluster_env( - dir, - &[( - "OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", - r#"{"act-admin":"admin-token","act-reader":"reader-token"}"#, - )], - ); - let client = Client::new(); - let get_graphs = |token: Option<&str>| { - let mut request = client.get(format!("{}/graphs", server.base_url)); - if let Some(token) = token { - request = request.bearer_auth(token); - } - request.send().unwrap().status().as_u16() - }; - // Management surface: cluster-bound bundle, admins only. - assert_eq!(get_graphs(Some("admin-token")), 200); - assert_eq!(get_graphs(Some("reader-token")), 403); - assert_eq!(get_graphs(None), 401); - - // Query invocation: graph-bound bundle, readers only. - let invoke = |token: &str| { - client - .post(format!( - "{}/graphs/knowledge/queries/find_person", - server.base_url - )) - .bearer_auth(token) - .json(&serde_json::json!({"params": {"name": "Ada"}})) - .send() - .unwrap() - }; - let response = invoke("reader-token"); - assert_eq!(response.status().as_u16(), 200); - let body: serde_json::Value = response.json().unwrap(); - assert_eq!(body["rows"][0]["p.name"], "Ada", "{body}"); - // Denied invocation is deliberately 404, indistinguishable from an - // unknown query β€” the server's anti-probing contract. - assert_eq!(invoke("admin-token").status().as_u16(), 404); -} - -/// Rule 0 (axiom 15): a --cluster server never reads omnigraph.yaml β€” not -/// even the implicit cwd search. A MALFORMED config in the process cwd must -/// not affect boot or serving. -#[test] -fn cluster_server_boot_ignores_local_config_in_cwd() { - let cluster = tempfile::tempdir().unwrap(); - std::fs::write( - cluster.path().join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - std::fs::write( - cluster.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n", - ) - .unwrap(); - for command in ["import", "apply"] { - let output = cli() - .arg("cluster") - .arg(command) - .arg("--config") - .arg(cluster.path()) - .output() - .unwrap(); - assert!(output.status.success(), "cluster {command} failed"); - } - let cwd = tempfile::tempdir().unwrap(); - std::fs::write(cwd.path().join("omnigraph.yaml"), "{{{{ not yaml").unwrap(); - - let server = spawn_server_with_cluster_in(cluster.path(), cwd.path()); - let response = reqwest::blocking::get(format!("{}/healthz", server.base_url)).unwrap(); - assert!(response.status().is_success()); -} - -/// RFC-007 PR 2: keyed credentials end to end β€” `login` stores a 0600 -/// credential, the URL-matched server's token chain authenticates remote -/// reads (env > file), a non-matching URL never sees the token (Β§D5 rule -/// 3), and `logout` revokes. -#[test] -fn local_cli_keyed_credentials_authenticate_url_matched_server() { - // RFC-011 cluster-only: the server boots from a converged cluster - // serving the fixture graph under id `local`; tokens-only boot is - // default-deny, which still permits `read`. - let cluster = converged_loaded_cluster("local", None); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[("OMNIGRAPH_SERVER_BEARER_TOKEN", "secret-tok")], - ); - let operator_home = tempfile::tempdir().unwrap(); - let write_server_url = |url: &str| { - fs::write( - operator_home.path().join("config.yaml"), - format!("servers:\n test-srv:\n url: {url}\n"), - ) - .unwrap(); - }; - write_server_url(&server.base_url); - - let remote_read = |envs: &[(&str, &str)]| { - let mut command = cli(); - command.env("OMNIGRAPH_HOME", operator_home.path()); - for (name, value) in envs { - command.env(name, value); - } - command - .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg("local") - .arg("--query") - .arg(fixture("test.gq")) - .arg("get_person") - .arg("--params") - .arg(r#"{"name":"Alice"}"#) - .arg("--json") - .output() - .unwrap() - }; - - // No credential anywhere: the server refuses. - let output = remote_read(&[]); - assert!(!output.status.success(), "{output:?}"); - - // login with a WRONG token (via stdin, the documented pipe flow). - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("login") - .arg("test-srv") - .write_stdin("wrong-tok\n") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - let output = remote_read(&[]); - assert!(!output.status.success(), "wrong token must not authenticate"); - - // Re-login rotates to the right token (via --token); 0600 on disk. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("login") - .arg("test-srv") - .arg("--token") - .arg("secret-tok") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - let credentials = operator_home.path().join("credentials"); - let text = fs::read_to_string(&credentials).unwrap(); - assert!(text.contains("[test-srv]"), "{text}"); - #[cfg(unix)] - { - use std::os::unix::fs::PermissionsExt; - let mode = fs::metadata(&credentials).unwrap().permissions().mode(); - assert_eq!(mode & 0o777, 0o600, "{:o}", mode & 0o777); - } - let output = remote_read(&[]); - assert!( - output.status.success(), - "keyed credential must authenticate the URL-matched server: {output:?}" - ); - let payload: serde_json::Value = - serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["rows"][0]["p.name"], "Alice"); - - // OMNIGRAPH_TOKEN_ env outranks the credentials file. - let output = remote_read(&[("OMNIGRAPH_TOKEN_TEST_SRV", "env-wrong")]); - assert!( - !output.status.success(), - "keyed env token must outrank the credentials file" - ); - - // Β§D5 rule 3: a URL matching no operator server never sees the token. - write_server_url("http://127.0.0.1:1"); - let output = remote_read(&[]); - assert!( - !output.status.success(), - "token keyed to another url must not be sent here" - ); - write_server_url(&server.base_url); - - // logout revokes; idempotent. - for _ in 0..2 { - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("logout") - .arg("test-srv") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - } - let output = remote_read(&[]); - assert!(!output.status.success(), "logout must revoke access"); -} - -/// RFC-007 PR 3: --server targeting and operator aliases (pure bindings to -/// stored queries) end to end, with the keyed credential from PR 2. -#[test] -fn local_cli_operator_alias_and_server_flag_invoke_stored_query() { - // RFC-011 cluster-only: build a converged cluster serving graph `local` - // with a stored query `find_person` and a per-graph policy granting the - // operator invoke_query + read (invoke_query is policy-gated β€” anti-probing - // 404 without the grant). - let cluster = tempfile::tempdir().unwrap(); - fs::copy(fixture("test.pg"), cluster.path().join("local.pg")).unwrap(); - fs::write( - cluster.path().join("find-person.gq"), - "query find_person($name: String) { match { $p: Person { name: $name } } return { $p.name } }", - ) - .unwrap(); - fs::write( - cluster.path().join("insert-person.gq"), - "query insert_person($name: String) { insert Person { name: $name, age: 41 } }", - ) - .unwrap(); - fs::write( - cluster.path().join("graph.policy.yaml"), - "version: 1\ngroups:\n ops: [\"act-op\"]\nprotected_branches: [main]\nrules:\n - id: allow-invoke\n allow:\n actors: { group: ops }\n actions: [invoke_query]\n - id: allow-read\n allow:\n actors: { group: ops }\n actions: [read]\n branch_scope: any\n - id: allow-change\n allow:\n actors: { group: ops }\n actions: [change]\n branch_scope: any\n", - ) - .unwrap(); - fs::write( - cluster.path().join("cluster.yaml"), - "version: 1\nmetadata:\n name: alias-sys\nstate:\n backend: cluster\n lock: true\ngraphs:\n local:\n schema: ./local.pg\n queries:\n find_person:\n file: ./find-person.gq\n insert_person:\n file: ./insert-person.gq\npolicies:\n graph:\n file: ./graph.policy.yaml\n applies_to: [local]\n", - ) - .unwrap(); - output_success(cli().arg("cluster").arg("import").arg("--config").arg(cluster.path())); - output_success(cli().arg("cluster").arg("apply").arg("--config").arg(cluster.path())); - output_success( - cli() - .arg("load") - .arg("--data") - .arg(fixture("test.jsonl")) - .arg("--mode") - .arg("overwrite") - .arg(cluster.path().join("graphs").join("local.omni")), - ); - let server = spawn_server_with_cluster_env( - cluster.path(), - &[( - "OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", - r#"{"act-op":"srv-tok"}"#, - )], - ); - - let operator_home = tempfile::tempdir().unwrap(); - fs::write( - operator_home.path().join("config.yaml"), - format!( - "servers:\n dev:\n url: {}\naliases:\n who:\n server: dev\n graph: local\n query: find_person\n args: [name]\n create_person:\n server: dev\n graph: local\n query: insert_person\n args: [name]\n", - server.base_url - ), - ) - .unwrap(); - fs::write( - operator_home.path().join("credentials"), - "[dev]\ntoken = srv-tok\n", - ) - .unwrap(); - #[cfg(unix)] - { - use std::os::unix::fs::PermissionsExt; - fs::set_permissions( - operator_home.path().join("credentials"), - fs::Permissions::from_mode(0o600), - ) - .unwrap(); - } - - // The operator alias (RFC-011 D4): `alias [args]` β€” server, - // graph, stored query, and token all resolve from the operator layer. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("alias") - .arg("who") - .arg("Alice") - .arg("--json") - .output() - .unwrap(); - assert!( - output.status.success(), - "operator alias must invoke the stored query: {output:?}" - ); - let payload: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["rows"][0]["p.name"], "Alice", "{payload}"); - - // Operator aliases are read-only conveniences: a binding to a stored - // mutation must be rejected before the server executes it. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("alias") - .arg("create_person") - .arg("AliasGuardPerson") - .output() - .unwrap(); - assert!(!output.status.success(), "mutation alias must fail"); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("'insert_person' is a mutation") - && stderr.contains("omnigraph mutate insert_person"), - "expected mutation-kind mismatch; got: {stderr}" - ); - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("query") - .arg("find_person") - .arg("--server") - .arg("dev") - .arg("--graph") - .arg("local") - .arg("--params") - .arg(r#"{"name":"AliasGuardPerson"}"#) - .arg("--json") - .output() - .unwrap(); - assert!( - output.status.success(), - "post-alias read should succeed: {output:?}" - ); - let payload: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!( - payload["rows"].as_array().unwrap().len(), - 0, - "mutation alias must not insert AliasGuardPerson: {payload}" - ); - - // --server/--graph: the same stored query via explicit targeting. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("query") - .arg("--server") - .arg("dev") - .arg("--graph") - .arg("local") - .arg("--query-string") - .arg("query q($name: String) { match { $p: Person { name: $name } } return { $p.name } }") - .arg("--params") - .arg(r#"{"name":"Alice"}"#) - .arg("--json") - .output() - .unwrap(); - assert!(output.status.success(), "{output:?}"); - - // RFC-011 D3: invoke the STORED query by name (catalog lane, served-only). - // No `-e`/`--query` β€” the positional `find_person` is the catalog name. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("query") - .arg("find_person") - .arg("--server") - .arg("dev") - .arg("--graph") - .arg("local") - .arg("--params") - .arg(r#"{"name":"Alice"}"#) - .arg("--json") - .output() - .unwrap(); - assert!(output.status.success(), "by-name catalog invocation: {output:?}"); - let payload: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap(); - assert_eq!(payload["rows"][0]["p.name"], "Alice", "{payload}"); - - // The verb asserts kind: `mutate ` is rejected by the server. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("mutate") - .arg("find_person") - .arg("--server") - .arg("dev") - .arg("--graph") - .arg("local") - .arg("--params") - .arg(r#"{"name":"Alice"}"#) - .output() - .unwrap(); - assert!(!output.status.success(), "mutate on a read query must fail"); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("'find_person' is a read β€” use omnigraph query find_person"), - "expected a kind-mismatch error; got: {stderr}" - ); - - // Unknown --server errors listing what IS defined. - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("query") - .arg("--server") - .arg("nope") - .arg("--query-string") - .arg("query q() { match { $p: Person } return { $p.name } }") - .output() - .unwrap(); - assert!(!output.status.success()); - let stderr = String::from_utf8_lossy(&output.stderr); - assert!(stderr.contains("unknown server 'nope'") && stderr.contains("dev"), "{stderr}"); - - // --server is exclusive with --store (two ways to address the graph). - // (RFC-011 D3: there is no positional URI anymore β€” the positional is a - // query name β€” so the double-addressing contradiction now surfaces between - // the two scope primitives.) - let output = cli() - .env("OMNIGRAPH_HOME", operator_home.path()) - .arg("query") - .arg("--store") - .arg(&server.base_url) - .arg("--server") - .arg("dev") - .arg("--query-string") - .arg("query q() { match { $p: Person } return { $p.name } }") - .output() - .unwrap(); - assert!(!output.status.success()); - assert!( - String::from_utf8_lossy(&output.stderr).contains("exclusive"), - "{output:?}" + stderr.contains("denied"), + "expected 'denied' when --as overrides config to bruno, got: {stderr}" ); } diff --git a/crates/omnigraph-cli/tests/system_remote.rs b/crates/omnigraph-cli/tests/system_remote.rs index 19f460e..c86e32e 100644 --- a/crates/omnigraph-cli/tests/system_remote.rs +++ b/crates/omnigraph-cli/tests/system_remote.rs @@ -8,14 +8,6 @@ use serde_json::json; use support::*; -/// Graph id every served test addresses (`--server --graph GRAPH_ID`). -/// RFC-011: the server is cluster-only, so a graph selector is always required -/// β€” even for a single-graph cluster. -const GRAPH_ID: &str = "knowledge"; - -/// Graph-bound Cedar bundle for the policy-flavored remote tests. `act-bruno` -/// (team) reads + writes unprotected branches; `act-ragnor` (admins) merges -/// into protected `main`. const REMOTE_POLICY_E2E_YAML: &str = r#" version: 1 groups: @@ -45,8 +37,6 @@ rules: target_branch_scope: protected "#; -/// Server-scoped bundle granting `act-admin` the `graph_list` action so -/// `GET /graphs` succeeds. const GRAPH_LIST_SERVER_POLICY_YAML: &str = r#" version: 1 groups: @@ -58,24 +48,61 @@ rules: actions: [graph_list] "#; +fn yaml_string(value: &str) -> String { + format!("'{}'", value.replace('\'', "''")) +} + +fn remote_policy_server_config(graph: &SystemGraph) -> String { + format!( + "\ +project: + name: remote-policy-e2e +graphs: + local: + uri: {} +server: + graph: local +policy: + file: ./policy.yaml +", + yaml_string(&graph.path().to_string_lossy()) + ) +} + +fn remote_policy_client_config(url: &str) -> String { + format!( + "\ +graphs: + dev: + uri: {} + bearer_token_env: POLICY_TEST_TOKEN +cli: + graph: dev + branch: main +query: + roots: + - . +auth: + env_file: ./.env.omni +", + yaml_string(url) + ) +} + #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_server_and_cli_end_to_end_flow() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - // The served graph's storage root β€” used for embedded-side cross checks. - let served_root = cluster.path().join("graphs").join(format!("{GRAPH_ID}.omni")); - let temp = tempfile::tempdir().unwrap(); - let mutation_file = temp.path().join("system-remote-change.gq"); - fs::write( - &mutation_file, + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let mutation_file = graph.write_query( + "system-remote-change.gq", r#" query insert_person($name: String, $age: I32) { insert Person { name: $name, age: $age } } "#, - ) - .unwrap(); + ); let client = Client::new(); let health = client @@ -89,15 +116,13 @@ query insert_person($name: String, $age: I32) { assert_eq!(health["status"], "ok"); let local_snapshot = parse_stdout_json(&output_success( - cli().arg("snapshot").arg(&served_root).arg("--json"), + cli().arg("snapshot").arg(graph.path()).arg("--json"), )); let snapshot = parse_stdout_json(&output_success( cli() .arg("snapshot") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--json"), )); assert_eq!(snapshot["branch"], "main"); @@ -106,10 +131,10 @@ query insert_person($name: String, $age: I32) { let local_read = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") - .arg(&served_root) + .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -118,12 +143,11 @@ query insert_person($name: String, $age: I32) { let read_payload = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -133,15 +157,11 @@ query insert_person($name: String, $age: I32) { assert_eq!(read_payload["row_count"], 1); assert_eq!(read_payload["rows"][0]["p.name"], "Alice"); - // Served write: no `--as` (the server resolves the actor; here the server - // is `--unauthenticated`, so the actor is the server default). let change_payload = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) .arg("--params") @@ -152,7 +172,7 @@ query insert_person($name: String, $age: I32) { let query_source = fs::read_to_string(fixture("test.gq")).unwrap(); let http_read = client - .post(format!("{}/graphs/{GRAPH_ID}/read", server.base_url)) + .post(format!("{}/read", server.base_url)) .json(&json!({ "branch": "main", "query_source": query_source, @@ -171,10 +191,10 @@ query insert_person($name: String, $age: I32) { let local_verify = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") - .arg(&served_root) + .arg(graph.path()) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Mina"}"#) @@ -183,16 +203,15 @@ query insert_person($name: String, $age: I32) { assert_eq!(local_verify["row_count"], 1); assert_eq!(local_verify["rows"][0]["p.name"], "Mina"); - // CLI inline source over the HTTP transport (--server). Confirms inline - // source survives the remote-execution path identically to file-based - // queries. + // CLI `-e` over the HTTP transport (--config points at remote server). + // Confirms inline source survives the remote-execution path identically + // to file-based queries, and exercises `POST /query` end-to-end via the + // change-then-read round trip we just established. let inline_remote_read = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("-e") .arg("query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }") .arg("--params") @@ -205,10 +224,8 @@ query insert_person($name: String, $age: I32) { let inline_remote_change = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query-string") .arg("query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }") .arg("--params") @@ -217,9 +234,10 @@ query insert_person($name: String, $age: I32) { )); assert_eq!(inline_remote_change["affected_nodes"], 1); - // `POST /graphs/{id}/query` happy path directly. + // `POST /query` happy path directly: a hand-rolled HTTP body using the + // new clean field names. let http_query = client - .post(format!("{}/graphs/{GRAPH_ID}/query", server.base_url)) + .post(format!("{}/query", server.base_url)) .json(&json!({ "branch": "main", "query": "query find($name: String) { match { $p: Person { name: $name } } return { $p.name } }", @@ -234,9 +252,9 @@ query insert_person($name: String, $age: I32) { assert_eq!(http_query["row_count"], 1); assert_eq!(http_query["rows"][0]["p.name"], "Inline"); - // `POST /graphs/{id}/query` rejects mutations with 400. + // `POST /query` rejects mutations with 400. let http_query_mutation = client - .post(format!("{}/graphs/{GRAPH_ID}/query", server.base_url)) + .post(format!("{}/query", server.base_url)) .json(&json!({ "branch": "main", "query": "query bad($name: String, $age: I32) { insert Person { name: $name, age: $age } }", @@ -245,33 +263,32 @@ query insert_person($name: String, $age: I32) { .send() .unwrap(); assert_eq!(http_query_mutation.status(), reqwest::StatusCode::BAD_REQUEST); + + // `run publish` / `run list` removed. Direct-to-target writes + // already landed via the change call above; the commit graph is now + // the audit surface (verified separately by `commit list`). } #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_schema_apply_via_cli_updates_graph() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let served_root = cluster.path().join("graphs").join(format!("{GRAPH_ID}.omni")); - let temp = tempfile::tempdir().unwrap(); - let next_schema = temp.path().join("next.pg"); - fs::write( - &next_schema, - fs::read_to_string(fixture("test.pg")).unwrap().replace( + let graph = SystemGraph::initialized(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let next_schema = graph.write_file( + "next.pg", + &fs::read_to_string(fixture("test.pg")).unwrap().replace( " age: I32?\n}", " age: I32?\n nickname: String?\n}", ), - ) - .unwrap(); + ); let payload = parse_stdout_json(&output_success( cli() .arg("schema") .arg("apply") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--schema") .arg(&next_schema) .arg("--json"), @@ -280,7 +297,7 @@ fn remote_schema_apply_via_cli_updates_graph() { let db = tokio::runtime::Runtime::new() .unwrap() - .block_on(Omnigraph::open(served_root.to_string_lossy().as_ref())) + .block_on(Omnigraph::open(graph.path().to_string_lossy().as_ref())) .unwrap(); assert!( db.catalog().node_types["Person"] @@ -292,95 +309,74 @@ fn remote_schema_apply_via_cli_updates_graph() { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_schema_apply_rejects_unsupported_plan() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let breaking_schema = temp.path().join("breaking.pg"); - fs::write( - &breaking_schema, - fs::read_to_string(fixture("test.pg")) + let graph = SystemGraph::initialized(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let breaking_schema = graph.write_file( + "breaking.pg", + &fs::read_to_string(fixture("test.pg")) .unwrap() .replace("age: I32?", "age: I64?"), - ) - .unwrap(); + ); let output = output_failure( cli() .arg("schema") .arg("apply") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--schema") .arg(&breaking_schema), ); let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("changing property type"), - "expected unsupported-plan error, got: {stderr}" - ); + assert!(stderr.contains("changing property type")); } #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_schema_apply_rejects_when_non_main_branch_exists() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - - // Create a non-main branch over the served path so the schema-apply - // single-branch precondition fails. + let graph = SystemGraph::initialized(); output_success( cli() .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) .arg("--from") .arg("main") + .arg("--uri") + .arg(graph.path()) .arg("feature"), ); - - let temp = tempfile::tempdir().unwrap(); - let next_schema = temp.path().join("next.pg"); - fs::write( - &next_schema, - fs::read_to_string(fixture("test.pg")).unwrap().replace( + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let next_schema = graph.write_file( + "next.pg", + &fs::read_to_string(fixture("test.pg")).unwrap().replace( " age: I32?\n}", " age: I32?\n nickname: String?\n}", ), - ) - .unwrap(); + ); let output = output_failure( cli() .arg("schema") .arg("apply") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--schema") .arg(&next_schema), ); let stderr = String::from_utf8_lossy(&output.stderr); - assert!( - stderr.contains("schema apply requires a graph with only main"), - "expected single-branch precondition error, got: {stderr}" - ); + assert!(stderr.contains("schema apply requires a graph with only main")); } #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_read_preserves_projection_order_in_json_and_csv() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let ordered_query = temp.path().join("ordered-remote.gq"); - fs::write( - &ordered_query, + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let ordered_query = graph.write_query( + "ordered-remote.gq", r#" query ordered_person($name: String) { match { @@ -389,18 +385,16 @@ query ordered_person($name: String) { return { $p.age, $p.name } } "#, - ) - .unwrap(); + ); let json_payload = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&ordered_query) + .arg("--name") .arg("ordered_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -417,12 +411,11 @@ query ordered_person($name: String) { let csv = stdout_string(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&ordered_query) + .arg("--name") .arg("ordered_person") .arg("--params") .arg(r#"{"name":"Alice"}"#) @@ -437,28 +430,24 @@ query ordered_person($name: String) { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_branch_create_list_merge_flow() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let mutation_file = temp.path().join("system-remote-branch-change.gq"); - fs::write( - &mutation_file, + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let mutation_file = graph.write_query( + "system-remote-branch-change.gq", r#" query insert_person($name: String, $age: I32) { insert Person { name: $name, age: $age } } "#, - ) - .unwrap(); + ); let initial = parse_stdout_json(&output_success( cli() .arg("branch") .arg("list") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--json"), )); assert_eq!(initial["branches"], json!(["main"])); @@ -467,10 +456,8 @@ query insert_person($name: String, $age: I32) { cli() .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("feature") @@ -483,10 +470,8 @@ query insert_person($name: String, $age: I32) { cli() .arg("branch") .arg("list") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--json"), )); assert_eq!(listed["branches"], json!(["feature", "main"])); @@ -494,10 +479,8 @@ query insert_person($name: String, $age: I32) { let changed = parse_stdout_json(&output_success( cli() .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) .arg("--branch") @@ -513,10 +496,8 @@ query insert_person($name: String, $age: I32) { cli() .arg("branch") .arg("merge") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("feature") .arg("--into") .arg("main") @@ -529,12 +510,11 @@ query insert_person($name: String, $age: I32) { let verify = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Zoe"}"#) @@ -547,17 +527,16 @@ query insert_person($name: String, $age: I32) { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_branch_delete_removes_branch() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); parse_stdout_json(&output_success( cli() .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("feature") @@ -568,13 +547,9 @@ fn remote_branch_delete_removes_branch() { cli() .arg("branch") .arg("delete") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("feature") - // Served target is non-local β†’ destructive-confirm gate (RFC-011 D9). - .arg("--yes") .arg("--json"), )); assert_eq!(deleted["name"], "feature"); @@ -583,10 +558,8 @@ fn remote_branch_delete_removes_branch() { cli() .arg("branch") .arg("list") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--json"), )); assert_eq!(listed["branches"], json!(["main"])); @@ -595,12 +568,11 @@ fn remote_branch_delete_removes_branch() { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_export_round_trips_full_branch_graph() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let mutation_file = temp.path().join("system-remote-export-change.gq"); - fs::write( - &mutation_file, + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let mutation_file = graph.write_query( + "system-remote-export-change.gq", r#" query insert_person($name: String, $age: I32) { insert Person { name: $name, age: $age } @@ -610,17 +582,14 @@ query add_friend($from: String, $to: String) { insert Knows { from: $from, to: $to } } "#, - ) - .unwrap(); + ); output_success( cli() .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("feature"), @@ -629,12 +598,11 @@ query add_friend($from: String, $to: String) { output_success( cli() .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) + .arg("--name") .arg("insert_person") .arg("--branch") .arg("feature") @@ -645,12 +613,11 @@ query add_friend($from: String, $to: String) { output_success( cli() .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(&mutation_file) + .arg("--name") .arg("add_friend") .arg("--branch") .arg("feature") @@ -662,17 +629,18 @@ query add_friend($from: String, $to: String) { let exported = stdout_string(&output_success( cli() .arg("export") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--branch") .arg("feature") .arg("--jsonl"), )); - let export_path = temp.path().join("system-remote-exported.jsonl"); - fs::write(&export_path, &exported).unwrap(); - let imported_graph = temp.path().join("imported-remote-export.omni"); + let export_path = graph.write_jsonl("system-remote-exported.jsonl", &exported); + let imported_graph = graph + .path() + .parent() + .unwrap() + .join("imported-remote-export.omni"); output_success( cli() @@ -684,8 +652,6 @@ query add_friend($from: String, $to: String) { output_success( cli() .arg("load") - .arg("--mode") - .arg("overwrite") .arg("--data") .arg(&export_path) .arg(&imported_graph), @@ -716,10 +682,10 @@ query add_friend($from: String, $to: String) { let eve = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--store") .arg(&imported_graph) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"Eve"}"#) @@ -732,24 +698,20 @@ query add_friend($from: String, $to: String) { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_ingest_creates_review_branch_and_keeps_it_readable() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let ingest_data = temp.path().join("system-remote-ingest.jsonl"); - fs::write( - &ingest_data, + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); + let ingest_data = graph.write_jsonl( + "system-remote-ingest.jsonl", r#"{"type":"Person","data":{"name":"Zoe","age":33}} {"type":"Person","data":{"name":"Bob","age":26}}"#, - ) - .unwrap(); + ); let ingest_payload = parse_stdout_json(&output_success( cli() .arg("ingest") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--data") .arg(&ingest_data) .arg("--branch") @@ -766,10 +728,8 @@ fn remote_ingest_creates_review_branch_and_keeps_it_readable() { let feature_snapshot = parse_stdout_json(&output_success( cli() .arg("snapshot") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--branch") .arg("feature-ingest") .arg("--json"), @@ -779,12 +739,11 @@ fn remote_ingest_creates_review_branch_and_keeps_it_readable() { let zoe = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature-ingest") @@ -796,114 +755,35 @@ fn remote_ingest_creates_review_branch_and_keeps_it_readable() { assert_eq!(zoe["rows"][0]["p.name"], "Zoe"); } -/// The unified `load` works against remote graphs through the server's -/// `/ingest` endpoint: without `--from` a missing branch is a hard error -/// (no implicit fork), with `--from` it forks like ingest did. -#[test] -#[ignore = "requires loopback socket permissions in sandboxed runners"] -fn remote_load_round_trips_and_requires_from_for_new_branches() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); - let temp = tempfile::tempdir().unwrap(); - let extra = temp.path().join("system-remote-load.jsonl"); - fs::write( - &extra, - r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#, - ) - .unwrap(); - - // Missing branch without --from: refused remotely, nothing created. - let failure = output_failure( - cli() - .arg("load") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) - .arg("--mode") - .arg("merge") - .arg("--data") - .arg(&extra) - .arg("--branch") - .arg("feature-load"), - ); - assert!( - String::from_utf8_lossy(&failure.stderr).contains("feature-load"), - "error should name the missing branch" - ); - - // With --from, the remote load forks and lands the rows. - let payload = parse_stdout_json(&output_success( - cli() - .arg("load") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) - .arg("--mode") - .arg("merge") - .arg("--data") - .arg(&extra) - .arg("--branch") - .arg("feature-load") - .arg("--from") - .arg("main") - .arg("--json"), - )); - assert_eq!(payload["branch"], "feature-load"); - assert_eq!(payload["base_branch"], "main"); - assert_eq!(payload["branch_created"], true); - assert_eq!(payload["nodes_loaded"], 1); - - let snapshot = parse_stdout_json(&output_success( - cli() - .arg("snapshot") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) - .arg("--branch") - .arg("feature-load") - .arg("--json"), - )); - assert_eq!(snapshot["branch"], "feature-load"); -} - #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_ingest_reuses_existing_branch_and_merges_updates() { - let cluster = converged_loaded_cluster(GRAPH_ID, None); - let server = spawn_server_with_cluster(cluster.path()); + let graph = SystemGraph::loaded(); + let server = graph.spawn_server(); + let config = graph.write_config("omnigraph.yaml", &remote_yaml_config(&server.base_url)); output_success( cli() .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--from") .arg("main") .arg("feature-ingest"), ); - let temp = tempfile::tempdir().unwrap(); - let ingest_data = temp.path().join("system-remote-ingest-merge.jsonl"); - fs::write( - &ingest_data, + let ingest_data = graph.write_jsonl( + "system-remote-ingest-merge.jsonl", r#"{"type":"Person","data":{"name":"Bob","age":26}} {"type":"Person","data":{"name":"Zoe","age":33}}"#, - ) - .unwrap(); + ); let ingest_payload = parse_stdout_json(&output_success( cli() .arg("ingest") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--data") .arg(&ingest_data) .arg("--branch") @@ -922,12 +802,11 @@ fn remote_ingest_reuses_existing_branch_and_merges_updates() { let bob = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature-ingest") @@ -941,12 +820,11 @@ fn remote_ingest_reuses_existing_branch_and_merges_updates() { let zoe = parse_stdout_json(&output_success( cli() .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--branch") .arg("feature-ingest") @@ -961,51 +839,45 @@ fn remote_ingest_reuses_existing_branch_and_merges_updates() { #[test] #[ignore = "requires loopback socket permissions in sandboxed runners"] fn remote_policy_enforces_branch_first_cli_workflow() { - // Served policy enforcement: the cluster binds REMOTE_POLICY_E2E_YAML to the - // graph, and the server maps bearer tokens to actors. The actor is resolved - // from the token (no `--as` on served writes). - let cluster = converged_loaded_cluster(GRAPH_ID, Some(REMOTE_POLICY_E2E_YAML)); - let server = spawn_server_with_cluster_env( - cluster.path(), + let graph = SystemGraph::loaded(); + let server_config = + graph.write_config("server-policy.yaml", &remote_policy_server_config(&graph)); + graph.write_config("policy.yaml", REMOTE_POLICY_E2E_YAML); + let server = graph.spawn_server_with_config_env( + &server_config, &[( "OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", r#"{"act-bruno":"team-token","act-ragnor":"admin-token"}"#, )], ); - let temp = tempfile::tempdir().unwrap(); - let mutation_file = temp.path().join("system-remote-policy-change.gq"); - fs::write( - &mutation_file, + let client_config = graph.write_config( + "omnigraph-policy.yaml", + &remote_policy_client_config(&server.base_url), + ); + graph.write_config(".env.omni", "POLICY_TEST_TOKEN=team-token\n"); + let mutation_file = graph.write_query( + "system-remote-policy-change.gq", r#" query insert_person($name: String, $age: I32) { insert Person { name: $name, age: $age } } "#, - ) - .unwrap(); + ); - // Reads are granted to the team group (bruno). let snapshot = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("snapshot") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("--json"), )); assert_eq!(snapshot["branch"], "main"); - // bruno cannot change protected main (team-write-unprotected only). let denied_main_change = output_failure( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("--query") .arg(&mutation_file) .arg("--params") @@ -1013,23 +885,14 @@ query insert_person($name: String, $age: I32) { .arg("--json"), ); let denied_main_stderr = String::from_utf8(denied_main_change.stderr).unwrap(); - assert!( - denied_main_stderr.contains("denied") - && denied_main_stderr.contains("change") - && denied_main_stderr.contains("main"), - "expected change-on-main denial, got: {denied_main_stderr}" - ); + assert!(denied_main_stderr.contains("policy denied action 'change' on branch 'main'")); - // bruno can create an unprotected branch. let created = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("branch") .arg("create") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("--from") .arg("main") .arg("feature") @@ -1037,15 +900,11 @@ query insert_person($name: String, $age: I32) { )); assert_eq!(created["name"], "feature"); - // bruno can change the unprotected branch; actor resolves from the token. let changed = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("change") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("--query") .arg(&mutation_file) .arg("--branch") @@ -1056,39 +915,28 @@ query insert_person($name: String, $age: I32) { )); assert_eq!(changed["branch"], "feature"); assert_eq!(changed["affected_nodes"], 1); - assert_eq!(changed["actor_id"], "act-bruno"); - // bruno cannot merge into protected main (admins-promote only). let denied_merge = output_failure( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("branch") .arg("merge") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("feature") .arg("--into") .arg("main") .arg("--json"), ); let denied_merge_stderr = String::from_utf8(denied_merge.stderr).unwrap(); - assert!( - denied_merge_stderr.contains("denied") && denied_merge_stderr.contains("branch_merge"), - "expected branch_merge denial, got: {denied_merge_stderr}" - ); + assert!(denied_merge_stderr.contains("policy denied action 'branch_merge'")); - // ragnor (admins) can promote into protected main. let merged = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "admin-token") + .env("POLICY_TEST_TOKEN", "admin-token") .arg("branch") .arg("merge") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("feature") .arg("--into") .arg("main") @@ -1098,14 +946,12 @@ query insert_person($name: String, $age: I32) { let verify = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "team-token") .arg("read") - .arg("--server") - .arg(&server.base_url) - .arg("--graph") - .arg(GRAPH_ID) + .arg("--config") + .arg(&client_config) .arg("--query") .arg(fixture("test.gq")) + .arg("--name") .arg("get_person") .arg("--params") .arg(r#"{"name":"PolicyRemote"}"#) @@ -1117,16 +963,13 @@ query insert_person($name: String, $age: I32) { // ─── MR-668 PR 8 β€” omnigraph graphs list end-to-end ──────────────────────── -/// Multi-graph server + CLI `omnigraph graphs list` end-to-end (RFC-011 -/// cluster-only serving). +/// Multi-graph server + CLI `omnigraph graphs list` end-to-end. /// /// Steps: -/// 1. Build a converged cluster serving one graph `alpha` with a -/// server-scoped policy granting `act-admin` the `graph_list` action. -/// 2. Spawn the server with `--cluster` + a bearer-token map. -/// 3. `omnigraph graphs list --server ` (admin token) β€” expect `alpha`. -/// 4. Addressing the server via `--server ` with NO `--graph` errors and -/// lists the candidate graphs (RFC-011 D7). +/// 1. Init a graph `alpha` on disk and write an `omnigraph.yaml` +/// whose `graphs:` map references it. +/// 2. Spawn the server with `--config `. +/// 3. `omnigraph graphs list` β€” expect to see `alpha`. /// /// Ignored by default β€” spawning servers needs loopback socket /// permissions some sandboxes lack. @@ -1134,33 +977,86 @@ query insert_person($name: String, $age: I32) { #[ignore = "requires loopback socket permissions in sandboxed runners"] fn graphs_list_against_multi_graph_server() { let cfg_dir = tempfile::tempdir().unwrap(); - let dir = cfg_dir.path(); - fs::copy(fixture("test.pg"), dir.join("alpha.pg")).unwrap(); - fs::write(dir.join("server.policy.yaml"), GRAPH_LIST_SERVER_POLICY_YAML).unwrap(); + let schema_path = fixture("test.pg"); + + // Init `alpha` on disk. + let alpha_uri = cfg_dir.path().join("alpha.omni"); + tokio::runtime::Runtime::new().unwrap().block_on(async { + Omnigraph::init( + alpha_uri.to_str().unwrap(), + &fs::read_to_string(&schema_path).unwrap(), + ) + .await + .unwrap(); + }); + fs::write( - dir.join("cluster.yaml"), - "version: 1\nmetadata:\n name: sys\nstate:\n backend: cluster\n lock: true\ngraphs:\n alpha:\n schema: ./alpha.pg\npolicies:\n server:\n file: ./server.policy.yaml\n applies_to: [cluster]\n", + cfg_dir.path().join("server-policy.yaml"), + GRAPH_LIST_SERVER_POLICY_YAML, ) .unwrap(); - output_success(cli().arg("cluster").arg("import").arg("--config").arg(dir)); - output_success(cli().arg("cluster").arg("apply").arg("--config").arg(dir)); - let server = spawn_server_with_cluster_env( - dir, + // Server config with `graphs:` map and no `server.graph` selector + // β€” multi mode (rule 4 of the inference matrix). `GET /graphs` is a + // server-scoped action, so the success path needs an explicit server + // policy and bearer token. + let server_config_path = cfg_dir.path().join("omnigraph.yaml"); + fs::write( + &server_config_path, + format!( + "\ +server: + policy: + file: ./server-policy.yaml +graphs: + alpha: + uri: {} +", + yaml_string(&alpha_uri.to_string_lossy()) + ), + ) + .unwrap(); + + let server = spawn_server_with_config_env( + &server_config_path, &[( "OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", r#"{"act-admin":"admin-token"}"#, )], ); + // Client config β€” the CLI's `--target dev` resolves to `server.base_url`. + let client_config_path = cfg_dir.path().join("client.yaml"); + fs::write( + &client_config_path, + format!( + "\ +graphs: + dev: + uri: {} + bearer_token_env: GRAPH_LIST_TOKEN +cli: + graph: dev +auth: + env_file: ./.env.omni +", + yaml_string(&server.base_url) + ), + ) + .unwrap(); + fs::write( + cfg_dir.path().join(".env.omni"), + "GRAPH_LIST_TOKEN=admin-token\n", + ) + .unwrap(); + // `graphs list` lists `alpha`. let payload = parse_stdout_json(&output_success( cli() - .env("OMNIGRAPH_BEARER_TOKEN", "admin-token") .arg("graphs") .arg("list") - .arg("--server") - .arg(&server.base_url) + .arg("--config") + .arg(&client_config_path) .arg("--json"), )); let ids: Vec<&str> = payload["graphs"] @@ -1171,27 +1067,5 @@ fn graphs_list_against_multi_graph_server() { .collect(); assert_eq!(ids, vec!["alpha"]); - // RFC-011 D7: addressing the multi-graph server via `--server ` with no - // `--graph` errors and lists the candidate graphs (the resolver probes - // GET /graphs; the default-env token authorizes it). - let no_graph = cli() - .env("OMNIGRAPH_BEARER_TOKEN", "admin-token") - .arg("query") - .arg("--server") - .arg(&server.base_url) - .arg("-e") - .arg("query q { match { $p: Person { name: \"x\" } } return { $p.name } }") - .output() - .unwrap(); - assert!( - !no_graph.status.success(), - "multi-graph server with no --graph must error" - ); - let stderr = String::from_utf8_lossy(&no_graph.stderr); - assert!( - stderr.contains("alpha") && stderr.contains("--graph "), - "expected a candidate-listing error naming alpha; got: {stderr}" - ); - drop(server); } diff --git a/crates/omnigraph-cluster/Cargo.toml b/crates/omnigraph-cluster/Cargo.toml deleted file mode 100644 index ad3cf24..0000000 --- a/crates/omnigraph-cluster/Cargo.toml +++ /dev/null @@ -1,35 +0,0 @@ -[package] -name = "omnigraph-cluster" -version = "0.7.2" -edition = "2024" -description = "Cluster configuration validation, planning, and config-only apply for Omnigraph." -license = "MIT" -repository = "https://github.com/ModernRelay/omnigraph" -homepage = "https://github.com/ModernRelay/omnigraph" -documentation = "https://docs.rs/omnigraph-cluster" - -[features] -# Fault-injection hooks for the apply protocol (crash-mid-apply, CAS-race -# tests), including cluster/engine boundary failures. -failpoints = ["dep:fail", "fail/failpoints", "omnigraph/failpoints"] - -[dependencies] -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" } -fail = { workspace = true, optional = true } -serde = { workspace = true } -serde_json = { workspace = true } -serde_yaml = { workspace = true } -sha2 = { workspace = true } -thiserror = { workspace = true } -time = { workspace = true } -# Runtime handle only β€” best-effort async lock release in -# StateLockGuard::drop on object-store backends (cluster commands always -# run inside the caller's tokio runtime). -tokio = { workspace = true } -ulid = { workspace = true } - -[dev-dependencies] -serial_test = "3" -tempfile = { workspace = true } -tokio = { workspace = true } diff --git a/crates/omnigraph-cluster/src/config.rs b/crates/omnigraph-cluster/src/config.rs deleted file mode 100644 index 10621da..0000000 --- a/crates/omnigraph-cluster/src/config.rs +++ /dev/null @@ -1,1011 +0,0 @@ -//! Declared-configuration loading: cluster.yaml parsing, query -//! discovery, source digesting, validation (moved verbatim from lib.rs -//! in the modularization). Reads the operator's WORKING TREE β€” stored -//! state never lives here (see store.rs). - -use super::*; - -/// How a graph declares its stored queries. Terraform-style: the `.gq` -/// files ARE the declaration β€” point at them (or a directory) and every -/// `query ` they contain is discovered. The explicit name->file map -/// remains for fine-grained control. -#[derive(Debug, Serialize, Deserialize)] -#[serde(untagged)] -pub(crate) enum QueriesDecl { - /// `queries: ./queries/` β€” a directory (top-level `*.gq`, sorted) or a - /// single `.gq` file; every declaration inside is registered. - Discover(PathBuf), - /// `queries: [./queries/, ./extra.gq]` β€” several directories/files. - DiscoverMany(Vec), - /// `queries: { name: { file: ... } }` β€” explicit registry. - Explicit(BTreeMap), -} - -impl Default for QueriesDecl { - fn default() -> Self { - QueriesDecl::Explicit(BTreeMap::new()) - } -} - -/// Expand a graph's query declaration into the canonical name->file map. -/// Discovery reads and parses each `.gq`; unreadable or unparseable files -/// and duplicate query names are loud validation errors β€” a declaration the -/// tool cannot enumerate is broken, not partially usable. -pub(crate) fn resolve_query_decls( - config_dir: &Path, - graph_id: &str, - decl: &QueriesDecl, - diagnostics: &mut Vec, -) -> (BTreeMap, BTreeMap) { - let paths: Vec = match decl { - QueriesDecl::Explicit(map) => { - return ( - map.iter() - .map(|(name, config)| { - ( - name.clone(), - QueryConfig { - file: config.file.clone(), - }, - ) - }) - .collect(), - BTreeMap::new(), - ); - } - QueriesDecl::Discover(path) => vec![path.clone()], - QueriesDecl::DiscoverMany(paths) => paths.clone(), - }; - - let mut files: Vec<(PathBuf, PathBuf)> = Vec::new(); // (declared-relative, resolved) - for declared in &paths { - let resolved = resolve_config_path(config_dir, declared); - if resolved.is_dir() { - let mut entries: Vec = match fs::read_dir(&resolved) { - Ok(read) => read - .flatten() - .map(|entry| entry.path()) - .filter(|path| path.extension().is_some_and(|ext| ext == "gq")) - .collect(), - Err(err) => { - diagnostics.push(Diagnostic::error( - "query_dir_unreadable", - format!("graphs.{graph_id}.queries"), - format!( - "could not list query directory '{}': {err}", - resolved.display() - ), - )); - continue; - } - }; - entries.sort(); - if entries.is_empty() { - diagnostics.push(Diagnostic::warning( - "query_dir_empty", - format!("graphs.{graph_id}.queries"), - format!( - "query directory '{}' contains no .gq files", - resolved.display() - ), - )); - } - for path in entries { - let relative = declared.join(path.file_name().expect("dir entries have names")); - files.push((relative, path)); - } - } else { - files.push((declared.clone(), resolved)); - } - } - - let mut registry: BTreeMap = BTreeMap::new(); - let mut origin: BTreeMap = BTreeMap::new(); - // Content read once at discovery and handed to the caller β€” the per-query - // digest/typecheck pass reuses it instead of re-reading (no N+1 reads, no - // window for the file to change between enumeration and validation). - let mut contents: BTreeMap = BTreeMap::new(); - for (declared, resolved) in files { - let source = match fs::read_to_string(&resolved) { - Ok(source) => source, - Err(err) => { - diagnostics.push(Diagnostic::error( - "query_file_missing", - format!("graphs.{graph_id}.queries"), - format!("could not read query file '{}': {err}", resolved.display()), - )); - continue; - } - }; - let parsed = match parse_query(&source) { - Ok(parsed) => parsed, - Err(err) => { - diagnostics.push(Diagnostic::error( - "query_parse_error", - format!("graphs.{graph_id}.queries"), - format!("'{}' does not parse: {err}", resolved.display()), - )); - continue; - } - }; - for query_decl in &parsed.queries { - let name = query_decl.name.clone(); - if let Some(previous) = origin.get(&name) { - diagnostics.push(Diagnostic::error( - "duplicate_query_name", - format!("graphs.{graph_id}.queries.{name}"), - format!( - "query '{name}' is declared in both '{}' and '{}'", - previous.display(), - declared.display() - ), - )); - continue; - } - origin.insert(name.clone(), declared.clone()); - registry.insert( - name, - QueryConfig { - file: declared.clone(), - }, - ); - } - contents.insert(declared, source); - } - (registry, contents) -} - -pub(crate) fn parse_cluster_config(config_dir: &Path) -> ParsedConfig { - let config_dir = config_dir.to_path_buf(); - let config_file = config_dir.join(CLUSTER_CONFIG_FILE); - let mut diagnostics = Vec::new(); - - if !config_dir.is_dir() { - diagnostics.push(Diagnostic::error( - "config_dir_not_found", - display_path(&config_dir), - "`--config` must point at a directory containing cluster.yaml", - )); - return ParsedConfig { - raw: None, - diagnostics, - config_dir, - config_file, - }; - } - - let text = match fs::read_to_string(&config_file) { - Ok(text) => text, - Err(err) => { - diagnostics.push(Diagnostic::error( - "cluster_config_read_error", - CLUSTER_CONFIG_FILE, - format!("could not read cluster.yaml: {err}"), - )); - return ParsedConfig { - raw: None, - diagnostics, - config_dir, - config_file, - }; - } - }; - - diagnostics.extend(duplicate_key_diagnostics(&text)); - diagnostics.extend(future_field_diagnostics(&text)); - if has_errors(&diagnostics) { - return ParsedConfig { - raw: None, - diagnostics, - config_dir, - config_file, - }; - } - - let raw = match serde_yaml::from_str::(&text) { - Ok(raw) => Some(raw), - Err(err) => { - diagnostics.push(Diagnostic::error( - "invalid_cluster_yaml", - CLUSTER_CONFIG_FILE, - format!("could not parse cluster.yaml: {err}"), - )); - None - } - }; - - ParsedConfig { - raw, - diagnostics, - config_dir, - config_file, - } -} - -pub(crate) fn validate_cluster_header( - raw: &RawClusterConfig, - diagnostics: &mut Vec, -) -> ClusterSettings { - if raw.version != 1 { - diagnostics.push(Diagnostic::error( - "unsupported_cluster_config_version", - "version", - format!( - "unsupported cluster config version {}; this build supports version 1", - raw.version - ), - )); - } - if let Some(name) = raw.metadata.name.as_deref() { - if name.trim().is_empty() { - diagnostics.push(Diagnostic::error( - "empty_metadata_name", - "metadata.name", - "metadata.name must not be empty when provided", - )); - } - } - if let Some(backend) = raw.state.backend.as_deref() { - if backend != "cluster" { - diagnostics.push(Diagnostic::error( - "unsupported_state_backend", - "state.backend", - "Stage 2C supports only omitted state.backend or `cluster`", - )); - } - } - - if let Some(storage) = raw.storage.as_deref() { - let trimmed = storage.trim(); - if trimmed.is_empty() { - diagnostics.push(Diagnostic::error( - "invalid_storage_root", - "storage", - "storage must be a non-empty URI (e.g. s3://bucket/prefix) when provided", - )); - } else if let Some(rest) = trimmed.strip_prefix("s3://") { - if rest.trim_start_matches('/').is_empty() { - diagnostics.push(Diagnostic::error( - "invalid_storage_root", - "storage", - "storage s3:// URI must name a bucket", - )); - } - } - } - - ClusterSettings { - state_lock: raw.state.lock.unwrap_or(true), - storage_root: raw - .storage - .as_deref() - .map(str::trim) - .filter(|storage| !storage.is_empty()) - .map(|storage| storage.trim_end_matches('/').to_string()), - } -} - -pub(crate) fn state_resource_digests(state: &ClusterState) -> BTreeMap { - state - .applied_revision - .resources - .iter() - .map(|(address, resource)| (address.clone(), resource.digest.clone())) - .collect() -} - -pub(crate) fn initial_import_state(desired: &DesiredCluster) -> ClusterState { - ClusterState { - version: 1, - state_revision: 0, - applied_revision: AppliedRevisionState { - config_digest: Some(desired.config_digest.clone()), - resources: BTreeMap::new(), - }, - resource_statuses: BTreeMap::new(), - approval_records: BTreeMap::new(), - recovery_records: BTreeMap::new(), - observations: BTreeMap::new(), - } -} - -pub(crate) async fn observe_declared_graphs( - desired: &DesiredCluster, - backend: &ClusterStore, - state: &mut ClusterState, -) -> usize { - let mut graph_error_count = 0; - for graph in &desired.graphs { - let graph_address = graph_address(&graph.id); - let schema_address = schema_address(&graph.id); - let graph_uri = backend.graph_root(&graph.id); - let observed_at = now_rfc3339(); - - if !backend.graph_root_exists(&graph_uri).await { - state.applied_revision.resources.remove(&graph_address); - state.applied_revision.resources.remove(&schema_address); - state.observations.insert( - graph_address.clone(), - graph_observation_json(GraphObservationJson { - address: &graph_address, - graph_uri: &graph_uri, - observed_at: &observed_at, - exists: false, - manifest_version: None, - schema_digest: None, - desired_schema_digest: &graph.schema_digest, - schema_matches_desired: Some(false), - error: Some("derived graph root is missing"), - }), - ); - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Drifted, - "graph_missing", - "derived graph root is missing", - ); - set_resource_status( - state, - &schema_address, - ResourceLifecycleStatus::Drifted, - "graph_missing", - "derived graph root is missing", - ); - continue; - } - - match observe_live_graph(&graph_uri).await { - Ok(observation) => { - let schema_matches = observation.schema_digest == graph.schema_digest; - state.applied_revision.resources.insert( - schema_address.clone(), - StateResource { - digest: observation.schema_digest.clone(), - applies_to: None, - embedding_provider: None, - embedding_profile: None, - }, - ); - let query_digests = state_query_digests_for_graph(state, &graph.id); - let embedding_provider = state_graph_embedding_provider(state, &graph.id); - let embedding_provider_digest = - state_embedding_provider_digest(state, embedding_provider.as_deref()); - let graph_digest_value = graph_digest( - &graph.id, - Some(&observation.schema_digest), - Some(&query_digests), - embedding_provider.as_deref(), - embedding_provider_digest.as_ref(), - ); - state.applied_revision.resources.insert( - graph_address.clone(), - StateResource { - digest: graph_digest_value, - applies_to: None, - embedding_provider, - embedding_profile: None, - }, - ); - state.observations.insert( - graph_address.clone(), - graph_observation_json(GraphObservationJson { - address: &graph_address, - graph_uri: &graph_uri, - observed_at: &observed_at, - exists: true, - manifest_version: Some(observation.manifest_version), - schema_digest: Some(observation.schema_digest.as_str()), - desired_schema_digest: &graph.schema_digest, - schema_matches_desired: Some(schema_matches), - error: None, - }), - ); - if schema_matches { - set_resource_status_applied(state, &graph_address); - set_resource_status_applied(state, &schema_address); - } else { - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Drifted, - "schema_mismatch", - "live schema digest differs from desired schema digest", - ); - set_resource_status( - state, - &schema_address, - ResourceLifecycleStatus::Drifted, - "schema_mismatch", - "live schema digest differs from desired schema digest", - ); - } - } - Err(error) => { - graph_error_count += 1; - state.observations.insert( - graph_address.clone(), - graph_observation_json(GraphObservationJson { - address: &graph_address, - graph_uri: &graph_uri, - observed_at: &observed_at, - exists: true, - manifest_version: None, - schema_digest: None, - desired_schema_digest: &graph.schema_digest, - schema_matches_desired: None, - error: Some(error.as_str()), - }), - ); - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Error, - "graph_observation_error", - error.as_str(), - ); - set_resource_status( - state, - &schema_address, - ResourceLifecycleStatus::Error, - "graph_observation_error", - error.as_str(), - ); - } - } - } - graph_error_count -} - -/// RFC-004 Β§D7: the data-aware preview β€” the engine's migration plan for a -/// desired schema against the live graph, computed read-only (no lock). -pub(crate) async fn preview_schema_migration( - graph_uri: &str, - schema_path: &str, -) -> Result { - let source = fs::read_to_string(schema_path).map_err(|err| err.to_string())?; - let db = Omnigraph::open_read_only(graph_uri) - .await - .map_err(|err| err.to_string())?; - let preview = db - .preview_schema_apply_with_options(&source, SchemaApplyOptions::default()) - .await - .map_err(|err| err.to_string())?; - Ok(preview.plan) -} - -pub(crate) struct LiveGraphObservation { - manifest_version: u64, - schema_digest: String, -} - -pub(crate) async fn observe_live_graph(graph_uri: &str) -> Result { - let db = Omnigraph::open_read_only(graph_uri) - .await - .map_err(|err| err.to_string())?; - let snapshot = db - .snapshot_of(ReadTarget::branch("main")) - .await - .map_err(|err| err.to_string())?; - let schema_source = db.schema_source(); - Ok(LiveGraphObservation { - manifest_version: snapshot.version(), - schema_digest: sha256_hex(schema_source.as_bytes()), - }) -} - -pub(crate) struct GraphObservationJson<'a> { - address: &'a str, - graph_uri: &'a str, - observed_at: &'a str, - exists: bool, - manifest_version: Option, - schema_digest: Option<&'a str>, - desired_schema_digest: &'a str, - schema_matches_desired: Option, - error: Option<&'a str>, -} - -pub(crate) fn graph_observation_json(observation: GraphObservationJson<'_>) -> serde_json::Value { - json!({ - "kind": "graph", - "address": observation.address, - "graph_uri": observation.graph_uri, - "observed_at": observation.observed_at, - "exists": observation.exists, - "manifest_version": observation.manifest_version, - "schema_digest": observation.schema_digest, - "desired_schema_digest": observation.desired_schema_digest, - "schema_matches_desired": observation.schema_matches_desired, - "error": observation.error, - }) -} - -pub(crate) fn load_desired(config_dir: &Path) -> LoadOutcome { - let parsed = parse_cluster_config(config_dir); - let config_dir = parsed.config_dir; - let config_file = parsed.config_file; - let mut diagnostics = parsed.diagnostics; - let Some(raw) = parsed.raw else { - return LoadOutcome { - desired: None, - diagnostics, - config_dir, - config_file, - }; - }; - let settings = validate_cluster_header(&raw, &mut diagnostics); - - let mut resources = BTreeMap::new(); - let mut dependencies = BTreeSet::new(); - let mut graph_query_digests: BTreeMap> = BTreeMap::new(); - let mut graph_schema_digests: BTreeMap = BTreeMap::new(); - let mut graph_embedding_providers: BTreeMap = BTreeMap::new(); - let mut embedding_provider_digests: BTreeMap = BTreeMap::new(); - let mut embedding_providers: BTreeMap = BTreeMap::new(); - - for (provider_name, profile) in &raw.providers.embedding { - validate_id( - "embedding provider name", - &format!("providers.embedding.{provider_name}"), - provider_name, - &mut diagnostics, - ); - let address = embedding_provider_address(provider_name); - profile.validate( - format!("providers.embedding.{provider_name}"), - &mut diagnostics, - ); - let digest = embedding_provider_digest(profile); - embedding_provider_digests.insert(address.clone(), digest.clone()); - embedding_providers.insert(address.clone(), profile.clone()); - resources.insert( - address.clone(), - ResourceSummary { - address, - kind: "embedding_provider".to_string(), - digest, - path: None, - }, - ); - } - - for (graph_id, graph) in &raw.graphs { - validate_id( - "graph id", - &format!("graphs.{graph_id}"), - graph_id, - &mut diagnostics, - ); - let graph_address = graph_address(graph_id); - let schema_address = schema_address(graph_id); - dependencies.insert(Dependency { - from: schema_address.clone(), - to: graph_address.clone(), - }); - if let Some(provider_ref) = graph.embedding_provider.as_deref() { - match normalize_embedding_provider_target(provider_ref) { - EmbeddingProviderTarget::Provider(provider_name) => { - let provider_address = embedding_provider_address(&provider_name); - if raw.providers.embedding.contains_key(&provider_name) { - dependencies.insert(Dependency { - from: graph_address.clone(), - to: provider_address.clone(), - }); - graph_embedding_providers.insert(graph_id.clone(), provider_address); - } else { - diagnostics.push(Diagnostic::error( - "dangling_embedding_provider_reference", - format!("graphs.{graph_id}.embedding_provider"), - format!( - "graph references embedding provider `{provider_name}`, but no providers.embedding.{provider_name} profile is declared" - ), - )); - } - } - EmbeddingProviderTarget::WrongKind(kind) => diagnostics.push(Diagnostic::error( - "wrong_kind_reference", - format!("graphs.{graph_id}.embedding_provider"), - format!( - "embedding_provider expects a providers.embedding ref or bare provider name, got `{kind}`" - ), - )), - } - } - - let schema_path = resolve_config_path(&config_dir, &graph.schema); - let schema_source = match fs::read_to_string(&schema_path) { - Ok(source) => { - let digest = sha256_hex(source.as_bytes()); - graph_schema_digests.insert(graph_id.clone(), digest.clone()); - resources.insert( - schema_address.clone(), - ResourceSummary { - address: schema_address.clone(), - kind: "schema".to_string(), - digest, - path: Some(display_path(&schema_path)), - }, - ); - Some(source) - } - Err(err) => { - diagnostics.push(Diagnostic::error( - "schema_file_missing", - format!("graphs.{graph_id}.schema"), - format!( - "could not read schema file '{}': {err}", - schema_path.display() - ), - )); - None - } - }; - - let catalog = schema_source.and_then(|source| match parse_schema(&source) { - Ok(schema) => match build_catalog(&schema) { - Ok(catalog) => Some(catalog), - Err(err) => { - diagnostics.push(Diagnostic::error( - "schema_catalog_error", - format!("graphs.{graph_id}.schema"), - err.to_string(), - )); - None - } - }, - Err(err) => { - diagnostics.push(Diagnostic::error( - "schema_parse_error", - format!("graphs.{graph_id}.schema"), - err.to_string(), - )); - None - } - }); - - let (graph_queries, query_contents) = - resolve_query_decls(&config_dir, graph_id, &graph.queries, &mut diagnostics); - for (query_name, query) in &graph_queries { - validate_id( - "query name", - &format!("graphs.{graph_id}.queries.{query_name}"), - query_name, - &mut diagnostics, - ); - let query_address = query_address(graph_id, query_name); - dependencies.insert(Dependency { - from: query_address.clone(), - to: graph_address.clone(), - }); - dependencies.insert(Dependency { - from: query_address.clone(), - to: schema_address.clone(), - }); - - let query_path = resolve_config_path(&config_dir, &query.file); - let source = match query_contents.get(&query.file) { - Some(cached) => Ok(cached.clone()), - None => fs::read_to_string(&query_path), - }; - match source { - Ok(source) => { - let digest = sha256_hex(source.as_bytes()); - graph_query_digests - .entry(graph_id.clone()) - .or_default() - .insert(query_name.clone(), digest.clone()); - resources.insert( - query_address.clone(), - ResourceSummary { - address: query_address, - kind: "query".to_string(), - digest, - path: Some(display_path(&query_path)), - }, - ); - validate_query_source( - graph_id, - query_name, - &source, - catalog.as_ref(), - &mut diagnostics, - ); - } - Err(err) => diagnostics.push(Diagnostic::error( - "query_file_missing", - format!("graphs.{graph_id}.queries.{query_name}.file"), - format!( - "could not read query file '{}': {err}", - query_path.display() - ), - )), - } - } - } - - for graph_id in raw.graphs.keys() { - let embedding_provider = graph_embedding_providers.get(graph_id); - let embedding_provider_digest = - embedding_provider.and_then(|address| embedding_provider_digests.get(address)); - let digest = graph_digest( - graph_id, - graph_schema_digests.get(graph_id), - graph_query_digests.get(graph_id), - embedding_provider.map(String::as_str), - embedding_provider_digest, - ); - resources.insert( - graph_address(graph_id), - ResourceSummary { - address: graph_address(graph_id), - kind: "graph".to_string(), - digest, - path: None, - }, - ); - } - - let mut policy_bindings: BTreeMap> = BTreeMap::new(); - for (policy_name, policy) in &raw.policies { - validate_id( - "policy name", - &format!("policies.{policy_name}"), - policy_name, - &mut diagnostics, - ); - if policy.applies_to.is_empty() { - diagnostics.push(Diagnostic::error( - "policy_missing_applies_to", - format!("policies.{policy_name}.applies_to"), - "policy.applies_to must name `cluster` or at least one graph", - )); - } - - let policy_address = policy_address(policy_name); - let mut normalized_bindings: Vec = Vec::new(); - for (idx, target) in policy.applies_to.iter().enumerate() { - match normalize_policy_target(target) { - PolicyTarget::Cluster => { - normalized_bindings.push("cluster".to_string()); - } - PolicyTarget::Graph(graph_id) => { - normalized_bindings.push(graph_address(&graph_id)); - if raw.graphs.contains_key(&graph_id) { - dependencies.insert(Dependency { - from: policy_address.clone(), - to: graph_address(&graph_id), - }); - } else { - diagnostics.push(Diagnostic::error( - "dangling_graph_reference", - format!("policies.{policy_name}.applies_to[{idx}]"), - format!( - "policy references graph `{graph_id}`, but no graph with that id is declared" - ), - )); - } - } - PolicyTarget::WrongKind(kind) => diagnostics.push(Diagnostic::error( - "wrong_kind_reference", - format!("policies.{policy_name}.applies_to[{idx}]"), - format!("policy applies_to expects graph refs or `cluster`, got `{kind}`"), - )), - } - } - - normalized_bindings.sort(); - normalized_bindings.dedup(); - policy_bindings.insert(policy_address.clone(), normalized_bindings); - - let policy_path = resolve_config_path(&config_dir, &policy.file); - match fs::read(&policy_path) { - Ok(bytes) => { - resources.insert( - policy_address.clone(), - ResourceSummary { - address: policy_address, - kind: "policy".to_string(), - digest: sha256_hex(&bytes), - path: Some(display_path(&policy_path)), - }, - ); - } - Err(err) => diagnostics.push(Diagnostic::error( - "policy_file_missing", - format!("policies.{policy_name}.file"), - format!( - "could not read policy file '{}': {err}", - policy_path.display() - ), - )), - } - } - - let mut resource_digests = BTreeMap::new(); - let mut resource_list = Vec::new(); - for (address, resource) in resources { - resource_digests.insert(address, resource.digest.clone()); - resource_list.push(resource); - } - let dependencies: Vec<_> = dependencies.into_iter().collect(); - let graphs = raw - .graphs - .keys() - .map(|graph_id| DesiredGraph { - id: graph_id.clone(), - schema_digest: graph_schema_digests - .get(graph_id) - .cloned() - .unwrap_or_default(), - embedding_provider: graph_embedding_providers.get(graph_id).cloned(), - }) - .collect(); - let config_digest = desired_config_digest(&raw, &resource_digests); - - LoadOutcome { - desired: Some(DesiredCluster { - config_dir: config_dir.clone(), - config_digest, - storage_root: settings.storage_root.clone(), - state_lock: settings.state_lock, - graphs, - resource_digests, - resources: resource_list, - dependencies, - policy_bindings, - embedding_providers, - }), - diagnostics, - config_dir, - config_file, - } -} - -pub(crate) fn validate_query_source( - graph_id: &str, - query_name: &str, - source: &str, - catalog: Option<&omnigraph_compiler::catalog::Catalog>, - diagnostics: &mut Vec, -) { - let path = format!("graphs.{graph_id}.queries.{query_name}"); - match parse_query(source) { - Ok(query_file) => { - let Some(query_decl) = query_file.queries.iter().find(|q| q.name == query_name) else { - diagnostics.push(Diagnostic::error( - "query_key_mismatch", - path, - format!("no `query {query_name}` declaration found in the referenced .gq file"), - )); - return; - }; - if let Some(catalog) = catalog { - if let Err(err) = typecheck_query_decl(catalog, query_decl) { - diagnostics.push(Diagnostic::error( - "query_typecheck_error", - format!("graphs.{graph_id}.queries.{query_name}"), - err.to_string(), - )); - } - } else { - diagnostics.push(Diagnostic::warning( - "query_typecheck_skipped", - format!("graphs.{graph_id}.queries.{query_name}"), - "query parsed, but type-check was skipped because the graph schema is invalid", - )); - } - } - Err(err) => diagnostics.push(Diagnostic::error( - "query_parse_error", - path, - err.to_string(), - )), - } -} - -pub(crate) fn future_field_diagnostics(text: &str) -> Vec { - let Ok(value) = serde_yaml::from_str::(text) else { - return Vec::new(); - }; - let Some(mapping) = value.as_mapping() else { - return Vec::new(); - }; - let future_fields = [ - "apply", - "env_file", - "pipelines", - "embeddings", - "ui", - "aliases", - "bindings", - ]; - mapping - .keys() - .filter_map(|key| key.as_str()) - .filter(|key| future_fields.contains(key)) - .map(|key| { - Diagnostic::error( - "future_phase_field", - key, - format!("`{key}` is reserved for a later cluster-control phase"), - ) - }) - .collect() -} - -pub(crate) fn validate_id(kind: &str, path: &str, value: &str, diagnostics: &mut Vec) { - let mut chars = value.chars(); - let valid = chars - .next() - .is_some_and(|ch| ch.is_ascii_alphabetic() || ch == '_') - && chars.all(|ch| ch.is_ascii_alphanumeric() || ch == '_' || ch == '-'); - if !valid { - diagnostics.push(Diagnostic::error( - "invalid_resource_id", - path, - format!("{kind} `{value}` must start with a letter or `_` and contain only ASCII letters, digits, `_`, or `-`"), - )); - } -} - -pub(crate) enum PolicyTarget { - Cluster, - Graph(String), - WrongKind(String), -} - -pub(crate) fn normalize_policy_target(value: &str) -> PolicyTarget { - if value == "cluster" { - PolicyTarget::Cluster - } else if let Some(graph_id) = value.strip_prefix("graph.") { - PolicyTarget::Graph(graph_id.to_string()) - } else if value.contains('.') { - PolicyTarget::WrongKind(value.to_string()) - } else { - PolicyTarget::Graph(value.to_string()) - } -} - -enum EmbeddingProviderTarget { - Provider(String), - WrongKind(String), -} - -fn normalize_embedding_provider_target(value: &str) -> EmbeddingProviderTarget { - if let Some(name) = value.strip_prefix("provider.embedding.") { - EmbeddingProviderTarget::Provider(name.to_string()) - } else if value.contains('.') { - EmbeddingProviderTarget::WrongKind(value.to_string()) - } else { - EmbeddingProviderTarget::Provider(value.to_string()) - } -} - -pub(crate) fn graph_address(graph_id: &str) -> String { - format!("graph.{graph_id}") -} - -pub(crate) fn schema_address(graph_id: &str) -> String { - format!("schema.{graph_id}") -} - -pub(crate) fn query_address(graph_id: &str, query_name: &str) -> String { - format!("query.{graph_id}.{query_name}") -} - -pub(crate) fn policy_address(policy_name: &str) -> String { - format!("policy.{policy_name}") -} - -pub(crate) fn embedding_provider_address(provider_name: &str) -> String { - format!("provider.embedding.{provider_name}") -} - -pub(crate) fn resolve_config_path(config_dir: &Path, path: &Path) -> PathBuf { - if path.is_absolute() { - path.to_path_buf() - } else { - config_dir.join(path) - } -} diff --git a/crates/omnigraph-cluster/src/diff.rs b/crates/omnigraph-cluster/src/diff.rs deleted file mode 100644 index ce29a45..0000000 --- a/crates/omnigraph-cluster/src/diff.rs +++ /dev/null @@ -1,464 +0,0 @@ -//! Plan/apply classification: resource diffing, dispositions, approval -//! gating, demotion (moved verbatim from lib.rs in the modularization). - -use super::*; - -pub(crate) fn diff_resources( - prior: &BTreeMap, - desired: &BTreeMap, -) -> Vec { - let mut changes = Vec::new(); - for (address, after) in desired { - match prior.get(address) { - None => changes.push(PlanChange { - resource: address.clone(), - operation: PlanOperation::Create, - before_digest: None, - after_digest: Some(after.clone()), - disposition: None, - reason: None, - binding_change: false, - metadata_change: None, - migration: None, - }), - Some(before) if before != after => changes.push(PlanChange { - resource: address.clone(), - operation: PlanOperation::Update, - before_digest: Some(before.clone()), - after_digest: Some(after.clone()), - disposition: None, - reason: None, - binding_change: false, - metadata_change: None, - migration: None, - }), - Some(_) => {} - } - } - for (address, before) in prior { - if !desired.contains_key(address) { - changes.push(PlanChange { - resource: address.clone(), - operation: PlanOperation::Delete, - before_digest: Some(before.clone()), - after_digest: None, - disposition: None, - reason: None, - binding_change: false, - metadata_change: None, - migration: None, - }); - } - } - changes.sort_by(|a, b| a.resource.cmp(&b.resource)); - changes -} - -/// Binding-only policy changes: the file digest is unchanged (so -/// `diff_resources` saw nothing) but the applied `applies_to` differs from -/// the desired bindings β€” including the pre-5A case where the state entry -/// has no bindings recorded yet. These are first-class plan changes: without -/// this pass a binding edit would silently rot or silently converge. -pub(crate) fn append_policy_binding_changes( - changes: &mut Vec, - prior_state: Option<&ClusterState>, - desired: &DesiredCluster, -) { - let Some(state) = prior_state else { - return; // no state: everything is already a Create carrying bindings - }; - for (address, desired_bindings) in &desired.policy_bindings { - if changes.iter().any(|change| &change.resource == address) { - continue; // content change already covers it - } - let Some(entry) = state.applied_revision.resources.get(address) else { - continue; // not applied yet: the Create covers it - }; - if entry.applies_to.as_ref() == Some(desired_bindings) { - continue; - } - changes.push(PlanChange { - resource: address.clone(), - operation: PlanOperation::Update, - before_digest: Some(entry.digest.clone()), - after_digest: Some(entry.digest.clone()), - disposition: None, - reason: None, - binding_change: true, - metadata_change: Some(PlanMetadataChange::PolicyBindings), - migration: None, - }); - } - changes.sort_by(|a, b| a.resource.cmp(&b.resource)); -} - -/// Metadata-only embedding provider changes: the provider digest is unchanged -/// but the applied state predates storing the profile body needed by -/// config-free serving. This mirrors policy binding backfill instead of -/// hiding a serving-time failure behind a no-op plan. -pub(crate) fn append_embedding_profile_changes( - changes: &mut Vec, - prior_state: Option<&ClusterState>, - desired: &DesiredCluster, -) { - let Some(state) = prior_state else { - return; // no state: provider Creates carry profiles already - }; - for (address, desired_profile) in &desired.embedding_providers { - if changes - .iter() - .any(|change| change.resource.as_str() == address.as_str()) - { - continue; // content change already covers it - } - let Some(entry) = state.applied_revision.resources.get(address) else { - continue; // not applied yet: the Create covers it - }; - if entry.embedding_profile.as_ref() == Some(desired_profile) { - continue; - } - changes.push(PlanChange { - resource: address.clone(), - operation: PlanOperation::Update, - before_digest: Some(entry.digest.clone()), - after_digest: Some(entry.digest.clone()), - disposition: None, - reason: None, - binding_change: false, - metadata_change: Some(PlanMetadataChange::EmbeddingProfile), - migration: None, - }); - } - changes.sort_by(|a, b| a.resource.cmp(&b.resource)); -} - -pub(crate) fn compute_blast_radius( - changes: &[PlanChange], - dependencies: &[Dependency], -) -> Vec { - changes - .iter() - .filter_map(|change| { - let affected: Vec<_> = dependencies - .iter() - .filter_map(|dep| (dep.to == change.resource).then_some(dep.from.clone())) - .collect(); - (!affected.is_empty()).then(|| BlastRadius { - resource: change.resource.clone(), - affected, - }) - }) - .collect() -} - -pub(crate) fn compute_approvals( - changes: &[PlanChange], - approved: &BTreeSet, -) -> Vec { - // One gate per subtree: the graph. delete carries its schema and - // queries, so a schema delete whose graph is also deleted is not listed. - let graph_deletes: BTreeSet = changes - .iter() - .filter(|change| change.operation == PlanOperation::Delete) - .filter_map(|change| change.resource.strip_prefix("graph.").map(str::to_string)) - .collect(); - changes - .iter() - .filter_map(|change| { - if change.operation != PlanOperation::Delete { - return None; - } - let gated = match resource_kind(&change.resource) { - ResourceKind::Graph(_) => true, - ResourceKind::Schema(graph) => !graph_deletes.contains(&graph), - _ => false, - }; - gated.then(|| ApprovalRequirement { - resource: change.resource.clone(), - reason: "delete may remove deployed graph or schema definition".to_string(), - satisfied: approved.contains(&change.resource), - }) - }) - .collect() -} - -/// Resources with a valid (digest-matching, unconsumed) pending approval. -/// Near-misses β€” an artifact for the same resource whose bound digests no -/// longer match β€” warn as `approval_stale` and never authorize anything. -pub(crate) fn approved_resources( - artifacts: &[(String, ApprovalArtifact)], - changes: &[PlanChange], - config_digest: &str, - diagnostics: &mut Vec, -) -> BTreeSet { - let mut approved = BTreeSet::new(); - for change in changes { - let candidates: Vec<&ApprovalArtifact> = artifacts - .iter() - .map(|(_, artifact)| artifact) - .filter(|artifact| { - artifact.consumed_at.is_none() && artifact.resource == change.resource - }) - .collect(); - if candidates.is_empty() { - continue; - } - let matched = candidates.iter().any(|artifact| { - artifact.bound_config_digest == config_digest - && artifact.bound_before_digest == change.before_digest - && artifact.bound_after_digest == change.after_digest - }); - if matched { - approved.insert(change.resource.clone()); - } else { - diagnostics.push(Diagnostic::warning( - "approval_stale", - change.resource.clone(), - "an approval artifact exists but its bound digests no longer match the plan; re-run `cluster approve`", - )); - } - } - approved -} - -#[derive(Debug, PartialEq, Eq)] -pub(crate) enum ResourceKind { - Graph(String), - Schema(String), - Query { graph: String, name: String }, - Policy(String), - EmbeddingProvider(String), - Unknown, -} - -pub(crate) fn resource_kind(address: &str) -> ResourceKind { - if let Some(graph) = address.strip_prefix("graph.") { - ResourceKind::Graph(graph.to_string()) - } else if let Some(graph) = address.strip_prefix("schema.") { - ResourceKind::Schema(graph.to_string()) - } else if let Some(rest) = address.strip_prefix("query.") { - match rest.split_once('.') { - Some((graph, name)) => ResourceKind::Query { - graph: graph.to_string(), - name: name.to_string(), - }, - None => ResourceKind::Unknown, - } - } else if let Some(name) = address.strip_prefix("policy.") { - ResourceKind::Policy(name.to_string()) - } else if let Some(name) = address.strip_prefix("provider.embedding.") { - ResourceKind::EmbeddingProvider(name.to_string()) - } else { - ResourceKind::Unknown - } -} - -/// Classify every planned change with the disposition config-only apply gives -/// it. Stage 3A executes only query/policy catalog writes; graph/schema -/// movement is a later phase, and `graph.` composite updates whose schema -/// component is unchanged converge automatically once query digests land. -pub(crate) fn classify_changes( - changes: &mut [PlanChange], - dependencies: &[Dependency], - pending_recovery: &BTreeSet, - approved: &BTreeSet, -) { - let mut schema_creates = BTreeSet::new(); - let mut schema_pending = BTreeSet::new(); - let mut graph_creates = BTreeSet::new(); - let mut graph_deletes = BTreeSet::new(); - for change in changes.iter() { - match resource_kind(&change.resource) { - ResourceKind::Schema(graph) => match change.operation { - PlanOperation::Create => { - schema_creates.insert(graph); - } - // Schema updates execute in-run before catalog writes (4B) - // and no longer block dependents; deletes (4C) still do. - PlanOperation::Update => {} - PlanOperation::Delete => { - schema_pending.insert(graph); - } - }, - ResourceKind::Graph(graph) => match change.operation { - PlanOperation::Create => { - graph_creates.insert(graph); - } - PlanOperation::Delete => { - graph_deletes.insert(graph); - } - PlanOperation::Update => {} - }, - _ => {} - } - } - // A schema Create is satisfied by its paired graph create (the init - // carries the schema); a standalone schema Create stays pending. - for graph in &schema_creates { - if !graph_creates.contains(graph) { - schema_pending.insert(graph.clone()); - } - } - // Subtree deletes ride the approved graph delete. - let rides_approved_delete = |graph: &str| { - graph_deletes.contains(graph) - && approved.contains(&graph_address(graph)) - && !pending_recovery.contains(graph) - }; - - for change in changes.iter_mut() { - let (disposition, reason) = match resource_kind(&change.resource) { - ResourceKind::Schema(graph) => match change.operation { - PlanOperation::Create - if graph_creates.contains(&graph) && !pending_recovery.contains(&graph) => - { - // Applied with the graph create β€” the init carries it. - (ApplyDisposition::Applied, None) - } - PlanOperation::Update if !pending_recovery.contains(&graph) => { - // Stage 4B: schema updates execute via the engine's - // schema apply (soft drops only; allow_data_loss is 4C). - (ApplyDisposition::Applied, None) - } - PlanOperation::Create | PlanOperation::Update => { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } - PlanOperation::Delete if graph_deletes.contains(&graph) => { - if rides_approved_delete(&graph) { - (ApplyDisposition::Applied, None) - } else if pending_recovery.contains(&graph) { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } else { - (ApplyDisposition::Blocked, Some("approval_required")) - } - } - _ => (ApplyDisposition::Deferred, Some("apply_unsupported_kind")), - }, - ResourceKind::Graph(graph) => match change.operation { - PlanOperation::Create => { - if pending_recovery.contains(&graph) { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } else { - (ApplyDisposition::Applied, None) - } - } - PlanOperation::Update if !schema_pending.contains(&graph) => { - (ApplyDisposition::Derived, None) - } - // Stage 4C: an approved graph delete executes (the - // irreversible tier β€” gated by a digest-bound artifact). - PlanOperation::Delete => { - if pending_recovery.contains(&graph) { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } else if rides_approved_delete(&graph) { - (ApplyDisposition::Applied, None) - } else { - (ApplyDisposition::Blocked, Some("approval_required")) - } - } - _ => (ApplyDisposition::Deferred, Some("apply_unsupported_kind")), - }, - ResourceKind::Query { graph, .. } => match change.operation { - PlanOperation::Delete => { - if rides_approved_delete(&graph) { - // Tombstoned with the approved graph delete. - (ApplyDisposition::Applied, None) - } else if graph_deletes.contains(&graph) { - (ApplyDisposition::Blocked, Some("approval_required")) - } else { - (ApplyDisposition::Applied, None) - } - } - PlanOperation::Create | PlanOperation::Update => { - if pending_recovery.contains(&graph) { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } else if schema_pending.contains(&graph) { - (ApplyDisposition::Blocked, Some("dependency_not_applied")) - } else { - // A graph create in the same plan no longer blocks: - // creates execute first in the same apply run. - (ApplyDisposition::Applied, None) - } - } - }, - ResourceKind::Policy(_) => match change.operation { - PlanOperation::Delete => (ApplyDisposition::Applied, None), - PlanOperation::Create | PlanOperation::Update => { - let blocked_pending = dependencies.iter().any(|dep| { - dep.from == change.resource - && dep - .to - .strip_prefix("graph.") - .is_some_and(|graph| pending_recovery.contains(graph)) - }); - if blocked_pending { - (ApplyDisposition::Blocked, Some("cluster_recovery_pending")) - } else { - (ApplyDisposition::Applied, None) - } - } - }, - ResourceKind::EmbeddingProvider(_) => (ApplyDisposition::Applied, None), - ResourceKind::Unknown => (ApplyDisposition::Deferred, Some("apply_unsupported_kind")), - }; - change.disposition = Some(disposition); - change.reason = reason.map(str::to_string); - } -} - -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) enum FailedGraphOrigin { - GraphCreate, - SchemaApply, - GraphDelete, -} - -/// After a graph-moving operation fails mid-run, every change that depended -/// on that graph flips from Applied to Blocked so the output and the -/// persisted statuses tell the truth about what this run actually executed. -/// The originating change carries the failure code; dependents carry -/// `dependency_not_applied`. -pub(crate) fn demote_dependents_of_failed_graphs( - changes: &mut [PlanChange], - failed: &BTreeMap, - dependencies: &[Dependency], -) { - for change in changes.iter_mut() { - if change.disposition != Some(ApplyDisposition::Applied) { - continue; - } - let demote_reason = match resource_kind(&change.resource) { - ResourceKind::Graph(graph) => match failed.get(&graph) { - Some(FailedGraphOrigin::GraphCreate) => Some("graph_create_failed"), - Some(FailedGraphOrigin::GraphDelete) => Some("graph_delete_failed"), - Some(FailedGraphOrigin::SchemaApply) => Some("dependency_not_applied"), - None => None, - }, - ResourceKind::Schema(graph) => match failed.get(&graph) { - Some(FailedGraphOrigin::SchemaApply) => Some("schema_apply_failed"), - Some(FailedGraphOrigin::GraphCreate) | Some(FailedGraphOrigin::GraphDelete) => { - Some("dependency_not_applied") - } - None => None, - }, - ResourceKind::Query { graph, .. } if failed.contains_key(&graph) => { - Some("dependency_not_applied") - } - ResourceKind::Policy(_) => { - let blocked = dependencies.iter().any(|dep| { - dep.from == change.resource - && dep - .to - .strip_prefix("graph.") - .is_some_and(|graph| failed.contains_key(graph)) - }); - blocked.then_some("dependency_not_applied") - } - _ => None, - }; - if let Some(reason) = demote_reason { - change.disposition = Some(ApplyDisposition::Blocked); - change.reason = Some(reason.to_string()); - } - } -} diff --git a/crates/omnigraph-cluster/src/failpoints.rs b/crates/omnigraph-cluster/src/failpoints.rs deleted file mode 100644 index f5b2023..0000000 --- a/crates/omnigraph-cluster/src/failpoints.rs +++ /dev/null @@ -1,41 +0,0 @@ -//! Fault-injection hooks for the cluster apply protocol, mirroring the -//! engine's `omnigraph::failpoints` pattern. With the `failpoints` feature -//! off, every call site compiles to `Ok(())`. -//! -//! Only `maybe_fail` lives here β€” it returns the cluster's [`Diagnostic`] -//! error type. The test-side configuration guard is shared: use -//! [`omnigraph::failpoints::ScopedFailPoint`], which is registry-only -//! (error-type agnostic) and reachable because the cluster's `failpoints` -//! feature enables `omnigraph/failpoints`. One `ScopedFailPoint`, in the -//! lowest crate, avoids a drifting duplicate. - -use crate::Diagnostic; - -pub(crate) fn maybe_fail(_name: &str) -> Result<(), Diagnostic> { - #[cfg(feature = "failpoints")] - { - let name = _name; - fail::fail_point!(name, |_| { - return Err(Diagnostic::error( - "injected_failpoint", - name, - format!("injected failpoint triggered: {name}"), - )); - }); - } - Ok(()) -} - -/// Compile-checked catalog of this crate's apply-protocol failpoint names. -/// Engine-scoped names referenced from cluster tests live in -/// [`omnigraph::failpoints::names`]. -pub mod names { - pub const CLUSTER_APPLY_AFTER_GRAPH_CREATE: &str = "cluster_apply.after_graph_create"; - pub const CLUSTER_APPLY_AFTER_GRAPH_DELETE: &str = "cluster_apply.after_graph_delete"; - pub const CLUSTER_APPLY_AFTER_PAYLOAD_PHASE: &str = "cluster_apply.after_payload_phase"; - pub const CLUSTER_APPLY_AFTER_SCHEMA_APPLY: &str = "cluster_apply.after_schema_apply"; - pub const CLUSTER_APPLY_BEFORE_GRAPH_CREATE: &str = "cluster_apply.before_graph_create"; - pub const CLUSTER_APPLY_BEFORE_GRAPH_DELETE: &str = "cluster_apply.before_graph_delete"; - pub const CLUSTER_APPLY_BEFORE_SCHEMA_APPLY: &str = "cluster_apply.before_schema_apply"; - pub const CLUSTER_APPLY_BEFORE_STATE_WRITE: &str = "cluster_apply.before_state_write"; -} diff --git a/crates/omnigraph-cluster/src/lib.rs b/crates/omnigraph-cluster/src/lib.rs deleted file mode 100644 index 42735ae..0000000 --- a/crates/omnigraph-cluster/src/lib.rs +++ /dev/null @@ -1,2073 +0,0 @@ -use std::collections::{BTreeMap, BTreeSet}; -use std::fs::{self}; -use std::path::{Path, PathBuf}; - -use omnigraph::db::{Omnigraph, ReadTarget, SchemaApplyOptions}; -use omnigraph_compiler::SchemaMigrationPlan; -use omnigraph_compiler::build_catalog; -use omnigraph_compiler::query::parser::parse_query; -use omnigraph_compiler::query::typecheck::typecheck_query_decl; -use omnigraph_compiler::schema::parser::parse_schema; -use serde::{Deserialize, Serialize}; -use serde_json::json; -use sha2::{Digest, Sha256}; -use time::OffsetDateTime; -use time::format_description::well_known::Rfc3339; -use ulid::Ulid; - -pub mod failpoints; - -mod config; -mod diff; -mod serve; -mod store; -mod sweep; -mod types; -use config::{ - QueriesDecl, graph_address, initial_import_state, load_desired, observe_declared_graphs, parse_cluster_config, preview_schema_migration, schema_address, state_resource_digests, validate_cluster_header, -}; -use diff::{ - FailedGraphOrigin, ResourceKind, append_embedding_profile_changes, - append_policy_binding_changes, approved_resources, classify_changes, compute_approvals, - compute_blast_radius, demote_dependents_of_failed_graphs, diff_resources, resource_kind, -}; -pub use serve::{ - ServingGraph, ServingPolicy, ServingQuery, ServingSnapshot, cluster_graph_ids, - cluster_root_for_graph_uri, read_serving_snapshot, read_serving_snapshot_from_storage, - resolve_graph_storage_uri, -}; -use store::ClusterStore; -use sweep::{ - mark_approvals_consumed, record_approval_consumed, sweep_recovery_sidecars, - tombstone_graph_subtree, warn_pending_recovery_sidecars, -}; -pub use types::*; - -pub const CLUSTER_CONFIG_FILE: &str = "cluster.yaml"; -pub const CLUSTER_GRAPHS_DIR: &str = "graphs"; -pub const CLUSTER_STATE_DIR: &str = "__cluster"; -pub const CLUSTER_STATE_FILE: &str = "__cluster/state.json"; -pub const CLUSTER_LOCK_FILE: &str = "__cluster/lock.json"; -pub const CLUSTER_RESOURCES_DIR: &str = "__cluster/resources"; -pub const CLUSTER_RECOVERIES_DIR: &str = "__cluster/recoveries"; -pub const CLUSTER_APPROVALS_DIR: &str = "__cluster/approvals"; - -/// The store for a load outcome: the declared `storage:` root when present, -/// the config directory itself otherwise. A bad root is a loud error. -fn store_for(config_dir: &Path, storage_root: Option<&str>) -> Result { - match storage_root { - Some(root) => ClusterStore::for_storage_root(root), - None => Ok(ClusterStore::for_config_dir(config_dir)), - } -} - -pub fn validate_config_dir(config_dir: impl AsRef) -> ValidateOutput { - let outcome = load_desired(config_dir.as_ref()); - let (resource_digests, resources, dependencies) = match outcome.desired { - Some(desired) => ( - desired.resource_digests, - desired.resources, - desired.dependencies, - ), - None => (BTreeMap::new(), Vec::new(), Vec::new()), - }; - let ok = !has_errors(&outcome.diagnostics); - - ValidateOutput { - ok, - config_dir: display_path(&outcome.config_dir), - config_file: display_path(&outcome.config_file), - resource_digests, - resources, - dependencies, - diagnostics: outcome.diagnostics, - } -} - -pub async fn plan_config_dir(config_dir: impl AsRef) -> PlanOutput { - let outcome = load_desired(config_dir.as_ref()); - let mut diagnostics = outcome.diagnostics; - let storage_root = outcome - .desired - .as_ref() - .and_then(|desired| desired.storage_root.clone()); - let backend = match store_for(&outcome.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&outcome.config_dir) - } - }; - let mut observations = backend.observations(); - - let Some(desired) = outcome.desired else { - return PlanOutput { - ok: false, - config_dir: display_path(&outcome.config_dir), - desired_revision: DesiredRevision { - config_digest: None, - }, - resource_digests: BTreeMap::new(), - dependencies: Vec::new(), - state_observations: observations, - changes: Vec::new(), - blast_radius: Vec::new(), - approvals_required: Vec::new(), - diagnostics, - }; - }; - - if has_errors(&diagnostics) { - return PlanOutput { - ok: false, - config_dir: display_path(&desired.config_dir), - desired_revision: DesiredRevision { - config_digest: Some(desired.config_digest), - }, - resource_digests: desired.resource_digests, - dependencies: desired.dependencies, - state_observations: observations, - changes: Vec::new(), - blast_radius: Vec::new(), - approvals_required: Vec::new(), - diagnostics, - }; - } - - let _lock_guard = if desired.state_lock { - match backend.acquire_lock("plan", &mut observations).await { - Ok(guard) => Some(guard), - Err(diagnostic) => { - diagnostics.push(diagnostic); - None - } - } - } else { - diagnostics.push(Diagnostic::warning( - "state_lock_disabled", - "state.lock", - "state.lock is false; plan read state without acquiring the cluster state lock", - )); - None - }; - - // Plan is read-only: pending sidecars are reported, never acted on - // (RFC-004 open question 3 keeps read-only commands warn-only). - warn_pending_recovery_sidecars(&backend, &mut diagnostics).await; - - let mut prior_resources = BTreeMap::new(); - let mut prior_state: Option = None; - if !has_errors(&diagnostics) { - match backend.read_state(&mut observations).await { - Ok(snapshot) => { - if let Some(state) = snapshot.state { - prior_resources = state_resource_digests(&state); - prior_state = Some(state); - } - } - Err(diagnostic) => diagnostics.push(diagnostic), - } - } - - let mut changes = if has_errors(&diagnostics) { - Vec::new() - } else { - diff_resources(&prior_resources, &desired.resource_digests) - }; - if !has_errors(&diagnostics) { - append_policy_binding_changes(&mut changes, prior_state.as_ref(), &desired); - append_embedding_profile_changes(&mut changes, prior_state.as_ref(), &desired); - } - // Plan previews dispositions without sweeping; a pending recovery is - // surfaced as the cluster_recovery_pending warning above instead. - let artifacts = backend.list_approval_artifacts(&mut diagnostics).await; - let approved = approved_resources( - &artifacts, - &changes, - &desired.config_digest, - &mut diagnostics, - ); - classify_changes( - &mut changes, - &desired.dependencies, - &BTreeSet::new(), - &approved, - ); - - // Embed real migration steps for schema updates so plan is a data-aware - // preview; failures degrade to the digest diff with a warning. - for change in &mut changes { - if change.operation != PlanOperation::Update { - continue; - } - let ResourceKind::Schema(graph_id) = resource_kind(&change.resource) else { - continue; - }; - let graph_uri = backend.graph_root(&graph_id); - let source_path = desired - .resources - .iter() - .find(|resource| resource.address == change.resource) - .and_then(|resource| resource.path.clone()); - let preview = match source_path { - Some(path) => preview_schema_migration(&graph_uri, &path).await, - None => Err("no schema source recorded".to_string()), - }; - match preview { - Ok(migration) => change.migration = Some(migration), - Err(err) => diagnostics.push(Diagnostic::warning( - "schema_preview_unavailable", - change.resource.clone(), - format!("could not preview the schema migration: {err}"), - )), - } - } - let blast_radius = compute_blast_radius(&changes, &desired.dependencies); - let approvals_required = compute_approvals(&changes, &approved); - let ok = !has_errors(&diagnostics); - - PlanOutput { - ok, - config_dir: display_path(&desired.config_dir), - desired_revision: DesiredRevision { - config_digest: Some(desired.config_digest), - }, - resource_digests: desired.resource_digests, - dependencies: desired.dependencies, - state_observations: observations, - changes, - blast_radius, - approvals_required, - diagnostics, - } -} - -/// Config-only `cluster apply` (Stage 3A): execute the query/policy subset of -/// the plan against the local cluster catalog. The plan is recomputed under -/// the state lock, so freshness is structural; the state CAS inside -/// `write_state` is the second fence. Graph/schema changes are never executed -/// here β€” they are deferred to the graph-lifecycle phase and reported loudly. -/// -/// Payloads are content-addressed and written BEFORE the state CAS because -/// state is the publish point: a failure after payload writes leaves inert -/// digest-named blobs and no success acknowledgement; re-running apply is the -/// repair. -/// Options for `cluster apply`. `actor` attributes graph-moving operations -/// (recorded in sidecars and audit entries, threaded to the engine's -/// `apply_schema_as` so Cedar enforcement fires wherever a policy checker is -/// installed). -#[derive(Debug, Clone, Default)] -pub struct ApplyOptions { - pub actor: Option, -} - -pub async fn apply_config_dir(config_dir: impl AsRef) -> ApplyOutput { - apply_config_dir_with_options(config_dir, ApplyOptions::default()).await -} - -pub async fn apply_config_dir_with_options( - config_dir: impl AsRef, - options: ApplyOptions, -) -> ApplyOutput { - let outcome = load_desired(config_dir.as_ref()); - let mut diagnostics = outcome.diagnostics; - let storage_root = outcome - .desired - .as_ref() - .and_then(|desired| desired.storage_root.clone()); - let backend = match store_for(&outcome.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&outcome.config_dir) - } - }; - let mut observations = backend.observations(); - - let actor_for_output = options.actor.clone(); - let early_return = |config_dir: String, - config_digest: Option, - observations: StateObservations, - changes: Vec, - resource_statuses: BTreeMap, - diagnostics: Vec| { - ApplyOutput { - ok: !has_errors(&diagnostics), - config_dir, - actor: actor_for_output.clone(), - desired_revision: DesiredRevision { config_digest }, - state_observations: observations, - changes, - applied_count: 0, - deferred_count: 0, - converged: false, - state_written: false, - resource_statuses, - diagnostics, - } - }; - - let Some(desired) = outcome.desired else { - return early_return( - display_path(&outcome.config_dir), - None, - observations, - Vec::new(), - BTreeMap::new(), - diagnostics, - ); - }; - - if has_errors(&diagnostics) { - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - Vec::new(), - BTreeMap::new(), - diagnostics, - ); - } - - // Named guard: the lock must be held until the state outcome is recorded. - let _lock_guard = if desired.state_lock { - match backend.acquire_lock("apply", &mut observations).await { - Ok(guard) => Some(guard), - Err(diagnostic) => { - diagnostics.push(diagnostic); - None - } - } - } else { - diagnostics.push(Diagnostic::warning( - "state_lock_disabled", - "state.lock", - "state.lock is false; apply wrote state without acquiring the cluster state lock", - )); - None - }; - - if has_errors(&diagnostics) { - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - Vec::new(), - BTreeMap::new(), - diagnostics, - ); - } - - let snapshot = match backend.read_state(&mut observations).await { - Ok(snapshot) => snapshot, - Err(diagnostic) => { - diagnostics.push(diagnostic); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - Vec::new(), - BTreeMap::new(), - diagnostics, - ); - } - }; - let expected_cas = snapshot.state_cas; - let Some(mut state) = snapshot.state else { - diagnostics.push(Diagnostic::error( - "state_missing", - CLUSTER_STATE_FILE, - "apply requires an existing state.json; run `cluster import` to bootstrap state", - )); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - Vec::new(), - BTreeMap::new(), - diagnostics, - ); - }; - - // Snapshot the as-read state BEFORE the sweep so sweep mutations count as - // changes for the final dirty check and get persisted by the state CAS. - let before_value = - serde_json::to_value(&state).expect("cluster state must serialize deterministically"); - let sweep = sweep_recovery_sidecars(&backend, &mut state, &mut diagnostics).await; - - let prior_resources = state_resource_digests(&state); - let mut changes = diff_resources(&prior_resources, &desired.resource_digests); - append_policy_binding_changes(&mut changes, Some(&state), &desired); - append_embedding_profile_changes(&mut changes, Some(&state), &desired); - let approval_artifacts = backend.list_approval_artifacts(&mut diagnostics).await; - let approved = approved_resources( - &approval_artifacts, - &changes, - &desired.config_digest, - &mut diagnostics, - ); - classify_changes( - &mut changes, - &desired.dependencies, - &sweep.pending_graphs, - &approved, - ); - - // Defensive invariant: nothing the approval gate covers may be executable - // WITHOUT a matching approval. Gated changes with a valid artifact are the - // sanctioned exception (stage 4C). - let approvals = compute_approvals(&changes, &approved); - let approval_violation = changes.iter().any(|change| { - change.disposition == Some(ApplyDisposition::Applied) - && approvals - .iter() - .any(|approval| approval.resource == change.resource && !approval.satisfied) - }); - if approval_violation { - diagnostics.push(Diagnostic::error( - "apply_approval_invariant_violation", - "changes", - "an executable change requires approval; refusing to apply", - )); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - - // Graph creates execute first (RFC-004 Β§D5), sequentially, sidecar-fenced: - // sidecar written before the init, rewritten with the post-init manifest - // version, deleted only after the final state CAS lands. A failure stops - // further graph-moving work and demotes that graph's dependents. - let source_paths: BTreeMap<&str, &str> = desired - .resources - .iter() - .filter_map(|resource| { - resource - .path - .as_deref() - .map(|path| (resource.address.as_str(), path)) - }) - .collect(); - let graph_creates_to_run: Vec = changes - .iter() - .filter(|change| { - change.disposition == Some(ApplyDisposition::Applied) - && change.operation == PlanOperation::Create - && matches!(resource_kind(&change.resource), ResourceKind::Graph(_)) - }) - .filter_map(|change| change.resource.strip_prefix("graph.").map(str::to_string)) - .collect(); - let mut completed_op_sidecars: Vec = Vec::new(); - let mut failed_graphs: BTreeMap = BTreeMap::new(); - let mut graph_moving_aborted = false; - for graph_id in &graph_creates_to_run { - if graph_moving_aborted { - // A prior create failed: stop graph-moving work (loud partials). - diagnostics.push(Diagnostic::warning( - "graph_create_skipped", - graph_address(graph_id), - "skipped after an earlier graph create failed in this run", - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphCreate); - continue; - } - let Some(desired_graph) = desired.graphs.iter().find(|graph| &graph.id == graph_id) else { - continue; - }; - let graph_uri = backend.graph_root(graph_id); - let mut sidecar = RecoverySidecar { - schema_version: 1, - operation_id: Ulid::new().to_string(), - started_at: now_rfc3339(), - actor: options.actor.clone(), - kind: RecoverySidecarKind::GraphCreate, - graph_id: graph_id.clone(), - graph_uri: graph_uri.clone(), - observed_manifest_version: None, - expected_manifest_version: None, - desired_schema_digest: desired_graph.schema_digest.clone(), - state_cas_base: expected_cas.clone(), - approval_id: None, - }; - let sidecar_path = match backend.write_recovery_sidecar(&sidecar).await { - Ok(path) => path, - Err(diagnostic) => { - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphCreate); - graph_moving_aborted = true; - continue; - } - }; - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_BEFORE_GRAPH_CREATE) { - // Simulated crash before the init: the sidecar stays for the - // sweep (row 1: root absent -> intent removed next run). - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphCreate); - graph_moving_aborted = true; - continue; - } - // Re-read + re-verify the schema source under the lock β€” the same - // TOCTOU posture as write_resource_payload. - let schema_source = source_paths - .get(schema_address(graph_id).as_str()) - .ok_or_else(|| { - Diagnostic::error( - "graph_create_failed", - graph_address(graph_id), - "no schema source recorded for graph", - ) - }) - .and_then(|path| { - fs::read_to_string(Path::new(path)).map_err(|err| { - Diagnostic::error( - "graph_create_failed", - graph_address(graph_id), - format!("could not read schema source '{path}': {err}"), - ) - }) - }) - .and_then(|source| { - if sha256_hex(source.as_bytes()) == desired_graph.schema_digest { - Ok(source) - } else { - Err(Diagnostic::error( - "resource_content_changed", - schema_address(graph_id), - "schema source changed while apply was running; re-run `cluster apply`", - )) - } - }); - let schema_source = match schema_source { - Ok(source) => source, - Err(diagnostic) => { - diagnostics.push(diagnostic); - backend.delete_object(&sidecar_path).await; // nothing moved - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphCreate); - graph_moving_aborted = true; - continue; - } - }; - match Omnigraph::init(&graph_uri, &schema_source).await { - Ok(_) => {} - Err(err) => { - diagnostics.push(Diagnostic::error( - "graph_create_failed", - graph_address(graph_id), - format!("could not initialize graph at '{graph_uri}': {err}"), - )); - // The sidecar stays: the sweep classifies whether the failed - // init left a partial root (row 5) or nothing (row 1). - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphCreate); - graph_moving_aborted = true; - continue; - } - } - // Record the post-init pin in the sidecar (best effort β€” a failure - // here leaves expected = null and the sweep classifies by digest). - if let Ok(db) = Omnigraph::open_read_only(&graph_uri).await { - if let Ok(snapshot) = db.snapshot_of(ReadTarget::branch("main")).await { - sidecar.expected_manifest_version = Some(snapshot.version()); - if let Err(diagnostic) = backend.write_recovery_sidecar(&sidecar).await { - diagnostics.push(diagnostic); - } - } - } - // Crash point: the graph exists, the cluster state does not record it - // yet. A failure here must acknowledge nothing; the next run's sweep - // rolls the ledger forward (row 4). - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_AFTER_GRAPH_CREATE) { - diagnostics.push(diagnostic); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - completed_op_sidecars.push(sidecar_path); - } - - // Schema applies execute next (RFC-004 Β§D5): the first cluster operation - // that moves an EXISTING graph manifest, sidecar-fenced the same way. - let schema_updates_to_run: Vec = changes - .iter() - .filter(|change| { - change.disposition == Some(ApplyDisposition::Applied) - && change.operation == PlanOperation::Update - && matches!(resource_kind(&change.resource), ResourceKind::Schema(_)) - }) - .filter_map(|change| change.resource.strip_prefix("schema.").map(str::to_string)) - .collect(); - for graph_id in &schema_updates_to_run { - if graph_moving_aborted { - diagnostics.push(Diagnostic::warning( - "schema_apply_skipped", - schema_address(graph_id), - "skipped after an earlier graph-moving operation failed in this run", - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - continue; - } - let Some(desired_graph) = desired.graphs.iter().find(|graph| &graph.id == graph_id) else { - continue; - }; - let graph_uri = backend.graph_root(graph_id); - // Read-write open: the engine's own recovery sweep runs here, which - // is exactly what we want before moving its manifest. - let db = match Omnigraph::open(&graph_uri).await { - Ok(db) => db, - Err(err) => { - diagnostics.push(Diagnostic::error( - "schema_apply_failed", - schema_address(graph_id), - format!("could not open graph at '{graph_uri}': {err}"), - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - }; - // Re-read + digest-verify the desired schema source before the - // cluster sidecar exists. Parser/planner rejections cannot have - // moved graph state, so they must not leave recovery work behind. - let schema_source = source_paths - .get(schema_address(graph_id).as_str()) - .ok_or_else(|| { - Diagnostic::error( - "schema_apply_failed", - schema_address(graph_id), - "no schema source recorded for graph", - ) - }) - .and_then(|path| { - fs::read_to_string(Path::new(path)).map_err(|err| { - Diagnostic::error( - "schema_apply_failed", - schema_address(graph_id), - format!("could not read schema source '{path}': {err}"), - ) - }) - }) - .and_then(|source| { - if sha256_hex(source.as_bytes()) == desired_graph.schema_digest { - Ok(source) - } else { - Err(Diagnostic::error( - "resource_content_changed", - schema_address(graph_id), - "schema source changed while apply was running; re-run `cluster apply`", - )) - } - }); - let schema_source = match schema_source { - Ok(source) => source, - Err(diagnostic) => { - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - }; - if let Err(err) = db - .preview_schema_apply_with_options(&schema_source, SchemaApplyOptions::default()) - .await - { - diagnostics.push(Diagnostic::error( - "schema_apply_failed", - schema_address(graph_id), - format!("schema apply is not supported on '{graph_uri}': {err}"), - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - let observed_manifest_version = match db.snapshot_of(ReadTarget::branch("main")).await { - Ok(snapshot) => Some(snapshot.version()), - Err(_) => None, - }; - let recorded_schema_digest = state - .applied_revision - .resources - .get(&schema_address(graph_id)) - .map(|entry| entry.digest.clone()); - let mut sidecar = RecoverySidecar { - schema_version: 1, - operation_id: Ulid::new().to_string(), - started_at: now_rfc3339(), - actor: options.actor.clone(), - kind: RecoverySidecarKind::SchemaApply, - graph_id: graph_id.clone(), - graph_uri: graph_uri.clone(), - observed_manifest_version, - expected_manifest_version: None, - desired_schema_digest: desired_graph.schema_digest.clone(), - state_cas_base: expected_cas.clone(), - approval_id: None, - }; - let sidecar_path = match backend.write_recovery_sidecar(&sidecar).await { - Ok(path) => path, - Err(diagnostic) => { - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - }; - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_BEFORE_SCHEMA_APPLY) { - // Simulated crash before the engine call: the sidecar stays; the - // sweep retires it next run (ledger still consistent with live). - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - // Soft drops only: allow_data_loss stays false until the approval - // artifacts of stage 4C exist (RFC-004 Β§D4). - match db - .apply_schema_as( - &schema_source, - SchemaApplyOptions::default(), - options.actor.as_deref(), - ) - .await - { - Ok(result) => { - sidecar.expected_manifest_version = Some(result.manifest_version); - if let Err(diagnostic) = backend.write_recovery_sidecar(&sidecar).await { - diagnostics.push(diagnostic); - } - } - Err(err) => { - diagnostics.push(Diagnostic::error( - "schema_apply_failed", - schema_address(graph_id), - format!("schema apply failed on '{graph_uri}': {err}"), - )); - if live_schema_matches_recorded_digest( - &graph_uri, - recorded_schema_digest.as_deref(), - observed_manifest_version, - ) - .await - { - // Pre-movement rejection: nothing moved, so retire the - // sidecar eagerly. A delete failure leaves it safe (the - // graph is quarantined until the next sweep), but surface - // it so an operator isn't left debugging a silent stick. - if let Err(err) = backend.try_delete_object(&sidecar_path).await { - diagnostics.push(Diagnostic::warning( - "recovery_sidecar_cleanup_failed", - sidecar_path.clone(), - format!( - "could not delete the stale recovery sidecar after a pre-movement \ - schema-apply rejection; graph `{graph_id}` stays quarantined until \ - a state-mutating cluster command sweeps it: {err}" - ), - )); - } - } - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::SchemaApply); - graph_moving_aborted = true; - continue; - } - } - // Crash point: the manifest moved, the ledger does not record it yet. - // A failure here acknowledges nothing; the sweep rolls forward. - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_AFTER_SCHEMA_APPLY) { - diagnostics.push(diagnostic); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - completed_op_sidecars.push(sidecar_path); - } - - if !failed_graphs.is_empty() { - demote_dependents_of_failed_graphs(&mut changes, &failed_graphs, &desired.dependencies); - } - - for change in &changes { - match change.disposition { - Some(ApplyDisposition::Deferred) => diagnostics.push(Diagnostic::warning( - "apply_unsupported_change", - change.resource.clone(), - "graph/schema changes are not applied in this stage; they are deferred to the graph-lifecycle phase", - )), - Some(ApplyDisposition::Blocked) => diagnostics.push(Diagnostic::warning( - "apply_dependency_blocked", - change.resource.clone(), - format!( - "blocked by an unapplied or missing dependency ({})", - change.reason.as_deref().unwrap_or("dependency") - ), - )), - _ => {} - } - } - - // Payload phase: content-addressed writes before the state CAS. Any - // failure aborts before state moves; blobs already written are inert. - // Gate on payload-phase errors only β€” sweep errors (e.g. a kept row-5 - // sidecar) must not abort the run, or their statuses would never persist. - let errors_before_payloads = count_errors(&diagnostics); - for change in &changes { - if change.disposition != Some(ApplyDisposition::Applied) - || change.operation == PlanOperation::Delete - { - continue; - } - let kind = resource_kind(&change.resource); - let digest = change - .after_digest - .as_deref() - .expect("create/update always carries an after digest"); - if ClusterStore::payload_relative(&kind, digest).is_none() { - continue; - } - let Some(source) = source_paths.get(change.resource.as_str()) else { - diagnostics.push(Diagnostic::error( - "resource_payload_write_error", - change.resource.clone(), - "no source file recorded for resource", - )); - continue; - }; - if let Err(diagnostic) = - write_resource_payload(&backend, &kind, Path::new(source), digest, &change.resource) - .await - { - diagnostics.push(diagnostic); - } - } - if count_errors(&diagnostics) > errors_before_payloads { - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - - // Crash point: payloads are on disk, state has not moved. A failure here - // must leave state.json byte-identical and acknowledge nothing; re-running - // apply repairs via the skip-if-exists blob reuse. - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_AFTER_PAYLOAD_PHASE) { - diagnostics.push(diagnostic); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - - // Approved graph deletes execute LAST (RFC-004 Β§D5): catalog writes for - // surviving resources land first, then the irreversible work. - let graph_deletes_to_run: Vec = changes - .iter() - .filter(|change| { - change.disposition == Some(ApplyDisposition::Applied) - && change.operation == PlanOperation::Delete - && matches!(resource_kind(&change.resource), ResourceKind::Graph(_)) - }) - .filter_map(|change| change.resource.strip_prefix("graph.").map(str::to_string)) - .collect(); - let mut executed_deletes: Vec<(String, Option)> = Vec::new(); // (graph_id, approval_id) - let mut consumed_approval_ids: Vec = Vec::new(); - for graph_id in &graph_deletes_to_run { - if graph_moving_aborted { - diagnostics.push(Diagnostic::warning( - "graph_delete_skipped", - graph_address(graph_id), - "skipped after an earlier graph-moving operation failed in this run", - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphDelete); - continue; - } - let graph_addr = graph_address(graph_id); - // Re-locate the consumable approval (classification verified one exists). - let approval_id = approval_artifacts - .iter() - .map(|(_, artifact)| artifact) - .find(|artifact| { - artifact.consumed_at.is_none() - && artifact.resource == graph_addr - && artifact.bound_config_digest == desired.config_digest - }) - .map(|artifact| artifact.approval_id.clone()); - let graph_uri = backend.graph_root(graph_id); - let observed_manifest_version = match Omnigraph::open_read_only(&graph_uri).await { - Ok(db) => match db.snapshot_of(ReadTarget::branch("main")).await { - Ok(snapshot) => Some(snapshot.version()), - Err(_) => None, - }, - Err(_) => None, // partial/unopenable roots still get deleted - }; - let sidecar = RecoverySidecar { - schema_version: 1, - operation_id: Ulid::new().to_string(), - started_at: now_rfc3339(), - actor: options.actor.clone(), - kind: RecoverySidecarKind::GraphDelete, - graph_id: graph_id.clone(), - graph_uri: graph_uri.clone(), - observed_manifest_version, - expected_manifest_version: None, // no post-op manifest exists - desired_schema_digest: String::new(), - state_cas_base: expected_cas.clone(), - approval_id: approval_id.clone(), - }; - let sidecar_path = match backend.write_recovery_sidecar(&sidecar).await { - Ok(path) => path, - Err(diagnostic) => { - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphDelete); - graph_moving_aborted = true; - continue; - } - }; - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_BEFORE_GRAPH_DELETE) { - // Simulated crash before removal: row 8 retires the intent and - // the still-valid approval lets a later run retry. - diagnostics.push(diagnostic); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphDelete); - graph_moving_aborted = true; - continue; - } - // Prefix delete through the storage layer: remove_dir_all locally, - // list+delete on object stores (idempotent; already-gone is fine). - match backend.delete_graph_root(&graph_uri).await { - Ok(()) => {} - Err(err) => { - diagnostics.push(Diagnostic::error( - "graph_delete_failed", - graph_addr.clone(), - format!("could not remove graph root '{graph_uri}': {err}"), - )); - failed_graphs.insert(graph_id.clone(), FailedGraphOrigin::GraphDelete); - graph_moving_aborted = true; - continue; - } - } - // Crash point: the root is gone, the ledger does not record it yet. - // The sweep rolls forward (row 7b) and consumes the approval. - if let Err(diagnostic) = failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_AFTER_GRAPH_DELETE) { - diagnostics.push(diagnostic); - return early_return( - display_path(&desired.config_dir), - Some(desired.config_digest), - observations, - changes, - state.resource_statuses, - diagnostics, - ); - } - executed_deletes.push((graph_id.clone(), approval_id.clone())); - if let Some(approval_id) = approval_id { - consumed_approval_ids.push(approval_id); - } - completed_op_sidecars.push(sidecar_path); - } - if !failed_graphs.is_empty() { - demote_dependents_of_failed_graphs(&mut changes, &failed_graphs, &desired.dependencies); - } - - // State mutation. Apply owns query/policy statuses only; graph/schema - // statuses belong to refresh/import observation and must not be clobbered - // (the sweep above is the one exception: it owns recovery statuses). - let mut new_state = state.clone(); - for change in &changes { - match change.disposition { - Some(ApplyDisposition::Applied) => match change.operation { - PlanOperation::Create | PlanOperation::Update => { - new_state.applied_revision.resources.insert( - change.resource.clone(), - StateResource { - digest: change - .after_digest - .clone() - .expect("create/update always carries an after digest"), - // Policies record their applied bindings so the - // ledger is serving-sufficient (RFC-005 Β§D3). - applies_to: desired.policy_bindings.get(&change.resource).cloned(), - embedding_provider: None, - embedding_profile: desired - .embedding_providers - .get(&change.resource) - .cloned(), - }, - ); - set_resource_status_applied(&mut new_state, &change.resource); - } - PlanOperation::Delete => { - new_state - .applied_revision - .resources - .remove(&change.resource); - new_state.resource_statuses.remove(&change.resource); - } - }, - Some(ApplyDisposition::Blocked) => { - // The sweep owns recovery statuses (Drifted/Error with their - // conditions); a generic Blocked must not clobber them. - if change.reason.as_deref() != Some("cluster_recovery_pending") { - set_resource_status( - &mut new_state, - &change.resource, - ResourceLifecycleStatus::Blocked, - change.reason.as_deref().unwrap_or("dependency_not_applied"), - "waiting on an unapplied or missing dependency", - ); - } - } - _ => {} - } - } - for (graph_id, approval_id) in &executed_deletes { - tombstone_graph_subtree( - &mut new_state, - graph_id, - approval_id.as_deref(), - options.actor.as_deref(), - ); - if let Some(approval_id) = approval_id { - record_approval_consumed(&mut new_state, approval_id, "apply"); - } - } - recompute_state_graph_digests(&mut new_state, &desired); - - let mut residual = diff_resources( - &state_resource_digests(&new_state), - &desired.resource_digests, - ); - append_policy_binding_changes(&mut residual, Some(&new_state), &desired); - append_embedding_profile_changes(&mut residual, Some(&new_state), &desired); - let converged = residual.is_empty(); - if converged { - new_state.applied_revision.config_digest = Some(desired.config_digest.clone()); - } - - let after_value = - serde_json::to_value(&new_state).expect("cluster state must serialize deterministically"); - let mut state_written = false; - let mut state_write_failed = false; - if after_value != before_value { - new_state.state_revision = new_state.state_revision.saturating_add(1); - // The failpoint error routes through state_write_failed so the - // persisted-statuses revert contract below is exercised; a cfg_callback - // on this point can mutate state.json to simulate a concurrent writer, - // making write_state's CAS check fail organically. - let write_result = match failpoints::maybe_fail(crate::failpoints::names::CLUSTER_APPLY_BEFORE_STATE_WRITE) { - Ok(()) => { - backend - .write_state(&new_state, expected_cas.as_deref(), &mut observations) - .await - } - Err(diagnostic) => Err(diagnostic), - }; - match write_result { - Ok(()) => state_written = true, - Err(diagnostic) => { - diagnostics.push(diagnostic); - state_write_failed = true; - } - } - } - // Completed (rows 2/4) sweep sidecars are deleted only once their outcome - // is durably recorded; on a failed write they stay and re-sweep next run. - if !state_write_failed { - for sidecar_uri in sweep - .completed_sidecars - .iter() - .chain(completed_op_sidecars.iter()) - { - backend.delete_object(sidecar_uri).await; - } - let mut all_consumed = sweep.consumed_approvals.clone(); - all_consumed.extend(consumed_approval_ids.iter().cloned()); - mark_approvals_consumed(&backend, &all_consumed).await; - } - // On a failed state write, report the statuses that are actually on disk - // (the pre-apply snapshot), not the in-memory mutations that were never - // persisted β€” automation reading `resource_statuses` independently of `ok` - // must not see phantom status updates. - let resource_statuses = if state_write_failed { - state.resource_statuses - } else { - new_state.resource_statuses - }; - - let applied_count = changes - .iter() - .filter(|change| change.disposition == Some(ApplyDisposition::Applied)) - .count(); - let deferred_count = changes - .iter() - .filter(|change| { - matches!( - change.disposition, - Some(ApplyDisposition::Deferred) | Some(ApplyDisposition::Blocked) - ) - }) - .count(); - - ApplyOutput { - ok: !has_errors(&diagnostics), - config_dir: display_path(&desired.config_dir), - actor: options.actor.clone(), - desired_revision: DesiredRevision { - config_digest: Some(desired.config_digest), - }, - state_observations: observations, - changes, - applied_count, - deferred_count, - converged, - state_written, - resource_statuses, - diagnostics, - } -} - -/// Record a digest-bound human approval for a gated (irreversible) change β€” -/// today: graph deletes. The artifact binds to the exact desired config -/// digest and the change's before/after digests, so config or state drift -/// invalidates it automatically (a stale approval can never authorize a -/// different change). -pub async fn approve_config_dir( - config_dir: impl AsRef, - resource: &str, - approved_by: &str, -) -> ApproveOutput { - let outcome = load_desired(config_dir.as_ref()); - let mut diagnostics = outcome.diagnostics; - let storage_root = outcome - .desired - .as_ref() - .and_then(|desired| desired.storage_root.clone()); - let backend = match store_for(&outcome.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&outcome.config_dir) - } - }; - let mut observations = backend.observations(); - - let fail = |config_dir: String, diagnostics: Vec| ApproveOutput { - ok: false, - config_dir, - approval_id: None, - resource: None, - operation: None, - approved_by: None, - diagnostics, - }; - - let Some(desired) = outcome.desired else { - return fail(display_path(&outcome.config_dir), diagnostics); - }; - if has_errors(&diagnostics) { - return fail(display_path(&desired.config_dir), diagnostics); - } - - let _lock_guard = if desired.state_lock { - match backend.acquire_lock("approve", &mut observations).await { - Ok(guard) => Some(guard), - Err(diagnostic) => { - diagnostics.push(diagnostic); - return fail(display_path(&desired.config_dir), diagnostics); - } - } - } else { - diagnostics.push(Diagnostic::warning( - "state_lock_disabled", - "state.lock", - "state.lock is false; approve ran without acquiring the cluster state lock", - )); - None - }; - - let state = match backend.read_state(&mut observations).await { - Ok(snapshot) => match snapshot.state { - Some(state) => state, - None => { - diagnostics.push(Diagnostic::error( - "state_missing", - CLUSTER_STATE_FILE, - "approve requires an existing state.json; run `cluster import` first", - )); - return fail(display_path(&desired.config_dir), diagnostics); - } - }, - Err(diagnostic) => { - diagnostics.push(diagnostic); - return fail(display_path(&desired.config_dir), diagnostics); - } - }; - - let prior_resources = state_resource_digests(&state); - let changes = diff_resources(&prior_resources, &desired.resource_digests); - let gates = compute_approvals(&changes, &BTreeSet::new()); - let Some(change) = changes.iter().find(|change| { - change.resource == resource && gates.iter().any(|gate| gate.resource == resource) - }) else { - diagnostics.push(Diagnostic::error( - "approval_not_required", - resource, - "no pending change for this resource requires approval (check `cluster plan`)", - )); - return fail(display_path(&desired.config_dir), diagnostics); - }; - - let artifact = ApprovalArtifact { - schema_version: 1, - approval_id: Ulid::new().to_string(), - resource: change.resource.clone(), - operation: match change.operation { - PlanOperation::Create => "create", - PlanOperation::Update => "update", - PlanOperation::Delete => "delete", - } - .to_string(), - reason: gates - .iter() - .find(|gate| gate.resource == resource) - .map(|gate| gate.reason.clone()) - .unwrap_or_default(), - bound_config_digest: desired.config_digest.clone(), - bound_before_digest: change.before_digest.clone(), - bound_after_digest: change.after_digest.clone(), - approved_by: approved_by.to_string(), - created_at: now_rfc3339(), - consumed_at: None, - consumed_by_operation: None, - }; - if let Err(diagnostic) = backend.write_approval_artifact(&artifact).await { - diagnostics.push(diagnostic); - return fail(display_path(&desired.config_dir), diagnostics); - } - - ApproveOutput { - ok: !has_errors(&diagnostics), - config_dir: display_path(&desired.config_dir), - approval_id: Some(artifact.approval_id), - resource: Some(artifact.resource), - operation: Some(change.operation.clone()), - approved_by: Some(artifact.approved_by), - diagnostics, - } -} - -pub async fn status_config_dir(config_dir: impl AsRef) -> StatusOutput { - let parsed = parse_cluster_config(config_dir.as_ref()); - let mut diagnostics = parsed.diagnostics; - let storage_root = parsed.raw.as_ref().and_then(|raw| { - raw.storage - .as_deref() - .map(str::trim) - .filter(|root| !root.is_empty()) - .map(|root| root.trim_end_matches('/').to_string()) - }); - let backend = match store_for(&parsed.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&parsed.config_dir) - } - }; - let mut observations = backend.observations(); - backend - .observe_lock(&mut observations, &mut diagnostics) - .await; - warn_pending_recovery_sidecars(&backend, &mut diagnostics).await; - - let mut resource_digests = BTreeMap::new(); - let mut resource_statuses = BTreeMap::new(); - let mut state_observation_records = BTreeMap::new(); - - if let Some(raw) = parsed.raw.as_ref() { - let _settings = validate_cluster_header(raw, &mut diagnostics); - if !has_errors(&diagnostics) { - match backend.read_state(&mut observations).await { - Ok(snapshot) => { - if let Some(state) = snapshot.state { - // Read-only point-in-time catalog check: report the - // findings as diagnostics; persisting Drifted statuses - // is refresh's job. Status never writes state. - for (address, finding) in verify_catalog_payloads(&backend, &state).await { - diagnostics.push(payload_finding_diagnostic(&address, &finding)); - } - resource_digests = state_resource_digests(&state); - resource_statuses = state.resource_statuses; - state_observation_records = state.observations; - } else { - diagnostics.push(Diagnostic::warning( - "state_missing", - CLUSTER_STATE_FILE, - "state.json is missing; no applied cluster revision has been recorded", - )); - } - } - Err(diagnostic) => diagnostics.push(diagnostic), - } - } - } - - StatusOutput { - ok: !has_errors(&diagnostics), - config_dir: display_path(&parsed.config_dir), - state_observations: observations, - resource_digests, - resource_statuses, - observations: state_observation_records, - diagnostics, - } -} - -pub async fn force_unlock_config_dir( - config_dir: impl AsRef, - lock_id: impl AsRef, -) -> ForceUnlockOutput { - let parsed = parse_cluster_config(config_dir.as_ref()); - let mut diagnostics = parsed.diagnostics; - let storage_root = parsed.raw.as_ref().and_then(|raw| { - raw.storage - .as_deref() - .map(str::trim) - .filter(|root| !root.is_empty()) - .map(|root| root.trim_end_matches('/').to_string()) - }); - let backend = match store_for(&parsed.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&parsed.config_dir) - } - }; - let mut observations = backend.observations(); - let mut lock_removed = false; - - if let Some(raw) = parsed.raw.as_ref() { - let _settings = validate_cluster_header(raw, &mut diagnostics); - if !has_errors(&diagnostics) { - match backend - .force_unlock(lock_id.as_ref(), &mut observations) - .await - { - Ok(()) => lock_removed = true, - Err(diagnostic) => diagnostics.push(diagnostic), - } - } - } - - ForceUnlockOutput { - ok: !has_errors(&diagnostics), - config_dir: display_path(&parsed.config_dir), - state_observations: observations, - lock_removed, - diagnostics, - } -} - -pub async fn refresh_config_dir(config_dir: impl AsRef) -> StateSyncOutput { - sync_config_dir(config_dir.as_ref(), StateSyncOperation::Refresh).await -} - -pub async fn import_config_dir(config_dir: impl AsRef) -> StateSyncOutput { - sync_config_dir(config_dir.as_ref(), StateSyncOperation::Import).await -} - -async fn sync_config_dir(config_dir: &Path, operation: StateSyncOperation) -> StateSyncOutput { - let outcome = load_desired(config_dir); - let mut diagnostics = outcome.diagnostics; - let storage_root = outcome - .desired - .as_ref() - .and_then(|desired| desired.storage_root.clone()); - let backend = match store_for(&outcome.config_dir, storage_root.as_deref()) { - Ok(backend) => backend, - Err(diagnostic) => { - diagnostics.push(diagnostic); - ClusterStore::for_config_dir(&outcome.config_dir) - } - }; - let mut observations = backend.observations(); - - let Some(desired) = outcome.desired else { - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&outcome.config_dir), - state_observations: observations, - resource_digests: BTreeMap::new(), - resource_statuses: BTreeMap::new(), - observations: BTreeMap::new(), - diagnostics, - }; - }; - - if has_errors(&diagnostics) { - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: desired.resource_digests, - resource_statuses: BTreeMap::new(), - observations: BTreeMap::new(), - diagnostics, - }; - } - - let operation_label = state_sync_operation_label(operation); - let _lock_guard = if desired.state_lock { - match backend - .acquire_lock(operation_label, &mut observations) - .await - { - Ok(guard) => Some(guard), - Err(diagnostic) => { - diagnostics.push(diagnostic); - None - } - } - } else { - diagnostics.push(Diagnostic::warning( - "state_lock_disabled", - "state.lock", - format!( - "state.lock is false; {operation_label} wrote state without acquiring the cluster state lock" - ), - )); - None - }; - - if has_errors(&diagnostics) { - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: desired.resource_digests, - resource_statuses: BTreeMap::new(), - observations: BTreeMap::new(), - diagnostics, - }; - } - - let snapshot = match backend.read_state(&mut observations).await { - Ok(snapshot) => snapshot, - Err(diagnostic) => { - diagnostics.push(diagnostic); - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: desired.resource_digests, - resource_statuses: BTreeMap::new(), - observations: BTreeMap::new(), - diagnostics, - }; - } - }; - - let expected_cas = snapshot.state_cas; - let mut state = match (operation, snapshot.state) { - (StateSyncOperation::Refresh, Some(state)) => state, - (StateSyncOperation::Refresh, None) => { - diagnostics.push(Diagnostic::error( - "state_missing", - CLUSTER_STATE_FILE, - "refresh requires an existing state.json; run `cluster import` to bootstrap state", - )); - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: BTreeMap::new(), - resource_statuses: BTreeMap::new(), - observations: BTreeMap::new(), - diagnostics, - }; - } - (StateSyncOperation::Import, Some(state)) => { - diagnostics.push(Diagnostic::error( - "state_already_exists", - CLUSTER_STATE_FILE, - "import creates initial state only when state.json is missing; use `cluster refresh` for an existing state ledger", - )); - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: state_resource_digests(&state), - resource_statuses: state.resource_statuses, - observations: state.observations, - diagnostics, - }; - } - (StateSyncOperation::Import, None) => initial_import_state(&desired), - }; - - // Recovery sweep first (RFC-004 Β§D3): classify any interrupted graph - // operation before observation/verification so a rolled-forward outcome - // is what those passes see. - let sweep = sweep_recovery_sidecars(&backend, &mut state, &mut diagnostics).await; - - // Catalog payload verification must run BEFORE graph observation: removing - // a drifted query digest first means the live-graph composite recompute - // below already excludes it, so the persisted graph. composite stays - // consistent and the next plan shows exactly the create + derived update. - for (address, finding) in verify_catalog_payloads(&backend, &state).await { - diagnostics.push(payload_finding_diagnostic(&address, &finding)); - match finding { - PayloadFinding::Missing => { - state.applied_revision.resources.remove(&address); - set_resource_status( - &mut state, - &address, - ResourceLifecycleStatus::Drifted, - "payload_missing", - "catalog payload blob is missing; re-run `cluster apply` to republish", - ); - } - PayloadFinding::Mismatch { .. } => { - state.applied_revision.resources.remove(&address); - set_resource_status( - &mut state, - &address, - ResourceLifecycleStatus::Drifted, - "payload_mismatch", - "catalog payload blob does not match the recorded digest; re-run `cluster apply` to republish", - ); - } - // Transient IO must not trigger a spurious republish: keep the - // digest, surface the error, let a later clean refresh converge. - PayloadFinding::ReadError(error) => { - set_resource_status( - &mut state, - &address, - ResourceLifecycleStatus::Error, - "payload_read_error", - &error, - ); - } - } - } - - let graph_error_count = observe_declared_graphs(&desired, &backend, &mut state).await; - if graph_error_count > 0 { - diagnostics.push(Diagnostic::error( - "graph_observation_error", - CLUSTER_GRAPHS_DIR, - format!("{graph_error_count} graph observation(s) failed"), - )); - } - - if operation == StateSyncOperation::Import && has_errors(&diagnostics) { - return StateSyncOutput { - ok: false, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests: state_resource_digests(&state), - resource_statuses: state.resource_statuses, - observations: state.observations, - diagnostics, - }; - } - - if operation == StateSyncOperation::Import { - state.state_revision = 1; - } else { - state.state_revision = state.state_revision.saturating_add(1); - } - - match backend - .write_state(&state, expected_cas.as_deref(), &mut observations) - .await - { - Ok(()) => { - // Completed sweep sidecars are deleted only after their outcome - // is durably recorded; on failure they stay and re-sweep. - for sidecar_uri in &sweep.completed_sidecars { - backend.delete_object(sidecar_uri).await; - } - mark_approvals_consumed(&backend, &sweep.consumed_approvals).await; - } - Err(diagnostic) => diagnostics.push(diagnostic), - } - - let resource_digests = state_resource_digests(&state); - let ok = !has_errors(&diagnostics); - - StateSyncOutput { - ok, - operation, - config_dir: display_path(&desired.config_dir), - state_observations: observations, - resource_digests, - resource_statuses: state.resource_statuses, - observations: state.observations, - diagnostics, - } -} - -#[derive(Debug, PartialEq, Eq)] -enum PayloadFinding { - Missing, - Mismatch { actual_digest: String }, - ReadError(String), -} - -/// Verify every catalog-backed resource digest in state against its -/// content-addressed blob under `__cluster/resources/`. Graph, schema, and -/// unknown addresses have no payloads and are skipped. Read-only; findings -/// are deterministic (BTreeMap order). Payloads are small (queries, policy -/// bundles), so a full digest re-hash is cheap. -async fn verify_catalog_payloads( - backend: &ClusterStore, - state: &ClusterState, -) -> Vec<(String, PayloadFinding)> { - let mut findings = Vec::new(); - for (address, resource) in &state.applied_revision.resources { - let kind = resource_kind(address); - if ClusterStore::payload_relative(&kind, &resource.digest).is_none() { - continue; - } - match backend.read_payload(&kind, &resource.digest).await { - Ok(Some(text)) => { - let actual_digest = sha256_hex(text.as_bytes()); - if actual_digest != resource.digest { - findings.push((address.clone(), PayloadFinding::Mismatch { actual_digest })); - } - } - Ok(None) => findings.push((address.clone(), PayloadFinding::Missing)), - Err(err) => { - findings.push((address.clone(), PayloadFinding::ReadError(err))); - } - } - } - findings -} - -fn payload_finding_diagnostic(address: &str, finding: &PayloadFinding) -> Diagnostic { - match finding { - PayloadFinding::Missing => Diagnostic::warning( - "catalog_payload_missing", - address, - "catalog payload blob is missing; re-run `cluster apply` to republish", - ), - PayloadFinding::Mismatch { actual_digest } => Diagnostic::warning( - "catalog_payload_mismatch", - address, - format!( - "catalog payload blob does not match the recorded digest (actual sha256:{actual_digest}); re-run `cluster apply` to republish" - ), - ), - // An unverifiable blob must not report healthy. - PayloadFinding::ReadError(error) => { - Diagnostic::error("catalog_payload_read_error", address, error.clone()) - } - } -} - -/// Write one content-addressed payload blob. Idempotent: an existing -/// digest-named file is trusted as-is. The digest re-check is the apply-side -/// TOCTOU detector β€” the source file changing between `load_desired` and the -/// payload write must fail loudly, never publish mismatched content. -async fn write_resource_payload( - backend: &ClusterStore, - kind: &ResourceKind, - source: &Path, - expected_digest: &str, - resource: &str, -) -> Result<(), Diagnostic> { - if backend.payload_exists(kind, expected_digest).await { - // Content-addressed: an existing digest-named object is identical. - return Ok(()); - } - let bytes = fs::read(source).map_err(|err| { - Diagnostic::error( - "resource_payload_write_error", - resource, - format!( - "could not read resource source '{}': {err}", - source.display() - ), - ) - })?; - if sha256_hex(&bytes) != expected_digest { - // The apply-side TOCTOU detector: the source changing between - // load_desired and this write must fail loudly, never publish - // mismatched content. - return Err(Diagnostic::error( - "resource_content_changed", - resource, - format!( - "resource source '{}' changed while apply was running; re-run `cluster apply`", - source.display() - ), - )); - } - let content = String::from_utf8(bytes).map_err(|err| { - Diagnostic::error( - "resource_payload_write_error", - resource, - format!("resource source is not valid UTF-8: {err}"), - ) - })?; - backend - .write_payload(kind, expected_digest, &content) - .await - .map_err(|err| { - Diagnostic::error( - "resource_payload_write_error", - resource, - format!("could not write payload: {err}"), - ) - }) -} - -/// Recompute the composite `graph.` digests for state-resident graphs from -/// state's own schema/query components. Without this, an applied query change -/// would leave the prior composite digest in state and `graph.` would show -/// a phantom update in every later plan β€” apply could never converge. -fn recompute_state_graph_digests(state: &mut ClusterState, desired: &DesiredCluster) { - for graph in &desired.graphs { - let graph_address = graph_address(&graph.id); - if !state - .applied_revision - .resources - .contains_key(&graph_address) - { - continue; - } - let schema_digest = state - .applied_revision - .resources - .get(&schema_address(&graph.id)) - .map(|resource| resource.digest.clone()); - let query_digests = state_query_digests_for_graph(state, &graph.id); - let embedding_provider = graph.embedding_provider.as_deref(); - let embedding_provider_digest = embedding_provider - .and_then(|address| state.applied_revision.resources.get(address)) - .map(|resource| resource.digest.clone()); - let digest = graph_digest( - &graph.id, - schema_digest.as_ref(), - Some(&query_digests), - embedding_provider, - embedding_provider_digest.as_ref(), - ); - state.applied_revision.resources.insert( - graph_address, - StateResource { - digest, - applies_to: None, - embedding_provider: graph.embedding_provider.clone(), - embedding_profile: None, - }, - ); - } -} - -fn duplicate_key_diagnostics(text: &str) -> Vec { - #[derive(Debug)] - struct Frame { - indent: isize, - path: String, - keys: BTreeSet, - } - - let mut diagnostics = Vec::new(); - let mut stack = vec![Frame { - indent: -1, - path: String::new(), - keys: BTreeSet::new(), - }]; - - for (line_idx, line) in text.lines().enumerate() { - let line_without_comment = strip_comment(line); - if line_without_comment.trim().is_empty() { - continue; - } - let indent = line_without_comment - .chars() - .take_while(|ch| *ch == ' ') - .count() as isize; - let trimmed = line_without_comment.trim_start(); - if trimmed.starts_with('-') { - continue; - } - let Some((raw_key, raw_value)) = trimmed.split_once(':') else { - continue; - }; - let key = raw_key.trim(); - if key.is_empty() || key.starts_with('{') || key.starts_with('[') { - continue; - } - - while stack.last().is_some_and(|frame| indent <= frame.indent) { - stack.pop(); - } - let parent = stack.last_mut().expect("root frame is always present"); - let full_path = if parent.path.is_empty() { - key.to_string() - } else { - format!("{}.{}", parent.path, key) - }; - if !parent.keys.insert(key.to_string()) { - diagnostics.push(Diagnostic::error( - "duplicate_yaml_key", - full_path.clone(), - format!("duplicate YAML key `{key}` on line {}", line_idx + 1), - )); - } - if raw_value.trim().is_empty() { - stack.push(Frame { - indent, - path: full_path, - keys: BTreeSet::new(), - }); - } - } - - diagnostics -} - -fn strip_comment(line: &str) -> String { - let mut in_single_quote = false; - let mut in_double_quote = false; - let mut escaped = false; - - for (idx, ch) in line.char_indices() { - if escaped { - escaped = false; - continue; - } - match ch { - '\\' if in_double_quote => escaped = true, - '\'' if !in_double_quote => in_single_quote = !in_single_quote, - '"' if !in_single_quote => in_double_quote = !in_double_quote, - '#' if !in_single_quote && !in_double_quote => return line[..idx].to_string(), - _ => {} - } - } - - line.to_string() -} - -fn state_query_digests_for_graph(state: &ClusterState, graph_id: &str) -> BTreeMap { - let prefix = format!("query.{graph_id}."); - state - .applied_revision - .resources - .iter() - .filter_map(|(address, resource)| { - address - .strip_prefix(&prefix) - .map(|name| (name.to_string(), resource.digest.clone())) - }) - .collect() -} - -fn state_graph_embedding_provider(state: &ClusterState, graph_id: &str) -> Option { - state - .applied_revision - .resources - .get(&graph_address(graph_id)) - .and_then(|resource| resource.embedding_provider.clone()) -} - -fn state_embedding_provider_digest( - state: &ClusterState, - embedding_provider: Option<&str>, -) -> Option { - embedding_provider - .and_then(|address| state.applied_revision.resources.get(address)) - .map(|resource| resource.digest.clone()) -} - -fn set_resource_status_applied(state: &mut ClusterState, address: &str) { - state.resource_statuses.insert( - address.to_string(), - ResourceStatusRecord { - status: ResourceLifecycleStatus::Applied, - conditions: Vec::new(), - message: None, - }, - ); -} - -fn set_resource_status( - state: &mut ClusterState, - address: &str, - status: ResourceLifecycleStatus, - condition: &str, - message: &str, -) { - state.resource_statuses.insert( - address.to_string(), - ResourceStatusRecord { - status, - conditions: vec![condition.to_string()], - message: Some(message.to_string()), - }, - ); -} - -fn graph_digest( - graph_id: &str, - schema_digest: Option<&String>, - query_digests: Option<&BTreeMap>, - embedding_provider: Option<&str>, - embedding_provider_digest: Option<&String>, -) -> String { - let mut input = format!( - "graph\0{graph_id}\0schema\0{}\0", - schema_digest.map_or("", String::as_str) - ); - if let Some(query_digests) = query_digests { - for (name, digest) in query_digests { - input.push_str("query\0"); - input.push_str(name); - input.push('\0'); - input.push_str(digest); - input.push('\0'); - } - } - if let Some(provider) = embedding_provider { - input.push_str("embedding_provider\0"); - input.push_str(provider); - input.push('\0'); - input.push_str(embedding_provider_digest.map_or("", String::as_str)); - input.push('\0'); - } - sha256_hex(input.as_bytes()) -} - -fn embedding_provider_digest(profile: &EmbeddingProviderConfig) -> String { - let mut input = String::from("embedding-provider\0"); - let config_semantics = - serde_json::to_string(profile).expect("embedding provider config must serialize"); - input.push_str(&config_semantics); - sha256_hex(input.as_bytes()) -} - -async fn live_schema_matches_recorded_digest( - graph_uri: &str, - recorded_schema_digest: Option<&str>, - observed_manifest_version: Option, -) -> bool { - let Some(recorded_schema_digest) = recorded_schema_digest else { - return false; - }; - let Some(observed_manifest_version) = observed_manifest_version else { - return false; - }; - let Ok(db) = Omnigraph::open_read_only(graph_uri).await else { - return false; - }; - let Ok(snapshot) = db.snapshot_of(ReadTarget::branch("main")).await else { - return false; - }; - if snapshot.version() != observed_manifest_version { - return false; - } - sha256_hex(db.schema_source().as_bytes()) == recorded_schema_digest -} - -fn desired_config_digest( - raw: &RawClusterConfig, - resource_digests: &BTreeMap, -) -> String { - let mut input = String::from("cluster-config\0"); - // Hash parsed semantics, not raw YAML bytes, so comments and formatting do - // not create a new desired revision and the digest cannot drift from parse. - let config_semantics = - serde_json::to_string(raw).expect("raw cluster config must serialize deterministically"); - input.push_str(&config_semantics); - input.push('\0'); - for (address, digest) in resource_digests { - input.push_str(address); - input.push('\0'); - input.push_str(digest); - input.push('\0'); - } - sha256_hex(input.as_bytes()) -} - -fn sha256_hex(bytes: &[u8]) -> String { - let digest = Sha256::digest(bytes); - const HEX: &[u8; 16] = b"0123456789abcdef"; - let mut out = String::with_capacity(digest.len() * 2); - for byte in digest { - out.push(HEX[(byte >> 4) as usize] as char); - out.push(HEX[(byte & 0x0f) as usize] as char); - } - out -} - -fn now_rfc3339() -> String { - OffsetDateTime::now_utc() - .format(&Rfc3339) - .unwrap_or_else(|_| "1970-01-01T00:00:00Z".to_string()) -} - -fn lock_age_seconds(created_at: &str) -> Option { - let created_at = OffsetDateTime::parse(created_at, &Rfc3339).ok()?; - Some( - (OffsetDateTime::now_utc() - created_at) - .whole_seconds() - .max(0) as u64, - ) -} - -fn state_sync_operation_label(operation: StateSyncOperation) -> &'static str { - match operation { - StateSyncOperation::Refresh => "refresh", - StateSyncOperation::Import => "import", - } -} - -fn has_errors(diagnostics: &[Diagnostic]) -> bool { - diagnostics - .iter() - .any(|diagnostic| diagnostic.severity == DiagnosticSeverity::Error) -} - -fn count_errors(diagnostics: &[Diagnostic]) -> usize { - diagnostics - .iter() - .filter(|diagnostic| diagnostic.severity == DiagnosticSeverity::Error) - .count() -} - -fn display_path(path: &Path) -> String { - path.display().to_string() -} - -#[cfg(test)] -#[path = "tests.rs"] -mod tests; diff --git a/crates/omnigraph-cluster/src/serve.rs b/crates/omnigraph-cluster/src/serve.rs deleted file mode 100644 index 54d3017..0000000 --- a/crates/omnigraph-cluster/src/serve.rs +++ /dev/null @@ -1,478 +0,0 @@ -//! Phase-5 serving snapshot: the read-only loader a `--cluster` server -//! boots from (moved verbatim from lib.rs in the modularization). - -use super::*; - -/// One graph in a serving snapshot: its id and on-disk root. -#[derive(Debug, Clone)] -pub struct ServingGraph { - pub graph_id: String, - pub root: PathBuf, - pub embedding: Option, -} - -/// One stored query: its graph binding, registry name, and verified source. -#[derive(Debug, Clone)] -pub struct ServingQuery { - pub graph_id: String, - pub name: String, - pub source: String, -} - -/// One policy bundle: its verified catalog blob path and applied bindings -/// (normalized typed refs: `cluster` | `graph.`). -#[derive(Debug, Clone)] -pub struct ServingPolicy { - pub name: String, - /// The policy bundle CONTENT, digest-verified against the applied - /// revision at read time. Content, not a path: the catalog may live on - /// object storage, and the server must not re-read mutable state. - pub source: String, - pub applies_to: Vec, -} - -/// Everything a server needs to boot from the cluster catalog (RFC-005 Β§D2). -#[derive(Debug, Clone)] -pub struct ServingSnapshot { - pub graphs: Vec, - pub queries: Vec, - pub policies: Vec, - pub diagnostics: Vec, -} - -/// Read the applied revision as a serving snapshot β€” the read-only loader for -/// the Phase-5 server boot. Cluster-global readiness failures are still -/// all-or-nothing, but graph-attributed pending recovery sidecars quarantine -/// only that graph so healthy graphs can continue serving. This loader never -/// runs a recovery sweep. -/// Takes no lock: the state file is replaced atomically, so this reads a -/// consistent point-in-time ledger. -pub async fn read_serving_snapshot( - config_dir: impl AsRef, -) -> Result> { - let config_dir = config_dir.as_ref().to_path_buf(); - // The declared storage: root decides where the ledger/catalog/graphs - // live; config parse errors surface through the normal validation path. - let parsed = parse_cluster_config(&config_dir); - let storage_root = parsed.raw.as_ref().and_then(|raw| { - raw.storage - .as_deref() - .map(str::trim) - .filter(|root| !root.is_empty()) - .map(|root| root.trim_end_matches('/').to_string()) - }); - let backend = match storage_root.as_deref() { - Some(root) => match ClusterStore::for_storage_root(root) { - Ok(backend) => backend, - Err(diagnostic) => return Err(vec![diagnostic]), - }, - None => ClusterStore::for_config_dir(&config_dir), - }; - read_snapshot_with_store(backend).await -} - -/// Read the applied revision directly from a storage root URI β€” config-free -/// serving: a `--cluster s3://bucket/prefix` server needs no local files at -/// all, only the bucket and credentials. The ledger and catalog ARE the -/// deployment artifact. -pub async fn read_serving_snapshot_from_storage( - storage_root: &str, -) -> Result> { - let backend = - ClusterStore::for_storage_root(storage_root).map_err(|diagnostic| vec![diagnostic])?; - read_snapshot_with_store(backend).await -} - -/// Cluster root for a graph **storage URI** of the cluster layout -/// (`/graphs/.omni`), if `` is actually a cluster (holds -/// `__cluster/state.json`); otherwise `None`. Used by the CLI to refuse -/// `init` into a cluster-managed location β€” graphs there are created by -/// `cluster apply`, not `init`. -/// -/// Cheap by construction: a URI that does not match the `/graphs/.omni` -/// shape returns `None` without any I/O, so ordinary `init` targets -/// (`./kb.omni`, `s3://bucket/kb.omni`) never probe storage. Works for -/// `file://` and `s3://` via the storage adapter. -pub async fn cluster_root_for_graph_uri(graph_uri: &str) -> Option { - let root = cluster_root_of_graph_layout(graph_uri)?; - let store = ClusterStore::for_storage_root(&root).ok()?; - store - .has_state() - .await - .then(|| store.display_root().to_string()) -} - -/// Resolve a graph's **storage URI** (`/graphs/.omni`) from a cluster's -/// applied state ledger β€” the lightweight path for storage-plane maintenance -/// (`optimize`/`repair`/`cleanup`). -/// -/// Unlike [`read_serving_snapshot`], this deliberately does NOT validate catalog -/// payloads or recovery readiness: maintenance only needs the derivable graph -/// root, and must not be blocked by an unrelated corrupt policy/query blob or a -/// pending recovery sweep β€” a degraded cluster is exactly when an operator -/// reaches for `repair`. It reads the state ledger, confirms the graph is in the -/// applied revision, and returns `graph_root(id)`. -/// -/// `cluster` is a config directory or a storage-root URI (`s3://…`, config-free), -/// mirroring the server's `--cluster` dispatch. -pub async fn resolve_graph_storage_uri(cluster: &str, graph_id: &str) -> Result { - let backend = open_cluster_backend(cluster)?; - let mut observations = backend.observations(); - let snapshot = backend.read_state(&mut observations).await?; - let state = snapshot.state.ok_or_else(|| missing_state_diagnostic(cluster))?; - let address = format!("graph.{graph_id}"); - if !state.applied_revision.resources.contains_key(&address) { - let applied = applied_graph_ids(&state); - return Err(Diagnostic::error( - "graph_not_applied", - address, - format!( - "graph `{graph_id}` is not applied in cluster `{cluster}` (applied graphs: [{}]); \ - declare it in cluster.yaml and run `cluster apply`, or check the id", - applied.join(", ") - ), - )); - } - Ok(backend.graph_root(graph_id)) -} - -/// List the graph ids applied in a cluster's served state (sorted). Reads the -/// ledger only β€” no catalog validation β€” like `resolve_graph_storage_uri`, so -/// it works on a degraded cluster. Used to enumerate candidates when no -/// `--graph` is selected (RFC-011 Decision 7). -pub async fn cluster_graph_ids(cluster: &str) -> Result, Diagnostic> { - let backend = open_cluster_backend(cluster)?; - let mut observations = backend.observations(); - let snapshot = backend.read_state(&mut observations).await?; - let state = snapshot.state.ok_or_else(|| missing_state_diagnostic(cluster))?; - Ok(applied_graph_ids(&state)) -} - -fn open_cluster_backend(cluster: &str) -> Result { - if cluster.contains("://") { - ClusterStore::for_storage_root(cluster) - } else { - Ok(ClusterStore::for_config_dir(Path::new(cluster))) - } -} - -fn missing_state_diagnostic(cluster: &str) -> Diagnostic { - Diagnostic::error( - "cluster_state_missing", - CLUSTER_STATE_FILE, - format!("cluster `{cluster}` has no applied state; run `cluster apply` first"), - ) -} - -fn applied_graph_ids(state: &crate::types::ClusterState) -> Vec { - let mut ids: Vec = state - .applied_revision - .resources - .keys() - .filter_map(|a| a.strip_prefix("graph.")) - .map(str::to_string) - .collect(); - ids.sort(); - ids -} - -/// Split `/graphs/.omni` β†’ ``, gating on the exact cluster -/// graph-layout shape (a single `` segment, no nested path). `None` for -/// anything else β€” no I/O is done for non-cluster-shaped URIs. -fn cluster_root_of_graph_layout(graph_uri: &str) -> Option { - let trimmed = graph_uri.trim_end_matches('/'); - let rest = trimmed.strip_suffix(".omni")?; - let (root, id) = rest.rsplit_once("/graphs/")?; - if root.is_empty() || id.is_empty() || id.contains('/') { - return None; - } - Some(root.to_string()) -} - -async fn read_snapshot_with_store( - backend: ClusterStore, -) -> Result> { - let mut diagnostics: Vec = Vec::new(); - let mut startup_diagnostics: Vec = Vec::new(); - let mut quarantined_graphs: BTreeSet = BTreeSet::new(); - - // Do not sweep at serve time. Valid graph-attributed sidecars quarantine - // that graph; malformed/unattributable sidecars remain cluster-fatal - // because serving cannot prove their blast radius. - let sidecar_diag_start = diagnostics.len(); - let sidecars = backend.list_recovery_sidecars(&mut diagnostics).await; - // Every diagnostic `list_recovery_sidecars` appends is a genuine - // read/parse/version failure (emitted as a warning by `store::list_json_dir`) - // whose blast radius serving cannot prove β€” promote each to a cluster-fatal - // error. This depends on that listing only ever emitting failure diagnostics; - // if it grows a benign/informational one, promote by code instead. - for diagnostic in diagnostics.iter_mut().skip(sidecar_diag_start) { - diagnostic.severity = DiagnosticSeverity::Error; - } - for (path, sidecar) in sidecars { - if sidecar.graph_id.trim().is_empty() { - diagnostics.push(Diagnostic::error( - "cluster_recovery_unattributed", - path, - "recovery sidecar has no graph id; run a state-mutating cluster command to sweep it before serving", - )); - continue; - } - quarantined_graphs.insert(sidecar.graph_id.clone()); - startup_diagnostics.push(Diagnostic::warning( - "cluster_recovery_pending", - graph_address(&sidecar.graph_id), - format!( - "graph `{}` is quarantined because interrupted operation `{}` awaits recovery; run any state-mutating cluster command (e.g. `cluster apply`) to sweep", - sidecar.graph_id, sidecar.operation_id - ), - )); - } - if has_errors(&diagnostics) { - return Err(diagnostics); - } - - let mut observations = backend.observations(); - let state = match backend.read_state(&mut observations).await { - Ok(snapshot) => match snapshot.state { - Some(state) => Some(state), - None => { - diagnostics.push(Diagnostic::error( - "cluster_state_missing", - CLUSTER_STATE_FILE, - "no cluster state ledger; run `cluster import` and `cluster apply` first", - )); - None - } - }, - Err(diagnostic) => { - diagnostics.push(diagnostic); - None - } - }; - let Some(state) = state else { - diagnostics.extend(startup_diagnostics); - return Err(diagnostics); - }; - - let required_embedding_providers: BTreeSet = state - .applied_revision - .resources - .iter() - .filter_map(|(address, entry)| match resource_kind(address) { - ResourceKind::Graph(graph_id) if !quarantined_graphs.contains(&graph_id) => { - entry.embedding_provider.clone() - } - _ => None, - }) - .collect(); - let mut embedding_profiles: BTreeMap = BTreeMap::new(); - for (address, entry) in &state.applied_revision.resources { - if !matches!(resource_kind(address), ResourceKind::EmbeddingProvider(_)) { - continue; - } - if !required_embedding_providers.contains(address) { - continue; - } - let Some(profile) = entry.embedding_profile.clone() else { - diagnostics.push(Diagnostic::error( - "embedding_provider_profile_missing", - address.clone(), - "no applied embedding provider profile recorded; re-run `cluster apply` to backfill", - )); - continue; - }; - let actual_digest = embedding_provider_digest(&profile); - if actual_digest != entry.digest { - diagnostics.push(Diagnostic::error( - "embedding_provider_digest_mismatch", - address.clone(), - format!( - "applied embedding provider profile does not match its recorded digest (actual sha256:{actual_digest}); run `cluster refresh` then `cluster apply`, and restart" - ), - )); - continue; - } - embedding_profiles.insert(address.clone(), profile); - } - - let mut graphs = Vec::new(); - let mut queries = Vec::new(); - let mut policies = Vec::new(); - let mut saw_applied_graph = false; - for (address, entry) in &state.applied_revision.resources { - match resource_kind(address) { - ResourceKind::Graph(graph_id) => { - saw_applied_graph = true; - if quarantined_graphs.contains(&graph_id) { - continue; - } - let embedding = match entry.embedding_provider.as_deref() { - Some(provider_address) => match resource_kind(provider_address) { - ResourceKind::EmbeddingProvider(_) => { - match embedding_profiles.get(provider_address) { - Some(profile) => Some(profile.clone()), - None => { - diagnostics.push(Diagnostic::error( - "embedding_provider_missing", - address.clone(), - format!( - "graph references `{provider_address}`, but no applied embedding provider profile is available; re-run `cluster apply`" - ), - )); - None - } - } - } - _ => { - diagnostics.push(Diagnostic::error( - "wrong_kind_reference", - address.clone(), - format!( - "graph embedding_provider expects `provider.embedding.`, got `{provider_address}`" - ), - )); - None - } - }, - None => None, - }; - graphs.push(ServingGraph { - root: PathBuf::from(backend.graph_root(&graph_id)), - graph_id, - embedding, - }); - } - ResourceKind::Schema(_) => {} - kind @ ResourceKind::Query { .. } => { - let ResourceKind::Query { graph, name } = &kind else { - unreachable!() - }; - if quarantined_graphs.contains(graph) { - continue; - } - match backend - .read_verified_payload(&kind, &entry.digest, address) - .await - { - Ok(source) => queries.push(ServingQuery { - graph_id: graph.clone(), - name: name.clone(), - source, - }), - Err(diagnostic) => diagnostics.push(diagnostic), - } - } - kind @ ResourceKind::Policy(_) => { - let ResourceKind::Policy(name) = &kind else { - unreachable!() - }; - let Some(applies_to) = entry.applies_to.clone() else { - diagnostics.push(Diagnostic::error( - "policy_bindings_missing", - address.clone(), - "no applied applies_to bindings recorded (ledger predates binding metadata); re-run `cluster apply` to backfill", - )); - continue; - }; - let applies_to: Vec = applies_to - .into_iter() - .filter(|binding| { - binding - .strip_prefix("graph.") - .is_none_or(|graph| !quarantined_graphs.contains(graph)) - }) - .collect(); - if applies_to.is_empty() { - continue; - } - match backend - .read_verified_payload(&kind, &entry.digest, address) - .await - { - Ok(source) => policies.push(ServingPolicy { - name: name.clone(), - source, - applies_to, - }), - Err(diagnostic) => diagnostics.push(diagnostic), - } - } - ResourceKind::EmbeddingProvider(_) => {} - ResourceKind::Unknown => {} - } - } - - if graphs.is_empty() { - if saw_applied_graph && !quarantined_graphs.is_empty() { - diagnostics.push(Diagnostic::error( - "cluster_no_healthy_graphs", - CLUSTER_RECOVERIES_DIR, - "all applied graphs are quarantined by pending recovery sidecars; run any state-mutating cluster command (e.g. `cluster apply`) to sweep, then retry", - )); - } else { - diagnostics.push(Diagnostic::error( - "cluster_empty", - CLUSTER_STATE_FILE, - "the applied revision records no graphs; apply a cluster with at least one graph before serving from it", - )); - } - } - if has_errors(&diagnostics) { - diagnostics.extend(startup_diagnostics); - return Err(diagnostics); - } - Ok(ServingSnapshot { - graphs, - queries, - policies, - diagnostics: startup_diagnostics, - }) -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn graph_layout_gating_does_no_io_for_non_cluster_shapes() { - // Only `/graphs/.omni` matches; everything else is None. - assert_eq!( - cluster_root_of_graph_layout("/data/cluster/graphs/kb.omni").as_deref(), - Some("/data/cluster") - ); - assert_eq!( - cluster_root_of_graph_layout("s3://bucket/prefix/graphs/kb.omni").as_deref(), - Some("s3://bucket/prefix") - ); - assert_eq!(cluster_root_of_graph_layout("./kb.omni"), None); - assert_eq!(cluster_root_of_graph_layout("s3://bucket/kb.omni"), None); - // nested id under graphs/ is not the cluster layout - assert_eq!(cluster_root_of_graph_layout("/c/graphs/a/b.omni"), None); - // not a .omni graph - assert_eq!(cluster_root_of_graph_layout("/c/graphs/kb"), None); - } - - #[tokio::test] - async fn cluster_root_detected_only_when_state_ledger_present() { - let temp = tempfile::tempdir().unwrap(); - let root = temp.path(); - std::fs::create_dir_all(root.join("graphs")).unwrap(); - let graph_uri = format!("{}/graphs/kb.omni", root.to_string_lossy()); - - // No __cluster/state.json yet β†’ not a cluster. - assert_eq!(cluster_root_for_graph_uri(&graph_uri).await, None); - - // Lay down the state ledger β†’ now it's a cluster-managed location. - std::fs::create_dir_all(root.join("__cluster")).unwrap(); - std::fs::write(root.join(CLUSTER_STATE_FILE), "{}").unwrap(); - let detected = cluster_root_for_graph_uri(&graph_uri).await; - assert!(detected.is_some(), "expected cluster root to be detected"); - - // A non-cluster-shaped target never probes and is always None. - assert_eq!( - cluster_root_for_graph_uri(&format!("{}/plain.omni", root.to_string_lossy())).await, - None - ); - } -} diff --git a/crates/omnigraph-cluster/src/store.rs b/crates/omnigraph-cluster/src/store.rs deleted file mode 100644 index 9136f5e..0000000 --- a/crates/omnigraph-cluster/src/store.rs +++ /dev/null @@ -1,842 +0,0 @@ -//! The cluster's storage layer: every stored byte (state ledger, lock, -//! recovery sidecars, approval artifacts, catalog payloads) goes through the -//! engine's `StorageAdapter`, so `file://` and `s3://` are one code path -//! (RFC-006). Declared configuration β€” `cluster.yaml` and the schema/query/ -//! policy sources it references β€” deliberately does NOT live here: config is -//! read from the operator's working tree (Terraform's config-local / -//! state-remote split). -//! -//! Raw `fs::*` for cluster state outside this module is a deny-list entry. - -use std::path::Path; -use std::process; -use std::sync::Arc; - -use omnigraph::storage::{StorageAdapter, StorageKind, storage_for_uri, storage_kind_for_uri}; -use time::OffsetDateTime; -use time::format_description::well_known::Rfc3339; -use ulid::Ulid; - -use crate::{ - ApprovalArtifact, CLUSTER_APPROVALS_DIR, CLUSTER_LOCK_FILE, CLUSTER_RECOVERIES_DIR, - CLUSTER_RESOURCES_DIR, CLUSTER_STATE_FILE, ClusterState, Diagnostic, RecoverySidecar, - ResourceKind, StateLockFile, StateObservations, sha256_hex, -}; - -#[derive(Debug, Clone)] -pub(crate) struct ClusterStore { - adapter: Arc, - /// Normalized storage-root URI, no trailing slash: `file:///abs/dir` - /// (the default config-dir layout) or `s3://bucket/prefix`. - root: String, - /// What observations/diagnostics display for stored locations: the plain - /// local path for `file://` roots (byte-compatible with the pre-store - /// outputs), the URI otherwise. - display_root: String, -} - -#[derive(Debug)] -pub(crate) struct StateSnapshot { - pub(crate) state: Option, - /// Content identity (`sha256:`) β€” the public CAS vocabulary. - pub(crate) state_cas: Option, -} - -#[derive(Debug)] -pub(crate) struct StateLockGuard { - adapter: Arc, - uri: String, - kind: StorageKind, -} - -impl Drop for StateLockGuard { - fn drop(&mut self) { - match self.kind { - // Deterministic release on the file backend (tests assert the - // lock is gone the moment a command returns). - StorageKind::Local => { - let path = self.uri.trim_start_matches("file://"); - let _ = std::fs::remove_file(path); - } - // Object stores need an async delete, and it must COMPLETE - // before a short-lived CLI process exits β€” a spawned task dies - // with the runtime and leaks the lock (caught by the s3 smoke - // test: import's lock survived into the next command). On the - // multi-thread runtime (the CLI and the gated s3 tests), - // block_in_place waits for the delete; on a current-thread - // runtime that's not allowed, so fall back to a spawn β€” - // best-effort, with `force-unlock` as the documented recovery, - // same as a crash. - StorageKind::S3 => { - let adapter = Arc::clone(&self.adapter); - let uri = self.uri.clone(); - if let Ok(handle) = tokio::runtime::Handle::try_current() { - if handle.runtime_flavor() == tokio::runtime::RuntimeFlavor::MultiThread { - tokio::task::block_in_place(move || { - handle.block_on(async move { - let _ = adapter.delete(&uri).await; - }); - }); - } else { - handle.spawn(async move { - let _ = adapter.delete(&uri).await; - }); - } - } - } - } - } -} - -impl ClusterStore { - /// The default layout: storage root = the config directory itself - /// (`file://`), byte-compatible with every pre-existing - /// cluster on disk. - pub(crate) fn for_config_dir(config_dir: &Path) -> Self { - let absolute = - std::path::absolute(config_dir).unwrap_or_else(|_| config_dir.to_path_buf()); - let display_root = absolute - .to_string_lossy() - .trim_end_matches('/') - .to_string(); - let root = format!("file://{display_root}"); - let adapter = storage_for_uri(&root) - .expect("local storage adapter construction is infallible for file:// roots"); - Self { - adapter, - root, - display_root, - } - } - - /// An explicit `storage:` root. `file://` URIs and plain paths normalize - /// to the local backend; `s3://bucket/prefix` to the S3 backend (env- - /// driven credentials/endpoint β€” the same contract as graph storage). - pub(crate) fn for_storage_root(root_uri: &str) -> Result { - let trimmed = root_uri.trim_end_matches('/'); - if storage_kind_for_uri(trimmed) == StorageKind::Local { - let path = trimmed.trim_start_matches("file://"); - return Ok(Self::for_config_dir(Path::new(path))); - } - let adapter = storage_for_uri(trimmed).map_err(|err| { - Diagnostic::error( - "storage_root_invalid", - "storage", - format!("could not initialize storage for '{root_uri}': {err}"), - ) - })?; - Ok(Self { - adapter, - root: trimmed.to_string(), - display_root: trimmed.to_string(), - }) - } - - pub(crate) fn kind(&self) -> StorageKind { - storage_kind_for_uri(&self.root) - } - - fn uri(&self, relative: &str) -> String { - format!("{}/{}", self.root, relative) - } - - fn display(&self, relative: &str) -> String { - format!("{}/{}", self.display_root, relative) - } - - /// Derived graph root for ``: `/graphs/.omni`. A plain - /// local path for `file://` roots (byte-compatible, directly usable by - /// the engine); the S3 URI the engine opens natively otherwise. - pub(crate) fn graph_root(&self, graph_id: &str) -> String { - match self.kind() { - StorageKind::Local => format!("{}/graphs/{graph_id}.omni", self.display_root), - StorageKind::S3 => format!("{}/graphs/{graph_id}.omni", self.root), - } - } - - /// Display-form storage root (plain local path for `file://`, URI for S3). - pub(crate) fn display_root(&self) -> &str { - &self.display_root - } - - /// Whether this root holds the cluster state ledger (`__cluster/state.json`) - /// β€” i.e. is an actual cluster, not just any directory. Probed via the - /// adapter (`file://` or `s3://`), failures read as "not a cluster". - pub(crate) async fn has_state(&self) -> bool { - self.adapter - .exists(&self.uri(CLUSTER_STATE_FILE)) - .await - .unwrap_or(false) - } - - /// `read_text_versioned`, returning None for a missing object (probed - /// via `exists` β€” the engine error type doesn't discriminate NotFound). - async fn read_versioned_opt(&self, uri: &str) -> Result, String> { - match self.adapter.exists(uri).await { - Ok(false) => return Ok(None), - Ok(true) => {} - Err(err) => return Err(err.to_string()), - } - self.adapter - .read_text_versioned(uri) - .await - .map(Some) - .map_err(|err| err.to_string()) - } - - /// JSON object write. Atomic visibility is the storage adapter's - /// contract on every backend (staged temp + rename on the filesystem, - /// a single atomic PUT on object stores) β€” no torn JSON after a crash, - /// no per-backend branch needed here. - async fn put_json(&self, relative: &str, payload: &str) -> Result<(), String> { - let target = self.uri(relative); - self.adapter - .write_text(&target, payload) - .await - .map_err(|err| err.to_string()) - } - - /// Shared list-and-parse for the sidecar/approval directories: id - /// (filename) order; unparseable objects warn and stay for the operator. - async fn list_json_dir( - &self, - dir: &str, - diagnostics: &mut Vec, - list_error_code: &'static str, - parse_error_code: &'static str, - version_ok: impl Fn(&T) -> bool, - version_error_code: &'static str, - ) -> Vec<(String, T)> { - let dir_uri = self.uri(dir); - let mut uris = match self.adapter.list_dir(&dir_uri).await { - Ok(uris) => uris, - Err(err) => { - diagnostics.push(Diagnostic::warning( - list_error_code, - dir, - format!("could not list '{dir}': {err}"), - )); - return Vec::new(); - } - }; - uris.retain(|uri| uri.ends_with(".json")); - uris.sort(); - let mut out = Vec::new(); - for uri in uris { - match self.adapter.read_text(&uri).await { - Ok(text) => match serde_json::from_str::(&text) { - Ok(value) if version_ok(&value) => out.push((uri, value)), - Ok(_) => diagnostics.push(Diagnostic::warning( - version_error_code, - uri.clone(), - "unsupported schema version; leaving it in place".to_string(), - )), - Err(err) => diagnostics.push(Diagnostic::warning( - parse_error_code, - uri.clone(), - format!("could not parse ({err}); leaving it in place"), - )), - }, - Err(err) => diagnostics.push(Diagnostic::warning( - parse_error_code, - uri.clone(), - format!("could not read ({err}); leaving it in place"), - )), - } - } - out - } - - /// Best-effort object removal (sidecar retirement after a CAS lands, - /// lock cleanup) β€” failures are recoverable by the next sweep. - pub(crate) async fn delete_object(&self, uri: &str) { - let _ = self.try_delete_object(uri).await; - } - - /// Like `delete_object` but surfaces the failure, so a caller that depends - /// on the deletion (e.g. the pre-movement sidecar cleanup fast-path) can - /// report it as a diagnostic instead of silently leaving stale state. - pub(crate) async fn try_delete_object(&self, uri: &str) -> Result<(), String> { - self.adapter.delete(uri).await.map_err(|err| err.to_string()) - } - - /// Recursive prefix delete for graph roots (approved deletes). Idempotent; - /// S3 non-atomicity is tolerated by the delete protocol's retry shape. - pub(crate) async fn delete_graph_root(&self, graph_uri: &str) -> Result<(), String> { - self.adapter - .delete_prefix(graph_uri) - .await - .map_err(|err| err.to_string()) - } - - /// Existence probe for graph roots in sweep classification. A bare local - /// path or any URI works β€” resolved through the same adapter machinery - /// the engine uses. - pub(crate) async fn graph_root_exists(&self, graph_uri: &str) -> bool { - match storage_kind_for_uri(graph_uri) { - StorageKind::Local => Path::new(graph_uri.trim_start_matches("file://")).exists(), - StorageKind::S3 => match storage_for_uri(graph_uri) { - Ok(adapter) => !adapter - .list_dir(graph_uri) - .await - .map(|entries| entries.is_empty()) - .unwrap_or(true), - Err(_) => false, - }, - } - } - - // ---- approvals ---- - - pub(crate) async fn list_approval_artifacts( - &self, - diagnostics: &mut Vec, - ) -> Vec<(String, ApprovalArtifact)> { - self.list_json_dir( - CLUSTER_APPROVALS_DIR, - diagnostics, - "approval_read_error", - "invalid_approval_artifact", - |artifact: &ApprovalArtifact| artifact.schema_version == 1, - "unsupported_approval_version", - ) - .await - } - - pub(crate) async fn write_approval_artifact( - &self, - artifact: &ApprovalArtifact, - ) -> Result { - let relative = format!("{CLUSTER_APPROVALS_DIR}/{}.json", artifact.approval_id); - let mut payload = serde_json::to_string_pretty(artifact).map_err(|err| { - Diagnostic::error( - "approval_write_error", - self.display(&relative), - format!("could not encode approval artifact: {err}"), - ) - })?; - payload.push('\n'); - self.put_json(&relative, &payload).await.map_err(|err| { - Diagnostic::error( - "approval_write_error", - self.display(&relative), - format!("could not write approval artifact: {err}"), - ) - })?; - Ok(self.uri(&relative)) - } - - // ---- recovery sidecars ---- - - pub(crate) async fn list_recovery_sidecar_locations( - &self, - diagnostics: &mut Vec, - ) -> Vec { - let dir_uri = self.uri(CLUSTER_RECOVERIES_DIR); - let mut uris = match self.adapter.list_dir(&dir_uri).await { - Ok(uris) => uris, - Err(err) => { - diagnostics.push(Diagnostic::warning( - "recovery_sidecar_read_error", - CLUSTER_RECOVERIES_DIR, - format!("could not list '{CLUSTER_RECOVERIES_DIR}': {err}"), - )); - return Vec::new(); - } - }; - uris.retain(|uri| uri.ends_with(".json")); - uris.sort(); - uris.into_iter() - .map(|uri| { - let name = uri.rsplit_once('/').map_or(uri.as_str(), |(_, name)| name); - format!("{}/{name}", self.display(CLUSTER_RECOVERIES_DIR)) - }) - .collect() - } - - pub(crate) async fn list_recovery_sidecars( - &self, - diagnostics: &mut Vec, - ) -> Vec<(String, RecoverySidecar)> { - self.list_json_dir( - CLUSTER_RECOVERIES_DIR, - diagnostics, - "recovery_sidecar_read_error", - "invalid_recovery_sidecar", - |sidecar: &RecoverySidecar| sidecar.schema_version == 1, - "unsupported_recovery_sidecar_version", - ) - .await - } - - pub(crate) async fn write_recovery_sidecar( - &self, - sidecar: &RecoverySidecar, - ) -> Result { - let relative = format!("{CLUSTER_RECOVERIES_DIR}/{}.json", sidecar.operation_id); - let mut payload = serde_json::to_string_pretty(sidecar).map_err(|err| { - Diagnostic::error( - "recovery_sidecar_write_error", - self.display(&relative), - format!("could not encode recovery sidecar: {err}"), - ) - })?; - payload.push('\n'); - self.put_json(&relative, &payload).await.map_err(|err| { - Diagnostic::error( - "recovery_sidecar_write_error", - self.display(&relative), - format!("could not write recovery sidecar: {err}"), - ) - })?; - Ok(self.uri(&relative)) - } - - // ---- catalog payloads ---- - - /// Content-addressed catalog location for a query/policy payload - /// (extensions fixed per kind, same as the pre-port layout). - pub(crate) fn payload_relative(kind: &ResourceKind, digest: &str) -> Option { - match kind { - ResourceKind::Query { graph, name } => Some(format!( - "{CLUSTER_RESOURCES_DIR}/query/{graph}/{name}/{digest}.gq" - )), - ResourceKind::Policy(name) => Some(format!( - "{CLUSTER_RESOURCES_DIR}/policy/{name}/{digest}.yaml" - )), - _ => None, - } - } - - pub(crate) async fn payload_exists(&self, kind: &ResourceKind, digest: &str) -> bool { - let Some(relative) = Self::payload_relative(kind, digest) else { - return false; - }; - self.adapter - .exists(&self.uri(&relative)) - .await - .unwrap_or(false) - } - - /// Raw payload read: `Ok(None)` for a missing blob, `Err` for transport - /// failures β€” callers classify (verify loops need the three-way split). - pub(crate) async fn read_payload( - &self, - kind: &ResourceKind, - digest: &str, - ) -> Result, String> { - let Some(relative) = Self::payload_relative(kind, digest) else { - return Ok(None); - }; - let uri = self.uri(&relative); - match self.adapter.exists(&uri).await { - Ok(false) => return Ok(None), - Ok(true) => {} - Err(err) => return Err(err.to_string()), - } - self.adapter - .read_text(&uri) - .await - .map(Some) - .map_err(|err| { - format!( - "could not read catalog payload '{}': {err}", - self.display(&relative) - ) - }) - } - - /// Idempotent content-addressed write: a payload already present at its - /// digest is by definition identical. - pub(crate) async fn write_payload( - &self, - kind: &ResourceKind, - digest: &str, - content: &str, - ) -> Result<(), String> { - let Some(relative) = Self::payload_relative(kind, digest) else { - return Err("resource kind has no payload".to_string()); - }; - if self - .adapter - .exists(&self.uri(&relative)) - .await - .map_err(|err| err.to_string())? - { - return Ok(()); - } - self.put_json(&relative, content).await - } - - /// Read a catalog payload and verify it against its recorded digest. - pub(crate) async fn read_verified_payload( - &self, - kind: &ResourceKind, - digest: &str, - address: &str, - ) -> Result { - let Some(relative) = Self::payload_relative(kind, digest) else { - return Err(Diagnostic::error( - "catalog_payload_missing", - address, - "resource kind has no payload", - )); - }; - let uri = self.uri(&relative); - let text = self.adapter.read_text(&uri).await.map_err(|err| { - Diagnostic::error( - "catalog_payload_missing", - address, - format!( - "catalog blob '{}' unreadable ({err}); run `cluster refresh` then `cluster apply`, and restart", - self.display(&relative) - ), - ) - })?; - if sha256_hex(text.as_bytes()) != digest { - return Err(Diagnostic::error( - "catalog_payload_digest_mismatch", - address, - format!( - "catalog blob '{}' does not match its recorded digest; run `cluster refresh` then `cluster apply`, and restart", - self.display(&relative) - ), - )); - } - Ok(text) - } - - // ---- observations ---- - - pub(crate) fn observations(&self) -> StateObservations { - StateObservations { - state_path: self.display(CLUSTER_STATE_FILE), - lock_path: self.display(CLUSTER_LOCK_FILE), - state_found: false, - applied_config_digest: None, - state_revision: 0, - state_cas: None, - resource_count: 0, - locked: false, - lock_id: None, - lock_acquired: false, - acquired_lock_id: None, - lock_operation: None, - lock_created_at: None, - lock_pid: None, - lock_age_seconds: None, - } - } - - // ---- state ledger ---- - - pub(crate) async fn read_state( - &self, - observations: &mut StateObservations, - ) -> Result { - let state_uri = self.uri(CLUSTER_STATE_FILE); - let (text, _version) = match self.read_versioned_opt(&state_uri).await { - Ok(Some(read)) => read, - Ok(None) => { - return Ok(StateSnapshot { - state: None, - state_cas: None, - }); - } - Err(err) => { - return Err(Diagnostic::error( - "state_read_error", - CLUSTER_STATE_FILE, - format!("could not read state file: {err}"), - )); - } - }; - - observations.state_found = true; - let state_cas = format!("sha256:{}", sha256_hex(text.as_bytes())); - observations.state_cas = Some(state_cas.clone()); - - let state = serde_json::from_str::(&text).map_err(|err| { - Diagnostic::error( - "invalid_state_json", - CLUSTER_STATE_FILE, - format!("could not parse state JSON: {err}"), - ) - })?; - - if state.version != 1 { - return Err(Diagnostic::error( - "unsupported_state_version", - "state.version", - format!( - "unsupported cluster state version {}; this build supports version 1", - state.version - ), - )); - } - - observations.applied_config_digest = state.applied_revision.config_digest.clone(); - observations.state_revision = state.state_revision; - observations.resource_count = state.applied_revision.resources.len(); - - Ok(StateSnapshot { - state: Some(state), - state_cas: Some(state_cas), - }) - } - - /// CAS-guarded ledger replace. The public contract stays content-level - /// (`expected_cas` = `sha256:` from the snapshot the command read); - /// the physical swap is token-conditioned on a fresh read, so a writer - /// that raced us between the fresh read and the put loses with - /// `state_cas_mismatch` β€” never a silent overwrite. On S3 the token is - /// the object's ETag and the put is conditional (If-Match); locally it - /// is a content token over the same temp+rename flow as before the port. - pub(crate) async fn write_state( - &self, - state: &ClusterState, - expected_cas: Option<&str>, - observations: &mut StateObservations, - ) -> Result<(), Diagnostic> { - let state_uri = self.uri(CLUSTER_STATE_FILE); - let current = self.read_versioned_opt(&state_uri).await.map_err(|err| { - Diagnostic::error( - "state_write_error", - CLUSTER_STATE_FILE, - format!("could not read state file before write: {err}"), - ) - })?; - let current_cas = current - .as_ref() - .map(|(text, _)| format!("sha256:{}", sha256_hex(text.as_bytes()))); - if current_cas.as_deref() != expected_cas { - return Err(state_cas_mismatch()); - } - - let mut payload = serde_json::to_string_pretty(state).map_err(|err| { - Diagnostic::error( - "state_write_error", - CLUSTER_STATE_FILE, - format!("could not encode state JSON: {err}"), - ) - })?; - payload.push('\n'); - - let written = match current { - None => self - .adapter - .write_text_if_absent(&state_uri, &payload) - .await - .map_err(|err| { - Diagnostic::error( - "state_write_error", - CLUSTER_STATE_FILE, - format!("could not create state.json: {err}"), - ) - })?, - Some((_, version)) => self - .adapter - .write_text_if_match(&state_uri, &payload, &version) - .await - .map_err(|err| { - Diagnostic::error( - "state_write_error", - CLUSTER_STATE_FILE, - format!("could not replace state.json: {err}"), - ) - })? - .is_some(), - }; - if !written { - return Err(state_cas_mismatch()); - } - - observations.state_found = true; - observations.applied_config_digest = state.applied_revision.config_digest.clone(); - observations.state_revision = state.state_revision; - observations.state_cas = Some(format!("sha256:{}", sha256_hex(payload.as_bytes()))); - observations.resource_count = state.applied_revision.resources.len(); - Ok(()) - } - - // ---- lock ---- - - pub(crate) async fn acquire_lock( - &self, - operation: &str, - observations: &mut StateObservations, - ) -> Result { - let lock_uri = self.uri(CLUSTER_LOCK_FILE); - let lock_id = Ulid::new().to_string(); - let lock = StateLockFile { - version: 1, - lock_id: lock_id.clone(), - operation: operation.to_string(), - created_at: OffsetDateTime::now_utc() - .format(&Rfc3339) - .unwrap_or_else(|_| "1970-01-01T00:00:00Z".to_string()), - pid: process::id(), - }; - let payload = serde_json::to_string_pretty(&lock).map_err(|err| { - Diagnostic::error( - "state_lock_error", - CLUSTER_LOCK_FILE, - format!("could not encode state lock: {err}"), - ) - })?; - - match self.adapter.write_text_if_absent(&lock_uri, &payload).await { - Ok(true) => { - observations.lock_acquired = true; - observations.acquired_lock_id = Some(lock_id); - Ok(StateLockGuard { - adapter: Arc::clone(&self.adapter), - uri: lock_uri, - kind: self.kind(), - }) - } - Ok(false) => { - self.observe_lock_metadata_lossy(observations).await; - Err(Diagnostic::error( - "state_lock_held", - CLUSTER_LOCK_FILE, - state_lock_held_message(observations), - )) - } - Err(err) => Err(Diagnostic::error( - "state_lock_error", - CLUSTER_LOCK_FILE, - format!("could not write state lock: {err}"), - )), - } - } - - pub(crate) async fn force_unlock( - &self, - lock_id: &str, - observations: &mut StateObservations, - ) -> Result<(), Diagnostic> { - let lock_uri = self.uri(CLUSTER_LOCK_FILE); - let text = match self.read_versioned_opt(&lock_uri).await { - Ok(Some((text, _))) => text, - Ok(None) => { - return Err(Diagnostic::error( - "state_lock_missing", - CLUSTER_LOCK_FILE, - "no cluster state lock is present", - )); - } - Err(err) => { - return Err(Diagnostic::error( - "state_lock_read_error", - CLUSTER_LOCK_FILE, - format!("could not read state lock: {err}"), - )); - } - }; - let lock = parse_lock_file_for_unlock(&text)?; - observations.observe_lock_metadata(&lock); - observations.locked = true; - if lock.lock_id != lock_id { - return Err(Diagnostic::error( - "state_lock_id_mismatch", - CLUSTER_LOCK_FILE, - format!( - "lock id mismatch: held lock is {}, refusing to remove (pass the exact id from `cluster status`)", - lock.lock_id - ), - )); - } - self.adapter.delete(&lock_uri).await.map_err(|err| { - Diagnostic::error( - "state_lock_error", - CLUSTER_LOCK_FILE, - format!("could not remove state lock: {err}"), - ) - })?; - observations.locked = false; - Ok(()) - } - - pub(crate) async fn observe_lock( - &self, - observations: &mut StateObservations, - diagnostics: &mut Vec, - ) { - let lock_uri = self.uri(CLUSTER_LOCK_FILE); - match self.read_versioned_opt(&lock_uri).await { - Ok(Some((text, _))) => { - observations.locked = true; - match serde_json::from_str::(&text) { - Ok(lock) if lock.version == 1 => observations.observe_lock_metadata(&lock), - Ok(lock) => diagnostics.push(Diagnostic::warning( - "unsupported_state_lock_version", - CLUSTER_LOCK_FILE, - format!("unsupported cluster state lock version {}", lock.version), - )), - Err(err) => diagnostics.push(Diagnostic::warning( - "invalid_state_lock", - CLUSTER_LOCK_FILE, - format!("could not parse state lock: {err}"), - )), - } - } - Ok(None) => {} - Err(err) => diagnostics.push(Diagnostic::warning( - "state_lock_read_error", - CLUSTER_LOCK_FILE, - format!("could not read state lock: {err}"), - )), - } - } - - pub(crate) async fn observe_lock_metadata_lossy( - &self, - observations: &mut StateObservations, - ) { - observations.locked = true; - let lock_uri = self.uri(CLUSTER_LOCK_FILE); - if let Ok(Some((text, _))) = self.read_versioned_opt(&lock_uri).await { - if let Ok(lock) = serde_json::from_str::(&text) { - if lock.version == 1 { - observations.observe_lock_metadata(&lock); - } - } - } - } -} - -fn state_cas_mismatch() -> Diagnostic { - Diagnostic::error( - "state_cas_mismatch", - CLUSTER_STATE_FILE, - "state.json changed while the command was running; re-run the command against the latest state", - ) -} - -pub(crate) fn parse_lock_file_for_unlock(text: &str) -> Result { - let lock = serde_json::from_str::(text).map_err(|err| { - Diagnostic::error( - "invalid_state_lock", - CLUSTER_LOCK_FILE, - format!("could not parse state lock: {err}"), - ) - })?; - if lock.version != 1 { - return Err(Diagnostic::error( - "unsupported_state_lock_version", - CLUSTER_LOCK_FILE, - format!("unsupported cluster state lock version {}", lock.version), - )); - } - Ok(lock) -} - -pub(crate) fn state_lock_held_message(observations: &StateObservations) -> String { - match observations.lock_id.as_deref() { - Some(lock_id) => format!( - "cluster state lock already exists (lock id {lock_id}); run `omnigraph cluster force-unlock {lock_id}` only after confirming no cluster operation is active" - ), - None => "cluster state lock already exists; remove it only after confirming no cluster operation is active".to_string(), - } -} diff --git a/crates/omnigraph-cluster/src/sweep.rs b/crates/omnigraph-cluster/src/sweep.rs deleted file mode 100644 index 27e6e9c..0000000 --- a/crates/omnigraph-cluster/src/sweep.rs +++ /dev/null @@ -1,441 +0,0 @@ -//! The recovery sweep: RFC-004's roll-forward-only sidecar -//! classification (moved verbatim from lib.rs in the modularization). - -use super::*; - -/// Recovery sweep (RFC-004 Β§D3): runs at the start of every state-mutating -/// cluster command, under the state lock, before the command's own work. -/// Roll-forward-only β€” the engine's own sidecars make each graph-level -/// operation atomic within the graph, so the cluster never rolls a graph -/// back; it converges the ledger to observable reality or refuses loudly. -/// Mutations ride the calling command's CAS-checked state write; completed -/// sidecars are deleted only after that write lands. -pub(crate) async fn sweep_recovery_sidecars( - backend: &ClusterStore, - state: &mut ClusterState, - diagnostics: &mut Vec, -) -> SweepOutcome { - let mut outcome = SweepOutcome::default(); - for (path, sidecar) in backend.list_recovery_sidecars(diagnostics).await { - match sidecar.kind { - RecoverySidecarKind::GraphCreate => { - sweep_graph_create_sidecar( - backend, - path, - sidecar, - state, - diagnostics, - &mut outcome, - ) - .await; - } - RecoverySidecarKind::SchemaApply => { - sweep_schema_apply_sidecar(path, sidecar, state, diagnostics, &mut outcome).await; - } - RecoverySidecarKind::GraphDelete => { - sweep_graph_delete_sidecar( - backend, - path, - sidecar, - state, - diagnostics, - &mut outcome, - ) - .await; - } - } - } - outcome -} - -pub(crate) async fn sweep_graph_create_sidecar( - backend: &ClusterStore, - path: String, - sidecar: RecoverySidecar, - state: &mut ClusterState, - diagnostics: &mut Vec, - outcome: &mut SweepOutcome, -) { - let graph_address = graph_address(&sidecar.graph_id); - let schema_addr = schema_address(&sidecar.graph_id); - - // Row 1: nothing moved β€” the init never landed. The sidecar is pure - // intent; retire it (deferred to the command's post-CAS cleanup, like - // every other completed sidecar β€” a failed CAS simply re-sweeps it) and - // let the command's own plan re-propose the create. - if !backend.graph_root_exists(&sidecar.graph_uri).await { - outcome.completed_sidecars.push(path); - return; - } - - match Omnigraph::open_read_only(&sidecar.graph_uri).await { - Ok(db) => { - let live_digest = sha256_hex(db.schema_source().as_bytes()); - let recorded = state - .applied_revision - .resources - .get(&schema_addr) - .map(|resource| resource.digest.clone()); - if recorded.as_deref() == Some(live_digest.as_str()) { - // Row 2: crash fell between the state CAS and sidecar delete. - outcome.completed_sidecars.push(path); - } else if live_digest == sidecar.desired_schema_digest { - // Row 4: the create completed on the graph; roll the cluster - // state forward to observable reality. - state.applied_revision.resources.insert( - schema_addr.clone(), - StateResource { - digest: live_digest.clone(), - applies_to: None, - embedding_provider: None, - embedding_profile: None, - }, - ); - let query_digests = state_query_digests_for_graph(state, &sidecar.graph_id); - let embedding_provider = state_graph_embedding_provider(state, &sidecar.graph_id); - let embedding_provider_digest = - state_embedding_provider_digest(state, embedding_provider.as_deref()); - let composite = graph_digest( - &sidecar.graph_id, - Some(&live_digest), - Some(&query_digests), - embedding_provider.as_deref(), - embedding_provider_digest.as_ref(), - ); - state.applied_revision.resources.insert( - graph_address.clone(), - StateResource { - digest: composite, - applies_to: None, - embedding_provider, - embedding_profile: None, - }, - ); - set_resource_status_applied(state, &graph_address); - set_resource_status_applied(state, &schema_addr); - state.recovery_records.insert( - sidecar.operation_id.clone(), - json!({ - "kind": "graph_create", - "graph_id": sidecar.graph_id, - "outcome": "rolled_forward", - "recovered_at": now_rfc3339(), - "actor": sidecar.actor, - }), - ); - diagnostics.push(Diagnostic::warning( - "cluster_recovery_rolled_forward", - graph_address.clone(), - "an interrupted graph create had completed on the graph; cluster state was rolled forward to match", - )); - outcome.completed_sidecars.push(path); - } else { - // Row 6: the graph moved to something the sidecar did not - // intend. Refuse to guess; require refresh + operator re-plan. - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Drifted, - "actual_applied_state_pending", - "graph state does not match the interrupted operation; run `cluster refresh` and re-plan", - ); - set_resource_status( - state, - &schema_addr, - ResourceLifecycleStatus::Drifted, - "actual_applied_state_pending", - "graph state does not match the interrupted operation; run `cluster refresh` and re-plan", - ); - diagnostics.push(Diagnostic::warning( - "cluster_recovery_pending", - graph_address.clone(), - "an interrupted graph create left unexpected graph state; graph-moving work is blocked until repaired", - )); - outcome.pending_graphs.insert(sidecar.graph_id.clone()); - } - } - Err(err) => { - // Row 5: partial root (the engine's documented init gap). Never - // auto-delete β€” reconciler deletes are the same data-loss class - // as human deletes; the operator removes the root explicitly. - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Error, - "graph_create_incomplete", - "graph root exists but cannot be opened; remove the graph root and re-run `cluster apply`", - ); - set_resource_status( - state, - &schema_addr, - ResourceLifecycleStatus::Error, - "graph_create_incomplete", - "graph root exists but cannot be opened; remove the graph root and re-run `cluster apply`", - ); - diagnostics.push(Diagnostic::error( - "graph_create_incomplete", - graph_address.clone(), - format!( - "graph root '{}' exists but cannot be opened ({err}); remove the graph root and re-run `cluster apply`", - sidecar.graph_uri - ), - )); - outcome.pending_graphs.insert(sidecar.graph_id.clone()); - } - } -} - -pub(crate) async fn sweep_schema_apply_sidecar( - path: String, - sidecar: RecoverySidecar, - state: &mut ClusterState, - diagnostics: &mut Vec, - outcome: &mut SweepOutcome, -) { - let graph_address = graph_address(&sidecar.graph_id); - let schema_addr = schema_address(&sidecar.graph_id); - - // Digest-based classification: robust to unrelated manifest movement; - // the sidecar's version pins stay forensic. - let live_digest = match Omnigraph::open_read_only(&sidecar.graph_uri).await { - Ok(db) => sha256_hex(db.schema_source().as_bytes()), - Err(err) => { - // Cannot verify the interrupted operation β€” refuse to guess. - diagnostics.push(Diagnostic::warning( - "cluster_recovery_pending", - graph_address.clone(), - format!( - "an interrupted schema apply cannot be verified (graph '{}' did not open: {err}); graph-moving work is blocked until repaired", - sidecar.graph_uri - ), - )); - outcome.pending_graphs.insert(sidecar.graph_id.clone()); - return; - } - }; - - let recorded = state - .applied_revision - .resources - .get(&schema_addr) - .map(|resource| resource.digest.clone()); - if recorded.as_deref() == Some(live_digest.as_str()) { - // Ledger consistent with the live graph (the apply never landed, or - // landed and was recorded): the sidecar is stale intent β€” retire it. - outcome.completed_sidecars.push(path); - } else if live_digest == sidecar.desired_schema_digest { - // RFC-004 Β§D3 row 3: the schema apply completed on the graph; roll - // the cluster state forward to observable reality. - state.applied_revision.resources.insert( - schema_addr.clone(), - StateResource { - digest: live_digest.clone(), - applies_to: None, - embedding_provider: None, - embedding_profile: None, - }, - ); - let query_digests = state_query_digests_for_graph(state, &sidecar.graph_id); - let embedding_provider = state_graph_embedding_provider(state, &sidecar.graph_id); - let embedding_provider_digest = - state_embedding_provider_digest(state, embedding_provider.as_deref()); - let composite = graph_digest( - &sidecar.graph_id, - Some(&live_digest), - Some(&query_digests), - embedding_provider.as_deref(), - embedding_provider_digest.as_ref(), - ); - state.applied_revision.resources.insert( - graph_address.clone(), - StateResource { - digest: composite, - applies_to: None, - embedding_provider, - embedding_profile: None, - }, - ); - set_resource_status_applied(state, &graph_address); - set_resource_status_applied(state, &schema_addr); - state.recovery_records.insert( - sidecar.operation_id.clone(), - json!({ - "kind": "schema_apply", - "graph_id": sidecar.graph_id, - "outcome": "rolled_forward", - "recovered_at": now_rfc3339(), - "actor": sidecar.actor, - }), - ); - diagnostics.push(Diagnostic::warning( - "cluster_recovery_rolled_forward", - graph_address.clone(), - "an interrupted schema apply had completed on the graph; cluster state was rolled forward to match", - )); - outcome.completed_sidecars.push(path); - } else { - // Row 6: live schema is neither the recorded nor the desired digest. - set_resource_status( - state, - &graph_address, - ResourceLifecycleStatus::Drifted, - "actual_applied_state_pending", - "graph state does not match the interrupted operation; run `cluster refresh` and re-plan", - ); - set_resource_status( - state, - &schema_addr, - ResourceLifecycleStatus::Drifted, - "actual_applied_state_pending", - "graph state does not match the interrupted operation; run `cluster refresh` and re-plan", - ); - diagnostics.push(Diagnostic::warning( - "cluster_recovery_pending", - graph_address.clone(), - "an interrupted schema apply left unexpected graph state; graph-moving work is blocked until repaired", - )); - outcome.pending_graphs.insert(sidecar.graph_id.clone()); - } -} - -pub(crate) async fn sweep_graph_delete_sidecar( - backend: &ClusterStore, - path: String, - sidecar: RecoverySidecar, - state: &mut ClusterState, - diagnostics: &mut Vec, - outcome: &mut SweepOutcome, -) { - let graph_address = graph_address(&sidecar.graph_id); - - if backend.graph_root_exists(&sidecar.graph_uri).await { - // Row 8: the delete never completed. Prefix removal is idempotent and - // works on partial roots, so the repair is simply the re-proposed, - // still-approved delete on a later run β€” retire the stale intent. - diagnostics.push(Diagnostic::warning( - "graph_delete_incomplete", - graph_address, - "a previous graph delete did not complete; it will be re-proposed by plan and can be retried under its approval", - )); - outcome.completed_sidecars.push(path); - return; - } - - if !state - .applied_revision - .resources - .contains_key(&graph_address) - { - // Row 7: already tombstoned (or never recorded); crash fell between - // the state CAS and sidecar delete. - outcome.completed_sidecars.push(path); - return; - } - - // Row 7b: the root is gone, the ledger is stale β€” roll forward the - // tombstone, consume the approval the sidecar carries, audit. - tombstone_graph_subtree( - state, - &sidecar.graph_id, - sidecar.approval_id.as_deref(), - sidecar.actor.as_deref(), - ); - state.recovery_records.insert( - sidecar.operation_id.clone(), - json!({ - "kind": "graph_delete", - "graph_id": sidecar.graph_id, - "outcome": "rolled_forward", - "recovered_at": now_rfc3339(), - "actor": sidecar.actor, - }), - ); - if let Some(approval_id) = &sidecar.approval_id { - record_approval_consumed(state, approval_id, &sidecar.operation_id); - outcome.consumed_approvals.push(approval_id.clone()); - } - diagnostics.push(Diagnostic::warning( - "cluster_recovery_rolled_forward", - graph_address, - "an interrupted graph delete had completed on disk; cluster state was rolled forward to match", - )); - outcome.completed_sidecars.push(path); -} - -/// Remove a graph's subtree (graph, schema, queries) from the ledger and -/// leave a tombstone observation. Idempotent. -pub(crate) fn tombstone_graph_subtree( - state: &mut ClusterState, - graph_id: &str, - approval_id: Option<&str>, - actor: Option<&str>, -) { - let graph_addr = graph_address(graph_id); - let schema_addr = schema_address(graph_id); - let query_prefix = format!("query.{graph_id}."); - state.applied_revision.resources.remove(&graph_addr); - state.applied_revision.resources.remove(&schema_addr); - state - .applied_revision - .resources - .retain(|address, _| !address.starts_with(&query_prefix)); - state.resource_statuses.remove(&graph_addr); - state.resource_statuses.remove(&schema_addr); - state - .resource_statuses - .retain(|address, _| !address.starts_with(&query_prefix)); - state.observations.insert( - graph_addr, - json!({ - "kind": "tombstone", - "deleted_at": now_rfc3339(), - "approval_id": approval_id, - "actor": actor, - }), - ); -} - -/// Record approval consumption in the state ledger. The artifact FILE is -/// rewritten with consumed_at only after the state write lands, so a failed -/// CAS leaves the approval valid for the retry. -pub(crate) fn record_approval_consumed( - state: &mut ClusterState, - approval_id: &str, - operation_id: &str, -) { - state.approval_records.insert( - approval_id.to_string(), - json!({ - "consumed_at": now_rfc3339(), - "consumed_by_operation": operation_id, - }), - ); -} - -/// Mark approval artifact files consumed on disk (post-CAS). -pub(crate) async fn mark_approvals_consumed(backend: &ClusterStore, approval_ids: &[String]) { - if approval_ids.is_empty() { - return; - } - let mut sink = Vec::new(); - for (_, mut artifact) in backend.list_approval_artifacts(&mut sink).await { - if approval_ids.contains(&artifact.approval_id) && artifact.consumed_at.is_none() { - artifact.consumed_at = Some(now_rfc3339()); - let _ = backend.write_approval_artifact(&artifact).await; - } - } -} - -/// Read-only commands report pending sidecars without acting on them. -pub(crate) async fn warn_pending_recovery_sidecars( - backend: &ClusterStore, - diagnostics: &mut Vec, -) { - for location in backend.list_recovery_sidecar_locations(diagnostics).await { - diagnostics.push(Diagnostic::warning( - "cluster_recovery_pending", - location, - "a recovery sidecar from an interrupted apply is pending; the next state-mutating command will classify it", - )); - } -} diff --git a/crates/omnigraph-cluster/src/tests.rs b/crates/omnigraph-cluster/src/tests.rs deleted file mode 100644 index 7eae69f..0000000 --- a/crates/omnigraph-cluster/src/tests.rs +++ /dev/null @@ -1,3690 +0,0 @@ -//! In-source test suite, moved verbatim from lib.rs (modularization). -//! Indentation is preserved exactly β€” embedded raw-string fixtures -//! (cluster.yaml/JSON bodies) are content, not formatting. -#![allow(clippy::all)] - - use std::fs; - use std::path::Path; - - use omnigraph::db::Omnigraph; - use serde_json::json; - use tempfile::tempdir; - - use super::*; - - const SCHEMA: &str = r#" -node Person { - name: String @key - age: I32? -} -"#; - - const QUERY: &str = r#" -query find_person($name: String) { - match { $p: Person { name: $name } } - return { $p.name, $p.age } -} -"#; - - fn fixture() -> tempfile::TempDir { - let dir = tempdir().unwrap(); - fs::write(dir.path().join("people.pg"), SCHEMA).unwrap(); - fs::write(dir.path().join("people.gq"), QUERY).unwrap(); - fs::write(dir.path().join("base.policy.yaml"), "rules: []\n").unwrap(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -metadata: - name: test -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - base: - file: ./base.policy.yaml - applies_to: [knowledge] -"#, - ) - .unwrap(); - dir - } - - fn write_mock_embedding_cluster(config_dir: &Path, model: &str) { - fs::write( - config_dir.join(CLUSTER_CONFIG_FILE), - format!( - r#" -version: 1 -metadata: - name: test -state: - backend: cluster - lock: true -providers: - embedding: - default: - kind: mock - model: {model} -graphs: - knowledge: - schema: ./people.pg - embedding_provider: default - queries: - find_person: - file: ./people.gq -policies: - base: - file: ./base.policy.yaml - applies_to: [knowledge] -"# - ), - ) - .unwrap(); - } - - async fn init_derived_graph(root: &Path) { - let graph_dir = root.join(CLUSTER_GRAPHS_DIR); - fs::create_dir_all(&graph_dir).unwrap(); - let graph = graph_dir.join("knowledge.omni"); - Omnigraph::init(graph.to_string_lossy().as_ref(), SCHEMA) - .await - .unwrap(); - } - - fn write_lock_file(config_dir: &Path, lock_id: &str, operation: &str) { - let state_dir = config_dir.join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("lock.json"), - json!({ - "version": 1, - "lock_id": lock_id, - "operation": operation, - "created_at": "1970-01-01T00:00:00Z", - "pid": 123 - }) - .to_string(), - ) - .unwrap(); - } - - #[test] - fn valid_minimal_config() { - let dir = fixture(); - let out = validate_config_dir(dir.path()); - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.resource_digests.contains_key("graph.knowledge")); - assert!(out.resource_digests.contains_key("schema.knowledge")); - assert!( - out.dependencies - .iter() - .any(|dep| dep.from == "policy.base" && dep.to == "graph.knowledge") - ); - } - - #[test] - fn unknown_field_rejection() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\ngraphs: {}\nwat: true\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert!(out.diagnostics[0].message.contains("unknown field")); - } - - #[test] - fn future_phase_field_rejection() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\ngraphs: {}\npipelines: {}\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert_eq!(out.diagnostics[0].code, "future_phase_field"); - } - - #[test] - fn duplicate_yaml_key_rejection() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\ngraphs: {}\ngraphs: {}\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert_eq!(out.diagnostics[0].code, "duplicate_yaml_key"); - } - - #[test] - fn duplicate_yaml_key_rejection_keeps_quoted_hashes() { - let diagnostics = - duplicate_key_diagnostics("\"name#display\": one\n\"name#display\": two\n"); - assert_eq!(diagnostics.len(), 1); - assert_eq!(diagnostics[0].code, "duplicate_yaml_key"); - } - - #[test] - fn missing_schema_query_and_policy_files() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -graphs: - knowledge: - schema: ./missing.pg - queries: - find_person: { file: ./missing.gq } -policies: - base: - file: ./missing.policy.yaml - applies_to: [knowledge] -"#, - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - let codes: BTreeSet<_> = out.diagnostics.iter().map(|d| d.code.as_str()).collect(); - assert!(codes.contains("schema_file_missing")); - assert!(codes.contains("query_file_missing")); - assert!(codes.contains("policy_file_missing")); - } - - #[test] - fn wrong_kind_and_dangling_refs_fail() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg -policies: - base: - file: ./base.policy.yaml - applies_to: [query.knowledge.find_person, missing] -"#, - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - let codes: BTreeSet<_> = out.diagnostics.iter().map(|d| d.code.as_str()).collect(); - assert!(codes.contains("wrong_kind_reference")); - assert!(codes.contains("dangling_graph_reference")); - } - - #[test] - fn embedding_provider_config_accepts_provider_resources_and_graph_refs() { - let dir = fixture(); - write_mock_embedding_cluster(dir.path(), "recorded-x"); - - let out = validate_config_dir(dir.path()); - assert!(out.ok, "{:?}", out.diagnostics); - let provider_digest = out - .resource_digests - .get("provider.embedding.default") - .expect("provider resource digest"); - assert!( - out.resources - .iter() - .any(|resource| resource.address == "provider.embedding.default" - && resource.kind == "embedding_provider" - && resource.path.is_none()) - ); - assert!( - out.dependencies - .iter() - .any(|dep| dep.from == "graph.knowledge" && dep.to == "provider.embedding.default"), - "{:?}", - out.dependencies - ); - let schema_digest = out.resource_digests.get("schema.knowledge").unwrap(); - let query_digest = out - .resource_digests - .get("query.knowledge.find_person") - .unwrap(); - let expected_graph_digest = graph_digest( - "knowledge", - Some(schema_digest), - Some( - &[("find_person".to_string(), query_digest.clone())] - .into_iter() - .collect(), - ), - Some("provider.embedding.default"), - Some(provider_digest), - ); - assert_eq!( - out.resource_digests["graph.knowledge"], - expected_graph_digest - ); - } - - #[test] - fn embedding_provider_config_rejects_bad_refs_and_inline_secrets() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -providers: - embedding: - default: - kind: openai-compatible - api_key: sk-inline -graphs: - knowledge: - schema: ./people.pg - embedding_provider: provider.policy.default - missing_provider: - schema: ./people.pg - embedding_provider: absent -"#, - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - let codes: BTreeSet<_> = out.diagnostics.iter().map(|d| d.code.as_str()).collect(); - assert!( - codes.contains("embedding_api_key_inline"), - "{:?}", - out.diagnostics - ); - assert!( - codes.contains("wrong_kind_reference"), - "{:?}", - out.diagnostics - ); - assert!( - codes.contains("dangling_embedding_provider_reference"), - "{:?}", - out.diagnostics - ); - } - - #[test] - fn query_key_mismatch_fails() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - different: { file: ./people.gq } -"#, - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert_eq!(out.diagnostics[0].code, "query_key_mismatch"); - } - - #[test] - fn query_typecheck_failure_fails() { - let dir = fixture(); - fs::write( - dir.path().join("people.gq"), - "query find_person() { match { $d: DoesNotExist } return { $d.name } }\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "query_typecheck_error") - ); - } - - #[tokio::test] - async fn missing_state_plans_creates() { - let dir = fixture(); - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!out.state_observations.state_found); - assert!(!out.state_observations.locked); - assert!(out.state_observations.lock_acquired); - assert!( - out.changes - .iter() - .all(|c| c.operation == PlanOperation::Create) - ); - assert!(out.changes.iter().any(|c| c.resource == "graph.knowledge")); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn config_digest_ignores_yaml_comments_and_formatting() { - let dir = fixture(); - let first = plan_config_dir(dir.path()).await; - assert!(first.ok, "{:?}", first.diagnostics); - - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -# Same semantic config as the fixture, intentionally rendered differently. -version: 1 -metadata: { name: test } -state: { backend: cluster, lock: true } -graphs: - knowledge: - schema: ./people.pg - queries: { find_person: { file: ./people.gq } } -policies: - base: - file: ./base.policy.yaml - applies_to: - - knowledge -"#, - ) - .unwrap(); - - let second = plan_config_dir(dir.path()).await; - assert!(second.ok, "{:?}", second.diagnostics); - assert_eq!( - first.desired_revision.config_digest, - second.desired_revision.config_digest - ); - } - - #[tokio::test] - async fn existing_state_plans_update_and_delete_deterministically() { - let dir = fixture(); - let first = plan_config_dir(dir.path()).await; - let state_dir = dir.path().join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - serde_json::to_string_pretty(&json!({ - "version": 1, - "applied_revision": { - "config_digest": "old", - "resources": { - "graph.knowledge": { "digest": first.resource_digests["graph.knowledge"] }, - "policy.old": { "digest": "abc" }, - "schema.knowledge": { "digest": "old-schema" } - } - } - })) - .unwrap(), - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let rendered: Vec<_> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), &change.operation)) - .collect(); - assert_eq!( - rendered, - vec![ - ("policy.base", &PlanOperation::Create), - ("policy.old", &PlanOperation::Delete), - ("query.knowledge.find_person", &PlanOperation::Create), - ("schema.knowledge", &PlanOperation::Update), - ] - ); - } - - #[tokio::test] - async fn old_minimal_state_json_still_plans_with_default_revision() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{ - "version": 1, - "applied_revision": { - "config_digest": "old", - "resources": { - "graph.knowledge": { "digest": "old-graph" } - } - } -}"#, - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.state_observations.state_revision, 0); - assert!(out.state_observations.state_cas.is_some()); - assert!(out.changes.iter().any(|change| { - change.resource == "graph.knowledge" && change.operation == PlanOperation::Update - })); - } - - #[tokio::test] - async fn extended_state_json_status_surfaces_statuses() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - let state = r#"{ - "version": 1, - "state_revision": 42, - "applied_revision": { - "config_digest": "applied-config", - "resources": { - "graph.knowledge": { "digest": "graph-digest" } - } - }, - "resource_statuses": { - "graph.knowledge": { - "status": "applied", - "conditions": ["healthy"], - "message": "ready" - } - }, - "approval_records": {}, - "recovery_records": {}, - "observations": { - "graph.knowledge": { "manifest_version": 12 } - } -}"#; - fs::write(state_dir.join("state.json"), state).unwrap(); - - let out = status_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.state_observations.state_found); - assert_eq!(out.state_observations.state_revision, 42); - assert_eq!( - out.state_observations.state_cas.as_deref(), - Some(format!("sha256:{}", sha256_hex(state.as_bytes())).as_str()) - ); - assert_eq!( - out.resource_digests - .get("graph.knowledge") - .map(String::as_str), - Some("graph-digest") - ); - assert_eq!( - out.resource_statuses["graph.knowledge"].status, - ResourceLifecycleStatus::Applied - ); - } - - #[tokio::test] - async fn missing_state_status_succeeds_with_warning() { - let dir = fixture(); - let out = status_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!out.state_observations.state_found); - assert_eq!(out.state_observations.state_revision, 0); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_missing") - ); - } - - #[tokio::test] - async fn invalid_state_status_fails() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write(state_dir.join("state.json"), "{").unwrap(); - - let out = status_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.state_observations.state_found); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "invalid_state_json") - ); - } - - #[tokio::test] - async fn status_surfaces_full_lock_metadata() { - let dir = fixture(); - write_lock_file(dir.path(), "held-lock", "refresh"); - - let out = status_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.state_observations.locked); - assert_eq!(out.state_observations.lock_id.as_deref(), Some("held-lock")); - assert_eq!( - out.state_observations.lock_operation.as_deref(), - Some("refresh") - ); - assert_eq!( - out.state_observations.lock_created_at.as_deref(), - Some("1970-01-01T00:00:00Z") - ); - assert_eq!(out.state_observations.lock_pid, Some(123)); - assert!(out.state_observations.lock_age_seconds.is_some()); - } - - #[tokio::test] - async fn force_unlock_matching_id_removes_lock() { - let dir = fixture(); - write_lock_file(dir.path(), "held-lock", "plan"); - - let out = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.lock_removed); - assert_eq!(out.state_observations.lock_id.as_deref(), Some("held-lock")); - assert_eq!( - out.state_observations.lock_operation.as_deref(), - Some("plan") - ); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn force_unlock_wrong_id_fails_and_preserves_lock() { - let dir = fixture(); - write_lock_file(dir.path(), "held-lock", "plan"); - - let out = force_unlock_config_dir(dir.path(), "other-lock").await; - assert!(!out.ok); - assert!(!out.lock_removed); - assert_eq!(out.state_observations.lock_id.as_deref(), Some("held-lock")); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_id_mismatch") - ); - assert!(dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn force_unlock_missing_lock_fails() { - let dir = fixture(); - - let out = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(!out.ok); - assert!(!out.lock_removed); - assert!(!out.state_observations.locked); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_missing") - ); - } - - #[tokio::test] - async fn force_unlock_invalid_lock_json_fails_and_preserves_lock() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write(state_dir.join("lock.json"), "{").unwrap(); - - let out = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(!out.ok); - assert!(!out.lock_removed); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "invalid_state_lock") - ); - assert!(dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn force_unlock_unsupported_lock_version_fails_and_preserves_lock() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("lock.json"), - r#"{"version":2,"lock_id":"held-lock","operation":"plan","created_at":"1970-01-01T00:00:00Z","pid":123}"#, - ) - .unwrap(); - - let out = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(!out.ok); - assert!(!out.lock_removed); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "unsupported_state_lock_version") - ); - assert!(dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn force_unlock_external_state_backend_rejected() { - let dir = fixture(); - write_lock_file(dir.path(), "held-lock", "plan"); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -state: - backend: s3://state-bucket/cluster -graphs: - knowledge: - schema: ./people.pg -"#, - ) - .unwrap(); - - let out = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(!out.ok); - assert!(!out.lock_removed); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "unsupported_state_backend") - ); - assert!(dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn plan_succeeds_after_force_unlock() { - let dir = fixture(); - write_lock_file(dir.path(), "held-lock", "plan"); - - let locked = plan_config_dir(dir.path()).await; - assert!(!locked.ok); - assert!( - locked - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_held") - ); - - let unlocked = force_unlock_config_dir(dir.path(), "held-lock").await; - assert!(unlocked.ok, "{:?}", unlocked.diagnostics); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - } - - #[tokio::test] - async fn plan_reports_state_cas_revision_and_removes_lock() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - let state = r#"{ - "version": 1, - "state_revision": 7, - "applied_revision": { - "config_digest": "old", - "resources": { - "graph.knowledge": { "digest": "old-graph" } - } - } -}"#; - fs::write(state_dir.join("state.json"), state).unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.state_observations.state_revision, 7); - assert_eq!( - out.state_observations.state_cas.as_deref(), - Some(format!("sha256:{}", sha256_hex(state.as_bytes())).as_str()) - ); - assert!(!out.state_observations.locked); - assert!(out.state_observations.lock_id.is_none()); - assert!(out.state_observations.lock_acquired); - assert!(out.state_observations.acquired_lock_id.is_some()); - assert!( - !dir.path().join(CLUSTER_LOCK_FILE).exists(), - "plan must release lock before returning" - ); - } - - #[tokio::test] - async fn existing_lock_makes_plan_fail() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("lock.json"), - r#"{ - "version": 1, - "lock_id": "held-lock", - "operation": "plan", - "created_at": "2026-06-08T00:00:00Z", - "pid": 123 -}"#, - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.state_observations.locked); - assert_eq!(out.state_observations.lock_id.as_deref(), Some("held-lock")); - assert!(!out.state_observations.lock_acquired); - assert!(out.state_observations.acquired_lock_id.is_none()); - assert_eq!( - out.state_observations.lock_operation.as_deref(), - Some("plan") - ); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_held") - ); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "state_lock_held" - && diagnostic.message.contains("force-unlock held-lock") - })); - } - - #[tokio::test] - async fn state_lock_false_bypasses_lock_with_warning() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -state: - backend: cluster - lock: false -graphs: - knowledge: - schema: ./people.pg -"#, - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!out.state_observations.locked); - assert!(!out.state_observations.lock_acquired); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_disabled") - ); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[test] - fn external_state_backend_rejected() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\nstate:\n backend: s3://bucket/state\ngraphs: {}\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert_eq!(out.diagnostics[0].code, "unsupported_state_backend"); - } - - #[tokio::test] - async fn external_state_backend_plan_rejected() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\nstate:\n backend: s3://bucket/state\ngraphs: {}\n", - ) - .unwrap(); - let out = plan_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "unsupported_state_backend") - ); - } - - #[tokio::test] - async fn import_missing_state_creates_state_with_graph_observation() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - - let out = import_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.state_observations.state_revision, 1); - assert!(out.state_observations.state_cas.is_some()); - assert!(!out.state_observations.locked); - assert!(out.state_observations.lock_acquired); - assert!(out.state_observations.acquired_lock_id.is_some()); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - assert_eq!( - out.resource_digests - .get("schema.knowledge") - .map(String::as_str), - Some(sha256_hex(SCHEMA.as_bytes()).as_str()) - ); - assert!(out.observations["graph.knowledge"]["manifest_version"].is_number()); - assert_eq!( - out.observations["graph.knowledge"]["schema_matches_desired"], - true - ); - - let state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap()) - .unwrap(); - assert_eq!(state["state_revision"], 1); - assert_eq!( - state["resource_statuses"]["graph.knowledge"]["status"], - "applied" - ); - } - - #[tokio::test] - async fn import_existing_state_fails() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - - let out = import_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_already_exists") - ); - } - - #[tokio::test] - async fn refresh_missing_state_fails() { - let dir = fixture(); - let out = refresh_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_missing") - ); - } - - #[tokio::test] - async fn refresh_existing_minimal_state_increments_revision_and_updates_cas() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"config_digest":"old","resources":{"graph.knowledge":{"digest":"old"}}}}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.state_observations.state_revision, 1); - assert!(out.state_observations.state_cas.is_some()); - assert!(!out.state_observations.locked); - assert!(out.state_observations.lock_acquired); - assert_eq!( - out.resource_statuses["graph.knowledge"].status, - ResourceLifecycleStatus::Applied - ); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn refresh_records_live_schema_digest_and_manifest_version() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"state_revision":4,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.state_observations.state_revision, 5); - assert_eq!( - out.observations["graph.knowledge"]["schema_digest"], - sha256_hex(SCHEMA.as_bytes()) - ); - assert!(out.observations["graph.knowledge"]["manifest_version"].is_u64()); - } - - #[tokio::test] - async fn missing_derived_graph_root_marks_drifted_and_plans_creates() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{"graph.knowledge":{"digest":"old-graph"},"schema.knowledge":{"digest":"old-schema"}}}}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!( - out.resource_statuses["graph.knowledge"].status, - ResourceLifecycleStatus::Drifted - ); - assert!(!out.resource_digests.contains_key("graph.knowledge")); - assert_eq!(out.observations["graph.knowledge"]["exists"], false); - - let plan = plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - assert!(plan.changes.iter().any(|change| { - change.resource == "graph.knowledge" && change.operation == PlanOperation::Create - })); - assert!(plan.changes.iter().any(|change| { - change.resource == "schema.knowledge" && change.operation == PlanOperation::Create - })); - } - - #[tokio::test] - async fn live_schema_mismatch_marks_drifted_and_causes_plan_update() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - fs::write( - dir.path().join("people.pg"), - SCHEMA.replace("age: I32?", "age: I32?\n nickname: String?"), - ) - .unwrap(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{"graph.knowledge":{"digest":"old-graph"},"schema.knowledge":{"digest":"old-schema"}}}}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!( - out.resource_statuses["schema.knowledge"].status, - ResourceLifecycleStatus::Drifted - ); - assert_eq!( - out.observations["graph.knowledge"]["schema_matches_desired"], - false - ); - - let plan = plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - assert!(plan.changes.iter().any(|change| { - change.resource == "schema.knowledge" && change.operation == PlanOperation::Update - })); - } - - #[tokio::test] - async fn existing_lock_makes_refresh_fail() { - let dir = fixture(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - fs::write( - state_dir.join("lock.json"), - r#"{"version":1,"lock_id":"held-lock","operation":"refresh","created_at":"2026-06-08T00:00:00Z","pid":123}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.state_observations.locked); - assert_eq!(out.state_observations.lock_id.as_deref(), Some("held-lock")); - assert!(!out.state_observations.lock_acquired); - assert_eq!( - out.state_observations.lock_operation.as_deref(), - Some("refresh") - ); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_held") - ); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "state_lock_held" - && diagnostic.message.contains("force-unlock held-lock") - })); - } - - #[tokio::test] - async fn state_lock_false_bypasses_refresh_lock_with_warning() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -state: - backend: cluster - lock: false -graphs: - knowledge: - schema: ./people.pg -"#, - ) - .unwrap(); - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{"version":1,"applied_revision":{"resources":{}}}"#, - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!out.state_observations.locked); - assert!(!out.state_observations.lock_acquired); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_disabled") - ); - } - - #[tokio::test] - async fn external_state_backend_refresh_rejected() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\nstate:\n backend: s3://bucket/state\ngraphs: {}\n", - ) - .unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "unsupported_state_backend") - ); - } - - #[tokio::test] - async fn import_graph_open_error_does_not_create_state() { - let dir = fixture(); - fs::create_dir_all(dir.path().join(CLUSTER_GRAPHS_DIR).join("knowledge.omni")).unwrap(); - - let out = import_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "graph_observation_error") - ); - assert!(!dir.path().join(CLUSTER_STATE_FILE).exists()); - } - - // ---- config-only apply (Stage 3A) ---- - - /// Seed a state.json that simulates "graph exists with the desired schema, - /// queries/policies not yet applied" by borrowing the desired digests. - fn write_applyable_state(config_dir: &Path) { - let out = validate_config_dir(config_dir); - assert!(out.ok, "{:?}", out.diagnostics); - let schema_digest = out.resource_digests.get("schema.knowledge").unwrap().clone(); - let graph_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&BTreeMap::new()), - None, - None, - ); - write_state_resources( - config_dir, - &[ - ("graph.knowledge", graph_composite.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ], - ); - } - - fn write_state_resources(config_dir: &Path, resources: &[(&str, &str)]) { - let resource_map: serde_json::Map = resources - .iter() - .map(|(address, digest)| ((*address).to_string(), json!({ "digest": digest }))) - .collect(); - let state_dir = config_dir.join(CLUSTER_STATE_DIR); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - serde_json::to_string_pretty(&json!({ - "version": 1, - "state_revision": 1, - "applied_revision": { "resources": resource_map } - })) - .unwrap(), - ) - .unwrap(); - } - - fn read_state_json(config_dir: &Path) -> serde_json::Value { - serde_json::from_str(&fs::read_to_string(config_dir.join(CLUSTER_STATE_FILE)).unwrap()) - .unwrap() - } - - fn recovery_sidecars(config_dir: &Path) -> Vec { - let dir = config_dir.join(CLUSTER_RECOVERIES_DIR); - if !dir.exists() { - return Vec::new(); - } - let mut sidecars: Vec<_> = fs::read_dir(dir) - .unwrap() - .map(|entry| entry.unwrap().path()) - .collect(); - sidecars.sort(); - sidecars - } - - fn query_payload_path(config_dir: &Path, digest: &str) -> std::path::PathBuf { - config_dir - .join(CLUSTER_RESOURCES_DIR) - .join("query/knowledge/find_person") - .join(format!("{digest}.gq")) - } - - fn policy_payload_path(config_dir: &Path, digest: &str) -> std::path::PathBuf { - config_dir - .join(CLUSTER_RESOURCES_DIR) - .join("policy/base") - .join(format!("{digest}.yaml")) - } - - #[tokio::test] - async fn apply_without_state_fails_with_state_missing() { - let dir = fixture(); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_missing" - && diagnostic.message.contains("cluster import")) - ); - assert!(!dir.path().join(CLUSTER_STATE_FILE).exists()); - assert!(!dir.path().join(CLUSTER_RESOURCES_DIR).exists()); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn apply_writes_payloads_state_and_statuses() { - let dir = fixture(); - write_applyable_state(dir.path()); - let desired = validate_config_dir(dir.path()); - let query_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap() - .clone(); - let policy_digest = desired.resource_digests.get("policy.base").unwrap().clone(); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(out.applied_count, 2); - assert_eq!(out.deferred_count, 0); - assert!(out.converged); - assert!(out.state_written); - - let query_blob = query_payload_path(dir.path(), &query_digest); - assert_eq!(fs::read_to_string(&query_blob).unwrap(), QUERY); - let policy_blob = policy_payload_path(dir.path(), &policy_digest); - assert_eq!(fs::read_to_string(&policy_blob).unwrap(), "rules: []\n"); - - let state = read_state_json(dir.path()); - assert_eq!(state["state_revision"], 2); - let resources = &state["applied_revision"]["resources"]; - assert_eq!( - resources["query.knowledge.find_person"]["digest"], - query_digest - ); - assert_eq!(resources["policy.base"]["digest"], policy_digest); - let expected_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some( - &[("find_person".to_string(), query_digest.clone())] - .into_iter() - .collect(), - ), - None, - None, - ); - assert_eq!(resources["graph.knowledge"]["digest"], expected_composite); - assert_eq!( - state["applied_revision"]["config_digest"], - desired_revision_digest(&out) - ); - assert_eq!( - state["resource_statuses"]["query.knowledge.find_person"]["status"], - "applied" - ); - assert_eq!(state["resource_statuses"]["policy.base"]["status"], "applied"); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn apply_records_embedding_provider_profile_and_graph_binding() { - let dir = fixture(); - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let desired = validate_config_dir(dir.path()); - let query_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap() - .clone(); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - let provider_digest = desired - .resource_digests - .get("provider.embedding.default") - .unwrap() - .clone(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.converged, "{out:?}"); - - let state = read_state_json(dir.path()); - let resources = &state["applied_revision"]["resources"]; - let provider = resources["provider.embedding.default"] - .as_object() - .expect("provider resource"); - assert_eq!(provider["digest"], provider_digest); - assert_eq!(provider["embedding_profile"]["kind"], "mock"); - assert_eq!(provider["embedding_profile"]["model"], "recorded-x"); - assert!(provider["embedding_profile"].get("api_key").is_none()); - assert_eq!( - resources["graph.knowledge"]["embedding_provider"], - "provider.embedding.default" - ); - let expected_graph_digest = graph_digest( - "knowledge", - Some(&schema_digest), - Some( - &[("find_person".to_string(), query_digest)] - .into_iter() - .collect(), - ), - Some("provider.embedding.default"), - Some(&provider_digest), - ); - assert_eq!(resources["graph.knowledge"]["digest"], expected_graph_digest); - } - - #[tokio::test] - async fn embedding_provider_changes_update_provider_and_graph_plan() { - let dir = fixture(); - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let first = apply_config_dir(dir.path()).await; - assert!(first.ok && first.converged, "{first:?}"); - - write_mock_embedding_cluster(dir.path(), "recorded-y"); - let plan = plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - let by_resource: BTreeMap<&str, &PlanChange> = plan - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!( - by_resource["provider.embedding.default"].operation, - PlanOperation::Update - ); - assert_eq!( - by_resource["provider.embedding.default"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["graph.knowledge"].operation, - PlanOperation::Update - ); - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Derived) - ); - } - - #[tokio::test] - async fn embedding_binding_survives_refresh() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let apply = apply_config_dir(dir.path()).await; - assert!(apply.ok && apply.converged, "{apply:?}"); - - let refresh = refresh_config_dir(dir.path()).await; - assert!(refresh.ok, "{:?}", refresh.diagnostics); - - let state = read_state_json(dir.path()); - let resources = &state["applied_revision"]["resources"]; - assert_eq!( - resources["graph.knowledge"]["embedding_provider"], - "provider.embedding.default" - ); - assert_eq!( - resources["provider.embedding.default"]["embedding_profile"]["model"], - "recorded-x" - ); - } - - fn desired_revision_digest(out: &ApplyOutput) -> String { - out.desired_revision.config_digest.clone().unwrap() - } - - #[tokio::test] - async fn apply_update_changes_query_digest_and_keeps_old_blob() { - let dir = fixture(); - let desired = validate_config_dir(dir.path()); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - let old_digest = "0".repeat(64); - let graph_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&BTreeMap::new()), - None, - None, - ); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", graph_composite.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ("query.knowledge.find_person", old_digest.as_str()), - ], - ); - let old_blob = query_payload_path(dir.path(), &old_digest); - fs::create_dir_all(old_blob.parent().unwrap()).unwrap(); - fs::write(&old_blob, "old query source").unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let new_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap(); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["query.knowledge.find_person"]["digest"], - *new_digest - ); - assert_eq!(fs::read_to_string(&old_blob).unwrap(), "old query source"); - assert!(query_payload_path(dir.path(), new_digest).exists()); - } - - #[tokio::test] - async fn apply_deletes_removed_resources_but_keeps_blobs() { - let dir = fixture(); - let desired = validate_config_dir(dir.path()); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - let stale_query_digest = "1".repeat(64); - let stale_policy_digest = "2".repeat(64); - let graph_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&BTreeMap::new()), - None, - None, - ); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", graph_composite.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ("query.knowledge.orphan", stale_query_digest.as_str()), - ("policy.old", stale_policy_digest.as_str()), - ], - ); - let stale_blob = dir - .path() - .join(CLUSTER_RESOURCES_DIR) - .join("policy/old") - .join(format!("{stale_policy_digest}.yaml")); - fs::create_dir_all(stale_blob.parent().unwrap()).unwrap(); - fs::write(&stale_blob, "old policy").unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.converged); - let state = read_state_json(dir.path()); - let resources = &state["applied_revision"]["resources"]; - assert!(resources.get("query.knowledge.orphan").is_none()); - assert!(resources.get("policy.old").is_none()); - assert!( - state["resource_statuses"] - .get("query.knowledge.orphan") - .is_none() - ); - // Deleted resources leave their content-addressed blobs in place; GC is - // a later stage. - assert_eq!(fs::read_to_string(&stale_blob).unwrap(), "old policy"); - // The composite no longer includes the orphan query. - let query_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap() - .clone(); - let expected_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&[("find_person".to_string(), query_digest)].into_iter().collect()), - None, - None, - ); - assert_eq!(resources["graph.knowledge"]["digest"], expected_composite); - } - - #[tokio::test] - async fn apply_schema_update_and_dependent_query_in_one_run() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - // Schema update + a query update that depends on the new field: one - // apply executes the schema migration first, then the catalog write. - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - fs::write( - dir.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name, $p.bio }\n}\n", - ) - .unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.converged, "{out:?}"); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!( - by_resource["schema.knowledge"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["query.knowledge.find_person"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Derived) - ); - // The live graph carries the new schema. - let db = Omnigraph::open_read_only(&derived_graph_uri(dir.path(), "knowledge")) - .await - .unwrap(); - let desired = validate_config_dir(dir.path()); - assert_eq!( - sha256_hex(db.schema_source().as_bytes()), - desired.resource_digests["schema.knowledge"] - ); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - desired.resource_digests["schema.knowledge"] - ); - // Sidecar retired after the CAS landed. - assert!( - !dir.path().join(CLUSTER_RECOVERIES_DIR).exists() - || fs::read_dir(dir.path().join(CLUSTER_RECOVERIES_DIR)) - .unwrap() - .next() - .is_none() - ); - } - - #[tokio::test] - async fn apply_unsupported_schema_change_fails_loudly() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - // Property type changes are unsupported by the engine planner. - fs::write( - dir.path().join("people.pg"), - "\nnode Person {\n name: String @key\n age: I64?\n}\n", - ) - .unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "schema_apply_failed" - && diagnostic.message.contains("changing property type") - })); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!( - by_resource["schema.knowledge"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["schema.knowledge"].reason.as_deref(), - Some("schema_apply_failed") - ); - // The live schema and the ledger are unchanged. - let state = read_state_json(dir.path()); - let desired = validate_config_dir(dir.path()); - assert_ne!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - desired.resource_digests["schema.knowledge"] - ); - let db = Omnigraph::open_read_only(&derived_graph_uri(dir.path(), "knowledge")) - .await - .unwrap(); - assert_eq!(db.schema_source().as_str(), SCHEMA); - assert!( - recovery_sidecars(dir.path()).is_empty(), - "{:?}", - recovery_sidecars(dir.path()) - ); - // Second run fails just as loudly and still leaves no sidecar because - // the engine preview rejects before graph state can move. - let second = apply_config_dir(dir.path()).await; - assert!(!second.ok); - assert!( - second - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "schema_apply_failed") - ); - assert!( - recovery_sidecars(dir.path()).is_empty(), - "{:?}", - recovery_sidecars(dir.path()) - ); - } - - #[tokio::test] - async fn apply_schema_update_blocked_by_non_main_branch_leaves_no_sidecar() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - let graph_uri = derived_graph_uri(dir.path(), "knowledge"); - let db = Omnigraph::open(&graph_uri).await.unwrap(); - db.branch_create("feature").await.unwrap(); - drop(db); - let before_state = read_state_json(dir.path()); - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "schema_apply_failed" - && diagnostic - .message - .contains("schema apply requires a graph with only main") - })); - assert!( - recovery_sidecars(dir.path()).is_empty(), - "{:?}", - recovery_sidecars(dir.path()) - ); - let after_state = read_state_json(dir.path()); - assert_eq!( - after_state["applied_revision"]["resources"], - before_state["applied_revision"]["resources"] - ); - let reopened = Omnigraph::open_read_only(&graph_uri).await.unwrap(); - assert_eq!(reopened.schema_source().as_str(), SCHEMA); - } - - #[tokio::test] - async fn apply_blocks_schema_update_while_recovery_pending() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_state_resources(dir.path(), &[("schema.knowledge", "stale-digest")]); - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - // A pending sidecar whose intent matches neither live nor recorded. - write_schema_apply_sidecar(dir.path(), "knowledge", "intended-digest", "01PENDS"); - - let out = apply_config_dir(dir.path()).await; - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!( - by_resource["schema.knowledge"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["schema.knowledge"].reason.as_deref(), - Some("cluster_recovery_pending") - ); - } - - #[tokio::test] - async fn apply_creates_graph_and_unblocks_dependents() { - let dir = fixture(); - write_state_resources(dir.path(), &[]); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.converged, "{out:?}"); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - // Stage 4A: the create executes, and its dependents apply in-run. - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["schema.knowledge"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["query.knowledge.find_person"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["policy.base"].disposition, - Some(ApplyDisposition::Applied) - ); - // The graph exists on disk and opens; state records everything. - let graph_uri = derived_graph_uri(dir.path(), "knowledge"); - let db = Omnigraph::open_read_only(&graph_uri).await.unwrap(); - let desired = validate_config_dir(dir.path()); - assert_eq!( - sha256_hex(db.schema_source().as_bytes()), - desired.resource_digests["schema.knowledge"] - ); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - desired.resource_digests["schema.knowledge"] - ); - assert_eq!( - state["resource_statuses"]["graph.knowledge"]["status"], - "applied" - ); - // The create's sidecar was retired after the state CAS landed. - assert!( - !dir.path().join(CLUSTER_RECOVERIES_DIR).exists() - || fs::read_dir(dir.path().join(CLUSTER_RECOVERIES_DIR)) - .unwrap() - .next() - .is_none() - ); - } - - #[tokio::test] - async fn apply_create_failure_blocks_dependents_and_keeps_sidecar() { - let dir = fixture(); - write_state_resources(dir.path(), &[]); - // Make the init fail its strict preflight: a junk _schema.pg already - // sits at the derived root (the engine refuses to overwrite it). - let root = dir.path().join(CLUSTER_GRAPHS_DIR).join("knowledge.omni"); - fs::create_dir_all(&root).unwrap(); - fs::write(root.join("_schema.pg"), "junk").unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "graph_create_failed") - ); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - // Dependents are demoted: the run tells the truth about what executed. - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["query.knowledge.find_person"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["query.knowledge.find_person"].reason.as_deref(), - Some("dependency_not_applied") - ); - assert_eq!( - by_resource["policy.base"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert!(!out.converged); - // The sidecar stays for the sweep to classify next run. - assert!( - fs::read_dir(dir.path().join(CLUSTER_RECOVERIES_DIR)) - .unwrap() - .next() - .is_some() - ); - // No graph digests moved. - let state = read_state_json(dir.path()); - assert!( - state["applied_revision"]["resources"] - .as_object() - .unwrap() - .is_empty() - ); - } - - #[tokio::test] - async fn apply_blocks_graph_delete_without_approval() { - let dir = fixture(); - let desired = validate_config_dir(dir.path()); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - let graph_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&BTreeMap::new()), - None, - None, - ); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", graph_composite.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ("graph.old", "3333"), - ("schema.old", "4444"), - ("query.old.q", "5555"), - ], - ); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!out.converged); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - // Stage 4C: deletes are gated, not deferred β€” every subtree change - // blocks on the single graph-level approval. - assert_eq!( - by_resource["graph.old"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["graph.old"].reason.as_deref(), - Some("approval_required") - ); - assert_eq!( - by_resource["schema.old"].reason.as_deref(), - Some("approval_required") - ); - assert_eq!( - by_resource["query.old.q"].reason.as_deref(), - Some("approval_required") - ); - // State intact; nothing destroyed without the artifact. - let state = read_state_json(dir.path()); - let resources = &state["applied_revision"]["resources"]; - assert_eq!(resources["graph.old"]["digest"], "3333"); - assert_eq!(resources["schema.old"]["digest"], "4444"); - assert_eq!(resources["query.old.q"]["digest"], "5555"); - } - - #[tokio::test] - async fn approve_writes_digest_bound_artifact() { - let dir = fixture(); - write_applyable_state(dir.path()); - // Seed a deletable subtree. - let state = read_state_json(dir.path()); - let graph_digest_str = state["applied_revision"]["resources"]["graph.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - let schema_digest_str = state["applied_revision"]["resources"]["schema.knowledge"] - ["digest"] - .as_str() - .unwrap() - .to_string(); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", graph_digest_str.as_str()), - ("schema.knowledge", schema_digest_str.as_str()), - ("graph.old", "3333"), - ("schema.old", "4444"), - ], - ); - - let out = approve_config_dir(dir.path(), "graph.old", "andrew").await; - assert!(out.ok, "{:?}", out.diagnostics); - let approval_id = out.approval_id.clone().unwrap(); - let artifact: serde_json::Value = serde_json::from_str( - &fs::read_to_string( - dir.path() - .join(CLUSTER_APPROVALS_DIR) - .join(format!("{approval_id}.json")), - ) - .unwrap(), - ) - .unwrap(); - assert_eq!(artifact["resource"], "graph.old"); - assert_eq!(artifact["operation"], "delete"); - assert_eq!(artifact["approved_by"], "andrew"); - assert_eq!(artifact["bound_before_digest"], "3333"); - assert!(artifact["bound_after_digest"].is_null()); - assert!(artifact["bound_config_digest"].is_string()); - assert!(artifact["consumed_at"].is_null()); - - // A non-gated address is refused. - let not_gated = approve_config_dir(dir.path(), "query.knowledge.find_person", "andrew").await; - assert!(!not_gated.ok); - assert!( - not_gated - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "approval_not_required") - ); - } - - #[tokio::test] - async fn stale_approval_is_ignored() { - let dir = fixture(); - write_applyable_state(dir.path()); - let state = read_state_json(dir.path()); - let graph_digest_str = state["applied_revision"]["resources"]["graph.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - let schema_digest_str = state["applied_revision"]["resources"]["schema.knowledge"] - ["digest"] - .as_str() - .unwrap() - .to_string(); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", graph_digest_str.as_str()), - ("schema.knowledge", schema_digest_str.as_str()), - ("graph.old", "3333"), - ], - ); - let approved = approve_config_dir(dir.path(), "graph.old", "andrew").await; - assert!(approved.ok, "{:?}", approved.diagnostics); - // The config moves after approval: the bound config digest no longer - // matches and the artifact authorizes nothing. - fs::write(dir.path().join("base.policy.yaml"), "rules: [] # moved\n").unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "approval_stale"), - "{:?}", - out.diagnostics - ); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!( - by_resource["graph.old"].reason.as_deref(), - Some("approval_required") - ); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["graph.old"]["digest"], - "3333" - ); - } - - #[tokio::test] - async fn compute_approvals_one_gate_per_subtree() { - let dir = fixture(); - write_applyable_state(dir.path()); - let state = read_state_json(dir.path()); - let g = state["applied_revision"]["resources"]["graph.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - let sc = state["applied_revision"]["resources"]["schema.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", g.as_str()), - ("schema.knowledge", sc.as_str()), - ("graph.old", "3333"), - ("schema.old", "4444"), - ("query.old.q", "5555"), - ], - ); - let plan = plan_config_dir(dir.path()).await; - let gated: Vec<&str> = plan - .approvals_required - .iter() - .map(|gate| gate.resource.as_str()) - .collect(); - assert_eq!(gated, vec!["graph.old"], "{plan:?}"); - assert!(!plan.approvals_required[0].satisfied); - } - - #[tokio::test] - async fn apply_is_idempotent() { - let dir = fixture(); - write_applyable_state(dir.path()); - - let first = apply_config_dir(dir.path()).await; - assert!(first.ok, "{:?}", first.diagnostics); - assert!(first.state_written); - let state_after_first = fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(); - - let second = apply_config_dir(dir.path()).await; - assert!(second.ok, "{:?}", second.diagnostics); - assert!(second.changes.is_empty()); - assert_eq!(second.applied_count, 0); - assert!(second.converged); - assert!(!second.state_written); - let state_after_second = fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(); - assert_eq!(state_after_first, state_after_second); - assert_eq!(second.state_observations.state_revision, 2); - } - - #[tokio::test] - async fn apply_respects_held_lock() { - let dir = fixture(); - write_applyable_state(dir.path()); - write_lock_file(dir.path(), "held-lock", "plan"); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_held") - ); - // The held lock survives a refused apply, and nothing was written. - assert!(dir.path().join(CLUSTER_LOCK_FILE).exists()); - assert!(!dir.path().join(CLUSTER_RESOURCES_DIR).exists()); - let state = read_state_json(dir.path()); - assert_eq!(state["state_revision"], 1); - } - - #[tokio::test] - async fn apply_state_lock_false_bypasses_with_warning() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -state: - backend: cluster - lock: false -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -"#, - ) - .unwrap(); - write_applyable_state(dir.path()); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.state_written); - assert!(!out.state_observations.lock_acquired); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_lock_disabled") - ); - assert!(!dir.path().join(CLUSTER_LOCK_FILE).exists()); - } - - #[tokio::test] - async fn apply_skips_existing_payload_blob() { - let dir = fixture(); - write_applyable_state(dir.path()); - let desired = validate_config_dir(dir.path()); - let query_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap() - .clone(); - // Content-addressed blobs are trusted by name: an existing file is - // never rewritten. - let blob = query_payload_path(dir.path(), &query_digest); - fs::create_dir_all(blob.parent().unwrap()).unwrap(); - fs::write(&blob, "pre-existing").unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert_eq!(fs::read_to_string(&blob).unwrap(), "pre-existing"); - } - - #[tokio::test] - async fn apply_invalid_config_fails_before_lock() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - "version: 1\nnot_a_field: true\n", - ) - .unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - // Config errors bail before the lock or any state directory exists. - assert!(!dir.path().join(CLUSTER_STATE_DIR).exists()); - } - - /// When the state write fails after payloads landed, the output must - /// report the statuses actually on disk β€” not the unpersisted in-memory - /// mutations (phantom `applied` entries would mislead automation that - /// reads `resource_statuses` independently of `ok`). - #[cfg(unix)] - #[tokio::test] - async fn apply_state_write_failure_reports_persisted_statuses() { - use std::os::unix::fs::PermissionsExt; - - let dir = fixture(); - // lock: false so the only write into __cluster/ is state.json itself. - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -state: - backend: cluster - lock: false -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -"#, - ) - .unwrap(); - write_applyable_state(dir.path()); - // Pre-create the payload blob so the payload phase is a no-op and the - // failure lands exactly at the state write. - let desired = validate_config_dir(dir.path()); - let query_digest = desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap(); - let blob = query_payload_path(dir.path(), query_digest); - fs::create_dir_all(blob.parent().unwrap()).unwrap(); - fs::write(&blob, QUERY).unwrap(); - - let state_dir = dir.path().join(CLUSTER_STATE_DIR); - fs::set_permissions(&state_dir, fs::Permissions::from_mode(0o555)).unwrap(); - // Running as root ignores permission bits; skip rather than flake. - if fs::write(state_dir.join("probe"), b"x").is_ok() { - let _ = fs::remove_file(state_dir.join("probe")); - fs::set_permissions(&state_dir, fs::Permissions::from_mode(0o755)).unwrap(); - eprintln!("skipping: permissions are not enforced (running as root)"); - return; - } - - let out = apply_config_dir(dir.path()).await; - fs::set_permissions(&state_dir, fs::Permissions::from_mode(0o755)).unwrap(); - - assert!(!out.ok); - assert!(!out.state_written); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_write_error"), - "{:?}", - out.diagnostics - ); - // The seeded state has no statuses; the failed apply must not invent - // the in-memory `applied` ones it failed to persist. - assert!( - out.resource_statuses.is_empty(), - "unpersisted statuses leaked into output: {:?}", - out.resource_statuses - ); - } - - // ---- catalog payload verification (Stage 3B) ---- - - /// Converge a fixture dir and return the query blob path. - async fn converge_fixture(config_dir: &Path) -> std::path::PathBuf { - write_applyable_state(config_dir); - let out = apply_config_dir(config_dir).await; - assert!(out.ok && out.converged, "{:?}", out.diagnostics); - let desired = validate_config_dir(config_dir); - query_payload_path( - config_dir, - desired - .resource_digests - .get("query.knowledge.find_person") - .unwrap(), - ) - } - - #[tokio::test] - async fn status_reports_missing_payload_read_only() { - let dir = fixture(); - let blob = converge_fixture(dir.path()).await; - let state_before = fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(); - fs::remove_file(&blob).unwrap(); - - let out = status_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "catalog_payload_missing" - && diagnostic.path == "query.knowledge.find_person" - })); - // Read-only: persisted statuses and state bytes untouched. - assert_eq!( - out.resource_statuses["query.knowledge.find_person"].status, - ResourceLifecycleStatus::Applied - ); - assert_eq!( - fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(), - state_before - ); - } - - #[tokio::test] - async fn refresh_removes_digest_and_drifts_on_missing_payload() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let blob = converge_fixture(dir.path()).await; - fs::remove_file(&blob).unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "catalog_payload_missing") - ); - let status = &out.resource_statuses["query.knowledge.find_person"]; - assert_eq!(status.status, ResourceLifecycleStatus::Drifted); - assert!(status.conditions.contains(&"payload_missing".to_string())); - let state = read_state_json(dir.path()); - assert!( - state["applied_revision"]["resources"] - .get("query.knowledge.find_person") - .is_none(), - "{state}" - ); - } - - #[tokio::test] - async fn refresh_drifts_on_corrupted_payload() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let blob = converge_fixture(dir.path()).await; - fs::write(&blob, "corrupted content").unwrap(); - - let out = refresh_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let status = &out.resource_statuses["query.knowledge.find_person"]; - assert_eq!(status.status, ResourceLifecycleStatus::Drifted); - assert!(status.conditions.contains(&"payload_mismatch".to_string())); - let state = read_state_json(dir.path()); - assert!( - state["applied_revision"]["resources"] - .get("query.knowledge.find_person") - .is_none() - ); - } - - #[tokio::test] - #[cfg(unix)] - async fn refresh_flags_unreadable_payload_as_error() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let blob = converge_fixture(dir.path()).await; - // Make the payload unreadable without removing it: permission - // denied is a genuine non-NotFound IO error. (A same-named - // directory no longer triggers this path: object-store semantics - // classify a directory at an object path as NotFound β€” "only - // objects exist" β€” which is the missing-payload case, not the - // unreadable one.) - let mut perms = fs::metadata(&blob).unwrap().permissions(); - std::os::unix::fs::PermissionsExt::set_mode(&mut perms, 0o000); - fs::set_permissions(&blob, perms).unwrap(); - // Root reads straight through mode 000 (container dev runners - // commonly run as root): skip rather than fail β€” the contract - // under test needs a genuine permission error. - if fs::read(&blob).is_ok() { - eprintln!( - "skipping refresh_flags_unreadable_payload_as_error: running as root (mode 000 is still readable)" - ); - return; - } - - let out = refresh_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "catalog_payload_read_error") - ); - let status = &out.resource_statuses["query.knowledge.find_person"]; - assert_eq!(status.status, ResourceLifecycleStatus::Error); - assert!(status.conditions.contains(&"payload_read_error".to_string())); - // Transient IO keeps the digest: no spurious republish. - let state = read_state_json(dir.path()); - assert!( - state["applied_revision"]["resources"] - .get("query.knowledge.find_person") - .is_some() - ); - } - - #[tokio::test] - async fn payload_drift_self_heals_through_refresh_plan_apply() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - let blob = converge_fixture(dir.path()).await; - let original = fs::read_to_string(&blob).unwrap(); - fs::remove_file(&blob).unwrap(); - - let refresh = refresh_config_dir(dir.path()).await; - assert!(refresh.ok, "{:?}", refresh.diagnostics); - - let plan = plan_config_dir(dir.path()).await; - let query_change = plan - .changes - .iter() - .find(|change| change.resource == "query.knowledge.find_person") - .expect("plan must propose recreating the query"); - assert_eq!(query_change.operation, PlanOperation::Create); - assert_eq!(query_change.disposition, Some(ApplyDisposition::Applied)); - - let apply = apply_config_dir(dir.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - assert_eq!(fs::read_to_string(&blob).unwrap(), original); - - let status = status_config_dir(dir.path()).await; - assert!( - !status - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code.starts_with("catalog_payload")), - "{:?}", - status.diagnostics - ); - } - - #[tokio::test] - async fn verification_skips_graph_and_schema_resources() { - let dir = fixture(); - write_applyable_state(dir.path()); // graph + schema digests only, no blobs - - let out = status_config_dir(dir.path()).await; - assert!( - !out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code.starts_with("catalog_payload")), - "{:?}", - out.diagnostics - ); - } - - // ---- recovery sidecars + sweep (Stage 4A) ---- - - fn derived_graph_uri(config_dir: &Path, graph_id: &str) -> String { - display_path( - &config_dir - .join(CLUSTER_GRAPHS_DIR) - .join(format!("{graph_id}.omni")), - ) - } - - fn write_create_sidecar( - config_dir: &Path, - graph_id: &str, - desired_schema_digest: &str, - operation_id: &str, - ) -> PathBuf { - let dir = config_dir.join(CLUSTER_RECOVERIES_DIR); - fs::create_dir_all(&dir).unwrap(); - let path = dir.join(format!("{operation_id}.json")); - fs::write( - &path, - serde_json::to_string_pretty(&json!({ - "schema_version": 1, - "operation_id": operation_id, - "started_at": "1970-01-01T00:00:00Z", - "kind": "graph_create", - "graph_id": graph_id, - "graph_uri": derived_graph_uri(config_dir, graph_id), - "desired_schema_digest": desired_schema_digest, - })) - .unwrap(), - ) - .unwrap(); - path - } - - #[tokio::test] - async fn sweep_removes_sidecar_when_root_absent() { - let dir = fixture(); - write_applyable_state(dir.path()); - let sidecar = write_create_sidecar(dir.path(), "knowledge", "irrelevant", "01ROW1"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - // Row 1: nothing moved; intent removed, run proceeds normally. - assert!(!sidecar.exists()); - assert!(out.converged); - } - - #[tokio::test] - async fn sweep_rolls_forward_completed_create() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_state_resources(dir.path(), &[]); // state predates the create - let desired = validate_config_dir(dir.path()); - let schema_digest = desired.resource_digests["schema.knowledge"].clone(); - let sidecar = write_create_sidecar(dir.path(), "knowledge", &schema_digest, "01ROW4"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - // Row 4: ledger converged to observable reality, audit recorded, - // sidecar retired after the CAS landed. - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - schema_digest - ); - assert!( - state["recovery_records"] - .as_object() - .unwrap() - .values() - .any(|record| record["outcome"] == "rolled_forward" - && record["graph_id"] == "knowledge") - ); - assert!(!sidecar.exists()); - // With the graph rolled forward, the same run converges the catalog. - assert!(out.converged, "{out:?}"); - } - - #[tokio::test] - async fn sweep_completes_already_recorded_create() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); // state already records graph+schema - let desired = validate_config_dir(dir.path()); - let sidecar = write_create_sidecar( - dir.path(), - "knowledge", - &desired.resource_digests["schema.knowledge"], - "01ROW2", - ); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - // Row 2: outcome was already durable; no audit entry, sidecar retired. - assert!(!sidecar.exists()); - let state = read_state_json(dir.path()); - assert!( - state["recovery_records"] - .as_object() - .is_none_or(|records| records.is_empty()), - "{state}" - ); - } - - #[tokio::test] - async fn sweep_keeps_sidecar_for_incomplete_root() { - let dir = fixture(); - write_applyable_state(dir.path()); - // A root that exists but cannot be opened: the engine's partial-init gap. - let root = dir.path().join(CLUSTER_GRAPHS_DIR).join("knowledge.omni"); - fs::create_dir_all(&root).unwrap(); - fs::write(root.join("_schema.pg"), "junk").unwrap(); - let sidecar = write_create_sidecar(dir.path(), "knowledge", "whatever", "01ROW5"); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "graph_create_incomplete") - ); - // Row 5: never auto-delete; sidecar and root stay for the operator, - // and the Error status is persisted by the run's state write. - assert!(sidecar.exists()); - assert!(root.exists()); - let state = read_state_json(dir.path()); - assert_eq!(state["resource_statuses"]["graph.knowledge"]["status"], "error"); - assert!( - state["resource_statuses"]["graph.knowledge"]["conditions"] - .as_array() - .unwrap() - .iter() - .any(|condition| condition == "graph_create_incomplete") - ); - } - - #[tokio::test] - async fn sweep_flags_unexpected_schema_as_pending() { - let dir = fixture(); - write_state_resources(dir.path(), &[]); - // Live graph exists with a schema the sidecar never intended. - let graph_dir = dir.path().join(CLUSTER_GRAPHS_DIR); - fs::create_dir_all(&graph_dir).unwrap(); - Omnigraph::init( - &derived_graph_uri(dir.path(), "knowledge"), - "\nnode Other {\n name: String @key\n}\n", - ) - .await - .unwrap(); - let desired = validate_config_dir(dir.path()); - let sidecar = write_create_sidecar( - dir.path(), - "knowledge", - &desired.resource_digests["schema.knowledge"], - "01ROW6", - ); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); // warning, not error - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending") - ); - // Row 6: refuse to guess; sidecar kept, Drifted persisted. - assert!(sidecar.exists()); - let state = read_state_json(dir.path()); - assert_eq!( - state["resource_statuses"]["graph.knowledge"]["status"], - "drifted" - ); - assert!( - state["resource_statuses"]["graph.knowledge"]["conditions"] - .as_array() - .unwrap() - .iter() - .any(|condition| condition == "actual_applied_state_pending") - ); - } - - #[tokio::test] - async fn apply_blocks_create_while_recovery_pending() { - let dir = fixture(); - write_state_resources(dir.path(), &[]); - // A kept (row 5) sidecar: partial root that cannot be opened. - let root = dir.path().join(CLUSTER_GRAPHS_DIR).join("knowledge.omni"); - fs::create_dir_all(&root).unwrap(); - fs::write(root.join("_schema.pg"), "junk").unwrap(); - let sidecar = write_create_sidecar(dir.path(), "knowledge", "whatever", "01PEND"); - - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); // row 5 is an error condition - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - // The pending recovery blocks the create and its dependents; the - // executor never attempts the init. - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Blocked) - ); - assert_eq!( - by_resource["graph.knowledge"].reason.as_deref(), - Some("cluster_recovery_pending") - ); - assert_eq!( - by_resource["query.knowledge.find_person"].reason.as_deref(), - Some("cluster_recovery_pending") - ); - assert_eq!( - by_resource["policy.base"].reason.as_deref(), - Some("cluster_recovery_pending") - ); - assert!(sidecar.exists()); - // The sweep's Error status is what persists β€” not a generic Blocked. - let state = read_state_json(dir.path()); - assert_eq!(state["resource_statuses"]["graph.knowledge"]["status"], "error"); - } - - #[tokio::test] - async fn plan_embeds_migration_preview_for_schema_update() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - fs::write( - dir.path().join("people.pg"), - "\nnode Person {\n name: String @key\n age: I32?\n bio: String?\n}\n", - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let schema_change = out - .changes - .iter() - .find(|change| change.resource == "schema.knowledge") - .unwrap(); - let migration = schema_change.migration.as_ref().expect("preview embedded"); - assert!(migration.supported); - assert!( - serde_json::to_string(&migration.steps) - .unwrap() - .contains("add_property"), - "{migration:?}" - ); - } - - #[tokio::test] - async fn plan_warns_when_preview_unavailable() { - let dir = fixture(); - write_applyable_state(dir.path()); // digests recorded, but no live root - fs::write( - dir.path().join("people.pg"), - "\nnode Person {\n name: String @key\n age: I32?\n bio: String?\n}\n", - ) - .unwrap(); - - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let schema_change = out - .changes - .iter() - .find(|change| change.resource == "schema.knowledge") - .unwrap(); - assert!(schema_change.migration.is_none()); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "schema_preview_unavailable") - ); - } - - fn write_schema_apply_sidecar( - config_dir: &Path, - graph_id: &str, - desired_schema_digest: &str, - operation_id: &str, - ) -> PathBuf { - let dir = config_dir.join(CLUSTER_RECOVERIES_DIR); - fs::create_dir_all(&dir).unwrap(); - let path = dir.join(format!("{operation_id}.json")); - fs::write( - &path, - serde_json::to_string_pretty(&json!({ - "schema_version": 1, - "operation_id": operation_id, - "started_at": "1970-01-01T00:00:00Z", - "kind": "schema_apply", - "graph_id": graph_id, - "graph_uri": derived_graph_uri(config_dir, graph_id), - "desired_schema_digest": desired_schema_digest, - })) - .unwrap(), - ) - .unwrap(); - path - } - - const SCHEMA_V2: &str = "\nnode Person {\n name: String @key\n age: I32?\n bio: String?\n}\n"; - - #[tokio::test] - async fn sweep_retires_schema_sidecar_when_ledger_consistent() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); // state digest == live digest - let sidecar = - write_schema_apply_sidecar(dir.path(), "knowledge", "never-applied", "01SROW1"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!sidecar.exists()); - let state = read_state_json(dir.path()); - assert!( - state["recovery_records"] - .as_object() - .is_none_or(|records| records.is_empty()) - ); - } - - #[tokio::test] - async fn sweep_rolls_forward_completed_schema_apply() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - // The schema apply completed on the graph out-of-process... - let graph_uri = derived_graph_uri(dir.path(), "knowledge"); - let db = Omnigraph::open(&graph_uri).await.unwrap(); - db.apply_schema(SCHEMA_V2).await.unwrap(); - // ...the desired config matches it, and the sidecar records the intent. - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - let desired = validate_config_dir(dir.path()); - let v2_digest = desired.resource_digests["schema.knowledge"].clone(); - let sidecar = write_schema_apply_sidecar(dir.path(), "knowledge", &v2_digest, "01SROW3"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(!sidecar.exists()); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - v2_digest - ); - assert!( - state["recovery_records"] - .as_object() - .unwrap() - .values() - .any(|record| record["kind"] == "schema_apply" - && record["outcome"] == "rolled_forward") - ); - assert!(out.converged, "{out:?}"); - } - - #[tokio::test] - async fn sweep_flags_unexpected_schema_apply_state_as_pending() { - let dir = fixture(); - init_derived_graph(dir.path()).await; // live = v1 - write_state_resources(dir.path(), &[("schema.knowledge", "stale-digest")]); - // Sidecar intended a digest that is neither live nor recorded. - let sidecar = - write_schema_apply_sidecar(dir.path(), "knowledge", "intended-digest", "01SROW6"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); // warnings only - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending") - ); - assert!(sidecar.exists()); - let state = read_state_json(dir.path()); - assert_eq!( - state["resource_statuses"]["schema.knowledge"]["status"], - "drifted" - ); - } - - #[tokio::test] - async fn sweep_keeps_schema_sidecar_for_unopenable_root() { - let dir = fixture(); - write_applyable_state(dir.path()); - let root = dir.path().join(CLUSTER_GRAPHS_DIR).join("knowledge.omni"); - fs::create_dir_all(&root).unwrap(); // exists, won't open - let sidecar = - write_schema_apply_sidecar(dir.path(), "knowledge", "whatever", "01SROWX"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); // warning: cannot verify - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending") - ); - assert!(sidecar.exists()); - } - - /// Seed: converged knowledge subtree + a stale `old` graph subtree with a - /// real directory on disk. - fn seed_deletable_state(config_dir: &Path) { - write_applyable_state(config_dir); - let state = read_state_json(config_dir); - let g = state["applied_revision"]["resources"]["graph.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - let sc = state["applied_revision"]["resources"]["schema.knowledge"]["digest"] - .as_str() - .unwrap() - .to_string(); - write_state_resources( - config_dir, - &[ - ("graph.knowledge", g.as_str()), - ("schema.knowledge", sc.as_str()), - ("graph.old", "3333"), - ("schema.old", "4444"), - ("query.old.q", "5555"), - ], - ); - let root = config_dir.join(CLUSTER_GRAPHS_DIR).join("old.omni"); - fs::create_dir_all(&root).unwrap(); - fs::write(root.join("_schema.pg"), "stale").unwrap(); - } - - #[tokio::test] - async fn apply_executes_approved_graph_delete() { - let dir = fixture(); - seed_deletable_state(dir.path()); - let approved = approve_config_dir(dir.path(), "graph.old", "andrew").await; - assert!(approved.ok, "{:?}", approved.diagnostics); - let approval_id = approved.approval_id.clone().unwrap(); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.converged, "{out:?}"); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - assert_eq!(by_resource["graph.old"].disposition, Some(ApplyDisposition::Applied)); - assert_eq!(by_resource["schema.old"].disposition, Some(ApplyDisposition::Applied)); - assert_eq!(by_resource["query.old.q"].disposition, Some(ApplyDisposition::Applied)); - // The root is gone; the subtree is tombstoned out of the ledger. - assert!(!dir.path().join(CLUSTER_GRAPHS_DIR).join("old.omni").exists()); - let state = read_state_json(dir.path()); - let resources = state["applied_revision"]["resources"].as_object().unwrap(); - assert!(!resources.contains_key("graph.old")); - assert!(!resources.contains_key("schema.old")); - assert!(!resources.contains_key("query.old.q")); - assert_eq!(state["observations"]["graph.old"]["kind"], "tombstone"); - assert_eq!(state["observations"]["graph.old"]["approval_id"], approval_id); - // Approval consumed in BOTH stores: ledger summary + artifact file. - assert!(state["approval_records"][&approval_id]["consumed_at"].is_string()); - let artifact: serde_json::Value = serde_json::from_str( - &fs::read_to_string( - dir.path() - .join(CLUSTER_APPROVALS_DIR) - .join(format!("{approval_id}.json")), - ) - .unwrap(), - ) - .unwrap(); - assert!(artifact["consumed_at"].is_string(), "{artifact}"); - // Sidecar retired. - assert!( - fs::read_dir(dir.path().join(CLUSTER_RECOVERIES_DIR)) - .map(|mut entries| entries.next().is_none()) - .unwrap_or(true) - ); - // A consumed approval authorizes nothing further (idempotent re-apply). - let again = apply_config_dir(dir.path()).await; - assert!(again.ok && again.converged && !again.state_written, "{again:?}"); - } - - fn write_delete_sidecar( - config_dir: &Path, - graph_id: &str, - approval_id: Option<&str>, - operation_id: &str, - ) -> PathBuf { - let dir = config_dir.join(CLUSTER_RECOVERIES_DIR); - fs::create_dir_all(&dir).unwrap(); - let path = dir.join(format!("{operation_id}.json")); - fs::write( - &path, - serde_json::to_string_pretty(&json!({ - "schema_version": 1, - "operation_id": operation_id, - "started_at": "1970-01-01T00:00:00Z", - "kind": "graph_delete", - "graph_id": graph_id, - "graph_uri": derived_graph_uri(config_dir, graph_id), - "desired_schema_digest": "", - "approval_id": approval_id, - })) - .unwrap(), - ) - .unwrap(); - path - } - - #[tokio::test] - async fn sweep_retires_delete_sidecar_when_tombstoned() { - let dir = fixture(); - write_applyable_state(dir.path()); // no graph.old in state, no root - let sidecar = write_delete_sidecar(dir.path(), "old", None, "01DROW7"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!(!sidecar.exists()); - let state = read_state_json(dir.path()); - assert!( - state["recovery_records"] - .as_object() - .is_none_or(|records| records.is_empty()) - ); - } - - #[tokio::test] - async fn sweep_rolls_forward_completed_delete() { - let dir = fixture(); - seed_deletable_state(dir.path()); - // Approve, then simulate: root removed, state stale, sidecar present. - let approved = approve_config_dir(dir.path(), "graph.old", "andrew").await; - let approval_id = approved.approval_id.unwrap(); - fs::remove_dir_all(dir.path().join(CLUSTER_GRAPHS_DIR).join("old.omni")).unwrap(); - let sidecar = write_delete_sidecar(dir.path(), "old", Some(&approval_id), "01DROW7B"); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(!sidecar.exists()); - let state = read_state_json(dir.path()); - assert!( - !state["applied_revision"]["resources"] - .as_object() - .unwrap() - .contains_key("graph.old") - ); - assert_eq!(state["observations"]["graph.old"]["kind"], "tombstone"); - assert!(state["approval_records"][&approval_id]["consumed_at"].is_string()); - assert!( - state["recovery_records"] - .as_object() - .unwrap() - .values() - .any(|record| record["kind"] == "graph_delete" - && record["outcome"] == "rolled_forward") - ); - // The artifact file is marked consumed post-CAS. - let artifact: serde_json::Value = serde_json::from_str( - &fs::read_to_string( - dir.path() - .join(CLUSTER_APPROVALS_DIR) - .join(format!("{approval_id}.json")), - ) - .unwrap(), - ) - .unwrap(); - assert!(artifact["consumed_at"].is_string()); - assert!(out.converged, "{out:?}"); - } - - #[tokio::test] - async fn sweep_reproposes_incomplete_delete() { - let dir = fixture(); - seed_deletable_state(dir.path()); // root present - let approved = approve_config_dir(dir.path(), "graph.old", "andrew").await; - assert!(approved.ok); - let sidecar = write_delete_sidecar(dir.path(), "old", approved.approval_id.as_deref(), "01DROW8"); - - // Row 8: the stale intent is retired with a warning, and the same run - // re-executes the still-approved delete to completion. - let out = apply_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "graph_delete_incomplete") - ); - assert!(!sidecar.exists()); - assert!(!dir.path().join(CLUSTER_GRAPHS_DIR).join("old.omni").exists()); - assert!(out.converged, "{out:?}"); - } - - // ---- policy bindings in the applied revision (5A) ---- - - #[tokio::test] - async fn apply_records_policy_bindings() { - let dir = fixture(); - write_applyable_state(dir.path()); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok && out.converged, "{:?}", out.diagnostics); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["policy.base"]["applies_to"], - serde_json::json!(["graph.knowledge"]), - "{state}" - ); - // Non-policy entries carry no bindings field at all. - assert!( - state["applied_revision"]["resources"]["query.knowledge.find_person"] - .get("applies_to") - .is_none() - ); - } - - #[tokio::test] - async fn binding_change_is_a_visible_plan_change() { - let dir = fixture(); - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - // Edit ONLY applies_to: the policy file digest is unchanged. - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -metadata: - name: test -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - base: - file: ./base.policy.yaml - applies_to: [cluster, knowledge] -"#, - ) - .unwrap(); - - let plan = plan_config_dir(dir.path()).await; - let change = plan - .changes - .iter() - .find(|change| change.resource == "policy.base") - .expect("binding change must be visible in plan"); - assert!(change.binding_change); - assert_eq!( - change.metadata_change, - Some(PlanMetadataChange::PolicyBindings) - ); - assert_eq!(change.operation, PlanOperation::Update); - assert_eq!(change.before_digest, change.after_digest); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok && out.converged, "{out:?}"); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["policy.base"]["applies_to"], - serde_json::json!(["cluster", "graph.knowledge"]) - ); - // Idempotent: a second run sees no changes. - let again = apply_config_dir(dir.path()).await; - assert!(again.changes.is_empty() && !again.state_written, "{again:?}"); - } - - #[tokio::test] - async fn pre_5a_state_backfills_bindings() { - let dir = fixture(); - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - // Strip the bindings from the state entry (a pre-5A ledger). - let mut state: serde_json::Value = serde_json::from_str( - &fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(), - ) - .unwrap(); - state["applied_revision"]["resources"]["policy.base"] - .as_object_mut() - .unwrap() - .remove("applies_to"); - fs::write( - dir.path().join(CLUSTER_STATE_FILE), - serde_json::to_string_pretty(&state).unwrap(), - ) - .unwrap(); - - let plan = plan_config_dir(dir.path()).await; - assert!( - plan.changes.iter().any(|change| change.resource == "policy.base" - && change.binding_change - && change.metadata_change == Some(PlanMetadataChange::PolicyBindings)), - "{plan:?}" - ); - let out = apply_config_dir(dir.path()).await; - assert!(out.ok && out.converged, "{out:?}"); - let healed = read_state_json(dir.path()); - assert_eq!( - healed["applied_revision"]["resources"]["policy.base"]["applies_to"], - serde_json::json!(["graph.knowledge"]) - ); - } - - #[tokio::test] - async fn pre_5a_state_backfills_embedding_profile() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - - let mut state = read_state_json(dir.path()); - state["applied_revision"]["resources"]["provider.embedding.default"] - .as_object_mut() - .unwrap() - .remove("embedding_profile"); - fs::write( - dir.path().join(CLUSTER_STATE_FILE), - serde_json::to_string_pretty(&state).unwrap(), - ) - .unwrap(); - - let plan = plan_config_dir(dir.path()).await; - let change = plan - .changes - .iter() - .find(|change| change.resource == "provider.embedding.default") - .expect("embedding profile backfill must be visible in plan"); - assert_eq!(change.operation, PlanOperation::Update); - assert_eq!(change.before_digest, change.after_digest); - assert_eq!( - change.metadata_change, - Some(PlanMetadataChange::EmbeddingProfile) - ); - - let out = apply_config_dir(dir.path()).await; - assert!(out.ok && out.converged, "{out:?}"); - let healed = read_state_json(dir.path()); - assert_eq!( - healed["applied_revision"]["resources"]["provider.embedding.default"] - ["embedding_profile"]["model"], - serde_json::json!("recorded-x") - ); - let snapshot = read_serving_snapshot(dir.path()).await.unwrap(); - let profile = snapshot.graphs[0].embedding.as_ref().unwrap(); - assert_eq!(profile.model.as_deref(), Some("recorded-x")); - } - - #[tokio::test] - async fn bindings_survive_refresh() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - - let refresh = refresh_config_dir(dir.path()).await; - assert!(refresh.ok, "{:?}", refresh.diagnostics); - let state = read_state_json(dir.path()); - assert_eq!( - state["applied_revision"]["resources"]["policy.base"]["applies_to"], - serde_json::json!(["graph.knowledge"]) - ); - } - - // ---- serving snapshot (5B read-only loader) ---- - - // ---- storage: root (RFC-006) ---- - - #[tokio::test] - async fn storage_root_defaults_to_config_dir_layout() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - let out = apply_config_dir(dir.path()).await; - assert!(out.converged, "{out:?}"); - // No storage: key β€” the original on-disk layout, byte-compatible. - assert!(dir.path().join(CLUSTER_STATE_FILE).exists()); - assert!(dir.path().join(CLUSTER_RESOURCES_DIR).exists()); - assert!(dir.path().join("graphs/knowledge.omni").exists()); - } - - #[tokio::test] - async fn storage_root_file_uri_relocates_the_cluster() { - let dir = fixture(); - let storage = tempfile::tempdir().unwrap(); - let storage_path = storage.path().to_string_lossy().to_string(); - let mut config = fs::read_to_string(dir.path().join("cluster.yaml")).unwrap(); - config = config.replace("version: 1\n", &format!("version: 1\nstorage: {storage_path}\n")); - fs::write(dir.path().join("cluster.yaml"), config).unwrap(); - - let import = import_config_dir(dir.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let out = apply_config_dir(dir.path()).await; - assert!(out.ok && out.converged, "{:?}", out.diagnostics); - - // Everything lives under the declared root; nothing under config dir. - assert!(storage.path().join("__cluster/state.json").exists()); - assert!(storage.path().join("graphs/knowledge.omni").exists()); - assert!(storage.path().join(CLUSTER_RESOURCES_DIR).exists()); - assert!(!dir.path().join(CLUSTER_STATE_FILE).exists()); - assert!(!dir.path().join("graphs").exists()); - - // The serving snapshot follows the root. - let snapshot = read_serving_snapshot(dir.path()).await.unwrap(); - assert!( - snapshot.graphs[0] - .root - .starts_with(storage.path()), - "{:?}", - snapshot.graphs[0].root - ); - } - - #[test] - fn storage_root_invalid_uri_fails_validation() { - let dir = fixture(); - let mut config = fs::read_to_string(dir.path().join("cluster.yaml")).unwrap(); - config = config.replace("version: 1\n", "version: 1\nstorage: \"s3://\"\n"); - fs::write(dir.path().join("cluster.yaml"), config).unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "invalid_storage_root"), - "{:?}", - out.diagnostics - ); - } - - #[tokio::test] - async fn serving_snapshot_reads_converged_cluster() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - - let snapshot = read_serving_snapshot(dir.path()).await.expect("converged cluster must serve"); - assert_eq!(snapshot.graphs.len(), 1); - assert_eq!(snapshot.graphs[0].graph_id, "knowledge"); - assert!(snapshot.graphs[0].root.ends_with("graphs/knowledge.omni")); - assert_eq!(snapshot.queries.len(), 1); - assert_eq!(snapshot.queries[0].name, "find_person"); - assert!(snapshot.queries[0].source.contains("query find_person")); - assert_eq!(snapshot.policies.len(), 1); - assert_eq!(snapshot.policies[0].applies_to, vec!["graph.knowledge"]); - // Content, not a path: the catalog may live on object storage. - // The fixture bundle is `rules: []` β€” assert the verified text. - assert!(snapshot.policies[0].source.contains("rules:")); - } - - #[tokio::test] - async fn serving_snapshot_uses_applied_embedding_provider_profile() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - - let snapshot = read_serving_snapshot(dir.path()).await.unwrap(); - let profile = snapshot.graphs[0].embedding.as_ref().unwrap(); - assert_eq!(profile.kind.as_deref(), Some("mock")); - assert_eq!(profile.model.as_deref(), Some("recorded-x")); - } - - #[tokio::test] - async fn serving_snapshot_refuses_missing_embedding_provider_metadata() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_mock_embedding_cluster(dir.path(), "recorded-x"); - write_applyable_state(dir.path()); - let converge = apply_config_dir(dir.path()).await; - assert!(converge.converged, "{converge:?}"); - - let mut state = read_state_json(dir.path()); - state["applied_revision"]["resources"]["provider.embedding.default"] - .as_object_mut() - .unwrap() - .remove("embedding_profile"); - fs::write( - dir.path().join(CLUSTER_STATE_FILE), - serde_json::to_string_pretty(&state).unwrap(), - ) - .unwrap(); - - let err = read_serving_snapshot(dir.path()).await.unwrap_err(); - assert!( - err.iter() - .any(|diagnostic| diagnostic.code == "embedding_provider_profile_missing"), - "{err:?}" - ); - assert!( - err.iter() - .any(|diagnostic| diagnostic.code == "embedding_provider_missing"), - "{err:?}" - ); - } - - #[tokio::test] - async fn serving_snapshot_refuses_missing_state() { - let dir = fixture(); - let err = read_serving_snapshot(dir.path()).await.unwrap_err(); - assert!( - err.iter().any(|diagnostic| diagnostic.code == "cluster_state_missing"), - "{err:?}" - ); - } - - #[tokio::test] - async fn serving_snapshot_refuses_pending_recovery() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - apply_config_dir(dir.path()).await; - write_schema_apply_sidecar(dir.path(), "knowledge", "whatever", "01SERVE"); - - let err = read_serving_snapshot(dir.path()).await.unwrap_err(); - assert!( - err.iter() - .any(|diagnostic| diagnostic.code == "cluster_no_healthy_graphs"), - "{err:?}" - ); - assert!( - err.iter().any(|diagnostic| { - diagnostic.code == "cluster_recovery_pending" - && diagnostic.path == "graph.knowledge" - }), - "{err:?}" - ); - } - - #[tokio::test] - async fn serving_snapshot_quarantines_one_graph_with_pending_recovery() { - let dir = fixture(); - fs::write( - dir.path().join(CLUSTER_CONFIG_FILE), - r#" -version: 1 -metadata: - name: test -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - archive: - schema: ./people.pg -"#, - ) - .unwrap(); - let graph_dir = dir.path().join(CLUSTER_GRAPHS_DIR); - fs::create_dir_all(&graph_dir).unwrap(); - Omnigraph::init( - graph_dir.join("knowledge.omni").to_string_lossy().as_ref(), - SCHEMA, - ) - .await - .unwrap(); - Omnigraph::init( - graph_dir.join("archive.omni").to_string_lossy().as_ref(), - SCHEMA, - ) - .await - .unwrap(); - let desired = validate_config_dir(dir.path()); - assert!(desired.ok, "{:?}", desired.diagnostics); - let schema_digest = desired.resource_digests["schema.knowledge"].clone(); - let empty_queries = BTreeMap::new(); - let knowledge_digest = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&empty_queries), - None, - None, - ); - let archive_digest = graph_digest( - "archive", - Some(&schema_digest), - Some(&empty_queries), - None, - None, - ); - write_state_resources( - dir.path(), - &[ - ("graph.knowledge", knowledge_digest.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ("graph.archive", archive_digest.as_str()), - ("schema.archive", schema_digest.as_str()), - ], - ); - write_schema_apply_sidecar(dir.path(), "knowledge", "whatever", "01SERVE2"); - - let snapshot = read_serving_snapshot(dir.path()).await.unwrap(); - assert_eq!(snapshot.graphs.len(), 1); - assert_eq!(snapshot.graphs[0].graph_id, "archive"); - assert!(snapshot.queries.is_empty()); - assert!(snapshot.policies.is_empty()); - assert!(snapshot.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "cluster_recovery_pending" - && diagnostic.path == "graph.knowledge" - && diagnostic.severity == DiagnosticSeverity::Warning - })); - } - - #[tokio::test] - async fn serving_snapshot_refuses_tampered_blob_and_stripped_bindings() { - let dir = fixture(); - init_derived_graph(dir.path()).await; - write_applyable_state(dir.path()); - apply_config_dir(dir.path()).await; - // Tamper with the query blob... - let snapshot = read_serving_snapshot(dir.path()).await.unwrap(); - let desired = validate_config_dir(dir.path()); - let query_digest = &desired.resource_digests["query.knowledge.find_person"]; - let blob = dir - .path() - .join(CLUSTER_RESOURCES_DIR) - .join("query/knowledge/find_person") - .join(format!("{query_digest}.gq")); - fs::write(&blob, "tampered").unwrap(); - // ...and strip the policy bindings (pre-5A ledger). - let mut state: serde_json::Value = serde_json::from_str( - &fs::read_to_string(dir.path().join(CLUSTER_STATE_FILE)).unwrap(), - ) - .unwrap(); - state["applied_revision"]["resources"]["policy.base"] - .as_object_mut() - .unwrap() - .remove("applies_to"); - fs::write( - dir.path().join(CLUSTER_STATE_FILE), - serde_json::to_string_pretty(&state).unwrap(), - ) - .unwrap(); - - let err = read_serving_snapshot(dir.path()).await.unwrap_err(); - assert!( - err.iter() - .any(|diagnostic| diagnostic.code == "catalog_payload_digest_mismatch"), - "{err:?}" - ); - assert!( - err.iter().any(|diagnostic| diagnostic.code == "policy_bindings_missing"), - "{err:?}" - ); - let _ = snapshot; // the pre-tamper read succeeded - } - - #[tokio::test] - async fn serving_snapshot_refuses_empty_cluster() { - let dir = fixture(); - write_state_resources(dir.path(), &[]); // state exists, no graphs - - let err = read_serving_snapshot(dir.path()).await.unwrap_err(); - assert!( - err.iter().any(|diagnostic| diagnostic.code == "cluster_empty"), - "{err:?}" - ); - } - - // ---- query discovery (Terraform-style declaration) ---- - - #[test] - fn queries_directory_discovers_every_declaration() { - let dir = tempfile::tempdir().unwrap(); - fs::write(dir.path().join("people.pg"), "\nnode Person {\n name: String @key\n}\n").unwrap(); - fs::create_dir(dir.path().join("queries")).unwrap(); - fs::write( - dir.path().join("queries/people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n\nquery all_people() {\n match { $p: Person }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - dir.path().join("queries/extra.gq"), - "\nquery count_people() {\n match { $p: Person }\n return { count($p) }\n}\n", - ) - .unwrap(); - fs::write(dir.path().join("queries/notes.txt"), "ignored").unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n queries: ./queries/\n", - ) - .unwrap(); - - let out = validate_config_dir(dir.path()); - assert!(out.ok, "{:?}", out.diagnostics); - let names: Vec<&str> = out - .resource_digests - .keys() - .filter_map(|address| address.strip_prefix("query.knowledge.")) - .collect(); - assert_eq!(names, vec!["all_people", "count_people", "find_person"]); - } - - #[test] - fn queries_list_and_single_file_forms_discover() { - let dir = tempfile::tempdir().unwrap(); - fs::write(dir.path().join("people.pg"), "\nnode Person {\n name: String @key\n}\n").unwrap(); - fs::write( - dir.path().join("a.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - dir.path().join("b.gq"), - "\nquery all_people() {\n match { $p: Person }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n queries: [./a.gq, ./b.gq]\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.resource_digests.contains_key("query.knowledge.find_person")); - assert!(out.resource_digests.contains_key("query.knowledge.all_people")); - - // Single-file string form - fs::write( - dir.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n queries: ./a.gq\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(out.ok, "{:?}", out.diagnostics); - assert!(out.resource_digests.contains_key("query.knowledge.find_person")); - assert!(!out.resource_digests.contains_key("query.knowledge.all_people")); - } - - #[test] - fn query_discovery_rejects_duplicates_and_parse_errors() { - let dir = tempfile::tempdir().unwrap(); - fs::write(dir.path().join("people.pg"), "\nnode Person {\n name: String @key\n}\n").unwrap(); - let decl = "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n"; - fs::write(dir.path().join("a.gq"), decl).unwrap(); - fs::write(dir.path().join("b.gq"), decl).unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n queries: [./a.gq, ./b.gq]\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "duplicate_query_name"), - "{:?}", - out.diagnostics - ); - - fs::write(dir.path().join("broken.gq"), "query {{{ nope").unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - "version: 1\ngraphs:\n knowledge:\n schema: ./people.pg\n queries: ./broken.gq\n", - ) - .unwrap(); - let out = validate_config_dir(dir.path()); - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "query_parse_error"), - "{:?}", - out.diagnostics - ); - } - - #[tokio::test] - async fn status_warns_on_pending_recovery_sidecar() { - let dir = fixture(); - write_applyable_state(dir.path()); - write_create_sidecar(dir.path(), "knowledge", "irrelevant", "01STATUS"); - - let out = status_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending" - && diagnostic.severity == DiagnosticSeverity::Warning) - ); - } - - #[tokio::test] - async fn read_only_commands_ignore_missing_recovery_sidecar_dir() { - let dir = fixture(); - write_applyable_state(dir.path()); - assert!(!dir.path().join(CLUSTER_RECOVERIES_DIR).exists()); - - let status = status_config_dir(dir.path()).await; - assert!(status.ok, "{:?}", status.diagnostics); - assert!( - !status.diagnostics.iter().any(|diagnostic| matches!( - diagnostic.code.as_str(), - "recovery_sidecar_read_error" | "cluster_recovery_pending" - )), - "{:?}", - status.diagnostics - ); - - let plan = plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - assert!( - !plan.diagnostics.iter().any(|diagnostic| matches!( - diagnostic.code.as_str(), - "recovery_sidecar_read_error" | "cluster_recovery_pending" - )), - "{:?}", - plan.diagnostics - ); - } - - #[tokio::test] - async fn read_only_commands_warn_on_pending_recovery_sidecar_in_storage_root() { - let dir = fixture(); - let storage = tempfile::tempdir().unwrap(); - let storage_path = storage.path().to_string_lossy().to_string(); - let mut config = fs::read_to_string(dir.path().join(CLUSTER_CONFIG_FILE)).unwrap(); - config = config.replace( - "version: 1\n", - &format!("version: 1\nstorage: {storage_path}\n"), - ); - fs::write(dir.path().join(CLUSTER_CONFIG_FILE), config).unwrap(); - - let desired = validate_config_dir(dir.path()); - assert!(desired.ok, "{:?}", desired.diagnostics); - let schema_digest = desired - .resource_digests - .get("schema.knowledge") - .unwrap() - .clone(); - let graph_composite = graph_digest( - "knowledge", - Some(&schema_digest), - Some(&BTreeMap::new()), - None, - None, - ); - write_state_resources( - storage.path(), - &[ - ("graph.knowledge", graph_composite.as_str()), - ("schema.knowledge", schema_digest.as_str()), - ], - ); - write_create_sidecar(storage.path(), "knowledge", "irrelevant", "01STORAGE"); - - let status = status_config_dir(dir.path()).await; - assert!(status.ok, "{:?}", status.diagnostics); - assert!( - status - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending" - && diagnostic.path.contains("01STORAGE.json")), - "{:?}", - status.diagnostics - ); - - let plan = plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - assert!( - plan.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_pending" - && diagnostic.path.contains("01STORAGE.json")), - "{:?}", - plan.diagnostics - ); - - assert!(!dir.path().join(CLUSTER_RECOVERIES_DIR).exists()); - } - - #[tokio::test] - async fn plan_annotates_apply_dispositions() { - let dir = fixture(); - let out = plan_config_dir(dir.path()).await; - assert!(out.ok, "{:?}", out.diagnostics); - let by_resource: BTreeMap<&str, &PlanChange> = out - .changes - .iter() - .map(|change| (change.resource.as_str(), change)) - .collect(); - // Stage 4A: graph/schema creates are executable, and dependents ride - // the same run β€” plan previews exactly that. - assert_eq!( - by_resource["graph.knowledge"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["schema.knowledge"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["query.knowledge.find_person"].disposition, - Some(ApplyDisposition::Applied) - ); - assert_eq!( - by_resource["policy.base"].disposition, - Some(ApplyDisposition::Applied) - ); - } diff --git a/crates/omnigraph-cluster/src/types.rs b/crates/omnigraph-cluster/src/types.rs deleted file mode 100644 index 7687575..0000000 --- a/crates/omnigraph-cluster/src/types.rs +++ /dev/null @@ -1,727 +0,0 @@ -//! Public output/diagnostic types and internal state/sidecar/approval -//! models (moved verbatim from lib.rs in the modularization). - -use super::*; - -#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum DiagnosticSeverity { - Error, - Warning, -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq)] -pub struct Diagnostic { - pub code: String, - pub severity: DiagnosticSeverity, - pub path: String, - pub message: String, -} - -impl Diagnostic { - pub(crate) fn error(code: impl Into, path: impl Into, message: impl Into) -> Self { - Self { - code: code.into(), - severity: DiagnosticSeverity::Error, - path: path.into(), - message: message.into(), - } - } - - pub(crate) fn warning( - code: impl Into, - path: impl Into, - message: impl Into, - ) -> Self { - Self { - code: code.into(), - severity: DiagnosticSeverity::Warning, - path: path.into(), - message: message.into(), - } - } -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq)] -pub struct ResourceSummary { - pub address: String, - pub kind: String, - pub digest: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub path: Option, -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq, PartialOrd, Ord)] -pub struct Dependency { - pub from: String, - pub to: String, -} - -#[derive(Debug, Clone, Serialize)] -pub struct ValidateOutput { - pub ok: bool, - pub config_dir: String, - pub config_file: String, - pub resource_digests: BTreeMap, - pub resources: Vec, - pub dependencies: Vec, - pub diagnostics: Vec, -} - -#[derive(Debug, Clone, Serialize)] -pub struct DesiredRevision { - #[serde(skip_serializing_if = "Option::is_none")] - pub config_digest: Option, -} - -#[derive(Debug, Clone, Serialize)] -pub struct StateObservations { - pub state_path: String, - pub lock_path: String, - pub state_found: bool, - #[serde(skip_serializing_if = "Option::is_none")] - pub applied_config_digest: Option, - pub state_revision: u64, - #[serde(skip_serializing_if = "Option::is_none")] - pub state_cas: Option, - pub resource_count: usize, - pub locked: bool, - #[serde(skip_serializing_if = "Option::is_none")] - pub lock_id: Option, - pub lock_acquired: bool, - #[serde(skip_serializing_if = "Option::is_none")] - pub acquired_lock_id: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub lock_operation: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub lock_created_at: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub lock_pid: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub lock_age_seconds: Option, -} - -impl StateObservations { - pub(crate) fn observe_lock_metadata(&mut self, lock: &StateLockFile) { - self.locked = true; - self.lock_id = Some(lock.lock_id.clone()); - self.lock_operation = Some(lock.operation.clone()); - self.lock_created_at = Some(lock.created_at.clone()); - self.lock_pid = Some(lock.pid); - self.lock_age_seconds = lock_age_seconds(&lock.created_at); - } -} - -#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum ResourceLifecycleStatus { - Pending, - Planned, - Applying, - Applied, - Drifted, - Blocked, - Error, -} - -#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] -#[serde(deny_unknown_fields)] -pub struct ResourceStatusRecord { - pub status: ResourceLifecycleStatus, - #[serde(default, skip_serializing_if = "Vec::is_empty")] - pub conditions: Vec, - #[serde(default, skip_serializing_if = "Option::is_none")] - pub message: Option, -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum PlanOperation { - Create, - Update, - Delete, -} - -/// How `cluster apply` treats a planned change in the current stage. -/// -/// `Applied` changes execute (config-only query/policy catalog writes). -/// `Derived` marks a `graph.` composite-digest update that converges -/// automatically once its applied query digests land in state. `Deferred` -/// changes need a later phase (graph/schema lifecycle or schema content). -/// `Blocked` query/policy changes are gated by an unapplied or missing -/// dependency. -#[derive(Debug, Clone, Copy, Serialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum ApplyDisposition { - Applied, - Derived, - Deferred, - Blocked, -} - -#[derive(Debug, Clone, Serialize, PartialEq)] -pub struct PlanChange { - pub resource: String, - pub operation: PlanOperation, - #[serde(skip_serializing_if = "Option::is_none")] - pub before_digest: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub after_digest: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub disposition: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub reason: Option, - /// True for a policy change whose file digest is unchanged but whose - /// `applies_to` bindings differ from the applied revision (including the - /// pre-5A backfill case). - #[serde(default, skip_serializing_if = "std::ops::Not::not")] - pub binding_change: bool, - /// Metadata-only updates whose resource content digest is unchanged but - /// whose applied ledger metadata needs to converge. - #[serde(skip_serializing_if = "Option::is_none")] - pub metadata_change: Option, - /// For schema updates: the engine's migration plan against the live - /// graph (RFC-004 Β§D7's data-aware preview). Absent when the preview is - /// unavailable (warning `schema_preview_unavailable`). - #[serde(skip_serializing_if = "Option::is_none")] - pub migration: Option, -} - -#[derive(Debug, Clone, Copy, Serialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum PlanMetadataChange { - PolicyBindings, - EmbeddingProfile, -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq)] -pub struct BlastRadius { - pub resource: String, - pub affected: Vec, -} - -#[derive(Debug, Clone, Serialize, PartialEq, Eq)] -pub struct ApprovalRequirement { - pub resource: String, - pub reason: String, - /// True when a valid (digest-matching, unconsumed) approval artifact is - /// pending for this change. - pub satisfied: bool, -} - -#[derive(Debug, Clone, Serialize)] -pub struct PlanOutput { - pub ok: bool, - pub config_dir: String, - pub desired_revision: DesiredRevision, - pub resource_digests: BTreeMap, - pub dependencies: Vec, - pub state_observations: StateObservations, - pub changes: Vec, - pub blast_radius: Vec, - pub approvals_required: Vec, - pub diagnostics: Vec, -} - -#[derive(Debug, Clone, Serialize)] -pub struct StatusOutput { - pub ok: bool, - pub config_dir: String, - pub state_observations: StateObservations, - pub resource_digests: BTreeMap, - pub resource_statuses: BTreeMap, - pub observations: BTreeMap, - pub diagnostics: Vec, -} - -#[derive(Debug, Clone, Copy, Serialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub enum StateSyncOperation { - Refresh, - Import, -} - -#[derive(Debug, Clone, Serialize)] -pub struct StateSyncOutput { - pub ok: bool, - pub operation: StateSyncOperation, - pub config_dir: String, - pub state_observations: StateObservations, - pub resource_digests: BTreeMap, - pub resource_statuses: BTreeMap, - pub observations: BTreeMap, - pub diagnostics: Vec, -} - -#[derive(Debug, Clone, Serialize)] -pub struct ForceUnlockOutput { - pub ok: bool, - pub config_dir: String, - pub state_observations: StateObservations, - pub lock_removed: bool, - pub diagnostics: Vec, -} - -/// Output of config-only `cluster apply`. "Applied" means recorded in the -/// local cluster catalog (`__cluster/`); nothing applied here serves traffic β€” -/// the server still boots from `omnigraph.yaml` until the server-boot stage. -#[derive(Debug, Clone, Serialize)] -pub struct ApplyOutput { - pub ok: bool, - pub config_dir: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub actor: Option, - pub desired_revision: DesiredRevision, - pub state_observations: StateObservations, - /// Every planned change, with `disposition`/`reason` always populated. - pub changes: Vec, - pub applied_count: usize, - /// Deferred + Blocked changes (Derived composite updates count as neither). - pub deferred_count: usize, - /// True when state matches the desired revision after this apply. - pub converged: bool, - /// False for a no-op re-apply: state bytes (and revision) were left untouched. - pub state_written: bool, - /// The statuses as persisted: post-apply on success, the pre-apply on-disk - /// snapshot when the state write fails (never unpersisted in-memory state). - pub resource_statuses: BTreeMap, - pub diagnostics: Vec, -} - -/// A digest-bound human approval for an irreversible operation (RFC-004 -/// Β§D4). Written by `cluster approve`, consumed by apply. The file is never -/// deleted on consumption β€” it is rewritten with `consumed_at` and also -/// summarized into the state ledger's `approval_records`, so the audit fact -/// survives the loss of either store (axiom 11). -#[derive(Debug, Clone, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct ApprovalArtifact { - pub(crate) schema_version: u32, - pub(crate) approval_id: String, - pub(crate) resource: String, - pub(crate) operation: String, - pub(crate) reason: String, - pub(crate) bound_config_digest: String, - #[serde(default)] - pub(crate) bound_before_digest: Option, - #[serde(default)] - pub(crate) bound_after_digest: Option, - pub(crate) approved_by: String, - pub(crate) created_at: String, - #[serde(default)] - pub(crate) consumed_at: Option, - #[serde(default)] - pub(crate) consumed_by_operation: Option, -} - -#[derive(Debug, Clone, Serialize)] -pub struct ApproveOutput { - pub ok: bool, - pub config_dir: String, - #[serde(skip_serializing_if = "Option::is_none")] - pub approval_id: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub resource: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub operation: Option, - #[serde(skip_serializing_if = "Option::is_none")] - pub approved_by: Option, - pub diagnostics: Vec, -} - -#[derive(Debug, Clone)] -pub(crate) struct DesiredCluster { - pub(crate) config_dir: PathBuf, - pub(crate) config_digest: String, - /// The declared `storage:` root, if any (None β‡’ the config dir itself). - pub(crate) storage_root: Option, - pub(crate) state_lock: bool, - pub(crate) embedding_providers: BTreeMap, - pub(crate) graphs: Vec, - pub(crate) resource_digests: BTreeMap, - pub(crate) resources: Vec, - pub(crate) dependencies: Vec, - /// `policy.` address -> normalized applies_to refs. - pub(crate) policy_bindings: BTreeMap>, -} - -#[derive(Debug, Clone)] -pub(crate) struct DesiredGraph { - pub(crate) id: String, - pub(crate) schema_digest: String, - pub(crate) embedding_provider: Option, -} - -#[derive(Debug)] -pub(crate) struct ParsedConfig { - pub(crate) raw: Option, - pub(crate) diagnostics: Vec, - pub(crate) config_dir: PathBuf, - pub(crate) config_file: PathBuf, -} - -#[derive(Debug, Clone)] -pub(crate) struct ClusterSettings { - pub(crate) state_lock: bool, - pub(crate) storage_root: Option, -} - -#[derive(Debug)] -pub(crate) struct LoadOutcome { - pub(crate) desired: Option, - pub(crate) diagnostics: Vec, - pub(crate) config_dir: PathBuf, - pub(crate) config_file: PathBuf, -} - -#[derive(Debug, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct RawClusterConfig { - pub(crate) version: u32, - #[serde(default)] - pub(crate) metadata: Metadata, - /// Storage root URI for everything the cluster stores: the state - /// ledger, catalog, sidecars, approvals, and derived graph roots. - /// Absent β‡’ `file://` (the original layout, byte-compatible). - /// `s3://bucket/prefix` puts the whole cluster on object storage. - #[serde(default)] - pub(crate) storage: Option, - #[serde(default)] - pub(crate) state: StateConfig, - #[serde(default)] - pub(crate) providers: ProvidersConfig, - #[serde(default)] - pub(crate) graphs: BTreeMap, - #[serde(default)] - pub(crate) policies: BTreeMap, -} - -#[derive(Debug, Default, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct Metadata { - pub(crate) name: Option, -} - -#[derive(Debug, Default, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct StateConfig { - pub(crate) backend: Option, - pub(crate) lock: Option, -} - -#[derive(Debug, Default, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct ProvidersConfig { - #[serde(default)] - pub(crate) embedding: BTreeMap, -} - -#[derive(Debug, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct GraphConfig { - pub(crate) schema: PathBuf, - #[serde(default)] - pub(crate) queries: QueriesDecl, - /// Optional reference to a top-level `providers.embedding.` profile. - #[serde(default)] - pub(crate) embedding_provider: Option, -} - -/// A named cluster embedding provider profile (RFC-012 Phase 5). `kind`/`base_url`/ -/// `model` default exactly as the engine's `EmbeddingConfig::from_env` does. -/// `api_key`, when required, must be a `${NAME}` env reference resolved at -/// serving boot, never an inline secret. -#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] -#[serde(deny_unknown_fields)] -pub struct EmbeddingProviderConfig { - #[serde(default, alias = "provider", skip_serializing_if = "Option::is_none")] - pub kind: Option, - #[serde(default, skip_serializing_if = "Option::is_none")] - pub base_url: Option, - #[serde(default, skip_serializing_if = "Option::is_none")] - pub model: Option, - #[serde(default, skip_serializing_if = "Option::is_none")] - pub api_key: Option, -} - -impl EmbeddingProviderConfig { - pub(crate) fn validate(&self, path: String, diagnostics: &mut Vec) { - if let Err(error) = omnigraph::embedding::EmbeddingConfig::from_parts( - self.kind.as_deref(), - self.base_url.clone(), - self.model.clone(), - "validation-placeholder".to_string(), - ) { - diagnostics.push(Diagnostic::error( - "invalid_embedding_provider", - path.clone(), - error.to_string(), - )); - } - - if self.kind.as_deref() == Some("mock") { - if let Some(api_key) = self.api_key.as_deref() { - if secret_ref_name(api_key).is_err() { - diagnostics.push(Diagnostic::error( - "embedding_api_key_inline", - format!("{path}.api_key"), - "embedding api_key must be a ${NAME} env reference, not an inline secret", - )); - } - } - return; - } - - match self.api_key.as_deref() { - Some(api_key) if secret_ref_name(api_key).is_err() => diagnostics.push( - Diagnostic::error( - "embedding_api_key_inline", - format!("{path}.api_key"), - "embedding api_key must be a ${NAME} env reference, not an inline secret", - ), - ), - Some(_) => {} - None => diagnostics.push(Diagnostic::error( - "embedding_api_key_required", - format!("{path}.api_key"), - "non-mock embedding providers must set api_key to a ${NAME} env reference", - )), - } - } - - /// Resolve into an engine `EmbeddingConfig`, reading the `${NAME}` api-key - /// reference from process env. Mock profiles do not read env and may omit - /// `api_key`; real providers error if the reference is missing or unset. - pub fn resolve(&self) -> Result { - let api_key = if self.kind.as_deref() == Some("mock") { - String::new() - } else { - resolve_secret_ref(self.api_key.as_deref().ok_or_else(|| { - "embedding api_key is required for non-mock providers".to_string() - })?)? - }; - omnigraph::embedding::EmbeddingConfig::from_parts( - self.kind.as_deref(), - self.base_url.clone(), - self.model.clone(), - api_key, - ) - .map_err(|e| e.to_string()) - } -} - -fn secret_ref_name(value: &str) -> Result<&str, String> { - value - .trim() - .strip_prefix("${") - .and_then(|s| s.strip_suffix('}')) - .filter(|name| !name.trim().is_empty()) - .ok_or_else(|| { - format!("embedding api_key must be a ${{NAME}} env reference, got '{}'", value.trim()) - }) -} - -/// Resolve a `${NAME}` secret reference from process env. Rejects an inline value -/// (anything not wrapped in `${…}`) so secrets never sit in the cluster config. -fn resolve_secret_ref(value: &str) -> Result { - let name = secret_ref_name(value)?; - std::env::var(name).map_err(|_| format!("embedding api_key env var '{name}' is not set")) -} - -/// How a graph declares its stored queries. Terraform-style: the `.gq` -/// files ARE the declaration β€” point at them (or a directory) and every -#[derive(Debug, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct QueryConfig { - pub(crate) file: PathBuf, -} - -#[derive(Debug, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct PolicyConfig { - pub(crate) file: PathBuf, - pub(crate) applies_to: Vec, -} - -// Stage 2A/2B accept these forward-compatible state sections so existing -// ledgers won't churn while approval/recovery semantics are staged later. -#[allow(dead_code)] -#[derive(Debug, Clone, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct ClusterState { - pub(crate) version: u32, - #[serde(default)] - pub(crate) state_revision: u64, - pub(crate) applied_revision: AppliedRevisionState, - #[serde(default)] - pub(crate) resource_statuses: BTreeMap, - #[serde(default)] - pub(crate) approval_records: BTreeMap, - #[serde(default)] - pub(crate) recovery_records: BTreeMap, - #[serde(default)] - pub(crate) observations: BTreeMap, -} - -#[derive(Debug, Clone, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct AppliedRevisionState { - #[serde(default)] - pub(crate) config_digest: Option, - #[serde(default)] - pub(crate) resources: BTreeMap, -} - -#[derive(Debug, Clone, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct StateResource { - pub(crate) digest: String, - /// Policy resources only: the applied `applies_to` bindings, normalized - /// to typed refs (`cluster` | `graph.`). Recorded so the state - /// ledger is serving-sufficient for the Phase-5 server boot (RFC-005 - /// Β§D3). Absent on pre-5A entries (backfilled by the next apply) and on - /// non-policy resources. - #[serde(default, skip_serializing_if = "Option::is_none")] - pub(crate) applies_to: Option>, - /// Graph resources only: the applied `provider.embedding.` binding. - /// The provider profile itself is stored on the provider resource so - /// serving can boot without re-reading mutable desired config. - #[serde(default, skip_serializing_if = "Option::is_none")] - pub(crate) embedding_provider: Option, - /// Embedding provider resources only: the applied profile with unresolved - /// `${ENV}` references. The server resolves the referenced env var exactly - /// once at boot and injects the resulting engine config into the graph. - #[serde(default, skip_serializing_if = "Option::is_none")] - pub(crate) embedding_profile: Option, -} - -#[derive(Debug, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct StateLockFile { - pub(crate) version: u32, - pub(crate) lock_id: String, - pub(crate) operation: String, - pub(crate) created_at: String, - pub(crate) pid: u32, -} - -/// Recovery-intent record for a graph-moving apply operation (RFC-004 Β§D2). -/// Written under the state lock before the engine call that can create or -/// move a graph manifest; deleted only after the cluster state CAS that -/// records the outcome lands. The sweep (Β§D3) classifies survivors. -#[derive(Debug, Clone, Serialize, Deserialize)] -#[serde(deny_unknown_fields)] -pub(crate) struct RecoverySidecar { - pub(crate) schema_version: u32, - pub(crate) operation_id: String, - pub(crate) started_at: String, - #[serde(default)] - pub(crate) actor: Option, - pub(crate) kind: RecoverySidecarKind, - pub(crate) graph_id: String, - pub(crate) graph_uri: String, - #[serde(default)] - pub(crate) observed_manifest_version: Option, - #[serde(default)] - pub(crate) expected_manifest_version: Option, - pub(crate) desired_schema_digest: String, - #[serde(default)] - pub(crate) state_cas_base: Option, - /// For graph_delete: the approval this operation consumes; lets a sweep - /// roll-forward consume it too. - #[serde(default)] - pub(crate) approval_id: Option, -} - -#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq)] -#[serde(rename_all = "snake_case")] -pub(crate) enum RecoverySidecarKind { - GraphCreate, - SchemaApply, - GraphDelete, -} - -#[derive(Debug, Default)] -pub(crate) struct SweepOutcome { - /// Graphs whose sidecar was kept (rows 5/6): graph-moving work for them - /// is blocked until the operator repairs and re-observes. - pub(crate) pending_graphs: BTreeSet, - /// Sidecars whose outcome is recorded (rows 2/4): deleted only after the - /// command's state write lands, so a CAS failure re-sweeps them. - /// Store URIs (the storage layer addresses everything by URI). - pub(crate) completed_sidecars: Vec, - /// Approval artifacts consumed by a roll-forward (delete row 7b): their - /// files are rewritten with consumed_at only after the state write lands. - pub(crate) consumed_approvals: Vec, -} - -#[cfg(test)] -mod embedding_provider_config_tests { - use super::EmbeddingProviderConfig; - - #[test] - fn resolves_secret_from_env_and_applies_defaults() { - // SAFETY: a unique var name, no concurrent reader. - unsafe { std::env::set_var("OG_TEST_EMBED_KEY_A", "secret-x") }; - let profile = EmbeddingProviderConfig { - kind: Some("openai-compatible".to_string()), - base_url: None, - model: Some("m".to_string()), - api_key: Some("${OG_TEST_EMBED_KEY_A}".to_string()), - }; - let config = profile.resolve().unwrap(); - assert_eq!(config.api_key, "secret-x"); - assert_eq!(config.model, "m"); - unsafe { std::env::remove_var("OG_TEST_EMBED_KEY_A") }; - } - - #[test] - fn rejects_inline_api_key() { - let profile = EmbeddingProviderConfig { - kind: None, - base_url: None, - model: None, - api_key: Some("sk-inline".to_string()), - }; - let err = profile.resolve().unwrap_err(); - assert!(err.contains("${NAME}"), "got: {err}"); - } - - #[test] - fn errors_on_unset_secret() { - let profile = EmbeddingProviderConfig { - kind: None, - base_url: None, - model: None, - api_key: Some("${OG_TEST_DEFINITELY_UNSET_VAR}".to_string()), - }; - let err = profile.resolve().unwrap_err(); - assert!(err.contains("not set"), "got: {err}"); - } - - #[test] - fn rejects_unknown_provider() { - unsafe { std::env::set_var("OG_TEST_EMBED_KEY_B", "x") }; - let profile = EmbeddingProviderConfig { - kind: Some("cohere".to_string()), - base_url: None, - model: None, - api_key: Some("${OG_TEST_EMBED_KEY_B}".to_string()), - }; - let err = profile.resolve().unwrap_err(); - assert!(err.contains("unknown embedding provider"), "got: {err}"); - unsafe { std::env::remove_var("OG_TEST_EMBED_KEY_B") }; - } - - #[test] - fn mock_does_not_require_secret_env() { - let profile = EmbeddingProviderConfig { - kind: Some("mock".to_string()), - base_url: None, - model: Some("cluster-mock".to_string()), - api_key: None, - }; - let config = profile.resolve().unwrap(); - assert_eq!(config.model, "cluster-mock"); - } -} diff --git a/crates/omnigraph-cluster/tests/failpoints.rs b/crates/omnigraph-cluster/tests/failpoints.rs deleted file mode 100644 index 6b6d339..0000000 --- a/crates/omnigraph-cluster/tests/failpoints.rs +++ /dev/null @@ -1,707 +0,0 @@ -//! Fault-injection tests for the cluster apply protocol. -//! -//! These live in an integration binary (not in-source) deliberately: the fail -//! crate's registry is process-global, so a configured `cluster_apply.*` -//! action would fire inside any concurrently running normal apply test in the -//! lib-test process. A separate binary isolates the registry by construction β€” -//! same reason the engine keeps its failpoint suite in `tests/failpoints.rs`. - -#![cfg(feature = "failpoints")] - -use std::collections::BTreeMap; -use std::fs; -use std::path::{Path, PathBuf}; - -use fail::FailScenario; -use serial_test::serial; -use omnigraph::db::Omnigraph; -// One ScopedFailPoint for both engine- and cluster-scoped failpoint names: -// it is registry-only (error-type agnostic) and lives in the lowest crate. -use omnigraph::failpoints::ScopedFailPoint; -use omnigraph_cluster::{ - ApplyOptions, apply_config_dir, apply_config_dir_with_options, approve_config_dir, - validate_config_dir, -}; -use tempfile::tempdir; - -const SCHEMA: &str = r#" -node Person { - name: String @key - age: I32? -} -"#; - -const QUERY: &str = r#" -query find_person($name: String) { - match { $p: Person { name: $name } } - return { $p.name, $p.age } -} -"#; - -fn fixture() -> tempfile::TempDir { - let dir = tempdir().unwrap(); - fs::write(dir.path().join("people.pg"), SCHEMA).unwrap(); - fs::write(dir.path().join("people.gq"), QUERY).unwrap(); - fs::write(dir.path().join("base.policy.yaml"), "rules: []\n").unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - r#" -version: 1 -state: - backend: cluster - lock: true -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -policies: - base: - file: ./base.policy.yaml - applies_to: [knowledge] -"#, - ) - .unwrap(); - dir -} - -/// Seed a state.json where the graph/schema digests match desired, so query -/// and policy changes are applicable. Digests are borrowed from the public -/// validate output; the graph composite is a placeholder that apply converges -/// as a Derived update. -fn seed_applyable_state(config_dir: &Path) -> BTreeMap { - let validate = validate_config_dir(config_dir); - assert!(validate.ok, "{:?}", validate.diagnostics); - let schema_digest = validate.resource_digests["schema.knowledge"].clone(); - let state_dir = config_dir.join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - format!( - r#"{{ - "version": 1, - "state_revision": 1, - "applied_revision": {{ - "resources": {{ - "graph.knowledge": {{ "digest": "seed" }}, - "schema.knowledge": {{ "digest": "{schema_digest}" }} - }} - }} -}} -"# - ), - ) - .unwrap(); - validate.resource_digests -} - -fn state_path(config_dir: &Path) -> PathBuf { - config_dir.join("__cluster/state.json") -} - -fn query_blob(config_dir: &Path, digests: &BTreeMap) -> PathBuf { - config_dir - .join("__cluster/resources/query/knowledge/find_person") - .join(format!("{}.gq", digests["query.knowledge.find_person"])) -} - -#[tokio::test] -#[serial] -async fn failpoint_wiring_returns_injected_diagnostic() { - let scenario = FailScenario::setup(); - let dir = fixture(); - seed_applyable_state(dir.path()); - - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_AFTER_PAYLOAD_PHASE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "injected_failpoint" - && diagnostic - .message - .contains("cluster_apply.after_payload_phase") - })); - drop(_failpoint); - scenario.teardown(); -} - -/// Crash between the payload phase and the state write: blobs are on disk, -/// state.json is byte-identical, nothing is acknowledged β€” and a plain re-run -/// repairs by trusting the existing content-addressed blobs. -#[tokio::test] -#[serial] -async fn apply_crash_after_payload_phase_leaves_state_unmoved_then_recovers() { - let scenario = FailScenario::setup(); - let dir = fixture(); - let digests = seed_applyable_state(dir.path()); - let state_before = fs::read(state_path(dir.path())).unwrap(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_AFTER_PAYLOAD_PHASE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(!out.state_written); - assert!(!out.converged); - assert_eq!(out.applied_count, 0); - // Persisted pre-apply snapshot: no phantom Applied statuses. - assert!( - !out.resource_statuses - .contains_key("query.knowledge.find_person"), - "{:?}", - out.resource_statuses - ); - // State has not moved; payloads are inert on disk; the lock released. - assert_eq!(fs::read(state_path(dir.path())).unwrap(), state_before); - assert!(query_blob(dir.path(), &digests).exists()); - assert!(!dir.path().join("__cluster/lock.json").exists()); - } - - // The repair is a plain re-run: existing blobs are trusted by digest. - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!(recovered.converged); - assert!(recovered.state_written); - assert_eq!( - recovered.resource_statuses["query.knowledge.find_person"].status, - omnigraph_cluster::ResourceLifecycleStatus::Applied - ); - scenario.teardown(); -} - -/// A concurrent writer mutating state.json between apply's read and its write -/// (possible under `state.lock: false`) must surface `state_cas_mismatch`, -/// acknowledge nothing, and leave the concurrent writer's state on disk. -#[tokio::test] -#[serial] -async fn apply_cas_race_surfaces_state_cas_mismatch() { - let scenario = FailScenario::setup(); - let dir = fixture(); - let digests = seed_applyable_state(dir.path()); - - // Simulate the concurrent writer at the exact race window: rewrite - // state.json (valid JSON, graph/schema digests preserved, revision 99) - // after apply read it but before apply writes. RAII-guarded so a panic - // inside apply cannot leak the callback into the global registry. - let race_path = state_path(dir.path()); - let failpoint = ScopedFailPoint::with_callback(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_BEFORE_STATE_WRITE, move || { - let mut state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&race_path).unwrap()).unwrap(); - state["state_revision"] = serde_json::json!(99); - fs::write(&race_path, serde_json::to_string_pretty(&state).unwrap()).unwrap(); - }); - - let out = apply_config_dir(dir.path()).await; - drop(failpoint); - - assert!(!out.ok); - assert!(!out.state_written); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "state_cas_mismatch"), - "{:?}", - out.diagnostics - ); - // Persisted snapshot, not the unwritten in-memory mutations. - assert!( - !out.resource_statuses - .contains_key("query.knowledge.find_person") - ); - // The concurrent writer's state is what's on disk; apply's mutation never landed. - let state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(state_path(dir.path())).unwrap()).unwrap(); - assert_eq!(state["state_revision"], 99); - assert!( - state["applied_revision"]["resources"] - .get("query.knowledge.find_person") - .is_none() - ); - // Blobs written before the race are inert. - assert!(query_blob(dir.path(), &digests).exists()); - - // Recovery is a plain re-run against the rewritten state. - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!(recovered.converged); - scenario.teardown(); -} - -fn seed_empty_state(config_dir: &Path) { - let state_dir = config_dir.join("__cluster"); - fs::create_dir_all(&state_dir).unwrap(); - fs::write( - state_dir.join("state.json"), - r#"{ - "version": 1, - "state_revision": 1, - "applied_revision": { "resources": {} } -} -"#, - ) - .unwrap(); -} - -fn recovery_sidecars(config_dir: &Path) -> Vec { - match fs::read_dir(config_dir.join("__cluster/recoveries")) { - Ok(entries) => { - let mut paths: Vec = entries - .flatten() - .map(|entry| entry.path()) - .filter(|path| path.extension().is_some_and(|ext| ext == "json")) - .collect(); - paths.sort(); - paths - } - Err(_) => Vec::new(), - } -} - -/// Crash before the init: the create-intent sidecar survives, nothing moved. -/// The next run's sweep removes the intent (row 1) and the same run creates -/// the graph and converges. -#[tokio::test] -#[serial] -async fn create_crash_before_init_recovers_via_sweep() { - let scenario = FailScenario::setup(); - let dir = fixture(); - seed_empty_state(dir.path()); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_BEFORE_GRAPH_CREATE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(out.diagnostics.iter().any(|diagnostic| { - diagnostic.code == "injected_failpoint" - && diagnostic - .message - .contains("cluster_apply.before_graph_create") - })); - assert_eq!(recovery_sidecars(dir.path()).len(), 1); - assert!(!dir.path().join("graphs/knowledge.omni").exists()); - // No resource digest moved. - let state: serde_json::Value = serde_json::from_str( - &fs::read_to_string(dir.path().join("__cluster/state.json")).unwrap(), - ) - .unwrap(); - assert!( - state["applied_revision"]["resources"] - .as_object() - .unwrap() - .is_empty() - ); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!(recovered.converged); - assert!(dir.path().join("graphs/knowledge.omni").exists()); - assert!(recovery_sidecars(dir.path()).is_empty()); - scenario.teardown(); -} - -/// Crash after the init but before the state CAS: the graph exists, the -/// ledger is stale, nothing was acknowledged. The next run's sweep rolls the -/// ledger forward (row 4) with an audit entry, and the run converges. -#[tokio::test] -#[serial] -async fn create_crash_after_init_rolls_state_forward() { - let scenario = FailScenario::setup(); - let dir = fixture(); - seed_empty_state(dir.path()); - let state_before = fs::read(dir.path().join("__cluster/state.json")).unwrap(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_AFTER_GRAPH_CREATE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(!out.state_written); - // The graph exists; the cluster state is byte-identical (no ack). - assert!(dir.path().join("graphs/knowledge.omni").exists()); - assert_eq!( - fs::read(dir.path().join("__cluster/state.json")).unwrap(), - state_before - ); - // The sidecar carries the post-init manifest pin. - let sidecars = recovery_sidecars(dir.path()); - assert_eq!(sidecars.len(), 1); - let sidecar: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&sidecars[0]).unwrap()).unwrap(); - assert!( - sidecar["expected_manifest_version"].is_number(), - "{sidecar}" - ); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!( - recovered - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(recovered.converged); - assert!(recovery_sidecars(dir.path()).is_empty()); - let state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(dir.path().join("__cluster/state.json")).unwrap()) - .unwrap(); - assert!( - state["recovery_records"] - .as_object() - .unwrap() - .values() - .any(|record| record["outcome"] == "rolled_forward") - ); - scenario.teardown(); -} - -const SCHEMA_V2: &str = r#" -node Person { - name: String @key - age: I32? - bio: String? -} -"#; - -async fn converge_with_live_graph(dir: &Path) { - let graph_dir = dir.join("graphs"); - fs::create_dir_all(&graph_dir).unwrap(); - Omnigraph::init( - graph_dir.join("knowledge.omni").to_string_lossy().as_ref(), - SCHEMA, - ) - .await - .unwrap(); - seed_applyable_state(dir); - let out = apply_config_dir(dir).await; - assert!(out.ok && out.converged, "{:?}", out.diagnostics); -} - -async fn live_schema_digest(dir: &Path) -> String { - let uri = dir.join("graphs/knowledge.omni"); - let db = Omnigraph::open_read_only(uri.to_string_lossy().as_ref()) - .await - .unwrap(); - use sha2::{Digest, Sha256}; - let digest = Sha256::digest(db.schema_source().as_bytes()); - digest.iter().map(|byte| format!("{byte:02x}")).collect() -} - -/// Crash before the engine schema apply: sidecar (with actor) survives, the -/// live schema and ledger are untouched; the next run's sweep retires the -/// stale intent and the same run applies and converges. -#[tokio::test] -#[serial] -async fn schema_crash_before_apply_recovers_via_sweep() { - let scenario = FailScenario::setup(); - let dir = fixture(); - converge_with_live_graph(dir.path()).await; - let pre_digest = live_schema_digest(dir.path()).await; - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_BEFORE_SCHEMA_APPLY, "return"); - let out = apply_config_dir_with_options( - dir.path(), - ApplyOptions { - actor: Some("test-actor".to_string()), - }, - ) - .await; - assert!(!out.ok); - assert_eq!(out.actor.as_deref(), Some("test-actor")); - let sidecars = recovery_sidecars(dir.path()); - assert_eq!(sidecars.len(), 1); - let sidecar: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&sidecars[0]).unwrap()).unwrap(); - assert_eq!(sidecar["kind"], "schema_apply"); - assert_eq!(sidecar["actor"], "test-actor"); - // Nothing moved. - assert_eq!(live_schema_digest(dir.path()).await, pre_digest); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!(recovered.converged); - assert!(recovery_sidecars(dir.path()).is_empty()); - assert_ne!(live_schema_digest(dir.path()).await, pre_digest); - scenario.teardown(); -} - -/// Engine apply fails after cluster preview and sidecar creation, but before -/// the graph manifest moves. The defensive cleanup proof should remove the -/// cluster sidecar immediately so a pre-movement error cannot brick boot. -#[tokio::test] -#[serial] -async fn schema_apply_error_before_graph_movement_removes_sidecar() { - let scenario = FailScenario::setup(); - let dir = fixture(); - converge_with_live_graph(dir.path()).await; - let pre_digest = live_schema_digest(dir.path()).await; - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph::failpoints::names::SCHEMA_APPLY_BEFORE_STAGING_WRITE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "schema_apply_failed"), - "{:?}", - out.diagnostics - ); - assert_eq!(live_schema_digest(dir.path()).await, pre_digest); - assert!( - recovery_sidecars(dir.path()).is_empty(), - "{:?}", - recovery_sidecars(dir.path()) - ); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok && recovered.converged, "{recovered:?}"); - assert!(recovery_sidecars(dir.path()).is_empty()); - assert_ne!(live_schema_digest(dir.path()).await, pre_digest); - scenario.teardown(); -} - -/// Engine apply fails after the graph manifest moved. The cluster cannot -/// prove this is a pre-movement failure, so the sidecar must survive for -/// explicit recovery/quarantine instead of being cleaned up defensively. -#[tokio::test] -#[serial] -async fn schema_apply_error_after_graph_movement_keeps_sidecar() { - let scenario = FailScenario::setup(); - let dir = fixture(); - converge_with_live_graph(dir.path()).await; - let pre_digest = live_schema_digest(dir.path()).await; - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - let desired = validate_config_dir(dir.path()); - let v2_digest = desired.resource_digests["schema.knowledge"].clone(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph::failpoints::names::SCHEMA_APPLY_AFTER_MANIFEST_COMMIT, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!( - out.diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "schema_apply_failed"), - "{:?}", - out.diagnostics - ); - // Read-only opens do not run engine schema-state recovery, so the - // schema file still reads as the old digest even though the manifest - // has moved. The cluster sidecar must remain because movement was - // detected by the fallback manifest-version proof. - assert_eq!(live_schema_digest(dir.path()).await, pre_digest); - let sidecars = recovery_sidecars(dir.path()); - assert_eq!(sidecars.len(), 1, "{sidecars:?}"); - let sidecar: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&sidecars[0]).unwrap()).unwrap(); - assert_eq!(sidecar["kind"], "schema_apply"); - assert!(sidecar["expected_manifest_version"].is_null(), "{sidecar}"); - } - - let uri = dir.path().join("graphs/knowledge.omni"); - let db = Omnigraph::open(uri.to_string_lossy().as_ref()) - .await - .unwrap(); - assert_eq!( - db.schema_source().as_str(), - SCHEMA_V2, - "read-write open should complete engine schema-state recovery" - ); - drop(db); - assert_eq!(live_schema_digest(dir.path()).await, v2_digest); - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!( - recovered - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(recovered.converged); - assert!(recovery_sidecars(dir.path()).is_empty()); - scenario.teardown(); -} - -/// Crash after the engine schema apply, before the state CAS: the manifest -/// moved, the ledger is stale, nothing acknowledged; the next run's sweep -/// rolls the ledger forward with an audit entry and the run converges. -#[tokio::test] -#[serial] -async fn schema_crash_after_apply_rolls_state_forward() { - let scenario = FailScenario::setup(); - let dir = fixture(); - converge_with_live_graph(dir.path()).await; - fs::write(dir.path().join("people.pg"), SCHEMA_V2).unwrap(); - let state_before = fs::read(state_path(dir.path())).unwrap(); - let desired = validate_config_dir(dir.path()); - let v2_digest = desired.resource_digests["schema.knowledge"].clone(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_AFTER_SCHEMA_APPLY, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(!out.state_written); - // The live schema moved; the ledger is byte-identical (no ack). - assert_eq!(live_schema_digest(dir.path()).await, v2_digest); - assert_eq!(fs::read(state_path(dir.path())).unwrap(), state_before); - let sidecars = recovery_sidecars(dir.path()); - assert_eq!(sidecars.len(), 1); - let sidecar: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&sidecars[0]).unwrap()).unwrap(); - assert!( - sidecar["expected_manifest_version"].is_number(), - "{sidecar}" - ); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!( - recovered - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(recovered.converged); - assert!(recovery_sidecars(dir.path()).is_empty()); - let state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(state_path(dir.path())).unwrap()).unwrap(); - assert_eq!( - state["applied_revision"]["resources"]["schema.knowledge"]["digest"], - v2_digest - ); - scenario.teardown(); -} - -/// Seed: converged state + a stale `old` graph subtree with a real root and -/// a valid approval for its delete. Returns the approval id. -async fn seed_approved_delete(dir: &Path) -> String { - let digests = seed_applyable_state(dir); - let graph_digest = digests["graph.knowledge"].clone(); - let schema_digest = digests["schema.knowledge"].clone(); - let state_dir = dir.join("__cluster"); - fs::write( - state_dir.join("state.json"), - format!( - r#"{{ - "version": 1, - "state_revision": 1, - "applied_revision": {{ - "resources": {{ - "graph.knowledge": {{ "digest": "{graph_digest}" }}, - "schema.knowledge": {{ "digest": "{schema_digest}" }}, - "graph.old": {{ "digest": "3333" }}, - "schema.old": {{ "digest": "4444" }} - }} - }} -}} -"# - ), - ) - .unwrap(); - let root = dir.join("graphs/old.omni"); - fs::create_dir_all(&root).unwrap(); - fs::write(root.join("_schema.pg"), "stale").unwrap(); - let approved = approve_config_dir(dir, "graph.old", "test-actor").await; - assert!(approved.ok, "{:?}", approved.diagnostics); - approved.approval_id.unwrap() -} - -/// Crash before the removal: root intact, approval unconsumed, no ack; the -/// next run retires the stale intent (row 8) and the still-approved delete -/// completes in the same run. -#[tokio::test] -#[serial] -async fn delete_crash_before_removal_reproposes() { - let scenario = FailScenario::setup(); - let dir = fixture(); - let approval_id = seed_approved_delete(dir.path()).await; - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_BEFORE_GRAPH_DELETE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(dir.path().join("graphs/old.omni").exists()); - assert_eq!(recovery_sidecars(dir.path()).len(), 1); - // The approval is untouched (file unconsumed). - let artifact: serde_json::Value = serde_json::from_str( - &fs::read_to_string( - dir.path() - .join("__cluster/approvals") - .join(format!("{approval_id}.json")), - ) - .unwrap(), - ) - .unwrap(); - assert!(artifact["consumed_at"].is_null()); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!( - recovered - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "graph_delete_incomplete") - ); - assert!(recovered.converged); - assert!(!dir.path().join("graphs/old.omni").exists()); - assert!(recovery_sidecars(dir.path()).is_empty()); - scenario.teardown(); -} - -/// Crash after the removal, before the state CAS: root gone, ledger stale, -/// nothing acknowledged; the next run's sweep rolls the tombstone forward, -/// consumes the approval the sidecar carries, and audits the recovery. -#[tokio::test] -#[serial] -async fn delete_crash_after_removal_rolls_forward() { - let scenario = FailScenario::setup(); - let dir = fixture(); - let approval_id = seed_approved_delete(dir.path()).await; - let state_before = fs::read(state_path(dir.path())).unwrap(); - - { - let _failpoint = ScopedFailPoint::new(omnigraph_cluster::failpoints::names::CLUSTER_APPLY_AFTER_GRAPH_DELETE, "return"); - let out = apply_config_dir(dir.path()).await; - assert!(!out.ok); - assert!(!out.state_written); - assert!(!dir.path().join("graphs/old.omni").exists()); - assert_eq!(fs::read(state_path(dir.path())).unwrap(), state_before); - let sidecars = recovery_sidecars(dir.path()); - assert_eq!(sidecars.len(), 1); - let sidecar: serde_json::Value = - serde_json::from_str(&fs::read_to_string(&sidecars[0]).unwrap()).unwrap(); - assert_eq!(sidecar["approval_id"], approval_id.as_str()); - } - - let recovered = apply_config_dir(dir.path()).await; - assert!(recovered.ok, "{:?}", recovered.diagnostics); - assert!( - recovered - .diagnostics - .iter() - .any(|diagnostic| diagnostic.code == "cluster_recovery_rolled_forward") - ); - assert!(recovered.converged); - let state: serde_json::Value = - serde_json::from_str(&fs::read_to_string(state_path(dir.path())).unwrap()).unwrap(); - assert_eq!(state["observations"]["graph.old"]["kind"], "tombstone"); - assert!(state["approval_records"][&approval_id]["consumed_at"].is_string()); - assert!( - state["recovery_records"] - .as_object() - .unwrap() - .values() - .any(|record| record["kind"] == "graph_delete") - ); - scenario.teardown(); -} diff --git a/crates/omnigraph-cluster/tests/s3_cluster.rs b/crates/omnigraph-cluster/tests/s3_cluster.rs deleted file mode 100644 index 3c7cef3..0000000 --- a/crates/omnigraph-cluster/tests/s3_cluster.rs +++ /dev/null @@ -1,162 +0,0 @@ -//! Cluster-on-object-storage end-to-end (RFC-006): the full control-plane -//! lifecycle with `storage: s3://…` β€” import, apply (graph roots + catalog -//! on the bucket), serving snapshots from both the config dir and the bare -//! storage URI, schema evolution, and the approved delete (prefix removal). -//! -//! Gated like every S3 suite: skips unless `OMNIGRAPH_S3_TEST_BUCKET` is -//! set (CI runs it against containerized RustFS; locally use the RustFS -//! binary + `AWS_*` env, see docs/dev/testing.md). -//! -//! Runtime flavor is multi_thread on purpose: the state-lock guard's -//! drop-time release uses block_in_place on object stores, which is the -//! production (CLI) runtime shape β€” and the lock-release regression this -//! suite pins (a spawned delete dying with a short-lived runtime) only -//! reproduces realistically under it. - -use std::env; -use std::fs; - -use omnigraph_cluster::{ - apply_config_dir, import_config_dir, read_serving_snapshot, - read_serving_snapshot_from_storage, status_config_dir, validate_config_dir, -}; -use ulid::Ulid; - -const SCHEMA_V1: &str = "node Person {\n name: String @key\n}\n"; -const SCHEMA_V2: &str = "node Person {\n name: String @key\n title: String?\n}\n"; -const FIND_PERSON_GQ: &str = "query find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n"; -const POLICY_YAML: &str = r#" -version: 1 -actors: - - id: act-admin - roles: [admin] -rules: - - effect: permit - actions: [read, change, schema_apply, branch_create, branch_delete, branch_merge] - roles: [admin] -"#; - -/// Unique per-run storage root under the test bucket, or None to skip. -fn s3_storage_root(suite: &str) -> Option { - let bucket = env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?; - Some(format!("s3://{bucket}/cluster-e2e/{suite}-{}", Ulid::new())) -} - -fn write_cluster_fixture(dir: &std::path::Path, storage_root: &str, schema: &str) { - fs::write(dir.join("people.pg"), schema).unwrap(); - fs::create_dir_all(dir.join("queries")).unwrap(); - fs::write(dir.join("queries/people.gq"), FIND_PERSON_GQ).unwrap(); - fs::write(dir.join("intel.policy.yaml"), POLICY_YAML).unwrap(); - fs::write( - dir.join("cluster.yaml"), - format!( - r#"version: 1 -storage: {storage_root} -graphs: - knowledge: - schema: people.pg - queries: queries/ -policies: - intel: - file: intel.policy.yaml - applies_to: [graph.knowledge] -"# - ), - ) - .unwrap(); -} - -#[tokio::test(flavor = "multi_thread")] -async fn s3_cluster_full_lifecycle_import_apply_serve_evolve_delete() { - let Some(root) = s3_storage_root("lifecycle") else { - eprintln!("skipping s3 cluster e2e: OMNIGRAPH_S3_TEST_BUCKET is not set"); - return; - }; - let dir = tempfile::tempdir().unwrap(); - write_cluster_fixture(dir.path(), &root, SCHEMA_V1); - - // validate is config-only and must pass before any bucket I/O. - let validate = validate_config_dir(dir.path()); - assert!(validate.ok, "{:?}", validate.diagnostics); - - let import = import_config_dir(dir.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - - // The lock-release regression (caught live on the first smoke): the - // guard's drop must COMPLETE its bucket delete before the command - // returns β€” a follow-up command finding `state_lock_held` means the - // release was spawned into a dying runtime. - let status = status_config_dir(dir.path()).await; - assert!(status.ok, "{:?}", status.diagnostics); - assert!( - !status.state_observations.locked, - "import leaked the state lock on the bucket: {:?}", - status.state_observations - ); - - let apply = apply_config_dir(dir.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - - // Nothing stored locally: the config dir holds only declared sources. - assert!(!dir.path().join("__cluster").exists()); - assert!(!dir.path().join("graphs").exists()); - - // Serving snapshot resolves through cluster.yaml's storage: key… - let via_config = read_serving_snapshot(dir.path()).await.unwrap(); - assert_eq!(via_config.graphs.len(), 1); - let graph_root = via_config.graphs[0].root.to_string_lossy().to_string(); - assert!( - graph_root.starts_with("s3://") && graph_root.ends_with("graphs/knowledge.omni"), - "{graph_root}" - ); - assert_eq!(via_config.queries.len(), 1); - assert_eq!(via_config.policies.len(), 1); - assert!( - via_config.policies[0].source.contains("act-admin"), - "policy must carry verified content, not a path" - ); - - // …and config-free, straight from the bucket URI (the deployment - // payoff: a server needs only the URI and credentials). - let via_uri = read_serving_snapshot_from_storage(&root).await.unwrap(); - assert_eq!(via_uri.graphs.len(), 1); - assert_eq!( - via_uri.graphs[0].root.to_string_lossy(), - via_config.graphs[0].root.to_string_lossy() - ); - assert_eq!(via_uri.policies.len(), 1); - - // Schema evolution converges on the bucket. - write_cluster_fixture(dir.path(), &root, SCHEMA_V2); - let evolve = apply_config_dir(dir.path()).await; - assert!(evolve.ok && evolve.converged, "{:?}", evolve.diagnostics); - - // Approved delete: drop the graph from the config; the plan demands an - // approval, the approved apply prefix-deletes the bucket root. - fs::write( - dir.path().join("cluster.yaml"), - format!("version: 1\nstorage: {root}\ngraphs: {{}}\n"), - ) - .unwrap(); - let plan = omnigraph_cluster::plan_config_dir(dir.path()).await; - assert!(plan.ok, "{:?}", plan.diagnostics); - let approval = plan - .approvals_required - .first() - .expect("graph delete requires approval"); - let approve = omnigraph_cluster::approve_config_dir( - dir.path(), - &approval.resource, - "e2e-operator", - ) - .await; - assert!(approve.ok, "{:?}", approve.diagnostics); - let delete = apply_config_dir(dir.path()).await; - assert!(delete.ok && delete.converged, "{:?}", delete.diagnostics); - - let after = read_serving_snapshot_from_storage(&root).await; - assert!( - after.is_err(), - "an empty cluster must refuse to serve, got {after:?}" - ); -} diff --git a/crates/omnigraph-compiler/Cargo.toml b/crates/omnigraph-compiler/Cargo.toml index f885a9f..229b862 100644 --- a/crates/omnigraph-compiler/Cargo.toml +++ b/crates/omnigraph-compiler/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-compiler" -version = "0.7.2" +version = "0.6.0" edition = "2024" description = "Schema/query compiler for Omnigraph. Zero Lance dependency." license = "MIT" @@ -20,5 +20,10 @@ pest_derive = { workspace = true } thiserror = { workspace = true } serde = { workspace = true } serde_json = { workspace = true } +reqwest = { workspace = true } ahash = { workspace = true } +tokio = { workspace = true } sha2 = { workspace = true } + +[dev-dependencies] +tokio = { workspace = true } diff --git a/crates/omnigraph-compiler/src/catalog/mod.rs b/crates/omnigraph-compiler/src/catalog/mod.rs index 2287c3b..0bb536d 100644 --- a/crates/omnigraph-compiler/src/catalog/mod.rs +++ b/crates/omnigraph-compiler/src/catalog/mod.rs @@ -6,7 +6,7 @@ use std::sync::Arc; use arrow_schema::{DataType, Field, Schema, SchemaRef}; -use crate::error::{CompilerError, Result}; +use crate::error::{NanoError, Result}; use crate::schema::ast::{Cardinality, Constraint, ConstraintBound, SchemaDecl, SchemaFile}; use crate::types::{PropType, ScalarType}; @@ -26,15 +26,6 @@ pub struct InterfaceType { pub properties: HashMap, } -/// The `@embed` binding for a vector property: its source text property and, -/// optionally, the embedding model recorded by `@embed("source", model="…")`. -/// The model is what the query-time same-space check validates against. -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct EmbedSource { - pub source: String, - pub model: Option, -} - #[derive(Debug, Clone)] pub struct NodeType { pub name: String, @@ -51,8 +42,8 @@ pub struct NodeType { pub range_constraints: Vec, /// Regex check constraints pub check_constraints: Vec, - /// Maps @embed target property -> its source text property + recorded model. - pub embed_sources: HashMap, + /// Maps @embed target property -> source text property + pub embed_sources: HashMap, pub blob_properties: HashSet, pub arrow_schema: SchemaRef, } @@ -151,7 +142,7 @@ pub fn build_catalog(schema: &SchemaFile) -> Result { for decl in &schema.declarations { if let SchemaDecl::Node(node) = decl { if node_types.contains_key(&node.name) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "duplicate node type: {}", node.name ))); @@ -165,18 +156,14 @@ pub fn build_catalog(schema: &SchemaFile) -> Result { if matches!(prop.prop_type.scalar, ScalarType::Blob) { blob_properties.insert(prop.name.clone()); } - // Extract @embed: the source text property (positional) and the - // optional recorded model (the `model` kwarg). - if let Some(ann) = prop.annotations.iter().find(|ann| ann.name == "embed") { - if let Some(source) = ann.value.clone() { - embed_sources.insert( - prop.name.clone(), - EmbedSource { - source, - model: ann.kwargs.get("model").cloned(), - }, - ); - } + // Extract @embed from property annotations (stays as annotation) + if let Some(source_prop) = prop + .annotations + .iter() + .find(|ann| ann.name == "embed") + .and_then(|ann| ann.value.clone()) + { + embed_sources.insert(prop.name.clone(), source_prop); } } @@ -250,19 +237,19 @@ pub fn build_catalog(schema: &SchemaFile) -> Result { for decl in &schema.declarations { if let SchemaDecl::Edge(edge) = decl { if edge_types.contains_key(&edge.name) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "duplicate edge type: {}", edge.name ))); } if !node_types.contains_key(&edge.from_type) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "edge {} references unknown source type: {}", edge.name, edge.from_type ))); } if !node_types.contains_key(&edge.to_type) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "edge {} references unknown target type: {}", edge.name, edge.to_type ))); @@ -302,7 +289,7 @@ pub fn build_catalog(schema: &SchemaFile) -> Result { if let Some(existing) = edge_name_index.get(&normalized_name) && existing != &edge.name { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "edge name collision after case folding: '{}' conflicts with '{}'", edge.name, existing ))); diff --git a/crates/omnigraph-compiler/src/catalog/schema_ir.rs b/crates/omnigraph-compiler/src/catalog/schema_ir.rs index 4a56ffa..d90539e 100644 --- a/crates/omnigraph-compiler/src/catalog/schema_ir.rs +++ b/crates/omnigraph-compiler/src/catalog/schema_ir.rs @@ -4,7 +4,7 @@ use serde::{Deserialize, Serialize}; use sha2::{Digest, Sha256}; use crate::catalog::{Catalog, build_catalog}; -use crate::error::{CompilerError, Result}; +use crate::error::{NanoError, Result}; use crate::schema::ast::{Annotation, Cardinality, Constraint, PropDecl, SchemaDecl, SchemaFile}; use crate::types::PropType; @@ -119,7 +119,7 @@ pub fn build_schema_ir(schema: &SchemaFile) -> Result { pub fn build_catalog_from_ir(ir: &SchemaIR) -> Result { if ir.ir_version != SCHEMA_IR_VERSION { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "unsupported schema ir_version {} (expected {})", ir.ir_version, SCHEMA_IR_VERSION ))); @@ -167,12 +167,12 @@ pub fn build_catalog_from_ir(ir: &SchemaIR) -> Result { pub fn schema_ir_json(ir: &SchemaIR) -> Result { serde_json::to_string(ir) - .map_err(|err| CompilerError::Catalog(format!("serialize schema ir error: {}", err))) + .map_err(|err| NanoError::Catalog(format!("serialize schema ir error: {}", err))) } pub fn schema_ir_pretty_json(ir: &SchemaIR) -> Result { serde_json::to_string_pretty(ir) - .map_err(|err| CompilerError::Catalog(format!("serialize schema ir error: {}", err))) + .map_err(|err| NanoError::Catalog(format!("serialize schema ir error: {}", err))) } pub fn schema_ir_hash(ir: &SchemaIR) -> Result { @@ -228,7 +228,7 @@ fn canonical_properties( .map(|property| { let prop_id = stable_prop_id(&owner_key, &property.name); if let Some(previous) = seen_prop_ids.insert(prop_id, property.name.clone()) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "property id collision on {}: '{}' and '{}' both hash to {}", owner_name, previous, property.name, prop_id ))); @@ -308,7 +308,7 @@ fn check_type_id_collision( name: &str, ) -> Result<()> { if let Some(previous) = seen_type_ids.insert(type_id, name.to_string()) { - return Err(CompilerError::Catalog(format!( + return Err(NanoError::Catalog(format!( "type id collision: '{}' and '{}' both hash to {}", previous, name, type_id ))); diff --git a/crates/omnigraph-compiler/src/catalog/schema_plan.rs b/crates/omnigraph-compiler/src/catalog/schema_plan.rs index dc9d466..a9e26b2 100644 --- a/crates/omnigraph-compiler/src/catalog/schema_plan.rs +++ b/crates/omnigraph-compiler/src/catalog/schema_plan.rs @@ -1137,7 +1137,6 @@ node Person @description("new") { annotations: vec![Annotation { name: "description".to_string(), value: Some("new".to_string()), - kwargs: Default::default(), }], })); } diff --git a/crates/omnigraph-compiler/src/catalog/tests.rs b/crates/omnigraph-compiler/src/catalog/tests.rs index 4ab3956..883b4a9 100644 --- a/crates/omnigraph-compiler/src/catalog/tests.rs +++ b/crates/omnigraph-compiler/src/catalog/tests.rs @@ -31,33 +31,6 @@ fn test_build_catalog() { assert!(catalog.node_types.contains_key("Company")); } -#[test] -fn test_embed_source_records_model_kwarg() { - let schema = parse_schema( - r#" -node Doc { -title: String -embedding: Vector(3) @embed("title", model="openai/text-embedding-3-large") -plain: Vector(3) @embed("title") -} -"#, - ) - .unwrap(); - let catalog = build_catalog(&schema).unwrap(); - let doc = catalog.node_types.get("Doc").unwrap(); - - let embedding = doc.embed_sources.get("embedding").unwrap(); - assert_eq!(embedding.source, "title"); - assert_eq!( - embedding.model.as_deref(), - Some("openai/text-embedding-3-large") - ); - - let plain = doc.embed_sources.get("plain").unwrap(); - assert_eq!(plain.source, "title"); - assert_eq!(plain.model, None); -} - #[test] fn test_edge_lookup() { let schema = parse_schema(test_schema()).unwrap(); diff --git a/crates/omnigraph-compiler/src/embedding.rs b/crates/omnigraph-compiler/src/embedding.rs new file mode 100644 index 0000000..6c9e6f3 --- /dev/null +++ b/crates/omnigraph-compiler/src/embedding.rs @@ -0,0 +1,379 @@ +#![allow(dead_code)] + +use std::time::Duration; + +use reqwest::Client; +use serde::Deserialize; +use tokio::time::sleep; + +use crate::error::{NanoError, Result}; + +const DEFAULT_EMBED_MODEL: &str = "text-embedding-3-small"; +const DEFAULT_OPENAI_BASE_URL: &str = "https://api.openai.com/v1"; +const DEFAULT_TIMEOUT_MS: u64 = 30_000; +const DEFAULT_RETRY_ATTEMPTS: usize = 4; +const DEFAULT_RETRY_BACKOFF_MS: u64 = 200; + +#[derive(Clone)] +enum EmbeddingTransport { + Mock, + OpenAi { + api_key: String, + base_url: String, + http: Client, + }, +} + +#[derive(Clone)] +pub(crate) struct EmbeddingClient { + model: String, + retry_attempts: usize, + retry_backoff_ms: u64, + transport: EmbeddingTransport, +} + +struct EmbedCallError { + message: String, + retryable: bool, +} + +#[derive(Debug, Deserialize)] +struct OpenAiEmbeddingResponse { + data: Vec, +} + +#[derive(Debug, Deserialize)] +struct OpenAiEmbeddingDatum { + index: usize, + embedding: Vec, +} + +#[derive(Debug, Deserialize)] +struct OpenAiErrorEnvelope { + error: OpenAiErrorBody, +} + +#[derive(Debug, Deserialize)] +struct OpenAiErrorBody { + message: String, +} + +impl EmbeddingClient { + pub(crate) fn from_env() -> Result { + let model = std::env::var("NANOGRAPH_EMBED_MODEL") + .ok() + .map(|v| v.trim().to_string()) + .filter(|v| !v.is_empty()) + .unwrap_or_else(|| DEFAULT_EMBED_MODEL.to_string()); + let retry_attempts = + parse_env_usize("NANOGRAPH_EMBED_RETRY_ATTEMPTS", DEFAULT_RETRY_ATTEMPTS); + let retry_backoff_ms = + parse_env_u64("NANOGRAPH_EMBED_RETRY_BACKOFF_MS", DEFAULT_RETRY_BACKOFF_MS); + + if env_flag("NANOGRAPH_EMBEDDINGS_MOCK") { + return Ok(Self { + model, + retry_attempts, + retry_backoff_ms, + transport: EmbeddingTransport::Mock, + }); + } + + let api_key = std::env::var("OPENAI_API_KEY") + .ok() + .map(|v| v.trim().to_string()) + .filter(|v| !v.is_empty()) + .ok_or_else(|| { + NanoError::Execution( + "OPENAI_API_KEY is required when an embedding call is needed".to_string(), + ) + })?; + let base_url = std::env::var("OPENAI_BASE_URL") + .ok() + .map(|v| v.trim_end_matches('/').to_string()) + .filter(|v| !v.is_empty()) + .unwrap_or_else(|| DEFAULT_OPENAI_BASE_URL.to_string()); + let timeout_ms = parse_env_u64("NANOGRAPH_EMBED_TIMEOUT_MS", DEFAULT_TIMEOUT_MS); + let http = Client::builder() + .timeout(Duration::from_millis(timeout_ms)) + .build() + .map_err(|e| { + NanoError::Execution(format!("failed to initialize HTTP client: {}", e)) + })?; + + Ok(Self { + model, + retry_attempts, + retry_backoff_ms, + transport: EmbeddingTransport::OpenAi { + api_key, + base_url, + http, + }, + }) + } + + #[cfg(test)] + pub(crate) fn mock_for_tests() -> Self { + Self { + model: DEFAULT_EMBED_MODEL.to_string(), + retry_attempts: DEFAULT_RETRY_ATTEMPTS, + retry_backoff_ms: DEFAULT_RETRY_BACKOFF_MS, + transport: EmbeddingTransport::Mock, + } + } + + pub(crate) fn model(&self) -> &str { + &self.model + } + + pub(crate) async fn embed_text(&self, input: &str, expected_dim: usize) -> Result> { + let mut vectors = self.embed_texts(&[input.to_string()], expected_dim).await?; + vectors.pop().ok_or_else(|| { + NanoError::Execution("embedding provider returned no vector".to_string()) + }) + } + + pub(crate) async fn embed_texts( + &self, + inputs: &[String], + expected_dim: usize, + ) -> Result>> { + if expected_dim == 0 { + return Err(NanoError::Execution( + "embedding dimension must be greater than zero".to_string(), + )); + } + if inputs.is_empty() { + return Ok(Vec::new()); + } + + match &self.transport { + EmbeddingTransport::Mock => Ok(inputs + .iter() + .map(|input| mock_embedding(input, expected_dim)) + .collect()), + EmbeddingTransport::OpenAi { .. } => { + self.embed_texts_openai_with_retry(inputs, expected_dim) + .await + } + } + } + + async fn embed_texts_openai_with_retry( + &self, + inputs: &[String], + expected_dim: usize, + ) -> Result>> { + let max_attempt = self.retry_attempts.max(1); + let mut attempt = 0usize; + loop { + attempt += 1; + match self.embed_texts_openai_once(inputs, expected_dim).await { + Ok(vectors) => return Ok(vectors), + Err(err) => { + if !err.retryable || attempt >= max_attempt { + return Err(NanoError::Execution(err.message)); + } + let shift = (attempt - 1).min(10) as u32; + let delay = self.retry_backoff_ms.saturating_mul(1u64 << shift); + sleep(Duration::from_millis(delay)).await; + } + } + } + } + + async fn embed_texts_openai_once( + &self, + inputs: &[String], + expected_dim: usize, + ) -> std::result::Result>, EmbedCallError> { + let (api_key, base_url, http) = match &self.transport { + EmbeddingTransport::OpenAi { + api_key, + base_url, + http, + } => (api_key, base_url, http), + EmbeddingTransport::Mock => unreachable!("mock transport should not call OpenAI"), + }; + + let request = serde_json::json!({ + "model": self.model, + "input": inputs, + "dimensions": expected_dim, + }); + let url = format!("{}/embeddings", base_url); + let response = http + .post(&url) + .bearer_auth(api_key) + .json(&request) + .send() + .await; + + let response = match response { + Ok(resp) => resp, + Err(err) => { + let retryable = err.is_timeout() || err.is_connect() || err.is_request(); + return Err(EmbedCallError { + message: format!("embedding request failed: {}", err), + retryable, + }); + } + }; + + let status = response.status(); + let body = match response.text().await { + Ok(body) => body, + Err(err) => { + return Err(EmbedCallError { + message: format!( + "embedding response read failed (status {}): {}", + status, err + ), + retryable: status.is_server_error() || status.as_u16() == 429, + }); + } + }; + + if !status.is_success() { + let message = parse_openai_error_message(&body).unwrap_or_else(|| body.clone()); + return Err(EmbedCallError { + message: format!( + "embedding request failed with status {}: {}", + status, message + ), + retryable: status.is_server_error() || status.as_u16() == 429, + }); + } + + let mut parsed: OpenAiEmbeddingResponse = + serde_json::from_str(&body).map_err(|err| EmbedCallError { + message: format!("embedding response decode failed: {}", err), + retryable: false, + })?; + + if parsed.data.len() != inputs.len() { + return Err(EmbedCallError { + message: format!( + "embedding response size mismatch: expected {}, got {}", + inputs.len(), + parsed.data.len() + ), + retryable: false, + }); + } + + parsed.data.sort_by_key(|item| item.index); + let mut vectors = Vec::with_capacity(parsed.data.len()); + for (idx, item) in parsed.data.into_iter().enumerate() { + if item.index != idx { + return Err(EmbedCallError { + message: format!( + "embedding response index mismatch at position {}: got {}", + idx, item.index + ), + retryable: false, + }); + } + if item.embedding.len() != expected_dim { + return Err(EmbedCallError { + message: format!( + "embedding dimension mismatch: expected {}, got {}", + expected_dim, + item.embedding.len() + ), + retryable: false, + }); + } + vectors.push(item.embedding); + } + Ok(vectors) + } +} + +fn parse_openai_error_message(body: &str) -> Option { + serde_json::from_str::(body) + .ok() + .map(|e| e.error.message) + .filter(|msg| !msg.trim().is_empty()) +} + +fn parse_env_usize(name: &str, default: usize) -> usize { + std::env::var(name) + .ok() + .and_then(|v| v.parse::().ok()) + .filter(|v| *v > 0) + .unwrap_or(default) +} + +fn parse_env_u64(name: &str, default: u64) -> u64 { + std::env::var(name) + .ok() + .and_then(|v| v.parse::().ok()) + .filter(|v| *v > 0) + .unwrap_or(default) +} + +fn env_flag(name: &str) -> bool { + std::env::var(name) + .ok() + .map(|v| { + let s = v.trim().to_ascii_lowercase(); + s == "1" || s == "true" || s == "yes" || s == "on" + }) + .unwrap_or(false) +} + +fn mock_embedding(input: &str, dim: usize) -> Vec { + let mut seed = fnv1a64(input.as_bytes()); + let mut out = Vec::with_capacity(dim); + for _ in 0..dim { + seed = xorshift64(seed); + let ratio = (seed as f64 / u64::MAX as f64) as f32; + out.push((ratio * 2.0) - 1.0); + } + + let norm = out + .iter() + .map(|v| (*v as f64) * (*v as f64)) + .sum::() + .sqrt() as f32; + if norm > f32::EPSILON { + for value in &mut out { + *value /= norm; + } + } + out +} + +fn fnv1a64(bytes: &[u8]) -> u64 { + let mut hash = 14695981039346656037u64; + for byte in bytes { + hash ^= *byte as u64; + hash = hash.wrapping_mul(1099511628211u64); + } + hash +} + +fn xorshift64(mut x: u64) -> u64 { + x ^= x << 13; + x ^= x >> 7; + x ^= x << 17; + x +} + +#[cfg(test)] +mod tests { + use super::*; + + #[tokio::test] + async fn mock_embeddings_are_deterministic() { + let client = EmbeddingClient::mock_for_tests(); + let a = client.embed_text("alpha", 8).await.unwrap(); + let b = client.embed_text("alpha", 8).await.unwrap(); + let c = client.embed_text("beta", 8).await.unwrap(); + assert_eq!(a, b); + assert_ne!(a, c); + assert_eq!(a.len(), 8); + } +} diff --git a/crates/omnigraph-compiler/src/error.rs b/crates/omnigraph-compiler/src/error.rs index 0c642c2..ea48759 100644 --- a/crates/omnigraph-compiler/src/error.rs +++ b/crates/omnigraph-compiler/src/error.rs @@ -55,7 +55,7 @@ pub fn decode_string_literal(raw: &str) -> Result { let escaped = chars .next() - .ok_or_else(|| CompilerError::Parse("unterminated escape sequence".to_string()))?; + .ok_or_else(|| NanoError::Parse("unterminated escape sequence".to_string()))?; match escaped { '"' => decoded.push('"'), '\\' => decoded.push('\\'), @@ -63,7 +63,7 @@ pub fn decode_string_literal(raw: &str) -> Result { 'r' => decoded.push('\r'), 't' => decoded.push('\t'), other => { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "unsupported escape sequence: \\{}", other ))); @@ -75,7 +75,7 @@ pub fn decode_string_literal(raw: &str) -> Result { } #[derive(Debug, Error)] -pub enum CompilerError { +pub enum NanoError { #[error("parse error: {0}")] Parse(String), @@ -118,16 +118,11 @@ pub enum CompilerError { Manifest(String), } -#[deprecated(note = "use CompilerError")] -pub type NanoError = CompilerError; - -pub type Result = std::result::Result; +pub type Result = std::result::Result; #[cfg(test)] mod tests { - use std::path::Path; - - use super::{CompilerError, SourceSpan, decode_string_literal, render_span}; + use super::{SourceSpan, decode_string_literal, render_span}; #[test] fn source_span_preserves_zero_width() { @@ -148,77 +143,4 @@ mod tests { let decoded = decode_string_literal("\"a\\n\\r\\t\\\\\\\"b\"").unwrap(); assert_eq!(decoded, "a\n\r\t\\\"b"); } - - #[test] - fn compiler_error_parse_display_is_stable() { - let err = CompilerError::Parse("bad token".to_string()); - assert_eq!(err.to_string(), "parse error: bad token"); - } - - #[allow(deprecated)] - #[test] - fn legacy_nano_error_alias_constructs_variants() { - let err = super::NanoError::Parse("bad token".to_string()); - assert_eq!(err.to_string(), "parse error: bad token"); - } - - #[test] - fn legacy_name_is_confined_to_alias_and_compatibility_test() { - let legacy_name = ["Nano", "Error"].concat(); - let workspace_root = Path::new(env!("CARGO_MANIFEST_DIR")) - .parent() - .and_then(Path::parent) - .expect("compiler crate should live under crates/"); - let allowed_file = workspace_root.join("crates/omnigraph-compiler/src/error.rs"); - let mut offenders = Vec::new(); - - visit_rs_files(workspace_root, &mut |path| { - let text = std::fs::read_to_string(path).expect("source file should be readable"); - let count = text.matches(&legacy_name).count(); - if path == allowed_file { - if count != 2 { - offenders.push(format!( - "{} contains {count} legacy-name occurrences; expected exactly 2", - display_path(workspace_root, path) - )); - } - } else if count > 0 { - offenders.push(format!( - "{} contains {count} legacy-name occurrence(s)", - display_path(workspace_root, path) - )); - } - }); - - assert!( - offenders.is_empty(), - "legacy compiler error name should stay compatibility-only:\n{}", - offenders.join("\n") - ); - } - - fn visit_rs_files(dir: &Path, visit: &mut impl FnMut(&Path)) { - for entry in std::fs::read_dir(dir).expect("source directory should be readable") { - let entry = entry.expect("source entry should be readable"); - let path = entry.path(); - if path.is_dir() { - if matches!( - path.file_name().and_then(|name| name.to_str()), - Some(".git" | "target") - ) { - continue; - } - visit_rs_files(&path, visit); - } else if path.extension().and_then(|ext| ext.to_str()) == Some("rs") { - visit(&path); - } - } - } - - fn display_path(root: &Path, path: &Path) -> String { - path.strip_prefix(root) - .unwrap_or(path) - .to_string_lossy() - .into_owned() - } } diff --git a/crates/omnigraph-compiler/src/ir/lower.rs b/crates/omnigraph-compiler/src/ir/lower.rs index 9427e27..6999d69 100644 --- a/crates/omnigraph-compiler/src/ir/lower.rs +++ b/crates/omnigraph-compiler/src/ir/lower.rs @@ -14,7 +14,7 @@ pub fn lower_query( type_ctx: &TypeContext, ) -> Result { if !query.mutations.is_empty() { - return Err(crate::error::CompilerError::Plan( + return Err(crate::error::NanoError::Plan( "cannot lower mutation query with read-query lowerer".to_string(), )); } @@ -62,7 +62,7 @@ pub fn lower_query( pub fn lower_mutation_query(query: &QueryDecl) -> Result { if query.mutations.is_empty() { - return Err(crate::error::CompilerError::Plan( + return Err(crate::error::NanoError::Plan( "query does not contain a mutation body".to_string(), )); } @@ -261,7 +261,7 @@ fn lower_clauses( let edge = catalog .lookup_edge_by_name(&traversal.edge_name) .ok_or_else(|| { - crate::error::CompilerError::Plan(format!( + crate::error::NanoError::Plan(format!( "lowering traversal referenced missing edge '{}' after typecheck", traversal.edge_name )) diff --git a/crates/omnigraph-compiler/src/lib.rs b/crates/omnigraph-compiler/src/lib.rs index 4f85c08..ba1aba2 100644 --- a/crates/omnigraph-compiler/src/lib.rs +++ b/crates/omnigraph-compiler/src/lib.rs @@ -1,4 +1,5 @@ pub mod catalog; +pub mod embedding; pub mod error; pub mod ir; pub mod json_output; diff --git a/crates/omnigraph-compiler/src/query/parser.rs b/crates/omnigraph-compiler/src/query/parser.rs index 3284876..4ba8476 100644 --- a/crates/omnigraph-compiler/src/query/parser.rs +++ b/crates/omnigraph-compiler/src/query/parser.rs @@ -3,7 +3,7 @@ use pest::error::InputLocation; use pest_derive::Parser; use crate::error::{ - CompilerError, ParseDiagnostic, Result, SourceSpan, decode_string_literal, render_span, + NanoError, ParseDiagnostic, Result, SourceSpan, decode_string_literal, render_span, }; use super::ast::*; @@ -13,7 +13,7 @@ use super::ast::*; struct QueryParser; pub fn parse_query(input: &str) -> Result { - parse_query_diagnostic(input).map_err(|e| CompilerError::Parse(e.to_string())) + parse_query_diagnostic(input).map_err(|e| NanoError::Parse(e.to_string())) } pub fn parse_query_diagnostic(input: &str) -> std::result::Result { @@ -24,7 +24,7 @@ pub fn parse_query_diagnostic(input: &str) -> std::result::Result) -> ParseDiagnostic { ParseDiagnostic::new(err.to_string(), span) } -fn compiler_error_to_diagnostic(err: CompilerError) -> ParseDiagnostic { +fn nano_error_to_diagnostic(err: NanoError) -> ParseDiagnostic { ParseDiagnostic::new(err.to_string(), None) } @@ -71,7 +71,7 @@ fn parse_query_decl(pair: pest::iterators::Pair) -> Result { match annotation_name { "description" => { if description.replace(value).is_some() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "query `{}` cannot include duplicate @description annotations", name ))); @@ -79,14 +79,14 @@ fn parse_query_decl(pair: pest::iterators::Pair) -> Result { } "instruction" => { if instruction.replace(value).is_some() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "query `{}` cannot include duplicate @instruction annotations", name ))); } } other => { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "unsupported query annotation: @{}", other ))); @@ -94,9 +94,10 @@ fn parse_query_decl(pair: pest::iterators::Pair) -> Result { } } Rule::query_body => { - let body = item.into_inner().next().ok_or_else(|| { - CompilerError::Parse("query body cannot be empty".to_string()) - })?; + let body = item + .into_inner() + .next() + .ok_or_else(|| NanoError::Parse("query body cannot be empty".to_string()))?; match body.as_rule() { Rule::read_query_body => { for section in body.into_inner() { @@ -126,7 +127,7 @@ fn parse_query_decl(pair: pest::iterators::Pair) -> Result { let int_pair = section.into_inner().next().unwrap(); limit = Some(int_pair.as_str().parse::().map_err(|e| { - CompilerError::Parse(format!("invalid limit: {}", e)) + NanoError::Parse(format!("invalid limit: {}", e)) })?); } _ => {} @@ -137,7 +138,7 @@ fn parse_query_decl(pair: pest::iterators::Pair) -> Result { for mutation_pair in body.into_inner() { if let Rule::mutation_stmt = mutation_pair.as_rule() { let stmt = mutation_pair.into_inner().next().ok_or_else(|| { - CompilerError::Parse( + NanoError::Parse( "mutation statement cannot be empty".to_string(), ) })?; @@ -169,14 +170,14 @@ fn parse_query_annotation(pair: pest::iterators::Pair) -> Result<(&'static let inner = pair .into_inner() .next() - .ok_or_else(|| CompilerError::Parse("query annotation cannot be empty".to_string()))?; + .ok_or_else(|| NanoError::Parse("query annotation cannot be empty".to_string()))?; match inner.as_rule() { Rule::description_annotation => { let value = inner .into_inner() .next() .ok_or_else(|| { - CompilerError::Parse("@description requires a string literal".to_string()) + NanoError::Parse("@description requires a string literal".to_string()) }) .map(|value| parse_string_lit(value.as_str()))??; Ok(("description", value)) @@ -186,12 +187,12 @@ fn parse_query_annotation(pair: pest::iterators::Pair) -> Result<(&'static .into_inner() .next() .ok_or_else(|| { - CompilerError::Parse("@instruction requires a string literal".to_string()) + NanoError::Parse("@instruction requires a string literal".to_string()) }) .map(|value| parse_string_lit(value.as_str()))??; Ok(("instruction", value)) } - other => Err(CompilerError::Parse(format!( + other => Err(NanoError::Parse(format!( "unexpected query annotation rule: {:?}", other ))), @@ -207,29 +208,30 @@ fn parse_param(pair: pest::iterators::Pair) -> Result { let mut type_inner = type_ref.into_inner(); let core = type_inner .next() - .ok_or_else(|| CompilerError::Parse("parameter type is missing".to_string()))?; - let base = - match core.as_rule() { - Rule::base_type => core.as_str().to_string(), - Rule::list_type => { - let inner = core.into_inner().next().ok_or_else(|| { - CompilerError::Parse("list type missing item type".to_string()) - })?; - format!("[{}]", inner.as_str().trim()) - } - Rule::vector_type => { - let vector = core.into_inner().next().ok_or_else(|| { - CompilerError::Parse("Vector type missing dimension".to_string()) - })?; - format!("Vector({})", vector.as_str().trim()) - } - other => { - return Err(CompilerError::Parse(format!( - "unexpected param type rule: {:?}", - other - ))); - } - }; + .ok_or_else(|| NanoError::Parse("parameter type is missing".to_string()))?; + let base = match core.as_rule() { + Rule::base_type => core.as_str().to_string(), + Rule::list_type => { + let inner = core + .into_inner() + .next() + .ok_or_else(|| NanoError::Parse("list type missing item type".to_string()))?; + format!("[{}]", inner.as_str().trim()) + } + Rule::vector_type => { + let vector = core + .into_inner() + .next() + .ok_or_else(|| NanoError::Parse("Vector type missing dimension".to_string()))?; + format!("Vector({})", vector.as_str().trim()) + } + other => { + return Err(NanoError::Parse(format!( + "unexpected param type rule: {:?}", + other + ))); + } + }; Ok(Param { name, @@ -254,7 +256,7 @@ fn parse_clause(pair: pest::iterators::Pair) -> Result { } Ok(Clause::Negation(clauses)) } - _ => Err(CompilerError::Parse(format!( + _ => Err(NanoError::Parse(format!( "unexpected clause rule: {:?}", inner.as_rule() ))), @@ -265,13 +267,13 @@ fn parse_text_search_clause(pair: pest::iterators::Pair) -> Result let inner = pair .into_inner() .next() - .ok_or_else(|| CompilerError::Parse("text search clause cannot be empty".to_string()))?; + .ok_or_else(|| NanoError::Parse("text search clause cannot be empty".to_string()))?; let expr = match inner.as_rule() { Rule::search_call => parse_search_call(inner)?, Rule::fuzzy_call => parse_fuzzy_call(inner)?, Rule::match_text_call => parse_match_text_call(inner)?, other => { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "unexpected text search clause rule: {:?}", other ))); @@ -323,7 +325,7 @@ fn parse_mutation_stmt(pair: pest::iterators::Pair) -> Result { Rule::insert_stmt => parse_insert_mutation(pair).map(Mutation::Insert), Rule::update_stmt => parse_update_mutation(pair).map(Mutation::Update), Rule::delete_stmt => parse_delete_mutation(pair).map(Mutation::Delete), - other => Err(CompilerError::Parse(format!( + other => Err(NanoError::Parse(format!( "unexpected mutation statement rule: {:?}", other ))), @@ -361,7 +363,7 @@ fn parse_update_mutation(pair: pest::iterators::Pair) -> Result) -> Result) -> Result { } Rule::now_call => Ok(MatchValue::Now), Rule::literal => Ok(MatchValue::Literal(parse_literal(value_inner)?)), - _ => Err(CompilerError::Parse(format!( + _ => Err(NanoError::Parse(format!( "unexpected match value: {:?}", value_inner.as_rule() ))), @@ -436,9 +436,9 @@ fn parse_traversal(pair: pest::iterators::Pair) -> Result { let (min, max) = parse_traversal_bounds(next)?; min_hops = min; max_hops = max; - inner.next().ok_or_else(|| { - CompilerError::Parse("traversal missing destination variable".to_string()) - })? + inner + .next() + .ok_or_else(|| NanoError::Parse("traversal missing destination variable".to_string()))? } else { next }; @@ -459,16 +459,16 @@ fn parse_traversal_bounds(pair: pest::iterators::Pair) -> Result<(u32, Opt let mut inner = pair.into_inner(); let min = inner .next() - .ok_or_else(|| CompilerError::Parse("traversal bound missing min hop".to_string()))? + .ok_or_else(|| NanoError::Parse("traversal bound missing min hop".to_string()))? .as_str() .parse::() - .map_err(|e| CompilerError::Parse(format!("invalid traversal min bound: {}", e)))?; + .map_err(|e| NanoError::Parse(format!("invalid traversal min bound: {}", e)))?; let max = inner .next() .map(|p| { p.as_str() .parse::() - .map_err(|e| CompilerError::Parse(format!("invalid traversal max bound: {}", e))) + .map_err(|e| NanoError::Parse(format!("invalid traversal max bound: {}", e))) }) .transpose()?; Ok((min, max)) @@ -507,12 +507,7 @@ fn parse_expr(pair: pest::iterators::Pair) -> Result { "avg" => AggFunc::Avg, "min" => AggFunc::Min, "max" => AggFunc::Max, - other => { - return Err(CompilerError::Parse(format!( - "unknown aggregate: {}", - other - ))); - } + other => return Err(NanoError::Parse(format!("unknown aggregate: {}", other))), }; let arg = parse_expr(parts.next().unwrap())?; Ok(Expr::Aggregate { @@ -527,7 +522,7 @@ fn parse_expr(pair: pest::iterators::Pair) -> Result { Rule::bm25_call => parse_bm25_call(inner), Rule::rrf_call => parse_rrf_call(inner), Rule::ident => Ok(Expr::AliasRef(inner.as_str().to_string())), - _ => Err(CompilerError::Parse(format!( + _ => Err(NanoError::Parse(format!( "unexpected expr rule: {:?}", inner.as_rule() ))), @@ -538,12 +533,12 @@ fn parse_search_call(pair: pest::iterators::Pair) -> Result { let mut args = pair.into_inner(); let field = args .next() - .ok_or_else(|| CompilerError::Parse("search() missing field argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("search() missing field argument".to_string()))?; let query = args .next() - .ok_or_else(|| CompilerError::Parse("search() missing query argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("search() missing query argument".to_string()))?; if args.next().is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "search() accepts exactly 2 arguments".to_string(), )); } @@ -557,13 +552,13 @@ fn parse_fuzzy_call(pair: pest::iterators::Pair) -> Result { let mut args = pair.into_inner(); let field = args .next() - .ok_or_else(|| CompilerError::Parse("fuzzy() missing field argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("fuzzy() missing field argument".to_string()))?; let query = args .next() - .ok_or_else(|| CompilerError::Parse("fuzzy() missing query argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("fuzzy() missing query argument".to_string()))?; let max_edits = args.next().map(parse_expr).transpose()?.map(Box::new); if args.next().is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "fuzzy() accepts at most 3 arguments".to_string(), )); } @@ -578,12 +573,12 @@ fn parse_match_text_call(pair: pest::iterators::Pair) -> Result { let mut args = pair.into_inner(); let field = args .next() - .ok_or_else(|| CompilerError::Parse("match_text() missing field argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("match_text() missing field argument".to_string()))?; let query = args .next() - .ok_or_else(|| CompilerError::Parse("match_text() missing query argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("match_text() missing query argument".to_string()))?; if args.next().is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "match_text() accepts exactly 2 arguments".to_string(), )); } @@ -597,12 +592,12 @@ fn parse_bm25_call(pair: pest::iterators::Pair) -> Result { let mut args = pair.into_inner(); let field = args .next() - .ok_or_else(|| CompilerError::Parse("bm25() missing field argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("bm25() missing field argument".to_string()))?; let query = args .next() - .ok_or_else(|| CompilerError::Parse("bm25() missing query argument".to_string()))?; + .ok_or_else(|| NanoError::Parse("bm25() missing query argument".to_string()))?; if args.next().is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "bm25() accepts exactly 2 arguments".to_string(), )); } @@ -616,14 +611,14 @@ fn parse_rank_expr(pair: pest::iterators::Pair) -> Result { let inner = if pair.as_rule() == Rule::rank_expr { pair.into_inner() .next() - .ok_or_else(|| CompilerError::Parse("rank expression cannot be empty".to_string()))? + .ok_or_else(|| NanoError::Parse("rank expression cannot be empty".to_string()))? } else { pair }; match inner.as_rule() { Rule::nearest_ordering => parse_nearest_ordering(inner), Rule::bm25_call => parse_bm25_call(inner), - other => Err(CompilerError::Parse(format!( + other => Err(NanoError::Parse(format!( "rrf() rank expression must be nearest(...) or bm25(...), got {:?}", other ))), @@ -634,13 +629,13 @@ fn parse_rrf_call(pair: pest::iterators::Pair) -> Result { let mut args = pair.into_inner(); let primary = args .next() - .ok_or_else(|| CompilerError::Parse("rrf() missing primary rank expression".to_string()))?; - let secondary = args.next().ok_or_else(|| { - CompilerError::Parse("rrf() missing secondary rank expression".to_string()) - })?; + .ok_or_else(|| NanoError::Parse("rrf() missing primary rank expression".to_string()))?; + let secondary = args + .next() + .ok_or_else(|| NanoError::Parse("rrf() missing secondary rank expression".to_string()))?; let k = args.next().map(parse_expr).transpose()?.map(Box::new); if args.next().is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "rrf() accepts at most 3 arguments".to_string(), )); } @@ -659,7 +654,7 @@ fn parse_comp_op(pair: pest::iterators::Pair) -> Result { "<" => Ok(CompOp::Lt), ">=" => Ok(CompOp::Ge), "<=" => Ok(CompOp::Le), - other => Err(CompilerError::Parse(format!("unknown operator: {}", other))), + other => Err(NanoError::Parse(format!("unknown operator: {}", other))), } } @@ -678,14 +673,14 @@ fn parse_literal(pair: pest::iterators::Pair) -> Result { let n: i64 = inner .as_str() .parse() - .map_err(|e| CompilerError::Parse(format!("invalid integer: {}", e)))?; + .map_err(|e| NanoError::Parse(format!("invalid integer: {}", e)))?; Ok(Literal::Integer(n)) } Rule::float_lit => { let f: f64 = inner .as_str() .parse() - .map_err(|e| CompilerError::Parse(format!("invalid float: {}", e)))?; + .map_err(|e| NanoError::Parse(format!("invalid float: {}", e)))?; Ok(Literal::Float(f)) } Rule::bool_lit => { @@ -693,7 +688,7 @@ fn parse_literal(pair: pest::iterators::Pair) -> Result { "true" => true, "false" => false, other => { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "invalid boolean literal: {}", other ))); @@ -706,9 +701,7 @@ fn parse_literal(pair: pest::iterators::Pair) -> Result { .into_inner() .next() .map(|s| parse_string_lit(s.as_str())) - .ok_or_else(|| { - CompilerError::Parse("date literal requires a string".to_string()) - })?; + .ok_or_else(|| NanoError::Parse("date literal requires a string".to_string()))?; Ok(Literal::Date(date_str?)) } Rule::datetime_lit => { @@ -717,7 +710,7 @@ fn parse_literal(pair: pest::iterators::Pair) -> Result { .next() .map(|s| parse_string_lit(s.as_str())) .ok_or_else(|| { - CompilerError::Parse("datetime literal requires a string".to_string()) + NanoError::Parse("datetime literal requires a string".to_string()) })?; Ok(Literal::DateTime(dt_str?)) } @@ -730,7 +723,7 @@ fn parse_literal(pair: pest::iterators::Pair) -> Result { } Ok(Literal::List(items)) } - _ => Err(CompilerError::Parse(format!( + _ => Err(NanoError::Parse(format!( "unexpected literal: {:?}", inner.as_rule() ))), @@ -753,14 +746,14 @@ fn parse_ordering(pair: pest::iterators::Pair) -> Result { let mut inner = pair.into_inner(); let first = inner .next() - .ok_or_else(|| CompilerError::Parse("ordering cannot be empty".to_string()))?; + .ok_or_else(|| NanoError::Parse("ordering cannot be empty".to_string()))?; let (expr, descending) = match first.as_rule() { Rule::nearest_ordering => (parse_nearest_ordering(first)?, false), Rule::expr => { let expr = parse_expr(first)?; let direction = inner.next().map(|p| p.as_str().to_string()); if matches!(expr, Expr::Nearest { .. }) && direction.is_some() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "nearest() ordering does not accept asc/desc modifiers".to_string(), )); } @@ -768,7 +761,7 @@ fn parse_ordering(pair: pest::iterators::Pair) -> Result { (expr, descending) } other => { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "unexpected ordering rule: {:?}", other ))); @@ -782,22 +775,22 @@ fn parse_nearest_ordering(pair: pest::iterators::Pair) -> Result { let mut inner = pair.into_inner(); let prop = inner .next() - .ok_or_else(|| CompilerError::Parse("nearest() missing property".to_string()))?; + .ok_or_else(|| NanoError::Parse("nearest() missing property".to_string()))?; let mut prop_parts = prop.into_inner(); let var = prop_parts .next() - .ok_or_else(|| CompilerError::Parse("nearest() missing variable".to_string()))? + .ok_or_else(|| NanoError::Parse("nearest() missing variable".to_string()))? .as_str(); let variable = var.strip_prefix('$').unwrap_or(var).to_string(); let property = prop_parts .next() - .ok_or_else(|| CompilerError::Parse("nearest() missing property name".to_string()))? + .ok_or_else(|| NanoError::Parse("nearest() missing property name".to_string()))? .as_str() .to_string(); let query = inner .next() - .ok_or_else(|| CompilerError::Parse("nearest() missing query expression".to_string()))?; + .ok_or_else(|| NanoError::Parse("nearest() missing query expression".to_string()))?; Ok(Expr::Nearest { variable, property, diff --git a/crates/omnigraph-compiler/src/query/typecheck.rs b/crates/omnigraph-compiler/src/query/typecheck.rs index 2ac1604..658f083 100644 --- a/crates/omnigraph-compiler/src/query/typecheck.rs +++ b/crates/omnigraph-compiler/src/query/typecheck.rs @@ -4,7 +4,7 @@ use std::sync::Arc; use arrow_schema::{DataType, Field, Schema, SchemaRef}; use crate::catalog::Catalog; -use crate::error::{CompilerError, Result}; +use crate::error::{NanoError, Result}; use crate::types::{Direction, PropType, ScalarType}; use super::ast::*; @@ -82,7 +82,7 @@ pub fn typecheck_query_decl(catalog: &Catalog, query: &QueryDecl) -> Result Result { if !query.mutations.is_empty() { - return Err(CompilerError::Type( + return Err(NanoError::Type( "mutation query cannot be typechecked with read-query API".to_string(), )); } @@ -115,14 +115,14 @@ fn parse_declared_param_types(params: &[Param]) -> Result Result Result Result {} _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T9: non-aggregate expressions in an aggregate query must be \ property accesses or variables" .to_string(), @@ -221,7 +221,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) match mutation { Mutation::Insert(insert) => { if insert.assignments.is_empty() { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T10: insert mutation requires at least one assignment".to_string(), )); } @@ -235,7 +235,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) .properties .get(&assignment.property) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T11: type `{}` has no property `{}`", insert.type_name, assignment.property )) @@ -261,17 +261,17 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) continue; } - if let Some(embed) = node_type.embed_sources.get(prop_name) { - if assigned_props.contains(embed.source.as_str()) { + if let Some(source_prop) = node_type.embed_sources.get(prop_name) { + if assigned_props.contains(source_prop.as_str()) { continue; } - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T12: insert for `{}` must provide non-nullable property `{}` or @embed source `{}`", - insert.type_name, prop_name, embed.source + insert.type_name, prop_name, source_prop ))); } - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T12: insert for `{}` must provide non-nullable property `{}`", insert.type_name, prop_name ))); @@ -308,7 +308,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) .properties .get(&assignment.property) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T11: type `{}` has no property `{}`", insert.type_name, assignment.property )) @@ -324,13 +324,13 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) } if !has_from { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T12: insert for `{}` must provide required endpoint `from`", insert.type_name ))); } if !has_to { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T12: insert for `{}` must provide required endpoint `to`", insert.type_name ))); @@ -341,7 +341,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) continue; } if !insert.assignments.iter().any(|a| &a.property == prop_name) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T12: insert for `{}` must provide non-nullable property `{}`", insert.type_name, prop_name ))); @@ -350,7 +350,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) return Ok(insert.type_name.clone()); } - Err(CompilerError::Type(format!( + Err(NanoError::Type(format!( "T10: unknown node/edge type `{}`", insert.type_name ))) @@ -359,19 +359,19 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) let node_type = if let Some(node_type) = catalog.node_types.get(&update.type_name) { node_type } else if catalog.edge_types.contains_key(&update.type_name) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T16: update mutation for edge type `{}` is not supported", update.type_name ))); } else { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T10: unknown node/edge type `{}`", update.type_name ))); }; if update.assignments.is_empty() { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T10: update mutation requires at least one assignment".to_string(), )); } @@ -383,7 +383,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) .properties .get(&assignment.property) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T11: type `{}` has no property `{}`", update.type_name, assignment.property )) @@ -422,7 +422,7 @@ fn typecheck_mutation(catalog: &Catalog, mutation: &Mutation, params: &[Param]) )?; Ok(delete.type_name.clone()) } else { - Err(CompilerError::Type(format!( + Err(NanoError::Type(format!( "T10: unknown node/edge type `{}`", delete.type_name ))) @@ -435,7 +435,7 @@ fn ensure_no_duplicate_assignment_names(assignments: &[MutationAssignment]) -> R let mut seen = std::collections::HashSet::new(); for assignment in assignments { if !seen.insert(&assignment.property) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T13: duplicate assignment for property `{}`", assignment.property ))); @@ -454,13 +454,13 @@ fn typecheck_mutation_predicate( .properties .get(&predicate.property) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T11: type `{}` has no property `{}`", type_name, predicate.property )) })?; if matches!(prop_type.scalar, ScalarType::Blob) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T11: blob property `{}` cannot be used in WHERE predicates", predicate.property ))); @@ -493,7 +493,7 @@ fn typecheck_edge_mutation_predicate( .properties .get(&predicate.property) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T11: type `{}` has no property `{}`", type_name, predicate.property )) @@ -517,7 +517,7 @@ fn check_match_value_type( MatchValue::Literal(lit) => check_literal_type(lit, expected, property), MatchValue::Variable(v) => { let Some(actual) = params.get(v) else { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T14: mutation variable `${}` must be a declared query parameter", v ))); @@ -528,7 +528,7 @@ fn check_match_value_type( && matches!(actual.scalar, ScalarType::String) && !actual.list); if !compatible { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot assign/compare {} with {} for property `{}`", actual.display_name(), expected.display_name(), @@ -543,7 +543,7 @@ fn check_match_value_type( fn check_now_match_value_type(expected: &PropType, property: &str) -> Result<()> { if expected.list || expected.scalar != ScalarType::DateTime { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot assign/compare DateTime with {} for property `{}`", expected.display_name(), property @@ -597,7 +597,7 @@ fn typecheck_clauses( } } if !has_outer { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T9: negation block must reference at least one outer-bound variable" .to_string(), )); @@ -616,7 +616,7 @@ fn typecheck_binding( ) -> Result<()> { // T1: binding type must exist in catalog if !catalog.node_types.contains_key(&binding.type_name) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T1: unknown node type `{}`", binding.type_name ))); @@ -627,14 +627,14 @@ fn typecheck_binding( // T2 + T3: property match fields must exist and have correct types for pm in &binding.prop_matches { let prop = node_type.properties.get(&pm.prop_name).ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T2: type `{}` has no property `{}`", binding.type_name, pm.prop_name )) })?; if matches!(prop.scalar, ScalarType::Blob) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: blob property `{}.{}` cannot be used in match patterns", binding.type_name, pm.prop_name ))); @@ -658,7 +658,7 @@ fn typecheck_binding( if let Some(existing) = ctx.bindings.get(&binding.variable) && existing.type_name != binding.type_name { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "variable `${}` already bound to type `{}`, cannot rebind to `{}`", binding.variable, existing.type_name, binding.type_name ))); @@ -680,7 +680,7 @@ fn check_binding_literal_type(lit: &Literal, expected: &PropType, property: &str if expected.list { let lit_type = literal_type(lit)?; if lit_type.list { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: list equality is not supported for property `{}`; use a scalar value to match list membership", property ))); @@ -688,7 +688,7 @@ fn check_binding_literal_type(lit: &Literal, expected: &PropType, property: &str let expected_member = PropType::scalar(expected.scalar, expected.nullable); if !types_compatible(&lit_type, &expected_member) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: property `{}` has type {} but membership match got {}", property, expected.display_name(), @@ -708,7 +708,7 @@ fn check_binding_variable_type( ) -> Result<()> { if expected.list { if actual.list { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: list equality is not supported for property `{}`; use a scalar parameter for membership matching", property ))); @@ -716,7 +716,7 @@ fn check_binding_variable_type( let expected_member = PropType::scalar(expected.scalar, expected.nullable); if !types_compatible(actual, &expected_member) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot compare {} membership against {} for property `{}`", actual.display_name(), expected.display_name(), @@ -727,7 +727,7 @@ fn check_binding_variable_type( } if !types_compatible(actual, expected) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot assign/compare {} with {} for property `{}`", actual.display_name(), expected.display_name(), @@ -746,23 +746,23 @@ fn typecheck_traversal( let edge = catalog .lookup_edge_by_name(&traversal.edge_name) .ok_or_else(|| { - CompilerError::Type(format!("T4: unknown edge type `{}`", traversal.edge_name)) + NanoError::Type(format!("T4: unknown edge type `{}`", traversal.edge_name)) })?; if traversal.min_hops == 0 { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T15: traversal min hop bound must be >= 1".to_string(), )); } if let Some(max_hops) = traversal.max_hops { if max_hops < traversal.min_hops { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T15: invalid traversal bounds {{{},{}}}; max must be >= min", traversal.min_hops, max_hops ))); } } else { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T15: unbounded traversal is disabled; use bounded traversal {min,max}".to_string(), )); } @@ -784,7 +784,7 @@ fn typecheck_traversal( // dst should be edge.from_type bind_traversal_endpoint(ctx, &traversal.dst, &edge.from_type, edge)?; } else { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T5: variable `${}` has type `{}`, which is not an endpoint of edge `{}: {} -> {}`", traversal.src, src_bv.type_name, edge.name, edge.from_type, edge.to_type ))); @@ -798,7 +798,7 @@ fn typecheck_traversal( direction = Direction::In; bind_traversal_endpoint(ctx, &traversal.src, &edge.to_type, edge)?; } else { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T5: variable `${}` has type `{}`, which is not an endpoint of edge `{}: {} -> {}`", traversal.dst, dst_bv.type_name, edge.name, edge.from_type, edge.to_type ))); @@ -833,7 +833,7 @@ fn bind_traversal_endpoint( } if let Some(existing) = ctx.bindings.get(var) { if existing.type_name != expected_type { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T5: variable `${}` has type `{}` but edge `{}` expects `{}`", var, existing.type_name, edge.name, expected_type ))); @@ -863,27 +863,27 @@ fn typecheck_filter( if let (ResolvedType::Scalar(l), ResolvedType::Scalar(r)) = (&left_type, &right_type) { if filter.op == CompOp::Contains { if !l.list { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: contains requires a list property on the left, got {}", l.display_name() ))); } if r.list { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T7: contains requires a scalar right operand".to_string(), )); } if matches!(l.scalar, ScalarType::Vector(_)) || matches!(r.scalar, ScalarType::Vector(_)) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T7: vector membership filters are not supported".to_string(), )); } let expected_member = PropType::scalar(l.scalar, l.nullable); if !types_compatible(&expected_member, r) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot test membership of {} in {}", r.display_name(), l.display_name() @@ -894,29 +894,29 @@ fn typecheck_filter( // T7: check type compatibility if l.list || r.list { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T7: list comparisons in filters are not supported; use `contains` for list membership".to_string(), )); } if matches!(l.scalar, ScalarType::Vector(_)) || matches!(r.scalar, ScalarType::Vector(_)) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T7: vector comparisons in filters are not supported".to_string(), )); } if matches!(l.scalar, ScalarType::Blob) || matches!(r.scalar, ScalarType::Blob) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T7: blob comparisons in filters are not supported".to_string(), )); } if !types_compatible(l, r) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: cannot compare {} with {}", l.display_name(), r.display_name() ))); } } else { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T7: filter comparisons require scalar operands, got {} and {}", left_type.display_name(), right_type.display_name() @@ -940,15 +940,15 @@ fn resolve_expr_type( Expr::PropAccess { variable, property } => { // T6: variable must be bound and property must exist let bv = ctx.bindings.get(variable).ok_or_else(|| { - CompilerError::Type(format!("T6: variable `${}` is not bound", variable)) + NanoError::Type(format!("T6: variable `${}` is not bound", variable)) })?; let node_type = catalog.node_types.get(&bv.type_name).ok_or_else(|| { - CompilerError::Type(format!("T6: type `{}` not found in catalog", bv.type_name)) + NanoError::Type(format!("T6: type `{}` not found in catalog", bv.type_name)) })?; let prop = node_type.properties.get(property).ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T6: type `{}` has no property `{}`", bv.type_name, property )) @@ -962,19 +962,19 @@ fn resolve_expr_type( query, } => { let node_binding = ctx.bindings.get(variable).ok_or_else(|| { - CompilerError::Type(format!("T15: variable `${}` is not bound", variable)) + NanoError::Type(format!("T15: variable `${}` is not bound", variable)) })?; let node_type = catalog .node_types .get(&node_binding.type_name) .ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T15: type `{}` not found in catalog", node_binding.type_name )) })?; let prop_type = node_type.properties.get(property).ok_or_else(|| { - CompilerError::Type(format!( + NanoError::Type(format!( "T15: type `{}` has no property `{}`", node_binding.type_name, property )) @@ -982,7 +982,7 @@ fn resolve_expr_type( let vector_dim = match prop_type.scalar { ScalarType::Vector(dim) => dim, _ => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T15: nearest requires a Vector property, got {}.{}: {}", node_binding.type_name, property, @@ -991,7 +991,7 @@ fn resolve_expr_type( } }; if prop_type.list { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T15: nearest does not support list-wrapped vectors".to_string(), )); } @@ -1000,7 +1000,7 @@ fn resolve_expr_type( && let Some(dim) = numeric_vector_literal_dim(lit) { if dim != vector_dim { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T15: nearest vector dimension mismatch: property is Vector({}), query literal has {} elements", vector_dim, dim ))); @@ -1019,7 +1019,7 @@ fn resolve_expr_type( _ => unreachable!(), }; if qdim != vector_dim { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T15: nearest vector dimension mismatch: property is Vector({}), query is Vector({})", vector_dim, qdim ))); @@ -1029,14 +1029,14 @@ fn resolve_expr_type( // query-time string embedding is supported by the runtime executor } ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T15: nearest query must be Vector({}) or String, got {}", vector_dim, s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T15: nearest query must be a scalar expression".to_string(), )); } @@ -1052,13 +1052,13 @@ fn resolve_expr_type( match field_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T19: search field must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T19: search field must be a scalar String expression".to_string(), )); } @@ -1068,13 +1068,13 @@ fn resolve_expr_type( match query_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T19: search query must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T19: search query must be a scalar String expression".to_string(), )); } @@ -1094,13 +1094,13 @@ fn resolve_expr_type( match field_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T19: fuzzy field must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T19: fuzzy field must be a scalar String expression".to_string(), )); } @@ -1110,13 +1110,13 @@ fn resolve_expr_type( match query_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T19: fuzzy query must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T19: fuzzy query must be a scalar String expression".to_string(), )); } @@ -1135,13 +1135,13 @@ fn resolve_expr_type( | ScalarType::U64 ) => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T19: fuzzy max_edits must be an integer scalar, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T19: fuzzy max_edits must be an integer scalar expression".to_string(), )); } @@ -1158,13 +1158,13 @@ fn resolve_expr_type( match field_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T20: match_text field must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T20: match_text field must be a scalar String expression".to_string(), )); } @@ -1174,13 +1174,13 @@ fn resolve_expr_type( match query_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T20: match_text query must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T20: match_text query must be a scalar String expression".to_string(), )); } @@ -1196,13 +1196,13 @@ fn resolve_expr_type( match field_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T20: bm25 field must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T20: bm25 field must be a scalar String expression".to_string(), )); } @@ -1212,13 +1212,13 @@ fn resolve_expr_type( match query_type { ResolvedType::Scalar(s) if s.scalar == ScalarType::String && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T20: bm25 query must be String, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T20: bm25 query must be a scalar String expression".to_string(), )); } @@ -1235,12 +1235,12 @@ fn resolve_expr_type( k, } => { if !matches!(primary.as_ref(), Expr::Nearest { .. } | Expr::Bm25 { .. }) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T21: rrf primary expression must be nearest(...) or bm25(...)".to_string(), )); } if !matches!(secondary.as_ref(), Expr::Nearest { .. } | Expr::Bm25 { .. }) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T21: rrf secondary expression must be nearest(...) or bm25(...)".to_string(), )); } @@ -1252,13 +1252,13 @@ fn resolve_expr_type( match ty { ResolvedType::Scalar(s) if s.scalar == ScalarType::F64 && !s.list => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T21: rrf rank expressions must evaluate to F64, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T21: rrf rank expressions must be scalar numeric expressions" .to_string(), )); @@ -1279,13 +1279,13 @@ fn resolve_expr_type( | ScalarType::U64 ) => {} ResolvedType::Scalar(s) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T21: rrf k must be an integer scalar, got {}", s.display_name() ))); } _ => { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T21: rrf k must be an integer scalar expression".to_string(), )); } @@ -1293,7 +1293,7 @@ fn resolve_expr_type( if let Expr::Literal(Literal::Integer(v)) = k_expr.as_ref() && *v <= 0 { - return Err(CompilerError::Type( + return Err(NanoError::Type( "T21: rrf k must be greater than 0".to_string(), )); } @@ -1311,7 +1311,7 @@ fn resolve_expr_type( } else if let Some(bv) = ctx.bindings.get(name) { Ok(ResolvedType::Node(bv.type_name.clone())) } else { - Err(CompilerError::Type(format!( + Err(NanoError::Type(format!( "variable `${}` is not bound", name ))) @@ -1327,7 +1327,7 @@ fn resolve_expr_type( if let ResolvedType::Scalar(s) = &arg_type && (s.list || !s.scalar.is_numeric()) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T8: {} requires numeric type, got {}", func, s.display_name() @@ -1338,7 +1338,7 @@ fn resolve_expr_type( if let ResolvedType::Scalar(s) = &arg_type && (s.list || (!s.scalar.is_numeric() && s.scalar != ScalarType::String)) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T8: {} requires numeric or string type, got {}", func, s.display_name() @@ -1420,7 +1420,7 @@ fn resolved_type_to_field_shape( ResolvedType::Scalar(prop_type) => Ok((prop_type.to_arrow(), prop_type.nullable)), ResolvedType::Node(type_name) => { let node_type = catalog.node_types.get(type_name).ok_or_else(|| { - CompilerError::Type(format!("type `{}` not found in catalog", type_name)) + NanoError::Type(format!("type `{}` not found in catalog", type_name)) })?; let fields: Vec = node_type .arrow_schema @@ -1450,14 +1450,14 @@ fn literal_type(lit: &Literal) -> Result { } let first = literal_type(&items[0])?; if first.list { - return Err(CompilerError::Type( + return Err(NanoError::Type( "nested list literals are not supported".to_string(), )); } for item in items.iter().skip(1) { let item_type = literal_type(item)?; if item_type.list || !types_compatible(&first, &item_type) { - return Err(CompilerError::Type( + return Err(NanoError::Type( "list literal elements must share a compatible scalar type".to_string(), )); } @@ -1473,7 +1473,7 @@ fn check_literal_type(lit: &Literal, expected: &PropType, prop_name: &str) -> Re return if expected.nullable { Ok(()) } else { - Err(CompilerError::Type(format!( + Err(NanoError::Type(format!( "T3: property `{}` is non-nullable but got null", prop_name ))) @@ -1487,7 +1487,7 @@ fn check_literal_type(lit: &Literal, expected: &PropType, prop_name: &str) -> Re if actual_dim == expected_dim { return Ok(()); } - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: property `{}` has type Vector({}) but got vector literal with {} elements", prop_name, expected_dim, actual_dim ))); @@ -1495,7 +1495,7 @@ fn check_literal_type(lit: &Literal, expected: &PropType, prop_name: &str) -> Re let lit_type = literal_type(lit)?; if !types_compatible(&lit_type, expected) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: property `{}` has type {} but got {}", prop_name, expected.display_name(), @@ -1507,7 +1507,7 @@ fn check_literal_type(lit: &Literal, expected: &PropType, prop_name: &str) -> Re match lit { Literal::String(v) => { if !allowed.contains(v) { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: property `{}` expects one of [{}], got '{}'", prop_name, allowed.join(", "), @@ -1520,7 +1520,7 @@ fn check_literal_type(lit: &Literal, expected: &PropType, prop_name: &str) -> Re match item { Literal::String(v) if allowed.contains(v) => {} Literal::String(v) => { - return Err(CompilerError::Type(format!( + return Err(NanoError::Type(format!( "T3: property `{}` expects one of [{}], got '{}'", prop_name, allowed.join(", "), diff --git a/crates/omnigraph-compiler/src/query_input.rs b/crates/omnigraph-compiler/src/query_input.rs index b641f3e..b85decf 100644 --- a/crates/omnigraph-compiler/src/query_input.rs +++ b/crates/omnigraph-compiler/src/query_input.rs @@ -3,7 +3,7 @@ use std::fmt; use serde_json::Value; -use crate::error::CompilerError; +use crate::error::NanoError; use crate::ir::ParamMap; use crate::json_output::{JS_MAX_SAFE_INTEGER_U64, is_js_safe_integer_i64}; use crate::query::ast::{Literal, Param, QueryDecl}; @@ -17,7 +17,7 @@ pub enum JsonParamMode { #[derive(Debug)] pub enum RunInputError { - Core(CompilerError), + Core(NanoError), Message(String), } @@ -45,8 +45,8 @@ impl Error for RunInputError { } } -impl From for RunInputError { - fn from(value: CompilerError) -> Self { +impl From for RunInputError { + fn from(value: NanoError) -> Self { Self::Core(value) } } @@ -120,7 +120,7 @@ impl ToParam for i64 { impl ToParam for isize { fn to_param(self) -> crate::error::Result { let value = i64::try_from(self).map_err(|_| { - CompilerError::Execution(format!( + NanoError::Execution(format!( "param value {} exceeds current engine range for numeric literals (max {})", self, i64::MAX @@ -151,7 +151,7 @@ impl ToParam for u32 { impl ToParam for u64 { fn to_param(self) -> crate::error::Result { let value = i64::try_from(self).map_err(|_| { - CompilerError::Execution(format!( + NanoError::Execution(format!( "param value {} exceeds current engine range for numeric literals (max {})", self, i64::MAX @@ -164,7 +164,7 @@ impl ToParam for u64 { impl ToParam for usize { fn to_param(self) -> crate::error::Result { let value = i64::try_from(self).map_err(|_| { - CompilerError::Execution(format!( + NanoError::Execution(format!( "param value {} exceeds current engine range for numeric literals (max {})", self, i64::MAX @@ -177,7 +177,7 @@ impl ToParam for usize { impl ToParam for f32 { fn to_param(self) -> crate::error::Result { if !self.is_finite() { - return Err(CompilerError::Execution(format!( + return Err(NanoError::Execution(format!( "invalid float parameter {}", self ))); @@ -189,7 +189,7 @@ impl ToParam for f32 { impl ToParam for f64 { fn to_param(self) -> crate::error::Result { if !self.is_finite() { - return Err(CompilerError::Execution(format!( + return Err(NanoError::Execution(format!( "invalid float parameter {}", self ))); diff --git a/crates/omnigraph-compiler/src/result.rs b/crates/omnigraph-compiler/src/result.rs index d92dd1e..7de77ac 100644 --- a/crates/omnigraph-compiler/src/result.rs +++ b/crates/omnigraph-compiler/src/result.rs @@ -5,7 +5,7 @@ use arrow_ipc::writer::StreamWriter; use arrow_schema::{DataType, Field, Schema, SchemaRef}; use serde::de::DeserializeOwned; -use crate::error::{CompilerError, Result}; +use crate::error::{NanoError, Result}; use crate::json_output::{record_batches_to_json_rows, record_batches_to_rust_json_rows}; #[derive(Debug, Clone, Copy, Default)] @@ -47,7 +47,7 @@ impl QueryResult { } arrow_select::concat::concat_batches(&self.schema, &self.batches) - .map_err(|err| CompilerError::Execution(err.to_string())) + .map_err(|err| NanoError::Execution(err.to_string())) } pub fn to_sdk_json(&self) -> serde_json::Value { @@ -60,7 +60,7 @@ impl QueryResult { pub fn deserialize(&self) -> Result { serde_json::from_value(self.to_rust_json()).map_err(|err| { - CompilerError::Execution(format!("failed to deserialize query result: {}", err)) + NanoError::Execution(format!("failed to deserialize query result: {}", err)) }) } diff --git a/crates/omnigraph-compiler/src/schema/ast.rs b/crates/omnigraph-compiler/src/schema/ast.rs index 9be0e56..f8ed18a 100644 --- a/crates/omnigraph-compiler/src/schema/ast.rs +++ b/crates/omnigraph-compiler/src/schema/ast.rs @@ -1,5 +1,3 @@ -use std::collections::BTreeMap; - use crate::types::PropType; use serde::{Deserialize, Serialize}; @@ -52,11 +50,6 @@ pub struct PropDecl { pub struct Annotation { pub name: String, pub value: Option, - /// Keyword arguments, e.g. `model="…"` on `@embed("source", model="…")`. - /// Empty is skipped in serialization so existing schemas' IR JSON (and - /// hash) stay byte-identical; `BTreeMap` keeps the order deterministic. - #[serde(default, skip_serializing_if = "BTreeMap::is_empty")] - pub kwargs: BTreeMap, } /// A typed constraint declared in a node or edge body. diff --git a/crates/omnigraph-compiler/src/schema/parser.rs b/crates/omnigraph-compiler/src/schema/parser.rs index 6e34e53..43e11ed 100644 --- a/crates/omnigraph-compiler/src/schema/parser.rs +++ b/crates/omnigraph-compiler/src/schema/parser.rs @@ -5,7 +5,7 @@ use pest::error::InputLocation; use pest_derive::Parser; use crate::error::{ - CompilerError, ParseDiagnostic, Result, SourceSpan, decode_string_literal, render_span, + NanoError, ParseDiagnostic, Result, SourceSpan, decode_string_literal, render_span, }; use crate::types::{PropType, ScalarType}; @@ -16,7 +16,7 @@ use super::ast::*; struct SchemaParser; pub fn parse_schema(input: &str) -> Result { - parse_schema_diagnostic(input).map_err(|e| CompilerError::Parse(e.to_string())) + parse_schema_diagnostic(input).map_err(|e| NanoError::Parse(e.to_string())) } pub fn parse_schema_diagnostic(input: &str) -> std::result::Result { @@ -27,8 +27,7 @@ pub fn parse_schema_diagnostic(input: &str) -> std::result::Result std::result::Result = interfaces.iter().collect(); for decl in &mut declarations { if let SchemaDecl::Node(node) = decl { - resolve_interfaces(node, &iface_refs).map_err(compiler_error_to_diagnostic)?; + resolve_interfaces(node, &iface_refs).map_err(nano_error_to_diagnostic)?; } } let schema = SchemaFile { declarations }; - validate_schema_annotations(&schema).map_err(compiler_error_to_diagnostic)?; - validate_constraints(&schema).map_err(compiler_error_to_diagnostic)?; + validate_schema_annotations(&schema).map_err(nano_error_to_diagnostic)?; + validate_constraints(&schema).map_err(nano_error_to_diagnostic)?; Ok(schema) } @@ -65,7 +64,7 @@ fn pest_error_to_diagnostic(err: pest::error::Error) -> ParseDiagnostic { ParseDiagnostic::new(err.to_string(), span) } -fn compiler_error_to_diagnostic(err: CompilerError) -> ParseDiagnostic { +fn nano_error_to_diagnostic(err: NanoError) -> ParseDiagnostic { ParseDiagnostic::new(err.to_string(), None) } @@ -75,7 +74,7 @@ fn parse_schema_decl(pair: pest::iterators::Pair) -> Result { Rule::interface_decl => Ok(SchemaDecl::Interface(parse_interface_decl(inner)?)), Rule::node_decl => Ok(SchemaDecl::Node(parse_node_decl(inner)?)), Rule::edge_decl => Ok(SchemaDecl::Edge(parse_edge_decl(inner)?)), - _ => Err(CompilerError::Parse(format!( + _ => Err(NanoError::Parse(format!( "unexpected rule: {:?}", inner.as_rule() ))), @@ -181,20 +180,21 @@ fn parse_cardinality(pair: pest::iterators::Pair) -> Result { let min_str = inner.next().unwrap().as_str(); let min = min_str .parse::() - .map_err(|_| CompilerError::Parse(format!("invalid cardinality min: {}", min_str)))?; - let max = - if let Some(max_pair) = inner.next() { - let max_str = max_pair.as_str(); - Some(max_str.parse::().map_err(|_| { - CompilerError::Parse(format!("invalid cardinality max: {}", max_str)) - })?) - } else { - None - }; + .map_err(|_| NanoError::Parse(format!("invalid cardinality min: {}", min_str)))?; + let max = if let Some(max_pair) = inner.next() { + let max_str = max_pair.as_str(); + Some( + max_str + .parse::() + .map_err(|_| NanoError::Parse(format!("invalid cardinality max: {}", max_str)))?, + ) + } else { + None + }; if let Some(max_val) = max { if min > max_val { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "cardinality min ({}) exceeds max ({})", min, max_val ))); @@ -219,7 +219,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result>>()?; if names.is_empty() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "@key constraint requires at least one property name".to_string(), )); } @@ -228,7 +228,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result { let names = extract_ident_list_from_args(args)?; if names.is_empty() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "@unique constraint requires at least one property name".to_string(), )); } @@ -237,7 +237,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result { let names = extract_ident_list_from_args(args)?; if names.is_empty() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "@index constraint requires at least one property name".to_string(), )); } @@ -246,7 +246,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result { // @range(prop, min..max) if args.len() < 2 { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "@range requires property name and bounds: @range(prop, min..max)".to_string(), )); } @@ -258,7 +258,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result { // @check(prop, "regex") if args.len() < 2 { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "@check requires property name and pattern: @check(prop, \"regex\")" .to_string(), )); @@ -267,10 +267,7 @@ fn parse_body_constraint(pair: pest::iterators::Pair) -> Result Err(CompilerError::Parse(format!( - "unknown constraint: @{}", - other - ))), + other => Err(NanoError::Parse(format!("unknown constraint: @{}", other))), } } @@ -284,7 +281,7 @@ fn extract_ident_from_constraint_arg(pair: pest::iterators::Pair) -> Resul return Ok(inner.as_str().to_string()); } } - Err(CompilerError::Parse( + Err(NanoError::Parse( "expected property name in constraint".to_string(), )) } @@ -312,7 +309,7 @@ fn extract_string_from_constraint_arg(pair: &pest::iterators::Pair) -> Res } find_string(pair)? - .ok_or_else(|| CompilerError::Parse("expected string argument in constraint".to_string())) + .ok_or_else(|| NanoError::Parse("expected string argument in constraint".to_string())) } fn extract_range_bounds( @@ -330,9 +327,7 @@ fn extract_range_bounds( } } found.ok_or_else(|| { - CompilerError::Parse( - "expected range bounds (min..max) in @range constraint".to_string(), - ) + NanoError::Parse("expected range bounds (min..max) in @range constraint".to_string()) })? }; @@ -383,7 +378,7 @@ fn parse_constraint_bound(pair: &pest::iterators::Pair) -> Result Res for iface_name in &node.implements { let iface = interface_map.get(iface_name.as_str()).ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "node {} implements unknown interface '{}'", node.name, iface_name )) @@ -426,7 +421,7 @@ fn resolve_interfaces(node: &mut NodeDecl, interfaces: &[&InterfaceDecl]) -> Res if let Some(existing) = node.properties.iter().find(|p| p.name == iface_prop.name) { // Property exists β€” verify type compatibility if existing.prop_type != iface_prop.prop_type { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "node {} property '{}' has type {} but interface {} declares it as {}", node.name, iface_prop.name, @@ -477,35 +472,36 @@ fn parse_type_ref(pair: pest::iterators::Pair) -> Result { let mut inner = pair .into_inner() .next() - .ok_or_else(|| CompilerError::Parse("type reference is missing core type".to_string()))?; + .ok_or_else(|| NanoError::Parse("type reference is missing core type".to_string()))?; if inner.as_rule() == Rule::core_type { - inner = inner.into_inner().next().ok_or_else(|| { - CompilerError::Parse("type reference is missing core type".to_string()) - })?; + inner = inner + .into_inner() + .next() + .ok_or_else(|| NanoError::Parse("type reference is missing core type".to_string()))?; } match inner.as_rule() { Rule::base_type => { let scalar = ScalarType::from_str_name(inner.as_str()) - .ok_or_else(|| CompilerError::Parse(format!("unknown type: {}", inner.as_str())))?; + .ok_or_else(|| NanoError::Parse(format!("unknown type: {}", inner.as_str())))?; Ok(PropType::scalar(scalar, nullable)) } Rule::vector_type => { let dim_text = inner .into_inner() .next() - .ok_or_else(|| CompilerError::Parse("Vector type missing dimension".to_string()))? + .ok_or_else(|| NanoError::Parse("Vector type missing dimension".to_string()))? .as_str(); let dim = dim_text .parse::() - .map_err(|e| CompilerError::Parse(format!("invalid Vector dimension: {}", e)))?; + .map_err(|e| NanoError::Parse(format!("invalid Vector dimension: {}", e)))?; if dim == 0 { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "Vector dimension must be greater than zero".to_string(), )); } if dim > i32::MAX as u32 { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "Vector dimension {} exceeds maximum supported {}", dim, i32::MAX @@ -514,14 +510,15 @@ fn parse_type_ref(pair: pest::iterators::Pair) -> Result { Ok(PropType::scalar(ScalarType::Vector(dim), nullable)) } Rule::list_type => { - let element = inner.into_inner().next().ok_or_else(|| { - CompilerError::Parse("list type missing element type".to_string()) - })?; + let element = inner + .into_inner() + .next() + .ok_or_else(|| NanoError::Parse("list type missing element type".to_string()))?; let scalar = ScalarType::from_str_name(element.as_str()).ok_or_else(|| { - CompilerError::Parse(format!("unknown list element type: {}", element.as_str())) + NanoError::Parse(format!("unknown list element type: {}", element.as_str())) })?; if matches!(scalar, ScalarType::Blob) { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "list of Blob is not supported".to_string(), )); } @@ -535,7 +532,7 @@ fn parse_type_ref(pair: pest::iterators::Pair) -> Result { } } if values.is_empty() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "enum type must include at least one value".to_string(), )); } @@ -543,13 +540,13 @@ fn parse_type_ref(pair: pest::iterators::Pair) -> Result { dedup.sort(); dedup.dedup(); if dedup.len() != values.len() { - return Err(CompilerError::Parse( + return Err(NanoError::Parse( "enum type cannot include duplicate values".to_string(), )); } Ok(PropType::enum_type(values, nullable)) } - other => Err(CompilerError::Parse(format!( + other => Err(NanoError::Parse(format!( "unexpected type rule: {:?}", other ))), @@ -559,32 +556,12 @@ fn parse_type_ref(pair: pest::iterators::Pair) -> Result { fn parse_annotation(pair: pest::iterators::Pair) -> Result { let mut inner = pair.into_inner(); let name = inner.next().unwrap().as_str().to_string(); - let mut value = None; - let mut kwargs = std::collections::BTreeMap::new(); - if let Some(args) = inner.next() { - // `annotation_args`: one positional arg followed by zero or more - // `key = literal` kwargs (e.g. `@embed("source", model="…")`). - for arg in args.into_inner() { - match arg.as_rule() { - Rule::annotation_arg => { - value = Some(decode_string_literal(arg.as_str())?); - } - Rule::annotation_kwarg => { - let mut kw = arg.into_inner(); - let key = kw.next().unwrap().as_str().to_string(); - let raw = kw.next().unwrap().as_str(); - kwargs.insert(key, decode_string_literal(raw)?); - } - _ => {} - } - } - } + let value = inner + .next() + .map(|p| decode_string_literal(p.as_str())) + .transpose()?; - Ok(Annotation { - name, - value, - kwargs, - }) + Ok(Annotation { name, value }) } fn validate_string_annotation( @@ -598,19 +575,19 @@ fn validate_string_annotation( continue; } if seen { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "{} declares @{} multiple times", target, annotation ))); } let value = ann.value.as_deref().ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@{} on {} requires a non-empty value", annotation, target )) })?; if value.trim().is_empty() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} on {} requires a non-empty value", annotation, target ))); @@ -634,7 +611,7 @@ fn validate_schema_annotations(schema: &SchemaFile) -> Result<()> { || ann.name == "index" || ann.name == "embed" { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is only supported on node properties or as body constraint (node {})", ann.name, node.name ))); @@ -663,7 +640,7 @@ fn validate_schema_annotations(schema: &SchemaFile) -> Result<()> { || ann.name == "index" || ann.name == "embed" { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is not supported on edges (edge {})", ann.name, edge.name ))); @@ -717,13 +694,13 @@ fn validate_property_annotations( || ann.name == "index" || ann.name == "embed") { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is not supported on list property {}.{}", ann.name, type_name, prop.name ))); } if is_vector && (ann.name == "key" || ann.name == "unique") { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is not supported on vector property {}.{}", ann.name, type_name, prop.name ))); @@ -734,13 +711,13 @@ fn validate_property_annotations( || ann.name == "index" || ann.name == "embed") { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is not supported on blob property {}.{}", ann.name, type_name, prop.name ))); } if ann.name == "instruction" { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@instruction is only supported on node and edge types (property {}.{})", type_name, prop.name ))); @@ -748,7 +725,7 @@ fn validate_property_annotations( // Edge-specific restrictions if is_edge && (ann.name == "key" || ann.name == "embed") { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@{} is not supported on edge properties (edge {}.{})", ann.name, type_name, prop.name ))); @@ -758,13 +735,13 @@ fn validate_property_annotations( match ann.name.as_str() { "key" => { if ann.value.is_some() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key on {}.{} does not accept a value", type_name, prop.name ))); } if key_seen { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "property {}.{} declares @key multiple times", type_name, prop.name ))); @@ -773,13 +750,13 @@ fn validate_property_annotations( } "unique" => { if ann.value.is_some() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@unique on {}.{} does not accept a value", type_name, prop.name ))); } if unique_seen { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "property {}.{} declares @unique multiple times", type_name, prop.name ))); @@ -788,13 +765,13 @@ fn validate_property_annotations( } "index" => { if ann.value.is_some() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@index on {}.{} does not accept a value", type_name, prop.name ))); } if index_seen { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "property {}.{} declares @index multiple times", type_name, prop.name ))); @@ -803,7 +780,7 @@ fn validate_property_annotations( } "embed" => { if embed_seen { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "property {}.{} declares @embed multiple times", type_name, prop.name ))); @@ -811,20 +788,20 @@ fn validate_property_annotations( embed_seen = true; if !is_vector { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@embed is only supported on vector properties ({}.{})", type_name, prop.name ))); } let source_prop = ann.value.as_deref().ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@embed on {}.{} requires a source property name", type_name, prop.name )) })?; if source_prop.trim().is_empty() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@embed on {}.{} requires a non-empty source property name", type_name, prop.name ))); @@ -834,29 +811,18 @@ fn validate_property_annotations( .iter() .find(|p| p.name == source_prop) .ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@embed on {}.{} references unknown source property {}", type_name, prop.name, source_prop )) })?; if source_decl.prop_type.list || source_decl.prop_type.scalar != ScalarType::String { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@embed source property {}.{} must be String", type_name, source_prop ))); } - - // `model` is the only supported kwarg; reject the rest loudly so - // a typo can't be silently ignored (it would never validate). - for key in ann.kwargs.keys() { - if key != "model" { - return Err(CompilerError::Parse(format!( - "@embed on {}.{} has unknown argument '{}=' (only 'model' is supported)", - type_name, prop.name, key - ))); - } - } } _ => {} } @@ -896,45 +862,45 @@ fn validate_type_constraints( match constraint { Constraint::Key(cols) => { if is_edge { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key constraint is not supported on edges (edge {})", type_name ))); } key_count += 1; if key_count > 1 { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "node type {} has multiple @key constraints; only one is supported", type_name ))); } for col in cols { let prop = prop_names.get(col.as_str()).ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@key on {} references unknown property '{}'", type_name, col )) })?; if prop.prop_type.nullable { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key property {}.{} cannot be nullable", type_name, col ))); } if prop.prop_type.list { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key is not supported on list property {}.{}", type_name, col ))); } if matches!(prop.prop_type.scalar, ScalarType::Vector(_)) { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key is not supported on vector property {}.{}", type_name, col ))); } if matches!(prop.prop_type.scalar, ScalarType::Blob) { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@key is not supported on blob property {}.{}", type_name, col ))); @@ -948,7 +914,7 @@ fn validate_type_constraints( continue; } if !prop_names.contains_key(col.as_str()) { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@unique on {} references unknown property '{}'", type_name, col ))); @@ -961,13 +927,13 @@ fn validate_type_constraints( continue; } let prop = prop_names.get(col.as_str()).ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@index on {} references unknown property '{}'", type_name, col )) })?; if matches!(prop.prop_type.scalar, ScalarType::Blob) { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@index is not supported on blob property {}.{}", type_name, col ))); @@ -976,19 +942,19 @@ fn validate_type_constraints( } Constraint::Range { property, .. } => { if is_edge { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@range constraint is not supported on edges (edge {})", type_name ))); } let prop = prop_names.get(property.as_str()).ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@range on {} references unknown property '{}'", type_name, property )) })?; if !prop.prop_type.scalar.is_numeric() { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@range on {}.{} requires a numeric type, got {}", type_name, property, @@ -998,19 +964,19 @@ fn validate_type_constraints( } Constraint::Check { property, .. } => { if is_edge { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@check constraint is not supported on edges (edge {})", type_name ))); } let prop = prop_names.get(property.as_str()).ok_or_else(|| { - CompilerError::Parse(format!( + NanoError::Parse(format!( "@check on {} references unknown property '{}'", type_name, property )) })?; if prop.prop_type.scalar != ScalarType::String { - return Err(CompilerError::Parse(format!( + return Err(NanoError::Parse(format!( "@check on {}.{} requires String type, got {}", type_name, property, diff --git a/crates/omnigraph-compiler/src/schema/parser_tests.rs b/crates/omnigraph-compiler/src/schema/parser_tests.rs index 9a2e1ba..2302cfb 100644 --- a/crates/omnigraph-compiler/src/schema/parser_tests.rs +++ b/crates/omnigraph-compiler/src/schema/parser_tests.rs @@ -508,66 +508,6 @@ embedding: Vector(3) @embed(title) } } -#[test] -fn test_parse_embed_annotation_with_model_kwarg() { - let input = r#" -node Doc { -title: String -embedding: Vector(3) @embed("title", model="openai/text-embedding-3-large") -} -"#; - let schema = parse_schema(input).unwrap(); - match &schema.declarations[0] { - SchemaDecl::Node(n) => { - let ann = &n.properties[1].annotations[0]; - assert_eq!(ann.name, "embed"); - assert_eq!(ann.value.as_deref(), Some("title")); - assert_eq!( - ann.kwargs.get("model").map(String::as_str), - Some("openai/text-embedding-3-large") - ); - } - _ => panic!("expected Node"), - } -} - -#[test] -fn test_parse_embed_annotation_without_model_has_empty_kwargs() { - let input = r#" -node Doc { -title: String -embedding: Vector(3) @embed("title") -} -"#; - let schema = parse_schema(input).unwrap(); - match &schema.declarations[0] { - SchemaDecl::Node(n) => { - let ann = &n.properties[1].annotations[0]; - assert!(ann.kwargs.is_empty()); - // Empty kwargs must NOT serialize, so existing schemas' IR JSON (and - // thus the schema hash) stay byte-identical after this field is added. - let json = serde_json::to_string(ann).unwrap(); - assert!(!json.contains("kwargs"), "unexpected kwargs in {json}"); - } - _ => panic!("expected Node"), - } -} - -#[test] -fn test_parse_embed_annotation_rejects_unknown_kwarg() { - let input = r#" -node Doc { -title: String -embedding: Vector(3) @embed("title", provider="openai") -} -"#; - let err = parse_schema(input).unwrap_err(); - assert!( - err.to_string().contains("only 'model' is supported"), - "got: {err}" - ); -} - #[test] fn test_parse_edge_no_body() { let input = "edge WorksAt: Person -> Company\n"; diff --git a/crates/omnigraph-compiler/src/schema/schema.pest b/crates/omnigraph-compiler/src/schema/schema.pest index b02724e..395c516 100644 --- a/crates/omnigraph-compiler/src/schema/schema.pest +++ b/crates/omnigraph-compiler/src/schema/schema.pest @@ -42,10 +42,8 @@ enum_value = @{ (ASCII_ALPHANUMERIC | "_" | "-")+ } base_type = { "String" | "Blob" | "Bool" | "I32" | "I64" | "U32" | "U64" | "F32" | "F64" | "DateTime" | "Date" } // Annotation rule excludes constraint keywords followed by "(" β€” those are body_constraints -annotation = { "@" ~ !(constraint_name ~ "(") ~ ident ~ ("(" ~ annotation_args ~ ")")? } -annotation_args = { annotation_arg ~ ("," ~ annotation_kwarg)* } +annotation = { "@" ~ !(constraint_name ~ "(") ~ ident ~ ("(" ~ annotation_arg ~ ")")? } annotation_arg = { literal | ident } -annotation_kwarg = { ident ~ "=" ~ literal } literal = { string_lit | float_lit | integer | bool_lit } diff --git a/crates/omnigraph-policy/Cargo.toml b/crates/omnigraph-policy/Cargo.toml index 136df84..dacda35 100644 --- a/crates/omnigraph-policy/Cargo.toml +++ b/crates/omnigraph-policy/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-policy" -version = "0.7.2" +version = "0.6.0" edition = "2024" description = "Policy / authorization layer for Omnigraph β€” Cedar-backed PolicyEngine, PolicyChecker trait, ResourceScope enum." license = "MIT" diff --git a/crates/omnigraph-policy/src/lib.rs b/crates/omnigraph-policy/src/lib.rs index 46b380a..6459fcd 100644 --- a/crates/omnigraph-policy/src/lib.rs +++ b/crates/omnigraph-policy/src/lib.rs @@ -56,21 +56,6 @@ pub enum PolicyAction { /// from v0.6.0; operators add and remove graphs by editing /// `omnigraph.yaml` and restarting. GraphList, - /// Gates invoking a server-side stored query by name. Per-graph and - /// **graph-scoped** (no branch dimension, like `Admin`): the per-branch - /// access of the query body is enforced by the inner `Read`/`Change` - /// gate, so branch-scoping this outer gate would be redundant (and was - /// wrong for snapshot reads). A rule that sets `branch_scope` on - /// `invoke_query` is rejected by `validate()`. In this release it is - /// **coarse**: an `invoke_query` allow rule permits *any* stored query - /// on the graph (no per-query dimension yet); a future, additive - /// refinement adds an optional query-name scope. - /// - /// This gate sits at the HTTP boundary. The engine `_as` writers still - /// enforce `Read`/`Change` per the query body, so a stored *mutation* - /// is double-gated: `invoke_query` to reach the tool, plus `change` for - /// the write itself. - InvokeQuery, } impl PolicyAction { @@ -85,7 +70,6 @@ impl PolicyAction { Self::BranchMerge => "branch_merge", Self::Admin => "admin", Self::GraphList => "graph_list", - Self::InvokeQuery => "invoke_query", } } @@ -115,8 +99,7 @@ impl PolicyAction { | Self::BranchCreate | Self::BranchDelete | Self::BranchMerge - | Self::Admin - | Self::InvokeQuery => PolicyResourceKind::Graph, + | Self::Admin => PolicyResourceKind::Graph, } } } @@ -172,7 +155,6 @@ impl FromStr for PolicyAction { "branch_merge" => Ok(Self::BranchMerge), "admin" => Ok(Self::Admin), "graph_list" => Ok(Self::GraphList), - "invoke_query" => Ok(Self::InvokeQuery), other => bail!("unknown policy action '{other}'"), } } @@ -277,14 +259,7 @@ pub struct PolicyEngine { impl PolicyConfig { pub fn load(path: &Path) -> Result { - Self::from_source(&fs::read_to_string(path)?) - } - - /// Parse + validate a policy from YAML source. The from-content twin of - /// `load` for callers whose policies don't live on the local filesystem - /// (e.g. a cluster catalog on object storage). - pub fn from_source(source: &str) -> Result { - let config: Self = serde_yaml::from_str(source)?; + let config: Self = serde_yaml::from_str(&fs::read_to_string(path)?)?; config.validate()?; Ok(config) } @@ -472,26 +447,13 @@ impl PolicyEngine { PolicyCompiler::compile(&config, graph_id) } - /// `load_graph` from YAML content instead of a file path β€” for policies - /// that live in a non-filesystem catalog (cluster object storage). - pub fn load_graph_from_source(source: &str, graph_id: &str) -> Result { - let config = PolicyConfig::from_source(source)?; - validate_kind_alignment(&config, PolicyEngineKind::Graph)?; - PolicyCompiler::compile(&config, graph_id) - } - /// Load a server-level policy file. Rejects rules whose actions /// are per-graph (e.g. `read`, `change`) β€” those belong in a /// per-graph policy file, not the server one. Takes no `graph_id`: /// server-scoped actions resolve against the singleton /// `Omnigraph::Server::"root"` entity, never a Graph. pub fn load_server(path: &Path) -> Result { - Self::load_server_from_source(&fs::read_to_string(path)?) - } - - /// `load_server` from YAML content instead of a file path. - pub fn load_server_from_source(source: &str) -> Result { - let config = PolicyConfig::from_source(source)?; + let config = PolicyConfig::load(path)?; validate_kind_alignment(&config, PolicyEngineKind::Server)?; // The Graph entity created by the compiler is never referenced // by a server-scoped rule, so the label below is purely a @@ -844,7 +806,6 @@ namespace Omnigraph { action "branch_delete" appliesTo { principal: Actor, resource: Graph, context: RequestContext }; action "branch_merge" appliesTo { principal: Actor, resource: Graph, context: RequestContext }; action "admin" appliesTo { principal: Actor, resource: Graph, context: RequestContext }; - action "invoke_query" appliesTo { principal: Actor, resource: Graph, context: RequestContext }; action "graph_list" appliesTo { principal: Actor, resource: Server, context: RequestContext }; } @@ -1022,42 +983,6 @@ impl PolicyChecker for PolicyEngine { #[cfg(test)] mod tests { - - #[test] - fn from_source_twins_match_path_loaders() { - let yaml = r#" -version: 1 -groups: - readers: ["act-r"] -protected_branches: [main] -rules: - - id: r1 - allow: - actors: { group: readers } - actions: [read] - branch_scope: any -"#; - let config = PolicyConfig::from_source(yaml).unwrap(); - assert_eq!(config.version, 1); - let engine = PolicyEngine::load_graph_from_source(yaml, "g1").unwrap(); - drop(engine); - - let server_yaml = r#" -version: 1 -kind: server -groups: - admins: ["act-a"] -rules: - - id: s1 - allow: - actors: { group: admins } - actions: [graph_list] -"#; - PolicyEngine::load_server_from_source(server_yaml).unwrap(); - // Kind misalignment stays loud through the from-source path. - assert!(PolicyEngine::load_graph_from_source(server_yaml, "g1").is_err()); - assert!(PolicyEngine::load_server_from_source(yaml).is_err()); - } use super::{ PolicyAction, PolicyCompiler, PolicyConfig, PolicyEngine, PolicyExpectation, PolicyRequest, PolicyTestCase, PolicyTestConfig, @@ -1339,80 +1264,6 @@ rules: assert!(!deny.allowed); } - #[test] - fn invoke_query_authorizes_per_graph() { - let policy: PolicyConfig = serde_yaml::from_str( - r#" -version: 1 -groups: - team: [act-alice] - others: [act-bruno] -rules: - - id: team-invoke-queries - allow: - actors: { group: team } - actions: [invoke_query] -"#, - ) - .unwrap(); - let engine = PolicyCompiler::compile(&policy, "graph").unwrap(); - - let allow = engine - .authorize( - "act-alice", - &PolicyRequest { - action: PolicyAction::InvokeQuery, - branch: None, - target_branch: None, - }, - ) - .unwrap(); - assert!(allow.allowed); - assert_eq!( - allow.matched_rule_id.as_deref(), - Some("team-invoke-queries") - ); - - // Actor outside the group β†’ deny. - let deny = engine - .authorize( - "act-bruno", - &PolicyRequest { - action: PolicyAction::InvokeQuery, - branch: None, - target_branch: None, - }, - ) - .unwrap(); - assert!(!deny.allowed); - } - - #[test] - fn invoke_query_rejects_branch_scope() { - // invoke_query is graph-scoped (like admin) β€” per-branch access is - // enforced by the inner read/change gate β€” so a rule that puts a - // `branch_scope` qualifier on it is rejected at validate(). - let policy: PolicyConfig = serde_yaml::from_str( - r#" -version: 1 -groups: - team: [act-alice] -rules: - - id: team-invoke-any-branch - allow: - actors: { group: team } - actions: [invoke_query] - branch_scope: any -"#, - ) - .unwrap(); - let err = policy.validate().unwrap_err().to_string(); - assert!( - err.contains("branch_scope") && err.contains("invoke_query"), - "branch_scope on invoke_query must be rejected: {err}" - ); - } - #[test] fn server_scoped_rule_cannot_use_branch_scope() { let policy: PolicyConfig = serde_yaml::from_str( diff --git a/crates/omnigraph-server/Cargo.toml b/crates/omnigraph-server/Cargo.toml index fe349e9..e9a0e46 100644 --- a/crates/omnigraph-server/Cargo.toml +++ b/crates/omnigraph-server/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-server" -version = "0.7.2" +version = "0.6.0" edition = "2024" description = "HTTP server for the Omnigraph graph database." license = "MIT" @@ -19,11 +19,9 @@ default = [] aws = ["dep:aws-config", "dep:aws-sdk-secretsmanager"] [dependencies] -omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.7.2" } -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.7.2" } -omnigraph-api-types = { path = "../omnigraph-api-types", version = "0.7.2" } -omnigraph-cluster = { path = "../omnigraph-cluster", version = "0.7.2" } +omnigraph = { package = "omnigraph-engine", path = "../omnigraph", version = "0.6.0" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } axum = { workspace = true } clap = { workspace = true } color-eyre = { workspace = true } diff --git a/crates/omnigraph-server/examples/bench_concurrent_http.rs b/crates/omnigraph-server/examples/bench_concurrent_http.rs index 044b2ce..6a8411a 100644 --- a/crates/omnigraph-server/examples/bench_concurrent_http.rs +++ b/crates/omnigraph-server/examples/bench_concurrent_http.rs @@ -1,15 +1,14 @@ //! Server-level concurrent HTTP benchmark for MR-686 (PR 0 baseline). //! //! Drives concurrent `/change` requests against an in-process Omnigraph HTTP -//! server. Originally written to measure the global `Arc>` -//! lock penalty as an MR-686 baseline; that lock has since been removed -//! (engine write APIs are `&self`, the server holds a lockless -//! `Arc`), so this now measures the concurrent write path itself -//! (per-`(table, branch)` queue contention + Lance I/O). +//! server. Measures the global `Arc>` lock penalty on +//! current `main` so PR 1 + PR 2 can be evaluated against a real baseline. //! -//! Driving the HTTP server is still the right level: an engine-level bench on -//! a single handle measures Lance contention, not the server's request-path -//! concurrency. +//! Per the MR-686 plan: this is the load-bearing bench. `Omnigraph::mutate_as` +//! is `&mut self`, so an engine-level concurrent bench either serializes on the +//! borrow checker (measures nothing) or drives multiple handles (measures Lance +//! contention, not the server bottleneck). Driving the HTTP server is the only +//! way to measure the actual `RwLock` contention this work removes. //! //! Usage: //! ```sh diff --git a/crates/omnigraph-server/src/api.rs b/crates/omnigraph-server/src/api.rs index cf0d604..2c818ae 100644 --- a/crates/omnigraph-server/src/api.rs +++ b/crates/omnigraph-server/src/api.rs @@ -1,24 +1,536 @@ -//! HTTP wire DTOs. The types and their engine-result -> DTO mappings live -//! in the shared `omnigraph-api-types` crate (RFC-009 Phase 2) so the CLI -//! and server share one definition; re-exported here so every -//! `omnigraph_server::api::*` path (handlers, the OpenApi schema list, -//! CLI imports) keeps resolving unchanged. Only `query_catalog_entry` -//! stays β€” it maps the server's runtime `StoredQuery` (not a wire type) -//! into the shared `QueryCatalogEntry` DTO. +use omnigraph::db::{GraphCommit, MergeOutcome, ReadTarget, SchemaApplyResult, Snapshot}; +use omnigraph::error::{MergeConflict, MergeConflictKind}; +use omnigraph::loader::{IngestResult, LoadMode}; +use omnigraph_compiler::SchemaMigrationStep; +use omnigraph_compiler::result::QueryResult; +use serde::{Deserialize, Serialize}; +use serde_json::Value; +use utoipa::{IntoParams, ToSchema}; -pub use omnigraph_api_types::*; +/// Shadow enum for documenting [`LoadMode`] in the OpenAPI schema. +#[derive(ToSchema)] +#[schema(as = LoadMode)] +#[allow(dead_code)] +enum LoadModeSchema { + /// Overwrite existing data. + #[schema(rename = "overwrite")] + Overwrite, + /// Append to existing data. + #[schema(rename = "append")] + Append, + /// Merge by id key (upsert). + #[schema(rename = "merge")] + Merge, +} -use crate::queries::StoredQuery; +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct SnapshotTableOutput { + pub table_key: String, + pub table_path: String, + pub table_version: u64, + pub table_branch: Option, + pub row_count: u64, +} -/// Project a loaded stored query into its catalog entry (typed params, -/// MCP tool name, read/mutate flag, description/instruction). -pub fn query_catalog_entry(query: &StoredQuery) -> QueryCatalogEntry { - QueryCatalogEntry { - name: query.name.clone(), - tool_name: query.effective_tool_name().to_string(), - description: query.decl.description.clone(), - instruction: query.decl.instruction.clone(), - mutation: query.is_mutation(), - params: query.decl.params.iter().map(param_descriptor).collect(), +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct SnapshotOutput { + pub branch: String, + pub manifest_version: u64, + pub tables: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchCreateRequest { + /// Parent branch to fork from. Defaults to `main`. + pub from: Option, + /// Name of the new branch. Must not already exist. + pub name: String, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchCreateOutput { + pub uri: String, + pub from: String, + pub name: String, + pub actor_id: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchListOutput { + pub branches: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchDeleteOutput { + pub uri: String, + pub name: String, + pub actor_id: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchMergeRequest { + /// Source branch whose commits will be merged. + pub source: String, + /// Target branch that will receive the merge. Defaults to `main`. + pub target: Option, +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] +#[serde(rename_all = "snake_case")] +pub enum BranchMergeOutcome { + AlreadyUpToDate, + FastForward, + Merged, +} + +impl From for BranchMergeOutcome { + fn from(value: MergeOutcome) -> Self { + match value { + MergeOutcome::AlreadyUpToDate => Self::AlreadyUpToDate, + MergeOutcome::FastForward => Self::FastForward, + MergeOutcome::Merged => Self::Merged, + } } } + +impl BranchMergeOutcome { + pub fn as_str(self) -> &'static str { + match self { + Self::AlreadyUpToDate => "already_up_to_date", + Self::FastForward => "fast_forward", + Self::Merged => "merged", + } + } +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct BranchMergeOutput { + pub source: String, + pub target: String, + pub outcome: BranchMergeOutcome, + pub actor_id: Option, +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] +#[serde(rename_all = "snake_case")] +pub enum MergeConflictKindOutput { + DivergentInsert, + DivergentUpdate, + DeleteVsUpdate, + OrphanEdge, + UniqueViolation, + CardinalityViolation, + ValueConstraintViolation, +} + +impl MergeConflictKindOutput { + pub fn as_str(self) -> &'static str { + match self { + Self::DivergentInsert => "divergent_insert", + Self::DivergentUpdate => "divergent_update", + Self::DeleteVsUpdate => "delete_vs_update", + Self::OrphanEdge => "orphan_edge", + Self::UniqueViolation => "unique_violation", + Self::CardinalityViolation => "cardinality_violation", + Self::ValueConstraintViolation => "value_constraint_violation", + } + } +} + +impl From for MergeConflictKindOutput { + fn from(value: MergeConflictKind) -> Self { + match value { + MergeConflictKind::DivergentInsert => Self::DivergentInsert, + MergeConflictKind::DivergentUpdate => Self::DivergentUpdate, + MergeConflictKind::DeleteVsUpdate => Self::DeleteVsUpdate, + MergeConflictKind::OrphanEdge => Self::OrphanEdge, + MergeConflictKind::UniqueViolation => Self::UniqueViolation, + MergeConflictKind::CardinalityViolation => Self::CardinalityViolation, + MergeConflictKind::ValueConstraintViolation => Self::ValueConstraintViolation, + } + } +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct MergeConflictOutput { + pub table_key: String, + pub row_id: Option, + pub kind: MergeConflictKindOutput, + pub message: String, +} + +impl From<&MergeConflict> for MergeConflictOutput { + fn from(value: &MergeConflict) -> Self { + Self { + table_key: value.table_key.clone(), + row_id: value.row_id.clone(), + kind: value.kind.into(), + message: value.message.clone(), + } + } +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ReadTargetOutput { + pub branch: Option, + pub snapshot: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ReadOutput { + pub query_name: String, + pub target: ReadTargetOutput, + pub row_count: usize, + #[serde(default, skip_serializing_if = "Vec::is_empty")] + pub columns: Vec, + pub rows: Value, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ChangeOutput { + pub branch: String, + pub query_name: String, + pub affected_nodes: usize, + pub affected_edges: usize, + pub actor_id: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct IngestTableOutput { + pub table_key: String, + pub rows_loaded: usize, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct IngestOutput { + pub uri: String, + pub branch: String, + pub base_branch: String, + pub branch_created: bool, + #[schema(value_type = LoadModeSchema)] + pub mode: LoadMode, + pub tables: Vec, + pub actor_id: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct CommitOutput { + pub graph_commit_id: String, + pub manifest_branch: Option, + pub manifest_version: u64, + pub parent_commit_id: Option, + pub merged_parent_commit_id: Option, + pub actor_id: Option, + /// Commit creation time as Unix epoch microseconds. + #[schema(example = 1714000000000000i64)] + pub created_at: i64, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct CommitListOutput { + pub commits: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ReadRequest { + /// GQ query source. May declare one or more named queries; pick one with + /// `query_name` if there is more than one. + #[schema( + example = "query get_person($name: String) {\n match {\n $p: Person { name: $name }\n }\n return { $p.name, $p.age }\n}" + )] + pub query_source: String, + /// Name of the query to run when `query_source` declares multiple. Optional + /// when only one query is declared. + pub query_name: Option, + /// JSON object whose keys match the query's declared parameters. + pub params: Option, + /// Branch to read from. Mutually exclusive with `snapshot`. Defaults to `main`. + pub branch: Option, + /// Snapshot id to read from. Mutually exclusive with `branch`. + pub snapshot: Option, +} + +/// Inline read-query request for `POST /query`. +/// +/// Friendlier-named alternative to [`ReadRequest`] for ad-hoc reads and +/// AI-agent integration. Mutations are rejected with 400 β€” use `POST +/// /mutate` (or its deprecated alias `POST /change`) for write queries. +/// Field names are deliberately short (`query`, `name`) to match the GQ +/// keyword and the CLI `-e` flag. +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct QueryRequest { + /// GQ read-query source. May declare one or more named queries; pick one + /// with `name` when more than one is declared. Mutations + /// (`insert`/`update`/`delete`) get 400 β€” use `POST /mutate` (or its + /// deprecated alias `POST /change`) instead. + #[schema(example = "query get_person($name: String) {\n match {\n $p: Person { name: $name }\n }\n return { $p.name, $p.age }\n}")] + pub query: String, + /// Name of the query to run when `query` declares multiple. Optional when + /// only one query is declared. + pub name: Option, + /// JSON object whose keys match the query's declared parameters. + pub params: Option, + /// Branch to read from. Mutually exclusive with `snapshot`. Defaults to `main`. + pub branch: Option, + /// Snapshot id to read from. Mutually exclusive with `branch`. + pub snapshot: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ChangeRequest { + /// GQ mutation source containing `insert`, `update`, or `delete` statements. + /// May declare multiple named mutations; pick one with `name`. + /// + /// Accepts the legacy field name `query_source` as a deserialization alias. + #[schema( + example = "query insert_person($name: String, $age: I32) {\n insert Person { name: $name, age: $age }\n}" + )] + #[serde(alias = "query_source")] + pub query: String, + /// Name of the mutation to run when `query` declares multiple. + /// + /// Accepts the legacy field name `query_name` as a deserialization alias. + #[serde(default, alias = "query_name")] + pub name: Option, + /// JSON object whose keys match the mutation's declared parameters. + #[serde(default)] + pub params: Option, + /// Target branch. Defaults to `main`. + #[serde(default)] + pub branch: Option, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize, ToSchema)] +pub struct SchemaApplyRequest { + /// Project schema in `.pg` source form. The diff against the current + /// schema produces the migration steps that will be applied. + #[schema( + example = "node Person {\n name: String @key\n age: I32?\n}\n\nedge Knows: Person -> Person" + )] + pub schema_source: String, + /// When true, promote every `DropMode::Soft` step in the plan to + /// `DropMode::Hard`, making the prior column data unreachable + /// after the apply. Matches the CLI's `--allow-data-loss` flag. + /// Defaults to `false` (drops remain reversible via time travel). + #[serde(default)] + pub allow_data_loss: bool, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct SchemaApplyOutput { + pub uri: String, + pub supported: bool, + pub applied: bool, + pub step_count: usize, + pub manifest_version: u64, + #[schema(value_type = Vec)] + pub steps: Vec, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct SchemaOutput { + pub schema_source: String, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct IngestRequest { + /// Target branch. Created from `from` if it does not yet exist. Defaults to `main`. + pub branch: Option, + /// Parent branch used to create `branch` if it does not exist. Defaults to `main`. + pub from: Option, + /// How existing rows are handled. Defaults to `merge`. + #[schema(value_type = Option)] + pub mode: Option, + /// NDJSON payload: one record per line, each shaped + /// `{"type": "", "data": {...}}`. + #[schema( + example = "{\"type\": \"Person\", \"data\": {\"name\": \"Alice\", \"age\": 30}}\n{\"type\": \"Person\", \"data\": {\"name\": \"Bob\", \"age\": 25}}" + )] + pub data: String, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ExportRequest { + /// Branch to export. Defaults to `main`. + pub branch: Option, + /// Restrict the export to these node/edge type names. Empty exports all types. + #[serde(default)] + pub type_names: Vec, + /// Restrict the export to these table keys. Empty exports all tables. + #[serde(default)] + pub table_keys: Vec, +} + +#[derive(Debug, Clone, Deserialize, IntoParams)] +pub struct SnapshotQuery { + pub branch: Option, +} + +#[derive(Debug, Clone, Deserialize, IntoParams)] +pub struct CommitListQuery { + pub branch: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct HealthOutput { + pub status: String, + pub version: String, + #[serde(skip_serializing_if = "Option::is_none")] + pub source_version: Option, +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, ToSchema)] +#[serde(rename_all = "snake_case")] +pub enum ErrorCode { + Unauthorized, + Forbidden, + BadRequest, + NotFound, + /// 405 Method Not Allowed β€” the route exists but the active server + /// mode doesn't serve this method (e.g. `GET /graphs` in single-graph + /// mode). Distinct from 404 so clients can tell "wrong context" from + /// "no such resource." + MethodNotAllowed, + Conflict, + /// 429 Too Many Requests β€” per-actor admission cap exceeded. + /// Clients should respect the `Retry-After` header. + TooManyRequests, + Internal, +} + +/// Structured details for a publisher-level OCC failure. Surfaces alongside +/// HTTP 409 when a write was rejected because the caller's pre-write view of +/// one table's manifest version was stale relative to the current head. The +/// expected/actual fields tell the client which table to refresh. +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ManifestConflictOutput { + pub table_key: String, + pub expected: u64, + pub actual: u64, +} + +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct ErrorOutput { + pub error: String, + #[serde(skip_serializing_if = "Option::is_none")] + pub code: Option, + #[serde(default, skip_serializing_if = "Vec::is_empty")] + pub merge_conflicts: Vec, + /// Set when the conflict is a publisher CAS rejection + /// (`ManifestConflictDetails::ExpectedVersionMismatch`). The caller's + /// pre-write view of `table_key` was at version `expected` but the + /// manifest is now at `actual`. Refresh and retry. + #[serde(skip_serializing_if = "Option::is_none")] + pub manifest_conflict: Option, +} + +pub fn snapshot_payload(branch: &str, snapshot: &Snapshot) -> SnapshotOutput { + let mut entries: Vec<_> = snapshot.entries().cloned().collect(); + entries.sort_by(|a, b| a.table_key.cmp(&b.table_key)); + let tables = entries + .iter() + .map(|entry| SnapshotTableOutput { + table_key: entry.table_key.clone(), + table_path: entry.table_path.clone(), + table_version: entry.table_version, + table_branch: entry.table_branch.clone(), + row_count: entry.row_count, + }) + .collect::>(); + SnapshotOutput { + branch: branch.to_string(), + manifest_version: snapshot.version(), + tables, + } +} + +pub fn schema_apply_output(uri: &str, result: SchemaApplyResult) -> SchemaApplyOutput { + SchemaApplyOutput { + uri: uri.to_string(), + supported: result.supported, + applied: result.applied, + step_count: result.steps.len(), + manifest_version: result.manifest_version, + steps: result.steps, + } +} + +pub fn commit_output(commit: &GraphCommit) -> CommitOutput { + CommitOutput { + graph_commit_id: commit.graph_commit_id.clone(), + manifest_branch: commit.manifest_branch.clone(), + manifest_version: commit.manifest_version, + parent_commit_id: commit.parent_commit_id.clone(), + merged_parent_commit_id: commit.merged_parent_commit_id.clone(), + actor_id: commit.actor_id.clone(), + created_at: commit.created_at, + } +} + +pub fn read_output(query_name: String, target: &ReadTarget, result: QueryResult) -> ReadOutput { + let columns = result + .schema() + .fields() + .iter() + .map(|field| field.name().clone()) + .collect(); + ReadOutput { + query_name, + target: read_target_output(target), + row_count: result.num_rows(), + columns, + rows: result.to_rust_json(), + } +} + +pub fn ingest_output(uri: &str, result: &IngestResult, actor_id: Option) -> IngestOutput { + IngestOutput { + uri: uri.to_string(), + branch: result.branch.clone(), + base_branch: result.base_branch.clone(), + branch_created: result.branch_created, + mode: result.mode, + tables: result + .tables + .iter() + .map(|table| IngestTableOutput { + table_key: table.table_key.clone(), + rows_loaded: table.rows_loaded, + }) + .collect(), + actor_id, + } +} + +pub fn read_target_output(target: &ReadTarget) -> ReadTargetOutput { + match target { + ReadTarget::Branch(branch) => ReadTargetOutput { + branch: Some(branch.clone()), + snapshot: None, + }, + ReadTarget::Snapshot(snapshot) => ReadTargetOutput { + branch: None, + snapshot: Some(snapshot.as_str().to_string()), + }, + } +} + +// ─── MR-668 β€” management endpoint shapes ────────────────────────────────── + +/// One entry in the response from `GET /graphs`. Cluster operators +/// consume this list to discover which graphs the server is currently +/// serving. The shape is intentionally minimal β€” `graph_id` and `uri` +/// are the only fields a routing client needs. +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct GraphInfo { + pub graph_id: String, + pub uri: String, +} + +/// Response from `GET /graphs`. Lists every graph registered with the +/// server in alphabetical order by `graph_id` (sorted server-side so +/// clients get deterministic output across requests). +#[derive(Debug, Clone, Serialize, Deserialize, ToSchema)] +pub struct GraphListResponse { + pub graphs: Vec, +} diff --git a/crates/omnigraph-server/src/config.rs b/crates/omnigraph-server/src/config.rs new file mode 100644 index 0000000..87737d0 --- /dev/null +++ b/crates/omnigraph-server/src/config.rs @@ -0,0 +1,542 @@ +use std::collections::BTreeMap; +use std::env; +use std::fs; +use std::path::{Path, PathBuf}; + +use clap::ValueEnum; +use color_eyre::eyre::{Result, bail}; +use serde::{Deserialize, Serialize}; + +pub const DEFAULT_CONFIG_FILE: &str = "omnigraph.yaml"; + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct ProjectConfig { + pub name: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct TargetConfig { + pub uri: String, + pub bearer_token_env: Option, + /// Per-graph Cedar policy file (MR-668). In single-graph mode this + /// field is unused β€” the top-level `policy.file` applies. In + /// multi-graph mode, each `graphs..policy.file` governs that + /// graph's HTTP-layer Cedar enforcement. + #[serde(default)] + pub policy: PolicySettings, +} + +#[derive(Debug, Clone, Copy, Default, Eq, PartialEq, Serialize, Deserialize, ValueEnum)] +#[serde(rename_all = "snake_case")] +pub enum ReadOutputFormat { + #[default] + Table, + Kv, + Csv, + Jsonl, + Json, +} + +#[derive(Debug, Clone, Copy, Default, Eq, PartialEq, Serialize, Deserialize, ValueEnum)] +#[serde(rename_all = "snake_case")] +pub enum TableCellLayout { + #[default] + Truncate, + Wrap, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct CliDefaults { + #[serde(rename = "graph")] + pub graph: Option, + pub branch: Option, + pub output_format: Option, + pub table_max_column_width: Option, + pub table_cell_layout: Option, + /// Default actor identity for CLI direct-engine writes (MR-722). + /// Used when `policy.file` is configured and the operator hasn't + /// passed `--as ` on the command line. With policy configured + /// and neither this nor `--as` set, the engine-layer footgun guard + /// fires (no silent bypass). + pub actor: Option, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct ServerDefaults { + #[serde(rename = "graph")] + pub graph: Option, + pub bind: Option, + /// Server-level Cedar policy (MR-668). Governs management endpoints + /// β€” currently `GET /graphs`; future runtime add/remove endpoints + /// will plug in here too. In single-graph mode this is unused β€” the + /// top-level `policy.file` covers the single graph. + #[serde(default)] + pub policy: PolicySettings, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct AuthDefaults { + pub env_file: Option, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct QueryDefaults { + #[serde(default)] + pub roots: Vec, +} + +#[derive(Debug, Clone, Default, Serialize, Deserialize)] +pub struct PolicySettings { + pub file: Option, +} + +#[derive(Debug, Clone, Copy, Eq, PartialEq, Serialize, Deserialize)] +#[serde(rename_all = "snake_case")] +pub enum AliasCommand { + /// Read alias (canonical: `query`). The legacy spelling `read` is + /// kept as the variant name for back-compat with serialized configs + /// and external SDK callers; `query` is accepted on the wire via the + /// serde alias. + #[serde(alias = "query")] + Read, + /// Mutation alias (canonical: `mutate`). The legacy spelling `change` + /// is kept as the variant name for back-compat; `mutate` is accepted + /// on the wire via the serde alias. + #[serde(alias = "mutate")] + Change, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct AliasConfig { + pub command: AliasCommand, + pub query: String, + pub name: Option, + #[serde(default)] + pub args: Vec, + #[serde(rename = "graph")] + pub graph: Option, + pub branch: Option, + pub format: Option, +} + +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct OmnigraphConfig { + #[serde(default)] + pub project: ProjectConfig, + #[serde(default, rename = "graphs")] + pub graphs: BTreeMap, + #[serde(default)] + pub server: ServerDefaults, + #[serde(default)] + pub auth: AuthDefaults, + #[serde(default)] + pub cli: CliDefaults, + #[serde(default)] + pub query: QueryDefaults, + #[serde(default)] + pub aliases: BTreeMap, + #[serde(default)] + pub policy: PolicySettings, + #[serde(skip)] + base_dir: PathBuf, +} + +impl Default for OmnigraphConfig { + fn default() -> Self { + Self { + project: ProjectConfig::default(), + graphs: BTreeMap::new(), + server: ServerDefaults::default(), + auth: AuthDefaults::default(), + cli: CliDefaults::default(), + query: QueryDefaults::default(), + aliases: BTreeMap::new(), + policy: PolicySettings::default(), + base_dir: PathBuf::new(), + } + } +} + +impl OmnigraphConfig { + pub fn base_dir(&self) -> &Path { + &self.base_dir + } + + pub fn cli_branch(&self) -> &str { + self.cli.branch.as_deref().unwrap_or("main") + } + + pub fn cli_output_format(&self) -> ReadOutputFormat { + self.cli.output_format.unwrap_or_default() + } + + pub fn table_max_column_width(&self) -> usize { + self.cli.table_max_column_width.unwrap_or(80) + } + + pub fn table_cell_layout(&self) -> TableCellLayout { + self.cli.table_cell_layout.unwrap_or_default() + } + + pub fn cli_graph_name(&self) -> Option<&str> { + self.cli.graph.as_deref() + } + + pub fn server_graph_name(&self) -> Option<&str> { + self.server.graph.as_deref() + } + + pub fn server_bind(&self) -> &str { + self.server.bind.as_deref().unwrap_or("127.0.0.1:8080") + } + + pub fn resolve_target_name<'a>( + &self, + explicit_uri: Option<&str>, + explicit_target: Option<&'a str>, + default_target: Option<&'a str>, + ) -> Option<&'a str> { + explicit_target.or_else(|| { + if explicit_uri.is_some() { + None + } else { + default_target + } + }) + } + + pub fn graph_bearer_token_env( + &self, + explicit_uri: Option<&str>, + explicit_target: Option<&str>, + default_target: Option<&str>, + ) -> Option<&str> { + let target_name = + self.resolve_target_name(explicit_uri, explicit_target, default_target)?; + self.graphs + .get(target_name) + .and_then(|target| target.bearer_token_env.as_deref()) + } + + pub fn resolve_auth_env_file(&self) -> Option { + self.auth + .env_file + .as_deref() + .map(|path| self.resolve_config_path(path)) + } + + pub fn resolve_policy_file(&self) -> Option { + self.policy + .file + .as_deref() + .map(|path| self.resolve_config_path(path)) + } + + /// Resolve the per-graph policy file path for the named target, + /// relative to the config file's `base_dir`. Returns `None` if the + /// target is unknown or no per-graph `policy.file` is set. + pub fn resolve_target_policy_file(&self, target_name: &str) -> Option { + let target = self.graphs.get(target_name)?; + target + .policy + .file + .as_deref() + .map(|path| self.resolve_config_path(path)) + } + + /// Resolve the server-level policy file path (used by management + /// endpoints). Returns `None` if `server.policy.file` is not set. + pub fn resolve_server_policy_file(&self) -> Option { + self.server + .policy + .file + .as_deref() + .map(|path| self.resolve_config_path(path)) + } + + /// Resolve a raw config-supplied URI (which may be relative) to its + /// absolute form. URIs containing `://` are passed through as-is; + /// relative paths are joined with the config file's `base_dir`. + pub fn resolve_uri_value(&self, value: &str) -> String { + self.resolve_config_uri(value) + } + + pub fn resolve_policy_tests_file(&self) -> Option { + let policy_file = self.resolve_policy_file()?; + Some(policy_file.with_file_name("policy.tests.yaml")) + } + + pub fn alias(&self, name: &str) -> Result<&AliasConfig> { + self.aliases + .get(name) + .ok_or_else(|| color_eyre::eyre::eyre!("alias '{}' not found", name)) + } + + pub fn resolve_target_uri( + &self, + explicit_uri: Option, + explicit_target: Option<&str>, + default_target: Option<&str>, + ) -> Result { + if let Some(uri) = explicit_uri { + return Ok(uri); + } + + let target_name = explicit_target.or(default_target).ok_or_else(|| { + color_eyre::eyre::eyre!("URI must be provided via , --target, or config") + })?; + let target = self.graphs.get(target_name).ok_or_else(|| { + color_eyre::eyre::eyre!( + "graph '{}' not found in {}", + target_name, + DEFAULT_CONFIG_FILE + ) + })?; + Ok(self.resolve_config_uri(&target.uri)) + } + + pub fn resolve_query_path(&self, query: &Path) -> Result { + if query.is_absolute() { + return Ok(query.to_path_buf()); + } + + let direct = self.base_dir.join(query); + if direct.exists() { + return Ok(direct); + } + + for root in &self.query.roots { + let candidate = self.base_dir.join(root).join(query); + if candidate.exists() { + return Ok(candidate); + } + } + + bail!("query file '{}' not found", query.display()); + } + + fn resolve_config_uri(&self, value: &str) -> String { + if value.contains("://") { + return value.to_string(); + } + + let path = Path::new(value); + if path.is_absolute() { + value.to_string() + } else { + self.base_dir.join(path).to_string_lossy().to_string() + } + } + + fn resolve_config_path(&self, value: &str) -> PathBuf { + let path = Path::new(value); + if path.is_absolute() { + path.to_path_buf() + } else { + self.base_dir.join(path) + } + } +} + +pub fn default_config_path() -> PathBuf { + PathBuf::from(DEFAULT_CONFIG_FILE) +} + +pub fn load_config(config_path: Option<&PathBuf>) -> Result { + load_config_in(&env::current_dir()?, config_path) +} + +fn load_config_in(cwd: &Path, config_path: Option<&PathBuf>) -> Result { + let explicit_path = config_path.cloned(); + let config_path = explicit_path.or_else(|| { + let default_path = cwd.join(DEFAULT_CONFIG_FILE); + default_path.exists().then_some(default_path) + }); + + let mut config = if let Some(path) = &config_path { + serde_yaml::from_str::(&fs::read_to_string(path)?)? + } else { + OmnigraphConfig::default() + }; + + config.base_dir = if let Some(path) = config_path { + absolute_base_dir(cwd, &path)? + } else { + cwd.to_path_buf() + }; + + Ok(config) +} + +fn absolute_base_dir(cwd: &Path, path: &Path) -> Result { + let path = if path.is_absolute() { + path.to_path_buf() + } else { + cwd.join(path) + }; + Ok(path + .parent() + .map(Path::to_path_buf) + .unwrap_or_else(|| cwd.to_path_buf())) +} + +#[cfg(test)] +mod tests { + use std::fs; + use std::path::{Path, PathBuf}; + + use tempfile::tempdir; + + use super::{ReadOutputFormat, TableCellLayout, load_config_in}; + + #[test] + fn load_config_reads_yaml_defaults_from_current_dir() { + let temp = tempdir().unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + r#" +graphs: + local: + uri: ./demo.omni + bearer_token_env: DEMO_TOKEN +auth: + env_file: .env.omni +cli: + graph: local + branch: main + output_format: kv + table_max_column_width: 40 + table_cell_layout: wrap +policy: {} +"#, + ) + .unwrap(); + + let config = load_config_in(temp.path(), None).unwrap(); + assert_eq!(config.cli_graph_name(), Some("local")); + assert_eq!(config.cli_branch(), "main"); + assert_eq!(config.cli_output_format(), ReadOutputFormat::Kv); + assert_eq!(config.table_max_column_width(), 40); + assert_eq!(config.table_cell_layout(), TableCellLayout::Wrap); + assert_eq!( + config.graph_bearer_token_env(None, None, config.cli_graph_name()), + Some("DEMO_TOKEN") + ); + assert_eq!( + config.resolve_auth_env_file().unwrap(), + temp.path().join(".env.omni") + ); + assert_eq!( + PathBuf::from( + config + .resolve_target_uri(None, None, config.cli_graph_name()) + .unwrap() + ), + temp.path().join("./demo.omni") + ); + } + + #[test] + fn load_config_does_not_walk_parent_directories() { + let temp = tempdir().unwrap(); + let child = temp.path().join("child"); + fs::create_dir_all(&child).unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + "graphs:\n local:\n uri: ./demo.omni\n", + ) + .unwrap(); + + let config = load_config_in(&child, None).unwrap(); + assert!(config.graphs.is_empty()); + } + + #[test] + fn resolve_query_path_searches_config_roots() { + let temp = tempdir().unwrap(); + fs::create_dir_all(temp.path().join("queries")).unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + "query:\n roots:\n - queries\npolicy: {}\n", + ) + .unwrap(); + fs::write( + temp.path().join("queries").join("test.gq"), + "query q { return {} }", + ) + .unwrap(); + + let config = load_config_in(temp.path(), None).unwrap(); + let resolved = config.resolve_query_path(Path::new("test.gq")).unwrap(); + assert_eq!(resolved, temp.path().join("queries").join("test.gq")); + } + + #[test] + fn resolve_query_path_prefers_config_base_dir_over_ambient_cwd() { + let workspace = tempdir().unwrap(); + let config_dir = workspace.path().join("config"); + let ambient_dir = workspace.path().join("ambient"); + fs::create_dir_all(&config_dir).unwrap(); + fs::create_dir_all(&ambient_dir).unwrap(); + fs::write(config_dir.join("omnigraph.yaml"), "policy: {}\n").unwrap(); + fs::write(config_dir.join("local.gq"), "query local { return {} }").unwrap(); + fs::write(ambient_dir.join("local.gq"), "query ambient { return {} }").unwrap(); + + let config = + load_config_in(&ambient_dir, Some(&config_dir.join("omnigraph.yaml"))).unwrap(); + let resolved = config.resolve_query_path(Path::new("local.gq")).unwrap(); + + assert_eq!(resolved, config_dir.join("local.gq")); + } + + #[test] + fn policy_block_accepts_non_empty_mapping() { + let temp = tempdir().unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + "policy:\n file: ./policy.yaml\n", + ) + .unwrap(); + + let config = load_config_in(temp.path(), None).unwrap(); + assert_eq!( + config.resolve_policy_file().unwrap(), + temp.path().join("policy.yaml") + ); + } + + #[test] + fn scoped_auth_env_ignores_default_target_when_uri_is_explicit() { + let temp = tempdir().unwrap(); + fs::write( + temp.path().join("omnigraph.yaml"), + r#" +graphs: + demo: + uri: https://example.com + bearer_token_env: DEMO_TOKEN +cli: + graph: demo +"#, + ) + .unwrap(); + + let config = load_config_in(temp.path(), None).unwrap(); + assert_eq!( + config.graph_bearer_token_env( + Some("https://override.example.com"), + None, + config.cli_graph_name() + ), + None + ); + assert_eq!( + config.graph_bearer_token_env( + Some("https://override.example.com"), + Some("demo"), + config.cli_graph_name() + ), + Some("DEMO_TOKEN") + ); + } +} diff --git a/crates/omnigraph-server/src/handlers.rs b/crates/omnigraph-server/src/handlers.rs deleted file mode 100644 index 1571164..0000000 --- a/crates/omnigraph-server/src/handlers.rs +++ /dev/null @@ -1,1758 +0,0 @@ -//! HTTP route handlers, the bearer-auth middleware, per-request -//! authorization, and the cluster-prefix OpenAPI rewrite (moved -//! verbatim from lib.rs in the modularization). - -use super::*; - -/// Liveness probe. -/// -/// Returns server status and version. Unauthenticated; safe to call from any -/// caller. Use this to confirm the server is reachable before invoking other -/// endpoints. -#[utoipa::path( - get, - path = "/healthz", - tag = "health", - operation_id = "health", - responses( - (status = 200, description = "Server is healthy", body = HealthOutput), - ), -)] -pub(crate) async fn server_health() -> Json { - Json(HealthOutput { - status: "ok".to_string(), - version: SERVER_VERSION.to_string(), - source_version: SERVER_SOURCE_VERSION.map(str::to_string), - }) -} - -#[utoipa::path( - get, - path = "/graphs", - tag = "management", - operation_id = "listGraphs", - responses( - (status = 200, description = "List of registered graphs", body = GraphListResponse), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 405, description = "Method not allowed (single-graph mode)", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// List every graph currently registered with this server (MR-668). -/// -/// Multi-graph mode only. In single mode, the route returns 405 β€” there's -/// no registry to enumerate. Cedar-gated by the server-level policy via -/// the `graph_list` action against `Omnigraph::Server::"root"`. -/// -/// Order: alphabetical by `graph_id` (server-sorted so clients see -/// deterministic output across requests). -pub(crate) async fn server_graphs_list( - State(state): State, - actor: Option>, -) -> std::result::Result, ApiError> { - let registry = &state.routing().registry; - - // Server-level Cedar gate. `state.server_policy` is loaded from the - // cluster-scoped policy bundle at startup. When no server policy is - // configured, `authorize_request_server` falls through to the MR-723 - // default-deny semantics (every non-Read action denied for an - // authenticated actor). `GraphList` is not `Read`, so without a server - // policy the request gets 403 β€” which is the right default (don't leak - // the registry until the operator explicitly authorizes it). - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - state.server_policy.as_deref(), - PolicyRequest { - action: PolicyAction::GraphList, - branch: None, - target_branch: None, - }, - )?; - - let mut graphs: Vec = registry - .list() - .into_iter() - .map(|handle| GraphInfo { - graph_id: handle.key.graph_id.as_str().to_string(), - uri: handle.uri.clone(), - }) - .collect(); - graphs.sort_by(|a, b| a.graph_id.cmp(&b.graph_id)); - Ok(Json(GraphListResponse { graphs })) -} - -pub(crate) async fn server_openapi(State(state): State) -> Json { - // `served_openapi` is the single nesting source β€” the protected - // routes always live under `/graphs/{graph_id}/...` (public/management - // paths `/healthz`, `/graphs` stay flat). Building from it here means - // the runtime spec and the committed `openapi.json` share one nesting - // pass and can't drift. - let mut doc = crate::served_openapi(); - if !state.requires_bearer_auth() { - strip_security(&mut doc); - } - Json(doc) -} - -/// Path prefix used to namespace per-graph routes in multi mode. -/// Kept in sync with the `Router::nest(...)` invocation in `build_app`. -const CLUSTER_PATH_PREFIX: &str = "/graphs/{graph_id}"; - -/// Operation-id prefix applied to every cloned cluster operation. -/// Decision 7 in the implementation plan β€” keeps operation IDs unique -/// across the spec when both flat and nested variants ever appear in -/// the same generation pass. -const CLUSTER_OPERATION_ID_PREFIX: &str = "cluster_"; - -/// Paths that stay flat in every server mode (public or server-level, -/// no per-graph dependency). Update this list when adding new -/// always-flat endpoints. `/graphs` is the management enumeration β€” -/// it lives at the root in both single mode (405) and multi mode, and -/// must never be rewritten to `/graphs/{graph_id}/graphs`. -const ALWAYS_FLAT_PATHS: &[&str] = &["/healthz", "/graphs"]; - -/// In multi-mode `server_openapi`, every protected path-item is -/// reattached under the cluster prefix. Operation IDs gain the -/// `cluster_` prefix so SDK generators don't collide if/when both -/// surfaces are merged. Every rewritten operation also declares the -/// required `{graph_id}` path parameter so the served OpenAPI document -/// remains internally valid. -/// -/// Removing the flat protected paths matches the runtime router β€” -/// in multi mode, requests to `/snapshot` etc. return 404, so the -/// spec must agree. -pub(crate) fn nest_paths_under_cluster_prefix(doc: &mut utoipa::openapi::OpenApi) { - let original = std::mem::take(&mut doc.paths.paths); - let mut rewritten = std::collections::BTreeMap::new(); - for (path, mut item) in original { - if ALWAYS_FLAT_PATHS.contains(&path.as_str()) { - rewritten.insert(path, item); - continue; - } - rename_operation_ids(&mut item, CLUSTER_OPERATION_ID_PREFIX); - add_cluster_graph_id_parameter(&mut item); - let new_path = format!("{CLUSTER_PATH_PREFIX}{path}"); - rewritten.insert(new_path, item); - } - doc.paths.paths = rewritten; -} - -pub(crate) fn add_cluster_graph_id_parameter(item: &mut utoipa::openapi::PathItem) { - for op in path_item_operations_mut(item) { - let parameters = op.parameters.get_or_insert_with(Vec::new); - let has_graph_id = parameters - .iter() - .any(|param| param.name == "graph_id" && param.parameter_in == ParameterIn::Path); - if !has_graph_id { - parameters.insert(0, graph_id_path_parameter()); - } - } -} - -pub(crate) fn graph_id_path_parameter() -> Parameter { - let mut parameter = Parameter::new("graph_id"); - parameter.parameter_in = ParameterIn::Path; - parameter.description = Some("Graph id to route the request to.".to_string()); - parameter.schema = Some(Object::with_type(Type::String).into()); - parameter -} - -/// Prefix every operation_id in this PathItem with `prefix`. -pub(crate) fn rename_operation_ids(item: &mut utoipa::openapi::PathItem, prefix: &str) { - for op in path_item_operations_mut(item) { - if let Some(id) = op.operation_id.as_deref() { - op.operation_id = Some(format!("{prefix}{id}")); - } - } -} - -pub(crate) fn path_item_operations_mut( - item: &mut utoipa::openapi::PathItem, -) -> impl Iterator { - [ - item.get.as_mut(), - item.post.as_mut(), - item.put.as_mut(), - item.delete.as_mut(), - item.options.as_mut(), - item.head.as_mut(), - item.patch.as_mut(), - item.trace.as_mut(), - ] - .into_iter() - .flatten() -} - -pub(crate) fn strip_security(doc: &mut utoipa::openapi::OpenApi) { - if let Some(components) = doc.components.as_mut() { - components.security_schemes.clear(); - } - for path_item in doc.paths.paths.values_mut() { - for op in [ - path_item.get.as_mut(), - path_item.post.as_mut(), - path_item.put.as_mut(), - path_item.delete.as_mut(), - path_item.options.as_mut(), - path_item.head.as_mut(), - path_item.patch.as_mut(), - path_item.trace.as_mut(), - ] - .into_iter() - .flatten() - { - op.security = None; - } - } -} - -pub(crate) async fn require_bearer_auth( - State(state): State, - mut request: Request, - next: Next, -) -> std::result::Result { - if !state.requires_bearer_auth() { - return Ok(next.run(request).await); - } - - let Some(header) = request - .headers() - .get(AUTHORIZATION) - .and_then(|value| value.to_str().ok()) - else { - return Err(ApiError::unauthorized("missing bearer token")); - }; - - let Some(provided_token) = header.strip_prefix("Bearer ") else { - return Err(ApiError::unauthorized("missing bearer token")); - }; - - let Some(actor) = state.authenticate_bearer_token(provided_token) else { - return Err(ApiError::unauthorized("invalid bearer token")); - }; - request.extensions_mut().insert(actor); - - Ok(next.run(request).await) -} - -/// Routing middleware (RFC-011 cluster-only). Resolves the active graph -/// for the request and injects `Arc` as an extension so -/// handlers can extract it via `Extension>`. -/// -/// Routes are always nested under `/graphs/{graph_id}/...`. The -/// middleware extracts `{graph_id}` from the URI path and looks it up in -/// the registry. Returns 404 if the graph is not registered. -/// -/// The middleware fires AFTER `require_bearer_auth`, so the actor is -/// already in the request extensions (or auth was off entirely). -pub(crate) async fn resolve_graph_handle( - State(state): State, - mut request: Request, - next: Next, -) -> std::result::Result { - let registry = &state.routing.registry; - // `Router::nest("/graphs/{graph_id}", inner)` rewrites - // `request.uri().path()` to the inner suffix (e.g. `/snapshot`). - // The pre-rewrite URI is preserved in the `OriginalUri` - // request extension by axum's router; we read from there to - // extract `{graph_id}`. Fall back to the current URI only if - // the extension is missing, which shouldn't happen for - // nested routes but is safe defensive code. - let original_path: String = request - .extensions() - .get::() - .map(|OriginalUri(uri)| uri.path().to_string()) - .unwrap_or_else(|| request.uri().path().to_string()); - let graph_id_str = original_path - .strip_prefix("/graphs/") - .and_then(|rest| rest.split('/').next()) - .filter(|s| !s.is_empty()) - .ok_or_else(|| { - ApiError::bad_request("cluster route missing /graphs/{graph_id} prefix".to_string()) - })?; - let graph_id = GraphId::try_from(graph_id_str.to_string()) - .map_err(|err| ApiError::bad_request(err.to_string()))?; - let key = GraphKey::cluster(graph_id.clone()); - let handle = match registry.get(&key) { - RegistryLookup::Ready(handle) => handle, - RegistryLookup::Gone => { - return Err(ApiError::not_found(format!("graph '{graph_id}' not found"))); - } - }; - - // Per-request observability. `Span::current().record` would silently - // no-op here because no upstream `#[tracing::instrument(...)]` macro - // declares a `graph_id` field; emit an explicit event instead so the - // routing decision actually lands in logs. - info!(graph_id = %handle.key.graph_id, "graph routed"); - - request.extensions_mut().insert(handle); - Ok(next.run(request).await) -} - -pub(crate) fn log_policy_decision(actor_id: &str, request: &PolicyRequest, decision: &PolicyDecision) { - info!( - actor_id = actor_id, - action = %request.action, - branch = request.branch.as_deref().unwrap_or(""), - target_branch = request.target_branch.as_deref().unwrap_or(""), - allowed = decision.allowed, - matched_rule_id = decision.matched_rule_id.as_deref().unwrap_or(""), - "policy decision" - ); -} - -/// The allow/deny **decision** an authorization check produces, kept -/// separate from the operational failures (`Err`) that can occur while -/// computing it. [`authorize_request`] collapses `Denied` to a 403; a caller -/// that needs to remap a denial without also remapping operational failures -/// (the stored-query invoke handler hides a denial as a 404) matches on this -/// directly, so a real 401 (missing bearer) or 500 (policy-evaluation error) -/// keeps its true status instead of being masked as the denial's response. -pub(crate) enum Authz { - Allowed, - Denied(String), -} - -/// HTTP-layer Cedar policy gate, returning the allow/deny [`Authz`] decision -/// and reserving `Err` for operational failures (401 missing bearer, 500 -/// policy-evaluation error). Two sources of the policy engine: -/// * Per-graph handler β€” passes `handle.policy.as_deref()` so the -/// graph's Cedar rules govern read/change/branch_*/schema_apply. -/// * Management handler β€” passes `state.server_policy.as_deref()` so -/// server-level Cedar rules govern `graph_list` (the only shipped -/// server-scoped action; runtime `graph_create` / `graph_delete` -/// are deferred until a managed cluster catalog lands). -/// -/// The MR-731 invariant lives inside this function: actor identity is -/// supplied as a separate argument from the resolved bearer match. The -/// `PolicyRequest` struct itself does not carry identity (the field was -/// dropped from the type), so handlers cannot smuggle it through the -/// request. See `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` -/// at `tests/server.rs`. -pub(crate) fn authorize( - actor: Option<&ResolvedActor>, - policy: Option<&PolicyEngine>, - request: PolicyRequest, -) -> std::result::Result { - let Some(engine) = policy else { - // No PolicyEngine installed. Three runtime states can reach this: - // - // * **Open mode** (`--unauthenticated`): no tokens, no policy. - // Per-graph operations are open by operator opt-in (they - // accepted "trust the network" for graph data). - // * **DefaultDeny mode**: tokens configured but no policy. The - // request went through bearer auth, so `actor` is Some. Only - // per-graph `Read` is permitted; other per-graph actions - // return 403. Closes the "configured auth but forgot the - // policy file" trap from MR-723. - // * Either of the above with a **server-scoped** action - // (`graph_list`, future `graph_create`/`graph_delete`). - // - // Server-scoped actions are always denied here, regardless of - // mode or actor presence. The management surface leaks server - // topology (graph IDs + URIs that may contain S3 bucket paths - // or internal hostnames) β€” operators who opted into Open mode - // accepted exposure of graph DATA, not exposure of server - // topology. Closing the management surface by default in every - // runtime state means the docstring contract on - // `server_graphs_list` ("don't leak the registry until the - // operator explicitly authorizes it") holds uniformly; the - // operator's only path to enabling it is configuring a - // cluster-scoped policy bundle, applying the cluster, and - // restarting the server. - if request.action.resource_kind() == PolicyResourceKind::Server { - return Ok(Authz::Denied( - "server-scoped actions require an explicit cluster policy bundle \ - applied with `omnigraph cluster apply` and served after restart β€” \ - the management surface is closed by default in every runtime state, \ - including --unauthenticated, so that server topology is never exposed \ - without operator opt-in." - .to_string(), - )); - } - if actor.is_some() && request.action != PolicyAction::Read { - return Ok(Authz::Denied( - "server runs in default-deny mode (bearer tokens configured but no \ - applied policy bundle). Only `read` actions are permitted; configure \ - a graph or cluster policy bundle in the cluster config, run \ - `omnigraph cluster apply`, and restart the server to enable other actions." - .to_string(), - )); - } - return Ok(Authz::Allowed); - }; - let Some(actor) = actor else { - return Err(ApiError::unauthorized("missing bearer token")); - }; - // SECURITY INVARIANT (MR-731): actor identity is supplied to the - // policy engine here as a separate argument, sourced from the - // bearer-token match resolved by `require_bearer_auth`. The - // `PolicyRequest` struct itself no longer carries `actor_id` (it - // was dropped from the type), so handlers cannot smuggle identity - // through the request body and there is no overwrite step that - // could be skipped. The principle is codified in - // `docs/dev/invariants.md` Hard Invariant 11 ("clients cannot set - // actor identity directly") and pinned by the regression test - // `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` - // in `crates/omnigraph-server/tests/server.rs`. - let actor_id = actor.actor_id.as_ref(); - let decision = engine - .authorize(actor_id, &request) - .map_err(|err| ApiError::internal(format!("policy: {err}")))?; - log_policy_decision(actor_id, &request, &decision); - if decision.allowed { - Ok(Authz::Allowed) - } else { - Ok(Authz::Denied(decision.message)) - } -} - -/// Thin wrapper over [`authorize`] for the handlers that treat any denial as a -/// 403: a denial becomes `ApiError::forbidden`, and operational failures -/// (401 missing bearer, 500 policy-evaluation error) propagate unchanged. The -/// stored-query invoke handler does **not** use this β€” it consumes the -/// [`Authz`] decision directly to hide a denial as a 404 while letting an -/// operational failure keep its true status. -pub(crate) fn authorize_request( - actor: Option<&ResolvedActor>, - policy: Option<&PolicyEngine>, - request: PolicyRequest, -) -> std::result::Result<(), ApiError> { - match authorize(actor, policy, request)? { - Authz::Allowed => Ok(()), - Authz::Denied(message) => Err(ApiError::forbidden(message)), - } -} - -#[utoipa::path( - get, - path = "/snapshot", - tag = "snapshots", - operation_id = "getSnapshot", - params(SnapshotQuery), - responses( - (status = 200, description = "Database snapshot", body = api::SnapshotOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Read the current snapshot of a branch. -/// -/// Returns the manifest version plus per-table metadata (path, version, row -/// count) for every table on the branch. Defaults to `main` when `branch` is -/// omitted. Read-only. -pub(crate) async fn server_snapshot( - Extension(handle): Extension>, - actor: Option>, - Query(query): Query, -) -> std::result::Result, ApiError> { - let branch = query.branch.unwrap_or_else(|| "main".to_string()); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: Some(branch.clone()), - target_branch: None, - }, - )?; - let snapshot = { - let db = &handle.engine; - db.snapshot_of(ReadTarget::branch(branch.as_str())) - .await - .map_err(ApiError::from_omni)? - }; - Ok(Json(snapshot_payload(&branch, &snapshot))) -} - -/// Header values that flag a response as coming from a deprecated route -/// (RFC 9745 / RFC 8288) and point at the canonical successor. -pub(crate) fn deprecation_headers(successor_link: &'static str) -> [(HeaderName, HeaderValue); 2] { - [ - ( - HeaderName::from_static("deprecation"), - HeaderValue::from_static("true"), - ), - ( - HeaderName::from_static("link"), - HeaderValue::from_static(successor_link), - ), - ] -} - -#[utoipa::path( - post, - path = "/read", - tag = "queries", - operation_id = "read", - request_body = ReadRequest, - responses( - (status = 200, description = "Query results (response includes `Deprecation: true` + `Link: ; rel=\"successor-version\"`)", body = ReadOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -#[deprecated(note = "use POST /query instead; /read is kept indefinitely for byte-stable back-compat")] -/// **Deprecated** β€” use [`POST /query`](#tag/queries/operation/query) instead. -/// -/// Execute a GQ read query. Behavior is unchanged from prior releases; the -/// route is kept indefinitely for byte-stable back-compat. New integrations -/// should target `POST /query`, which has clean field names (`query` / -/// `name`) and a 400-on-mutation guard. Responses from this route include -/// `Deprecation: true` and `Link: ; rel="successor-version"` -/// headers per RFC 9745 / RFC 8288 so SDKs and proxies can surface the -/// signal. -pub(crate) async fn server_read( - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result<([(HeaderName, HeaderValue); 2], Json), ApiError> { - let (selected_name, target, result) = run_query( - handle, - actor.as_ref().map(|Extension(actor)| actor), - &request.query_source, - request.query_name.as_deref(), - request.params.as_ref(), - request.branch, - request.snapshot, - false, // /read predates the D2 rule; legacy callers may submit mutating queries here - ) - .await?; - Ok(( - deprecation_headers("; rel=\"successor-version\""), - Json(api::read_output(selected_name, &target, result)), - )) -} - -#[utoipa::path( - post, - path = "/query", - tag = "queries", - operation_id = "query", - request_body = QueryRequest, - responses( - (status = 200, description = "Query results", body = ReadOutput), - (status = 400, description = "Bad request - also returned when the query body contains mutations; use POST /mutate (or its deprecated alias POST /change) for write queries", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Execute an inline read query (friendlier-named alternative to `POST /read`). -/// -/// Designed for ad-hoc exploration and AI-agent tool-use: short field -/// names (`query`, `name`) match the CLI `-e` flag and the GQ `query` -/// keyword. Mutations (`insert`/`update`/`delete`) are rejected with 400 -/// -- use `POST /mutate` (or its deprecated alias `POST /change`) for -/// write queries. Otherwise behaves identically to `POST /read`: same -/// target semantics (branch xor snapshot), same Cedar action (Read), -/// same response shape. -pub(crate) async fn server_query( - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - let (selected_name, target, result) = run_query( - handle, - actor.as_ref().map(|Extension(actor)| actor), - &request.query, - request.name.as_deref(), - request.params.as_ref(), - request.branch, - request.snapshot, - true, // /query is read-only; reject mutations - ) - .await?; - Ok(Json(api::read_output(selected_name, &target, result))) -} - -#[utoipa::path( - post, - path = "/export", - tag = "queries", - operation_id = "export", - request_body = ExportRequest, - responses( - (status = 200, description = "Exported data as NDJSON", content_type = "application/x-ndjson"), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Stream the contents of a branch as NDJSON. -/// -/// Emits one JSON object per line (`application/x-ndjson`). Filter with -/// `type_names` (node/edge type names) and/or `table_keys`; both empty -/// streams the entire branch. Suitable for large exports β€” the response is -/// streamed, not buffered. Read-only. -pub(crate) async fn server_export( - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result { - let branch = request.branch.unwrap_or_else(|| "main".to_string()); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Export, - branch: Some(branch.clone()), - target_branch: None, - }, - )?; - let engine = Arc::clone(&handle.engine); - let type_names = request.type_names.clone(); - let table_keys = request.table_keys.clone(); - let (tx, rx) = mpsc::unbounded_channel::>(); - tokio::spawn(async move { - let result = { - let mut writer = ExportStreamWriter { sender: tx.clone() }; - engine - .export_jsonl_to_writer(&branch, &type_names, &table_keys, &mut writer) - .await - }; - if let Err(err) = result { - let _ = tx.send(Err(io::Error::other(err.to_string()))); - } - }); - let body = Body::from_stream(stream::unfold(rx, |mut rx| async move { - rx.recv().await.map(|item| (item, rx)) - })); - Ok(( - StatusCode::OK, - [(CONTENT_TYPE, "application/x-ndjson; charset=utf-8")], - body, - ) - .into_response()) -} - -/// Shared implementation behind `POST /mutate` (canonical) and -/// `POST /change` (deprecated alias). Returns the bare `ChangeOutput`; -/// each route handler wraps it (the alias also attaches Deprecation -/// headers). -/// Shared backend for `/mutate` (canonical) and `/change` (deprecated alias). -/// -/// Decoupled from `ChangeRequest` so MR-969's `/queries/{name}` stored-query -/// handler can call this directly with registry-supplied fields without -/// rebuilding the request body. Today's HTTP handlers unpack the request and -/// call here; the registry would do the same. -pub(crate) async fn run_mutate( - state: AppState, - handle: Arc, - actor: Option<&ResolvedActor>, - query: &str, - name: Option<&str>, - params_json: Option<&Value>, - branch: String, -) -> std::result::Result { - let actor_arc = actor - .map(|a| Arc::clone(&a.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - let actor_id = actor.map(|a| a.actor_id.as_ref()); - authorize_request( - actor, - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Change, - branch: Some(branch.clone()), - target_branch: None, - }, - )?; - // Per-actor admission: bound concurrent in-flight mutations and - // estimated bytes per actor. Cedar runs FIRST so denied requests - // don't consume admission slots. Estimate uses the request body - // size as a coarse proxy; engine memory pressure can run higher. - let est_bytes = query.len() as u64 - + params_json - .map(|p| p.to_string().len() as u64) - .unwrap_or(0); - let _admission = state - .workload - .try_admit(&actor_arc, est_bytes) - .map_err(ApiError::from_workload_reject)?; - let (selected_name, query_params) = - select_named_query(query, name).map_err(|err| ApiError::bad_request(err.to_string()))?; - let params = query_params_from_json(&query_params, params_json) - .map_err(|err| ApiError::bad_request(err.to_string()))?; - - let result = { - let db = &handle.engine; - db.mutate_as(&branch, query, &selected_name, ¶ms, actor_id) - .await - .map_err(ApiError::from_omni)? - }; - Ok(ChangeOutput { - branch, - query_name: selected_name, - affected_nodes: result.affected_nodes, - affected_edges: result.affected_edges, - actor_id: actor_id.map(str::to_string), - }) -} - -/// Shared backend for `/query` (canonical) and `/read` (deprecated alias). -/// -/// Mirrors [`run_mutate`]'s decoupled shape so MR-969's stored-query handler -/// can call here with registry-supplied fields. Rejects inline source that -/// contains mutations (D2 rule); callers wanting writes go through -/// [`run_mutate`] instead. -/// -/// Intentionally does **not** take [`AppState`] (unlike [`run_mutate`]): -/// reads are not admission-gated today, so there is no `state.workload` -/// consumer. The signature grows the parameter when Phase 1 (MR-976) adds -/// the request envelope's `expect: { max_rows_scanned: N }` budget, or -/// MR-969 extends per-actor admission to stored-read invocations. -pub(crate) async fn run_query( - handle: Arc, - actor: Option<&ResolvedActor>, - query: &str, - name: Option<&str>, - params_json: Option<&Value>, - branch: Option, - snapshot: Option, - reject_mutations: bool, -) -> std::result::Result<(String, ReadTarget, omnigraph_compiler::result::QueryResult), ApiError> { - if branch.is_some() && snapshot.is_some() { - return Err(ApiError::bad_request( - "request may specify branch or snapshot, not both", - )); - } - - let target = read_target_from_request(branch, snapshot); - let policy_branch = match &target { - ReadTarget::Branch(branch) => Some(branch.clone()), - ReadTarget::Snapshot(_) if handle.policy.is_some() && actor.is_some() => { - let db = &handle.engine; - db.resolved_branch_of(target.clone()) - .await - .map(|branch| branch.or_else(|| Some("main".to_string()))) - .map_err(ApiError::from_omni)? - } - ReadTarget::Snapshot(_) => None, - }; - authorize_request( - actor, - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: policy_branch, - target_branch: None, - }, - )?; - let query_decl = - select_named_query_decl(query, name).map_err(|err| ApiError::bad_request(err.to_string()))?; - if reject_mutations && !query_decl.mutations.is_empty() { - return Err(ApiError::bad_request(format!( - "query '{}' contains mutations (insert/update/delete); use POST /mutate for write queries", - query_decl.name - ))); - } - let selected_name = query_decl.name.clone(); - let params = query_params_from_json(&query_decl.params, params_json) - .map_err(|err| ApiError::bad_request(err.to_string()))?; - - let result = { - let db = &handle.engine; - db.query(target.clone(), query, &selected_name, ¶ms) - .await - .map_err(ApiError::from_omni)? - }; - Ok((selected_name, target, result)) -} - -#[utoipa::path( - post, - path = "/change", - tag = "mutations", - operation_id = "change", - request_body = ChangeRequest, - responses( - (status = 200, description = "Mutation results (response includes `Deprecation: true` + `Link: ; rel=\"successor-version\"`)", body = ChangeOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 409, description = "Merge conflict", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -#[deprecated(note = "use POST /mutate instead; /change is kept indefinitely for back-compat")] -/// **Deprecated** β€” use [`POST /mutate`](#tag/mutations/operation/mutate) instead. -/// -/// Apply a GQ mutation to a branch. Behavior is unchanged; the route is -/// kept indefinitely for back-compat. New integrations should target -/// `POST /mutate`, which has identical semantics and a name that pairs -/// cleanly with `POST /query`. Responses from this route include -/// `Deprecation: true` and `Link: ; rel="successor-version"` -/// headers per RFC 9745 / RFC 8288 so SDKs and proxies can surface the -/// signal. -pub(crate) async fn server_change( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result<([(HeaderName, HeaderValue); 2], Json), ApiError> { - let branch = request.branch.unwrap_or_else(|| "main".to_string()); - let output = run_mutate( - state, - handle, - actor.as_ref().map(|Extension(actor)| actor), - &request.query, - request.name.as_deref(), - request.params.as_ref(), - branch, - ) - .await?; - Ok(( - deprecation_headers("; rel=\"successor-version\""), - Json(output), - )) -} - -#[utoipa::path( - post, - path = "/mutate", - tag = "mutations", - operation_id = "mutate", - request_body = ChangeRequest, - responses( - (status = 200, description = "Mutation results", body = ChangeOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 409, description = "Merge conflict", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Apply a GQ mutation to a branch (canonical mutation endpoint). -/// -/// Writes to the named `branch` (defaults to `main`). Mutations are atomic -/// per call and produce a new commit. Returns counts of nodes and edges -/// affected. **Destructive**: on success the branch is updated; rejected -/// mutations may still acquire locks briefly. Returns 409 on merge conflict. -/// -/// Pairs with `POST /query` (read-only). The legacy `POST /change` route -/// has identical semantics and is kept as a deprecated alias. -pub(crate) async fn server_mutate( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - let branch = request.branch.unwrap_or_else(|| "main".to_string()); - Ok(Json( - run_mutate( - state, - handle, - actor.as_ref().map(|Extension(actor)| actor), - &request.query, - request.name.as_deref(), - request.params.as_ref(), - branch, - ) - .await?, - )) -} - -/// Path parameter for `POST /queries/{name}`. -#[derive(Deserialize)] -pub(crate) struct QueryNamePath { - name: String, -} - -pub(crate) fn parse_optional_invoke_body( - body: Bytes, -) -> std::result::Result { - if body.is_empty() { - return Ok(InvokeStoredQueryRequest::default()); - } - serde_json::from_slice::>(&body) - .map(|request| request.unwrap_or_default()) - .map_err(|err| { - ApiError::bad_request(format!("invalid stored-query invocation body: {err}")) - }) -} - -#[utoipa::path( - post, - path = "/queries/{name}", - tag = "queries", - operation_id = "invoke_query", - params(("name" = String, Path, description = "Stored query name (the registry key)")), - request_body = Option, - responses( - (status = 200, description = "Read envelope (ReadOutput) or mutation envelope (ChangeOutput), serialized untagged", body = InvokeStoredQueryResponse), - (status = 400, description = "Bad request (param type error; snapshot on a stored mutation)", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden (the inner `change` gate for a stored mutation)", body = ErrorOutput), - (status = 404, description = "Unknown stored query, or `invoke_query` denied β€” indistinguishable to a caller without the grant", body = ErrorOutput), - (status = 409, description = "Merge conflict", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - (status = 500, description = "Policy evaluation error (a denial is reported as 404, not 500)", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Invoke a curated, server-side stored query by name. -/// -/// The query source comes from the graph's `queries:` registry, not the -/// request body β€” callers send only runtime inputs (`params`, `branch`, -/// `snapshot`). Gated by the `invoke_query` Cedar action at the boundary; -/// a stored *mutation* additionally passes the engine's `change` gate -/// (double-gated). An actor **without** `invoke_query` cannot tell a denied -/// query from a missing one β€” both return the same 404, so the catalog -/// can't be probed without the grant. Once `invoke_query` is held, the -/// inner `read`/`change` gate may surface a 403 for an existing query the -/// actor can't run (the intended double-gate signal). -pub(crate) async fn server_invoke_query( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Path(QueryNamePath { name }): Path, - body: Bytes, -) -> std::result::Result, ApiError> { - let req = parse_optional_invoke_body(body)?; - // A caller without `invoke_query` can't tell a denial from a missing - // query: both 404 with this exact message, so the catalog can't be - // probed without the grant. (A caller that holds invoke_query may still - // see the inner gate's 403 for an existing query it can't run β€” intended.) - const NOT_FOUND: &str = "stored query not found"; - let actor_ref = actor.as_ref().map(|Extension(actor)| actor); - - // Boundary gate (authentication already ran in `require_bearer_auth`). - // A denial is hidden as 404 (deny == missing, so the catalog can't be - // probed without the grant), but operational failures (401 missing bearer, - // 500 policy-evaluation error) propagate with their true status via `?` - // rather than being masked as a missing query. - match authorize( - actor_ref, - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::InvokeQuery, - // Graph-scoped: no branch dimension. The per-branch/snapshot - // access is enforced by the inner read/change gate in the - // runner, so the outer gate must not resolve a branch (doing so - // was wrong for snapshot reads). - branch: None, - target_branch: None, - }, - )? { - Authz::Allowed => {} - Authz::Denied(_) => return Err(ApiError::not_found(NOT_FOUND)), - } - - // Resolve against the per-graph registry (same 404 on a miss). - let stored = handle - .queries - .as_ref() - .and_then(|registry| registry.lookup(&name)) - .ok_or_else(|| ApiError::not_found(NOT_FOUND))?; - - // Detach what we need before `handle` moves into the runner β€” the - // registry borrow lives inside `handle`. - let source = Arc::clone(&stored.source); - let query_name = stored.name.clone(); - let is_mutation = stored.is_mutation(); - - // RFC-011 D3: the CLI verb asserts the stored query's kind. `query ` - // sends `expect_mutation: false`, `mutate ` sends `true`; a mismatch - // is rejected here so the wrong verb errors instead of silently running. - if let Some(expected) = req.expect_mutation { - if expected != is_mutation { - let (actual, verb) = if is_mutation { - ("mutation", "mutate") - } else { - ("read", "query") - }; - return Err(ApiError::bad_request(format!( - "'{query_name}' is a {actual} β€” use omnigraph {verb} {query_name}" - ))); - } - } - - info!( - graph = %handle.uri, - actor = ?actor_ref.map(|a| a.actor_id.as_ref()), - query = %query_name, - kind = if is_mutation { "mutate" } else { "read" }, - "stored query invoked" - ); - - if is_mutation { - if req.snapshot.is_some() { - return Err(ApiError::bad_request( - "stored mutation cannot target a snapshot", - )); - } - let branch = req.branch.unwrap_or_else(|| "main".to_string()); - let output = run_mutate( - state, - handle, - actor_ref, - &source, - Some(&query_name), - req.params.as_ref(), - branch, - ) - .await?; - Ok(Json(InvokeStoredQueryResponse::Change(output))) - } else { - let (selected, target, result) = run_query( - handle, - actor_ref, - &source, - Some(&query_name), - req.params.as_ref(), - req.branch, - req.snapshot, - true, - ) - .await?; - Ok(Json(InvokeStoredQueryResponse::Read(api::read_output( - selected, &target, result, - )))) - } -} - -#[utoipa::path( - get, - path = "/queries", - tag = "queries", - operation_id = "list_queries", - responses( - (status = 200, description = "Stored-query catalog (every stored query, with typed params)", body = QueriesCatalogOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// List the graph's exposed stored queries as a typed tool catalog. -/// -/// Returns every stored query in the `queries:` registry, each -/// with its MCP tool name, read/mutate flag, description/instruction, and -/// typed parameters β€” enough for a client to register them as tools without -/// fetching `.gq` source. Cluster-served graphs have no per-query expose flag, -/// so the catalog lists them all. Read-gated; the catalog is graph-wide (branch -/// independent β€” `read` is authorized against `main`). **Not** Cedar-filtered -/// per query yet, so it can list a query whose `invoke_query` the caller -/// lacks (a known gap until per-query authorization lands). -pub(crate) async fn server_list_queries( - Extension(handle): Extension>, - actor: Option>, -) -> std::result::Result, ApiError> { - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: Some("main".to_string()), - target_branch: None, - }, - )?; - let queries = match handle.queries.as_ref() { - Some(registry) => registry - .iter() - .filter(|q| q.expose) - .map(api::query_catalog_entry) - .collect(), - None => Vec::new(), - }; - Ok(Json(QueriesCatalogOutput { queries })) -} - -#[utoipa::path( - get, - path = "/schema", - tag = "schema", - operation_id = "getSchema", - responses( - (status = 200, description = "Current schema source", body = SchemaOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Read the current schema source. -/// -/// Returns the project's schema as a single string in `.pg` source form. -/// Useful for clients that want to introspect available types and tables -/// before constructing GQ queries. Read-only. -pub(crate) async fn server_schema_get( - Extension(handle): Extension>, - actor: Option>, -) -> std::result::Result, ApiError> { - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: None, - target_branch: None, - }, - )?; - let schema_source = { - let db = &handle.engine; - db.schema_source().to_string() - }; - Ok(Json(SchemaOutput { schema_source })) -} - -#[utoipa::path( - post, - path = "/schema/apply", - tag = "mutations", - operation_id = "applySchema", - request_body = SchemaApplyRequest, - responses( - (status = 200, description = "Schema apply results", body = SchemaApplyOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 409, description = "Schema apply is disabled for cluster-backed serving; use `omnigraph cluster apply` and restart", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Apply a schema migration. -/// -/// Cluster-backed servers reject this route with `409 Conflict`; operators -/// must apply schema changes through `omnigraph cluster apply` and restart. -/// -/// Diffs `schema_source` against the current schema and applies the resulting -/// migration steps (add/drop type, add/drop column, etc.). **Destructive**: -/// some steps drop data. Returns the list of steps applied; if `applied` is -/// false the diff was unsupported and no changes were made. -pub(crate) async fn server_schema_apply( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - let actor_arc = actor - .as_ref() - .map(|Extension(actor)| Arc::clone(&actor.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - let actor_id = actor - .as_ref() - .map(|Extension(actor)| actor.actor_id.as_ref()); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::SchemaApply, - branch: None, - target_branch: Some("main".to_string()), - }, - )?; - // Disable HTTP schema apply on cluster-backed serving AFTER the Cedar gate, - // so an unauthorized actor gets a 403 (not a 409 that would disclose the - // server is cluster-backed): 401 β†’ 403 β†’ 409, never leak topology before - // authorization. An authorized actor gets the actionable 409 signpost. - if state.routing().config_path.is_some() { - return Err(ApiError::conflict( - "server-side schema apply is disabled for cluster-backed serving; \ - update the cluster config, run `omnigraph cluster apply`, and restart \ - the server.", - )); - } - let est_bytes = request.schema_source.len() as u64; - let _admission = state - .workload - .try_admit(&actor_arc, est_bytes) - .map_err(ApiError::from_workload_reject)?; - let result = { - let db = &handle.engine; - let registry = handle.queries.as_deref(); - let label = handle.key.graph_id.as_str().to_string(); - // Engine-layer policy enforcement (MR-722): pass the resolved - // actor through so apply_schema_as can call enforce() with the - // authoritative identity. With a policy installed in AppState, - // engine-side enforcement re-checks the same decision the - // HTTP-layer authorize_request just made above. PR #3 collapses - // the redundancy. - db.apply_schema_as_with_catalog_check( - &request.schema_source, - omnigraph::db::SchemaApplyOptions { - allow_data_loss: request.allow_data_loss, - }, - actor_id, - |catalog| { - if let Some(registry) = registry { - validate_registry_against_catalog(registry, catalog, &label)?; - } - Ok(()) - }, - ) - .await - .map_err(ApiError::from_omni)? - }; - // Prompt index convergence (iss-848): schema apply records `@index` intent - // but defers the physical build. On a long-lived server, materialize it - // promptly rather than waiting for the next `optimize` cron β€” spawned - // detached so it never blocks or fails the apply response. Best-effort: a - // failure is logged and the index still converges on the next optimize. - // The CLI is one-shot, so it has no equivalent; its convergence path is the - // operator's optimize cadence. - if result.applied { - let engine = Arc::clone(&handle.engine); - tokio::spawn(async move { - if let Err(err) = engine.ensure_indices().await { - tracing::warn!( - target: "omnigraph::server", - error = %err, - "post-apply ensure_indices failed; indexes will converge on the next optimize", - ); - } - }); - } - Ok(Json(schema_apply_output(handle.uri.as_str(), result))) -} - -/// Shared body for `POST /load` (canonical) and `POST /ingest` (deprecated): -/// branch-exists / fork-if-`from` check, Cedar authorization, admission, the -/// bulk `load_as`, and the `IngestOutput` mapping. -async fn run_ingest( - state: AppState, - handle: Arc, - actor: Option<&ResolvedActor>, - request: IngestRequest, -) -> std::result::Result { - let branch = request.branch.unwrap_or_else(|| "main".to_string()); - let from = request.from; - let mode = request.mode.unwrap_or(omnigraph::loader::LoadMode::Merge); - let actor_arc = actor - .map(|actor| Arc::clone(&actor.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - let actor_id = actor.map(|actor| actor.actor_id.as_ref()); - - let branch_exists = { - let db = &handle.engine; - db.branch_list() - .await - .map_err(ApiError::from_omni)? - .into_iter() - .any(|name| name == branch) - }; - - if !branch_exists { - match from.as_deref() { - // Fork-if-missing is opt-in by presence of `from`; without it a - // typo'd branch name must surface as an error, not silently - // create a fork and land the data there. - None => { - return Err(ApiError::not_found(format!( - "branch '{branch}' not found; pass `from` to create it" - ))); - } - Some(from) => authorize_request( - actor, - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::BranchCreate, - branch: Some(from.to_string()), - target_branch: Some(branch.clone()), - }, - )?, - } - } - authorize_request( - actor, - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Change, - branch: Some(branch.clone()), - target_branch: None, - }, - )?; - let est_bytes = request.data.len() as u64; - let _admission = state - .workload - .try_admit(&actor_arc, est_bytes) - .map_err(ApiError::from_workload_reject)?; - - let result = { - let db = &handle.engine; - db.load_as(&branch, from.as_deref(), &request.data, mode, actor_id) - .await - .map_err(ApiError::from_omni)? - }; - - Ok(ingest_output( - handle.uri.as_str(), - &result, - mode, - actor_id.map(str::to_string), - )) -} - -#[utoipa::path( - post, - path = "/load", - tag = "mutations", - operation_id = "load", - request_body = IngestRequest, - responses( - (status = 200, description = "Load results", body = IngestOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Bulk-load NDJSON data into a branch (canonical load endpoint). -/// -/// `data` is NDJSON with one record per line. `mode` controls behavior on -/// existing rows: `merge` upserts by id (default), `append` blindly inserts, -/// `overwrite` replaces table contents. Branch creation is opt-in by -/// presence of `from`: with `from` set, a missing `branch` is created from -/// it; without `from`, `branch` must already exist β€” a missing branch is a -/// 404, never an implicit fork. **Destructive** when `mode` is `overwrite` -/// or when the load produces conflicting writes. -/// -/// The legacy `POST /ingest` route has identical semantics and is kept as a -/// deprecated alias. -pub(crate) async fn server_load( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - Ok(Json( - run_ingest( - state, - handle, - actor.as_ref().map(|Extension(actor)| actor), - request, - ) - .await?, - )) -} - -#[utoipa::path( - post, - path = "/ingest", - tag = "mutations", - operation_id = "ingest", - request_body = IngestRequest, - responses( - (status = 200, description = "Load results (response includes `Deprecation: true` + `Link: ; rel=\"successor-version\"`)", body = IngestOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -#[deprecated(note = "use POST /load instead; /ingest is kept indefinitely for back-compat")] -/// **Deprecated** β€” use [`POST /load`](#tag/mutations/operation/load) instead. -/// -/// Bulk-load NDJSON data into a branch. Behavior is unchanged; the route is -/// kept indefinitely for back-compat. New integrations should target -/// `POST /load`, which has identical semantics. Responses from this route -/// include `Deprecation: true` and `Link: ; rel="successor-version"` -/// headers per RFC 9745 / RFC 8288 so SDKs and proxies can surface the signal. -pub(crate) async fn server_ingest( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result<([(HeaderName, HeaderValue); 2], Json), ApiError> { - let output = run_ingest( - state, - handle, - actor.as_ref().map(|Extension(actor)| actor), - request, - ) - .await?; - Ok(( - deprecation_headers("; rel=\"successor-version\""), - Json(output), - )) -} - -#[utoipa::path( - get, - path = "/branches", - tag = "branches", - operation_id = "listBranches", - responses( - (status = 200, description = "List of branches", body = BranchListOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// List all branches. -/// -/// Returns branch names sorted alphabetically. Read-only. -pub(crate) async fn server_branch_list( - Extension(handle): Extension>, - actor: Option>, -) -> std::result::Result, ApiError> { - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: None, - target_branch: None, - }, - )?; - let mut branches = { - let db = &handle.engine; - db.branch_list().await.map_err(ApiError::from_omni)? - }; - branches.sort(); - Ok(Json(BranchListOutput { branches })) -} - -#[utoipa::path( - post, - path = "/branches", - tag = "branches", - operation_id = "createBranch", - request_body = BranchCreateRequest, - responses( - (status = 200, description = "Branch created", body = BranchCreateOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 409, description = "Branch already exists", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Create a new branch. -/// -/// Forks `name` off of `from` (defaults to `main`). The new branch shares -/// table data with its parent until it is mutated. Returns 409 if `name` -/// already exists. -pub(crate) async fn server_branch_create( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - let from = request.from.unwrap_or_else(|| "main".to_string()); - let actor_arc = actor - .as_ref() - .map(|Extension(actor)| Arc::clone(&actor.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::BranchCreate, - branch: Some(from.clone()), - target_branch: Some(request.name.clone()), - }, - )?; - // Branch metadata only β€” small constant bytes estimate. The Lance - // shallow-clone work is bounded by the parent's manifest size, not - // the request body. - let _admission = state - .workload - .try_admit(&actor_arc, 256) - .map_err(ApiError::from_workload_reject)?; - { - let db = &handle.engine; - db.branch_create_from_as( - ReadTarget::branch(&from), - &request.name, - actor.as_ref().map(|Extension(a)| a.actor_id.as_ref()), - ) - .await - .map_err(ApiError::from_omni)?; - } - Ok(Json(BranchCreateOutput { - uri: handle.uri.clone(), - from, - name: request.name, - actor_id: actor.map(|Extension(actor)| actor.actor_id.as_ref().to_string()), - })) -} - -/// Path-param shape for [`server_branch_delete`]. Named-field -/// deserialization (rather than `Path` or `Path<(String,)>`) -/// keeps the extractor stable across single-mode flat routes and -/// multi-mode nested routes: the `{branch}` capture is picked by -/// name and any other captures in scope (e.g. `{graph_id}` in -/// multi-mode) are ignored without breaking deserialization. -/// -/// Closes the "handler path-extractor type is positional and breaks -/// when route nesting changes" class. -#[derive(Deserialize)] -pub(crate) struct BranchPath { - branch: String, -} - -#[utoipa::path( - delete, - path = "/branches/{branch}", - tag = "branches", - operation_id = "deleteBranch", - params( - ("branch" = String, Path, description = "Branch name to delete"), - ), - responses( - (status = 200, description = "Branch deleted", body = BranchDeleteOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 404, description = "Branch not found", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Delete a branch. -/// -/// **Irreversible.** Removes the branch pointer; commits remain reachable -/// only if referenced by another branch. Returns 404 if the branch does not -/// exist. -pub(crate) async fn server_branch_delete( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Path(BranchPath { branch }): Path, -) -> std::result::Result, ApiError> { - let actor_arc = actor - .as_ref() - .map(|Extension(actor)| Arc::clone(&actor.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - let actor_id = actor - .as_ref() - .map(|Extension(actor)| actor.actor_id.as_ref()); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::BranchDelete, - branch: None, - target_branch: Some(branch.clone()), - }, - )?; - // Metadata-only manifest tombstone β€” small constant estimate. - let _admission = state - .workload - .try_admit(&actor_arc, 256) - .map_err(ApiError::from_workload_reject)?; - { - let db = &handle.engine; - db.branch_delete_as(&branch, actor_id) - .await - .map_err(ApiError::from_omni)?; - } - Ok(Json(BranchDeleteOutput { - uri: handle.uri.clone(), - name: branch, - actor_id: actor_id.map(str::to_string), - })) -} - -#[utoipa::path( - post, - path = "/branches/merge", - tag = "branches", - operation_id = "mergeBranches", - request_body = BranchMergeRequest, - responses( - (status = 200, description = "Branches merged", body = BranchMergeOutput), - (status = 400, description = "Bad request", body = ErrorOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 409, description = "Merge conflict", body = ErrorOutput), - (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// Merge one branch into another. -/// -/// Merges `source` into `target` (defaults to `main`). Outcome is one of -/// `already_up_to_date`, `fast_forward`, or `merged`. Returns 409 with the -/// list of conflicts if the merge cannot be completed; the target is left -/// unchanged in that case. **Destructive** to `target` on success. -pub(crate) async fn server_branch_merge( - State(state): State, - Extension(handle): Extension>, - actor: Option>, - Json(request): Json, -) -> std::result::Result, ApiError> { - let target = request.target.unwrap_or_else(|| "main".to_string()); - let actor_arc = actor - .as_ref() - .map(|Extension(actor)| Arc::clone(&actor.actor_id)) - .unwrap_or_else(|| Arc::::from("anonymous")); - let actor_id = actor - .as_ref() - .map(|Extension(actor)| actor.actor_id.as_ref()); - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::BranchMerge, - branch: Some(request.source.clone()), - target_branch: Some(target.clone()), - }, - )?; - // Merge body is small JSON; the heavy work is in the engine but is - // bounded per-(table, branch) by the writer queue. Small constant - // estimate suffices for the actor in-flight count. - let _admission = state - .workload - .try_admit(&actor_arc, 256) - .map_err(ApiError::from_workload_reject)?; - let outcome = { - let db = &handle.engine; - db.branch_merge_as(&request.source, &target, actor_id) - .await - .map_err(ApiError::from_omni)? - }; - Ok(Json(BranchMergeOutput { - source: request.source, - target, - outcome: outcome.into(), - actor_id: actor_id.map(str::to_string), - })) -} - -#[utoipa::path( - get, - path = "/commits", - tag = "commits", - operation_id = "listCommits", - params(CommitListQuery), - responses( - (status = 200, description = "List of commits", body = CommitListOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] -/// List commits. -/// -/// Filter by `branch` to get the commits on a single branch (most recent -/// first); omit to list across all branches. Read-only. -pub(crate) async fn server_commit_list( - Extension(handle): Extension>, - actor: Option>, - Query(query): Query, -) -> std::result::Result, ApiError> { - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: query.branch.clone(), - target_branch: None, - }, - )?; - let commits = { - let db = &handle.engine; - db.list_commits(query.branch.as_deref()) - .await - .map_err(ApiError::from_omni)? - }; - Ok(Json(CommitListOutput { - commits: commits.iter().map(api::commit_output).collect(), - })) -} - -/// Path-param shape for [`server_commit_show`]. See [`BranchPath`] -/// for the design rationale β€” same pattern, different field name. -#[derive(Deserialize)] -pub(crate) struct CommitPath { - commit_id: String, -} - -#[utoipa::path( - get, - path = "/commits/{commit_id}", - tag = "commits", - operation_id = "getCommit", - params( - ("commit_id" = String, Path, description = "Commit identifier"), - ), - responses( - (status = 200, description = "Commit details", body = api::CommitOutput), - (status = 401, description = "Unauthorized", body = ErrorOutput), - (status = 403, description = "Forbidden", body = ErrorOutput), - (status = 404, description = "Commit not found", body = ErrorOutput), - ), - security(("bearer_token" = [])), -)] - -/// Get a single commit. -/// -/// Returns the commit's manifest version, parent commit(s), and creation -/// metadata. Read-only. -pub(crate) async fn server_commit_show( - Extension(handle): Extension>, - actor: Option>, - Path(CommitPath { commit_id }): Path, -) -> std::result::Result, ApiError> { - authorize_request( - actor.as_ref().map(|Extension(actor)| actor), - handle.policy.as_deref(), - PolicyRequest { - action: PolicyAction::Read, - branch: None, - target_branch: None, - }, - )?; - let commit = { - let db = &handle.engine; - db.get_commit(&commit_id) - .await - .map_err(ApiError::from_omni)? - }; - Ok(Json(api::commit_output(&commit))) -} - -pub(crate) fn read_target_from_request(branch: Option, snapshot: Option) -> ReadTarget { - if let Some(snapshot) = snapshot { - ReadTarget::snapshot(omnigraph::db::SnapshotId::new(snapshot)) - } else { - ReadTarget::branch(branch.unwrap_or_else(|| "main".to_string())) - } -} - -pub(crate) fn select_named_query_decl( - query_source: &str, - requested_name: Option<&str>, -) -> Result { - let parsed = parse_query(query_source)?; - let query = if let Some(name) = requested_name { - parsed - .queries - .into_iter() - .find(|query| query.name == name) - .ok_or_else(|| color_eyre::eyre::eyre!("query '{}' not found", name))? - } else if parsed.queries.len() == 1 { - parsed.queries.into_iter().next().unwrap() - } else { - bail!("query file contains multiple queries; pass --name"); - }; - Ok(query) -} - -pub(crate) fn select_named_query( - query_source: &str, - requested_name: Option<&str>, -) -> Result<(String, Vec)> { - let query = select_named_query_decl(query_source, requested_name)?; - Ok((query.name, query.params)) -} - -pub(crate) fn query_params_from_json( - query_params: &[omnigraph_compiler::query::ast::Param], - params_json: Option<&Value>, -) -> Result { - json_params_to_param_map(params_json, query_params, JsonParamMode::Standard) - .map_err(|err| color_eyre::eyre::eyre!(err.to_string())) -} diff --git a/crates/omnigraph-server/src/lib.rs b/crates/omnigraph-server/src/lib.rs index fbc37d2..ad41f9d 100644 --- a/crates/omnigraph-server/src/lib.rs +++ b/crates/omnigraph-server/src/lib.rs @@ -1,14 +1,9 @@ pub mod api; -mod handlers; -mod settings; -use handlers::*; -use settings::*; -pub use settings::{ServerRuntimeState, classify_server_runtime_state, load_server_settings}; pub mod auth; +pub mod config; pub mod graph_id; pub mod identity; pub mod policy; -pub mod queries; pub mod registry; pub mod workload; @@ -16,9 +11,7 @@ pub use graph_id::GraphId; pub use identity::{AuthSource, GraphKey, ResolvedActor, Scope, TenantId}; pub use registry::{GraphHandle, GraphRegistry, InsertError, RegistryLookup, RegistrySnapshot}; -use crate::queries::{QueryRegistry, check, format_check_breakages}; - -use std::collections::{BTreeMap, HashMap, HashSet}; +use std::collections::{HashMap, HashSet}; use std::fs; use std::io; use std::io::Write; @@ -29,10 +22,9 @@ use api::{ BranchCreateOutput, BranchCreateRequest, BranchDeleteOutput, BranchListOutput, BranchMergeOutput, BranchMergeRequest, ChangeOutput, ChangeRequest, CommitListOutput, CommitListQuery, ErrorCode, ErrorOutput, ExportRequest, GraphInfo, GraphListResponse, - HealthOutput, IngestOutput, IngestRequest, InvokeStoredQueryRequest, InvokeStoredQueryResponse, - QueriesCatalogOutput, QueryRequest, ReadOutput, ReadRequest, SchemaApplyOutput, - SchemaApplyRequest, SchemaOutput, SnapshotQuery, ingest_output, schema_apply_output, - snapshot_payload, + HealthOutput, IngestOutput, IngestRequest, QueryRequest, ReadOutput, ReadRequest, + SchemaApplyOutput, SchemaApplyRequest, SchemaOutput, SnapshotQuery, ingest_output, + schema_apply_output, snapshot_payload, }; pub use auth::{AWS_SECRET_ENV, EnvOrFileTokenSource, TokenSource, resolve_token_source}; use axum::body::{Body, Bytes}; @@ -44,12 +36,16 @@ use axum::middleware::{self, Next}; use axum::response::{IntoResponse, Response}; use axum::routing::{delete, get, post}; use axum::{Json, Router}; -use color_eyre::eyre::{Result, WrapErr, bail, eyre}; +use color_eyre::eyre::{Result, WrapErr, bail}; +pub use config::{ + AliasCommand, AliasConfig, CliDefaults, DEFAULT_CONFIG_FILE, OmnigraphConfig, PolicySettings, + ProjectConfig, QueryDefaults, ReadOutputFormat, ServerDefaults, TableCellLayout, TargetConfig, + load_config, +}; use futures::stream; use omnigraph::db::{Omnigraph, ReadTarget}; use omnigraph::error::{ManifestConflictDetails, ManifestErrorKind, OmniError}; use omnigraph::storage::normalize_root_uri; -use omnigraph_compiler::catalog::Catalog; use omnigraph_compiler::json_params_to_param_map; use omnigraph_compiler::query::parser::parse_query; use omnigraph_compiler::{JsonParamMode, ParamMap}; @@ -87,49 +83,30 @@ fn hash_bearer_token(token: &str) -> BearerTokenHash { description = "HTTP API for the Omnigraph graph database", ), paths( - handlers::server_health, - handlers::server_graphs_list, - handlers::server_snapshot, + server_health, + server_graphs_list, + server_snapshot, // deprecated; the #[deprecated] attribute on the handler // surfaces as `deprecated: true` on the OpenAPI operation. - #[allow(deprecated)] handlers::server_read, - handlers::server_query, - handlers::server_export, - #[allow(deprecated)] handlers::server_change, - handlers::server_mutate, - handlers::server_list_queries, - handlers::server_invoke_query, - handlers::server_schema_apply, - handlers::server_schema_get, - handlers::server_load, - // deprecated; the #[deprecated] attribute on the handler surfaces as - // `deprecated: true` on the OpenAPI operation. - #[allow(deprecated)] handlers::server_ingest, - handlers::server_branch_list, - handlers::server_branch_create, - handlers::server_branch_delete, - handlers::server_branch_merge, - handlers::server_commit_list, - handlers::server_commit_show, + #[allow(deprecated)] server_read, + server_query, + server_export, + #[allow(deprecated)] server_change, + server_mutate, + server_schema_apply, + server_schema_get, + server_ingest, + server_branch_list, + server_branch_create, + server_branch_delete, + server_branch_merge, + server_commit_list, + server_commit_show, ), modifiers(&SecurityAddon), )] pub struct ApiDoc; -/// The canonical served OpenAPI shape (RFC-011 cluster-only): the static -/// `ApiDoc` with every protected path nested under `/graphs/{graph_id}/…` -/// and `cluster_`-prefixed operation ids. `/healthz` and `/graphs` stay -/// flat. This is the single source of nesting β€” both the runtime -/// `server_openapi` handler and the committed `openapi.json` derive from -/// it, so the published spec can never describe routes the server does -/// not serve. The handler additionally strips security in open mode; the -/// committed spec retains it. -pub fn served_openapi() -> utoipa::openapi::OpenApi { - let mut doc = ApiDoc::openapi(); - handlers::nest_paths_under_cluster_prefix(&mut doc); - doc -} - struct SecurityAddon; impl utoipa::Modify for SecurityAddon { @@ -151,10 +128,11 @@ const SERVER_SOURCE_VERSION: Option<&str> = option_env!("OMNIGRAPH_SOURCE_VERSIO #[derive(Debug, Clone)] pub struct ServerConfig { - /// Server topology + the graphs to open at startup. RFC-011 - /// cluster-only: the server always boots from a cluster - /// (`--cluster `) and serves N graphs under cluster - /// routes. + /// Server topology + the graphs to open at startup. Single-mode + /// invocations (`omnigraph-server ` or `--target `) + /// produce `ServerConfigMode::Single`; multi-mode invocations + /// (`--config omnigraph.yaml` with a non-empty `graphs:` map and + /// no single-mode selector) produce `ServerConfigMode::Multi`. pub mode: ServerConfigMode, pub bind: String, /// Operator opt-in for fully-unauthenticated dev mode (MR-723). @@ -166,81 +144,79 @@ pub struct ServerConfig { /// who set up auth and forgot the policy file would otherwise ship /// the illusion of protection. pub allow_unauthenticated: bool, - /// Operator opt-in for fail-fast cluster boot. By default, graph-local - /// startup failures quarantine that graph and healthy graphs still serve. - /// When true, any quarantined or failed graph aborts startup. - pub require_all_graphs: bool, } -/// What `load_server_settings` produces. RFC-011 cluster-only: the -/// server always boots from a cluster's applied revision into a -/// multi-graph deployment (N β‰₯ 1 graphs). +/// What `load_server_settings` produces after applying the four-rule +/// mode inference matrix (MR-668 decision 2). #[derive(Debug, Clone)] pub enum ServerConfigMode { - /// Cluster boot β€” `--cluster ` resolves the applied - /// revision into per-graph startup configs plus an optional - /// server-level policy. + /// Legacy invocation β€” one graph at the given URI. Either: + /// * `omnigraph-server ` (CLI positional), or + /// * `omnigraph-server --target --config omnigraph.yaml`, or + /// * `omnigraph-server --config omnigraph.yaml` with `server.graph` + /// set to a named target. + Single { + uri: String, + /// Top-level `policy.file` (single-graph Cedar policy). + policy_file: Option, + }, + /// Multi-graph invocation β€” `--config omnigraph.yaml` with a + /// non-empty `graphs:` map and no single-mode selector. Multi { /// Per-graph startup configs, sorted by graph id (BTreeMap /// iteration order). The parallel-open loop iterates this. graphs: Vec, - /// The cluster boot source (config directory or storage root). - /// Kept on the mode so future runtime mutation (deferred β€” see - /// release notes) can locate the source of truth without - /// re-parsing CLI args. + /// Path to the config file the server was started from. Kept on + /// the mode so future runtime mutation (deferred β€” see release + /// notes) can locate the source of truth without re-parsing CLI + /// args. config_path: PathBuf, - /// Server-level Cedar policy for the management endpoints - /// (`GET /graphs`). Wired into `GET /graphs` authorization. - server_policy: Option, + /// `server.policy.file` (server-level Cedar policy for the + /// management endpoints). Wired into `GET /graphs` authorization. + server_policy_file: Option, }, } -/// Where a Cedar policy bundle comes from at startup. Cluster-local files are -/// used during config application; inline digest-verified catalog content is -/// used for serving, where the catalog may live on object storage and the -/// server must not re-read mutable state after the snapshot. -#[derive(Debug, Clone)] -pub enum PolicySource { - File(PathBuf), - Inline(String), -} - /// One graph's startup-time configuration: id, opened URI, optional -/// per-graph policy source. Constructed by `load_server_settings` +/// per-graph policy file path. Constructed by `load_server_settings` /// in multi mode; consumed by `serve`'s parallel open loop. #[derive(Debug, Clone)] pub struct GraphStartupConfig { pub graph_id: String, pub uri: String, - pub policy: Option, - /// Pre-resolved embedding config from an applied cluster provider profile. - /// Legacy config paths leave this unset and continue to use env resolution. - pub embedding: Option, - /// Per-graph stored-query registry, loaded and identity-checked at - /// settings-build time; type-checked against the schema when this - /// graph's engine opens. - pub queries: QueryRegistry, + pub policy_file: Option, } -/// Runtime routing for the server (RFC-011 cluster-only). Every -/// deployment serves cluster routes (`/graphs/{graph_id}/...`) backed by -/// a registry of N graphs (N β‰₯ 1). The single-graph convenience -/// constructors build a one-graph registry keyed by `default`; the -/// cluster boot path builds an N-graph registry. There is no longer a -/// flat-route mode. +/// Runtime routing for the server. Single mode = legacy +/// `omnigraph-server ` invocation, one graph, flat HTTP routes. +/// Multi mode = `--config omnigraph.yaml` with a non-empty `graphs:` +/// map, N graphs, cluster routes (`/graphs/{graph_id}/...`). Mode is +/// determined at startup by `load_server_settings`. /// -/// `config_path` is the boot source (the cluster directory or storage -/// root); preserved here so future runtime mutation (deferred) can find -/// the source of truth without re-parsing CLI args. The server treats -/// the source as operator-owned and never writes it. +/// In single mode the handle lives here directly β€” there is no +/// registry, no sentinel key, no walk-and-assert. In multi mode the +/// registry carries N handles and the middleware dispatches on the +/// URL's `{graph_id}` segment. /// -/// All handler bodies are mode-agnostic β€” the routing middleware +/// Both modes share the same handler bodies β€” the routing middleware /// (`resolve_graph_handle`) injects `Arc` as a request -/// extension by looking up the `{graph_id}` URL segment in the registry. +/// extension so handlers never see the routing discriminator. #[derive(Clone)] -pub struct GraphRouting { - pub registry: Arc, - pub config_path: Option, +pub enum GraphRouting { + /// Single-graph deployment: one handle, flat routes (`/snapshot`, + /// `/read`, …). The `handle.uri` field carries the URI the engine + /// was opened from. Backward compatible with v0.6.0 deployments. + Single { handle: Arc }, + /// Multi-graph deployment: many handles, cluster routes + /// (`/graphs/{graph_id}/...`). `config_path` is the `omnigraph.yaml` + /// the server reads at startup; preserved here so future runtime + /// mutation (deferred) can find the source of truth without + /// re-parsing CLI args. The server treats the file as + /// operator-owned and never writes it. + Multi { + registry: Arc, + config_path: Option, + }, } #[derive(Clone)] @@ -256,10 +232,12 @@ pub struct AppState { /// see MR-668 decision Q6. workload: Arc, bearer_tokens: Arc<[(BearerTokenHash, Arc)]>, - /// Server-level Cedar policy. Used by management endpoints (`GET - /// /graphs`) which act on the registry resource, not on a per-graph - /// resource. Loaded from the cluster-scoped policy binding when - /// configured. Per-graph policies live on each `GraphHandle.policy`. + /// Server-level Cedar policy. Used by management endpoints (`POST + /// /graphs`, `GET /graphs`) which act on the registry resource, + /// not on a per-graph resource. Loaded from `server.policy.file` + /// in `omnigraph.yaml`. `None` outside multi mode and when no + /// server policy is configured. Per-graph policies live on each + /// `GraphHandle.policy`. server_policy: Option>, } @@ -307,38 +285,7 @@ impl AppState { ) -> Self { let bearer_tokens = hash_bearer_tokens(bearer_tokens); let per_graph_policy = policy_engine.map(Arc::new); - Self::build_single_mode( - uri, - db, - bearer_tokens, - per_graph_policy, - Arc::new(workload), - None, - ) - } - - /// Like `new_single`, but attaches a pre-validated stored-query - /// registry. Private β€” the production single-mode boot path - /// (`open_single_with_queries`) is the only caller; every public - /// `new_*` constructor builds with no stored queries. - fn new_single_with_queries( - uri: String, - db: Omnigraph, - bearer_tokens: Vec<(String, String)>, - policy_engine: Option, - workload: workload::WorkloadController, - queries: Option>, - ) -> Self { - let bearer_tokens = hash_bearer_tokens(bearer_tokens); - let per_graph_policy = policy_engine.map(Arc::new); - Self::build_single_mode( - uri, - db, - bearer_tokens, - per_graph_policy, - Arc::new(workload), - queries, - ) + Self::build_single_mode(uri, db, bearer_tokens, per_graph_policy, Arc::new(workload)) } pub fn new(uri: String, db: Omnigraph) -> Self { @@ -430,34 +377,6 @@ impl AppState { uri: impl Into, bearer_tokens: Vec<(String, String)>, policy_file: Option<&PathBuf>, - ) -> Result { - Self::open_single_with_queries(uri, bearer_tokens, policy_file, QueryRegistry::default()) - .await - } - - /// Single-mode boot with a stored-query registry: open the engine, - /// **type-check the registry against the live schema and refuse to - /// start on a breakage** (same posture as bad policy YAML), log - /// non-blocking warnings, then attach the registry to the handle. - /// With an empty registry the check is a no-op and no registry is - /// attached β€” that is the path `open_with_bearer_tokens_and_policy` - /// (no stored queries) takes. - pub async fn open_single_with_queries( - uri: impl Into, - bearer_tokens: Vec<(String, String)>, - policy_file: Option<&PathBuf>, - queries: QueryRegistry, - ) -> Result { - Self::open_single_with_queries_for_graph_id(uri, bearer_tokens, policy_file, queries, None) - .await - } - - async fn open_single_with_queries_for_graph_id( - uri: impl Into, - bearer_tokens: Vec<(String, String)>, - policy_file: Option<&PathBuf>, - queries: QueryRegistry, - graph_id: Option, ) -> Result { // The "policy requires tokens" invariant is enforced once by // `classify_server_runtime_state` in `serve()`, before either @@ -465,41 +384,30 @@ impl AppState { // time we get here, the (policy, no-tokens) combination has // already been rejected β€” no second bail needed. let uri = normalize_root_uri(&uri.into()).wrap_err("normalize graph URI")?; - let graph_id = graph_id.unwrap_or_else(|| uri.clone()); let db = Omnigraph::open(&uri).await?; - - // Validate the registry against the live schema and resolve it to - // an attachable handle (refuse boot on breakage). - let registry = validate_and_attach(queries, &db.catalog(), &graph_id)?; - let policy_engine = match policy_file { - Some(path) => Some(PolicyEngine::load_graph(path, &graph_id)?), + Some(path) => Some(PolicyEngine::load_graph(path, &uri)?), None => None, }; - Ok(Self::new_single_with_queries( + Ok(Self::new_with_bearer_tokens_and_policy( uri, db, bearer_tokens, policy_engine, - workload::WorkloadController::from_env(), - registry, )) } - /// Single-graph convenience construction (RFC-011 cluster-only): - /// wraps the bare engine + per-graph policy in a `GraphHandle` keyed - /// by `default`, then builds a one-graph registry so the deployment - /// serves the same `/graphs/{graph_id}/...` cluster routes as any - /// other. Per-graph policy enforcement on the engine (MR-722) is - /// re-applied via `Omnigraph::with_policy` so HTTP and engine layers - /// can never diverge. + /// Single-mode shared construction: wraps the bare engine + per-graph + /// policy in a `GraphHandle` carried directly by `GraphRouting::Single`. + /// Per-graph policy enforcement on the engine (MR-722) is re-applied + /// via `Omnigraph::with_policy` so HTTP and engine layers can never + /// diverge. fn build_single_mode( uri: String, db: Omnigraph, bearer_tokens: Arc<[(BearerTokenHash, Arc)]>, policy_engine: Option>, workload: Arc, - queries: Option>, ) -> Self { // Engine-layer policy gate (MR-722). With a per-graph policy // installed, every `_as` writer on `Omnigraph` calls into the @@ -511,28 +419,26 @@ impl AppState { } else { db }; - // The convenience constructors address the single graph by the - // reserved id `default` β€” both the registry key and the URL - // segment (`/graphs/default/...`). + // `GraphHandle.key` is required by the struct, but in single + // mode it is never a registry key (there's no registry) and + // never compared against user input (routes are flat, no + // `{graph_id}` parameter). The label appears only in tracing + // output from `resolve_graph_handle`. The literal below is a + // log label, not a routing key β€” when the future cluster + // catalog ships, single mode may carry the catalog-assigned + // id here instead. let uri = normalize_root_uri(&uri).unwrap_or(uri); - let graph_id = GraphId::try_from("default").expect("'default' is a valid GraphId"); - let key = GraphKey::cluster(graph_id); + let key = GraphKey::cluster( + GraphId::try_from("default").expect("'default' is a valid GraphId log label"), + ); let handle = Arc::new(GraphHandle { key, uri, engine: Arc::new(db), policy: policy_engine, - queries, }); - let registry = Arc::new( - GraphRegistry::from_handles(vec![handle]) - .expect("a single handle never collides on graph id"), - ); Self { - routing: GraphRouting { - registry, - config_path: None, - }, + routing: GraphRouting::Single { handle }, workload, bearer_tokens, server_policy: None, @@ -540,11 +446,12 @@ impl AppState { } /// Multi-mode constructor β€” used by the startup loop. Operators - /// reach this by invoking `omnigraph-server --cluster `. + /// reach this by invoking `omnigraph-server --config omnigraph.yaml` + /// with a non-empty `graphs:` map. /// /// Caller supplies the already-opened `GraphHandle`s and (optionally) - /// the path to the source cluster. `server_policy` is loaded from the - /// cluster-scoped policy binding if configured. + /// the path to the source config file. `server_policy` is loaded + /// from `server.policy.file` if configured. pub fn new_multi( handles: Vec>, bearer_tokens: Vec<(String, String)>, @@ -555,7 +462,7 @@ impl AppState { let bearer_tokens = hash_bearer_tokens(bearer_tokens); let registry = Arc::new(GraphRegistry::from_handles(handles)?); Ok(Self { - routing: GraphRouting { + routing: GraphRouting::Multi { registry, config_path, }, @@ -567,7 +474,9 @@ impl AppState { /// Runtime routing accessor. Handlers don't typically inspect this β€” /// they extract `Arc` via the routing middleware β€” but - /// `server_graphs_list` reads the registry through it. + /// `build_app` matches on it to decide flat vs nested route + /// mounting, and a handful of management endpoints (`GET /graphs`, + /// the OpenAPI cluster rewrite) match on the discriminant. pub fn routing(&self) -> &GraphRouting { &self.routing } @@ -581,9 +490,13 @@ impl AppState { } // Any per-graph policy also requires auth β€” otherwise the // policy gate would receive unauthenticated requests. Reading - // the cached `any_per_graph_policy` flag off the registry - // snapshot is O(1). - self.routing.registry.snapshot_ref().any_per_graph_policy + // from `routing` is O(1) in both arms: single mode is a direct + // `handle.policy.is_some()` check, multi mode reads the + // cached `any_per_graph_policy` flag on the registry snapshot. + match &self.routing { + GraphRouting::Single { handle } => handle.policy.is_some(), + GraphRouting::Multi { registry, .. } => registry.snapshot_ref().any_per_graph_policy, + } } fn authenticate_bearer_token(&self, provided_token: &str) -> Option { @@ -837,47 +750,179 @@ pub fn init_tracing() { let _ = tracing_subscriber::fmt().with_env_filter(filter).try_init(); } -/// Log each non-blocking advisory from a registry check report. -fn log_registry_warnings(label: &str, report: &queries::CheckReport) { - for warning in &report.warnings { - warn!(graph = label, query = %warning.query, "stored query: {}", warning.message); - } -} +pub fn load_server_settings( + config_path: Option<&PathBuf>, + cli_uri: Option, + cli_target: Option, + cli_bind: Option, + cli_allow_unauthenticated: bool, +) -> Result { + let config = load_config(config_path)?; + let bind = cli_bind.unwrap_or_else(|| config.server_bind().to_string()); + // Either `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1` flips + // this. Treat any non-empty, non-"0"/"false" string as truthy β€” + // standard 12-factor "any value is true" reading of the env var. + let env_unauth = std::env::var("OMNIGRAPH_UNAUTHENTICATED") + .ok() + .map(|v| { + let trimmed = v.trim(); + !trimmed.is_empty() && trimmed != "0" && !trimmed.eq_ignore_ascii_case("false") + }) + .unwrap_or(false); + let allow_unauthenticated = cli_allow_unauthenticated || env_unauth; -fn validate_registry_against_catalog( - registry: &QueryRegistry, - catalog: &Catalog, - label: &str, -) -> omnigraph::error::Result<()> { - let report = check(registry, catalog); - if report.has_breakages() { - return Err(OmniError::manifest(format_check_breakages(label, &report))); - } - log_registry_warnings(label, &report); - Ok(()) -} + // MR-668 decision 2 β€” four-rule mode inference matrix. + // + // 1. CLI `` positional β†’ Single (URI = the value) + // 2. CLI `--target ` β†’ Single (URI = graphs..uri) + // 3. `server.graph` in config β†’ Single (URI = graphs..uri) + // 4. `--config` + non-empty `graphs:` + no single-mode selector + // β†’ Multi (every entry in `graphs:`) + // 5. otherwise β†’ error with migration hint + // + // Rules 1-3 are mutually compatible (CLI URI wins over `--target` + // wins over `server.graph`), reusing the existing + // `resolve_target_uri` precedence. + let has_cli_uri = cli_uri.is_some(); + let has_cli_target = cli_target.is_some(); + let has_server_graph = config.server_graph_name().is_some(); + let has_graphs_map = !config.graphs.is_empty(); + let has_explicit_config = config_path.is_some(); -/// Validate a loaded stored-query registry against the live schema and -/// resolve it to an attachable handle. Refuses boot on any breakage -/// (same posture as bad policy YAML), logs the non-blocking warnings, -/// and collapses an empty registry to `None` (nothing attached). This is -/// the single gate every open path funnels through, so no opener can -/// attach a registry that has not been schema-checked. `label` names the -/// graph in messages. -fn validate_and_attach( - queries: QueryRegistry, - catalog: &Catalog, - label: &str, -) -> Result>> { - validate_registry_against_catalog(&queries, catalog, label) - .map_err(|err| color_eyre::eyre::eyre!(err.to_string()))?; - Ok(if queries.is_empty() { - None + let mode = if has_cli_uri || has_cli_target || has_server_graph { + // Rules 1, 2, or 3 β†’ Single mode. + let raw_uri = config.resolve_target_uri( + cli_uri, + cli_target.as_deref(), + config.server_graph_name(), + )?; + let uri = normalize_root_uri(&raw_uri).wrap_err_with(|| { + format!("normalize single-graph URI '{raw_uri}' from server settings") + })?; + let policy_file = config.resolve_policy_file(); + ServerConfigMode::Single { uri, policy_file } + } else if has_explicit_config && has_graphs_map { + if config.resolve_policy_file().is_some() { + bail!( + "top-level `policy.file` is single-graph/CLI-local policy only; \ + in multi-graph mode move per-graph rules to \ + `graphs..policy.file` and move `graph_list` rules to \ + `server.policy.file`." + ); + } + // Rule 4 β†’ Multi mode. Build a startup config per graph. + let mut graphs = Vec::with_capacity(config.graphs.len()); + for (name, target) in &config.graphs { + // Validate the graph id can construct a `GraphId` newtype. + // Doing this here (not at registry insert) so a malformed + // omnigraph.yaml fails at startup with a clear error. + GraphId::try_from(name.clone()).map_err(|err| { + color_eyre::eyre::eyre!("invalid graph id '{name}' in omnigraph.yaml: {err}") + })?; + let raw_uri = config.resolve_uri_value(&target.uri); + let uri = normalize_root_uri(&raw_uri).wrap_err_with(|| { + format!("normalize URI '{raw_uri}' for graph '{name}' in omnigraph.yaml") + })?; + graphs.push(GraphStartupConfig { + graph_id: name.clone(), + uri, + policy_file: config.resolve_target_policy_file(name), + }); + } + let config_path = config_path + .cloned() + .expect("has_explicit_config implies config_path is Some"); + let server_policy_file = config.resolve_server_policy_file(); + ServerConfigMode::Multi { + graphs, + config_path, + server_policy_file, + } } else { - Some(Arc::new(queries)) + // Rule 5 β†’ error with migration hint. + bail!( + "no graph to serve: pass a URI (`omnigraph-server `), select a target \ + (`--target --config omnigraph.yaml`), set `server.graph: ` in \ + omnigraph.yaml, or for multi-graph mode add a `graphs:` map to the config \ + file referenced by `--config`." + ); + }; + + Ok(ServerConfig { + mode, + bind, + allow_unauthenticated, }) } +/// Whether the loaded config will run the server in multi-graph mode. +/// Useful for the test that constructs `ServerConfig` directly. +pub fn server_config_is_multi(config: &ServerConfig) -> bool { + matches!(config.mode, ServerConfigMode::Multi { .. }) +} + +/// MR-723 server runtime state, classified from the three-state matrix +/// of (bearer tokens configured) Γ— (policy file configured) at startup. +/// +/// * **Open** β€” neither tokens nor policy; requires explicit +/// `allow_unauthenticated`. Effectively a "trust the network" dev +/// mode. `serve()` refuses to start in this shape without the flag, +/// so the only way to reach this state at runtime is via deliberate +/// operator opt-in. +/// * **DefaultDeny** β€” tokens configured but no policy file. The +/// server requires a valid bearer token; once authenticated, every +/// action except `Read` is denied with 403. Closes the "tokens but +/// forgot the policy file" trap. +/// * **PolicyEnabled** β€” policy file configured and at least one +/// bearer token configured. Cedar evaluates every authenticated +/// request. Policy without tokens is rejected at startup β€” +/// such a server would 401 every request, which is bug-shaped +/// rather than feature-shaped (operators wanting "deny all +/// unauthenticated traffic" should configure tokens plus a +/// deny-all policy to get meaningful 403s with policy-decision +/// logging instead). +#[derive(Debug, Clone, Copy, Eq, PartialEq)] +pub enum ServerRuntimeState { + Open, + DefaultDeny, + PolicyEnabled, +} + +/// Compute the [`ServerRuntimeState`] from the configured inputs. +/// Pulled out as a pure function so the matrix is unit-testable +/// without standing up the full server. +/// +/// The classifier is the **single source of truth** for "should we +/// start?" β€” both `serve()`'s single-mode and multi-mode branches +/// call this before constructing their `AppState`. Adding a startup +/// invariant here means both modes enforce it automatically; the +/// alternative (per-constructor `bail!`) drifts the moment a third +/// mode is added. +pub fn classify_server_runtime_state( + has_tokens: bool, + has_policy: bool, + allow_unauthenticated: bool, +) -> Result { + match (has_tokens, has_policy, allow_unauthenticated) { + (false, false, false) => bail!( + "server has no bearer tokens and no policy file configured. This is a fully \ + open server β€” pass `--unauthenticated` (or set OMNIGRAPH_UNAUTHENTICATED=1) \ + if you actually want that, otherwise configure bearer tokens (see \ + docs/user/server.md) and/or `policy.file` in omnigraph.yaml." + ), + (false, false, true) => Ok(ServerRuntimeState::Open), + (true, false, _) => Ok(ServerRuntimeState::DefaultDeny), + (false, true, _) => bail!( + "policy file is configured but no bearer tokens β€” every request would 401 \ + because no token can ever match. Configure at least one bearer token (see \ + docs/user/server.md), or remove the policy file. To deny all unauthenticated \ + traffic deliberately, configure tokens plus a deny-all Cedar rule β€” that \ + produces meaningful 403s with policy-decision logging instead of silent 401s." + ), + (true, true, _) => Ok(ServerRuntimeState::PolicyEnabled), + } +} + pub fn build_app(state: AppState) -> Router { // The per-graph protected routes, identical in single + multi mode. // Two middleware layers wrap them (outer first, inner last): @@ -894,40 +939,21 @@ pub fn build_app(state: AppState) -> Router { // flagged and their responses include RFC 9745 Deprecation + // RFC 8288 Link headers. Suppress the call-site warning for the // route registration itself. - .route( - "/read", - post({ - #[allow(deprecated)] - server_read - }), - ) + .route("/read", post({ + #[allow(deprecated)] + server_read + })) .route("/query", post(server_query)) - .route( - "/change", - post({ - #[allow(deprecated)] - server_change - }), - ) + .route("/change", post({ + #[allow(deprecated)] + server_change + })) .route("/mutate", post(server_mutate)) - .route("/queries", get(server_list_queries)) - .route("/queries/{name}", post(server_invoke_query)) .route("/schema", get(server_schema_get)) .route("/schema/apply", post(server_schema_apply)) - .route( - "/load", - post(server_load).layer(DefaultBodyLimit::max(INGEST_REQUEST_BODY_LIMIT_BYTES)), - ) - // /ingest is the deprecated alias of /load; its handler carries - // #[deprecated] (OpenAPI operation flagged) and emits RFC 9745 - // Deprecation + RFC 8288 Link headers. Suppress the call-site warning. .route( "/ingest", - post({ - #[allow(deprecated)] - server_ingest - }) - .layer(DefaultBodyLimit::max(INGEST_REQUEST_BODY_LIMIT_BYTES)), + post(server_ingest).layer(DefaultBodyLimit::max(INGEST_REQUEST_BODY_LIMIT_BYTES)), ) .route( "/branches", @@ -949,9 +975,13 @@ pub fn build_app(state: AppState) -> Router { // Management endpoints (`GET /graphs`) live alongside the per-graph // router. They go through bearer auth but NOT through // `resolve_graph_handle` β€” they operate on the registry directly. + // The endpoint is mounted in both modes; in single mode the handler + // returns 405 so clients see "resource exists, wrong context" + // rather than 404 "no such resource." // // Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`) is not - // exposed β€” operators run `cluster apply` and restart. + // exposed in v0.6.0 β€” operators add graphs by editing + // `omnigraph.yaml` and restarting. let management = Router::new() .route("/graphs", get(server_graphs_list)) .route_layer(middleware::from_fn_with_state( @@ -959,11 +989,15 @@ pub fn build_app(state: AppState) -> Router { require_bearer_auth, )); - // RFC-011 cluster-only: per-graph routes always nest under - // `/graphs/{graph_id}/...`; there are no flat single-graph routes. - let protected: Router = Router::new() - .nest("/graphs/{graph_id}", per_graph_protected) - .merge(management); + // Mount the protected routes differently per mode: + // * Single β†’ flat routes (legacy: `/snapshot`, `/read`, etc.) + // * Multi β†’ nested under `/graphs/{graph_id}/...` + let protected: Router = match state.routing() { + GraphRouting::Single { .. } => per_graph_protected.merge(management), + GraphRouting::Multi { .. } => Router::new() + .nest("/graphs/{graph_id}", per_graph_protected) + .merge(management), + }; Router::new() .route("/healthz", get(server_health)) @@ -984,11 +1018,12 @@ pub async fn serve(config: ServerConfig) -> Result<()> { // policy OR any per-graph policy file. Mirrors the // `requires_bearer_auth` semantics on AppState. let has_policy_configured = match &config.mode { + ServerConfigMode::Single { policy_file, .. } => policy_file.is_some(), ServerConfigMode::Multi { graphs, - server_policy, + server_policy_file, .. - } => server_policy.is_some() || graphs.iter().any(|g| g.policy.is_some()), + } => server_policy_file.is_some() || graphs.iter().any(|g| g.policy_file.is_some()), }; let runtime_state = classify_server_runtime_state( !tokens.is_empty(), @@ -1004,34 +1039,31 @@ pub async fn serve(config: ServerConfig) -> Result<()> { ServerRuntimeState::DefaultDeny => warn!( "bearer tokens are configured but no policy file is set β€” running in \ default-deny mode (only `read` actions are permitted for authenticated \ - actors). Configure a graph or cluster policy bundle in the cluster config, \ - run `omnigraph cluster apply`, and restart to enable Cedar rules." + actors). Configure `policy.file` in omnigraph.yaml to enable Cedar rules." ), ServerRuntimeState::PolicyEnabled => {} } let bind = config.bind.clone(); let state = match config.mode { + ServerConfigMode::Single { uri, policy_file } => { + let uri_for_log = uri.clone(); + info!(uri = %uri_for_log, bind = %bind, mode = "single", "serving omnigraph"); + AppState::open_with_bearer_tokens_and_policy(uri, tokens, policy_file.as_ref()).await? + } ServerConfigMode::Multi { graphs, config_path, - server_policy, + server_policy_file, } => { info!( bind = %bind, - mode = "cluster", + mode = "multi", graph_count = graphs.len(), config = %config_path.display(), "serving omnigraph" ); - open_multi_graph_state( - graphs, - tokens, - server_policy.as_ref(), - config_path, - config.require_all_graphs, - ) - .await? + open_multi_graph_state(graphs, tokens, server_policy_file.as_ref(), config_path).await? } }; @@ -1042,30 +1074,21 @@ pub async fn serve(config: ServerConfig) -> Result<()> { Ok(()) } -/// Load a graph-scoped policy bundle from either source kind. -fn load_graph_policy(source: &PolicySource, graph_id: &str) -> Result { - match source { - PolicySource::File(path) => Ok(PolicyEngine::load_graph(path, graph_id)?), - PolicySource::Inline(text) => Ok(PolicyEngine::load_graph_from_source(text, graph_id)?), - } -} - /// Parallel open of every graph in the startup config, with bounded -/// concurrency (`buffer_unordered(4)`). Graph-specific open failures -/// quarantine that graph; startup succeeds as long as at least one graph -/// opens. +/// concurrency (`buffer_unordered(4)`). Fail-fast β€” the first open error +/// aborts startup; other in-flight opens are dropped (their `Omnigraph` +/// instances close cleanly via Arc drop). /// /// The bound 4 is a rule-of-thumb for I/O-bound work. At N ≀ 10 this /// trades startup latency for a small amount of concurrent S3 / Lance /// open pressure. -pub async fn open_multi_graph_state( +async fn open_multi_graph_state( graphs: Vec, tokens: Vec<(String, String)>, - server_policy_source: Option<&PolicySource>, + server_policy_file: Option<&PathBuf>, config_path: PathBuf, - require_all_graphs: bool, ) -> Result { - use futures::StreamExt; + use futures::{StreamExt, TryStreamExt}; if graphs.is_empty() { bail!("multi-graph mode requires at least one graph in the `graphs:` map"); @@ -1075,50 +1098,20 @@ pub async fn open_multi_graph_state( // The placeholder graph_id `"server"` is the sentinel the Cedar // resource-model refactor maps to the singleton // `Omnigraph::Server::"root"` entity at evaluation time. - let server_policy = match server_policy_source { - Some(PolicySource::File(path)) => Some(PolicyEngine::load_server(path)?), - Some(PolicySource::Inline(source)) => Some(PolicyEngine::load_server_from_source(source)?), + let server_policy = match server_policy_file { + Some(path) => Some(PolicyEngine::load_server(path)?), None => None, }; - let configured_graphs = graphs.len(); - let results = futures::stream::iter(graphs.into_iter()) - .map(|cfg| async move { - let graph_id = cfg.graph_id.clone(); - open_single_graph(cfg).await.map_err(|err| (graph_id, err)) - }) + // `try_collect` propagates the first error eagerly, dropping every + // in-flight open. `buffer_unordered + collect::>` would drain + // the stream before checking errors β€” incorrect for the docstring's + // "fail-fast" claim and wasteful on S3-backed graphs. + let handles: Vec> = futures::stream::iter(graphs.into_iter()) + .map(|cfg| async move { open_single_graph(cfg).await }) .buffer_unordered(4) - .collect::>() - .await; - let mut handles = Vec::new(); - let mut failed = 0usize; - for result in results { - match result { - Ok(handle) => handles.push(handle), - Err((graph_id, err)) => { - failed += 1; - warn!( - graph_id = %graph_id, - error = %err, - "graph quarantined during startup" - ); - } - } - } - if require_all_graphs && failed > 0 { - bail!( - "strict multi-graph startup requires every graph to open ({} configured, {} failed)", - configured_graphs, - failed - ); - } - if handles.is_empty() { - bail!( - "no healthy graphs opened from multi-graph startup config ({} configured, {} failed)", - configured_graphs, - failed - ); - } + .try_collect() + .await?; let workload = workload::WorkloadController::from_env(); let state = AppState::new_multi(handles, tokens, server_policy, workload, Some(config_path)) @@ -1137,21 +1130,10 @@ async fn open_single_graph(cfg: GraphStartupConfig) -> Result> let db = Omnigraph::open(&uri) .await .map_err(|err| color_eyre::eyre::eyre!("open graph '{}' at {}: {err}", graph_id, uri))?; - let db = if let Some(embedding) = cfg.embedding { - db.with_embedding_config(Arc::new(embedding)) - } else { - db - }; - // Validate this graph's stored queries against the live schema and - // resolve them to an attachable handle (refuse boot on breakage). - // Done before the policy match rebinds `db`; the catalog handle is an - // owned `Arc`, so no borrow of `db` survives into the match. - let queries = validate_and_attach(cfg.queries, &db.catalog(), graph_id.as_str())?; - - let (policy_arc, db) = match &cfg.policy { - Some(source) => { - let policy = load_graph_policy(source, graph_id.as_str())?; + let (policy_arc, db) = match &cfg.policy_file { + Some(path) => { + let policy = PolicyEngine::load_graph(path, graph_id.as_str())?; let policy_arc: Arc = Arc::new(policy); let checker = Arc::clone(&policy_arc) as Arc; (Some(policy_arc), db.with_policy(checker)) @@ -1164,7 +1146,6 @@ async fn open_single_graph(cfg: GraphStartupConfig) -> Result> uri, engine: Arc::new(db), policy: policy_arc, - queries, })) } @@ -1175,3 +1156,1948 @@ async fn shutdown_signal() { } info!("shutdown signal received"); } + +#[utoipa::path( + get, + path = "/healthz", + tag = "health", + operation_id = "health", + responses( + (status = 200, description = "Server is healthy", body = HealthOutput), + ), +)] +/// Liveness probe. +/// +/// Returns server status and version. Unauthenticated; safe to call from any +/// caller. Use this to confirm the server is reachable before invoking other +/// endpoints. +async fn server_health() -> Json { + Json(HealthOutput { + status: "ok".to_string(), + version: SERVER_VERSION.to_string(), + source_version: SERVER_SOURCE_VERSION.map(str::to_string), + }) +} + +#[utoipa::path( + get, + path = "/graphs", + tag = "management", + operation_id = "listGraphs", + responses( + (status = 200, description = "List of registered graphs", body = GraphListResponse), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 405, description = "Method not allowed (single-graph mode)", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// List every graph currently registered with this server (MR-668). +/// +/// Multi-graph mode only. In single mode, the route returns 405 β€” there's +/// no registry to enumerate. Cedar-gated by the server-level policy via +/// the `graph_list` action against `Omnigraph::Server::"root"`. +/// +/// Order: alphabetical by `graph_id` (server-sorted so clients see +/// deterministic output across requests). +async fn server_graphs_list( + State(state): State, + actor: Option>, +) -> std::result::Result, ApiError> { + // 405 in single mode β€” there's no registry to enumerate, and the + // legacy URL surface didn't expose this endpoint. + let registry = match state.routing() { + GraphRouting::Single { .. } => { + return Err(ApiError::method_not_allowed( + "GET /graphs is only available in multi-graph mode", + )); + } + GraphRouting::Multi { registry, .. } => registry, + }; + + // Server-level Cedar gate. `state.server_policy` is loaded from + // `server.policy.file` in `omnigraph.yaml` at startup. When no + // server policy is configured, `authorize_request_server` falls + // through to the MR-723 default-deny semantics (every non-Read + // action denied for an authenticated actor). `GraphList` is not + // `Read`, so without a server policy the request gets 403 β€” which + // is the right default (don't leak the registry until the operator + // explicitly authorizes it). + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + state.server_policy.as_deref(), + PolicyRequest { + action: PolicyAction::GraphList, + branch: None, + target_branch: None, + }, + )?; + + let mut graphs: Vec = registry + .list() + .into_iter() + .map(|handle| GraphInfo { + graph_id: handle.key.graph_id.as_str().to_string(), + uri: handle.uri.clone(), + }) + .collect(); + graphs.sort_by(|a, b| a.graph_id.cmp(&b.graph_id)); + Ok(Json(GraphListResponse { graphs })) +} + +async fn server_openapi(State(state): State) -> Json { + let mut doc = ApiDoc::openapi(); + if !state.requires_bearer_auth() { + strip_security(&mut doc); + } + // MR-668: in multi mode, the protected routes live under + // `/graphs/{graph_id}/...`. Rewrite the doc so the spec matches + // the routes the router actually serves. Public paths (`/healthz`) + // stay flat in both modes. + if matches!(state.routing(), GraphRouting::Multi { .. }) { + nest_paths_under_cluster_prefix(&mut doc); + } + Json(doc) +} + +/// Path prefix used to namespace per-graph routes in multi mode. +/// Kept in sync with the `Router::nest(...)` invocation in `build_app`. +const CLUSTER_PATH_PREFIX: &str = "/graphs/{graph_id}"; + +/// Operation-id prefix applied to every cloned cluster operation. +/// Decision 7 in the implementation plan β€” keeps operation IDs unique +/// across the spec when both flat and nested variants ever appear in +/// the same generation pass. +const CLUSTER_OPERATION_ID_PREFIX: &str = "cluster_"; + +/// Paths that stay flat in every server mode (public or server-level, +/// no per-graph dependency). Update this list when adding new +/// always-flat endpoints. `/graphs` is the management enumeration β€” +/// it lives at the root in both single mode (405) and multi mode, and +/// must never be rewritten to `/graphs/{graph_id}/graphs`. +const ALWAYS_FLAT_PATHS: &[&str] = &["/healthz", "/graphs"]; + +/// In multi-mode `server_openapi`, every protected path-item is +/// reattached under the cluster prefix. Operation IDs gain the +/// `cluster_` prefix so SDK generators don't collide if/when both +/// surfaces are merged. Every rewritten operation also declares the +/// required `{graph_id}` path parameter so the served OpenAPI document +/// remains internally valid. +/// +/// Removing the flat protected paths matches the runtime router β€” +/// in multi mode, requests to `/snapshot` etc. return 404, so the +/// spec must agree. +fn nest_paths_under_cluster_prefix(doc: &mut utoipa::openapi::OpenApi) { + let original = std::mem::take(&mut doc.paths.paths); + let mut rewritten = std::collections::BTreeMap::new(); + for (path, mut item) in original { + if ALWAYS_FLAT_PATHS.contains(&path.as_str()) { + rewritten.insert(path, item); + continue; + } + rename_operation_ids(&mut item, CLUSTER_OPERATION_ID_PREFIX); + add_cluster_graph_id_parameter(&mut item); + let new_path = format!("{CLUSTER_PATH_PREFIX}{path}"); + rewritten.insert(new_path, item); + } + doc.paths.paths = rewritten; +} + +fn add_cluster_graph_id_parameter(item: &mut utoipa::openapi::PathItem) { + for op in path_item_operations_mut(item) { + let parameters = op.parameters.get_or_insert_with(Vec::new); + let has_graph_id = parameters + .iter() + .any(|param| param.name == "graph_id" && param.parameter_in == ParameterIn::Path); + if !has_graph_id { + parameters.insert(0, graph_id_path_parameter()); + } + } +} + +fn graph_id_path_parameter() -> Parameter { + let mut parameter = Parameter::new("graph_id"); + parameter.parameter_in = ParameterIn::Path; + parameter.description = Some("Graph id to route the request to.".to_string()); + parameter.schema = Some(Object::with_type(Type::String).into()); + parameter +} + +/// Prefix every operation_id in this PathItem with `prefix`. +fn rename_operation_ids(item: &mut utoipa::openapi::PathItem, prefix: &str) { + for op in path_item_operations_mut(item) { + if let Some(id) = op.operation_id.as_deref() { + op.operation_id = Some(format!("{prefix}{id}")); + } + } +} + +fn path_item_operations_mut( + item: &mut utoipa::openapi::PathItem, +) -> impl Iterator { + [ + item.get.as_mut(), + item.post.as_mut(), + item.put.as_mut(), + item.delete.as_mut(), + item.options.as_mut(), + item.head.as_mut(), + item.patch.as_mut(), + item.trace.as_mut(), + ] + .into_iter() + .flatten() +} + +fn strip_security(doc: &mut utoipa::openapi::OpenApi) { + if let Some(components) = doc.components.as_mut() { + components.security_schemes.clear(); + } + for path_item in doc.paths.paths.values_mut() { + for op in [ + path_item.get.as_mut(), + path_item.post.as_mut(), + path_item.put.as_mut(), + path_item.delete.as_mut(), + path_item.options.as_mut(), + path_item.head.as_mut(), + path_item.patch.as_mut(), + path_item.trace.as_mut(), + ] + .into_iter() + .flatten() + { + op.security = None; + } + } +} + +async fn require_bearer_auth( + State(state): State, + mut request: Request, + next: Next, +) -> std::result::Result { + if !state.requires_bearer_auth() { + return Ok(next.run(request).await); + } + + let Some(header) = request + .headers() + .get(AUTHORIZATION) + .and_then(|value| value.to_str().ok()) + else { + return Err(ApiError::unauthorized("missing bearer token")); + }; + + let Some(provided_token) = header.strip_prefix("Bearer ") else { + return Err(ApiError::unauthorized("missing bearer token")); + }; + + let Some(actor) = state.authenticate_bearer_token(provided_token) else { + return Err(ApiError::unauthorized("invalid bearer token")); + }; + request.extensions_mut().insert(actor); + + Ok(next.run(request).await) +} + +/// Routing middleware (MR-668). Resolves the active graph for the +/// request and injects `Arc` as an extension so handlers can +/// extract it via `Extension>`. +/// +/// **Single mode**: the routing field holds the single handle directly. +/// Routes are flat; every request resolves to that handle, regardless +/// of the URI path. No registry walk, no sentinel key, no +/// programmer-error guard. +/// +/// **Multi mode**: routes are nested under `/graphs/{graph_id}/...`. The +/// middleware extracts `{graph_id}` from the URI path and looks it up in +/// the registry. Returns 404 if the graph is not registered. +/// +/// The middleware fires AFTER `require_bearer_auth`, so the actor is +/// already in the request extensions (or auth was off entirely). +async fn resolve_graph_handle( + State(state): State, + mut request: Request, + next: Next, +) -> std::result::Result { + let handle = match &state.routing { + GraphRouting::Single { handle } => Arc::clone(handle), + GraphRouting::Multi { registry, .. } => { + // `Router::nest("/graphs/{graph_id}", inner)` rewrites + // `request.uri().path()` to the inner suffix (e.g. `/snapshot`). + // The pre-rewrite URI is preserved in the `OriginalUri` + // request extension by axum's router; we read from there to + // extract `{graph_id}`. Fall back to the current URI only if + // the extension is missing, which shouldn't happen for + // nested routes but is safe defensive code. + let original_path: String = request + .extensions() + .get::() + .map(|OriginalUri(uri)| uri.path().to_string()) + .unwrap_or_else(|| request.uri().path().to_string()); + let graph_id_str = original_path + .strip_prefix("/graphs/") + .and_then(|rest| rest.split('/').next()) + .filter(|s| !s.is_empty()) + .ok_or_else(|| { + ApiError::bad_request( + "cluster route missing /graphs/{graph_id} prefix".to_string(), + ) + })?; + let graph_id = GraphId::try_from(graph_id_str.to_string()) + .map_err(|err| ApiError::bad_request(err.to_string()))?; + let key = GraphKey::cluster(graph_id.clone()); + match registry.get(&key) { + RegistryLookup::Ready(handle) => handle, + RegistryLookup::Gone => { + return Err(ApiError::not_found(format!("graph '{graph_id}' not found"))); + } + } + } + }; + + // Per-request observability. `Span::current().record` would silently + // no-op here because no upstream `#[tracing::instrument(...)]` macro + // declares a `graph_id` field; emit an explicit event instead so the + // routing decision actually lands in logs. + info!(graph_id = %handle.key.graph_id, "graph routed"); + + request.extensions_mut().insert(handle); + Ok(next.run(request).await) +} + +fn log_policy_decision(actor_id: &str, request: &PolicyRequest, decision: &PolicyDecision) { + info!( + actor_id = actor_id, + action = %request.action, + branch = request.branch.as_deref().unwrap_or(""), + target_branch = request.target_branch.as_deref().unwrap_or(""), + allowed = decision.allowed, + matched_rule_id = decision.matched_rule_id.as_deref().unwrap_or(""), + "policy decision" + ); +} + +/// HTTP-layer Cedar policy gate. Two sources of the policy engine: +/// * Per-graph handler β€” passes `handle.policy.as_deref()` so the +/// graph's Cedar rules govern read/change/branch_*/schema_apply. +/// * Management handler β€” passes `state.server_policy.as_deref()` so +/// server-level Cedar rules govern `graph_list` (the only shipped +/// server-scoped action; runtime `graph_create` / `graph_delete` +/// are deferred until a managed cluster catalog lands). +/// +/// The MR-731 invariant lives inside this function: actor identity is +/// supplied as a separate argument from the resolved bearer match. The +/// `PolicyRequest` struct itself does not carry identity (the field was +/// dropped from the type), so handlers cannot smuggle it through the +/// request. See `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` +/// at `tests/server.rs`. +fn authorize_request( + actor: Option<&ResolvedActor>, + policy: Option<&PolicyEngine>, + request: PolicyRequest, +) -> std::result::Result<(), ApiError> { + let Some(engine) = policy else { + // No PolicyEngine installed. Three runtime states can reach this: + // + // * **Open mode** (`--unauthenticated`): no tokens, no policy. + // Per-graph operations are open by operator opt-in (they + // accepted "trust the network" for graph data). + // * **DefaultDeny mode**: tokens configured but no policy. The + // request went through bearer auth, so `actor` is Some. Only + // per-graph `Read` is permitted; other per-graph actions + // return 403. Closes the "configured auth but forgot the + // policy file" trap from MR-723. + // * Either of the above with a **server-scoped** action + // (`graph_list`, future `graph_create`/`graph_delete`). + // + // Server-scoped actions are always denied here, regardless of + // mode or actor presence. The management surface leaks server + // topology (graph IDs + URIs that may contain S3 bucket paths + // or internal hostnames) β€” operators who opted into Open mode + // accepted exposure of graph DATA, not exposure of server + // topology. Closing the management surface by default in every + // runtime state means the docstring contract on + // `server_graphs_list` ("don't leak the registry until the + // operator explicitly authorizes it") holds uniformly; the + // operator's only path to enabling it is configuring an + // explicit `server.policy.file` in omnigraph.yaml. + if request.action.resource_kind() == PolicyResourceKind::Server { + return Err(ApiError::forbidden( + "server-scoped actions require an explicit `server.policy.file` \ + configured in omnigraph.yaml β€” the management surface is closed \ + by default in every runtime state, including --unauthenticated, \ + so that server topology is never exposed without operator opt-in.", + )); + } + if actor.is_some() && request.action != PolicyAction::Read { + return Err(ApiError::forbidden( + "server runs in default-deny mode (bearer tokens configured but no \ + policy file). Only `read` actions are permitted; configure \ + `policy.file` in omnigraph.yaml to enable other actions.", + )); + } + return Ok(()); + }; + let Some(actor) = actor else { + return Err(ApiError::unauthorized("missing bearer token")); + }; + // SECURITY INVARIANT (MR-731): actor identity is supplied to the + // policy engine here as a separate argument, sourced from the + // bearer-token match resolved by `require_bearer_auth`. The + // `PolicyRequest` struct itself no longer carries `actor_id` (it + // was dropped from the type), so handlers cannot smuggle identity + // through the request body and there is no overwrite step that + // could be skipped. The principle is codified in + // `docs/dev/invariants.md` Hard Invariant 11 ("clients cannot set + // actor identity directly") and pinned by the regression test + // `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` + // in `crates/omnigraph-server/tests/server.rs`. + let actor_id = actor.actor_id.as_ref(); + let decision = engine + .authorize(actor_id, &request) + .map_err(|err| ApiError::internal(format!("policy: {err}")))?; + log_policy_decision(actor_id, &request, &decision); + if decision.allowed { + Ok(()) + } else { + Err(ApiError::forbidden(decision.message)) + } +} + +#[utoipa::path( + get, + path = "/snapshot", + tag = "snapshots", + operation_id = "getSnapshot", + params(SnapshotQuery), + responses( + (status = 200, description = "Database snapshot", body = api::SnapshotOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Read the current snapshot of a branch. +/// +/// Returns the manifest version plus per-table metadata (path, version, row +/// count) for every table on the branch. Defaults to `main` when `branch` is +/// omitted. Read-only. +async fn server_snapshot( + Extension(handle): Extension>, + actor: Option>, + Query(query): Query, +) -> std::result::Result, ApiError> { + let branch = query.branch.unwrap_or_else(|| "main".to_string()); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: Some(branch.clone()), + target_branch: None, + }, + )?; + let snapshot = { + let db = &handle.engine; + db.snapshot_of(ReadTarget::branch(branch.as_str())) + .await + .map_err(ApiError::from_omni)? + }; + Ok(Json(snapshot_payload(&branch, &snapshot))) +} + +/// Header values that flag a response as coming from a deprecated route +/// (RFC 9745 / RFC 8288) and point at the canonical successor. +fn deprecation_headers(successor_link: &'static str) -> [(HeaderName, HeaderValue); 2] { + [ + ( + HeaderName::from_static("deprecation"), + HeaderValue::from_static("true"), + ), + ( + HeaderName::from_static("link"), + HeaderValue::from_static(successor_link), + ), + ] +} + +#[utoipa::path( + post, + path = "/read", + tag = "queries", + operation_id = "read", + request_body = ReadRequest, + responses( + (status = 200, description = "Query results (response includes `Deprecation: true` + `Link: ; rel=\"successor-version\"`)", body = ReadOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +#[deprecated(note = "use POST /query instead; /read is kept indefinitely for byte-stable back-compat")] +/// **Deprecated** β€” use [`POST /query`](#tag/queries/operation/query) instead. +/// +/// Execute a GQ read query. Behavior is unchanged from prior releases; the +/// route is kept indefinitely for byte-stable back-compat. New integrations +/// should target `POST /query`, which has clean field names (`query` / +/// `name`) and a 400-on-mutation guard. Responses from this route include +/// `Deprecation: true` and `Link: ; rel="successor-version"` +/// headers per RFC 9745 / RFC 8288 so SDKs and proxies can surface the +/// signal. +async fn server_read( + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result<([(HeaderName, HeaderValue); 2], Json), ApiError> { + let (selected_name, target, result) = run_query( + handle, + actor.as_ref().map(|Extension(actor)| actor), + &request.query_source, + request.query_name.as_deref(), + request.params.as_ref(), + request.branch, + request.snapshot, + false, // /read predates the D2 rule; legacy callers may submit mutating queries here + ) + .await?; + Ok(( + deprecation_headers("; rel=\"successor-version\""), + Json(api::read_output(selected_name, &target, result)), + )) +} + +#[utoipa::path( + post, + path = "/query", + tag = "queries", + operation_id = "query", + request_body = QueryRequest, + responses( + (status = 200, description = "Query results", body = ReadOutput), + (status = 400, description = "Bad request - also returned when the query body contains mutations; use POST /mutate (or its deprecated alias POST /change) for write queries", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Execute an inline read query (friendlier-named alternative to `POST /read`). +/// +/// Designed for ad-hoc exploration and AI-agent tool-use: short field +/// names (`query`, `name`) match the CLI `-e` flag and the GQ `query` +/// keyword. Mutations (`insert`/`update`/`delete`) are rejected with 400 +/// -- use `POST /mutate` (or its deprecated alias `POST /change`) for +/// write queries. Otherwise behaves identically to `POST /read`: same +/// target semantics (branch xor snapshot), same Cedar action (Read), +/// same response shape. +async fn server_query( + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let (selected_name, target, result) = run_query( + handle, + actor.as_ref().map(|Extension(actor)| actor), + &request.query, + request.name.as_deref(), + request.params.as_ref(), + request.branch, + request.snapshot, + true, // /query is read-only; reject mutations + ) + .await?; + Ok(Json(api::read_output(selected_name, &target, result))) +} + +#[utoipa::path( + post, + path = "/export", + tag = "queries", + operation_id = "export", + request_body = ExportRequest, + responses( + (status = 200, description = "Exported data as NDJSON", content_type = "application/x-ndjson"), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Stream the contents of a branch as NDJSON. +/// +/// Emits one JSON object per line (`application/x-ndjson`). Filter with +/// `type_names` (node/edge type names) and/or `table_keys`; both empty +/// streams the entire branch. Suitable for large exports β€” the response is +/// streamed, not buffered. Read-only. +async fn server_export( + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result { + let branch = request.branch.unwrap_or_else(|| "main".to_string()); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Export, + branch: Some(branch.clone()), + target_branch: None, + }, + )?; + let engine = Arc::clone(&handle.engine); + let type_names = request.type_names.clone(); + let table_keys = request.table_keys.clone(); + let (tx, rx) = mpsc::unbounded_channel::>(); + tokio::spawn(async move { + let result = { + let mut writer = ExportStreamWriter { sender: tx.clone() }; + engine + .export_jsonl_to_writer(&branch, &type_names, &table_keys, &mut writer) + .await + }; + if let Err(err) = result { + let _ = tx.send(Err(io::Error::other(err.to_string()))); + } + }); + let body = Body::from_stream(stream::unfold(rx, |mut rx| async move { + rx.recv().await.map(|item| (item, rx)) + })); + Ok(( + StatusCode::OK, + [(CONTENT_TYPE, "application/x-ndjson; charset=utf-8")], + body, + ) + .into_response()) +} + +/// Shared implementation behind `POST /mutate` (canonical) and +/// `POST /change` (deprecated alias). Returns the bare `ChangeOutput`; +/// each route handler wraps it (the alias also attaches Deprecation +/// headers). +/// Shared backend for `/mutate` (canonical) and `/change` (deprecated alias). +/// +/// Decoupled from `ChangeRequest` so MR-969's `/queries/{name}` stored-query +/// handler can call this directly with registry-supplied fields without +/// rebuilding the request body. Today's HTTP handlers unpack the request and +/// call here; the registry would do the same. +async fn run_mutate( + state: AppState, + handle: Arc, + actor: Option<&ResolvedActor>, + query: &str, + name: Option<&str>, + params_json: Option<&Value>, + branch: String, +) -> std::result::Result { + let actor_arc = actor + .map(|a| Arc::clone(&a.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + let actor_id = actor.map(|a| a.actor_id.as_ref()); + authorize_request( + actor, + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Change, + branch: Some(branch.clone()), + target_branch: None, + }, + )?; + // Per-actor admission: bound concurrent in-flight mutations and + // estimated bytes per actor. Cedar runs FIRST so denied requests + // don't consume admission slots. Estimate uses the request body + // size as a coarse proxy; engine memory pressure can run higher. + let est_bytes = query.len() as u64 + + params_json + .map(|p| p.to_string().len() as u64) + .unwrap_or(0); + let _admission = state + .workload + .try_admit(&actor_arc, est_bytes) + .map_err(ApiError::from_workload_reject)?; + let (selected_name, query_params) = + select_named_query(query, name).map_err(|err| ApiError::bad_request(err.to_string()))?; + let params = query_params_from_json(&query_params, params_json) + .map_err(|err| ApiError::bad_request(err.to_string()))?; + + let result = { + let db = &handle.engine; + db.mutate_as(&branch, query, &selected_name, ¶ms, actor_id) + .await + .map_err(ApiError::from_omni)? + }; + Ok(ChangeOutput { + branch, + query_name: selected_name, + affected_nodes: result.affected_nodes, + affected_edges: result.affected_edges, + actor_id: actor_id.map(str::to_string), + }) +} + +/// Shared backend for `/query` (canonical) and `/read` (deprecated alias). +/// +/// Mirrors [`run_mutate`]'s decoupled shape so MR-969's stored-query handler +/// can call here with registry-supplied fields. Rejects inline source that +/// contains mutations (D2 rule); callers wanting writes go through +/// [`run_mutate`] instead. +/// +/// Intentionally does **not** take [`AppState`] (unlike [`run_mutate`]): +/// reads are not admission-gated today, so there is no `state.workload` +/// consumer. The signature grows the parameter when Phase 1 (MR-976) adds +/// the request envelope's `expect: { max_rows_scanned: N }` budget, or +/// MR-969 extends per-actor admission to stored-read invocations. +async fn run_query( + handle: Arc, + actor: Option<&ResolvedActor>, + query: &str, + name: Option<&str>, + params_json: Option<&Value>, + branch: Option, + snapshot: Option, + reject_mutations: bool, +) -> std::result::Result<(String, ReadTarget, omnigraph_compiler::result::QueryResult), ApiError> { + if branch.is_some() && snapshot.is_some() { + return Err(ApiError::bad_request( + "request may specify branch or snapshot, not both", + )); + } + + let target = read_target_from_request(branch, snapshot); + let policy_branch = match &target { + ReadTarget::Branch(branch) => Some(branch.clone()), + ReadTarget::Snapshot(_) if handle.policy.is_some() && actor.is_some() => { + let db = &handle.engine; + db.resolved_branch_of(target.clone()) + .await + .map(|branch| branch.or_else(|| Some("main".to_string()))) + .map_err(ApiError::from_omni)? + } + ReadTarget::Snapshot(_) => None, + }; + authorize_request( + actor, + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: policy_branch, + target_branch: None, + }, + )?; + let query_decl = + select_named_query_decl(query, name).map_err(|err| ApiError::bad_request(err.to_string()))?; + if reject_mutations && !query_decl.mutations.is_empty() { + return Err(ApiError::bad_request(format!( + "query '{}' contains mutations (insert/update/delete); use POST /mutate for write queries", + query_decl.name + ))); + } + let selected_name = query_decl.name.clone(); + let params = query_params_from_json(&query_decl.params, params_json) + .map_err(|err| ApiError::bad_request(err.to_string()))?; + + let result = { + let db = &handle.engine; + db.query(target.clone(), query, &selected_name, ¶ms) + .await + .map_err(ApiError::from_omni)? + }; + Ok((selected_name, target, result)) +} + +#[utoipa::path( + post, + path = "/change", + tag = "mutations", + operation_id = "change", + request_body = ChangeRequest, + responses( + (status = 200, description = "Mutation results (response includes `Deprecation: true` + `Link: ; rel=\"successor-version\"`)", body = ChangeOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 409, description = "Merge conflict", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +#[deprecated(note = "use POST /mutate instead; /change is kept indefinitely for back-compat")] +/// **Deprecated** β€” use [`POST /mutate`](#tag/mutations/operation/mutate) instead. +/// +/// Apply a GQ mutation to a branch. Behavior is unchanged; the route is +/// kept indefinitely for back-compat. New integrations should target +/// `POST /mutate`, which has identical semantics and a name that pairs +/// cleanly with `POST /query`. Responses from this route include +/// `Deprecation: true` and `Link: ; rel="successor-version"` +/// headers per RFC 9745 / RFC 8288 so SDKs and proxies can surface the +/// signal. +async fn server_change( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result<([(HeaderName, HeaderValue); 2], Json), ApiError> { + let branch = request.branch.unwrap_or_else(|| "main".to_string()); + let output = run_mutate( + state, + handle, + actor.as_ref().map(|Extension(actor)| actor), + &request.query, + request.name.as_deref(), + request.params.as_ref(), + branch, + ) + .await?; + Ok(( + deprecation_headers("; rel=\"successor-version\""), + Json(output), + )) +} + +#[utoipa::path( + post, + path = "/mutate", + tag = "mutations", + operation_id = "mutate", + request_body = ChangeRequest, + responses( + (status = 200, description = "Mutation results", body = ChangeOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 409, description = "Merge conflict", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Apply a GQ mutation to a branch (canonical mutation endpoint). +/// +/// Writes to the named `branch` (defaults to `main`). Mutations are atomic +/// per call and produce a new commit. Returns counts of nodes and edges +/// affected. **Destructive**: on success the branch is updated; rejected +/// mutations may still acquire locks briefly. Returns 409 on merge conflict. +/// +/// Pairs with `POST /query` (read-only). The legacy `POST /change` route +/// has identical semantics and is kept as a deprecated alias. +async fn server_mutate( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let branch = request.branch.unwrap_or_else(|| "main".to_string()); + Ok(Json( + run_mutate( + state, + handle, + actor.as_ref().map(|Extension(actor)| actor), + &request.query, + request.name.as_deref(), + request.params.as_ref(), + branch, + ) + .await?, + )) +} + +#[utoipa::path( + get, + path = "/schema", + tag = "schema", + operation_id = "getSchema", + responses( + (status = 200, description = "Current schema source", body = SchemaOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Read the current schema source. +/// +/// Returns the project's schema as a single string in `.pg` source form. +/// Useful for clients that want to introspect available types and tables +/// before constructing GQ queries. Read-only. +async fn server_schema_get( + Extension(handle): Extension>, + actor: Option>, +) -> std::result::Result, ApiError> { + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: None, + target_branch: None, + }, + )?; + let schema_source = { + let db = &handle.engine; + db.schema_source().to_string() + }; + Ok(Json(SchemaOutput { schema_source })) +} + +#[utoipa::path( + post, + path = "/schema/apply", + tag = "mutations", + operation_id = "applySchema", + request_body = SchemaApplyRequest, + responses( + (status = 200, description = "Schema apply results", body = SchemaApplyOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Apply a schema migration. +/// +/// Diffs `schema_source` against the current schema and applies the resulting +/// migration steps (add/drop type, add/drop column, etc.). **Destructive**: +/// some steps drop data. Returns the list of steps applied; if `applied` is +/// false the diff was unsupported and no changes were made. +async fn server_schema_apply( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let actor_arc = actor + .as_ref() + .map(|Extension(actor)| Arc::clone(&actor.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + let actor_id = actor + .as_ref() + .map(|Extension(actor)| actor.actor_id.as_ref()); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::SchemaApply, + branch: None, + target_branch: Some("main".to_string()), + }, + )?; + let est_bytes = request.schema_source.len() as u64; + let _admission = state + .workload + .try_admit(&actor_arc, est_bytes) + .map_err(ApiError::from_workload_reject)?; + let result = { + let db = &handle.engine; + // Engine-layer policy enforcement (MR-722): pass the resolved + // actor through so apply_schema_as can call enforce() with the + // authoritative identity. With a policy installed in AppState, + // engine-side enforcement re-checks the same decision the + // HTTP-layer authorize_request just made above. PR #3 collapses + // the redundancy. + db.apply_schema_as( + &request.schema_source, + omnigraph::db::SchemaApplyOptions { + allow_data_loss: request.allow_data_loss, + }, + actor_id, + ) + .await + .map_err(ApiError::from_omni)? + }; + Ok(Json(schema_apply_output(handle.uri.as_str(), result))) +} + +#[utoipa::path( + post, + path = "/ingest", + tag = "mutations", + operation_id = "ingest", + request_body = IngestRequest, + responses( + (status = 200, description = "Ingest results", body = IngestOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Bulk-ingest NDJSON data into a branch. +/// +/// `data` is NDJSON with one record per line. `mode` controls behavior on +/// existing rows: `merge` upserts by id (default), `append` blindly inserts, +/// `overwrite` replaces table contents. If `branch` does not exist it is +/// created from `from` (defaults to `main`). **Destructive** when `mode` is +/// `overwrite` or when ingest produces conflicting writes. +async fn server_ingest( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let branch = request.branch.unwrap_or_else(|| "main".to_string()); + let from = request.from.unwrap_or_else(|| "main".to_string()); + let mode = request.mode.unwrap_or(omnigraph::loader::LoadMode::Merge); + let actor_arc = actor + .as_ref() + .map(|Extension(actor)| Arc::clone(&actor.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + let actor_id = actor + .as_ref() + .map(|Extension(actor)| actor.actor_id.as_ref()); + + let branch_exists = { + let db = &handle.engine; + db.branch_list() + .await + .map_err(ApiError::from_omni)? + .into_iter() + .any(|name| name == branch) + }; + + if !branch_exists { + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::BranchCreate, + branch: Some(from.clone()), + target_branch: Some(branch.clone()), + }, + )?; + } + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Change, + branch: Some(branch.clone()), + target_branch: None, + }, + )?; + let est_bytes = request.data.len() as u64; + let _admission = state + .workload + .try_admit(&actor_arc, est_bytes) + .map_err(ApiError::from_workload_reject)?; + + let result = { + let db = &handle.engine; + db.ingest_as(&branch, Some(&from), &request.data, mode, actor_id) + .await + .map_err(ApiError::from_omni)? + }; + + Ok(Json(ingest_output( + handle.uri.as_str(), + &result, + actor_id.map(str::to_string), + ))) +} + +#[utoipa::path( + get, + path = "/branches", + tag = "branches", + operation_id = "listBranches", + responses( + (status = 200, description = "List of branches", body = BranchListOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// List all branches. +/// +/// Returns branch names sorted alphabetically. Read-only. +async fn server_branch_list( + Extension(handle): Extension>, + actor: Option>, +) -> std::result::Result, ApiError> { + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: None, + target_branch: None, + }, + )?; + let mut branches = { + let db = &handle.engine; + db.branch_list().await.map_err(ApiError::from_omni)? + }; + branches.sort(); + Ok(Json(BranchListOutput { branches })) +} + +#[utoipa::path( + post, + path = "/branches", + tag = "branches", + operation_id = "createBranch", + request_body = BranchCreateRequest, + responses( + (status = 200, description = "Branch created", body = BranchCreateOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 409, description = "Branch already exists", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Create a new branch. +/// +/// Forks `name` off of `from` (defaults to `main`). The new branch shares +/// table data with its parent until it is mutated. Returns 409 if `name` +/// already exists. +async fn server_branch_create( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let from = request.from.unwrap_or_else(|| "main".to_string()); + let actor_arc = actor + .as_ref() + .map(|Extension(actor)| Arc::clone(&actor.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::BranchCreate, + branch: Some(from.clone()), + target_branch: Some(request.name.clone()), + }, + )?; + // Branch metadata only β€” small constant bytes estimate. The Lance + // shallow-clone work is bounded by the parent's manifest size, not + // the request body. + let _admission = state + .workload + .try_admit(&actor_arc, 256) + .map_err(ApiError::from_workload_reject)?; + { + let db = &handle.engine; + db.branch_create_from_as( + ReadTarget::branch(&from), + &request.name, + actor.as_ref().map(|Extension(a)| a.actor_id.as_ref()), + ) + .await + .map_err(ApiError::from_omni)?; + } + Ok(Json(BranchCreateOutput { + uri: handle.uri.clone(), + from, + name: request.name, + actor_id: actor.map(|Extension(actor)| actor.actor_id.as_ref().to_string()), + })) +} + +/// Path-param shape for [`server_branch_delete`]. Named-field +/// deserialization (rather than `Path` or `Path<(String,)>`) +/// keeps the extractor stable across single-mode flat routes and +/// multi-mode nested routes: the `{branch}` capture is picked by +/// name and any other captures in scope (e.g. `{graph_id}` in +/// multi-mode) are ignored without breaking deserialization. +/// +/// Closes the "handler path-extractor type is positional and breaks +/// when route nesting changes" class. +#[derive(Deserialize)] +struct BranchPath { + branch: String, +} + +#[utoipa::path( + delete, + path = "/branches/{branch}", + tag = "branches", + operation_id = "deleteBranch", + params( + ("branch" = String, Path, description = "Branch name to delete"), + ), + responses( + (status = 200, description = "Branch deleted", body = BranchDeleteOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 404, description = "Branch not found", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Delete a branch. +/// +/// **Irreversible.** Removes the branch pointer; commits remain reachable +/// only if referenced by another branch. Returns 404 if the branch does not +/// exist. +async fn server_branch_delete( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Path(BranchPath { branch }): Path, +) -> std::result::Result, ApiError> { + let actor_arc = actor + .as_ref() + .map(|Extension(actor)| Arc::clone(&actor.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + let actor_id = actor + .as_ref() + .map(|Extension(actor)| actor.actor_id.as_ref()); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::BranchDelete, + branch: None, + target_branch: Some(branch.clone()), + }, + )?; + // Metadata-only manifest tombstone β€” small constant estimate. + let _admission = state + .workload + .try_admit(&actor_arc, 256) + .map_err(ApiError::from_workload_reject)?; + { + let db = &handle.engine; + db.branch_delete_as(&branch, actor_id) + .await + .map_err(ApiError::from_omni)?; + } + Ok(Json(BranchDeleteOutput { + uri: handle.uri.clone(), + name: branch, + actor_id: actor_id.map(str::to_string), + })) +} + +#[utoipa::path( + post, + path = "/branches/merge", + tag = "branches", + operation_id = "mergeBranches", + request_body = BranchMergeRequest, + responses( + (status = 200, description = "Branches merged", body = BranchMergeOutput), + (status = 400, description = "Bad request", body = ErrorOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 409, description = "Merge conflict", body = ErrorOutput), + (status = 429, description = "Per-actor admission cap exceeded; honor `Retry-After` header", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// Merge one branch into another. +/// +/// Merges `source` into `target` (defaults to `main`). Outcome is one of +/// `already_up_to_date`, `fast_forward`, or `merged`. Returns 409 with the +/// list of conflicts if the merge cannot be completed; the target is left +/// unchanged in that case. **Destructive** to `target` on success. +async fn server_branch_merge( + State(state): State, + Extension(handle): Extension>, + actor: Option>, + Json(request): Json, +) -> std::result::Result, ApiError> { + let target = request.target.unwrap_or_else(|| "main".to_string()); + let actor_arc = actor + .as_ref() + .map(|Extension(actor)| Arc::clone(&actor.actor_id)) + .unwrap_or_else(|| Arc::::from("anonymous")); + let actor_id = actor + .as_ref() + .map(|Extension(actor)| actor.actor_id.as_ref()); + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::BranchMerge, + branch: Some(request.source.clone()), + target_branch: Some(target.clone()), + }, + )?; + // Merge body is small JSON; the heavy work is in the engine but is + // bounded per-(table, branch) by the writer queue. Small constant + // estimate suffices for the actor in-flight count. + let _admission = state + .workload + .try_admit(&actor_arc, 256) + .map_err(ApiError::from_workload_reject)?; + let outcome = { + let db = &handle.engine; + db.branch_merge_as(&request.source, &target, actor_id) + .await + .map_err(ApiError::from_omni)? + }; + Ok(Json(BranchMergeOutput { + source: request.source, + target, + outcome: outcome.into(), + actor_id: actor_id.map(str::to_string), + })) +} + +#[utoipa::path( + get, + path = "/commits", + tag = "commits", + operation_id = "listCommits", + params(CommitListQuery), + responses( + (status = 200, description = "List of commits", body = CommitListOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] +/// List commits. +/// +/// Filter by `branch` to get the commits on a single branch (most recent +/// first); omit to list across all branches. Read-only. +async fn server_commit_list( + Extension(handle): Extension>, + actor: Option>, + Query(query): Query, +) -> std::result::Result, ApiError> { + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: query.branch.clone(), + target_branch: None, + }, + )?; + let commits = { + let db = &handle.engine; + db.list_commits(query.branch.as_deref()) + .await + .map_err(ApiError::from_omni)? + }; + Ok(Json(CommitListOutput { + commits: commits.iter().map(api::commit_output).collect(), + })) +} + +/// Path-param shape for [`server_commit_show`]. See [`BranchPath`] +/// for the design rationale β€” same pattern, different field name. +#[derive(Deserialize)] +struct CommitPath { + commit_id: String, +} + +#[utoipa::path( + get, + path = "/commits/{commit_id}", + tag = "commits", + operation_id = "getCommit", + params( + ("commit_id" = String, Path, description = "Commit identifier"), + ), + responses( + (status = 200, description = "Commit details", body = api::CommitOutput), + (status = 401, description = "Unauthorized", body = ErrorOutput), + (status = 403, description = "Forbidden", body = ErrorOutput), + (status = 404, description = "Commit not found", body = ErrorOutput), + ), + security(("bearer_token" = [])), +)] + +/// Get a single commit. +/// +/// Returns the commit's manifest version, parent commit(s), and creation +/// metadata. Read-only. +async fn server_commit_show( + Extension(handle): Extension>, + actor: Option>, + Path(CommitPath { commit_id }): Path, +) -> std::result::Result, ApiError> { + authorize_request( + actor.as_ref().map(|Extension(actor)| actor), + handle.policy.as_deref(), + PolicyRequest { + action: PolicyAction::Read, + branch: None, + target_branch: None, + }, + )?; + let commit = { + let db = &handle.engine; + db.get_commit(&commit_id) + .await + .map_err(ApiError::from_omni)? + }; + Ok(Json(api::commit_output(&commit))) +} + +fn read_target_from_request(branch: Option, snapshot: Option) -> ReadTarget { + if let Some(snapshot) = snapshot { + ReadTarget::snapshot(omnigraph::db::SnapshotId::new(snapshot)) + } else { + ReadTarget::branch(branch.unwrap_or_else(|| "main".to_string())) + } +} + +fn select_named_query_decl( + query_source: &str, + requested_name: Option<&str>, +) -> Result { + let parsed = parse_query(query_source)?; + let query = if let Some(name) = requested_name { + parsed + .queries + .into_iter() + .find(|query| query.name == name) + .ok_or_else(|| color_eyre::eyre::eyre!("query '{}' not found", name))? + } else if parsed.queries.len() == 1 { + parsed.queries.into_iter().next().unwrap() + } else { + bail!("query file contains multiple queries; pass --name"); + }; + Ok(query) +} + +fn select_named_query( + query_source: &str, + requested_name: Option<&str>, +) -> Result<(String, Vec)> { + let query = select_named_query_decl(query_source, requested_name)?; + Ok((query.name, query.params)) +} + +fn query_params_from_json( + query_params: &[omnigraph_compiler::query::ast::Param], + params_json: Option<&Value>, +) -> Result { + json_params_to_param_map(params_json, query_params, JsonParamMode::Standard) + .map_err(|err| color_eyre::eyre::eyre!(err.to_string())) +} + +fn normalize_bearer_token(value: Option) -> Option { + value + .map(|value| value.trim().to_string()) + .filter(|value| !value.is_empty()) +} + +fn normalize_bearer_actor(value: String) -> Result { + let value = value.trim().to_string(); + if value.is_empty() { + bail!("bearer token actor names must not be blank"); + } + Ok(value) +} + +fn parse_bearer_tokens_json(value: &str) -> Result> { + let entries: HashMap = serde_json::from_str(value) + .wrap_err("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON must be a JSON object of actor->token")?; + Ok(entries.into_iter().collect()) +} + +fn read_bearer_tokens_file(path: &str) -> Result> { + let contents = fs::read_to_string(path) + .wrap_err_with(|| format!("failed to read bearer tokens file at {path}"))?; + parse_bearer_tokens_json(&contents) + .wrap_err_with(|| format!("failed to parse bearer tokens file at {path}")) +} + +fn validate_bearer_tokens(entries: Vec<(String, String)>) -> Result> { + let mut seen_actors = HashSet::new(); + let mut seen_tokens = HashSet::new(); + let mut normalized = Vec::with_capacity(entries.len()); + + for (actor, token) in entries { + let actor = normalize_bearer_actor(actor)?; + let Some(token) = normalize_bearer_token(Some(token)) else { + bail!("bearer token for actor '{actor}' must not be blank"); + }; + if !seen_actors.insert(actor.clone()) { + bail!("duplicate bearer token actor '{actor}'"); + } + if !seen_tokens.insert(token.clone()) { + bail!("duplicate bearer token value configured"); + } + normalized.push((actor, token)); + } + + normalized.sort_by(|(left, _), (right, _)| left.cmp(right)); + Ok(normalized) +} + +fn server_bearer_tokens_from_env() -> Result> { + let mut entries = Vec::new(); + + if let Some(token) = normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKEN").ok()) + { + entries.push(("default".to_string(), token)); + } + + if let Some(path) = + normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE").ok()) + { + entries.extend(read_bearer_tokens_file(&path)?); + } else if let Some(json) = + normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON").ok()) + { + entries.extend(parse_bearer_tokens_json(&json)?); + } + + validate_bearer_tokens(entries) +} + +#[cfg(test)] +mod tests { + use super::{ + GraphStartupConfig, ServerConfig, ServerConfigMode, ServerRuntimeState, + classify_server_runtime_state, hash_bearer_token, load_server_settings, + normalize_bearer_token, parse_bearer_tokens_json, serve, server_bearer_tokens_from_env, + }; + use serial_test::serial; + use std::env; + use std::fs; + use tempfile::tempdir; + + #[test] + fn hash_bearer_token_produces_32_byte_output() { + let hash = hash_bearer_token("any-token"); + assert_eq!(hash.len(), 32); + } + + #[test] + fn hash_bearer_token_is_deterministic() { + assert_eq!( + hash_bearer_token("stable-input"), + hash_bearer_token("stable-input"), + ); + } + + #[test] + fn hash_bearer_token_differs_for_different_inputs() { + assert_ne!(hash_bearer_token("token-a"), hash_bearer_token("token-b")); + } + + #[test] + fn hash_bearer_token_matches_known_sha256_vector() { + // SHA-256("abc"). If this ever fails, the hash function was swapped. + let hash = hash_bearer_token("abc"); + let hex: String = hash.iter().map(|b| format!("{:02x}", b)).collect(); + assert_eq!( + hex, + "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad" + ); + } + + #[test] + fn server_settings_load_from_yaml_config() { + let temp = tempdir().unwrap(); + let config = temp.path().join("omnigraph.yaml"); + fs::write( + &config, + r#" +graphs: + local: + uri: /tmp/demo.omni +server: + graph: local + bind: 0.0.0.0:9090 +"#, + ) + .unwrap(); + + let settings = load_server_settings(Some(&config), None, None, None, false).unwrap(); + match &settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "/tmp/demo.omni"), + ServerConfigMode::Multi { .. } => panic!("expected Single mode, got Multi"), + } + assert_eq!(settings.bind, "0.0.0.0:9090"); + } + + #[test] + fn server_settings_cli_flags_override_yaml_config() { + let temp = tempdir().unwrap(); + let config = temp.path().join("omnigraph.yaml"); + fs::write( + &config, + r#" +graphs: + local: + uri: /tmp/demo.omni +server: + graph: local + bind: 127.0.0.1:8080 +"#, + ) + .unwrap(); + + let settings = load_server_settings( + Some(&config), + Some("/tmp/override.omni".to_string()), + None, + Some("0.0.0.0:9999".to_string()), + false, + ) + .unwrap(); + match &settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "/tmp/override.omni"), + ServerConfigMode::Multi { .. } => panic!("expected Single mode, got Multi"), + } + assert_eq!(settings.bind, "0.0.0.0:9999"); + } + + #[test] + fn server_settings_can_resolve_named_target() { + let temp = tempdir().unwrap(); + let config = temp.path().join("omnigraph.yaml"); + fs::write( + &config, + r#" +graphs: + local: + uri: ./demo.omni + dev: + uri: http://127.0.0.1:8080 +server: + graph: local + bind: 127.0.0.1:8080 +"#, + ) + .unwrap(); + + let settings = + load_server_settings(Some(&config), None, Some("dev".to_string()), None, false) + .unwrap(); + match &settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "http://127.0.0.1:8080"), + ServerConfigMode::Multi { .. } => panic!("expected Single mode, got Multi"), + } + } + + #[test] + fn server_settings_require_uri_from_cli_or_config() { + let error = load_server_settings(None, None, None, None, false).unwrap_err(); + assert!( + error.to_string().contains("no graph to serve"), + "expected mode-inference error, got: {error}", + ); + } + + #[test] + fn classify_open_requires_explicit_unauthenticated_flag() { + // State 1: no tokens, no policy, no flag β†’ refuse to start. + let error = classify_server_runtime_state(false, false, false).unwrap_err(); + let msg = error.to_string(); + assert!( + msg.contains("--unauthenticated"), + "expected refusal message mentioning --unauthenticated, got: {msg}" + ); + + // Same matrix cell but with the flag set β†’ Open mode permitted. + assert_eq!( + classify_server_runtime_state(false, false, true).unwrap(), + ServerRuntimeState::Open + ); + } + + #[test] + fn classify_tokens_without_policy_is_default_deny() { + // State 2: tokens configured, no policy β†’ DefaultDeny regardless + // of the flag (the flag opts into the fully-open dev mode; it + // doesn't downgrade default-deny back to open). + assert_eq!( + classify_server_runtime_state(true, false, false).unwrap(), + ServerRuntimeState::DefaultDeny + ); + assert_eq!( + classify_server_runtime_state(true, false, true).unwrap(), + ServerRuntimeState::DefaultDeny + ); + } + + #[tokio::test] + #[serial] + async fn serve_refuses_to_start_with_policy_but_no_tokens_multi_mode() { + // Bug 2 from the bot-review pass: multi-mode startup was missing + // the "policy requires tokens" check that single-mode enforces. + // After centralizing the check in `classify_server_runtime_state`, + // both modes get the same enforcement. This test guards the + // multi-mode propagation path. + // + // Sibling test below pins single mode. Together they pin that + // the classifier is called from both branches of `serve()`. + let _guard = EnvGuard::set(&[ + ("OMNIGRAPH_SERVER_BEARER_TOKEN", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET", None), + ("OMNIGRAPH_UNAUTHENTICATED", None), + ]); + let temp = tempdir().unwrap(); + // The classifier reads `has_policy_configured` from the config + // shape (does the Option contain a path?), not from file + // existence, so we can hand it a path without writing a real + // policy file β€” the bail fires before policy load. + let policy_path = temp.path().join("server-policy.yaml"); + let config = ServerConfig { + mode: ServerConfigMode::Multi { + graphs: vec![GraphStartupConfig { + graph_id: "alpha".to_string(), + uri: temp + .path() + .join("alpha.omni") + .to_string_lossy() + .into_owned(), + policy_file: None, + }], + config_path: temp.path().join("omnigraph.yaml"), + server_policy_file: Some(policy_path), + }, + bind: "127.0.0.1:0".to_string(), + allow_unauthenticated: false, + }; + let result = serve(config).await; + let err = result + .expect_err("serve should refuse to start in multi mode with policy but no tokens"); + let msg = format!("{:?}", err); + assert!( + msg.contains("policy file is configured but no bearer tokens"), + "expected policy-without-tokens rejection in multi mode, got: {msg}", + ); + } + + #[tokio::test] + #[serial] + async fn serve_refuses_to_start_in_state_1_without_unauthenticated() { + // MR-723 PR A: pin the integration boundary that the classifier + // is actually called by `serve()` before any side-effecting + // work (Lance dataset open, TcpListener::bind). The classifier + // itself is unit-tested above; this test guards the propagation + // path from `classify_server_runtime_state` through serve's + // `?` so a future refactor that drops the call returns red. + // + // Marked `#[serial]` because we have to clear all bearer-token + // env vars, and another test in this module setting any of them + // concurrently would corrupt the read inside `resolve_token_source`. + let _guard = EnvGuard::set(&[ + ("OMNIGRAPH_SERVER_BEARER_TOKEN", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET", None), + ("OMNIGRAPH_UNAUTHENTICATED", None), + ]); + let temp = tempdir().unwrap(); + // Graph path doesn't need to exist β€” classifier fires before + // `AppState::open_with_bearer_tokens_and_policy`. + let config = ServerConfig { + mode: ServerConfigMode::Single { + uri: temp + .path() + .join("graph.omni") + .to_string_lossy() + .into_owned(), + policy_file: None, + }, + bind: "127.0.0.1:0".to_string(), + allow_unauthenticated: false, + }; + let result = serve(config).await; + let err = + result.expect_err("serve should refuse to start in State 1 without --unauthenticated"); + let msg = format!("{:?}", err); + assert!( + msg.contains("no bearer tokens") || msg.contains("policy file"), + "expected refusal message naming the misconfiguration, got: {msg}", + ); + } + + #[test] + #[serial] + fn unauthenticated_env_var_classification() { + // MR-723 PR A: closes the gap where the env-var read path inside + // `load_server_settings` was structurally implemented but not + // exercised by any test. Three properties to pin, all in one + // sequential test because `cargo test` runs the mod test suite + // in parallel and `OMNIGRAPH_UNAUTHENTICATED` is process-global + // β€” interleaving with another test that sets the same env var + // (concurrent classifier tests, even the bearer-token suite + // sharing `EnvGuard`) corrupts the read. Sequential within one + // test fn is the simplest race-free shape. + let temp = tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + local: + uri: /tmp/demo-unauth.omni +server: + graph: local +"#, + ) + .unwrap(); + + // Truthy values flip Open mode on, even with CLI flag off. + for value in ["1", "true", "yes", "TRUE", "anything"] { + let _guard = EnvGuard::set(&[("OMNIGRAPH_UNAUTHENTICATED", Some(value))]); + let settings = load_server_settings(Some(&config_path), None, None, None, false) + .expect("settings load should succeed"); + assert!( + settings.allow_unauthenticated, + "OMNIGRAPH_UNAUTHENTICATED={value:?} should enable Open mode", + ); + } + + // Falsy values keep refusal behavior, even with CLI flag off. + for value in ["0", "false", "FALSE", ""] { + let _guard = EnvGuard::set(&[("OMNIGRAPH_UNAUTHENTICATED", Some(value))]); + let settings = load_server_settings(Some(&config_path), None, None, None, false) + .expect("settings load should succeed"); + assert!( + !settings.allow_unauthenticated, + "OMNIGRAPH_UNAUTHENTICATED={value:?} should NOT enable Open mode", + ); + } + + // Unset env var: also false. + let _guard = EnvGuard::set(&[("OMNIGRAPH_UNAUTHENTICATED", None)]); + let settings = load_server_settings(Some(&config_path), None, None, None, false) + .expect("settings load should succeed"); + assert!( + !settings.allow_unauthenticated, + "OMNIGRAPH_UNAUTHENTICATED unset should NOT enable Open mode", + ); + drop(_guard); + + // CLI flag wins even when env is falsy β€” `serve()` honors the + // OR of both inputs. + let _guard = EnvGuard::set(&[("OMNIGRAPH_UNAUTHENTICATED", Some("0"))]); + let settings = load_server_settings(Some(&config_path), None, None, None, true) + .expect("settings load should succeed"); + assert!( + settings.allow_unauthenticated, + "--unauthenticated CLI flag should win even when env is falsy", + ); + } + + #[test] + fn classify_policy_enabled_requires_tokens() { + // State 3: tokens + policy β†’ PolicyEnabled, regardless of the + // `allow_unauthenticated` flag (Cedar evaluates the bearer, + // the flag is moot once tokens exist). + assert_eq!( + classify_server_runtime_state(true, true, false).unwrap(), + ServerRuntimeState::PolicyEnabled + ); + assert_eq!( + classify_server_runtime_state(true, true, true).unwrap(), + ServerRuntimeState::PolicyEnabled + ); + } + + #[test] + fn classify_policy_without_tokens_is_rejected() { + // Closes the "policy installed but no tokens β†’ silent 401 on + // every request" footgun. The same shape that single-mode + // `open_with_bearer_tokens_and_policy` used to bail on + // privately is now rejected by the classifier so both single + // and multi mode get the same enforcement from one source of + // truth. + for allow_unauthenticated in [false, true] { + let err = + classify_server_runtime_state(false, true, allow_unauthenticated).unwrap_err(); + let msg = err.to_string(); + assert!( + msg.contains("policy file is configured but no bearer tokens"), + "expected policy-without-tokens rejection message; got: {msg}" + ); + assert!( + msg.contains("every request would 401"), + "rejection message must name the failure mode; got: {msg}" + ); + } + } + + #[test] + fn normalize_bearer_token_trims_and_filters_blank_values() { + assert_eq!(normalize_bearer_token(None), None); + assert_eq!(normalize_bearer_token(Some(" ".to_string())), None); + assert_eq!( + normalize_bearer_token(Some(" demo-token ".to_string())).as_deref(), + Some("demo-token") + ); + } + + struct EnvGuard { + saved: Vec<(&'static str, Option)>, + } + + impl EnvGuard { + fn set(vars: &[(&'static str, Option<&str>)]) -> Self { + let saved = vars + .iter() + .map(|(name, _)| (*name, env::var(name).ok())) + .collect::>(); + for (name, value) in vars { + unsafe { + match value { + Some(value) => env::set_var(name, value), + None => env::remove_var(name), + } + } + } + Self { saved } + } + } + + impl Drop for EnvGuard { + fn drop(&mut self) { + for (name, value) in self.saved.drain(..) { + unsafe { + match value { + Some(value) => env::set_var(name, value), + None => env::remove_var(name), + } + } + } + } + } + + #[test] + fn parse_bearer_tokens_json_reads_actor_token_map() { + let tokens = parse_bearer_tokens_json(r#"{"alice":" token-a ","bob":"token-b"}"#).unwrap(); + assert_eq!(tokens.len(), 2); + assert!(tokens.contains(&("alice".to_string(), " token-a ".to_string()))); + assert!(tokens.contains(&("bob".to_string(), "token-b".to_string()))); + } + + #[test] + #[serial] + fn server_bearer_tokens_from_env_reads_legacy_token_and_token_file() { + let temp = tempdir().unwrap(); + let tokens_path = temp.path().join("tokens.json"); + fs::write( + &tokens_path, + r#"{"team-01":"token-one","team-02":"token-two"}"#, + ) + .unwrap(); + + let _guard = EnvGuard::set(&[ + ("OMNIGRAPH_SERVER_BEARER_TOKEN", Some(" legacy-token ")), + ( + "OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", + Some(tokens_path.to_str().unwrap()), + ), + ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), + ]); + + let tokens = server_bearer_tokens_from_env().unwrap(); + assert_eq!( + tokens, + vec![ + ("default".to_string(), "legacy-token".to_string()), + ("team-01".to_string(), "token-one".to_string()), + ("team-02".to_string(), "token-two".to_string()), + ] + ); + } +} diff --git a/crates/omnigraph-server/src/main.rs b/crates/omnigraph-server/src/main.rs index c45b77f..4e1c256 100644 --- a/crates/omnigraph-server/src/main.rs +++ b/crates/omnigraph-server/src/main.rs @@ -8,12 +8,12 @@ use omnigraph_server::{ServerConfig, init_tracing, load_server_settings, serve}; #[command(name = "omnigraph-server")] #[command(about = "HTTP server for the Omnigraph graph database")] struct Cli { - /// Boot from a cluster: either a config directory (storage resolved - /// through cluster.yaml) or a storage-root URI directly - /// (s3://bucket/prefix β€” config-free serving from the bucket). - /// The server's only boot source (RFC-011 cluster-only). + /// Graph URI + uri: Option, #[arg(long)] - cluster: Option, + target: Option, + #[arg(long)] + config: Option, #[arg(long)] bind: Option, /// Run without bearer tokens and without a policy file (MR-723). @@ -22,11 +22,6 @@ struct Cli { /// Equivalent to setting `OMNIGRAPH_UNAUTHENTICATED=1`. #[arg(long)] unauthenticated: bool, - /// Fail startup if any applied graph is quarantined or fails to open. - /// By default, graph-local failures are logged and healthy graphs still - /// serve. Equivalent to setting `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. - #[arg(long)] - require_all_graphs: bool, } #[tokio::main] @@ -36,11 +31,11 @@ async fn main() -> Result<()> { let cli = Cli::parse(); let settings: ServerConfig = load_server_settings( - cli.cluster.as_ref(), + cli.config.as_ref(), + cli.uri, + cli.target, cli.bind, cli.unauthenticated, - cli.require_all_graphs, - ) - .await?; + )?; serve(settings).await } diff --git a/crates/omnigraph-server/src/queries.rs b/crates/omnigraph-server/src/queries.rs deleted file mode 100644 index 09d2491..0000000 --- a/crates/omnigraph-server/src/queries.rs +++ /dev/null @@ -1,613 +0,0 @@ -//! Stored-query registry. -//! -//! A server-side registry of named, parameter-typed `.gq` queries that -//! operators declare in `omnigraph.yaml` (per-graph, or top-level in -//! single mode) and the server loads at startup. Each entry is parsed -//! and its identity asserted here (`load`); type-checking against the -//! live schema happens separately (a `check` pass) so the loader stays -//! callable without an open engine (the CLI's offline `queries check`). -//! -//! Identity is the query **name**: the manifest key must equal the -//! `query ` symbol declared in the referenced `.gq` file. The two -//! are asserted equal at load β€” one name, two places that must agree. -//! Renaming either is a breaking change to callers, by design. - -use std::collections::BTreeMap; -use std::sync::Arc; - -use omnigraph_compiler::catalog::Catalog; -use omnigraph_compiler::query::ast::QueryDecl; -use omnigraph_compiler::query::parser::parse_query; -use omnigraph_compiler::query::typecheck::typecheck_query_decl; -use omnigraph_compiler::types::{PropType, ScalarType}; - -/// One loaded stored query. `source` is the full `.gq` file text β€” the -/// invocation handler hands it to `run_query` / `run_mutate` verbatim, -/// which reuse the same parse/IR/exec path as the inline routes (no -/// parallel implementation). -#[derive(Debug, Clone)] -pub struct StoredQuery { - /// Identity: manifest key == `query ` symbol. - pub name: String, - /// Full `.gq` source text the query was selected from. - pub source: Arc, - /// Parsed declaration (params, mutations, description, …). - pub decl: QueryDecl, - /// Whether this query is listed in the MCP tool catalog (`GET /queries`). - /// Default `true` (the manifest entry is the opt-in); `expose: false` - /// keeps it HTTP/service-callable but hidden from the agent tool list. - /// Catalog membership only β€” not an authorization gate. - pub expose: bool, - /// Optional MCP tool-name override; defaults to `name`. - pub tool_name: Option, -} - -impl StoredQuery { - /// `true` if the selected declaration contains insert/update/delete - /// statements β€” drives read-vs-mutate routing at invocation time. - pub fn is_mutation(&self) -> bool { - !self.decl.mutations.is_empty() - } - - /// The MCP tool name this query is catalogued under: the explicit - /// `tool_name` override, else the query `name`. The catalog key β€” - /// enforced unique across exposed queries at load. Server-side - /// consumers (the uniqueness check, the future catalog projection) read - /// this; the CLI `queries list` resolves the same rule on its own DTO. - pub fn effective_tool_name(&self) -> &str { - self.tool_name.as_deref().unwrap_or(&self.name) - } -} - -/// A loaded, identity-checked stored-query registry for one graph. -#[derive(Debug, Clone, Default)] -pub struct QueryRegistry { - by_name: BTreeMap, -} - -/// In-memory registry spec: a query's name + already-read `.gq` source. The -/// input to [`QueryRegistry::from_specs`] β€” built by the server's cluster boot -/// and by the CLI's `queries` tooling from a cluster serving snapshot. -#[derive(Debug, Clone)] -pub struct RegistrySpec { - pub name: String, - pub source: String, - pub expose: bool, - pub tool_name: Option, -} - -/// A single registry load failure. Collected (not fail-fast) so a bad -/// `omnigraph.yaml` surfaces every broken entry at once, matching the -/// bad-policy-YAML posture. -#[derive(Debug, Clone)] -pub struct LoadError { - /// The offending query name, when the failure is entry-scoped. - pub query: Option, - pub message: String, -} - -impl std::fmt::Display for LoadError { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - match &self.query { - Some(name) => write!(f, "stored query '{name}': {}", self.message), - None => write!(f, "stored query registry: {}", self.message), - } - } -} - -impl QueryRegistry { - /// Build a registry from in-memory specs: parse each source, select - /// the declaration whose symbol equals the manifest key, and assert - /// they agree. Collects every failure. No schema type-checking here - /// β€” that is [`check`]. - pub fn from_specs(specs: Vec) -> Result> { - let mut by_name = BTreeMap::new(); - let mut errors = Vec::new(); - - for spec in specs { - match parse_query(&spec.source) { - Ok(file) => { - match file.queries.into_iter().find(|q| q.name == spec.name) { - Some(decl) => { - by_name.insert( - spec.name.clone(), - StoredQuery { - name: spec.name, - source: Arc::from(spec.source), - decl, - expose: spec.expose, - tool_name: spec.tool_name, - }, - ); - } - None => errors.push(LoadError { - query: Some(spec.name.clone()), - message: format!( - "no `query {}` declaration found in its `.gq` file \ - (the registry key must match the query symbol)", - spec.name - ), - }), - } - } - Err(err) => errors.push(LoadError { - query: Some(spec.name), - message: format!("parse error: {err}"), - }), - } - } - - // Exposed queries are catalogued under their effective tool name; - // two claiming one name is an MCP-namespace collision. Refuse it at - // load (collected, not fail-fast), naming the loser and the winner. - // Iterating the `BTreeMap` makes the winner deterministic (the - // lexicographically-first query name; config is a map, so YAML - // declaration order isn't preserved anyway) and the error order - // stable. Scoped to a block so these borrows of `by_name` end - // before it is moved into `Self`. - { - let mut claimed: BTreeMap<&str, &str> = BTreeMap::new(); - for query in by_name.values().filter(|q| q.expose) { - let tool = query.effective_tool_name(); - if let Some(winner) = claimed.insert(tool, &query.name) { - errors.push(LoadError { - query: Some(query.name.clone()), - message: format!( - "MCP tool name '{tool}' already claimed by exposed query '{winner}'" - ), - }); - } - } - } - - if errors.is_empty() { - Ok(Self { by_name }) - } else { - Err(errors) - } - } - - pub fn lookup(&self, name: &str) -> Option<&StoredQuery> { - self.by_name.get(name) - } - - pub fn iter(&self) -> impl Iterator { - self.by_name.values() - } - - pub fn is_empty(&self) -> bool { - self.by_name.is_empty() - } - - pub fn len(&self) -> usize { - self.by_name.len() - } -} - -/// A stored query that fails to type-check against the live schema β€” -/// e.g. it references a node/edge type or property that was renamed or -/// removed by a migration. Breakages **block server boot** (same posture -/// as bad policy YAML), surfacing schema drift at the deploy boundary -/// rather than silently at invocation time. -#[derive(Debug, Clone)] -pub struct Breakage { - pub query: String, - pub message: String, -} - -/// A non-blocking advisory found during validation. Logged at boot; -/// never blocks startup. Currently: an MCP-exposed query that declares a -/// parameter an agent cannot realistically supply. -#[derive(Debug, Clone)] -pub struct Warning { - pub query: String, - pub message: String, -} - -/// Outcome of validating a registry against a schema. Breakages are -/// fatal (boot refuses); warnings are advisory. -#[derive(Debug, Clone, Default)] -pub struct CheckReport { - pub breakages: Vec, - pub warnings: Vec, -} - -impl CheckReport { - pub fn has_breakages(&self) -> bool { - !self.breakages.is_empty() - } - - pub fn is_clean(&self) -> bool { - self.breakages.is_empty() && self.warnings.is_empty() - } -} - -/// Validate a loaded registry against the live schema. -/// -/// Pure over `(registry, catalog)` β€” takes an already-parsed registry and -/// a catalog, so it is callable both at server boot (with the engine's -/// `catalog()`) and offline from the CLI (`omnigraph queries check`), -/// without coupling to server config or an open engine connection. -/// -/// Every query is type-checked via the same `typecheck_query_decl` the -/// engine runs for inline queries β€” no parallel implementation. Failures -/// are **collected, not fail-fast**, so an operator sees every broken -/// query in one pass. -/// -/// Advisory lint (warn, never block): an `mcp.expose: true` query that -/// declares a `Vector(N)` parameter. An LLM cannot supply a raw embedding -/// vector; such a query should take a `String` parameter and let the -/// engine embed it server-side at query time. Service-to-service callers -/// may legitimately pass vectors, so this warns rather than rejects. -pub fn check(registry: &QueryRegistry, catalog: &Catalog) -> CheckReport { - let mut report = CheckReport::default(); - for query in registry.iter() { - if let Err(err) = typecheck_query_decl(catalog, &query.decl) { - report.breakages.push(Breakage { - query: query.name.clone(), - message: err.to_string(), - }); - } - if query.expose { - for param in &query.decl.params { - // Resolve to the structured type via the compiler's own - // resolver rather than string-matching `Vector(` β€” one - // canonical definition of "is a vector", so this lint can't - // drift from how the parser/type system spells the type. - let is_vector = PropType::from_param_type_name(¶m.type_name, param.nullable) - .is_some_and(|pt| matches!(pt.scalar, ScalarType::Vector(_))); - if is_vector { - report.warnings.push(Warning { - query: query.name.clone(), - message: format!( - "MCP-exposed query declares a `{}` parameter `${}` that agents \ - cannot supply; use a `String` parameter for server-side embedding", - param.type_name, param.name - ), - }); - } - } - } - } - report -} - -/// Format every breakage in a registry check report into a multi-line -/// operator-facing message, naming each offending query. -pub fn format_check_breakages(label: &str, report: &CheckReport) -> String { - let joined = report - .breakages - .iter() - .map(|b| format!("query '{}': {}", b.query, b.message)) - .collect::>() - .join("\n "); - format!( - "graph '{label}': {} stored quer{} failed the schema check:\n {joined}", - report.breakages.len(), - if report.breakages.len() == 1 { - "y" - } else { - "ies" - } - ) -} - -#[cfg(test)] -mod tests { - use super::*; - - fn spec(name: &str, source: &str, expose: bool) -> RegistrySpec { - RegistrySpec { - name: name.to_string(), - source: source.to_string(), - expose, - tool_name: None, - } - } - - fn spec_tool(name: &str, source: &str, expose: bool, tool_name: &str) -> RegistrySpec { - RegistrySpec { - name: name.to_string(), - source: source.to_string(), - expose, - tool_name: Some(tool_name.to_string()), - } - } - - #[test] - fn key_equal_symbol_loads() { - let reg = QueryRegistry::from_specs(vec![spec( - "find_user", - "query find_user($id: String) { match { $u: User } return { $u.name } }", - true, - )]) - .unwrap(); - let q = reg.lookup("find_user").unwrap(); - assert_eq!(q.name, "find_user"); - assert!(q.expose); - assert_eq!(q.decl.params.len(), 1); - assert!(!q.is_mutation()); - // No override β†’ the effective tool name is the query name. - assert_eq!(q.effective_tool_name(), "find_user"); - - // An explicit override is what the catalog keys on. - let with_tool = QueryRegistry::from_specs(vec![spec_tool( - "find_user", - "query find_user($id: String) { match { $u: User } return { $u.name } }", - true, - "lookup_user", - )]) - .unwrap(); - assert_eq!( - with_tool.lookup("find_user").unwrap().effective_tool_name(), - "lookup_user" - ); - } - - #[test] - fn key_mismatch_is_an_identity_error() { - let errors = QueryRegistry::from_specs(vec![spec( - "find_user", - // symbol is `lookup`, key is `find_user` β€” must be rejected. - "query lookup($id: String) { match { $u: User } return { $u.name } }", - false, - )]) - .unwrap_err(); - assert_eq!(errors.len(), 1); - assert_eq!(errors[0].query.as_deref(), Some("find_user")); - assert!(errors[0].message.contains("must match the query symbol")); - } - - #[test] - fn multi_query_file_selects_the_matching_symbol() { - let source = "query a($x: I64) { match { $u: User } return { $u.name } }\n\ - query b($y: String) { match { $u: User } return { $u.name } }"; - let reg = QueryRegistry::from_specs(vec![spec("b", source, false)]).unwrap(); - let q = reg.lookup("b").unwrap(); - assert_eq!(q.name, "b"); - assert_eq!(q.decl.params[0].name, "y"); - assert!(reg.lookup("a").is_none(), "only the selected symbol is registered"); - } - - #[test] - fn duplicate_exposed_tool_name_is_a_load_error() { - // Two MCP-exposed queries claiming one tool name is an ambiguity in - // the catalog key space β€” refused at load, naming both queries and - // the contested tool. - let errors = QueryRegistry::from_specs(vec![ - spec_tool("a", "query a() { match { $u: User } return { $u.name } }", true, "dup"), - spec_tool("b", "query b() { match { $u: User } return { $u.name } }", true, "dup"), - ]) - .unwrap_err(); - assert_eq!(errors.len(), 1); - let msg = errors[0].to_string(); - assert!(msg.contains("'dup'"), "names the contested tool: {msg}"); - assert!(msg.contains("'a'"), "names the winning query: {msg}"); - assert!(msg.contains("'b'"), "names the losing query: {msg}"); - } - - #[test] - fn duplicate_tool_name_among_unexposed_is_allowed() { - // Unexposed queries have no MCP tool, so a shared effective tool - // name is inert β€” must not error (pins the exposed-only scope). - let reg = QueryRegistry::from_specs(vec![ - spec_tool("a", "query a() { match { $u: User } return { $u.name } }", false, "dup"), - spec_tool("b", "query b() { match { $u: User } return { $u.name } }", false, "dup"), - ]) - .unwrap(); - assert_eq!(reg.len(), 2); - } - - #[test] - fn parse_error_surfaces_per_entry() { - let errors = - QueryRegistry::from_specs(vec![spec("broken", "query broken( {{ not valid", false)]) - .unwrap_err(); - assert_eq!(errors[0].query.as_deref(), Some("broken")); - assert!(errors[0].message.contains("parse error")); - } - - #[test] - fn errors_collect_rather_than_fail_fast() { - let errors = QueryRegistry::from_specs(vec![ - spec("good", "query good() { match { $u: User } return { $u.name } }", false), - spec("mismatch", "query other() { match { $u: User } return { $u.name } }", false), - spec("broken", "query broken(", false), - ]) - .unwrap_err(); - // `good` loads cleanly; only the mismatch and the parse error are - // reported, and both surface in one pass (not fail-fast). - assert_eq!(errors.len(), 2); - } - - #[test] - fn mutation_body_classifies_as_mutation() { - let reg = QueryRegistry::from_specs(vec![spec( - "add_user", - "query add_user($name: String) { insert User { name: $name } }", - false, - )]) - .unwrap(); - assert!(reg.lookup("add_user").unwrap().is_mutation()); - } - - // --- check(registry, catalog) --- - - use omnigraph_compiler::catalog::build_catalog; - use omnigraph_compiler::schema::parser::parse_schema; - - fn test_catalog() -> Catalog { - let schema = parse_schema( - r#" -node User { -name: String -age: I32? -embedding: Vector(4) -} -"#, - ) - .unwrap(); - build_catalog(&schema).unwrap() - } - - #[test] - fn check_passes_for_valid_query() { - let reg = QueryRegistry::from_specs(vec![spec( - "find_user", - "query find_user($name: String) { match { $u: User { name: $name } } return { $u.age } }", - false, - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(report.is_clean(), "unexpected: {:?}", report); - } - - #[test] - fn check_reports_unknown_type_as_breakage() { - let reg = QueryRegistry::from_specs(vec![spec( - "ghost", - // `Widget` is not in the schema. - "query ghost() { match { $w: Widget } return { $w.name } }", - false, - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(report.has_breakages()); - assert_eq!(report.breakages[0].query, "ghost"); - } - - #[test] - fn check_reports_unknown_property_as_breakage() { - let reg = QueryRegistry::from_specs(vec![spec( - "bad_prop", - // `User` exists but has no `nickname`. - "query bad_prop() { match { $u: User } return { $u.nickname } }", - false, - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(report.has_breakages()); - assert_eq!(report.breakages[0].query, "bad_prop"); - } - - #[test] - fn check_collects_every_breakage_not_fail_fast() { - let reg = QueryRegistry::from_specs(vec![ - spec("a", "query a() { match { $w: Widget } return { $w.x } }", false), - spec("b", "query b() { match { $g: Gadget } return { $g.y } }", false), - spec( - "ok", - "query ok() { match { $u: User } return { $u.name } }", - false, - ), - ]) - .unwrap(); - let report = check(®, &test_catalog()); - assert_eq!(report.breakages.len(), 2, "both bad queries reported: {:?}", report); - } - - #[test] - fn vector_param_on_exposed_query_warns() { - let reg = QueryRegistry::from_specs(vec![spec( - "vec_search", - "query vec_search($q: Vector(4)) { match { $u: User } return { $u.name } \ - order { nearest($u.embedding, $q) } limit 3 }", - true, // mcp.expose - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(!report.has_breakages(), "valid query: {:?}", report); - assert_eq!(report.warnings.len(), 1); - assert_eq!(report.warnings[0].query, "vec_search"); - } - - #[test] - fn vector_param_on_unexposed_query_is_silent() { - let reg = QueryRegistry::from_specs(vec![spec( - "vec_search", - "query vec_search($q: Vector(4)) { match { $u: User } return { $u.name } \ - order { nearest($u.embedding, $q) } limit 3 }", - false, // not exposed β€” vector param is fine for service-to-service callers - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(report.is_clean(), "unexpected: {:?}", report); - } - - #[test] - fn non_vector_param_on_exposed_query_does_not_warn() { - // The recommended `String` alternative on an exposed query does not - // resolve to a Vector, so the embedding advisory stays silent. Guards - // the structured type check against a false positive (and pins that - // only `Vector(_)` triggers the warning). - let reg = QueryRegistry::from_specs(vec![spec( - "search", - "query search($name: String) { match { $u: User { name: $name } } return { $u.name } }", - true, - )]) - .unwrap(); - let report = check(®, &test_catalog()); - assert!(report.is_clean(), "no breakage or warning expected: {:?}", report); - } - - // --- catalog projection (api::query_catalog_entry) --- - - #[test] - fn catalog_entry_projects_every_param_kind() { - use crate::api::{self, ParamKind}; - let reg = QueryRegistry::from_specs(vec![spec_tool( - "all_types", - "query all_types($s: String, $i: I32, $big: I64, $u: U64, $f: F64, $b: Bool, \ - $d: Date, $dt: DateTime, $blob: Blob, $opt: String?, $list: [I32], $vec: Vector(4)) \ - { match { $x: User } return { $x.name } }", - true, - "all", - )]) - .unwrap(); - let entry = api::query_catalog_entry(reg.lookup("all_types").unwrap()); - assert_eq!(entry.name, "all_types"); - assert_eq!(entry.tool_name, "all"); - assert!(!entry.mutation); - - let by: std::collections::HashMap<_, _> = - entry.params.iter().map(|p| (p.name.as_str(), p)).collect(); - assert_eq!(by["s"].kind, ParamKind::String); - assert_eq!(by["i"].kind, ParamKind::Int); - assert_eq!(by["big"].kind, ParamKind::BigInt, "I64 β†’ bigint (string on the wire)"); - assert_eq!(by["u"].kind, ParamKind::BigInt, "U64 β†’ bigint"); - assert_eq!(by["f"].kind, ParamKind::Float); - assert_eq!(by["b"].kind, ParamKind::Bool); - assert_eq!(by["d"].kind, ParamKind::Date); - assert_eq!(by["dt"].kind, ParamKind::DateTime); - assert_eq!(by["blob"].kind, ParamKind::Blob); - assert!(!by["s"].nullable); - assert!(by["opt"].nullable, "String? β†’ nullable"); - assert_eq!(by["list"].kind, ParamKind::List); - assert_eq!(by["list"].item_kind, Some(ParamKind::Int), "[I32] β†’ list of int"); - assert_eq!(by["vec"].kind, ParamKind::Vector); - assert_eq!(by["vec"].vector_dim, Some(4)); - } - - #[test] - fn catalog_entry_flags_mutation_and_empty_params() { - use crate::api; - let reg = QueryRegistry::from_specs(vec![spec( - "add_user", - "query add_user($name: String) { insert User { name: $name } }", - true, - )]) - .unwrap(); - let entry = api::query_catalog_entry(reg.lookup("add_user").unwrap()); - assert!(entry.mutation, "insert body β†’ mutation flag"); - - let reg2 = QueryRegistry::from_specs(vec![spec( - "no_params", - "query no_params() { match { $u: User } return { $u.name } }", - true, - )]) - .unwrap(); - let entry2 = api::query_catalog_entry(reg2.lookup("no_params").unwrap()); - assert!(entry2.params.is_empty(), "no declared params β†’ empty list"); - } - -} diff --git a/crates/omnigraph-server/src/registry.rs b/crates/omnigraph-server/src/registry.rs index 54115e4..5897ad1 100644 --- a/crates/omnigraph-server/src/registry.rs +++ b/crates/omnigraph-server/src/registry.rs @@ -29,7 +29,6 @@ use tokio::sync::Mutex; use crate::identity::GraphKey; use crate::policy::PolicyEngine; -use crate::queries::QueryRegistry; /// Open handle for a single graph in the registry. Cheap to clone (`Arc`-wrapped /// engine + policy). Cluster-mode handlers extract this via @@ -48,11 +47,6 @@ pub struct GraphHandle { /// `_as` writers"; the HTTP-layer `require_bearer_auth` middleware still /// runs regardless. pub policy: Option>, - /// Per-graph stored-query registry, loaded and validated at - /// startup. `None` means the operator declared no stored queries for - /// this graph β€” `POST /queries/{name}` then 404s. Mirrors the - /// optional `policy` shape. - pub queries: Option>, } /// Immutable snapshot of the registry's current state. Replaced atomically @@ -251,7 +245,6 @@ fn canonicalize_handle_uri( uri: canonical_uri.clone(), engine: Arc::clone(&handle.engine), policy: handle.policy.clone(), - queries: handle.queries.clone(), }); Ok((canonical_uri, canonical_handle)) } @@ -283,7 +276,6 @@ mod tests { uri: graph_uri, engine: Arc::new(engine), policy: None, - queries: None, }) } @@ -348,14 +340,12 @@ mod tests { uri: shared_uri.clone(), engine: Arc::clone(&engine), policy: None, - queries: None, }); let h2 = Arc::new(GraphHandle { key: GraphKey::cluster(GraphId::try_from("beta").unwrap()), uri: shared_uri, engine, policy: None, - queries: None, }); let registry = GraphRegistry::new(); @@ -421,14 +411,12 @@ mod tests { uri: shared_uri.clone(), engine: Arc::clone(&engine), policy: None, - queries: None, }); let h2 = Arc::new(GraphHandle { key: GraphKey::cluster(GraphId::try_from("beta").unwrap()), uri: shared_uri, engine, policy: None, - queries: None, }); let err = match GraphRegistry::from_handles(vec![h1, h2]) { Ok(_) => panic!("expected DuplicateUri, got Ok"), diff --git a/crates/omnigraph-server/src/settings.rs b/crates/omnigraph-server/src/settings.rs deleted file mode 100644 index ae28205..0000000 --- a/crates/omnigraph-server/src/settings.rs +++ /dev/null @@ -1,837 +0,0 @@ -//! Server settings: cluster/CLI/env resolution, bearer-token sources, and -//! runtime-state classification (moved verbatim from lib.rs in the -//! modularization). - -use super::*; - -/// Build serving settings from a cluster directory's applied revision -/// (RFC-005 Β§D2): graphs at derived roots, stored queries from verified -/// catalog blob content, policy bundles from blob paths with their applied -/// bindings. Always multi-graph routing. -pub(crate) async fn load_cluster_settings( - cluster_dir: &PathBuf, - cli_bind: Option, - cli_allow_unauthenticated: bool, - cli_require_all_graphs: bool, -) -> Result { - // `--cluster` accepts either a config directory (the ledger location is - // resolved through cluster.yaml's `storage:` key) or a storage-root URI - // directly (`s3://bucket/prefix`) β€” config-free serving: the ledger and - // catalog on the bucket ARE the deployment artifact. - // Any scheme-qualified argument (s3://, file://) is a storage root; a - // bare path is a config directory. - let cluster_arg = cluster_dir.to_string_lossy(); - let snapshot = if cluster_arg.contains("://") { - omnigraph_cluster::read_serving_snapshot_from_storage(cluster_arg.as_ref()).await - } else { - omnigraph_cluster::read_serving_snapshot(cluster_dir).await - } - .map_err(|diagnostics| { - let details = diagnostics - .iter() - .map(|diagnostic| { - format!( - "[{}] {}: {}", - diagnostic.code, diagnostic.path, diagnostic.message - ) - }) - .collect::>() - .join("\n "); - eyre!( - "the cluster at '{}' is not ready to serve:\n {details}", - cluster_dir.display() - ) - })?; - for diagnostic in &snapshot.diagnostics { - warn!( - code = %diagnostic.code, - path = %diagnostic.path, - message = %diagnostic.message, - "cluster startup diagnostic" - ); - } - let env_require_all_graphs = env_flag("OMNIGRAPH_REQUIRE_ALL_GRAPHS"); - let require_all_graphs = cli_require_all_graphs || env_require_all_graphs; - if require_all_graphs && !snapshot.diagnostics.is_empty() { - let details = snapshot - .diagnostics - .iter() - .map(|diagnostic| { - format!( - "[{}] {}: {}", - diagnostic.code, diagnostic.path, diagnostic.message - ) - }) - .collect::>() - .join("\n "); - bail!( - "strict cluster boot requires every applied graph to be ready; startup diagnostics:\n {details}" - ); - } - - // Bindings -> Cedar slots. The serving pipeline loads one bundle per - // graph plus one server-level bundle; stacked bundles per scope are a - // later slice β€” refuse loudly rather than silently merging policy. - let mut server_policy: Option = None; - let mut graph_policies: BTreeMap = BTreeMap::new(); - for policy in &snapshot.policies { - for binding in &policy.applies_to { - if binding == "cluster" { - if server_policy - .replace(PolicySource::Inline(policy.source.clone())) - .is_some() - { - bail!( - "multiple policy bundles bind the cluster scope; cluster-mode serving supports one bundle per scope β€” split or merge bundles (multi-bundle scopes are a later slice)" - ); - } - } else if let Some(graph_id) = binding.strip_prefix("graph.") { - if graph_policies - .insert( - graph_id.to_string(), - PolicySource::Inline(policy.source.clone()), - ) - .is_some() - { - bail!( - "multiple policy bundles bind graph '{graph_id}'; cluster-mode serving supports one bundle per scope β€” split or merge bundles (multi-bundle scopes are a later slice)" - ); - } - } else { - bail!("unrecognized policy binding '{binding}' in the applied revision"); - } - } - } - - let mut graphs = Vec::new(); - let mut skipped_graphs = Vec::new(); - for graph in &snapshot.graphs { - let specs: Vec = snapshot - .queries - .iter() - .filter(|query| query.graph_id == graph.graph_id) - .map(|query| queries::RegistrySpec { - name: query.name.clone(), - source: query.source.clone(), - // The Β§D5 bridge: the cluster registry has no expose flag - // (exposure becomes a policy decision in Phase 6) β€” cluster - // mode lists every stored query. - expose: true, - tool_name: None, - }) - .collect(); - let registry = match QueryRegistry::from_specs(specs) { - Ok(registry) => registry, - Err(errors) => { - let details = errors - .iter() - .map(|error| error.to_string()) - .collect::>() - .join("\n "); - warn!( - graph_id = %graph.graph_id, - errors = %details, - "graph quarantined because stored queries failed to parse" - ); - skipped_graphs.push(format!( - "{}: stored queries failed to parse: {details}", - graph.graph_id - )); - continue; - } - }; - let embedding = match graph - .embedding - .as_ref() - .map(|profile| { - profile.resolve().map_err(|err| { - eyre!("embedding provider for graph '{}': {err}", graph.graph_id) - }) - }) - .transpose() - { - Ok(embedding) => embedding, - Err(err) => { - warn!( - graph_id = %graph.graph_id, - error = %err, - "graph quarantined because embedding provider configuration failed" - ); - skipped_graphs.push(format!("{}: {err}", graph.graph_id)); - continue; - } - }; - graphs.push(GraphStartupConfig { - graph_id: graph.graph_id.clone(), - uri: graph.root.to_string_lossy().to_string(), - policy: graph_policies.get(&graph.graph_id).cloned(), - embedding, - queries: registry, - }); - } - if graphs.is_empty() { - let skipped = skipped_graphs.join(", "); - bail!( - "the cluster at '{}' has no healthy graphs to serve{}", - cluster_dir.display(), - if skipped.is_empty() { - String::new() - } else { - format!(" (quarantined: {skipped})") - } - ); - } - if require_all_graphs && !skipped_graphs.is_empty() { - bail!( - "strict cluster boot requires every graph to build startup settings (quarantined: {})", - skipped_graphs.join(", ") - ); - } - - let env_unauth = env_flag("OMNIGRAPH_UNAUTHENTICATED"); - - Ok(ServerConfig { - mode: ServerConfigMode::Multi { - graphs, - config_path: cluster_dir.clone(), - server_policy, - }, - bind: cli_bind.unwrap_or_else(|| "127.0.0.1:8080".to_string()), - allow_unauthenticated: cli_allow_unauthenticated || env_unauth, - require_all_graphs, - }) -} - -/// RFC-011 cluster-only boot: the server serves exclusively from a -/// cluster's applied revision (`--cluster `). The legacy -/// omnigraph.yaml / `--target` / positional-URI single-graph boot paths -/// were removed β€” a deployment serves from exactly one source. -pub async fn load_server_settings( - cli_cluster: Option<&PathBuf>, - cli_bind: Option, - cli_allow_unauthenticated: bool, - cli_require_all_graphs: bool, -) -> Result { - let Some(cluster_dir) = cli_cluster else { - bail!( - "omnigraph-server boots from a cluster: pass --cluster \ - (the cluster's applied revision is the deployment artifact). The legacy \ - single-graph boot (positional , --target, --config omnigraph.yaml) \ - was removed in RFC-011." - ); - }; - load_cluster_settings( - cluster_dir, - cli_bind, - cli_allow_unauthenticated, - cli_require_all_graphs, - ) - .await -} - -fn env_flag(name: &str) -> bool { - std::env::var(name) - .ok() - .map(|v| { - let trimmed = v.trim(); - !trimmed.is_empty() && trimmed != "0" && !trimmed.eq_ignore_ascii_case("false") - }) - .unwrap_or(false) -} - -/// MR-723 server runtime state, classified from the three-state matrix -/// of (bearer tokens configured) Γ— (policy file configured) at startup. -/// -/// * **Open** β€” neither tokens nor policy; requires explicit -/// `allow_unauthenticated`. Effectively a "trust the network" dev -/// mode. `serve()` refuses to start in this shape without the flag, -/// so the only way to reach this state at runtime is via deliberate -/// operator opt-in. -/// * **DefaultDeny** β€” tokens configured but no policy file. The -/// server requires a valid bearer token; once authenticated, every -/// action except `Read` is denied with 403. Closes the "tokens but -/// forgot the policy file" trap. -/// * **PolicyEnabled** β€” policy file configured and at least one -/// bearer token configured. Cedar evaluates every authenticated -/// request. Policy without tokens is rejected at startup β€” -/// such a server would 401 every request, which is bug-shaped -/// rather than feature-shaped (operators wanting "deny all -/// unauthenticated traffic" should configure tokens plus a -/// deny-all policy to get meaningful 403s with policy-decision -/// logging instead). -#[derive(Debug, Clone, Copy, Eq, PartialEq)] -pub enum ServerRuntimeState { - Open, - DefaultDeny, - PolicyEnabled, -} - -/// Compute the [`ServerRuntimeState`] from the configured inputs. -/// Pulled out as a pure function so the matrix is unit-testable -/// without standing up the full server. -/// -/// The classifier is the **single source of truth** for "should we -/// start?" β€” both `serve()`'s single-mode and multi-mode branches -/// call this before constructing their `AppState`. Adding a startup -/// invariant here means both modes enforce it automatically; the -/// alternative (per-constructor `bail!`) drifts the moment a third -/// mode is added. -pub fn classify_server_runtime_state( - has_tokens: bool, - has_policy: bool, - allow_unauthenticated: bool, -) -> Result { - match (has_tokens, has_policy, allow_unauthenticated) { - (false, false, false) => bail!( - "server has no bearer tokens and no policy file configured. This is a fully \ - open server β€” pass `--unauthenticated` (or set OMNIGRAPH_UNAUTHENTICATED=1) \ - if you actually want that, otherwise configure bearer tokens (see \ - docs/user/operations/server.md) and a graph or cluster policy bundle in \ - the cluster config, then run `omnigraph cluster apply` and restart." - ), - (false, false, true) => Ok(ServerRuntimeState::Open), - (true, false, _) => Ok(ServerRuntimeState::DefaultDeny), - (false, true, _) => bail!( - "policy file is configured but no bearer tokens β€” every request would 401 \ - because no token can ever match. Configure at least one bearer token (see \ - docs/user/operations/server.md), or remove the policy file. To deny all unauthenticated \ - traffic deliberately, configure tokens plus a deny-all Cedar rule β€” that \ - produces meaningful 403s with policy-decision logging instead of silent 401s." - ), - (true, true, _) => Ok(ServerRuntimeState::PolicyEnabled), - } -} - -pub(crate) fn normalize_bearer_token(value: Option) -> Option { - value - .map(|value| value.trim().to_string()) - .filter(|value| !value.is_empty()) -} - -pub(crate) fn normalize_bearer_actor(value: String) -> Result { - let value = value.trim().to_string(); - if value.is_empty() { - bail!("bearer token actor names must not be blank"); - } - Ok(value) -} - -pub(crate) fn parse_bearer_tokens_json(value: &str) -> Result> { - let entries: HashMap = serde_json::from_str(value) - .wrap_err("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON must be a JSON object of actor->token")?; - Ok(entries.into_iter().collect()) -} - -pub(crate) fn read_bearer_tokens_file(path: &str) -> Result> { - let contents = fs::read_to_string(path) - .wrap_err_with(|| format!("failed to read bearer tokens file at {path}"))?; - parse_bearer_tokens_json(&contents) - .wrap_err_with(|| format!("failed to parse bearer tokens file at {path}")) -} - -pub(crate) fn validate_bearer_tokens( - entries: Vec<(String, String)>, -) -> Result> { - let mut seen_actors = HashSet::new(); - let mut seen_tokens = HashSet::new(); - let mut normalized = Vec::with_capacity(entries.len()); - - for (actor, token) in entries { - let actor = normalize_bearer_actor(actor)?; - let Some(token) = normalize_bearer_token(Some(token)) else { - bail!("bearer token for actor '{actor}' must not be blank"); - }; - if !seen_actors.insert(actor.clone()) { - bail!("duplicate bearer token actor '{actor}'"); - } - if !seen_tokens.insert(token.clone()) { - bail!("duplicate bearer token value configured"); - } - normalized.push((actor, token)); - } - - normalized.sort_by(|(left, _), (right, _)| left.cmp(right)); - Ok(normalized) -} - -pub(crate) fn server_bearer_tokens_from_env() -> Result> { - let mut entries = Vec::new(); - - if let Some(token) = normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKEN").ok()) - { - entries.push(("default".to_string(), token)); - } - - if let Some(path) = - normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE").ok()) - { - entries.extend(read_bearer_tokens_file(&path)?); - } else if let Some(json) = - normalize_bearer_token(std::env::var("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON").ok()) - { - entries.extend(parse_bearer_tokens_json(&json)?); - } - - validate_bearer_tokens(entries) -} - -#[cfg(test)] -mod tests { - use super::{ - GraphStartupConfig, ServerConfig, ServerConfigMode, ServerRuntimeState, - classify_server_runtime_state, hash_bearer_token, normalize_bearer_token, - parse_bearer_tokens_json, serve, server_bearer_tokens_from_env, - }; - use serial_test::serial; - use std::env; - use std::fs; - use tempfile::tempdir; - - /// `authorize` returns the allow/deny **decision** (`Authz`) and reserves - /// `Err` for operational failures, so the invoke handler can hide a denial - /// as 404 without also masking a 401/500. Pins each outcome. - #[test] - fn authorize_splits_decision_from_operational_error() { - use super::{ - Authz, PolicyAction, PolicyCompiler, PolicyConfig, PolicyRequest, ResolvedActor, - authorize, - }; - use std::sync::Arc; - - fn req(action: PolicyAction) -> PolicyRequest { - PolicyRequest { - action, - branch: None, - target_branch: None, - } - } - let actor = ResolvedActor::cluster_static(Arc::from("act-alice")); - - // --- No policy engine installed (open / default-deny modes) --- - // A server-scoped action is denied in every no-policy state. - assert!(matches!( - authorize(Some(&actor), None, req(PolicyAction::GraphList)).unwrap(), - Authz::Denied(_) - )); - // Authenticated actor + a non-read per-graph action β†’ default-deny. - assert!(matches!( - authorize(Some(&actor), None, req(PolicyAction::Change)).unwrap(), - Authz::Denied(_) - )); - // `read` is the one per-graph action permitted without a policy. - assert!(matches!( - authorize(Some(&actor), None, req(PolicyAction::Read)).unwrap(), - Authz::Allowed - )); - // Open mode (no actor, no policy) β†’ allowed. - assert!(matches!( - authorize(None, None, req(PolicyAction::Read)).unwrap(), - Authz::Allowed - )); - - // --- Policy engine installed --- - let policy: PolicyConfig = serde_yaml::from_str( - "version: 1\n\ - groups:\n team: [act-alice]\n\ - rules:\n - id: team-read\n allow:\n actors: { group: team }\n actions: [read]\n branch_scope: any\n", - ) - .unwrap(); - let engine = PolicyCompiler::compile(&policy, "graph").unwrap(); - - // A matched allow rule β†’ Allowed. - assert!(matches!( - authorize( - Some(&actor), - Some(&engine), - PolicyRequest { - action: PolicyAction::Read, - branch: Some("main".to_string()), - target_branch: None - }, - ) - .unwrap(), - Authz::Allowed - )); - // Known actor, no matching allow rule β†’ Denied, carrying the decision message. - match authorize( - Some(&actor), - Some(&engine), - PolicyRequest { - action: PolicyAction::Change, - branch: Some("main".to_string()), - target_branch: None, - }, - ) - .unwrap() - { - Authz::Denied(message) => { - assert!(!message.is_empty(), "a deny carries its decision message") - } - Authz::Allowed => panic!("change must be denied: only read is allowed"), - } - // Policy installed but no actor β†’ operational failure (`Err`), NOT a - // decision. This is the split that keeps a 401/500 from being masked - // as the denial's response in the invoke handler. - assert!( - authorize(None, Some(&engine), req(PolicyAction::Read)).is_err(), - "a missing actor with a policy installed is an operational error, not a deny" - ); - } - - #[test] - fn hash_bearer_token_produces_32_byte_output() { - let hash = hash_bearer_token("any-token"); - assert_eq!(hash.len(), 32); - } - - /// The single gate both open paths funnel through: it refuses a - /// schema breakage (naming the graph label + query), attaches a clean - /// registry, and collapses an empty one to `None`. Pure over its args - /// (no engine), so it covers the multi-graph path's logic too β€” the - /// only per-path difference is the `label`, asserted here. - #[test] - fn validate_and_attach_gates_on_schema_and_collapses_empty() { - use crate::queries::{QueryRegistry, RegistrySpec}; - use omnigraph_compiler::catalog::build_catalog; - use omnigraph_compiler::schema::parser::parse_schema; - - let schema = parse_schema("node User {\nname: String\n}\n").unwrap(); - let catalog = build_catalog(&schema).unwrap(); - let spec = |name: &str, source: &str| RegistrySpec { - name: name.to_string(), - source: source.to_string(), - expose: false, - tool_name: None, - }; - - // Empty registry β†’ nothing attached, no error. - let empty = super::validate_and_attach(QueryRegistry::default(), &catalog, "g").unwrap(); - assert!(empty.is_none()); - - // A query that type-checks β†’ attached. - let ok = QueryRegistry::from_specs(vec![spec( - "find_user", - "query find_user() { match { $u: User } return { $u.name } }", - )]) - .unwrap(); - assert!( - super::validate_and_attach(ok, &catalog, "g") - .unwrap() - .is_some() - ); - - // A query referencing a type the schema lacks β†’ boot refusal that - // names both the graph label and the offending query. - let broken = QueryRegistry::from_specs(vec![spec( - "ghost", - "query ghost() { match { $w: Widget } return { $w.name } }", - )]) - .unwrap(); - let err = super::validate_and_attach(broken, &catalog, "graph-x").unwrap_err(); - let msg = err.to_string(); - assert!(msg.contains("graph-x"), "labels the graph: {msg}"); - assert!(msg.contains("ghost"), "names the query: {msg}"); - assert!( - msg.contains("schema check"), - "mentions the schema check: {msg}" - ); - } - - #[test] - fn hash_bearer_token_is_deterministic() { - assert_eq!( - hash_bearer_token("stable-input"), - hash_bearer_token("stable-input"), - ); - } - - #[test] - fn hash_bearer_token_differs_for_different_inputs() { - assert_ne!(hash_bearer_token("token-a"), hash_bearer_token("token-b")); - } - - #[test] - fn hash_bearer_token_matches_known_sha256_vector() { - // SHA-256("abc"). If this ever fails, the hash function was swapped. - let hash = hash_bearer_token("abc"); - let hex: String = hash.iter().map(|b| format!("{:02x}", b)).collect(); - assert_eq!( - hex, - "ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad" - ); - } - - #[tokio::test] - async fn server_settings_require_cluster_boot_source() { - // RFC-011 cluster-only: with no --cluster the server refuses to - // start and names the cluster-required remedy. - let error = super::load_server_settings(None, None, false, false) - .await - .unwrap_err(); - assert!( - error.to_string().contains("boots from a cluster"), - "expected cluster-required error, got: {error}", - ); - } - - #[test] - fn classify_open_requires_explicit_unauthenticated_flag() { - // State 1: no tokens, no policy, no flag β†’ refuse to start. - let error = classify_server_runtime_state(false, false, false).unwrap_err(); - let msg = error.to_string(); - assert!( - msg.contains("--unauthenticated"), - "expected refusal message mentioning --unauthenticated, got: {msg}" - ); - - // Same matrix cell but with the flag set β†’ Open mode permitted. - assert_eq!( - classify_server_runtime_state(false, false, true).unwrap(), - ServerRuntimeState::Open - ); - } - - #[test] - fn classify_tokens_without_policy_is_default_deny() { - // State 2: tokens configured, no policy β†’ DefaultDeny regardless - // of the flag (the flag opts into the fully-open dev mode; it - // doesn't downgrade default-deny back to open). - assert_eq!( - classify_server_runtime_state(true, false, false).unwrap(), - ServerRuntimeState::DefaultDeny - ); - assert_eq!( - classify_server_runtime_state(true, false, true).unwrap(), - ServerRuntimeState::DefaultDeny - ); - } - - #[tokio::test] - #[serial] - async fn serve_refuses_to_start_with_policy_but_no_tokens_multi_mode() { - // Bug 2 from the bot-review pass: multi-mode startup was missing - // the "policy requires tokens" check that single-mode enforces. - // After centralizing the check in `classify_server_runtime_state`, - // both modes get the same enforcement. This test guards the - // multi-mode propagation path. - // - // Sibling test below pins single mode. Together they pin that - // the classifier is called from both branches of `serve()`. - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_SERVER_BEARER_TOKEN", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET", None), - ("OMNIGRAPH_UNAUTHENTICATED", None), - ]); - let temp = tempdir().unwrap(); - // The classifier reads `has_policy_configured` from the config - // shape (does the Option contain a path?), not from file - // existence, so we can hand it a path without writing a real - // policy file β€” the bail fires before policy load. - let policy_path = temp.path().join("server-policy.yaml"); - let config = ServerConfig { - mode: ServerConfigMode::Multi { - graphs: vec![GraphStartupConfig { - graph_id: "alpha".to_string(), - uri: temp - .path() - .join("alpha.omni") - .to_string_lossy() - .into_owned(), - policy: None, - embedding: None, - queries: crate::queries::QueryRegistry::default(), - }], - config_path: temp.path().join("omnigraph.yaml"), - server_policy: Some(crate::PolicySource::File(policy_path)), - }, - bind: "127.0.0.1:0".to_string(), - allow_unauthenticated: false, - require_all_graphs: false, - }; - let result = serve(config).await; - let err = result - .expect_err("serve should refuse to start in multi mode with policy but no tokens"); - let msg = format!("{:?}", err); - assert!( - msg.contains("policy file is configured but no bearer tokens"), - "expected policy-without-tokens rejection in multi mode, got: {msg}", - ); - } - - #[tokio::test] - #[serial] - async fn serve_refuses_to_start_in_state_1_without_unauthenticated() { - // MR-723 PR A: pin the integration boundary that the classifier - // is actually called by `serve()` before any side-effecting - // work (Lance dataset open, TcpListener::bind). The classifier - // itself is unit-tested above; this test guards the propagation - // path from `classify_server_runtime_state` through serve's - // `?` so a future refactor that drops the call returns red. - // - // Marked `#[serial]` because we have to clear all bearer-token - // env vars, and another test in this module setting any of them - // concurrently would corrupt the read inside `resolve_token_source`. - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_SERVER_BEARER_TOKEN", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET", None), - ("OMNIGRAPH_UNAUTHENTICATED", None), - ]); - let temp = tempdir().unwrap(); - // Graph path doesn't need to exist β€” classifier fires before - // any engine open. - let config = ServerConfig { - mode: ServerConfigMode::Multi { - graphs: vec![GraphStartupConfig { - graph_id: "default".to_string(), - uri: temp - .path() - .join("graph.omni") - .to_string_lossy() - .into_owned(), - policy: None, - embedding: None, - queries: crate::queries::QueryRegistry::default(), - }], - config_path: temp.path().join("cluster"), - server_policy: None, - }, - bind: "127.0.0.1:0".to_string(), - allow_unauthenticated: false, - require_all_graphs: false, - }; - let result = serve(config).await; - let err = - result.expect_err("serve should refuse to start in State 1 without --unauthenticated"); - let msg = format!("{:?}", err); - assert!( - msg.contains("no bearer tokens") || msg.contains("policy file"), - "expected refusal message naming the misconfiguration, got: {msg}", - ); - } - - #[test] - fn classify_policy_enabled_requires_tokens() { - // State 3: tokens + policy β†’ PolicyEnabled, regardless of the - // `allow_unauthenticated` flag (Cedar evaluates the bearer, - // the flag is moot once tokens exist). - assert_eq!( - classify_server_runtime_state(true, true, false).unwrap(), - ServerRuntimeState::PolicyEnabled - ); - assert_eq!( - classify_server_runtime_state(true, true, true).unwrap(), - ServerRuntimeState::PolicyEnabled - ); - } - - #[test] - fn classify_policy_without_tokens_is_rejected() { - // Closes the "policy installed but no tokens β†’ silent 401 on - // every request" footgun. The same shape that single-mode - // `open_with_bearer_tokens_and_policy` used to bail on - // privately is now rejected by the classifier so both single - // and multi mode get the same enforcement from one source of - // truth. - for allow_unauthenticated in [false, true] { - let err = - classify_server_runtime_state(false, true, allow_unauthenticated).unwrap_err(); - let msg = err.to_string(); - assert!( - msg.contains("policy file is configured but no bearer tokens"), - "expected policy-without-tokens rejection message; got: {msg}" - ); - assert!( - msg.contains("every request would 401"), - "rejection message must name the failure mode; got: {msg}" - ); - } - } - - #[test] - fn normalize_bearer_token_trims_and_filters_blank_values() { - assert_eq!(normalize_bearer_token(None), None); - assert_eq!(normalize_bearer_token(Some(" ".to_string())), None); - assert_eq!( - normalize_bearer_token(Some(" demo-token ".to_string())).as_deref(), - Some("demo-token") - ); - } - - struct EnvGuard { - saved: Vec<(&'static str, Option)>, - } - - impl EnvGuard { - fn set(vars: &[(&'static str, Option<&str>)]) -> Self { - let saved = vars - .iter() - .map(|(name, _)| (*name, env::var(name).ok())) - .collect::>(); - for (name, value) in vars { - unsafe { - match value { - Some(value) => env::set_var(name, value), - None => env::remove_var(name), - } - } - } - Self { saved } - } - } - - impl Drop for EnvGuard { - fn drop(&mut self) { - for (name, value) in self.saved.drain(..) { - unsafe { - match value { - Some(value) => env::set_var(name, value), - None => env::remove_var(name), - } - } - } - } - } - - #[test] - fn parse_bearer_tokens_json_reads_actor_token_map() { - let tokens = parse_bearer_tokens_json(r#"{"alice":" token-a ","bob":"token-b"}"#).unwrap(); - assert_eq!(tokens.len(), 2); - assert!(tokens.contains(&("alice".to_string(), " token-a ".to_string()))); - assert!(tokens.contains(&("bob".to_string(), "token-b".to_string()))); - } - - #[test] - #[serial] - fn server_bearer_tokens_from_env_reads_legacy_token_and_token_file() { - let temp = tempdir().unwrap(); - let tokens_path = temp.path().join("tokens.json"); - fs::write( - &tokens_path, - r#"{"team-01":"token-one","team-02":"token-two"}"#, - ) - .unwrap(); - - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_SERVER_BEARER_TOKEN", Some(" legacy-token ")), - ( - "OMNIGRAPH_SERVER_BEARER_TOKENS_FILE", - Some(tokens_path.to_str().unwrap()), - ), - ("OMNIGRAPH_SERVER_BEARER_TOKENS_JSON", None), - ]); - - let tokens = server_bearer_tokens_from_env().unwrap(); - assert_eq!( - tokens, - vec![ - ("default".to_string(), "legacy-token".to_string()), - ("team-01".to_string(), "token-one".to_string()), - ("team-02".to_string(), "token-two".to_string()), - ] - ); - } -} diff --git a/crates/omnigraph-server/tests/auth_policy.rs b/crates/omnigraph-server/tests/auth_policy.rs deleted file mode 100644 index 5cbbb97..0000000 --- a/crates/omnigraph-server/tests/auth_policy.rs +++ /dev/null @@ -1,919 +0,0 @@ -//! Bearer auth, actor resolution, Cedar policy decisions, admission. -//! Moved verbatim from tests/server.rs in the modularization. - -use std::env; -use std::fs; -use std::sync::Arc; - -use axum::body::Body; -use axum::http::header::AUTHORIZATION; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::error::OmniError; -use omnigraph::loader::LoadMode; -use omnigraph_server::api::{ - BranchCreateRequest, BranchMergeRequest, ChangeRequest, ErrorOutput, ExportRequest, ReadRequest, SchemaApplyRequest, -}; -use omnigraph_server::{AppState, build_app}; -use serde_json::{Value, json}; -use tower::ServiceExt; - - -mod support; -use support::*; - -#[tokio::test(flavor = "multi_thread")] -async fn healthz_succeeds_after_startup() { - let (_temp, app) = app_for_loaded_graph().await; - let (status, body) = json_response( - &app, - Request::builder() - .uri("/healthz") - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(body["status"], "ok"); - assert_eq!(body["version"], env!("CARGO_PKG_VERSION")); - match option_env!("OMNIGRAPH_SOURCE_VERSION") { - Some(source_version) => assert_eq!(body["source_version"], source_version), - None => assert!(body.get("source_version").is_none()), - } -} - -#[tokio::test(flavor = "multi_thread")] -async fn protected_routes_require_bearer_token() { - let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(status, StatusCode::UNAUTHORIZED); - assert_eq!( - error.code, - Some(omnigraph_server::api::ErrorCode::Unauthorized) - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn protected_routes_accept_valid_bearer_token_while_healthz_stays_open() { - let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; - - let health = app - .clone() - .oneshot( - Request::builder() - .uri("/healthz") - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(health.status(), StatusCode::OK); - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .header("authorization", "Bearer demo-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - assert!(body["branches"].is_array()); -} - -#[tokio::test(flavor = "multi_thread")] -async fn protected_routes_accept_any_configured_team_bearer_token() { - let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[ - ("team-01", "token-one"), - ("team-02", "token-two"), - ]) - .await; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .header("authorization", "Bearer token-two") - .body(Body::empty()) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - assert!(body["branches"].is_array()); -} - -#[tokio::test(flavor = "multi_thread")] -async fn bearer_token_resolves_to_correct_actor_for_policy_decisions() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - fs::write( - &policy_path, - r#" -version: 1 -groups: - readers: [act-a] - writers: [act-b] -protected_branches: [main] -rules: - - id: readers-only - allow: - actors: { group: readers } - actions: [read] - branch_scope: any -"#, - ) - .unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![ - ("act-a".to_string(), "token-a".to_string()), - ("act-b".to_string(), "token-b".to_string()), - ], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - // act-a is authenticated AND authorized. - let (ok_status, _) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-a") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(ok_status, StatusCode::OK); - - // act-b is authenticated but policy rejects β€” proves the resolved actor - // (not some default) was the policy subject. - let (denied_status, denied_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-b") - .body(Body::empty()) - .unwrap(), - ) - .await; - let denied_error: ErrorOutput = serde_json::from_value(denied_body).unwrap(); - assert_eq!(denied_status, StatusCode::FORBIDDEN); - assert_eq!( - denied_error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); - - // Unknown token: 401, never reaches the policy engine. - let (bad_status, _) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer wrong-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(bad_status, StatusCode::UNAUTHORIZED); -} - -#[tokio::test(flavor = "multi_thread")] -async fn actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - // Same readers/writers split as - // `bearer_token_resolves_to_correct_actor_for_policy_decisions` β€” - // `act-a` can read main, `act-b` cannot. The asymmetry is what - // makes the spoof-up/spoof-down distinction observable. - fs::write( - &policy_path, - r#" -version: 1 -groups: - readers: [act-a] - writers: [act-b] -protected_branches: [main] -rules: - - id: readers-only - allow: - actors: { group: readers } - actions: [read] - branch_scope: any -"#, - ) - .unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![ - ("act-a".to_string(), "token-a".to_string()), - ("act-b".to_string(), "token-b".to_string()), - ], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - // (1) Spoof-up: bearer for act-b (denied) + X-Actor-Id: act-a (allowed). - // If the server were trusting the header, this would succeed as - // act-a. The contract is: the bearer wins. Expect 403 because - // act-b can't read. - let (spoof_up_status, spoof_up_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-b") - .header("x-actor-id", "act-a") - .body(Body::empty()) - .unwrap(), - ) - .await; - let spoof_up_error: ErrorOutput = serde_json::from_value(spoof_up_body).unwrap(); - assert_eq!( - spoof_up_status, - StatusCode::FORBIDDEN, - "X-Actor-Id must not promote a denied bearer to an allowed actor", - ); - assert_eq!( - spoof_up_error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden), - ); - - // (2) Spoof-down: bearer for act-a (allowed) + X-Actor-Id: act-b (denied). - // If the server were trusting the header, this would fail as act-b. - // The contract is: the bearer wins. Expect 200 because act-a can read. - let (spoof_down_status, _) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-a") - .header("x-actor-id", "act-b") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!( - spoof_down_status, - StatusCode::OK, - "X-Actor-Id must not demote an allowed bearer to a denied actor", - ); - - // (3) Empty-string spoof attempt: an X-Actor-Id of "" must not - // leak through as the policy subject. Same expectation as (1): - // bearer for act-b is denied regardless of what the header tries. - let (empty_spoof_status, _) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-b") - .header("x-actor-id", "") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!( - empty_spoof_status, - StatusCode::FORBIDDEN, - "empty X-Actor-Id must not clear the resolved actor", - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_allows_read_but_distinguishes_401_from_403() { - let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( - &[("act-bruno", "team-token"), ("act-ragnor", "admin-token")], - POLICY_YAML, - ) - .await; - - let (missing_status, missing_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - let missing_error: ErrorOutput = serde_json::from_value(missing_body).unwrap(); - assert_eq!(missing_status, StatusCode::UNAUTHORIZED); - assert_eq!( - missing_error.code, - Some(omnigraph_server::api::ErrorCode::Unauthorized) - ); - - let (snapshot_status, snapshot_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer team-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(snapshot_status, StatusCode::OK); - assert_eq!(snapshot_body["branch"], "main"); - - let export_request = ExportRequest { - branch: Some("main".to_string()), - type_names: Vec::new(), - table_keys: Vec::new(), - }; - let (forbidden_status, forbidden_body) = json_response( - &app, - Request::builder() - .uri(g("/export")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&export_request).unwrap())) - .unwrap(), - ) - .await; - let forbidden_error: ErrorOutput = serde_json::from_value(forbidden_body).unwrap(); - assert_eq!(forbidden_status, StatusCode::FORBIDDEN); - assert_eq!( - forbidden_error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); - - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/export")) - .method(Method::POST) - .header("authorization", "Bearer admin-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&export_request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(response.status(), StatusCode::OK); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_uses_resolved_branch_for_snapshot_reads() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let snapshot_id = { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.resolve_snapshot("main").await.unwrap().to_string() - }; - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, POLICY_PROTECTED_READ_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("act-bruno".to_string(), "team-token".to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let read = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: None, - snapshot: Some(snapshot_id), - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(body["target"]["branch"], Value::Null); - assert_eq!( - body["target"]["snapshot"].as_str(), - read.snapshot.as_deref() - ); - assert_eq!(body["row_count"], 1); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_blocks_change_on_protected_main_but_allows_unprotected_branch() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - drop(db); - - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, POLICY_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("act-bruno".to_string(), "team-token".to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let main_change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Mina", "age": 28 })), - branch: Some("main".to_string()), - }; - let (main_status, main_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&main_change).unwrap())) - .unwrap(), - ) - .await; - let main_error: ErrorOutput = serde_json::from_value(main_body).unwrap(); - assert_eq!(main_status, StatusCode::FORBIDDEN); - assert_eq!( - main_error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); - - let feature_change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Mina", "age": 28 })), - branch: Some("feature".to_string()), - }; - let (feature_status, feature_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&feature_change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(feature_status, StatusCode::OK); - assert_eq!(feature_body["branch"], "feature"); - assert_eq!(feature_body["affected_nodes"], 1); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_blocks_non_admin_merge_to_main_and_allows_admin() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - db.load( - "feature", - r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#, - LoadMode::Append, - ) - .await - .unwrap(); - drop(db); - - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, POLICY_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![ - ("act-bruno".to_string(), "team-token".to_string()), - ("act-ragnor".to_string(), "admin-token".to_string()), - ], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let merge = BranchMergeRequest { - source: "feature".to_string(), - target: Some("main".to_string()), - }; - let (deny_status, deny_body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&merge).unwrap())) - .unwrap(), - ) - .await; - let deny_error: ErrorOutput = serde_json::from_value(deny_body).unwrap(); - assert_eq!(deny_status, StatusCode::FORBIDDEN); - assert_eq!( - deny_error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); - - let (allow_status, allow_body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("authorization", "Bearer admin-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&merge).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(allow_status, StatusCode::OK); - assert_eq!(allow_body["actor_id"], "act-ragnor"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn authenticated_change_stamps_actor_on_commits() { - // With the Run state machine removed, actor_id is recorded - // directly on the commit graph (no intermediate run record). - let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[("act-andrew", "token-one")]).await; - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Mina", "age": 28 })), - branch: Some("main".to_string()), - }; - let (change_status, change_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("authorization", "Bearer token-one") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(change_status, StatusCode::OK); - assert_eq!(change_body["actor_id"], "act-andrew"); - - let (commits_status, commits_body) = json_response( - &app, - Request::builder() - .uri(g("/commits?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-one") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(commits_status, StatusCode::OK); - let head = commits_body["commits"] - .as_array() - .unwrap() - .last() - .expect("head commit should exist"); - assert_eq!(head["actor_id"], "act-andrew"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn authenticated_branch_merge_stamps_merge_actor_on_head_commit() { - let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[ - ("act-andrew", "token-one"), - ("act-ragnor", "token-two"), - ]) - .await; - - let create = BranchCreateRequest { - from: Some("main".to_string()), - name: "feature".to_string(), - }; - let (create_status, _) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::POST) - .header("authorization", "Bearer token-one") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&create).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(create_status, StatusCode::OK); - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Zoe", "age": 33 })), - branch: Some("feature".to_string()), - }; - let (change_status, _) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("authorization", "Bearer token-one") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(change_status, StatusCode::OK); - - let merge = BranchMergeRequest { - source: "feature".to_string(), - target: Some("main".to_string()), - }; - let (merge_status, merge_body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("authorization", "Bearer token-two") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&merge).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(merge_status, StatusCode::OK); - assert_eq!(merge_body["actor_id"], "act-ragnor"); - - let (commit_status, commit_body) = json_response( - &app, - Request::builder() - .uri(g("/commits?branch=main")) - .method(Method::GET) - .header("authorization", "Bearer token-two") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(commit_status, StatusCode::OK); - let head = commit_body["commits"] - .as_array() - .unwrap() - .last() - .expect("head commit should exist"); - assert_eq!(head["actor_id"], "act-ragnor"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - - // Permit `act-allowed` for change actions; `act-blocked` is not in - // any allowed group β€” every change request from them must deny. - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, permit_all_policy_yaml(&["act-allowed"])).unwrap(); - let policy_engine = - omnigraph_server::PolicyEngine::load_graph(&policy_path, graph.to_string_lossy().as_ref()) - .unwrap(); - - let workload = omnigraph_server::workload::WorkloadController::new(100, 1_000_000_000); - let state = AppState::new_single( - graph.to_string_lossy().to_string(), - db, - vec![("act-blocked".to_string(), "block-token".to_string())], - Some(policy_engine), - workload, - ); - - // Reach into the routing and pull the engine the same way an - // embedded consumer holding `Arc` would. If `new_single` - // failed to apply `with_policy` to the engine, this `mutate_as` - // would succeed β€” the HTTP-layer is bypassed entirely. - // RFC-011 cluster-only: the single-graph convenience constructor - // registers the graph under the reserved id `default`. - let key = omnigraph_server::GraphKey::cluster( - omnigraph_server::GraphId::try_from("default").unwrap(), - ); - let handle = match state.routing().registry.get(&key) { - omnigraph_server::RegistryLookup::Ready(handle) => handle, - omnigraph_server::RegistryLookup::Gone => panic!("default graph must be registered"), - }; - let engine = Arc::clone(&handle.engine); - - let mut params: omnigraph_compiler::ParamMap = Default::default(); - params.insert( - "name".to_string(), - omnigraph_compiler::Literal::String("EngineLayerBlocked".to_string()), - ); - params.insert("age".to_string(), omnigraph_compiler::Literal::Integer(30)); - let result = engine - .mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - ¶ms, - Some("act-blocked"), - ) - .await; - match result { - Err(OmniError::Policy(_)) => { /* expected β€” engine-layer gate fired */ } - Ok(_) => panic!( - "engine-layer policy did NOT fire β€” act-blocked successfully ran mutate_as via \ - the engine pulled from the registry handle. AppState::new_single failed to apply \ - with_policy to the underlying Omnigraph engine. This is the B2 footgun the \ - with_policy_engine deletion was supposed to close." - ), - Err(other) => panic!("expected OmniError::Policy, got: {other:?}"), - } -} - -#[tokio::test(flavor = "multi_thread")] -async fn oversized_request_body_returns_payload_too_large() { - let (_temp, app) = app_for_loaded_graph().await; - let oversized = "x".repeat(1_100_000); - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(oversized)) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::PAYLOAD_TOO_LARGE); -} - -#[tokio::test(flavor = "multi_thread")] -async fn default_deny_mode_allows_read_for_authenticated_actor() { - let (_temp, app) = app_for_graph_with_auth_tokens_only( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-andrew", "demo-token")], - ) - .await; - - let (status, _body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot")) - .method(Method::GET) - .header(AUTHORIZATION, "Bearer demo-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); -} - -#[tokio::test(flavor = "multi_thread")] -async fn default_deny_mode_rejects_change_with_forbidden() { - let (_temp, app) = app_for_graph_with_auth_tokens_only( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-andrew", "demo-token")], - ) - .await; - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "DefaultDeny", "age": 1 })), - branch: Some("main".to_string()), - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header(AUTHORIZATION, "Bearer demo-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::FORBIDDEN); - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert!( - error.error.contains("default-deny"), - "expected default-deny in error message, got: {}", - error.error - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn default_deny_mode_rejects_schema_apply_with_forbidden() { - let (_temp, app) = app_for_graph_with_auth_tokens_only( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-andrew", "demo-token")], - ) - .await; - - let req = SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/schema/apply")) - .method(Method::POST) - .header(AUTHORIZATION, "Bearer demo-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&req).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::FORBIDDEN); - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert!( - error.error.contains("default-deny"), - "expected default-deny in error message, got: {}", - error.error - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_decision_parity_change_admin_on_main_allowed() { - // (act-ragnor, change, main) β€” admins-change-anywhere rule applies. - // Both SDK and HTTP must allow. Each path uses its own fresh graph - // because allowβ†’side-effects. - let (_t1, graph1, policy1) = build_parity_graph().await; - let sdk = sdk_change_decision(&graph1, &policy1, "act-ragnor").await; - let (_t2, graph2, policy2) = build_parity_graph().await; - let http = http_change_decision(&graph2, &policy2, "act-ragnor", "ragnor-token").await; - assert!( - matches!(sdk, ParityDecision::Allow) && matches!(http, ParityDecision::Allow), - "SDK={sdk:?} HTTP={http:?} β€” should both Allow", - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_decision_parity_change_team_on_main_denied() { - // (act-bruno, change, main) β€” no rule grants bruno change on - // protected. Both SDK and HTTP must deny. Same graph is reusable - // because denyβ†’no side-effects. - let (_temp, graph, policy) = build_parity_graph().await; - let sdk = sdk_change_decision(&graph, &policy, "act-bruno").await; - let http = http_change_decision(&graph, &policy, "act-bruno", "bruno-token").await; - assert!( - matches!(sdk, ParityDecision::Deny) && matches!(http, ParityDecision::Deny), - "SDK={sdk:?} HTTP={http:?} β€” should both Deny", - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_decision_parity_branch_merge_admin_allowed() { - // (act-ragnor, branch_merge, featureβ†’main) β€” admins-merge-to-protected - // rule applies. Both Allow. Each path uses its own fresh graph β€” - // a successful merge consumes the feature branch's commit on main. - let (_t1, graph1, policy1) = build_parity_graph().await; - let sdk = sdk_merge_decision(&graph1, &policy1, "act-ragnor").await; - let (_t2, graph2, policy2) = build_parity_graph().await; - let http = http_merge_decision(&graph2, &policy2, "act-ragnor", "ragnor-token").await; - assert!( - matches!(sdk, ParityDecision::Allow) && matches!(http, ParityDecision::Allow), - "SDK={sdk:?} HTTP={http:?} β€” should both Allow", - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn policy_decision_parity_branch_merge_team_denied() { - // (act-bruno, branch_merge, featureβ†’main) β€” no rule grants bruno - // branch_merge. Both Deny. - let (_temp, graph, policy) = build_parity_graph().await; - let sdk = sdk_merge_decision(&graph, &policy, "act-bruno").await; - let http = http_merge_decision(&graph, &policy, "act-bruno", "bruno-token").await; - assert!( - matches!(sdk, ParityDecision::Deny) && matches!(http, ParityDecision::Deny), - "SDK={sdk:?} HTTP={http:?} β€” should both Deny", - ); -} diff --git a/crates/omnigraph-server/tests/boot_settings.rs b/crates/omnigraph-server/tests/boot_settings.rs deleted file mode 100644 index 4ccc8da..0000000 --- a/crates/omnigraph-server/tests/boot_settings.rs +++ /dev/null @@ -1,562 +0,0 @@ -//! Server settings loading and mode inference (single vs multi). -//! Moved verbatim from tests/server.rs in the modularization. - -use std::fs; - -use axum::Router; -use axum::body::{Body, to_bytes}; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::Omnigraph; -use omnigraph_server::{AppState, build_app}; -use serde_json::Value; -use tower::ServiceExt; - - -mod support; -use support::*; - -mod multi_graph_startup { - use super::*; - use omnigraph::storage::normalize_root_uri; - use omnigraph_server::{GraphHandle, GraphId, GraphKey, GraphRegistry, InsertError}; - use std::sync::Arc; - - async fn build_multi_mode_app(graph_ids: &[&str]) -> (Vec, Router) { - let mut dirs = Vec::with_capacity(graph_ids.len()); - let mut handles = Vec::with_capacity(graph_ids.len()); - for id in graph_ids { - let dir = tempfile::tempdir().unwrap(); - let graph_uri = dir.path().join(id).to_str().unwrap().to_string(); - let schema = fs::read_to_string(fixture("test.pg")).unwrap(); - let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); - handles.push(Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from(*id).unwrap()), - uri: graph_uri, - engine: Arc::new(engine), - policy: None, - queries: None, - })); - dirs.push(dir); - } - let workload = omnigraph_server::workload::WorkloadController::from_env(); - let state = AppState::new_multi(handles, Vec::new(), None, workload, None).unwrap(); - let app = build_app(state); - (dirs, app) - } - - /// Cluster route `/graphs/{graph_id}/snapshot` resolves to the right - /// engine. Two graphs side by side; assert each responds to its own - /// id and does NOT respond to the other's URL. - #[tokio::test(flavor = "multi_thread")] - async fn cluster_routes_dispatch_per_graph_handle() { - let (_dirs, app) = build_multi_mode_app(&["alpha", "beta"]).await; - for id in ["alpha", "beta"] { - let resp = app - .clone() - .oneshot( - Request::builder() - .method(Method::GET) - .uri(format!("/graphs/{id}/snapshot?branch=main")) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - resp.status(), - StatusCode::OK, - "graph '{id}' must respond OK on its cluster snapshot route" - ); - } - } - - /// Unknown graph id under the cluster prefix yields 404 (not 500, - /// not 410 β€” `Gone` is reserved for the future DELETE flow). - #[tokio::test(flavor = "multi_thread")] - async fn cluster_route_for_unknown_graph_returns_404() { - let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; - let resp = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs/nonexistent/snapshot?branch=main") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(resp.status(), StatusCode::NOT_FOUND); - } - - /// Coverage net for cluster-route regressions across every - /// protected handler β€” not just the few that have inner path - /// params. Bug-1 surfaced because only `/snapshot` was being - /// exercised in cluster mode, leaving the other six protected - /// routes implicitly untested. This sweep hits each one and - /// asserts the response shows the handler was reached: no 404 - /// (router didn't match), no 500 with "Wrong number of path - /// arguments" (path extractor broke), no 500 with "missing - /// extension" (routing middleware didn't inject the handle). - /// - /// Status codes are negative assertions because each handler's - /// happy-path inputs differ β€” what matters is "the request - /// reached the handler," not "the handler returned 200." The - /// individual handlers' logic is already tested in single mode. - #[tokio::test(flavor = "multi_thread")] - async fn all_protected_cluster_routes_resolve_to_their_handler() { - let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; - - // (method, path, body) β€” one minimal request per protected - // cluster route. Bodies are valid enough that the router and - // extractors succeed; whether the engine ultimately returns - // 200 or 4xx is per-handler and not what this test pins. - let cases: &[(Method, &str, Option<&str>)] = &[ - (Method::GET, "/graphs/alpha/snapshot?branch=main", None), - (Method::GET, "/graphs/alpha/schema", None), - (Method::GET, "/graphs/alpha/branches", None), - (Method::GET, "/graphs/alpha/commits", None), - ( - Method::POST, - "/graphs/alpha/read", - Some(r#"{"query_source":"query q() { return {} }"}"#), - ), - ( - Method::POST, - "/graphs/alpha/change", - Some(r#"{"query_source":"query q() { return {} }"}"#), - ), - ( - Method::POST, - "/graphs/alpha/export", - Some(r#"{"branch":"main"}"#), - ), - ( - Method::POST, - "/graphs/alpha/schema/apply", - Some(r#"{"schema_source":"","allow_data_loss":false}"#), - ), - (Method::POST, "/graphs/alpha/ingest", Some(r#"{"data":""}"#)), - ( - Method::POST, - "/graphs/alpha/branches/merge", - Some(r#"{"source":"main","target":"main"}"#), - ), - ]; - - for (method, path, body) in cases { - let req_body = body - .map(|s| Body::from(s.to_string())) - .unwrap_or_else(Body::empty); - let req = Request::builder() - .method(method.clone()) - .uri(*path) - .header("content-type", "application/json") - .body(req_body) - .unwrap(); - let resp = app.clone().oneshot(req).await.unwrap(); - let status = resp.status(); - let bytes = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); - let body_str = String::from_utf8_lossy(&bytes); - - assert_ne!( - status, - StatusCode::NOT_FOUND, - "{} {} β€” router didn't match (cluster-route mounting regression). Body: {}", - method, - path, - body_str, - ); - assert!( - !(status == StatusCode::INTERNAL_SERVER_ERROR - && body_str.contains("Wrong number of path arguments")), - "{} {} β€” path extractor broke (Bug-1 class regression). Body: {}", - method, - path, - body_str, - ); - assert!( - !(status == StatusCode::INTERNAL_SERVER_ERROR - && body_str.to_lowercase().contains("missing extension")), - "{} {} β€” routing middleware didn't inject GraphHandle. Body: {}", - method, - path, - body_str, - ); - } - } - - /// Regression for the bot-surfaced path-extractor bug: cluster - /// routes whose inner path also captures a parameter - /// (`/graphs/{graph_id}/branches/{branch}`, - /// `/graphs/{graph_id}/commits/{commit_id}`) must extract the - /// inner param cleanly. Axum 0.8 propagates the outer `{graph_id}` - /// capture into nested handlers, so a `Path` extractor - /// would see two values and fail with "Wrong number of path - /// arguments. Expected 1 but got 2." Today both DELETE branch and - /// GET commit-by-id break in multi-mode because their handlers - /// use bare `Path` β€” this test pins the fix. - /// - /// The broader `all_protected_cluster_routes_resolve_to_their_handler` - /// test sweeps the full route surface; this one stays narrowly - /// targeted at the inner-path-param shape because that's the - /// specific regression class. - #[tokio::test(flavor = "multi_thread")] - async fn cluster_routes_with_inner_path_params_deserialize_correctly() { - let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; - - // Create a branch we can then delete β€” DELETE /graphs/alpha/branches/feature - let create_resp = app - .clone() - .oneshot( - Request::builder() - .method(Method::POST) - .uri("/graphs/alpha/branches") - .header("content-type", "application/json") - .body(Body::from(r#"{"name":"feature"}"#)) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - create_resp.status(), - StatusCode::OK, - "branch create on the cluster route must succeed before delete can be tested" - ); - - // DELETE /graphs/{graph_id}/branches/{branch} β€” exercises a handler - // whose only Path extractor (`branch`) is inside a nested route - // that also captures `graph_id`. The handler must pick `branch` - // by name, not by position. - let delete_resp = app - .clone() - .oneshot( - Request::builder() - .method(Method::DELETE) - .uri("/graphs/alpha/branches/feature") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - let delete_status = delete_resp.status(); - let delete_body = to_bytes(delete_resp.into_body(), usize::MAX).await.unwrap(); - assert_eq!( - delete_status, - StatusCode::OK, - "DELETE /graphs/{{id}}/branches/{{branch}} must extract `branch` cleanly. \ - Body: {}", - String::from_utf8_lossy(&delete_body), - ); - - // GET /graphs/{graph_id}/commits/{commit_id} β€” same shape: the - // handler's only Path extractor is the inner `commit_id`, which - // must deserialize by name even though `graph_id` is also in scope. - // We don't know a real commit_id, but the failure mode under test - // is path extraction, not commit lookup β€” a 404 from the engine - // is fine; a 500 with "Wrong number of path arguments" is the bug. - let commit_resp = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs/alpha/commits/0000000000000000") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - let commit_status = commit_resp.status(); - let commit_body = to_bytes(commit_resp.into_body(), usize::MAX).await.unwrap(); - let body_str = String::from_utf8_lossy(&commit_body); - assert!( - commit_status != StatusCode::INTERNAL_SERVER_ERROR - || !body_str.contains("Wrong number of path arguments"), - "GET /graphs/{{id}}/commits/{{commit_id}} must extract `commit_id` cleanly. \ - Got: {} | {}", - commit_status, - body_str, - ); - } - - /// RFC-011 cluster-only: flat per-graph routes never resolve β€” the - /// router only mounts under `/graphs/{graph_id}/...` so a root - /// `/snapshot` returns 404. - #[tokio::test(flavor = "multi_thread")] - async fn flat_routes_404_at_root() { - let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; - let resp = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/snapshot?branch=main") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(resp.status(), StatusCode::NOT_FOUND); - } - - - #[tokio::test(flavor = "multi_thread")] - async fn registry_rejects_duplicate_normalized_graph_uris() { - let dir = tempfile::tempdir().unwrap(); - let graph_uri = dir.path().join("same").to_str().unwrap().to_string(); - let schema = fs::read_to_string(fixture("test.pg")).unwrap(); - let engine = Arc::new(Omnigraph::init(&graph_uri, &schema).await.unwrap()); - - let alpha = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), - uri: graph_uri.clone(), - engine: Arc::clone(&engine), - policy: None, - queries: None, - }); - let beta = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("beta").unwrap()), - uri: format!("file://{graph_uri}/"), - engine, - policy: None, - queries: None, - }); - - match GraphRegistry::from_handles(vec![alpha, beta]) { - Err(InsertError::DuplicateUri(uri)) => { - assert!( - normalize_root_uri(&uri).is_ok(), - "duplicate URI should still be parseable, got {uri}" - ); - } - Err(err) => panic!("expected DuplicateUri for normalized aliases, got {err:?}"), - Ok(_) => panic!("expected DuplicateUri for normalized aliases, got Ok"), - } - } - - #[tokio::test(flavor = "multi_thread")] - async fn registry_stores_canonical_graph_uri() { - let dir = tempfile::tempdir().unwrap(); - let graph_uri = dir.path().join("canonical").to_str().unwrap().to_string(); - let schema = fs::read_to_string(fixture("test.pg")).unwrap(); - let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); - let handle = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), - uri: format!("file://{graph_uri}/"), - engine: Arc::new(engine), - policy: None, - queries: None, - }); - - let registry = GraphRegistry::from_handles(vec![handle]).unwrap(); - let listed = registry.list(); - assert_eq!(listed.len(), 1); - assert_eq!(listed[0].uri, graph_uri); - } - - /// `GET /graphs` must NOT leak the registry in Open mode without - /// an explicit server policy. Operators who pass `--unauthenticated` - /// opted into trusting the network for graph DATA, not for leaking - /// server topology (graph IDs + URIs, which may contain S3 bucket - /// paths or internal hostnames). Cedar gating the management - /// surface is the documented contract for `server_graphs_list` - /// ("don't leak the registry until the operator explicitly - /// authorizes it"); enforcing that contract in every runtime - /// state β€” not just `PolicyEnabled` β€” is the correct-by-design - /// closure of the open-mode hole the bot-review pass surfaced. - /// - /// Today (pre-fix) this returns 200 because `authorize_request`'s - /// no-policy fallback only denies when `actor.is_some()`, so Open - /// mode (`actor: None`) falls through to `Ok(())`. The fix in the - /// next commit tightens the fallback so server-scoped actions - /// always require explicit policy. - /// - /// Sort-order coverage previously lived here; it has moved to - /// `get_graphs_with_server_policy_authorizes_per_cedar` where - /// the response body is now non-empty and operator-authorized. - #[tokio::test(flavor = "multi_thread")] - async fn get_graphs_denied_in_open_mode_without_server_policy() { - let (_dirs, app) = build_multi_mode_app(&["beta", "alpha"]).await; - let resp = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - let status = resp.status(); - let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); - let body_str = String::from_utf8_lossy(&body); - assert_eq!( - status, - StatusCode::FORBIDDEN, - "GET /graphs must require an explicit server policy in every \ - runtime state; Open-mode bypass would leak server topology. \ - Body: {body_str}", - ); - } - - - /// `GET /graphs` requires bearer auth when tokens are configured. - #[tokio::test(flavor = "multi_thread")] - async fn get_graphs_requires_bearer_auth_when_configured() { - use omnigraph_server::{GraphHandle, GraphId, GraphKey}; - // Build a multi-mode app with bearer tokens configured. - let dir = tempfile::tempdir().unwrap(); - let graph_uri = dir.path().join("alpha").to_str().unwrap().to_string(); - let schema = fs::read_to_string(fixture("test.pg")).unwrap(); - let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); - let handle = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), - uri: graph_uri, - engine: Arc::new(engine), - policy: None, - queries: None, - }); - let tokens = vec![("act-andrew".to_string(), "secret-token".to_string())]; - let workload = omnigraph_server::workload::WorkloadController::from_env(); - let state = AppState::new_multi(vec![handle], tokens, None, workload, None).unwrap(); - let app = build_app(state); - - // No Authorization header β†’ 401. - let resp_no_auth = app - .clone() - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(resp_no_auth.status(), StatusCode::UNAUTHORIZED); - - // With auth but no server policy β†’ 403 (default-deny, since - // GraphList is not Read). - let resp_authed = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs") - .header("authorization", "Bearer secret-token") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(resp_authed.status(), StatusCode::FORBIDDEN); - } - - /// `GET /graphs` with a server policy that allows `graph_list` β†’ 200 - /// and returns the registry sorted alphabetically by `graph_id`. - /// `GET /graphs` with a server policy that does NOT allow - /// `graph_list` (viewer group) β†’ 403. - /// - /// This test owns the alphabetical-sort coverage that previously - /// lived in `get_graphs_lists_registered_graphs_in_multi_mode`. - /// That test now asserts denial in Open mode (server-scoped actions - /// require explicit policy in every runtime state), so the positive - /// body-shape assertions need a home where the response is - /// operator-authorized β€” here. - #[tokio::test(flavor = "multi_thread")] - async fn get_graphs_with_server_policy_authorizes_per_cedar() { - use omnigraph_policy::PolicyEngine; - use omnigraph_server::{GraphHandle, GraphId, GraphKey}; - - let dir = tempfile::tempdir().unwrap(); - - // Two graphs deliberately registered in non-alphabetical order - // so the test would fail if the handler relied on insertion - // order instead of server-side sorting. - let schema = fs::read_to_string(fixture("test.pg")).unwrap(); - let mut handles = Vec::new(); - for id in ["beta", "alpha"] { - let graph_uri = dir.path().join(id).to_str().unwrap().to_string(); - let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); - handles.push(Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from(id).unwrap()), - uri: graph_uri, - engine: Arc::new(engine), - policy: None, - queries: None, - })); - } - - // Server policy: admins can graph_list, viewers cannot. - let policy_path = dir.path().join("server-policy.yaml"); - fs::write( - &policy_path, - r#" -version: 1 -groups: - admins: [act-andrew] - viewers: [act-bruno] -rules: - - id: admins-list-graphs - allow: - actors: { group: admins } - actions: [graph_list] -"#, - ) - .unwrap(); - let server_policy = PolicyEngine::load_server(&policy_path).unwrap(); - - let tokens = vec![ - ("act-andrew".to_string(), "andrew-token".to_string()), - ("act-bruno".to_string(), "bruno-token".to_string()), - ]; - let workload = omnigraph_server::workload::WorkloadController::from_env(); - let state = - AppState::new_multi(handles, tokens, Some(server_policy), workload, None).unwrap(); - let app = build_app(state); - - // Admin β†’ 200, body returns both graphs alphabetically sorted. - let resp_admin = app - .clone() - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs") - .header("authorization", "Bearer andrew-token") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - resp_admin.status(), - StatusCode::OK, - "admin must be allowed graph_list" - ); - let body = to_bytes(resp_admin.into_body(), usize::MAX).await.unwrap(); - let json: Value = serde_json::from_slice(&body).unwrap(); - let graphs = json["graphs"].as_array().unwrap(); - assert_eq!(graphs.len(), 2, "response must list both registered graphs"); - assert_eq!( - graphs[0]["graph_id"].as_str().unwrap(), - "alpha", - "server must sort graphs alphabetically by graph_id (insertion order was 'beta', 'alpha')" - ); - assert_eq!(graphs[1]["graph_id"].as_str().unwrap(), "beta"); - - // Viewer β†’ 403 - let resp_viewer = app - .oneshot( - Request::builder() - .method(Method::GET) - .uri("/graphs") - .header("authorization", "Bearer bruno-token") - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - resp_viewer.status(), - StatusCode::FORBIDDEN, - "viewer must be denied graph_list (Cedar gate)" - ); - } - -} diff --git a/crates/omnigraph-server/tests/data_routes.rs b/crates/omnigraph-server/tests/data_routes.rs deleted file mode 100644 index 65af2c6..0000000 --- a/crates/omnigraph-server/tests/data_routes.rs +++ /dev/null @@ -1,1649 +0,0 @@ -//! Data-plane routes: read/query/change/ingest/branches/snapshot/export. -//! Moved verbatim from tests/server.rs in the modularization. - -use std::fs; -use std::sync::Arc; - -use axum::body::{Body, to_bytes}; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::loader::LoadMode; -use omnigraph_server::api::{ - BranchCreateRequest, BranchMergeRequest, ChangeRequest, ErrorOutput, ExportRequest, - IngestRequest, QueryRequest, ReadRequest, -}; -use omnigraph_server::{AppState, build_app}; -use serde_json::{Value, json}; -use serial_test::serial; -use tower::ServiceExt; - - -mod support; -use support::*; - -#[tokio::test(flavor = "multi_thread")] -async fn export_route_returns_jsonl_for_branch_snapshot() { - let token = "demo-token"; - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - db.load( - "feature", - r#"{"type":"Person","data":{"name":"Eve","age":29}}"#, - LoadMode::Append, - ) - .await - .unwrap(); - let expected = db - .export_jsonl("feature", &["Person".to_string()], &[]) - .await - .unwrap(); - drop(db); - - // MR-723: tokens-without-policy is now default-deny. Install a - // permit-all policy alongside the bearer token so /export - // (action=Export) passes Cedar evaluation. The test is exercising - // export semantics, not policy β€” the policy is just enough to clear - // the State 3 path. - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, permit_all_policy_yaml(&["default"])).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("default".to_string(), token.to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/export")) - .method(Method::POST) - .header("content-type", "application/json") - .header("authorization", format!("Bearer {}", token)) - .body(Body::from( - serde_json::to_vec(&ExportRequest { - branch: Some("feature".to_string()), - type_names: vec!["Person".to_string()], - table_keys: Vec::new(), - }) - .unwrap(), - )) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert_eq!( - response.headers().get("content-type").unwrap(), - "application/x-ndjson; charset=utf-8" - ); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - let text = String::from_utf8(body.to_vec()).unwrap(); - assert_eq!(text, expected); -} - -#[tokio::test(flavor = "multi_thread")] -async fn snapshot_route_returns_manifest_dataset_version() { - let (temp, app) = app_for_loaded_graph().await; - let graph = graph_path(temp.path()); - let expected_manifest_version = manifest_dataset_version(&graph).await; - - let (snapshot_status, snapshot_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - - assert_eq!(snapshot_status, StatusCode::OK); - assert_eq!(snapshot_body["branch"], "main"); - assert_eq!( - snapshot_body["manifest_version"].as_u64().unwrap(), - expected_manifest_version - ); - assert!(snapshot_body["tables"].is_array()); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_creates_branch_returns_metadata_and_stamps_actor() { - let (temp, app) = app_for_loaded_graph_with_auth_tokens(&[("act-andrew", "token-one")]).await; - let graph = graph_path(temp.path()); - let ingest = IngestRequest { - branch: Some("feature-ingest".to_string()), - from: Some("main".to_string()), - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}} -{"type":"Person","data":{"name":"Bob","age":26}}"# - .to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("authorization", "Bearer token-one") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["branch"], "feature-ingest"); - assert_eq!(body["base_branch"], "main"); - assert_eq!(body["branch_created"], true); - assert_eq!(body["mode"], "merge"); - assert_eq!(body["actor_id"], "act-andrew"); - assert_eq!(body["tables"][0]["table_key"], "node:Person"); - assert_eq!(body["tables"][0]["rows_loaded"], 2); - - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snapshot = db - .snapshot_of(ReadTarget::branch("feature-ingest")) - .await - .unwrap(); - let person_ds = snapshot.open("node:Person").await.unwrap(); - assert_eq!(person_ds.count_rows(None).await.unwrap(), 5); - let head = db - .list_commits(Some("feature-ingest")) - .await - .unwrap() - .into_iter() - .last() - .unwrap(); - assert_eq!(head.actor_id.as_deref(), Some("act-andrew")); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_existing_branch_skips_branch_create_policy_check() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - } - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, POLICY_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("act-bruno".to_string(), "team-token".to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - let ingest = IngestRequest { - branch: Some("feature".to_string()), - from: Some("other-base".to_string()), - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["branch"], "feature"); - assert_eq!(body["branch_created"], false); - assert_eq!(body["base_branch"], "other-base"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_without_from_returns_404_for_missing_branch_and_creates_nothing() { - let (temp, app) = app_for_loaded_graph().await; - let graph = graph_path(temp.path()); - let ingest = IngestRequest { - branch: Some("feature-typo".to_string()), - from: None, - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::NOT_FOUND); - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(error.code, Some(omnigraph_server::api::ErrorCode::NotFound)); - - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - assert!( - !db.branch_list() - .await - .unwrap() - .contains(&"feature-typo".to_string()), - "a 404'd ingest must not create the branch" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_without_from_loads_into_existing_branch() { - let (temp, app) = app_for_loaded_graph().await; - let graph = graph_path(temp.path()); - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - } - let ingest = IngestRequest { - branch: Some("feature".to_string()), - from: None, - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["branch"], "feature"); - assert_eq!(body["branch_created"], false); - assert_eq!(body["base_branch"], serde_json::Value::Null); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_denies_missing_branch_without_branch_create_permission() { - let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( - &[("act-bruno", "team-token")], - POLICY_YAML, - ) - .await; - let ingest = IngestRequest { - branch: Some("feature".to_string()), - from: Some("main".to_string()), - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(status, StatusCode::FORBIDDEN); - assert_eq!( - error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_denies_when_actor_lacks_change_permission() { - let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( - &[("act-bruno", "team-token")], - INGEST_CREATE_ONLY_POLICY_YAML, - ) - .await; - let ingest = IngestRequest { - branch: Some("feature".to_string()), - from: Some("main".to_string()), - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), - }; - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("authorization", "Bearer team-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&ingest).unwrap())) - .unwrap(), - ) - .await; - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(status, StatusCode::FORBIDDEN); - assert_eq!( - error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_rejects_payloads_over_32_mib() { - let (_temp, app) = app_for_loaded_graph().await; - let oversize = IngestRequest { - branch: Some("feature".to_string()), - from: Some("main".to_string()), - mode: Some(LoadMode::Merge), - data: "x".repeat(33 * 1024 * 1024), - }; - - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&oversize).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(response.status(), StatusCode::PAYLOAD_TOO_LARGE); -} - -#[tokio::test(flavor = "multi_thread")] -async fn branch_merge_conflict_response_includes_structured_conflicts() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - db.mutate( - "main", - MUTATION_QUERIES, - "set_age", - &omnigraph_compiler::json_params_to_param_map( - Some(&json!({"name": "Alice", "age": 31 })), - &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") - .unwrap() - .params, - omnigraph_compiler::JsonParamMode::Standard, - ) - .unwrap(), - ) - .await - .unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "set_age", - &omnigraph_compiler::json_params_to_param_map( - Some(&json!({"name": "Alice", "age": 32 })), - &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") - .unwrap() - .params, - omnigraph_compiler::JsonParamMode::Standard, - ) - .unwrap(), - ) - .await - .unwrap(); - drop(db); - - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - let merge = BranchMergeRequest { - source: "feature".to_string(), - target: Some("main".to_string()), - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&merge).unwrap())) - .unwrap(), - ) - .await; - - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(status, StatusCode::CONFLICT); - assert_eq!(error.code, Some(omnigraph_server::api::ErrorCode::Conflict)); - assert!(error.error.contains("merge conflict")); - assert!(error.merge_conflicts.iter().any(|conflict| { - conflict.table_key == "node:Person" - && conflict.row_id.as_deref() == Some("Alice") - && conflict.kind == omnigraph_server::api::MergeConflictKindOutput::DivergentUpdate - })); -} - -#[tokio::test(flavor = "multi_thread")] -async fn repeated_read_after_change_sees_updated_state_from_same_app() { - let (_temp, app) = app_for_loaded_graph().await; - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Mina", "age": 28 })), - branch: Some("main".to_string()), - }; - let (change_status, change_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(change_status, StatusCode::OK); - assert_eq!(change_body["affected_nodes"], 1); - - let read = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Mina" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (read_status, read_body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(read_status, StatusCode::OK); - assert_eq!(read_body["row_count"], 1); - assert_eq!(read_body["rows"][0]["p.name"], "Mina"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn query_endpoint_runs_inline_read() { - let (_temp, app) = app_for_loaded_graph().await; - - let query = QueryRequest { - query: fs::read_to_string(fixture("test.gq")).unwrap(), - name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/query")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&query).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["query_name"], "get_person"); - assert_eq!(body["row_count"], 1); - assert_eq!(body["rows"][0]["p.name"], "Alice"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn query_endpoint_rejects_mutation_with_400() { - let (_temp, app) = app_for_loaded_graph().await; - - let query = QueryRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Should", "age": 1 })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/query")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&query).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::BAD_REQUEST); - let err = body["error"].as_str().unwrap_or_default(); - assert!( - err.contains("contains mutations") && err.contains("POST /mutate"), - "expected mutation-rejection message pointing at canonical /mutate, got: {err}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn mutate_endpoint_runs_inline_mutation() { - // Canonical mutation endpoint. Pairs with `/query` on the read side. - // Same wire shape as `/change`, no deprecation signal. - let (_temp, app) = app_for_loaded_graph().await; - - let request = json!({ - "query": MUTATION_QUERIES, - "name": "insert_person", - "params": { "name": "Mutie", "age": 30 }, - "branch": "main", - }); - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/mutate")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - // Canonical route is NOT deprecated; no Deprecation header expected. - assert!( - response.headers().get("deprecation").is_none(), - "POST /mutate must not advertise itself as deprecated" - ); - let body_bytes = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - let body: Value = serde_json::from_slice(&body_bytes).unwrap(); - assert_eq!(body["affected_nodes"], 1); - assert_eq!(body["query_name"], "insert_person"); - assert_eq!(body["branch"], "main"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn change_endpoint_emits_deprecation_headers() { - // `/change` is kept indefinitely for back-compat but flagged at runtime - // per RFC 9745 (`Deprecation: true`) + RFC 8288 (`Link: ; - // rel="successor-version"`). The OpenAPI side is covered by - // `openapi_change_is_deprecated` in tests/openapi.rs. - let (_temp, app) = app_for_loaded_graph().await; - - let request = json!({ - "query": MUTATION_QUERIES, - "name": "insert_person", - "params": { "name": "Legacyer", "age": 33 }, - "branch": "main", - }); - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert_eq!( - response - .headers() - .get("deprecation") - .and_then(|v| v.to_str().ok()), - Some("true"), - "POST /change must advertise `Deprecation: true` (RFC 9745)" - ); - assert_eq!( - response.headers().get("link").and_then(|v| v.to_str().ok()), - Some("; rel=\"successor-version\""), - "POST /change must point at /mutate via `Link` rel=successor-version (RFC 8288)" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn load_endpoint_loads_into_existing_branch() { - // Canonical bulk-load endpoint (RFC-009 Phase 5). Same wire shape as - // /ingest, no deprecation signal. - let (_temp, app) = app_for_loaded_graph().await; - let request = IngestRequest { - branch: Some("main".to_string()), - from: None, - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Loaded","age":7}}"#.to_string(), - }; - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/load")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert!( - response.headers().get("deprecation").is_none(), - "POST /load must not advertise itself as deprecated" - ); - let body_bytes = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - let body: Value = serde_json::from_slice(&body_bytes).unwrap(); - assert_eq!(body["branch"], "main"); - assert_eq!(body["tables"][0]["table_key"], "node:Person"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn ingest_endpoint_emits_deprecation_headers() { - // `/ingest` is the deprecated alias of `/load` (RFC-009 Phase 5): flagged - // at runtime per RFC 9745 (`Deprecation: true`) + RFC 8288 (`Link: ; - // rel="successor-version"`). The OpenAPI side is covered by - // `openapi_ingest_is_deprecated` in tests/openapi.rs. - let (_temp, app) = app_for_loaded_graph().await; - let request = IngestRequest { - branch: Some("main".to_string()), - from: None, - mode: Some(LoadMode::Merge), - data: r#"{"type":"Person","data":{"name":"Legacyer","age":33}}"#.to_string(), - }; - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert_eq!( - response - .headers() - .get("deprecation") - .and_then(|v| v.to_str().ok()), - Some("true"), - "POST /ingest must advertise `Deprecation: true` (RFC 9745)" - ); - assert_eq!( - response.headers().get("link").and_then(|v| v.to_str().ok()), - Some("; rel=\"successor-version\""), - "POST /ingest must point at /load via `Link` rel=successor-version (RFC 8288)" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn read_endpoint_emits_deprecation_headers() { - // `/read` is kept indefinitely for byte-stable back-compat but flagged - // at runtime per RFC 9745 + RFC 8288. Successor is `/query`. - let (_temp, app) = app_for_loaded_graph().await; - - let request = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert_eq!( - response - .headers() - .get("deprecation") - .and_then(|v| v.to_str().ok()), - Some("true"), - "POST /read must advertise `Deprecation: true` (RFC 9745)" - ); - assert_eq!( - response.headers().get("link").and_then(|v| v.to_str().ok()), - Some("; rel=\"successor-version\""), - "POST /read must point at /query via `Link` rel=successor-version (RFC 8288)" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn query_endpoint_does_not_emit_deprecation_headers() { - // Sanity check the inverse: the canonical `/query` endpoint must not - // carry deprecation signaling, so SDK codegens don't propagate a - // bogus `@deprecated` marker. - let (_temp, app) = app_for_loaded_graph().await; - - let request = QueryRequest { - query: fs::read_to_string(fixture("test.gq")).unwrap(), - name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let response = app - .clone() - .oneshot( - Request::builder() - .uri(g("/query")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&request).unwrap())) - .unwrap(), - ) - .await - .unwrap(); - - assert_eq!(response.status(), StatusCode::OK); - assert!( - response.headers().get("deprecation").is_none(), - "POST /query is canonical and must not advertise itself as deprecated" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn change_endpoint_accepts_legacy_field_names() { - // The canonical wire field names on /change are `query` and `name`, but - // serde aliases keep the legacy `query_source`/`query_name` payload - // shape working for clients that haven't migrated yet. Pin both shapes. - let (_temp, app) = app_for_loaded_graph().await; - - let legacy_body = json!({ - "query_source": MUTATION_QUERIES, - "query_name": "insert_person", - "params": { "name": "Legacy", "age": 21 }, - "branch": "main", - }); - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&legacy_body).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["affected_nodes"], 1); - - let canonical_body = json!({ - "query": MUTATION_QUERIES, - "name": "insert_person", - "params": { "name": "Canonical", "age": 22 }, - "branch": "main", - }); - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&canonical_body).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(body["affected_nodes"], 1); -} - -#[tokio::test(flavor = "multi_thread")] -async fn remote_branch_list_create_merge_flow_works() { - let (_temp, app) = app_for_loaded_graph().await; - - let (list_status, list_body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(list_status, StatusCode::OK); - assert_eq!(list_body["branches"], json!(["main"])); - - let create = BranchCreateRequest { - from: Some("main".to_string()), - name: "feature".to_string(), - }; - let (create_status, create_body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&create).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(create_status, StatusCode::OK); - assert_eq!(create_body["from"], "main"); - assert_eq!(create_body["name"], "feature"); - - let (list_status, list_body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(list_status, StatusCode::OK); - assert_eq!(list_body["branches"], json!(["feature", "main"])); - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Zoe", "age": 33 })), - branch: Some("feature".to_string()), - }; - let (change_status, change_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(change_status, StatusCode::OK); - assert_eq!(change_body["branch"], "feature"); - assert_eq!(change_body["affected_nodes"], 1); - - let read_main_before = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Zoe" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (read_status, read_body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read_main_before).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(read_status, StatusCode::OK); - assert_eq!(read_body["row_count"], 0); - - let merge = BranchMergeRequest { - source: "feature".to_string(), - target: Some("main".to_string()), - }; - let (merge_status, merge_body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&merge).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(merge_status, StatusCode::OK); - assert_eq!(merge_body["source"], "feature"); - assert_eq!(merge_body["target"], "main"); - assert_eq!(merge_body["outcome"], "fast_forward"); - - let read_main_after = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Zoe" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (read_status, read_body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read_main_after).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(read_status, StatusCode::OK); - assert_eq!(read_body["row_count"], 1); - assert_eq!(read_body["rows"][0]["p.name"], "Zoe"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn remote_branch_delete_flow_works() { - let (_temp, app) = app_for_loaded_graph().await; - - let create = BranchCreateRequest { - from: Some("main".to_string()), - name: "feature".to_string(), - }; - let (create_status, _) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&create).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(create_status, StatusCode::OK); - - let (delete_status, delete_body) = json_response( - &app, - Request::builder() - .uri(g("/branches/feature")) - .method(Method::DELETE) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(delete_status, StatusCode::OK); - assert_eq!(delete_body["name"], "feature"); - - let (list_status, list_body) = json_response( - &app, - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(list_status, StatusCode::OK); - assert_eq!(list_body["branches"], json!(["main"])); -} - -#[tokio::test(flavor = "multi_thread")] -async fn branch_delete_denies_without_policy_permission() { - let (temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( - &[("act-andrew", "token-admin"), ("act-bruno", "token-team")], - POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - drop(db); - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/branches/feature")) - .method(Method::DELETE) - .header("authorization", "Bearer token-team") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::FORBIDDEN); - assert!( - body["error"] - .as_str() - .unwrap() - .contains("policy denied action 'branch_delete'") - ); -} - -#[tokio::test(flavor = "multi_thread")] -#[serial] -async fn remote_read_embeds_string_nearest_queries_with_mock_runtime() { - const EMBED_SCHEMA: &str = r#" -node Doc { - slug: String @key - title: String @index - embedding: Vector(4) @index -} -"#; - const EMBED_QUERY: &str = r#" -query vector_search_string($q: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { nearest($d.embedding, $q) } - limit 3 -} -"#; - - let alpha = mock_embedding("alpha", 4); - let beta = mock_embedding("beta", 4); - let gamma = mock_embedding("gamma", 4); - let data = format!( - concat!( - r#"{{"type":"Doc","data":{{"slug":"alpha-doc","title":"alpha guide","embedding":[{}]}}}}"#, - "\n", - r#"{{"type":"Doc","data":{{"slug":"beta-doc","title":"beta guide","embedding":[{}]}}}}"#, - "\n", - r#"{{"type":"Doc","data":{{"slug":"gamma-doc","title":"gamma handbook","embedding":[{}]}}}}"# - ), - format_vector(&alpha), - format_vector(&beta), - format_vector(&gamma), - ); - - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", Some("1")), - ("GEMINI_API_KEY", None), - ]); - let temp = init_graph_with_schema_and_data(EMBED_SCHEMA, &data).await; - let graph = graph_path(temp.path()); - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - - let read = ReadRequest { - query_source: EMBED_QUERY.to_string(), - query_name: Some("vector_search_string".to_string()), - params: Some(json!({ "q": "alpha" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(body["row_count"], 3); - assert_eq!(body["rows"][0]["d.slug"], "alpha-doc"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn change_conflict_returns_manifest_conflict_409() { - // A write that races with another writer surfaces as HTTP 409 with - // a structured `manifest_conflict` body β€” `table_key`, `expected`, - // and `actual` β€” so clients can detect-and-retry without parsing - // the message. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - - // Build the server first so its handle pins the pre-mutation manifest - // version. Then advance the manifest from outside the server. The - // server's next /change call will capture stale `expected_versions` - // (from its still-pinned snapshot) and the publisher's CAS rejects. - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.mutate( - "main", - MUTATION_QUERIES, - "set_age", - &omnigraph_compiler::json_params_to_param_map( - Some(&json!({"name": "Alice", "age": 31 })), - &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") - .unwrap() - .params, - omnigraph_compiler::JsonParamMode::Standard, - ) - .unwrap(), - ) - .await - .unwrap(); - } - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from( - serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("set_age".to_string()), - params: Some(json!({ "name": "Alice", "age": 33 })), - branch: Some("main".to_string()), - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::CONFLICT); - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(error.code, Some(omnigraph_server::api::ErrorCode::Conflict)); - let conflict = error - .manifest_conflict - .expect("publisher CAS rejection must populate manifest_conflict body"); - assert_eq!(conflict.table_key, "node:Person"); - assert!( - conflict.actual > conflict.expected, - "actual ({}) should be ahead of expected ({})", - conflict.actual, - conflict.expected, - ); -} - -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn change_concurrent_inserts_same_key_serialize_without_409() { - // PR 2 Phase 2 (MR-686): pin the design fix for the same-key - // concurrency hazard. Pre-fix, in-process concurrent inserts on - // the same `(table, branch)` rejected with 409 manifest_conflict - // because `ensure_expected_version` fired before the per-table - // queue was acquired and saw Lance HEAD already advanced by a - // peer writer. Post-fix, Insert/Merge skip the strict pre-stage - // check (see `MutationOpKind::strict_pre_stage_version_check`); - // the queue serializes commit_staged; Lance's natural rebase - // handles the in-flight stage; the publisher's CAS on a fresh - // per-branch snapshot under the queue catches genuine cross- - // process drift. - // - // This test spawns N concurrent /change inserts on a single - // node type and asserts: every request returns 200 (no 409), - // and the final row count equals the seed count + N (every - // staged batch actually committed). - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - - // test.jsonl seeds 4 Persons (Alice, Bob, Charlie, Diana). - const SEED_PERSON_ROWS: u64 = 4; - const N: usize = 12; - - let mut handles = Vec::with_capacity(N); - for i in 0..N { - let app = app.clone(); - handles.push(tokio::spawn(async move { - let body = serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": format!("racer-{i}"), "age": i as i32 })), - branch: Some("main".to_string()), - }) - .unwrap(); - let req = Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(); - let response = app.oneshot(req).await.unwrap(); - response.status() - })); - } - - let mut statuses = Vec::with_capacity(N); - for h in handles { - statuses.push(h.await.unwrap()); - } - - let bad: Vec<_> = statuses - .iter() - .enumerate() - .filter(|(_, s)| **s != StatusCode::OK) - .collect(); - assert!( - bad.is_empty(), - "expected every concurrent insert to return 200, got non-200 for: {:?}", - bad - ); - - // Verify the inserts actually landed. The status check above only proves - // the publisher CAS didn't reject; the row count proves none of the - // concurrent commits silently overwrote a peer. - let (snapshot_status, snapshot_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(snapshot_status, StatusCode::OK); - let person_rows = snapshot_body["tables"] - .as_array() - .and_then(|tables| { - tables - .iter() - .find(|t| t["table_key"].as_str() == Some("node:Person")) - }) - .and_then(|t| t["row_count"].as_u64()) - .expect("snapshot must include node:Person row_count"); - assert_eq!( - person_rows, - SEED_PERSON_ROWS + N as u64, - "expected {} seeded + {} concurrent inserts = {} Person rows; got {}", - SEED_PERSON_ROWS, - N, - SEED_PERSON_ROWS + N as u64, - person_rows, - ); -} - -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn change_concurrent_updates_same_key_serialize_via_publisher_cas() { - // Pin Update RYW semantics under in-process concurrency on the same - // `(table, branch)`. With per-table queue serialization and op-kind-aware - // drift detection at commit time, exactly one of N concurrent UPDATEs - // on the same row commits; the rest are rejected as 409 manifest_conflict. - // - // Pre-fix bug class: in `MutationStaging::commit_all`, after queue - // acquisition, the staged Lance transaction is handed straight to - // `commit_staged`. For a writer whose staged dataset is at V0 but - // Lance HEAD has advanced to V1 (because the queue's prior winner - // already published), Lance's transaction conflict resolver fires - // `RetryableCommitConflict` on Update vs Update on the same row. - // That error gets wrapped as `OmniError::Lance()` and the - // API surfaces it as **500 internal**, not 409. Users see "internal - // server error" instead of a retryable conflict, breaking the - // documented 409 contract for in-process drift. - // - // Post-fix invariant: `commit_all` does an op-kind-aware drift check - // before each `commit_staged`. For tables whose tracked op_kind has - // `strict_pre_stage_version_check() == true` (Update / Delete / - // SchemaRewrite), if the staged dataset's version doesn't match the - // fresh manifest pin, return `OmniError::manifest_expected_version_mismatch` - // β†’ 409 ExpectedVersionMismatch. The N-1 losers see a clean 409 - // before Lance's commit_staged ever runs. - // - // Why correct-by-design: closing the class "Lance internal conflict - // surfaces as 500 instead of 409" rather than mapping the specific - // Lance error variant. The drift check fires at the right architectural - // layer (engine boundary, under the queue) and respects the existing - // `MutationOpKind` policy. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - - // Spawn N=8 concurrent UPDATEs on Alice (from test.jsonl, age=30 at V0) - // writing distinct ages. - const N: usize = 8; - let mut handles = Vec::with_capacity(N); - for i in 0..N { - let app = app.clone(); - let target_age = 100 + i as i32; - handles.push(tokio::spawn(async move { - let body = serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("set_age".to_string()), - params: Some(json!({ "name": "Alice", "age": target_age })), - branch: Some("main".to_string()), - }) - .unwrap(); - let req = Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(); - let response = app.oneshot(req).await.unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - (status, body.to_vec()) - })); - } - - let mut results = Vec::with_capacity(N); - for h in handles { - results.push(h.await.unwrap()); - } - let statuses: Vec = results.iter().map(|(s, _)| *s).collect(); - - let ok_count = statuses.iter().filter(|s| **s == StatusCode::OK).count(); - let conflict_count = statuses - .iter() - .filter(|s| **s == StatusCode::CONFLICT) - .count(); - let other: Vec<_> = statuses - .iter() - .enumerate() - .filter(|(_, s)| **s != StatusCode::OK && **s != StatusCode::CONFLICT) - .collect(); - - let other_bodies: Vec<(usize, StatusCode, String)> = other - .iter() - .map(|(i, s)| { - let body_str = String::from_utf8_lossy(&results[*i].1).to_string(); - (*i, **s, body_str) - }) - .collect(); - assert!( - other.is_empty(), - "expected only 200 or 409 statuses, got non-200/409 entries: {:?}", - other_bodies - ); - assert_eq!( - ok_count + conflict_count, - N, - "all responses must be 200 or 409 to satisfy the RYW invariant; statuses: {:?}", - statuses - ); - assert_eq!( - ok_count, - 1, - "expected exactly one update to commit and N-1 to receive 409 manifest_conflict \ - (op-kind-aware drift check rejects stale-V0 staged datasets at commit_all entry). \ - Got {} OK + {} 409 + {} other. \ - Pre-fix symptom: 1 OK + (N-1) x 500 because Lance's RetryableCommitConflict for \ - Update vs Update on the same row bubbles up as `OmniError::Lance()` and \ - the API maps it to 500 internal, not 409. Statuses: {:?}", - ok_count, - conflict_count, - statuses.len() - ok_count - conflict_count, - statuses, - ); -} - -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn change_disjoint_table_concurrency_succeeds_at_http_level() { - // HTTP-level pin for MR-686's disjoint-table promise: concurrent /change - // requests touching different node types must coexist without admission - // rejection or publisher-CAS conflict. The bench harness measures - // throughput; this test is the regression sentinel that catches a - // future change which accidentally re-introduces graph-wide - // serialization on the disjoint path. - // - // Setup: test.jsonl seeds 4 Persons + 2 Companies. Spawn N=4 concurrent - // /change inserts on `node:Person` and N=4 concurrent inserts on - // `node:Company`. All 8 must return 200, and the post-test row counts - // must reflect every insert. - const PERSON_QUERY: &str = r#" -query insert_p($name: String, $age: I32) { - insert Person { name: $name, age: $age } -} -"#; - const COMPANY_QUERY: &str = r#" -query insert_c($name: String) { - insert Company { name: $name } -} -"#; - const SEED_PERSONS: u64 = 4; - const SEED_COMPANIES: u64 = 2; - const PER_TYPE: usize = 4; - - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - let app = build_app(state); - - let mut handles = Vec::with_capacity(PER_TYPE * 2); - for i in 0..PER_TYPE { - let app_p = app.clone(); - handles.push(tokio::spawn(async move { - let body = serde_json::to_vec(&ChangeRequest { - query: PERSON_QUERY.to_string(), - name: Some("insert_p".to_string()), - params: Some(json!({ "name": format!("p-{i}"), "age": i as i32 })), - branch: Some("main".to_string()), - }) - .unwrap(); - let req = Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(); - app_p.oneshot(req).await.unwrap().status() - })); - let app_c = app.clone(); - handles.push(tokio::spawn(async move { - let body = serde_json::to_vec(&ChangeRequest { - query: COMPANY_QUERY.to_string(), - name: Some("insert_c".to_string()), - params: Some(json!({ "name": format!("c-{i}") })), - branch: Some("main".to_string()), - }) - .unwrap(); - let req = Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(); - app_c.oneshot(req).await.unwrap().status() - })); - } - - let mut statuses = Vec::with_capacity(PER_TYPE * 2); - for h in handles { - statuses.push(h.await.unwrap()); - } - - let bad: Vec<_> = statuses - .iter() - .enumerate() - .filter(|(_, s)| **s != StatusCode::OK) - .collect(); - assert!( - bad.is_empty(), - "expected every disjoint /change insert to return 200, got non-200 for: {:?}", - bad, - ); - - // Verify both tables landed every insert. - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - let lookup_count = |table_key: &str| -> u64 { - body["tables"] - .as_array() - .and_then(|tables| { - tables - .iter() - .find(|t| t["table_key"].as_str() == Some(table_key)) - }) - .and_then(|t| t["row_count"].as_u64()) - .unwrap_or_else(|| panic!("snapshot missing {}", table_key)) - }; - assert_eq!( - lookup_count("node:Person"), - SEED_PERSONS + PER_TYPE as u64, - "Person row count after concurrent inserts", - ); - assert_eq!( - lookup_count("node:Company"), - SEED_COMPANIES + PER_TYPE as u64, - "Company row count after concurrent inserts", - ); -} - -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn ingest_per_actor_admission_cap_returns_429() { - // Pin the admission gate on `/ingest`. With per-actor in-flight cap of 1 - // and 8 concurrent requests from the same actor, at least one request - // must be rejected with HTTP 429 and `code: too_many_requests`. - // - // Pre-fix bug class: the admission pattern at `server_change` - // (`crates/omnigraph-server/src/lib.rs:932`) was the only handler - // that called `WorkloadController::try_admit`. A heavy actor sending - // bulk-ingest traffic would exhaust shared engine capacity (Lance I/O - // threads, manifest churn) without ever hitting an admission cap. - // Pinned at the HTTP boundary so future refactors that drop the - // try_admit call from a mutating handler turn this red. - // - // Post-fix invariant: `/ingest`, `/branches/create`, `/branches/delete`, - // `/branches/merge`, and `/schema/apply` all gate on - // `state.workload.try_admit(&actor_arc, est_bytes)` after Cedar - // authorization and before the engine call. Cap exhaustion surfaces as - // 429 with `code: too_many_requests`. - // - // Construct the WorkloadController directly with cap=1 instead of - // mutating `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX` via EnvGuard. Process-wide - // env vars are visible to concurrently-running tests; the previous - // `EnvGuard + #[serial]` pair leaked the override into any other test - // that called `AppState::open` during the guard's window - // (matrix CI failure on commit 99b0941). Using the explicit - // `AppState::new_with_workload` constructor closes that bug class β€” - // this test no longer mutates global state and no longer needs - // `#[serial]`. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let workload = omnigraph_server::workload::WorkloadController::new( - 1, // per-actor in-flight cap (the fixture under test) - 1_000_000_000, // per-actor byte budget β€” large so it never bottlenecks - ); - // MR-723: install a permit-all policy alongside the bearer token so - // /ingest (action=Change) passes Cedar evaluation. The test is - // exercising the admission cap, not policy β€” the policy is just - // enough to clear the State 3 path so the test reaches workload. - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, permit_all_policy_yaml(&["act-flooder"])).unwrap(); - let policy_engine = - omnigraph_server::PolicyEngine::load_graph(&policy_path, graph.to_string_lossy().as_ref()) - .unwrap(); - let state = AppState::new_single( - graph.to_string_lossy().to_string(), - db, - vec![("act-flooder".to_string(), "flooder-token".to_string())], - Some(policy_engine), - workload, - ); - let app = build_app(state); - let _temp = temp; - - // Eight concurrent ingests, all from act-flooder. Only one fits in a - // cap=1 in-flight semaphore; the others must 429. - const N: usize = 8; - let barrier = Arc::new(tokio::sync::Barrier::new(N)); - let mut handles = Vec::with_capacity(N); - for i in 0..N { - let app = app.clone(); - let barrier = Arc::clone(&barrier); - handles.push(tokio::spawn(async move { - // Align the 8 tasks at the barrier so they all attempt - // try_admit close in time. - barrier.wait().await; - - let body = serde_json::to_vec(&IngestRequest { - data: format!( - "{{\"type\":\"Person\",\"data\":{{\"name\":\"flooder-{i}\",\"age\":{i}}}}}\n" - ), - branch: Some("main".to_string()), - from: Some("main".to_string()), - mode: Some(omnigraph::loader::LoadMode::Merge), - }) - .unwrap(); - let req = Request::builder() - .uri(g("/ingest")) - .method(Method::POST) - .header("authorization", "Bearer flooder-token") - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(); - let response = app.oneshot(req).await.unwrap(); - let status = response.status(); - let headers = response.headers().clone(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - (status, headers, body.to_vec()) - })); - } - - let mut results = Vec::with_capacity(N); - for h in handles { - results.push(h.await.unwrap()); - } - let statuses: Vec = results.iter().map(|(s, _, _)| *s).collect(); - - let too_many: Vec = statuses - .iter() - .enumerate() - .filter(|(_, s)| **s == StatusCode::TOO_MANY_REQUESTS) - .map(|(i, _)| i) - .collect(); - assert!( - !too_many.is_empty(), - "expected at least one /ingest under cap=1 to return 429; got statuses: {:?}", - statuses, - ); - - // Validate the structured error body for each 429 (body must carry - // the `too_many_requests` code so clients can distinguish it from - // generic conflicts). - for i in &too_many { - let body_value: Value = serde_json::from_slice(&results[*i].2).unwrap(); - let error: ErrorOutput = serde_json::from_value(body_value).unwrap(); - assert_eq!( - error.code, - Some(omnigraph_server::api::ErrorCode::TooManyRequests), - "429 body must carry code=too_many_requests; idx {} got {:?}", - i, - error.code, - ); - } - - // Validate the `Retry-After` header is set on every 429. Pinned by - // the same test so a future refactor that drops the header from - // `IntoResponse for ApiError` turns this red. The constant - // matches `crates/omnigraph-server/src/lib.rs::ApiError::into_response`. - for i in &too_many { - let retry_after = results[*i] - .1 - .get(axum::http::header::RETRY_AFTER) - .and_then(|v| v.to_str().ok()) - .map(str::to_string); - assert!( - retry_after.is_some(), - "429 response must include a Retry-After header; idx {} headers were: {:?}", - i, - results[*i].1, - ); - } -} diff --git a/crates/omnigraph-server/tests/multi_graph.rs b/crates/omnigraph-server/tests/multi_graph.rs deleted file mode 100644 index 5679aef..0000000 --- a/crates/omnigraph-server/tests/multi_graph.rs +++ /dev/null @@ -1,865 +0,0 @@ -//! Cluster-mode boot and the concurrent branch-ops matrix. -//! Moved verbatim from tests/server.rs in the modularization. - -use std::fs; - -use axum::body::{Body, to_bytes}; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_server::api::{ErrorOutput, ReadRequest}; -use omnigraph_server::{AppState, build_app}; -use serde_json::Value; -use serial_test::serial; -use tower::ServiceExt; - -mod support; -use support::*; - -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn concurrent_branch_ops_morphological_matrix() { - // Cell a: Merge Γ— Merge, distinct targets. - // Pre-fix on b09a097/22d76db: branch_merge_impl's swap-restore race - // landed feature_a's content in target_b instead of target_a (and - // vice versa β€” symmetric swap). Identity asserts catch both - // asymmetric and symmetric variants. - { - let cell = "a:mergeΓ—merge:distinct-targets"; - let h = matrix::Harness::new().await; - h.create_branch("main", "feature-a-cella").await; - h.insert_person("feature-a-cella", "EveA-cella", 22).await; - h.create_branch("main", "feature-b-cella").await; - h.insert_person("feature-b-cella", "FrankB-cella", 33).await; - h.create_branch("main", "target-a-cella").await; - h.create_branch("main", "target-b-cella").await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("feature-a-cella".to_string(), "target-a-cella".to_string()), - matrix::op_merge("feature-b-cella".to_string(), "target-b-cella".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] merge a", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] merge b", cell); - h.assert_persons("target-a-cella", cell, &["EveA-cella"], &["FrankB-cella"]) - .await; - h.assert_persons("target-b-cella", cell, &["FrankB-cella"], &["EveA-cella"]) - .await; - h.assert_post_op_sentinel(cell, "sentinel-cella").await; - } - - // Cell b: Merge Γ— Merge, same target / distinct sources. - // Both want to land in main. merge_exclusive serializes; both should - // succeed and main should contain BOTH sources' contributions. - { - let cell = "b:mergeΓ—merge:same-target-distinct-sources"; - let h = matrix::Harness::new().await; - h.create_branch("main", "src-x-cellb").await; - h.insert_person("src-x-cellb", "Xavier-cellb", 41).await; - h.create_branch("main", "src-y-cellb").await; - h.insert_person("src-y-cellb", "Yvonne-cellb", 42).await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("src-x-cellb".to_string(), "main".to_string()), - matrix::op_merge("src-y-cellb".to_string(), "main".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] merge x", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] merge y", cell); - h.assert_persons("main", cell, &["Xavier-cellb", "Yvonne-cellb"], &[]) - .await; - h.assert_post_op_sentinel(cell, "sentinel-cellb").await; - } - - // Cell c: Merge Γ— Merge, same source / distinct targets (fanout). - // One source merged into two targets simultaneously. merge_exclusive - // serializes; both targets should reflect the source's content. - { - let cell = "c:mergeΓ—merge:same-source-distinct-targets"; - let h = matrix::Harness::new().await; - h.create_branch("main", "src-shared-cellc").await; - h.insert_person("src-shared-cellc", "Sharon-cellc", 50) - .await; - h.create_branch("main", "tgt-1-cellc").await; - h.create_branch("main", "tgt-2-cellc").await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("src-shared-cellc".to_string(), "tgt-1-cellc".to_string()), - matrix::op_merge("src-shared-cellc".to_string(), "tgt-2-cellc".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] merge into tgt-1", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] merge into tgt-2", cell); - h.assert_persons("tgt-1-cellc", cell, &["Sharon-cellc"], &[]) - .await; - h.assert_persons("tgt-2-cellc", cell, &["Sharon-cellc"], &[]) - .await; - h.assert_post_op_sentinel(cell, "sentinel-cellc").await; - } - - // Cell d: Merge Γ— Change, both touching main. C2 permits both - // succeed, or exactly one clean 409 if the merge detects target - // movement after planning but before acquiring the queue. - { - let cell = "d:mergeΓ—change:into-target"; - let h = matrix::Harness::new().await; - h.create_branch("main", "feature-celld").await; - h.insert_person("feature-celld", "EveD-celld", 22).await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("feature-celld".to_string(), "main".to_string()), - matrix::op_change_insert("main".to_string(), "FrankD-celld".to_string(), 33), - ) - .await; - assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); - assert!( - sa.status == StatusCode::OK || sa.status == StatusCode::CONFLICT, - "[{}] merge must be 200 or clean 409, got {}", - cell, - sa.status - ); - if sa.status == StatusCode::OK { - h.assert_persons("main", cell, &["EveD-celld", "FrankD-celld"], &[]) - .await; - } else { - let error: ErrorOutput = serde_json::from_slice(&sa.body).unwrap(); - let conflict = error - .manifest_conflict - .expect("merge 409 must include manifest_conflict"); - assert_eq!( - conflict.table_key, "node:Person", - "[{}] conflict table", - cell - ); - h.assert_persons("main", cell, &["FrankD-celld"], &["EveD-celld"]) - .await; - } - h.assert_post_op_sentinel(cell, "sentinel-celld").await; - } - - // Cell e: Merge Γ— BranchCreateFrom-target. Concurrent fork off the - // merge target while the merge runs. Both should succeed; the new - // branch should have a coherent view (either pre- or post-merge, - // both valid). After both, target = main has the merged content. - { - let cell = "e:mergeΓ—branch_create_from:target"; - let h = matrix::Harness::new().await; - h.create_branch("main", "src-celle").await; - h.insert_person("src-celle", "Eve-celle", 22).await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("src-celle".to_string(), "main".to_string()), - matrix::op_branch_create("main".to_string(), "fork-celle".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] merge", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] branch_create_from", cell); - // Main definitely has Eve. - h.assert_persons("main", cell, &["Eve-celle"], &[]).await; - // fork-celle was forked off main at SOME version; main's current - // count is 5 (4 seeded + Eve). fork-celle has either 4 (pre-merge - // snapshot) or 5 (post-merge snapshot); both are valid timings. - let fork_count = h.person_count("fork-celle").await; - assert!( - fork_count == 4 || fork_count == 5, - "[{}] fork-celle row count must be pre- or post-merge view (4 or 5), got {}", - cell, - fork_count - ); - h.assert_post_op_sentinel(cell, "sentinel-celle").await; - } - - // Cell f: BranchCreateFrom Γ— BranchCreateFrom, distinct parents. - // Pre-fix on f925ad1: swap-restore race in branch_create_from_impl - // forked the new branch off the wrong parent. Identity asserts pin - // that fork-from-A inherits A's content, fork-from-B inherits B's. - { - let cell = "f:branch_create_fromΓ—branch_create_from:distinct-parents"; - let h = matrix::Harness::new().await; - h.create_branch("main", "alpha-cellf").await; - h.insert_person("alpha-cellf", "Eve-cellf", 22).await; - h.create_branch("main", "beta-cellf").await; - - let (sa, sb) = h - .run_pair( - matrix::op_branch_create("alpha-cellf".to_string(), "gamma-cellf".to_string()), - matrix::op_branch_create("beta-cellf".to_string(), "delta-cellf".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] gamma create", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] delta create", cell); - // gamma forks off alpha β†’ must contain Eve. - h.assert_persons("gamma-cellf", cell, &["Eve-cellf"], &[]) - .await; - // delta forks off beta β†’ must NOT contain Eve. - h.assert_persons("delta-cellf", cell, &[], &["Eve-cellf"]) - .await; - h.assert_post_op_sentinel(cell, "sentinel-cellf").await; - } - - // Cell g: BranchCreateFrom Γ— BranchDelete, unrelated branches. - // Disjoint branches; both should complete cleanly without - // interference. - { - let cell = "g:branch_create_fromΓ—branch_delete:unrelated"; - let h = matrix::Harness::new().await; - h.create_branch("main", "doomed-cellg").await; - - let (sa, sb) = h - .run_pair( - matrix::op_branch_create("main".to_string(), "newborn-cellg".to_string()), - matrix::op_branch_delete("doomed-cellg".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] create newborn", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] delete doomed", cell); - // newborn-cellg exists with main's content. - h.assert_persons("newborn-cellg", cell, &["Alice"], &[]) - .await; - h.assert_post_op_sentinel(cell, "sentinel-cellg").await; - } - - // Cell h: BranchDelete Γ— BranchDelete, distinct branches. Both call - // refresh() internally; verify no deadlock and both deletes land. - { - let cell = "h:branch_deleteΓ—branch_delete:distinct"; - let h = matrix::Harness::new().await; - h.create_branch("main", "doomed1-cellh").await; - h.create_branch("main", "doomed2-cellh").await; - - let (sa, sb) = h - .run_pair( - matrix::op_branch_delete("doomed1-cellh".to_string()), - matrix::op_branch_delete("doomed2-cellh".to_string()), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] delete 1", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] delete 2", cell); - // Verify both gone via /branches list (snapshot would still work - // for a deleted branch via parent fallback in some paths, so we - // use the explicit list). - let r = h - .app - .clone() - .oneshot( - Request::builder() - .uri(g("/branches")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(r.status(), StatusCode::OK); - let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); - let list_body: Value = serde_json::from_slice(&body).unwrap(); - let branches: Vec<&str> = list_body["branches"] - .as_array() - .unwrap() - .iter() - .filter_map(|v| v.as_str()) - .collect(); - assert!( - !branches.contains(&"doomed1-cellh"), - "[{}] doomed1 still in branch list: {:?}", - cell, - branches - ); - assert!( - !branches.contains(&"doomed2-cellh"), - "[{}] doomed2 still in branch list: {:?}", - cell, - branches - ); - h.assert_post_op_sentinel(cell, "sentinel-cellh").await; - } - - // Cell i: BranchDelete Γ— Change, on a different branch. Delete one - // branch while a /change runs on main. Both should succeed. - { - let cell = "i:branch_deleteΓ—change:distinct-branch"; - let h = matrix::Harness::new().await; - h.create_branch("main", "doomed-celli").await; - - let (sa, sb) = h - .run_pair( - matrix::op_branch_delete("doomed-celli".to_string()), - matrix::op_change_insert("main".to_string(), "Pat-celli".to_string(), 44), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] delete", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); - h.assert_persons("main", cell, &["Pat-celli"], &[]).await; - h.assert_post_op_sentinel(cell, "sentinel-celli").await; - } - - // Cell j: BranchCreateFrom Γ— Change, both on main. The fork timing - // determines whether the new branch sees the change (pre or post). - // Both valid. Main must contain the inserted row. - { - let cell = "j:branch_create_fromΓ—change:on-source"; - let h = matrix::Harness::new().await; - - let (sa, sb) = h - .run_pair( - matrix::op_branch_create("main".to_string(), "twin-cellj".to_string()), - matrix::op_change_insert("main".to_string(), "Quincy-cellj".to_string(), 55), - ) - .await; - assert_eq!(sa.status, StatusCode::OK, "[{}] branch_create", cell); - assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); - h.assert_persons("main", cell, &["Quincy-cellj"], &[]).await; - // twin-cellj has either pre-change view (no Quincy) or - // post-change view (with Quincy); either is valid. - let twin_has_quincy = h.person_exists("twin-cellj", "Quincy-cellj").await; - let _ = twin_has_quincy; // either valid timing β€” just ensure no panic - h.assert_post_op_sentinel(cell, "sentinel-cellj").await; - } - - // Cell k: reopen consistency. Run a representative concurrent pair, - // drop the engine, reopen on a separate handle, verify state matches. - { - let cell = "k:reopen-after-pair"; - let h = matrix::Harness::new().await; - h.create_branch("main", "src-cellk").await; - h.insert_person("src-cellk", "Rita-cellk", 36).await; - - let (sa, sb) = h - .run_pair( - matrix::op_merge("src-cellk".to_string(), "main".to_string()), - matrix::op_change_insert("main".to_string(), "Steve-cellk".to_string(), 37), - ) - .await; - assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); - assert!( - sa.status == StatusCode::OK || sa.status == StatusCode::CONFLICT, - "[{}] merge must be 200 or clean 409, got {}", - cell, - sa.status - ); - if sa.status == StatusCode::OK { - h.assert_persons("main", cell, &["Rita-cellk", "Steve-cellk"], &[]) - .await; - } else { - let error: ErrorOutput = serde_json::from_slice(&sa.body).unwrap(); - let conflict = error - .manifest_conflict - .expect("merge 409 must include manifest_conflict"); - assert_eq!( - conflict.table_key, "node:Person", - "[{}] conflict table", - cell - ); - h.assert_persons("main", cell, &["Steve-cellk"], &["Rita-cellk"]) - .await; - } - - // Reopen via a fresh AppState on the same graph. - let graph_uri = format!("{}/server.omni", h._temp.path().display()); - let reopened = AppState::open(graph_uri.clone()).await.unwrap(); - let app2 = build_app(reopened); - // Sanity: the same identity check via the new app must see - // Rita and Steve. - let r = app2 - .clone() - .oneshot( - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(r.status(), StatusCode::OK, "[{}] reopen snapshot", cell); - let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); - let v: Value = serde_json::from_slice(&body).unwrap(); - let person_rows = v["tables"] - .as_array() - .and_then(|tables| { - tables - .iter() - .find(|t| t["table_key"].as_str() == Some("node:Person")) - }) - .and_then(|t| t["row_count"].as_u64()) - .expect("reopen snapshot must include node:Person row_count"); - let expected_rows = if sa.status == StatusCode::OK { 6 } else { 5 }; - assert_eq!( - person_rows, expected_rows, - "[{}] reopened main should include seed (4) + committed concurrent writes", - cell, - ); - } -} - -#[tokio::test] -async fn cluster_boot_serves_applied_state() { - let temp = converged_cluster_dir("").await; - let settings = cluster_settings(temp.path()).await.unwrap(); - let omnigraph_server::ServerConfigMode::Multi { - graphs, - config_path, - server_policy, - } = settings.mode - else { - panic!("cluster boot must select multi-graph routing"); - }; - assert_eq!(graphs.len(), 1); - assert_eq!(graphs[0].graph_id, "knowledge"); - assert!(server_policy.is_none()); - - let state = - omnigraph_server::open_multi_graph_state(graphs, Vec::new(), None, config_path, false) - .await - .unwrap(); - let app = build_app(state); - - // The management surface keeps its closed-by-default contract: without a - // cluster-scoped policy bundle there is no server-level Cedar engine, so - // GET /graphs refuses even in cluster mode. - let (status, body) = json_response( - &app, - Request::builder() - .uri("/graphs") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::FORBIDDEN, "{body}"); - - let (status, body) = json_response( - &app, - Request::builder() - .uri("/graphs/knowledge/queries") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK, "{body}"); - assert!( - body["queries"] - .as_array() - .unwrap() - .iter() - .any(|q| q["name"] == "find_person"), - "{body}" - ); - - let (status, body) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri("/graphs/knowledge/queries/find_person") - .header("content-type", "application/json") - .body(Body::from(r#"{"params":{"name":"nobody"}}"#)) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK, "{body}"); -} - -#[tokio::test] -async fn cluster_boot_quarantines_graph_open_failures() { - let temp = tempfile::tempdir().unwrap(); - let schema = "\nnode Person {\n name: String @key\n}\n"; - let good_uri = temp.path().join("good.omni"); - Omnigraph::init(good_uri.to_string_lossy().as_ref(), schema) - .await - .unwrap(); - let bad_uri = temp.path().join("missing.omni"); - let server_policy = omnigraph_server::PolicySource::Inline( - r#" -version: 1 -kind: server -groups: - admins: [act-admin] -rules: - - id: admins-list-graphs - allow: - actors: { group: admins } - actions: [graph_list] -"# - .to_string(), - ); - let graphs = vec![ - omnigraph_server::GraphStartupConfig { - graph_id: "broken".to_string(), - uri: bad_uri.to_string_lossy().to_string(), - policy: None, - embedding: None, - queries: stored_query_registry(&[]), - }, - omnigraph_server::GraphStartupConfig { - graph_id: "good".to_string(), - uri: good_uri.to_string_lossy().to_string(), - policy: None, - embedding: None, - queries: stored_query_registry(&[]), - }, - ]; - let strict_err = match omnigraph_server::open_multi_graph_state( - graphs.clone(), - vec![("act-admin".to_string(), "admin-token".to_string())], - Some(&server_policy), - temp.path().join("cluster.yaml"), - true, - ) - .await - { - Ok(_) => panic!("strict startup should reject a failed graph open"), - Err(err) => err, - }; - assert!( - strict_err - .to_string() - .contains("strict multi-graph startup requires every graph to open"), - "{strict_err}" - ); - let state = omnigraph_server::open_multi_graph_state( - graphs, - vec![("act-admin".to_string(), "admin-token".to_string())], - Some(&server_policy), - temp.path().join("cluster.yaml"), - false, - ) - .await - .unwrap(); - let mut ready: Vec<_> = state - .routing() - .registry - .list() - .iter() - .map(|handle| handle.key.graph_id.as_str().to_string()) - .collect(); - ready.sort(); - assert_eq!(ready, vec!["good"]); - let app = build_app(state); - - let (status, body) = json_response( - &app, - Request::builder() - .uri("/graphs") - .header("authorization", "Bearer admin-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK, "{body}"); - assert_eq!( - body["graphs"] - .as_array() - .unwrap() - .iter() - .map(|graph| graph["graph_id"].as_str().unwrap()) - .collect::>(), - vec!["good"] - ); - - let (status, body) = json_response( - &app, - Request::builder() - .uri("/graphs/broken/queries") - .header("authorization", "Bearer admin-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::NOT_FOUND, "{body}"); -} - -#[tokio::test(flavor = "multi_thread")] -#[serial] -async fn cluster_boot_injects_embedding_provider_config() { - const EMBED_SCHEMA: &str = r#" -node Doc { - slug: String @key - title: String @index - embedding: Vector(4) @embed("title", model="cluster-mock") @index -} -"#; - const EMBED_QUERY: &str = r#" -query vector_search_string($q: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { nearest($d.embedding, $q) } - limit 3 -} -"#; - - let alpha = mock_embedding("alpha", 4); - let beta = mock_embedding("beta", 4); - let gamma = mock_embedding("gamma", 4); - let data = format!( - concat!( - r#"{{"type":"Doc","data":{{"slug":"alpha-doc","title":"alpha guide","embedding":[{}]}}}}"#, - "\n", - r#"{{"type":"Doc","data":{{"slug":"beta-doc","title":"beta guide","embedding":[{}]}}}}"#, - "\n", - r#"{{"type":"Doc","data":{{"slug":"gamma-doc","title":"gamma handbook","embedding":[{}]}}}}"# - ), - format_vector(&alpha), - format_vector(&beta), - format_vector(&gamma), - ); - - let temp = tempfile::tempdir().unwrap(); - fs::write(temp.path().join("docs.pg"), EMBED_SCHEMA).unwrap(); - fs::write(temp.path().join("search.gq"), EMBED_QUERY).unwrap(); - fs::write( - temp.path().join("cluster.yaml"), - r#" -version: 1 -providers: - embedding: - default: - kind: mock - model: cluster-mock -graphs: - knowledge: - schema: ./docs.pg - embedding_provider: default - queries: - vector_search_string: - file: ./search.gq -"#, - ) - .unwrap(); - let import = omnigraph_cluster::import_config_dir(temp.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let apply = omnigraph_cluster::apply_config_dir(temp.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - - let graph_uri = temp - .path() - .join("graphs/knowledge.omni") - .to_string_lossy() - .to_string(); - let mut db = Omnigraph::open(&graph_uri).await.unwrap(); - load_jsonl(&mut db, &data, LoadMode::Overwrite) - .await - .unwrap(); - - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", None), - ("OMNIGRAPH_EMBED_PROVIDER", None), - ("OMNIGRAPH_EMBED_BASE_URL", None), - ("OMNIGRAPH_EMBED_MODEL", None), - ("OPENROUTER_API_KEY", None), - ("OPENAI_API_KEY", None), - ("GEMINI_API_KEY", None), - ]); - let settings = cluster_settings(temp.path()).await.unwrap(); - let omnigraph_server::ServerConfigMode::Multi { - graphs, - config_path, - server_policy, - } = settings.mode - else { - panic!("cluster boot must select multi-graph routing"); - }; - let state = omnigraph_server::open_multi_graph_state( - graphs, - Vec::new(), - server_policy.as_ref(), - config_path, - false, - ) - .await - .unwrap(); - let app = build_app(state); - - let read = ReadRequest { - query_source: EMBED_QUERY.to_string(), - query_name: Some("vector_search_string".to_string()), - params: Some(serde_json::json!({ "q": "alpha" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (status, body) = json_response( - &app, - Request::builder() - .uri("/graphs/knowledge/read") - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK, "{body}"); - assert_eq!(body["row_count"], 3); - assert_eq!(body["rows"][0]["d.slug"], "alpha-doc"); -} - -#[tokio::test(flavor = "multi_thread")] -#[serial] -async fn cluster_boot_refuses_missing_embedding_secret_env() { - let temp = tempfile::tempdir().unwrap(); - fs::write( - temp.path().join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("cluster.yaml"), - r#" -version: 1 -providers: - embedding: - default: - kind: openai-compatible - api_key: ${OG_TEST_MISSING_EMBED_KEY} -graphs: - knowledge: - schema: ./people.pg - embedding_provider: default - queries: - find_person: - file: ./people.gq -"#, - ) - .unwrap(); - let import = omnigraph_cluster::import_config_dir(temp.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let apply = omnigraph_cluster::apply_config_dir(temp.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - - let _guard = EnvGuard::set(&[ - ("OG_TEST_MISSING_EMBED_KEY", None), - ("OMNIGRAPH_EMBEDDINGS_MOCK", None), - ]); - let err = cluster_settings(temp.path()).await.unwrap_err(); - let message = err.to_string(); - assert!( - message.contains("embedding provider for graph 'knowledge'"), - "{message}" - ); - assert!(message.contains("OG_TEST_MISSING_EMBED_KEY"), "{message}"); -} - -#[tokio::test] -async fn cluster_boot_wires_policy_bindings_into_cedar_slots() { - let temp = tempfile::tempdir().unwrap(); - drop(temp); - let policy_block = r#"policies: - graph_rules: - file: ./graph.policy.yaml - applies_to: [knowledge] - cluster_rules: - file: ./cluster.policy.yaml - applies_to: [cluster] -"#; - let temp = { - let temp = tempfile::tempdir().unwrap(); - fs::write( - temp.path().join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("graph.policy.yaml"), - permit_all_policy_yaml(&["default"]), - ) - .unwrap(); - fs::write( - temp.path().join("cluster.policy.yaml"), - permit_all_policy_yaml(&["default"]).replace( - "protected_branches: [main]\n", - "protected_branches: [main]\nkind: server\n", - ), - ) - .unwrap(); - fs::write( - temp.path().join("cluster.yaml"), - format!( - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -{policy_block}"# - ), - ) - .unwrap(); - let import = omnigraph_cluster::import_config_dir(temp.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let apply = omnigraph_cluster::apply_config_dir(temp.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - temp - }; - - let settings = cluster_settings(temp.path()).await.unwrap(); - let omnigraph_server::ServerConfigMode::Multi { - graphs, - server_policy, - .. - } = settings.mode - else { - panic!("cluster boot must select multi-graph routing"); - }; - // Cluster boots carry policy CONTENT (digest-verified catalog blobs), - // not paths β€” the catalog may live on object storage. - let omnigraph_server::PolicySource::Inline(graph_policy) = - graphs[0].policy.as_ref().expect("graph-bound bundle") - else { - panic!("cluster-mode graph policy must be inline content"); - }; - assert!(graph_policy.contains("actors:"), "{graph_policy:?}"); - let omnigraph_server::PolicySource::Inline(server_policy) = - server_policy.expect("cluster-bound bundle") - else { - panic!("cluster-mode server policy must be inline content"); - }; - assert!(server_policy.contains("kind: server"), "{server_policy:?}"); -} - -#[tokio::test] -async fn cluster_boot_refusals() { - // RFC-011 cluster-only: with no --cluster, boot refuses with the - // cluster-required remedy. - let err = omnigraph_server::load_server_settings(None, None, true, false) - .await - .unwrap_err(); - assert!(err.to_string().contains("boots from a cluster"), "{err}"); - - let temp = converged_cluster_dir("").await; - let dir = temp.path().to_path_buf(); - - // Tampered catalog blob refuses boot with the remedy. - let blob_dir = dir.join("__cluster/resources/query/knowledge/find_person"); - let blob = fs::read_dir(&blob_dir) - .unwrap() - .next() - .unwrap() - .unwrap() - .path(); - fs::write(&blob, "tampered").unwrap(); - let err = cluster_settings(&dir).await.unwrap_err(); - assert!( - err.to_string().contains("catalog_payload_digest_mismatch"), - "{err}" - ); - assert!(err.to_string().contains("cluster refresh"), "{err}"); - - // Missing state refuses with the import/apply remedy. - let empty = tempfile::tempdir().unwrap(); - let err = cluster_settings(empty.path()).await.unwrap_err(); - assert!(err.to_string().contains("cluster_state_missing"), "{err}"); -} diff --git a/crates/omnigraph-server/tests/openapi.rs b/crates/omnigraph-server/tests/openapi.rs index 9276482..a2542db 100644 --- a/crates/omnigraph-server/tests/openapi.rs +++ b/crates/omnigraph-server/tests/openapi.rs @@ -8,9 +8,10 @@ use axum::body::{Body, to_bytes}; use axum::http::{Method, Request, StatusCode}; use omnigraph::db::Omnigraph; use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_server::{AppState, build_app, served_openapi}; +use omnigraph_server::{ApiDoc, AppState, build_app}; use serde_json::Value; use tower::ServiceExt; +use utoipa::OpenApi; fn fixture(name: &str) -> PathBuf { PathBuf::from(env!("CARGO_MANIFEST_DIR")) @@ -70,10 +71,7 @@ async fn json_response(app: &Router, request: Request) -> (StatusCode, Val } fn openapi_doc() -> utoipa::openapi::OpenApi { - // RFC-011 cluster-only: the canonical committed spec is the SERVED - // shape β€” protected routes nested under `/graphs/{graph_id}/…`, - // `/healthz` and `/graphs` flat. This matches what the server serves. - served_openapi() + ApiDoc::openapi() } fn openapi_json() -> Value { @@ -161,28 +159,23 @@ fn openapi_info_contains_version() { // Path coverage tests // --------------------------------------------------------------------------- -// The canonical served spec keeps `/healthz` and `/graphs` flat; every -// protected route nests under `/graphs/{graph_id}/…`. const EXPECTED_PATHS: &[&str] = &[ "/healthz", "/graphs", - "/graphs/{graph_id}/snapshot", - "/graphs/{graph_id}/read", - "/graphs/{graph_id}/query", - "/graphs/{graph_id}/export", - "/graphs/{graph_id}/change", - "/graphs/{graph_id}/mutate", - "/graphs/{graph_id}/queries", - "/graphs/{graph_id}/queries/{name}", - "/graphs/{graph_id}/schema", - "/graphs/{graph_id}/schema/apply", - "/graphs/{graph_id}/load", - "/graphs/{graph_id}/ingest", - "/graphs/{graph_id}/branches", - "/graphs/{graph_id}/branches/{branch}", - "/graphs/{graph_id}/branches/merge", - "/graphs/{graph_id}/commits", - "/graphs/{graph_id}/commits/{commit_id}", + "/snapshot", + "/read", + "/query", + "/export", + "/change", + "/mutate", + "/schema", + "/schema/apply", + "/ingest", + "/branches", + "/branches/{branch}", + "/branches/merge", + "/commits", + "/commits/{commit_id}", ]; #[test] @@ -226,25 +219,25 @@ fn openapi_healthz_is_get() { #[test] fn openapi_read_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/read"]["post"].is_object()); + assert!(doc["paths"]["/read"]["post"].is_object()); } #[test] fn openapi_export_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/export"]["post"].is_object()); + assert!(doc["paths"]["/export"]["post"].is_object()); } #[test] fn openapi_change_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/change"]["post"].is_object()); + assert!(doc["paths"]["/change"]["post"].is_object()); } #[test] fn openapi_mutate_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/mutate"]["post"].is_object()); + assert!(doc["paths"]["/mutate"]["post"].is_object()); } // Deprecation flagging β€” `/read` and `/change` are kept indefinitely for @@ -257,7 +250,7 @@ fn openapi_mutate_is_post() { fn openapi_read_is_deprecated() { let doc = openapi_json(); assert_eq!( - doc["paths"]["/graphs/{graph_id}/read"]["post"]["deprecated"], + doc["paths"]["/read"]["post"]["deprecated"], serde_json::Value::Bool(true), "/read must be flagged deprecated in OpenAPI; use /query instead" ); @@ -267,7 +260,7 @@ fn openapi_read_is_deprecated() { fn openapi_change_is_deprecated() { let doc = openapi_json(); assert_eq!( - doc["paths"]["/graphs/{graph_id}/change"]["post"]["deprecated"], + doc["paths"]["/change"]["post"]["deprecated"], serde_json::Value::Bool(true), "/change must be flagged deprecated in OpenAPI; use /mutate instead" ); @@ -276,7 +269,7 @@ fn openapi_change_is_deprecated() { #[test] fn openapi_query_is_not_deprecated() { let doc = openapi_json(); - let deprecated = doc["paths"]["/graphs/{graph_id}/query"]["post"] + let deprecated = doc["paths"]["/query"]["post"] .get("deprecated") .and_then(serde_json::Value::as_bool) .unwrap_or(false); @@ -289,7 +282,7 @@ fn openapi_query_is_not_deprecated() { #[test] fn openapi_mutate_is_not_deprecated() { let doc = openapi_json(); - let deprecated = doc["paths"]["/graphs/{graph_id}/mutate"]["post"] + let deprecated = doc["paths"]["/mutate"]["post"] .get("deprecated") .and_then(serde_json::Value::as_bool) .unwrap_or(false); @@ -302,64 +295,38 @@ fn openapi_mutate_is_not_deprecated() { #[test] fn openapi_ingest_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/ingest"]["post"].is_object()); -} - -#[test] -fn openapi_load_is_not_deprecated() { - // RFC-009 Phase 5: /load is the canonical bulk-load endpoint. - let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/load"]["post"].is_object()); - let deprecated = doc["paths"]["/graphs/{graph_id}/load"]["post"] - .get("deprecated") - .and_then(serde_json::Value::as_bool) - .unwrap_or(false); - assert!( - !deprecated, - "/load is the canonical load endpoint and must not be deprecated" - ); -} - -#[test] -fn openapi_ingest_is_deprecated() { - // RFC-009 Phase 5: /ingest is now the deprecated alias of /load. - let doc = openapi_json(); - assert_eq!( - doc["paths"]["/graphs/{graph_id}/ingest"]["post"]["deprecated"], - serde_json::Value::Bool(true), - "/ingest must be flagged deprecated now that /load is canonical" - ); + assert!(doc["paths"]["/ingest"]["post"].is_object()); } #[test] fn openapi_branches_supports_get_and_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/branches"]["get"].is_object()); - assert!(doc["paths"]["/graphs/{graph_id}/branches"]["post"].is_object()); + assert!(doc["paths"]["/branches"]["get"].is_object()); + assert!(doc["paths"]["/branches"]["post"].is_object()); } #[test] fn openapi_branch_delete_is_delete() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/branches/{branch}"]["delete"].is_object()); + assert!(doc["paths"]["/branches/{branch}"]["delete"].is_object()); } #[test] fn openapi_branch_merge_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/branches/merge"]["post"].is_object()); + assert!(doc["paths"]["/branches/merge"]["post"].is_object()); } #[test] fn openapi_commits_is_get() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/commits"]["get"].is_object()); + assert!(doc["paths"]["/commits"]["get"].is_object()); } #[test] fn openapi_commit_show_is_get() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/commits/{commit_id}"]["get"].is_object()); + assert!(doc["paths"]["/commits/{commit_id}"]["get"].is_object()); } // --------------------------------------------------------------------------- @@ -514,13 +481,13 @@ fn query_request_query_is_required() { #[test] fn openapi_query_is_post() { let doc = openapi_json(); - assert!(doc["paths"]["/graphs/{graph_id}/query"]["post"].is_object()); + assert!(doc["paths"]["/query"]["post"].is_object()); } #[test] fn query_endpoint_documents_mutation_400() { let doc = openapi_json(); - let four_hundred = &doc["paths"]["/graphs/{graph_id}/query"]["post"]["responses"]["400"]; + let four_hundred = &doc["paths"]["/query"]["post"]["responses"]["400"]; let description = four_hundred["description"].as_str().unwrap_or_default(); assert!( description.contains("mutations") || description.contains("POST /mutate"), @@ -731,21 +698,18 @@ fn openapi_defines_bearer_token_security_scheme() { fn protected_endpoints_reference_bearer_token_security() { let doc = openapi_json(); let protected_paths = [ - ("/graphs/{graph_id}/read", "post"), - ("/graphs/{graph_id}/change", "post"), - ("/graphs/{graph_id}/schema/apply", "post"), - ("/graphs/{graph_id}/queries", "get"), - ("/graphs/{graph_id}/queries/{name}", "post"), - ("/graphs/{graph_id}/load", "post"), - ("/graphs/{graph_id}/ingest", "post"), - ("/graphs/{graph_id}/export", "post"), - ("/graphs/{graph_id}/snapshot", "get"), - ("/graphs/{graph_id}/branches", "get"), - ("/graphs/{graph_id}/branches", "post"), - ("/graphs/{graph_id}/branches/{branch}", "delete"), - ("/graphs/{graph_id}/branches/merge", "post"), - ("/graphs/{graph_id}/commits", "get"), - ("/graphs/{graph_id}/commits/{commit_id}", "get"), + ("/read", "post"), + ("/change", "post"), + ("/schema/apply", "post"), + ("/ingest", "post"), + ("/export", "post"), + ("/snapshot", "get"), + ("/branches", "get"), + ("/branches", "post"), + ("/branches/{branch}", "delete"), + ("/branches/merge", "post"), + ("/commits", "get"), + ("/commits/{commit_id}", "get"), ]; for (path, method) in protected_paths { @@ -777,7 +741,7 @@ fn healthz_does_not_require_security() { #[test] fn branch_delete_has_branch_path_parameter() { let doc = openapi_json(); - let params = doc["paths"]["/graphs/{graph_id}/branches/{branch}"]["delete"]["parameters"] + let params = doc["paths"]["/branches/{branch}"]["delete"]["parameters"] .as_array() .unwrap(); let has_branch = params @@ -792,7 +756,7 @@ fn branch_delete_has_branch_path_parameter() { #[test] fn commit_show_has_commit_id_path_parameter() { let doc = openapi_json(); - let params = doc["paths"]["/graphs/{graph_id}/commits/{commit_id}"]["get"]["parameters"] + let params = doc["paths"]["/commits/{commit_id}"]["get"]["parameters"] .as_array() .unwrap(); let has_commit_id = params @@ -807,7 +771,7 @@ fn commit_show_has_commit_id_path_parameter() { #[test] fn snapshot_has_branch_query_parameter() { let doc = openapi_json(); - let params = doc["paths"]["/graphs/{graph_id}/snapshot"]["get"]["parameters"] + let params = doc["paths"]["/snapshot"]["get"]["parameters"] .as_array() .unwrap(); let has_branch = params @@ -822,7 +786,7 @@ fn snapshot_has_branch_query_parameter() { #[test] fn commits_has_branch_query_parameter() { let doc = openapi_json(); - let params = doc["paths"]["/graphs/{graph_id}/commits"]["get"]["parameters"] + let params = doc["paths"]["/commits"]["get"]["parameters"] .as_array() .unwrap(); let has_branch = params @@ -862,7 +826,7 @@ fn openapi_operations_have_tags() { #[test] fn read_endpoint_200_references_read_output_schema() { let doc = openapi_json(); - let content = &doc["paths"]["/graphs/{graph_id}/read"]["post"]["responses"]["200"]["content"]; + let content = &doc["paths"]["/read"]["post"]["responses"]["200"]["content"]; let schema = &content["application/json"]["schema"]; let ref_path = schema["$ref"].as_str().unwrap(); assert!( @@ -874,7 +838,7 @@ fn read_endpoint_200_references_read_output_schema() { #[test] fn change_endpoint_200_references_change_output_schema() { let doc = openapi_json(); - let content = &doc["paths"]["/graphs/{graph_id}/change"]["post"]["responses"]["200"]["content"]; + let content = &doc["paths"]["/change"]["post"]["responses"]["200"]["content"]; let schema = &content["application/json"]["schema"]; let ref_path = schema["$ref"].as_str().unwrap(); assert!( @@ -899,11 +863,11 @@ fn healthz_200_references_health_output_schema() { fn error_responses_reference_error_output_schema() { let doc = openapi_json(); let paths_with_errors = [ - ("/graphs/{graph_id}/read", "post", "400"), - ("/graphs/{graph_id}/read", "post", "401"), - ("/graphs/{graph_id}/change", "post", "400"), - ("/graphs/{graph_id}/change", "post", "409"), - ("/graphs/{graph_id}/branches", "post", "409"), + ("/read", "post", "400"), + ("/read", "post", "401"), + ("/change", "post", "400"), + ("/change", "post", "409"), + ("/branches", "post", "409"), ]; for (path, method, status) in paths_with_errors { @@ -925,13 +889,13 @@ fn error_responses_reference_error_output_schema() { fn post_endpoints_have_request_body() { let doc = openapi_json(); let post_paths = [ - ("/graphs/{graph_id}/read", "ReadRequest"), - ("/graphs/{graph_id}/change", "ChangeRequest"), - ("/graphs/{graph_id}/schema/apply", "SchemaApplyRequest"), - ("/graphs/{graph_id}/ingest", "IngestRequest"), - ("/graphs/{graph_id}/export", "ExportRequest"), - ("/graphs/{graph_id}/branches", "BranchCreateRequest"), - ("/graphs/{graph_id}/branches/merge", "BranchMergeRequest"), + ("/read", "ReadRequest"), + ("/change", "ChangeRequest"), + ("/schema/apply", "SchemaApplyRequest"), + ("/ingest", "IngestRequest"), + ("/export", "ExportRequest"), + ("/branches", "BranchCreateRequest"), + ("/branches/merge", "BranchMergeRequest"), ]; for (path, expected_schema) in post_paths { @@ -949,34 +913,6 @@ fn post_endpoints_have_request_body() { } } -#[test] -fn invoke_stored_query_request_body_is_optional() { - let doc = openapi_json(); - let request_body = &doc["paths"]["/graphs/{graph_id}/queries/{name}"]["post"]["requestBody"]; - assert!( - request_body.is_object(), - "POST /queries/{{name}} should document its optional request body" - ); - assert_eq!( - request_body["required"].as_bool().unwrap_or(false), - false, - "stored-query invocation body should be optional" - ); - let schema = &request_body["content"]["application/json"]["schema"]; - let ref_path = schema["$ref"] - .as_str() - .or_else(|| { - schema["oneOf"] - .as_array() - .and_then(|schemas| schemas.iter().find_map(|schema| schema["$ref"].as_str())) - }) - .unwrap(); - assert!( - ref_path.contains("InvokeStoredQueryRequest"), - "POST /queries/{{name}} requestBody should reference InvokeStoredQueryRequest, got {ref_path}" - ); -} - // --------------------------------------------------------------------------- // Serialization round-trip test // --------------------------------------------------------------------------- @@ -1055,14 +991,12 @@ async fn auth_mode_spec_has_security_on_protected_operations() { .body(Body::empty()) .unwrap(); let (_, json) = json_response(&app, request).await; - // RFC-011 cluster-only: the served spec always nests protected - // routes under `/graphs/{graph_id}/...`. let protected_paths = [ - ("/graphs/{graph_id}/read", "post"), - ("/graphs/{graph_id}/change", "post"), - ("/graphs/{graph_id}/snapshot", "get"), - ("/graphs/{graph_id}/branches", "get"), - ("/graphs/{graph_id}/commits", "get"), + ("/read", "post"), + ("/change", "post"), + ("/snapshot", "get"), + ("/branches", "get"), + ("/commits", "get"), ]; for (path, method) in protected_paths { let security = &json["paths"][path][method]["security"]; @@ -1079,6 +1013,22 @@ async fn auth_mode_spec_has_security_on_protected_operations() { } } +#[tokio::test] +async fn auth_mode_spec_matches_static_generation() { + let (_temp, app) = app_for_loaded_graph_with_auth("secret").await; + let request = Request::builder() + .method(Method::GET) + .uri("/openapi.json") + .body(Body::empty()) + .unwrap(); + let (_, served) = json_response(&app, request).await; + let static_doc = openapi_json(); + assert_eq!( + served, static_doc, + "auth-mode served spec must match static generation" + ); +} + #[tokio::test] async fn auth_mode_healthz_still_has_no_security() { let (_temp, app) = app_for_loaded_graph_with_auth("secret").await; @@ -1167,7 +1117,6 @@ async fn app_for_multi_mode(graph_ids: &[&str]) -> (Vec, Rout uri: graph_uri, engine: Arc::new(engine), policy: None, - queries: None, })); dirs.push(dir); } @@ -1384,9 +1333,8 @@ async fn multi_mode_operation_ids_are_unique() { } #[tokio::test] -async fn served_spec_always_nests_under_cluster_prefix() { - // RFC-011 cluster-only: even a one-graph convenience app serves the - // nested cluster surface and never the flat protected routes. +async fn single_mode_openapi_unchanged_by_cluster_filter() { + // Regression: single mode still emits the legacy flat surface. let (_temp, app) = app_for_loaded_graph().await; let request = Request::builder() .method(Method::GET) @@ -1396,37 +1344,16 @@ async fn served_spec_always_nests_under_cluster_prefix() { let (_, json) = json_response(&app, request).await; let paths = json["paths"].as_object().unwrap(); let path_keys: HashSet<&str> = paths.keys().map(|k| k.as_str()).collect(); - for cluster in EXPECTED_CLUSTER_PATHS { + for expected in EXPECTED_PATHS { assert!( - path_keys.contains(cluster), - "served spec must emit cluster path: {cluster}. Found: {path_keys:?}" + path_keys.contains(expected), + "single mode must still emit flat path: {expected}" ); } - // The flat protected routes must NOT appear β€” only the nested - // cluster surface plus the always-flat `/healthz` and `/graphs`. - let flat_protected = [ - "/snapshot", - "/read", - "/query", - "/export", - "/change", - "/mutate", - "/queries", - "/queries/{name}", - "/schema", - "/schema/apply", - "/load", - "/ingest", - "/branches", - "/branches/{branch}", - "/branches/merge", - "/commits", - "/commits/{commit_id}", - ]; - for flat in flat_protected { + for cluster in EXPECTED_CLUSTER_PATHS { assert!( - !path_keys.contains(flat), - "served spec must NOT emit flat protected path: {flat}" + !path_keys.contains(cluster), + "single mode must NOT emit cluster path: {cluster}" ); } } diff --git a/crates/omnigraph-server/tests/s3.rs b/crates/omnigraph-server/tests/s3.rs deleted file mode 100644 index 793d79d..0000000 --- a/crates/omnigraph-server/tests/s3.rs +++ /dev/null @@ -1,179 +0,0 @@ -//! S3-backed single-graph serving (gated on OMNIGRAPH_S3_TEST_BUCKET). -//! Moved verbatim from tests/server.rs in the modularization. - -use std::fs; - -use axum::body::Body; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_server::api::ReadRequest; -use omnigraph_server::{AppState, build_app}; -use serde_json::json; - -mod support; -use support::*; - -#[tokio::test(flavor = "multi_thread")] -async fn server_opens_s3_graph_directly_and_serves_snapshot_and_read() { - let Some(uri) = s3_test_graph_uri("server") else { - eprintln!("skipping s3 server test: OMNIGRAPH_S3_TEST_BUCKET is not set"); - return; - }; - - Omnigraph::init(&uri, &fs::read_to_string(fixture("test.pg")).unwrap()) - .await - .unwrap(); - let mut db = Omnigraph::open(&uri).await.unwrap(); - load_jsonl( - &mut db, - &fs::read_to_string(fixture("test.jsonl")).unwrap(), - LoadMode::Overwrite, - ) - .await - .unwrap(); - - let app = build_app( - AppState::open_with_bearer_token(uri.clone(), Some("s3-token".to_string())) - .await - .unwrap(), - ); - - let (snapshot_status, snapshot_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot")) - .method(Method::GET) - .header("authorization", "Bearer s3-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(snapshot_status, StatusCode::OK); - assert!(snapshot_body["tables"].is_array()); - - let read = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (read_status, read_body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("authorization", "Bearer s3-token") - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - assert_eq!(read_status, StatusCode::OK); - assert_eq!(read_body["row_count"], 1); - assert_eq!(read_body["rows"][0]["p.name"], "Alice"); -} - -/// Config-free cluster serving (RFC-006): boot `--cluster s3://bucket/prefix` -/// with NO local files at all β€” the ledger and catalog on the bucket are the -/// whole deployment artifact. The fixture cluster is applied from a temp -/// config dir, which is then dropped before the server boots from the URI. -#[tokio::test(flavor = "multi_thread")] -async fn server_boots_cluster_from_bare_storage_uri_and_serves_query() { - let Some(bucket) = std::env::var("OMNIGRAPH_S3_TEST_BUCKET").ok() else { - eprintln!("skipping s3 cluster-serving test: OMNIGRAPH_S3_TEST_BUCKET is not set"); - return; - }; - let unique = format!( - "{}-{}", - std::process::id(), - std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap() - .as_nanos() - ); - let root = format!("s3://{bucket}/cluster-serve/{unique}"); - - // Apply a one-graph cluster onto the bucket, seed it, then DROP the - // config dir β€” the boot below must need nothing local. - { - let dir = tempfile::tempdir().unwrap(); - fs::write( - dir.path().join("people.pg"), - "node Person {\n name: String @key\n}\n", - ) - .unwrap(); - fs::write( - dir.path().join("people.gq"), - "query find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - dir.path().join("cluster.yaml"), - format!( - "version: 1\nstorage: {root}\ngraphs:\n knowledge:\n schema: people.pg\n queries:\n find_person:\n file: people.gq\n" - ), - ) - .unwrap(); - let import = omnigraph_cluster::import_config_dir(dir.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let apply = omnigraph_cluster::apply_config_dir(dir.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - - let graph_uri = format!("{root}/graphs/knowledge.omni"); - let mut db = Omnigraph::open(&graph_uri).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Ada\"}}\n", - LoadMode::Overwrite, - ) - .await - .unwrap(); - } - - let settings = omnigraph_server::load_server_settings( - Some(&std::path::PathBuf::from(&root)), - None, - true, - false, - ) - .await - .unwrap(); - let omnigraph_server::ServerConfigMode::Multi { - graphs, - config_path, - server_policy, - } = settings.mode - else { - panic!("cluster boot must select multi-graph routing"); - }; - let state = omnigraph_server::open_multi_graph_state( - graphs, - Vec::new(), - server_policy.as_ref(), - config_path, - false, - ) - .await - .unwrap(); - let app = build_app(state); - - let response = tower::ServiceExt::oneshot( - app, - Request::builder() - .method(Method::POST) - .uri("/graphs/knowledge/queries/find_person") - .header("content-type", "application/json") - .body(Body::from(json!({"params": {"name": "Ada"}}).to_string())) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(response.status(), StatusCode::OK); - let bytes = axum::body::to_bytes(response.into_body(), usize::MAX) - .await - .unwrap(); - let value: serde_json::Value = serde_json::from_slice(&bytes).unwrap(); - assert_eq!(value["rows"][0]["p.name"], "Ada", "{value}"); -} diff --git a/crates/omnigraph-server/tests/schema_routes.rs b/crates/omnigraph-server/tests/schema_routes.rs deleted file mode 100644 index c73591c..0000000 --- a/crates/omnigraph-server/tests/schema_routes.rs +++ /dev/null @@ -1,950 +0,0 @@ -//! Schema read/apply routes: migrations over HTTP, drift, gating. -//! Moved verbatim from tests/server.rs in the modularization. - -use std::fs; -use std::sync::Arc; - -use axum::body::Body; -use axum::http::{Method, Request, StatusCode}; -use lance::index::DatasetIndexExt; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::loader::LoadMode; -use omnigraph_server::api::{ - ChangeRequest, ErrorOutput, ReadRequest, SchemaApplyRequest, SchemaOutput, -}; -use omnigraph_server::{ - AppState, GraphHandle, GraphId, GraphKey, PolicyEngine, build_app, workload, -}; -use serde_json::json; - - -mod support; -use support::*; - -#[tokio::test] -async fn schema_apply_route_updates_graph_for_authorized_admin() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let schema = additive_schema_with_nickname(); - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: schema, - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - let graph = graph_path(temp.path()); - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - assert!( - reopened.catalog().node_types["Person"] - .properties - .contains_key("nickname") - ); -} - -#[tokio::test] -async fn schema_apply_route_refuses_cluster_backed_server_mode() { - let temp = init_graph_with_schema(&fs::read_to_string(fixture("test.pg")).unwrap()).await; - let graph = graph_path(temp.path()); - let graph_uri = graph.to_string_lossy().to_string(); - let engine = Omnigraph::open(&graph_uri).await.unwrap(); - let handle = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("default").unwrap()), - uri: graph_uri.clone(), - engine: Arc::new(engine), - policy: None, - queries: None, - }); - let state = AppState::new_multi( - vec![handle], - Vec::new(), - None, - workload::WorkloadController::from_env(), - Some(temp.path().join("cluster.yaml")), - ) - .unwrap(); - let app = build_app(state); - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::CONFLICT, "body: {payload}"); - assert!( - payload["error"] - .as_str() - .unwrap_or_default() - .contains("cluster apply"), - "body: {payload}" - ); - let reopened = Omnigraph::open(&graph_uri).await.unwrap(); - assert!( - !reopened.catalog().node_types["Person"] - .properties - .contains_key("nickname"), - "cluster-backed schema apply must not mutate the graph" - ); -} - -#[tokio::test] -async fn schema_apply_route_cluster_backed_denies_unauthorized_actor_before_409() { - // The cluster-backed 409 is reported AFTER the Cedar gate, so an actor - // without `schema_apply` permission gets a 403 β€” never a 409 that would - // disclose the server is cluster-backed (401 β†’ 403 β†’ 409, no topology leak - // before authorization). POLICY_YAML grants read/export but not schema_apply, - // so act-ragnor is denied. - let temp = init_graph_with_schema(&fs::read_to_string(fixture("test.pg")).unwrap()).await; - let graph = graph_path(temp.path()); - let graph_uri = graph.to_string_lossy().to_string(); - let engine = Omnigraph::open(&graph_uri).await.unwrap(); - let policy = PolicyEngine::load_graph_from_source(POLICY_YAML, "default").unwrap(); - let handle = Arc::new(GraphHandle { - key: GraphKey::cluster(GraphId::try_from("default").unwrap()), - uri: graph_uri, - engine: Arc::new(engine), - policy: Some(Arc::new(policy)), - queries: None, - }); - let state = AppState::new_multi( - vec![handle], - vec![("act-ragnor".to_string(), "admin-token".to_string())], - None, - workload::WorkloadController::from_env(), - Some(temp.path().join("cluster.yaml")), - ) - .unwrap(); - let app = build_app(state); - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!( - status, - StatusCode::FORBIDDEN, - "an unauthorized actor must get 403 before the cluster-backed 409: {payload}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_rejects_stored_query_breakage_before_publish() { - let (temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, true)], - &[("act-ragnor", "admin-token")], - STORED_QUERY_SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: renamed_age_schema(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - assert_eq!(status, StatusCode::BAD_REQUEST, "body: {payload}"); - let message = payload["error"].as_str().unwrap_or_default(); - assert!( - message.contains("find_person") && message.contains("schema check"), - "registry breakage should name the stored query; body: {payload}" - ); - - let reopened = Omnigraph::open(graph_path(temp.path()).to_str().unwrap()) - .await - .unwrap(); - let person = &reopened.catalog().node_types["Person"]; - assert!(person.properties.contains_key("age")); - assert!(!person.properties.contains_key("years")); - - let (invoke_status, invoke_body) = json_response( - &app, - invoke_request( - "find_person", - "admin-token", - json!({ "params": { "name": "Alice" } }), - ), - ) - .await; - assert_eq!(invoke_status, StatusCode::OK, "body: {invoke_body}"); - assert_eq!(invoke_body["row_count"], 1); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_noop_keeps_valid_stored_query_registry() { - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, true)], - &[("act-ragnor", "admin-token")], - STORED_QUERY_SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: fs::read_to_string(fixture("test.pg")).unwrap(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - assert_eq!(status, StatusCode::OK, "body: {payload}"); - assert_eq!(payload["applied"], false); -} - -#[tokio::test] -async fn schema_apply_route_requires_schema_apply_policy_permission() { - let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::FORBIDDEN); - assert_eq!( - payload["code"], - serde_json::to_value(omnigraph_server::api::ErrorCode::Forbidden).unwrap() - ); -} - -#[tokio::test] -async fn schema_apply_route_requires_bearer_token_when_policy_enabled() { - let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::UNAUTHORIZED); - assert_eq!( - payload["code"], - serde_json::to_value(omnigraph_server::api::ErrorCode::Unauthorized).unwrap() - ); -} - -#[tokio::test] -async fn schema_apply_route_can_rename_type() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: renamed_person_schema(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - let graph = graph_path(temp.path()); - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snapshot = reopened - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap(); - assert!(snapshot.entry("node:Human").is_some()); - assert!(snapshot.entry("node:Person").is_none()); -} - -#[tokio::test] -async fn schema_apply_route_can_rename_property() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: renamed_age_schema(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - let graph = graph_path(temp.path()); - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let person = &reopened.catalog().node_types["Person"]; - assert!(person.properties.contains_key("years")); - assert!(!person.properties.contains_key("age")); -} - -#[tokio::test] -async fn schema_apply_route_can_add_index() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - let before_index_count = { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let dataset = snapshot.open("node:Person").await.unwrap(); - dataset.load_indices().await.unwrap().len() - }; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: indexed_name_schema(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - // iss-848: the /schema/apply route accepts the index-add and applies it as a - // metadata change β€” it records the `@index` intent in the catalog/IR but does - // NOT build the physical index inline (the build is deferred to - // ensure_indices/optimize; on this empty table nothing would build anyway). - // So the physical index count is unchanged by the apply. - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snapshot = reopened - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap(); - let dataset = snapshot.open("node:Person").await.unwrap(); - let after_index_count = dataset.load_indices().await.unwrap().len(); - assert_eq!( - after_index_count, before_index_count, - "schema apply records @index intent but defers the physical build (iss-848)" - ); -} - -#[tokio::test] -async fn schema_apply_route_rejects_unsupported_plan() { - let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: unsupported_schema_change(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::BAD_REQUEST); - assert_eq!( - payload["code"], - serde_json::to_value(omnigraph_server::api::ErrorCode::BadRequest).unwrap() - ); -} - -#[tokio::test] -async fn schema_apply_route_rejects_when_non_main_branch_exists() { - let temp = init_graph_with_schema(&fs::read_to_string(fixture("test.pg")).unwrap()).await; - let graph = graph_path(temp.path()); - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create("feature").await.unwrap(); - drop(db); - - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, SCHEMA_APPLY_POLICY_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("act-ragnor".to_string(), "admin-token".to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let request = Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(); - let (status, payload) = json_response(&app, request).await; - - assert_eq!(status, StatusCode::CONFLICT); - assert_eq!( - payload["code"], - serde_json::to_value(omnigraph_server::api::ErrorCode::Conflict).unwrap() - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_drift_returns_conflict_for_snapshot_read_and_change() { - let (temp, app) = app_for_loaded_graph().await; - let graph = graph_path(temp.path()); - fs::write(graph.join("_schema.pg"), drifted_test_schema()).unwrap(); - - let (snapshot_status, snapshot_body) = json_response( - &app, - Request::builder() - .uri(g("/snapshot?branch=main")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - let snapshot_error: ErrorOutput = serde_json::from_value(snapshot_body).unwrap(); - assert_eq!(snapshot_status, StatusCode::CONFLICT); - assert_eq!( - snapshot_error.code, - Some(omnigraph_server::api::ErrorCode::Conflict) - ); - assert!( - snapshot_error - .error - .contains("schema evolution is locked down in phase 1") - ); - - let read = ReadRequest { - query_source: fs::read_to_string(fixture("test.gq")).unwrap(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": "Alice" })), - branch: Some("main".to_string()), - snapshot: None, - }; - let (read_status, read_body) = json_response( - &app, - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&read).unwrap())) - .unwrap(), - ) - .await; - let read_error: ErrorOutput = serde_json::from_value(read_body).unwrap(); - assert_eq!(read_status, StatusCode::CONFLICT); - assert_eq!( - read_error.code, - Some(omnigraph_server::api::ErrorCode::Conflict) - ); - assert!( - read_error - .error - .contains("schema evolution is locked down in phase 1") - ); - - let change = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "Mina", "age": 28 })), - branch: Some("main".to_string()), - }; - let (change_status, change_body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&change).unwrap())) - .unwrap(), - ) - .await; - let change_error: ErrorOutput = serde_json::from_value(change_body).unwrap(); - assert_eq!(change_status, StatusCode::CONFLICT); - assert_eq!( - change_error.code, - Some(omnigraph_server::api::ErrorCode::Conflict) - ); - assert!( - change_error - .error - .contains("schema evolution is locked down in phase 1") - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_route_returns_current_source() { - let (_temp, app) = app_for_loaded_graph().await; - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/schema")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - - assert_eq!(status, StatusCode::OK); - let output: SchemaOutput = serde_json::from_value(body).unwrap(); - assert!(output.schema_source.contains("node Person")); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_route_requires_bearer_token_when_auth_configured() { - let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; - - let (missing_status, missing_body) = json_response( - &app, - Request::builder() - .uri(g("/schema")) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await; - let missing_error: ErrorOutput = serde_json::from_value(missing_body).unwrap(); - assert_eq!(missing_status, StatusCode::UNAUTHORIZED); - assert_eq!( - missing_error.code, - Some(omnigraph_server::api::ErrorCode::Unauthorized) - ); - - let (ok_status, ok_body) = json_response( - &app, - Request::builder() - .uri(g("/schema")) - .method(Method::GET) - .header("authorization", "Bearer demo-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - assert_eq!(ok_status, StatusCode::OK); - let output: SchemaOutput = serde_json::from_value(ok_body).unwrap(); - assert!(!output.schema_source.is_empty()); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_route_denied_when_actor_lacks_read_permission() { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - // Policy grants branch_create only β€” no read action for act-bruno. - fs::write(&policy_path, INGEST_CREATE_ONLY_POLICY_YAML).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("act-bruno".to_string(), "team-token".to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - - let (status, body) = json_response( - &app, - Request::builder() - .uri(g("/schema")) - .method(Method::GET) - .header("authorization", "Bearer team-token") - .body(Body::empty()) - .unwrap(), - ) - .await; - let error: ErrorOutput = serde_json::from_value(body).unwrap(); - assert_eq!(status, StatusCode::FORBIDDEN); - assert_eq!( - error.code, - Some(omnigraph_server::api::ErrorCode::Forbidden) - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_soft_drops_property_via_http() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - // Load a row that has the column we're about to drop. - let graph = graph_path(temp.path()); - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.load( - "main", - r#"{"type":"Person","data":{"name":"PreDrop","age":42}}"#, - LoadMode::Append, - ) - .await - .unwrap(); - } - let pre_version = manifest_dataset_version(&graph).await; - - let (status, payload) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: schema_without_age(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - - // Catalog reflects the drop: `age` is gone from the live schema. - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - assert!( - !reopened.catalog().node_types["Person"] - .properties - .contains_key("age"), - "catalog should not contain `age` after drop" - ); - - // Soft drop preserves the prior version β€” `age` is still readable - // via time travel to the pre-drop manifest version. Mirrors the - // SDK-side assertion in `apply_schema_drops_a_nullable_property_softly_preserves_prior_version`. - let pre_drop_snapshot = reopened.snapshot_at_version(pre_version).await.unwrap(); - let pre_drop_ds = pre_drop_snapshot.open("node:Person").await.unwrap(); - let pre_drop_fields = pre_drop_ds - .schema() - .fields - .iter() - .map(|f| f.name.clone()) - .collect::>(); - assert!( - pre_drop_fields.iter().any(|f| f == "age"), - "soft drop should leave the pre-drop dataset's `age` column \ - time-travel-reachable; got fields {pre_drop_fields:?}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_soft_drops_node_type_via_http() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - - let (status, payload) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: schema_without_company(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - assert!( - !reopened.catalog().node_types.contains_key("Company"), - "catalog should not contain `Company` after drop" - ); - assert!( - !reopened.catalog().edge_types.contains_key("WorksAt"), - "catalog should not contain `WorksAt` after cascade" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_hard_drops_property_with_allow_data_loss() { - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.load( - "main", - r#"{"type":"Person","data":{"name":"PreDropHard","age":50}}"#, - LoadMode::Append, - ) - .await - .unwrap(); - } - - // Apply with allow_data_loss=true β†’ Hard mode promotion. - let (status, payload) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: schema_without_age(), - allow_data_loss: true, - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - - // Catalog reflects the drop. - let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - assert!( - !reopened.catalog().node_types["Person"] - .properties - .contains_key("age"), - "catalog should not contain `age` after Hard drop" - ); - // Plan steps should show DropMode::Hard for property drops. - let steps = payload["steps"].as_array().expect("steps array"); - let drop_step = steps - .iter() - .find(|s| s["kind"] == "drop_property") - .expect("plan should include drop_property step"); - let mode = &drop_step["mode"]; - assert_eq!( - mode, "hard", - "expected hard mode under allow_data_loss=true" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_keeps_drops_soft_without_flag() { - // Symmetric to the Hard test: same schema change, but no - // allow_data_loss flag β†’ drops stay Soft (prior column data - // remains time-travel-reachable). Pins the default semantics - // against accidental Hard promotion. - let (temp, app) = app_for_graph_with_auth_tokens_and_policy( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - - let (status, payload) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: schema_without_age(), - allow_data_loss: false, - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - - let steps = payload["steps"].as_array().expect("steps array"); - let drop_step = steps - .iter() - .find(|s| s["kind"] == "drop_property") - .expect("plan should include drop_property step"); - let mode = &drop_step["mode"]; - assert_eq!(mode, "soft", "expected soft mode without allow_data_loss"); - let _ = graph; -} - -#[tokio::test(flavor = "multi_thread")] -async fn schema_apply_route_additive_property_preserves_existing_rows() { - // SDK suite covers rename and drop data preservation. Additive - // AddProperty wasn't pinned with a row-count check anywhere. - // Load N rows, apply schema adding nullable property, verify - // every row is still readable and the new column is null. - let (temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( - &[("act-ragnor", "admin-token")], - SCHEMA_APPLY_POLICY_YAML, - ) - .await; - let graph = graph_path(temp.path()); - - // Standard fixture data is loaded before the app is built, so the server - // handle applies schema from the same manifest it is serving. - let pre_count = { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snap = db - .snapshot_of(omnigraph::db::ReadTarget::branch("main")) - .await - .unwrap(); - snap.open("node:Person") - .await - .expect("Person") - .count_rows(None) - .await - .unwrap() - }; - assert!(pre_count > 0, "fixture should have loaded Person rows"); - - let (status, payload) = json_response( - &app, - Request::builder() - .method(Method::POST) - .uri(g("/schema/apply")) - .header("content-type", "application/json") - .header("authorization", "Bearer admin-token") - .body(Body::from( - serde_json::to_vec(&SchemaApplyRequest { - schema_source: additive_schema_with_nickname(), - ..Default::default() - }) - .unwrap(), - )) - .unwrap(), - ) - .await; - assert_eq!(status, StatusCode::OK); - assert_eq!(payload["applied"], true); - - // Row count preserved. - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let snap = db - .snapshot_of(omnigraph::db::ReadTarget::branch("main")) - .await - .unwrap(); - let post_count = snap - .open("node:Person") - .await - .expect("Person") - .count_rows(None) - .await - .unwrap(); - assert_eq!( - post_count, pre_count, - "AddProperty should preserve row count", - ); -} diff --git a/crates/omnigraph-server/tests/server.rs b/crates/omnigraph-server/tests/server.rs new file mode 100644 index 0000000..3ace80e --- /dev/null +++ b/crates/omnigraph-server/tests/server.rs @@ -0,0 +1,5580 @@ +use std::env; +use std::fs; +use std::path::{Path, PathBuf}; +use std::sync::Arc; + +use axum::Router; +use axum::body::{Body, to_bytes}; +use axum::http::header::AUTHORIZATION; +use axum::http::{Method, Request, StatusCode}; +use lance::index::DatasetIndexExt; +use omnigraph::db::{Omnigraph, ReadTarget, SchemaApplyOptions}; +use omnigraph::error::OmniError; +use omnigraph::loader::{LoadMode, load_jsonl}; +use omnigraph_policy::{PolicyChecker, PolicyEngine}; +use omnigraph_server::api::{ + BranchCreateRequest, BranchMergeRequest, ChangeRequest, ErrorOutput, ExportRequest, + IngestRequest, QueryRequest, ReadRequest, SchemaApplyRequest, SchemaOutput, +}; +use omnigraph_server::{AppState, build_app}; +use serde_json::{Value, json}; +use serial_test::serial; +use tower::ServiceExt; + +const MUTATION_QUERIES: &str = r#" +query insert_person($name: String, $age: I32) { + insert Person { name: $name, age: $age } +} + +query set_age($name: String, $age: I32) { + update Person set { age: $age } where name = $name +} +"#; + +const POLICY_YAML: &str = r#" +version: 1 +groups: + team: [act-andrew, act-bruno, act-ragnor] + admins: [act-ragnor] +protected_branches: [main] +rules: + - id: team-read + allow: + actors: { group: team } + actions: [read] + branch_scope: any + - id: admins-export + allow: + actors: { group: admins } + actions: [export] + branch_scope: any + - id: team-write-unprotected + allow: + actors: { group: team } + actions: [change] + branch_scope: unprotected + - id: admins-merge + allow: + actors: { group: admins } + actions: [branch_delete, branch_merge] + target_branch_scope: protected +"#; + +const POLICY_PROTECTED_READ_YAML: &str = r#" +version: 1 +groups: + team: [act-bruno] +protected_branches: [main] +rules: + - id: protected-read + allow: + actors: { group: team } + actions: [read] + branch_scope: protected +"#; + +const INGEST_CREATE_ONLY_POLICY_YAML: &str = r#" +version: 1 +groups: + team: [act-bruno] +protected_branches: [main] +rules: + - id: team-branch-create + allow: + actors: { group: team } + actions: [branch_create] + target_branch_scope: unprotected +"#; + +const SCHEMA_APPLY_POLICY_YAML: &str = r#" +version: 1 +groups: + admins: [act-ragnor] +protected_branches: [main] +rules: + - id: admins-schema-apply + allow: + actors: { group: admins } + actions: [schema_apply] + target_branch_scope: protected +"#; + +fn fixture(name: &str) -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("../omnigraph/tests/fixtures") + .join(name) +} + +async fn init_loaded_graph() -> tempfile::TempDir { + init_graph_with_schema_and_data( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &fs::read_to_string(fixture("test.jsonl")).unwrap(), + ) + .await +} + +async fn init_graph_with_schema_and_data(schema: &str, data: &str) -> tempfile::TempDir { + let temp = tempfile::tempdir().unwrap(); + let graph = graph_path(temp.path()); + fs::create_dir_all(&graph).unwrap(); + Omnigraph::init(graph.to_str().unwrap(), schema) + .await + .unwrap(); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + load_jsonl(&mut db, data, LoadMode::Overwrite) + .await + .unwrap(); + temp +} + +async fn init_graph_with_schema(schema: &str) -> tempfile::TempDir { + let temp = tempfile::tempdir().unwrap(); + let graph = graph_path(temp.path()); + fs::create_dir_all(&graph).unwrap(); + Omnigraph::init(graph.to_str().unwrap(), schema) + .await + .unwrap(); + temp +} + +fn graph_path(root: &Path) -> PathBuf { + root.join("server.omni") +} + +fn drifted_test_schema() -> String { + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "age: I64?") +} + +async fn manifest_dataset_version(graph: &Path) -> u64 { + Omnigraph::open(graph.to_string_lossy().as_ref()) + .await + .unwrap() + .snapshot_of(ReadTarget::branch("main")) + .await + .unwrap() + .version() +} + +fn s3_test_graph_uri(suite: &str) -> Option { + let bucket = env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?; + let prefix = env::var("OMNIGRAPH_S3_TEST_PREFIX") + .ok() + .filter(|value| !value.trim().is_empty()) + .unwrap_or_else(|| "omnigraph-itests".to_string()); + let unique = std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .ok()? + .as_nanos(); + Some(format!("s3://{}/{}/{}/{}", bucket, prefix, suite, unique)) +} + +async fn app_for_loaded_graph() -> (tempfile::TempDir, Router) { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + (temp, build_app(state)) +} + +/// Build a permit-all policy YAML that grants every action used by the +/// HTTP-layer tests to the listed actor names. MR-723 default-deny +/// closed the "tokens but no policy" loophole; helpers that used to +/// represent "auth without policy" now install this permit-all policy +/// so test cases retain their pre-MR-723 semantics ("auth required, +/// every action permitted") without conflicting with the new state +/// matrix. Tests that specifically need the State-2 deny path use +/// `app_for_graph_with_auth_tokens_only` instead. +fn permit_all_policy_yaml(actors: &[&str]) -> String { + let members = actors + .iter() + .map(|a| format!("\"{a}\"")) + .collect::>() + .join(", "); + format!( + r#" +version: 1 +groups: + permitted: [{members}] +protected_branches: [main] +rules: + - id: permit-data + allow: + actors: {{ group: permitted }} + actions: [read, change, export] + branch_scope: any + - id: permit-protected-target-actions + allow: + actors: {{ group: permitted }} + actions: [schema_apply, branch_create, branch_delete, branch_merge] + target_branch_scope: any +"# + ) +} + +async fn app_for_loaded_graph_with_auth(token: &str) -> (tempfile::TempDir, Router) { + // `AppState::new_with_bearer_token(token)` maps the token to actor "default"; + // permit-all policy needs to include that actor. + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, permit_all_policy_yaml(&["default"])).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("default".to_string(), token.to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + (temp, build_app(state)) +} + +async fn app_for_loaded_graph_with_auth_tokens( + tokens: &[(&str, &str)], +) -> (tempfile::TempDir, Router) { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + let actors: Vec<&str> = tokens.iter().map(|(actor, _)| *actor).collect(); + fs::write(&policy_path, permit_all_policy_yaml(&actors)).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + tokens + .iter() + .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) + .collect(), + Some(&policy_path), + ) + .await + .unwrap(); + (temp, build_app(state)) +} + +async fn app_for_loaded_graph_with_auth_tokens_and_policy( + tokens: &[(&str, &str)], + policy: &str, +) -> (tempfile::TempDir, Router) { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, policy).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + tokens + .iter() + .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) + .collect(), + Some(&policy_path), + ) + .await + .unwrap(); + (temp, build_app(state)) +} + +async fn app_for_graph_with_auth_tokens_and_policy( + schema: &str, + tokens: &[(&str, &str)], + policy: &str, +) -> (tempfile::TempDir, Router) { + let temp = init_graph_with_schema(schema).await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, policy).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + tokens + .iter() + .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) + .collect(), + Some(&policy_path), + ) + .await + .unwrap(); + (temp, build_app(state)) +} + +/// MR-723 default-deny mode: bearer tokens configured, no policy file. +/// Exercises ServerRuntimeState::DefaultDeny β€” authenticated requests +/// for Read succeed, every other action is rejected with 403 from +/// `authorize_request`'s state-2 branch. +async fn app_for_graph_with_auth_tokens_only( + schema: &str, + tokens: &[(&str, &str)], +) -> (tempfile::TempDir, Router) { + let temp = init_graph_with_schema(schema).await; + let graph = graph_path(temp.path()); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + tokens + .iter() + .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) + .collect(), + None, + ) + .await + .unwrap(); + (temp, build_app(state)) +} + +fn additive_schema_with_nickname() -> String { + fs::read_to_string(fixture("test.pg")).unwrap().replace( + " age: I32?\n}", + " age: I32?\n nickname: String?\n}", + ) +} + +fn schema_without_age() -> String { + // Drop the nullable `age` column from the test schema. Used by the + // HTTP soft/hard drop tests below. + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace(" age: I32?\n", "") +} + +fn schema_without_company() -> String { + // Drop the `Company` node type and the edge referencing it. Used + // by the HTTP DropType test below. Hand-crafted (no template + // string replace) because the fixture interleaves the type and + // its edge. + r#"node Person { + name: String @key + age: I32? +} + +edge Knows: Person -> Person { + since: Date? +} +"# + .to_string() +} + +fn renamed_person_schema() -> String { + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("node Person {\n", "node Human @rename_from(\"Person\") {\n") + .replace("edge Knows: Person -> Person", "edge Knows: Human -> Human") + .replace( + "edge WorksAt: Person -> Company", + "edge WorksAt: Human -> Company", + ) +} + +fn renamed_age_schema() -> String { + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "years: I32? @rename_from(\"age\")") +} + +fn indexed_name_schema() -> String { + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("name: String @key", "name: String @key @index") +} + +fn unsupported_schema_change() -> String { + fs::read_to_string(fixture("test.pg")) + .unwrap() + .replace("age: I32?", "age: I64?") +} + +async fn json_response(app: &Router, request: Request) -> (StatusCode, Value) { + let response = app.clone().oneshot(request).await.unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + let value = serde_json::from_slice(&body).unwrap(); + (status, value) +} + +#[tokio::test] +async fn schema_apply_route_updates_graph_for_authorized_admin() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let schema = additive_schema_with_nickname(); + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: schema, + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + let graph = graph_path(temp.path()); + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + assert!( + reopened.catalog().node_types["Person"] + .properties + .contains_key("nickname") + ); +} + +#[tokio::test] +async fn schema_apply_route_requires_schema_apply_policy_permission() { + let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + POLICY_YAML, + ) + .await; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: additive_schema_with_nickname(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::FORBIDDEN); + assert_eq!( + payload["code"], + serde_json::to_value(omnigraph_server::api::ErrorCode::Forbidden).unwrap() + ); +} + +#[tokio::test] +async fn schema_apply_route_requires_bearer_token_when_policy_enabled() { + let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: additive_schema_with_nickname(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::UNAUTHORIZED); + assert_eq!( + payload["code"], + serde_json::to_value(omnigraph_server::api::ErrorCode::Unauthorized).unwrap() + ); +} + +#[tokio::test] +async fn schema_apply_route_can_rename_type() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: renamed_person_schema(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + let graph = graph_path(temp.path()); + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let snapshot = reopened + .snapshot_of(ReadTarget::branch("main")) + .await + .unwrap(); + assert!(snapshot.entry("node:Human").is_some()); + assert!(snapshot.entry("node:Person").is_none()); +} + +#[tokio::test] +async fn schema_apply_route_can_rename_property() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: renamed_age_schema(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + let graph = graph_path(temp.path()); + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let person = &reopened.catalog().node_types["Person"]; + assert!(person.properties.contains_key("years")); + assert!(!person.properties.contains_key("age")); +} + +#[tokio::test] +async fn schema_apply_route_can_add_index() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + let before_index_count = { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let snapshot = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); + let dataset = snapshot.open("node:Person").await.unwrap(); + dataset.load_indices().await.unwrap().len() + }; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: indexed_name_schema(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let snapshot = reopened + .snapshot_of(ReadTarget::branch("main")) + .await + .unwrap(); + let dataset = snapshot.open("node:Person").await.unwrap(); + let after_index_count = dataset.load_indices().await.unwrap().len(); + assert!(after_index_count > before_index_count); +} + +#[tokio::test] +async fn schema_apply_route_rejects_unsupported_plan() { + let (_temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: unsupported_schema_change(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::BAD_REQUEST); + assert_eq!( + payload["code"], + serde_json::to_value(omnigraph_server::api::ErrorCode::BadRequest).unwrap() + ); +} + +#[tokio::test] +async fn schema_apply_route_rejects_when_non_main_branch_exists() { + let temp = init_graph_with_schema(&fs::read_to_string(fixture("test.pg")).unwrap()).await; + let graph = graph_path(temp.path()); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create("feature").await.unwrap(); + drop(db); + + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, SCHEMA_APPLY_POLICY_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("act-ragnor".to_string(), "admin-token".to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let request = Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: additive_schema_with_nickname(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(); + let (status, payload) = json_response(&app, request).await; + + assert_eq!(status, StatusCode::CONFLICT); + assert_eq!( + payload["code"], + serde_json::to_value(omnigraph_server::api::ErrorCode::Conflict).unwrap() + ); +} + +struct EnvGuard { + saved: Vec<(&'static str, Option)>, +} + +impl EnvGuard { + fn set(vars: &[(&'static str, Option<&str>)]) -> Self { + let saved = vars + .iter() + .map(|(name, _)| (*name, env::var(name).ok())) + .collect::>(); + for (name, value) in vars { + unsafe { + match value { + Some(value) => env::set_var(name, value), + None => env::remove_var(name), + } + } + } + Self { saved } + } +} + +impl Drop for EnvGuard { + fn drop(&mut self) { + for (name, value) in self.saved.drain(..) { + unsafe { + match value { + Some(value) => env::set_var(name, value), + None => env::remove_var(name), + } + } + } + } +} + +fn format_vector(values: &[f32]) -> String { + values + .iter() + .map(|value| format!("{:.8}", value)) + .collect::>() + .join(", ") +} + +fn normalize_vector(mut values: Vec) -> Vec { + let norm = values + .iter() + .map(|value| (*value as f64) * (*value as f64)) + .sum::() + .sqrt() as f32; + if norm > f32::EPSILON { + for value in &mut values { + *value /= norm; + } + } + values +} + +fn fnv1a64(bytes: &[u8]) -> u64 { + let mut hash = 14695981039346656037u64; + for byte in bytes { + hash ^= *byte as u64; + hash = hash.wrapping_mul(1099511628211u64); + } + hash +} + +fn xorshift64(mut x: u64) -> u64 { + x ^= x << 13; + x ^= x >> 7; + x ^= x << 17; + x +} + +fn mock_embedding(input: &str, dim: usize) -> Vec { + let mut seed = fnv1a64(input.as_bytes()); + let mut out = Vec::with_capacity(dim); + for _ in 0..dim { + seed = xorshift64(seed); + let ratio = (seed as f64 / u64::MAX as f64) as f32; + out.push((ratio * 2.0) - 1.0); + } + normalize_vector(out) +} + +#[tokio::test(flavor = "multi_thread")] +async fn healthz_succeeds_after_startup() { + let (_temp, app) = app_for_loaded_graph().await; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/healthz") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(body["status"], "ok"); + assert_eq!(body["version"], env!("CARGO_PKG_VERSION")); + match option_env!("OMNIGRAPH_SOURCE_VERSION") { + Some(source_version) => assert_eq!(body["source_version"], source_version), + None => assert!(body.get("source_version").is_none()), + } +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_drift_returns_conflict_for_snapshot_read_and_change() { + let (temp, app) = app_for_loaded_graph().await; + let graph = graph_path(temp.path()); + fs::write(graph.join("_schema.pg"), drifted_test_schema()).unwrap(); + + let (snapshot_status, snapshot_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + let snapshot_error: ErrorOutput = serde_json::from_value(snapshot_body).unwrap(); + assert_eq!(snapshot_status, StatusCode::CONFLICT); + assert_eq!( + snapshot_error.code, + Some(omnigraph_server::api::ErrorCode::Conflict) + ); + assert!( + snapshot_error + .error + .contains("schema evolution is locked down in phase 1") + ); + + let read = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (read_status, read_body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read).unwrap())) + .unwrap(), + ) + .await; + let read_error: ErrorOutput = serde_json::from_value(read_body).unwrap(); + assert_eq!(read_status, StatusCode::CONFLICT); + assert_eq!( + read_error.code, + Some(omnigraph_server::api::ErrorCode::Conflict) + ); + assert!( + read_error + .error + .contains("schema evolution is locked down in phase 1") + ); + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Mina", "age": 28 })), + branch: Some("main".to_string()), + }; + let (change_status, change_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + let change_error: ErrorOutput = serde_json::from_value(change_body).unwrap(); + assert_eq!(change_status, StatusCode::CONFLICT); + assert_eq!( + change_error.code, + Some(omnigraph_server::api::ErrorCode::Conflict) + ); + assert!( + change_error + .error + .contains("schema evolution is locked down in phase 1") + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn protected_routes_require_bearer_token() { + let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(status, StatusCode::UNAUTHORIZED); + assert_eq!( + error.code, + Some(omnigraph_server::api::ErrorCode::Unauthorized) + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn protected_routes_accept_valid_bearer_token_while_healthz_stays_open() { + let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; + + let health = app + .clone() + .oneshot( + Request::builder() + .uri("/healthz") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(health.status(), StatusCode::OK); + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .header("authorization", "Bearer demo-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + assert!(body["branches"].is_array()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn export_route_returns_jsonl_for_branch_snapshot() { + let token = "demo-token"; + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + db.load( + "feature", + r#"{"type":"Person","data":{"name":"Eve","age":29}}"#, + LoadMode::Append, + ) + .await + .unwrap(); + let expected = db + .export_jsonl("feature", &["Person".to_string()], &[]) + .await + .unwrap(); + drop(db); + + // MR-723: tokens-without-policy is now default-deny. Install a + // permit-all policy alongside the bearer token so /export + // (action=Export) passes Cedar evaluation. The test is exercising + // export semantics, not policy β€” the policy is just enough to clear + // the State 3 path. + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, permit_all_policy_yaml(&["default"])).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("default".to_string(), token.to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/export") + .method(Method::POST) + .header("content-type", "application/json") + .header("authorization", format!("Bearer {}", token)) + .body(Body::from( + serde_json::to_vec(&ExportRequest { + branch: Some("feature".to_string()), + type_names: vec!["Person".to_string()], + table_keys: Vec::new(), + }) + .unwrap(), + )) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::OK); + assert_eq!( + response.headers().get("content-type").unwrap(), + "application/x-ndjson; charset=utf-8" + ); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + let text = String::from_utf8(body.to_vec()).unwrap(); + assert_eq!(text, expected); +} + +#[tokio::test(flavor = "multi_thread")] +async fn protected_routes_accept_any_configured_team_bearer_token() { + let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[ + ("team-01", "token-one"), + ("team-02", "token-two"), + ]) + .await; + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .header("authorization", "Bearer token-two") + .body(Body::empty()) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + assert!(body["branches"].is_array()); +} + +/// Verifies the hashed-token lookup correctly resolves each bearer to its +/// associated actor, and that the resolved actor β€” not the handler-supplied +/// default β€” is what the policy engine sees. Two tokens for two distinct +/// actors; policy grants read to actor-A only. Swapping tokens must swap +/// the policy outcome. +#[tokio::test(flavor = "multi_thread")] +async fn bearer_token_resolves_to_correct_actor_for_policy_decisions() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + fs::write( + &policy_path, + r#" +version: 1 +groups: + readers: [act-a] + writers: [act-b] +protected_branches: [main] +rules: + - id: readers-only + allow: + actors: { group: readers } + actions: [read] + branch_scope: any +"#, + ) + .unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![ + ("act-a".to_string(), "token-a".to_string()), + ("act-b".to_string(), "token-b".to_string()), + ], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + // act-a is authenticated AND authorized. + let (ok_status, _) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-a") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(ok_status, StatusCode::OK); + + // act-b is authenticated but policy rejects β€” proves the resolved actor + // (not some default) was the policy subject. + let (denied_status, denied_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-b") + .body(Body::empty()) + .unwrap(), + ) + .await; + let denied_error: ErrorOutput = serde_json::from_value(denied_body).unwrap(); + assert_eq!(denied_status, StatusCode::FORBIDDEN); + assert_eq!( + denied_error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); + + // Unknown token: 401, never reaches the policy engine. + let (bad_status, _) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer wrong-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(bad_status, StatusCode::UNAUTHORIZED); +} + +/// Regression test for MR-731: actor identity comes from the matched +/// bearer token, never from a client-supplied request header. A future +/// "convenience" PR that lets clients override `actor_id` to spoof +/// another identity must break this test. The principle is named in +/// `docs/dev/invariants.md` Hard Invariant 11 and at the actor-resolution +/// site in `omnigraph-server/src/lib.rs::authorize_request`. +/// +/// Two assertions in one fixture: +/// 1. Spoof-up: bearer for a *denied* actor + X-Actor-Id naming an +/// *allowed* actor β€” policy still denies (proves the spoof header +/// doesn't promote the request). +/// 2. Spoof-down: bearer for an *allowed* actor + X-Actor-Id naming a +/// *denied* actor β€” policy still allows (proves the server-resolved +/// identity wins; the spoof can't trick the request into a denial +/// either, which would otherwise be a confusing UX trap). +/// +/// Cross-reference: MR-777 covers boundary cases like actor-id +/// *collision* (two distinct tokens minting the same actor_id) and +/// malformed bearer header parsing. See `auth_boundary_case_coverage` +/// suite when it lands; the two tests together pin the full bearer-token +/// β†’ actor identity contract. +#[tokio::test(flavor = "multi_thread")] +async fn actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + // Same readers/writers split as + // `bearer_token_resolves_to_correct_actor_for_policy_decisions` β€” + // `act-a` can read main, `act-b` cannot. The asymmetry is what + // makes the spoof-up/spoof-down distinction observable. + fs::write( + &policy_path, + r#" +version: 1 +groups: + readers: [act-a] + writers: [act-b] +protected_branches: [main] +rules: + - id: readers-only + allow: + actors: { group: readers } + actions: [read] + branch_scope: any +"#, + ) + .unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![ + ("act-a".to_string(), "token-a".to_string()), + ("act-b".to_string(), "token-b".to_string()), + ], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + // (1) Spoof-up: bearer for act-b (denied) + X-Actor-Id: act-a (allowed). + // If the server were trusting the header, this would succeed as + // act-a. The contract is: the bearer wins. Expect 403 because + // act-b can't read. + let (spoof_up_status, spoof_up_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-b") + .header("x-actor-id", "act-a") + .body(Body::empty()) + .unwrap(), + ) + .await; + let spoof_up_error: ErrorOutput = serde_json::from_value(spoof_up_body).unwrap(); + assert_eq!( + spoof_up_status, + StatusCode::FORBIDDEN, + "X-Actor-Id must not promote a denied bearer to an allowed actor", + ); + assert_eq!( + spoof_up_error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden), + ); + + // (2) Spoof-down: bearer for act-a (allowed) + X-Actor-Id: act-b (denied). + // If the server were trusting the header, this would fail as act-b. + // The contract is: the bearer wins. Expect 200 because act-a can read. + let (spoof_down_status, _) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-a") + .header("x-actor-id", "act-b") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!( + spoof_down_status, + StatusCode::OK, + "X-Actor-Id must not demote an allowed bearer to a denied actor", + ); + + // (3) Empty-string spoof attempt: an X-Actor-Id of "" must not + // leak through as the policy subject. Same expectation as (1): + // bearer for act-b is denied regardless of what the header tries. + let (empty_spoof_status, _) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-b") + .header("x-actor-id", "") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!( + empty_spoof_status, + StatusCode::FORBIDDEN, + "empty X-Actor-Id must not clear the resolved actor", + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_allows_read_but_distinguishes_401_from_403() { + let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( + &[("act-bruno", "team-token"), ("act-ragnor", "admin-token")], + POLICY_YAML, + ) + .await; + + let (missing_status, missing_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + let missing_error: ErrorOutput = serde_json::from_value(missing_body).unwrap(); + assert_eq!(missing_status, StatusCode::UNAUTHORIZED); + assert_eq!( + missing_error.code, + Some(omnigraph_server::api::ErrorCode::Unauthorized) + ); + + let (snapshot_status, snapshot_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .header("authorization", "Bearer team-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(snapshot_status, StatusCode::OK); + assert_eq!(snapshot_body["branch"], "main"); + + let export_request = ExportRequest { + branch: Some("main".to_string()), + type_names: Vec::new(), + table_keys: Vec::new(), + }; + let (forbidden_status, forbidden_body) = json_response( + &app, + Request::builder() + .uri("/export") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&export_request).unwrap())) + .unwrap(), + ) + .await; + let forbidden_error: ErrorOutput = serde_json::from_value(forbidden_body).unwrap(); + assert_eq!(forbidden_status, StatusCode::FORBIDDEN); + assert_eq!( + forbidden_error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); + + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/export") + .method(Method::POST) + .header("authorization", "Bearer admin-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&export_request).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(response.status(), StatusCode::OK); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_uses_resolved_branch_for_snapshot_reads() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let snapshot_id = { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.resolve_snapshot("main").await.unwrap().to_string() + }; + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, POLICY_PROTECTED_READ_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("act-bruno".to_string(), "team-token".to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let read = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: None, + snapshot: Some(snapshot_id), + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read).unwrap())) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(body["target"]["branch"], Value::Null); + assert_eq!( + body["target"]["snapshot"].as_str(), + read.snapshot.as_deref() + ); + assert_eq!(body["row_count"], 1); +} + +#[tokio::test(flavor = "multi_thread")] +async fn snapshot_route_returns_manifest_dataset_version() { + let (temp, app) = app_for_loaded_graph().await; + let graph = graph_path(temp.path()); + let expected_manifest_version = manifest_dataset_version(&graph).await; + + let (snapshot_status, snapshot_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + + assert_eq!(snapshot_status, StatusCode::OK); + assert_eq!(snapshot_body["branch"], "main"); + assert_eq!( + snapshot_body["manifest_version"].as_u64().unwrap(), + expected_manifest_version + ); + assert!(snapshot_body["tables"].is_array()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_route_returns_current_source() { + let (_temp, app) = app_for_loaded_graph().await; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/schema") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + let output: SchemaOutput = serde_json::from_value(body).unwrap(); + assert!(output.schema_source.contains("node Person")); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_route_requires_bearer_token_when_auth_configured() { + let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; + + let (missing_status, missing_body) = json_response( + &app, + Request::builder() + .uri("/schema") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + let missing_error: ErrorOutput = serde_json::from_value(missing_body).unwrap(); + assert_eq!(missing_status, StatusCode::UNAUTHORIZED); + assert_eq!( + missing_error.code, + Some(omnigraph_server::api::ErrorCode::Unauthorized) + ); + + let (ok_status, ok_body) = json_response( + &app, + Request::builder() + .uri("/schema") + .method(Method::GET) + .header("authorization", "Bearer demo-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(ok_status, StatusCode::OK); + let output: SchemaOutput = serde_json::from_value(ok_body).unwrap(); + assert!(!output.schema_source.is_empty()); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_route_denied_when_actor_lacks_read_permission() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let policy_path = temp.path().join("policy.yaml"); + // Policy grants branch_create only β€” no read action for act-bruno. + fs::write(&policy_path, INGEST_CREATE_ONLY_POLICY_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("act-bruno".to_string(), "team-token".to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/schema") + .method(Method::GET) + .header("authorization", "Bearer team-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(status, StatusCode::FORBIDDEN); + assert_eq!( + error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_blocks_change_on_protected_main_but_allows_unprotected_branch() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + drop(db); + + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, POLICY_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("act-bruno".to_string(), "team-token".to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let main_change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Mina", "age": 28 })), + branch: Some("main".to_string()), + }; + let (main_status, main_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&main_change).unwrap())) + .unwrap(), + ) + .await; + let main_error: ErrorOutput = serde_json::from_value(main_body).unwrap(); + assert_eq!(main_status, StatusCode::FORBIDDEN); + assert_eq!( + main_error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); + + let feature_change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Mina", "age": 28 })), + branch: Some("feature".to_string()), + }; + let (feature_status, feature_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&feature_change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(feature_status, StatusCode::OK); + assert_eq!(feature_body["branch"], "feature"); + assert_eq!(feature_body["affected_nodes"], 1); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_blocks_non_admin_merge_to_main_and_allows_admin() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + db.load( + "feature", + r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#, + LoadMode::Append, + ) + .await + .unwrap(); + drop(db); + + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, POLICY_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![ + ("act-bruno".to_string(), "team-token".to_string()), + ("act-ragnor".to_string(), "admin-token".to_string()), + ], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + + let merge = BranchMergeRequest { + source: "feature".to_string(), + target: Some("main".to_string()), + }; + let (deny_status, deny_body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&merge).unwrap())) + .unwrap(), + ) + .await; + let deny_error: ErrorOutput = serde_json::from_value(deny_body).unwrap(); + assert_eq!(deny_status, StatusCode::FORBIDDEN); + assert_eq!( + deny_error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); + + let (allow_status, allow_body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("authorization", "Bearer admin-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&merge).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(allow_status, StatusCode::OK); + assert_eq!(allow_body["actor_id"], "act-ragnor"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn authenticated_change_stamps_actor_on_commits() { + // With the Run state machine removed, actor_id is recorded + // directly on the commit graph (no intermediate run record). + let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[("act-andrew", "token-one")]).await; + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Mina", "age": 28 })), + branch: Some("main".to_string()), + }; + let (change_status, change_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("authorization", "Bearer token-one") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(change_status, StatusCode::OK); + assert_eq!(change_body["actor_id"], "act-andrew"); + + let (commits_status, commits_body) = json_response( + &app, + Request::builder() + .uri("/commits?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-one") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(commits_status, StatusCode::OK); + let head = commits_body["commits"] + .as_array() + .unwrap() + .last() + .expect("head commit should exist"); + assert_eq!(head["actor_id"], "act-andrew"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn ingest_creates_branch_returns_metadata_and_stamps_actor() { + let (temp, app) = app_for_loaded_graph_with_auth_tokens(&[("act-andrew", "token-one")]).await; + let graph = graph_path(temp.path()); + let ingest = IngestRequest { + branch: Some("feature-ingest".to_string()), + from: Some("main".to_string()), + mode: Some(LoadMode::Merge), + data: r#"{"type":"Person","data":{"name":"Zoe","age":33}} +{"type":"Person","data":{"name":"Bob","age":26}}"# + .to_string(), + }; + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("authorization", "Bearer token-one") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&ingest).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(body["branch"], "feature-ingest"); + assert_eq!(body["base_branch"], "main"); + assert_eq!(body["branch_created"], true); + assert_eq!(body["mode"], "merge"); + assert_eq!(body["actor_id"], "act-andrew"); + assert_eq!(body["tables"][0]["table_key"], "node:Person"); + assert_eq!(body["tables"][0]["rows_loaded"], 2); + + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let snapshot = db + .snapshot_of(ReadTarget::branch("feature-ingest")) + .await + .unwrap(); + let person_ds = snapshot.open("node:Person").await.unwrap(); + assert_eq!(person_ds.count_rows(None).await.unwrap(), 5); + let head = db + .list_commits(Some("feature-ingest")) + .await + .unwrap() + .into_iter() + .last() + .unwrap(); + assert_eq!(head.actor_id.as_deref(), Some("act-andrew")); +} + +#[tokio::test(flavor = "multi_thread")] +async fn ingest_existing_branch_skips_branch_create_policy_check() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + { + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + } + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, POLICY_YAML).unwrap(); + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![("act-bruno".to_string(), "team-token".to_string())], + Some(&policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + let ingest = IngestRequest { + branch: Some("feature".to_string()), + from: Some("other-base".to_string()), + mode: Some(LoadMode::Merge), + data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), + }; + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&ingest).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(body["branch"], "feature"); + assert_eq!(body["branch_created"], false); + assert_eq!(body["base_branch"], "other-base"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn ingest_denies_missing_branch_without_branch_create_permission() { + let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( + &[("act-bruno", "team-token")], + POLICY_YAML, + ) + .await; + let ingest = IngestRequest { + branch: Some("feature".to_string()), + from: Some("main".to_string()), + mode: Some(LoadMode::Merge), + data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), + }; + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&ingest).unwrap())) + .unwrap(), + ) + .await; + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(status, StatusCode::FORBIDDEN); + assert_eq!( + error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn ingest_denies_when_actor_lacks_change_permission() { + let (_temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( + &[("act-bruno", "team-token")], + INGEST_CREATE_ONLY_POLICY_YAML, + ) + .await; + let ingest = IngestRequest { + branch: Some("feature".to_string()), + from: Some("main".to_string()), + mode: Some(LoadMode::Merge), + data: r#"{"type":"Person","data":{"name":"Zoe","age":33}}"#.to_string(), + }; + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("authorization", "Bearer team-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&ingest).unwrap())) + .unwrap(), + ) + .await; + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(status, StatusCode::FORBIDDEN); + assert_eq!( + error.code, + Some(omnigraph_server::api::ErrorCode::Forbidden) + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn ingest_rejects_payloads_over_32_mib() { + let (_temp, app) = app_for_loaded_graph().await; + let oversize = IngestRequest { + branch: Some("feature".to_string()), + from: Some("main".to_string()), + mode: Some(LoadMode::Merge), + data: "x".repeat(33 * 1024 * 1024), + }; + + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&oversize).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(response.status(), StatusCode::PAYLOAD_TOO_LARGE); +} + +#[tokio::test(flavor = "multi_thread")] +async fn authenticated_branch_merge_stamps_merge_actor_on_head_commit() { + let (_temp, app) = app_for_loaded_graph_with_auth_tokens(&[ + ("act-andrew", "token-one"), + ("act-ragnor", "token-two"), + ]) + .await; + + let create = BranchCreateRequest { + from: Some("main".to_string()), + name: "feature".to_string(), + }; + let (create_status, _) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::POST) + .header("authorization", "Bearer token-one") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&create).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(create_status, StatusCode::OK); + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Zoe", "age": 33 })), + branch: Some("feature".to_string()), + }; + let (change_status, _) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("authorization", "Bearer token-one") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(change_status, StatusCode::OK); + + let merge = BranchMergeRequest { + source: "feature".to_string(), + target: Some("main".to_string()), + }; + let (merge_status, merge_body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("authorization", "Bearer token-two") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&merge).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(merge_status, StatusCode::OK); + assert_eq!(merge_body["actor_id"], "act-ragnor"); + + let (commit_status, commit_body) = json_response( + &app, + Request::builder() + .uri("/commits?branch=main") + .method(Method::GET) + .header("authorization", "Bearer token-two") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(commit_status, StatusCode::OK); + let head = commit_body["commits"] + .as_array() + .unwrap() + .last() + .expect("head commit should exist"); + assert_eq!(head["actor_id"], "act-ragnor"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn branch_merge_conflict_response_includes_structured_conflicts() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + db.mutate( + "main", + MUTATION_QUERIES, + "set_age", + &omnigraph_compiler::json_params_to_param_map( + Some(&json!({"name": "Alice", "age": 31 })), + &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") + .unwrap() + .params, + omnigraph_compiler::JsonParamMode::Standard, + ) + .unwrap(), + ) + .await + .unwrap(); + db.mutate( + "feature", + MUTATION_QUERIES, + "set_age", + &omnigraph_compiler::json_params_to_param_map( + Some(&json!({"name": "Alice", "age": 32 })), + &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") + .unwrap() + .params, + omnigraph_compiler::JsonParamMode::Standard, + ) + .unwrap(), + ) + .await + .unwrap(); + drop(db); + + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + let merge = BranchMergeRequest { + source: "feature".to_string(), + target: Some("main".to_string()), + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&merge).unwrap())) + .unwrap(), + ) + .await; + + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(status, StatusCode::CONFLICT); + assert_eq!(error.code, Some(omnigraph_server::api::ErrorCode::Conflict)); + assert!(error.error.contains("merge conflict")); + assert!(error.merge_conflicts.iter().any(|conflict| { + conflict.table_key == "node:Person" + && conflict.row_id.as_deref() == Some("Alice") + && conflict.kind == omnigraph_server::api::MergeConflictKindOutput::DivergentUpdate + })); +} + +#[tokio::test(flavor = "multi_thread")] +async fn repeated_read_after_change_sees_updated_state_from_same_app() { + let (_temp, app) = app_for_loaded_graph().await; + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Mina", "age": 28 })), + branch: Some("main".to_string()), + }; + let (change_status, change_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(change_status, StatusCode::OK); + assert_eq!(change_body["affected_nodes"], 1); + + let read = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Mina" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (read_status, read_body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(read_status, StatusCode::OK); + assert_eq!(read_body["row_count"], 1); + assert_eq!(read_body["rows"][0]["p.name"], "Mina"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn query_endpoint_runs_inline_read() { + let (_temp, app) = app_for_loaded_graph().await; + + let query = QueryRequest { + query: fs::read_to_string(fixture("test.gq")).unwrap(), + name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/query") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&query).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(body["query_name"], "get_person"); + assert_eq!(body["row_count"], 1); + assert_eq!(body["rows"][0]["p.name"], "Alice"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn query_endpoint_rejects_mutation_with_400() { + let (_temp, app) = app_for_loaded_graph().await; + + let query = QueryRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Should", "age": 1 })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/query") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&query).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::BAD_REQUEST); + let err = body["error"].as_str().unwrap_or_default(); + assert!( + err.contains("contains mutations") && err.contains("POST /mutate"), + "expected mutation-rejection message pointing at canonical /mutate, got: {err}" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn mutate_endpoint_runs_inline_mutation() { + // Canonical mutation endpoint. Pairs with `/query` on the read side. + // Same wire shape as `/change`, no deprecation signal. + let (_temp, app) = app_for_loaded_graph().await; + + let request = json!({ + "query": MUTATION_QUERIES, + "name": "insert_person", + "params": { "name": "Mutie", "age": 30 }, + "branch": "main", + }); + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/mutate") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&request).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::OK); + // Canonical route is NOT deprecated; no Deprecation header expected. + assert!( + response.headers().get("deprecation").is_none(), + "POST /mutate must not advertise itself as deprecated" + ); + let body_bytes = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + let body: Value = serde_json::from_slice(&body_bytes).unwrap(); + assert_eq!(body["affected_nodes"], 1); + assert_eq!(body["query_name"], "insert_person"); + assert_eq!(body["branch"], "main"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn change_endpoint_emits_deprecation_headers() { + // `/change` is kept indefinitely for back-compat but flagged at runtime + // per RFC 9745 (`Deprecation: true`) + RFC 8288 (`Link: ; + // rel="successor-version"`). The OpenAPI side is covered by + // `openapi_change_is_deprecated` in tests/openapi.rs. + let (_temp, app) = app_for_loaded_graph().await; + + let request = json!({ + "query": MUTATION_QUERIES, + "name": "insert_person", + "params": { "name": "Legacyer", "age": 33 }, + "branch": "main", + }); + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&request).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::OK); + assert_eq!( + response + .headers() + .get("deprecation") + .and_then(|v| v.to_str().ok()), + Some("true"), + "POST /change must advertise `Deprecation: true` (RFC 9745)" + ); + assert_eq!( + response.headers().get("link").and_then(|v| v.to_str().ok()), + Some("; rel=\"successor-version\""), + "POST /change must point at /mutate via `Link` rel=successor-version (RFC 8288)" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn read_endpoint_emits_deprecation_headers() { + // `/read` is kept indefinitely for byte-stable back-compat but flagged + // at runtime per RFC 9745 + RFC 8288. Successor is `/query`. + let (_temp, app) = app_for_loaded_graph().await; + + let request = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&request).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::OK); + assert_eq!( + response + .headers() + .get("deprecation") + .and_then(|v| v.to_str().ok()), + Some("true"), + "POST /read must advertise `Deprecation: true` (RFC 9745)" + ); + assert_eq!( + response.headers().get("link").and_then(|v| v.to_str().ok()), + Some("; rel=\"successor-version\""), + "POST /read must point at /query via `Link` rel=successor-version (RFC 8288)" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn query_endpoint_does_not_emit_deprecation_headers() { + // Sanity check the inverse: the canonical `/query` endpoint must not + // carry deprecation signaling, so SDK codegens don't propagate a + // bogus `@deprecated` marker. + let (_temp, app) = app_for_loaded_graph().await; + + let request = QueryRequest { + query: fs::read_to_string(fixture("test.gq")).unwrap(), + name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/query") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&request).unwrap())) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::OK); + assert!( + response.headers().get("deprecation").is_none(), + "POST /query is canonical and must not advertise itself as deprecated" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn change_endpoint_accepts_legacy_field_names() { + // The canonical wire field names on /change are `query` and `name`, but + // serde aliases keep the legacy `query_source`/`query_name` payload + // shape working for clients that haven't migrated yet. Pin both shapes. + let (_temp, app) = app_for_loaded_graph().await; + + let legacy_body = json!({ + "query_source": MUTATION_QUERIES, + "query_name": "insert_person", + "params": { "name": "Legacy", "age": 21 }, + "branch": "main", + }); + let (status, body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&legacy_body).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(body["affected_nodes"], 1); + + let canonical_body = json!({ + "query": MUTATION_QUERIES, + "name": "insert_person", + "params": { "name": "Canonical", "age": 22 }, + "branch": "main", + }); + let (status, body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&canonical_body).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(body["affected_nodes"], 1); +} + +#[tokio::test(flavor = "multi_thread")] +async fn remote_branch_list_create_merge_flow_works() { + let (_temp, app) = app_for_loaded_graph().await; + + let (list_status, list_body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(list_status, StatusCode::OK); + assert_eq!(list_body["branches"], json!(["main"])); + + let create = BranchCreateRequest { + from: Some("main".to_string()), + name: "feature".to_string(), + }; + let (create_status, create_body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&create).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(create_status, StatusCode::OK); + assert_eq!(create_body["from"], "main"); + assert_eq!(create_body["name"], "feature"); + + let (list_status, list_body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(list_status, StatusCode::OK); + assert_eq!(list_body["branches"], json!(["feature", "main"])); + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "Zoe", "age": 33 })), + branch: Some("feature".to_string()), + }; + let (change_status, change_body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(change_status, StatusCode::OK); + assert_eq!(change_body["branch"], "feature"); + assert_eq!(change_body["affected_nodes"], 1); + + let read_main_before = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Zoe" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (read_status, read_body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read_main_before).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(read_status, StatusCode::OK); + assert_eq!(read_body["row_count"], 0); + + let merge = BranchMergeRequest { + source: "feature".to_string(), + target: Some("main".to_string()), + }; + let (merge_status, merge_body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&merge).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(merge_status, StatusCode::OK); + assert_eq!(merge_body["source"], "feature"); + assert_eq!(merge_body["target"], "main"); + assert_eq!(merge_body["outcome"], "fast_forward"); + + let read_main_after = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Zoe" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (read_status, read_body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read_main_after).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(read_status, StatusCode::OK); + assert_eq!(read_body["row_count"], 1); + assert_eq!(read_body["rows"][0]["p.name"], "Zoe"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn remote_branch_delete_flow_works() { + let (_temp, app) = app_for_loaded_graph().await; + + let create = BranchCreateRequest { + from: Some("main".to_string()), + name: "feature".to_string(), + }; + let (create_status, _) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&create).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(create_status, StatusCode::OK); + + let (delete_status, delete_body) = json_response( + &app, + Request::builder() + .uri("/branches/feature") + .method(Method::DELETE) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(delete_status, StatusCode::OK); + assert_eq!(delete_body["name"], "feature"); + + let (list_status, list_body) = json_response( + &app, + Request::builder() + .uri("/branches") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(list_status, StatusCode::OK); + assert_eq!(list_body["branches"], json!(["main"])); +} + +#[tokio::test(flavor = "multi_thread")] +async fn branch_delete_denies_without_policy_permission() { + let (temp, app) = app_for_loaded_graph_with_auth_tokens_and_policy( + &[("act-andrew", "token-admin"), ("act-bruno", "token-team")], + POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + drop(db); + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/branches/feature") + .method(Method::DELETE) + .header("authorization", "Bearer token-team") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::FORBIDDEN); + assert!( + body["error"] + .as_str() + .unwrap() + .contains("policy denied action 'branch_delete'") + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn server_opens_s3_graph_directly_and_serves_snapshot_and_read() { + let Some(uri) = s3_test_graph_uri("server") else { + eprintln!("skipping s3 server test: OMNIGRAPH_S3_TEST_BUCKET is not set"); + return; + }; + + Omnigraph::init(&uri, &fs::read_to_string(fixture("test.pg")).unwrap()) + .await + .unwrap(); + let mut db = Omnigraph::open(&uri).await.unwrap(); + load_jsonl( + &mut db, + &fs::read_to_string(fixture("test.jsonl")).unwrap(), + LoadMode::Overwrite, + ) + .await + .unwrap(); + + let app = build_app( + AppState::open_with_bearer_token(uri.clone(), Some("s3-token".to_string())) + .await + .unwrap(), + ); + + let (snapshot_status, snapshot_body) = json_response( + &app, + Request::builder() + .uri("/snapshot") + .method(Method::GET) + .header("authorization", "Bearer s3-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(snapshot_status, StatusCode::OK); + assert!(snapshot_body["tables"].is_array()); + + let read = ReadRequest { + query_source: fs::read_to_string(fixture("test.gq")).unwrap(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": "Alice" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (read_status, read_body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("authorization", "Bearer s3-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(read_status, StatusCode::OK); + assert_eq!(read_body["row_count"], 1); + assert_eq!(read_body["rows"][0]["p.name"], "Alice"); +} + +#[tokio::test(flavor = "multi_thread")] +#[serial] +async fn remote_read_embeds_string_nearest_queries_with_mock_runtime() { + const EMBED_SCHEMA: &str = r#" +node Doc { + slug: String @key + title: String @index + embedding: Vector(4) @index +} +"#; + const EMBED_QUERY: &str = r#" +query vector_search_string($q: String) { + match { $d: Doc } + return { $d.slug, $d.title } + order { nearest($d.embedding, $q) } + limit 3 +} +"#; + + let alpha = mock_embedding("alpha", 4); + let beta = mock_embedding("beta", 4); + let gamma = mock_embedding("gamma", 4); + let data = format!( + concat!( + r#"{{"type":"Doc","data":{{"slug":"alpha-doc","title":"alpha guide","embedding":[{}]}}}}"#, + "\n", + r#"{{"type":"Doc","data":{{"slug":"beta-doc","title":"beta guide","embedding":[{}]}}}}"#, + "\n", + r#"{{"type":"Doc","data":{{"slug":"gamma-doc","title":"gamma handbook","embedding":[{}]}}}}"# + ), + format_vector(&alpha), + format_vector(&beta), + format_vector(&gamma), + ); + + let _guard = EnvGuard::set(&[ + ("OMNIGRAPH_EMBEDDINGS_MOCK", Some("1")), + ("GEMINI_API_KEY", None), + ]); + let temp = init_graph_with_schema_and_data(EMBED_SCHEMA, &data).await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + + let read = ReadRequest { + query_source: EMBED_QUERY.to_string(), + query_name: Some("vector_search_string".to_string()), + params: Some(json!({ "q": "alpha" })), + branch: Some("main".to_string()), + snapshot: None, + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&read).unwrap())) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::OK); + assert_eq!(body["row_count"], 3); + assert_eq!(body["rows"][0]["d.slug"], "alpha-doc"); +} + +#[tokio::test(flavor = "multi_thread")] +async fn change_conflict_returns_manifest_conflict_409() { + // A write that races with another writer surfaces as HTTP 409 with + // a structured `manifest_conflict` body β€” `table_key`, `expected`, + // and `actual` β€” so clients can detect-and-retry without parsing + // the message. + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + + // Build the server first so its handle pins the pre-mutation manifest + // version. Then advance the manifest from outside the server. The + // server's next /change call will capture stale `expected_versions` + // (from its still-pinned snapshot) and the publisher's CAS rejects. + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + + { + let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.mutate( + "main", + MUTATION_QUERIES, + "set_age", + &omnigraph_compiler::json_params_to_param_map( + Some(&json!({"name": "Alice", "age": 31 })), + &omnigraph_compiler::find_named_query(MUTATION_QUERIES, "set_age") + .unwrap() + .params, + omnigraph_compiler::JsonParamMode::Standard, + ) + .unwrap(), + ) + .await + .unwrap(); + } + + let (status, body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from( + serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("set_age".to_string()), + params: Some(json!({ "name": "Alice", "age": 33 })), + branch: Some("main".to_string()), + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + + assert_eq!(status, StatusCode::CONFLICT); + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert_eq!(error.code, Some(omnigraph_server::api::ErrorCode::Conflict)); + let conflict = error + .manifest_conflict + .expect("publisher CAS rejection must populate manifest_conflict body"); + assert_eq!(conflict.table_key, "node:Person"); + assert!( + conflict.actual > conflict.expected, + "actual ({}) should be ahead of expected ({})", + conflict.actual, + conflict.expected, + ); +} + +#[tokio::test(flavor = "multi_thread", worker_threads = 4)] +async fn change_concurrent_inserts_same_key_serialize_without_409() { + // PR 2 Phase 2 (MR-686): pin the design fix for the same-key + // concurrency hazard. Pre-fix, in-process concurrent inserts on + // the same `(table, branch)` rejected with 409 manifest_conflict + // because `ensure_expected_version` fired before the per-table + // queue was acquired and saw Lance HEAD already advanced by a + // peer writer. Post-fix, Insert/Merge skip the strict pre-stage + // check (see `MutationOpKind::strict_pre_stage_version_check`); + // the queue serializes commit_staged; Lance's natural rebase + // handles the in-flight stage; the publisher's CAS on a fresh + // per-branch snapshot under the queue catches genuine cross- + // process drift. + // + // This test spawns N concurrent /change inserts on a single + // node type and asserts: every request returns 200 (no 409), + // and the final row count equals the seed count + N (every + // staged batch actually committed). + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + + // test.jsonl seeds 4 Persons (Alice, Bob, Charlie, Diana). + const SEED_PERSON_ROWS: u64 = 4; + const N: usize = 12; + + let mut handles = Vec::with_capacity(N); + for i in 0..N { + let app = app.clone(); + handles.push(tokio::spawn(async move { + let body = serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": format!("racer-{i}"), "age": i as i32 })), + branch: Some("main".to_string()), + }) + .unwrap(); + let req = Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(); + let response = app.oneshot(req).await.unwrap(); + response.status() + })); + } + + let mut statuses = Vec::with_capacity(N); + for h in handles { + statuses.push(h.await.unwrap()); + } + + let bad: Vec<_> = statuses + .iter() + .enumerate() + .filter(|(_, s)| **s != StatusCode::OK) + .collect(); + assert!( + bad.is_empty(), + "expected every concurrent insert to return 200, got non-200 for: {:?}", + bad + ); + + // Verify the inserts actually landed. The status check above only proves + // the publisher CAS didn't reject; the row count proves none of the + // concurrent commits silently overwrote a peer. + let (snapshot_status, snapshot_body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(snapshot_status, StatusCode::OK); + let person_rows = snapshot_body["tables"] + .as_array() + .and_then(|tables| { + tables + .iter() + .find(|t| t["table_key"].as_str() == Some("node:Person")) + }) + .and_then(|t| t["row_count"].as_u64()) + .expect("snapshot must include node:Person row_count"); + assert_eq!( + person_rows, + SEED_PERSON_ROWS + N as u64, + "expected {} seeded + {} concurrent inserts = {} Person rows; got {}", + SEED_PERSON_ROWS, + N, + SEED_PERSON_ROWS + N as u64, + person_rows, + ); +} + +#[tokio::test(flavor = "multi_thread", worker_threads = 4)] +async fn change_concurrent_updates_same_key_serialize_via_publisher_cas() { + // Pin Update RYW semantics under in-process concurrency on the same + // `(table, branch)`. With per-table queue serialization and op-kind-aware + // drift detection at commit time, exactly one of N concurrent UPDATEs + // on the same row commits; the rest are rejected as 409 manifest_conflict. + // + // Pre-fix bug class: in `MutationStaging::commit_all`, after queue + // acquisition, the staged Lance transaction is handed straight to + // `commit_staged`. For a writer whose staged dataset is at V0 but + // Lance HEAD has advanced to V1 (because the queue's prior winner + // already published), Lance's transaction conflict resolver fires + // `RetryableCommitConflict` on Update vs Update on the same row. + // That error gets wrapped as `OmniError::Lance()` and the + // API surfaces it as **500 internal**, not 409. Users see "internal + // server error" instead of a retryable conflict, breaking the + // documented 409 contract for in-process drift. + // + // Post-fix invariant: `commit_all` does an op-kind-aware drift check + // before each `commit_staged`. For tables whose tracked op_kind has + // `strict_pre_stage_version_check() == true` (Update / Delete / + // SchemaRewrite), if the staged dataset's version doesn't match the + // fresh manifest pin, return `OmniError::manifest_expected_version_mismatch` + // β†’ 409 ExpectedVersionMismatch. The N-1 losers see a clean 409 + // before Lance's commit_staged ever runs. + // + // Why correct-by-design: closing the class "Lance internal conflict + // surfaces as 500 instead of 409" rather than mapping the specific + // Lance error variant. The drift check fires at the right architectural + // layer (engine boundary, under the queue) and respects the existing + // `MutationOpKind` policy. + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + + // Spawn N=8 concurrent UPDATEs on Alice (from test.jsonl, age=30 at V0) + // writing distinct ages. + const N: usize = 8; + let mut handles = Vec::with_capacity(N); + for i in 0..N { + let app = app.clone(); + let target_age = 100 + i as i32; + handles.push(tokio::spawn(async move { + let body = serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("set_age".to_string()), + params: Some(json!({ "name": "Alice", "age": target_age })), + branch: Some("main".to_string()), + }) + .unwrap(); + let req = Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(); + let response = app.oneshot(req).await.unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + (status, body.to_vec()) + })); + } + + let mut results = Vec::with_capacity(N); + for h in handles { + results.push(h.await.unwrap()); + } + let statuses: Vec = results.iter().map(|(s, _)| *s).collect(); + + let ok_count = statuses.iter().filter(|s| **s == StatusCode::OK).count(); + let conflict_count = statuses + .iter() + .filter(|s| **s == StatusCode::CONFLICT) + .count(); + let other: Vec<_> = statuses + .iter() + .enumerate() + .filter(|(_, s)| **s != StatusCode::OK && **s != StatusCode::CONFLICT) + .collect(); + + let other_bodies: Vec<(usize, StatusCode, String)> = other + .iter() + .map(|(i, s)| { + let body_str = String::from_utf8_lossy(&results[*i].1).to_string(); + (*i, **s, body_str) + }) + .collect(); + assert!( + other.is_empty(), + "expected only 200 or 409 statuses, got non-200/409 entries: {:?}", + other_bodies + ); + assert_eq!( + ok_count + conflict_count, + N, + "all responses must be 200 or 409 to satisfy the RYW invariant; statuses: {:?}", + statuses + ); + assert_eq!( + ok_count, + 1, + "expected exactly one update to commit and N-1 to receive 409 manifest_conflict \ + (op-kind-aware drift check rejects stale-V0 staged datasets at commit_all entry). \ + Got {} OK + {} 409 + {} other. \ + Pre-fix symptom: 1 OK + (N-1) x 500 because Lance's RetryableCommitConflict for \ + Update vs Update on the same row bubbles up as `OmniError::Lance()` and \ + the API maps it to 500 internal, not 409. Statuses: {:?}", + ok_count, + conflict_count, + statuses.len() - ok_count - conflict_count, + statuses, + ); +} + +// ───────────────────────────────────────────────────────────────────────── +// Branch-ops morphological matrix +// +// Table-driven test covering all interesting (op_a, op_b, target_overlap) +// concurrent-pair cells with the C1-C6 invariants asserted uniformly: +// +// C1 β€” both complete (no deadlock, no hang) +// C2 β€” status: both 200, or exactly one clean conflict (409/429), no 500 +// C3 β€” per-target row count +// C4 β€” per-target row identity (present + absent named persons) +// C5 β€” engine state remains coherent (subsequent /snapshot is consistent) +// C6 β€” post-op /change on main succeeds (engine state isn't poisoned) +// +// Cell list (a-k) below. Each cell uses a fresh tempdir + AppState so a +// failure in one doesn't leak into the next. Within a cell, ops align at +// a tokio::sync::Barrier so both reach the engine close in time, and the +// pair is wrapped in tokio::time::timeout(15s) so a deadlock surfaces +// as a clean panic. +// +// Replaces the three narrow concurrent_branch_* tests below; their +// scenarios are folded into cells f, h, i (branch_create_from race), +// cell a (merge race with C4 identity assertions), and cell d +// (concurrent change-during-merge). +// ───────────────────────────────────────────────────────────────────────── + +mod matrix { + use super::*; + use std::time::Duration; + use tokio::sync::Barrier; + + #[derive(Debug)] + pub(super) struct OpStatus { + pub status: StatusCode, + pub body: Vec, + } + + pub(super) struct Harness { + pub _temp: tempfile::TempDir, + pub app: Router, + } + + impl Harness { + pub async fn new() -> Self { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + // Build the WorkloadController explicitly with defaults rather + // than letting `AppState::open` call + // `WorkloadController::from_env()`. The admission-gate test + // (`ingest_per_actor_admission_cap_returns_429`) sets + // OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 inside an EnvGuard while + // it runs. Process-wide env vars are visible to + // concurrently-running tests; if a matrix cell reads env at + // AppState construction time during that window it picks up + // cap=1 and the second concurrent merge in cell b surfaces + // 429 instead of the expected 200. Constructing the + // controller here with explicit defaults makes cells + // independent of any env mutation other tests perform. + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let workload = omnigraph_server::workload::WorkloadController::with_defaults(); + let state = AppState::new_with_workload( + graph.to_string_lossy().to_string(), + db, + Vec::new(), + workload, + ); + let app = build_app(state); + Self { _temp: temp, app } + } + + pub async fn create_branch(&self, from: &str, name: &str) { + let body = serde_json::to_vec(&BranchCreateRequest { + from: Some(from.to_string()), + name: name.to_string(), + }) + .unwrap(); + let r = self + .app + .clone() + .oneshot( + Request::builder() + .uri("/branches") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + r.status(), + StatusCode::OK, + "setup create_branch {} from {} failed", + name, + from + ); + } + + pub async fn insert_person(&self, branch: &str, name: &str, age: i32) { + let body = serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": name, "age": age })), + branch: Some(branch.to_string()), + }) + .unwrap(); + let r = self + .app + .clone() + .oneshot( + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + r.status(), + StatusCode::OK, + "setup insert {} on {} failed", + name, + branch + ); + } + + /// Run two ops concurrently with barrier alignment + 15s deadlock + /// timeout. Returns `(op_a, op_b)`. Panics on timeout. + pub async fn run_pair( + &self, + op_a: impl FnOnce(Router, Arc) -> tokio::task::JoinHandle, + op_b: impl FnOnce(Router, Arc) -> tokio::task::JoinHandle, + ) -> (OpStatus, OpStatus) { + let barrier = Arc::new(Barrier::new(2)); + let h_a = op_a(self.app.clone(), Arc::clone(&barrier)); + let h_b = op_b(self.app.clone(), Arc::clone(&barrier)); + let result = tokio::time::timeout(Duration::from_secs(15), async { + let a = h_a.await.unwrap(); + let b = h_b.await.unwrap(); + (a, b) + }) + .await; + result.expect("concurrent op pair deadlocked (>15s)") + } + + pub async fn person_count(&self, branch: &str) -> u64 { + let r = self + .app + .clone() + .oneshot( + Request::builder() + .uri(format!("/snapshot?branch={}", branch)) + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(r.status(), StatusCode::OK, "snapshot {} failed", branch); + let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); + let v: Value = serde_json::from_slice(&body).unwrap(); + v["tables"] + .as_array() + .and_then(|tables| { + tables + .iter() + .find(|t| t["table_key"].as_str() == Some("node:Person")) + }) + .and_then(|t| t["row_count"].as_u64()) + .unwrap_or_else(|| panic!("snapshot {} missing node:Person", branch)) + } + + /// True iff the named Person exists on `branch`. Uses the + /// `get_person` query from `test.gq` for identity rather than + /// just count. + pub async fn person_exists(&self, branch: &str, name: &str) -> bool { + let body = serde_json::to_vec(&ReadRequest { + query_source: include_str!("../../omnigraph/tests/fixtures/test.gq").to_string(), + query_name: Some("get_person".to_string()), + params: Some(json!({ "name": name })), + branch: Some(branch.to_string()), + snapshot: None, + }) + .unwrap(); + let r = self + .app + .clone() + .oneshot( + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + r.status(), + StatusCode::OK, + "person_exists query for {} on {} failed", + name, + branch + ); + let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); + let v: Value = serde_json::from_slice(&body).unwrap(); + v["row_count"].as_u64().unwrap_or(0) > 0 + } + + /// Asserts each name in `present` exists on `branch` and each in + /// `absent` does not. Identity-grade check that catches symmetric + /// swap races a row-count assertion would miss. + pub async fn assert_persons( + &self, + branch: &str, + cell: &str, + present: &[&str], + absent: &[&str], + ) { + for name in present { + assert!( + self.person_exists(branch, name).await, + "[{}] expected {} to be present on {}", + cell, + name, + branch + ); + } + for name in absent { + assert!( + !self.person_exists(branch, name).await, + "[{}] expected {} to be absent from {}", + cell, + name, + branch + ); + } + } + + /// C6: insert a uniquely-named sentinel on main and verify it + /// landed. Catches engine-state poisoning where a cell's + /// concurrent ops left the engine half-broken β€” subsequent + /// /change either deadlocks or returns a non-200. + pub async fn assert_post_op_sentinel(&self, cell: &str, sentinel: &str) { + let body = serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": sentinel, "age": 99 })), + branch: Some("main".to_string()), + }) + .unwrap(); + let r = self + .app + .clone() + .oneshot( + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + r.status(), + StatusCode::OK, + "[{}] post-op sentinel /change on main failed (engine poisoned?)", + cell + ); + assert!( + self.person_exists("main", sentinel).await, + "[{}] sentinel {} did not land on main", + cell, + sentinel + ); + } + } + + // Helpers that build the closures for `run_pair`. Each takes a + // Router + Barrier and returns a JoinHandle yielding the status/body. + + pub(super) fn op_merge( + source: String, + target: String, + ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { + move |app: Router, barrier: Arc| { + tokio::spawn(async move { + barrier.wait().await; + let body = serde_json::to_vec(&BranchMergeRequest { + source, + target: Some(target), + }) + .unwrap(); + let response = app + .oneshot( + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + OpStatus { + status, + body: body.to_vec(), + } + }) + } + } + + pub(super) fn op_change_insert( + branch: String, + name: String, + age: i32, + ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { + move |app: Router, barrier: Arc| { + tokio::spawn(async move { + barrier.wait().await; + let body = serde_json::to_vec(&ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": name, "age": age })), + branch: Some(branch), + }) + .unwrap(); + let response = app + .oneshot( + Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + OpStatus { + status, + body: body.to_vec(), + } + }) + } + } + + pub(super) fn op_branch_create( + from: String, + name: String, + ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { + move |app: Router, barrier: Arc| { + tokio::spawn(async move { + barrier.wait().await; + let body = serde_json::to_vec(&BranchCreateRequest { + from: Some(from), + name, + }) + .unwrap(); + let response = app + .oneshot( + Request::builder() + .uri("/branches") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(), + ) + .await + .unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + OpStatus { + status, + body: body.to_vec(), + } + }) + } + } + + pub(super) fn op_branch_delete( + name: String, + ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { + move |app: Router, barrier: Arc| { + tokio::spawn(async move { + barrier.wait().await; + let response = app + .oneshot( + Request::builder() + .uri(format!("/branches/{}", name)) + .method(Method::DELETE) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + let status = response.status(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + OpStatus { + status, + body: body.to_vec(), + } + }) + } + } +} + +#[tokio::test(flavor = "multi_thread", worker_threads = 4)] +async fn concurrent_branch_ops_morphological_matrix() { + // Cell a: Merge Γ— Merge, distinct targets. + // Pre-fix on b09a097/22d76db: branch_merge_impl's swap-restore race + // landed feature_a's content in target_b instead of target_a (and + // vice versa β€” symmetric swap). Identity asserts catch both + // asymmetric and symmetric variants. + { + let cell = "a:mergeΓ—merge:distinct-targets"; + let h = matrix::Harness::new().await; + h.create_branch("main", "feature-a-cella").await; + h.insert_person("feature-a-cella", "EveA-cella", 22).await; + h.create_branch("main", "feature-b-cella").await; + h.insert_person("feature-b-cella", "FrankB-cella", 33).await; + h.create_branch("main", "target-a-cella").await; + h.create_branch("main", "target-b-cella").await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("feature-a-cella".to_string(), "target-a-cella".to_string()), + matrix::op_merge("feature-b-cella".to_string(), "target-b-cella".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] merge a", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] merge b", cell); + h.assert_persons("target-a-cella", cell, &["EveA-cella"], &["FrankB-cella"]) + .await; + h.assert_persons("target-b-cella", cell, &["FrankB-cella"], &["EveA-cella"]) + .await; + h.assert_post_op_sentinel(cell, "sentinel-cella").await; + } + + // Cell b: Merge Γ— Merge, same target / distinct sources. + // Both want to land in main. merge_exclusive serializes; both should + // succeed and main should contain BOTH sources' contributions. + { + let cell = "b:mergeΓ—merge:same-target-distinct-sources"; + let h = matrix::Harness::new().await; + h.create_branch("main", "src-x-cellb").await; + h.insert_person("src-x-cellb", "Xavier-cellb", 41).await; + h.create_branch("main", "src-y-cellb").await; + h.insert_person("src-y-cellb", "Yvonne-cellb", 42).await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("src-x-cellb".to_string(), "main".to_string()), + matrix::op_merge("src-y-cellb".to_string(), "main".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] merge x", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] merge y", cell); + h.assert_persons("main", cell, &["Xavier-cellb", "Yvonne-cellb"], &[]) + .await; + h.assert_post_op_sentinel(cell, "sentinel-cellb").await; + } + + // Cell c: Merge Γ— Merge, same source / distinct targets (fanout). + // One source merged into two targets simultaneously. merge_exclusive + // serializes; both targets should reflect the source's content. + { + let cell = "c:mergeΓ—merge:same-source-distinct-targets"; + let h = matrix::Harness::new().await; + h.create_branch("main", "src-shared-cellc").await; + h.insert_person("src-shared-cellc", "Sharon-cellc", 50) + .await; + h.create_branch("main", "tgt-1-cellc").await; + h.create_branch("main", "tgt-2-cellc").await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("src-shared-cellc".to_string(), "tgt-1-cellc".to_string()), + matrix::op_merge("src-shared-cellc".to_string(), "tgt-2-cellc".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] merge into tgt-1", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] merge into tgt-2", cell); + h.assert_persons("tgt-1-cellc", cell, &["Sharon-cellc"], &[]) + .await; + h.assert_persons("tgt-2-cellc", cell, &["Sharon-cellc"], &[]) + .await; + h.assert_post_op_sentinel(cell, "sentinel-cellc").await; + } + + // Cell d: Merge Γ— Change, both touching main. C2 permits both + // succeed, or exactly one clean 409 if the merge detects target + // movement after planning but before acquiring the queue. + { + let cell = "d:mergeΓ—change:into-target"; + let h = matrix::Harness::new().await; + h.create_branch("main", "feature-celld").await; + h.insert_person("feature-celld", "EveD-celld", 22).await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("feature-celld".to_string(), "main".to_string()), + matrix::op_change_insert("main".to_string(), "FrankD-celld".to_string(), 33), + ) + .await; + assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); + assert!( + sa.status == StatusCode::OK || sa.status == StatusCode::CONFLICT, + "[{}] merge must be 200 or clean 409, got {}", + cell, + sa.status + ); + if sa.status == StatusCode::OK { + h.assert_persons("main", cell, &["EveD-celld", "FrankD-celld"], &[]) + .await; + } else { + let error: ErrorOutput = serde_json::from_slice(&sa.body).unwrap(); + let conflict = error + .manifest_conflict + .expect("merge 409 must include manifest_conflict"); + assert_eq!( + conflict.table_key, "node:Person", + "[{}] conflict table", + cell + ); + h.assert_persons("main", cell, &["FrankD-celld"], &["EveD-celld"]) + .await; + } + h.assert_post_op_sentinel(cell, "sentinel-celld").await; + } + + // Cell e: Merge Γ— BranchCreateFrom-target. Concurrent fork off the + // merge target while the merge runs. Both should succeed; the new + // branch should have a coherent view (either pre- or post-merge, + // both valid). After both, target = main has the merged content. + { + let cell = "e:mergeΓ—branch_create_from:target"; + let h = matrix::Harness::new().await; + h.create_branch("main", "src-celle").await; + h.insert_person("src-celle", "Eve-celle", 22).await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("src-celle".to_string(), "main".to_string()), + matrix::op_branch_create("main".to_string(), "fork-celle".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] merge", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] branch_create_from", cell); + // Main definitely has Eve. + h.assert_persons("main", cell, &["Eve-celle"], &[]).await; + // fork-celle was forked off main at SOME version; main's current + // count is 5 (4 seeded + Eve). fork-celle has either 4 (pre-merge + // snapshot) or 5 (post-merge snapshot); both are valid timings. + let fork_count = h.person_count("fork-celle").await; + assert!( + fork_count == 4 || fork_count == 5, + "[{}] fork-celle row count must be pre- or post-merge view (4 or 5), got {}", + cell, + fork_count + ); + h.assert_post_op_sentinel(cell, "sentinel-celle").await; + } + + // Cell f: BranchCreateFrom Γ— BranchCreateFrom, distinct parents. + // Pre-fix on f925ad1: swap-restore race in branch_create_from_impl + // forked the new branch off the wrong parent. Identity asserts pin + // that fork-from-A inherits A's content, fork-from-B inherits B's. + { + let cell = "f:branch_create_fromΓ—branch_create_from:distinct-parents"; + let h = matrix::Harness::new().await; + h.create_branch("main", "alpha-cellf").await; + h.insert_person("alpha-cellf", "Eve-cellf", 22).await; + h.create_branch("main", "beta-cellf").await; + + let (sa, sb) = h + .run_pair( + matrix::op_branch_create("alpha-cellf".to_string(), "gamma-cellf".to_string()), + matrix::op_branch_create("beta-cellf".to_string(), "delta-cellf".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] gamma create", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] delta create", cell); + // gamma forks off alpha β†’ must contain Eve. + h.assert_persons("gamma-cellf", cell, &["Eve-cellf"], &[]) + .await; + // delta forks off beta β†’ must NOT contain Eve. + h.assert_persons("delta-cellf", cell, &[], &["Eve-cellf"]) + .await; + h.assert_post_op_sentinel(cell, "sentinel-cellf").await; + } + + // Cell g: BranchCreateFrom Γ— BranchDelete, unrelated branches. + // Disjoint branches; both should complete cleanly without + // interference. + { + let cell = "g:branch_create_fromΓ—branch_delete:unrelated"; + let h = matrix::Harness::new().await; + h.create_branch("main", "doomed-cellg").await; + + let (sa, sb) = h + .run_pair( + matrix::op_branch_create("main".to_string(), "newborn-cellg".to_string()), + matrix::op_branch_delete("doomed-cellg".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] create newborn", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] delete doomed", cell); + // newborn-cellg exists with main's content. + h.assert_persons("newborn-cellg", cell, &["Alice"], &[]) + .await; + h.assert_post_op_sentinel(cell, "sentinel-cellg").await; + } + + // Cell h: BranchDelete Γ— BranchDelete, distinct branches. Both call + // refresh() internally; verify no deadlock and both deletes land. + { + let cell = "h:branch_deleteΓ—branch_delete:distinct"; + let h = matrix::Harness::new().await; + h.create_branch("main", "doomed1-cellh").await; + h.create_branch("main", "doomed2-cellh").await; + + let (sa, sb) = h + .run_pair( + matrix::op_branch_delete("doomed1-cellh".to_string()), + matrix::op_branch_delete("doomed2-cellh".to_string()), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] delete 1", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] delete 2", cell); + // Verify both gone via /branches list (snapshot would still work + // for a deleted branch via parent fallback in some paths, so we + // use the explicit list). + let r = h + .app + .clone() + .oneshot( + Request::builder() + .uri("/branches") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(r.status(), StatusCode::OK); + let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); + let list_body: Value = serde_json::from_slice(&body).unwrap(); + let branches: Vec<&str> = list_body["branches"] + .as_array() + .unwrap() + .iter() + .filter_map(|v| v.as_str()) + .collect(); + assert!( + !branches.contains(&"doomed1-cellh"), + "[{}] doomed1 still in branch list: {:?}", + cell, + branches + ); + assert!( + !branches.contains(&"doomed2-cellh"), + "[{}] doomed2 still in branch list: {:?}", + cell, + branches + ); + h.assert_post_op_sentinel(cell, "sentinel-cellh").await; + } + + // Cell i: BranchDelete Γ— Change, on a different branch. Delete one + // branch while a /change runs on main. Both should succeed. + { + let cell = "i:branch_deleteΓ—change:distinct-branch"; + let h = matrix::Harness::new().await; + h.create_branch("main", "doomed-celli").await; + + let (sa, sb) = h + .run_pair( + matrix::op_branch_delete("doomed-celli".to_string()), + matrix::op_change_insert("main".to_string(), "Pat-celli".to_string(), 44), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] delete", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); + h.assert_persons("main", cell, &["Pat-celli"], &[]).await; + h.assert_post_op_sentinel(cell, "sentinel-celli").await; + } + + // Cell j: BranchCreateFrom Γ— Change, both on main. The fork timing + // determines whether the new branch sees the change (pre or post). + // Both valid. Main must contain the inserted row. + { + let cell = "j:branch_create_fromΓ—change:on-source"; + let h = matrix::Harness::new().await; + + let (sa, sb) = h + .run_pair( + matrix::op_branch_create("main".to_string(), "twin-cellj".to_string()), + matrix::op_change_insert("main".to_string(), "Quincy-cellj".to_string(), 55), + ) + .await; + assert_eq!(sa.status, StatusCode::OK, "[{}] branch_create", cell); + assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); + h.assert_persons("main", cell, &["Quincy-cellj"], &[]).await; + // twin-cellj has either pre-change view (no Quincy) or + // post-change view (with Quincy); either is valid. + let twin_has_quincy = h.person_exists("twin-cellj", "Quincy-cellj").await; + let _ = twin_has_quincy; // either valid timing β€” just ensure no panic + h.assert_post_op_sentinel(cell, "sentinel-cellj").await; + } + + // Cell k: reopen consistency. Run a representative concurrent pair, + // drop the engine, reopen on a separate handle, verify state matches. + { + let cell = "k:reopen-after-pair"; + let h = matrix::Harness::new().await; + h.create_branch("main", "src-cellk").await; + h.insert_person("src-cellk", "Rita-cellk", 36).await; + + let (sa, sb) = h + .run_pair( + matrix::op_merge("src-cellk".to_string(), "main".to_string()), + matrix::op_change_insert("main".to_string(), "Steve-cellk".to_string(), 37), + ) + .await; + assert_eq!(sb.status, StatusCode::OK, "[{}] change", cell); + assert!( + sa.status == StatusCode::OK || sa.status == StatusCode::CONFLICT, + "[{}] merge must be 200 or clean 409, got {}", + cell, + sa.status + ); + if sa.status == StatusCode::OK { + h.assert_persons("main", cell, &["Rita-cellk", "Steve-cellk"], &[]) + .await; + } else { + let error: ErrorOutput = serde_json::from_slice(&sa.body).unwrap(); + let conflict = error + .manifest_conflict + .expect("merge 409 must include manifest_conflict"); + assert_eq!( + conflict.table_key, "node:Person", + "[{}] conflict table", + cell + ); + h.assert_persons("main", cell, &["Steve-cellk"], &["Rita-cellk"]) + .await; + } + + // Reopen via a fresh AppState on the same graph. + let graph_uri = format!("{}/server.omni", h._temp.path().display()); + let reopened = AppState::open(graph_uri.clone()).await.unwrap(); + let app2 = build_app(reopened); + // Sanity: the same identity check via the new app must see + // Rita and Steve. + let r = app2 + .clone() + .oneshot( + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(r.status(), StatusCode::OK, "[{}] reopen snapshot", cell); + let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); + let v: Value = serde_json::from_slice(&body).unwrap(); + let person_rows = v["tables"] + .as_array() + .and_then(|tables| { + tables + .iter() + .find(|t| t["table_key"].as_str() == Some("node:Person")) + }) + .and_then(|t| t["row_count"].as_u64()) + .expect("reopen snapshot must include node:Person row_count"); + let expected_rows = if sa.status == StatusCode::OK { 6 } else { 5 }; + assert_eq!( + person_rows, expected_rows, + "[{}] reopened main should include seed (4) + committed concurrent writes", + cell, + ); + } +} + +#[tokio::test(flavor = "multi_thread", worker_threads = 4)] +async fn change_disjoint_table_concurrency_succeeds_at_http_level() { + // HTTP-level pin for MR-686's disjoint-table promise: concurrent /change + // requests touching different node types must coexist without admission + // rejection or publisher-CAS conflict. The bench harness measures + // throughput; this test is the regression sentinel that catches a + // future change which accidentally re-introduces graph-wide + // serialization on the disjoint path. + // + // Setup: test.jsonl seeds 4 Persons + 2 Companies. Spawn N=4 concurrent + // /change inserts on `node:Person` and N=4 concurrent inserts on + // `node:Company`. All 8 must return 200, and the post-test row counts + // must reflect every insert. + const PERSON_QUERY: &str = r#" +query insert_p($name: String, $age: I32) { + insert Person { name: $name, age: $age } +} +"#; + const COMPANY_QUERY: &str = r#" +query insert_c($name: String) { + insert Company { name: $name } +} +"#; + const SEED_PERSONS: u64 = 4; + const SEED_COMPANIES: u64 = 2; + const PER_TYPE: usize = 4; + + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + + let mut handles = Vec::with_capacity(PER_TYPE * 2); + for i in 0..PER_TYPE { + let app_p = app.clone(); + handles.push(tokio::spawn(async move { + let body = serde_json::to_vec(&ChangeRequest { + query: PERSON_QUERY.to_string(), + name: Some("insert_p".to_string()), + params: Some(json!({ "name": format!("p-{i}"), "age": i as i32 })), + branch: Some("main".to_string()), + }) + .unwrap(); + let req = Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(); + app_p.oneshot(req).await.unwrap().status() + })); + let app_c = app.clone(); + handles.push(tokio::spawn(async move { + let body = serde_json::to_vec(&ChangeRequest { + query: COMPANY_QUERY.to_string(), + name: Some("insert_c".to_string()), + params: Some(json!({ "name": format!("c-{i}") })), + branch: Some("main".to_string()), + }) + .unwrap(); + let req = Request::builder() + .uri("/change") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(); + app_c.oneshot(req).await.unwrap().status() + })); + } + + let mut statuses = Vec::with_capacity(PER_TYPE * 2); + for h in handles { + statuses.push(h.await.unwrap()); + } + + let bad: Vec<_> = statuses + .iter() + .enumerate() + .filter(|(_, s)| **s != StatusCode::OK) + .collect(); + assert!( + bad.is_empty(), + "expected every disjoint /change insert to return 200, got non-200 for: {:?}", + bad, + ); + + // Verify both tables landed every insert. + let (status, body) = json_response( + &app, + Request::builder() + .uri("/snapshot?branch=main") + .method(Method::GET) + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + let lookup_count = |table_key: &str| -> u64 { + body["tables"] + .as_array() + .and_then(|tables| { + tables + .iter() + .find(|t| t["table_key"].as_str() == Some(table_key)) + }) + .and_then(|t| t["row_count"].as_u64()) + .unwrap_or_else(|| panic!("snapshot missing {}", table_key)) + }; + assert_eq!( + lookup_count("node:Person"), + SEED_PERSONS + PER_TYPE as u64, + "Person row count after concurrent inserts", + ); + assert_eq!( + lookup_count("node:Company"), + SEED_COMPANIES + PER_TYPE as u64, + "Company row count after concurrent inserts", + ); +} + +#[tokio::test(flavor = "multi_thread", worker_threads = 4)] +async fn ingest_per_actor_admission_cap_returns_429() { + // Pin the admission gate on `/ingest`. With per-actor in-flight cap of 1 + // and 8 concurrent requests from the same actor, at least one request + // must be rejected with HTTP 429 and `code: too_many_requests`. + // + // Pre-fix bug class: the admission pattern at `server_change` + // (`crates/omnigraph-server/src/lib.rs:932`) was the only handler + // that called `WorkloadController::try_admit`. A heavy actor sending + // bulk-ingest traffic would exhaust shared engine capacity (Lance I/O + // threads, manifest churn) without ever hitting an admission cap. + // Pinned at the HTTP boundary so future refactors that drop the + // try_admit call from a mutating handler turn this red. + // + // Post-fix invariant: `/ingest`, `/branches/create`, `/branches/delete`, + // `/branches/merge`, and `/schema/apply` all gate on + // `state.workload.try_admit(&actor_arc, est_bytes)` after Cedar + // authorization and before the engine call. Cap exhaustion surfaces as + // 429 with `code: too_many_requests`. + // + // Construct the WorkloadController directly with cap=1 instead of + // mutating `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX` via EnvGuard. Process-wide + // env vars are visible to concurrently-running tests; the previous + // `EnvGuard + #[serial]` pair leaked the override into any other test + // that called `AppState::open` during the guard's window + // (matrix CI failure on commit 99b0941). Using the explicit + // `AppState::new_with_workload` constructor closes that bug class β€” + // this test no longer mutates global state and no longer needs + // `#[serial]`. + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let workload = omnigraph_server::workload::WorkloadController::new( + 1, // per-actor in-flight cap (the fixture under test) + 1_000_000_000, // per-actor byte budget β€” large so it never bottlenecks + ); + // MR-723: install a permit-all policy alongside the bearer token so + // /ingest (action=Change) passes Cedar evaluation. The test is + // exercising the admission cap, not policy β€” the policy is just + // enough to clear the State 3 path so the test reaches workload. + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, permit_all_policy_yaml(&["act-flooder"])).unwrap(); + let policy_engine = + omnigraph_server::PolicyEngine::load_graph(&policy_path, graph.to_string_lossy().as_ref()) + .unwrap(); + let state = AppState::new_single( + graph.to_string_lossy().to_string(), + db, + vec![("act-flooder".to_string(), "flooder-token".to_string())], + Some(policy_engine), + workload, + ); + let app = build_app(state); + let _temp = temp; + + // Eight concurrent ingests, all from act-flooder. Only one fits in a + // cap=1 in-flight semaphore; the others must 429. + const N: usize = 8; + let barrier = Arc::new(tokio::sync::Barrier::new(N)); + let mut handles = Vec::with_capacity(N); + for i in 0..N { + let app = app.clone(); + let barrier = Arc::clone(&barrier); + handles.push(tokio::spawn(async move { + // Align the 8 tasks at the barrier so they all attempt + // try_admit close in time. + barrier.wait().await; + + let body = serde_json::to_vec(&IngestRequest { + data: format!( + "{{\"type\":\"Person\",\"data\":{{\"name\":\"flooder-{i}\",\"age\":{i}}}}}\n" + ), + branch: Some("main".to_string()), + from: Some("main".to_string()), + mode: Some(omnigraph::loader::LoadMode::Merge), + }) + .unwrap(); + let req = Request::builder() + .uri("/ingest") + .method(Method::POST) + .header("authorization", "Bearer flooder-token") + .header("content-type", "application/json") + .body(Body::from(body)) + .unwrap(); + let response = app.oneshot(req).await.unwrap(); + let status = response.status(); + let headers = response.headers().clone(); + let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); + (status, headers, body.to_vec()) + })); + } + + let mut results = Vec::with_capacity(N); + for h in handles { + results.push(h.await.unwrap()); + } + let statuses: Vec = results.iter().map(|(s, _, _)| *s).collect(); + + let too_many: Vec = statuses + .iter() + .enumerate() + .filter(|(_, s)| **s == StatusCode::TOO_MANY_REQUESTS) + .map(|(i, _)| i) + .collect(); + assert!( + !too_many.is_empty(), + "expected at least one /ingest under cap=1 to return 429; got statuses: {:?}", + statuses, + ); + + // Validate the structured error body for each 429 (body must carry + // the `too_many_requests` code so clients can distinguish it from + // generic conflicts). + for i in &too_many { + let body_value: Value = serde_json::from_slice(&results[*i].2).unwrap(); + let error: ErrorOutput = serde_json::from_value(body_value).unwrap(); + assert_eq!( + error.code, + Some(omnigraph_server::api::ErrorCode::TooManyRequests), + "429 body must carry code=too_many_requests; idx {} got {:?}", + i, + error.code, + ); + } + + // Validate the `Retry-After` header is set on every 429. Pinned by + // the same test so a future refactor that drops the header from + // `IntoResponse for ApiError` turns this red. The constant + // matches `crates/omnigraph-server/src/lib.rs::ApiError::into_response`. + for i in &too_many { + let retry_after = results[*i] + .1 + .get(axum::http::header::RETRY_AFTER) + .and_then(|v| v.to_str().ok()) + .map(str::to_string); + assert!( + retry_after.is_some(), + "429 response must include a Retry-After header; idx {} headers were: {:?}", + i, + results[*i].1, + ); + } +} + +/// Regression for B2 (MR-668): when an `AppState` is built with a +/// per-graph policy and a custom workload, the engine inside the +/// routing's `GraphHandle` MUST have the same policy applied via +/// `Omnigraph::with_policy`. Pre-fix, `new_with_workload(...).with_policy_engine(p)` +/// installed the policy only on the HTTP-layer `handle.policy`; the +/// underlying `Arc` was reused without `with_policy`, so any +/// caller reaching through `state.routing()` could bypass Cedar. +/// +/// This test reaches the engine the same way an embedded SDK consumer +/// or future routing code path would, and asserts the policy still +/// fires. The deny path is "act-blocked has a valid bearer but isn't in +/// the policy's allowed group" β€” i.e., authenticated-but-unauthorised. +#[tokio::test(flavor = "multi_thread")] +async fn engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single() { + use omnigraph_server::GraphRouting; + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + + // Permit `act-allowed` for change actions; `act-blocked` is not in + // any allowed group β€” every change request from them must deny. + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, permit_all_policy_yaml(&["act-allowed"])).unwrap(); + let policy_engine = + omnigraph_server::PolicyEngine::load_graph(&policy_path, graph.to_string_lossy().as_ref()) + .unwrap(); + + let workload = omnigraph_server::workload::WorkloadController::new(100, 1_000_000_000); + let state = AppState::new_single( + graph.to_string_lossy().to_string(), + db, + vec![("act-blocked".to_string(), "block-token".to_string())], + Some(policy_engine), + workload, + ); + + // Reach into the routing and pull the engine the same way an + // embedded consumer holding `Arc` would. If `new_single` + // failed to apply `with_policy` to the engine, this `mutate_as` + // would succeed β€” the HTTP-layer is bypassed entirely. + let handle = match state.routing() { + GraphRouting::Single { handle } => Arc::clone(handle), + GraphRouting::Multi { .. } => panic!("expected single-mode routing"), + }; + let engine = Arc::clone(&handle.engine); + + let mut params: omnigraph_compiler::ParamMap = Default::default(); + params.insert( + "name".to_string(), + omnigraph_compiler::Literal::String("EngineLayerBlocked".to_string()), + ); + params.insert("age".to_string(), omnigraph_compiler::Literal::Integer(30)); + let result = engine + .mutate_as( + "main", + MUTATION_QUERIES, + "insert_person", + ¶ms, + Some("act-blocked"), + ) + .await; + match result { + Err(OmniError::Policy(_)) => { /* expected β€” engine-layer gate fired */ } + Ok(_) => panic!( + "engine-layer policy did NOT fire β€” act-blocked successfully ran mutate_as via \ + the engine pulled from the registry handle. AppState::new_single failed to apply \ + with_policy to the underlying Omnigraph engine. This is the B2 footgun the \ + with_policy_engine deletion was supposed to close." + ), + Err(other) => panic!("expected OmniError::Policy, got: {other:?}"), + } +} + +#[tokio::test(flavor = "multi_thread")] +async fn oversized_request_body_returns_payload_too_large() { + let (_temp, app) = app_for_loaded_graph().await; + let oversized = "x".repeat(1_100_000); + let response = app + .clone() + .oneshot( + Request::builder() + .uri("/read") + .method(Method::POST) + .header("content-type", "application/json") + .body(Body::from(oversized)) + .unwrap(), + ) + .await + .unwrap(); + + assert_eq!(response.status(), StatusCode::PAYLOAD_TOO_LARGE); +} + +// ─── MR-723 default-deny mode (State 2: tokens without policy) ────────── +// +// `authorize_request` returns 403 for every action except `Read` when a +// PolicyEngine is not installed but bearer tokens are configured. Pinned +// by the three tests below β€” Read allowed, Change/SchemaApply denied β€” +// to prevent regressing back to the pre-MR-723 "tokens configured but +// no policy = fully open" trap. + +#[tokio::test(flavor = "multi_thread")] +async fn default_deny_mode_allows_read_for_authenticated_actor() { + let (_temp, app) = app_for_graph_with_auth_tokens_only( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-andrew", "demo-token")], + ) + .await; + + let (status, _body) = json_response( + &app, + Request::builder() + .uri("/snapshot") + .method(Method::GET) + .header(AUTHORIZATION, "Bearer demo-token") + .body(Body::empty()) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); +} + +#[tokio::test(flavor = "multi_thread")] +async fn default_deny_mode_rejects_change_with_forbidden() { + let (_temp, app) = app_for_graph_with_auth_tokens_only( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-andrew", "demo-token")], + ) + .await; + + let change = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "DefaultDeny", "age": 1 })), + branch: Some("main".to_string()), + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header(AUTHORIZATION, "Bearer demo-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&change).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::FORBIDDEN); + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert!( + error.error.contains("default-deny"), + "expected default-deny in error message, got: {}", + error.error + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn default_deny_mode_rejects_schema_apply_with_forbidden() { + let (_temp, app) = app_for_graph_with_auth_tokens_only( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-andrew", "demo-token")], + ) + .await; + + let req = SchemaApplyRequest { + schema_source: additive_schema_with_nickname(), + ..Default::default() + }; + let (status, body) = json_response( + &app, + Request::builder() + .uri("/schema/apply") + .method(Method::POST) + .header(AUTHORIZATION, "Bearer demo-token") + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&req).unwrap())) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::FORBIDDEN); + let error: ErrorOutput = serde_json::from_value(body).unwrap(); + assert!( + error.error.contains("default-deny"), + "expected default-deny in error message, got: {}", + error.error + ); +} + +// ─── SDK ↔ HTTP decision parity (MR-722 PR A) ───────────────────────────── +// +// Engine and HTTP both consult Cedar via `PolicyChecker::check()`; by +// construction they cannot disagree on a decision. These tests pin that +// property explicitly so a future refactor that introduces a separate +// auth path (or copy-pastes Cedar evaluation logic) turns red. +// +// Four cases cover the per-action scope shapes: +// * Change on a protected branch via `mutate_as` / POST /change +// * Change with an actor that has no permit +// * BranchMerge to a protected target via `branch_merge_as` / POST /branches/merge +// * BranchMerge with an actor that has no permit + +const PARITY_POLICY_YAML: &str = r#" +version: 1 +groups: + team: [act-bruno] + admins: [act-ragnor] +protected_branches: [main] +rules: + - id: admins-change-anywhere + allow: + actors: { group: admins } + actions: [change] + branch_scope: any + - id: admins-merge-to-protected + allow: + actors: { group: admins } + actions: [branch_merge] + target_branch_scope: protected +"#; + +#[derive(Clone, Copy, Debug)] +enum ParityDecision { + Allow, + Deny, +} + +async fn build_parity_graph() -> (tempfile::TempDir, PathBuf, PathBuf) { + // Build a graph with `main` loaded and a `feature` branch ready for + // merge. Returns the graph path and a written policy.yaml path. + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.branch_create_from(ReadTarget::branch("main"), "feature") + .await + .unwrap(); + db.load_as( + "feature", + r#"{"type":"Person","data":{"name":"ParityEve","age":29}}"#, + LoadMode::Append, + None, + ) + .await + .unwrap(); + } + let policy_path = temp.path().join("policy.yaml"); + fs::write(&policy_path, PARITY_POLICY_YAML).unwrap(); + (temp, graph, policy_path) +} + +async fn sdk_change_decision(graph: &Path, policy_path: &Path, actor: &str) -> ParityDecision { + let policy = PolicyEngine::load_graph(policy_path, graph.to_string_lossy().as_ref()).unwrap(); + let db = Omnigraph::open(graph.to_str().unwrap()) + .await + .unwrap() + .with_policy(Arc::new(policy) as Arc); + let mut params: omnigraph_compiler::ParamMap = Default::default(); + // Parameter keys are bare names (no `$` prefix); the runtime resolves + // `$name` references in the query body to `params["name"]`. + params.insert( + "name".to_string(), + omnigraph_compiler::Literal::String("ParityCharlie".to_string()), + ); + params.insert("age".to_string(), omnigraph_compiler::Literal::Integer(30)); + let result = db + .mutate_as( + "main", + MUTATION_QUERIES, + "insert_person", + ¶ms, + Some(actor), + ) + .await; + match result { + Ok(_) => ParityDecision::Allow, + Err(OmniError::Policy(_)) => ParityDecision::Deny, + Err(other) => panic!("unexpected SDK error for change: {other:?}"), + } +} + +async fn http_change_decision( + graph: &Path, + policy_path: &PathBuf, + actor: &str, + token: &str, +) -> ParityDecision { + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![(actor.to_string(), token.to_string())], + Some(policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + let req = ChangeRequest { + query: MUTATION_QUERIES.to_string(), + name: Some("insert_person".to_string()), + params: Some(json!({ "name": "ParityCharlie", "age": 30 })), + branch: Some("main".to_string()), + }; + let (status, _body) = json_response( + &app, + Request::builder() + .uri("/change") + .method(Method::POST) + .header(AUTHORIZATION, format!("Bearer {token}")) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&req).unwrap())) + .unwrap(), + ) + .await; + match status { + StatusCode::OK => ParityDecision::Allow, + StatusCode::FORBIDDEN => ParityDecision::Deny, + other => panic!("unexpected HTTP status for change: {other}"), + } +} + +async fn sdk_merge_decision(graph: &Path, policy_path: &Path, actor: &str) -> ParityDecision { + let policy = PolicyEngine::load_graph(policy_path, graph.to_string_lossy().as_ref()).unwrap(); + let db = Omnigraph::open(graph.to_str().unwrap()) + .await + .unwrap() + .with_policy(Arc::new(policy) as Arc); + let result = db.branch_merge_as("feature", "main", Some(actor)).await; + match result { + Ok(_) => ParityDecision::Allow, + Err(OmniError::Policy(_)) => ParityDecision::Deny, + Err(other) => panic!("unexpected SDK error for branch_merge: {other:?}"), + } +} + +async fn http_merge_decision( + graph: &Path, + policy_path: &PathBuf, + actor: &str, + token: &str, +) -> ParityDecision { + let state = AppState::open_with_bearer_tokens_and_policy( + graph.to_string_lossy().to_string(), + vec![(actor.to_string(), token.to_string())], + Some(policy_path), + ) + .await + .unwrap(); + let app = build_app(state); + let req = BranchMergeRequest { + source: "feature".to_string(), + target: Some("main".to_string()), + }; + let (status, _body) = json_response( + &app, + Request::builder() + .uri("/branches/merge") + .method(Method::POST) + .header(AUTHORIZATION, format!("Bearer {token}")) + .header("content-type", "application/json") + .body(Body::from(serde_json::to_vec(&req).unwrap())) + .unwrap(), + ) + .await; + match status { + StatusCode::OK => ParityDecision::Allow, + StatusCode::FORBIDDEN => ParityDecision::Deny, + other => panic!("unexpected HTTP status for branch_merge: {other}"), + } +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_decision_parity_change_admin_on_main_allowed() { + // (act-ragnor, change, main) β€” admins-change-anywhere rule applies. + // Both SDK and HTTP must allow. Each path uses its own fresh graph + // because allowβ†’side-effects. + let (_t1, graph1, policy1) = build_parity_graph().await; + let sdk = sdk_change_decision(&graph1, &policy1, "act-ragnor").await; + let (_t2, graph2, policy2) = build_parity_graph().await; + let http = http_change_decision(&graph2, &policy2, "act-ragnor", "ragnor-token").await; + assert!( + matches!(sdk, ParityDecision::Allow) && matches!(http, ParityDecision::Allow), + "SDK={sdk:?} HTTP={http:?} β€” should both Allow", + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_decision_parity_change_team_on_main_denied() { + // (act-bruno, change, main) β€” no rule grants bruno change on + // protected. Both SDK and HTTP must deny. Same graph is reusable + // because denyβ†’no side-effects. + let (_temp, graph, policy) = build_parity_graph().await; + let sdk = sdk_change_decision(&graph, &policy, "act-bruno").await; + let http = http_change_decision(&graph, &policy, "act-bruno", "bruno-token").await; + assert!( + matches!(sdk, ParityDecision::Deny) && matches!(http, ParityDecision::Deny), + "SDK={sdk:?} HTTP={http:?} β€” should both Deny", + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_decision_parity_branch_merge_admin_allowed() { + // (act-ragnor, branch_merge, featureβ†’main) β€” admins-merge-to-protected + // rule applies. Both Allow. Each path uses its own fresh graph β€” + // a successful merge consumes the feature branch's commit on main. + let (_t1, graph1, policy1) = build_parity_graph().await; + let sdk = sdk_merge_decision(&graph1, &policy1, "act-ragnor").await; + let (_t2, graph2, policy2) = build_parity_graph().await; + let http = http_merge_decision(&graph2, &policy2, "act-ragnor", "ragnor-token").await; + assert!( + matches!(sdk, ParityDecision::Allow) && matches!(http, ParityDecision::Allow), + "SDK={sdk:?} HTTP={http:?} β€” should both Allow", + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn policy_decision_parity_branch_merge_team_denied() { + // (act-bruno, branch_merge, featureβ†’main) β€” no rule grants bruno + // branch_merge. Both Deny. + let (_temp, graph, policy) = build_parity_graph().await; + let sdk = sdk_merge_decision(&graph, &policy, "act-bruno").await; + let http = http_merge_decision(&graph, &policy, "act-bruno", "bruno-token").await; + assert!( + matches!(sdk, ParityDecision::Deny) && matches!(http, ParityDecision::Deny), + "SDK={sdk:?} HTTP={http:?} β€” should both Deny", + ); +} + +// ─── MR-694 PR B: HTTP soft + hard drop semantics + data preservation ──── +// +// SDK-level drop semantics are pinned in `crates/omnigraph/tests/schema_apply.rs`. +// These HTTP-side tests mirror the assertions through POST /schema/apply +// and exercise the new `allow_data_loss` field (closes the gap where +// the schema-lint chassis v1.2 shipped Hard mode on the CLI but the +// HTTP request struct had no equivalent field). + +#[tokio::test(flavor = "multi_thread")] +async fn schema_apply_route_soft_drops_property_via_http() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + // Load a row that has the column we're about to drop. + let graph = graph_path(temp.path()); + { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.load( + "main", + r#"{"type":"Person","data":{"name":"PreDrop","age":42}}"#, + LoadMode::Append, + ) + .await + .unwrap(); + } + let pre_version = manifest_dataset_version(&graph).await; + + let (status, payload) = json_response( + &app, + Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: schema_without_age(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + + // Catalog reflects the drop: `age` is gone from the live schema. + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + assert!( + !reopened.catalog().node_types["Person"] + .properties + .contains_key("age"), + "catalog should not contain `age` after drop" + ); + + // Soft drop preserves the prior version β€” `age` is still readable + // via time travel to the pre-drop manifest version. Mirrors the + // SDK-side assertion in `apply_schema_drops_a_nullable_property_softly_preserves_prior_version`. + let pre_drop_snapshot = reopened.snapshot_at_version(pre_version).await.unwrap(); + let pre_drop_ds = pre_drop_snapshot.open("node:Person").await.unwrap(); + let pre_drop_fields = pre_drop_ds + .schema() + .fields + .iter() + .map(|f| f.name.clone()) + .collect::>(); + assert!( + pre_drop_fields.iter().any(|f| f == "age"), + "soft drop should leave the pre-drop dataset's `age` column \ + time-travel-reachable; got fields {pre_drop_fields:?}" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_apply_route_soft_drops_node_type_via_http() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + + let (status, payload) = json_response( + &app, + Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: schema_without_company(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + assert!( + !reopened.catalog().node_types.contains_key("Company"), + "catalog should not contain `Company` after drop" + ); + assert!( + !reopened.catalog().edge_types.contains_key("WorksAt"), + "catalog should not contain `WorksAt` after cascade" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_apply_route_hard_drops_property_with_allow_data_loss() { + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.load( + "main", + r#"{"type":"Person","data":{"name":"PreDropHard","age":50}}"#, + LoadMode::Append, + ) + .await + .unwrap(); + } + + // Apply with allow_data_loss=true β†’ Hard mode promotion. + let (status, payload) = json_response( + &app, + Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: schema_without_age(), + allow_data_loss: true, + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + + // Catalog reflects the drop. + let reopened = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + assert!( + !reopened.catalog().node_types["Person"] + .properties + .contains_key("age"), + "catalog should not contain `age` after Hard drop" + ); + // Plan steps should show DropMode::Hard for property drops. + let steps = payload["steps"].as_array().expect("steps array"); + let drop_step = steps + .iter() + .find(|s| s["kind"] == "drop_property") + .expect("plan should include drop_property step"); + let mode = &drop_step["mode"]; + assert_eq!( + mode, "hard", + "expected hard mode under allow_data_loss=true" + ); +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_apply_route_keeps_drops_soft_without_flag() { + // Symmetric to the Hard test: same schema change, but no + // allow_data_loss flag β†’ drops stay Soft (prior column data + // remains time-travel-reachable). Pins the default semantics + // against accidental Hard promotion. + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + + let (status, payload) = json_response( + &app, + Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: schema_without_age(), + allow_data_loss: false, + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + + let steps = payload["steps"].as_array().expect("steps array"); + let drop_step = steps + .iter() + .find(|s| s["kind"] == "drop_property") + .expect("plan should include drop_property step"); + let mode = &drop_step["mode"]; + assert_eq!(mode, "soft", "expected soft mode without allow_data_loss"); + let _ = graph; +} + +#[tokio::test(flavor = "multi_thread")] +async fn schema_apply_route_additive_property_preserves_existing_rows() { + // SDK suite covers rename and drop data preservation. Additive + // AddProperty wasn't pinned with a row-count check anywhere. + // Load N rows, apply schema adding nullable property, verify + // every row is still readable and the new column is null. + let (temp, app) = app_for_graph_with_auth_tokens_and_policy( + &fs::read_to_string(fixture("test.pg")).unwrap(), + &[("act-ragnor", "admin-token")], + SCHEMA_APPLY_POLICY_YAML, + ) + .await; + let graph = graph_path(temp.path()); + + // Standard fixture data: 4 Persons + 1 Company. Load it. + let pre_count = { + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + db.load( + "main", + &fs::read_to_string(fixture("test.jsonl")).unwrap(), + LoadMode::Append, + ) + .await + .unwrap(); + let snap = db + .snapshot_of(omnigraph::db::ReadTarget::branch("main")) + .await + .unwrap(); + snap.entry("node:Person").expect("Person").row_count + }; + assert!(pre_count > 0, "fixture should have loaded Person rows"); + + let (status, payload) = json_response( + &app, + Request::builder() + .method(Method::POST) + .uri("/schema/apply") + .header("content-type", "application/json") + .header("authorization", "Bearer admin-token") + .body(Body::from( + serde_json::to_vec(&SchemaApplyRequest { + schema_source: additive_schema_with_nickname(), + ..Default::default() + }) + .unwrap(), + )) + .unwrap(), + ) + .await; + assert_eq!(status, StatusCode::OK); + assert_eq!(payload["applied"], true); + + // Row count preserved. + let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); + let snap = db + .snapshot_of(omnigraph::db::ReadTarget::branch("main")) + .await + .unwrap(); + let post_count = snap.entry("node:Person").expect("Person").row_count; + assert_eq!( + post_count, pre_count, + "AddProperty should preserve row count", + ); +} + +// ─── MR-668: multi-graph startup ────────────────────────────────────────── + +mod multi_graph_startup { + use super::*; + use omnigraph::storage::normalize_root_uri; + use omnigraph_server::{ + GraphHandle, GraphId, GraphKey, GraphRegistry, InsertError, ServerConfig, ServerConfigMode, + load_server_settings, + }; + use std::sync::Arc; + + async fn build_multi_mode_app(graph_ids: &[&str]) -> (Vec, Router) { + let mut dirs = Vec::with_capacity(graph_ids.len()); + let mut handles = Vec::with_capacity(graph_ids.len()); + for id in graph_ids { + let dir = tempfile::tempdir().unwrap(); + let graph_uri = dir.path().join(id).to_str().unwrap().to_string(); + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); + handles.push(Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from(*id).unwrap()), + uri: graph_uri, + engine: Arc::new(engine), + policy: None, + })); + dirs.push(dir); + } + let workload = omnigraph_server::workload::WorkloadController::from_env(); + let state = AppState::new_multi(handles, Vec::new(), None, workload, None).unwrap(); + let app = build_app(state); + (dirs, app) + } + + /// Cluster route `/graphs/{graph_id}/snapshot` resolves to the right + /// engine. Two graphs side by side; assert each responds to its own + /// id and does NOT respond to the other's URL. + #[tokio::test(flavor = "multi_thread")] + async fn cluster_routes_dispatch_per_graph_handle() { + let (_dirs, app) = build_multi_mode_app(&["alpha", "beta"]).await; + for id in ["alpha", "beta"] { + let resp = app + .clone() + .oneshot( + Request::builder() + .method(Method::GET) + .uri(format!("/graphs/{id}/snapshot?branch=main")) + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + resp.status(), + StatusCode::OK, + "graph '{id}' must respond OK on its cluster snapshot route" + ); + } + } + + /// Unknown graph id under the cluster prefix yields 404 (not 500, + /// not 410 β€” `Gone` is reserved for the future DELETE flow). + #[tokio::test(flavor = "multi_thread")] + async fn cluster_route_for_unknown_graph_returns_404() { + let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; + let resp = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs/nonexistent/snapshot?branch=main") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::NOT_FOUND); + } + + /// Coverage net for cluster-route regressions across every + /// protected handler β€” not just the few that have inner path + /// params. Bug-1 surfaced because only `/snapshot` was being + /// exercised in cluster mode, leaving the other six protected + /// routes implicitly untested. This sweep hits each one and + /// asserts the response shows the handler was reached: no 404 + /// (router didn't match), no 500 with "Wrong number of path + /// arguments" (path extractor broke), no 500 with "missing + /// extension" (routing middleware didn't inject the handle). + /// + /// Status codes are negative assertions because each handler's + /// happy-path inputs differ β€” what matters is "the request + /// reached the handler," not "the handler returned 200." The + /// individual handlers' logic is already tested in single mode. + #[tokio::test(flavor = "multi_thread")] + async fn all_protected_cluster_routes_resolve_to_their_handler() { + let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; + + // (method, path, body) β€” one minimal request per protected + // cluster route. Bodies are valid enough that the router and + // extractors succeed; whether the engine ultimately returns + // 200 or 4xx is per-handler and not what this test pins. + let cases: &[(Method, &str, Option<&str>)] = &[ + (Method::GET, "/graphs/alpha/snapshot?branch=main", None), + (Method::GET, "/graphs/alpha/schema", None), + (Method::GET, "/graphs/alpha/branches", None), + (Method::GET, "/graphs/alpha/commits", None), + ( + Method::POST, + "/graphs/alpha/read", + Some(r#"{"query_source":"query q() { return {} }"}"#), + ), + ( + Method::POST, + "/graphs/alpha/change", + Some(r#"{"query_source":"query q() { return {} }"}"#), + ), + ( + Method::POST, + "/graphs/alpha/export", + Some(r#"{"branch":"main"}"#), + ), + ( + Method::POST, + "/graphs/alpha/schema/apply", + Some(r#"{"schema_source":"","allow_data_loss":false}"#), + ), + (Method::POST, "/graphs/alpha/ingest", Some(r#"{"data":""}"#)), + ( + Method::POST, + "/graphs/alpha/branches/merge", + Some(r#"{"source":"main","target":"main"}"#), + ), + ]; + + for (method, path, body) in cases { + let req_body = body + .map(|s| Body::from(s.to_string())) + .unwrap_or_else(Body::empty); + let req = Request::builder() + .method(method.clone()) + .uri(*path) + .header("content-type", "application/json") + .body(req_body) + .unwrap(); + let resp = app.clone().oneshot(req).await.unwrap(); + let status = resp.status(); + let bytes = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); + let body_str = String::from_utf8_lossy(&bytes); + + assert_ne!( + status, + StatusCode::NOT_FOUND, + "{} {} β€” router didn't match (cluster-route mounting regression). Body: {}", + method, + path, + body_str, + ); + assert!( + !(status == StatusCode::INTERNAL_SERVER_ERROR + && body_str.contains("Wrong number of path arguments")), + "{} {} β€” path extractor broke (Bug-1 class regression). Body: {}", + method, + path, + body_str, + ); + assert!( + !(status == StatusCode::INTERNAL_SERVER_ERROR + && body_str.to_lowercase().contains("missing extension")), + "{} {} β€” routing middleware didn't inject GraphHandle. Body: {}", + method, + path, + body_str, + ); + } + } + + /// Regression for the bot-surfaced path-extractor bug: cluster + /// routes whose inner path also captures a parameter + /// (`/graphs/{graph_id}/branches/{branch}`, + /// `/graphs/{graph_id}/commits/{commit_id}`) must extract the + /// inner param cleanly. Axum 0.8 propagates the outer `{graph_id}` + /// capture into nested handlers, so a `Path` extractor + /// would see two values and fail with "Wrong number of path + /// arguments. Expected 1 but got 2." Today both DELETE branch and + /// GET commit-by-id break in multi-mode because their handlers + /// use bare `Path` β€” this test pins the fix. + /// + /// The broader `all_protected_cluster_routes_resolve_to_their_handler` + /// test sweeps the full route surface; this one stays narrowly + /// targeted at the inner-path-param shape because that's the + /// specific regression class. + #[tokio::test(flavor = "multi_thread")] + async fn cluster_routes_with_inner_path_params_deserialize_correctly() { + let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; + + // Create a branch we can then delete β€” DELETE /graphs/alpha/branches/feature + let create_resp = app + .clone() + .oneshot( + Request::builder() + .method(Method::POST) + .uri("/graphs/alpha/branches") + .header("content-type", "application/json") + .body(Body::from(r#"{"name":"feature"}"#)) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + create_resp.status(), + StatusCode::OK, + "branch create on the cluster route must succeed before delete can be tested" + ); + + // DELETE /graphs/{graph_id}/branches/{branch} β€” exercises a handler + // whose only Path extractor (`branch`) is inside a nested route + // that also captures `graph_id`. The handler must pick `branch` + // by name, not by position. + let delete_resp = app + .clone() + .oneshot( + Request::builder() + .method(Method::DELETE) + .uri("/graphs/alpha/branches/feature") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + let delete_status = delete_resp.status(); + let delete_body = to_bytes(delete_resp.into_body(), usize::MAX).await.unwrap(); + assert_eq!( + delete_status, + StatusCode::OK, + "DELETE /graphs/{{id}}/branches/{{branch}} must extract `branch` cleanly. \ + Body: {}", + String::from_utf8_lossy(&delete_body), + ); + + // GET /graphs/{graph_id}/commits/{commit_id} β€” same shape: the + // handler's only Path extractor is the inner `commit_id`, which + // must deserialize by name even though `graph_id` is also in scope. + // We don't know a real commit_id, but the failure mode under test + // is path extraction, not commit lookup β€” a 404 from the engine + // is fine; a 500 with "Wrong number of path arguments" is the bug. + let commit_resp = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs/alpha/commits/0000000000000000") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + let commit_status = commit_resp.status(); + let commit_body = to_bytes(commit_resp.into_body(), usize::MAX).await.unwrap(); + let body_str = String::from_utf8_lossy(&commit_body); + assert!( + commit_status != StatusCode::INTERNAL_SERVER_ERROR + || !body_str.contains("Wrong number of path arguments"), + "GET /graphs/{{id}}/commits/{{commit_id}} must extract `commit_id` cleanly. \ + Got: {} | {}", + commit_status, + body_str, + ); + } + + /// Flat routes 404 in multi mode β€” the router only mounts under + /// `/graphs/{graph_id}/...` so `/snapshot` doesn't resolve. + #[tokio::test(flavor = "multi_thread")] + async fn flat_routes_404_in_multi_mode() { + let (_dirs, app) = build_multi_mode_app(&["alpha"]).await; + let resp = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/snapshot?branch=main") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::NOT_FOUND); + } + + /// `GraphId` validation runs at startup β€” a reserved name in + /// `omnigraph.yaml` produces a clear error rather than getting + /// rejected per-request. + #[test] + fn load_server_settings_rejects_reserved_graph_id() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + policies: + uri: /tmp/g1.omni +"#, + ) + .unwrap(); + let err = load_server_settings(Some(&config_path), None, None, None, false).unwrap_err(); + assert!( + err.to_string().contains("invalid graph id 'policies'"), + "expected reserved-name rejection, got: {err}" + ); + } + + #[tokio::test(flavor = "multi_thread")] + async fn registry_rejects_duplicate_normalized_graph_uris() { + let dir = tempfile::tempdir().unwrap(); + let graph_uri = dir.path().join("same").to_str().unwrap().to_string(); + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + let engine = Arc::new(Omnigraph::init(&graph_uri, &schema).await.unwrap()); + + let alpha = Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), + uri: graph_uri.clone(), + engine: Arc::clone(&engine), + policy: None, + }); + let beta = Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from("beta").unwrap()), + uri: format!("file://{graph_uri}/"), + engine, + policy: None, + }); + + match GraphRegistry::from_handles(vec![alpha, beta]) { + Err(InsertError::DuplicateUri(uri)) => { + assert!( + normalize_root_uri(&uri).is_ok(), + "duplicate URI should still be parseable, got {uri}" + ); + } + Err(err) => panic!("expected DuplicateUri for normalized aliases, got {err:?}"), + Ok(_) => panic!("expected DuplicateUri for normalized aliases, got Ok"), + } + } + + #[tokio::test(flavor = "multi_thread")] + async fn registry_stores_canonical_graph_uri() { + let dir = tempfile::tempdir().unwrap(); + let graph_uri = dir.path().join("canonical").to_str().unwrap().to_string(); + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); + let handle = Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), + uri: format!("file://{graph_uri}/"), + engine: Arc::new(engine), + policy: None, + }); + + let registry = GraphRegistry::from_handles(vec![handle]).unwrap(); + let listed = registry.list(); + assert_eq!(listed.len(), 1); + assert_eq!(listed[0].uri, graph_uri); + } + + // ── Four-rule mode inference matrix ─────────────────────────────── + + /// Rule 1: CLI positional URI β†’ Single. + #[test] + fn mode_inference_cli_uri_is_single() { + let settings = load_server_settings( + None, + Some("/tmp/cli.omni".to_string()), + None, + None, + true, // allow unauth so we get past the runtime-state check + ) + .unwrap(); + match settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "/tmp/cli.omni"), + ServerConfigMode::Multi { .. } => panic!("expected Single (rule 1), got Multi"), + } + } + + /// Rule 2: --target picks one graph from `graphs:` map β†’ Single. + #[test] + fn mode_inference_cli_target_is_single() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + alpha: + uri: /tmp/alpha.omni + beta: + uri: /tmp/beta.omni +"#, + ) + .unwrap(); + let settings = + load_server_settings(Some(&config_path), None, Some("alpha".into()), None, true) + .unwrap(); + match settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "/tmp/alpha.omni"), + ServerConfigMode::Multi { .. } => panic!("expected Single (rule 2), got Multi"), + } + } + + /// Rule 3: `server.graph` set β†’ Single (target picked from config). + #[test] + fn mode_inference_server_graph_is_single() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + alpha: + uri: /tmp/alpha.omni + beta: + uri: /tmp/beta.omni +server: + graph: beta +"#, + ) + .unwrap(); + let settings = load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + match settings.mode { + ServerConfigMode::Single { uri, .. } => assert_eq!(uri, "/tmp/beta.omni"), + ServerConfigMode::Multi { .. } => panic!("expected Single (rule 3), got Multi"), + } + } + + /// Rule 4: `--config` + non-empty `graphs:` + no single-mode selector β†’ Multi. + #[test] + fn mode_inference_config_plus_graphs_is_multi() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + alpha: + uri: /tmp/alpha.omni + beta: + uri: /tmp/beta.omni +"#, + ) + .unwrap(); + let settings = load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + match settings.mode { + ServerConfigMode::Multi { graphs, .. } => { + let ids: Vec<&str> = graphs.iter().map(|g| g.graph_id.as_str()).collect(); + // BTreeMap iteration order is alphabetical. + assert_eq!(ids, vec!["alpha", "beta"]); + } + ServerConfigMode::Single { .. } => panic!("expected Multi (rule 4), got Single"), + } + } + + #[test] + fn mode_inference_multi_rejects_top_level_policy_file() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +policy: + file: ./policy.yaml +graphs: + alpha: + uri: /tmp/alpha.omni +"#, + ) + .unwrap(); + let err = load_server_settings(Some(&config_path), None, None, None, true).unwrap_err(); + let msg = err.to_string(); + assert!( + msg.contains("top-level `policy.file` is single-graph/CLI-local policy only"), + "expected single-graph policy guidance, got: {msg}" + ); + assert!( + msg.contains("graphs..policy.file"), + "expected per-graph migration guidance, got: {msg}" + ); + assert!( + msg.contains("server.policy.file"), + "expected server policy migration guidance, got: {msg}" + ); + } + + #[test] + fn mode_inference_normalizes_multi_graph_uris() { + let temp = tempfile::tempdir().unwrap(); + let graph = temp.path().join("alpha.omni"); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + format!( + r#" +graphs: + alpha: + uri: file://{}/ +"#, + graph.display() + ), + ) + .unwrap(); + let settings = load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + match settings.mode { + ServerConfigMode::Multi { graphs, .. } => { + assert_eq!(graphs[0].uri, graph.to_string_lossy()); + } + ServerConfigMode::Single { .. } => panic!("expected Multi"), + } + } + + /// Rule 5: nothing β†’ error with migration hint. + #[test] + fn mode_inference_no_inputs_errors_with_migration_hint() { + let err = load_server_settings(None, None, None, None, true).unwrap_err(); + let msg = err.to_string(); + assert!( + msg.contains("no graph to serve"), + "expected migration-hint error, got: {msg}" + ); + } + + /// Rule 4 sub-case: `--config` with empty `graphs:` map and no + /// single-mode selector β†’ rule 5 fires (no graph to serve). + #[test] + fn mode_inference_empty_graphs_map_errors() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write(&config_path, "server:\n bind: 127.0.0.1:8080\n").unwrap(); + let err = load_server_settings(Some(&config_path), None, None, None, true).unwrap_err(); + assert!(err.to_string().contains("no graph to serve")); + } + + /// `--config` + `` together: URI wins β†’ Single (the CLI URI + /// takes precedence over the config's graphs map). + #[test] + fn mode_inference_cli_uri_overrides_graphs_map() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + alpha: + uri: /tmp/alpha.omni +"#, + ) + .unwrap(); + let settings = load_server_settings( + Some(&config_path), + Some("/tmp/cli-override.omni".to_string()), + None, + None, + true, + ) + .unwrap(); + match settings.mode { + ServerConfigMode::Single { uri, .. } => { + assert_eq!( + uri, "/tmp/cli-override.omni", + "CLI URI must win over graphs: map" + ); + } + ServerConfigMode::Multi { .. } => { + panic!("expected Single (CLI URI wins), got Multi") + } + } + } + + /// Per-graph `policy.file` is resolved relative to the config base_dir. + #[test] + fn per_graph_policy_file_is_resolved_relative_to_base_dir() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +graphs: + alpha: + uri: /tmp/alpha.omni + policy: + file: ./policies/alpha.yaml + beta: + uri: /tmp/beta.omni +"#, + ) + .unwrap(); + let settings = load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + let graphs = match settings.mode { + ServerConfigMode::Multi { graphs, .. } => graphs, + _ => panic!("expected Multi"), + }; + // graphs is BTreeMap-iter order (alphabetical). + let alpha = &graphs[0]; + let beta = &graphs[1]; + assert_eq!(alpha.graph_id, "alpha"); + assert_eq!( + alpha.policy_file.as_ref().unwrap(), + &temp.path().join("policies/alpha.yaml") + ); + assert_eq!(beta.graph_id, "beta"); + assert!(beta.policy_file.is_none()); + } + + /// `server.policy.file` resolves alongside the graphs map. + #[test] + fn server_policy_file_is_resolved_relative_to_base_dir() { + let temp = tempfile::tempdir().unwrap(); + let config_path = temp.path().join("omnigraph.yaml"); + fs::write( + &config_path, + r#" +server: + policy: + file: ./server-policy.yaml +graphs: + alpha: + uri: /tmp/alpha.omni +"#, + ) + .unwrap(); + let settings = load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + match settings.mode { + ServerConfigMode::Multi { + server_policy_file, .. + } => { + assert_eq!( + server_policy_file.unwrap(), + temp.path().join("server-policy.yaml") + ); + } + _ => panic!("expected Multi"), + } + } + + /// `GET /graphs` must NOT leak the registry in Open mode without + /// an explicit server policy. Operators who pass `--unauthenticated` + /// opted into trusting the network for graph DATA, not for leaking + /// server topology (graph IDs + URIs, which may contain S3 bucket + /// paths or internal hostnames). Cedar gating the management + /// surface is the documented contract for `server_graphs_list` + /// ("don't leak the registry until the operator explicitly + /// authorizes it"); enforcing that contract in every runtime + /// state β€” not just `PolicyEnabled` β€” is the correct-by-design + /// closure of the open-mode hole the bot-review pass surfaced. + /// + /// Today (pre-fix) this returns 200 because `authorize_request`'s + /// no-policy fallback only denies when `actor.is_some()`, so Open + /// mode (`actor: None`) falls through to `Ok(())`. The fix in the + /// next commit tightens the fallback so server-scoped actions + /// always require explicit policy. + /// + /// Sort-order coverage previously lived here; it has moved to + /// `get_graphs_with_server_policy_authorizes_per_cedar` where + /// the response body is now non-empty and operator-authorized. + #[tokio::test(flavor = "multi_thread")] + async fn get_graphs_denied_in_open_mode_without_server_policy() { + let (_dirs, app) = build_multi_mode_app(&["beta", "alpha"]).await; + let resp = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + let status = resp.status(); + let body = to_bytes(resp.into_body(), usize::MAX).await.unwrap(); + let body_str = String::from_utf8_lossy(&body); + assert_eq!( + status, + StatusCode::FORBIDDEN, + "GET /graphs must require an explicit server policy in every \ + runtime state; Open-mode bypass would leak server topology. \ + Body: {body_str}", + ); + } + + /// `GET /graphs` returns 405 in single mode (resource exists in the + /// API surface, just not operational without a `graphs:` map). + #[tokio::test(flavor = "multi_thread")] + async fn get_graphs_returns_405_in_single_mode() { + let temp = init_loaded_graph().await; + let graph = graph_path(temp.path()); + let state = AppState::open(graph.to_string_lossy().to_string()) + .await + .unwrap(); + let app = build_app(state); + let resp = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(resp.status(), StatusCode::METHOD_NOT_ALLOWED); + } + + /// `GET /graphs` requires bearer auth when tokens are configured. + #[tokio::test(flavor = "multi_thread")] + async fn get_graphs_requires_bearer_auth_when_configured() { + use omnigraph_server::{GraphHandle, GraphId, GraphKey}; + // Build a multi-mode app with bearer tokens configured. + let dir = tempfile::tempdir().unwrap(); + let graph_uri = dir.path().join("alpha").to_str().unwrap().to_string(); + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); + let handle = Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from("alpha").unwrap()), + uri: graph_uri, + engine: Arc::new(engine), + policy: None, + }); + let tokens = vec![("act-andrew".to_string(), "secret-token".to_string())]; + let workload = omnigraph_server::workload::WorkloadController::from_env(); + let state = AppState::new_multi(vec![handle], tokens, None, workload, None).unwrap(); + let app = build_app(state); + + // No Authorization header β†’ 401. + let resp_no_auth = app + .clone() + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(resp_no_auth.status(), StatusCode::UNAUTHORIZED); + + // With auth but no server policy β†’ 403 (default-deny, since + // GraphList is not Read). + let resp_authed = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .header("authorization", "Bearer secret-token") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!(resp_authed.status(), StatusCode::FORBIDDEN); + } + + /// `GET /graphs` with a server policy that allows `graph_list` β†’ 200 + /// and returns the registry sorted alphabetically by `graph_id`. + /// `GET /graphs` with a server policy that does NOT allow + /// `graph_list` (viewer group) β†’ 403. + /// + /// This test owns the alphabetical-sort coverage that previously + /// lived in `get_graphs_lists_registered_graphs_in_multi_mode`. + /// That test now asserts denial in Open mode (server-scoped actions + /// require explicit policy in every runtime state), so the positive + /// body-shape assertions need a home where the response is + /// operator-authorized β€” here. + #[tokio::test(flavor = "multi_thread")] + async fn get_graphs_with_server_policy_authorizes_per_cedar() { + use omnigraph_policy::PolicyEngine; + use omnigraph_server::{GraphHandle, GraphId, GraphKey}; + + let dir = tempfile::tempdir().unwrap(); + + // Two graphs deliberately registered in non-alphabetical order + // so the test would fail if the handler relied on insertion + // order instead of server-side sorting. + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + let mut handles = Vec::new(); + for id in ["beta", "alpha"] { + let graph_uri = dir.path().join(id).to_str().unwrap().to_string(); + let engine = Omnigraph::init(&graph_uri, &schema).await.unwrap(); + handles.push(Arc::new(GraphHandle { + key: GraphKey::cluster(GraphId::try_from(id).unwrap()), + uri: graph_uri, + engine: Arc::new(engine), + policy: None, + })); + } + + // Server policy: admins can graph_list, viewers cannot. + let policy_path = dir.path().join("server-policy.yaml"); + fs::write( + &policy_path, + r#" +version: 1 +groups: + admins: [act-andrew] + viewers: [act-bruno] +rules: + - id: admins-list-graphs + allow: + actors: { group: admins } + actions: [graph_list] +"#, + ) + .unwrap(); + let server_policy = PolicyEngine::load_server(&policy_path).unwrap(); + + let tokens = vec![ + ("act-andrew".to_string(), "andrew-token".to_string()), + ("act-bruno".to_string(), "bruno-token".to_string()), + ]; + let workload = omnigraph_server::workload::WorkloadController::from_env(); + let state = + AppState::new_multi(handles, tokens, Some(server_policy), workload, None).unwrap(); + let app = build_app(state); + + // Admin β†’ 200, body returns both graphs alphabetically sorted. + let resp_admin = app + .clone() + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .header("authorization", "Bearer andrew-token") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + resp_admin.status(), + StatusCode::OK, + "admin must be allowed graph_list" + ); + let body = to_bytes(resp_admin.into_body(), usize::MAX).await.unwrap(); + let json: Value = serde_json::from_slice(&body).unwrap(); + let graphs = json["graphs"].as_array().unwrap(); + assert_eq!(graphs.len(), 2, "response must list both registered graphs"); + assert_eq!( + graphs[0]["graph_id"].as_str().unwrap(), + "alpha", + "server must sort graphs alphabetically by graph_id (insertion order was 'beta', 'alpha')" + ); + assert_eq!(graphs[1]["graph_id"].as_str().unwrap(), "beta"); + + // Viewer β†’ 403 + let resp_viewer = app + .oneshot( + Request::builder() + .method(Method::GET) + .uri("/graphs") + .header("authorization", "Bearer bruno-token") + .body(Body::empty()) + .unwrap(), + ) + .await + .unwrap(); + assert_eq!( + resp_viewer.status(), + StatusCode::FORBIDDEN, + "viewer must be denied graph_list (Cedar gate)" + ); + } + + /// Loads an `omnigraph.yaml` with two graphs and verifies multi-mode + /// inference plus graph entry resolution. Cluster-route dispatch is + /// covered by the route tests above. + #[tokio::test(flavor = "multi_thread")] + async fn server_settings_load_multi_graph_config_entries() { + let cfg_dir = tempfile::tempdir().unwrap(); + // Real graph storage dirs (the URIs in the config must point to + // a graph init-able location). + let alpha_dir = cfg_dir.path().join("alpha.omni"); + let beta_dir = cfg_dir.path().join("beta.omni"); + let schema = fs::read_to_string(fixture("test.pg")).unwrap(); + Omnigraph::init(alpha_dir.to_str().unwrap(), &schema) + .await + .unwrap(); + Omnigraph::init(beta_dir.to_str().unwrap(), &schema) + .await + .unwrap(); + + let config_path = cfg_dir.path().join("omnigraph.yaml"); + fs::write( + &config_path, + format!( + r#" +graphs: + alpha: + uri: {alpha} + beta: + uri: {beta} +"#, + alpha = alpha_dir.display(), + beta = beta_dir.display(), + ), + ) + .unwrap(); + + let settings: ServerConfig = + load_server_settings(Some(&config_path), None, None, None, true).unwrap(); + assert!(matches!(settings.mode, ServerConfigMode::Multi { .. })); + + match settings.mode { + ServerConfigMode::Multi { graphs, .. } => { + assert_eq!(graphs.len(), 2); + let ids: Vec<&str> = graphs.iter().map(|g| g.graph_id.as_str()).collect(); + assert_eq!(ids, vec!["alpha", "beta"]); + } + _ => unreachable!(), + } + } +} diff --git a/crates/omnigraph-server/tests/stored_queries.rs b/crates/omnigraph-server/tests/stored_queries.rs deleted file mode 100644 index 00b0229..0000000 --- a/crates/omnigraph-server/tests/stored_queries.rs +++ /dev/null @@ -1,422 +0,0 @@ -//! Stored-query registry boot, /queries listing, and invocation routes. -//! Moved verbatim from tests/server.rs in the modularization. - - -use axum::body::Body; -use axum::http::StatusCode; -use omnigraph_server::AppState; -use serde_json::json; - - -mod support; -use support::*; - -#[tokio::test] -async fn server_boots_with_a_valid_stored_query_registry() { - // A stored query that type-checks against the fixture schema - // (`Person { name, age }`) must let the server boot. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let registry = stored_query_registry(&[( - "find_person", - "query find_person($name: String) { match { $p: Person { name: $name } } return { $p.age } }", - false, - )]); - let state = AppState::open_single_with_queries( - graph.to_string_lossy().to_string(), - vec![], - None, - registry, - ) - .await; - assert!(state.is_ok(), "valid registry should boot: {:?}", state.err()); -} - -#[tokio::test] -async fn server_refuses_boot_on_type_broken_stored_query() { - // A stored query referencing a type not in the schema (`Widget`) - // must abort boot, naming the offending query. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let registry = stored_query_registry(&[( - "ghost", - "query ghost() { match { $w: Widget } return { $w.name } }", - false, - )]); - let result = AppState::open_single_with_queries( - graph.to_string_lossy().to_string(), - vec![], - None, - registry, - ) - .await; - // `AppState` is not `Debug`, so match rather than `expect_err`. - let err = match result { - Ok(_) => panic!("type-broken stored query must refuse boot"), - Err(err) => err, - }; - let msg = err.to_string(); - assert!(msg.contains("ghost"), "error should name the broken query: {msg}"); - assert!( - msg.contains("schema check"), - "error should mention the schema check: {msg}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_stored_read_returns_rows() { - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response( - &app, - invoke_request("find_person", "t-invoke", json!({ "params": { "name": "Alice" } })), - ) - .await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - assert_eq!(body["query_name"], "find_person"); - assert_eq!(body["row_count"], 1, "Alice is in the fixture; body: {body}"); - assert!(body["rows"].is_array(), "read envelope shape; body: {body}"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_with_mismatched_expected_kind_is_rejected() { - // RFC-011 D3: the CLI verb asserts the stored query's kind via - // `expect_mutation`. Invoking a read with `expect_mutation: true` - // (i.e. `omnigraph mutate `) is a 400 naming the right verb. - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response( - &app, - invoke_request( - "find_person", - "t-invoke", - json!({ "expect_mutation": true, "params": { "name": "Alice" } }), - ), - ) - .await; - assert_eq!(status, StatusCode::BAD_REQUEST, "body: {body}"); - assert!( - body["error"] - .as_str() - .unwrap_or_default() - .contains("'find_person' is a read β€” use omnigraph query find_person"), - "expected a kind-mismatch error; body: {body}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_with_matching_expected_kind_runs() { - // The matching assertion (`omnigraph query `) passes through. - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response( - &app, - invoke_request( - "find_person", - "t-invoke", - json!({ "expect_mutation": false, "params": { "name": "Alice" } }), - ), - ) - .await; - assert_eq!(status, StatusCode::OK, "matching kind should run; body: {body}"); - assert_eq!(body["query_name"], "find_person"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_stored_read_accepts_absent_or_empty_body() { - let no_param_query = "query list_people() { match { $p: Person } return { $p.name } }"; - let (_temp, app) = app_with_stored_queries( - &[("list_people", no_param_query, false)], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - - let (status, body) = json_response( - &app, - invoke_request_bytes("list_people", "t-invoke", Body::empty(), None), - ) - .await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - assert_eq!(body["query_name"], "list_people"); - - let (status, body) = json_response( - &app, - invoke_request_bytes( - "list_people", - "t-invoke", - Body::empty(), - Some("application/json"), - ), - ) - .await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - - let (status, body) = json_response( - &app, - invoke_request_bytes( - "list_people", - "t-invoke", - Body::from("{}"), - Some("application/json"), - ), - ) - .await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - - let (status, body) = json_response( - &app, - invoke_request_bytes( - "list_people", - "t-invoke", - Body::from("{"), - Some("application/json"), - ), - ) - .await; - assert_eq!(status, StatusCode::BAD_REQUEST, "body: {body}"); - assert!( - body["error"] - .as_str() - .unwrap_or_default() - .contains("invalid stored-query invocation body"), - "malformed JSON should be rejected as bad request; body: {body}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_stored_mutation_double_gates_on_change() { - let specs: &[(&str, &str, bool)] = &[( - "add_person", - "query add_person($name: String) { insert Person { name: $name } }", - false, - )]; - let (_temp, app) = app_with_stored_queries( - specs, - &[("act-invoke", "t-invoke"), ("act-full", "t-full")], - INVOKE_POLICY_YAML, - ) - .await; - - // Has invoke_query but NOT change β†’ the inner change gate denies (403). - let (status, body) = json_response( - &app, - invoke_request("add_person", "t-invoke", json!({ "params": { "name": "Eve" } })), - ) - .await; - assert_eq!( - status, - StatusCode::FORBIDDEN, - "invoke_query without change must 403; body: {body}" - ); - - // Has invoke_query + change β†’ applied. - let (status, body) = json_response( - &app, - invoke_request("add_person", "t-full", json!({ "params": { "name": "Eve" } })), - ) - .await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - assert_eq!(body["affected_nodes"], 1, "body: {body}"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_stored_query_bad_param_is_400() { - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - // `name` is declared String; pass a number. - let (status, body) = json_response( - &app, - invoke_request("find_person", "t-invoke", json!({ "params": { "name": 123 } })), - ) - .await; - assert_eq!(status, StatusCode::BAD_REQUEST, "body: {body}"); - assert!( - body["error"].as_str().unwrap_or_default().contains("name"), - "400 should name the offending param; body: {body}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_unknown_query_and_denied_actor_return_identical_404() { - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invoke", "t-invoke"), ("act-noinvoke", "t-noinvoke")], - INVOKE_POLICY_YAML, - ) - .await; - - // Authorized actor, unknown query name β†’ 404. - let (unknown_status, unknown_body) = - json_response(&app, invoke_request("does_not_exist", "t-invoke", json!({}))).await; - // Denied actor (no invoke_query), real query name β†’ 404. - let (denied_status, denied_body) = json_response( - &app, - invoke_request("find_person", "t-noinvoke", json!({ "params": { "name": "Alice" } })), - ) - .await; - - assert_eq!(unknown_status, StatusCode::NOT_FOUND); - assert_eq!(denied_status, StatusCode::NOT_FOUND); - assert_eq!( - unknown_body, denied_body, - "deny must be byte-identical to a missing query (no catalog probing)" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn invoke_query_holder_without_read_sees_403_not_404() { - // The 404-hiding is for callers WITHOUT invoke_query. An actor that - // HOLDS invoke_query but lacks `read` clears the boundary gate, then the - // inner read gate denies β†’ 403 for an EXISTING read query, vs 404 for an - // unknown one. Existence is visible to grant-holders by design (the - // documented double-gate); this pins that actual contract. - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, false)], - &[("act-invokeonly", "t-invokeonly")], - INVOKE_POLICY_YAML, - ) - .await; - let (exists_status, _) = json_response( - &app, - invoke_request("find_person", "t-invokeonly", json!({ "params": { "name": "Alice" } })), - ) - .await; - let (absent_status, _) = - json_response(&app, invoke_request("does_not_exist", "t-invokeonly", json!({}))).await; - assert_eq!( - exists_status, - StatusCode::FORBIDDEN, - "an existing read query the holder can't read β†’ inner-gate 403" - ); - assert_eq!(absent_status, StatusCode::NOT_FOUND, "unknown query still 404s"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn list_queries_returns_only_exposed_with_typed_params() { - let (_temp, app) = app_with_stored_queries( - &[ - ("find_person", FIND_PERSON_GQ, true), - ( - "add_person", - "query add_person($name: String) { insert Person { name: $name } }", - true, - ), - ("hidden", "query hidden() { match { $p: Person } return { $p.name } }", false), - ], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response(&app, get_request(&g("/queries"), "t-invoke")).await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - - let entries = body["queries"].as_array().unwrap(); - let names: Vec<&str> = entries.iter().map(|q| q["name"].as_str().unwrap()).collect(); - assert!( - names.contains(&"find_person") && names.contains(&"add_person"), - "exposed queries listed: {names:?}" - ); - assert!(!names.contains(&"hidden"), "non-exposed query hidden from the catalog: {names:?}"); - - let fp = entries.iter().find(|q| q["name"] == "find_person").unwrap(); - assert_eq!(fp["mutation"], false); - assert_eq!(fp["tool_name"], "find_person"); - assert_eq!(fp["params"][0]["name"], "name"); - assert_eq!(fp["params"][0]["kind"], "string"); - let ap = entries.iter().find(|q| q["name"] == "add_person").unwrap(); - assert_eq!(ap["mutation"], true, "stored insert β†’ mutation"); -} - -#[tokio::test(flavor = "multi_thread")] -async fn list_queries_is_read_gated_so_a_non_invoker_can_list() { - // The catalog is read-gated (not invoke_query-gated), so a reader who - // lacks invoke_query still enumerates the exposed queries β€” the - // documented probe-oracle gap until per-query Cedar filtering lands. - let (_temp, app) = app_with_stored_queries( - &[("find_person", FIND_PERSON_GQ, true)], - &[("act-noinvoke", "t-noinvoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response(&app, get_request(&g("/queries"), "t-noinvoke")).await; - assert_eq!(status, StatusCode::OK, "read-gated catalog; body: {body}"); - let names: Vec<&str> = body["queries"] - .as_array() - .unwrap() - .iter() - .map(|q| q["name"].as_str().unwrap()) - .collect(); - assert!( - names.contains(&"find_person"), - "a reader lists the catalog despite lacking invoke_query: {names:?}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn list_queries_surfaces_query_description_and_instruction() { - // E2e for the query-level `.gq` surface: `@description`/`@instruction` on - // a stored query declaration are carried through to clients via the typed - // `QueryCatalogEntry` fields over `GET /queries`. A query without them - // omits both fields (serde `skip_serializing_if = "Option::is_none"`). - let described = "query described($name: String) \ - @description(\"Find a person by exact name.\") \ - @instruction(\"Use for exact lookups; prefer search for fuzzy matches.\") \ - { match { $p: Person { name: $name } } return { $p.age } }"; - let (_temp, app) = app_with_stored_queries( - &[ - ("described", described, true), - ("bare", "query bare() { match { $p: Person } return { $p.name } }", true), - ], - &[("act-invoke", "t-invoke")], - INVOKE_POLICY_YAML, - ) - .await; - let (status, body) = json_response(&app, get_request(&g("/queries"), "t-invoke")).await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - let entries = body["queries"].as_array().unwrap(); - - let described = entries.iter().find(|q| q["name"] == "described").unwrap(); - assert_eq!( - described["description"], "Find a person by exact name.", - "query @description surfaces over GET /queries: {described}" - ); - assert_eq!( - described["instruction"], - "Use for exact lookups; prefer search for fuzzy matches.", - "query @instruction surfaces over GET /queries: {described}" - ); - - let bare = entries.iter().find(|q| q["name"] == "bare").unwrap(); - assert!( - bare.get("description").is_none() && bare.get("instruction").is_none(), - "a query without the annotations omits both fields: {bare}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -async fn list_queries_is_empty_when_no_registry() { - let (_temp, app) = app_for_loaded_graph_with_auth("demo-token").await; - let (status, body) = json_response(&app, get_request(&g("/queries"), "demo-token")).await; - assert_eq!(status, StatusCode::OK, "body: {body}"); - assert!( - body["queries"].as_array().unwrap().is_empty(), - "no stored-query registry β†’ empty catalog" - ); -} diff --git a/crates/omnigraph-server/tests/support/mod.rs b/crates/omnigraph-server/tests/support/mod.rs deleted file mode 100644 index 694db46..0000000 --- a/crates/omnigraph-server/tests/support/mod.rs +++ /dev/null @@ -1,1202 +0,0 @@ -//! Shared helpers for the server integration suites (moved verbatim -//! from the monolithic tests/server.rs in the modularization). -#![allow(dead_code)] - -use std::env; -use std::fs; -use std::path::{Path, PathBuf}; -use std::sync::Arc; - -use axum::Router; -use axum::body::{Body, to_bytes}; -use axum::http::header::AUTHORIZATION; -use axum::http::{Method, Request, StatusCode}; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::error::OmniError; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_policy::{PolicyChecker, PolicyEngine}; -use omnigraph_server::api::{BranchCreateRequest, BranchMergeRequest, ChangeRequest, ReadRequest}; -use omnigraph_server::queries::{QueryRegistry, RegistrySpec}; -use omnigraph_server::{AppState, build_app}; -use serde_json::{Value, json}; -use tower::ServiceExt; - -pub const MUTATION_QUERIES: &str = r#" -query insert_person($name: String, $age: I32) { - insert Person { name: $name, age: $age } -} - -query set_age($name: String, $age: I32) { - update Person set { age: $age } where name = $name -} -"#; - -pub const POLICY_YAML: &str = r#" -version: 1 -groups: - team: [act-andrew, act-bruno, act-ragnor] - admins: [act-ragnor] -protected_branches: [main] -rules: - - id: team-read - allow: - actors: { group: team } - actions: [read] - branch_scope: any - - id: admins-export - allow: - actors: { group: admins } - actions: [export] - branch_scope: any - - id: team-write-unprotected - allow: - actors: { group: team } - actions: [change] - branch_scope: unprotected - - id: admins-merge - allow: - actors: { group: admins } - actions: [branch_delete, branch_merge] - target_branch_scope: protected -"#; - -pub const POLICY_PROTECTED_READ_YAML: &str = r#" -version: 1 -groups: - team: [act-bruno] -protected_branches: [main] -rules: - - id: protected-read - allow: - actors: { group: team } - actions: [read] - branch_scope: protected -"#; - -pub const INGEST_CREATE_ONLY_POLICY_YAML: &str = r#" -version: 1 -groups: - team: [act-bruno] -protected_branches: [main] -rules: - - id: team-branch-create - allow: - actors: { group: team } - actions: [branch_create] - target_branch_scope: unprotected -"#; - -pub const SCHEMA_APPLY_POLICY_YAML: &str = r#" -version: 1 -groups: - admins: [act-ragnor] -protected_branches: [main] -rules: - - id: admins-schema-apply - allow: - actors: { group: admins } - actions: [schema_apply] - target_branch_scope: protected -"#; - -pub fn fixture(name: &str) -> PathBuf { - PathBuf::from(env!("CARGO_MANIFEST_DIR")) - .join("../omnigraph/tests/fixtures") - .join(name) -} - -pub async fn init_loaded_graph() -> tempfile::TempDir { - init_graph_with_schema_and_data( - &fs::read_to_string(fixture("test.pg")).unwrap(), - &fs::read_to_string(fixture("test.jsonl")).unwrap(), - ) - .await -} - -pub async fn init_graph_with_schema_and_data(schema: &str, data: &str) -> tempfile::TempDir { - let temp = tempfile::tempdir().unwrap(); - let graph = graph_path(temp.path()); - fs::create_dir_all(&graph).unwrap(); - Omnigraph::init(graph.to_str().unwrap(), schema) - .await - .unwrap(); - let mut db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - load_jsonl(&mut db, data, LoadMode::Overwrite) - .await - .unwrap(); - temp -} - -pub async fn init_graph_with_schema(schema: &str) -> tempfile::TempDir { - let temp = tempfile::tempdir().unwrap(); - let graph = graph_path(temp.path()); - fs::create_dir_all(&graph).unwrap(); - Omnigraph::init(graph.to_str().unwrap(), schema) - .await - .unwrap(); - temp -} - -pub fn graph_path(root: &Path) -> PathBuf { - root.join("server.omni") -} - -pub fn stored_query_registry(specs: &[(&str, &str, bool)]) -> QueryRegistry { - QueryRegistry::from_specs( - specs - .iter() - .map(|(name, source, expose)| RegistrySpec { - name: name.to_string(), - source: source.to_string(), - expose: *expose, - tool_name: None, - }) - .collect(), - ) - .expect("specs parse and key==symbol") -} - -pub async fn app_with_stored_queries( - specs: &[(&str, &str, bool)], - tokens: &[(&str, &str)], - policy: &str, -) -> (tempfile::TempDir, Router) { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, policy).unwrap(); - let registry = stored_query_registry(specs); - let state = AppState::open_single_with_queries( - graph.to_string_lossy().to_string(), - tokens - .iter() - .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) - .collect(), - Some(&policy_path), - registry, - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub const INVOKE_POLICY_YAML: &str = r#" -version: 1 -groups: - invokers: ["act-invoke"] - full: ["act-full"] - readers: ["act-noinvoke"] - invoke_only: ["act-invokeonly"] -protected_branches: [main] -rules: - # invoke_query is graph-scoped β€” its own rules, no branch_scope. - - id: invokers-can-invoke - allow: - actors: { group: invokers } - actions: [invoke_query] - - id: full-can-invoke - allow: - actors: { group: full } - actions: [invoke_query] - - id: invoke-only-can-invoke - allow: - actors: { group: invoke_only } - actions: [invoke_query] - # read / change are branch-scoped. - - id: invokers-can-read - allow: - actors: { group: invokers } - actions: [read] - branch_scope: any - - id: full-can-read-change - allow: - actors: { group: full } - actions: [read, change] - branch_scope: any - - id: readers-can-read - allow: - actors: { group: readers } - actions: [read] - branch_scope: any -"#; - -pub const STORED_QUERY_SCHEMA_APPLY_POLICY_YAML: &str = r#" -version: 1 -groups: - admins: [act-ragnor] -protected_branches: [main] -rules: - - id: admins-can-invoke - allow: - actors: { group: admins } - actions: [invoke_query] - - id: admins-can-read - allow: - actors: { group: admins } - actions: [read] - branch_scope: any - - id: admins-can-schema-apply - allow: - actors: { group: admins } - actions: [schema_apply] - target_branch_scope: protected -"#; - -pub const FIND_PERSON_GQ: &str = - "query find_person($name: String) { match { $p: Person { name: $name } } return { $p.age } }"; - -/// RFC-011 cluster-only: the single-graph convenience apps built by the -/// `app_for_loaded_graph*` helpers serve the graph under the reserved id -/// `default`. This prefixes a flat per-graph path (e.g. `/snapshot`) with -/// the cluster route prefix so tests address `/graphs/default/snapshot`. -pub fn g(path: &str) -> String { - format!("/graphs/default{path}") -} - -pub fn invoke_request(name: &str, token: &str, body: Value) -> Request { - Request::builder() - .uri(g(&format!("/queries/{name}"))) - .method(Method::POST) - .header("content-type", "application/json") - .header("authorization", format!("Bearer {token}")) - .body(Body::from(serde_json::to_vec(&body).unwrap())) - .unwrap() -} - -pub fn invoke_request_bytes( - name: &str, - token: &str, - body: impl Into, - content_type: Option<&str>, -) -> Request { - let mut builder = Request::builder() - .uri(g(&format!("/queries/{name}"))) - .method(Method::POST) - .header("authorization", format!("Bearer {token}")); - if let Some(content_type) = content_type { - builder = builder.header("content-type", content_type); - } - builder.body(body.into()).unwrap() -} - -pub fn get_request(uri: &str, token: &str) -> Request { - Request::builder() - .uri(uri) - .method(Method::GET) - .header("authorization", format!("Bearer {token}")) - .body(Body::empty()) - .unwrap() -} - -pub fn drifted_test_schema() -> String { - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "age: I64?") -} - -pub async fn manifest_dataset_version(graph: &Path) -> u64 { - Omnigraph::open(graph.to_string_lossy().as_ref()) - .await - .unwrap() - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap() - .version() -} - -pub fn s3_test_graph_uri(suite: &str) -> Option { - let bucket = env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?; - let prefix = env::var("OMNIGRAPH_S3_TEST_PREFIX") - .ok() - .filter(|value| !value.trim().is_empty()) - .unwrap_or_else(|| "omnigraph-itests".to_string()); - let unique = std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .ok()? - .as_nanos(); - Some(format!("s3://{}/{}/{}/{}", bucket, prefix, suite, unique)) -} - -pub async fn app_for_loaded_graph() -> (tempfile::TempDir, Router) { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let state = AppState::open(graph.to_string_lossy().to_string()) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub fn permit_all_policy_yaml(actors: &[&str]) -> String { - let members = actors - .iter() - .map(|a| format!("\"{a}\"")) - .collect::>() - .join(", "); - format!( - r#" -version: 1 -groups: - permitted: [{members}] -protected_branches: [main] -rules: - - id: permit-data - allow: - actors: {{ group: permitted }} - actions: [read, change, export] - branch_scope: any - - id: permit-protected-target-actions - allow: - actors: {{ group: permitted }} - actions: [schema_apply, branch_create, branch_delete, branch_merge] - target_branch_scope: any -"# - ) -} - -pub async fn app_for_loaded_graph_with_auth(token: &str) -> (tempfile::TempDir, Router) { - // `AppState::new_with_bearer_token(token)` maps the token to actor "default"; - // permit-all policy needs to include that actor. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, permit_all_policy_yaml(&["default"])).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![("default".to_string(), token.to_string())], - Some(&policy_path), - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub async fn app_for_loaded_graph_with_auth_tokens( - tokens: &[(&str, &str)], -) -> (tempfile::TempDir, Router) { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - let actors: Vec<&str> = tokens.iter().map(|(actor, _)| *actor).collect(); - fs::write(&policy_path, permit_all_policy_yaml(&actors)).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - tokens - .iter() - .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) - .collect(), - Some(&policy_path), - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub async fn app_for_loaded_graph_with_auth_tokens_and_policy( - tokens: &[(&str, &str)], - policy: &str, -) -> (tempfile::TempDir, Router) { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, policy).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - tokens - .iter() - .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) - .collect(), - Some(&policy_path), - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub async fn app_for_graph_with_auth_tokens_and_policy( - schema: &str, - tokens: &[(&str, &str)], - policy: &str, -) -> (tempfile::TempDir, Router) { - let temp = init_graph_with_schema(schema).await; - let graph = graph_path(temp.path()); - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, policy).unwrap(); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - tokens - .iter() - .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) - .collect(), - Some(&policy_path), - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub async fn app_for_graph_with_auth_tokens_only( - schema: &str, - tokens: &[(&str, &str)], -) -> (tempfile::TempDir, Router) { - let temp = init_graph_with_schema(schema).await; - let graph = graph_path(temp.path()); - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - tokens - .iter() - .map(|(actor, token)| ((*actor).to_string(), (*token).to_string())) - .collect(), - None, - ) - .await - .unwrap(); - (temp, build_app(state)) -} - -pub fn additive_schema_with_nickname() -> String { - fs::read_to_string(fixture("test.pg")).unwrap().replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ) -} - -pub fn schema_without_age() -> String { - // Drop the nullable `age` column from the test schema. Used by the - // HTTP soft/hard drop tests below. - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace(" age: I32?\n", "") -} - -pub fn schema_without_company() -> String { - // Drop the `Company` node type and the edge referencing it. Used - // by the HTTP DropType test below. Hand-crafted (no template - // string replace) because the fixture interleaves the type and - // its edge. - r#"node Person { - name: String @key - age: I32? -} - -edge Knows: Person -> Person { - since: Date? -} -"# - .to_string() -} - -pub fn renamed_person_schema() -> String { - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("node Person {\n", "node Human @rename_from(\"Person\") {\n") - .replace("edge Knows: Person -> Person", "edge Knows: Human -> Human") - .replace( - "edge WorksAt: Person -> Company", - "edge WorksAt: Human -> Company", - ) -} - -pub fn renamed_age_schema() -> String { - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "years: I32? @rename_from(\"age\")") -} - -pub fn indexed_name_schema() -> String { - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("name: String @key", "name: String @key @index") -} - -pub fn unsupported_schema_change() -> String { - fs::read_to_string(fixture("test.pg")) - .unwrap() - .replace("age: I32?", "age: I64?") -} - -pub async fn json_response(app: &Router, request: Request) -> (StatusCode, Value) { - let response = app.clone().oneshot(request).await.unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - let value = serde_json::from_slice(&body).unwrap(); - (status, value) -} - -pub struct EnvGuard { - saved: Vec<(&'static str, Option)>, -} - -impl EnvGuard { - pub fn set(vars: &[(&'static str, Option<&str>)]) -> Self { - let saved = vars - .iter() - .map(|(name, _)| (*name, env::var(name).ok())) - .collect::>(); - for (name, value) in vars { - unsafe { - match value { - Some(value) => env::set_var(name, value), - None => env::remove_var(name), - } - } - } - Self { saved } - } -} - -impl Drop for EnvGuard { - fn drop(&mut self) { - for (name, value) in self.saved.drain(..) { - unsafe { - match value { - Some(value) => env::set_var(name, value), - None => env::remove_var(name), - } - } - } - } -} - -pub fn format_vector(values: &[f32]) -> String { - values - .iter() - .map(|value| format!("{:.8}", value)) - .collect::>() - .join(", ") -} - -pub fn normalize_vector(mut values: Vec) -> Vec { - let norm = values - .iter() - .map(|value| (*value as f64) * (*value as f64)) - .sum::() - .sqrt() as f32; - if norm > f32::EPSILON { - for value in &mut values { - *value /= norm; - } - } - values -} - -pub fn fnv1a64(bytes: &[u8]) -> u64 { - let mut hash = 14695981039346656037u64; - for byte in bytes { - hash ^= *byte as u64; - hash = hash.wrapping_mul(1099511628211u64); - } - hash -} - -pub fn xorshift64(mut x: u64) -> u64 { - x ^= x << 13; - x ^= x >> 7; - x ^= x << 17; - x -} - -pub fn mock_embedding(input: &str, dim: usize) -> Vec { - let mut seed = fnv1a64(input.as_bytes()); - let mut out = Vec::with_capacity(dim); - for _ in 0..dim { - seed = xorshift64(seed); - let ratio = (seed as f64 / u64::MAX as f64) as f32; - out.push((ratio * 2.0) - 1.0); - } - normalize_vector(out) -} - -pub mod matrix { - use super::*; - use std::time::Duration; - use tokio::sync::Barrier; - - #[derive(Debug)] - pub struct OpStatus { - pub status: StatusCode, - pub body: Vec, - } - - pub struct Harness { - pub _temp: tempfile::TempDir, - pub app: Router, - } - - impl Harness { - pub async fn new() -> Self { - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - // Build the WorkloadController explicitly with defaults rather - // than letting `AppState::open` call - // `WorkloadController::from_env()`. The admission-gate test - // (`ingest_per_actor_admission_cap_returns_429`) sets - // OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 inside an EnvGuard while - // it runs. Process-wide env vars are visible to - // concurrently-running tests; if a matrix cell reads env at - // AppState construction time during that window it picks up - // cap=1 and the second concurrent merge in cell b surfaces - // 429 instead of the expected 200. Constructing the - // controller here with explicit defaults makes cells - // independent of any env mutation other tests perform. - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - let workload = omnigraph_server::workload::WorkloadController::with_defaults(); - let state = AppState::new_with_workload( - graph.to_string_lossy().to_string(), - db, - Vec::new(), - workload, - ); - let app = build_app(state); - Self { _temp: temp, app } - } - - pub async fn create_branch(&self, from: &str, name: &str) { - let body = serde_json::to_vec(&BranchCreateRequest { - from: Some(from.to_string()), - name: name.to_string(), - }) - .unwrap(); - let r = self - .app - .clone() - .oneshot( - Request::builder() - .uri(g("/branches")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - r.status(), - StatusCode::OK, - "setup create_branch {} from {} failed", - name, - from - ); - } - - pub async fn insert_person(&self, branch: &str, name: &str, age: i32) { - let body = serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": name, "age": age })), - branch: Some(branch.to_string()), - }) - .unwrap(); - let r = self - .app - .clone() - .oneshot( - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - r.status(), - StatusCode::OK, - "setup insert {} on {} failed", - name, - branch - ); - } - - /// Run two ops concurrently with barrier alignment + 15s deadlock - /// timeout. Returns `(op_a, op_b)`. Panics on timeout. - pub async fn run_pair( - &self, - op_a: impl FnOnce(Router, Arc) -> tokio::task::JoinHandle, - op_b: impl FnOnce(Router, Arc) -> tokio::task::JoinHandle, - ) -> (OpStatus, OpStatus) { - let barrier = Arc::new(Barrier::new(2)); - let h_a = op_a(self.app.clone(), Arc::clone(&barrier)); - let h_b = op_b(self.app.clone(), Arc::clone(&barrier)); - let result = tokio::time::timeout(Duration::from_secs(15), async { - let a = h_a.await.unwrap(); - let b = h_b.await.unwrap(); - (a, b) - }) - .await; - result.expect("concurrent op pair deadlocked (>15s)") - } - - pub async fn person_count(&self, branch: &str) -> u64 { - let r = self - .app - .clone() - .oneshot( - Request::builder() - .uri(g(&format!("/snapshot?branch={}", branch))) - .method(Method::GET) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!(r.status(), StatusCode::OK, "snapshot {} failed", branch); - let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); - let v: Value = serde_json::from_slice(&body).unwrap(); - v["tables"] - .as_array() - .and_then(|tables| { - tables - .iter() - .find(|t| t["table_key"].as_str() == Some("node:Person")) - }) - .and_then(|t| t["row_count"].as_u64()) - .unwrap_or_else(|| panic!("snapshot {} missing node:Person", branch)) - } - - /// True iff the named Person exists on `branch`. Uses the - /// `get_person` query from `test.gq` for identity rather than - /// just count. - pub async fn person_exists(&self, branch: &str, name: &str) -> bool { - let body = serde_json::to_vec(&ReadRequest { - query_source: include_str!("../../../omnigraph/tests/fixtures/test.gq").to_string(), - query_name: Some("get_person".to_string()), - params: Some(json!({ "name": name })), - branch: Some(branch.to_string()), - snapshot: None, - }) - .unwrap(); - let r = self - .app - .clone() - .oneshot( - Request::builder() - .uri(g("/read")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - r.status(), - StatusCode::OK, - "person_exists query for {} on {} failed", - name, - branch - ); - let body = to_bytes(r.into_body(), usize::MAX).await.unwrap(); - let v: Value = serde_json::from_slice(&body).unwrap(); - v["row_count"].as_u64().unwrap_or(0) > 0 - } - - /// Asserts each name in `present` exists on `branch` and each in - /// `absent` does not. Identity-grade check that catches symmetric - /// swap races a row-count assertion would miss. - pub async fn assert_persons( - &self, - branch: &str, - cell: &str, - present: &[&str], - absent: &[&str], - ) { - for name in present { - assert!( - self.person_exists(branch, name).await, - "[{}] expected {} to be present on {}", - cell, - name, - branch - ); - } - for name in absent { - assert!( - !self.person_exists(branch, name).await, - "[{}] expected {} to be absent from {}", - cell, - name, - branch - ); - } - } - - /// C6: insert a uniquely-named sentinel on main and verify it - /// landed. Catches engine-state poisoning where a cell's - /// concurrent ops left the engine half-broken β€” subsequent - /// /change either deadlocks or returns a non-200. - pub async fn assert_post_op_sentinel(&self, cell: &str, sentinel: &str) { - let body = serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": sentinel, "age": 99 })), - branch: Some("main".to_string()), - }) - .unwrap(); - let r = self - .app - .clone() - .oneshot( - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - assert_eq!( - r.status(), - StatusCode::OK, - "[{}] post-op sentinel /change on main failed (engine poisoned?)", - cell - ); - assert!( - self.person_exists("main", sentinel).await, - "[{}] sentinel {} did not land on main", - cell, - sentinel - ); - } - } - - // Helpers that build the closures for `run_pair`. Each takes a - // Router + Barrier and returns a JoinHandle yielding the status/body. - - pub fn op_merge( - source: String, - target: String, - ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { - move |app: Router, barrier: Arc| { - tokio::spawn(async move { - barrier.wait().await; - let body = serde_json::to_vec(&BranchMergeRequest { - source, - target: Some(target), - }) - .unwrap(); - let response = app - .oneshot( - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - OpStatus { - status, - body: body.to_vec(), - } - }) - } - } - - pub fn op_change_insert( - branch: String, - name: String, - age: i32, - ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { - move |app: Router, barrier: Arc| { - tokio::spawn(async move { - barrier.wait().await; - let body = serde_json::to_vec(&ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": name, "age": age })), - branch: Some(branch), - }) - .unwrap(); - let response = app - .oneshot( - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - OpStatus { - status, - body: body.to_vec(), - } - }) - } - } - - pub fn op_branch_create( - from: String, - name: String, - ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { - move |app: Router, barrier: Arc| { - tokio::spawn(async move { - barrier.wait().await; - let body = serde_json::to_vec(&BranchCreateRequest { - from: Some(from), - name, - }) - .unwrap(); - let response = app - .oneshot( - Request::builder() - .uri(g("/branches")) - .method(Method::POST) - .header("content-type", "application/json") - .body(Body::from(body)) - .unwrap(), - ) - .await - .unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - OpStatus { - status, - body: body.to_vec(), - } - }) - } - } - - pub fn op_branch_delete( - name: String, - ) -> impl FnOnce(Router, Arc) -> tokio::task::JoinHandle { - move |app: Router, barrier: Arc| { - tokio::spawn(async move { - barrier.wait().await; - let response = app - .oneshot( - Request::builder() - .uri(g(&format!("/branches/{}", name))) - .method(Method::DELETE) - .body(Body::empty()) - .unwrap(), - ) - .await - .unwrap(); - let status = response.status(); - let body = to_bytes(response.into_body(), usize::MAX).await.unwrap(); - OpStatus { - status, - body: body.to_vec(), - } - }) - } - } -} - -pub const PARITY_POLICY_YAML: &str = r#" -version: 1 -groups: - team: [act-bruno] - admins: [act-ragnor] -protected_branches: [main] -rules: - - id: admins-change-anywhere - allow: - actors: { group: admins } - actions: [change] - branch_scope: any - - id: admins-merge-to-protected - allow: - actors: { group: admins } - actions: [branch_merge] - target_branch_scope: protected -"#; - -#[derive(Clone, Copy, Debug)] -pub enum ParityDecision { - Allow, - Deny, -} - -pub async fn build_parity_graph() -> (tempfile::TempDir, PathBuf, PathBuf) { - // Build a graph with `main` loaded and a `feature` branch ready for - // merge. Returns the graph path and a written policy.yaml path. - let temp = init_loaded_graph().await; - let graph = graph_path(temp.path()); - { - let db = Omnigraph::open(graph.to_str().unwrap()).await.unwrap(); - db.branch_create_from(ReadTarget::branch("main"), "feature") - .await - .unwrap(); - db.load_as( - "feature", - None, - r#"{"type":"Person","data":{"name":"ParityEve","age":29}}"#, - LoadMode::Append, - None, - ) - .await - .unwrap(); - } - let policy_path = temp.path().join("policy.yaml"); - fs::write(&policy_path, PARITY_POLICY_YAML).unwrap(); - (temp, graph, policy_path) -} - -pub async fn sdk_change_decision(graph: &Path, policy_path: &Path, actor: &str) -> ParityDecision { - let policy = PolicyEngine::load_graph(policy_path, graph.to_string_lossy().as_ref()).unwrap(); - let db = Omnigraph::open(graph.to_str().unwrap()) - .await - .unwrap() - .with_policy(Arc::new(policy) as Arc); - let mut params: omnigraph_compiler::ParamMap = Default::default(); - // Parameter keys are bare names (no `$` prefix); the runtime resolves - // `$name` references in the query body to `params["name"]`. - params.insert( - "name".to_string(), - omnigraph_compiler::Literal::String("ParityCharlie".to_string()), - ); - params.insert("age".to_string(), omnigraph_compiler::Literal::Integer(30)); - let result = db - .mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - ¶ms, - Some(actor), - ) - .await; - match result { - Ok(_) => ParityDecision::Allow, - Err(OmniError::Policy(_)) => ParityDecision::Deny, - Err(other) => panic!("unexpected SDK error for change: {other:?}"), - } -} - -pub async fn http_change_decision( - graph: &Path, - policy_path: &PathBuf, - actor: &str, - token: &str, -) -> ParityDecision { - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![(actor.to_string(), token.to_string())], - Some(policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - let req = ChangeRequest { - query: MUTATION_QUERIES.to_string(), - name: Some("insert_person".to_string()), - params: Some(json!({ "name": "ParityCharlie", "age": 30 })), - branch: Some("main".to_string()), - }; - let (status, _body) = json_response( - &app, - Request::builder() - .uri(g("/change")) - .method(Method::POST) - .header(AUTHORIZATION, format!("Bearer {token}")) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&req).unwrap())) - .unwrap(), - ) - .await; - match status { - StatusCode::OK => ParityDecision::Allow, - StatusCode::FORBIDDEN => ParityDecision::Deny, - other => panic!("unexpected HTTP status for change: {other}"), - } -} - -pub async fn sdk_merge_decision(graph: &Path, policy_path: &Path, actor: &str) -> ParityDecision { - let policy = PolicyEngine::load_graph(policy_path, graph.to_string_lossy().as_ref()).unwrap(); - let db = Omnigraph::open(graph.to_str().unwrap()) - .await - .unwrap() - .with_policy(Arc::new(policy) as Arc); - let result = db.branch_merge_as("feature", "main", Some(actor)).await; - match result { - Ok(_) => ParityDecision::Allow, - Err(OmniError::Policy(_)) => ParityDecision::Deny, - Err(other) => panic!("unexpected SDK error for branch_merge: {other:?}"), - } -} - -pub async fn http_merge_decision( - graph: &Path, - policy_path: &PathBuf, - actor: &str, - token: &str, -) -> ParityDecision { - let state = AppState::open_with_bearer_tokens_and_policy( - graph.to_string_lossy().to_string(), - vec![(actor.to_string(), token.to_string())], - Some(policy_path), - ) - .await - .unwrap(); - let app = build_app(state); - let req = BranchMergeRequest { - source: "feature".to_string(), - target: Some("main".to_string()), - }; - let (status, _body) = json_response( - &app, - Request::builder() - .uri(g("/branches/merge")) - .method(Method::POST) - .header(AUTHORIZATION, format!("Bearer {token}")) - .header("content-type", "application/json") - .body(Body::from(serde_json::to_vec(&req).unwrap())) - .unwrap(), - ) - .await; - match status { - StatusCode::OK => ParityDecision::Allow, - StatusCode::FORBIDDEN => ParityDecision::Deny, - other => panic!("unexpected HTTP status for branch_merge: {other}"), - } -} - -pub async fn converged_cluster_dir(policies_yaml: &str) -> tempfile::TempDir { - let temp = tempfile::tempdir().unwrap(); - fs::write( - temp.path().join("people.pg"), - "\nnode Person {\n name: String @key\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("people.gq"), - "\nquery find_person($name: String) {\n match { $p: Person { name: $name } }\n return { $p.name }\n}\n", - ) - .unwrap(); - fs::write( - temp.path().join("cluster.yaml"), - format!( - r#" -version: 1 -graphs: - knowledge: - schema: ./people.pg - queries: - find_person: - file: ./people.gq -{policies_yaml}"# - ), - ) - .unwrap(); - let import = omnigraph_cluster::import_config_dir(temp.path()).await; - assert!(import.ok, "{:?}", import.diagnostics); - let apply = omnigraph_cluster::apply_config_dir(temp.path()).await; - assert!(apply.ok && apply.converged, "{:?}", apply.diagnostics); - temp -} - -pub async fn cluster_settings( - dir: &Path, -) -> color_eyre::eyre::Result { - omnigraph_server::load_server_settings(Some(&dir.to_path_buf()), None, true, false).await -} diff --git a/crates/omnigraph/Cargo.toml b/crates/omnigraph/Cargo.toml index 5038fd1..1fa3436 100644 --- a/crates/omnigraph/Cargo.toml +++ b/crates/omnigraph/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "omnigraph-engine" -version = "0.7.2" +version = "0.6.0" edition = "2024" description = "Runtime engine for the Omnigraph graph database." license = "MIT" @@ -16,8 +16,8 @@ default = [] failpoints = ["dep:fail", "fail/failpoints"] [dependencies] -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } -omnigraph-policy = { path = "../omnigraph-policy", version = "0.7.2" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } +omnigraph-policy = { path = "../omnigraph-policy", version = "0.6.0" } lance = { workspace = true } lance-datafusion = { workspace = true } datafusion = { workspace = true } @@ -37,7 +37,6 @@ serde_json = { workspace = true } reqwest = { workspace = true } object_store = { workspace = true } ulid = { workspace = true } -sha2 = { workspace = true } base64 = { workspace = true } futures = { workspace = true } tracing = { workspace = true } @@ -52,9 +51,7 @@ chrono = { workspace = true } arc-swap = { workspace = true } [dev-dependencies] -omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.7.2" } +omnigraph-compiler = { path = "../omnigraph-compiler", version = "0.6.0" } tokio = { workspace = true } lance-namespace-impls = { workspace = true } -lance-io = "7.0.0" serial_test = "3" -proptest = "1" diff --git a/crates/omnigraph/examples/bench_expand.rs b/crates/omnigraph/examples/bench_expand.rs index bb904a0..c723b24 100644 --- a/crates/omnigraph/examples/bench_expand.rs +++ b/crates/omnigraph/examples/bench_expand.rs @@ -221,65 +221,6 @@ fn microbench_dedup() { ); } -/// Selective single-source traversal, timed cold in CSR vs indexed mode across -/// growing |E|. The win of the indexed path: a small fixed frontier should be -/// ~flat in |E| (one BTREE scan per hop), whereas CSR pays an O(|E|) adjacency -/// build on the first (cold) query. Also asserts both modes return the same -/// rows β€” a guard against the scalar-index `physical_rows` silent fallback -/// dropping unindexed-fragment rows. -async fn bench_selective_modes() { - println!("\n── Selective traversal: indexed vs CSR (cold, single-source knows{{1,2}}) ──"); - let sel = r#" -query sel($name: String) { - match { - $a: Person { name: $name } - $a knows{1,2} $b - } - return { $b.name } -} -"#; - for &(n, avg_deg) in &[(1_000usize, 8usize), (10_000, 8), (30_000, 8)] { - let jsonl = generate_jsonl(n, avg_deg, 42); - let mut params = ParamMap::new(); - params.insert( - "name".to_string(), - omnigraph_compiler::query::ast::Literal::String("p0".to_string()), - ); - - let mut rows_by_mode: Vec<(&str, usize)> = Vec::new(); - for mode in ["csr", "indexed"] { - // Fresh db per measurement so the query is cold (CSR pays its build). - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, SCHEMA).await.unwrap(); - load_jsonl(&mut db, &jsonl, LoadMode::Overwrite).await.unwrap(); - // SAFE: example main drives queries sequentially; no concurrent env reader. - unsafe { std::env::set_var("OMNIGRAPH_TRAVERSAL_MODE", mode) }; - - let t = Instant::now(); - let r = db - .query(ReadTarget::branch("main"), sel, "sel", ¶ms) - .await - .expect("sel query"); - let elapsed = t.elapsed(); - let rows = r.num_rows(); - rows_by_mode.push((mode, rows)); - println!( - " |E|β‰ˆ{:>7} {:<8} cold={:>9.2?} rows={}", - n * avg_deg, - mode, - elapsed, - rows - ); - } - unsafe { std::env::remove_var("OMNIGRAPH_TRAVERSAL_MODE") }; - assert_eq!( - rows_by_mode[0].1, rows_by_mode[1].1, - "indexed and CSR must return identical rows (no silent drop under partial index coverage)" - ); - } -} - #[tokio::main(flavor = "multi_thread")] async fn main() { println!("── End-to-end query latency ──"); @@ -321,7 +262,5 @@ async fn main() { } } - bench_selective_modes().await; - microbench_dedup(); } diff --git a/crates/omnigraph/src/changes/mod.rs b/crates/omnigraph/src/changes/mod.rs index 2e9bc02..7c9e8ea 100644 --- a/crates/omnigraph/src/changes/mod.rs +++ b/crates/omnigraph/src/changes/mod.rs @@ -7,7 +7,6 @@ use lance::dataset::scanner::ColumnOrdering; use crate::db::SubTableEntry; use crate::db::manifest::Snapshot; use crate::error::Result; -use crate::storage_layer::{SnapshotHandle, TableStorage}; use crate::table_store::TableStore; // ─── Types ────────────────────────────────────────────────────────────────── @@ -230,8 +229,7 @@ async fn diff_table_same_lineage( ) -> Result> { let vf = from_entry.table_version; let vt = to_entry.table_version; - let storage: &dyn TableStorage = table_store; - let to_ds = storage.open_snapshot_at_entry(to_entry).await?; + let to_ds = table_store.open_at_entry(to_entry).await?; let cols: Vec<&str> = if is_edge { vec!["id", "src", "dst", "_row_last_updated_at_version"] @@ -248,23 +246,23 @@ async fn diff_table_same_lineage( // Inserts + Updates: use _row_last_updated_at_version to find all rows // touched since Vf, then classify by checking whether the ID existed at Vf. // - // We key on _row_last_updated_at_version because one scan over it catches - // every row touched in the window β€” inserts and updates alike β€” regardless - // of write mode, and ID-set membership at Vf then distinguishes inserts from - // updates. (lance#6774 made merge_insert stamp new rows' _row_created_at_version - // with the commit version, so created_at became reliable too; last_updated - // stays the right key since it also covers updates.) + // Why not _row_created_at_version for inserts: Lance's merge_insert stamps + // new rows with _row_created_at_version = dataset_creation_version (v1), + // not the merge_insert commit version. This makes _row_created_at_version + // unreliable for detecting inserts from merge_insert writes. Using + // _row_last_updated_at_version catches all touched rows regardless of + // write mode, and ID-set membership distinguishes inserts from updates. if wants_inserts || wants_updates { let filter_sql = format!( "_row_last_updated_at_version > {} AND _row_last_updated_at_version <= {}", vf, vt ); - let changed_rows = scan_with_filter(storage, &to_ds, &cols, &filter_sql).await?; + let changed_rows = scan_with_filter(table_store, &to_ds, &cols, &filter_sql).await?; if !changed_rows.is_empty() { // Build the set of IDs that existed at the from version - let from_ds = storage.open_snapshot_at_entry(from_entry).await?; - let from_ids: HashSet = scan_id_set(storage, &from_ds, &["id"]) + let from_ds = table_store.open_at_entry(from_entry).await?; + let from_ids: HashSet = scan_id_set(table_store, &from_ds, &["id"]) .await? .into_iter() .map(|r| r.id) @@ -284,8 +282,8 @@ async fn diff_table_same_lineage( // Deletes: ID set-difference if wants_deletes { - let from_ds = storage.open_snapshot_at_entry(from_entry).await?; - let deleted = deleted_ids_by_set_diff(storage, &from_ds, &to_ds, is_edge).await?; + let from_ds = table_store.open_at_entry(from_entry).await?; + let deleted = deleted_ids_by_set_diff(table_store, &from_ds, &to_ds, is_edge).await?; changes.extend(deleted); } @@ -302,14 +300,13 @@ async fn diff_table_cross_branch( is_edge: bool, filter: &ChangeFilter, ) -> Result> { - let storage: &dyn TableStorage = table_store; - let from_ds = storage - .open_snapshot_at_table(from_snap, table_key) + let from_ds = table_store + .open_snapshot_table(from_snap, table_key) .await?; - let to_ds = storage.open_snapshot_at_table(to_snap, table_key).await?; + let to_ds = table_store.open_snapshot_table(to_snap, table_key).await?; - let from_rows = scan_all_rows_ordered(storage, &from_ds, is_edge).await?; - let to_rows = scan_all_rows_ordered(storage, &to_ds, is_edge).await?; + let from_rows = scan_all_rows_ordered(table_store, &from_ds, is_edge).await?; + let to_rows = scan_all_rows_ordered(table_store, &to_ds, is_edge).await?; let mut changes = Vec::new(); let mut fi = 0; @@ -395,9 +392,8 @@ async fn diff_table_added( if !filter.wants_op(ChangeOp::Insert) { return Ok(Vec::new()); } - let storage: &dyn TableStorage = table_store; - let ds = storage.open_snapshot_at_table(to_snap, table_key).await?; - let rows = scan_all_rows_ordered(storage, &ds, is_edge).await?; + let ds = table_store.open_snapshot_table(to_snap, table_key).await?; + let rows = scan_all_rows_ordered(table_store, &ds, is_edge).await?; Ok(rows .into_iter() .map(|r| entity_change_from_row(&r, ChangeOp::Insert, is_edge)) @@ -414,11 +410,10 @@ async fn diff_table_removed( if !filter.wants_op(ChangeOp::Delete) { return Ok(Vec::new()); } - let storage: &dyn TableStorage = table_store; - let ds = storage - .open_snapshot_at_table(from_snap, table_key) + let ds = table_store + .open_snapshot_table(from_snap, table_key) .await?; - let rows = scan_all_rows_ordered(storage, &ds, is_edge).await?; + let rows = scan_all_rows_ordered(table_store, &ds, is_edge).await?; Ok(rows .into_iter() .map(|r| entity_change_from_row(&r, ChangeOp::Delete, is_edge)) @@ -429,12 +424,12 @@ async fn diff_table_removed( /// Scan with a SQL filter, projecting specific columns. async fn scan_with_filter( - storage: &dyn TableStorage, - ds: &SnapshotHandle, + table_store: &TableStore, + ds: &lance::Dataset, cols: &[&str], filter_sql: &str, ) -> Result> { - let batches = storage + let batches = table_store .scan(ds, Some(cols), Some(filter_sql), None) .await?; Ok(extract_rows(&batches)) @@ -442,11 +437,11 @@ async fn scan_with_filter( /// Scan all rows ordered by id, projecting id (+ src/dst for edges) + all columns for signature. async fn scan_all_rows_ordered( - storage: &dyn TableStorage, - ds: &SnapshotHandle, + table_store: &TableStore, + ds: &lance::Dataset, is_edge: bool, ) -> Result> { - let batches = storage + let batches = table_store .scan( ds, None, @@ -459,9 +454,9 @@ async fn scan_all_rows_ordered( /// Compute deleted IDs: scan id at from and to, set-difference. async fn deleted_ids_by_set_diff( - storage: &dyn TableStorage, - from_ds: &SnapshotHandle, - to_ds: &SnapshotHandle, + table_store: &TableStore, + from_ds: &lance::Dataset, + to_ds: &lance::Dataset, is_edge: bool, ) -> Result> { let cols: Vec<&str> = if is_edge { @@ -470,8 +465,8 @@ async fn deleted_ids_by_set_diff( vec!["id"] }; - let from_rows = scan_id_set(storage, from_ds, &cols).await?; - let to_ids: HashSet = scan_id_set(storage, to_ds, &["id"]) + let from_rows = scan_id_set(table_store, from_ds, &cols).await?; + let to_ids: HashSet = scan_id_set(table_store, to_ds, &["id"]) .await? .into_iter() .map(|r| r.id) @@ -485,11 +480,11 @@ async fn deleted_ids_by_set_diff( } async fn scan_id_set( - storage: &dyn TableStorage, - ds: &SnapshotHandle, + table_store: &TableStore, + ds: &lance::Dataset, cols: &[&str], ) -> Result> { - let batches = storage.scan(ds, Some(cols), None, None).await?; + let batches = table_store.scan(ds, Some(cols), None, None).await?; Ok(extract_rows(&batches)) } diff --git a/crates/omnigraph/src/db/commit_graph.rs b/crates/omnigraph/src/db/commit_graph.rs index fb61874..565bd69 100644 --- a/crates/omnigraph/src/db/commit_graph.rs +++ b/crates/omnigraph/src/db/commit_graph.rs @@ -1,5 +1,6 @@ use std::collections::{HashMap, VecDeque}; use std::sync::Arc; +use std::time::{SystemTime, UNIX_EPOCH}; use arrow_array::{ Array, RecordBatch, RecordBatchIterator, StringArray, TimestampMicrosecondArray, UInt64Array, @@ -28,16 +29,7 @@ pub struct GraphCommit { pub struct CommitGraph { root_uri: String, - /// Handle on `_graph_commits.lance` at the active branch, held only for the - /// branch-management WRITES (`create_branch`, formerly `version`) and - /// `refresh`. It is a DERIVED artifact (RFC-013 Phase 7): graph lineage lives - /// in `__manifest`, and reads (`head_commit`/`load_commits`/`get_commit`/ - /// `merge_base`) never touch it. `None` means the branch's - /// `_graph_commits.lance` ref is missing (an interrupted fork-reclaim or a - /// `cleanup` race) while the manifest lineage is still authoritative β€” so the - /// READS stay correct and only a subsequent `create_branch` surfaces the loud - /// actionable error. Mirrors `actor_dataset`'s best-effort `Option`. - dataset: Option, + dataset: Dataset, actor_dataset: Option, active_branch: Option, actor_by_commit_id: HashMap, @@ -46,26 +38,25 @@ pub struct CommitGraph { } impl CommitGraph { - /// Create the commit-graph datasets for a fresh graph. The genesis - /// `graph_commit` + `graph_head` rows live in `__manifest` (folded into the - /// init write β€” RFC-013 Phase 7), so `_graph_commits.lance` is created EMPTY - /// here: it exists only to carry the Lance branch refs that `create_branch` / - /// `list_branches` / the `cleanup` orphan reconciler operate on. No commit - /// rows are ever written to it. The in-memory cache is sourced from the - /// manifest projection β€” the same path as [`open`], so genesis is seen - /// identically whether the graph was just initialized or reopened. - pub async fn init(root_uri: &str) -> Result { + pub async fn init(root_uri: &str, manifest_version: u64) -> Result { let root = root_uri.trim_end_matches('/'); let uri = graph_commits_uri(root); + let genesis = GraphCommit { + graph_commit_id: ulid::Ulid::new().to_string(), + manifest_branch: None, + manifest_version, + parent_commit_id: None, + merged_parent_commit_id: None, + actor_id: None, + created_at: now_micros()?, + }; - let batch = RecordBatch::new_empty(commit_graph_schema()); + let batch = commits_to_batch(&[genesis.clone()])?; let reader = RecordBatchIterator::new(vec![Ok(batch)], commit_graph_schema()); let params = WriteParams { mode: WriteMode::Create, enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; let dataset = Dataset::write(reader, &uri as &str, Some(params)) @@ -73,58 +64,34 @@ impl CommitGraph { .map_err(|e| OmniError::Lance(e.to_string()))?; let actor_dataset = create_commit_actor_dataset(root).await?; - let (commit_by_id, head_commit) = load_commit_cache_from_manifest(root, None).await?; Ok(Self { root_uri: root.to_string(), - dataset: Some(dataset), + dataset, actor_dataset: Some(actor_dataset), active_branch: None, actor_by_commit_id: HashMap::new(), - commit_by_id, - head_commit, + commit_by_id: HashMap::from([(genesis.graph_commit_id.clone(), genesis.clone())]), + head_commit: Some(genesis), }) } - /// Insert a just-published commit into the in-memory cache (RFC-013 Phase 7). - /// The durable write already happened in the manifest publish CAS; this only - /// keeps the cache consistent for same-handle reads, with no storage I/O. - /// Head selection matches the manifest-sourced load (`should_replace_head`). - pub fn insert_committed(&mut self, commit: GraphCommit) { - if should_replace_head(self.head_commit.as_ref(), &commit) { - self.head_commit = Some(commit.clone()); - } - self.commit_by_id - .insert(commit.graph_commit_id.clone(), commit); - } - pub async fn open(root_uri: &str) -> Result { let root = root_uri.trim_end_matches('/'); - let wrapper = crate::instrumentation::commit_graph_wrapper(); - let dataset = - crate::instrumentation::open_dataset_tracked(&graph_commits_uri(root), wrapper.clone()) - .await?; - let actor_dataset = - crate::instrumentation::open_dataset_tracked(&graph_commit_actors_uri(root), wrapper) - .await - .ok(); - // RFC-013 step 4: source the in-memory cache from the `__manifest` - // lineage projection (which carries the actor inline), not from - // `_graph_commits.lance`. The dataset handles above are retained for the - // branch-management ops (create/delete/list/version) that still target - // the commit-graph dataset; the actor dataset is only kept for the - // dual-write append path. The projection-equivalence gate proves this - // cache equals the prior `_graph_commits.lance` read. A pre-Phase-7 (v3) - // graph not yet migrated falls back to the legacy read β€” see - // `load_commit_cache_for_branch`. - let (commit_by_id, head_commit) = load_commit_cache_for_branch(root, None).await?; + let dataset = Dataset::open(&graph_commits_uri(root)) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let actor_dataset = Dataset::open(&graph_commit_actors_uri(root)).await.ok(); + let actor_by_commit_id = match &actor_dataset { + Some(dataset) => load_commit_actor_cache(dataset).await?, + None => HashMap::new(), + }; + let (commit_by_id, head_commit) = load_commit_cache(&dataset, &actor_by_commit_id).await?; Ok(Self { root_uri: root.to_string(), - // `open` targets main and never checks out a branch (main cannot be - // deleted/recreated), so the handle is always present here. - dataset: Some(dataset), + dataset, actor_dataset, active_branch: None, - actor_by_commit_id: HashMap::new(), + actor_by_commit_id, commit_by_id, head_commit, }) @@ -132,37 +99,25 @@ impl CommitGraph { pub async fn open_at_branch(root_uri: &str, branch: &str) -> Result { let root = root_uri.trim_end_matches('/'); - let wrapper = crate::instrumentation::commit_graph_wrapper(); - let dataset = - crate::instrumentation::open_dataset_tracked(&graph_commits_uri(root), wrapper.clone()) - .await?; - // Best-effort checkout of the DERIVED `_graph_commits.lance` branch ref. - // It is held only for `create_branch` (a write); the lineage READ below - // comes from `__manifest`. A missing ref (interrupted fork-reclaim / - // `cleanup` race) must not wedge the read, so a typed not-found yields a - // `None` handle β€” a subsequent `create_branch` then surfaces the loud - // error. Any OTHER open error (transient IO / corrupt) still propagates, - // matching the `force_delete_branch` / `read_legacy_commit_cache` idiom. - let dataset = match dataset.checkout_branch(branch).await { - Ok(ds) => Some(ds), - Err(lance::Error::RefNotFound { .. }) | Err(lance::Error::NotFound { .. }) => None, - Err(e) => return Err(OmniError::Lance(e.to_string())), + let dataset = Dataset::open(&graph_commits_uri(root)) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let dataset = dataset + .checkout_branch(branch) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let actor_dataset = Dataset::open(&graph_commit_actors_uri(root)).await.ok(); + let actor_by_commit_id = match &actor_dataset { + Some(dataset) => load_commit_actor_cache(dataset).await?, + None => HashMap::new(), }; - let actor_dataset = - crate::instrumentation::open_dataset_tracked(&graph_commit_actors_uri(root), wrapper) - .await - .ok(); - // Hard `?`: the manifest existence gate. `load_commit_cache_for_branch` - // opens the branch's `__manifest` (its own `checkout_branch` on the - // authoritative table), so a TRULY absent branch still fails loudly here β€” - // only the derived `_graph_commits.lance` ref is allowed to be missing. - let (commit_by_id, head_commit) = load_commit_cache_for_branch(root, Some(branch)).await?; + let (commit_by_id, head_commit) = load_commit_cache(&dataset, &actor_by_commit_id).await?; Ok(Self { root_uri: root.to_string(), dataset, actor_dataset, active_branch: Some(branch.to_string()), - actor_by_commit_id: HashMap::new(), + actor_by_commit_id, commit_by_id, head_commit, }) @@ -170,50 +125,35 @@ impl CommitGraph { pub async fn refresh(&mut self) -> Result<()> { let root = self.root_uri.clone(); - let wrapper = crate::instrumentation::commit_graph_wrapper(); - let dataset = crate::instrumentation::open_dataset_tracked( - &graph_commits_uri(&root), - wrapper.clone(), - ) - .await?; - // Same best-effort checkout as `open_at_branch`: a missing DERIVED branch - // ref leaves the handle `None` (only `create_branch` then errors), while - // the in-memory cache re-syncs from the authoritative manifest below. - self.dataset = match &self.active_branch { - Some(branch) => match dataset.checkout_branch(branch).await { - Ok(ds) => Some(ds), - Err(lance::Error::RefNotFound { .. }) | Err(lance::Error::NotFound { .. }) => None, - Err(e) => return Err(OmniError::Lance(e.to_string())), - }, - None => Some(dataset), - }; - self.actor_dataset = - crate::instrumentation::open_dataset_tracked(&graph_commit_actors_uri(&root), wrapper) + self.dataset = Dataset::open(&graph_commits_uri(&root)) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + if let Some(branch) = &self.active_branch { + self.dataset = self + .dataset + .checkout_branch(branch) .await - .ok(); + .map_err(|e| OmniError::Lance(e.to_string()))?; + } + self.actor_dataset = Dataset::open(&graph_commit_actors_uri(&root)).await.ok(); + self.actor_by_commit_id = match &self.actor_dataset { + Some(dataset) => load_commit_actor_cache(dataset).await?, + None => HashMap::new(), + }; let (commit_by_id, head_commit) = - load_commit_cache_for_branch(&root, self.active_branch.as_deref()).await?; + load_commit_cache(&self.dataset, &self.actor_by_commit_id).await?; self.commit_by_id = commit_by_id; self.head_commit = head_commit; Ok(()) } + pub fn version(&self) -> u64 { + self.dataset.version().version + } + pub async fn create_branch(&mut self, name: &str) -> Result<()> { - // The held `_graph_commits.lance` handle is the only thing that can fork a - // branch ref. If it is missing (an interrupted fork-reclaim or a `cleanup` - // race dropped the derived ref while manifest lineage stayed authoritative), - // fail loudly + actionably rather than silently. Repair is the existing - // `cleanup` orphan reconciler (`reconcile_commit_graph_orphans`), not an - // inline write on this path. - let Some(dataset) = &self.dataset else { - let branch = self.active_branch.as_deref().unwrap_or("main"); - return Err(OmniError::manifest_internal(format!( - "commit-graph branch ref for '{branch}' is missing; run `omnigraph cleanup` then retry" - ))); - }; - let version = dataset.version().version; - let mut ds = dataset.clone(); - ds.create_branch(name, version, None) + let mut ds = self.dataset.clone(); + ds.create_branch(name, self.version(), None) .await .map_err(|e| OmniError::Lance(e.to_string()))?; Ok(()) @@ -229,48 +169,7 @@ impl CommitGraph { self.refresh().await } - /// Idempotently drop the commit-graph branch `name`, tolerating an - /// already-absent branch (see [`TableStore::force_delete_branch`] for the - /// same semantics). Used by the best-effort reclaim in `branch_delete` and - /// the `cleanup` orphan reconciler. `RefConflict` (referencing descendants) - /// is still surfaced. - pub async fn force_delete_branch(&mut self, name: &str) -> Result<()> { - let mut ds = Dataset::open(&graph_commits_uri(&self.root_uri)) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - match ds.force_delete_branch(name).await { - Ok(()) => {} - Err(lance::Error::RefNotFound { .. }) | Err(lance::Error::NotFound { .. }) => {} - Err(e) => return Err(OmniError::Lance(e.to_string())), - } - self.refresh().await - } - - /// List the named branches present on the commit-graph dataset. The - /// `cleanup` reconciler diffs this against the manifest branch set to find - /// orphaned commit-graph branches to reclaim. - pub async fn list_branches(&self) -> Result> { - let ds = Dataset::open(&graph_commits_uri(&self.root_uri)) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let branches = ds - .list_branches() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - Ok(branches.into_keys().collect()) - } - - // DEAD as of RFC-013 Phase 7: graph commits are recorded in `__manifest` - // (folded into the publish CAS), never appended to `_graph_commits.lance`. - // These append helpers are retained only because the actor sidecar table they - // touch is still enumerated by `optimize` (internal-table compaction); they - // have no caller on any write path. The single-source invariant is guarded by - // `tests/lineage_projection.rs`, which fails if `_graph_commits.lance` ever - // gains a commit row. Do NOT call these to record a commit β€” use the - // coordinator's `commit_*_with_actor` / `commit_merge_with_actor`, which carry - // the lineage intent into the manifest publish. - #[allow(dead_code)] - async fn append_commit( + pub async fn append_commit( &mut self, manifest_branch: Option<&str>, manifest_version: u64, @@ -287,8 +186,7 @@ impl CommitGraph { .await } - #[allow(dead_code)] - async fn append_merge_commit( + pub async fn append_merge_commit( &mut self, manifest_branch: Option<&str>, manifest_version: u64, @@ -306,7 +204,6 @@ impl CommitGraph { .await } - #[allow(dead_code)] async fn append_commit_with_parents( &mut self, manifest_branch: Option<&str>, @@ -323,22 +220,16 @@ impl CommitGraph { parent_commit_id: parent_commit_id.map(|s| s.to_string()), merged_parent_commit_id: merged_parent_commit_id.map(|s| s.to_string()), actor_id: actor_id.map(str::to_string), - created_at: crate::db::now_micros()?, + created_at: now_micros()?, }; let batch = commits_to_batch(&[commit.clone()])?; let reader = RecordBatchIterator::new(vec![Ok(batch)], commit_graph_schema()); - // This helper is dead on every write path (RFC-013 Phase 7) β€” reached only - // by the transitional v3 fixtures, which always hold the commits dataset. - // A `None` here would be a fixture bug, so fail loudly rather than silently. - let mut ds = self - .dataset - .clone() - .ok_or_else(|| OmniError::manifest_internal("commit-graph dataset is missing"))?; + let mut ds = self.dataset.clone(); ds.append(reader, None) .await .map_err(|e| OmniError::Lance(e.to_string()))?; - self.dataset = Some(ds); + self.dataset = ds; if let Some(actor_id) = actor_id { self.append_actor(&graph_commit_id, actor_id).await?; } @@ -351,7 +242,6 @@ impl CommitGraph { Ok(graph_commit_id) } - #[allow(dead_code)] // RFC-013 Phase 7: dead β€” see `append_commit`. async fn append_actor(&mut self, graph_commit_id: &str, actor_id: &str) -> Result<()> { if self .actor_by_commit_id @@ -364,7 +254,7 @@ impl CommitGraph { let record = CommitActorRecord { graph_commit_id: graph_commit_id.to_string(), actor_id: actor_id.to_string(), - created_at: crate::db::now_micros()?, + created_at: now_micros()?, }; let batch = commit_actors_to_batch(&[record])?; let reader = RecordBatchIterator::new(vec![Ok(batch)], commit_actor_schema()); @@ -455,11 +345,11 @@ impl CommitGraph { } } -pub(crate) fn graph_commits_uri(root_uri: &str) -> String { +fn graph_commits_uri(root_uri: &str) -> String { format!("{}/{}", root_uri.trim_end_matches('/'), GRAPH_COMMITS_DIR) } -pub(crate) fn graph_commit_actors_uri(root_uri: &str) -> String { +fn graph_commit_actors_uri(root_uri: &str) -> String { format!( "{}/{}", root_uri.trim_end_matches('/'), @@ -509,18 +399,11 @@ async fn create_commit_actor_dataset(root_uri: &str) -> Result { mode: WriteMode::Create, enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; match Dataset::write(reader, &uri as &str, Some(params)).await { Ok(dataset) => Ok(dataset), - // Create-or-open idempotency: a concurrent/prior create raced us. Match - // the typed `DatasetAlreadyExists` variant, not the display string β€” the - // message is not a Lance API contract (a wording change would silently - // break this fallback). Pinned by - // `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`. - Err(lance::Error::DatasetAlreadyExists { .. }) => Dataset::open(&uri) + Err(err) if err.to_string().contains("Dataset already exists") => Dataset::open(&uri) .await .map_err(|open_err| OmniError::Lance(open_err.to_string())), Err(err) => Err(OmniError::Lance(err.to_string())), @@ -558,156 +441,6 @@ fn commits_to_batch(commits: &[GraphCommit]) -> Result { .map_err(|e| OmniError::Lance(e.to_string())) } -/// Build the in-memory commit cache for `branch`, choosing the source by the -/// branch manifest's internal-schema stamp (RFC-013 step 4 forward/back-compat): -/// -/// - stamp β‰₯ v4 (post-Phase-7, the normal case): the `__manifest` lineage -/// projection β€” `graph_commit`/`graph_head` rows folded into the publish CAS. -/// - stamp < v4 (a pre-Phase-7 graph not yet migrated): the legacy -/// `_graph_commits.lance` read. This is the **transitional v3 fallback** that -/// lets a READ-ONLY open of an un-migrated graph still see correct history β€” -/// a read-only open never runs the v3β†’v4 backfill (it must not write), so -/// without this gate it would read an empty DAG from `__manifest`. A -/// read-write open backfills `__manifest` on its first write and thereafter -/// takes the projection branch. -/// -/// Both sources pick the head with `should_replace_head`, so the cache is -/// identical regardless of which branch is taken. Remove the fallback once no -/// graph below internal-schema v4 remains. -async fn load_commit_cache_for_branch( - root_uri: &str, - branch: Option<&str>, -) -> Result<(HashMap, Option)> { - let stamp = crate::db::manifest::internal_schema_stamp_at(root_uri, branch).await?; - // Defense-in-depth: refuse a branch whose stamp this binary cannot serve β€” - // newer than CURRENT, or below MIN_SUPPORTED β€” for the same reason the main - // read path does (`refuse_if_internal_schema_unsupported`). A `> CURRENT` stamp - // means a newer binary wrote a shape we can't read, so the projection below - // would misread it; a `< MIN` stamp predates the legacy readers this binary - // still carries. Not a live hole today: migrations run main-first - // (`migrate_on_open` migrates main; each branch migrates on its own first - // write), so main's stamp bounds every branch's and the main read path already - // refuses first. The guard closes the gap if that ordering is ever weakened. - crate::db::manifest::refuse_if_stamp_unsupported(stamp)?; - if stamp < crate::db::manifest::INTERNAL_MANIFEST_SCHEMA_VERSION { - // Transitional: un-migrated v3 graph β€” read lineage from the legacy - // `_graph_commits.lance` so reads (incl. read-only opens) see history. - return read_legacy_commit_cache(root_uri, branch).await; - } - load_commit_cache_from_manifest(root_uri, branch).await -} - -/// Build the in-memory commit cache from the `__manifest` graph-lineage -/// projection (RFC-013 step 4) rather than `_graph_commits.lance`. The lineage -/// rows carry the actor inline, so no separate actor-table read is needed. Head -/// selection is identical to [`load_commit_cache`] (`should_replace_head`), so -/// the resulting cache is equivalent to the prior `_graph_commits.lance` read. -async fn load_commit_cache_from_manifest( - root_uri: &str, - branch: Option<&str>, -) -> Result<(HashMap, Option)> { - let (rows, _heads) = - crate::db::manifest::ManifestCoordinator::read_graph_lineage_at(root_uri, branch).await?; - let mut commit_by_id = HashMap::with_capacity(rows.len()); - let mut head_commit = None; - for row in rows { - let commit = GraphCommit { - graph_commit_id: row.graph_commit_id, - manifest_branch: row.manifest_branch, - manifest_version: row.manifest_version, - parent_commit_id: row.parent_commit_id, - merged_parent_commit_id: row.merged_parent_commit_id, - actor_id: row.actor_id, - created_at: row.created_at, - }; - if should_replace_head(head_commit.as_ref(), &commit) { - head_commit = Some(commit.clone()); - } - commit_by_id.insert(commit.graph_commit_id.clone(), commit); - } - Ok((commit_by_id, head_commit)) -} - -/// Read the legacy `_graph_commits.lance` (+ its actor sidecar) for `branch` -/// into an in-memory cache β€” the transitional source for graphs not yet -/// migrated to internal-schema v4 (RFC-013 step 4). Two callers, both -/// transitional: the v3β†’v4 migration backfill (which copies these rows into -/// `__manifest`) and the read-only v3 fallback in `CommitGraph::open*`. Returns -/// `(commit_by_id, head)`, with the head picked by `should_replace_head` β€” -/// identical to the manifest projection. A genuinely ABSENT (not-found) commit -/// dataset or actor sidecar yields an empty cache (no head); any OTHER open error -/// (transient IO / corrupt file) propagates loudly rather than being read as -/// "empty" β€” a swallow here would let the v3β†’v4 migration backfill nothing and -/// still stamp v4, orphaning the real lineage permanently. This keeps the legacy -/// readers alive while any v3 graph survives; once no graph is below v4 it can -/// retire. -pub(crate) async fn read_legacy_commit_cache( - root_uri: &str, - branch: Option<&str>, -) -> Result<(HashMap, Option)> { - let root = root_uri.trim_end_matches('/'); - let commits_uri = graph_commits_uri(root); - let commits_open = match crate::failpoints::maybe_fail_lance_open("migration.v3_to_v4.legacy_open") - { - Ok(()) => Dataset::open(&commits_uri).await, - Err(injected) => Err(injected), - }; - let mut dataset = match commits_open { - Ok(dataset) => dataset, - // An ABSENT commits dataset is the legitimate "no legacy data" signal β€” - // a graph with no `_graph_commits.lance` (or none on this branch) yields - // an empty cache. But ONLY a genuine not-found gets that treatment: a - // transient/corrupt open (IO / CorruptFile / …) must propagate, never be - // read as "empty". The v3β†’v4 migration calls this once before stamping - // v4; swallowing a non-not-found error here would backfill nothing and - // stamp v4 anyway, orphaning the real lineage permanently (the migration - // never re-runs, and the v3 fallback is then disabled). Lance maps an - // object-store NotFound to `DatasetNotFound`; the variant match (vs an - // existence probe) is exactly right and not over-strict β€” pinned by - // `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`. - Err(lance::Error::DatasetNotFound { .. }) | Err(lance::Error::NotFound { .. }) => { - return Ok((HashMap::new(), None)); - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - }; - if let Some(branch) = branch.filter(|b| *b != "main") { - dataset = dataset - .checkout_branch(branch) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - } - - // The actor sidecar may be absent (older graphs without authored commits); - // an empty actor map then leaves every commit's actor `None`. It is read - // FLAT (no branch checkout): the pre-Phase-7 commit graph never forked the - // actor dataset β€” actors are keyed by `graph_commit_id` globally β€” so a - // branch's commits resolve their actor from the same single actor table. - // This matches the live `CommitGraph::open_at_branch`, which also opens the - // actor dataset on main while checking out the branch only on the commits - // dataset. - let actors_open = - match crate::failpoints::maybe_fail_lance_open("migration.v3_to_v4.legacy_open") { - Ok(()) => Dataset::open(&graph_commit_actors_uri(root)).await, - Err(injected) => Err(injected), - }; - let actor_by_commit_id = match actors_open { - Ok(actor_dataset) => load_commit_actor_cache(&actor_dataset).await?, - // An ABSENT actor sidecar is benign (older graphs without authored - // commits) β€” every commit's actor stays `None`. A not-found is therefore - // the empty-map signal. But a CORRUPT/transient actor open must NOT be - // read as "no authors": silently wiping all authorship and then stamping - // v4 is the same permanent-loss hole as the commits arm, so anything - // other than not-found propagates. (Same variant contract, different - // rationale β€” absence is normal here, error is not.) - Err(lance::Error::DatasetNotFound { .. }) | Err(lance::Error::NotFound { .. }) => { - HashMap::new() - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - }; - - load_commit_cache(&dataset, &actor_by_commit_id).await -} - async fn load_commit_cache( dataset: &Dataset, actor_by_commit_id: &HashMap, @@ -912,170 +645,11 @@ async fn open_for_branch(root_uri: &str, branch: Option<&str>) -> Result, -} - -/// Build a synthetic pre-Phase-7 (internal-schema v3) graph at `root_uri`: graph -/// lineage lives ONLY in `_graph_commits.lance` (+ its actor sidecar), `__manifest` -/// carries NO `graph_commit`/`graph_head` rows, and the stamp is set to v3. This -/// reproduces exactly the on-disk shape a graph created by a pre-RFC-013-Phase-7 -/// binary would have, so the v3β†’v4 migration and the v3-read fallback can be -/// tested against it. -/// -/// The lineage is a realistic DAG with a branch + a real merge: genesis β†’ A β†’ -/// (feature commit, off to the side) β†’ merge(A, feature) at the head of main, -/// with authored actors on the non-genesis commits. Reaches the dead-on-the- -/// write-path `append_commit_with_parents`/`append_actor` (still present for -/// exactly this transitional purpose) to write the legacy rows. -#[cfg(any(test, feature = "failpoints"))] -pub async fn seed_legacy_v3_lineage(root_uri: &str) -> Result { - let root = root_uri.trim_end_matches('/'); - - // 1. Create `__manifest` (Phase-7 folds genesis lineage into it) and the - // EMPTY legacy `_graph_commits.lance`. We then append the v3-style commit - // rows below β€” a real v3 graph carried its genesis in `_graph_commits`. - crate::db::manifest::seed_manifest_for_v3_fixture(root).await?; - let mut cg = CommitGraph::init(root).await?; - // Clear the cache that init seeded from the (genesis-bearing) manifest, so - // the appended rows below are the whole story and parents come out right. - cg.commit_by_id.clear(); - cg.head_commit = None; - - // 2. Append the legacy lineage to `_graph_commits.lance` on main. - let genesis = cg - .append_commit_with_parents(None, 1, None, None, None) - .await?; - let commit_a = cg - .append_commit_with_parents(None, 2, Some(&genesis), None, Some("act-a")) - .await?; - let feature_commit = cg - .append_commit_with_parents(Some("feature"), 3, Some(&commit_a), None, Some("act-feature")) - .await?; - let merge_commit = cg - .append_commit_with_parents( - None, - 4, - Some(&commit_a), - Some(&feature_commit), - Some("act-merger"), - ) - .await?; - - // 3. Strip the genesis lineage rows the Phase-7 init folded into `__manifest` - // and rewind the stamp to v3, so the manifest matches a true pre-Phase-7 - // graph (no lineage in `__manifest`, stamp v3). - crate::db::manifest::strip_lineage_and_set_v3_stamp_for_fixture(root).await?; - - Ok(V3LineageFixture { - genesis: genesis.clone(), - commit_a: commit_a.clone(), - feature_commit: feature_commit.clone(), - merge_commit: merge_commit.clone(), - all_ids: vec![genesis, commit_a, feature_commit, merge_commit], - }) -} - -/// Identities of a synthetic pre-Phase-7 (v3) graph that carries a REAL Lance -/// branch (built by [`seed_legacy_v3_lineage_with_branch`]). -#[cfg(test)] -#[derive(Debug, Clone)] -pub struct V3BranchedLineageFixture { - /// The genesis (parentless) commit on main. - pub genesis: String, - /// A direct authored commit on main (actor `act-a`). The head of main. - pub commit_a: String, - /// A commit on the real `feature` Lance branch (actor `act-branch`), - /// parented off `commit_a`. The head of `feature`. - pub branch_commit: String, - /// The branch name forked on both `_graph_commits.lance` and `__manifest`. - pub branch: String, -} - -/// Build a synthetic pre-Phase-7 (internal-schema v3) graph at `root_uri` that -/// carries a REAL Lance branch `feature` on BOTH `_graph_commits.lance` and -/// `__manifest`, reproducing exactly the on-disk shape of a branched graph -/// created by a pre-RFC-013-Phase-7 binary: -/// -/// - `_graph_commits.lance`: main has `genesis β†’ A`; the `feature` Lance branch -/// adds `branch_commit` (parent `A`). Authored actors land in the FLAT actor -/// sidecar (the pre-Phase-7 commit graph never forked the actor table). -/// - `__manifest`: main is stamped v3 with NO lineage rows; the `feature` branch -/// is forked from main's v3 state, so it too is v3 with NO lineage of its own. -/// -/// This is the fixture the per-branch v3β†’v4 migration runs against: it lets a -/// test prove that migrating the `feature` branch reads the branch's legacy -/// lineage, writes it into the BRANCH's `__manifest`, and leaves main untouched β€” -/// the case the main-only [`seed_legacy_v3_lineage`] cannot exercise. -#[cfg(test)] -pub async fn seed_legacy_v3_lineage_with_branch(root_uri: &str) -> Result { - let root = root_uri.trim_end_matches('/'); - - // 1. `__manifest` (genesis folded by Phase-7 init) + an empty legacy - // `_graph_commits.lance`. Clear the init-seeded cache so the rows we - // append below are the whole story. - crate::db::manifest::seed_manifest_for_v3_fixture(root).await?; - let mut cg = CommitGraph::init(root).await?; - cg.commit_by_id.clear(); - cg.head_commit = None; - - // 2. Main lineage on `_graph_commits.lance`: genesis β†’ A (authored). - let genesis = cg - .append_commit_with_parents(None, 1, None, None, None) - .await?; - let commit_a = cg - .append_commit_with_parents(None, 2, Some(&genesis), None, Some("act-a")) - .await?; - - // 3. Fork a real `feature` Lance branch on `_graph_commits.lance`, switch the - // handle to it, and append an authored branch commit (its actor lands in - // the flat main actor table β€” exactly the pre-Phase-7 shape). - cg.create_branch("feature").await?; - let commits_ds = cg - .dataset - .take() - .expect("commits dataset present after create_branch") - .checkout_branch("feature") - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - cg.dataset = Some(commits_ds); - cg.active_branch = Some("feature".to_string()); - let branch_commit = cg - .append_commit_with_parents(Some("feature"), 3, Some(&commit_a), None, Some("act-branch")) - .await?; - - // 4. Rewind main's `__manifest` to the v3 shape (strip the folded genesis - // lineage, set stamp 3) BEFORE forking β€” so the `feature` manifest branch - // inherits the stripped v3 state (no lineage, stamp 3). - crate::db::manifest::strip_lineage_and_set_v3_stamp_for_fixture(root).await?; - crate::db::manifest::fork_manifest_branch_for_v3_fixture(root, "feature").await?; - - Ok(V3BranchedLineageFixture { - genesis, - commit_a, - branch_commit, - branch: "feature".to_string(), - }) +fn now_micros() -> Result { + let duration = SystemTime::now() + .duration_since(UNIX_EPOCH) + .map_err(|e| OmniError::manifest(format!("system clock before UNIX_EPOCH: {}", e)))?; + Ok(duration.as_micros() as i64) } #[cfg(test)] @@ -1086,83 +660,6 @@ mod tests { use super::*; - // RFC-013 step 4: the v3-read fallback / migration source reads a NAMED - // branch's lineage from a real Lance branch on `_graph_commits.lance`, while - // resolving actors from the FLAT actor table (the pre-Phase-7 commit graph - // forked only the commits dataset, never the actor sidecar). This guards - // both that branch-checkout path and the flat-actor resolution β€” the case - // the main-branch fixture (commits on main only) does not exercise. - #[tokio::test] - async fn read_legacy_commit_cache_resolves_branch_commits_with_flat_actors() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - // A v3 graph needs `__manifest` to exist for `CommitGraph::init`'s - // genesis-cache seed; we clear that cache and write our own legacy rows. - crate::db::manifest::seed_manifest_for_v3_fixture(uri) - .await - .unwrap(); - let mut cg = CommitGraph::init(uri).await.unwrap(); - cg.commit_by_id.clear(); - cg.head_commit = None; - - // Main lineage: genesis β†’ A (authored). The actor lands in the FLAT - // `_graph_commit_actors.lance` (never branched). - let genesis = cg - .append_commit_with_parents(None, 1, None, None, None) - .await - .unwrap(); - let commit_a = cg - .append_commit_with_parents(None, 2, Some(&genesis), None, Some("act-a")) - .await - .unwrap(); - - // Fork a real Lance branch on `_graph_commits.lance`, switch the handle - // to it, and append an authored branch commit (its actor also goes to - // the flat main actor table β€” exactly the pre-Phase-7 shape). - cg.create_branch("feature").await.unwrap(); - cg.dataset = Some( - cg.dataset - .take() - .unwrap() - .checkout_branch("feature") - .await - .unwrap(), - ); - cg.active_branch = Some("feature".to_string()); - let branch_commit = cg - .append_commit_with_parents( - Some("feature"), - 3, - Some(&commit_a), - None, - Some("act-branch"), - ) - .await - .unwrap(); - - // The legacy read at the branch sees the inherited main commits + the - // branch commit, the head is the branch commit, and the authored actors - // resolve from the flat table (no branch checkout on the actor dataset). - let (commits, head) = read_legacy_commit_cache(uri, Some("feature")).await.unwrap(); - assert_eq!(commits.len(), 3, "branch inherits genesis + A + its own commit"); - assert_eq!( - head.as_ref().unwrap().graph_commit_id, - branch_commit, - "the branch commit is the head" - ); - assert_eq!( - commits.get(&commit_a).unwrap().actor_id.as_deref(), - Some("act-a"), - "main commit's actor resolves from the flat actor table", - ); - assert_eq!( - commits.get(&branch_commit).unwrap().actor_id.as_deref(), - Some("act-branch"), - "branch commit's actor resolves from the flat actor table", - ); - } - #[test] fn load_commits_from_batches_returns_error_for_bad_schema() { let batch = RecordBatch::try_new( diff --git a/crates/omnigraph/src/db/graph_coordinator.rs b/crates/omnigraph/src/db/graph_coordinator.rs index aff791d..a721036 100644 --- a/crates/omnigraph/src/db/graph_coordinator.rs +++ b/crates/omnigraph/src/db/graph_coordinator.rs @@ -10,9 +10,7 @@ use crate::storage::{StorageAdapter, join_uri, normalize_root_uri}; use super::commit_graph::{CommitGraph, GraphCommit}; use super::is_internal_system_branch; -use super::manifest::{ - ManifestChange, ManifestCoordinator, ManifestIncarnation, Snapshot, SubTableUpdate, -}; +use super::manifest::{ManifestChange, ManifestCoordinator, Snapshot, SubTableUpdate}; const GRAPH_COMMITS_DIR: &str = "_graph_commits.lance"; @@ -28,11 +26,10 @@ impl SnapshotId { &self.0 } - pub(crate) fn synthetic(branch: Option<&str>, version: u64, e_tag: Option<&str>) -> Self { - let branch = branch.unwrap_or("main"); - match e_tag { - Some(e_tag) => Self(format!("manifest:{}:v{}:etag:{}", branch, version, e_tag)), - None => Self(format!("manifest:{}:v{}", branch, version)), + pub(crate) fn synthetic(branch: Option<&str>, version: u64) -> Self { + match branch { + Some(branch) => Self(format!("manifest:{}:v{}", branch, version)), + None => Self(format!("manifest:main:v{}", version)), } } } @@ -106,17 +103,13 @@ impl GraphCoordinator { storage: Arc, ) -> Result { let root = normalize_root_uri(root_uri)?; - // The genesis graph commit is folded into the manifest init write, so - // `__manifest` is the single source of graph lineage from version one - // (RFC-013 Phase 7). `CommitGraph::init` then creates the empty - // branch-ref dataset and seeds its cache from that manifest genesis. let manifest = ManifestCoordinator::init(&root, catalog).await?; - let commit_graph = CommitGraph::init(&root).await?; + let commit_graph = Some(CommitGraph::init(&root, manifest.version()).await?); Ok(Self { root_uri: root, storage, manifest, - commit_graph: Some(commit_graph), + commit_graph, bound_branch: None, }) } @@ -173,10 +166,6 @@ impl GraphCoordinator { self.manifest.version() } - pub(crate) fn manifest_incarnation(&self) -> ManifestIncarnation { - self.manifest.incarnation() - } - pub fn snapshot(&self) -> Snapshot { self.manifest.snapshot() } @@ -193,19 +182,6 @@ impl GraphCoordinator { Ok(()) } - pub(crate) async fn probe_latest_incarnation(&self) -> Result { - crate::instrumentation::record_probe(); - self.manifest.probe_latest_incarnation().await - } - - /// Refresh only the manifest (not the commit graph). The read path uses this - /// on a stale same-branch probe: a read pins its snapshot by manifest version - /// and never needs the commit graph, so a full `refresh` (which also scans - /// the commit graph) would be wasted IO. - pub async fn refresh_manifest_only(&mut self) -> Result<()> { - self.manifest.refresh().await - } - pub async fn branch_list(&self) -> Result> { self.manifest.list_branches().await.map(|branches| { branches @@ -235,47 +211,14 @@ impl GraphCoordinator { let branch = normalize_branch_name(name)? .ok_or_else(|| OmniError::manifest("cannot create branch 'main'".to_string()))?; self.ensure_commit_graph_initialized().await?; - - // Manifest authority flip first. self.manifest.create_branch(&branch).await?; - - // Derived commit-graph branch. If anything after the authority flip - // fails, roll back the manifest branch so the branch never half-exists - // (a manifest branch with no commit-graph branch breaks the next write). - if let Err(err) = self.create_commit_graph_branch(&branch).await { - if let Err(rollback_err) = self.manifest.delete_branch(&branch).await { - tracing::warn!( - target: "omnigraph::branch_create", - branch = %branch, - error = %rollback_err, - "rollback of manifest branch failed after commit-graph create failure", - ); - } - return Err(err); + failpoints::maybe_fail("branch_create.after_manifest_branch_create")?; + if let Some(commit_graph) = &mut self.commit_graph { + commit_graph.create_branch(&branch).await?; } Ok(()) } - /// Create the derived commit-graph branch for `branch`, healing a zombie ref - /// left by an incomplete prior delete. The manifest branch was just created - /// fresh, so any existing commit-graph branch with this name is provably - /// orphaned and is force-dropped before recreating. - async fn create_commit_graph_branch(&mut self, branch: &str) -> Result<()> { - failpoints::maybe_fail(crate::failpoints::names::BRANCH_CREATE_AFTER_MANIFEST_BRANCH_CREATE)?; - let Some(commit_graph) = &mut self.commit_graph else { - return Ok(()); - }; - if commit_graph - .list_branches() - .await? - .iter() - .any(|existing| existing == branch) - { - commit_graph.force_delete_branch(branch).await?; - } - commit_graph.create_branch(branch).await - } - pub async fn branch_delete(&mut self, name: &str) -> Result<()> { let branch = normalize_branch_name(name)? .ok_or_else(|| OmniError::manifest("cannot delete branch 'main'".to_string()))?; @@ -286,43 +229,20 @@ impl GraphCoordinator { ))); } - // Manifest authority flip β€” the single atomic op that makes the branch - // cease to exist. Must succeed; everything after is derived state - // reclaimed best-effort. self.manifest.delete_branch(&branch).await?; - // Commit-graph branch is derived state. Reclaim best-effort with the - // idempotent force variant: a failure here (or a missing dataset) is - // reconciled by `cleanup` and must not fail the delete after the - // authority already flipped. - if let Err(err) = self.reclaim_commit_graph_branch(&branch).await { - tracing::warn!( - target: "omnigraph::branch_delete::cleanup", - branch = %branch, - error = %err, - "best-effort commit-graph branch reclaim failed; cleanup will reconcile", - ); - } - - Ok(()) - } - - /// Best-effort, idempotent reclaim of the commit-graph branch `branch`. - /// Tolerates an absent commit-graph dataset (a graph that never committed). - async fn reclaim_commit_graph_branch(&mut self, branch: &str) -> Result<()> { - failpoints::maybe_fail(crate::failpoints::names::BRANCH_DELETE_BEFORE_COMMIT_GRAPH_RECLAIM)?; if let Some(commit_graph) = &mut self.commit_graph { - commit_graph.force_delete_branch(branch).await + commit_graph.delete_branch(&branch).await?; } else if self .storage .exists(&graph_commits_uri(self.root_uri())) .await? { let mut commit_graph = CommitGraph::open(self.root_uri()).await?; - commit_graph.force_delete_branch(branch).await - } else { - Ok(()) + commit_graph.delete_branch(&branch).await?; } + + Ok(()) } pub async fn snapshot_at_version(&self, version: u64) -> Result { @@ -339,13 +259,10 @@ impl GraphCoordinator { None => GraphCoordinator::open(self.root_uri(), Arc::clone(&self.storage)).await?, }; - Ok(other.head_commit_id().await?.unwrap_or_else(|| { - SnapshotId::synthetic( - other.current_branch(), - other.version(), - other.manifest_incarnation().e_tag.as_deref(), - ) - })) + Ok(other + .head_commit_id() + .await? + .unwrap_or_else(|| SnapshotId::synthetic(other.current_branch(), other.version()))) } pub async fn resolve_target(&self, target: &ReadTarget) -> Result { @@ -366,11 +283,7 @@ impl GraphCoordinator { } }; let snapshot_id = other.head_commit_id().await?.unwrap_or_else(|| { - SnapshotId::synthetic( - other.current_branch(), - other.version(), - other.manifest_incarnation().e_tag.as_deref(), - ) + SnapshotId::synthetic(other.current_branch(), other.version()) }); Ok(ResolvedTarget { requested: target.clone(), @@ -442,12 +355,7 @@ impl GraphCoordinator { .exists(&graph_commits_uri(self.root_uri())) .await? { - // A graph opened without a commit-graph dataset gets the empty - // branch-ref dataset created lazily here. Graph lineage lives in - // `__manifest` (RFC-013 Phase 7) β€” a graph initialized by current - // code already carries its genesis there, and the commit graph - // sources its cache from it. No genesis is written here. - CommitGraph::init(self.root_uri()).await?; + let _ = CommitGraph::init(self.root_uri(), self.manifest.version()).await?; } self.commit_graph = match self.current_branch() { Some(branch) => Some(CommitGraph::open_at_branch(self.root_uri(), branch).await?), @@ -461,8 +369,12 @@ impl GraphCoordinator { updates: &[SubTableUpdate], actor_id: Option<&str>, ) -> Result { - self.commit_updates_with_actor_with_expected(updates, &HashMap::new(), actor_id) - .await + let manifest_version = self.commit_manifest_updates(updates).await?; + let snapshot_id = self.record_graph_commit(manifest_version, actor_id).await?; + Ok(PublishedSnapshot { + manifest_version, + _snapshot_id: snapshot_id, + }) } /// Commit with publisher-level OCC fence. The `expected_table_versions` map @@ -476,9 +388,45 @@ impl GraphCoordinator { expected_table_versions: &HashMap, actor_id: Option<&str>, ) -> Result { - let changes = updates_to_changes(updates); - self.commit_changes_with_actor_with_expected(&changes, expected_table_versions, actor_id) - .await + let manifest_version = self + .commit_manifest_updates_with_expected(updates, expected_table_versions) + .await?; + let snapshot_id = self.record_graph_commit(manifest_version, actor_id).await?; + Ok(PublishedSnapshot { + manifest_version, + _snapshot_id: snapshot_id, + }) + } + + pub(crate) async fn commit_manifest_updates( + &mut self, + updates: &[SubTableUpdate], + ) -> Result { + let manifest_version = self.manifest.commit(updates).await?; + failpoints::maybe_fail("graph_publish.after_manifest_commit")?; + Ok(manifest_version) + } + + pub(crate) async fn commit_manifest_updates_with_expected( + &mut self, + updates: &[SubTableUpdate], + expected_table_versions: &HashMap, + ) -> Result { + let manifest_version = self + .manifest + .commit_with_expected(updates, expected_table_versions) + .await?; + failpoints::maybe_fail("graph_publish.after_manifest_commit")?; + Ok(manifest_version) + } + + pub(crate) async fn commit_manifest_changes( + &mut self, + changes: &[ManifestChange], + ) -> Result { + let manifest_version = self.manifest.commit_changes(changes).await?; + failpoints::maybe_fail("graph_publish.after_manifest_commit")?; + Ok(manifest_version) } pub(crate) async fn commit_changes_with_actor( @@ -486,110 +434,57 @@ impl GraphCoordinator { changes: &[ManifestChange], actor_id: Option<&str>, ) -> Result { - self.commit_changes_with_actor_with_expected(changes, &HashMap::new(), actor_id) - .await - } - - /// Publish `changes` and record one graph commit in the SAME manifest CAS - /// (RFC-013 Phase 7). The lineage intent (a freshly minted commit id, the - /// branch, the actor) rides the publish so the `graph_commit` + `graph_head` - /// rows land atomically with the table-version rows β€” one manifest version, - /// no separate write, no `commit_graph.refresh()` to pick a parent (the - /// publisher resolves it under the CAS). The in-memory commit cache is then - /// updated from the intent + the resolved parent without a re-read. - async fn commit_changes_with_actor_with_expected( - &mut self, - changes: &[ManifestChange], - expected_table_versions: &HashMap, - actor_id: Option<&str>, - ) -> Result { - self.ensure_commit_graph_initialized().await?; - let intent = self.new_lineage_intent(actor_id, None)?; - failpoints::maybe_fail(crate::failpoints::names::GRAPH_PUBLISH_BEFORE_COMMIT_APPEND)?; - let outcome = self - .manifest - .commit_changes_with_lineage(changes, expected_table_versions, Some(&intent)) - .await?; - failpoints::maybe_fail(crate::failpoints::names::GRAPH_PUBLISH_AFTER_MANIFEST_COMMIT)?; - let snapshot_id = self.apply_lineage_to_cache(intent, &outcome); + let manifest_version = self.commit_manifest_changes(changes).await?; + let snapshot_id = self.record_graph_commit(manifest_version, actor_id).await?; Ok(PublishedSnapshot { - manifest_version: outcome.version, + manifest_version, _snapshot_id: snapshot_id, }) } - /// Publish a branch-merge: `updates` (the merged table versions) plus the - /// merge commit, in one manifest CAS (RFC-013 Phase 7). The merge commit's - /// merged-in parent is `merged_parent_commit_id` (the source head, stable); - /// its first parent is resolved by the publisher as the current target-branch - /// head β€” the live head, which is the post-merge correct parent even if the - /// target advanced since the merge began. - pub(crate) async fn commit_merge_with_actor( + pub(crate) async fn record_graph_commit( &mut self, - updates: &[SubTableUpdate], + manifest_version: u64, + actor_id: Option<&str>, + ) -> Result { + self.ensure_commit_graph_initialized().await?; + let current_branch = self.current_branch().map(str::to_string); + let Some(commit_graph) = &mut self.commit_graph else { + return Ok(SnapshotId::synthetic( + current_branch.as_deref(), + manifest_version, + )); + }; + failpoints::maybe_fail("graph_publish.before_commit_append")?; + let graph_commit_id = commit_graph + .append_commit(current_branch.as_deref(), manifest_version, actor_id) + .await?; + Ok(SnapshotId::new(graph_commit_id)) + } + + pub(crate) async fn record_merge_commit( + &mut self, + manifest_version: u64, + parent_commit_id: &str, merged_parent_commit_id: &str, actor_id: Option<&str>, ) -> Result { self.ensure_commit_graph_initialized().await?; - let intent = - self.new_lineage_intent(actor_id, Some(merged_parent_commit_id.to_string()))?; - failpoints::maybe_fail(crate::failpoints::names::GRAPH_PUBLISH_BEFORE_COMMIT_APPEND)?; - let changes = updates_to_changes(updates); - let outcome = self - .manifest - .commit_changes_with_lineage(&changes, &HashMap::new(), Some(&intent)) + let current_branch = self.current_branch().map(str::to_string); + let commit_graph = self.commit_graph.as_mut().ok_or_else(|| { + OmniError::manifest("branch merge requires _graph_commits.lance".to_string()) + })?; + failpoints::maybe_fail("graph_publish.before_commit_append")?; + let graph_commit_id = commit_graph + .append_merge_commit( + current_branch.as_deref(), + manifest_version, + parent_commit_id, + merged_parent_commit_id, + actor_id, + ) .await?; - failpoints::maybe_fail(crate::failpoints::names::GRAPH_PUBLISH_AFTER_MANIFEST_COMMIT)?; - Ok(self.apply_lineage_to_cache(intent, &outcome)) - } - - /// Mint a [`LineageIntent`] for the next commit on the current branch: a - /// fresh ULID (stable across the publisher's CAS retries) and a timestamp. - /// The parent is NOT chosen here β€” the publisher resolves it per attempt - /// against the manifest it commits against. - fn new_lineage_intent( - &self, - actor_id: Option<&str>, - merged_parent_commit_id: Option, - ) -> Result { - Ok(crate::db::manifest::LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: self.current_branch().map(str::to_string), - actor_id: actor_id.map(str::to_string), - merged_parent_commit_id, - created_at: crate::db::now_micros()?, - }) - } - - /// Insert the just-published commit into the in-memory commit cache from the - /// intent + the publisher-resolved parent + the new manifest version. No - /// storage I/O: the durable write already happened in the publish CAS, and - /// this keeps a same-handle read's `head_commit_id` consistent with the - /// snapshot it just advanced. Falls back to a synthetic id only when the - /// commit graph is somehow absent (never on a real write). - fn apply_lineage_to_cache( - &mut self, - intent: crate::db::manifest::LineageIntent, - outcome: &crate::db::manifest::CommitOutcome, - ) -> SnapshotId { - let Some(commit_graph) = &mut self.commit_graph else { - return SnapshotId::synthetic( - self.bound_branch.as_deref(), - outcome.version, - self.manifest.incarnation().e_tag.as_deref(), - ); - }; - let commit = GraphCommit { - graph_commit_id: intent.graph_commit_id.clone(), - manifest_branch: intent.branch, - manifest_version: outcome.version, - parent_commit_id: outcome.parent_commit_id.clone(), - merged_parent_commit_id: intent.merged_parent_commit_id, - actor_id: intent.actor_id, - created_at: intent.created_at, - }; - commit_graph.insert_committed(commit); - SnapshotId::new(intent.graph_commit_id) + Ok(SnapshotId::new(graph_commit_id)) } async fn open_commit_graph_for_branch( @@ -633,15 +528,6 @@ fn graph_commits_uri(root_uri: &str) -> String { join_uri(root_uri, GRAPH_COMMITS_DIR) } -/// Wrap each `SubTableUpdate` as a `ManifestChange::Update` for the publisher. -fn updates_to_changes(updates: &[SubTableUpdate]) -> Vec { - updates - .iter() - .cloned() - .map(ManifestChange::Update) - .collect() -} - fn normalize_branch_name(branch: &str) -> Result> { let branch = branch.trim(); if branch.is_empty() { diff --git a/crates/omnigraph/src/db/manifest.rs b/crates/omnigraph/src/db/manifest.rs index da22136..7fcf7de 100644 --- a/crates/omnigraph/src/db/manifest.rs +++ b/crates/omnigraph/src/db/manifest.rs @@ -14,10 +14,6 @@ mod layout; mod metadata; #[path = "manifest/migrations.rs"] mod migrations; -// Entirely test-only since RFC-013 step 3a: with both reads (Fix 2) and writes -// bypassing the Lance namespace, nothing in production routes through it; the -// `LanceNamespace` impls are retained only to validate the contract in unit tests. -#[cfg(test)] #[path = "manifest/namespace.rs"] mod namespace; #[path = "manifest/publisher.rs"] @@ -28,24 +24,21 @@ mod recovery; mod state; use graph::{init_manifest_graph, open_manifest_graph, snapshot_state_at}; -use layout::{open_manifest_dataset, table_uri_for_path, type_name_hash}; -pub(crate) use layout::manifest_uri; +use layout::{manifest_uri, open_manifest_dataset, type_name_hash}; pub(crate) use metadata::TableVersionMetadata; #[cfg(test)] use metadata::{OMNIGRAPH_ROW_COUNT_KEY, table_version_metadata_for_state}; +use namespace::open_table_at_version_from_manifest; +pub(crate) use namespace::open_table_head_for_write; #[cfg(test)] use namespace::{branch_manifest_namespace, staged_table_namespace}; -pub(crate) use migrations::refuse_if_stamp_unsupported; -pub(crate) use publisher::LineageIntent; -use publisher::{GraphNamespacePublisher, ManifestBatchPublisher, PublishOutcome}; +use publisher::{GraphNamespacePublisher, ManifestBatchPublisher}; pub(crate) use recovery::{ RecoveryMode, RecoverySidecar, RecoverySidecarHandle, SidecarKind, SidecarTablePin, - SidecarTableRegistration, SidecarTombstone, confirm_sidecar_phase_b, delete_sidecar, - has_schema_apply_sidecar, heal_pending_sidecars_roll_forward, list_sidecars, new_sidecar, - recover_manifest_drift, schema_apply_serial_queue_key, write_sidecar, + SidecarTableRegistration, SidecarTombstone, delete_sidecar, has_schema_apply_sidecar, + new_sidecar, recover_manifest_drift, write_sidecar, }; pub use state::SubTableEntry; -pub(crate) use state::{GraphLineageRow, read_graph_lineage}; #[cfg(test)] use state::string_column; use state::{ManifestState, read_manifest_state}; @@ -53,148 +46,8 @@ use state::{ManifestState, read_manifest_state}; const OBJECT_TYPE_TABLE: &str = "table"; const OBJECT_TYPE_TABLE_VERSION: &str = "table_version"; const OBJECT_TYPE_TABLE_TOMBSTONE: &str = "table_tombstone"; -/// Immutable per-commit graph-lineage row (RFC-013 Phase 7). One row per graph -/// commit; the projected form reconstructs a [`GraphCommit`]. `__manifest` is -/// the single source β€” written in the same publish CAS as the table-version -/// rows (no `_graph_commits.lance` row). -const OBJECT_TYPE_GRAPH_COMMIT: &str = "graph_commit"; -/// Mutable per-branch head pointer for the graph lineage (RFC-013 Phase 7). -/// `object_id` is `graph_head:` (`graph_head:main` for the main branch). -const OBJECT_TYPE_GRAPH_HEAD: &str = "graph_head"; const TABLE_VERSION_MANAGEMENT_KEY: &str = "table_version_management"; -/// Stable head-key segment for the main branch in `graph_head:` rows. -/// `table_branch`/`manifest_branch` encode main as null, but `object_id` must be -/// non-null, so the head row needs a literal β€” matching the `"main"` sentinel -/// already used by `SnapshotId::synthetic` and `open_for_branch`. -pub(crate) const MAIN_BRANCH_HEAD_KEY: &str = "main"; - -/// The result of a manifest commit that may have folded in a graph commit -/// (RFC-013 Phase 7). -#[derive(Debug, Clone)] -pub(crate) struct CommitOutcome { - /// The new `__manifest` version after the publish. - pub version: u64, - /// The parent the publisher resolved for the recorded commit, or `None` when - /// no lineage was recorded or the commit is the genesis. Lets the caller - /// update its in-memory commit cache without re-reading the manifest. - pub parent_commit_id: Option, -} - -/// Apply pending internal-schema migrations against `__manifest` on the -/// open-for-write path, independent of a publish. -/// -/// `Omnigraph::open(ReadWrite)` calls this before the coordinator reads branch -/// state, so branch-observing code (`branch_list`, the schema-apply -/// blocking-branch checks) sees the post-migration graph. In particular the -/// v2β†’v3 step sweeps legacy `__run__*` staging branches off `__manifest` -/// (MR-770); running it here closes the window where those branches would -/// otherwise block schema apply before the first publish runs the migration. -/// -/// Idempotent: a no-op stamp read when the on-disk version already matches. -pub(crate) async fn migrate_on_open(root_uri: &str) -> Result<()> { - let mut dataset = open_manifest_dataset(root_uri, None).await?; - // Main branch: the v3β†’v4 lineage backfill reads `_graph_commits.lance` at - // main. Named branches migrate on their own first write via the publisher. - migrations::migrate_internal_schema(&mut dataset, root_uri, None).await -} - -/// The on-disk internal-schema stamp of `__manifest` at `branch` (main when -/// `None`). The transitional v3-read fallback in `CommitGraph` uses this to -/// decide whether to source lineage from `__manifest` (stamp β‰₯ v4, post-Phase-7) -/// or from the legacy `_graph_commits.lance` (stamp < v4, not yet migrated). -pub(crate) async fn internal_schema_stamp_at(root_uri: &str, branch: Option<&str>) -> Result { - let dataset = open_manifest_dataset(root_uri, branch).await?; - Ok(migrations::read_stamp(&dataset)) -} - -/// Refuse to open a graph whose `__manifest` is stamped outside this binary's -/// supported internal-schema range (newer than CURRENT, or older than -/// MIN_SUPPORTED). The read-only open path calls this β€” it skips the write-path -/// migration where the refusal otherwise lives β€” so an old binary still refuses a -/// newer graph instead of silently misreading it, and a too-new binary refuses a -/// below-floor graph instead of opening an unmigrated one. -pub(crate) async fn refuse_if_internal_schema_unsupported(root_uri: &str) -> Result<()> { - let stamp = internal_schema_stamp_at(root_uri, None).await?; - migrations::refuse_if_stamp_unsupported(stamp) -} - -/// The internal-schema version this binary writes. Exposed so the v3-read -/// fallback can compare a branch's on-disk stamp against it. -pub(crate) const INTERNAL_MANIFEST_SCHEMA_VERSION: u32 = - migrations::INTERNAL_MANIFEST_SCHEMA_VERSION; - -/// Test-only: create a `__manifest` for a minimal catalog, the first half of a -/// synthetic pre-Phase-7 (v3) graph (see `commit_graph::seed_legacy_v3_lineage`). -/// A small two-type schema is enough β€” the v3β†’v4 migration touches only the -/// lineage rows, never the table-version rows. -#[cfg(any(test, feature = "failpoints"))] -pub(crate) async fn seed_manifest_for_v3_fixture(root_uri: &str) -> Result<()> { - let schema = omnigraph_compiler::schema::parser::parse_schema( - "node Person { name: String }\nedge Knows: Person -> Person { }\n", - ) - .map_err(|e| OmniError::manifest(e.to_string()))?; - let catalog = - omnigraph_compiler::catalog::build_catalog(&schema).map_err(|e| OmniError::manifest(e.to_string()))?; - ManifestCoordinator::init(root_uri, &catalog).await?; - Ok(()) -} - -/// Test-only: strip the `graph_commit`/`graph_head` rows that Phase-7 init folds -/// into `__manifest`, then rewind the internal-schema stamp to v3 β€” completing a -/// synthetic pre-Phase-7 graph whose lineage lives only in `_graph_commits.lance`. -#[cfg(any(test, feature = "failpoints"))] -pub(crate) async fn strip_lineage_and_set_v3_stamp_for_fixture(root_uri: &str) -> Result<()> { - let mut dataset = open_manifest_dataset(root_uri, None).await?; - dataset - .delete("object_type = 'graph_commit' OR object_type = 'graph_head'") - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - // Re-open so the stamp write lands on the post-delete HEAD. - let mut dataset = open_manifest_dataset(root_uri, None).await?; - migrations::set_stamp_for_test(&mut dataset, 3).await -} - -/// Test-only: fork a real Lance branch `name` on `__manifest` from main's CURRENT -/// state. Call AFTER `strip_lineage_and_set_v3_stamp_for_fixture` so the forked -/// branch inherits the v3 stamp with no lineage rows β€” i.e. a faithful -/// pre-Phase-7 branch whose `__manifest` carries no lineage of its own. The -/// branch's commits live only on the `_graph_commits.lance` branch until the -/// per-branch v3β†’v4 migration runs against this branch's `__manifest`. -#[cfg(test)] -pub(crate) async fn fork_manifest_branch_for_v3_fixture(root_uri: &str, name: &str) -> Result<()> { - let mut dataset = open_manifest_dataset(root_uri, None).await?; - let version = dataset.version().version; - dataset - .create_branch(name, version, None) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - Ok(()) -} - -/// Test-support re-export of the read-write migration entry point for the -/// `failpoints` integration binary (which can't reach `pub(crate)` items). Gated -/// on `test` OR `failpoints`; never in a release build. -#[cfg(any(test, feature = "failpoints"))] -pub async fn migrate_on_open_for_test(root_uri: &str) -> Result<()> { - migrate_on_open(root_uri).await -} - -/// Test-support: the number of `graph_commit` lineage rows in `__manifest` at -/// `branch` (main when `None`), plus the on-disk internal-schema stamp. Lets the -/// `failpoints` integration binary assert the migration neither stamped nor -/// backfilled when a legacy-open fault fired. Gated on `test` OR `failpoints`. -#[cfg(any(test, feature = "failpoints"))] -pub async fn lineage_row_count_and_stamp_for_test( - root_uri: &str, - branch: Option<&str>, -) -> Result<(usize, u32)> { - let dataset = open_manifest_dataset(root_uri, branch).await?; - let stamp = migrations::read_stamp(&dataset); - let (rows, _heads) = read_graph_lineage(&dataset).await?; - Ok((rows.len(), stamp)) -} - /// Immutable point-in-time view of the database. /// /// Cheap to create (no storage I/O). All reads within a query go through one @@ -204,51 +57,16 @@ pub struct Snapshot { root_uri: String, version: u64, entries: HashMap, - /// Per-graph read caches (shared `Session` + held-handle cache), injected by - /// `Omnigraph::resolved_target` for live Branch reads so table opens reuse - /// handles (0 IO on a warm repeat) and one `Session`. `None` for write-prelude - /// snapshots, time-travel / Snapshot-id reads, and directly-built test - /// snapshots, which fall back to a plain open. - read_caches: Option>, } impl Snapshot { - /// Open a sub-table dataset at its pinned version. With read caches present - /// (live Branch reads), reuse a held handle through the cache (0 open IO on a - /// warm repeat) and the shared `Session`; otherwise plain-open (Fix 2). + /// Open a sub-table dataset at its pinned version. pub async fn open(&self, table_key: &str) -> Result { let entry = self .entries .get(table_key) .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; - match &self.read_caches { - Some(caches) => { - let location = table_uri_for_path( - &self.root_uri, - &entry.table_path, - entry.table_branch.as_deref(), - ); - caches - .handles - .get_or_open( - &entry.table_path, - entry.table_branch.as_deref(), - entry.table_version, - entry.version_metadata.e_tag(), - &location, - Some(&caches.session), - ) - .await - } - None => entry.open(&self.root_uri).await, - } - } - - /// Attach per-graph read caches (shared `Session` + handle cache) so this - /// snapshot's table opens reuse handles and the session. Set by - /// `Omnigraph::resolved_target` for live Branch reads only. - pub(crate) fn set_read_caches(&mut self, caches: Arc) { - self.read_caches = Some(caches); + entry.open(&self.root_uri).await } /// Manifest version this snapshot was taken from. @@ -266,31 +84,6 @@ impl Snapshot { } } -#[derive(Debug, Clone, PartialEq, Eq)] -pub(crate) struct ManifestIncarnation { - pub(crate) version: u64, - pub(crate) e_tag: Option, - timestamp_nanos: Option, -} - -impl ManifestIncarnation { - pub(crate) fn matches(&self, held: &Self) -> bool { - if self.version != held.version { - return false; - } - match (&self.e_tag, &held.e_tag) { - (Some(latest), Some(current)) => latest == current, - _ => match (self.timestamp_nanos, held.timestamp_nanos) { - (Some(latest), Some(current)) => latest == current, - // Some object stores can omit both e_tag and manifest timestamp - // from the reachable API. In that narrow case the version-number - // probe is the strongest available identity. - _ => true, - }, - } - } -} - impl SubTableUpdate { pub(crate) fn to_create_table_version_request(&self) -> CreateTableVersionRequest { self.version_metadata.to_create_table_version_request( @@ -322,28 +115,14 @@ pub(crate) enum ManifestChange { } impl SubTableEntry { - /// Open this sub-table at its pinned version directly by location (Fix 2), - /// without the Lance namespace β€” which would full-scan `__manifest` twice per - /// open (`describe_table` + `describe_table_version`). The resolved Snapshot - /// already holds the path, version, and branch. Branches are Lance native - /// branches, so `with_branch` resolves `{base}/tree/{branch}` from the base - /// URI; main uses `with_version`. pub(crate) async fn open(&self, root_uri: &str) -> Result { - // The branch-qualified location is the dataset that physically holds this - // version: main at `{table_path}`, a branch at - // `{table_path}/tree/{branch}` (Lance native-branch storage). `with_version` - // then resolves the version within THAT dataset's `_versions` β€” a branch - // version lives under `tree/{branch}/_versions`, not the base. This - // matches the physical layout the namespace path resolved, without the - // per-open `__manifest` scan. - let location = table_uri_for_path(root_uri, &self.table_path, self.table_branch.as_deref()); - // Route through the instrumented data-table opener (Fix 3). With no - // session this is exactly the Fix-2 `from_uri(location).with_version`. - // This is the uncached fallback (a snapshot with no read caches); the - // cached path (`Snapshot::open` β†’ handle cache) calls the same opener on - // a miss with the shared session, so both paths count on the per-query - // `table_wrapper`. - crate::instrumentation::open_table_dataset(&location, self.table_version, None).await + open_table_at_version_from_manifest( + root_uri, + &self.table_key, + self.table_branch.as_deref(), + self.table_version, + ) + .await } } @@ -427,7 +206,6 @@ impl ManifestCoordinator { .into_iter() .map(|entry| (entry.table_key.clone(), entry)) .collect(), - read_caches: None, } } @@ -440,9 +218,6 @@ impl ManifestCoordinator { /// Create a new graph at `root_uri` from a catalog. /// /// Creates per-type Lance datasets and the namespace `__manifest` table. - /// The genesis graph commit is folded into the init write, so `__manifest` - /// is the single source of graph lineage from version one β€” callers read it - /// back through the lineage projection rather than via a second write. pub async fn init(root_uri: &str, catalog: &Catalog) -> Result { let root = root_uri.trim_end_matches('/'); let (dataset, known_state) = init_manifest_graph(root, catalog).await?; @@ -549,58 +324,17 @@ impl ManifestCoordinator { changes: &[ManifestChange], expected_table_versions: &HashMap, ) -> Result { - Ok(self - .commit_changes_with_lineage(changes, expected_table_versions, None) - .await? - .version) - } - - /// Publish `changes` and, when `lineage` is present, record the graph commit - /// in the SAME merge-insert (RFC-013 Phase 7). `__manifest` is the single - /// source of graph lineage: the `graph_commit` + `graph_head:` rows - /// ride the table-version publish so the whole commit lands at one manifest - /// version β€” no separate write, no manifestβ†’commit-graph atomicity gap, no - /// per-write commit-graph refresh. Returns the new version and the parent the - /// publisher resolved for the commit (so the caller can update its in-memory - /// commit cache without a re-read). - pub(crate) async fn commit_changes_with_lineage( - &mut self, - changes: &[ManifestChange], - expected_table_versions: &HashMap, - lineage: Option<&LineageIntent>, - ) -> Result { - if changes.is_empty() && expected_table_versions.is_empty() && lineage.is_none() { - return Ok(CommitOutcome { - version: self.version(), - parent_commit_id: None, - }); + if changes.is_empty() && expected_table_versions.is_empty() { + return Ok(self.version()); } - let PublishOutcome { - dataset, - parent_commit_id, - } = self + self.dataset = self .publisher - .publish(changes, expected_table_versions, lineage) + .publish(changes, expected_table_versions) .await?; - self.dataset = dataset; self.known_state = read_manifest_state(&self.dataset).await?; - Ok(CommitOutcome { - version: self.version(), - parent_commit_id, - }) - } - - /// Project the graph-lineage rows out of `__manifest` at `branch` without an - /// open coordinator. Opens the manifest fresh; used by `CommitGraph` to - /// source its in-memory cache from the manifest projection. - pub(crate) async fn read_graph_lineage_at( - root_uri: &str, - branch: Option<&str>, - ) -> Result<(Vec, HashMap)> { - let dataset = open_manifest_dataset(root_uri, branch).await?; - read_graph_lineage(&dataset).await + Ok(self.version()) } /// Current manifest version. @@ -608,48 +342,6 @@ impl ManifestCoordinator { self.dataset.version().version } - /// Latest committed manifest version on disk (one object-store op, no row - /// scan). The freshness probe for warm reuse: compare against `version()` - /// (the held handle's pinned version) to decide whether to refresh. - pub async fn probe_latest_version(&self) -> Result { - self.dataset - .latest_version_id() - .await - .map_err(|e| OmniError::Lance(e.to_string())) - } - - pub(crate) fn incarnation(&self) -> ManifestIncarnation { - ManifestIncarnation { - version: self.version(), - e_tag: self.dataset.manifest_location().e_tag.clone(), - timestamp_nanos: Some(self.dataset.manifest().timestamp_nanos), - } - } - - /// Latest committed manifest identity. Main cannot be deleted/recreated, so - /// the cheap version-number probe is sufficient there. Non-main Lance - /// branches can be deleted and recreated with the same version number, so - /// load the latest manifest location and compare its e_tag / timestamp too. - pub(crate) async fn probe_latest_incarnation(&self) -> Result { - if self.active_branch.is_none() { - return Ok(ManifestIncarnation { - version: self.probe_latest_version().await?, - e_tag: self.dataset.manifest_location().e_tag.clone(), - timestamp_nanos: Some(self.dataset.manifest().timestamp_nanos), - }); - } - let (manifest, location) = self - .dataset - .latest_manifest() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - Ok(ManifestIncarnation { - version: manifest.version, - e_tag: location.e_tag, - timestamp_nanos: Some(manifest.timestamp_nanos), - }) - } - pub fn active_branch(&self) -> Option<&str> { self.active_branch.as_deref() } diff --git a/crates/omnigraph/src/db/manifest/graph.rs b/crates/omnigraph/src/db/manifest/graph.rs index e0d3c85..6c414aa 100644 --- a/crates/omnigraph/src/db/manifest/graph.rs +++ b/crates/omnigraph/src/db/manifest/graph.rs @@ -14,17 +14,9 @@ use super::layout::{manifest_uri, open_manifest_dataset, type_name_hash}; use super::metadata::TableVersionMetadata; use super::migrations::stamp_current_version; use super::state::{ - GraphLineageRow, ManifestState, SubTableEntry, entries_to_batch, graph_lineage_row_parts, - manifest_schema, read_manifest_state, + ManifestState, SubTableEntry, entries_to_batch, manifest_schema, read_manifest_state, }; -/// The manifest version the init `Dataset::write` produces (Lance datasets start -/// at version one). The genesis graph commit pins this version β€” a snapshot at -/// it is the empty, freshly-initialized graph. The two config-only commits that -/// follow (`update_config`, `stamp_current_version`) advance the live manifest -/// version but add no table data, so genesis correctly stays pinned at one. -const GENESIS_MANIFEST_VERSION: u64 = 1; - pub(super) async fn init_manifest_graph( root_uri: &str, catalog: &Catalog, @@ -32,29 +24,13 @@ pub(super) async fn init_manifest_graph( let root = root_uri.trim_end_matches('/'); let (entries, version_metadata) = build_initial_entries(root, catalog).await?; - // Genesis graph commit: parentless, actorless, minted once and folded into - // the init write so `__manifest` is the single source of graph lineage from - // version one (no `_graph_commits.lance` row, no separate publish). - let genesis = GraphLineageRow { - graph_commit_id: ulid::Ulid::new().to_string(), - manifest_branch: None, - manifest_version: GENESIS_MANIFEST_VERSION, - parent_commit_id: None, - merged_parent_commit_id: None, - actor_id: None, - created_at: crate::db::now_micros()?, - }; - let genesis_lineage = graph_lineage_row_parts(&genesis, None)?; - - let manifest_batch = entries_to_batch(&entries, &version_metadata, &genesis_lineage)?; + let manifest_batch = entries_to_batch(&entries, &version_metadata)?; let schema = manifest_schema(); let reader = RecordBatchIterator::new(vec![Ok(manifest_batch)], schema); let params = WriteParams { mode: WriteMode::Create, enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; let manifest_path = manifest_uri(root); @@ -151,8 +127,6 @@ async fn create_empty_dataset(uri: &str, schema: &SchemaRef) -> Result enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; Dataset::write(reader, uri, Some(params)) diff --git a/crates/omnigraph/src/db/manifest/layout.rs b/crates/omnigraph/src/db/manifest/layout.rs index 12894a7..9cfde9a 100644 --- a/crates/omnigraph/src/db/manifest/layout.rs +++ b/crates/omnigraph/src/db/manifest/layout.rs @@ -15,17 +15,14 @@ pub(super) fn type_name_hash(name: &str) -> String { format!("{:016x}", h) } -pub(crate) fn manifest_uri(root: &str) -> String { +pub(super) fn manifest_uri(root: &str) -> String { format!("{}/{}", root.trim_end_matches('/'), MANIFEST_DIR) } pub(super) async fn open_manifest_dataset(root_uri: &str, branch: Option<&str>) -> Result { - let uri = manifest_uri(root_uri.trim_end_matches('/')); - let dataset = crate::instrumentation::open_dataset_tracked( - &uri, - crate::instrumentation::manifest_wrapper(), - ) - .await?; + let dataset = Dataset::open(&manifest_uri(root_uri.trim_end_matches('/'))) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; match branch { Some(branch) if branch != "main" => dataset .checkout_branch(branch) @@ -76,7 +73,6 @@ pub(super) fn table_uri_for_path(root_uri: &str, table_path: &str, branch: Optio } } -#[cfg(test)] pub(super) fn namespace_internal_error(message: impl Into) -> LanceNamespaceError { LanceNamespaceError::namespace_source(Box::new(std::io::Error::other(message.into()))) } diff --git a/crates/omnigraph/src/db/manifest/metadata.rs b/crates/omnigraph/src/db/manifest/metadata.rs index d84db34..0bf14b6 100644 --- a/crates/omnigraph/src/db/manifest/metadata.rs +++ b/crates/omnigraph/src/db/manifest/metadata.rs @@ -2,9 +2,7 @@ use std::collections::HashMap; use lance::Dataset; use lance_namespace::Error as LanceNamespaceError; -use lance_namespace::models::CreateTableVersionRequest; -#[cfg(test)] -use lance_namespace::models::TableVersion; +use lance_namespace::models::{CreateTableVersionRequest, TableVersion}; use serde::{Deserialize, Serialize}; use crate::error::{OmniError, Result}; @@ -113,6 +111,7 @@ impl TableVersionMetadata { self.manifest_size } + #[cfg(test)] pub(crate) fn e_tag(&self) -> Option<&str> { self.e_tag.as_deref() } @@ -139,12 +138,10 @@ impl TableVersionMetadata { request } - #[cfg(test)] pub(super) fn to_namespace_version(&self, version: u64) -> TableVersion { self.to_namespace_version_with_details(version, None, None) } - #[cfg(test)] pub(super) fn to_namespace_version_with_details( &self, version: u64, diff --git a/crates/omnigraph/src/db/manifest/migrations.rs b/crates/omnigraph/src/db/manifest/migrations.rs index 207def4..bbb7995 100644 --- a/crates/omnigraph/src/db/manifest/migrations.rs +++ b/crates/omnigraph/src/db/manifest/migrations.rs @@ -37,9 +37,6 @@ use lance::Dataset; use crate::error::{OmniError, Result}; -use crate::db::commit_graph::GraphCommit; -use super::state::{GraphLineageRow, graph_lineage_row_parts, merge_lineage_rows, read_graph_lineage}; - /// Current internal schema version this binary expects to find on disk. /// /// History: @@ -49,66 +46,14 @@ use super::state::{GraphLineageRow, graph_lineage_row_parts, merge_lineage_rows, /// - v2 β€” `__manifest.object_id` carries the unenforced-PK annotation, /// engaging Lance's bloom-filter conflict resolver at commit time. Added /// alongside `expected_table_versions` OCC on `ManifestBatchPublisher::publish`. -/// - v3 β€” one-time sweep of legacy `__run__` staging branches left on the -/// `__manifest` dataset by the pre-v0.4.0 Run state machine (removed in -/// MR-771). Once swept, the `is_internal_run_branch` defense-in-depth guard -/// is no longer needed (MR-770). -/// - v4 β€” RFC-013 Phase 7 folds graph lineage into `__manifest` as -/// `graph_commit`/`graph_head` rows written in the publish CAS. A pre-Phase-7 -/// (v3) graph has its lineage only in `_graph_commits.lance`, so the new -/// binary would read an empty commit DAG. This one-time per-branch backfill -/// copies the lineage from `_graph_commits.lance` into `__manifest` -/// (`migrate_v3_to_v4`). `_graph_commits.lance` is left in place as the -/// branch-ref carrier; no commit rows are ever written to it again. -pub(crate) const INTERNAL_MANIFEST_SCHEMA_VERSION: u32 = 4; - -/// The oldest on-disk internal-schema stamp this binary will open. A graph below -/// this floor is refused (`refuse_if_stamp_unsupported`) with a "migrate it -/// forward with an older release first" error, instead of obliging this binary to -/// carry that version's `migrate_vN_…` arm and the legacy readers it needs -/// forever. Raising the floor is how the migration chain sheds old code. -/// -/// **Retirement runbook** β€” turning "accumulates forever" into a sliding window: -/// 1. *Shed version N* once no graph below `N+1` remains in the fleet: bump this -/// floor AND `LOWEST_REGISTERED_MIGRATION_SOURCE` to `N+1`, then delete the -/// `N =>` arm in `migrate_internal_schema`, `migrate_vN_to_vN+1`, and its -/// helpers + tests. The tripwire test keeps the two consts in lockstep, so a -/// half-done shed fails CI. -/// 2. *Retire the v3 legacy readers entirely* once MIN β‰₯ 4: `git rm` the -/// `commit_graph/commit_graph_legacy_v3.rs` seam file and flip the single -/// `stamp < CURRENT` gate in `load_commit_cache_for_branch` to read the -/// manifest projection unconditionally. -/// -/// MIN = 1 today is a pure no-op: `read_stamp` floors an absent stamp at 1 and no -/// real graph carries 0, so nothing is refused. -pub(crate) const MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION: u32 = 1; - -/// The lowest `current` value the `migrate_internal_schema` dispatcher still has a -/// `match` arm for. Mirrors the lowest registered migration source so a floor bump -/// that forgets to delete the now-dead arm (or vice versa) is caught by the -/// compile-time tripwire below. Migration arms aren't an enumerable registry, so -/// this hand-mirrored const is the minimal enforced coupling β€” cheaper than -/// reshaping the dispatcher into a data-driven table. -const LOWEST_REGISTERED_MIGRATION_SOURCE: u32 = 1; - -/// Retirement tripwire (compile-time): the refusal floor and the lowest migration -/// arm must move together. Raising `MIN_SUPPORTED` without deleting the now-dead -/// below-floor arm β€” or vice versa β€” fails the build with this message, which is -/// stronger than a runtime test and impossible to skip. Migration arms can't be -/// enumerated, so this const-mirror is the check. -const _: () = assert!( - LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION, - "internal-schema floor drifted from the lowest registered migration arm: when raising \ - MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION, delete every below-floor `N =>` arm + migrate_vN_… \ - + its helpers/tests and bump LOWEST_REGISTERED_MIGRATION_SOURCE to match (or vice versa)", -); +pub(super) const INTERNAL_MANIFEST_SCHEMA_VERSION: u32 = 2; const INTERNAL_SCHEMA_VERSION_KEY: &str = "omnigraph:internal_schema_version"; const OBJECT_ID_PK_KEY: &str = "lance-schema:unenforced-primary-key"; /// Read the on-disk stamp from `__manifest`'s schema-level metadata. /// Absent β‡’ v1 (pre-stamp world). -pub(crate) fn read_stamp(dataset: &Dataset) -> u32 { +pub(super) fn read_stamp(dataset: &Dataset) -> u32 { dataset .schema() .metadata @@ -123,52 +68,20 @@ pub(super) async fn stamp_current_version(dataset: &mut Dataset) -> Result<()> { set_stamp(dataset, INTERNAL_MANIFEST_SCHEMA_VERSION).await } -/// Refuse to open a manifest whose stamp this binary cannot serve β€” in either -/// direction β€” with a clear upgrade path. Shared by every place a stamp is read -/// and enforced: the write-path migration dispatcher, the read-only open guard, -/// and the branch lineage-read path. Checking both bounds in one function means a -/// new stamp-reading caller gets the floor and the ceiling together and cannot -/// half-enforce. -/// -/// - `stamp > CURRENT`: the graph was written by a newer binary β€” upgrade omnigraph. -/// - `stamp < MIN_SUPPORTED`: the graph predates the oldest migration this binary -/// still carries β€” migrate it forward with an older release first, then reopen. -pub(crate) fn refuse_if_stamp_unsupported(stamp: u32) -> Result<()> { - if stamp > INTERNAL_MANIFEST_SCHEMA_VERSION { - return Err(OmniError::manifest(format!( - "__manifest is stamped at internal schema v{} but this binary expects v{} \ - β€” upgrade omnigraph before opening this graph", - stamp, INTERNAL_MANIFEST_SCHEMA_VERSION, - ))); - } - if stamp < MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION { - return Err(OmniError::manifest(format!( - "__manifest is stamped at internal schema v{} but this binary supports v{} or later \ - β€” open it with an older omnigraph release to migrate it forward first, then reopen", - stamp, MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION, - ))); - } - Ok(()) -} - /// Apply any pending internal-schema migrations to the manifest dataset. /// /// Idempotent: when the on-disk stamp matches the binary, this is a single /// metadata read with no writes. -/// -/// `root_uri` + `branch` identify which graph + branch this `dataset` is a -/// manifest for. The v3β†’v4 lineage backfill needs them to read that branch's -/// `_graph_commits.lance`. `migrate_on_open` passes the main branch -/// (`branch = None`); the publisher's `load_publish_state` passes its own -/// branch, so each branch backfills on its first write. -pub(super) async fn migrate_internal_schema( - dataset: &mut Dataset, - root_uri: &str, - branch: Option<&str>, -) -> Result<()> { +pub(super) async fn migrate_internal_schema(dataset: &mut Dataset) -> Result<()> { let mut current = read_stamp(dataset); - refuse_if_stamp_unsupported(current)?; + if current > INTERNAL_MANIFEST_SCHEMA_VERSION { + return Err(OmniError::manifest(format!( + "__manifest is stamped at internal schema v{} but this binary expects v{} \ + β€” upgrade omnigraph before opening this graph for writes", + current, INTERNAL_MANIFEST_SCHEMA_VERSION, + ))); + } while current < INTERNAL_MANIFEST_SCHEMA_VERSION { match current { @@ -176,14 +89,6 @@ pub(super) async fn migrate_internal_schema( migrate_v1_to_v2(dataset).await?; current = 2; } - 2 => { - migrate_v2_to_v3(dataset).await?; - current = 3; - } - 3 => { - migrate_v3_to_v4(dataset, root_uri, branch).await?; - current = 4; - } other => { return Err(OmniError::manifest_internal(format!( "no internal-schema migration registered for v{} β†’ v{}", @@ -200,305 +105,21 @@ pub(super) async fn migrate_internal_schema( /// so the merge-insert conflict resolver enforces row-level CAS at commit /// time, then bump the stamp. /// -/// Idempotent under crash-retry by construction. Lance 7 makes the unenforced -/// primary key **immutable once set**: any write that touches the reserved -/// `lance-schema:unenforced-primary-key` field metadata after the PK is set -/// errors ("cannot be changed once set", `lance::dataset::transaction`), even -/// re-applying the same value. A crash between the field-set and the stamp -/// bump leaves the field set without a stamp, so the next open re-enters here -/// with the PK already present β€” we must therefore set it only when absent. -/// (Fresh graphs bake the PK into `manifest_schema()` at init and never run -/// this migration; only genuine pre-v0.4.0 graphs do.) +/// Both steps are idempotent under retry: re-applying the field annotation +/// at its current value is a no-op-ish bump in Lance, and the stamp is a +/// simple key-value write. A crash between the two leaves the field set +/// without a stamp; the next open re-runs this fn and only the stamp lands. async fn migrate_v1_to_v2(dataset: &mut Dataset) -> Result<()> { - // The guard is over the *specific* field, not just "any PK is set": skipping - // when `object_id` is already the PK is the idempotent crash-recovery path, - // but a manifest whose PK is some *other* field has the wrong CAS key β€” and - // Lance 7 won't let us change it. Refuse loudly rather than silently leave - // merge-insert conflict detection keyed on the wrong column. - let pk_fields: Vec<&str> = dataset - .schema() - .unenforced_primary_key() - .iter() - .map(|field| field.name.as_str()) - .collect(); - match pk_fields.as_slice() { - ["object_id"] => {} // already migrated (or a crash re-entry) β€” idempotent no-op - [] => { - dataset - .update_field_metadata() - .update( - "object_id", - [(OBJECT_ID_PK_KEY.to_string(), "true".to_string())], - ) - .map_err(|e| OmniError::Lance(e.to_string()))? - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - } - other => { - return Err(OmniError::manifest_internal(format!( - "__manifest unenforced primary key is {other:?}, expected [\"object_id\"]; \ - refusing to migrate a manifest with an unexpected CAS key" - ))); - } - } - set_stamp(dataset, 2).await -} - -/// v2 β†’ v3: sweep legacy `__run__` staging branches off the `__manifest` -/// dataset, then bump the stamp. -/// -/// The pre-v0.4.0 Run state machine (removed in MR-771) created graph-level -/// staging branches named `__run__` on `__manifest`. MR-771 stopped -/// creating them but left any pre-existing ones in place; Lance's -/// `list_branches` still enumerates them, so they leak into `branch_list()` -/// and count as blocking branches at schema-apply time. This one-time sweep -/// removes them so the `is_internal_run_branch` guard can retire (MR-770). -/// -/// The `"__run__"` prefix is inlined here on purpose: this migration must keep -/// working after the `run_registry` module (the guard) is deleted, so it does -/// not depend on it. -/// -/// Idempotent under both sequential retry and concurrent runners: each run -/// re-enumerates `list_branches` fresh, and `force_delete_branch` tolerates a -/// branch that is already gone β€” so a crash before the stamp bump, or a second -/// process opening the same legacy graph at the same time, never errors out. -async fn migrate_v2_to_v3(dataset: &mut Dataset) -> Result<()> { - const LEGACY_RUN_BRANCH_PREFIX: &str = "__run__"; - let branches = dataset - .list_branches() + dataset + .update_field_metadata() + .update( + "object_id", + [(OBJECT_ID_PK_KEY.to_string(), "true".to_string())], + ) + .map_err(|e| OmniError::Lance(e.to_string()))? .await .map_err(|e| OmniError::Lance(e.to_string()))?; - let run_branches: Vec = branches - .into_keys() - .filter(|name| { - name.trim_start_matches('/') - .starts_with(LEGACY_RUN_BRANCH_PREFIX) - }) - .collect(); - for name in run_branches { - // `force_delete_branch` deletes even when the `BranchContents` is - // already gone. Plain `delete_branch` errors "BranchContents not - // found", which would fail a second concurrent open (or a retry that - // raced another runner) after the first one swept the branch. Force is - // exactly Lance's documented path for cleaning up zombie branches. - dataset - .force_delete_branch(&name) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - } - set_stamp(dataset, 3).await -} - -/// v3 β†’ v4: backfill the graph lineage from `_graph_commits.lance` into -/// `__manifest`, then bump the stamp. -/// -/// RFC-013 Phase 7 made `__manifest` the single source of graph lineage -/// (`graph_commit` / `graph_head:` rows, written in the publish CAS). -/// A pre-Phase-7 (v3) graph has its lineage only in `_graph_commits.lance` and -/// none in `__manifest`, so the new binary would read an EMPTY commit DAG. This -/// one-time per-branch migration copies that branch's commits + the single head -/// into `__manifest` so reads see the real history. `_graph_commits.lance` -/// itself is left untouched as the branch-ref carrier (no commit row is ever -/// written to it again). -/// -/// `dataset` is the `__manifest` for `branch` (main when `branch` is `None`); -/// the migration runs per-branch on that branch's first write, so it reads -/// `_graph_commits.lance` at the SAME branch. -/// -/// Idempotency + crash recovery: the stamp bump is the LAST step, and the -/// lineage merge is keyed on `object_id` (re-inserting the same commit rows is a -/// no-op update). A crash after the merge but before the stamp bump re-enters -/// here at v3 and re-runs harmlessly. As a fast path, if `__manifest` already -/// carries `graph_commit` rows (a previous run completed the merge), we skip -/// straight to the stamp bump. -/// -/// Concurrent runners: two processes (or two open-for-write handles) can open the -/// same legacy graph at once and both reach the backfill merge. `merge_lineage_rows` -/// uses `conflict_retries(0)`, so the row-level CAS loser on `graph_head:` -/// must be re-driven here rather than failing the open β€” `migrate_v2_to_v3` is -/// concurrent-runner idempotent and this step must be too. The bounded loop -/// re-reads the fast path (a concurrent winner's merge is one atomic Lance commit, -/// so a re-read sees either zero or all of its rows, never partial), re-opens the -/// stale handle past the winner's commit, and retries. On budget exhaustion it -/// returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry -/// loop (which only re-runs `is_retryable_publish_conflict` conflicts) completes -/// it on the next attempt β€” the same converge-on-next-attempt contract the -/// recovery sweep uses. -async fn migrate_v3_to_v4( - dataset: &mut Dataset, - root_uri: &str, - branch: Option<&str>, -) -> Result<()> { - // Mirror the publisher's budget (`publisher::PUBLISHER_RETRY_BUDGET = 5`); kept - // as a local const rather than re-exporting that private one β€” the two are the - // same shape (bounded row-level-CAS retries) but independent knobs. - const MIGRATION_MERGE_RETRY_BUDGET: u32 = 5; - - // Exclusive range + an unguarded retryable arm (see `commit_v4_stamp_idempotently` - // for the rationale): every retryable conflict re-opens and retries inside the - // loop, and the SINGLE reachable exhaustion path is the typed contention return - // below β€” so the retryable variant can never fall through to the `Err(err)` - // propagate arm on the last iteration. - for _ in 0..MIGRATION_MERGE_RETRY_BUDGET { - // Fast path / idempotency + concurrent-winner guard: if the backfill - // already landed (a previous run, OR a concurrent runner that won the CAS - // β€” its merge is atomic, so this is all-or-nothing), don't re-merge β€” just - // (re)stamp. `dataset` is re-opened past any winner's commit below, so this - // re-read sees the winner's rows on a retry. - let (existing_lineage, _heads) = read_graph_lineage(dataset).await?; - if !existing_lineage.is_empty() { - return commit_v4_stamp_idempotently(dataset, root_uri, branch).await; - } - - // Read this branch's legacy commit cache (commits + the head). An absent or - // empty `_graph_commits.lance` yields no commits β€” nothing to backfill. - let (commit_by_id, head) = - crate::db::commit_graph::read_legacy_commit_cache(root_uri, branch).await?; - if commit_by_id.is_empty() { - return commit_v4_stamp_idempotently(dataset, root_uri, branch).await; - } - - let parts = build_lineage_backfill_parts(&commit_by_id, head.as_ref(), branch)?; - - match merge_lineage_rows(dataset.clone(), &parts).await { - Ok(new_dataset) => { - *dataset = new_dataset; - // Stamp LAST. Crash window: a failure between the merge above and - // this stamp bump leaves stamp v3 + lineage present in `__manifest`. - // The next open re-enters at v3, the fast path at the top sees the - // lineage and skips straight to the stamp bump β€” completing the - // migration with no duplicate rows (the merge is keyed on - // `object_id`). Pinned by - // `crash_after_merge_before_stamp_completes_on_next_open`. - return commit_v4_stamp_idempotently(dataset, root_uri, branch).await; - } - // A concurrent runner won the `graph_head:` CAS. Our in-hand - // handle is stale at the pre-contention HEAD, so a re-open is required - // to see the winner's commit; then re-loop (the fast path will see the - // winner's lineage and stamp). Bounded by the budget. - Err(err) if super::publisher::is_retryable_publish_conflict(&err) => { - *dataset = super::layout::open_manifest_dataset(root_uri, branch).await?; - continue; - } - Err(err) => return Err(err), - } - } - - // Budget exhausted under sustained contention. Return a CAS-typed error (not a - // plain conflict) so the publisher's outer retry loop β€” which only re-runs - // `is_retryable_publish_conflict` β€” re-runs `load_publish_state` and completes - // the migration, rather than giving up. - Err(OmniError::manifest_row_level_cas_contention(format!( - "v3β†’v4 lineage backfill exhausted {} retries against concurrent runners", - MIGRATION_MERGE_RETRY_BUDGET - ))) -} - -/// Stamp the v3β†’v4 migration's terminal version idempotently under concurrent -/// runners. `set_stamp` issues an `UpdateConfig` Lance commit; once the merge CAS -/// loser is made to converge (above), BOTH runners reach this stamp bump and race -/// it β€” the loser gets `lance::Error::IncompatibleTransaction` (two `UpdateConfig` -/// commits touching the same metadata key), which is NOT a row-level CAS -/// contention and so is not caught by the merge loop. But both write the SAME -/// value, so the conflict is benign: re-open and, if the stamp already reached the -/// target (the concurrent runner finished it), succeed; otherwise re-apply. -/// Bounded; on exhaustion surface a CAS-typed error for the publisher's outer -/// retry, same as the merge loop. -async fn commit_v4_stamp_idempotently( - dataset: &mut Dataset, - root_uri: &str, - branch: Option<&str>, -) -> Result<()> { - const STAMP_RETRY_BUDGET: u32 = 5; - // Exclusive range + an UNGUARDED `IncompatibleTransaction` arm: the retryable - // variant is always handled inside the loop (re-open + same-value check + retry), - // so it can never fall through to the stringifying `Err(e)` catch-all, and the - // SINGLE reachable exhaustion path is the typed contention return below. (A - // `0..=BUDGET` range with an `attempt < BUDGET` guard let the last iteration's - // retryable conflict reach the catch-all and return a non-retryable - // `OmniError::Lance` β€” the publisher's outer retry would then give up.) - for _ in 0..STAMP_RETRY_BUDGET { - // Inline the `update_schema_metadata` write (rather than `set_stamp`) so the - // raw Lance error variant is in hand β€” `set_stamp` pre-stringifies it. - let stamp_result = stamp_internal_schema(dataset).await; - match stamp_result { - Ok(_) => return Ok(()), - Err(lance::Error::IncompatibleTransaction { .. }) => { - // A concurrent runner's `UpdateConfig` preempted ours β€” the - // retryable case. Re-open past its commit; if it already stamped to - // the target we're done (the value is identical), else fall through - // to retry on the advanced handle. - *dataset = super::layout::open_manifest_dataset(root_uri, branch).await?; - if read_stamp(dataset) >= INTERNAL_MANIFEST_SCHEMA_VERSION { - return Ok(()); - } - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - } - } - - // Exhausted the budget against sustained concurrent stampers. Return a - // CAS-typed (retryable) error so the publisher's OUTER retry β€” which only - // re-runs `is_retryable_publish_conflict` β€” completes it, rather than the - // stringified `OmniError::Lance` it would treat as fatal. - Err(OmniError::manifest_row_level_cas_contention(format!( - "v3β†’v4 stamp bump exhausted {} retries against concurrent runners", - STAMP_RETRY_BUDGET - ))) -} - -/// The single `update_schema_metadata` write that bumps the on-disk internal-schema -/// stamp to the current version. Extracted from `commit_v4_stamp_idempotently`'s -/// retry loop so a `failpoints` test can inject a concurrent-stamper -/// `IncompatibleTransaction` deterministically (the loop's exhaustion path is -/// otherwise near-unreachable). Returns the RAW `lance::Error` so the loop can match -/// the `IncompatibleTransaction` variant β€” `set_stamp` pre-stringifies it. -async fn stamp_internal_schema(dataset: &mut Dataset) -> std::result::Result<(), lance::Error> { - crate::failpoints::maybe_fail_lance_incompatible("migration.v4_stamp.force_incompatible")?; - dataset - .update_schema_metadata([( - INTERNAL_SCHEMA_VERSION_KEY.to_string(), - INTERNAL_MANIFEST_SCHEMA_VERSION.to_string(), - )]) - .await - .map(|_| ()) -} - -/// Build the `__manifest` rows for the v3β†’v4 backfill: one immutable -/// `graph_commit` row per commit, plus EXACTLY ONE `graph_head:` row for -/// the actual head. Each commit encodes to a `[graph_commit, graph_head]` pair, -/// but only the head commit's head row is kept β€” the others would be redundant -/// updates of the same `graph_head:` object_id (the head is per-branch, -/// not per-commit). -fn build_lineage_backfill_parts( - commit_by_id: &std::collections::HashMap, - head: Option<&GraphCommit>, - branch: Option<&str>, -) -> Result> { - let head_id = head.map(|h| h.graph_commit_id.as_str()); - // Deterministic iteration order (the source is a HashMap): merge-insert is - // keyed on `object_id` so the final manifest content is order-independent, - // but a stable order keeps the produced batch reproducible regardless. - let mut commits: Vec<&GraphCommit> = commit_by_id.values().collect(); - commits.sort_by(|a, b| a.graph_commit_id.cmp(&b.graph_commit_id)); - let mut parts = Vec::with_capacity(commits.len() + 1); - for commit in commits { - let row = GraphLineageRow { - graph_commit_id: commit.graph_commit_id.clone(), - manifest_branch: commit.manifest_branch.clone(), - manifest_version: commit.manifest_version, - parent_commit_id: commit.parent_commit_id.clone(), - merged_parent_commit_id: commit.merged_parent_commit_id.clone(), - actor_id: commit.actor_id.clone(), - created_at: commit.created_at, - }; - let [commit_part, head_part] = graph_lineage_row_parts(&row, branch)?; - parts.push(commit_part); - if Some(commit.graph_commit_id.as_str()) == head_id { - parts.push(head_part); - } - } - Ok(parts) + set_stamp(dataset, 2).await } async fn set_stamp(dataset: &mut Dataset, version: u32) -> Result<()> { @@ -508,42 +129,3 @@ async fn set_stamp(dataset: &mut Dataset, version: u32) -> Result<()> { .map_err(|e| OmniError::Lance(e.to_string()))?; Ok(()) } - -/// Test-only: force the on-disk internal-schema stamp to `version`. Used to -/// synthesize a pre-migration graph (rewinding to v3) and to simulate a crash -/// that lost the final stamp bump. Gated on `test` OR `failpoints` so the -/// fault-injection migration test (in the `failpoints` integration binary, -/// compiled without `cfg(test)`) can reach it too. -#[cfg(any(test, feature = "failpoints"))] -pub(crate) async fn set_stamp_for_test(dataset: &mut Dataset, version: u32) -> Result<()> { - set_stamp(dataset, version).await -} - -#[cfg(test)] -mod tests { - use super::*; - - /// The floor never refuses any stamp the binary can actually serve β€” a graph - /// at MIN through CURRENT passes, only sub-MIN / super-CURRENT are rejected. - /// With MIN = 1 and CURRENT = 4 this proves the live range is exactly [1, 4] - /// and that the floor is a no-op for every real graph (lowest real stamp is 1). - #[test] - fn unsupported_guard_accepts_exactly_the_supported_range() { - for stamp in MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION..=INTERNAL_MANIFEST_SCHEMA_VERSION { - assert!( - refuse_if_stamp_unsupported(stamp).is_ok(), - "stamp v{stamp} is within [MIN, CURRENT] and must be accepted" - ); - } - if MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION > 0 { - assert!( - refuse_if_stamp_unsupported(MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION - 1).is_err(), - "a sub-floor stamp must be refused" - ); - } - assert!( - refuse_if_stamp_unsupported(INTERNAL_MANIFEST_SCHEMA_VERSION + 1).is_err(), - "a future stamp must be refused" - ); - } -} diff --git a/crates/omnigraph/src/db/manifest/namespace.rs b/crates/omnigraph/src/db/manifest/namespace.rs index a684b4d..80d206f 100644 --- a/crates/omnigraph/src/db/manifest/namespace.rs +++ b/crates/omnigraph/src/db/manifest/namespace.rs @@ -1,10 +1,3 @@ -// Both the read namespace (BranchManifestNamespace) and the write namespace -// (StagedTableNamespace) are now test-only contract validation. Reads open -// sub-tables directly by location+version (SubTableEntry::open, Fix 2), and -// writes open the table head directly by URI (TableStore::open_dataset_head, -// RFC-013 step 3a), so nothing in production routes through the Lance namespace -// anymore. These impls are retained only to validate the LanceNamespace -// contract in unit tests. use std::sync::Arc; use async_trait::async_trait; @@ -17,9 +10,7 @@ use lance_namespace::models::{ }; use lance_namespace::{Error as LanceNamespaceError, LanceNamespace, NamespaceError}; use lance_table::io::commit::ManifestNamingScheme; -use object_store::{ - Error as ObjectStoreError, ObjectStore as _, ObjectStoreExt, PutMode, PutOptions, path::Path, -}; +use object_store::{Error as ObjectStoreError, ObjectStore as _, PutMode, PutOptions, path::Path}; use crate::error::{OmniError, Result}; @@ -162,6 +153,41 @@ pub(crate) fn staged_table_namespace( )) } +async fn load_table_from_namespace( + namespace: Arc, + table_key: &str, + branch: Option<&str>, + version: Option, +) -> Result { + let builder = DatasetBuilder::from_namespace(namespace, vec![table_key.to_string()]) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let builder = match (branch, version) { + (Some(branch), version) => builder.with_branch(branch, version), + (None, Some(version)) => builder.with_version(version), + (None, None) => builder, + }; + builder + .load() + .await + .map_err(|e| OmniError::Lance(e.to_string())) +} + +pub(crate) async fn open_table_at_version_from_manifest( + root_uri: &str, + table_key: &str, + branch: Option<&str>, + version: u64, +) -> Result { + load_table_from_namespace( + branch_manifest_namespace(root_uri, branch), + table_key, + branch, + Some(version), + ) + .await +} + #[async_trait] impl LanceNamespace for BranchManifestNamespace { fn namespace_id(&self) -> String { @@ -516,3 +542,18 @@ impl LanceNamespace for StagedTableNamespace { Ok(response) } } + +pub(crate) async fn open_table_head_for_write( + root_uri: &str, + table_key: &str, + table_path: &str, + branch: Option<&str>, +) -> Result { + load_table_from_namespace( + staged_table_namespace(root_uri, table_key, table_path, branch), + table_key, + branch, + None, + ) + .await +} diff --git a/crates/omnigraph/src/db/manifest/publisher.rs b/crates/omnigraph/src/db/manifest/publisher.rs index 382b51a..d13dd08 100644 --- a/crates/omnigraph/src/db/manifest/publisher.rs +++ b/crates/omnigraph/src/db/manifest/publisher.rs @@ -24,23 +24,20 @@ use lance::Dataset; use lance::Error as LanceError; use lance::dataset::{MergeInsertBuilder, WhenMatched, WhenNotMatched}; use lance_namespace::NamespaceError; -#[cfg(test)] use lance_namespace::models::CreateTableVersionRequest; use crate::error::{OmniError, Result}; -#[cfg(test)] -use super::SubTableUpdate; use super::layout::{open_manifest_dataset, tombstone_object_id, version_object_id}; use super::metadata::parse_namespace_version_request; use super::migrations::migrate_internal_schema; use super::state::{ - GraphLineageRow, GraphLineageRowPart, graph_lineage_row_parts, head_lineage_row, - manifest_rows_batch, manifest_schema, read_publish_scan, + manifest_rows_batch, manifest_schema, read_manifest_entries, read_registered_table_locations, + read_tombstone_versions, }; use super::{ ManifestChange, OBJECT_TYPE_TABLE, OBJECT_TYPE_TABLE_TOMBSTONE, OBJECT_TYPE_TABLE_VERSION, - SubTableEntry, TableRegistration, TableTombstone, + SubTableEntry, SubTableUpdate, TableRegistration, TableTombstone, }; /// Bound on the publisher-level retry loop that wraps Lance's row-level CAS @@ -50,48 +47,13 @@ use super::{ /// iteration re-runs `load_publish_state` and the expected-version pre-check. const PUBLISHER_RETRY_BUDGET: u32 = 5; -/// The graph-lineage commit to record atomically with a manifest publish -/// (RFC-013 Phase 7). One logical commit per publish: the `graph_commit_id` is -/// minted once by the caller and stays stable across the publisher's CAS -/// retries; only the parent re-resolves per attempt (against the freshly loaded -/// `__manifest`), so a retry after a concurrent commit parents off the new head -/// β€” the TOCTOU the dual-write era's `commit_graph.refresh()` guarded is closed -/// by construction. -#[derive(Debug, Clone)] -pub(crate) struct LineageIntent { - /// ULID minted once before the publish loop; the graph commit's identity. - pub graph_commit_id: String, - /// The branch this commit lands on (`None` = main). Selects the - /// `graph_head:` pointer row that gets updated. - pub branch: Option, - /// Authoring actor, or `None` for unauthored / system writes. - pub actor_id: Option, - /// The merged-in source head β€” `Some` only for a branch-merge commit. - pub merged_parent_commit_id: Option, - /// Commit timestamp (microseconds since the UNIX epoch). - pub created_at: i64, -} - -/// The result of a manifest publish that may have folded in a graph commit. -#[derive(Debug)] -pub(super) struct PublishOutcome { - /// The advanced `__manifest` dataset (its version is the published version). - pub dataset: Dataset, - /// The parent the publisher resolved for the recorded commit, if a - /// [`LineageIntent`] was supplied. Returned so the caller can update its - /// in-memory commit cache without a re-read. `None` when no lineage was - /// recorded, or when the commit is the genesis (no parent). - pub parent_commit_id: Option, -} - #[async_trait] pub(super) trait ManifestBatchPublisher: Send + Sync { async fn publish( &self, changes: &[ManifestChange], expected_table_versions: &HashMap, - lineage: Option<&LineageIntent>, - ) -> Result; + ) -> Result; } pub(super) struct GraphNamespacePublisher { @@ -111,19 +73,6 @@ struct PendingVersionRow { row_count: Option, } -/// Everything one CAS attempt needs out of a single `__manifest` scan -/// (RFC-013 P2): the open dataset, table state for the pre-check + pending-row -/// build, and the `graph_commit` lineage rows for parent resolution. Folding the -/// lineage into this struct is what lets `resolve_lineage_rows` skip its own -/// `read_graph_lineage` scan. -struct LoadedPublishState { - dataset: Dataset, - registered_tables: HashMap, - existing_versions: HashMap<(String, u64), SubTableEntry>, - existing_tombstones: HashMap<(String, u64), ()>, - lineage_rows: Vec, -} - impl GraphNamespacePublisher { pub(super) fn new(root_uri: &str, branch: Option<&str>) -> Self { Self { @@ -138,31 +87,22 @@ impl GraphNamespacePublisher { open_manifest_dataset(&self.root_uri, self.branch.as_deref()).await } - async fn load_publish_state(&self) -> Result { - // Test seam: inject a retryable contention here to exercise the outer - // retry loop's re-run-on-retryable-load-error path (no-op without the - // `failpoints` feature). The migration surfaces the same typed error. - crate::failpoints::maybe_fail_retryable_contention( - crate::failpoints::names::PUBLISH_LOAD_STATE_RETRYABLE_CONTENTION, - )?; + async fn load_publish_state( + &self, + ) -> Result<( + Dataset, + HashMap, + HashMap<(String, u64), SubTableEntry>, + HashMap<(String, u64), ()>, + )> { let mut dataset = self.dataset().await?; // Run pending internal-schema migrations exactly once per publish on // the open-for-write path; idempotent when the on-disk stamp already - // matches this binary. Pass this publisher's branch so the v3β†’v4 lineage - // backfill reads `_graph_commits.lance` at the SAME branch it is - // publishing to (each branch backfills on its first write). See - // `db/manifest/migrations.rs`. - migrate_internal_schema(&mut dataset, &self.root_uri, self.branch.as_deref()).await?; - // ONE `__manifest` scan for everything the publish needs: table - // locations, version entries, tombstones, AND the `graph_commit` lineage - // rows for parent resolution (RFC-013 P2). The lineage extraction rides - // this pass instead of a second `read_graph_lineage` scan in - // `resolve_lineage_rows`; the per-attempt re-read is preserved because - // `load_publish_state` runs once per CAS attempt, so a retry sees the - // advanced head and re-parents correctly. - let scan = read_publish_scan(&dataset).await?; - let existing_versions = scan - .version_entries + // matches this binary. See `db/manifest/migrations.rs`. + migrate_internal_schema(&mut dataset).await?; + let registered_tables = read_registered_table_locations(&dataset).await?; + let existing_entries = read_manifest_entries(&dataset).await?; + let existing_versions = existing_entries .iter() .map(|entry| { ( @@ -171,14 +111,13 @@ impl GraphNamespacePublisher { ) }) .collect(); - let existing_tombstones = scan.tombstones.into_iter().collect(); - Ok(LoadedPublishState { + let existing_tombstones = read_tombstone_versions(&dataset).await?; + Ok(( dataset, - registered_tables: scan.table_locations, + registered_tables, existing_versions, existing_tombstones, - lineage_rows: scan.lineage_rows, - }) + )) } fn build_pending_rows( @@ -324,50 +263,6 @@ impl GraphNamespacePublisher { Ok(rows) } - /// Resolve the parent for `intent` against the just-loaded `dataset` and - /// build the two lineage rows (`graph_commit` + `graph_head:`) to - /// fold into the publish batch. Runs INSIDE the CAS retry loop, so the - /// parent is read from the manifest state this attempt will commit against β€” - /// a retry after a concurrent commit re-reads the advanced head and parents - /// correctly (TOCTOU closed). `new_manifest_version` is the version this - /// publish produces (the recorded commit pins it). - /// - /// The parent is the current head of the branch's lineage β€” the - /// `should_replace_head` winner over the visible `graph_commit` rows, the - /// same selection the commit-graph cache uses. (The denormalized - /// `graph_head:` row is written for forward-compat but is not the - /// parent source here: a branch freshly forked from main inherits main's - /// commits but not yet a `graph_head:` row, and the head-over-rows - /// computation gives the correct fork-point parent in that case.) - /// - /// `lineage_rows` is the `graph_commit` set this attempt already parsed in - /// `load_publish_state`'s single scan (RFC-013 P2) β€” NOT a fresh - /// `read_graph_lineage` scan. The per-attempt re-read is still preserved: the - /// retry loop re-runs `load_publish_state`, so each attempt's `lineage_rows` - /// reflects the head as it stands for that attempt. - fn resolve_lineage_rows( - lineage_rows: &[GraphLineageRow], - intent: &LineageIntent, - new_manifest_version: u64, - ) -> Result<(Vec, Option)> { - let parent_commit_id = head_lineage_row(lineage_rows).map(|h| h.graph_commit_id.clone()); - - let commit = GraphLineageRow { - graph_commit_id: intent.graph_commit_id.clone(), - manifest_branch: intent.branch.clone(), - manifest_version: new_manifest_version, - parent_commit_id: parent_commit_id.clone(), - merged_parent_commit_id: intent.merged_parent_commit_id.clone(), - actor_id: intent.actor_id.clone(), - created_at: intent.created_at, - }; - let parts = graph_lineage_row_parts(&commit, intent.branch.as_deref())?; - Ok(( - parts.into_iter().map(lineage_part_to_pending).collect(), - parent_commit_id, - )) - } - fn pending_rows_to_batch(rows: Vec) -> Result { let mut object_ids = Vec::with_capacity(rows.len()); let mut object_types = Vec::with_capacity(rows.len()); @@ -486,12 +381,6 @@ impl GraphNamespacePublisher { // the publisher loop above, where each attempt re-runs the pre-check. merge_builder.conflict_retries(0); merge_builder.use_index(false); - // Skip Lance's auto-cleanup hook: `__manifest` versions are the snapshot - // / time-travel authority and must never be GC'd by Lance's per-commit - // hook. A `__manifest` created before the v7 bump (6.0.1 defaulted - // auto_cleanup ON) still carries the stored config, so this skip is - // load-bearing on upgraded graphs, not just defensive. - merge_builder.skip_auto_cleanup(true); let (new_dataset, _stats) = merge_builder .try_build() .map_err(|e| OmniError::Lance(e.to_string()))? @@ -501,7 +390,6 @@ impl GraphNamespacePublisher { Ok(Arc::try_unwrap(new_dataset).unwrap_or_else(|arc| (*arc).clone())) } - #[cfg(test)] pub(super) async fn publish_requests( &self, requests: &[CreateTableVersionRequest], @@ -522,25 +410,7 @@ impl GraphNamespacePublisher { })) }) .collect::>>()?; - Ok(self.publish(&changes, &HashMap::new(), None).await?.dataset) - } -} - -/// Map a `state::GraphLineageRowPart` onto a `PendingVersionRow` so a graph -/// commit's two lineage rows ride the same publish batch as the table-version -/// rows (RFC-013 Phase 7). Lineage rows carry no table identity: `table_key` is -/// the empty string (never matched by a real key) and `location`/`row_count` -/// are null. -fn lineage_part_to_pending(part: GraphLineageRowPart) -> PendingVersionRow { - PendingVersionRow { - object_id: part.object_id, - object_type: part.object_type.to_string(), - location: None, - metadata: Some(part.metadata), - table_key: String::new(), - table_version: part.table_version, - table_branch: part.table_branch, - row_count: None, + self.publish(&changes, &HashMap::new()).await } } @@ -549,17 +419,7 @@ fn lineage_part_to_pending(part: GraphLineageRowPart) -> PendingVersionRow { /// merge-insert join key, annotated as an unenforced primary key on /// `__manifest`). Translate it to a typed manifest conflict so callers can /// match without parsing strings; everything else is opaque storage. -/// -/// Shared (`pub(crate)`) with the v3β†’v4 lineage backfill -/// (`state::merge_lineage_rows`), which issues its own `__manifest` merge-insert -/// outside the publisher and must surface the SAME typed -/// `RowLevelCasContention` so the migration's re-open retry loop can recognize a -/// CAS loss. This is the merge-insert (`execute_reader`) conflict vocabulary -/// only. It is deliberately NOT `optimize::is_retryable_lance_conflict`: that one -/// also matches `CommitConflict`/`RetryableCommitConflict` from the COMPACTION -/// commit path (`compact_files` -> `apply_commit`), which a row-level merge-insert -/// never emits β€” folding it in here would match impossible variants. -pub(crate) fn map_lance_publish_error(err: LanceError) -> OmniError { +fn map_lance_publish_error(err: LanceError) -> OmniError { if matches!(err, LanceError::TooMuchWriteContention { .. }) { return OmniError::manifest_row_level_cas_contention(format!( "manifest publish lost a row-level CAS race: {}", @@ -575,40 +435,14 @@ impl ManifestBatchPublisher for GraphNamespacePublisher { &self, changes: &[ManifestChange], expected_table_versions: &HashMap, - lineage: Option<&LineageIntent>, - ) -> Result { - if changes.is_empty() && expected_table_versions.is_empty() && lineage.is_none() { - return Ok(PublishOutcome { - dataset: self.dataset().await?, - parent_commit_id: None, - }); + ) -> Result { + if changes.is_empty() && expected_table_versions.is_empty() { + return self.dataset().await; } for attempt in 0..=PUBLISHER_RETRY_BUDGET { - // `load_publish_state` runs the v3β†’v4 migration (`migrate_internal_schema`) - // on its first scan. The migration's bounded merge/stamp retries surface a - // retryable `RowLevelCasContention` on exhaustion EXPECTING this outer loop - // to re-run them β€” a re-run re-reads the manifest, by which point a - // concurrent winner has usually completed the migration (next scan is a - // no-op). Route a retryable load error through the SAME retry path as a - // retryable `merge_rows` conflict below, so that typed contention actually - // composes with the publisher retry instead of aborting the publish. - let loaded = match self.load_publish_state().await { - Ok(loaded) => loaded, - Err(err) - if attempt < PUBLISHER_RETRY_BUDGET && is_retryable_publish_conflict(&err) => - { - continue; - } - Err(err) => return Err(err), - }; - let LoadedPublishState { - dataset, - registered_tables: known_tables, - existing_versions, - existing_tombstones, - lineage_rows, - } = loaded; + let (dataset, known_tables, existing_versions, existing_tombstones) = + self.load_publish_state().await?; let latest_per_table = Self::latest_visible_per_table(&existing_versions, &existing_tombstones); @@ -617,48 +451,19 @@ impl ManifestBatchPublisher for GraphNamespacePublisher { // surfaced as `ExpectedVersionMismatch` rather than retried. Self::check_expected_table_versions(&latest_per_table, expected_table_versions)?; - let mut rows = Self::build_pending_rows( + if changes.is_empty() { + return Ok(dataset); + } + + let rows = Self::build_pending_rows( changes, &known_tables, &existing_versions, &existing_tombstones, )?; - // Fold the graph commit into the SAME batch so table-version rows - // and lineage rows land in one merge-insert (one Lance commit, one - // manifest version) β€” no separate write, no manifestβ†’commit-graph - // atomicity gap. The merge-insert advances exactly one version on - // top of the loaded dataset, so the commit pins - // `current + 1`. The parent is resolved here, per attempt, from the - // lineage rows THIS attempt's scan loaded (TOCTOU closed on a CAS - // retry β€” a retry re-runs `load_publish_state` β†’ fresh lineage). - let parent_commit_id = match lineage { - Some(intent) => { - let new_manifest_version = dataset.version().version + 1; - let (commit_rows, parent) = - Self::resolve_lineage_rows(&lineage_rows, intent, new_manifest_version)?; - rows.extend(commit_rows); - parent - } - None => None, - }; - - if rows.is_empty() { - // Expected-version-only publish with no changes and no lineage: - // the precondition held, nothing to write. - return Ok(PublishOutcome { - dataset, - parent_commit_id, - }); - } - match self.merge_rows(dataset, rows).await { - Ok(new_dataset) => { - return Ok(PublishOutcome { - dataset: new_dataset, - parent_commit_id, - }); - } + Ok(new_dataset) => return Ok(new_dataset), Err(err) => { if attempt < PUBLISHER_RETRY_BUDGET && is_retryable_publish_conflict(&err) { continue; @@ -682,12 +487,7 @@ impl ManifestBatchPublisher for GraphNamespacePublisher { /// contention; if the caller's `expected_table_versions` still holds against /// the new manifest state, we re-attempt. Other conflict variants (notably /// `ExpectedVersionMismatch`) propagate so the caller learns immediately. -/// -/// Shared (`pub(crate)`) with the v3β†’v4 lineage backfill's re-open retry loop -/// (`migrations::migrate_v3_to_v4`), so the migration's retry decision matches the -/// publisher's by construction β€” both retry exactly `RowLevelCasContention` and -/// propagate everything else. -pub(crate) fn is_retryable_publish_conflict(err: &OmniError) -> bool { +fn is_retryable_publish_conflict(err: &OmniError) -> bool { matches!( err, OmniError::Manifest(m) diff --git a/crates/omnigraph/src/db/manifest/recovery.rs b/crates/omnigraph/src/db/manifest/recovery.rs index c32c493..425499a 100644 --- a/crates/omnigraph/src/db/manifest/recovery.rs +++ b/crates/omnigraph/src/db/manifest/recovery.rs @@ -2,7 +2,7 @@ //! //! This module implements the building blocks of the per-sidecar recovery //! sweep that closes the documented Phase B β†’ Phase C residual (see -//! `docs/dev/writes.md` "Open-time recovery sweep"). The high-level shape: +//! `docs/dev/runs.md` "Open-time recovery sweep"). The high-level shape: //! //! 1. Each writer that performs a multi-table commit writes a small JSON //! sidecar at `__recovery/{ulid}.json` BEFORE its per-table @@ -25,14 +25,13 @@ //! version. Pinned by //! `tests/staged_writes.rs::lance_restore_appends_one_commit_with_checked_out_content`. //! - `Dataset::restore` "wins" against concurrent Append/Update/Delete/ -//! CreateIndex/Merge β€” see `check_restore_txn` at lance-6.0.1 +//! CreateIndex/Merge β€” see `check_restore_txn` at lance-4.0.0 //! `src/io/commit/conflict_resolver.rs:986`. The hazard is documented //! by `tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`. -//! The open-time sweep sidesteps the hazard by running before any -//! other writers can race; the in-process heal -//! ([`heal_pending_sidecars_roll_forward`]) never restores (roll- -//! forward only) and guards via per-(table_key, branch) queue -//! acquisition. +//! This module sidesteps the hazard by running recovery only at +//! `Omnigraph::open` (before any other writers can race). A future +//! continuous in-process recovery reconciler will need to guard via +//! per-(table_key, branch) queue acquisition. use std::collections::HashMap; @@ -40,14 +39,17 @@ use lance::Dataset; use serde::{Deserialize, Serialize}; use tracing::warn; +use crate::db::commit_graph::CommitGraph; use crate::db::graph_coordinator::GraphCoordinator; -use crate::db::recovery_audit::{RecoveryAudit, RecoveryAuditRecord, RecoveryKind, TableOutcome}; +use crate::db::recovery_audit::{ + RecoveryAudit, RecoveryAuditRecord, RecoveryKind, TableOutcome, now_micros, +}; use crate::db::schema_state::SchemaStateRecovery; use crate::error::{OmniError, Result}; use crate::storage::StorageAdapter; use super::Snapshot; -use super::publisher::{GraphNamespacePublisher, LineageIntent, ManifestBatchPublisher}; +use super::publisher::{GraphNamespacePublisher, ManifestBatchPublisher}; use super::{ManifestChange, SubTableUpdate, TableRegistration, TableTombstone}; /// System actor identifier recorded on every recovery commit. Operators @@ -56,67 +58,13 @@ use super::{ManifestChange, SubTableUpdate, TableRegistration, TableTombstone}; /// into the audit row's `recovery_for_actor` field. pub(crate) const RECOVERY_ACTOR: &str = "omnigraph:recovery"; -/// Publish a recovery action's manifest `updates` AND its recovery commit in one -/// CAS (RFC-013 Phase 7). The recovery commit's lineage (`graph_commit` + -/// `graph_head`) rides the same merge-insert as the table-version re-pin β€” there -/// is no separate `_graph_commits.lance` write and no manifestβ†’commit-graph gap. -/// `updates` is empty for the no-table-change recovery paths (all-NoMovement -/// roll-back, stale-sidecar cleanup, orphaned-branch discard); the lineage rows -/// still publish, so the recovery commit is always durable. -/// -/// The commit's first parent is resolved by the publisher (the live head of the -/// recovery's branch); its merged-in parent is the sidecar's recorded source -/// head for a rolled-forward branch merge, matching the pre-Phase-7 merge-commit -/// shape. Returns the new manifest version and the minted recovery commit id -/// (which the audit row references). -async fn publish_recovery_commit( - root_uri: &str, - sidecar: &RecoverySidecar, - kind: RecoveryKind, - updates: &[ManifestChange], - expected: &HashMap, -) -> Result<(u64, String)> { - let merged_parent_commit_id = match (sidecar.writer_kind, kind) { - (SidecarKind::BranchMerge, RecoveryKind::RolledForward) => { - sidecar.merge_source_commit_id.clone() - } - _ => None, - }; - let intent = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: sidecar.branch.clone(), - actor_id: Some(RECOVERY_ACTOR.to_string()), - merged_parent_commit_id, - created_at: crate::db::now_micros()?, - }; - let publisher = GraphNamespacePublisher::new(root_uri, sidecar.branch.as_deref()); - let outcome = publisher.publish(updates, expected, Some(&intent)).await?; - Ok((outcome.dataset.version().version, intent.graph_commit_id)) -} - /// Subdirectory under the graph root holding sidecar files. pub(crate) const RECOVERY_DIR_NAME: &str = "__recovery"; -/// Max sidecar JSON shape/semantics version this binary writes and understands. -/// The reader accepts every version `<= ` this and refuses only versions ABOVE -/// it (an older binary cannot guess semantics a newer writer baked in β€” see -/// [`SidecarSchemaError`] and [`parse_sidecar`]). Bump this whenever a change -/// alters how an existing field is *interpreted* (not merely adds an optional -/// one), and add a fixed `*_SCHEMA_VERSION` floor like the one below so older -/// generations keep their original semantics. -/// -/// v1 β†’ v2: Phase-B confirmation. A `BranchMerge` sidecar at v2 carries -/// `confirmed_version` and is classified strictly (unconfirmed β‡’ partial β‡’ roll -/// back); at v1 it predates confirmation and keeps the loose roll-forward. The -/// reader must distinguish the two, so this is a real version bump, not an -/// additive field. -pub(crate) const SIDECAR_SCHEMA_VERSION: u32 = 2; - -/// The version at which Phase-B confirmation shipped. A `BranchMerge` sidecar is -/// confirmation-aware (strict classification) iff `schema_version >=` this. -/// FIXED at 2 β€” NOT derived from [`SIDECAR_SCHEMA_VERSION`] β€” so a future bump to -/// v3+ still treats v2 sidecars as confirmation-aware. -pub(crate) const CONFIRMATION_SCHEMA_VERSION: u32 = 2; +/// Current sidecar JSON shape version. Bumping this is a breaking change: +/// older binaries will refuse to interpret newer sidecars (intentional β€” +/// see [`SidecarSchemaError`]). +pub(crate) const SIDECAR_SCHEMA_VERSION: u32 = 1; /// Selects which recovery actions are allowed in a sweep. /// @@ -158,60 +106,6 @@ pub(crate) enum SidecarKind { BranchMerge, /// `ensure_indices_for_branch` β€” index lifecycle commits. EnsureIndices, - /// `optimize_all_tables` β€” Lance `compact_files` (reserve-fragments + - /// rewrite commits) followed by a manifest publish of the compacted - /// version. Loose-match like the other multi-commit writers; roll-forward - /// is always safe because compaction is content-preserving (Lance - /// `Operation::Rewrite` "reorganizes data without semantic modification"). - Optimize, -} - -/// Which recovery-classification semantics a sidecar's tables use. Resolved once -/// from `(writer_kind, schema_version)` β€” see [`SidecarKind::classification_mode`] -/// β€” so [`classify_table`] dispatches on the mode instead of re-deriving it from -/// a kindΓ—version match. Adding a writer kind or a version floor is then one arm -/// in the resolver, not a guard threaded through `classify_table`. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) enum ClassificationMode { - /// Exactly one `commit_staged` per table (`Mutation`, `Load`): require - /// `lance_head == manifest_pinned + 1` and the pin to match. - Strict, - /// N β‰₯ 1 commits per table whose drift is content-preserving / derived - /// state (`SchemaApply`, `EnsureIndices`, `Optimize`, and pre-confirmation - /// `BranchMerge`): any `lance_head > manifest_pinned` rolls forward. - Loose, - /// Multi-commit publish of *distinct logical rows* with a recorded - /// `confirmed_version` (`BranchMerge` at `schema_version >= - /// CONFIRMATION_SCHEMA_VERSION`): roll forward ONLY to the confirmed - /// version; an unconfirmed moved HEAD is a partial publish and rolls back. - Confirmed, -} - -impl SidecarKind { - /// Resolve the classification mode for this writer at a given sidecar - /// `schema_version`. Exhaustive over `SidecarKind`, so adding a variant is a - /// compile error here until its recovery semantics are declared. - pub(crate) fn classification_mode(self, schema_version: u32) -> ClassificationMode { - match self { - SidecarKind::Mutation | SidecarKind::Load => ClassificationMode::Strict, - // BranchMerge gained two-phase confirmation at - // `CONFIRMATION_SCHEMA_VERSION`. A sidecar written before that - // carries no `confirmed_version` and must keep the prior loose - // roll-forward β€” classifying it strictly would misread a *completed* - // pre-upgrade merge as a partial and roll it back. (The read gate - // already refused any version newer than this binary.) - SidecarKind::BranchMerge => { - if schema_version >= CONFIRMATION_SCHEMA_VERSION { - ClassificationMode::Confirmed - } else { - ClassificationMode::Loose - } - } - SidecarKind::SchemaApply | SidecarKind::EnsureIndices | SidecarKind::Optimize => { - ClassificationMode::Loose - } - } - } } /// One table's contribution to a sidecar's intended commit. The classifier @@ -225,22 +119,8 @@ pub(crate) struct SidecarTablePin { /// Manifest-pinned version at writer start (CAS expectation). pub expected_version: u64, /// Lance HEAD that the writer's `commit_staged` would produce - /// (typically `expected_version + 1`). For multi-commit writers this is - /// only a *lower bound* β€” see `confirmed_version`. + /// (typically `expected_version + 1`). pub post_commit_pin: u64, - /// Phase-B confirmation: the exact Lance HEAD this table reached once the - /// writer's *entire* multi-commit publish for it finished, recorded by a - /// second sidecar write immediately before the manifest publish (Phase C). - /// `None` means Phase B did not complete (the writer crashed mid-publish), - /// so the on-disk drift is a *partial* commit and recovery must roll the - /// whole operation BACK rather than publish an incomplete state. Only the - /// `BranchMerge` writer records this today (its per-table publish is - /// append β†’ upsert β†’ delete, several HEAD advances that the manifest - /// publish makes atomic); other writers leave it `None` and keep their - /// existing loose roll-forward. Backward-compatible: absent on older - /// sidecars. - #[serde(default, skip_serializing_if = "Option::is_none")] - pub confirmed_version: Option, /// Lance branch ref this table lives on (mirrors /// `SubTableEntry::table_branch`). Required for the recovery sweep /// to open the dataset at the correct ref β€” `Dataset::open(path)` @@ -331,27 +211,25 @@ pub(crate) struct RecoverySidecarHandle { pub(crate) sidecar_uri: String, } -/// Error returned when the sidecar's `schema_version` is NEWER than this binary -/// understands. We refuse-and-error rather than read-and-warn: an old binary -/// cannot guess what semantics a newer writer baked into a future shape. (Older -/// versions are accepted and interpreted with their original semantics β€” see -/// [`parse_sidecar`].) Operator action is required (typically: upgrade the -/// binary). +/// Error returned when the sidecar's `schema_version` is unknown to this +/// binary. We refuse-and-error rather than read-and-warn: an old binary +/// cannot guess what semantics a newer writer baked into a future shape. +/// Operator action is required (typically: upgrade the binary). #[derive(Debug)] pub(crate) struct SidecarSchemaError { pub sidecar_uri: String, pub found_version: u32, - pub max_supported_version: u32, + pub supported_version: u32, } impl std::fmt::Display for SidecarSchemaError { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { write!( f, - "recovery sidecar at '{}' declares schema_version={}, newer than the \ - maximum this binary supports (schema_version={}); refusing to interpret \ + "recovery sidecar at '{}' declares schema_version={}, but this \ + binary supports only schema_version={}; refusing to interpret \ β€” upgrade omnigraph or remove the sidecar with operator review", - self.sidecar_uri, self.found_version, self.max_supported_version, + self.sidecar_uri, self.found_version, self.supported_version, ) } } @@ -386,14 +264,6 @@ pub(crate) enum TableClassification { /// previous restore attempt or an external mutation. Roll back to /// the manifest pin. UnexpectedMultistep, - /// A confirmation-using writer (`BranchMerge`) advanced this table's HEAD - /// (`lance_head > manifest_pinned`) but the sidecar carries no - /// `confirmed_version` β€” its multi-commit publish crashed mid-flight, so - /// the drift is a *partial* commit (e.g. an append without its sibling - /// upsert/delete). Roll back to the manifest pin; the whole operation is - /// re-run from scratch. Distinct from `UnexpectedMultistep` so the audit - /// records a partial Phase B, not a foreign mutation. - IncompletePhaseB, /// `lance_head < manifest_pinned`. Should be impossible: the manifest /// pin can only advance after a successful Lance commit. Surface /// loudly and abort recovery. @@ -440,8 +310,8 @@ pub(crate) fn sidecar_uri(root_uri: &str, operation_id: &str) -> String { /// Write a sidecar atomically and return a handle for later deletion. /// /// The atomicity contract is inherited from [`StorageAdapter::write_text`]: -/// the local backend publishes via a staged temp file + rename (atomic on -/// POSIX); object stores write via PutObject (atomic at the object level). +/// LocalStorageAdapter writes via `tokio::fs::write` (whole-file replace); +/// S3StorageAdapter writes via PutObject (atomic at the object level). /// Both are sufficient for sidecar semantics β€” readers either see the /// complete sidecar or none. pub(crate) async fn write_sidecar( @@ -449,9 +319,6 @@ pub(crate) async fn write_sidecar( storage: &dyn StorageAdapter, sidecar: &RecoverySidecar, ) -> Result { - // Failpoint: models a storage put failure (S3 PutObject / fs write) - // in Phase A β€” every writer must abort before any HEAD advance. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_SIDECAR_WRITE)?; debug_assert_eq!(sidecar.schema_version, SIDECAR_SCHEMA_VERSION); let uri = sidecar_uri(root_uri, &sidecar.operation_id); let json = serde_json::to_string_pretty(sidecar).map_err(|err| { @@ -464,67 +331,11 @@ pub(crate) async fn write_sidecar( }) } -/// Phase-B confirmation: stamp each pin with the exact Lance HEAD its publish -/// reached, then re-write the sidecar in place (same object). Called once, after -/// the writer's whole multi-commit publish completed and before the manifest -/// publish (Phase C). Recovery then rolls forward ONLY to these confirmed -/// versions; a sidecar still missing them is a partial Phase B that rolls back. -/// -/// Overwriting the same object is atomic (same contract as [`write_sidecar`]): -/// a torn rewrite is never observed, so recovery reads either the pre-confirm -/// sidecar (β†’ roll back, safe) or the confirmed one (β†’ roll forward). A failure -/// here leaves the pre-confirm sidecar, so the operation rolls back β€” correct. -/// -/// SURVIVES the fragment-adopt work (unlike the row-level merge it currently -/// serves β€” see `AdoptDelta` in `exec/merge.rs`). The recovery sidecar is the -/// cross-table write-ahead log that makes a fast-forward-main commit -/// all-or-nothing across N tables, which a fragment graft still needs. What -/// narrows is the *within-table* reason for confirmation: once each table's -/// merge is a single graft commit, the multi-step partial window shrinks to one -/// commit, so the `BranchMerge` arm of `classify_table` could fold back into the -/// strict single-commit path and `IncompletePhaseB` retire. Do NOT delete this -/// with the row path β€” keep the sidecar; only simplify the classifier. -pub(crate) async fn confirm_sidecar_phase_b( - root_uri: &str, - storage: &dyn StorageAdapter, - sidecar: &mut RecoverySidecar, - confirmed_versions: &HashMap, -) -> Result<()> { - // Failpoint: models a storage failure on the confirmation write β€” the - // pre-confirm sidecar stays on disk, so recovery rolls the operation back. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_SIDECAR_CONFIRM)?; - for pin in &mut sidecar.tables { - // Every pinned table MUST have an achieved version. A miss means the - // pin set and the publish `updates` diverged β€” fail loudly at the - // producer rather than leave the pin unconfirmed, which recovery would - // read as a partial Phase B and silently roll the whole (complete) merge - // back. Today the two are kept in lockstep by construction; this guards - // the invariant against a future edit to either filter. - let version = confirmed_versions.get(&pin.table_key).ok_or_else(|| { - OmniError::manifest_internal(format!( - "confirm_sidecar_phase_b: no achieved version for pinned table '{}' \ - (pins and publish updates diverged)", - pin.table_key - )) - })?; - pin.confirmed_version = Some(*version); - } - let uri = sidecar_uri(root_uri, &sidecar.operation_id); - let json = serde_json::to_string_pretty(sidecar).map_err(|err| { - OmniError::manifest_internal(format!("failed to serialize recovery sidecar: {}", err)) - })?; - storage.write_text(&uri, &json).await -} - /// Delete a sidecar after Phase C succeeded. Idempotent (safe to retry). pub(crate) async fn delete_sidecar( handle: &RecoverySidecarHandle, storage: &dyn StorageAdapter, ) -> Result<()> { - // Failpoint: models a storage delete failure (S3 DeleteObject) in - // Phase D β€” callers swallow it (the write already published) and the - // stale sidecar is healed by the next write or open. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_SIDECAR_DELETE)?; storage.delete(&handle.sidecar_uri).await } @@ -539,10 +350,6 @@ pub(crate) async fn list_sidecars( root_uri: &str, storage: &dyn StorageAdapter, ) -> Result> { - // Failpoint: models a storage list failure (S3 ListObjectsV2) β€” every - // consumer (open-time sweep, write-entry heal) must fail loudly - // rather than silently skipping recovery. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_SIDECAR_LIST)?; let dir = recovery_dir_uri(root_uri); let mut uris = storage.list_dir(&dir).await?; // Sort by URI so the sweep processes sidecars deterministically. @@ -583,15 +390,11 @@ pub(crate) fn parse_sidecar(sidecar_uri: &str, body: &str) -> Result SIDECAR_SCHEMA_VERSION { + if peek.schema_version != SIDECAR_SCHEMA_VERSION { return Err(SidecarSchemaError { sidecar_uri: sidecar_uri.to_string(), found_version: peek.schema_version, - max_supported_version: SIDECAR_SCHEMA_VERSION, + supported_version: SIDECAR_SCHEMA_VERSION, } .into()); } @@ -606,38 +409,24 @@ pub(crate) fn parse_sidecar(sidecar_uri: &str, body: &str) -> Result manifest_pinned` is ambiguous β€” it may be a *complete* -/// publish or a *partial* one crashed mid-sequence. The writer resolves -/// the ambiguity by recording the exact achieved version -/// (`confirmed_version`) only after the whole publish finished. So roll -/// forward ONLY to that confirmed version; a missing confirmation is a -/// partial commit (`IncompletePhaseB`) and rolls back. This is the safe -/// form of the loose match for writers where a partial would publish an -/// incomplete delta. /// - **Strict** (`Mutation`, `Load`): exactly one `commit_staged` per /// table, so `lance_head == manifest_pinned + 1` AND /// `post_commit_pin == lance_head` is required. -/// - **Loose** (`SchemaApply`, `EnsureIndices`, `Optimize`): the writer -/// advances the Lance HEAD by N β‰₯ 1 commits per table (one per index -/// built + one for the overwrite, etc.; `Optimize` runs `compact_files`, -/// which commits reserve-fragments + rewrite) and the exact N is hard to -/// compute at sidecar-write time. The loose match accepts +/// - **Loose** (`SchemaApply`, `EnsureIndices`, `BranchMerge`): the +/// writer may run N β‰₯ 1 `commit_staged` calls per table (one per +/// index built + one for the overwrite, etc.; merge tables run +/// merge_insert + delete_where + index rebuilds) and the exact N +/// is hard to compute at sidecar-write time. The loose match accepts /// any `lance_head > manifest_pinned` as `RolledPastExpected` when /// `pin.expected_version == manifest_pinned` (the writer's CAS -/// target matches what the manifest currently shows). This is safe for -/// these writers because their drift is derived state (index coverage, -/// compaction) the reconciler reproduces β€” a partial roll-forward loses -/// no logical rows. The risk it admits β€” an external agent advancing HEAD -/// between sidecar write and recovery β€” is out of scope for the -/// single-coordinator model. +/// target matches what the manifest currently shows). The risk this +/// admits β€” an external agent advancing HEAD between sidecar write +/// and recovery β€” is out of scope for the single-coordinator model. pub(crate) fn classify_table( pin: &SidecarTablePin, lance_head: u64, manifest_pinned: u64, kind: SidecarKind, - schema_version: u32, ) -> TableClassification { use TableClassification::*; if lance_head < manifest_pinned { @@ -648,49 +437,27 @@ pub(crate) fn classify_table( if lance_head == manifest_pinned { return NoMovement; } - // lance_head > manifest_pinned. The "which semantics" decision is resolved - // once from (kind, schema_version); dispatch on it. - match kind.classification_mode(schema_version) { - ClassificationMode::Confirmed => { - // Two-phase confirmation: roll forward only to the exact version the - // writer recorded after its whole multi-commit publish completed. No - // confirmation β‡’ the publish crashed mid-sequence β‡’ partial β‡’ roll - // back. A confirmation that doesn't match the observed HEAD means a - // foreign writer advanced the table β€” don't roll a surprise forward. - match pin.confirmed_version { - Some(confirmed) - if lance_head == confirmed && pin.expected_version == manifest_pinned => - { - RolledPastExpected - } - Some(_) => UnexpectedMultistep, - None => IncompletePhaseB, - } - } - ClassificationMode::Strict => { - if lance_head == manifest_pinned + 1 { - if pin.expected_version == manifest_pinned && pin.post_commit_pin == lance_head { - RolledPastExpected - } else { - UnexpectedAtP1 - } - } else { - // lance_head > manifest_pinned + 1 - UnexpectedMultistep - } - } - ClassificationMode::Loose => { - // Multi-commit writers whose drift is content-preserving / derived - // state (and pre-confirmation BranchMerge sidecars): any - // `lance_head > manifest_pinned` rolls forward when the CAS target - // matches what the manifest currently shows. - if pin.expected_version == manifest_pinned { + // lance_head > manifest_pinned + let strict = matches!(kind, SidecarKind::Mutation | SidecarKind::Load); + if strict { + if lance_head == manifest_pinned + 1 { + if pin.expected_version == manifest_pinned && pin.post_commit_pin == lance_head { RolledPastExpected - } else if lance_head == manifest_pinned + 1 { - UnexpectedAtP1 } else { - UnexpectedMultistep + UnexpectedAtP1 } + } else { + // lance_head > manifest_pinned + 1 + UnexpectedMultistep + } + } else { + // Loose match for multi-commit writers (SchemaApply, EnsureIndices). + if pin.expected_version == manifest_pinned { + RolledPastExpected + } else if lance_head == manifest_pinned + 1 { + UnexpectedAtP1 + } else { + UnexpectedMultistep } } } @@ -709,7 +476,7 @@ pub(crate) fn decide(classifications: &[TableClassification]) -> SidecarDecision } if classifications .iter() - .any(|c| matches!(c, NoMovement | UnexpectedAtP1 | UnexpectedMultistep | IncompletePhaseB)) + .any(|c| matches!(c, NoMovement | UnexpectedAtP1 | UnexpectedMultistep)) { return RollBack; } @@ -727,12 +494,9 @@ pub(crate) fn decide(classifications: &[TableClassification]) -> SidecarDecision /// Skipping the restore in those cases would leave Lance HEAD ahead of /// the manifest with no recovery artifact left. /// -/// Cost: a successful roll-back appends one restore commit and then publishes -/// the manifest to match (`roll_back_sidecar`), so the table converges -/// (`manifest == HEAD`) in one pass. Only repeated crashes *between* the restore -/// and that publish (rare) accumulate extra restore commits; each re-classified -/// roll-back restores again and `omnigraph cleanup` reclaims the surplus. -/// Bounded by the number of interrupted recovery iterations β€” typically 0. +/// Cost: under repeated mid-rollback crashes (rare), Lance HEAD +/// accumulates extra restore commits that `omnigraph cleanup` reclaims. +/// Bounded by the number of recovery iterations β€” typically 1. pub(crate) async fn restore_table_to_version( table_path: &str, branch: Option<&str>, @@ -759,241 +523,6 @@ pub(crate) async fn restore_table_to_version( Ok(()) } -/// In-process heal for pending recovery sidecars β€” the entry point for -/// long-lived handles (`Omnigraph::refresh` and the staged-write entry -/// points `load_as` / `mutate_as`). -/// -/// Steady-state cost is one `list_dir` of `__recovery/` (typically -/// empty β†’ immediate return), so write entry points can afford to call -/// this on every request. When sidecars exist, each is processed in -/// `RecoveryMode::RollForwardOnly`: the common Phase B β†’ Phase C -/// residual (per-table `commit_staged` landed, manifest publish did -/// not) rolls forward in-process; rollback-eligible or invariant- -/// violating sidecars are deferred to the next ReadWrite open, exactly -/// as `Omnigraph::refresh` documents. -/// -/// Concurrency: unlike the open-time sweep, this runs while other -/// writers may be in flight. Every sidecar writer (mutation/load -/// finalize, schema_apply, branch_merge, ensure_indices, optimize) -/// acquires its per-`(table_key, table_branch)` write queues *before* -/// `write_sidecar` and holds them until after `delete_sidecar` β€” so -/// acquiring the same queues here blocks until that writer either -/// finished (sidecar deleted; the existence re-check skips it) or died -/// (sidecar is genuinely orphaned; safe to process). Without this, the -/// heal could observe a live writer's sidecar in its commitβ†’publish -/// window, roll it forward, and fail that writer's own publish CAS. -/// Lock order is queues β†’ coordinator, matching every writer's -/// commitβ†’publish path. -/// -/// The schema-staging reconcile runs lazily, per SchemaApply sidecar, -/// AFTER that sidecar's queue guards are held and its existence is -/// re-confirmed β€” never up front. An up-front reconcile can promote a -/// LIVE schema apply's staging files and steal its commit (pinned by -/// `tests/failpoints.rs::heal_does_not_promote_live_schema_apply_staging`). -/// -/// Returns `true` when at least one sidecar was processed (the caller -/// should invalidate per-snapshot caches). -pub(crate) async fn heal_pending_sidecars_roll_forward( - root_uri: &str, - storage: std::sync::Arc, - coordinator: &tokio::sync::RwLock, - write_queue: &crate::db::write_queue::WriteQueueManager, -) -> Result { - let sidecars = list_sidecars(root_uri, storage.as_ref()).await?; - if sidecars.is_empty() { - return Ok(false); - } - let mut processed_any = false; - for sidecar in sidecars { - // Serialize against a possibly-live writer (see fn docs). Guards - // are scoped per sidecar so two sidecars never hold queues - // simultaneously (no cross-sidecar lock-order surface). - let mut queue_keys: Vec = sidecar - .tables - .iter() - .map(|pin| (pin.table_key.clone(), pin.table_branch.clone())) - .collect(); - let is_schema_apply = matches!(sidecar.writer_kind, SidecarKind::SchemaApply); - if is_schema_apply { - // A SchemaApply sidecar's per-table pins don't cover a - // registration-only migration (no existing tables touched, - // but staging files + a sidecar on disk). The schema-apply - // writer holds this serialization key from before its - // sidecar write until after its sidecar delete, so blocking - // on it β€” then re-checking sidecar existence β€” guarantees - // the writer is gone before the reconcile below mutates - // schema staging. - queue_keys.push(schema_apply_serial_queue_key()); - } - let _guards = write_queue.acquire_many(&queue_keys).await; - // Re-check after the wait: the writer we blocked on may have - // completed Phase C and deleted its sidecar. - if !storage - .exists(&sidecar_uri(root_uri, &sidecar.operation_id)) - .await? - { - continue; - } - // Schema-staging reconcile, per SchemaApply sidecar, UNDER the - // sidecar's guards: a sidecar still on disk after the queue wait - // belongs to a dead writer, so promoting its staging files can no - // longer race the live apply's own renames or steal its commit. - // It also re-runs per sidecar, so a multi-sidecar pass never - // classifies against a reconcile result an earlier roll-forward - // staled. Non-SchemaApply sidecars never consult the value. - let schema_state_recovery = if is_schema_apply { - let snapshot = { - let mut coord = coordinator.write().await; - coord.refresh().await?; - coord.snapshot() - }; - crate::db::schema_state::recover_schema_state_files( - root_uri, - std::sync::Arc::clone(&storage), - &snapshot, - ) - .await? - } else { - SchemaStateRecovery::Noop - }; - // Fresh per-branch snapshot β€” same rationale as - // `recover_manifest_drift`: classify against the branch the - // sidecar's writer targeted, refreshed after any prior - // sidecar's roll-forward. - let branch_snapshot = match sidecar.branch.as_deref() { - Some(b) => { - // Orphan check against the manifest's branch list (the - // authority) BEFORE opening: a deferred sidecar whose - // branch was deleted would otherwise wedge every write - // on the dead-branch open. - let branch_exists = { - let mut coord = coordinator.write().await; - coord.refresh().await?; - coord.all_branches().await?.iter().any(|name| name == b) - }; - if !branch_exists { - discard_orphaned_branch_sidecar(root_uri, storage.as_ref(), &sidecar).await?; - processed_any = true; - continue; - } - let mut branch_coord = - GraphCoordinator::open_branch(root_uri, b, std::sync::Arc::clone(&storage)) - .await?; - branch_coord.refresh().await?; - branch_coord.snapshot() - } - None => { - let mut coord = coordinator.write().await; - coord.refresh().await?; - coord.snapshot() - } - }; - if process_sidecar( - root_uri, - &storage, - &branch_snapshot, - &sidecar, - RecoveryMode::RollForwardOnly, - schema_state_recovery, - ) - .await? - { - processed_any = true; - } - } - // Re-read coordinator state so the caller's handle observes the - // post-heal manifest. - coordinator.write().await.refresh().await?; - Ok(processed_any) -} - -/// Discard a sidecar whose branch no longer exists in the manifest (the -/// authority β€” callers must key the orphan classification off the branch -/// LIST, never off a `Not found` from an open, which could be a transient -/// storage error masking real recovery intent). The branch's tree and -/// per-table forks are already reclaimed, so the drift the sidecar pins is -/// unreachable and the sidecar is provably moot; leaving it would wedge -/// every heal (write entry) and every ReadWrite open on a dead-branch -/// open, with `repair` refusing while it pends. Records an -/// `OrphanedBranchDiscarded` audit row (commit appended on main β€” the -/// sidecar's own branch has no commit graph anymore). -async fn discard_orphaned_branch_sidecar( - root_uri: &str, - storage: &dyn StorageAdapter, - sidecar: &RecoverySidecar, -) -> Result<()> { - warn!( - operation_id = sidecar.operation_id.as_str(), - writer_kind = ?sidecar.writer_kind, - branch = sidecar.branch.as_deref().unwrap_or(""), - "recovery: discarding sidecar for a deleted branch (drift unreachable; audit recorded)" - ); - let mut audit = RecoveryAudit::open(root_uri).await?; - // Idempotency across a Phase D delete fault: the audit row + commit - // land before the sidecar delete, so a failed delete re-enters here - // with the audit already durable. Append only once per operation β€” - // the retry's sole remaining job is finishing the delete. (Cold - // path: the list scan runs only when an orphaned sidecar exists.) - // - // Documented residual: the commit append and the audit append are - // two writes. A failure BETWEEN them leaves a recovery commit with - // no audit row; the retry (keyed on the audit row, the operator- - // facing record) appends a second commit before the audit lands β€” - // bounded commit-graph noise, audit row still exactly-once. Same - // not-atomic-pair-write tolerance as `record_audit` and the - // manifestβ†’commit-graph Known Gap; keying on commit rows instead - // would need an operation_id column on `_graph_commits`, and - // audit-before-commit would dangle the `graph_commit_id` join. - let already_recorded = audit.list().await?.iter().any(|record| { - record.operation_id == sidecar.operation_id - && record.recovery_kind == RecoveryKind::OrphanedBranchDiscarded - }); - if !already_recorded { - // The orphan-discard commit is recorded on MAIN (the sidecar's own - // branch is gone), via a lineage-only publish into `__manifest` (RFC-013 - // Phase 7) β€” no `_graph_commits.lance` row. The publisher stamps the - // commit at the version it produces. - let intent = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: None, - actor_id: Some(RECOVERY_ACTOR.to_string()), - merged_parent_commit_id: None, - created_at: crate::db::now_micros()?, - }; - let publisher = GraphNamespacePublisher::new(root_uri, None); - publisher.publish(&[], &HashMap::new(), Some(&intent)).await?; - // Failpoint: the residual window above β€” commit published, audit - // not yet durable. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_ORPHAN_DISCARD_AUDIT_APPEND)?; - audit - .append(RecoveryAuditRecord { - graph_commit_id: intent.graph_commit_id, - recovery_kind: RecoveryKind::OrphanedBranchDiscarded, - recovery_for_actor: sidecar.actor_id.clone(), - operation_id: sidecar.operation_id.clone(), - sidecar_writer_kind: format!("{:?}", sidecar.writer_kind), - per_table_outcomes: Vec::new(), - created_at: crate::db::now_micros()?, - }) - .await?; - } - let handle = RecoverySidecarHandle { - operation_id: sidecar.operation_id.clone(), - sidecar_uri: sidecar_uri(root_uri, &sidecar.operation_id), - }; - delete_sidecar(&handle, storage).await -} - -/// The write-queue key serializing schema-apply's sidecar lifecycle -/// against the write-entry heal. The schema-apply writer acquires it -/// (alongside its per-table keys) from before `write_sidecar` until -/// after `delete_sidecar`; the heal acquires it before reconciling -/// schema staging or processing a SchemaApply sidecar. The name cannot -/// collide with real table keys (those are `node:`/`edge:`-prefixed). -pub(crate) fn schema_apply_serial_queue_key() -> crate::db::write_queue::TableQueueKey { - ("__schema_apply__".to_string(), None) -} - /// Open-time recovery sweep β€” the entry point invoked from /// `Omnigraph::open` (gated on `OpenMode::ReadWrite`). /// @@ -1007,12 +536,11 @@ pub(crate) fn schema_apply_serial_queue_key() -> crate::db::write_queue::TableQu /// same table append extra Lance restore commits which `omnigraph /// cleanup` reclaims. /// -/// Concurrency: the open-time sweep runs synchronously in `Omnigraph::open` -/// before the engine handle is published to any caller, so no request -/// handler can race it and it does NOT acquire write queues. In-process -/// callers (refresh, write entry points) must use -/// [`heal_pending_sidecars_roll_forward`] instead, which serializes -/// against live writers via per-(table_key, branch) queue acquisition. +/// Concurrency: today recovery runs synchronously in `Omnigraph::open` +/// *before* the engine is wrapped in the server's `Arc>`. +/// No request handlers can race. A future per-(table_key, branch) writer +/// queue model (paired with a background reconciler) will need to acquire +/// queues before the sweep restores or publishes. pub(crate) async fn recover_manifest_drift( root_uri: &str, storage: std::sync::Arc, @@ -1039,20 +567,6 @@ pub(crate) async fn recover_manifest_drift( for sidecar in sidecars { let branch_snapshot = match sidecar.branch.as_deref() { Some(b) => { - // Orphan check against the manifest's branch list (the - // authority) BEFORE opening β€” same classification as the - // write-entry heal: a deferred sidecar whose branch was - // deleted would otherwise fail every ReadWrite open. - coordinator.refresh().await?; - if !coordinator - .all_branches() - .await? - .iter() - .any(|name| name == b) - { - discard_orphaned_branch_sidecar(root_uri, storage.as_ref(), &sidecar).await?; - continue; - } let mut branch_coord = GraphCoordinator::open_branch(root_uri, b, std::sync::Arc::clone(&storage)) .await?; @@ -1066,7 +580,7 @@ pub(crate) async fn recover_manifest_drift( }; process_sidecar( root_uri, - &storage, + storage.as_ref(), &branch_snapshot, &sidecar, mode, @@ -1081,16 +595,12 @@ pub(crate) async fn recover_manifest_drift( async fn process_sidecar( root_uri: &str, - storage: &std::sync::Arc, + storage: &dyn StorageAdapter, snapshot: &Snapshot, sidecar: &RecoverySidecar, mode: RecoveryMode, schema_state_recovery: SchemaStateRecovery, -) -> Result { - // Returns whether durable state changed (roll-forward, roll-back, or - // stale-sidecar audit recovery). `false` = the sidecar was deferred - // untouched -- callers must not treat that as a completed heal (no - // schema reload / cache invalidation is warranted). +) -> Result<()> { let mut states = Vec::with_capacity(sidecar.tables.len()); for pin in &sidecar.tables { let lance_head = open_lance_head(&pin.table_path, pin.table_branch.as_deref()).await?; @@ -1099,13 +609,7 @@ async fn process_sidecar( .map(|e| e.table_version) .unwrap_or(0); states.push(ClassifiedTable { - classification: classify_table( - pin, - lance_head, - manifest_pinned, - sidecar.writer_kind, - sidecar.schema_version, - ), + classification: classify_table(pin, lance_head, manifest_pinned, sidecar.writer_kind), manifest_pinned, lance_head, }); @@ -1140,7 +644,7 @@ async fn process_sidecar( writer_kind = ?sidecar.writer_kind, "recovery: deferring sidecar with invariant violation to next ReadWrite open" ); - Ok(false) + Ok(()) } }, SidecarDecision::RollBack => { @@ -1184,10 +688,9 @@ async fn process_sidecar( ); } return record_audit_recovery_rollforward( - root_uri, storage.as_ref(), sidecar, &states, + root_uri, storage, snapshot, sidecar, &states, ) - .await - .map(|()| true); + .await; } if matches!(mode, RecoveryMode::RollForwardOnly) { // In-process recovery cannot run Dataset::restore safely @@ -1199,16 +702,14 @@ async fn process_sidecar( writer_kind = ?sidecar.writer_kind, "recovery: deferring rollback-eligible sidecar to next ReadWrite open" ); - return Ok(false); + return Ok(()); } warn!( operation_id = sidecar.operation_id.as_str(), writer_kind = ?sidecar.writer_kind, "recovery: rolling back sidecar (mixed or unexpected state)" ); - roll_back_sidecar(root_uri, storage.as_ref(), sidecar, &states) - .await - .map(|()| true) + roll_back_sidecar(root_uri, storage, snapshot, sidecar, &states).await } SidecarDecision::RollForward => { if matches!(sidecar.writer_kind, SidecarKind::SchemaApply) @@ -1221,9 +722,7 @@ async fn process_sidecar( "recovery: rolling back SchemaApply sidecar because schema staging \ files were not promoted in this recovery pass" ); - roll_back_sidecar(root_uri, storage.as_ref(), sidecar, &states) - .await - .map(|()| true) + roll_back_sidecar(root_uri, storage, snapshot, sidecar, &states).await } RecoveryMode::RollForwardOnly => { warn!( @@ -1231,7 +730,7 @@ async fn process_sidecar( "recovery: deferring SchemaApply sidecar because schema staging files \ were not promoted in this recovery pass" ); - Ok(false) + Ok(()) } }; } @@ -1241,36 +740,8 @@ async fn process_sidecar( "recovery: rolling forward sidecar (Phase B completed; \ Phase C did not land)" ); - // TOCTOU window: between `classify_table` (which read the manifest - // pin) and the publish CAS below, a concurrent live writer can - // advance the manifest past our expected version. The failpoint - // lets a test force that interleave deterministically. - crate::failpoints::maybe_fail( - crate::failpoints::names::RECOVERY_BEFORE_ROLL_FORWARD_PUBLISH, - )?; - // RFC-013 Phase 7: `roll_forward_all` folds the recovery commit into the - // manifest publish CAS, so it also returns the minted `graph_commit_id` - // for the audit row below. - let (new_manifest_version, published_versions, graph_commit_id) = - match roll_forward_all(root_uri, sidecar, &states, snapshot).await { - Ok(published) => published, - // Convergence-idempotent (invariants 7 & 15): a roll-forward's - // postcondition is "the manifest reflects the sidecar's committed - // Lance state", NOT "this sweep personally won the CAS". A - // concurrent writer that advanced the manifest to/past that goal - // during the classifyβ†’publish window is convergence, not a logical - // conflict β€” so re-read and either record the already-achieved - // roll-forward or defer to the next pass; never fail the open. - // Any other error still propagates. - Err(err) if is_expected_version_mismatch(&err) => { - return converge_or_defer_roll_forward( - root_uri, storage, sidecar, &states, err, - ) - .await; - } - Err(err) => return Err(err), - }; - let _ = new_manifest_version; + let (new_manifest_version, published_versions) = + roll_forward_all(root_uri, sidecar, snapshot).await?; // `to_version` records the ACTUAL Lance HEAD published for // each table (not pin.post_commit_pin, which is a lower bound // for loose-match writers like SchemaApply / EnsureIndices / @@ -1300,214 +771,17 @@ async fn process_sidecar( record_audit( root_uri, sidecar, - graph_commit_id, + new_manifest_version, RecoveryKind::RolledForward, outcomes, ) .await?; - delete_sidecar_by_operation_id(root_uri, storage.as_ref(), &sidecar.operation_id) - .await?; - Ok(true) + delete_sidecar_by_operation_id(root_uri, storage, &sidecar.operation_id).await?; + Ok(()) } } } -/// True if `err` is the publisher's per-table CAS precondition failure -/// (`ExpectedVersionMismatch`) β€” the signal that a concurrent writer advanced -/// the manifest past what this caller expected. -fn is_expected_version_mismatch(err: &OmniError) -> bool { - matches!( - err, - OmniError::Manifest(m) - if matches!( - m.details, - Some(crate::error::ManifestConflictDetails::ExpectedVersionMismatch { .. }) - ) - ) -} - -/// Whether the live manifest already reflects everything this sidecar intended -/// to publish. -/// -/// SOUNDNESS: the per-table test is `current_version >= observed lance_head`, a -/// *proxy* for "the sidecar's committed Lance commit is an ancestor of the -/// published HEAD" (so a higher version is a descendant that contains it). The -/// proxy is sound only because of the heal-first invariant: every writer that -/// can advance a table's manifest version first heals pending sidecars -/// (`heal_pending_recovery_sidecars` runs at the head of `load`/`mutate`/ -/// schema-apply/branch-merge) or refuses on an unrecovered graph (`optimize`). -/// So the only path past `expected_version` is one that first publishes THIS -/// sidecar's commit at `lance_head` β€” version ordering then implies lineage -/// containment. A future writer that advances a pinned table WITHOUT healing -/// first (e.g. a non-heal-first `Overwrite` that replaces rows) would void this -/// proxy and must be re-validated by row-id lineage, not version ordering. -/// Added tables must be registered; tombstoned tables must be gone. -fn sidecar_intent_satisfied( - snapshot: &Snapshot, - sidecar: &RecoverySidecar, - states: &[ClassifiedTable], -) -> bool { - for (pin, state) in sidecar.tables.iter().zip(states.iter()) { - let current = snapshot - .entry(&pin.table_key) - .map(|e| e.table_version) - .unwrap_or(0); - if current < state.lance_head { - return false; - } - } - for reg in &sidecar.additional_registrations { - if snapshot.entry(®.table_key).is_none() { - return false; - } - } - for tomb in &sidecar.tombstones { - if snapshot.entry(&tomb.table_key).is_some() { - return false; - } - } - true -} - -/// Re-read the live manifest snapshot for the sidecar's branch. -async fn fresh_snapshot_for_sidecar( - root_uri: &str, - storage: &std::sync::Arc, - sidecar: &RecoverySidecar, -) -> Result { - let mut coordinator = match sidecar.branch.as_deref() { - Some(branch) if branch != "main" => { - GraphCoordinator::open_branch(root_uri, branch, std::sync::Arc::clone(storage)).await? - } - _ => GraphCoordinator::open(root_uri, std::sync::Arc::clone(storage)).await?, - }; - coordinator.refresh().await?; - Ok(coordinator.snapshot()) -} - -/// Convergence-idempotent handling of a roll-forward publish CAS that lost to a -/// concurrent writer (`ExpectedVersionMismatch`). A roll-forward's postcondition -/// is "the manifest reflects the sidecar's committed Lance state", not "this -/// sweep won the CAS" (invariants 7 & 15). Re-read the live manifest: -/// -/// - if it already reached the sidecar's goal, the work is done (just not by us) -/// β€” record the `RolledForward` audit and delete the sidecar idempotently; -/// - otherwise the manifest is progressing but not yet at the goal β€” leave the -/// sidecar for the next open / the live writer's own Phase D. -/// -/// Either way the open does NOT fail. A genuine logical conflict (a table below -/// `expected_version`, i.e. data lost) is not satisfiable here and re-surfaces -/// loudly via the classifier's `InvariantViolation` on the next pass. -/// See iss-schema-apply-reopen-recovery-race. -async fn converge_or_defer_roll_forward( - root_uri: &str, - storage: &std::sync::Arc, - sidecar: &RecoverySidecar, - states: &[ClassifiedTable], - conflict: OmniError, -) -> Result { - let fresh = fresh_snapshot_for_sidecar(root_uri, storage, sidecar).await?; - if !sidecar_intent_satisfied(&fresh, sidecar, states) { - warn!( - operation_id = sidecar.operation_id.as_str(), - writer_kind = ?sidecar.writer_kind, - "recovery: roll-forward publish lost a CAS and the manifest has not \ - yet reached the sidecar's goal; deferring to the next pass \ - (conflict: {conflict})" - ); - return Ok(false); - } - // The manifest already reached the sidecar's goal β€” some other actor - // advanced it. Under the heal-first invariant, whoever advanced past - // `expected_version` first healed THIS sidecar (recorded its RolledForward - // audit and deleted it). So the audit row already exists; recording another - // here would put two RolledForward rows in `_graph_commit_recoveries` for - // one recovery event (visible in `commit list --filter actor=…recovery`). - // Only finish the bookkeeping if the sidecar is still on disk (the winner - // crashed between audit and delete); if it is already gone, the winner - // completed it β€” return success WITHOUT a duplicate audit, keeping the - // audit append-idempotent per operation_id across concurrent sweeps. - let sidecar_path = sidecar_uri(root_uri, &sidecar.operation_id); - if !storage.exists(&sidecar_path).await? { - warn!( - operation_id = sidecar.operation_id.as_str(), - writer_kind = ?sidecar.writer_kind, - "recovery: roll-forward publish lost a CAS; the winner already \ - converged and cleaned up this sidecar β€” nothing to do" - ); - return Ok(true); - } - warn!( - operation_id = sidecar.operation_id.as_str(), - writer_kind = ?sidecar.writer_kind, - "recovery: roll-forward publish lost a CAS to a concurrent writer that \ - already reached the goal; converging (RolledForward audit + delete)" - ); - let mut outcomes: Vec = sidecar - .tables - .iter() - .map(|pin| TableOutcome { - table_key: pin.table_key.clone(), - from_version: pin.expected_version, - to_version: fresh - .entry(&pin.table_key) - .map(|e| e.table_version) - .unwrap_or(pin.post_commit_pin), - }) - .collect(); - // Mirror the normal roll-forward audit shape: SchemaApply sidecars also - // register added tables, so the audit must list them too (else a converge - // audit row is incomplete vs the `roll_forward_all` path for the same - // recovery kind). - for reg in &sidecar.additional_registrations { - outcomes.push(TableOutcome { - table_key: reg.table_key.clone(), - from_version: 0, - to_version: fresh - .entry(®.table_key) - .map(|e| e.table_version) - .unwrap_or(0), - }); - } - // RFC-013 Phase 7: the winning writer folded its recovery commit into the - // manifest CAS, so the converge audit references THAT commit. We lost the CAS - // and never minted it, but a recovery commit is distinguishable by its - // `RECOVERY_ACTOR` authorship (`publish_recovery_commit`), so the latest - // recovery-actored commit on this branch IS it. Do NOT use the branch head: - // a concurrent USER write can advance `graph_head` past the recovery commit - // between the winner's publish and this read, which would attribute the audit - // row to the wrong (later, user) commit. (We only reach here with the sidecar - // still on disk: the winner advanced the manifest but crashed before its own - // audit+delete, so we finish its bookkeeping.) - let cache = match sidecar.branch.as_deref() { - Some(branch) => { - crate::db::commit_graph::CommitGraph::open_at_branch(root_uri, branch).await? - } - None => crate::db::commit_graph::CommitGraph::open(root_uri).await?, - }; - let converged_commit_id = match cache - .load_commits() - .await? - .into_iter() - .rfind(|c| c.actor_id.as_deref() == Some(RECOVERY_ACTOR)) - { - Some(recovery_commit) => recovery_commit.graph_commit_id, - // No recovery commit visible β€” unexpected on this path (the winner just - // published one); fall back to the head rather than an empty id. - None => cache.head_commit_id().await?.unwrap_or_default(), - }; - record_audit( - root_uri, - sidecar, - converged_commit_id, - RecoveryKind::RolledForward, - outcomes, - ) - .await?; - delete_sidecar_by_operation_id(root_uri, storage.as_ref(), &sidecar.operation_id).await?; - Ok(true) -} - #[derive(Debug, Clone, Copy)] struct ClassifiedTable { classification: TableClassification, @@ -1523,34 +797,23 @@ struct ClassifiedTable { async fn roll_back_sidecar( root_uri: &str, storage: &dyn StorageAdapter, + snapshot: &Snapshot, sidecar: &RecoverySidecar, states: &[ClassifiedTable], ) -> Result<()> { - // Restore every drifted table (RolledPastExpected / UnexpectedAtP1 / - // UnexpectedMultistep) to its manifest-pinned content, then PUBLISH so - // `manifest == Lance HEAD` for each β€” symmetric with roll-forward. The - // restore commit's content equals the manifest-pinned version, so re-pinning - // the manifest to the new (restored) HEAD is content-correct and closes the - // orphaned-drift class (`HEAD > manifest` with no covering sidecar). This is - // what makes a failed-then-retried schema_apply converge: after one - // roll-back `manifest == HEAD`, so the retry's precondition passes instead of - // failing one version higher each iteration. - // - // NoMovement tables are already at the pin β€” excluded from both the restore - // and the publish. The audit `to_version` stays the *logical* rolled-back-to - // version (`manifest_pinned`), while the manifest is published at - // `manifest_pinned + 1` (the restore commit, same content) β€” keep that - // asymmetry so the audit records the drift (`from_version > to_version`). + // Restore every table whose Lance HEAD has drifted from the + // manifest pin (RolledPastExpected, UnexpectedAtP1, + // UnexpectedMultistep). NoMovement tables are already at the + // manifest pin β€” no action. Restore is unconditional; repeated + // mid-rollback crashes accumulate a few extra Lance commits that + // `omnigraph cleanup` reclaims. let mut outcomes = Vec::with_capacity(sidecar.tables.len()); - let mut updates: Vec = Vec::with_capacity(sidecar.tables.len()); - let mut expected: HashMap = HashMap::with_capacity(sidecar.tables.len()); for (pin, state) in sidecar.tables.iter().zip(states.iter()) { if matches!( state.classification, TableClassification::RolledPastExpected | TableClassification::UnexpectedAtP1 | TableClassification::UnexpectedMultistep - | TableClassification::IncompletePhaseB ) { restore_table_to_version( &pin.table_path, @@ -1558,23 +821,10 @@ async fn roll_back_sidecar( state.manifest_pinned, ) .await?; - // Publish the post-restore HEAD (the restore commit we just made), - // CAS against the current (unmoved) manifest pin β€” the same helper - // roll-forward uses. `None` target: there is no prior observation to - // pin to; the version to publish is the HEAD the restore produced. - push_table_update( - root_uri, - &pin.table_key, - &pin.table_path, - pin.table_branch.as_deref(), - state.manifest_pinned, - None, - &mut updates, - &mut expected, - ) - .await?; - // `from_version` records the Lance HEAD observed BEFORE the restore - // (the actual drift); `to_version` the logical pin we rolled back to. + // `from_version` records the Lance HEAD observed BEFORE the + // restore (the actual drift), not the manifest pin. Operators + // reading `_graph_commit_recoveries.lance` see "rolled back + // from v7 to v5" rather than "v5 β†’ v5". outcomes.push(TableOutcome { table_key: pin.table_key.clone(), from_version: state.lance_head, @@ -1582,18 +832,13 @@ async fn roll_back_sidecar( }); } } - // Publish the restored HEADs so manifest == HEAD AND record the recovery - // commit in the same CAS (RFC-013 Phase 7). A degenerate all-NoMovement - // roll-back restores no table β€” `updates` is empty β€” but the recovery commit - // lineage still publishes (a lineage-only merge), so the rollback is recorded - // in the commit history just like a roll-forward. - let (_manifest_version, graph_commit_id) = - publish_recovery_commit(root_uri, sidecar, RecoveryKind::RolledBack, &updates, &expected) - .await?; + // Manifest pin doesn't move on rollback; record an audit-only + // commit at the existing version so operators can correlate via + // `omnigraph commit list --filter actor=omnigraph:recovery`. record_audit( root_uri, sidecar, - graph_commit_id, + snapshot.version(), RecoveryKind::RolledBack, outcomes, ) @@ -1619,6 +864,7 @@ async fn roll_back_sidecar( async fn record_audit_recovery_rollforward( root_uri: &str, storage: &dyn StorageAdapter, + snapshot: &Snapshot, sidecar: &RecoverySidecar, states: &[ClassifiedTable], ) -> Result<()> { @@ -1632,22 +878,10 @@ async fn record_audit_recovery_rollforward( to_version: state.manifest_pinned, }) .collect(); - // The substrate is already in the post-roll-forward state (the prior pass's - // table re-pin landed), so there are no table `updates` β€” but a recovery - // commit is still recorded for this cleanup pass via a lineage-only publish - // (RFC-013 Phase 7), which the audit row references. - let (_manifest_version, graph_commit_id) = publish_recovery_commit( - root_uri, - sidecar, - RecoveryKind::RolledForward, - &[], - &HashMap::new(), - ) - .await?; record_audit( root_uri, sidecar, - graph_commit_id, + snapshot.version(), RecoveryKind::RolledForward, outcomes, ) @@ -1667,19 +901,16 @@ async fn record_audit_recovery_rollforward( /// contention; persistent contention surfaces the typed conflict error to /// the recovery sweep, which leaves the sidecar in place for the next /// open's retry. -/// Returns `(new_manifest_version, per_table_published_versions, -/// recovery_commit_id)`. The per-table map is what the audit row's `to_version` -/// should record β€” for loose-match writers the actual Lance HEAD can be higher -/// than the sidecar's `post_commit_pin` (which is a lower bound), so the pin is -/// the wrong source of truth for an operator-facing audit field. The recovery -/// commit id is the `graph_commit` folded into the publish CAS (RFC-013 -/// Phase 7), which the audit row references. +/// Returns `(new_manifest_version, per_table_published_versions)`. The +/// per-table map is what the audit row's `to_version` should record β€” +/// for loose-match writers the actual Lance HEAD can be higher than the +/// sidecar's `post_commit_pin` (which is a lower bound), so the pin is +/// the wrong source of truth for an operator-facing audit field. async fn roll_forward_all( root_uri: &str, sidecar: &RecoverySidecar, - states: &[ClassifiedTable], snapshot: &Snapshot, -) -> Result<(u64, HashMap, String)> { +) -> Result<(u64, HashMap)> { let total_changes = sidecar.tables.len() + sidecar.additional_registrations.len() + sidecar.tombstones.len(); let mut updates: Vec = Vec::with_capacity(total_changes); @@ -1687,25 +918,46 @@ async fn roll_forward_all( let mut published_versions: HashMap = HashMap::with_capacity(sidecar.tables.len() + sidecar.additional_registrations.len()); - for (pin, state) in sidecar.tables.iter().zip(states.iter()) { - // Publish the version classification OBSERVED (`state.lance_head`), not a - // fresh HEAD re-read. For a `Confirmed` pin classify already validated - // `lance_head == confirmed_version`, so this publishes the recorded WAL - // intent by construction; for loose/strict pins it's the multi-commit - // HEAD classify saw. Single observation, no classifyβ†’publish TOCTOU. CAS - // against the pin's pre-write `expected_version`. - let published = push_table_update( + for pin in &sidecar.tables { + // Open the dataset at its CURRENT Lance HEAD on the pin's branch + // (not at the sidecar's post_commit_pin). For strict-match writers + // (Mutation/Load) HEAD == post_commit_pin by construction. For + // loose-match writers (SchemaApply/EnsureIndices/BranchMerge) HEAD + // may be higher than post_commit_pin (multiple commit_staged + // calls per table); we want to publish to the actual current HEAD. + let head_ds = Dataset::open(&pin.table_path) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let head_ds = match pin.table_branch.as_deref() { + Some(b) if b != "main" => head_ds + .checkout_branch(b) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?, + _ => head_ds, + }; + let head_version = head_ds.version().version; + + let row_count = head_ds + .count_rows(None) + .await + .map_err(|e| OmniError::Lance(e.to_string()))? as u64; + + let table_relative_path = super::table_path_for_table_key(&pin.table_key)?; + let version_metadata = super::metadata::TableVersionMetadata::from_dataset( root_uri, - &pin.table_key, - &pin.table_path, - pin.table_branch.as_deref(), - pin.expected_version, - Some(state.lance_head), - &mut updates, - &mut expected, - ) - .await?; - published_versions.insert(pin.table_key.clone(), published); + &table_relative_path, + &head_ds, + )?; + + updates.push(ManifestChange::Update(SubTableUpdate { + table_key: pin.table_key.clone(), + table_version: head_version, + table_branch: pin.table_branch.clone(), + row_count, + version_metadata, + })); + expected.insert(pin.table_key.clone(), pin.expected_version); + published_versions.insert(pin.table_key.clone(), head_version); } // SchemaApply-only: register added tables (and renamed targets) and @@ -1790,100 +1042,62 @@ async fn roll_forward_all( ); } - let (new_manifest_version, graph_commit_id) = - publish_recovery_commit(root_uri, sidecar, RecoveryKind::RolledForward, &updates, &expected) - .await?; - Ok((new_manifest_version, published_versions, graph_commit_id)) + let publisher = GraphNamespacePublisher::new(root_uri, sidecar.branch.as_deref()); + let new_dataset = publisher.publish(&updates, &expected).await?; + Ok((new_dataset.version().version, published_versions)) } -/// Open `table_path` at its branch HEAD, read the current Lance HEAD version, -/// row count, and version metadata, and push a `ManifestChange::Update` (plus -/// its CAS `expected` entry) that re-pins the manifest to that HEAD. Returns the -/// published HEAD version. +/// Append the audit row describing this recovery action. /// -/// Shared by `roll_forward_all` (where `expected_version` is the sidecar's -/// pre-write pin) and `roll_back_sidecar` (where it is the manifest-pinned -/// version the table was just restored to). The HEAD is read AFTER any restore -/// in the same single-threaded sweep, so no concurrent writer can have advanced -/// it. -/// Stage a manifest `Update` for one table. -/// -/// `target_version` selects WHICH Lance version's state to publish: -/// - `Some(v)` β€” pin the dataset at version `v` and publish it. Roll-FORWARD -/// passes the version classification observed (and, for a `Confirmed` pin, -/// validated equals `confirmed_version`), so recovery publishes the version it -/// *decided* on rather than re-reading a HEAD a concurrent writer may have -/// advanced since classification β€” one observation, used for both the decision -/// and the publish (invariant 15). -/// - `None` β€” publish the dataset's current HEAD. Roll-BACK uses this: it just -/// created the restore commit, so HEAD *is* the version to publish. -async fn push_table_update( - root_uri: &str, - table_key: &str, - table_path: &str, - branch: Option<&str>, - expected_version: u64, - target_version: Option, - updates: &mut Vec, - expected: &mut HashMap, -) -> Result { - let ds = Dataset::open(table_path) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let ds = match branch { - Some(b) if b != "main" => ds - .checkout_branch(b) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?, - _ => ds, - }; - let ds = match target_version { - Some(v) => ds - .checkout_version(v) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?, - None => ds, - }; - let published_version = ds.version().version; - let row_count = ds - .count_rows(None) - .await - .map_err(|e| OmniError::Lance(e.to_string()))? as u64; - let table_relative_path = super::table_path_for_table_key(table_key)?; - let version_metadata = - super::metadata::TableVersionMetadata::from_dataset(root_uri, &table_relative_path, &ds)?; - updates.push(ManifestChange::Update(SubTableUpdate { - table_key: table_key.to_string(), - table_version: published_version, - table_branch: branch.map(str::to_string), - row_count, - version_metadata, - })); - expected.insert(table_key.to_string(), expected_version); - Ok(published_version) -} - -/// Append the audit row describing this recovery action (RFC-013 Phase 7). -/// -/// The recovery COMMIT (`graph_commit` + `graph_head`) was already recorded -/// durably in `__manifest` by `publish_recovery_commit` (folded into the same -/// CAS as the table re-pin), so this only writes the `_graph_commit_recoveries` -/// row, referencing that commit by `graph_commit_id`. A crash between the -/// recovery publish and this audit append leaves a recovery commit with no audit -/// row β€” the same not-atomic-pair-write shape as before; the sweep tolerates it -/// (on re-entry the classifier surfaces `NoMovement`, the action is a no-op, and -/// the audit append is retried, minting a fresh recovery commit). +/// Two-part write: (a) `_graph_commits.lance` row anchored on the recovery +/// actor (`omnigraph:recovery`); (b) `_graph_commit_recoveries.lance` row +/// linking back to (a) and naming the original actor + per-table outcomes. +/// Same not-atomic-pair-write shape as the existing `_graph_commits` +/// + `_graph_commit_actors` split β€” a crash between the two leaves an +/// orphan commit row with no audit row. The recovery sweep tolerates this: +/// on re-entry the classifier surfaces `NoMovement` for already-restored / +/// already-published tables, the action is a no-op, and the audit append +/// is retried. async fn record_audit( root_uri: &str, sidecar: &RecoverySidecar, - graph_commit_id: String, + manifest_version: u64, kind: RecoveryKind, outcomes: Vec, ) -> Result<()> { - // Failpoint: models an audit write failure after the roll-forward / - // roll-back publish (with its folded-in recovery commit) already landed β€” - // the sweep aborts, the sidecar stays, and re-entry records the audit row. - crate::failpoints::maybe_fail(crate::failpoints::names::RECOVERY_RECORD_AUDIT)?; + // Non-main recovery commits must be appended on the sidecar branch's + // commit graph, otherwise parent_commit_id comes from the global + // main head. BranchMerge additionally records the source branch's + // HEAD as merged_parent_commit_id so future merges between the same + // pair recognize "already up-to-date". + let target_branch = sidecar.branch.as_deref(); + let mut graph = match target_branch { + Some(branch) => CommitGraph::open_at_branch(root_uri, branch).await?, + None => CommitGraph::open(root_uri).await?, + }; + let graph_commit_id = match ( + sidecar.writer_kind, + sidecar.merge_source_commit_id.as_deref(), + kind, + ) { + (SidecarKind::BranchMerge, Some(source_id), RecoveryKind::RolledForward) => { + let parent_commit_id = graph.head_commit_id().await?.unwrap_or_default(); + graph + .append_merge_commit( + target_branch, + manifest_version, + &parent_commit_id, + source_id, + Some(RECOVERY_ACTOR), + ) + .await? + } + _ => { + graph + .append_commit(target_branch, manifest_version, Some(RECOVERY_ACTOR)) + .await? + } + }; let mut audit = RecoveryAudit::open(root_uri).await?; audit .append(RecoveryAuditRecord { @@ -1893,7 +1107,7 @@ async fn record_audit( operation_id: sidecar.operation_id.clone(), sidecar_writer_kind: format!("{:?}", sidecar.writer_kind), per_table_outcomes: outcomes, - created_at: crate::db::now_micros()?, + created_at: now_micros()?, }) .await?; Ok(()) @@ -1977,7 +1191,7 @@ pub(crate) fn new_sidecar( #[cfg(test)] mod tests { use super::*; - use crate::storage::ObjectStorageAdapter; + use crate::storage::LocalStorageAdapter; use crate::table_store::TableStore; use arrow_array::{Int32Array, RecordBatch, StringArray}; use arrow_schema::{DataType, Field, Schema}; @@ -2009,7 +1223,6 @@ mod tests { table_path: table_path.to_string(), expected_version: expected, post_commit_pin: post, - confirmed_version: None, table_branch: None, } } @@ -2034,39 +1247,30 @@ mod tests { } #[test] - fn parse_sidecar_refuses_future_but_accepts_older_schema_version() { - let body = |version: u32| { - format!( - r#"{{ - "schema_version": {version}, - "operation_id": "01H000000000000000000000XX", - "started_at": "0", - "branch": null, - "actor_id": null, - "writer_kind": "BranchMerge", - "tables": [] - }}"# - ) - }; - // A version NEWER than this binary's max β†’ refuse (can't guess the future). - let err = parse_sidecar("file:///tmp/__recovery/x.json", &body(99)).unwrap_err(); + fn parse_sidecar_refuses_unknown_schema_version() { + let body = r#"{ + "schema_version": 99, + "operation_id": "01H000000000000000000000XX", + "started_at": "0", + "branch": null, + "actor_id": null, + "writer_kind": "Mutation", + "tables": [] + }"#; + let err = parse_sidecar("file:///tmp/__recovery/x.json", body).unwrap_err(); let msg = err.to_string(); assert!( - msg.contains("schema_version=99") && msg.contains("newer than the maximum"), - "expected a future-version refusal, got: {msg}", + msg.contains("schema_version=99") && msg.contains("supports only schema_version=1"), + "expected SidecarSchemaError mentioning the version mismatch, got: {}", + msg, ); - // An OLDER version (pre-confirmation v1) β†’ accept and interpret with its - // original semantics; never refuse a version we were built to understand. - let parsed = parse_sidecar("file:///tmp/__recovery/x.json", &body(1)) - .expect("a v1 (older) sidecar must parse, not be refused"); - assert_eq!(parsed.schema_version, 1); } #[test] fn classify_no_movement_when_head_equals_pinned() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 5, 5, SidecarKind::Mutation, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 5, 5, SidecarKind::Mutation), TableClassification::NoMovement, ); } @@ -2075,7 +1279,7 @@ mod tests { fn classify_rolled_past_expected_when_sidecar_matches_strict() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 6, 5, SidecarKind::Mutation, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 6, 5, SidecarKind::Mutation), TableClassification::RolledPastExpected, ); } @@ -2085,7 +1289,7 @@ mod tests { // Same +1 drift but post_commit_pin says it should be 7, not 6. let pin = make_pin("node:Person", "irrelevant", 5, 7); assert_eq!( - classify_table(&pin, 6, 5, SidecarKind::Mutation, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 6, 5, SidecarKind::Mutation), TableClassification::UnexpectedAtP1, ); } @@ -2094,7 +1298,7 @@ mod tests { fn classify_unexpected_multistep_when_head_jumped_more_than_one_strict() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 8, 5, SidecarKind::Mutation, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 8, 5, SidecarKind::Mutation), TableClassification::UnexpectedMultistep, ); } @@ -2103,7 +1307,7 @@ mod tests { fn classify_invariant_violation_when_head_below_pinned() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 3, 5, SidecarKind::Mutation, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 3, 5, SidecarKind::Mutation), TableClassification::InvariantViolation { observed: 3 }, ); } @@ -2119,7 +1323,7 @@ mod tests { // built two indices). Strict would say UnexpectedMultistep; loose // accepts it as RolledPastExpected. assert_eq!( - classify_table(&pin, 8, 5, SidecarKind::SchemaApply, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 8, 5, SidecarKind::SchemaApply), TableClassification::RolledPastExpected, ); } @@ -2128,7 +1332,7 @@ mod tests { fn classify_loose_match_accepts_multi_commit_drift_for_ensure_indices() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 9, 5, SidecarKind::EnsureIndices, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 9, 5, SidecarKind::EnsureIndices), TableClassification::RolledPastExpected, ); } @@ -2137,7 +1341,7 @@ mod tests { fn classify_loose_match_no_movement_unchanged() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 5, 5, SidecarKind::SchemaApply, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 5, 5, SidecarKind::SchemaApply), TableClassification::NoMovement, ); } @@ -2146,65 +1350,31 @@ mod tests { fn classify_loose_match_invariant_violation_unchanged() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 3, 5, SidecarKind::SchemaApply, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 3, 5, SidecarKind::SchemaApply), TableClassification::InvariantViolation { observed: 3 }, ); } - /// BranchMerge advances each table by several commits per table - /// (adopt: append + upsert + delete; three-way: merge_insert + delete + - /// index), so a bare "HEAD moved" is ambiguous between a complete and a - /// partial publish. At a confirmation-aware version the two-phase - /// confirmation resolves it: roll forward ONLY to the recorded - /// `confirmed_version`; an unconfirmed moved HEAD is a partial publish - /// (`IncompletePhaseB` β‡’ roll back), and a confirmed version that doesn't - /// match the observed HEAD is a foreign advance (`UnexpectedMultistep` β‡’ - /// roll back). A *pre-confirmation* (v1) sidecar carries no confirmation and - /// must keep the original loose roll-forward β€” reading it as strict would - /// roll a completed pre-upgrade merge back (silent discard). + /// BranchMerge must be loose-matched, not strict: while the strict + /// classifier expects exactly one `commit_staged` per table, + /// `publish_rewritten_merge_table` runs multiple per table + /// (merge_insert + delete_where + index rebuilds β€” the comment in + /// `merge.rs` explicitly says so). Strict classification would roll + /// back valid completed Phase B work as `UnexpectedMultistep`. #[test] - fn classify_branch_merge_requires_phase_b_confirmation() { - // Unconfirmed multi-commit drift at a confirmation-aware version β†’ - // partial Phase B β†’ roll back. - let unconfirmed = make_pin("node:Person", "irrelevant", 5, 6); + fn classify_loose_match_accepts_multi_commit_drift_for_branch_merge() { + let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&unconfirmed, 8, 5, SidecarKind::BranchMerge, SIDECAR_SCHEMA_VERSION), - TableClassification::IncompletePhaseB, - ); - // Backward-compat: the SAME unconfirmed pin in a PRE-confirmation (v1) - // sidecar β†’ loose roll-forward (the regression fix β€” a completed - // pre-upgrade merge must not be discarded). - assert_eq!( - classify_table( - &unconfirmed, - 8, - 5, - SidecarKind::BranchMerge, - CONFIRMATION_SCHEMA_VERSION - 1, - ), + classify_table(&pin, 8, 5, SidecarKind::BranchMerge), TableClassification::RolledPastExpected, ); - // Confirmed to the observed HEAD β†’ complete Phase B β†’ roll forward. - let confirmed = SidecarTablePin { - confirmed_version: Some(8), - ..make_pin("node:Person", "irrelevant", 5, 6) - }; - assert_eq!( - classify_table(&confirmed, 8, 5, SidecarKind::BranchMerge, SIDECAR_SCHEMA_VERSION), - TableClassification::RolledPastExpected, - ); - // Confirmed, but HEAD drifted past it (foreign writer) β†’ roll back. - assert_eq!( - classify_table(&confirmed, 9, 5, SidecarKind::BranchMerge, SIDECAR_SCHEMA_VERSION), - TableClassification::UnexpectedMultistep, - ); } #[test] fn classify_loose_match_branch_merge_no_movement_unchanged() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 5, 5, SidecarKind::BranchMerge, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 5, 5, SidecarKind::BranchMerge), TableClassification::NoMovement, ); } @@ -2213,7 +1383,7 @@ mod tests { fn classify_loose_match_branch_merge_invariant_violation_unchanged() { let pin = make_pin("node:Person", "irrelevant", 5, 6); assert_eq!( - classify_table(&pin, 3, 5, SidecarKind::BranchMerge, SIDECAR_SCHEMA_VERSION), + classify_table(&pin, 3, 5, SidecarKind::BranchMerge), TableClassification::InvariantViolation { observed: 3 }, ); } @@ -2334,7 +1504,7 @@ mod tests { #[tokio::test] async fn list_sidecars_returns_empty_when_dir_missing() { let dir = tempfile::tempdir().unwrap(); - let storage = ObjectStorageAdapter::local(); + let storage = LocalStorageAdapter::default(); let result = list_sidecars(dir.path().to_str().unwrap(), &storage) .await .unwrap(); @@ -2344,10 +1514,10 @@ mod tests { #[tokio::test] async fn write_then_list_then_delete_round_trip() { let dir = tempfile::tempdir().unwrap(); - // No pre-created __recovery/ subdir: the storage backend creates - // missing parents on put, which is what the first sidecar write - // of a fresh graph relies on. - let storage = ObjectStorageAdapter::local(); + // Create the __recovery/ subdir so write_sidecar's parent exists + // (LocalStorageAdapter::write_text doesn't mkdir parents). + std::fs::create_dir(dir.path().join(RECOVERY_DIR_NAME)).unwrap(); + let storage = LocalStorageAdapter::default(); let root = dir.path().to_str().unwrap(); let sidecar = new_sidecar( @@ -2368,37 +1538,6 @@ mod tests { assert!(after.is_empty()); } - #[tokio::test] - async fn confirm_sidecar_phase_b_errors_when_pin_missing_from_updates() { - // A pinned table with no achieved version in the publish `updates` must - // be a loud producer error, NOT a silent skip that leaves the pin - // unconfirmed (which recovery would read as a partial Phase B and roll - // the whole complete merge back). Guards the implicit `pins βŠ† updates` - // invariant against a future divergence between the two filters. - let dir = tempfile::tempdir().unwrap(); - let storage = ObjectStorageAdapter::local(); - let mut sidecar = new_sidecar( - SidecarKind::BranchMerge, - Some("main".to_string()), - None, - vec![make_pin("node:Person", "file:///tmp/x.lance", 5, 6)], - ); - // The confirmed-versions map does NOT cover the pinned table. - let confirmed: HashMap = HashMap::new(); - let err = confirm_sidecar_phase_b( - dir.path().to_str().unwrap(), - &storage, - &mut sidecar, - &confirmed, - ) - .await - .expect_err("a pinned table with no achieved version must be a loud error"); - assert!( - err.to_string().contains("pins and publish updates diverged"), - "expected a pin/updates divergence error, got: {err}", - ); - } - #[tokio::test] async fn list_sidecars_skips_non_json_files() { let dir = tempfile::tempdir().unwrap(); @@ -2409,7 +1548,7 @@ mod tests { "noise", ) .unwrap(); - let storage = ObjectStorageAdapter::local(); + let storage = LocalStorageAdapter::default(); let result = list_sidecars(dir.path().to_str().unwrap(), &storage) .await .unwrap(); @@ -2425,7 +1564,7 @@ mod tests { async fn list_sidecars_returns_deterministic_order() { let dir = tempfile::tempdir().unwrap(); std::fs::create_dir(dir.path().join(RECOVERY_DIR_NAME)).unwrap(); - let storage = ObjectStorageAdapter::local(); + let storage = LocalStorageAdapter::default(); let root = dir.path().to_str().unwrap(); // Write sidecars in REVERSE chronological order (newest first). diff --git a/crates/omnigraph/src/db/manifest/state.rs b/crates/omnigraph/src/db/manifest/state.rs index 4fbbde3..e222ede 100644 --- a/crates/omnigraph/src/db/manifest/state.rs +++ b/crates/omnigraph/src/db/manifest/state.rs @@ -10,10 +10,7 @@ use crate::error::{OmniError, Result}; use super::layout::version_object_id; use super::metadata::TableVersionMetadata; -use super::{ - MAIN_BRANCH_HEAD_KEY, OBJECT_TYPE_GRAPH_COMMIT, OBJECT_TYPE_GRAPH_HEAD, OBJECT_TYPE_TABLE, - OBJECT_TYPE_TABLE_TOMBSTONE, OBJECT_TYPE_TABLE_VERSION, -}; +use super::{OBJECT_TYPE_TABLE, OBJECT_TYPE_TABLE_TOMBSTONE, OBJECT_TYPE_TABLE_VERSION}; #[derive(Debug, Clone)] pub struct SubTableEntry { @@ -37,64 +34,11 @@ struct TableTombstoneEntry { tombstone_version: u64, } -/// A graph-lineage commit projected out of the `__manifest` `graph_commit` -/// rows (RFC-013 step 4). Field-for-field identical to `commit_graph::GraphCommit` -/// so the commit-graph cache can be sourced from the manifest projection without -/// touching any reader above that boundary. Kept as a separate struct here to -/// keep `state.rs` free of the `commit_graph` module dependency. -#[derive(Debug, Clone, PartialEq, Eq)] -pub(crate) struct GraphLineageRow { - pub(crate) graph_commit_id: String, - pub(crate) manifest_branch: Option, - pub(crate) manifest_version: u64, - pub(crate) parent_commit_id: Option, - pub(crate) merged_parent_commit_id: Option, - pub(crate) actor_id: Option, - pub(crate) created_at: i64, -} - -/// JSON payload of a `graph_commit` row's `metadata` column. The immutable -/// commit fields that have no dedicated manifest column live here; the mutable -/// ones (`graph_commit_id`, `manifest_branch`, `manifest_version`) reuse -/// `object_id` / `table_branch` / `table_version`. -#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] -struct GraphCommitMetadata { - #[serde(default, skip_serializing_if = "Option::is_none")] - parent_commit_id: Option, - #[serde(default, skip_serializing_if = "Option::is_none")] - merged_parent_commit_id: Option, - #[serde(default, skip_serializing_if = "Option::is_none")] - actor_id: Option, - created_at: i64, -} - -/// JSON payload of a `graph_head` row's `metadata` column. -#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] -struct GraphHeadMetadata { - head_commit_id: String, - #[serde(default, skip_serializing_if = "Option::is_none")] - parent_commit_id: Option, -} - -/// The `object_id` for a branch's mutable head pointer row. Main encodes as -/// `graph_head:main`; named branches as `graph_head:`. -pub(crate) fn graph_head_object_id(branch: Option<&str>) -> String { - format!("graph_head:{}", branch.unwrap_or(MAIN_BRANCH_HEAD_KEY)) -} - #[derive(Debug, Clone)] struct ManifestScan { table_locations: HashMap, version_entries: Vec, tombstones: Vec, - /// Graph-lineage `graph_commit` rows, collected in the SAME pass only when - /// the caller asked (`collect_lineage`). Empty on the table-state read hot - /// path so it never pays the O(commits) lineage JSON decode; populated on the - /// publish path, where `load_publish_state` already needs the parent and would - /// otherwise scan `__manifest` a second time via `read_graph_lineage`. `graph_head` - /// rows are not collected here β€” parent resolution uses the head-over-commits - /// computation, not the denormalized head pointer (see `resolve_lineage_rows`). - lineage_rows: Vec, } pub(super) fn manifest_schema() -> SchemaRef { @@ -129,8 +73,7 @@ pub(super) fn manifest_schema() -> SchemaRef { pub(super) async fn read_manifest_state(dataset: &Dataset) -> Result { let version = dataset.version().version; - // The table-state hot path never needs lineage, so don't pay its JSON decode. - let scan = read_manifest_scan(dataset, false).await?; + let scan = read_manifest_scan(dataset).await?; let mut latest_versions = HashMap::::new(); for entry in scan.version_entries { @@ -166,85 +109,28 @@ pub(super) async fn read_manifest_state(dataset: &Dataset) -> Result Result> { - Ok(read_manifest_scan(dataset, false).await?.version_entries) + Ok(read_manifest_scan(dataset).await?.version_entries) } -/// The full table state the publisher needs to build its CAS batch, plus the -/// `graph_commit` lineage rows for parent resolution β€” all from ONE `__manifest` -/// scan (RFC-013 P2). Replaces the prior four scans on the publish path (three -/// thin accessors + a separate `read_graph_lineage`): `load_publish_state` -/// projects every piece it needs out of this single result. -pub(super) struct PublishScan { - pub(super) table_locations: HashMap, - pub(super) version_entries: Vec, - pub(super) tombstones: Vec<((String, u64), ())>, - pub(super) lineage_rows: Vec, +pub(super) async fn read_registered_table_locations( + dataset: &Dataset, +) -> Result> { + Ok(read_manifest_scan(dataset).await?.table_locations) } -/// One-scan read of everything the publish path needs. `collect_lineage` is -/// always on here (the publisher resolves a parent), so the lineage JSON decode -/// rides the same pass as the table-state assembly instead of a second scan. -pub(super) async fn read_publish_scan(dataset: &Dataset) -> Result { - let scan = read_manifest_scan(dataset, true).await?; - Ok(PublishScan { - table_locations: scan.table_locations, - version_entries: scan.version_entries, - tombstones: scan - .tombstones - .into_iter() - .map(|tombstone| ((tombstone.table_key, tombstone.tombstone_version), ())) - .collect(), - lineage_rows: scan.lineage_rows, - }) +pub(super) async fn read_tombstone_versions( + dataset: &Dataset, +) -> Result> { + Ok(read_manifest_scan(dataset) + .await? + .tombstones + .into_iter() + .map(|tombstone| ((tombstone.table_key, tombstone.tombstone_version), ())) + .collect()) } -/// Decode one `graph_commit` row (`object_type == OBJECT_TYPE_GRAPH_COMMIT`) into -/// a [`GraphLineageRow`]. The single decode for both lineage readers β€” the -/// dedicated `read_graph_lineage` scan and the folded `collect_lineage` branch of -/// `read_manifest_scan` β€” so the two cannot drift. The caller has already matched -/// the object type; `row` indexes into the per-batch columns. -fn decode_graph_commit_row( - object_ids: &StringArray, - metadata: &StringArray, - versions: &UInt64Array, - branches: &StringArray, - row: usize, -) -> Result { - if metadata.is_null(row) { - return Err(OmniError::manifest_internal(format!( - "manifest graph_commit row missing metadata for {}", - object_ids.value(row) - ))); - } - let commit_meta: GraphCommitMetadata = - serde_json::from_str(metadata.value(row)).map_err(|e| { - OmniError::manifest_internal(format!("failed to decode graph_commit metadata: {e}")) - })?; - Ok(GraphLineageRow { - graph_commit_id: object_ids.value(row).to_string(), - manifest_branch: if branches.is_null(row) { - None - } else { - Some(branches.value(row).to_string()) - }, - manifest_version: required_u64(versions, row, "table_version")?, - parent_commit_id: commit_meta.parent_commit_id, - merged_parent_commit_id: commit_meta.merged_parent_commit_id, - actor_id: commit_meta.actor_id, - created_at: commit_meta.created_at, - }) -} - -async fn read_manifest_scan(dataset: &Dataset, collect_lineage: bool) -> Result { +async fn read_manifest_scan(dataset: &Dataset) -> Result { let batches: Vec = dataset .scan() .try_into_stream() @@ -257,7 +143,6 @@ async fn read_manifest_scan(dataset: &Dataset, collect_lineage: bool) -> Result< let mut table_locations = HashMap::new(); let mut version_entries = Vec::new(); let mut tombstones = Vec::new(); - let mut lineage_rows = Vec::new(); for batch in &batches { let object_types = string_column(batch, "object_type")?; @@ -267,13 +152,6 @@ async fn read_manifest_scan(dataset: &Dataset, collect_lineage: bool) -> Result< let versions = u64_column(batch, "table_version")?; let branches = string_column(batch, "table_branch")?; let row_counts = u64_column(batch, "row_count")?; - // `object_id` is only needed for lineage decoding; skip the lookup - // entirely on the table-state hot path (`collect_lineage == false`). - let object_ids = if collect_lineage { - Some(string_column(batch, "object_id")?) - } else { - None - }; for row in 0..batch.num_rows() { let table_key = table_keys.value(row).to_string(); @@ -317,21 +195,6 @@ async fn read_manifest_scan(dataset: &Dataset, collect_lineage: bool) -> Result< tombstone_version, }); } - // `graph_commit` rows (RFC-013) are decoded into the scan ONLY - // when `collect_lineage` is set (the publish path, which resolves - // a parent). The table-state hot path leaves them β€” and - // `graph_head` + any future object type β€” in the `_` arm so it - // never pays the O(commits) lineage JSON decode. When NOT - // collecting, `object_ids` is `None`, so this arm is the same - // forward-compat skip as the `_` arm. - OBJECT_TYPE_GRAPH_COMMIT if collect_lineage => { - let object_ids = object_ids.expect("object_ids read when collect_lineage"); - lineage_rows.push(decode_graph_commit_row( - object_ids, metadata, versions, branches, row, - )?); - } - // Skipped on the table-state path (and for `graph_head` / unknown - // future object types on every path): no table snapshot needs them. _ => {} } } @@ -362,167 +225,21 @@ async fn read_manifest_scan(dataset: &Dataset, collect_lineage: bool) -> Result< table_locations, version_entries: entries, tombstones, - lineage_rows, }) } -/// Project the graph-lineage rows (`graph_commit` + `graph_head`) out of -/// `__manifest` (RFC-013 step 4). Returns every commit and the per-branch head -/// map (keyed by branch name, `"main"` for main). `__manifest` is the single -/// source of graph lineage: the commit-graph cache is sourced from here, and the -/// publisher resolves a new commit's parent from here inside its CAS loop. -/// -/// Dedicated scan (separate from `read_manifest_scan`): it decodes ONLY the two -/// lineage object types and builds no table snapshot, so the table-state hot -/// path never pays for lineage JSON and this path never pays for table-entry -/// assembly. -pub(crate) async fn read_graph_lineage( - dataset: &Dataset, -) -> Result<(Vec, HashMap)> { - let batches: Vec = dataset - .scan() - .try_into_stream() - .await - .map_err(|e| OmniError::Lance(e.to_string()))? - .try_collect() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - - let mut graph_commits = Vec::new(); - let mut graph_heads = HashMap::new(); - - for batch in &batches { - let object_ids = string_column(batch, "object_id")?; - let object_types = string_column(batch, "object_type")?; - let metadata = string_column(batch, "metadata")?; - let versions = u64_column(batch, "table_version")?; - let branches = string_column(batch, "table_branch")?; - - for row in 0..batch.num_rows() { - match object_types.value(row) { - OBJECT_TYPE_GRAPH_COMMIT => { - graph_commits.push(decode_graph_commit_row( - object_ids, metadata, versions, branches, row, - )?); - } - OBJECT_TYPE_GRAPH_HEAD => { - if metadata.is_null(row) { - return Err(OmniError::manifest_internal(format!( - "manifest graph_head row missing metadata for {}", - object_ids.value(row) - ))); - } - let head_meta: GraphHeadMetadata = serde_json::from_str(metadata.value(row)) - .map_err(|e| { - OmniError::manifest_internal(format!( - "failed to decode graph_head metadata: {e}" - )) - })?; - // `object_id` is `graph_head:`; the branch key after - // the prefix is the projection's map key (`main` for main). - let branch_key = object_ids - .value(row) - .strip_prefix("graph_head:") - .unwrap_or_default() - .to_string(); - graph_heads.insert(branch_key, head_meta.head_commit_id); - } - _ => {} - } - } - } - - Ok((graph_commits, graph_heads)) -} - -/// The current head of a branch's lineage: the [`GraphLineageRow`] with the -/// greatest `(manifest_version, created_at, graph_commit_id)`. This is the same -/// ordering the commit-graph cache uses to pick its head (`should_replace_head`) -/// β€” kept in one place so the publisher's per-attempt parent resolution and the -/// cache agree by construction. `None` only for a graph with no commits yet -/// (a parentless genesis). -pub(crate) fn head_lineage_row(rows: &[GraphLineageRow]) -> Option<&GraphLineageRow> { - rows.iter().max_by(|a, b| { - a.manifest_version - .cmp(&b.manifest_version) - .then_with(|| a.created_at.cmp(&b.created_at)) - .then_with(|| a.graph_commit_id.cmp(&b.graph_commit_id)) - }) -} - -/// One `__manifest` row materializing a piece of a graph commit's lineage. The -/// publisher maps these onto its `PendingVersionRow`s (folding lineage into the -/// table-version publish batch), and the genesis init path pushes them straight -/// into the init batch. -pub(crate) struct GraphLineageRowPart { - pub(crate) object_id: String, - pub(crate) object_type: &'static str, - pub(crate) metadata: String, - pub(crate) table_version: Option, - pub(crate) table_branch: Option, -} - -/// Encode one graph commit into its two `__manifest` rows: the immutable -/// `graph_commit` row plus the mutable `graph_head:` pointer (a -/// merge-insert on `object_id` updates the head in place). `branch` is `None` -/// for main. The immutable commit fields with no dedicated column live in the -/// `graph_commit` row's `metadata` JSON; the mutable head pointer payload lives -/// in the `graph_head` row's `metadata`. -pub(crate) fn graph_lineage_row_parts( - commit: &GraphLineageRow, - branch: Option<&str>, -) -> Result<[GraphLineageRowPart; 2]> { - let commit_metadata = serde_json::to_string(&GraphCommitMetadata { - parent_commit_id: commit.parent_commit_id.clone(), - merged_parent_commit_id: commit.merged_parent_commit_id.clone(), - actor_id: commit.actor_id.clone(), - created_at: commit.created_at, - }) - .map_err(|e| { - OmniError::manifest_internal(format!("failed to encode graph_commit metadata: {e}")) - })?; - let head_metadata = serde_json::to_string(&GraphHeadMetadata { - head_commit_id: commit.graph_commit_id.clone(), - parent_commit_id: commit.parent_commit_id.clone(), - }) - .map_err(|e| { - OmniError::manifest_internal(format!("failed to encode graph_head metadata: {e}")) - })?; - - Ok([ - // Only the immutable commit row carries the manifest version + branch. - GraphLineageRowPart { - object_id: commit.graph_commit_id.clone(), - object_type: OBJECT_TYPE_GRAPH_COMMIT, - metadata: commit_metadata, - table_version: Some(commit.manifest_version), - table_branch: commit.manifest_branch.clone(), - }, - // The head row reuses `metadata` for its pointer payload. - GraphLineageRowPart { - object_id: graph_head_object_id(branch), - object_type: OBJECT_TYPE_GRAPH_HEAD, - metadata: head_metadata, - table_version: None, - table_branch: None, - }, - ]) -} - pub(super) fn entries_to_batch( entries: &[SubTableEntry], version_metadata: &HashMap, - genesis_lineage: &[GraphLineageRowPart], ) -> Result { - let cap = entries.len() * 2 + genesis_lineage.len(); - let mut object_ids = Vec::with_capacity(cap); - let mut object_types = Vec::with_capacity(cap); - let mut locations = Vec::with_capacity(cap); - let mut metadata = Vec::with_capacity(cap); - let mut table_keys = Vec::with_capacity(cap); - let mut table_versions = Vec::with_capacity(cap); - let mut table_branches = Vec::with_capacity(cap); - let mut row_counts = Vec::with_capacity(cap); + let mut object_ids = Vec::with_capacity(entries.len() * 2); + let mut object_types = Vec::with_capacity(entries.len() * 2); + let mut locations = Vec::with_capacity(entries.len() * 2); + let mut metadata = Vec::with_capacity(entries.len() * 2); + let mut table_keys = Vec::with_capacity(entries.len() * 2); + let mut table_versions = Vec::with_capacity(entries.len() * 2); + let mut table_branches = Vec::with_capacity(entries.len() * 2); + let mut row_counts = Vec::with_capacity(entries.len() * 2); for entry in entries { object_ids.push(entry.table_key.clone()); @@ -554,22 +271,6 @@ pub(super) fn entries_to_batch( row_counts.push(Some(entry.row_count)); } - // Genesis graph-lineage rows ride the init write so a fresh graph carries - // its `graph_commit` + `graph_head` in `__manifest` from version one (no - // separate lineage fragment, no second commit). `table_key` is non-nullable - // but lineage rows have no table identity, so the empty string stands in - // (never matched by a real key). - for part in genesis_lineage { - object_ids.push(part.object_id.clone()); - object_types.push(part.object_type.to_string()); - locations.push(None); - metadata.push(Some(part.metadata.clone())); - table_keys.push(String::new()); - table_versions.push(part.table_version); - table_branches.push(part.table_branch.clone()); - row_counts.push(None); - } - manifest_rows_batch( object_ids, object_types, @@ -582,72 +283,6 @@ pub(super) fn entries_to_batch( ) } -/// Merge-insert a set of graph-lineage rows (`graph_commit` + `graph_head`) -/// straight into `__manifest`, keyed on `object_id`. Used only by the v3β†’v4 -/// internal-schema backfill (RFC-013 step 4): the normal write path folds -/// lineage into the publisher's batch, but the migration writes lineage with -/// no accompanying table-version change, so it issues its own merge. -/// -/// Mirrors the publisher's merge knobs (`use_index(false)`, `skip_auto_cleanup`, -/// `conflict_retries(0)`) so it has identical CAS / cleanup semantics. The -/// migration runs under the open-for-write path and is idempotent (re-inserting -/// the same `object_id` rows updates them in place), so it does not need the -/// publisher's retry loop. Returns the advanced dataset (its version is the -/// commit the lineage landed in). -pub(crate) async fn merge_lineage_rows( - dataset: Dataset, - parts: &[GraphLineageRowPart], -) -> Result { - let len = parts.len(); - let mut object_ids = Vec::with_capacity(len); - let mut object_types = Vec::with_capacity(len); - let mut metadata = Vec::with_capacity(len); - let mut table_versions = Vec::with_capacity(len); - let mut table_branches = Vec::with_capacity(len); - for part in parts { - object_ids.push(part.object_id.clone()); - object_types.push(part.object_type.to_string()); - metadata.push(Some(part.metadata.clone())); - table_versions.push(part.table_version); - table_branches.push(part.table_branch.clone()); - } - // Lineage rows carry no table identity: empty `table_key`, null location / - // row_count (matching `lineage_part_to_pending` in the publisher). - let batch = manifest_rows_batch( - object_ids, - object_types, - vec![None; len], - metadata, - vec![String::new(); len], - table_versions, - table_branches, - vec![None; len], - )?; - let reader = - arrow_array::RecordBatchIterator::new(vec![Ok(batch)], manifest_schema()); - let dataset = Arc::new(dataset); - let mut merge_builder = - lance::dataset::MergeInsertBuilder::try_new(dataset, vec!["object_id".to_string()]) - .map_err(|e| OmniError::Lance(e.to_string()))?; - merge_builder.when_matched(lance::dataset::WhenMatched::UpdateAll); - merge_builder.when_not_matched(lance::dataset::WhenNotMatched::InsertAll); - merge_builder.conflict_retries(0); - merge_builder.use_index(false); - merge_builder.skip_auto_cleanup(true); - let (new_dataset, _stats) = merge_builder - .try_build() - .map_err(|e| OmniError::Lance(e.to_string()))? - .execute_reader(Box::new(reader)) - // Route through the publisher's classifier (not a stringify) so a - // concurrent first-open's CAS loss on `__manifest` surfaces as the SAME - // typed `RowLevelCasContention` the publisher's retry consumes. The - // migration's re-open retry loop matches on that to converge instead of - // erroring out (FIX B). - .await - .map_err(super::publisher::map_lance_publish_error)?; - Ok(Arc::try_unwrap(new_dataset).unwrap_or_else(|arc| (*arc).clone())) -} - pub(super) fn manifest_rows_batch( object_ids: Vec, object_types: Vec, diff --git a/crates/omnigraph/src/db/manifest/tests.rs b/crates/omnigraph/src/db/manifest/tests.rs index 31a77fe..effa0b5 100644 --- a/crates/omnigraph/src/db/manifest/tests.rs +++ b/crates/omnigraph/src/db/manifest/tests.rs @@ -12,7 +12,7 @@ use lance_namespace::models::{ use lance_namespace_impls::DirectoryNamespaceBuilder; use tokio::sync::Mutex; -use super::publisher::{LineageIntent, ManifestBatchPublisher, PublishOutcome}; +use super::publisher::ManifestBatchPublisher; use super::*; use omnigraph_compiler::catalog::build_catalog; use omnigraph_compiler::schema::parser::parse_schema; @@ -336,77 +336,40 @@ async fn test_directory_namespace_direct_publish_cannot_replace_native_omnigraph .await .unwrap(); - // Lance 7: the native `DirectoryNamespace` no longer recognizes omnigraph's - // manifest-tracked tables, so list / describe / create_table_version all - // return `TableNotFound`. The mechanism is *contingent on omnigraph's legacy - // boolean PK key*, not an unconditional v7 property: v7's namespace eagerly - // rewrites any `__manifest` whose `object_id` lacks the new - // `lance-schema:unenforced-primary-key:position` key, omnigraph declares the - // PK with the legacy boolean key, and v7 forbids changing a PK once set β€” so - // `ensure_manifest_table_up_to_date` errors, the namespace silently falls - // back to directory listing (disabled here), and `check_table_status` reports - // the table absent. omnigraph keeps the boolean key deliberately: Lance - // honors it permanently (it maps to PK position 0) and one uniform on-disk - // format beats a new-vs-old split, since existing graphs can't be re-keyed to - // the position key under that same immutability rule. The decoupling is - // therefore an accepted, production-irrelevant tradeoff (omnigraph never uses - // the native namespace β€” its publisher writes `__manifest` via merge_insert - // and its reads go through its own `LanceNamespace` impls), and it only - // strengthens this guard's thesis: native tooling cannot enumerate, inspect, - // or publish over omnigraph's tables, let alone replace the write path. - let assert_table_not_found = |what: &str, dbg: String| { - assert!( - dbg.contains("TableNotFound") && dbg.contains("node:Person"), - "{what}: expected TableNotFound for node:Person, got: {dbg}" - ); - }; - assert_table_not_found( - "list_table_versions", - format!( - "{:?}", - namespace - .list_table_versions(ListTableVersionsRequest { - id: Some(vec!["node:Person".to_string()]), - descending: Some(true), - ..Default::default() - }) - .await - .unwrap_err() - ), - ); - assert_table_not_found( - "describe_table_version", - format!( - "{:?}", - namespace - .describe_table_version(DescribeTableVersionRequest { - id: Some(vec!["node:Person".to_string()]), - version: Some(person_version as i64), - ..Default::default() - }) - .await - .unwrap_err() - ), - ); - assert_table_not_found( - "create_table_version", - format!( - "{:?}", - namespace - .create_table_version(version_metadata.to_create_table_version_request( - "node:Person", - person_version, - 1, - None, - )) - .await - .unwrap_err() - ), + let versions = namespace + .list_table_versions(ListTableVersionsRequest { + id: Some(vec!["node:Person".to_string()]), + descending: Some(true), + ..Default::default() + }) + .await + .unwrap(); + assert_eq!( + versions.versions[0].version as u64, + person_entry.table_version ); - // omnigraph's manifest stays authoritative: refresh ignores the direct - // `person_ds.append` above (it was never manifest-published), so the row - // count stays 0 and the version is unchanged. + let err = namespace + .describe_table_version(DescribeTableVersionRequest { + id: Some(vec!["node:Person".to_string()]), + version: Some(person_version as i64), + ..Default::default() + }) + .await + .unwrap_err(); + assert!(err.to_string().contains("not found")); + + let err = namespace + .create_table_version(version_metadata.to_create_table_version_request( + "node:Person", + person_version, + 1, + None, + )) + .await + .unwrap_err(); + assert!(err.to_string().contains("already exists")); + mc.refresh().await.unwrap(); assert_eq!( mc.snapshot().entry("node:Person").unwrap().table_version, @@ -988,8 +951,7 @@ impl ManifestBatchPublisher for RecordingPublisher { &self, changes: &[ManifestChange], expected_table_versions: &HashMap, - lineage: Option<&LineageIntent>, - ) -> Result { + ) -> Result { let requests: Vec = changes .iter() .filter_map(|change| match change { @@ -998,9 +960,7 @@ impl ManifestBatchPublisher for RecordingPublisher { }) .collect(); self.requests.lock().await.extend_from_slice(&requests); - self.inner - .publish(changes, expected_table_versions, lineage) - .await + self.inner.publish(changes, expected_table_versions).await } } @@ -1012,8 +972,7 @@ impl ManifestBatchPublisher for FailingPublisher { &self, _changes: &[ManifestChange], _expected_table_versions: &HashMap, - _lineage: Option<&LineageIntent>, - ) -> Result { + ) -> Result { Err(OmniError::manifest( "injected batch publisher failure".to_string(), )) @@ -1393,8 +1352,8 @@ async fn test_concurrent_publish_with_overlapping_expected_versions_one_succeeds let expected_b = expected; let (res_a, res_b) = tokio::join!( - async { publisher_a.publish(&changes_a, &expected_a, None).await }, - async { publisher_b.publish(&changes_b, &expected_b, None).await } + async { publisher_a.publish(&changes_a, &expected_a).await }, + async { publisher_b.publish(&changes_b, &expected_b).await } ); let (succeeded, err) = match (res_a, res_b) { @@ -1485,7 +1444,7 @@ async fn test_publish_migrates_pre_stamp_manifest_to_current_version() { let mut expected = HashMap::new(); expected.insert("node:Person".to_string(), 1); GraphNamespacePublisher::new(uri, None) - .publish(&[], &expected, None) + .publish(&[], &expected) .await .unwrap(); @@ -1502,87 +1461,6 @@ async fn test_publish_migrates_pre_stamp_manifest_to_current_version() { assert!(reopened.snapshot().entry("node:Person").is_some()); } -#[tokio::test] -async fn test_v2_to_v3_sweeps_legacy_run_branches_on_write_open() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let catalog = build_test_catalog(); - let mut mc = ManifestCoordinator::init(uri, &catalog).await.unwrap(); - - // Synthesize a pre-MR-770 graph: several stale `__run__` staging branches - // left on `__manifest` (a real legacy graph accumulates one per run), plus - // a real user branch that must survive the sweep. Multiple run branches - // exercise the migration's delete loop on a single reused dataset handle. - mc.create_branch("__run__01J9LEGACY").await.unwrap(); - mc.create_branch("__run__01J9SECOND").await.unwrap(); - mc.create_branch("__run__01J9THIRD").await.unwrap(); - mc.create_branch("feature").await.unwrap(); - let before = mc.list_branches().await.unwrap(); - assert_eq!( - before.iter().filter(|b| b.starts_with("__run__")).count(), - 3, - "precondition: three legacy run branches exist on __manifest; got {before:?}", - ); - - // Rewind the internal-schema stamp to v2 so the next write-open runs the - // v2 β†’ v3 sweep arm (init stamps at the current version, which is past it). - { - let mut ds = open_manifest_dataset(uri, None).await.unwrap(); - ds.update_schema_metadata([( - "omnigraph:internal_schema_version".to_string(), - Some("2".to_string()), - )]) - .await - .unwrap(); - let post = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&post), - 2, - "stamp rewound to v2" - ); - } - - // A no-op publish forces the open-for-write path, which runs the migration. - let mut expected = HashMap::new(); - expected.insert("node:Person".to_string(), 1); - GraphNamespacePublisher::new(uri, None) - .publish(&[], &expected, None) - .await - .unwrap(); - - // Stamp advanced to current; the legacy run branch is physically gone from - // `__manifest` (checked via the raw, unfiltered manifest list β€” not the - // guard-filtered `branch_list`), and the real branch + `main` survive. - let post = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&post), - super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, - ); - let reopened = ManifestCoordinator::open(uri).await.unwrap(); - let after = reopened.list_branches().await.unwrap(); - assert!( - !after.iter().any(|b| b.starts_with("__run__")), - "legacy run branch must be swept; got {after:?}", - ); - assert!( - after.iter().any(|b| b == "feature"), - "user branch must survive" - ); - assert!(after.iter().any(|b| b == "main"), "main must survive"); - - // Idempotent: a second write-open finds the stamp at current and does not - // re-run the sweep or error. - GraphNamespacePublisher::new(uri, None) - .publish(&[], &expected, None) - .await - .unwrap(); - let final_ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&final_ds), - super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, - ); -} - #[tokio::test] async fn test_publish_rejects_manifest_stamped_at_future_version() { let dir = tempfile::tempdir().unwrap(); @@ -1605,7 +1483,7 @@ async fn test_publish_rejects_manifest_stamped_at_future_version() { let mut expected = HashMap::new(); expected.insert("node:Person".to_string(), 1); let err = GraphNamespacePublisher::new(uri, None) - .publish(&[], &expected, None) + .publish(&[], &expected) .await .expect_err("future-stamped manifest should reject open-for-write"); let msg = err.to_string(); @@ -1631,957 +1509,3 @@ fn manifest_column_helpers_return_error_for_bad_schema() { let err = string_column(&batch, "table_key").unwrap_err(); assert!(err.to_string().contains("table_key")); } - -// ── RFC-013 Phase 7 stage 4: existing-graph (v3 β†’ v4) lineage migration ────── -// -// A graph created by a pre-Phase-7 binary (internal schema v3) keeps its -// lineage in `_graph_commits.lance`, with NONE in `__manifest`. The new binary -// reads lineage from the `__manifest` projection, so without a migration it -// would see an EMPTY commit DAG. These tests pin the backfill (`migrate_v3_to_v4`), -// its idempotency, the transitional v3-read fallback, the read-only refusal, and -// the crash-mid-migration recovery. - -use crate::db::commit_graph::{CommitGraph, seed_legacy_v3_lineage}; - -/// Number of `graph_commit` rows in `__manifest` at main. -async fn manifest_commit_row_count(uri: &str) -> usize { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - let (rows, _heads) = read_graph_lineage(&ds).await.unwrap(); - rows.len() -} - -#[tokio::test] -async fn v3_graph_backfills_lineage_into_manifest_on_read_write_open() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - let fixture = seed_legacy_v3_lineage(uri).await.unwrap(); - - // Precondition: a true v3 graph β€” stamp 3, NO lineage rows in `__manifest`, - // and a NEW-binary projection therefore reads an empty DAG. - { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!(super::migrations::read_stamp(&ds), 3, "fixture is stamped v3"); - } - assert_eq!( - manifest_commit_row_count(uri).await, - 0, - "precondition: __manifest carries no graph_commit rows in a v3 graph", - ); - - // Run the production read-write migration entry point (main branch). - super::migrate_on_open(uri).await.unwrap(); - - // The manifest now carries the lineage and is stamped at the current version. - { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&ds), - super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, - "migration stamps the manifest at the current internal schema version", - ); - } - // 4 commits (genesis, A, feature, merge) β†’ 4 `graph_commit` rows. - assert_eq!( - manifest_commit_row_count(uri).await, - fixture.all_ids.len(), - "every legacy commit is backfilled into __manifest", - ); - - // The commit-graph projection (now sourced from __manifest) reconstructs the - // full DAG: every old id resolves, parents/merge parents are connected, the - // merge commit's actor + two parents survive, and the head is the merge. - let cg = CommitGraph::open(uri).await.unwrap(); - let commits = cg.load_commits().await.unwrap(); - assert_eq!(commits.len(), fixture.all_ids.len()); - for id in &fixture.all_ids { - assert!( - cg.get_commit(id).is_some(), - "old commit id {id} must still resolve after migration", - ); - } - - let genesis = cg.get_commit(&fixture.genesis).unwrap(); - assert!(genesis.parent_commit_id.is_none(), "genesis is parentless"); - assert!(genesis.actor_id.is_none(), "genesis is actorless"); - - let commit_a = cg.get_commit(&fixture.commit_a).unwrap(); - assert_eq!(commit_a.parent_commit_id.as_deref(), Some(fixture.genesis.as_str())); - assert_eq!(commit_a.actor_id.as_deref(), Some("act-a"), "actor backfilled inline"); - - let merge = cg.get_commit(&fixture.merge_commit).unwrap(); - assert_eq!(merge.parent_commit_id.as_deref(), Some(fixture.commit_a.as_str())); - assert_eq!( - merge.merged_parent_commit_id.as_deref(), - Some(fixture.feature_commit.as_str()), - "the merge commit keeps both parents", - ); - assert_eq!(merge.actor_id.as_deref(), Some("act-merger")); - - assert_eq!( - cg.head_commit_id().await.unwrap().as_deref(), - Some(fixture.merge_commit.as_str()), - "the merge commit is the head of main after migration", - ); - - // merge_base of main vs main is reflexively the head β€” a smoke check that the - // ancestor walk works over the backfilled DAG. - let base = CommitGraph::merge_base(uri, None, None).await.unwrap(); - assert!(base.is_some(), "merge_base resolves over the backfilled DAG"); -} - -#[tokio::test] -async fn v3_to_v4_migration_is_idempotent() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let fixture = seed_legacy_v3_lineage(uri).await.unwrap(); - - super::migrate_on_open(uri).await.unwrap(); - let after_first = manifest_commit_row_count(uri).await; - // Re-running the migration must not duplicate any rows. - super::migrate_on_open(uri).await.unwrap(); - let after_second = manifest_commit_row_count(uri).await; - - assert_eq!(after_first, fixture.all_ids.len()); - assert_eq!( - after_first, after_second, - "a second migration pass adds no duplicate graph_commit rows", - ); -} - -#[tokio::test] -async fn v3_graph_reads_history_via_fallback_without_migrating() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let fixture = seed_legacy_v3_lineage(uri).await.unwrap(); - - // Open the commit-graph projection WITHOUT running the migration (this is the - // read-only path: `CommitGraph::open` reads, never writes). The stamp-gated - // fallback sources lineage from `_graph_commits.lance`, so history is correct. - let cg = CommitGraph::open(uri).await.unwrap(); - let commits = cg.load_commits().await.unwrap(); - assert_eq!( - commits.len(), - fixture.all_ids.len(), - "the v3 fallback reads the full legacy DAG with no migration", - ); - assert_eq!( - cg.head_commit_id().await.unwrap().as_deref(), - Some(fixture.merge_commit.as_str()), - ); - - // The fallback is read-only: stamp stays v3, __manifest still has no lineage. - { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!(super::migrations::read_stamp(&ds), 3, "fallback did not write"); - } - assert_eq!( - manifest_commit_row_count(uri).await, - 0, - "the read-only fallback writes nothing to __manifest", - ); -} - -#[tokio::test] -async fn future_stamp_is_refused_in_both_open_modes() { - use crate::db::{Omnigraph, OpenMode}; - use crate::storage::storage_for_uri; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - // A full graph (schema artifacts present) so `Omnigraph::open*` gets past its - // schema read to the stamp check. - Omnigraph::init(uri, "node Person { name: String }\n") - .await - .unwrap(); - - // Stamp past this binary's known version. - { - let mut ds = open_manifest_dataset(uri, None).await.unwrap(); - ds.update_schema_metadata([( - "omnigraph:internal_schema_version".to_string(), - Some("5".to_string()), - )]) - .await - .unwrap(); - } - - let storage = storage_for_uri(uri).unwrap(); - for mode in [OpenMode::ReadWrite, OpenMode::ReadOnly] { - // `Omnigraph` is not `Debug`, so match instead of `expect_err`. - let err = match Omnigraph::open_with_storage_and_mode(uri, Arc::clone(&storage), mode).await - { - Ok(_) => panic!("{mode:?}: a future-stamped graph must be refused"), - Err(err) => err, - }; - assert!( - err.to_string().contains("upgrade omnigraph"), - "{mode:?}: expected an upgrade-omnigraph refusal, got: {err}", - ); - } -} - -#[tokio::test] -async fn sub_floor_stamp_is_refused_in_both_open_modes() { - use crate::db::{Omnigraph, OpenMode}; - use crate::storage::storage_for_uri; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - Omnigraph::init(uri, "node Person { name: String }\n") - .await - .unwrap(); - - // Stamp below MIN_SUPPORTED (1 today). No real graph carries 0 β€” `read_stamp` - // floors an absent stamp at 1 β€” so this is the symmetric twin of - // `future_stamp_is_refused_in_both_open_modes`, exercising the floor the - // combined `refuse_if_stamp_unsupported` guard adds at every open mode - // (write-path migrate, read-only open, and the branch lineage-read path). The - // upper side β€” a graph at exactly MIN migrating to CURRENT β€” is covered by - // `test_publish_migrates_pre_stamp_manifest_to_current_version`, where an - // absent stamp reads as 1 = MIN. - { - let mut ds = open_manifest_dataset(uri, None).await.unwrap(); - ds.update_schema_metadata([( - "omnigraph:internal_schema_version".to_string(), - Some("0".to_string()), - )]) - .await - .unwrap(); - } - - let storage = storage_for_uri(uri).unwrap(); - for mode in [OpenMode::ReadWrite, OpenMode::ReadOnly] { - let err = match Omnigraph::open_with_storage_and_mode(uri, Arc::clone(&storage), mode).await - { - Ok(_) => panic!("{mode:?}: a sub-floor graph must be refused"), - Err(err) => err, - }; - assert!( - err.to_string().contains("migrate it forward"), - "{mode:?}: expected a migrate-forward floor refusal, got: {err}", - ); - } -} - -#[tokio::test] -async fn crash_after_merge_before_stamp_completes_on_next_open() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let fixture = seed_legacy_v3_lineage(uri).await.unwrap(); - - // Simulate a crash that landed the lineage merge but lost the stamp bump: - // run the full migration (lineage now in __manifest), then rewind the stamp - // to v3. This is exactly the on-disk state after a crash at the - // `migration.v3_to_v4.after_merge_before_stamp` window. - super::migrate_on_open(uri).await.unwrap(); - { - let mut ds = open_manifest_dataset(uri, None).await.unwrap(); - super::migrations::set_stamp_for_test(&mut ds, 3).await.unwrap(); - } - assert_eq!( - manifest_commit_row_count(uri).await, - fixture.all_ids.len(), - "crash state: lineage present, stamp rewound to v3", - ); - - // The next open re-enters at v3; the idempotency guard sees the lineage and - // skips straight to the stamp bump β€” no duplicate rows, migration completes. - super::migrate_on_open(uri).await.unwrap(); - { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&ds), - super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, - "the re-entered migration completes the stamp bump", - ); - } - assert_eq!( - manifest_commit_row_count(uri).await, - fixture.all_ids.len(), - "re-running over an already-merged manifest adds no duplicate rows", - ); -} - -/// Migrate the `__manifest` at `branch` (the per-branch v3β†’v4 entry shape: -/// `migrate_on_open` runs it for main; the publisher runs it for each branch's -/// first write). Returns the migrated branch lineage `(commit_by_id, heads)`. -async fn migrate_branch_and_read_lineage( - uri: &str, - branch: &str, -) -> ( - std::collections::HashMap, - std::collections::HashMap, -) { - let mut ds = open_manifest_dataset(uri, Some(branch)).await.unwrap(); - super::migrations::migrate_internal_schema(&mut ds, uri, Some(branch)) - .await - .unwrap(); - // Re-open at the branch so the read sees the migration's committed HEAD. - let ds = open_manifest_dataset(uri, Some(branch)).await.unwrap(); - let (rows, heads) = read_graph_lineage(&ds).await.unwrap(); - let by_id = rows - .into_iter() - .map(|r| (r.graph_commit_id.clone(), r)) - .collect(); - (by_id, heads) -} - -// FIX C β€” the per-branch v3β†’v4 migration against a REAL Lance branch. -// -// `seed_legacy_v3_lineage` writes every commit (incl. the "feature"-tagged one) -// to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field β€” it -// never exercises the production per-branch path (`read_legacy_commit_cache` β†’ -// `checkout_branch`, and a branch-scoped `__manifest`). This test builds a graph -// with a REAL Lance branch on both `_graph_commits.lance` and `__manifest`, then -// migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's -// `__manifest` with main untouched. -// -// It also EMPIRICALLY decides the open question behind FIX B: the fast-path -// `read_graph_lineage(dataset)` has no `manifest_branch` filter in its query, but -// `dataset` is branch-scoped (`__manifest` is Lance-branched per graph-branch), -// so a branch should read only its OWN lineage. If migrating the branch were to -// leak main's backfill (or vice versa), that would be a 5th bug needing a branch -// filter. The assertions below pin that it does not. -#[tokio::test] -async fn v3_branch_migration_backfills_branch_manifest_and_leaves_main_untouched() { - use crate::db::commit_graph::seed_legacy_v3_lineage_with_branch; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let fx = seed_legacy_v3_lineage_with_branch(uri).await.unwrap(); - - // Precondition: both main and the branch are v3 with no lineage in __manifest. - for branch in [None, Some(fx.branch.as_str())] { - let ds = open_manifest_dataset(uri, branch).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&ds), - 3, - "{branch:?}: fixture branch is stamped v3", - ); - let (rows, _heads) = read_graph_lineage(&ds).await.unwrap(); - assert!( - rows.is_empty(), - "{branch:?}: fixture branch has no lineage in __manifest", - ); - } - - // Migrate ONLY the branch. - let (branch_by_id, branch_heads) = migrate_branch_and_read_lineage(uri, &fx.branch).await; - - // The branch's __manifest now carries the branch's full DAG: genesis, A, and - // the branch commit (3 rows), with the branch commit as `graph_head:feature`. - assert_eq!( - branch_by_id.len(), - 3, - "the branch backfill carries genesis + A + the branch commit", - ); - for id in [&fx.genesis, &fx.commit_a, &fx.branch_commit] { - assert!( - branch_by_id.contains_key(id), - "branch commit {id} must be backfilled into the branch __manifest", - ); - } - assert_eq!( - branch_heads.get(&fx.branch).map(String::as_str), - Some(fx.branch_commit.as_str()), - "graph_head:feature points at the branch commit", - ); - - // Parents + actors survived the backfill. - let branch_commit = &branch_by_id[&fx.branch_commit]; - assert_eq!( - branch_commit.parent_commit_id.as_deref(), - Some(fx.commit_a.as_str()), - "the branch commit keeps its parent", - ); - assert_eq!( - branch_commit.actor_id.as_deref(), - Some("act-branch"), - "the branch commit's authored actor survives", - ); - assert_eq!( - branch_by_id[&fx.commit_a].actor_id.as_deref(), - Some("act-a"), - "the inherited main commit's actor survives on the branch", - ); - - // Contingency check: migrating the branch left MAIN's __manifest untouched β€” - // still v3, still no lineage. The unfiltered fast-path read is branch-correct - // because `__manifest` is Lance-branched; no `manifest_branch` filter is - // needed (no 5th bug). - { - let main_ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&main_ds), - 3, - "migrating the branch must not advance main's stamp", - ); - let (main_rows, _heads) = read_graph_lineage(&main_ds).await.unwrap(); - assert!( - main_rows.is_empty(), - "migrating the branch must not backfill main's __manifest", - ); - } -} - -// FIX D β€” the branch read path refuses a `> CURRENT` branch stamp. -// -// `load_commit_cache_for_branch` handled `< CURRENT` (the v3 fallback) and -// `>= CURRENT` (the manifest projection), but never a `> CURRENT` branch stamp β€” -// it would misread a future shape with the projection. The main read path already -// refuses (`refuse_if_internal_schema_unsupported`), and migrations run main-first so -// main's stamp β‰₯ every branch's β€” so this is not a live hole today. The guard is -// defense-in-depth against that ordering invariant ever weakening. Here we -// synthesize the unreachable state directly (force-stamp a branch past CURRENT) -// and assert the branch read refuses loudly instead of misreading. -#[tokio::test] -async fn branch_read_refuses_future_internal_schema_stamp() { - use crate::db::commit_graph::{CommitGraph, seed_legacy_v3_lineage_with_branch}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - // A graph with a real `feature` Lance branch on both `_graph_commits.lance` - // and `__manifest` (so `open_at_branch` can check it out). - let fx = seed_legacy_v3_lineage_with_branch(uri).await.unwrap(); - - // Force the BRANCH's `__manifest` stamp past this binary's known version. - let future = super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION + 1; - { - let mut branch_ds = open_manifest_dataset(uri, Some(&fx.branch)).await.unwrap(); - super::migrations::set_stamp_for_test(&mut branch_ds, future) - .await - .unwrap(); - } - - // Reading the commit graph at that branch must refuse, not misread. - let err = match CommitGraph::open_at_branch(uri, &fx.branch).await { - Ok(_) => panic!("a branch stamped past CURRENT must be refused on read"), - Err(e) => e, - }; - assert!( - err.to_string().contains("upgrade omnigraph"), - "expected an upgrade-omnigraph refusal at the branch read, got: {err}", - ); -} - -// A v4 branch whose AUTHORITATIVE lineage lives in `__manifest` must stay -// readable even when its DERIVED `_graph_commits.lance` branch ref is gone. -// -// `_graph_commits.lance` is no longer the source of graph lineage on a v4 graph -// (RFC-013 Phase 7) β€” `__manifest`'s `graph_commit`/`graph_head:` rows -// are. The Lance branch ref on `_graph_commits.lance` is a derived artifact, kept -// only so `create_branch`/`cleanup` have something to operate on. An interrupted -// fork-reclaim or a `cleanup` race can leave that ref missing while the manifest -// lineage is fully intact. Per invariants 7 + 15 a missing DERIVED ref must not -// fail a LOGICAL read of the lineage. -// -// The wedge: take a real v4 `feature` branch (its `graph_head:feature` row in -// `__manifest`), then `force_delete` ONLY the `_graph_commits.lance` `feature` -// ref β€” manifest lineage is left authoritative. The contract: -// - reads at the wedged branch (`open_at_branch` / list-commits / `merge_base`) -// SUCCEED, sourcing the DAG from `__manifest`; and -// - a WRITE that needs the derived ref (`create_branch`) fails LOUDLY with the -// typed actionable error, deferring repair to `cleanup`'s orphan reconciler. -// -// RED before the fix: `open_at_branch` does a hard `checkout_branch(branch)?` on -// the now-missing `_graph_commits.lance` ref and errors `OmniError::Lance`, -// wedging the logical read. -#[tokio::test] -async fn open_at_branch_reads_manifest_lineage_when_commit_graph_ref_is_missing() { - use crate::db::commit_graph::{CommitGraph, seed_legacy_v3_lineage_with_branch}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - // 1. A graph with a REAL `feature` Lance branch on both `_graph_commits.lance` - // and `__manifest`, then migrate BOTH main and the branch to v4 so the - // branch's lineage is authoritative in `__manifest` (not the legacy - // fallback). After this, `graph_head:feature` resolves the branch commit - // from `__manifest` and the `_graph_commits.lance` `feature` ref still - // exists (the v3β†’v4 migration leaves it in place). - let fx = seed_legacy_v3_lineage_with_branch(uri).await.unwrap(); - super::migrate_on_open(uri).await.unwrap(); - let (_branch_by_id, branch_heads) = migrate_branch_and_read_lineage(uri, &fx.branch).await; - assert_eq!( - branch_heads.get(&fx.branch).map(String::as_str), - Some(fx.branch_commit.as_str()), - "precondition: __manifest carries graph_head:feature (lineage is authoritative)", - ); - - // 2. Force-delete ONLY the derived `_graph_commits.lance` `feature` ref, - // leaving the `__manifest` `feature` branch (and its lineage) untouched β€” - // the exact shape an interrupted fork-reclaim / cleanup race produces. - { - let mut cg = CommitGraph::open(uri).await.unwrap(); - cg.force_delete_branch(&fx.branch).await.unwrap(); - } - // Sanity: the derived ref is genuinely gone from `_graph_commits.lance`. - { - let cg = CommitGraph::open(uri).await.unwrap(); - let branches = cg.list_branches().await.unwrap(); - assert!( - !branches.iter().any(|b| b == &fx.branch), - "the _graph_commits.lance feature ref must be deleted to build the wedge, got: {branches:?}", - ); - } - - // 3a. The logical READS at the branch succeed from `__manifest` despite the - // missing derived ref. `open_at_branch` is the one that errors pre-fix. - let mut cg = CommitGraph::open_at_branch(uri, &fx.branch) - .await - .expect("open_at_branch must read manifest lineage when the commit-graph ref is missing"); - let commits = cg.load_commits().await.unwrap(); - assert_eq!( - commits.len(), - 3, - "the branch DAG (genesis + A + branch commit) is read from __manifest", - ); - assert_eq!( - cg.head_commit_id().await.unwrap().as_deref(), - Some(fx.branch_commit.as_str()), - "the branch head resolves from __manifest's graph_head:feature", - ); - let base = CommitGraph::merge_base(uri, Some(&fx.branch), Some(&fx.branch)) - .await - .expect("merge_base must resolve over the manifest-sourced DAG"); - assert_eq!( - base.map(|c| c.graph_commit_id), - Some(fx.branch_commit.clone()), - "merge_base(feature, feature) is reflexively the branch head", - ); - - // 3b. A WRITE that needs the derived ref fails loudly + actionably β€” the repair - // is deferred to `cleanup`'s orphan reconciler, not inlined on a read. - let err = match cg.create_branch("derived").await { - Ok(()) => panic!("create_branch must fail when the commit-graph branch ref is missing"), - Err(e) => e, - }; - let msg = err.to_string(); - assert!( - msg.contains("commit-graph branch ref") && msg.contains("is missing"), - "expected the typed missing-ref error, got: {msg}", - ); -} - -// FIX B β€” the v3β†’v4 lineage backfill must be concurrent-runner idempotent. -// -// `migrate_v2_to_v3` is explicitly safe under two processes opening the same -// legacy graph at once (each re-enumerates branches; `force_delete_branch` -// tolerates an already-gone branch). v3β†’v4 regressed that: `merge_lineage_rows` -// uses `conflict_retries(0)` and the migration had no app-level retry, so a -// concurrent first-open's CAS loser errored the whole open instead of converging. -// -// This test reproduces exactly two concurrent first-opens: two `__manifest` -// handles opened at the SAME pre-migration (v3, empty-lineage) HEAD, then their -// `migrate_internal_schema` calls run under `tokio::join!`. Both pass the -// fast-path empty-lineage check and both attempt the backfill merge, so the -// row-level CAS on `graph_head:main` is guaranteed to fire β€” deterministically -// red against the pre-fix code (the loser errors). The contract: BOTH converge -// to `Ok`, the manifest carries exactly the fixture's commit rows (merge keyed on -// `object_id`, so a double-merge stays exact), and the stamp is v4. -// -// (Driving pre-opened handles rather than `migrate_on_open(uri)` twice is a -// deliberate choice: `migrate_on_open` opens fresh each call, so two of them can -// luckily serialize β€” one finishes before the other reads the fast path β€” which -// would not exercise the CAS path and would pass even pre-fix. Pre-opening both -// at the empty-lineage HEAD forces the contention every run, so the RED is real.) -#[tokio::test(flavor = "multi_thread", worker_threads = 2)] -async fn concurrent_v3_to_v4_migrations_both_converge() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let fixture = seed_legacy_v3_lineage(uri).await.unwrap(); - - // Two handles opened at the same pre-migration HEAD: both see stamp v3 and an - // empty lineage, so both will run the full backfill and collide on the merge. - let mut ds_a = open_manifest_dataset(uri, None).await.unwrap(); - let mut ds_b = open_manifest_dataset(uri, None).await.unwrap(); - - let (res_a, res_b) = tokio::join!( - super::migrations::migrate_internal_schema(&mut ds_a, uri, None), - super::migrations::migrate_internal_schema(&mut ds_b, uri, None), - ); - - // The whole contract: a concurrent first-open's CAS loser converges instead of - // erroring. BOTH must succeed. - res_a.expect("migration runner A must converge"); - res_b.expect("migration runner B must converge"); - - // Exactly the fixture's commits, no duplicates (the merge is keyed on - // `object_id`, so even a double-merge under read-after-write lag stays exact). - assert_eq!( - manifest_commit_row_count(uri).await, - fixture.all_ids.len(), - "concurrent backfills converge to exactly the fixture's commit rows", - ); - // And the stamp landed at v4. - { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - assert_eq!( - super::migrations::read_stamp(&ds), - super::migrations::INTERNAL_MANIFEST_SCHEMA_VERSION, - "both runners leave the manifest stamped at the current version", - ); - } -} - -// ── RFC-013 Phase 7 / step 5: the `graph_head` concurrency gate ────────────── -// -// Two (or N) writers committing DISJOINT tables on the same branch still share -// one mutable `graph_head:main` row (one `object_id`, `WhenMatched::UpdateAll`). -// Their table-version rows never collide (distinct `object_id`s), so the *only* -// row-level CAS contention is on `graph_head:main`. The contract under test: -// exactly one writer wins each CAS round; the loser retries, re-resolves its -// parent off the freshly-advanced head (inside the publisher's retry loop), and -// re-commits β€” so every writer commits and the resulting graph_commit DAG is a -// single LINEAR chain (no fork), not a tree. This is the cross-process -// disjoint-table fork closed by the shared head row (invariants.md Β§7.1). - -/// A microsecond UNIX timestamp for a `LineageIntent`, matching the genesis / -/// commit-graph `created_at` unit. -fn lineage_now_micros() -> i64 { - std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap() - .as_micros() as i64 -} - -/// Append one row to a two-column NODE table (`id`, `name`) and return the -/// resulting `SubTableUpdate` at the new on-disk version. Generalizes -/// `append_person_and_make_update` to any node table whose schema is `(id: -/// String, name: String[, ...])`; the extra `Person.age` column is filled null -/// when present so the same helper drives both `node:Person` and `node:Company`. -async fn append_node_row_and_make_update( - uri: &str, - entry: &SubTableEntry, - id: &str, - name: &str, -) -> SubTableUpdate { - let mut ds = Dataset::open(&format!("{}/{}", uri, entry.table_path)) - .await - .unwrap(); - let schema = Arc::new(ds.schema().into()); - let arrow_schema: &Schema = &schema; - // Columns 0/1 are (id, name); a third column (Person.age) is filled null. - let mut columns: Vec> = vec![ - Arc::new(StringArray::from(vec![id.to_string()])), - Arc::new(StringArray::from(vec![name.to_string()])), - ]; - for field in arrow_schema.fields().iter().skip(2) { - columns.push(arrow_array::new_null_array(field.data_type(), 1)); - } - let row = RecordBatch::try_new(Arc::clone(&schema), columns).unwrap(); - let reader = RecordBatchIterator::new(vec![Ok(row)], schema); - ds.append(reader, None).await.unwrap(); - let new_version = ds.version().version; - let version_metadata = - table_version_metadata_for_state(uri, &entry.table_path, None, new_version) - .await - .unwrap(); - SubTableUpdate { - table_key: entry.table_key.clone(), - table_version: new_version, - table_branch: None, - row_count: 1, - version_metadata, - } -} - -/// Read the `graph_commit` lineage rows from `__manifest` at main and assert -/// they form a single LINEAR chain of `expected_total` commits (one genesis + -/// the rest), with no fork. Returns the head commit id. -/// -/// "Linear, not a fork" is proven structurally: (1) exactly one parentless -/// genesis; (2) no two commits share a `parent_commit_id` (a fork would have two -/// children off one parent); (3) every commit except the unique head is the -/// parent of exactly one other commit β€” so the parent pointers form one path -/// that visits all commits. (1)+(2)+(3) over a connected set is a single chain. -async fn assert_linear_chain(uri: &str, expected_total: usize) -> String { - let ds = open_manifest_dataset(uri, None).await.unwrap(); - let (rows, _heads) = read_graph_lineage(&ds).await.unwrap(); - assert_eq!( - rows.len(), - expected_total, - "expected {expected_total} graph_commit rows (genesis + the concurrent commits), got {}", - rows.len(), - ); - - // (1) exactly one genesis. - let genesis: Vec<&GraphLineageRow> = - rows.iter().filter(|r| r.parent_commit_id.is_none()).collect(); - assert_eq!( - genesis.len(), - 1, - "exactly one parentless genesis commit in a linear chain, got {}", - genesis.len(), - ); - - // (2) no two commits parent off the same commit (no fork). - let mut parents: Vec<&str> = rows - .iter() - .filter_map(|r| r.parent_commit_id.as_deref()) - .collect(); - let parent_count = parents.len(); - parents.sort_unstable(); - parents.dedup(); - assert_eq!( - parents.len(), - parent_count, - "two commits share a parent_commit_id β€” the DAG forked instead of forming a linear chain", - ); - - // (3) the head (the `should_replace_head` winner) plus the parent set covers - // every commit exactly once: each non-head commit is some commit's parent. - let head = super::state::head_lineage_row(&rows).expect("a non-empty lineage has a head"); - let ids: std::collections::HashSet<&str> = - rows.iter().map(|r| r.graph_commit_id.as_str()).collect(); - let parent_set: std::collections::HashSet<&str> = parents.iter().copied().collect(); - // The head is the only commit that is not a parent of anything. - let non_parents: Vec<&str> = ids - .iter() - .copied() - .filter(|id| !parent_set.contains(id)) - .collect(); - assert_eq!( - non_parents, - vec![head.graph_commit_id.as_str()], - "the only commit that is no one's parent must be the head β€” a fork or break leaves others", - ); - // Every parent points at a real commit (connectedness). - for parent in &parent_set { - assert!( - ids.contains(parent), - "parent {parent} must be a known commit in the chain", - ); - } - - head.graph_commit_id.clone() -} - -/// Test A (deterministic, the must-have): two writers, two DISJOINT table -/// updates, two distinct `LineageIntent`s, `tokio::join!`. BOTH commit (the loser -/// retries on the `graph_head:main` CAS conflict and re-parents off the winner), -/// and the on-disk graph_commit DAG is a single linear chain genesis β†’ c β†’ c'. -#[tokio::test(flavor = "multi_thread", worker_threads = 2)] -async fn concurrent_disjoint_writes_share_head_and_form_linear_chain() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let catalog = build_test_catalog(); - let mc = ManifestCoordinator::init(uri, &catalog).await.unwrap(); - let snap = mc.snapshot(); - let person_entry = snap.entry("node:Person").unwrap().clone(); - let company_entry = snap.entry("node:Company").unwrap().clone(); - - // Two DISJOINT table-version rows (`node:Person@v=2`, `node:Company@v=2`): - // distinct `object_id`s, so neither hits the table-version CAS. The ONLY - // shared row both writers merge is `graph_head:main`. - let update_a = append_node_row_and_make_update(uri, &person_entry, "p1", "Alice").await; - let update_b = append_node_row_and_make_update(uri, &company_entry, "c1", "Acme").await; - - let publisher_a = GraphNamespacePublisher::new(uri, None); - let publisher_b = GraphNamespacePublisher::new(uri, None); - let changes_a = vec![ManifestChange::Update(update_a)]; - let changes_b = vec![ManifestChange::Update(update_b)]; - // Each writer mints its own stable commit id; the parent re-resolves per - // attempt inside the publisher. - let intent_a = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: None, - actor_id: Some("act-a".to_string()), - merged_parent_commit_id: None, - created_at: lineage_now_micros(), - }; - let intent_b = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: None, - actor_id: Some("act-b".to_string()), - merged_parent_commit_id: None, - created_at: lineage_now_micros(), - }; - // Empty expected-versions: the two writers are disjoint, so neither asserts a - // version on the other's table; contention is purely the shared head row. - let empty = HashMap::new(); - let (res_a, res_b) = tokio::join!( - async { publisher_a.publish(&changes_a, &empty, Some(&intent_a)).await }, - async { publisher_b.publish(&changes_b, &empty, Some(&intent_b)).await } - ); - - // BOTH commit: disjoint tables β†’ the head-row CAS loser retries within - // PUBLISHER_RETRY_BUDGET, re-resolves its parent off the winner, and lands. - res_a.expect("writer A must commit"); - res_b.expect("writer B must commit"); - - // End-state assertion (the on-disk DAG is fixed once both committed): a single - // linear chain genesis β†’ first β†’ second, no fork. The two minted ids both - // appear; their parents form a chain (one off genesis, the other off the - // first), so no two commits share a parent. - let head = assert_linear_chain(uri, 3).await; - assert!( - head == intent_a.graph_commit_id || head == intent_b.graph_commit_id, - "the head must be one of the two concurrent commits", - ); - // Both committed table writes are visible (Person and Company advanced). - let reopened = ManifestCoordinator::open(uri).await.unwrap(); - let after = reopened.snapshot(); - assert_eq!(after.entry("node:Person").unwrap().table_version, 2); - assert_eq!(after.entry("node:Company").unwrap().table_version, 2); -} - -/// Test C (S3 variant, bucket-gated): the same two-disjoint-writers + -/// `LineageIntent` race as Test A, but on a real object store so the one-winner -/// behaviour exercises the genuine conditional-put CAS on `__manifest` rather -/// than the local content-token emulation. Skips with a log when -/// `OMNIGRAPH_S3_TEST_BUCKET` is unset (the `tests/s3_storage.rs` gate); the -/// rustfs CI job sets it. Asserts the same end-state: both commit, single linear -/// chain. -#[tokio::test(flavor = "multi_thread", worker_threads = 2)] -async fn concurrent_disjoint_writes_form_linear_chain_on_s3() { - let Ok(bucket) = std::env::var("OMNIGRAPH_S3_TEST_BUCKET") else { - eprintln!( - "SKIP concurrent_disjoint_writes_form_linear_chain_on_s3: \ - OMNIGRAPH_S3_TEST_BUCKET unset β€” the S3 lineage-CAS gate needs an object store" - ); - return; - }; - let uri = format!( - "s3://{bucket}/lineage-concurrency/{}-{}", - std::process::id(), - ulid::Ulid::new() - ); - - let catalog = build_test_catalog(); - let mc = ManifestCoordinator::init(&uri, &catalog).await.unwrap(); - let snap = mc.snapshot(); - let person_entry = snap.entry("node:Person").unwrap().clone(); - let company_entry = snap.entry("node:Company").unwrap().clone(); - - let update_a = append_node_row_and_make_update(&uri, &person_entry, "p1", "Alice").await; - let update_b = append_node_row_and_make_update(&uri, &company_entry, "c1", "Acme").await; - - let publisher_a = GraphNamespacePublisher::new(&uri, None); - let publisher_b = GraphNamespacePublisher::new(&uri, None); - let changes_a = vec![ManifestChange::Update(update_a)]; - let changes_b = vec![ManifestChange::Update(update_b)]; - let intent_a = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: None, - actor_id: Some("act-a".to_string()), - merged_parent_commit_id: None, - created_at: lineage_now_micros(), - }; - let intent_b = LineageIntent { - graph_commit_id: ulid::Ulid::new().to_string(), - branch: None, - actor_id: Some("act-b".to_string()), - merged_parent_commit_id: None, - created_at: lineage_now_micros(), - }; - let empty = HashMap::new(); - let (res_a, res_b) = tokio::join!( - async { publisher_a.publish(&changes_a, &empty, Some(&intent_a)).await }, - async { publisher_b.publish(&changes_b, &empty, Some(&intent_b)).await } - ); - res_a.expect("writer A must commit on S3"); - res_b.expect("writer B must commit on S3"); - - let head = assert_linear_chain(&uri, 3).await; - assert!( - head == intent_a.graph_commit_id || head == intent_b.graph_commit_id, - "the head must be one of the two concurrent commits", - ); -} - -/// Test B (bounded-retry convergence, scaled): N=8 same-branch writers, each -/// touching a DISJOINT table-version row + its own `LineageIntent`, each wrapped -/// in an APP-LEVEL retry loop. `PUBLISHER_RETRY_BUDGET=5` means the later writers -/// can exhaust the internal budget under contention, so the app loop re-submits -/// on a typed `Conflict` / row-level-CAS-contention error. All 8 eventually -/// commit and the final DAG is a single linear chain of 8 (+genesis), no fork, -/// no lost commit. -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -async fn n_concurrent_disjoint_writers_converge_to_one_linear_chain() { - use crate::error::ManifestConflictDetails; - use crate::error::ManifestErrorKind; - - const N: usize = 8; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let catalog = build_test_catalog(); - let mc = ManifestCoordinator::init(uri, &catalog).await.unwrap(); - let snap = mc.snapshot(); - let person_entry = snap.entry("node:Person").unwrap().clone(); - let company_entry = snap.entry("node:Company").unwrap().clone(); - - // Synthesize N=8 DISJOINT table-version updates by sequentially advancing the - // two node tables four versions each (Person@v2..v5, Company@v2..v5). Each - // update is a distinct `object_id`, so the writers never collide on a - // table-version row β€” only on the shared `graph_head:main`. Built serially - // here (before the concurrent phase) so the on-disk versions exist. - let mut updates: Vec = Vec::with_capacity(N); - for i in 0..(N / 2) { - updates.push( - append_node_row_and_make_update(uri, &person_entry, &format!("p{i}"), &format!("P{i}")) - .await, - ); - updates.push( - append_node_row_and_make_update(uri, &company_entry, &format!("c{i}"), &format!("C{i}")) - .await, - ); - } - assert_eq!(updates.len(), N); - - // Each writer: its own publisher + its own commit id + an app-level retry loop - // re-submitting on a typed Conflict (the publisher's internal budget can be - // exhausted by the later contenders, so convergence relies on the app retry). - let uri_owned = uri.to_string(); - let mut handles = Vec::with_capacity(N); - for update in updates { - let uri = uri_owned.clone(); - handles.push(tokio::spawn(async move { - let commit_id = ulid::Ulid::new().to_string(); - let changes = vec![ManifestChange::Update(update)]; - let empty = HashMap::new(); - // Bounded app-level retry: re-submit on a Conflict-kind manifest error - // (the only retryable outcome here is losing the shared-head CAS). - for _attempt in 0..64 { - let intent = LineageIntent { - graph_commit_id: commit_id.clone(), - branch: None, - actor_id: None, - merged_parent_commit_id: None, - created_at: lineage_now_micros(), - }; - let publisher = GraphNamespacePublisher::new(&uri, None); - match publisher.publish(&changes, &empty, Some(&intent)).await { - Ok(_) => return commit_id, - Err(OmniError::Manifest(m)) - if matches!(m.kind, ManifestErrorKind::Conflict) - && matches!( - m.details, - Some(ManifestConflictDetails::RowLevelCasContention) - ) => - { - // lost the shared-head CAS after exhausting the internal - // budget β€” re-resolve parent + re-submit. - continue; - } - Err(other) => panic!("non-retryable publish error: {other:?}"), - } - } - panic!("writer for commit {commit_id} did not converge within the app-retry budget"); - })); - } - - let mut committed_ids = Vec::with_capacity(N); - for handle in handles { - committed_ids.push(handle.await.unwrap()); - } - // All 8 distinct writer ids committed (no lost commit, no duplicate id). - committed_ids.sort(); - committed_ids.dedup(); - assert_eq!(committed_ids.len(), N, "every writer must commit exactly once"); - - // The final DAG is a single linear chain of genesis + 8 = 9, no fork. - assert_linear_chain(uri, N + 1).await; -} diff --git a/crates/omnigraph/src/db/mod.rs b/crates/omnigraph/src/db/mod.rs index 2ce3e29..d0b292f 100644 --- a/crates/omnigraph/src/db/mod.rs +++ b/crates/omnigraph/src/db/mod.rs @@ -3,6 +3,7 @@ pub mod graph_coordinator; pub mod manifest; mod omnigraph; mod recovery_audit; +mod run_registry; mod schema_state; pub(crate) mod write_queue; @@ -10,14 +11,11 @@ pub use commit_graph::GraphCommit; pub use graph_coordinator::{GraphCoordinator, ReadTarget, ResolvedTarget, SnapshotId}; pub use manifest::{Snapshot, SubTableEntry, SubTableUpdate}; pub(crate) use omnigraph::ensure_public_branch_ref; -pub(crate) use omnigraph::WriteTxn; pub use omnigraph::{ - CleanupPolicyOptions, InitOptions, MergeOutcome, Omnigraph, OpenMode, PendingIndex, - RepairAction, RepairClassification, RepairOptions, RepairStats, SchemaApplyOptions, - SchemaApplyResult, SkipReason, TableCleanupStats, TableOptimizeStats, TableRepairStats, + CleanupPolicyOptions, InitOptions, MergeOutcome, Omnigraph, OpenMode, SchemaApplyOptions, + SchemaApplyResult, TableCleanupStats, TableOptimizeStats, }; - -use crate::error::{OmniError, Result}; +pub(crate) use run_registry::is_internal_run_branch; pub(crate) const SCHEMA_APPLY_LOCK_BRANCH: &str = "__schema_apply_lock__"; @@ -71,19 +69,5 @@ pub(crate) fn is_schema_apply_lock_branch(name: &str) -> bool { } pub(crate) fn is_internal_system_branch(name: &str) -> bool { - // Legacy `__run__*` staging branches (Run state machine, removed MR-771) - // are swept off `__manifest` by the v2β†’v3 internal-schema migration, so the - // only internal branch the engine still creates is the schema-apply lock. - is_schema_apply_lock_branch(name) -} - -/// Microseconds since the UNIX epoch β€” the `created_at` stamp threaded through -/// every graph-lineage / recovery-audit / commit-graph row. One canonical -/// helper so the clock-error mapping (variant + message) cannot drift across -/// the call sites that record those timestamps. -pub(crate) fn now_micros() -> Result { - let duration = std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .map_err(|e| OmniError::manifest(format!("system clock before UNIX_EPOCH: {e}")))?; - Ok(duration.as_micros() as i64) + is_internal_run_branch(name) || is_schema_apply_lock_branch(name) } diff --git a/crates/omnigraph/src/db/omnigraph.rs b/crates/omnigraph/src/db/omnigraph.rs index 4a770f0..5c92ac3 100644 --- a/crates/omnigraph/src/db/omnigraph.rs +++ b/crates/omnigraph/src/db/omnigraph.rs @@ -16,7 +16,7 @@ use lance::dataset::scanner::ColumnOrdering; use lance::datatypes::BlobKind; use omnigraph_compiler::catalog::{Catalog, EdgeType, NodeType}; use omnigraph_compiler::schema::parser::parse_schema; -use omnigraph_compiler::types::{PropType, ScalarType}; +use omnigraph_compiler::types::ScalarType; use omnigraph_compiler::{ DropMode, SchemaIR, SchemaMigrationPlan, SchemaMigrationStep, SchemaTypeKind, build_catalog_from_ir, build_schema_ir, plan_schema_migration, @@ -26,22 +26,15 @@ use crate::db::graph_coordinator::{GraphCoordinator, PublishedSnapshot}; use crate::error::{OmniError, Result}; use crate::runtime_cache::RuntimeCache; use crate::storage::{StorageAdapter, join_uri, normalize_root_uri, storage_for_uri}; -use crate::storage_layer::SnapshotHandle; use crate::table_store::TableStore; mod export; mod optimize; -mod repair; mod schema_apply; mod table_ops; -pub use optimize::{CleanupPolicyOptions, SkipReason, TableCleanupStats, TableOptimizeStats}; -pub use repair::{ - RepairAction, RepairClassification, RepairOptions, RepairStats, TableRepairStats, -}; +pub use optimize::{CleanupPolicyOptions, TableCleanupStats, TableOptimizeStats}; pub use schema_apply::SchemaApplyOptions; -pub use table_ops::PendingIndex; -pub(crate) use table_ops::OpenedForMutation; use super::commit_graph::GraphCommit; use super::manifest::{ @@ -74,41 +67,6 @@ pub struct SchemaApplyResult { pub steps: Vec, } -#[derive(Debug, Clone)] -pub struct SchemaApplyPreview { - pub plan: SchemaMigrationPlan, - pub catalog: Catalog, -} - -/// A capture-once write transaction (RFC-013 step 3b). Pins the operation's read -/// base ONCE so the per-table opens reuse the pinned version instead of -/// re-resolving / re-validating per table. The schema contract is validated once -/// (when `base` is captured). NOT a general "no re-resolution" handle β€” the -/// commit-time OCC re-read, the live-HEAD drift probe, and the fork-authority reads -/// stay fresh (correctness machinery). Step 5 (PublishPlan unification) makes this -/// the non-optional publish carrier and adds session-aware base opens there, gated -/// by an S3 cost test β€” the warm-session benefit on the single remaining open is an -/// object-store phenomenon, so it earns its own gate rather than riding this PR. -/// -/// Threaded as `Option<&WriteTxn>` through the mutate/load write chain -/// (`open_for_mutation_on_branch`, `commit_all`, `commit_updates_on_branch_with_expected`) -/// so a single write validates the schema contract EXACTLY ONCE β€” at capture. When -/// present, the per-table resolves source the pinned `base` entry instead of calling -/// `resolved_branch_target` / `snapshot_for_branch` / `fresh_snapshot_for_branch` -/// (each of which re-runs `ensure_schema_state_valid`). When absent (`None` β€” every -/// non-mutate/load caller), every threaded function behaves byte-identically to -/// before. The carrier never removes a version guard or changes which dataset version -/// the per-table open targets: strict ops keep `open_dataset_head_for_write` + -/// `ensure_expected_version`, and the commit-time OCC re-read still opens a fresh -/// manifest snapshot (via `fresh_snapshot_for_branch_unchecked`) β€” only the redundant -/// schema re-validation is dropped. -pub(crate) struct WriteTxn { - /// The resolved branch (`None` = main). - pub(crate) branch: Option, - /// The pinned base snapshot (per-table location + version + e_tag), captured once. - pub(crate) base: Snapshot, -} - /// Top-level handle to an Omnigraph database. /// /// An Omnigraph is a Lance-native graph database with git-style branching. @@ -123,12 +81,12 @@ pub struct Omnigraph { /// calls without a global write lock). Reads (`snapshot`, `version`, /// `current_branch`, `branch_list`, `resolve_*`, `head_commit_id`, /// `list_commits`, …) acquire `.read().await` and parallelize. - /// Writes (`refresh`, `branch_create`, `branch_delete`, `commit_*`) - /// acquire `.write().await` and serialize. The atomic commit invariant β€” - /// table-version rows and the graph commit are one unit β€” holds by - /// construction since RFC-013 Phase 7: both ride a SINGLE manifest publish - /// CAS (`commit_changes_with_lineage`), so there is no two-write window to - /// keep atomic. PR 2 Phase 2 + /// Writes (`refresh`, `branch_create`, `branch_delete`, `commit_*`, + /// `record_*`) acquire `.write().await` and serialize. The atomic + /// commit invariant β€” `commit_manifest_updates` followed by + /// `record_graph_commit` must be atomic β€” is preserved by the + /// single `.write()` covering both calls inside + /// `commit_updates_with_actor_with_expected`. PR 2 Phase 2 /// converted from `Mutex` to `RwLock` because the bench showed /// the Mutex was the dominant serializer for disjoint-table /// workloads. Lock acquisition order: always before `runtime_cache` @@ -136,12 +94,6 @@ pub struct Omnigraph { coordinator: Arc>, table_store: TableStore, runtime_cache: RuntimeCache, - /// Per-graph read caches: one shared Lance `Session` plus the held-`Dataset` - /// handle cache, handed to live-Branch-read snapshots (via - /// `resolved_target`) so table opens reuse handles (0 IO on a warm repeat) - /// and one session. Invalidated alongside `runtime_cache` on branch switch / - /// refresh β€” hygiene only; version-in-key carries correctness. - read_caches: Arc, /// Read-heavy on every query, written only by `apply_schema`. ArcSwap /// gives atomic pointer swap with zero-cost reads (`load()` returns a /// `Guard>`), so concurrent queries on different actors @@ -150,11 +102,10 @@ pub struct Omnigraph { /// Read-heavy on schema introspection paths, written only by /// `apply_schema`. Same ArcSwap rationale as `catalog`. schema_source: Arc>, - /// Per-`(table_key, branch)` writer queues β€” the engine's - /// write-serialization mechanism (the server holds the engine as a - /// lockless `Arc`). Reachable from engine internals - /// (mutation finalize, schema_apply, branch_merge, ensure_indices, - /// delete_where, the fork path, recovery reconciler). + /// Per-`(table_key, branch)` writer queues. Reachable from engine + /// internals (mutation finalize, schema_apply, branch_merge, + /// ensure_indices, delete_where) and from future MR-870 recovery + /// reconciler. PR 1b adds the field; callers acquire in commits 4+. write_queue: Arc, /// Process-wide mutex held across the swap β†’ operate β†’ restore window /// in `branch_merge_impl`. Two concurrent merges with distinct targets @@ -194,17 +145,6 @@ pub struct Omnigraph { /// `apply_schema_as` consults this field (PR #2 proof-of-concept); /// PR #3 fans the `enforce()` call out to the remaining writers. policy: Option>, - /// Lazily-built, reused-across-queries embedding client. Built on the first - /// `nearest($v, "string")` that needs server-side embedding (so a graph that - /// never embeds needs no provider key), then shared by every later query β€” - /// avoids the per-query `from_env()` rebuild and keeps the provider HTTP - /// connection pool warm. `OnceCell` guarantees a single initialization. - embedding: Arc>, - /// Optional pre-resolved embedding config (RFC-012 Phase 5), injected from an - /// applied cluster `providers.embedding` profile via [`Omnigraph::with_embedding_config`]. - /// When set, the embedding cell builds its client from this instead of - /// `EmbeddingClient::from_env()`; `None` keeps the env fallback. - embedding_config: Option>, } /// Whether [`Omnigraph::open`] runs the open-time recovery sweep. @@ -317,7 +257,7 @@ impl Omnigraph { { return Err(OmniError::AlreadyInitialized { uri: root.clone() }); } - if let Err(err) = crate::failpoints::maybe_fail(crate::failpoints::names::INIT_AFTER_SCHEMA_PG_WRITTEN) { + if let Err(err) = crate::failpoints::maybe_fail("init.after_schema_pg_written") { best_effort_cleanup_init_artifacts(&root, storage.as_ref()).await; return Err(err); } @@ -363,21 +303,11 @@ impl Omnigraph { coordinator: Arc::new(tokio::sync::RwLock::new(coordinator)), table_store: TableStore::new(&root), runtime_cache: RuntimeCache::default(), - // One shared Session per graph (LanceDB's one-session-per-connection - // model) plus the held-handle cache, created once and reused across - // reads. Session::default() caps are lazy (6 GiB index / 1 GiB - // metadata); multi-graph cap/sharing is a deferred follow-up. - read_caches: Arc::new(crate::runtime_cache::ReadCaches { - session: Arc::new(lance::session::Session::default()), - handles: Arc::new(crate::runtime_cache::TableHandleCache::default()), - }), catalog: Arc::new(ArcSwap::from_pointee(catalog)), schema_source: Arc::new(ArcSwap::from_pointee(schema_source.to_string())), write_queue: Arc::new(crate::db::write_queue::WriteQueueManager::new()), merge_exclusive: Arc::new(tokio::sync::Mutex::new(())), policy: None, - embedding: Arc::new(tokio::sync::OnceCell::new()), - embedding_config: None, }) } @@ -395,10 +325,12 @@ impl Omnigraph { Self::open_with_storage_and_mode(uri, storage_for_uri(uri)?, OpenMode::ReadOnly).await } - /// Open with a caller-supplied [`StorageAdapter`]. Used by init/test paths - /// and by embedding/test consumers that wrap storage (e.g. a counting - /// decorator for IO-budget tests). Defaults to `OpenMode::ReadWrite`. - pub async fn open_with_storage(uri: &str, storage: Arc) -> Result { + /// `open_with_storage` retained for existing callers (init/test paths). + /// Defaults to `OpenMode::ReadWrite`. + pub(crate) async fn open_with_storage( + uri: &str, + storage: Arc, + ) -> Result { Self::open_with_storage_and_mode(uri, storage, OpenMode::ReadWrite).await } @@ -408,24 +340,6 @@ impl Omnigraph { mode: OpenMode, ) -> Result { let root = normalize_root_uri(uri)?; - // Apply pending internal-schema migrations before the coordinator reads - // branch state, so `branch_list` and the schema-apply blocking-branch - // checks observe the post-migration graph β€” notably the v2β†’v3 sweep of - // legacy `__run__*` staging branches (MR-770). ReadWrite only: a - // read-only open must not trigger object-store writes, so a read-only - // open of an unmigrated legacy graph still lists `__run__*` until its - // first read-write open (an accepted, documented limitation). - if matches!(mode, OpenMode::ReadWrite) { - crate::db::manifest::migrate_on_open(&root).await?; - } else { - // A read-only open skips `migrate_on_open` (no object-store writes), - // which is where the version refusal otherwise lives. Still refuse a - // `__manifest` stamped outside this binary's supported range β€” newer - // than CURRENT (an old binary cannot silently misread a newer graph, - // e.g. one folded to internal-schema v4 lineage), or below - // MIN_SUPPORTED (predates the readers we carry). Read-only, no write. - crate::db::manifest::refuse_if_internal_schema_unsupported(&root).await?; - } // Open the coordinator first so the schema-staging recovery sweep can // compare its snapshot against any leftover staging files. let mut coordinator = GraphCoordinator::open(&root, Arc::clone(&storage)).await?; @@ -443,11 +357,10 @@ impl Omnigraph { recover_schema_state_files(&root, Arc::clone(&storage), &coordinator.snapshot()) .await?; // Recovery sweep: close the Phase B β†’ Phase C residual on - // any sidecar left over from a crashed writer. Long-running - // processes additionally converge in-process: the staged- - // write entry points and `refresh` run the roll-forward-only - // heal (`heal_pending_sidecars_roll_forward`); only - // rollback-eligible sidecars wait for this open-time sweep. + // any sidecar left over from a crashed writer. Continuous + // in-process recovery for long-running servers (no restart + // required between Phase B failure and recovery) is a + // separate background-reconciler effort. crate::db::manifest::recover_manifest_drift( &root, Arc::clone(&storage), @@ -478,21 +391,11 @@ impl Omnigraph { coordinator: Arc::new(tokio::sync::RwLock::new(coordinator)), table_store: TableStore::new(&root), runtime_cache: RuntimeCache::default(), - // One shared Session per graph (LanceDB's one-session-per-connection - // model) plus the held-handle cache, created once and reused across - // reads. Session::default() caps are lazy (6 GiB index / 1 GiB - // metadata); multi-graph cap/sharing is a deferred follow-up. - read_caches: Arc::new(crate::runtime_cache::ReadCaches { - session: Arc::new(lance::session::Session::default()), - handles: Arc::new(crate::runtime_cache::TableHandleCache::default()), - }), catalog: Arc::new(ArcSwap::from_pointee(catalog)), schema_source: Arc::new(ArcSwap::from_pointee(schema_source)), write_queue: Arc::new(crate::db::write_queue::WriteQueueManager::new()), merge_exclusive: Arc::new(tokio::sync::Mutex::new(())), policy: None, - embedding: Arc::new(tokio::sync::OnceCell::new()), - embedding_config: None, }) } @@ -540,29 +443,6 @@ impl Omnigraph { self } - /// The lazily-initialized, reused-across-queries embedding client cell - /// (see the `embedding` field doc). The query executor resolves the client - /// through this on the first `nearest($v, "string")` that needs embedding. - pub(crate) fn embedding_cell( - &self, - ) -> &tokio::sync::OnceCell { - &self.embedding - } - - /// Install a pre-resolved embedding config (RFC-012 Phase 5). Builder-style, - /// mirroring [`Omnigraph::with_policy`]: a graph served from a cluster - /// embedding provider profile injects it here; an embedded/CLI caller that doesn't - /// call this keeps the `EmbeddingClient::from_env()` fallback. - pub fn with_embedding_config(mut self, config: Arc) -> Self { - self.embedding_config = Some(config); - self - } - - /// The injected embedding config, if any (see the `embedding_config` field). - pub(crate) fn embedding_config_ref(&self) -> Option<&crate::embedding::EmbeddingConfig> { - self.embedding_config.as_deref() - } - /// Engine-layer policy enforcement gate (MR-722 chassis core). /// /// * If no policy is installed β†’ no-op (returns `Ok(())`). @@ -597,12 +477,6 @@ impl Omnigraph { } pub(crate) async fn ensure_schema_state_valid(&self) -> Result<()> { - // Full per-call validation is intentional: a long-lived handle must - // detect external drift of the schema source, IR, OR state on its next - // operation (see lifecycle::long_lived_handle_rejects_schema_* tests). A - // source-only fast path would miss IR/state drift when _schema.pg is - // unchanged, so the only safe latency win is not calling this twice per - // query (finding A removes the redundant caller in exec/query.rs). validate_schema_contract(self.uri(), Arc::clone(&self.storage)).await } @@ -619,14 +493,6 @@ impl Omnigraph { schema_apply::plan_schema(self, desired_schema_source, options).await } - pub async fn preview_schema_apply_with_options( - &self, - desired_schema_source: &str, - options: SchemaApplyOptions, - ) -> Result { - schema_apply::preview_schema_apply(self, desired_schema_source, options).await - } - pub async fn apply_schema(&self, desired_schema_source: &str) -> Result { self.apply_schema_as(desired_schema_source, SchemaApplyOptions::default(), None) .await @@ -657,28 +523,7 @@ impl Omnigraph { options: SchemaApplyOptions, actor: Option<&str>, ) -> Result { - self.apply_schema_as_with_catalog_check(desired_schema_source, options, actor, |_| Ok(())) - .await - } - - pub async fn apply_schema_as_with_catalog_check( - &self, - desired_schema_source: &str, - options: SchemaApplyOptions, - actor: Option<&str>, - validate_catalog: F, - ) -> Result - where - F: FnOnce(&Catalog) -> Result<()>, - { - schema_apply::apply_schema( - self, - desired_schema_source, - options, - actor, - validate_catalog, - ) - .await + schema_apply::apply_schema(self, desired_schema_source, options, actor).await } pub(crate) async fn ensure_schema_apply_idle(&self, operation: &str) -> Result<()> { @@ -689,30 +534,19 @@ impl Omnigraph { schema_apply::ensure_schema_apply_not_locked(self, operation).await } - /// Engine-facing trait surface around `TableStore`. - /// - /// This is the **only** accessor for engine code reaching into the - /// storage layer. The trait's signatures use opaque `SnapshotHandle` - /// / `StagedHandle` instead of leaking `lance::Dataset` / - /// `lance::dataset::transaction::Transaction`, so newly-added engine - /// call sites cannot drift the staged-write invariant by mistake - /// (the trait's `stage_*` + `commit_staged` pair is the only way to - /// land a write). - pub(crate) fn storage(&self) -> &dyn crate::storage_layer::TableStorage { + pub(crate) fn table_store(&self) -> &TableStore { &self.table_store } - /// Inline-commit residual surface (`delete_where`, - /// `create_vector_index`) β€” the writes Lance cannot yet express as a - /// stage-then-commit pair. Deliberately separate from [`Self::storage`] so - /// the default storage surface is staged-only and a new writer cannot couple - /// "write bytes" with "advance HEAD" by reaching for `db.storage()`. Only - /// the handful of documented residual call sites (mutation/merge deletes, - /// vector-index build) use this accessor. See - /// `crate::storage_layer::InlineCommitResidual` for the per-method blocker. - pub(crate) fn storage_inline_residual( - &self, - ) -> &dyn crate::storage_layer::InlineCommitResidual { + /// Engine-facing trait surface around `TableStore`. + /// + /// This is the canonical accessor for newly-written engine code. The + /// trait's signatures use opaque `SnapshotHandle` / `StagedHandle` + /// instead of leaking `lance::Dataset` / + /// `lance::dataset::transaction::Transaction`. Existing call sites + /// that still use `db.table_store.X(...)` (the inherent struct + /// methods) are migrated incrementally. + pub(crate) fn storage(&self) -> &dyn crate::storage_layer::TableStorage { &self.table_store } @@ -774,29 +608,6 @@ impl Omnigraph { *self.coordinator.write().await = coordinator; } - /// Open a capture-once write transaction (RFC-013 step 3b): validate the schema - /// contract ONCE and pin the base snapshot. The per-table opens take - /// `Option<&WriteTxn>` and, on the bound branch for the non-strict (Insert/Merge) - /// path, source the pinned base entry β€” instead of re-resolving (re-validating the - /// schema) per table. Strict ops, the fork path, and the commit-time OCC re-read - /// keep their fresh reads (those are correctness machinery β€” see the handoff doc). - /// - /// "Once" covers the table-touch hot path captured here (proven by the node-insert - /// gate `write_validates_schema_contract_once`); it does NOT yet cover edge endpoint - /// / cardinality RI validation (`ensure_node_id_exists`, the loader's RI/cardinality), - /// which still resolve through `snapshot_for_branch` and re-validate. Those reads must - /// observe LIVE committed state, so unifying them (validate-once + pinned + re-checked - /// read-set) is step 4's Β§7.1 work β€” threading `txn.base` there would re-introduce the - /// stale-read class the #298 cardinality fix removed. A session-aware base open is - /// likewise deferred to step 5 (handoff Β§1d). - pub(crate) async fn open_write_txn(&self, branch: Option<&str>) -> Result { - let resolved = self.resolved_branch_target(branch).await?; - Ok(WriteTxn { - branch: resolved.branch, - base: resolved.snapshot, - }) - } - pub(crate) async fn resolved_branch_target( &self, branch: Option<&str>, @@ -806,13 +617,10 @@ impl Omnigraph { let normalized = normalize_branch_name(branch.unwrap_or("main"))?; let coord = self.coordinator.read().await; if normalized.as_deref() == coord.current_branch() { - let snapshot_id = coord.head_commit_id().await?.unwrap_or_else(|| { - SnapshotId::synthetic( - coord.current_branch(), - coord.version(), - coord.manifest_incarnation().e_tag.as_deref(), - ) - }); + let snapshot_id = coord + .head_commit_id() + .await? + .unwrap_or_else(|| SnapshotId::synthetic(coord.current_branch(), coord.version())); return Ok(ResolvedTarget { requested, branch: coord.current_branch().map(str::to_string), @@ -829,43 +637,6 @@ impl Omnigraph { .map(|resolved| resolved.snapshot) } - pub(crate) async fn fresh_snapshot_for_branch(&self, branch: Option<&str>) -> Result { - self.ensure_schema_state_valid().await?; - self.fresh_snapshot_for_branch_unchecked(branch).await - } - - /// Fresh per-branch manifest snapshot WITHOUT the schema-contract - /// re-validation. Identical OCC freshness to [`fresh_snapshot_for_branch`] - /// β€” a fresh manifest re-read from storage, never the warm cache β€” only the - /// redundant `ensure_schema_state_valid` is dropped. Used inside a single - /// write once a `WriteTxn` has already validated the contract at capture: the - /// commit-time drift re-read needs the live manifest, not a second contract - /// read. Callers with no `WriteTxn` MUST use the checked variant. - /// - /// Reads the manifest directly via `ManifestCoordinator` rather than - /// `resolve_target`. The OCC re-read uses only the returned `Snapshot` - /// (per-table location + version), which `ManifestCoordinator::open().snapshot()` - /// produces identically to `GraphCoordinator::open(...).snapshot()` β€” but - /// `resolve_target` additionally opens the commit graph (an extra - /// `_graph_commits.lance` probe) the OCC read never consults. Skipping that - /// load is a pure read-cost reduction, not a freshness change. The checked - /// `fresh_snapshot_for_branch` delegates here, so its no-`txn` callers - /// (commit_all's None arm, optimize, repair, fork reclaim) get the same - /// identical `Snapshot` via this lighter manifest-only read; they consume - /// only the snapshot and never relied on the commit-graph side load. - pub(crate) async fn fresh_snapshot_for_branch_unchecked( - &self, - branch: Option<&str>, - ) -> Result { - let manifest = match branch { - Some(branch) => { - crate::db::manifest::ManifestCoordinator::open_at_branch(self.uri(), branch).await? - } - None => crate::db::manifest::ManifestCoordinator::open(self.uri()).await?, - }; - Ok(manifest.snapshot()) - } - pub(crate) async fn version(&self) -> u64 { self.coordinator.read().await.version() } @@ -902,13 +673,8 @@ impl Omnigraph { let branch = normalize_branch_name(branch)?; let next = self.open_coordinator_for_branch(branch.as_deref()).await?; *self.coordinator.write().await = next; - self.invalidate_read_caches().await; - Ok(()) - } - - async fn invalidate_read_caches(&self) { self.runtime_cache.invalidate_all().await; - self.read_caches.handles.invalidate_all().await; + Ok(()) } /// Re-read the handle-local coordinator state from storage AND run @@ -918,7 +684,7 @@ impl Omnigraph { /// /// Composition mirrors `Omnigraph::open_with_storage_and_mode`'s /// recovery sequence, in the same order, with one restriction: the - /// manifest-drift heal runs in `RollForwardOnly` mode (rollback / + /// manifest-drift sweep runs in `RollForwardOnly` mode (rollback / /// abort cases defer to the next ReadWrite open because /// `Dataset::restore` is unsafe under concurrency). Each step: /// @@ -930,120 +696,46 @@ impl Omnigraph { /// SchemaApply roll-forward doesn't publish the manifest while /// the staging files remain unrenamed (which would corrupt the /// graph: data on new schema, catalog on old). - /// 3. `heal_pending_sidecars_roll_forward` β€” close the + /// 3. `recover_manifest_drift(... RollForwardOnly)` β€” close the /// finalizeβ†’publisher residual via roll-forward; defer rollback - /// work to next ReadWrite open. Serializes against live writers - /// by acquiring each sidecar's per-(table_key, branch) write - /// queues, so refresh never rolls forward an in-flight writer's - /// sidecar from under it. + /// work to next ReadWrite open. /// 4. `runtime_cache.invalidate_all` β€” drop stale per-snapshot caches. /// /// Steady state cost: one `list_dir` of `__recovery/` (typically /// returns empty β†’ early return for both passes). No additional /// Lance reads. /// - /// The staged-write entry points (`load_as`, `mutate_as`) run the - /// same heal via - /// [`heal_pending_recovery_sidecars`](Self::heal_pending_recovery_sidecars), - /// so a long-lived server converges on the next write without an - /// explicit refresh. Engine-internal callers that already hold an - /// in-flight sidecar (e.g. `schema_apply` mid-write) MUST use + /// Engine-internal callers that already hold an in-flight sidecar + /// (e.g. `schema_apply` mid-write) MUST use /// [`refresh_coordinator_only`](Self::refresh_coordinator_only) to /// avoid the recovery sweep racing their own sidecar. pub async fn refresh(&self) -> Result<()> { - // Standalone schema-staging reconcile ONLY when no recovery - // sidecar exists (legacy/manual staging residue). When sidecars - // exist, the heal below owns the reconcile β€” per SchemaApply - // sidecar, under that sidecar's queue guards β€” because an - // unserialized reconcile can promote a LIVE schema apply's - // staging files from under it, and a pre-promoted result would - // make the heal's own guarded reconcile see clean staging and - // wrongly defer the sidecar. The no-sidecar case cannot race a - // live apply: its sidecar is on disk before its staging files. - // - // Scope the coord write guard to the schema-state section only. + // Scope the coord write guard to the recovery section only. // `reload_schema_if_source_changed` (below) acquires // `self.coordinator.read().await` when the on-disk schema source // has drifted from the cached `schema_source`. Tokio's RwLock is // not reentrant, so holding the write across that call deadlocks. // Pinned by `composite_flow_schema_apply_then_branch_ops_no_deadlock_in_refresh`. - // The heal also takes the lock itself (queues β†’ coordinator - // order), so it must run after this guard is released. { - // Hold the schema-apply serialization key across the - // list-then-reconcile pair: without it, a live apply can - // write its sidecar + staging between the empty check and - // the reconcile (the same race, through a smaller window). - // Queue before coordinator β€” the documented lock order. - // - // Liveness note: with a pending NON-SchemaApply sidecar - // (e.g. a Mutation residual), this gate skips the standalone - // reconcile and the heal below reconciles only per - // SchemaApply sidecar β€” so pre-sidecar-era orphaned staging - // residue waits for the NEXT refresh after the sidecars are - // consumed. Convergence holds, one pass late. Do not "fix" - // by re-running the reconcile unserialized here: that is - // exactly the live-apply race this block exists to close. - let _serial = self - .write_queue - .acquire(&crate::db::manifest::schema_apply_serial_queue_key()) - .await; - if crate::db::manifest::list_sidecars(&self.root_uri, self.storage.as_ref()) - .await? - .is_empty() - { - let mut coord = self.coordinator.write().await; - coord.refresh().await?; - recover_schema_state_files( - &self.root_uri, - Arc::clone(&self.storage), - &coord.snapshot(), - ) - .await?; - } - } // ← guards released before the heal's queue acquisition - crate::db::manifest::heal_pending_sidecars_roll_forward( - &self.root_uri, - Arc::clone(&self.storage), - &self.coordinator, - &self.write_queue, - ) - .await?; + let mut coord = self.coordinator.write().await; + coord.refresh().await?; + let schema_state_recovery = recover_schema_state_files( + &self.root_uri, + Arc::clone(&self.storage), + &coord.snapshot(), + ) + .await?; + crate::db::manifest::recover_manifest_drift( + &self.root_uri, + Arc::clone(&self.storage), + &mut *coord, + crate::db::manifest::RecoveryMode::RollForwardOnly, + schema_state_recovery, + ) + .await?; + } // ← write guard released before reload's read acquisition self.reload_schema_if_source_changed().await?; - self.invalidate_read_caches().await; - Ok(()) - } - - /// Write-entry heal: converge any pending recovery sidecars (a - /// previously failed writer's Phase B β†’ Phase C residual) before - /// starting a new staged write, so a long-lived process (the HTTP - /// server, an embedded handle) recovers on its next write instead - /// of wedging every write on the commit-time drift guard until - /// restart. Roll-forward only; rollback-eligible sidecars defer to - /// the next ReadWrite open exactly as [`refresh`](Self::refresh) - /// does. - /// - /// Steady-state cost: one `list_dir` of `__recovery/` (typically - /// empty β†’ immediate return). See - /// `recovery::heal_pending_sidecars_roll_forward` for the - /// concurrency contract (per-table write-queue acquisition). - pub(crate) async fn heal_pending_recovery_sidecars(&self) -> Result<()> { - let processed = crate::db::manifest::heal_pending_sidecars_roll_forward( - &self.root_uri, - Arc::clone(&self.storage), - &self.coordinator, - &self.write_queue, - ) - .await?; - if processed { - // A rolled-forward SchemaApply sidecar moved disk + manifest - // to the new schema (staging promoted, registrations - // published); the in-memory catalog must follow or the very - // write that triggered the heal validates against the stale - // schema. Same post-heal step as `refresh`. - self.reload_schema_if_source_changed().await?; - self.invalidate_read_caches().await; - } + self.runtime_cache.invalidate_all().await; Ok(()) } @@ -1078,7 +770,7 @@ impl Omnigraph { /// own publish path. pub(crate) async fn refresh_coordinator_only(&self) -> Result<()> { self.coordinator.write().await.refresh().await?; - self.invalidate_read_caches().await; + self.runtime_cache.invalidate_all().await; Ok(()) } @@ -1096,66 +788,11 @@ impl Omnigraph { target: impl Into, ) -> Result { self.ensure_schema_state_valid().await?; - let target = target.into(); - let mut resolved = self.resolve_target_inner(&target).await?; - // Attach the read caches (shared Session + held-handle cache) for live - // Branch reads so table opens reuse handles (0 IO on a warm repeat). - // Snapshot-id reads are deliberately NOT cached: they pin a historical - // version `cleanup` may GC, so bypassing the cache sidesteps the - // cleanup-vs-cached-handle edge. Writes never reach here (they use - // `resolved_branch_target`), so they never receive a pinned handle. - if matches!(target, ReadTarget::Branch(_)) { - resolved - .snapshot - .set_read_caches(Arc::clone(&self.read_caches)); - } - Ok(resolved) - } - - /// Resolve a read target to its snapshot, without attaching read caches. - /// Same-branch reads reuse the warm coordinator, gated by a cheap version - /// probe (invariant 6: strong consistency, never a blind warm read). Reads do - /// not need the commit graph (the manifest version is the visibility - /// authority, invariant 2), so the id is synthetic and no commit-graph scan - /// happens on this path. - async fn resolve_target_inner(&self, target: &ReadTarget) -> Result { - if let ReadTarget::Branch(branch) = target { - let normalized = normalize_branch_name(branch)?; - { - let coord = self.coordinator.read().await; - if normalized.as_deref() != coord.current_branch() { - // Different branch: cold resolve (opens that branch). - return coord.resolve_target(target).await; - } - let held = coord.manifest_incarnation(); - if coord.probe_latest_incarnation().await?.matches(&held) { - return Ok(warm_resolved_target(&coord, target)); - } - // Stale: refresh under the write lock below. - } - let mut coord = self.coordinator.write().await; - if normalized.as_deref() == coord.current_branch() { - // Re-check after taking the write lock; another writer may have - // refreshed (tokio RwLock has no read->write upgrade). - let held = coord.manifest_incarnation(); - let mut refreshed = false; - if !coord.probe_latest_incarnation().await?.matches(&held) { - coord.refresh_manifest_only().await?; - refreshed = true; - } - let resolved = warm_resolved_target(&coord, target); - drop(coord); - if refreshed { - self.invalidate_read_caches().await; - } - return Ok(resolved); - } - // Branch changed while waiting for the write lock: cold resolve. - return coord.resolve_target(target).await; - } - - // Snapshot target: resolve through the commit graph as before. - self.coordinator.read().await.resolve_target(target).await + self.coordinator + .read() + .await + .resolve_target(&target.into()) + .await } // ─── Change detection ──────────────────────────────────────────────── @@ -1286,15 +923,11 @@ impl Omnigraph { /// unbranched subtables keep inheriting `main`, while subtables inherited /// from an ancestor branch are first forked into the active branch before /// their index metadata is updated. - /// Returns the declared indexes that could not be materialized on this - /// pass (today: vector columns with no trainable vectors yet). They are - /// deferred, not errors; a later `ensure_indices`/`optimize` builds them - /// once the column is trainable. Reads stay correct (brute-force) meanwhile. - pub async fn ensure_indices(&self) -> Result> { + pub async fn ensure_indices(&self) -> Result<()> { table_ops::ensure_indices(self).await } - pub async fn ensure_indices_on(&self, branch: &str) -> Result> { + pub async fn ensure_indices_on(&self, branch: &str) -> Result<()> { table_ops::ensure_indices_on(self, branch).await } @@ -1321,13 +954,6 @@ impl Omnigraph { optimize::optimize_all_tables(self).await } - /// Classify and explicitly repair uncovered manifest/head drift. See - /// [`repair`] for the distinction between safe maintenance drift and - /// suspicious/unverifiable drift. - pub async fn repair(&self, options: repair::RepairOptions) -> Result { - repair::repair_all_tables(self, options).await - } - /// Remove Lance manifests (and the fragments they uniquely own) per the /// given [`optimize::CleanupPolicyOptions`]. Destructive to version /// history. See [`optimize`] for details. @@ -1363,24 +989,19 @@ impl Omnigraph { let snapshot = self.snapshot().await; let table_key = format!("node:{}", type_name); - let handle = self - .storage() - .open_snapshot_at_table(&snapshot, &table_key) - .await?; + let ds = snapshot.open(&table_key).await?; let filter_sql = format!("id = '{}'", id.replace('\'', "''")); let row_id = self - .storage() - .first_row_id_for_filter(&handle, &filter_sql) + .table_store + .first_row_id_for_filter(&ds, &filter_sql) .await? .ok_or_else(|| { OmniError::manifest(format!("no {} with id '{}' found", type_name, id)) })?; - // `take_blobs` is a Lance-specific blob accessor not surfaced - // through the `TableStorage` trait β€” reach the inner `Arc` - // via the `pub(crate)` accessor for this read-only call. - let ds = handle.into_arc(); + // Use take_blobs to get the BlobFile handle + let ds = Arc::new(ds); let mut blobs = ds .take_blobs(&[row_id], property) .await @@ -1437,14 +1058,11 @@ impl Omnigraph { Ok(()) } - /// Best-effort reclaim of the per-table Lance forks a just-deleted branch - /// owned. Runs AFTER the manifest authority flip, so the branch is already - /// gone and these forks are unreachable orphans. A failure here (transient - /// object-store error, the `branch_delete.before_table_cleanup` failpoint) - /// is logged and swallowed: the `cleanup` reconciler is the guaranteed - /// backstop that converges any leftover orphan. Uses `force_delete_branch` - /// so a partially-reclaimed retry is idempotent. - async fn cleanup_deleted_branch_tables(&self, branch: &str, owned_tables: &[(String, String)]) { + async fn cleanup_deleted_branch_tables( + &self, + branch: &str, + owned_tables: &[(String, String)], + ) -> Result<()> { let mut seen_paths = HashSet::new(); let mut cleanup_targets = owned_tables .iter() @@ -1454,26 +1072,16 @@ impl Omnigraph { cleanup_targets.sort_by(|left, right| left.0.cmp(&right.0)); for (table_key, table_path) in cleanup_targets { - let dataset_uri = self.storage().dataset_uri(&table_path); - let outcome = match crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_DELETE_BEFORE_TABLE_CLEANUP) - { - Ok(()) => { - self.storage() - .force_delete_branch(&dataset_uri, branch) - .await - } - Err(injected) => Err(injected), - }; - if let Err(err) = outcome { - tracing::warn!( - target: "omnigraph::branch_delete::cleanup", - branch = %branch, - table = %table_key, - error = %err, - "best-effort fork reclaim failed; cleanup will reconcile the orphan", - ); + let dataset_uri = self.table_store.dataset_uri(&table_path); + if let Err(err) = self.table_store.delete_branch(&dataset_uri, branch).await { + return Err(OmniError::manifest_internal(format!( + "branch '{}' was deleted but cleanup failed for {}: {}", + branch, table_key, err + ))); } } + + Ok(()) } async fn delete_branch_storage_only(&self, branch: &str) -> Result<()> { @@ -1497,12 +1105,9 @@ impl Omnigraph { .map(|entry| (entry.table_key.clone(), entry.table_path.clone())) .collect::>(); - // Authority flip (+ best-effort commit-graph reclaim) β€” must succeed. self.coordinator.write().await.branch_delete(branch).await?; - // Best-effort per-table fork reclaim; cleanup reconciles any leftover. self.cleanup_deleted_branch_tables(branch, &owned_tables) - .await; - Ok(()) + .await } pub(crate) fn normalize_branch_name(branch: &str) -> Result> { @@ -1687,7 +1292,7 @@ impl Omnigraph { &self, table_key: &str, op_kind: crate::db::MutationOpKind, - ) -> Result { + ) -> Result<(Dataset, String, Option)> { table_ops::open_for_mutation(self, table_key, op_kind).await } @@ -1696,18 +1301,10 @@ impl Omnigraph { branch: Option<&str>, table_key: &str, op_kind: crate::db::MutationOpKind, - txn: Option<&crate::db::WriteTxn>, - ) -> Result { - table_ops::open_for_mutation_on_branch(self, branch, table_key, op_kind, txn).await + ) -> Result<(Dataset, String, Option)> { + table_ops::open_for_mutation_on_branch(self, branch, table_key, op_kind).await } - /// Fork `table_key` onto `active_branch` from the given source state, - /// self-healing a manifest-unreferenced leftover fork if one is in the - /// way. Callers that reach this MUST already hold the per-`(table_key, - /// active_branch)` write queue (so the reclaim cannot race an in-process - /// fork) and must have confirmed via the live manifest that the table is - /// not yet on `active_branch`. Both the first-write fork path - /// (`open_owned_dataset_for_branch_write`) and `branch_merge` satisfy this. pub(crate) async fn fork_dataset_from_entry_state( &self, table_key: &str, @@ -1715,8 +1312,8 @@ impl Omnigraph { source_branch: Option<&str>, source_version: u64, active_branch: &str, - ) -> Result { - match table_ops::fork_dataset_from_entry_state( + ) -> Result { + table_ops::fork_dataset_from_entry_state( self, table_key, full_path, @@ -1724,21 +1321,7 @@ impl Omnigraph { source_version, active_branch, ) - .await? - { - crate::storage_layer::ForkOutcome::Created(ds) => Ok(ds), - crate::storage_layer::ForkOutcome::RefAlreadyExists => { - table_ops::reclaim_orphaned_fork_and_refork( - self, - table_key, - full_path, - source_branch, - source_version, - active_branch, - ) - .await - } - } + .await } pub(crate) async fn reopen_for_mutation( @@ -1748,7 +1331,7 @@ impl Omnigraph { table_branch: Option<&str>, expected_version: u64, op_kind: crate::db::MutationOpKind, - ) -> Result { + ) -> Result { table_ops::reopen_for_mutation( self, table_key, @@ -1765,18 +1348,27 @@ impl Omnigraph { table_path: &str, table_branch: Option<&str>, table_version: u64, - ) -> Result { + ) -> Result { table_ops::open_dataset_at_state(self, table_path, table_branch, table_version).await } pub(crate) async fn build_indices_on_dataset( &self, table_key: &str, - ds: &mut SnapshotHandle, - ) -> Result> { + ds: &mut Dataset, + ) -> Result<()> { table_ops::build_indices_on_dataset(self, table_key, ds).await } + pub(crate) async fn build_indices_on_dataset_for_catalog( + &self, + catalog: &Catalog, + table_key: &str, + ds: &mut Dataset, + ) -> Result<()> { + table_ops::build_indices_on_dataset_for_catalog(self, catalog, table_key, ds).await + } + // Used only by in-tree tests (`#[cfg(test)]`); the runtime path now // uses `commit_updates_on_branch_with_expected` exclusively. #[cfg(test)] @@ -1787,17 +1379,28 @@ impl Omnigraph { table_ops::commit_updates(self, updates).await } - /// Publish a branch merge: the merged table `updates` and the merge commit - /// in one manifest CAS (RFC-013 Phase 7). The merge commit's merged-in parent - /// is `merged_parent_commit_id` (the source head); its first parent is the - /// live target-branch head, resolved by the publisher. - pub(crate) async fn commit_merge_with_actor( + pub(crate) async fn commit_manifest_updates( &self, updates: &[crate::db::SubTableUpdate], + ) -> Result { + table_ops::commit_manifest_updates(self, updates).await + } + + pub(crate) async fn record_merge_commit( + &self, + manifest_version: u64, + parent_commit_id: &str, merged_parent_commit_id: &str, actor_id: Option<&str>, ) -> Result { - table_ops::commit_merge_with_actor(self, updates, merged_parent_commit_id, actor_id).await + table_ops::record_merge_commit( + self, + manifest_version, + parent_commit_id, + merged_parent_commit_id, + actor_id, + ) + .await } pub(crate) async fn commit_updates_on_branch_with_expected( @@ -1806,8 +1409,6 @@ impl Omnigraph { updates: &[crate::db::SubTableUpdate], expected_table_versions: &std::collections::HashMap, actor_id: Option<&str>, - txn: Option<&crate::db::WriteTxn>, - committed_handles: std::collections::HashMap, ) -> Result { table_ops::commit_updates_on_branch_with_expected( self, @@ -1815,8 +1416,6 @@ impl Omnigraph { updates, expected_table_versions, actor_id, - txn, - committed_handles, ) .await } @@ -1844,25 +1443,13 @@ pub(crate) fn normalize_branch_name(branch: &str) -> Result> { Ok(Some(branch.to_string())) } -/// Build a `ResolvedTarget` from the warm coordinator without opening the commit -/// graph. The live branch snapshot is pinned by the manifest incarnation, so the -/// id is synthetic `(branch, version, e_tag when available)`; nothing on the read -/// path needs a real commit ULID (only `RuntimeCache` keys on the id, where -/// synthetic is consistent). -fn warm_resolved_target(coord: &GraphCoordinator, requested: &ReadTarget) -> ResolvedTarget { - ResolvedTarget { - requested: requested.clone(), - branch: coord.current_branch().map(str::to_string), - snapshot_id: SnapshotId::synthetic( - coord.current_branch(), - coord.version(), - coord.manifest_incarnation().e_tag.as_deref(), - ), - snapshot: coord.snapshot(), - } -} - pub(crate) fn ensure_public_branch_ref(branch: &str, operation: &str) -> Result<()> { + if super::is_internal_run_branch(branch) { + return Err(OmniError::manifest(format!( + "{} does not allow internal run ref '{}'", + operation, branch + ))); + } if is_internal_system_branch(branch) { return Err(OmniError::manifest(format!( "{} does not allow internal system ref '{}'", @@ -2021,14 +1608,14 @@ async fn init_storage_phase( if write_schema_pg { let schema_path = join_uri(root, SCHEMA_SOURCE_FILENAME); storage.write_text(&schema_path, schema_source).await?; - crate::failpoints::maybe_fail(crate::failpoints::names::INIT_AFTER_SCHEMA_PG_WRITTEN)?; + crate::failpoints::maybe_fail("init.after_schema_pg_written")?; } write_schema_contract(root, storage.as_ref(), schema_ir).await?; - crate::failpoints::maybe_fail(crate::failpoints::names::INIT_AFTER_SCHEMA_CONTRACT_WRITTEN)?; + crate::failpoints::maybe_fail("init.after_schema_contract_written")?; let coordinator = GraphCoordinator::init(root, catalog, Arc::clone(storage)).await?; - crate::failpoints::maybe_fail(crate::failpoints::names::INIT_AFTER_COORDINATOR_INIT)?; + crate::failpoints::maybe_fail("init.after_coordinator_init")?; Ok(coordinator) } @@ -2266,12 +1853,13 @@ fn json_value_from_array(array: &dyn Array, row: usize) -> Result Person { edge WorksAt: Person -> Company "#; - #[derive(Debug)] + #[derive(Debug, Default)] struct RecordingStorageAdapter { - inner: ObjectStorageAdapter, + inner: LocalStorageAdapter, reads: Mutex>, writes: Mutex>, exists_checks: Mutex>, @@ -2297,19 +1885,6 @@ edge WorksAt: Person -> Company deletes: Mutex>, } - impl Default for RecordingStorageAdapter { - fn default() -> Self { - Self { - inner: ObjectStorageAdapter::local(), - reads: Mutex::default(), - writes: Mutex::default(), - exists_checks: Mutex::default(), - renames: Mutex::default(), - deletes: Mutex::default(), - } - } - } - impl RecordingStorageAdapter { fn reads(&self) -> Vec { self.reads.lock().unwrap().clone() @@ -2362,30 +1937,11 @@ edge WorksAt: Person -> Company async fn list_dir(&self, dir_uri: &str) -> Result> { self.inner.list_dir(dir_uri).await } - - async fn read_text_versioned(&self, uri: &str) -> Result<(String, String)> { - self.inner.read_text_versioned(uri).await - } - - async fn write_text_if_match( - &self, - uri: &str, - contents: &str, - expected_version: &str, - ) -> Result> { - self.inner - .write_text_if_match(uri, contents, expected_version) - .await - } - - async fn delete_prefix(&self, prefix_uri: &str) -> Result<()> { - self.inner.delete_prefix(prefix_uri).await - } } #[derive(Debug)] struct InitRaceStorageAdapter { - inner: ObjectStorageAdapter, + inner: LocalStorageAdapter, root: String, barrier: Arc, } @@ -2423,25 +1979,6 @@ edge WorksAt: Person -> Company async fn list_dir(&self, dir_uri: &str) -> Result> { self.inner.list_dir(dir_uri).await } - - async fn read_text_versioned(&self, uri: &str) -> Result<(String, String)> { - self.inner.read_text_versioned(uri).await - } - - async fn write_text_if_match( - &self, - uri: &str, - contents: &str, - expected_version: &str, - ) -> Result> { - self.inner - .write_text_if_match(uri, contents, expected_version) - .await - } - - async fn delete_prefix(&self, prefix_uri: &str) -> Result<()> { - self.inner.delete_prefix(prefix_uri).await - } } #[tokio::test(flavor = "multi_thread", worker_threads = 2)] @@ -2450,7 +1987,7 @@ edge WorksAt: Person -> Company let uri = dir.path().to_str().unwrap().to_string(); let root = normalize_root_uri(&uri).unwrap(); let storage: Arc = Arc::new(InitRaceStorageAdapter { - inner: ObjectStorageAdapter::local(), + inner: LocalStorageAdapter, root, barrier: Arc::new(tokio::sync::Barrier::new(2)), }); @@ -2531,12 +2068,8 @@ edge WorksAt: Person -> Company async fn table_rows_json(db: &Omnigraph, table_key: &str) -> Vec { let snapshot = db.snapshot().await; - let ds = db - .storage() - .open_snapshot_at_table(&snapshot, table_key) - .await - .unwrap(); - let batches = db.storage().scan_batches(&ds).await.unwrap(); + let ds = snapshot.open(table_key).await.unwrap(); + let batches = db.table_store().scan_batches(&ds).await.unwrap(); batches .into_iter() .flat_map(|batch| { @@ -2548,14 +2081,11 @@ edge WorksAt: Person -> Company } async fn seed_person_row(db: &mut Omnigraph, name: &str, age: Option) { - // No-txn entry, so the handle is always `Some` (collapse #1's skip is - // gated on `txn.is_some()`). - let (ds, full_path, table_branch) = db + let (mut ds, full_path, table_branch) = db .open_for_mutation("node:Person", crate::db::MutationOpKind::Insert) .await - .unwrap() - .require_handle("seed_person_row test"); - let schema: Arc = Arc::new(ds.dataset().schema().into()); + .unwrap(); + let schema: Arc = Arc::new(ds.schema().into()); let columns: Vec> = schema .fields() .iter() @@ -2567,11 +2097,9 @@ edge WorksAt: Person -> Company }) .collect(); let batch = RecordBatch::try_new(Arc::clone(&schema), columns).unwrap(); - let staged = db.storage().stage_append(&ds, batch, &[]).await.unwrap(); - let committed = db.storage().commit_staged(ds, staged).await.unwrap(); let state = db - .storage() - .table_state(&full_path, &committed) + .table_store() + .append_batch(&full_path, &mut ds, batch) .await .unwrap(); db.commit_updates(&[crate::db::SubTableUpdate { @@ -2663,11 +2191,11 @@ edge WorksAt: Person -> Company #[tokio::test] async fn test_apply_schema_succeeds_after_load() { // Historical: schema apply used to be blocked by leftover - // `__run__` branches. The Run state machine was removed in - // MR-771, so a fresh graph never creates a `__run__` branch; - // legacy ones are swept by the v2β†’v3 manifest migration. This - // asserts the invariant a current graph upholds: publish leaves - // no `__run__` branch behind, so schema apply proceeds. + // `__run__` branches. A defense-in-depth filter now skips + // internal system branches, and run branches were made + // ephemeral on every terminal state β€” so in practice no + // `__run__` branch survives publish. The filter still guards + // the invariant. let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); @@ -2682,8 +2210,8 @@ edge WorksAt: Person -> Company let all_branches = db.coordinator.read().await.all_branches().await.unwrap(); assert!( - !all_branches.iter().any(|b| b.starts_with("__run__")), - "no __run__ branch should exist after publish, got: {:?}", + !all_branches.iter().any(|b| is_internal_run_branch(b)), + "run branch should be deleted after publish, got: {:?}", all_branches ); @@ -2695,101 +2223,22 @@ edge WorksAt: Person -> Company assert!(result.applied, "schema apply should have applied"); } - /// Regression (MR-770): a pre-v0.4.0 graph that still carries a stale - /// `__run__*` branch on `__manifest` must not block schema apply. The - /// v2β†’v3 sweep runs in `Omnigraph::open(ReadWrite)` β€” before the - /// schema-apply blocking-branch check β€” so apply succeeds with no - /// intervening publish. - /// - /// Confirmed to fail before the open-time migration landed: the reopened - /// graph still listed `__run__legacy`, and `apply_schema` returned - /// "found non-main branches: __run__legacy". #[tokio::test] - async fn legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply() { + async fn test_apply_schema_adds_index_for_existing_property() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - // Synthesize a legacy graph: a stale `__run__` branch on `__manifest` - // plus the manifest stamp rewound to v2 (pre-sweep). - db.branch_create("__run__legacy").await.unwrap(); - drop(db); - { - // forbidden-api-allow: test synthesizes a legacy graph by editing __manifest directly. - let mut ds = lance::Dataset::open(&format!("{}/__manifest", uri)) - .await - .unwrap(); - ds.update_schema_metadata([( - "omnigraph:internal_schema_version".to_string(), - Some("2".to_string()), - )]) - .await - .unwrap(); - } - - // Reopen (ReadWrite): the open-time migration must sweep `__run__legacy` - // before any branch-observing code runs. - let db = Omnigraph::open(uri).await.unwrap(); - let branches = db.branch_list().await.unwrap(); - assert!( - !branches.iter().any(|b| b.starts_with("__run__")), - "open-time migration must sweep legacy __run__ branches; got {branches:?}", - ); - - // Schema apply must proceed with no intervening publish β€” the - // blocking-branch check no longer sees `__run__legacy`. - let desired = TEST_SCHEMA.replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - let result = db.apply_schema(&desired).await.unwrap(); - assert!(result.applied, "schema apply should have applied"); - } - - #[tokio::test] - async fn test_apply_schema_defers_index_then_reconciler_builds_it() { - // iss-848: schema apply records the @index intent but builds nothing - // inline; a later ensure_indices materializes it once the table has - // rows. (Use `age`, which is unindexed in TEST_SCHEMA β€” `name @key` is - // already FTS-indexed at seed, so it can't show the deferral.) - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - seed_person_row(&mut db, "Alice", Some(30)).await; - - let desired = TEST_SCHEMA.replace("age: I32?", "age: I32? @index"); + let desired = TEST_SCHEMA.replace("name: String @key", "name: String @key @index"); db.apply_schema(&desired).await.unwrap(); - // Apply built nothing β€” the BTREE on `age` is deferred. let snapshot = db.snapshot().await; - let ds = db - .storage() - .open_snapshot_at_table(&snapshot, "node:Person") - .await - .unwrap(); - assert!( - !db.storage().has_btree_index(&ds, "age").await.unwrap(), - "apply must not build the index inline (deferred to the reconciler)" - ); - - // The reconciler materializes it (Person has a row). - db.ensure_indices().await.unwrap(); - let snapshot = db.snapshot().await; - let ds = db - .storage() - .open_snapshot_at_table(&snapshot, "node:Person") - .await - .unwrap(); - assert!( - db.storage().has_btree_index(&ds, "age").await.unwrap(), - "ensure_indices must build the deferred index" - ); + let ds = snapshot.open("node:Person").await.unwrap(); + assert!(db.table_store().has_fts_index(&ds, "name").await.unwrap()); } #[tokio::test] - async fn test_apply_schema_rewrite_defers_index_then_reconciler_restores() { - // iss-848: an AddProperty rewrite writes a new dataset version without - // rebuilding indexes inline (deferred); ensure_indices restores them. + async fn test_apply_schema_rewrite_preserves_existing_indices() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); let initial_schema = TEST_SCHEMA.replace("name: String @key", "name: String @key @index"); @@ -2802,16 +2251,10 @@ edge WorksAt: Person -> Company ); db.apply_schema(&desired).await.unwrap(); - // After the rewrite the reconciler restores index coverage. - db.ensure_indices().await.unwrap(); let snapshot = db.snapshot().await; - let ds = db - .storage() - .open_snapshot_at_table(&snapshot, "node:Person") - .await - .unwrap(); - assert!(db.storage().has_btree_index(&ds, "id").await.unwrap()); - assert!(db.storage().has_fts_index(&ds, "name").await.unwrap()); + let ds = snapshot.open("node:Person").await.unwrap(); + assert!(db.table_store().has_btree_index(&ds, "id").await.unwrap()); + assert!(db.table_store().has_fts_index(&ds, "name").await.unwrap()); } #[tokio::test] diff --git a/crates/omnigraph/src/db/omnigraph/export.rs b/crates/omnigraph/src/db/omnigraph/export.rs index 7696056..366f50a 100644 --- a/crates/omnigraph/src/db/omnigraph/export.rs +++ b/crates/omnigraph/src/db/omnigraph/export.rs @@ -60,12 +60,12 @@ async fn entity_from_snapshot( } let ds = db - .storage() - .open_snapshot_at_table(snapshot, table_key) + .table_store + .open_snapshot_table(snapshot, table_key) .await?; let filter_sql = format!("id = '{}'", id.replace('\'', "''")); let batches = db - .storage() + .table_store .scan(&ds, None, Some(&filter_sql), None) .await?; let Some(batch) = batches.iter().find(|batch| batch.num_rows() > 0) else { @@ -143,23 +143,23 @@ async fn export_table_to_writer( writer: &mut W, ) -> Result<()> { let ds = db - .storage() - .open_snapshot_at_table(snapshot, table_key) + .table_store + .open_snapshot_table(snapshot, table_key) .await?; let ordering = Some(vec![ColumnOrdering::asc_nulls_last("id".to_string())]); let catalog = db.catalog(); let blob_properties = blob_properties_for_table_key(&catalog, table_key)?; if blob_properties.is_empty() { - for batch in db.storage().scan(&ds, None, None, ordering).await? { + for batch in db.table_store.scan(&ds, None, None, ordering).await? { write_export_rows_from_batch(db, table_key, &batch, None, writer)?; } return Ok(()); } let batches = db - .storage() - .scan_with_row_id(&ds, None, None, ordering, true) + .table_store + .scan_with(&ds, None, None, ordering, true, |_| Ok(())) .await?; for batch in batches { let row_ids = batch @@ -175,13 +175,7 @@ async fn export_table_to_writer( .iter() .copied() .collect::>(); - // Blob materialization reaches through to the inner Lance - // `Dataset` because `take_blobs` is a Lance-only API not lifted - // onto the `TableStorage` trait surface (the trait covers - // staged-write and snapshot-scan primitives; blob descriptor - // materialization sits outside that surface). - let blob_values = - export_blob_values(ds.dataset(), &batch, &row_ids, blob_properties).await?; + let blob_values = export_blob_values(&ds, &batch, &row_ids, blob_properties).await?; write_export_rows_from_batch(db, table_key, &batch, Some(&blob_values), writer)?; } Ok(()) diff --git a/crates/omnigraph/src/db/omnigraph/optimize.rs b/crates/omnigraph/src/db/omnigraph/optimize.rs index bae0c88..e158dc7 100644 --- a/crates/omnigraph/src/db/omnigraph/optimize.rs +++ b/crates/omnigraph/src/db/omnigraph/optimize.rs @@ -8,14 +8,8 @@ //! Two dials: //! //! * `optimize_all_tables` β€” Lance `compact_files` on every table. Rewrites -//! small fragments into fewer large ones, then **publishes the compacted -//! version to the `__manifest`** so the manifest's `table_version` tracks the -//! compacted Lance HEAD (reads pin the manifest version, so without the -//! publish compaction would be invisible to readers and would break the -//! HEAD-vs-manifest precondition of schema apply / strict writes). Compaction -//! is content-preserving (Lance `Operation::Rewrite` "reorganizes data -//! without semantic modification"), so old fragments remain reachable via -//! older manifest versions until `cleanup` runs. +//! small fragments into fewer large ones. Non-destructive (creates a new +//! version; old fragments remain reachable via older manifest versions). //! * `cleanup_all_tables` β€” Lance `cleanup_old_versions` on every table. //! Removes manifests (and their unique fragments) older than the configured //! retention. Destructive to version history β€” callers should gate this @@ -29,11 +23,7 @@ use std::time::Duration; use chrono::Utc; use futures::stream::StreamExt; use lance::dataset::cleanup::{CleanupPolicy, RemovalStats}; -use lance::dataset::optimize::{ - CompactionMetrics, CompactionOptions, compact_files, plan_compaction, -}; -use lance::index::DatasetIndexExt; -use lance_index::optimize::OptimizeOptions; +use lance::dataset::optimize::{CompactionMetrics, CompactionOptions, compact_files}; use super::*; @@ -50,20 +40,6 @@ fn maint_concurrency() -> usize { .unwrap_or(DEFAULT_MAINT_CONCURRENCY) } -/// Whether the installed Lance can compact a dataset that contains blob -/// columns. `false` today: Lance `compact_files` forces -/// `BlobHandling::AllBinary` on the read side, and the blob-v2 struct decoder -/// mis-counts columns ("there were more fields in the schema than provided -/// column indices"), failing even a pristine uniform-V2_2 multi-fragment blob -/// table. Reads are unaffected (queries use descriptor handling). -/// -/// While `false`, [`optimize_all_tables`] skips blob-bearing tables and reports -/// [`SkipReason::BlobColumnsUnsupportedByLance`] instead of aborting the whole -/// sweep. Flip to `true` once the upstream Lance fix ships β€” the -/// `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard -/// turns red on that bump and forces this flip. Tracked in `docs/dev/lance.md`. -const LANCE_SUPPORTS_BLOB_COMPACTION: bool = false; - /// Retention knobs for [`cleanup_all_tables`]. At least one must be set or /// nothing is cleaned. If both are set, Lance applies them as AND (a manifest /// is kept if it satisfies either β€” i.e. only manifests older than BOTH the @@ -76,784 +52,74 @@ pub struct CleanupPolicyOptions { pub older_than: Option, } -/// Why `optimize` did not compact a table. Typed so callers branch on the -/// reason rather than sniffing a string. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -#[non_exhaustive] -pub enum SkipReason { - /// The table has one or more `Blob` columns. Lance `compact_files` forces - /// `BlobHandling::AllBinary`, which mis-decodes blob-v2 columns; see - /// [`LANCE_SUPPORTS_BLOB_COMPACTION`] and `docs/dev/lance.md`. - BlobColumnsUnsupportedByLance, - /// The Lance dataset HEAD is ahead of the version recorded in - /// `__manifest`, and no recovery sidecar covers that movement. `optimize` - /// cannot infer whether the drift is benign maintenance or an external - /// semantic write, so it leaves the table untouched and points operators at - /// explicit `repair`. - DriftNeedsRepair, -} - -impl SkipReason { - /// Stable machine-readable token for serialized output (e.g. CLI `--json`). - /// Once emitted this is part of the output contract β€” keep it stable. - pub fn as_str(&self) -> &'static str { - match self { - SkipReason::BlobColumnsUnsupportedByLance => "blob_columns_unsupported_by_lance", - SkipReason::DriftNeedsRepair => "drift_needs_repair", - } - } -} - -impl std::fmt::Display for SkipReason { - /// Human-readable reason for CLI and log output. - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - let msg = match self { - SkipReason::BlobColumnsUnsupportedByLance => { - "blob columns β€” Lance compaction unsupported" - } - SkipReason::DriftNeedsRepair => "manifest/head drift β€” run omnigraph repair", - }; - f.write_str(msg) - } -} - -/// Per-table outcome of `optimize_all_tables`. This is a returned result type, -/// not built by callers, so it is `#[non_exhaustive]`: future fields stay -/// non-breaking and downstream code reads fields rather than constructing it. +/// Per-table outcome of `optimize_all_tables`. #[derive(Debug, Clone)] -#[non_exhaustive] pub struct TableOptimizeStats { pub table_key: String, /// Number of source fragments that were rewritten by Lance. pub fragments_removed: usize, /// Number of new, larger fragments Lance produced. pub fragments_added: usize, - /// Did this table get a new manifest version from the compaction? True when - /// compaction ran and its compacted version was published to `__manifest`. + /// Did this table get a new Lance manifest version from the compaction? pub committed: bool, - /// `Some(reason)` if this table was deliberately not compacted. When set, - /// `fragments_removed == 0`, `fragments_added == 0`, and `!committed`. - pub skipped: Option, - /// Manifest table version observed by optimize for drift skips. `None` for - /// normal compaction/no-op/blob skips. - pub manifest_version: Option, - /// Lance HEAD version observed by optimize for drift skips. `None` for - /// normal compaction/no-op/blob skips. - pub lance_head_version: Option, - /// Declared `@index` columns on this table the reconciler could not build - /// this run, each with the `reason` (today: a vector column with no - /// trainable vectors yet). Empty on the common path. Reported, not fatal β€” a - /// later `optimize` retries; the `list_indices`/`indisvalid` analog so - /// operators can see which index is pending and why. - pub pending_indexes: Vec, } -impl TableOptimizeStats { - /// Stat for a table that Lance actually compacted. - fn compacted(table_key: String, metrics: &CompactionMetrics, committed: bool) -> Self { - Self { - table_key, - fragments_removed: metrics.fragments_removed, - fragments_added: metrics.fragments_added, - committed, - skipped: None, - manifest_version: None, - lance_head_version: None, - pending_indexes: Vec::new(), - } - } - - /// Stat for a table that was deliberately skipped (compaction not attempted). - fn skipped(table_key: String, reason: SkipReason) -> Self { - Self { - table_key, - fragments_removed: 0, - fragments_added: 0, - committed: false, - skipped: Some(reason), - manifest_version: None, - lance_head_version: None, - pending_indexes: Vec::new(), - } - } - - /// Stat for a table skipped because the manifest and Lance HEAD disagree. - fn skipped_for_drift( - table_key: String, - manifest_version: u64, - lance_head_version: u64, - ) -> Self { - Self { - table_key, - fragments_removed: 0, - fragments_added: 0, - committed: false, - skipped: Some(SkipReason::DriftNeedsRepair), - manifest_version: Some(manifest_version), - lance_head_version: Some(lance_head_version), - pending_indexes: Vec::new(), - } - } -} - -/// Per-table outcome of `cleanup_all_tables`. `error` is `Some` when this -/// table's version GC failed; cleanup is fault-isolated per table, so a single -/// table's failure is recorded here rather than aborting the whole sweep. +/// Per-table outcome of `cleanup_all_tables`. #[derive(Debug, Clone)] pub struct TableCleanupStats { pub table_key: String, pub bytes_removed: u64, pub old_versions_removed: u64, - pub error: Option, } -/// Run Lance `compact_files` on every node + edge table on `main`, publishing -/// each compacted table's new version to the `__manifest`. Tables run in -/// parallel (bounded concurrency); each is fault-isolated only at the Lance -/// level β€” a publish error is propagated (the recovery sidecar covers it). +/// Run Lance `compact_files` on every node + edge table on `main`. +/// Tables run in parallel (bounded concurrency). pub async fn optimize_all_tables(db: &Omnigraph) -> Result> { db.ensure_schema_state_valid().await?; db.ensure_schema_apply_idle("optimize").await?; - // Refuse on an unrecovered graph. A pending recovery sidecar means a failed - // write left partial state that the open-time sweep must resolve (roll - // forward/back) first; compacting + publishing a table covered by such a - // sidecar could commit a partial write the sweep would roll back. Reopen the - // graph to run recovery, then re-run optimize. - if !crate::db::manifest::list_sidecars(db.root_uri(), db.storage_adapter()) - .await? - .is_empty() - { - return Err(OmniError::manifest_conflict( - "optimize requires a clean recovery state; reopen the graph to run the \ - recovery sweep before optimizing", - )); + let resolved = db.resolved_branch_target(None).await?; + let snapshot = resolved.snapshot; + + let table_tasks: Vec<_> = all_table_keys(&db.catalog()) + .into_iter() + .filter_map(|table_key| { + let entry = snapshot.entry(&table_key)?; + let full_path = format!("{}/{}", db.root_uri, entry.table_path); + Some((table_key, full_path)) + }) + .collect(); + + if table_tasks.is_empty() { + return Ok(Vec::new()); } - let snapshot = db.fresh_snapshot_for_branch(None).await?; - - // Compute per-table state (path + whether it has blob columns) up front, in - // a scope that drops the catalog handle before the async stream starts. - let table_tasks: Vec<(String, String, bool)> = { - let catalog = db.catalog(); - let mut tasks = Vec::new(); - for table_key in all_table_keys(&catalog) { - let Some(entry) = snapshot.entry(&table_key) else { - continue; - }; - let full_path = format!("{}/{}", db.root_uri, entry.table_path); - let has_blob = !blob_properties_for_table_key(&catalog, &table_key)?.is_empty(); - tasks.push((table_key, full_path, has_blob)); - } - tasks - }; - - // NB: do NOT early-return when `table_tasks` is empty (a schema with no - // node/edge types) β€” the internal system tables below must still be compacted. let concurrency = maint_concurrency().min(table_tasks.len()).max(1); + let table_store = &db.table_store; let stats: Vec> = futures::stream::iter(table_tasks.into_iter()) - .map(move |(table_key, full_path, has_blob)| async move { - optimize_one_table(db, table_key, full_path, has_blob).await + .map(|(table_key, full_path)| async move { + let mut ds = table_store + .open_dataset_head_for_write(&table_key, &full_path, None) + .await?; + let version_before = ds.version().version; + let metrics: CompactionMetrics = + compact_files(&mut ds, CompactionOptions::default(), None) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let version_after = ds.version().version; + Ok(TableOptimizeStats { + table_key, + fragments_removed: metrics.fragments_removed, + fragments_added: metrics.fragments_added, + committed: version_after != version_before, + }) }) .buffer_unordered(concurrency) .collect() .await; - // Invalidate caches for any table that published a compaction β€” done BEFORE - // propagating a sibling table's error, since the published versions are - // durable and reads must observe the new fragment layout (Lance invalidates - // the original row addresses on rewrite). The CSR/CSC graph topology index - // is rebuilt only when an edge table moved. Mirrors schema_apply's - // post-publish invalidation. - let any_committed = stats.iter().any(|s| matches!(s, Ok(st) if st.committed)); - let edge_committed = stats - .iter() - .any(|s| matches!(s, Ok(st) if st.committed && st.table_key.starts_with("edge:"))); - if any_committed { - db.runtime_cache.invalidate_all().await; - if edge_committed { - db.invalidate_graph_index().await; - } - } - - // Compact the internal system tables too (RFC-013 step 2). They are not - // catalog-tracked, so they take a separate, simpler path (`compact_internal_table`): - // compact in place, no manifest publish, no sidecar. Appended after the - // data-table stats so the data-table cache invalidation above is computed from - // data-table stats only; each internal compaction does its own coordinator - // refresh for cache coherence. - let mut all = stats; - // One source of truth for the internal system tables optimize compacts. The - // commit graph is THREE tables, not one: the DAG (`_graph_commits`), the actor - // map (`_graph_commit_actors`, appended by every *authenticated* write β€” the - // production server/CLI path always carries an actor), and the manifest. Missing - // any leaves an O(history) scan on a live write path. `__manifest` is always - // present (created at init); the two commit-graph tables may be absent (the - // coordinator opens them as `Option`, gated on existence β€” graphs predating the - // commit graph, and the actor table is itself optional), so guard each with the - // same existence check rather than letting `Dataset::open` error and fail the - // whole optimize. - let root = db.root_uri(); - let internal_tables: [(&str, String); 3] = [ - ("__manifest", crate::db::manifest::manifest_uri(root)), - ( - "_graph_commits", - crate::db::commit_graph::graph_commits_uri(root), - ), - ( - "_graph_commit_actors", - crate::db::commit_graph::graph_commit_actors_uri(root), - ), - ]; - for (table_key, uri) in internal_tables { - if table_key == "__manifest" || db.storage_adapter().exists(&uri).await? { - all.push(compact_internal_table(db, table_key, uri).await); - } - } - - all.into_iter().collect() -} - -/// Compact one table and publish the compacted version to the `__manifest`. -/// -/// Compaction (`compact_files`) advances the *dataset's* Lance HEAD via a -/// reserve-fragments + rewrite commit, but Lance knows nothing about the -/// `__manifest`. To keep the manifest the single authority for each table's -/// visible version (invariant 2), optimize must publish the compacted version. -/// The Lance-HEAD-before-manifest-publish gap is unavoidable (Lance has no -/// staged/uncommitted compaction), so it is covered by a recovery sidecar like -/// the other multi-commit writers; roll-forward is always safe because -/// compaction is content-preserving. -async fn optimize_one_table( - db: &Omnigraph, - table_key: String, - full_path: String, - has_blob: bool, -) -> Result { - // Lance `compact_files` mis-decodes blob-v2 columns under the forced - // `BlobHandling::AllBinary` read (see LANCE_SUPPORTS_BLOB_COMPACTION). Skip - // blob-bearing tables before acquiring the write queue; `repair` is the - // operator tool for full manifest/head drift classification. - if has_blob && !LANCE_SUPPORTS_BLOB_COMPACTION { - tracing::warn!( - target: "omnigraph::optimize", - table = %table_key, - "skipping compaction: table has blob columns the current Lance \ - cannot rewrite (blob-v2 AllBinary decode bug); other tables \ - unaffected β€” rerun after the Lance fix", - ); - return Ok(TableOptimizeStats::skipped( - table_key, - SkipReason::BlobColumnsUnsupportedByLance, - )); - } - - // Serialize the whole compactβ†’publish against concurrent mutations on this - // (table, main): compaction is a Rewrite op that retryable-conflicts with a - // concurrent Merge/Update/Delete on overlapping fragments, and an - // interleaved write would also move the manifest version out from under the - // CAS below. Holding the queue makes the CAS baseline read under it exact. - let _guard = db - .write_queue() - .acquire_many(&[(table_key.clone(), None)]) - .await; - - // Survive a CROSS-PROCESS race (a CLI `optimize` vs the served server): the - // in-process write queue above serializes only same-process writers, so we also - // retry. Two failure modes, two retry levels: - // * Outer loop β€” a genuine Lance `Rewrite`-vs-`Update/Delete` same-fragment - // conflict (compaction did NOT commit). Reopen at the new HEAD and re-plan, - // exactly as the internal-table path does. (Lance rebases the common disjoint - // case β€” a concurrent insert/delete on other fragments β€” for free, so this - // fires only on real overlap.) - // * Inner loop (Phase C) β€” the manifest advanced under us between our - // compaction and our publish. The compaction IS committed at Lance HEAD, so - // we must NOT reopen (that would trip the HEAD>manifest drift guard on our - // own work); instead re-read the current manifest version and either no-op - // (the manifest already moved past our version β€” being linear, it descends - // from and includes our compaction) or fast-forward to it. Monotonic, never - // the strict equality CAS that manufactured the bug. - // - // The Phase-A sidecar is written ONCE on the first productive attempt and reused - // across reopen attempts: every Phase-B commit is content-preserving, so a crash - // mid-retry leaves the table readable and recovery either rolls the observed HEAD - // forward (pin still matches the manifest) or safely rolls the compaction back. - let mut sidecar: Option = None; - - // Tracks whether one of OUR Phase-B ops (auto-cleanup strip / compact / reindex) - // already committed and advanced Lance HEAD past the manifest in a prior attempt. - // Once true, a reopened `lance_head > manifest` is our own sidecar-covered work, - // NOT external drift β€” so the drift guard and the no-op early-return must not treat - // it as such (that would drop our committed work as uncovered drift). - let mut head_advanced = false; - - // Outer loop: open β†’ plan β†’ Phase B, reopening + re-planning on a retryable - // Lance conflict. Breaks with the committed snapshot once Phase B succeeds. - let mut attempt: u32 = 0; - let (snapshot, metrics, pending_indexes, committed) = loop { - attempt += 1; - - // `compact_files` is a Lance-only maintenance API that needs `&mut Dataset`. - // The `TableStorage` trait deliberately does not surface it; unwrap the - // opaque `SnapshotHandle` via `into_dataset()` (gated to the maintenance path). - let mut ds = db - .storage() - .open_dataset_head_for_write(&table_key, &full_path, None) - .await? - .into_dataset(); - - // CAS baseline: the table's current manifest version, re-read each attempt - // (a reopen means the manifest may have advanced). - let expected_version = db - .fresh_snapshot_for_branch(None) - .await? - .entry(&table_key) - .map(|e| e.table_version) - .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; - - let lance_head_version = ds.version().version; - if lance_head_version < expected_version { - return Err(OmniError::manifest_internal(format!( - "table '{}' Lance HEAD version {} is behind manifest version {}", - table_key, lance_head_version, expected_version - ))); - } - if !head_advanced && lance_head_version > expected_version { - // Pre-existing EXTERNAL uncovered drift (we have not advanced HEAD yet) β€” - // go through explicit repair. Once `head_advanced` is set, a reopened - // `lance_head > manifest` is our own prior Phase-B commit (sidecar-covered) - // that the publish below fast-forwards, NOT external drift, so this guard is - // skipped on those retries. - if let Some(h) = sidecar.take() { - let _ = crate::db::manifest::delete_sidecar(&h, db.storage_adapter()).await; - } - tracing::warn!( - target: "omnigraph::optimize", - table = %table_key, - manifest_version = expected_version, - lance_head_version, - "skipping compaction: Lance HEAD is ahead of the manifest; run `omnigraph repair` \ - to classify and publish covered maintenance drift explicitly", - ); - return Ok(TableOptimizeStats::skipped_for_drift( - table_key.clone(), - expected_version, - lance_head_version, - )); - } - - // Precise "will it compact?" check β€” `plan_compaction` also accounts for - // deletion materialization (which can rewrite even a single fragment). - let options = CompactionOptions::default(); - let plan = plan_compaction(&ds, &options) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let will_compact = plan.num_tasks() > 0; - // Even with nothing to compact, the table may still have index work - // (needs_reindex: rows appended since the index was built; needs_index_create: - // a declared `@index` whose physical build schema apply deferred, iss-848). - // Any of the three enters the publish path. If NONE, this is a no-op and must - // NOT be pinned in a sidecar (a zero-commit pin classifies NoMovement on - // recovery and rolls back siblings). - let needs_reindex = TableStore::has_unindexed_fragments(&ds).await?; - let needs_index_create = if let Some(type_name) = table_key.strip_prefix("node:") { - super::table_ops::needs_index_work_node(db, type_name, &table_key, &full_path, None) - .await? - } else { - super::table_ops::needs_index_work_edge(db, &table_key, &full_path, None).await? - }; - if !will_compact && !needs_reindex && !needs_index_create { - if head_advanced { - // Nothing left to compact, but a prior attempt already advanced HEAD - // (e.g. the strip committed, then compaction conflicted, and the reopen - // is now already compacted). Publish that committed work instead of - // dropping it as uncovered drift. - break ( - crate::storage_layer::SnapshotHandle::new(ds), - CompactionMetrics::default(), - Vec::new(), - true, - ); - } - if let Some(h) = sidecar.take() { - let _ = crate::db::manifest::delete_sidecar(&h, db.storage_adapter()).await; - } - return Ok(TableOptimizeStats::compacted( - table_key.clone(), - &CompactionMetrics::default(), - false, - )); - } - - // Phase A: recovery sidecar BEFORE any HEAD-advancing op, written once and - // reused across reopen attempts. - if sidecar.is_none() { - let sc = crate::db::manifest::new_sidecar( - crate::db::manifest::SidecarKind::Optimize, - None, - // optimize is system-attributed (no `optimize_as` actor API today). - None, - vec![crate::db::manifest::SidecarTablePin { - table_key: table_key.clone(), - table_path: full_path.clone(), - expected_version, - // Lower bound β€” compaction commits Nβ‰₯1 versions (reserve + rewrite); - // the classifier loose-matches SidecarKind::Optimize. - post_commit_pin: expected_version + 1, - confirmed_version: None, - table_branch: None, - }], - ); - sidecar = Some( - crate::db::manifest::write_sidecar(db.root_uri(), db.storage_adapter(), &sc).await?, - ); - } - - // Test seam: a concurrent (cross-process) writer can interleave here, before - // any Phase-B commit lands, to exercise the reopen+replan path. - crate::failpoints::maybe_fail(crate::failpoints::names::OPTIMIZE_BEFORE_COMPACT)?; - - // Phase B: scrub stale auto_cleanup (keeps optimize non-destructive on a - // graph upgraded from a pre-v7 binary whose `compact_files`/`optimize_indices` - // commits would otherwise fire Lance's auto-cleanup GC hook), compact, - // incremental reindex, then materialize declared-but-missing indexes. Each is - // an inline-commit residual covered by the sidecar. A retryable Lance conflict - // here means a concurrent writer preempted an overlapping fragment β†’ reopen at - // the new HEAD and re-plan. Baseline captured BEFORE the scrub so that if the - // scrub is the only commit, `committed` still triggers the Phase-C publish. - let version_before = ds.version().version; - match clear_stale_auto_cleanup_config(&mut ds).await { - // `true` β‡’ the strip committed and advanced HEAD past the manifest. - Ok(stripped) => head_advanced |= stripped, - Err(e) if attempt < COMPACTION_RETRY_BUDGET && is_retryable_lance_conflict(&e) => { - continue; - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - } - let metrics: CompactionMetrics = if will_compact { - match compact_files(&mut ds, options, None).await { - Ok(m) => { - head_advanced = true; - m - } - Err(e) if attempt < COMPACTION_RETRY_BUDGET && is_retryable_lance_conflict(&e) => { - continue; - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - } - } else { - CompactionMetrics::default() - }; - // Test seam: inject one retryable reindex conflict AFTER compaction has - // committed (so HEAD is already ahead of the manifest from our own work), - // exercising the own-HEAD (not external) drift classification on the next - // reopened attempt. - if crate::failpoints::maybe_fail(crate::failpoints::names::OPTIMIZE_INJECT_REINDEX_CONFLICT).is_err() - && attempt < COMPACTION_RETRY_BUDGET - { - continue; - } - match ds.optimize_indices(&OptimizeOptions::default()).await { - Ok(()) => {} - Err(e) if attempt < COMPACTION_RETRY_BUDGET && is_retryable_lance_conflict(&e) => { - continue; - } - Err(e) => { - return Err(OmniError::Lance(format!("optimize_indices on {}: {}", table_key, e))); - } - } - - let catalog = db.catalog(); - let mut snapshot = crate::storage_layer::SnapshotHandle::new(ds); - let pending_indexes: Vec = - super::table_ops::build_indices_on_dataset_for_catalog( - db, - &catalog, - &table_key, - &mut snapshot, - ) - .await?; - // optimize_indices / index build may also have committed (folded fragments, - // built a deferred index). Any HEAD advance this attempt counts too. - let version_after = snapshot.dataset().version().version; - head_advanced |= version_after != version_before; - - break (snapshot, metrics, pending_indexes, head_advanced); - }; - - // Pin the per-writer Phase B β†’ Phase C residual: Lance HEAD has advanced but the - // manifest publish below hasn't run. - crate::failpoints::maybe_fail(crate::failpoints::names::OPTIMIZE_POST_PHASE_B_PRE_MANIFEST_COMMIT)?; - - // Phase C: monotonic fast-forward publish. The compaction is committed at Lance - // HEAD `N`; publish a manifest pointer that includes it. If a concurrent writer - // already advanced the manifest to β‰₯ N (it built on our compaction), there is - // nothing to do. Otherwise advance to N; a concurrent advance during this window - // is a retryable manifest conflict β€” re-read the current version and re-evaluate - // (NOT a reopen: the compaction is already committed). - if committed { - let state = db.storage().table_state(&full_path, &snapshot).await?; - let mut published = false; - let mut last_conflict: Option = None; - for _ in 0..COMPACTION_RETRY_BUDGET { - let current = current_manifest_version(db, &table_key).await?; - if current >= state.version { - // The manifest already points at a version that includes our - // compaction (Lance versions are linear). Nothing to publish. - published = true; - break; - } - let update = crate::db::SubTableUpdate { - table_key: table_key.clone(), - table_version: state.version, - table_branch: None, - row_count: state.row_count, - version_metadata: state.version_metadata.clone(), - }; - let mut expected = std::collections::HashMap::new(); - expected.insert(table_key.clone(), current); - match db - .coordinator - .write() - .await - .commit_updates_with_actor_with_expected(&[update], &expected, None) - .await - { - Ok(_) => { - published = true; - break; - } - // A retryable manifest conflict means the manifest moved under us β€” - // loop and re-read `current` (the top check converges if it now - // already includes our compaction). Record it for the exhaustion path. - Err(e) if is_retryable_manifest_conflict(&e) => last_conflict = Some(e), - // Leave the sidecar for the open-time recovery sweep to roll forward. - Err(e) => return Err(e), - } - } - if !published { - // Budget exhausted under sustained contention. The final conflict may - // itself mean a concurrent writer published a version that already - // includes our (content-preserving) compaction β€” the postcondition is - // "the manifest reflects our compaction," not "we won the CAS" β€” so - // re-check before surfacing an error (Β§6.6). - let current = current_manifest_version(db, &table_key).await?; - if current < state.version { - return Err(last_conflict.unwrap_or_else(|| { - OmniError::manifest_conflict(format!( - "optimize publish of {table_key} exhausted {COMPACTION_RETRY_BUDGET} \ - retries against concurrent writers" - )) - })); - } - } - } - - // Phase D: delete the sidecar (best-effort; recovery resolves a leftover). - if let Some(h) = sidecar.take() { - if let Err(err) = crate::db::manifest::delete_sidecar(&h, db.storage_adapter()).await { - tracing::warn!( - error = %err, - operation_id = h.operation_id.as_str(), - "optimize recovery sidecar cleanup failed; next open's recovery sweep will resolve it" - ); - } - } - - let mut stat = TableOptimizeStats::compacted(table_key, &metrics, committed); - stat.pending_indexes = pending_indexes; - Ok(stat) -} - -/// Bound on the app-level retry of an internal-table compaction against a -/// concurrent live writer (see [`is_retryable_lance_conflict`]). -const COMPACTION_RETRY_BUDGET: u32 = 5; - -/// A Lance commit error that means "a concurrent writer preempted us; reload the -/// dataset and rerun." `compact_files` commits via `commit_compaction` -> -/// `apply_commit` *directly* β€” unlike the merge-insert path it is NOT wrapped in -/// `execute_with_retry`, so a `Rewrite`-vs-`Merge`/`Update`/`Delete` `check_txn` -/// conflict propagates raw instead of being rebased or converted to -/// `TooMuchWriteContention`. Lance's transaction spec prescribes that the -/// *application* reruns these, which is what `compact_internal_table` does β€” so a -/// maintenance compaction (a physical op) never fails a live write (a logical op), -/// invariant 7. (`TooMuchWriteContention` is included for the exhausted-retry form -/// some commit paths surface.) -fn is_retryable_lance_conflict(err: &lance::Error) -> bool { - matches!( - err, - lance::Error::RetryableCommitConflict { .. } - | lance::Error::CommitConflict { .. } - | lance::Error::TooMuchWriteContention { .. } - ) -} - -/// A manifest publish conflict that optimize's monotonic Phase-C loop re-evaluates -/// (re-read the current version, then no-op or fast-forward). Both shapes that reach -/// here are `Conflict`-kind and mean "the manifest moved under us; reconsider," never -/// a lost update: the typed `ExpectedVersionMismatch` (a concurrent writer advanced -/// the table) and the publisher's exhausted row-level CAS (`manifest_conflict`). -fn is_retryable_manifest_conflict(err: &OmniError) -> bool { - matches!( - err, - OmniError::Manifest(m) if m.kind == crate::error::ManifestErrorKind::Conflict - ) -} - -/// The table's current manifest version on `main` (0 if absent), read fresh. Used by -/// optimize's monotonic publish loop to decide no-op (`current >= N`) vs fast-forward. -async fn current_manifest_version(db: &Omnigraph, table_key: &str) -> Result { - Ok(db - .fresh_snapshot_for_branch(None) - .await? - .entry(table_key) - .map(|e| e.table_version) - .unwrap_or(0)) -} - -/// Remove any stored `lance.auto_cleanup.*` config from a table so compaction -/// stays **non-destructive by construction**. Used by both the internal-table -/// path ([`compact_internal_table`]) and the data-table path -/// ([`optimize_one_table`]). -/// -/// `compact_files` / `optimize_indices` commit with a default `CommitConfig` -/// (`skip_auto_cleanup = false`) and `CompactionOptions` exposes no override, so on -/// a dataset whose stored config has `lance.auto_cleanup.interval` set, the -/// compaction/reindex commit would fire Lance's auto-cleanup hook (version GC) β€” -/// deletion of old versions, including ones `__manifest` pins for snapshots / -/// time-travel (data tables) or that hold lineage/time-travel state (internal -/// tables). New graphs create tables with `auto_cleanup: None` (`manifest/graph.rs`, -/// `commit_graph.rs`, and the data-table create path) so there is nothing to clear; -/// only pre-`auto_cleanup`-fix *upgraded* graphs carry the config. OmniGraph owns -/// version cleanup explicitly (`cleanup`), so Lance's hook is unwanted regardless β€” -/// clearing it both makes `optimize` non-destructive and aligns the table with the -/// new-graph posture. The `delete_config_keys` commit itself does not GC: the -/// resulting manifest no longer has the `interval` key, so the post-commit hook is a -/// no-op. Returns whether any config was cleared (it advances Lance HEAD iff so). -/// Recovery coverage differs by caller: the data-table path runs this inside the -/// Optimize sidecar window; the internal-table path needs none (it commits at HEAD -/// and is read at HEAD β€” the strip is a content-preserving config commit, so a crash -/// leaves the table readable and content-identical, see [`compact_internal_table`]). -async fn clear_stale_auto_cleanup_config( - ds: &mut lance::Dataset, -) -> std::result::Result { - let keys: Vec = ds - .config() - .keys() - .filter(|k| k.starts_with("lance.auto_cleanup.")) - .cloned() - .collect(); - if keys.is_empty() { - return Ok(false); - } - // Merge-update with `None` values to delete the keys β€” the non-deprecated - // replacement for `delete_config_keys` (awaiting the builder merges rather - // than replacing the whole config map). - let entries: Vec<(&str, Option<&str>)> = keys.iter().map(|k| (k.as_str(), None)).collect(); - ds.update_config(entries).await?; - Ok(true) -} - -/// Compact one INTERNAL system table (`__manifest` / `_graph_commits` / -/// `_graph_commit_actors`) in place. -/// -/// Unlike catalog data tables, the internal tables are not tracked in the -/// `__manifest` (they ARE the manifest / the lineage DAG): readers open them at -/// their latest Lance HEAD, so compaction just advances that HEAD and the next -/// reader transparently observes the compacted version. That makes this path much -/// simpler than [`optimize_one_table`] β€” no manifest publish (nothing to publish -/// to), and no recovery sidecar. The sidecar-free claim does NOT rest on -/// single-commit atomicity: `compact_files` can emit a `ReserveFragments` commit -/// before the final `Rewrite` (and the config strip is a separate commit before -/// both), so this advances HEAD over one or more commits. It needs no sidecar -/// because every one of those commits is content-preserving and the table is read -/// at HEAD β€” a crash at any point leaves the table readable and content-identical, -/// and the next `optimize` re-plans. Internal tables carry no Lance index (only -/// `object_id`'s unenforced-PK schema metadata), so no `optimize_indices`. -/// -/// Concurrency: no application lock, but `compact_files` does NOT auto-retry a -/// semantic conflict β€” its `Operation::Rewrite` commits through `apply_commit` -/// directly (not the merge-insert `execute_with_retry` path), so a `Rewrite` -/// vs concurrent `Update`/`Merge`/`Delete` `check_txn` conflict propagates raw. -/// We own the retry here (see [`is_retryable_lance_conflict`]): on a retryable -/// conflict, reopen at the new HEAD and rerun. A follow-up coordinator `refresh` -/// makes the warm internal-table handles observe the compacted HEAD -/// deterministically (the version probe would also self-heal on the next read). -async fn compact_internal_table( - db: &Omnigraph, - table_key: &str, - uri: String, -) -> Result { - // App-level retry against concurrent live writers. compact_files does NOT - // auto-retry a Rewrite-vs-live-write conflict (see is_retryable_lance_conflict), - // so optimize would otherwise fail spuriously on a live graph. On a retryable - // conflict we re-open at the new HEAD and rerun β€” the canonical Lance-consumer - // pattern. Each attempt opens fresh because the conflict means the version moved. - for attempt in 0..COMPACTION_RETRY_BUDGET { - let handle = db - .storage() - .open_dataset_head_for_write(table_key, &uri, None) - .await?; - let mut ds = handle.into_dataset(); - - // Keep optimize non-destructive by construction (see clear_stale_auto_cleanup_config). - // Returns whether it committed a config-strip (which advances Lance HEAD). - let cleared_config = match clear_stale_auto_cleanup_config(&mut ds).await { - Ok(cleared) => cleared, - Err(e) => { - if attempt + 1 < COMPACTION_RETRY_BUDGET && is_retryable_lance_conflict(&e) - { - continue; - } - return Err(OmniError::Lance(e.to_string())); - } - }; - - let options = CompactionOptions::default(); - let plan = plan_compaction(&ds, &options) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - if plan.num_tasks() == 0 { - // No compaction work, but a config-strip still advanced HEAD β€” refresh - // the warm coordinator handles so they observe it deterministically - // (same cache-coherence step the successful-compaction path takes - // below; otherwise they stay pinned until the next version probe). - if cleared_config { - db.coordinator.write().await.refresh().await?; - } - return Ok(TableOptimizeStats::compacted( - table_key.to_string(), - &CompactionMetrics::default(), - false, - )); - } - - match compact_files(&mut ds, options, None).await { - Ok(metrics) => { - // Cache coherence: re-open the warm coordinator's internal-table - // handles at the compacted HEAD (they live in `db.coordinator`, not - // the data-table `runtime_cache`). - db.coordinator.write().await.refresh().await?; - return Ok(TableOptimizeStats::compacted( - table_key.to_string(), - &metrics, - true, - )); - } - Err(e) - if attempt + 1 < COMPACTION_RETRY_BUDGET - && is_retryable_lance_conflict(&e) => - { - continue; - } - Err(e) => return Err(OmniError::Lance(e.to_string())), - } - } - Err(OmniError::manifest_conflict(format!( - "internal-table compaction of {table_key} exhausted {COMPACTION_RETRY_BUDGET} \ - retries against concurrent writers" - ))) + stats.into_iter().collect() } /// Run Lance `cleanup_old_versions` on every node + edge table on `main`, @@ -872,26 +138,6 @@ pub async fn cleanup_all_tables( db.ensure_schema_state_valid().await?; db.ensure_schema_apply_idle("cleanup").await?; - // Reclaim orphaned branch forks (from an incomplete prior `branch_delete`) - // before version GC. Authority-derived and idempotent; the eager - // best-effort reclaim in `branch_delete` covers the common case, this is - // the guaranteed backstop. Logged for observability. - let reconciled = reconcile_orphaned_branches(db).await?; - if !reconciled.reclaimed.is_empty() { - tracing::info!( - count = reconciled.reclaimed.len(), - reclaimed = ?reconciled.reclaimed, - "cleanup reconciled orphaned branch forks" - ); - } - if !reconciled.failures.is_empty() { - tracing::warn!( - count = reconciled.failures.len(), - failures = ?reconciled.failures, - "cleanup could not reconcile some orphaned forks; will retry next cleanup" - ); - } - let before_timestamp = options.older_than.map(|d| Utc::now() - d); let keep_versions = options.keep_versions; @@ -912,332 +158,41 @@ pub async fn cleanup_all_tables( } let concurrency = maint_concurrency().min(table_tasks.len()).max(1); - let storage = db.storage(); + let table_store = &db.table_store; - // Fault-isolated per table: a single table's GC failure is recorded on its - // stats row (`error: Some`) and logged, never aborting the healthy tables. - // cleanup is the convergence backstop, so it must do as much as it can and - // converge on re-run rather than fail wholesale (invariant 13). - let results: Vec = futures::stream::iter(table_tasks.into_iter()) + let results: Vec> = futures::stream::iter(table_tasks.into_iter()) .map(|(table_key, full_path)| async move { - let outcome: Result = async { - crate::failpoints::maybe_fail(crate::failpoints::names::CLEANUP_TABLE_GC)?; - // `cleanup_old_versions` is a Lance-only maintenance API not - // surfaced through `TableStorage` β€” see the optimize path - // above for the same rationale. Unwrap via `into_dataset()`. - let handle = storage - .open_dataset_head_for_write(&table_key, &full_path, None) - .await?; - let ds = handle.into_dataset(); - let before_version = keep_versions - .map(|n| ds.version().version.saturating_sub(n as u64)) - .filter(|v| *v > 0); - let policy = CleanupPolicy { - before_timestamp, - before_version, - delete_unverified: false, - error_if_tagged_old_versions: false, - clean_referenced_branches: false, - delete_rate_limit: None, - }; - lance::dataset::cleanup::cleanup_old_versions(&ds, policy) - .await - .map_err(|e| OmniError::Lance(e.to_string())) - } - .await; - match outcome { - Ok(removed) => TableCleanupStats { - table_key, - bytes_removed: removed.bytes_removed, - old_versions_removed: removed.old_versions, - error: None, - }, - Err(err) => { - tracing::warn!( - target: "omnigraph::cleanup", - table = %table_key, - error = %err, - "version GC failed for table; other tables unaffected", - ); - TableCleanupStats { - table_key, - bytes_removed: 0, - old_versions_removed: 0, - error: Some(err.to_string()), - } - } - } + let ds = table_store + .open_dataset_head_for_write(&table_key, &full_path, None) + .await?; + let before_version = keep_versions + .map(|n| ds.version().version.saturating_sub(n as u64)) + .filter(|v| *v > 0); + let policy = CleanupPolicy { + before_timestamp, + before_version, + delete_unverified: false, + error_if_tagged_old_versions: false, + clean_referenced_branches: false, + delete_rate_limit: None, + }; + let removed: RemovalStats = lance::dataset::cleanup::cleanup_old_versions(&ds, policy) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + Ok(TableCleanupStats { + table_key, + bytes_removed: removed.bytes_removed, + old_versions_removed: removed.old_versions, + }) }) .buffer_unordered(concurrency) .collect() .await; - Ok(results) + results.into_iter().collect() } -/// Outcome of [`reconcile_orphaned_branches`]: the `(owner, branch)` pairs -/// reclaimed and the `(owner, error)` pairs that failed, where `owner` is a -/// table key (e.g. `node:Person`) or `"_graph_commits"`. Per-owner failures are -/// isolated and recorded here, not propagated β€” the next reconcile converges. -#[derive(Debug, Clone, Default)] -pub struct BranchReconcileStats { - pub reclaimed: Vec<(String, String)>, - pub failures: Vec<(String, String)>, -} - -/// Drop every per-table and commit-graph Lance branch fork the manifest does -/// not reference. -/// -/// Two origins produce a manifest-unreferenced fork: -/// 1. A `branch_delete` flips the manifest authority (atomic) but a -/// downstream best-effort reclaim does not complete β€” the whole branch is -/// gone from the manifest, but a `tree/{branch}/` ref lingers. -/// 2. A first-write fork (or a merge fork) creates the branch ref before the -/// manifest publish, then the writer dies / is cancelled β€” the branch is -/// still a live manifest branch, but the manifest's snapshot of it does -/// not place *this table* on the branch. -/// -/// The write path self-heals (2) on the next write to the table -/// (`reclaim_orphaned_fork_and_refork`); this is the guaranteed-convergence -/// backstop that also covers (1) and any table the write path never revisits. -/// -/// The orphan test is therefore **per-table**, not per-branch-name: a Lance -/// branch `B` on table `T` is an orphan iff `B` is not a live manifest branch -/// at all (origin 1) OR the manifest's branch-`B` snapshot does not place `T` -/// on `B` (origin 2). A legitimately-forked table (`table_branch == Some(B)`) -/// is kept. `main` and internal/system branches are never candidates. Lance -/// refuses to force-delete a branch with referencing descendants, so children -/// are dropped before parents (longest name first). Idempotent and authority- -/// derived: no-ops once reconciled, and degrades to finding nothing if a future -/// Lance atomic multi-dataset branch op prevents orphans from forming. -pub async fn reconcile_orphaned_branches(db: &Omnigraph) -> Result { - use std::collections::{HashMap, HashSet}; - - // Live manifest branches: the set whose per-table placements are - // authoritative. A branch absent here is a whole-branch (origin-1) orphan. - let live_branches: HashSet = db - .coordinator - .read() - .await - .all_branches() - .await? - .into_iter() - .collect(); - - let resolved = db.resolved_branch_target(None).await?; - let snapshot = resolved.snapshot; - let table_targets: Vec<(String, String)> = all_table_keys(&db.catalog()) - .into_iter() - .filter_map(|table_key| { - let entry = snapshot.entry(&table_key)?; - let full_path = format!("{}/{}", db.root_uri, entry.table_path); - Some((table_key, full_path)) - }) - .collect(); - - let mut stats = BranchReconcileStats::default(); - // Per-branch snapshots are resolved once and cached across tables (few - // branches in practice); origin-2 detection consults the branch's own view. - // Failures are cached too: one branch-level read failure should not refetch - // and append duplicate per-table noise for every table that lists the ref. - let mut branch_snapshots: HashMap = HashMap::new(); - let mut failed_branch_snapshots: HashSet = HashSet::new(); - - // Per-table fault isolation: one table's transient failure is recorded and - // logged, never aborting the rest of the sweep. - let storage = db.storage(); - for (table_key, full_path) in table_targets { - let listed = match storage.list_branches(&full_path).await { - Ok(listed) => listed, - Err(err) => { - tracing::warn!( - target: "omnigraph::cleanup", - table = %table_key, - error = %err, - "listing branches failed during reconcile; skipping table", - ); - stats.failures.push((table_key.clone(), err.to_string())); - continue; - } - }; - - // Decide per (table, branch) whether the fork is an orphan. - let mut orphans: Vec = Vec::new(); - for branch in listed { - // `main` is not a named Lance branch; system/internal branches - // (e.g. the schema-apply lock) own legitimate forks β€” never touch. - if branch == "main" || crate::db::is_internal_system_branch(&branch) { - continue; - } - let is_orphan = if !live_branches.contains(&branch) { - true // origin 1: whole branch gone from the manifest - } else { - // origin 2: live branch, but does the manifest place THIS - // table on it? Resolve (and cache) the branch's snapshot. - if failed_branch_snapshots.contains(&branch) { - continue; - } - if !branch_snapshots.contains_key(&branch) { - let branch_snapshot = - match crate::failpoints::maybe_fail(crate::failpoints::names::CLEANUP_RESOLVE_BRANCH_SNAPSHOT) { - Ok(()) => db.snapshot_for_branch(Some(&branch)).await, - Err(injected) => Err(injected), - }; - match branch_snapshot { - Ok(snap) => { - branch_snapshots.insert(branch.clone(), snap); - } - Err(err) => { - tracing::warn!( - target: "omnigraph::cleanup", - table = %table_key, - branch = %branch, - error = %err, - "resolving branch snapshot failed during reconcile; skipping", - ); - stats.failures.push((table_key.clone(), err.to_string())); - failed_branch_snapshots.insert(branch.clone()); - continue; - } - } - } - branch_snapshots[&branch] - .entry(&table_key) - .map(|e| e.table_branch.as_deref() != Some(branch.as_str())) - .unwrap_or(true) - }; - if is_orphan { - orphans.push(branch); - } - } - // Children before parents (longest name first) so Lance's referenced- - // parent RefConflict cannot block reclamation. - orphans.sort_by(|a, b| b.len().cmp(&a.len()).then_with(|| a.cmp(b))); - - for branch in orphans { - // Serialize against in-process live writers before destroying a ref. - // A first-write fork holds the per-(table, branch) write queue from - // before the fork through the manifest publish; on a LIVE branch its - // in-flight fork looks exactly like an origin-2 orphan (manifest not - // yet advanced). Acquire the same queue so cleanup waits for any such - // writer, then RE-VALIDATE under the queue with a fresh read: if the - // writer published in the meantime (table now placed on the branch), - // it is no longer an orphan β€” skip it. (Cross-process writers remain - // the documented one-winner-CAS gap.) One key held at a time β†’ no - // lock-order inversion against multi-table `acquire_many` writers. - let _guard = db - .write_queue() - .acquire(&(table_key.clone(), Some(branch.clone()))) - .await; - // Decide under the queue from FRESH authority via the shared - // classifier (same decision the write-path reclaim uses) β€” never - // from the sweep-start `live_branches` capture. A branch created - // AFTER that capture is absent from the stale set yet may already - // carry a legitimately-published fork (an in-process writer held - // this queue through its fork+publish; we just waited on it), so a - // stale "origin-1 β‡’ delete" shortcut would destroy a live fork. - // Only `Orphan` is reclaimed; `Indeterminate` (transient read) is - // skipped and recorded. (Cross-process writers remain the documented - // one-winner-CAS gap.) One key held at a time β†’ no lock-order - // inversion vs multi-table `acquire_many` writers. - match super::table_ops::classify_fork_ref(db, &table_key, &branch).await { - super::table_ops::ForkRefStatus::Orphan => {} - super::table_ops::ForkRefStatus::Legitimate => continue, - super::table_ops::ForkRefStatus::Indeterminate => { - tracing::warn!( - target: "omnigraph::cleanup", - table = %table_key, - branch = %branch, - "fresh re-check inconclusive during reconcile; skipping to avoid \ - destroying a possibly-live fork (will retry next cleanup)", - ); - stats.failures.push(( - table_key.clone(), - format!("indeterminate fork status for {branch}"), - )); - continue; - } - } - let outcome = match crate::failpoints::maybe_fail(crate::failpoints::names::CLEANUP_RECONCILE_FORK) { - Ok(()) => storage.force_delete_branch(&full_path, &branch).await, - Err(injected) => Err(injected), - }; - match outcome { - Ok(()) => stats.reclaimed.push((table_key.clone(), branch)), - Err(err) => { - tracing::warn!( - target: "omnigraph::cleanup", - table = %table_key, - branch = %branch, - error = %err, - "reclaiming orphaned fork failed; will retry next cleanup", - ); - stats.failures.push((table_key.clone(), err.to_string())); - } - } - } - } - - // Commit-graph orphans are whole-branch (not per-table), so the simple - // "branch name not in the live set" test still applies there. - if let Err(err) = reconcile_commit_graph_orphans(db, &live_branches, &mut stats).await { - tracing::warn!( - target: "omnigraph::cleanup", - error = %err, - "commit-graph orphan reconcile failed; will retry next cleanup", - ); - stats - .failures - .push(("_graph_commits".to_string(), err.to_string())); - } - - Ok(stats) -} - -/// Commit-graph half of [`reconcile_orphaned_branches`], split out so its -/// errors can be isolated. Returns `Ok` when the commit-graph dataset is absent. -async fn reconcile_commit_graph_orphans( - db: &Omnigraph, - keep: &std::collections::HashSet, - stats: &mut BranchReconcileStats, -) -> Result<()> { - let commits_uri = crate::db::commit_graph::graph_commits_uri(db.root_uri()); - if !db.storage_adapter().exists(&commits_uri).await? { - return Ok(()); - } - let mut commit_graph = crate::db::commit_graph::CommitGraph::open(db.root_uri()).await?; - for branch in orphan_branches(commit_graph.list_branches().await?, keep) { - match commit_graph.force_delete_branch(&branch).await { - Ok(()) => stats.reclaimed.push(("_graph_commits".to_string(), branch)), - Err(err) => { - tracing::warn!( - target: "omnigraph::cleanup", - branch = %branch, - error = %err, - "reclaiming orphaned commit-graph branch failed; will retry next cleanup", - ); - stats - .failures - .push(("_graph_commits".to_string(), err.to_string())); - } - } - } - Ok(()) -} - -/// Filter `present` Lance branches down to those absent from the manifest -/// `keep` set, ordered children-before-parents (longest name first) so Lance's -/// referenced-parent `RefConflict` cannot block reclamation. -fn orphan_branches(present: Vec, keep: &std::collections::HashSet) -> Vec { - let mut orphans: Vec = present - .into_iter() - .filter(|branch| !keep.contains(branch)) - .collect(); - orphans.sort_by(|a, b| b.len().cmp(&a.len()).then_with(|| a.cmp(b))); - orphans -} - -pub(super) fn all_table_keys(catalog: &omnigraph_compiler::catalog::Catalog) -> Vec { +fn all_table_keys(catalog: &omnigraph_compiler::catalog::Catalog) -> Vec { let mut keys: Vec = catalog .node_types .keys() @@ -1247,90 +202,3 @@ pub(super) fn all_table_keys(catalog: &omnigraph_compiler::catalog::Catalog) -> keys.sort(); keys } - -#[cfg(all(test, feature = "failpoints"))] -mod tests { - use super::*; - use crate::failpoints::ScopedFailPoint; - use crate::loader::{LoadMode, load_jsonl}; - - /// The internal-table compaction retry classifier: a concurrent live writer - /// preempting our `Rewrite` is retryable (Lance prescribes app-rerun, and - /// compact_files does not auto-retry it); a non-conflict error is not (must not - /// be masked by a blind retry). - #[test] - fn retryable_lance_conflicts_are_classified() { - assert!(is_retryable_lance_conflict( - &lance::Error::retryable_commit_conflict_source( - 1, - Box::new(std::io::Error::other("preempted by concurrent write")), - ) - )); - assert!(is_retryable_lance_conflict( - &lance::Error::too_much_write_contention("contended") - )); - assert!(!is_retryable_lance_conflict(&lance::Error::invalid_input( - "not a conflict" - ))); - } - - fn node_table_uri(root: &str, type_name: &str) -> String { - let mut hash: u64 = 0xcbf2_9ce4_8422_2325; - for &b in type_name.as_bytes() { - hash ^= b as u64; - hash = hash.wrapping_mul(0x100_0000_01b3); - } - format!("{}/nodes/{hash:016x}", root.trim_end_matches('/')) - } - - #[tokio::test] - async fn reconcile_caches_live_branch_snapshot_resolution_failure() { - let _scenario = fail::FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let schema = "node Person { name: String @key }\nnode Company { name: String @key }\n"; - let mut db = Omnigraph::init(uri, schema).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\"}}\n\ - {\"type\":\"Company\",\"data\":{\"name\":\"Acme\"}}", - LoadMode::Merge, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - - for type_name in ["Person", "Company"] { - let table_uri = node_table_uri(uri, type_name); - // forbidden-api-allow: test synthesizes a branch ref directly on the Lance dataset. - let mut ds = lance::Dataset::open(&table_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - } - - let _fp = ScopedFailPoint::new( - crate::failpoints::names::CLEANUP_RESOLVE_BRANCH_SNAPSHOT, - "return", - ); - let stats = reconcile_orphaned_branches(&db).await.unwrap(); - - assert_eq!( - stats.failures.len(), - 1, - "one live-branch snapshot resolution failure should be reported once, \ - not once per table: {:?}", - stats.failures - ); - assert!( - stats.failures[0] - .1 - .contains("cleanup.resolve_branch_snapshot"), - "the recorded failure should be the branch-snapshot resolution failure: {:?}", - stats.failures - ); - assert!( - stats.reclaimed.is_empty(), - "unreadable live-branch refs must be left for the next cleanup run" - ); - } -} diff --git a/crates/omnigraph/src/db/omnigraph/repair.rs b/crates/omnigraph/src/db/omnigraph/repair.rs deleted file mode 100644 index 8e7146a..0000000 --- a/crates/omnigraph/src/db/omnigraph/repair.rs +++ /dev/null @@ -1,340 +0,0 @@ -//! Explicit repair for uncovered manifest/head drift. -//! -//! Recovery sidecars handle deterministic crash residuals automatically. This -//! module is for the different case: a table's Lance HEAD is ahead of the -//! version recorded in `__manifest` and there is no sidecar encoding writer -//! intent. `repair` classifies that uncovered drift from Lance transactions and -//! only auto-publishes maintenance-only drift when the operator confirms. - -use std::collections::HashMap; - -use lance::Dataset; -use lance::dataset::transaction::Operation; - -use super::*; - -/// Options for [`Omnigraph::repair`]. -#[derive(Debug, Clone, Copy, Default)] -pub struct RepairOptions { - /// Preview by default. With `confirm`, verified maintenance drift is - /// published to `__manifest`. - pub confirm: bool, - /// Also publish suspicious/unverifiable drift. Requires `confirm`. - pub force: bool, -} - -/// Classification of a table's manifest/head state. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -#[non_exhaustive] -pub enum RepairClassification { - /// Lance HEAD equals the manifest pin. - NoDrift, - /// Every uncovered Lance transaction is maintenance-only (`Rewrite` or - /// `ReserveFragments`), so publishing the HEAD is content-preserving. - VerifiedMaintenance, - /// At least one uncovered transaction is semantic (`Append`, `Delete`, - /// `Update`, etc.). - Suspicious, - /// A needed transaction could not be read, so the drift cannot be judged. - Unverifiable, -} - -impl RepairClassification { - /// Stable machine-readable token for serialized output. - pub fn as_str(&self) -> &'static str { - match self { - Self::NoDrift => "no_drift", - Self::VerifiedMaintenance => "verified_maintenance", - Self::Suspicious => "suspicious", - Self::Unverifiable => "unverifiable", - } - } -} - -impl std::fmt::Display for RepairClassification { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.write_str(self.as_str()) - } -} - -/// What repair did for a table. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -#[non_exhaustive] -pub enum RepairAction { - /// Nothing to do. - NoOp, - /// Drift was reported but not published because this was a preview. - Preview, - /// Verified maintenance drift was published to `__manifest`. - Healed, - /// Suspicious/unverifiable drift was published because `force` was set. - Forced, - /// Drift was left untouched because it was not safe to publish without - /// `force`. - Refused, -} - -impl RepairAction { - /// Stable machine-readable token for serialized output. - pub fn as_str(&self) -> &'static str { - match self { - Self::NoOp => "no_op", - Self::Preview => "preview", - Self::Healed => "healed", - Self::Forced => "forced", - Self::Refused => "refused", - } - } -} - -impl std::fmt::Display for RepairAction { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.write_str(self.as_str()) - } -} - -/// Per-table repair outcome. -#[derive(Debug, Clone)] -#[non_exhaustive] -pub struct TableRepairStats { - pub table_key: String, - pub manifest_version: u64, - pub lance_head_version: u64, - pub classification: RepairClassification, - pub action: RepairAction, - pub operations: Vec, - pub error: Option, -} - -/// Whole-graph repair outcome. -#[derive(Debug, Clone)] -#[non_exhaustive] -pub struct RepairStats { - pub tables: Vec, - /// New graph manifest version if repair published any table pins. - pub manifest_version: Option, -} - -struct ClassificationResult { - classification: RepairClassification, - operations: Vec, - error: Option, -} - -pub async fn repair_all_tables(db: &Omnigraph, options: RepairOptions) -> Result { - if options.force && !options.confirm { - return Err(OmniError::manifest("repair --force requires --confirm")); - } - - db.ensure_schema_state_valid().await?; - db.ensure_schema_apply_idle("repair").await?; - ensure_no_pending_recovery_sidecars(db, "repair").await?; - - let snapshot = db.fresh_snapshot_for_branch(None).await?; - let table_tasks: Vec<(String, String)> = { - let catalog = db.catalog(); - let mut tasks = Vec::new(); - for table_key in optimize::all_table_keys(&catalog) { - let Some(entry) = snapshot.entry(&table_key) else { - continue; - }; - let full_path = format!("{}/{}", db.root_uri, entry.table_path); - tasks.push((table_key, full_path)); - } - tasks - }; - - if table_tasks.is_empty() { - return Ok(RepairStats { - tables: Vec::new(), - manifest_version: None, - }); - } - - let queue_keys: Vec<(String, Option)> = table_tasks - .iter() - .map(|(table_key, _)| (table_key.clone(), None)) - .collect(); - let _guards = db.write_queue().acquire_many(&queue_keys).await; - ensure_no_pending_recovery_sidecars(db, "repair").await?; - - let snapshot = db.fresh_snapshot_for_branch(None).await?; - let mut tables = Vec::with_capacity(table_tasks.len()); - let mut updates = Vec::new(); - let mut expected = HashMap::new(); - let mut any_forced = false; - - for (table_key, full_path) in table_tasks { - // `classify_drift` inspects raw Lance transaction history - // (`read_transaction_by_version`), a Lance-only maintenance read the - // staged-write trait does not surface. Open via `db.storage()` and - // unwrap the opaque handle (mirrors optimize / cleanup). - let ds = db - .storage() - .open_dataset_head_for_write(&table_key, &full_path, None) - .await? - .into_dataset(); - let manifest_version = snapshot - .entry(&table_key) - .map(|e| e.table_version) - .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; - let lance_head_version = ds.version().version; - - if lance_head_version < manifest_version { - return Err(OmniError::manifest_internal(format!( - "table '{}' Lance HEAD version {} is behind manifest version {}", - table_key, lance_head_version, manifest_version - ))); - } - - if lance_head_version == manifest_version { - tables.push(TableRepairStats { - table_key, - manifest_version, - lance_head_version, - classification: RepairClassification::NoDrift, - action: RepairAction::NoOp, - operations: Vec::new(), - error: None, - }); - continue; - } - - let classification = classify_drift(&ds, manifest_version, lance_head_version).await; - let action = match ( - options.confirm, - options.force, - classification.classification, - ) { - (false, _, _) => RepairAction::Preview, - (true, _, RepairClassification::VerifiedMaintenance) => RepairAction::Healed, - (true, true, RepairClassification::Suspicious | RepairClassification::Unverifiable) => { - any_forced = true; - RepairAction::Forced - } - (true, _, RepairClassification::Suspicious | RepairClassification::Unverifiable) => { - RepairAction::Refused - } - (true, _, RepairClassification::NoDrift) => RepairAction::NoOp, - }; - - if matches!(action, RepairAction::Healed | RepairAction::Forced) { - // Re-wrap the opened dataset to read its state through the trait - // surface (`table_state` is a read; no HEAD advance). - let snapshot = crate::storage_layer::SnapshotHandle::new(ds); - let state = db.storage().table_state(&full_path, &snapshot).await?; - updates.push(crate::db::SubTableUpdate { - table_key: table_key.clone(), - table_version: state.version, - table_branch: None, - row_count: state.row_count, - version_metadata: state.version_metadata, - }); - expected.insert(table_key.clone(), manifest_version); - } - - tables.push(TableRepairStats { - table_key, - manifest_version, - lance_head_version, - classification: classification.classification, - action, - operations: classification.operations, - error: classification.error, - }); - } - - let manifest_version = if updates.is_empty() { - None - } else { - let actor = if any_forced { - Some("omnigraph:repair:force") - } else { - Some("omnigraph:repair") - }; - let PublishedSnapshot { - manifest_version, - _snapshot_id: _, - } = db - .coordinator - .write() - .await - .commit_updates_with_actor_with_expected(&updates, &expected, actor) - .await?; - db.runtime_cache.invalidate_all().await; - if updates - .iter() - .any(|update| update.table_key.starts_with("edge:")) - { - db.invalidate_graph_index().await; - } - Some(manifest_version) - }; - - Ok(RepairStats { - tables, - manifest_version, - }) -} - -async fn ensure_no_pending_recovery_sidecars(db: &Omnigraph, operation: &str) -> Result<()> { - if !crate::db::manifest::list_sidecars(db.root_uri(), db.storage_adapter()) - .await? - .is_empty() - { - return Err(OmniError::manifest_conflict(format!( - "{operation} requires a clean recovery state; reopen the graph to run the \ - recovery sweep before repairing" - ))); - } - Ok(()) -} - -async fn classify_drift( - ds: &Dataset, - manifest_version: u64, - lance_head_version: u64, -) -> ClassificationResult { - let mut operations = Vec::new(); - let mut saw_suspicious = false; - let mut error = None; - - for version in manifest_version.saturating_add(1)..=lance_head_version { - match ds.read_transaction_by_version(version).await { - Ok(Some(transaction)) => { - let operation = transaction.operation; - operations.push(operation.name().to_string()); - if !matches!( - operation, - Operation::Rewrite { .. } | Operation::ReserveFragments { .. } - ) { - saw_suspicious = true; - } - } - Ok(None) => { - error = Some(format!("missing Lance transaction for version {version}")); - break; - } - Err(err) => { - error = Some(format!( - "failed to read Lance transaction for version {version}: {err}" - )); - break; - } - } - } - - let classification = if error.is_some() { - RepairClassification::Unverifiable - } else if saw_suspicious { - RepairClassification::Suspicious - } else { - RepairClassification::VerifiedMaintenance - }; - - ClassificationResult { - classification, - operations, - error, - } -} diff --git a/crates/omnigraph/src/db/omnigraph/schema_apply.rs b/crates/omnigraph/src/db/omnigraph/schema_apply.rs index 364f5a4..0dcf0f9 100644 --- a/crates/omnigraph/src/db/omnigraph/schema_apply.rs +++ b/crates/omnigraph/src/db/omnigraph/schema_apply.rs @@ -48,24 +48,57 @@ pub(super) async fn plan_schema( Ok(plan) } -struct PlannedSchemaApply { - plan: SchemaMigrationPlan, - desired_ir: SchemaIR, - desired_catalog: Catalog, -} - -async fn plan_schema_for_apply( +pub(super) async fn apply_schema( db: &Omnigraph, desired_schema_source: &str, options: SchemaApplyOptions, -) -> Result { + actor: Option<&str>, +) -> Result { + // Engine-layer policy gate (MR-722 chassis core). + // + // Fires BEFORE acquiring the schema-apply lock or doing any other + // work. When no PolicyChecker is installed this is a no-op and + // the apply path behaves exactly as it did before MR-722. When + // a PolicyChecker IS installed and the actor is None, this is a + // hard error β€” see Omnigraph::enforce's docstring for the + // forget-the-actor-footgun reasoning. + // + // Scope is TargetBranch("main") to match the HTTP-layer convention + // for SchemaApply: branch=None, target_branch=Some("main"). Cedar + // policies in the wild use `target_branch_scope: protected` to + // gate schema applies, so the engine-layer call has to set the + // target_branch shape that activates that predicate. Wrong scope + // here = silent policy mismatch with HTTP. See + // `omnigraph_policy::ResourceScope::to_branch_pair` for the mapping. + db.enforce( + omnigraph_policy::PolicyAction::SchemaApply, + &omnigraph_policy::ResourceScope::TargetBranch("main".to_string()), + actor, + )?; + + acquire_schema_apply_lock(db).await?; + let result = apply_schema_with_lock(db, desired_schema_source, options).await; + let release_result = release_schema_apply_lock(db).await; + match (result, release_result) { + (Ok(result), Ok(())) => Ok(result), + (Ok(_), Err(err)) => Err(err), + (Err(err), Ok(())) => Err(err), + (Err(err), Err(_)) => Err(err), + } +} + +pub(super) async fn apply_schema_with_lock( + db: &Omnigraph, + desired_schema_source: &str, + options: SchemaApplyOptions, +) -> Result { db.ensure_schema_state_valid().await?; let branches = db.coordinator.read().await.all_branches().await?; - // Skip `main` and internal system branches (the schema-apply lock branch, - // the cluster-wide schema-apply serializer). Legacy `__run__*` staging - // branches were swept off `__manifest` by the v2β†’v3 migration that runs in - // `Omnigraph::open(ReadWrite)` before this check (MR-770), so they no - // longer appear here. + // Skip `main` and internal system branches. The schema-apply lock branch + // is excluded because it is the cluster-wide schema-apply serializer. + // `__run__*` branches are no longer created; the filter remains as + // defense-in-depth for legacy graphs with leftover staging branches. + // A future production sweep will let this guard go. let blocking_branches = branches .into_iter() .filter(|branch| branch != "main" && !is_internal_system_branch(branch)) @@ -90,95 +123,6 @@ async fn plan_schema_for_apply( .unwrap_or_else(|| "unsupported schema migration plan".to_string()); return Err(OmniError::manifest(message)); } - - let mut desired_catalog = build_catalog_from_ir(&desired_ir)?; - fixup_blob_schemas(&mut desired_catalog); - Ok(PlannedSchemaApply { - plan, - desired_ir, - desired_catalog, - }) -} - -pub(super) async fn preview_schema_apply( - db: &Omnigraph, - desired_schema_source: &str, - options: SchemaApplyOptions, -) -> Result { - let planned = plan_schema_for_apply(db, desired_schema_source, options).await?; - Ok(SchemaApplyPreview { - plan: planned.plan, - catalog: planned.desired_catalog, - }) -} - -pub(super) async fn apply_schema( - db: &Omnigraph, - desired_schema_source: &str, - options: SchemaApplyOptions, - actor: Option<&str>, - validate_catalog: F, -) -> Result -where - F: FnOnce(&Catalog) -> Result<()>, -{ - // Engine-layer policy gate (MR-722 chassis core). - // - // Fires BEFORE acquiring the schema-apply lock or doing any other - // work. When no PolicyChecker is installed this is a no-op and - // the apply path behaves exactly as it did before MR-722. When - // a PolicyChecker IS installed and the actor is None, this is a - // hard error β€” see Omnigraph::enforce's docstring for the - // forget-the-actor-footgun reasoning. - // - // Scope is TargetBranch("main") to match the HTTP-layer convention - // for SchemaApply: branch=None, target_branch=Some("main"). Cedar - // policies in the wild use `target_branch_scope: protected` to - // gate schema applies, so the engine-layer call has to set the - // target_branch shape that activates that predicate. Wrong scope - // here = silent policy mismatch with HTTP. See - // `omnigraph_policy::ResourceScope::to_branch_pair` for the mapping. - db.enforce( - omnigraph_policy::PolicyAction::SchemaApply, - &omnigraph_policy::ResourceScope::TargetBranch("main".to_string()), - actor, - )?; - - // Converge any pending recovery sidecar before planning: a table - // rewrite over sidecar-covered drift would otherwise re-plan from - // the manifest pin and orphan the drifted Phase-B commit (silently - // dropping its rows) while the stale sidecar lingers to misclassify - // against the post-apply pins. Runs before the apply's own sidecar - // exists, so the heal can never observe it. - db.heal_pending_recovery_sidecars().await?; - - acquire_schema_apply_lock(db).await?; - let result = apply_schema_with_lock(db, desired_schema_source, options, validate_catalog).await; - let release_result = release_schema_apply_lock(db).await; - match (result, release_result) { - (Ok(result), Ok(())) => Ok(result), - (Ok(_), Err(err)) => Err(err), - (Err(err), Ok(())) => Err(err), - (Err(err), Err(_)) => Err(err), - } -} - -pub(super) async fn apply_schema_with_lock( - db: &Omnigraph, - desired_schema_source: &str, - options: SchemaApplyOptions, - validate_catalog: F, -) -> Result -where - F: FnOnce(&Catalog) -> Result<()>, -{ - let planned = plan_schema_for_apply(db, desired_schema_source, options).await?; - validate_catalog(&planned.desired_catalog)?; - let PlannedSchemaApply { - plan, - desired_ir, - desired_catalog, - } = planned; if plan.steps.is_empty() { return Ok(SchemaApplyResult { supported: true, @@ -188,11 +132,15 @@ where }); } + let mut desired_catalog = build_catalog_from_ir(&desired_ir)?; + fixup_blob_schemas(&mut desired_catalog); + let snapshot = db.snapshot().await; let base_manifest_version = snapshot.version(); let mut added_tables = BTreeSet::new(); let mut renamed_tables = HashMap::new(); let mut rewritten_tables = BTreeSet::new(); + let mut indexed_tables = BTreeSet::new(); let mut dropped_tables = BTreeSet::new(); // Hard-drop cleanup targets: (table_key, full_dataset_uri). // Populated for DropProperty { Hard } and DropType { Hard }; the @@ -251,14 +199,14 @@ where .or_default() .insert(to.clone(), from.clone()); } - // AddConstraint is only ever an `@index` addition (every other - // added constraint plans as UnsupportedChange). It records intent - // in the desired catalog/IR; the physical index is built off the - // critical path by ensure_indices/optimize (iss-848), so the apply - // does no table work for it β€” a pure metadata change like the two - // metadata steps below. - SchemaMigrationStep::AddConstraint { .. } - | SchemaMigrationStep::UpdateTypeMetadata { .. } + SchemaMigrationStep::AddConstraint { + type_kind, + type_name, + .. + } => { + indexed_tables.insert(schema_table_key(*type_kind, type_name)); + } + SchemaMigrationStep::UpdateTypeMetadata { .. } | SchemaMigrationStep::UpdatePropertyMetadata { .. } => {} SchemaMigrationStep::DropProperty { type_kind, @@ -346,25 +294,25 @@ where let mut table_updates = HashMap::::new(); let mut table_tombstones = HashMap::::new(); - // Recovery sidecar: protect the per-table `stage_overwrite` + - // `commit_staged` in rewritten_tables β€” the only tables that advance Lance - // HEAD inline now that index building is deferred to the reconciler - // (iss-848). Each rewritten table is exactly one commit, so - // `post_commit_pin = expected + 1` is now exact (it was a loose lower bound - // when index builds added extra commits); the classifier's loose-match for - // SidecarKind::SchemaApply still accepts it. + // Recovery sidecar: protect the per-table commit_staged loop in + // rewritten_tables + indexed_tables. The post_commit_pin we record + // here is a lower bound (expected + 1); the classifier loose-matches + // for SidecarKind::SchemaApply because the actual N depends on how + // many indices need building. See classify_table's loose-match arm. let recovery_pins: Vec = rewritten_tables .iter() + .chain(indexed_tables.iter().filter(|t| { + !rewritten_tables.contains(*t) + && !added_tables.contains(*t) + && !renamed_tables.contains_key(*t) + })) .filter_map(|table_key| { let entry = snapshot.entry(table_key)?; Some(crate::db::manifest::SidecarTablePin { table_key: table_key.clone(), - table_path: db.storage().dataset_uri(&entry.table_path), + table_path: db.table_store.dataset_uri(&entry.table_path), expected_version: entry.table_version, post_commit_pin: entry.table_version + 1, - // SchemaApply uses the loose match, not BranchMerge's Phase-B - // confirmation β€” left None. - confirmed_version: None, table_branch: entry.table_branch.clone(), }) }) @@ -431,33 +379,23 @@ where // manifest publish via `commit_changes_with_actor` below. // // Schema-apply already holds the graph-wide `__schema_apply_lock__` - // sentinel branch, so these per-table acquisitions are uncontended in - // practice. They exist for symmetry with the recovery reconciler, which - // acquires the same queues before any `Dataset::restore` it issues for - // SchemaApply sidecars. - let mut schema_apply_queue_keys: Vec<(String, Option)> = recovery_pins + // sentinel branch, so under PR 1b's intermediate state these + // per-table acquisitions are uncontended. They exist for symmetry + // with future MR-870 recovery, which will need queue acquisition + // before any `Dataset::restore` it issues for SchemaApply sidecars. + let schema_apply_queue_keys: Vec<(String, Option)> = recovery_pins .iter() .map(|pin| (pin.table_key.clone(), pin.table_branch.clone())) .collect(); - // The serialization key the write-entry heal acquires before touching - // schema staging or a SchemaApply sidecar. Per-table keys alone don't - // cover a registration-only migration (no pins, but a sidecar and - // staging files on disk) β€” without this, a concurrent write's heal can - // promote this apply's staging files and publish its registrations out - // from under it. Acquired whenever a sidecar will be written, held - // through Phase D (the guards live to the end of this function). - let writes_sidecar = !(recovery_pins.is_empty() - && sidecar_registrations.is_empty() - && sidecar_tombstones.is_empty()); - if writes_sidecar { - schema_apply_queue_keys.push(crate::db::manifest::schema_apply_serial_queue_key()); - } let _schema_apply_queue_guards = db .write_queue() .acquire_many(&schema_apply_queue_keys) .await; - let recovery_handle = if !writes_sidecar { + let recovery_handle = if recovery_pins.is_empty() + && sidecar_registrations.is_empty() + && sidecar_tombstones.is_empty() + { None } else { // `branch=None` because schema_apply publishes against main β€” @@ -486,14 +424,12 @@ where for table_key in &added_tables { let table_path = table_path_for_table_key(table_key)?; - let dataset_uri = db.storage().dataset_uri(&table_path); + let dataset_uri = db.table_store.dataset_uri(&table_path); let schema = schema_for_table_key(&desired_catalog, table_key)?; - let ds = - SnapshotHandle::new(TableStore::create_empty_dataset(&dataset_uri, &schema).await?); - // Indexes for the new table are materialized off the critical path by - // ensure_indices/optimize (iss-848); a 0-row table is never trainable - // anyway. The @index intent is recorded in the persisted catalog/IR. - let state = db.storage().table_state(&dataset_uri, &ds).await?; + let mut ds = TableStore::create_empty_dataset(&dataset_uri, &schema).await?; + db.build_indices_on_dataset_for_catalog(&desired_catalog, table_key, &mut ds) + .await?; + let state = db.table_store.table_state(&dataset_uri, &ds).await?; table_registrations.insert(table_key.clone(), table_path); table_updates.insert( table_key.clone(), @@ -515,10 +451,7 @@ where )) })?; ensure_snapshot_entry_head_matches(db, source_entry).await?; - let source_ds = db - .storage() - .open_snapshot_at_table(&snapshot, source_table_key) - .await?; + let source_ds = snapshot.open(source_table_key).await?; let current_catalog = db.catalog(); let batch = batch_for_schema_apply_rewrite( db, @@ -531,10 +464,11 @@ where ) .await?; let table_path = table_path_for_table_key(target_table_key)?; - let dataset_uri = db.storage().dataset_uri(&table_path); - let target_ds = SnapshotHandle::new(TableStore::write_dataset(&dataset_uri, batch).await?); - // Indexes on the renamed table are reconciled later (iss-848). - let state = db.storage().table_state(&dataset_uri, &target_ds).await?; + let dataset_uri = db.table_store.dataset_uri(&table_path); + let mut target_ds = TableStore::write_dataset(&dataset_uri, batch).await?; + db.build_indices_on_dataset_for_catalog(&desired_catalog, target_table_key, &mut target_ds) + .await?; + let state = db.table_store.table_state(&dataset_uri, &target_ds).await?; table_registrations.insert(target_table_key.clone(), table_path); table_updates.insert( target_table_key.clone(), @@ -563,10 +497,7 @@ where )) })?; ensure_snapshot_entry_head_matches(db, entry).await?; - let source_ds = db - .storage() - .open_snapshot_at_table(&snapshot, table_key) - .await?; + let source_ds = snapshot.open(table_key).await?; let current_catalog = db.catalog(); let batch = batch_for_schema_apply_rewrite( db, @@ -578,23 +509,37 @@ where property_renames.get(table_key), ) .await?; - let dataset_uri = db.storage().dataset_uri(&entry.table_path); - // Pass `entry.table_branch.as_deref()` (not `None`) for - // consistency with the indexed_tables block below. Schema - // apply runs under `__schema_apply_lock__` which today rejects - // non-main branches, so `entry.table_branch` is expected to be - // `None`. But the defensive passthrough means a future relaxation - // of the lock-check can't quietly open the wrong HEAD here. - let existing = db - .storage() - .open_dataset_head_for_write(table_key, &dataset_uri, entry.table_branch.as_deref()) + let dataset_uri = db.table_store.dataset_uri(&entry.table_path); + // Route through stage_overwrite + commit_staged for non-empty + // batches. Lance's `InsertBuilder::execute_uncommitted` + // errors on empty data (lance-4.0.0 `src/dataset/write/insert.rs:144`), + // so the empty-rewrite case stays on `overwrite_dataset` (which + // accepts empty input). The empty case is rare in schema_apply + // β€” it only fires when the source table itself was already empty + // β€” and schema_apply runs under `__schema_apply_lock__` so the + // narrow inline-commit residual is bounded. + let mut target_ds = if batch.num_rows() == 0 { + TableStore::overwrite_dataset(&dataset_uri, batch).await? + } else { + // Pass `entry.table_branch.as_deref()` (not `None`) for + // consistency with the indexed_tables block below. Schema + // apply runs under `__schema_apply_lock__` which today + // rejects non-main branches, so `entry.table_branch` is + // expected to be `None`. But the defensive passthrough + // means a future relaxation of the lock-check can't quietly + // open the wrong HEAD here. + let existing = db + .table_store + .open_dataset_head_for_write(table_key, &dataset_uri, entry.table_branch.as_deref()) + .await?; + let staged = db.table_store.stage_overwrite(&existing, batch).await?; + db.table_store + .commit_staged(Arc::new(existing), staged.transaction) + .await? + }; + db.build_indices_on_dataset_for_catalog(&desired_catalog, table_key, &mut target_ds) .await?; - let staged = db.storage().stage_overwrite(&existing, batch).await?; - let target_ds = db.storage().commit_staged(existing, staged).await?; - // The rewrite drops the table's existing index coverage; it is - // restored off the critical path by optimize's optimize_indices / - // ensure_indices (iss-848). Reads scan uncovered fragments meanwhile. - let state = db.storage().table_state(&dataset_uri, &target_ds).await?; + let state = db.table_store.table_state(&dataset_uri, &target_ds).await?; table_updates.insert( table_key.clone(), crate::db::SubTableUpdate { @@ -607,12 +552,41 @@ where ); } - // Index-only changes (AddConstraint, i.e. adding an `@index`) are pure - // metadata: the new `@index` intent is recorded in the desired catalog/IR - // persisted below, and the physical index is materialized off the critical - // path by `ensure_indices`/`optimize` (iss-848). Schema apply touches no - // table data for them, so there is no per-table loop here and no recovery - // pin (no Lance HEAD advances). Reads stay correct meanwhile via a scan. + for table_key in &indexed_tables { + if added_tables.contains(table_key) + || renamed_tables.contains_key(table_key) + || rewritten_tables.contains(table_key) + { + continue; + } + let entry = snapshot.entry(table_key).ok_or_else(|| { + OmniError::manifest(format!( + "missing table '{}' for schema index apply", + table_key + )) + })?; + ensure_snapshot_entry_head_matches(db, entry).await?; + let dataset_uri = db.table_store.dataset_uri(&entry.table_path); + let mut ds = db + .table_store + .open_dataset_head_for_write(table_key, &dataset_uri, entry.table_branch.as_deref()) + .await?; + db.table_store + .ensure_expected_version(&ds, table_key, entry.table_version)?; + db.build_indices_on_dataset_for_catalog(&desired_catalog, table_key, &mut ds) + .await?; + let state = db.table_store.table_state(&dataset_uri, &ds).await?; + table_updates.insert( + table_key.clone(), + crate::db::SubTableUpdate { + table_key: table_key.clone(), + table_version: state.version, + table_branch: None, + row_count: state.row_count, + version_metadata: state.version_metadata, + }, + ); + } let mut manifest_changes = Vec::new(); for (table_key, table_path) in table_registrations { @@ -648,7 +622,7 @@ where // `recover_schema_state_files`: // - crash before commit β†’ manifest unchanged; staging deleted on open // - crash after commit β†’ manifest advanced; staging renamed on open - crate::failpoints::maybe_fail(crate::failpoints::names::SCHEMA_APPLY_BEFORE_STAGING_WRITE)?; + crate::failpoints::maybe_fail("schema_apply.before_staging_write")?; let staging_pg_uri = schema_source_staging_uri(&db.root_uri); db.storage @@ -656,7 +630,7 @@ where .await?; write_schema_contract_staging(&db.root_uri, db.storage.as_ref(), &desired_ir).await?; - crate::failpoints::maybe_fail(crate::failpoints::names::SCHEMA_APPLY_AFTER_STAGING_WRITE)?; + crate::failpoints::maybe_fail("schema_apply.after_staging_write")?; // `apply_schema` doesn't currently take an actor; system-attributed. let PublishedSnapshot { @@ -669,7 +643,7 @@ where .commit_changes_with_actor(&manifest_changes, None) .await?; - crate::failpoints::maybe_fail(crate::failpoints::names::SCHEMA_APPLY_AFTER_MANIFEST_COMMIT)?; + crate::failpoints::maybe_fail("schema_apply.after_manifest_commit")?; db.storage .rename_text(&staging_pg_uri, &schema_source_uri(&db.root_uri)) @@ -751,7 +725,6 @@ where async fn cleanup_dataset_old_versions(db: &Omnigraph, full_uri: &str) -> Result<()> { use chrono::Utc; use lance::dataset::cleanup::CleanupPolicy; - // forbidden-api-allow: maintenance (Hard-drop version GC) opens the dataset to run cleanup_old_versions. let ds = lance::Dataset::open(full_uri) .await .map_err(|e| OmniError::Lance(e.to_string()))?; @@ -851,22 +824,22 @@ pub(super) async fn ensure_snapshot_entry_head_matches( db: &Omnigraph, entry: &SubTableEntry, ) -> Result<()> { - let dataset_uri = db.storage().dataset_uri(&entry.table_path); + let dataset_uri = db.table_store.dataset_uri(&entry.table_path); let ds = db - .storage() + .table_store .open_dataset_head_for_write( &entry.table_key, &dataset_uri, entry.table_branch.as_deref(), ) .await?; - db.storage() + db.table_store .ensure_expected_version(&ds, &entry.table_key, entry.table_version) } pub(super) async fn batch_for_schema_apply_rewrite( db: &Omnigraph, - source_ds: &SnapshotHandle, + source_ds: &Dataset, source_table_key: &str, source_catalog: &Catalog, target_table_key: &str, @@ -878,11 +851,11 @@ pub(super) async fn batch_for_schema_apply_rewrite( let target_blob_properties = blob_properties_for_table_key(target_catalog, target_table_key)?; let needs_row_ids = !source_blob_properties.is_empty() || !target_blob_properties.is_empty(); let batches = if needs_row_ids { - db.storage() - .scan_with_row_id(source_ds, None, None, None, true) + db.table_store() + .scan_with(source_ds, None, None, None, true, |_| Ok(())) .await? } else { - db.storage().scan_batches(source_ds).await? + db.table_store().scan_batches(source_ds).await? }; if batches.is_empty() { return Ok(RecordBatch::new_empty(target_schema)); @@ -952,7 +925,7 @@ pub(super) async fn batch_for_schema_apply_rewrite( async fn rebuild_blob_column( _db: &Omnigraph, - source_ds: &SnapshotHandle, + source_ds: &Dataset, column_name: &str, descriptions: &StructArray, row_ids: &[u64], @@ -972,7 +945,7 @@ async fn rebuild_blob_column( let blob_files = if non_null_row_ids.is_empty() { Vec::new() } else { - Arc::new(source_ds.dataset().clone()) + Arc::new(source_ds.clone()) .take_blobs(&non_null_row_ids, column_name) .await .map_err(|e| OmniError::Lance(e.to_string()))? diff --git a/crates/omnigraph/src/db/omnigraph/table_ops.rs b/crates/omnigraph/src/db/omnigraph/table_ops.rs index a917150..0e89c45 100644 --- a/crates/omnigraph/src/db/omnigraph/table_ops.rs +++ b/crates/omnigraph/src/db/omnigraph/table_ops.rs @@ -21,7 +21,7 @@ pub(super) async fn graph_index_for_resolved( db.runtime_cache.graph_index(resolved, &catalog).await } -pub(super) async fn ensure_indices(db: &Omnigraph) -> Result> { +pub(super) async fn ensure_indices(db: &Omnigraph) -> Result<()> { let current_branch = db .coordinator .read() @@ -31,7 +31,7 @@ pub(super) async fn ensure_indices(db: &Omnigraph) -> Result> ensure_indices_for_branch(db, current_branch.as_deref()).await } -pub(super) async fn ensure_indices_on(db: &Omnigraph, branch: &str) -> Result> { +pub(super) async fn ensure_indices_on(db: &Omnigraph, branch: &str) -> Result<()> { let branch = normalize_branch_name(branch)?; ensure_indices_for_branch(db, branch.as_deref()).await } @@ -50,10 +50,10 @@ pub(super) async fn failpoint_publish_table_head_without_index_rebuild_for_test( .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; let full_path = format!("{}/{}", db.root_uri, entry.table_path); let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, &full_path, table_branch) .await?; - let state = db.storage().table_state(&full_path, &ds).await?; + let state = db.table_store.table_state(&full_path, &ds).await?; let update = crate::db::SubTableUpdate { table_key: table_key.to_string(), table_version: state.version, @@ -73,16 +73,12 @@ pub(super) async fn failpoint_publish_table_head_without_index_rebuild_for_test( .await } -pub(super) async fn ensure_indices_for_branch( - db: &Omnigraph, - branch: Option<&str>, -) -> Result> { +pub(super) async fn ensure_indices_for_branch(db: &Omnigraph, branch: Option<&str>) -> Result<()> { db.ensure_schema_state_valid().await?; db.ensure_schema_apply_idle("ensure_indices").await?; let resolved = db.resolved_branch_target(branch).await?; let snapshot = resolved.snapshot; let mut updates = Vec::new(); - let mut pending = Vec::new(); let active_branch = resolved.branch; let catalog = db.catalog(); @@ -125,9 +121,6 @@ pub(super) async fn ensure_indices_for_branch( table_path: full_path, expected_version: entry.table_version, post_commit_pin: entry.table_version + 1, - // EnsureIndices uses the loose match (index coverage is derived - // state), not BranchMerge's Phase-B confirmation β€” left None. - confirmed_version: None, // Use active_branch (where commits actually land), NOT // entry.table_branch (where the table currently lives). // open_owned_dataset_for_branch_write forks a feature @@ -153,9 +146,6 @@ pub(super) async fn ensure_indices_for_branch( table_path: full_path, expected_version: entry.table_version, post_commit_pin: entry.table_version + 1, - // EnsureIndices uses the loose match (index coverage is derived - // state), not BranchMerge's Phase-B confirmation β€” left None. - confirmed_version: None, // Use active_branch (where commits actually land), NOT // entry.table_branch (where the table currently lives). // open_owned_dataset_for_branch_write forks a feature @@ -170,8 +160,9 @@ pub(super) async fn ensure_indices_for_branch( // that needs index work. Held across the per-table commit loop and // the manifest publish at the end of this function. Sorted-order // acquisition prevents lock-order inversion against concurrent - // multi-table writers (mutation finalize, branch_merge, the fork - // path, recovery). + // multi-table writers (mutation finalize, branch_merge, future + // MR-870 recovery). Under PR 1b's intermediate state (global server + // RwLock still in place), this acquisition is uncontended. let queue_keys: Vec<(String, Option)> = recovery_pins .iter() .map(|pin| (pin.table_key.clone(), pin.table_branch.clone())) @@ -218,18 +209,18 @@ pub(super) async fn ensure_indices_for_branch( } }, None => ( - db.storage() + db.table_store .open_dataset_head_for_write(&table_key, &full_path, None) .await?, None, ), }; - let row_count = db.storage().count_rows(&ds, None).await.unwrap_or(0); + let row_count = db.table_store.count_rows(&ds, None).await.unwrap_or(0); if row_count > 0 { - pending.extend(build_indices_on_dataset(db, &table_key, &mut ds).await?); + build_indices_on_dataset(db, &table_key, &mut ds).await?; } - let state = db.storage().table_state(&full_path, &ds).await?; + let state = db.table_store.table_state(&full_path, &ds).await?; if state.version != entry.table_version || resolved_branch.as_deref() != entry.table_branch.as_deref() { @@ -266,18 +257,18 @@ pub(super) async fn ensure_indices_for_branch( } }, None => ( - db.storage() + db.table_store .open_dataset_head_for_write(&table_key, &full_path, None) .await?, None, ), }; - let row_count = db.storage().count_rows(&ds, None).await.unwrap_or(0); + let row_count = db.table_store.count_rows(&ds, None).await.unwrap_or(0); if row_count > 0 { - pending.extend(build_indices_on_dataset(db, &table_key, &mut ds).await?); + build_indices_on_dataset(db, &table_key, &mut ds).await?; } - let state = db.storage().table_state(&full_path, &ds).await?; + let state = db.table_store.table_state(&full_path, &ds).await?; if state.version != entry.table_version || resolved_branch.as_deref() != entry.table_branch.as_deref() { @@ -296,7 +287,7 @@ pub(super) async fn ensure_indices_for_branch( // (one commit_staged per index built) but the manifest publish below // hasn't run. Used by // `tests/failpoints.rs::ensure_indices_phase_b_failure_recovered_on_next_open`. - crate::failpoints::maybe_fail(crate::failpoints::names::ENSURE_INDICES_POST_PHASE_B_PRE_MANIFEST_COMMIT)?; + crate::failpoints::maybe_fail("ensure_indices.post_phase_b_pre_manifest_commit")?; if !updates.is_empty() { commit_prepared_updates_on_branch(db, branch, &updates, None).await?; @@ -316,69 +307,7 @@ pub(super) async fn ensure_indices_for_branch( } } - Ok(pending) -} - -/// The single scalar/vector index a node property receives from a one-column -/// `@index`/`@key` declaration, or `None` when the property type is not -/// indexable here (a list column or `Blob`). -/// -/// Shared by `build_indices_on_dataset_for_catalog` (which builds the index) -/// and `needs_index_work_node` (which checks coverage to decide recovery- -/// sidecar pinning) so the two cannot drift: an enum or orderable scalar the -/// builder gives a BTREE must also be reported as "needs work" until that -/// BTREE exists, or the HEAD-advancing build would run without sidecar cover. -#[derive(Clone, Copy, PartialEq, Eq, Debug)] -enum NodePropIndexKind { - Btree, - Fts, - Vector, -} - -fn node_prop_index_kind(prop_type: &PropType) -> Option { - if prop_type.list { - return None; - } - // Enums are physically `String` but filtered by equality, so they take a - // scalar BTREE, not an FTS inverted index (Lance never consults an inverted - // index for `=`/range). Free-text Strings keep FTS for - // `search()`/`match_text`/`bm25`. - let is_enum = prop_type.enum_values.is_some(); - match prop_type.scalar { - ScalarType::String if !is_enum => Some(NodePropIndexKind::Fts), - ScalarType::Vector(_) => Some(NodePropIndexKind::Vector), - ScalarType::String - | ScalarType::DateTime - | ScalarType::Date - | ScalarType::I32 - | ScalarType::I64 - | ScalarType::U32 - | ScalarType::U64 - | ScalarType::F32 - | ScalarType::F64 - | ScalarType::Bool => Some(NodePropIndexKind::Btree), - ScalarType::Blob => None, - } -} - -/// Whether a vector column currently has at least one non-null vector β€” the -/// minimum for Lance IVF k-means to train (the `ivf_flat(1)` index we build -/// needs >=1 vector). Used identically by `needs_index_work_node` (so an -/// untrainable column is not pinned for recovery β€” avoiding a zero-commit pin -/// that would roll back a sibling's index work) and by the vector build arm (so -/// `create_vector_index` is only attempted when it can succeed, keeping its -/// genuine errors fatal instead of swallowed as pending). If index params -/// become size-aware (dev-graph iss-687), this threshold moves with them. -async fn vector_column_trainable( - db: &Omnigraph, - ds: &SnapshotHandle, - column: &str, -) -> Result { - Ok(db - .storage() - .count_rows(ds, Some(format!("{column} IS NOT NULL"))) - .await? - > 0) + Ok(()) } /// Returns true if the node table is missing at least one declared @@ -389,13 +318,12 @@ async fn vector_column_trainable( /// would force `NoMovement` classification on recovery and trigger the /// all-or-nothing rollback of sibling tables' legitimate index work). /// -/// Per `build_indices_on_dataset_for_catalog`, nodes get BTree (id) plus, for -/// each one-column `@index`/`@key` property, the index `node_prop_index_kind` -/// assigns: a scalar BTREE for enums and orderable scalars -/// (DateTime/Date/numeric/Bool), FTS for free-text Strings, or a Vector index. -/// Edges get BTree only (id, src, dst). This helper and the builder share -/// `node_prop_index_kind` so they cannot drift β€” see its doc comment. -pub(super) async fn needs_index_work_node( +/// Per the actual `build_indices_on_dataset_for_catalog` implementation +/// (this file, ~line 419-491), nodes get BTree (id) + per-prop FTS +/// (@search String) + per-prop Vector indices; edges get BTree only +/// (id, src, dst). The two helpers mirror that asymmetry β€” see the +/// `needs_index_work_edge` doc comment. +async fn needs_index_work_node( db: &Omnigraph, type_name: &str, table_key: &str, @@ -403,7 +331,7 @@ pub(super) async fn needs_index_work_node( table_branch: Option<&str>, ) -> Result { let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, full_path, table_branch) .await?; // Empty tables are skipped by the ensure_indices loop, so they must @@ -413,10 +341,10 @@ pub(super) async fn needs_index_work_node( // Errors from count_rows are propagated: silently treating them as // "0 rows" risks skipping a table that is actually about to be // modified. - if db.storage().count_rows(&ds, None).await? == 0 { + if db.table_store.count_rows(&ds, None).await? == 0 { return Ok(false); } - if !db.storage().has_btree_index(&ds, "id").await? { + if !db.table_store.has_btree_index(&ds, "id").await? { return Ok(true); } let catalog = db.catalog(); @@ -431,30 +359,14 @@ pub(super) async fn needs_index_work_node( let Some(prop_type) = node_type.properties.get(prop_name) else { continue; }; - match node_prop_index_kind(prop_type) { - Some(NodePropIndexKind::Fts) => { - if !db.storage().has_fts_index(&ds, prop_name).await? { - return Ok(true); - } + if matches!(prop_type.scalar, ScalarType::String) && !prop_type.list { + if !db.table_store.has_fts_index(&ds, prop_name).await? { + return Ok(true); } - Some(NodePropIndexKind::Vector) => { - // Only count a missing vector index as buildable *work* when the - // column is trainable (>=1 non-null vector). An untrainable - // column would defer in the build and commit nothing; pinning it - // for recovery would be a zero-commit pin that classifies - // NoMovement and rolls back a sibling table's index work. - if !db.storage().has_vector_index(&ds, prop_name).await? - && vector_column_trainable(db, &ds, prop_name).await? - { - return Ok(true); - } + } else if matches!(prop_type.scalar, ScalarType::Vector(_)) && !prop_type.list { + if !db.table_store.has_vector_index(&ds, prop_name).await? { + return Ok(true); } - Some(NodePropIndexKind::Btree) => { - if !db.storage().has_btree_index(&ds, prop_name).await? { - return Ok(true); - } - } - None => {} } } Ok(false) @@ -470,70 +382,36 @@ pub(super) async fn needs_index_work_node( /// /// Empty edge tables are skipped by the ensure_indices loop the same /// way node tables are; see `needs_index_work_node`. -pub(super) async fn needs_index_work_edge( +async fn needs_index_work_edge( db: &Omnigraph, table_key: &str, full_path: &str, table_branch: Option<&str>, ) -> Result { let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, full_path, table_branch) .await?; - if db.storage().count_rows(&ds, None).await? == 0 { + if db.table_store.count_rows(&ds, None).await? == 0 { return Ok(false); } - Ok(!db.storage().has_btree_index(&ds, "id").await? - || !db.storage().has_btree_index(&ds, "src").await? - || !db.storage().has_btree_index(&ds, "dst").await?) -} - -/// Result of opening a sub-table for mutation. `handle` is `None` only when a -/// non-strict (Insert/Merge) op on the WriteTxn's own branch skipped the -/// accumulation open (RFC-013 step 3b collapse #1) β€” there the caller needs just -/// `expected_version`. It is ALWAYS `Some` for strict ops, the fork path, and -/// every no-`txn` caller (branch merge), which use [`Self::require_handle`]. -#[derive(Debug)] -pub(crate) struct OpenedForMutation { - /// The opened dataset, or `None` on the non-strict-txn open-skip path. - pub(crate) handle: Option, - /// The publisher's CAS fence: the opened handle's version, or β€” when the open - /// was skipped β€” the pinned base entry's version (equal absent uncovered drift). - pub(crate) expected_version: u64, - pub(crate) full_path: String, - pub(crate) table_branch: Option, -} - -impl OpenedForMutation { - /// Destructure for a caller that REQUIRES the handle (strict ops, the fork - /// path, every no-`txn` caller). The `None` skip fires solely on the - /// non-strict `txn` path, which these callers are not β€” so a panic here means - /// a future change broke that contract, named by `ctx`. - pub(crate) fn require_handle(self, ctx: &str) -> (SnapshotHandle, String, Option) { - let handle = self.handle.unwrap_or_else(|| { - panic!("{ctx}: open_for_mutation returned no handle on a path that requires one") - }); - (handle, self.full_path, self.table_branch) - } + Ok(!db.table_store.has_btree_index(&ds, "id").await? + || !db.table_store.has_btree_index(&ds, "src").await? + || !db.table_store.has_btree_index(&ds, "dst").await?) } pub(super) async fn open_for_mutation( db: &Omnigraph, table_key: &str, op_kind: crate::db::MutationOpKind, -) -> Result { +) -> Result<(Dataset, String, Option)> { let current_branch = db .coordinator .read() .await .current_branch() .map(str::to_string); - // `open_for_mutation` is the no-txn entry (branch merge). Passing `None` - // keeps the exact pre-WriteTxn code path (a fresh `resolved_branch_target` - // that re-validates the schema). With `txn = None` the non-strict early-skip - // in `open_for_mutation_on_branch` never fires, so this always returns a - // `Some(handle)` for its callers. - open_for_mutation_on_branch(db, current_branch.as_deref(), table_key, op_kind, None).await + open_for_mutation_on_branch(db, current_branch.as_deref(), table_key, op_kind).await } /// Open a sub-table for mutation. The `op_kind` selects the strict-vs-relaxed @@ -547,85 +425,25 @@ pub(super) async fn open_for_mutation_on_branch( branch: Option<&str>, table_key: &str, op_kind: crate::db::MutationOpKind, - txn: Option<&crate::db::WriteTxn>, -) -> Result { +) -> Result<(Dataset, String, Option)> { db.ensure_schema_apply_not_locked("write").await?; - // Source the resolved (snapshot, branch). With a `WriteTxn` the contract was - // validated once at capture, so use the pinned base + resolved branch instead - // of `resolved_branch_target` (which re-runs `ensure_schema_state_valid`). The - // base is the same fresh per-branch manifest read the no-txn path would have - // resolved β€” only the redundant schema re-validation is dropped. Without a txn - // this is byte-identical to the prior `resolved_branch_target` call. - let (snapshot, resolved_branch) = match txn { - Some(txn) => (txn.base.clone(), txn.branch.clone()), - None => { - let resolved = db.resolved_branch_target(branch).await?; - (resolved.snapshot, resolved.branch) - } - }; - let entry = snapshot + let resolved = db.resolved_branch_target(branch).await?; + let entry = resolved + .snapshot .entry(table_key) .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; let full_path = format!("{}/{}", db.root_uri, entry.table_path); - - // Collapse #1 (RFC-013 step 3b): a non-strict op (Insert/Merge) on the txn's - // own branch needs no dataset open for ACCUMULATION β€” the only thing the - // caller reads from this handle on the non-strict path is `.version()` (the - // publisher's CAS fence), which is exactly the pinned base version. The base - // already validated the schema contract once, and the staging reopen - // (`reopen_for_mutation`) plus the publisher CAS in `commit_all` are the real - // drift guards. So skip `open_dataset_head_for_write` entirely and source the - // expected version from the pinned entry. - // - // Gated on `txn.is_some()`: without a txn (branch merge's `open_for_mutation`) - // every arm below is byte-identical to before. STRICT ops (Update/Delete/ - // SchemaRewrite) always open live HEAD + run `ensure_expected_version` - // (read-modify-write SI), and any write that must FORK (the table isn't yet on - // the resolved branch) opens too (the fork is a real Lance state advance the - // manifest snapshot can't substitute for). - if txn.is_some() && !op_kind.strict_pre_stage_version_check() { - match resolved_branch.as_deref() { - // Non-strict, table already on the active branch β†’ no open, no fork. - Some(active_branch) if entry.table_branch.as_deref() == Some(active_branch) => { - return Ok(OpenedForMutation { - handle: None, - expected_version: entry.table_version, - full_path, - table_branch: Some(active_branch.to_string()), - }); - } - // Main branch, non-strict β†’ no open. (Main never forks.) - None => { - return Ok(OpenedForMutation { - handle: None, - expected_version: entry.table_version, - full_path, - table_branch: None, - }); - } - // Non-strict but the table isn't on the active branch yet β€” falls - // through to fork below. - Some(_) => {} - } - } - - match resolved_branch.as_deref() { + match resolved.branch.as_deref() { None => { let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, &full_path, None) .await?; if op_kind.strict_pre_stage_version_check() { - db.storage() + db.table_store .ensure_expected_version(&ds, table_key, entry.table_version)?; } - let version = ds.version(); - Ok(OpenedForMutation { - handle: Some(ds), - expected_version: version, - full_path, - table_branch: None, - }) + Ok((ds, full_path, None)) } Some(active_branch) => { let (ds, table_branch) = open_owned_dataset_for_branch_write( @@ -638,13 +456,7 @@ pub(super) async fn open_for_mutation_on_branch( op_kind, ) .await?; - let version = ds.version(); - Ok(OpenedForMutation { - handle: Some(ds), - expected_version: version, - full_path, - table_branch, - }) + Ok((ds, full_path, table_branch)) } } } @@ -657,44 +469,22 @@ pub(super) async fn open_owned_dataset_for_branch_write( entry_version: u64, active_branch: &str, op_kind: crate::db::MutationOpKind, -) -> Result<(SnapshotHandle, Option)> { +) -> Result<(Dataset, Option)> { match entry_branch { Some(branch) if branch == active_branch => { let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, full_path, Some(active_branch)) .await?; if op_kind.strict_pre_stage_version_check() { - db.storage() + db.table_store .ensure_expected_version(&ds, table_key, entry_version)?; } Ok((ds, Some(active_branch.to_string()))) } source_branch => { - crate::failpoints::maybe_fail(crate::failpoints::names::FORK_BEFORE_CLASSIFY)?; - // Authority check before forking: re-read the live manifest. If this - // table is already forked on active_branch, a concurrent first-write - // won the race and our snapshot is stale β€” that is a retryable - // conflict, not an orphan. (A zombie fork is never in the manifest, - // so this only fires for a live concurrent fork.) - let live = db.snapshot_for_branch(Some(active_branch)).await?; - if let Some(entry) = live.entry(table_key) { - if entry.table_branch.as_deref() == Some(active_branch) { - return Err(OmniError::manifest_expected_version_mismatch( - table_key, - entry_version, - entry.table_version, - )); - } - } - // The fork advances Lance state before the manifest publish. The - // caller holds the per-(table, active_branch) write queue from - // before this fork through the publish, so a leftover ref is a - // manifest-unreferenced fork (interrupted prior fork, or - // delete+recreate), not a live in-process fork. The wrapper - // self-heals it (reclaim + re-fork); see - // `Omnigraph::fork_dataset_from_entry_state`. - db.fork_dataset_from_entry_state( + fork_dataset_from_entry_state( + db, table_key, full_path, source_branch, @@ -703,11 +493,11 @@ pub(super) async fn open_owned_dataset_for_branch_write( ) .await?; let ds = db - .storage() + .table_store .open_dataset_head_for_write(table_key, full_path, Some(active_branch)) .await?; if op_kind.strict_pre_stage_version_check() { - db.storage() + db.table_store .ensure_expected_version(&ds, table_key, entry_version)?; } Ok((ds, Some(active_branch.to_string()))) @@ -722,8 +512,8 @@ pub(super) async fn fork_dataset_from_entry_state( source_branch: Option<&str>, source_version: u64, active_branch: &str, -) -> Result> { - db.storage() +) -> Result { + db.table_store .fork_branch_from_state( full_path, source_branch, @@ -734,172 +524,6 @@ pub(super) async fn fork_dataset_from_entry_state( .await } -/// Classification of a Lance branch ref `B` on table `T` against FRESH manifest -/// authority β€” the single decision both fork-ref reclaim sites share: the -/// write-path reclaim ([`reclaim_orphaned_fork_and_refork`]) and the cleanup -/// reconciler (`optimize::reconcile_orphaned_branches`). Having one classifier -/// keeps the two destructive sites from drifting (the bug history: each was -/// hardened separately and the other lagged). -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) enum ForkRefStatus { - /// The manifest places `T` on `B` β€” a legitimate fork. Never destroy. - Legitimate, - /// The manifest does not reference this fork (`T` not on `B`, or `B` absent - /// from the manifest entirely). Reclaimable. - Orphan, - /// Fresh authority could not be established (a transient read failure on a - /// live branch). Ambiguous β€” do not destroy; the caller retries / converges. - Indeterminate, -} - -/// Classify a fork ref from FRESH manifest authority (bypasses the coordinator -/// cache). MUST be called with the per-`(table, branch)` write queue held, so -/// the classification is stable against in-process writers for the caller's -/// critical section. Both reclaim sites map the result to their own action -/// (write path: reclaim vs retryable; cleanup: delete vs skip), but the -/// destroy-only-on-`Orphan` rule is enforced here, once. -pub(crate) async fn classify_fork_ref( - db: &Omnigraph, - table_key: &str, - branch: &str, -) -> ForkRefStatus { - // `classify.fresh_read` failpoint: simulate a transient failure of the - // fresh-authority read (no-op without the `failpoints` feature). Lets a - // test exercise the Indeterminate path β€” a read failure on a live branch - // must classify as Indeterminate (skip), never Orphan (destroy). - let fresh = match crate::failpoints::maybe_fail(crate::failpoints::names::CLASSIFY_FRESH_READ) { - Ok(()) => db.fresh_snapshot_for_branch(Some(branch)).await, - Err(injected) => Err(injected), - }; - match fresh { - Ok(snap) => { - let placed = snap - .entry(table_key) - .map(|e| e.table_branch.as_deref() == Some(branch)) - .unwrap_or(false); - if placed { - ForkRefStatus::Legitimate - } else { - // Branch resolves but the manifest does not place this table on - // it β€” a manifest-unreferenced fork. - ForkRefStatus::Orphan - } - } - // Branch did not resolve. `all_branches` lists `_refs/branches/` live, so - // absent there = genuinely no such manifest branch (origin-1 orphan); - // present (or a list error) = transient read β€” never destroy on that. - Err(_) => match db.coordinator.read().await.all_branches().await { - Ok(fresh) if !fresh.iter().any(|b| b == branch) => ForkRefStatus::Orphan, - _ => ForkRefStatus::Indeterminate, - }, - } -} - -/// Reclaim a manifest-unreferenced fork and re-fork in its place. -/// -/// Reached when `fork_branch_from_state` reports `RefAlreadyExists`. This is a -/// destructive op (it force-deletes a Lance branch ref), so it owns its own -/// safety precondition rather than trusting the caller's: it re-derives, via -/// [`classify_fork_ref`], that the manifest does not place this table on -/// `active_branch`. The caller's earlier proof may have come from the -/// coordinator's *cached* branch snapshot (`resolved_branch_target` returns -/// the cache when the handle is bound to `active_branch` β€” an embedded handle -/// on the branch, or `branch_merge`'s target swap); trusting it could -/// force-delete a fork a concurrent writer just legitimately published. Only -/// once fresh authority confirms the ref is unreferenced does it drop the ref -/// (idempotent `force_delete_branch`) and re-fork, exactly once. -/// -/// If fresh authority shows the table IS on `active_branch` (a legitimate -/// concurrent fork), or a second collision occurs after reclaim (a foreign- -/// process writer recreated the ref β€” the documented one-winner-CAS gap), it -/// surfaces a retryable conflict; on retry the winner's fork is visible and -/// the no-fork path runs. -pub(super) async fn reclaim_orphaned_fork_and_refork( - db: &Omnigraph, - table_key: &str, - full_path: &str, - source_branch: Option<&str>, - source_version: u64, - active_branch: &str, -) -> Result { - // Self-validate against FRESH authority before destroying anything. Only an - // Orphan is reclaimable; a Legitimate status (a concurrent writer published - // a real fork despite the caller's possibly-cached proof) or an - // Indeterminate one (transient read) surfaces a retryable conflict rather - // than stranding the manifest at a version the recreated ref won't have. - match classify_fork_ref(db, table_key, active_branch).await { - ForkRefStatus::Orphan => {} - ForkRefStatus::Legitimate => { - let actual = db - .fresh_snapshot_for_branch(Some(active_branch)) - .await - .ok() - .and_then(|s| s.entry(table_key).map(|e| e.table_version)) - .unwrap_or(source_version); - return Err(OmniError::manifest_expected_version_mismatch( - table_key, - source_version, - actual, - )); - } - ForkRefStatus::Indeterminate => { - return Err(OmniError::manifest_conflict(format!( - "could not verify whether branch '{active_branch}' still owns an orphaned \ - fork for table '{table_key}' because fresh manifest authority was \ - unavailable; refresh and retry" - ))); - } - } - - crate::failpoints::maybe_fail(crate::failpoints::names::FORK_BEFORE_RECLAIM)?; - db.storage() - .force_delete_branch(full_path, active_branch) - .await - .map_err(|e| { - // Lance refuses to delete a branch with dependent child branches - // even under force (RefConflict). Unreachable for a leaf first-write - // fork (the cleanup reconciler also drops children before parents), - // but surface it actionably if it ever happens. We match loosely on - // "referenc" rather than the exact prose, which is not a Lance API - // contract; a typed RefConflict variant through `force_delete_branch` - // is the durable follow-up. - if e.to_string().contains("referenc") { - OmniError::manifest_conflict(format!( - "branch '{active_branch}' cannot reclaim the leftover fork for \ - table '{table_key}' because it has dependent child branches; \ - delete the child branches (or run `omnigraph cleanup`) first" - )) - } else { - e - } - })?; - - match fork_dataset_from_entry_state( - db, - table_key, - full_path, - source_branch, - source_version, - active_branch, - ) - .await? - { - crate::storage_layer::ForkOutcome::Created(ds) => Ok(ds), - crate::storage_layer::ForkOutcome::RefAlreadyExists => { - let live = db.fresh_snapshot_for_branch(Some(active_branch)).await?; - let actual = live - .entry(table_key) - .map(|e| e.table_version) - .unwrap_or(source_version); - Err(OmniError::manifest_expected_version_mismatch( - table_key, - source_version, - actual, - )) - } - } -} - pub(super) async fn reopen_for_mutation( db: &Omnigraph, table_key: &str, @@ -907,10 +531,10 @@ pub(super) async fn reopen_for_mutation( table_branch: Option<&str>, expected_version: u64, op_kind: crate::db::MutationOpKind, -) -> Result { +) -> Result { db.ensure_schema_apply_not_locked("write").await?; if op_kind.strict_pre_stage_version_check() { - db.storage() + db.table_store .reopen_for_mutation(full_path, table_branch, table_key, expected_version) .await } else { @@ -923,7 +547,7 @@ pub(super) async fn reopen_for_mutation( // genuine cross-process drift as 409. See // [`crate::db::MutationOpKind`] for the policy rationale. let _ = expected_version; - db.storage() + db.table_store .open_dataset_head_for_write(table_key, full_path, table_branch) .await } @@ -934,31 +558,17 @@ pub(super) async fn open_dataset_at_state( table_path: &str, table_branch: Option<&str>, table_version: u64, -) -> Result { - db.storage() +) -> Result { + db.table_store .open_dataset_at_state(table_path, table_branch, table_version) .await } -/// A declared index the builder could not materialize on this pass. Today the -/// only such case is a vector (IVF) column with no trainable vectors yet -/// (KMeans needs >=1 vector), e.g. the load-before-embed window. Reported, not -/// fatal: a later `ensure_indices`/`optimize` retries once the column is -/// buildable, and reads stay correct via brute-force meanwhile. Surfacing -/// pending index *status* rather than failing the operation is the database -/// norm (Postgres `indisvalid`, LanceDB `list_indices`). -#[derive(Debug, Clone)] -pub struct PendingIndex { - pub table_key: String, - pub column: String, - pub reason: String, -} - pub(super) async fn build_indices_on_dataset( db: &Omnigraph, table_key: &str, - ds: &mut SnapshotHandle, -) -> Result> { + ds: &mut Dataset, +) -> Result<()> { let catalog = db.catalog(); build_indices_on_dataset_for_catalog(db, &catalog, table_key, ds).await } @@ -967,11 +577,10 @@ pub(super) async fn build_indices_on_dataset_for_catalog( db: &Omnigraph, catalog: &Catalog, table_key: &str, - ds: &mut SnapshotHandle, -) -> Result> { + ds: &mut Dataset, +) -> Result<()> { if let Some(type_name) = table_key.strip_prefix("node:") { - let mut pending = Vec::new(); - if !db.storage().has_btree_index(ds, "id").await? { + if !db.table_store.has_btree_index(ds, "id").await? { stage_and_commit_btree(db, table_key, ds, &["id"]).await?; } @@ -990,94 +599,46 @@ pub(super) async fn build_indices_on_dataset_for_catalog( } let prop_name = &index_cols[0]; if let Some(prop_type) = node_type.properties.get(prop_name) { - match node_prop_index_kind(prop_type) { - Some(NodePropIndexKind::Fts) => { - if !db.storage().has_fts_index(ds, prop_name).await? { - stage_and_commit_inverted(db, table_key, ds, prop_name.as_str()) - .await?; - } + if matches!(prop_type.scalar, ScalarType::String) && !prop_type.list { + if !db.table_store.has_fts_index(ds, prop_name).await? { + stage_and_commit_inverted(db, table_key, ds, prop_name.as_str()) + .await?; } - Some(NodePropIndexKind::Vector) => { - if !db.storage().has_vector_index(ds, prop_name).await? { - // A vector (IVF) index trains k-means over the column, - // so it needs >=1 non-null vector (KMeans errors - // "cannot train N centroids with 0 vectors"). Precheck - // trainability: a column with no vectors yet (e.g. rows - // loaded before `embed`) is recorded as a *pending* - // index and skipped β€” deferred, not failed. The SAME - // predicate gates `needs_index_work_node`, so an - // untrainable column is never pinned for recovery (no - // zero-commit pin that would roll back a sibling - // table's index work). This function is the chokepoint - // every write path funnels through (load/mutate, schema - // apply, ensure_indices, optimize, merge), realizing - // the governing principle β€” physical index state never - // fails a logical operation. Only when trainable do we - // attempt the build, and then we PROPAGATE any error: a - // genuine I/O/manifest/Lance failure must stay fatal, - // not be hidden as pending. (Vector creation is an - // inline-commit residual until lance#6666; iss-951.) - if vector_column_trainable(db, ds, prop_name).await? { - let new_snap = db - .storage_inline_residual() - .create_vector_index(ds.clone(), prop_name.as_str()) - .await - .map_err(|e| { - OmniError::Lance(format!( - "create Vector index on {}({}): {}", - table_key, prop_name, e - )) - })?; - *ds = new_snap; - } else { - tracing::info!( - target: "omnigraph::index", - table = %table_key, - column = %prop_name, - "deferring Vector index: column has no \ - trainable vectors yet", - ); - pending.push(PendingIndex { - table_key: table_key.to_string(), - column: prop_name.clone(), - reason: "column has no non-null vectors to \ - train on yet" - .to_string(), - }); - } - } + } else if matches!(prop_type.scalar, ScalarType::Vector(_)) && !prop_type.list { + if !db.table_store.has_vector_index(ds, prop_name).await? { + // Inline-commit residual: lance-4.0.0 does not + // expose `build_index_metadata_from_segments` as + // `pub`, so vector indices cannot be staged from + // outside the lance crate. Document at the call + // site; companion ticket to lance-format/lance#6658. + db.table_store + .create_vector_index(ds, prop_name.as_str()) + .await + .map_err(|e| { + OmniError::Lance(format!( + "create Vector index on {}({}): {}", + table_key, prop_name, e + )) + })?; } - // Enum + orderable scalars (DateTime/Date/numeric/Bool) - // get a BTREE so `=`, range, IN, and IS NULL are index- - // accelerated instead of degrading to a full scan. - Some(NodePropIndexKind::Btree) => { - if !db.storage().has_btree_index(ds, prop_name).await? { - stage_and_commit_btree(db, table_key, ds, &[prop_name.as_str()]) - .await?; - } - } - // List or Blob column: not indexable as a scalar here. - None => {} } } } } - return Ok(pending); + return Ok(()); } if table_key.starts_with("edge:") { - if !db.storage().has_btree_index(ds, "id").await? { + if !db.table_store.has_btree_index(ds, "id").await? { stage_and_commit_btree(db, table_key, ds, &["id"]).await?; } - if !db.storage().has_btree_index(ds, "src").await? { + if !db.table_store.has_btree_index(ds, "src").await? { stage_and_commit_btree(db, table_key, ds, &["src"]).await?; } - if !db.storage().has_btree_index(ds, "dst").await? { + if !db.table_store.has_btree_index(ds, "dst").await? { stage_and_commit_btree(db, table_key, ds, &["dst"]).await?; } - // Edge tables only get BTree (id/src/dst), which build at any - // cardinality; no pending state is possible here. - return Ok(Vec::new()); + return Ok(()); } Err(OmniError::manifest(format!( @@ -1097,11 +658,11 @@ pub(super) async fn build_indices_on_dataset_for_catalog( async fn stage_and_commit_btree( db: &Omnigraph, table_key: &str, - ds: &mut SnapshotHandle, + ds: &mut Dataset, columns: &[&str], ) -> Result<()> { let staged = db - .storage() + .table_store .stage_create_btree_index(ds, columns) .await .map_err(|e| { @@ -1114,10 +675,10 @@ async fn stage_and_commit_btree( // to demonstrate that a stage-step failure in the staged-index // path (`stage_create_btree_index` succeeded; `commit_staged` not // yet called) leaves no Lance-HEAD drift on the touched table. - crate::failpoints::maybe_fail(crate::failpoints::names::ENSURE_INDICES_POST_STAGE_PRE_COMMIT_BTREE)?; + crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")?; let new_ds = db - .storage() - .commit_staged(ds.clone(), staged) + .table_store + .commit_staged(Arc::new(ds.clone()), staged.transaction) .await .map_err(|e| { OmniError::Lance(format!( @@ -1134,11 +695,11 @@ async fn stage_and_commit_btree( async fn stage_and_commit_inverted( db: &Omnigraph, table_key: &str, - ds: &mut SnapshotHandle, + ds: &mut Dataset, column: &str, ) -> Result<()> { let staged = db - .storage() + .table_store .stage_create_inverted_index(ds, column) .await .map_err(|e| { @@ -1148,8 +709,8 @@ async fn stage_and_commit_inverted( )) })?; let new_ds = db - .storage() - .commit_staged(ds.clone(), staged) + .table_store + .commit_staged(Arc::new(ds.clone()), staged.transaction) .await .map_err(|e| { OmniError::Lance(format!( @@ -1165,30 +726,12 @@ async fn prepare_updates_for_commit( db: &Omnigraph, branch: Option<&str>, updates: &[crate::db::SubTableUpdate], - txn: Option<&crate::db::WriteTxn>, - // Post-`commit_staged` handles handed out by `StagedMutation::commit_all` - // (RFC-013 step 3b, collapse #4): table_key β†’ the handle already open at - // its just-committed version. When a table's handle is present, the index - // build below reuses it and SKIPS the `reopen_for_mutation` open. Absent - // entries (other writers β€” schema apply, merge, ensure_indices, tests β€” - // pass `HashMap::new()`; inline-committed/delete tables are never staged) - // keep the byte-identical `reopen_for_mutation` path. - mut committed_handles: std::collections::HashMap, ) -> Result> { if updates.is_empty() { return Ok(Vec::new()); } - // With a `WriteTxn` the schema contract was validated once at capture, so - // reuse the pinned base entries (same per-branch manifest snapshot) instead - // of `snapshot_for_branch` (which re-runs `ensure_schema_state_valid`). Only - // the `entry(table_key).table_path` is read out of it here, identical to the - // no-txn path; the post-`commit_staged` index build below still reopens the - // dataset at its just-committed version. Without a txn, byte-identical. - let snapshot = match txn { - Some(txn) => txn.base.clone(), - None => db.snapshot_for_branch(branch).await?, - }; + let snapshot = db.snapshot_for_branch(branch).await?; let mut prepared = Vec::with_capacity(updates.len()); for update in updates { @@ -1202,41 +745,23 @@ async fn prepare_updates_for_commit( let mut prepared_update = update.clone(); if prepared_update.row_count > 0 { let full_path = format!("{}/{}", db.root_uri, entry.table_path); - // Reuse the post-`commit_staged` handle when the caller handed one - // out (collapse #4): it is already open at exactly - // `prepared_update.table_version`, so the defense-in-depth strict - // re-check `reopen_for_mutation` would run is trivially satisfied - // and the open is redundant. When no handle is present (other - // writers, or any non-staged table), fall back to the byte-identical - // `reopen_for_mutation` path. - // - // Strict version check is correct on the fallback: this runs INSIDE + // Strict version check is correct here: this runs INSIDE // the publisher commit path, after `commit_staged` already // advanced Lance HEAD to `prepared_update.table_version`. // The check is a defense-in-depth assertion that the // dataset state matches what we just committed; not the // pre-stage race the op-kind policy targets. - let mut ds = match committed_handles.remove(&prepared_update.table_key) { - Some(ds) => ds, - None => { - reopen_for_mutation( - db, - &prepared_update.table_key, - &full_path, - prepared_update.table_branch.as_deref(), - prepared_update.table_version, - crate::db::MutationOpKind::SchemaRewrite, - ) - .await? - } - }; - // Any column not yet buildable (e.g. a vector column whose rows - // have null embeddings) is deferred and logged inside - // build_indices; a later ensure_indices/optimize materializes it. - // The load/mutate/merge commit must not fail on it. - let _pending = - build_indices_on_dataset(db, &prepared_update.table_key, &mut ds).await?; - let state = db.storage().table_state(&full_path, &ds).await?; + let mut ds = reopen_for_mutation( + db, + &prepared_update.table_key, + &full_path, + prepared_update.table_branch.as_deref(), + prepared_update.table_version, + crate::db::MutationOpKind::SchemaRewrite, + ) + .await?; + build_indices_on_dataset(db, &prepared_update.table_key, &mut ds).await?; + let state = db.table_store.table_state(&full_path, &ds).await?; prepared_update.table_version = state.version; prepared_update.row_count = state.row_count; prepared_update.version_metadata = state.version_metadata; @@ -1368,27 +893,37 @@ pub(super) async fn commit_updates( .await .current_branch() .map(str::to_string); - let prepared = prepare_updates_for_commit( - db, - current_branch.as_deref(), - updates, - None, - std::collections::HashMap::new(), - ) - .await?; + let prepared = prepare_updates_for_commit(db, current_branch.as_deref(), updates).await?; commit_prepared_updates(db, &prepared, None).await } -pub(super) async fn commit_merge_with_actor( +pub(super) async fn commit_manifest_updates( db: &Omnigraph, updates: &[crate::db::SubTableUpdate], +) -> Result { + db.coordinator + .write() + .await + .commit_manifest_updates(updates) + .await +} + +pub(super) async fn record_merge_commit( + db: &Omnigraph, + manifest_version: u64, + parent_commit_id: &str, merged_parent_commit_id: &str, actor_id: Option<&str>, ) -> Result { db.coordinator .write() .await - .commit_merge_with_actor(updates, merged_parent_commit_id, actor_id) + .record_merge_commit( + manifest_version, + parent_commit_id, + merged_parent_commit_id, + actor_id, + ) .await .map(|snapshot_id| snapshot_id.as_str().to_string()) } @@ -1402,12 +937,9 @@ pub(super) async fn commit_updates_on_branch_with_expected( updates: &[crate::db::SubTableUpdate], expected_table_versions: &std::collections::HashMap, actor_id: Option<&str>, - txn: Option<&crate::db::WriteTxn>, - committed_handles: std::collections::HashMap, ) -> Result { db.ensure_schema_apply_not_locked("write commit").await?; - let prepared = - prepare_updates_for_commit(db, branch, updates, txn, committed_handles).await?; + let prepared = prepare_updates_for_commit(db, branch, updates).await?; commit_prepared_updates_on_branch_with_expected( db, branch, @@ -1429,80 +961,3 @@ pub(super) async fn ensure_commit_graph_initialized(db: &Omnigraph) -> Result<() pub(super) async fn invalidate_graph_index(db: &Omnigraph) { db.runtime_cache.invalidate_all().await; } - -#[cfg(test)] -mod classify_fork_ref_tests { - //! Direct coverage of [`classify_fork_ref`] β€” the single fresh-authority - //! decision both fork-ref reclaim sites (write-path reclaim + cleanup - //! reconciler) route through. Pins each deterministic status so reverting - //! the fresh-authority logic at either site fails here. (The `Indeterminate` - //! arm needs an injected transient read and is covered under the - //! `failpoints` suite.) - use super::*; - use crate::db::Omnigraph; - use crate::loader::LoadMode; - - const SCHEMA: &str = "node Person { name: String @key }\nnode Company { name: String @key }\n"; - - /// On-disk dataset path for a node table, taken from the manifest entry - /// (the same path the engine uses) so the test forges against the real ref. - async fn node_path(db: &Omnigraph, branch: &str, table_key: &str) -> String { - let snap = db.snapshot_for_branch(Some(branch)).await.unwrap(); - let entry = snap.entry(table_key).unwrap(); - format!("{}/{}", db.root_uri, entry.table_path) - } - - #[tokio::test] - async fn classify_distinguishes_legitimate_unreferenced_and_ghost() { - let dir = tempfile::tempdir().unwrap(); - let db = Omnigraph::init(dir.path().to_str().unwrap(), SCHEMA) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - - // Legitimate: a real write forks Company onto `feature`, and the - // manifest places Company on `feature`. - db.load_as( - "feature", - None, - r#"{"type":"Company","data":{"name":"Acme"}}"#, - LoadMode::Merge, - None, - ) - .await - .unwrap(); - assert_eq!( - classify_fork_ref(&db, "node:Company", "feature").await, - ForkRefStatus::Legitimate, - "a manifest-placed fork must classify as Legitimate (never destroyed)" - ); - - // Orphan (manifest-unreferenced): forge a `feature` ref on Person, which - // the manifest's `feature` snapshot still places on main. - let person = node_path(&db, "feature", "node:Person").await; - { - // forbidden-api-allow: test synthesizes a branch ref directly on the Lance dataset. - let mut ds = lance::Dataset::open(&person).await.unwrap(); - let v = ds.version().version; - ds.create_branch("feature", v, None).await.unwrap(); - } - assert_eq!( - classify_fork_ref(&db, "node:Person", "feature").await, - ForkRefStatus::Orphan, - "a ref the manifest does not place on the branch must classify as Orphan" - ); - - // Orphan (ghost): a ref for a branch the manifest does not have at all. - { - // forbidden-api-allow: test synthesizes a branch ref directly on the Lance dataset. - let mut ds = lance::Dataset::open(&person).await.unwrap(); - let v = ds.version().version; - ds.create_branch("ghost", v, None).await.unwrap(); - } - assert_eq!( - classify_fork_ref(&db, "node:Person", "ghost").await, - ForkRefStatus::Orphan, - "a ref for a branch absent from the manifest must classify as Orphan" - ); - } -} diff --git a/crates/omnigraph/src/db/recovery_audit.rs b/crates/omnigraph/src/db/recovery_audit.rs index 3444773..b9e8e7b 100644 --- a/crates/omnigraph/src/db/recovery_audit.rs +++ b/crates/omnigraph/src/db/recovery_audit.rs @@ -14,14 +14,15 @@ //! this change additive. //! //! Atomicity caveat: append to `_graph_commit_recoveries.lance` is -//! sequential w.r.t. the recovery commit, which RFC-013 Phase 7 records in -//! `__manifest` (folded into the recovery publish CAS via `publish_recovery_commit`). -//! A crash between the publish and this audit append leaves a recovery commit -//! with no audit row. The recovery sweep tolerates it the same way (re-entry -//! sees `NoMovement` for already-restored / already-published tables; the audit -//! append is retried, minting a fresh recovery commit). +//! sequential w.r.t. the `CommitGraph::append_commit` write. A crash +//! between the two leaves an orphan commit-graph row with no audit row. +//! Same shape as the existing `_graph_commits` + `_graph_commit_actors` +//! split; the recovery sweep tolerates it the same way (re-entry sees +//! `NoMovement` for already-restored / already-published tables; the +//! audit append is retried). use std::sync::Arc; +use std::time::{SystemTime, UNIX_EPOCH}; use arrow_array::{ Array, RecordBatch, RecordBatchIterator, StringArray, TimestampMicrosecondArray, @@ -42,11 +43,6 @@ const RECOVERIES_DIR: &str = "_graph_commit_recoveries.lance"; pub(crate) enum RecoveryKind { RolledForward, RolledBack, - /// The sidecar's branch no longer exists in the manifest: its tree - /// and forks are reclaimed, the pinned drift is unreachable, and the - /// sidecar is provably moot β€” discarded with this audit row instead - /// of wedging every heal/sweep on a dead-branch open. - OrphanedBranchDiscarded, } impl RecoveryKind { @@ -54,7 +50,6 @@ impl RecoveryKind { match self { RecoveryKind::RolledForward => "RolledForward", RecoveryKind::RolledBack => "RolledBack", - RecoveryKind::OrphanedBranchDiscarded => "OrphanedBranchDiscarded", } } @@ -62,7 +57,6 @@ impl RecoveryKind { match s { "RolledForward" => Ok(RecoveryKind::RolledForward), "RolledBack" => Ok(RecoveryKind::RolledBack), - "OrphanedBranchDiscarded" => Ok(RecoveryKind::OrphanedBranchDiscarded), other => Err(OmniError::manifest_internal(format!( "unknown recovery_kind '{}' in _graph_commit_recoveries.lance", other @@ -188,17 +182,11 @@ async fn create_recoveries_dataset(root_uri: &str) -> Result { mode: WriteMode::Create, enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; match Dataset::write(reader, &uri as &str, Some(params)).await { Ok(dataset) => Ok(dataset), - // Create-or-open idempotency β€” match the typed `DatasetAlreadyExists` - // variant, not the display string (not a Lance API contract). Same - // discipline as `commit_graph.rs`'s create-or-open; pinned by - // `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`. - Err(lance::Error::DatasetAlreadyExists { .. }) => Dataset::open(&uri) + Err(err) if err.to_string().contains("Dataset already exists") => Dataset::open(&uri) .await .map_err(|open_err| OmniError::Lance(open_err.to_string())), Err(err) => Err(OmniError::Lance(err.to_string())), @@ -279,6 +267,13 @@ fn decode_row(batch: &RecordBatch, row: usize) -> Result { }) } +pub(crate) fn now_micros() -> Result { + SystemTime::now() + .duration_since(UNIX_EPOCH) + .map(|d| d.as_micros() as i64) + .map_err(|e| OmniError::manifest_internal(format!("system clock before unix epoch: {}", e))) +} + #[cfg(test)] mod tests { use super::*; diff --git a/crates/omnigraph/src/db/run_registry.rs b/crates/omnigraph/src/db/run_registry.rs new file mode 100644 index 0000000..ee3d336 --- /dev/null +++ b/crates/omnigraph/src/db/run_registry.rs @@ -0,0 +1,16 @@ +// The Run state machine has been removed. Mutations now write directly +// to target tables and use the publisher's `expected_table_versions` +// CAS for cross-table OCC; `__run__` staging branches and the +// `_graph_runs.lance` state machine no longer exist. +// +// What remains is the branch-name predicate, kept as a defense-in-depth +// guard against users naming a public branch `__run__*`. A future +// production sweep of legacy `_graph_runs.lance` rows and stale +// `__run__*` branches will let this predicate (and this file) go too. + +pub(crate) const INTERNAL_RUN_BRANCH_PREFIX: &str = "__run__"; + +pub(crate) fn is_internal_run_branch(name: &str) -> bool { + name.trim_start_matches('/') + .starts_with(INTERNAL_RUN_BRANCH_PREFIX) +} diff --git a/crates/omnigraph/src/db/write_queue.rs b/crates/omnigraph/src/db/write_queue.rs index 18a14d1..1f0c53a 100644 --- a/crates/omnigraph/src/db/write_queue.rs +++ b/crates/omnigraph/src/db/write_queue.rs @@ -1,15 +1,12 @@ -//! Per-`(table_key, branch)` writer queues. +//! Per-`(table_key, branch)` writer queues β€” MR-686 scaffolding. //! -//! These queues are the engine's write-serialization mechanism: the server -//! holds the engine as a lockless `Arc` (writes are `&self`), so -//! disjoint-key writes proceed concurrently and only writes to the same -//! `(table_key, branch_ref)` serialize here. This module owns the queue -//! data structure; callers in `MutationStaging::commit_all`, `branch_merge`, -//! `schema_apply`, `ensure_indices`, `delete_where`, the fork path (first -//! write to a table on a branch β€” acquired before the fork, held through the -//! manifest publish), and the recovery reconciler acquire guards before any -//! per-table Lance commit. Serialization is in-process only; cross-process -//! writers on one graph remain one-winner-CAS at the manifest publish. +//! Today every server-layer write serializes on the global +//! `Arc>` in `AppState`. MR-686 replaces that with +//! per-`(table_key, branch_ref)` queues so disjoint-key writes proceed +//! concurrently. This module owns the queue data structure; callers in +//! `MutationStaging::commit_all`, `branch_merge`, `schema_apply`, +//! `ensure_indices`, `delete_where`, and the future MR-870 recovery +//! reconciler acquire guards before any per-table Lance commit. //! //! ## Why exclusive `tokio::sync::Mutex<()>` per key //! diff --git a/crates/omnigraph/src/embedding.rs b/crates/omnigraph/src/embedding.rs index 246836c..cfd4071 100644 --- a/crates/omnigraph/src/embedding.rs +++ b/crates/omnigraph/src/embedding.rs @@ -8,157 +8,29 @@ use tokio::time::sleep; use crate::error::{OmniError, Result}; -const DEFAULT_OPENROUTER_BASE_URL: &str = "https://openrouter.ai/api/v1"; -const DEFAULT_OPENROUTER_MODEL: &str = "openai/text-embedding-3-large"; -const DEFAULT_OPENAI_BASE_URL: &str = "https://api.openai.com/v1"; -const DEFAULT_OPENAI_MODEL: &str = "text-embedding-3-large"; +const GEMINI_EMBED_MODEL: &str = "gemini-embedding-2-preview"; const DEFAULT_GEMINI_BASE_URL: &str = "https://generativelanguage.googleapis.com/v1beta"; -const DEFAULT_GEMINI_MODEL: &str = "gemini-embedding-2"; const DEFAULT_TIMEOUT_MS: u64 = 30_000; const DEFAULT_RETRY_ATTEMPTS: usize = 4; const DEFAULT_RETRY_BACKOFF_MS: u64 = 200; -const DEFAULT_DEADLINE_MS: u64 = 60_000; -const GEMINI_QUERY_TASK_TYPE: &str = "RETRIEVAL_QUERY"; -const GEMINI_DOCUMENT_TASK_TYPE: &str = "RETRIEVAL_DOCUMENT"; +const QUERY_TASK_TYPE: &str = "RETRIEVAL_QUERY"; +const DOCUMENT_TASK_TYPE: &str = "RETRIEVAL_DOCUMENT"; -/// Which embedding API a client speaks. Each variant owns its request shape, -/// auth, and response parsing; everything else (retry, deadline, normalization, -/// tracing) is provider-independent. -#[derive(Clone, Copy, Debug, PartialEq, Eq)] -pub enum Provider { - /// OpenAI-compatible (`POST {base}/embeddings`, bearer auth, - /// `{model, input, dimensions}`). Covers OpenRouter (the default gateway), - /// OpenAI direct, and self-hosted endpoints (vLLM/Ollama/LM Studio). - OpenAiCompatible, - /// Google Gemini `generativelanguage` (`POST {base}/models/{model}:embedContent`, - /// `x-goog-api-key`), with `RETRIEVAL_QUERY` / `RETRIEVAL_DOCUMENT` task types. - Gemini, - /// Deterministic, offline. No network, no key. - Mock, -} - -/// Whether the text being embedded is a search query or a stored document. -/// Only Gemini distinguishes these (`RETRIEVAL_QUERY` vs `RETRIEVAL_DOCUMENT`); -/// OpenAI-compatible providers and Mock produce the identical request for both, -/// which is also the same-space property a query relies on. -#[derive(Clone, Copy, Debug, PartialEq, Eq)] -enum EmbedRole { - Query, - Document, -} - -/// The single source of truth for how embedding text becomes a vector: -/// provider + model + endpoint + key. Resolved once (from env for direct -/// engine/CLI callers, or from an applied cluster `providers.embedding` profile -/// at server boot) and shared by the query path and the offline CLI so stored -/// and query vectors stay same-space by construction. #[derive(Clone, Debug)] -pub struct EmbeddingConfig { - pub provider: Provider, - pub model: String, - pub base_url: String, - pub api_key: String, -} - -impl EmbeddingConfig { - /// Resolve from the environment. Precedence: - /// 1. `OMNIGRAPH_EMBEDDINGS_MOCK` β†’ Mock. - /// 2. `OMNIGRAPH_EMBED_PROVIDER` (`openai-compatible`|`openai`|`gemini`|`mock`); - /// unset defaults to `openai-compatible` (OpenRouter). - /// 3. `OMNIGRAPH_EMBED_BASE_URL` else the provider default. - /// 4. `OMNIGRAPH_EMBED_MODEL` else the provider default. - /// 5. provider api-key env (`OPENROUTER_API_KEY`/`OPENAI_API_KEY`, or `GEMINI_API_KEY`). - pub fn from_env() -> Result { - if env_flag("OMNIGRAPH_EMBEDDINGS_MOCK") { - return Ok(Self::mock()); - } - - let alias = env_string("OMNIGRAPH_EMBED_PROVIDER"); - if alias.as_deref() == Some("mock") { - return Ok(Self::mock()); - } - - let (provider, default_base, default_model, key_envs) = provider_profile(alias.as_deref())?; - let base_url = env_string("OMNIGRAPH_EMBED_BASE_URL") - .unwrap_or_else(|| default_base.to_string()) - .trim_end_matches('/') - .to_string(); - let model = - env_string("OMNIGRAPH_EMBED_MODEL").unwrap_or_else(|| default_model.to_string()); - - let api_key = key_envs.iter().copied().find_map(env_string).ok_or_else(|| { - OmniError::manifest_internal(format!( - "{} is required for the {} embedding provider", - key_envs.join(" or "), - alias.as_deref().unwrap_or("openai-compatible") - )) - })?; - - Ok(Self { - provider, - model, - base_url, - api_key, - }) - } - - /// Build a config from explicit parts β€” the cluster `providers.embedding` profile path - /// (RFC-012 Phase 5). `provider`/`base_url`/`model` default exactly as - /// `from_env` does (shared `provider_profile`); `api_key` is already resolved - /// (the cluster path resolves a `${NAME}` ref before calling this). - pub fn from_parts( - provider: Option<&str>, - base_url: Option, - model: Option, +enum EmbeddingTransport { + Mock, + Gemini { api_key: String, - ) -> Result { - if provider == Some("mock") { - // An explicit `model` (e.g. a cluster `providers.embedding` profile) is - // authoritative β€” it is what the same-space check compares against β€” - // so honor it; fall back to `mock()`'s env-based model only when the - // caller supplied none. Without this, a profile's `model` is silently - // dropped and the same-space check resolves to OMNIGRAPH_EMBED_MODEL. - let mut config = Self::mock(); - if let Some(model) = model { - config.model = model; - } - return Ok(config); - } - let (provider, default_base, default_model, _key_envs) = provider_profile(provider)?; - let base_url = base_url - .unwrap_or_else(|| default_base.to_string()) - .trim_end_matches('/') - .to_string(); - let model = model.unwrap_or_else(|| default_model.to_string()); - Ok(Self { - provider, - model, - base_url, - api_key, - }) - } - - fn mock() -> Self { - Self { - provider: Provider::Mock, - // Honor OMNIGRAPH_EMBED_MODEL so the same-space check is exercisable - // under mock; the mock vectors themselves don't depend on the model. - model: env_string("OMNIGRAPH_EMBED_MODEL").unwrap_or_default(), - base_url: String::new(), - api_key: String::new(), - } - } + base_url: String, + http: Client, + }, } #[derive(Clone, Debug)] pub struct EmbeddingClient { - config: EmbeddingConfig, - http: Client, retry_attempts: usize, retry_backoff_ms: u64, - /// Total wall-clock budget for one embed call, across all retries - /// (`OMNIGRAPH_EMBED_DEADLINE_MS`). `0` = unbounded. - deadline_ms: u64, + transport: EmbeddingTransport, } struct EmbedCallError { @@ -186,39 +58,35 @@ struct GoogleErrorBody { message: String, } -#[derive(Debug, Deserialize)] -struct OpenAiEmbeddingResponse { - data: Vec, -} - -#[derive(Debug, Deserialize)] -struct OpenAiEmbeddingDatum { - index: usize, - embedding: Vec, -} - -#[derive(Debug, Deserialize)] -struct OpenAiErrorEnvelope { - error: OpenAiErrorBody, -} - -#[derive(Debug, Deserialize)] -struct OpenAiErrorBody { - message: String, -} - impl EmbeddingClient { pub fn from_env() -> Result { - Self::new(EmbeddingConfig::from_env()?) - } - - pub fn new(config: EmbeddingConfig) -> Result { let retry_attempts = parse_env_usize("OMNIGRAPH_EMBED_RETRY_ATTEMPTS", DEFAULT_RETRY_ATTEMPTS); let retry_backoff_ms = parse_env_u64("OMNIGRAPH_EMBED_RETRY_BACKOFF_MS", DEFAULT_RETRY_BACKOFF_MS); - let deadline_ms = - parse_env_u64_allow_zero("OMNIGRAPH_EMBED_DEADLINE_MS", DEFAULT_DEADLINE_MS); + + if env_flag("OMNIGRAPH_EMBEDDINGS_MOCK") { + return Ok(Self { + retry_attempts, + retry_backoff_ms, + transport: EmbeddingTransport::Mock, + }); + } + + let api_key = std::env::var("GEMINI_API_KEY") + .ok() + .map(|v| v.trim().to_string()) + .filter(|v| !v.is_empty()) + .ok_or_else(|| { + OmniError::manifest_internal( + "GEMINI_API_KEY is required when nearest() needs a string embedding", + ) + })?; + let base_url = std::env::var("OMNIGRAPH_GEMINI_BASE_URL") + .ok() + .map(|v| v.trim_end_matches('/').to_string()) + .filter(|v| !v.is_empty()) + .unwrap_or_else(|| DEFAULT_GEMINI_BASE_URL.to_string()); let timeout_ms = parse_env_u64("OMNIGRAPH_EMBED_TIMEOUT_MS", DEFAULT_TIMEOUT_MS); let http = Client::builder() .timeout(Duration::from_millis(timeout_ms)) @@ -228,36 +96,39 @@ impl EmbeddingClient { })?; Ok(Self { - config, - http, retry_attempts, retry_backoff_ms, - deadline_ms, + transport: EmbeddingTransport::Gemini { + api_key, + base_url, + http, + }, }) } - pub fn config(&self) -> &EmbeddingConfig { - &self.config - } - #[cfg(test)] fn mock_for_tests() -> Self { - Self::new(EmbeddingConfig::mock()).expect("mock client builds") + Self { + retry_attempts: DEFAULT_RETRY_ATTEMPTS, + retry_backoff_ms: DEFAULT_RETRY_BACKOFF_MS, + transport: EmbeddingTransport::Mock, + } } pub async fn embed_query_text(&self, input: &str, expected_dim: usize) -> Result> { - self.embed_text(input, expected_dim, EmbedRole::Query).await + self.embed_text(input, expected_dim, QUERY_TASK_TYPE).await } pub async fn embed_document_text(&self, input: &str, expected_dim: usize) -> Result> { - self.embed_text(input, expected_dim, EmbedRole::Document).await + self.embed_text(input, expected_dim, DOCUMENT_TASK_TYPE) + .await } async fn embed_text( &self, input: &str, expected_dim: usize, - role: EmbedRole, + task_type: &'static str, ) -> Result> { if expected_dim == 0 { return Err(OmniError::manifest_internal( @@ -265,71 +136,10 @@ impl EmbeddingClient { )); } - let started = std::time::Instant::now(); - let result = self - .run_with_deadline(self.embed_text_inner(input, expected_dim, role)) - .await; - let elapsed_ms = started.elapsed().as_millis() as u64; - - match &result { - Ok(_) => tracing::info!( - target: "omnigraph::embedding", - provider = ?self.config.provider, - model = %self.config.model, - dim = expected_dim, - elapsed_ms, - outcome = "ok", - "embedding succeeded" - ), - Err(err) => tracing::warn!( - target: "omnigraph::embedding", - provider = ?self.config.provider, - model = %self.config.model, - dim = expected_dim, - elapsed_ms, - outcome = "error", - error = %err, - "embedding failed" - ), - } - result - } - - /// Bound the whole embed operation (all retries + backoff) by `deadline_ms`, - /// so a degraded provider can never hang the caller for the full retry - /// envelope. Applies to every embed call (query and document). `0` = - /// unbounded. Embedding has no Lance/manifest side effects, so cancelling the - /// in-flight request future on elapse is safe. - async fn run_with_deadline(&self, fut: F) -> Result> - where - F: Future>>, - { - if self.deadline_ms == 0 { - return fut.await; - } - match tokio::time::timeout(Duration::from_millis(self.deadline_ms), fut).await { - Ok(res) => res, - Err(_elapsed) => Err(OmniError::manifest_internal(format!( - "embedding deadline exceeded after {} ms (provider={:?}, model={})", - self.deadline_ms, self.config.provider, self.config.model - ))), - } - } - - async fn embed_text_inner( - &self, - input: &str, - expected_dim: usize, - role: EmbedRole, - ) -> Result> { - match self.config.provider { - Provider::Mock => Ok(mock_embedding(input, expected_dim)), - Provider::Gemini => { - self.with_retry(|| self.embed_gemini_once(input, expected_dim, role)) - .await - } - Provider::OpenAiCompatible => { - self.with_retry(|| self.embed_openai_once(input, expected_dim)) + match &self.transport { + EmbeddingTransport::Mock => Ok(mock_embedding(input, expected_dim)), + EmbeddingTransport::Gemini { .. } => { + self.with_retry(|| self.embed_text_gemini_once(input, expected_dim, task_type)) .await } } @@ -350,14 +160,6 @@ impl EmbeddingClient { if !err.retryable || attempt >= max_attempt { return Err(OmniError::manifest_internal(err.message)); } - tracing::warn!( - target: "omnigraph::embedding", - provider = ?self.config.provider, - model = %self.config.model, - attempt, - error = %err.message, - "embedding attempt failed, retrying" - ); let shift = (attempt - 1).min(10) as u32; let delay = self.retry_backoff_ms.saturating_mul(1u64 << shift); sleep(Duration::from_millis(delay)).await; @@ -366,27 +168,25 @@ impl EmbeddingClient { } } - async fn embed_gemini_once( + async fn embed_text_gemini_once( &self, input: &str, expected_dim: usize, - role: EmbedRole, + task_type: &'static str, ) -> std::result::Result, EmbedCallError> { - let task_type = match role { - EmbedRole::Query => GEMINI_QUERY_TASK_TYPE, - EmbedRole::Document => GEMINI_DOCUMENT_TASK_TYPE, + let (api_key, base_url, http) = match &self.transport { + EmbeddingTransport::Gemini { + api_key, + base_url, + http, + } => (api_key, base_url, http), + EmbeddingTransport::Mock => unreachable!("mock transport should not call Gemini"), }; - let response = self - .http - .post(gemini_endpoint(&self.config.base_url, &self.config.model)) - .header("x-goog-api-key", &self.config.api_key) - .json(&build_gemini_request( - &self.config.model, - input, - expected_dim, - task_type, - )) + let response = http + .post(gemini_endpoint(base_url)) + .header("x-goog-api-key", api_key) + .json(&build_gemini_request(input, expected_dim, task_type)) .send() .await; let response = match response { @@ -405,7 +205,10 @@ impl EmbeddingClient { Ok(body) => body, Err(err) => { return Err(EmbedCallError { - message: format!("embedding response read failed (status {}): {}", status, err), + message: format!( + "embedding response read failed (status {}): {}", + status, err + ), retryable: status.is_server_error() || status.as_u16() == 429, }); } @@ -414,7 +217,10 @@ impl EmbeddingClient { if !status.is_success() { let message = parse_google_error_message(&body).unwrap_or(body); return Err(EmbedCallError { - message: format!("embedding request failed with status {}: {}", status, message), + message: format!( + "embedding request failed with status {}: {}", + status, message + ), retryable: status.is_server_error() || status.as_u16() == 429, }); } @@ -432,85 +238,19 @@ impl EmbeddingClient { } }) } - - async fn embed_openai_once( - &self, - input: &str, - expected_dim: usize, - ) -> std::result::Result, EmbedCallError> { - let response = self - .http - .post(format!("{}/embeddings", self.config.base_url)) - .bearer_auth(&self.config.api_key) - .json(&build_openai_request(&self.config.model, input, expected_dim)) - .send() - .await; - let response = match response { - Ok(response) => response, - Err(err) => { - let retryable = err.is_timeout() || err.is_connect() || err.is_request(); - return Err(EmbedCallError { - message: format!("embedding request failed: {}", err), - retryable, - }); - } - }; - - let status = response.status(); - let body = match response.text().await { - Ok(body) => body, - Err(err) => { - return Err(EmbedCallError { - message: format!("embedding response read failed (status {}): {}", status, err), - retryable: status.is_server_error() || status.as_u16() == 429, - }); - } - }; - - if !status.is_success() { - let message = parse_openai_error_message(&body).unwrap_or(body); - return Err(EmbedCallError { - message: format!("embedding request failed with status {}: {}", status, message), - retryable: status.is_server_error() || status.as_u16() == 429, - }); - } - - let parsed: OpenAiEmbeddingResponse = - serde_json::from_str(&body).map_err(|err| EmbedCallError { - message: format!("embedding response decode failed: {}", err), - retryable: false, - })?; - - // The query path embeds exactly one string, so expect one datum at index 0. - let datum = parsed - .data - .into_iter() - .find(|d| d.index == 0) - .ok_or_else(|| EmbedCallError { - message: "embedding response missing data[0]".to_string(), - retryable: false, - })?; - - validate_and_normalize_embedding(datum.embedding, expected_dim).map_err(|message| { - EmbedCallError { - message, - retryable: false, - } - }) - } } -fn gemini_endpoint(base_url: &str, model: &str) -> String { +fn gemini_endpoint(base_url: &str) -> String { format!( "{}/models/{}:embedContent", base_url.trim_end_matches('/'), - model + GEMINI_EMBED_MODEL ) } -fn build_gemini_request(model: &str, input: &str, expected_dim: usize, task_type: &str) -> Value { +fn build_gemini_request(input: &str, expected_dim: usize, task_type: &'static str) -> Value { json!({ - "model": format!("models/{}", model), + "model": format!("models/{}", GEMINI_EMBED_MODEL), "content": { "parts": [ { @@ -523,14 +263,6 @@ fn build_gemini_request(model: &str, input: &str, expected_dim: usize, task_type }) } -fn build_openai_request(model: &str, input: &str, expected_dim: usize) -> Value { - json!({ - "model": model, - "input": [input], - "dimensions": expected_dim, - }) -} - fn validate_and_normalize_embedding( values: Vec, expected_dim: usize, @@ -566,57 +298,6 @@ fn parse_google_error_message(body: &str) -> Option { .filter(|msg| !msg.trim().is_empty()) } -fn parse_openai_error_message(body: &str) -> Option { - serde_json::from_str::(body) - .ok() - .map(|e| e.error.message) - .filter(|msg| !msg.trim().is_empty()) -} - -/// Map a provider alias to `(provider, default base URL, default model, ordered -/// api-key envs)`. Shared by `from_env` and `from_parts` so both apply identical -/// defaults: `openai-compatible`/unset β†’ the OpenRouter gateway, `openai` β†’ -/// OpenAI's own host. `mock` is handled by callers before this is reached. The -/// `Provider` enum alone would collapse the two openai aliases, so the alias -/// (not the enum) determines the key-env order here. -fn provider_profile( - alias: Option<&str>, -) -> Result<(Provider, &'static str, &'static str, &'static [&'static str])> { - Ok(match alias { - None | Some("openai-compatible") => ( - Provider::OpenAiCompatible, - DEFAULT_OPENROUTER_BASE_URL, - DEFAULT_OPENROUTER_MODEL, - &["OPENROUTER_API_KEY", "OPENAI_API_KEY"], - ), - Some("openai") => ( - Provider::OpenAiCompatible, - DEFAULT_OPENAI_BASE_URL, - DEFAULT_OPENAI_MODEL, - &["OPENAI_API_KEY"], - ), - Some("gemini") => ( - Provider::Gemini, - DEFAULT_GEMINI_BASE_URL, - DEFAULT_GEMINI_MODEL, - &["GEMINI_API_KEY"], - ), - Some(other) => { - return Err(OmniError::manifest_internal(format!( - "unknown embedding provider '{}' (expected openai-compatible|openai|gemini|mock)", - other - ))); - } - }) -} - -fn env_string(name: &str) -> Option { - std::env::var(name) - .ok() - .map(|v| v.trim().to_string()) - .filter(|v| !v.is_empty()) -} - fn parse_env_usize(name: &str, default: usize) -> usize { std::env::var(name) .ok() @@ -633,15 +314,6 @@ fn parse_env_u64(name: &str, default: u64) -> u64 { .unwrap_or(default) } -/// Like [`parse_env_u64`] but accepts `0` as a meaningful value (the deadline -/// uses `0` for "unbounded"). -fn parse_env_u64_allow_zero(name: &str, default: u64) -> u64 { - std::env::var(name) - .ok() - .and_then(|v| v.trim().parse::().ok()) - .unwrap_or(default) -} - fn env_flag(name: &str) -> bool { std::env::var(name) .ok() @@ -723,25 +395,6 @@ mod tests { } } - // Every test that calls `EmbeddingConfig::from_env` clears the full set of - // embedding env vars first so the host environment can't leak in. - const EMBED_ENV: &[&str] = &[ - "OMNIGRAPH_EMBEDDINGS_MOCK", - "OMNIGRAPH_EMBED_PROVIDER", - "OMNIGRAPH_EMBED_BASE_URL", - "OMNIGRAPH_EMBED_MODEL", - "OPENROUTER_API_KEY", - "OPENAI_API_KEY", - "GEMINI_API_KEY", - ]; - - fn cleared_env(extra: &[(&'static str, Option<&str>)]) -> EnvGuard { - let mut vars: Vec<(&'static str, Option<&str>)> = - EMBED_ENV.iter().map(|n| (*n, None)).collect(); - vars.extend_from_slice(extra); - EnvGuard::set(&vars) - } - #[tokio::test] async fn mock_embeddings_are_deterministic() { let client = EmbeddingClient::mock_for_tests(); @@ -754,30 +407,18 @@ mod tests { } #[test] - fn gemini_request_uses_model_retrieval_query_and_dimension() { - let request = - build_gemini_request("gemini-embedding-2", "alpha", 4, GEMINI_QUERY_TASK_TYPE); - assert_eq!(request["model"], "models/gemini-embedding-2"); - assert_eq!(request["taskType"], GEMINI_QUERY_TASK_TYPE); + fn gemini_request_uses_preview_model_retrieval_query_and_dimension() { + let request = build_gemini_request("alpha", 4, QUERY_TASK_TYPE); + assert_eq!(request["model"], "models/gemini-embedding-2-preview"); + assert_eq!(request["taskType"], QUERY_TASK_TYPE); assert_eq!(request["outputDimensionality"], 4); assert_eq!(request["content"]["parts"][0]["text"], "alpha"); } #[test] fn gemini_document_request_uses_retrieval_document_task_type() { - let request = - build_gemini_request("gemini-embedding-2", "alpha", 4, GEMINI_DOCUMENT_TASK_TYPE); - assert_eq!(request["taskType"], GEMINI_DOCUMENT_TASK_TYPE); - } - - #[test] - fn openai_request_uses_model_input_array_and_dimensions() { - let request = build_openai_request("openai/text-embedding-3-large", "alpha", 4); - assert_eq!(request["model"], "openai/text-embedding-3-large"); - assert_eq!(request["input"][0], "alpha"); - assert!(request["input"].is_array()); - assert_eq!(request["dimensions"], 4); - assert!(request.get("taskType").is_none()); + let request = build_gemini_request("alpha", 4, DOCUMENT_TASK_TYPE); + assert_eq!(request["taskType"], DOCUMENT_TASK_TYPE); } #[test] @@ -834,202 +475,15 @@ mod tests { assert!(err.to_string().contains("do not retry")); } - #[tokio::test] - async fn run_with_deadline_aborts_slow_future() { - let mut client = EmbeddingClient::mock_for_tests(); - client.deadline_ms = 20; - let slow = async { - tokio::time::sleep(Duration::from_secs(5)).await; - Ok(vec![0.0_f32]) - }; - let err = client.run_with_deadline(slow).await.unwrap_err(); - assert!(err.to_string().contains("deadline exceeded")); - } - - #[tokio::test] - async fn run_with_deadline_passes_through_fast_future() { - let client = EmbeddingClient::mock_for_tests(); - let ok = client - .run_with_deadline(async { Ok(vec![1.0_f32, 2.0]) }) - .await - .unwrap(); - assert_eq!(ok, vec![1.0, 2.0]); - } - - #[tokio::test] - async fn run_with_deadline_zero_is_unbounded() { - let mut client = EmbeddingClient::mock_for_tests(); - client.deadline_ms = 0; - let ok = client - .run_with_deadline(async { Ok(vec![3.0_f32]) }) - .await - .unwrap(); - assert_eq!(ok, vec![3.0]); - } - #[test] #[serial] - fn from_env_defaults_to_openai_compatible_openrouter() { - let _guard = cleared_env(&[("OPENROUTER_API_KEY", Some("sk-test"))]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.provider, Provider::OpenAiCompatible); - assert_eq!(config.base_url, DEFAULT_OPENROUTER_BASE_URL); - assert_eq!(config.model, DEFAULT_OPENROUTER_MODEL); - assert_eq!(config.api_key, "sk-test"); - } - - #[test] - #[serial] - fn from_env_openai_alias_uses_openai_host_not_openrouter() { - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBED_PROVIDER", Some("openai")), - ("OPENAI_API_KEY", Some("k")), + fn from_env_requires_gemini_api_key_when_not_mocking() { + let _guard = EnvGuard::set(&[ + ("OMNIGRAPH_EMBEDDINGS_MOCK", None), + ("GEMINI_API_KEY", None), ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.provider, Provider::OpenAiCompatible); - assert_eq!(config.base_url, DEFAULT_OPENAI_BASE_URL); // api.openai.com, not OpenRouter - assert_eq!(config.model, DEFAULT_OPENAI_MODEL); // text-embedding-3-large, no openai/ prefix - assert_eq!(config.api_key, "k"); - } - #[test] - #[serial] - fn from_env_openai_alias_prefers_openai_key_over_openrouter() { - // `openai` targets api.openai.com, so an OpenRouter key must not be sent there. - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBED_PROVIDER", Some("openai")), - ("OPENROUTER_API_KEY", Some("router")), - ("OPENAI_API_KEY", Some("openai")), - ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.base_url, DEFAULT_OPENAI_BASE_URL); - assert_eq!(config.api_key, "openai"); - } - - #[test] - #[serial] - fn from_env_openai_alias_errors_when_only_openrouter_key_is_set() { - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBED_PROVIDER", Some("openai")), - ("OPENROUTER_API_KEY", Some("router")), - ]); - let err = EmbeddingConfig::from_env().unwrap_err(); - assert!(err.to_string().contains("OPENAI_API_KEY"), "got: {err}"); - } - - #[test] - fn from_parts_applies_provider_defaults_and_overrides() { - let openrouter = EmbeddingConfig::from_parts(None, None, None, "k".to_string()).unwrap(); - assert_eq!(openrouter.provider, Provider::OpenAiCompatible); - assert_eq!(openrouter.base_url, DEFAULT_OPENROUTER_BASE_URL); - assert_eq!(openrouter.model, DEFAULT_OPENROUTER_MODEL); - assert_eq!(openrouter.api_key, "k"); - - let gemini = - EmbeddingConfig::from_parts(Some("gemini"), None, None, "g".to_string()).unwrap(); - assert_eq!(gemini.provider, Provider::Gemini); - assert_eq!(gemini.base_url, DEFAULT_GEMINI_BASE_URL); - - let overridden = EmbeddingConfig::from_parts( - Some("openai"), - Some("https://x/v1/".to_string()), - Some("custom".to_string()), - "k".to_string(), - ) - .unwrap(); - assert_eq!(overridden.base_url, "https://x/v1"); // trailing slash trimmed - assert_eq!(overridden.model, "custom"); - - let err = - EmbeddingConfig::from_parts(Some("cohere"), None, None, "k".to_string()).unwrap_err(); - assert!( - err.to_string().contains("unknown embedding provider"), - "got: {err}" - ); - } - - #[test] - #[serial] - fn from_parts_mock_honors_an_explicit_model() { - // A cluster `providers.embedding` profile that sets `kind: mock, model: X` - // must resolve to model X β€” it is what the query-time same-space check - // compares against. Env cleared so the assertion isolates the arg. - let _guard = cleared_env(&[]); - let pinned = - EmbeddingConfig::from_parts(Some("mock"), None, Some("recorded-x".to_string()), String::new()) - .unwrap(); - assert_eq!(pinned.provider, Provider::Mock); - assert_eq!(pinned.model, "recorded-x"); - // With no explicit model, mock falls back to its env-based default (here - // empty, since the env is cleared). - let bare = EmbeddingConfig::from_parts(Some("mock"), None, None, String::new()).unwrap(); - assert_eq!(bare.provider, Provider::Mock); - assert_eq!(bare.model, ""); - } - - #[test] - #[serial] - fn from_env_openai_compatible_prefers_openrouter_key() { - let _guard = cleared_env(&[ - ("OPENROUTER_API_KEY", Some("router")), - ("OPENAI_API_KEY", Some("openai")), - ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.api_key, "router"); - } - - #[test] - #[serial] - fn from_env_explicit_gemini_provider() { - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBED_PROVIDER", Some("gemini")), - ("GEMINI_API_KEY", Some("g-key")), - ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.provider, Provider::Gemini); - assert_eq!(config.base_url, DEFAULT_GEMINI_BASE_URL); - assert_eq!(config.model, DEFAULT_GEMINI_MODEL); - assert_eq!(config.api_key, "g-key"); - } - - #[test] - #[serial] - fn from_env_base_url_and_model_overrides_apply() { - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBED_PROVIDER", Some("openai-compatible")), - ("OMNIGRAPH_EMBED_BASE_URL", Some("https://example.test/v1/")), - ("OMNIGRAPH_EMBED_MODEL", Some("custom/model")), - ("OPENAI_API_KEY", Some("k")), - ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.base_url, "https://example.test/v1"); // trailing slash trimmed - assert_eq!(config.model, "custom/model"); - } - - #[test] - #[serial] - fn from_env_unknown_provider_errors() { - let _guard = cleared_env(&[("OMNIGRAPH_EMBED_PROVIDER", Some("cohere"))]); - let err = EmbeddingConfig::from_env().unwrap_err(); - assert!(err.to_string().contains("unknown embedding provider")); - } - - #[test] - #[serial] - fn from_env_errors_when_no_key_present() { - let _guard = cleared_env(&[]); - let err = EmbeddingConfig::from_env().unwrap_err(); - assert!(err.to_string().contains("OPENROUTER_API_KEY or OPENAI_API_KEY")); - } - - #[test] - #[serial] - fn from_env_mock_flag_wins() { - let _guard = cleared_env(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", Some("1")), - ("OMNIGRAPH_EMBED_PROVIDER", Some("gemini")), - ]); - let config = EmbeddingConfig::from_env().unwrap(); - assert_eq!(config.provider, Provider::Mock); + let err = EmbeddingClient::from_env().unwrap_err(); + assert!(err.to_string().contains("GEMINI_API_KEY")); } } diff --git a/crates/omnigraph/src/error.rs b/crates/omnigraph/src/error.rs index a24f153..11f4da0 100644 --- a/crates/omnigraph/src/error.rs +++ b/crates/omnigraph/src/error.rs @@ -74,7 +74,7 @@ pub enum MergeConflictKind { #[derive(Debug, Error)] pub enum OmniError { #[error("{0}")] - Compiler(#[from] omnigraph_compiler::error::CompilerError), + Compiler(#[from] omnigraph_compiler::error::NanoError), #[error("storage: {0}")] Lance(String), #[error("query: {0}")] diff --git a/crates/omnigraph/src/exec/merge.rs b/crates/omnigraph/src/exec/merge.rs index c846894..2e5f32e 100644 --- a/crates/omnigraph/src/exec/merge.rs +++ b/crates/omnigraph/src/exec/merge.rs @@ -5,14 +5,7 @@ const MERGE_STAGE_DIR_ENV: &str = "OMNIGRAPH_MERGE_STAGING_DIR"; #[derive(Debug)] enum CandidateTableState { - /// Adopt the source's table state via a pointer switch or a branch fork β€” - /// no data HEAD advance, so nothing to pin for recovery. AdoptSourceState, - /// Adopt the source's state by applying a non-empty delta onto the target's - /// lineage (append new + upsert changed + delete removed). The delta is - /// pre-computed at classification so this candidate can be recovery-pinned: - /// its publish advances Lance HEAD before the manifest commit. - AdoptWithDelta(AdoptDelta), RewriteMerged(StagedMergeResult), } @@ -29,38 +22,6 @@ struct StagedMergeResult { deleted_ids: Vec, } -/// Delta for an adopted-source merge (the fast-forward / target-owns path): -/// the new + changed rows to apply onto the target's base lineage, plus the ids -/// removed on source. Distinct from [`StagedMergeResult`] (the three-way path), -/// which also carries a `full_staged` table for validation β€” the adopt path -/// validates against the source snapshot directly (`candidate_dataset`), so it -/// needs no `full_staged` and never builds it. -/// -/// TRANSITIONAL β€” fragment-adopt excision point. This whole row-level adopt -/// (`AdoptDelta`, [`compute_adopt_delta`], [`publish_adopted_delta`], and the -/// streaming append it drives) re-derives the source branch row-by-row because -/// today's Lance offers no fragment-level branch merge. When Lance ships -/// branch-merge/rebase ([#7263]) + UUID branch paths ([#7185]), a fast-forward -/// merge becomes a *fragment graft* β€” adopt the source table version's -/// fragments (and their already-built indexes) by reference, no rows scanned, -/// re-appended, upserted, or deleted. At that point this struct and its two -/// functions are removed wholesale; the merge collapses to ~one ref/metadata -/// op per table. Keep them self-contained so that excision stays a clean delete. -/// -/// [#7263]: https://github.com/lance-format/lance/issues/7263 -/// [#7185]: https://github.com/lance-format/lance/issues/7185 -#[derive(Debug)] -struct AdoptDelta { - /// New-on-source rows β†’ `stage_append` (a streaming `Operation::Append`, no - /// hash join). The connector's dominant case and the OOM fix: appending new - /// rows never buffers the whole delta in a full-outer hash join. - appends: Option, - /// Changed-on-source rows β†’ `stage_merge_insert` (a hash join bounded to the - /// genuinely-changed set, not the whole delta). - upserts: Option, - deleted_ids: Vec, -} - #[derive(Debug, Clone)] struct CursorRow { id: String, @@ -70,48 +31,24 @@ struct CursorRow { row_index: usize, } -impl CursorRow { - /// Compute this row's signature on demand. Used by the lazy adopt cursor, - /// where `signature` is left empty; the value is identical to the eager - /// `signature` field the three-way cursor populates. - fn compute_signature(&self) -> Result { - row_signature(&self.batch, self.row_index) - } -} - struct OrderedTableCursor { stream: Option>>, dataset: Option, current_batch: Option, current_row: usize, peeked: Option, - /// When false, `next_row` leaves `CursorRow::signature` empty and callers - /// compute it on demand via `CursorRow::compute_signature`. The adopt path - /// uses this: new/deleted rows never need a signature comparison and would - /// otherwise eagerly stringify their embedding for nothing. - eager_signatures: bool, } impl OrderedTableCursor { async fn from_snapshot(snapshot: &Snapshot, table_key: &str) -> Result { - Self::open(snapshot, table_key, true).await - } - - /// Like `from_snapshot` but leaves row signatures uncomputed (callers use - /// `CursorRow::compute_signature` on demand). See `eager_signatures`. - async fn from_snapshot_lazy(snapshot: &Snapshot, table_key: &str) -> Result { - Self::open(snapshot, table_key, false).await - } - - async fn open(snapshot: &Snapshot, table_key: &str, eager_signatures: bool) -> Result { let dataset = match snapshot.entry(table_key) { Some(_) => Some(snapshot.open(table_key).await?), None => None, }; - Self::from_dataset(dataset, eager_signatures).await + Self::from_dataset(dataset).await } - async fn from_dataset(dataset: Option, eager_signatures: bool) -> Result { + async fn from_dataset(dataset: Option) -> Result { let stream = if let Some(ds) = &dataset { Some(Box::pin( crate::table_store::TableStore::scan_stream_with( @@ -134,7 +71,6 @@ impl OrderedTableCursor { current_batch: None, current_row: 0, peeked: None, - eager_signatures, }) } @@ -161,14 +97,9 @@ impl OrderedTableCursor { let dataset = self.dataset.clone().ok_or_else(|| { OmniError::manifest("cursor row missing source dataset".to_string()) })?; - let signature = if self.eager_signatures { - row_signature(batch, row_index)? - } else { - String::new() - }; return Ok(Some(CursorRow { id: row_id_at(batch, row_index)?, - signature, + signature: row_signature(batch, row_index)?, dataset, batch: batch.clone(), row_index, @@ -327,30 +258,20 @@ fn sanitize_table_key(table_key: &str) -> String { } /// Computes the delta between base and source for an adopted-source merge. -/// Returns the new + changed rows and the ids deleted on source. -/// -/// Unchanged rows are dropped: the adopt path validates against the source -/// snapshot directly (`candidate_dataset`), so no `full_staged` table is built -/// β€” saving the O(rows) temp write that `compute_source_delta` used to produce -/// and then discard. -/// -/// TRANSITIONAL β€” removed by the fragment-adopt work (see [`AdoptDelta`]): a -/// fragment graft adopts the source's fragments by reference, so there is no -/// row-level delta to compute. -async fn compute_adopt_delta( +/// Returns the changed/new rows (for merge_insert) and deleted IDs (for delete). +async fn compute_source_delta( table_key: &str, catalog: &Catalog, base_snapshot: &Snapshot, source_snapshot: &Snapshot, -) -> Result> { +) -> Result> { let schema = schema_for_table_key(catalog, table_key)?; - let mut append_writer = - StagedTableWriter::new(&format!("{}_adopt_append", table_key), schema.clone())?; - let mut upsert_writer = - StagedTableWriter::new(&format!("{}_adopt_upsert", table_key), schema)?; + let mut full_writer = + StagedTableWriter::new(&format!("{}_adopt_full", table_key), schema.clone())?; + let mut delta_writer = StagedTableWriter::new(&format!("{}_adopt_delta", table_key), schema)?; let mut deleted_ids: Vec = Vec::new(); - let mut base = OrderedTableCursor::from_snapshot_lazy(base_snapshot, table_key).await?; - let mut source = OrderedTableCursor::from_snapshot_lazy(source_snapshot, table_key).await?; + let mut base = OrderedTableCursor::from_snapshot(base_snapshot, table_key).await?; + let mut source = OrderedTableCursor::from_snapshot(source_snapshot, table_key).await?; let mut needs_update = false; @@ -376,6 +297,9 @@ async fn compute_adopt_delta( None }; + let base_sig = base_row.as_ref().map(|r| r.signature.as_str()); + let source_sig = source_row.as_ref().map(|r| r.signature.as_str()); + match (&base_row, &source_row) { (Some(_), None) => { // Deleted on source @@ -383,21 +307,20 @@ async fn compute_adopt_delta( needs_update = true; } (None, Some(src)) => { - // New on source β†’ append (streaming, no hash join). No signature - // needed β€” a new id is absent from base by construction. - append_writer.push_row(src).await?; + // New on source + full_writer.push_row(src).await?; + delta_writer.push_row(src).await?; needs_update = true; } - (Some(base), Some(src)) => { - // Present on both β€” compute signatures lazily (the only case - // that needs them) to tell a changed row from an unchanged one. - // New/deleted rows above skip the embedding stringify entirely. - if src.compute_signature()? != base.compute_signature()? { - // Changed on source β†’ upsert. - upsert_writer.push_row(src).await?; - needs_update = true; - } - // else unchanged β€” already on the target's base lineage; drop. + (Some(_), Some(src)) if source_sig != base_sig => { + // Changed on source + full_writer.push_row(src).await?; + delta_writer.push_row(src).await?; + needs_update = true; + } + (Some(base), Some(_)) => { + // Unchanged β€” write to full (for validation), skip delta + full_writer.push_row(base).await?; } (None, None) => unreachable!(), } @@ -407,20 +330,15 @@ async fn compute_adopt_delta( return Ok(None); } - let appends = if append_writer.row_count > 0 { - Some(append_writer.finish().await?) - } else { - None - }; - let upserts = if upsert_writer.row_count > 0 { - Some(upsert_writer.finish().await?) + let delta_staged = if delta_writer.row_count > 0 { + Some(delta_writer.finish().await?) } else { None }; - Ok(Some(AdoptDelta { - appends, - upserts, + Ok(Some(StagedMergeResult { + full_staged: full_writer.finish().await?, + delta_staged, deleted_ids, })) } @@ -733,12 +651,10 @@ async fn candidate_dataset( ) -> Result> { if let Some(candidate) = candidates.get(table_key) { return match candidate { - CandidateTableState::AdoptSourceState | CandidateTableState::AdoptWithDelta(_) => { - match source_snapshot.entry(table_key) { - Some(_) => Ok(Some(source_snapshot.open(table_key).await?)), - None => Ok(None), - } - } + CandidateTableState::AdoptSourceState => match source_snapshot.entry(table_key) { + Some(_) => Ok(Some(source_snapshot.open(table_key).await?)), + None => Ok(None), + }, CandidateTableState::RewriteMerged(staged) => { Ok(Some(staged.full_staged.dataset.clone())) } @@ -754,34 +670,36 @@ fn update_unique_constraints( table_key: &str, batch: &RecordBatch, constraints: &[Vec], - seen: &mut [HashMap, String>], + seen: &mut [HashMap], conflicts: &mut Vec, ) -> Result<()> { for (constraint_idx, columns) in constraints.iter().enumerate() { let seen = &mut seen[constraint_idx]; - // Resolve the group's columns once. The candidate dataset always - // carries the full table schema, so a missing column is an internal - // error rather than a skip. - let group_columns = columns - .iter() - .map(|column_name| { - batch.column_by_name(column_name).cloned().ok_or_else(|| { + for row in 0..batch.num_rows() { + let mut parts = Vec::with_capacity(columns.len()); + let mut any_null = false; + for column_name in columns { + let column = batch.column_by_name(column_name).ok_or_else(|| { OmniError::manifest(format!( "table {} missing unique column '{}'", table_key, column_name )) - }) - }) - .collect::>>()?; - for row in 0..batch.num_rows() { - // Same tuple key as the intake path β€” one shared derivation in - // `crate::loader::composite_unique_key`, so the two cannot drift on - // separator or scalar conversion. Null rows are exempt. - let Some(key) = crate::loader::composite_unique_key(&group_columns, row)? else { + })?; + if column.is_null(row) { + any_null = true; + break; + } + parts.push( + array_value_to_string(column.as_ref(), row) + .map_err(|e| OmniError::Lance(e.to_string()))?, + ); + } + if any_null { continue; - }; + } + let value = parts.join("|"); let row_id = row_id_at(batch, row)?; - if let Some(first_row_id) = seen.insert(key, row_id.clone()) { + if let Some(first_row_id) = seen.insert(value.clone(), row_id.clone()) { conflicts.push(MergeConflict { table_key: table_key.to_string(), row_id: Some(row_id.clone()), @@ -924,62 +842,13 @@ fn row_id_at(batch: &RecordBatch, row: usize) -> Result { Ok(ids.value(row).to_string()) } -/// Classify a table whose target state equals base (the adopt / fast-forward -/// case). Returns [`CandidateTableState::AdoptWithDelta`] β€” with the delta -/// pre-computed so it can be recovery-pinned β€” when the adopt applies a -/// non-empty delta onto the target's lineage (a HEAD-advancing publish via -/// [`publish_adopted_delta`]); otherwise [`CandidateTableState::AdoptSourceState`] -/// (a pointer switch or fork, which does not advance the data HEAD). -/// -/// The HEAD-advancing subcases mirror [`publish_adopted_source_state`]: source -/// on a branch with the target either on main or owning the table. Computing the -/// delta here (rather than inside the publish) is what closes the recovery gap β€” -/// the classifier knows whether the publish will move Lance HEAD. -async fn classify_adopt( +async fn publish_adopted_source_state( target_db: &Omnigraph, catalog: &Catalog, base_snapshot: &Snapshot, source_snapshot: &Snapshot, target_snapshot: &Snapshot, table_key: &str, -) -> Result { - let Some(source_entry) = source_snapshot.entry(table_key) else { - return Ok(CandidateTableState::AdoptSourceState); - }; - let target_entry = target_snapshot.entry(table_key); - let target_active = target_db.active_branch().await; - let advances_head = match ( - target_active.as_deref(), - source_entry.table_branch.as_deref(), - ) { - // Source on a branch, target on main β€” delta applied onto main's lineage. - (None, Some(_)) => true, - // Both on branches, target owns this table β€” delta applied onto it. - (Some(target_branch), Some(_)) => { - target_entry.and_then(|e| e.table_branch.as_deref()) == Some(target_branch) - } - // Source on main (pointer switch) or target doesn't own (fork): no advance. - _ => false, - }; - if !advances_head { - return Ok(CandidateTableState::AdoptSourceState); - } - match compute_adopt_delta(table_key, catalog, base_snapshot, source_snapshot).await? { - Some(delta) => Ok(CandidateTableState::AdoptWithDelta(delta)), - None => Ok(CandidateTableState::AdoptSourceState), - } -} - -/// Adopt the source's table state without applying a row delta: a pointer -/// switch (source/target share lineage) or a branch fork. The HEAD-advancing -/// delta case is classified [`CandidateTableState::AdoptWithDelta`] and -/// published by [`publish_adopted_delta`], so reaching the branch-bearing arms -/// here means the delta was empty. -async fn publish_adopted_source_state( - target_db: &Omnigraph, - source_snapshot: &Snapshot, - target_snapshot: &Snapshot, - table_key: &str, ) -> Result { let source_entry = source_snapshot .entry(table_key) @@ -1008,31 +877,44 @@ async fn publish_adopted_source_state( row_count: source_entry.row_count, version_metadata: source_entry.version_metadata.clone(), }), - // Source on branch, target on main, empty delta β€” adopt source's - // version by a pointer switch (the non-empty case is `AdoptWithDelta`). - (None, Some(_source_branch)) => Ok(crate::db::SubTableUpdate { - table_key: table_key.to_string(), - table_version: target_entry - .map(|e| e.table_version) - .unwrap_or(source_entry.table_version), - table_branch: None, - row_count: source_entry.row_count, - version_metadata: target_entry - .map(|entry| entry.version_metadata.clone()) - .unwrap_or_else(|| source_entry.version_metadata.clone()), - }), + // Source on branch, target on main β€” apply delta to preserve version metadata + (None, Some(_source_branch)) => { + let delta = + compute_source_delta(table_key, catalog, base_snapshot, source_snapshot).await?; + match delta { + Some(staged) => publish_rewritten_merge_table(target_db, table_key, &staged).await, + None => Ok(crate::db::SubTableUpdate { + table_key: table_key.to_string(), + table_version: target_entry + .map(|e| e.table_version) + .unwrap_or(source_entry.table_version), + table_branch: None, + row_count: source_entry.row_count, + version_metadata: target_entry + .map(|entry| entry.version_metadata.clone()) + .unwrap_or_else(|| source_entry.version_metadata.clone()), + }), + } + } // Both on branches (Some(target_branch), Some(source_branch)) => { if target_entry.and_then(|entry| entry.table_branch.as_deref()) == Some(target_branch) { - // Target already owns this table, empty delta β€” pointer switch - // onto its own lineage (the non-empty case is `AdoptWithDelta`). - Ok(crate::db::SubTableUpdate { - table_key: table_key.to_string(), - table_version: target_entry.unwrap().table_version, - table_branch: Some(target_branch.to_string()), - row_count: source_entry.row_count, - version_metadata: target_entry.unwrap().version_metadata.clone(), - }) + // Target already owns this table β€” apply delta onto its lineage + let delta = + compute_source_delta(table_key, catalog, base_snapshot, source_snapshot) + .await?; + match delta { + Some(staged) => { + publish_rewritten_merge_table(target_db, table_key, &staged).await + } + None => Ok(crate::db::SubTableUpdate { + table_key: table_key.to_string(), + table_version: target_entry.unwrap().table_version, + table_branch: Some(target_branch.to_string()), + row_count: source_entry.row_count, + version_metadata: target_entry.unwrap().version_metadata.clone(), + }), + } } else { // Target doesn't own this table yet β€” fork from source state. // This creates the target branch on the sub-table dataset. @@ -1046,7 +928,7 @@ async fn publish_adopted_source_state( target_branch, ) .await?; - let state = target_db.storage().table_state(&full_path, &ds).await?; + let state = target_db.table_store().table_state(&full_path, &ds).await?; Ok(crate::db::SubTableUpdate { table_key: table_key.to_string(), table_version: state.version, @@ -1068,13 +950,10 @@ async fn publish_rewritten_merge_table( // source onto target). The inline `delete_where` later in this // function operates on rows the rewrite chose to remove, not // user-facing predicates, so Merge is the correct policy here. - // `open_for_mutation` is the no-txn entry, so collapse #1's non-strict - // open-skip (gated on `txn.is_some()`) never fires here β€” the handle is - // always `Some`. - let (mut current_ds, full_path, table_branch) = target_db + let (ds, full_path, table_branch) = target_db .open_for_mutation(table_key, crate::db::MutationOpKind::Merge) - .await? - .require_handle("branch merge"); + .await?; + let mut current_ds = ds; // Phase 1: merge_insert changed/new rows (preserves _row_created_at_version for // existing rows, bumps _row_last_updated_at_version only for actually-changed rows). @@ -1086,13 +965,9 @@ async fn publish_rewritten_merge_table( // commit point, narrowed from the previous "merge_insert + delete + // index" multi-step inline-commit chain. if let Some(delta) = &staged.delta_staged { - // The staged delta dataset is a temp-dir Lance dataset used only - // to collect the rewrite batches; wrap it in a `SnapshotHandle` - // so we can route through the trait's `scan_batches_for_rewrite`. - let delta_snapshot = SnapshotHandle::new(delta.dataset.clone()); let batches: Vec = target_db - .storage() - .scan_batches_for_rewrite(&delta_snapshot) + .table_store() + .scan_batches_for_rewrite(&delta.dataset) .await? .into_iter() .filter(|batch| batch.num_rows() > 0) @@ -1107,7 +982,7 @@ async fn publish_rewritten_merge_table( .map_err(|e| OmniError::Lance(e.to_string()))? }; let staged_merge = target_db - .storage() + .table_store() .stage_merge_insert( current_ds.clone(), combined, @@ -1117,22 +992,15 @@ async fn publish_rewritten_merge_table( ) .await?; current_ds = target_db - .storage() - .commit_staged(current_ds, staged_merge) + .table_store() + .commit_staged(Arc::new(current_ds), staged_merge.transaction) .await?; } } - // Failpoint: crash after the Phase 1 merge_insert commit, before the delete. - // Models a partial Phase B on the three-way path β€” the merged constructive - // rows are on Lance HEAD but the delete has not committed and the - // achieved-version intent has not been recorded, so recovery must roll BACK. - // See tests/failpoints.rs::branch_merge_rewrite_partial_after_merge_rolls_back. - crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_MERGE_REWRITE_AFTER_MERGE_PRE_DELETE)?; - // Phase 2: delete removed rows via deletion vectors. // - // INLINE-COMMIT RESIDUAL: lance-6.0.1 does not expose a public + // INLINE-COMMIT RESIDUAL: lance-4.0.0 does not expose a public // two-phase delete API (DeleteJob is `pub(crate)` β€” // lance-format/lance#6658 is open with no PRs). We deliberately do // NOT introduce a `stage_delete` wrapper that would secretly @@ -1146,30 +1014,21 @@ async fn publish_rewritten_merge_table( .map(|id| format!("'{}'", id.replace('\'', "''"))) .collect(); let filter = format!("id IN ({})", escaped.join(", ")); - let (new_ds, _) = target_db - .storage_inline_residual() - .delete_where(&full_path, current_ds, &filter) + target_db + .table_store() + .delete_where(&full_path, &mut current_ds, &filter) .await?; - current_ds = new_ds; } - // Failpoint: crash after the Phase 2 delete commit, before the index build. - // Models a partial Phase B on the three-way path β€” constructive rows + - // deletes are on Lance HEAD but the achieved-version intent has not been - // recorded, so recovery must roll BACK (the index is reconciler-owned derived - // state, but the merge itself never reached its commit boundary). See - // tests/failpoints.rs::branch_merge_rewrite_partial_after_delete_rolls_back. - crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_MERGE_REWRITE_AFTER_DELETE_PRE_INDEX)?; - // Phase 3: rebuild indices. // // `build_indices_on_dataset` uses `stage_create_btree_index` / // `stage_create_inverted_index` + `commit_staged` for scalar // indices. Vector indices remain inline-commit // (`build_index_metadata_from_segments` is `pub(crate)` in lance- - // 6.0.1 β€” companion ticket to lance-format/lance#6666). + // 4.0.0 β€” companion ticket to lance-format/lance#6658). let row_count = target_db - .storage() + .table_store() .table_state(&full_path, ¤t_ds) .await? .row_count; @@ -1179,164 +1038,7 @@ async fn publish_rewritten_merge_table( .await?; } let final_state = target_db - .storage() - .table_state(&full_path, ¤t_ds) - .await?; - - Ok(crate::db::SubTableUpdate { - table_key: table_key.to_string(), - table_version: final_state.version, - table_branch, - row_count: final_state.row_count, - version_metadata: final_state.version_metadata, - }) -} - -/// Scan a staged temp table and concat its non-empty batches into the single -/// batch that `stage_append` / `stage_merge_insert` consume. Returns `None` when -/// the table has no rows (both staged primitives reject an empty batch). -async fn scan_staged_combined( - target_db: &Omnigraph, - table: &StagedTable, -) -> Result> { - crate::instrumentation::record_scan_staged_combined(); - let snapshot = SnapshotHandle::new(table.dataset.clone()); - let batches: Vec = target_db - .storage() - .scan_batches_for_rewrite(&snapshot) - .await? - .into_iter() - .filter(|batch| batch.num_rows() > 0) - .collect(); - if batches.is_empty() { - return Ok(None); - } - let combined = if batches.len() == 1 { - batches.into_iter().next().unwrap() - } else { - let schema = batches[0].schema(); - arrow_select::concat::concat_batches(&schema, &batches) - .map_err(|e| OmniError::Lance(e.to_string()))? - }; - Ok(Some(combined)) -} - -/// Apply an [`AdoptDelta`] onto the target's base lineage (the fast-forward / -/// target-owns path). Kept separate from `publish_rewritten_merge_table` (the -/// three-way path) because the two paths diverge: commit 3 splits this Phase 1 -/// into append (new) + merge_insert (changed), and commit 6 makes its index -/// coverage incremental β€” neither of which the three-way path takes. -/// -/// `open_for_mutation(Merge)` opens the target's own table lineage (active -/// branch is the merge target after the caller's swap), so every write lands on -/// the target and survives source-branch deletion β€” GC-safe. -/// -/// TRANSITIONAL β€” removed by the fragment-adopt work (see [`AdoptDelta`]): the -/// multi-commit append β†’ upsert β†’ delete publish here (the source of the -/// partial-Phase-B recovery window the sidecar confirmation guards) collapses to -/// a single fragment-graft commit per table, so this whole function goes away. -async fn publish_adopted_delta( - target_db: &Omnigraph, - table_key: &str, - delta: &AdoptDelta, -) -> Result { - // `open_for_mutation` is the no-txn entry, so collapse #1's non-strict - // open-skip (gated on `txn.is_some()`) never fires here β€” the handle is - // always `Some`. - let (mut current_ds, full_path, table_branch) = target_db - .open_for_mutation(table_key, crate::db::MutationOpKind::Merge) - .await? - .require_handle("branch merge"); - - // Phase 1a: append the NEW rows. `stage_append_stream` is a streaming - // `Operation::Append` β€” no hash join β€” so it never buffers the delta and - // cannot exhaust the DataFusion memory pool (the OOM fix). It streams the - // staged rows straight into the target (Lance rolls fragments at - // `max_rows_per_file`), so memory is bounded regardless of how many rows the - // connector appended β€” never the whole set in one batch. New ids are absent - // from base by construction (the ordered walk only classifies a row - // `(None, Some)` when base lacks it), so they never collide on `id`. Routed - // through the staged primitive so a failure between writing fragments and - // committing leaves no Lance-HEAD drift. `appends` is `Some` only when the - // staged table is non-empty (`compute_adopt_delta`). - if let Some(append_table) = &delta.appends { - let source = SnapshotHandle::new(append_table.dataset.clone()); - let staged = target_db - .storage() - .stage_append_stream(¤t_ds, &source, &[]) - .await?; - current_ds = target_db - .storage() - .commit_staged(current_ds, staged) - .await?; - } - - // Failpoint: crash after the Phase 1a append commit, before the upsert. - // Models a partial Phase B β€” appends are on Lance HEAD but the upserts/deletes - // have not committed and the achieved-version intent has not been recorded, so - // recovery must roll BACK (not publish the appends-only state). See - // tests/failpoints.rs::branch_merge_adopt_partial_after_append_rolls_back. - crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_MERGE_ADOPT_AFTER_APPEND_PRE_UPSERT)?; - - // Phase 1b: upsert the CHANGED rows. The merge_insert hash join is now - // bounded to the genuinely-changed set, not the whole delta. It runs against - // the committed view that already includes the appends; the changed ids are - // disjoint from the appended ids (each id is classified into exactly one of - // new / changed / deleted / unchanged in the single ordered walk), so the - // join never collides with an appended row. - if let Some(upsert_table) = &delta.upserts { - if let Some(combined) = scan_staged_combined(target_db, upsert_table).await? { - let staged_merge = target_db - .storage() - .stage_merge_insert( - current_ds.clone(), - combined, - vec!["id".to_string()], - lance::dataset::WhenMatched::UpdateAll, - lance::dataset::WhenNotMatched::InsertAll, - ) - .await?; - current_ds = target_db - .storage() - .commit_staged(current_ds, staged_merge) - .await?; - } - } - - // Failpoint: crash after the Phase 1b upsert commit, before the delete. - // Models a partial Phase B β€” appends + upserts on Lance HEAD but the delete - // has not committed and the achieved-version intent has not been recorded, so - // recovery must roll BACK. See - // tests/failpoints.rs::branch_merge_adopt_partial_after_upsert_rolls_back. - crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_MERGE_ADOPT_AFTER_UPSERT_PRE_DELETE)?; - - // Phase 2: delete removed rows via deletion vectors (inline-commit residual, - // same as the three-way path until Lance ships a public two-phase delete). - if !delta.deleted_ids.is_empty() { - let escaped: Vec = delta - .deleted_ids - .iter() - .map(|id| format!("'{}'", id.replace('\'', "''"))) - .collect(); - let filter = format!("id IN ({})", escaped.join(", ")); - let (new_ds, _) = target_db - .storage_inline_residual() - .delete_where(&full_path, current_ds, &filter) - .await?; - current_ds = new_ds; - } - - // Phase 4: index coverage is reconciler-owned on the adopt path. Unlike the - // three-way `RewriteMerged` path, this does NOT build indices inline: the - // appended/upserted rows are left uncovered (reads stay correct via - // brute-force β€” indexes are derived state, invariant 7) and - // `optimize` / `ensure_indices` folds them in. This keeps even the first - // merge into a freshly schema-applied (unindexed) table fast β€” no inline IVF - // retrain on the publish path β€” and is the row-level approximation of Layer - // 2's fragment-adopt, where the source branch's already-built indices carry - // over by reference. See docs/user/branching/merge.md. - let final_state = target_db - .storage() + .table_store() .table_state(&full_path, ¤t_ds) .await?; @@ -1376,13 +1078,6 @@ impl Omnigraph { actor_id, )?; self.ensure_schema_apply_idle("branch_merge").await?; - // Converge any pending recovery sidecar before the merge - // captures its target snapshot: the merge's publish would - // otherwise make the drifted Phase-B commit visible as an - // unattributed side effect (manifest catches up to HEAD with no - // recovery audit row) and leave the stale sidecar behind. Runs - // before the merge's own sidecar exists. - self.heal_pending_recovery_sidecars().await?; self.branch_merge_impl(source, target, actor_id).await } @@ -1392,9 +1087,9 @@ impl Omnigraph { target: &str, actor_id: Option<&str>, ) -> Result { - if is_internal_system_branch(source) || is_internal_system_branch(target) { + if is_internal_run_branch(source) || is_internal_run_branch(target) { return Err(OmniError::manifest(format!( - "branch_merge does not allow internal system refs ('{}' -> '{}')", + "branch_merge does not allow internal run refs ('{}' -> '{}')", source, target ))); } @@ -1557,16 +1252,7 @@ impl Omnigraph { continue; } if same_manifest_state(base_entry, target_entry) { - let candidate = classify_adopt( - self, - &self.catalog(), - base_snapshot, - source_snapshot, - &target_snapshot, - table_key, - ) - .await?; - candidates.insert(table_key.clone(), candidate); + candidates.insert(table_key.clone(), CandidateTableState::AdoptSourceState); continue; } @@ -1594,24 +1280,31 @@ impl Omnigraph { validate_merge_candidates(self, source_snapshot, &target_snapshot, &candidates).await?; // Recovery sidecar: protect the per-table commit_staged loop. - // Pin `RewriteMerged` and `AdoptWithDelta` candidates β€” both advance - // Lance HEAD before the manifest publish (RewriteMerged via - // publish_rewritten_merge_table; AdoptWithDelta via publish_adopted_delta: - // stage_append + stage_merge_insert + delete_where + index β€” multiple - // commit_staged calls per table, which the loose classification handles - // as multi-step drift). + // Pin only `RewriteMerged` candidates because they always + // advance Lance HEAD through `publish_rewritten_merge_table` + // (which runs stage_merge_insert + delete_where + index + // rebuilds β€” multiple commit_staged calls per table; loose + // classification handles the multi-step drift). // // `AdoptSourceState` candidates are NOT pinned: their publish - // (`publish_adopted_source_state`) is a pure pointer switch or a fork - // (`fork_dataset_from_entry_state` only adds a Lance branch ref), neither - // of which advances the data HEAD. Pinning them would classify as - // NoMovement and force an all-or-nothing rollback that destroys sibling - // tables' committed work. + // path is `publish_adopted_source_state`, whose subcases mostly + // don't advance Lance HEAD (pure manifest pointer switch, or + // fork via `fork_dataset_from_entry_state` which only adds a + // Lance branch ref). If those subcases were pinned, recovery + // would classify them as NoMovement and the all-or-nothing + // decision would force a rollback that destroys legitimately- + // committed work on sibling RewriteMerged tables. // - // The former gap β€” adopt subcases that applied a non-empty delta advanced - // HEAD unpinned β€” is closed: `classify_adopt` pre-computes the delta, so a - // HEAD-advancing adopt is `AdoptWithDelta` (pinned here) and an empty-delta - // adopt stays `AdoptSourceState`. + // Residual: two `AdoptSourceState` subcases (when source has a + // table_branch AND the source delta is non-empty) internally + // call `publish_rewritten_merge_table` and DO advance HEAD. + // Those are not covered by this sidecar β€” if they fail mid- + // commit, the residual persists until the next ReadWrite open + // detects it via a subsequent ExpectedVersionMismatch from a + // later writer that touches the same table. Closing this gap + // requires pre-computing source deltas during candidate + // classification (a structural change to `CandidateTableState`) + // and is left as follow-up work. // Acquire per-(table_key, target_branch) queues for every table // touched by the merge plan. Sorted-order acquisition prevents // lock-order inversion against concurrent multi-table writers. @@ -1620,9 +1313,9 @@ impl Omnigraph { // branch_merge writes only to the target branch. // // Held across the per-table publish loop and the manifest - // commit + record_merge_commit calls below, so no concurrent - // writer to a touched (table, target_branch) can interleave - // between our commit_staged and our publish. + // commit + record_merge_commit calls below. Under PR 1b's + // intermediate state (global server RwLock still in place), + // this acquisition is uncontended. let active_branch_for_keys = self.active_branch().await; let merge_queue_keys: Vec<(String, Option)> = ordered_table_keys .iter() @@ -1631,7 +1324,6 @@ impl Omnigraph { candidates.get(*table_key), Some(CandidateTableState::RewriteMerged(_)) | Some(CandidateTableState::AdoptSourceState) - | Some(CandidateTableState::AdoptWithDelta(_)) ) }) .map(|table_key| (table_key.clone(), active_branch_for_keys.clone())) @@ -1645,9 +1337,7 @@ impl Omnigraph { }; if !matches!( candidate, - CandidateTableState::RewriteMerged(_) - | CandidateTableState::AdoptSourceState - | CandidateTableState::AdoptWithDelta(_) + CandidateTableState::RewriteMerged(_) | CandidateTableState::AdoptSourceState ) { continue; } @@ -1668,23 +1358,15 @@ impl Omnigraph { .iter() .filter_map(|table_key| { let candidate = candidates.get(table_key)?; - if !matches!( - candidate, - CandidateTableState::RewriteMerged(_) | CandidateTableState::AdoptWithDelta(_) - ) { + if !matches!(candidate, CandidateTableState::RewriteMerged(_)) { return None; } let entry = target_snapshot.entry(table_key)?; Some(crate::db::manifest::SidecarTablePin { table_key: table_key.clone(), - table_path: self.storage().dataset_uri(&entry.table_path), + table_path: self.table_store().dataset_uri(&entry.table_path), expected_version: entry.table_version, post_commit_pin: entry.table_version + 1, - // Stamped after the whole per-table publish completes - // (Phase-B confirmation, just before the manifest publish). - // Until then `None` marks an unfinished publish that - // recovery must roll back, not roll forward. - confirmed_version: None, // Use the merge target branch (where commits actually // land), NOT entry.table_branch (where the table // currently lives). publish_rewritten_merge_table calls @@ -1701,14 +1383,7 @@ impl Omnigraph { }) }) .collect(); - // Keep the sidecar alongside its handle: after the per-table publish - // loop completes (Phase B), we re-write it with each table's confirmed - // version before the manifest publish, so recovery can tell a finished - // publish (roll forward) from a partial one (roll back). - let mut recovery: Option<( - crate::db::manifest::RecoverySidecar, - crate::db::manifest::RecoverySidecarHandle, - )> = if recovery_pins.is_empty() { + let recovery_handle = if recovery_pins.is_empty() { None } else { // Use the merge target branch directly, NOT a heuristic @@ -1733,13 +1408,14 @@ impl Omnigraph { // this, future merges between the same pair lose // already-up-to-date detection and merge-base correctness. sidecar.merge_source_commit_id = Some(source_head_commit_id.to_string()); - let handle = crate::db::manifest::write_sidecar( - self.root_uri(), - self.storage_adapter(), - &sidecar, + Some( + crate::db::manifest::write_sidecar( + self.root_uri(), + self.storage_adapter(), + &sidecar, + ) + .await?, ) - .await?; - Some((sidecar, handle)) }; let mut updates = Vec::new(); @@ -1750,11 +1426,15 @@ impl Omnigraph { }; let update = match candidate_state { CandidateTableState::AdoptSourceState => { - publish_adopted_source_state(self, source_snapshot, &target_snapshot, table_key) - .await? - } - CandidateTableState::AdoptWithDelta(delta) => { - publish_adopted_delta(self, table_key, delta).await? + publish_adopted_source_state( + self, + &self.catalog(), + base_snapshot, + source_snapshot, + &target_snapshot, + table_key, + ) + .await? } CandidateTableState::RewriteMerged(staged) => { publish_rewritten_merge_table(self, table_key, staged).await? @@ -1766,50 +1446,22 @@ impl Omnigraph { updates.push(update); } - // Phase-B confirmation: every table's publish finished, so stamp the - // sidecar with each table's exact achieved version before the manifest - // publish. This is the commit point of the recovery WAL: a crash from - // here on rolls FORWARD to these versions, while a crash anywhere in the - // publish loop above left the sidecar unconfirmed and rolls BACK. The - // `updates` carry the real per-table final versions (multiple - // commit_staged calls per table, so not derivable from `post_commit_pin` - // alone). A failure here leaves the unconfirmed sidecar β†’ roll back. - if let Some((sidecar, _)) = recovery.as_mut() { - let confirmed_versions: std::collections::HashMap = updates - .iter() - .map(|u| (u.table_key.clone(), u.table_version)) - .collect(); - crate::db::manifest::confirm_sidecar_phase_b( - self.root_uri(), - self.storage_adapter(), - sidecar, - &confirmed_versions, - ) - .await?; - } - // Failpoint: pin the per-writer Phase B β†’ Phase C residual for // branch_merge. Lance HEAD has advanced on every touched table - // (publish_*) AND the sidecar is confirmed, but the manifest publish - // below hasn't run β€” so recovery rolls FORWARD. Used by - // `tests/failpoints.rs::branch_merge_phase_b_failure_recovered_on_next_open`. - crate::failpoints::maybe_fail(crate::failpoints::names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT)?; + // (publish_*) but the manifest publish below hasn't run. Used + // by `tests/failpoints.rs::branch_merge_phase_b_failure_recovered_on_next_open`. + crate::failpoints::maybe_fail("branch_merge.post_phase_b_pre_manifest_commit")?; - // Publish the merged table versions AND the merge commit in one manifest - // CAS (RFC-013 Phase 7): `graph_commit` + `graph_head` rows ride the same - // merge-insert as the table-version rows. The merge commit's first parent - // is resolved by the publisher as the live target-branch head (the - // post-merge correct parent even if the target advanced); its merged-in - // parent is the source head. `target_head_commit_id` is no longer passed - // β€” it was the pre-merge target head, which the publisher reads live. - let _ = target_head_commit_id; - self.commit_merge_with_actor(&updates, source_head_commit_id, actor_id) - .await?; + let manifest_version = if updates.is_empty() { + self.version().await + } else { + self.commit_manifest_updates(&updates).await? + }; - // Recovery sidecar lifecycle: delete after the manifest publish (Phase C). - // Best-effort cleanup; the merge already landed durably so failing the - // user here is undesirable. - if let Some((_, handle)) = recovery { + // Recovery sidecar lifecycle: delete after manifest publish. + // Best-effort cleanup; the merge already landed durably so + // failing the user here is undesirable. + if let Some(handle) = recovery_handle { if let Err(err) = crate::db::manifest::delete_sidecar(&handle, self.storage_adapter()).await { @@ -1820,6 +1472,13 @@ impl Omnigraph { ); } } + self.record_merge_commit( + manifest_version, + target_head_commit_id, + source_head_commit_id, + actor_id, + ) + .await?; if changed_edge_tables { self.invalidate_graph_index().await; diff --git a/crates/omnigraph/src/exec/mod.rs b/crates/omnigraph/src/exec/mod.rs index 4076414..33a7e41 100644 --- a/crates/omnigraph/src/exec/mod.rs +++ b/crates/omnigraph/src/exec/mod.rs @@ -35,12 +35,11 @@ use time::format_description::well_known::Rfc3339; use crate::db::commit_graph::CommitGraph; use crate::db::manifest::ManifestCoordinator; -use crate::db::{MergeOutcome, Omnigraph, is_internal_system_branch}; +use crate::db::{MergeOutcome, Omnigraph, is_internal_run_branch}; use crate::db::{ReadTarget, Snapshot}; use crate::embedding::EmbeddingClient; use crate::error::{MergeConflict, MergeConflictKind, OmniError, Result}; use crate::graph_index::GraphIndex; -use crate::storage_layer::SnapshotHandle; use tempfile::{Builder as TempDirBuilder, TempDir}; mod merge; diff --git a/crates/omnigraph/src/exec/mutation.rs b/crates/omnigraph/src/exec/mutation.rs index fe63a0c..02b2a21 100644 --- a/crates/omnigraph/src/exec/mutation.rs +++ b/crates/omnigraph/src/exec/mutation.rs @@ -428,11 +428,12 @@ async fn ensure_node_id_exists( let filter = format!("id = '{}'", id.replace('\'', "''")); let snapshot = db.snapshot_for_branch(branch).await?; - let ds = db - .storage() - .open_snapshot_at_table(&snapshot, &table_key) - .await?; - let exists = db.storage().count_rows(&ds, Some(filter)).await? > 0; + let ds = snapshot.open(&table_key).await?; + let exists = ds + .count_rows(Some(filter)) + .await + .map_err(|e| OmniError::Lance(e.to_string()))? + > 0; if exists { Ok(()) @@ -477,12 +478,6 @@ fn predicate_to_sql( } }; - // #283: emit the column UNQUOTED. Lance's `Scanner::filter(&str)` (the - // committed-scan consumer) preserves an unquoted identifier's case but - // treats a double-quoted `"col"` as a string literal, so quoting here - // would silently match zero committed rows. The pending-batch MemTable - // query is instead made case-preserving by disabling DataFusion identifier - // normalization on its `SessionContext` (see `scan_pending_batches`). Ok(format!("{} {} {}", column, op, value_sql)) } @@ -574,8 +569,7 @@ use super::staging::{MutationStaging, PendingMode}; /// via `open_for_mutation_on_branch`, which compares Lance HEAD against /// the manifest's pinned version β€” that fence is the engine's /// publisher-style OCC catching cross-writer drift before we make any -/// changes. For delete-only queries, this strict open is also the uncovered -/// drift guard that runs before `delete_where` can inline-commit. +/// changes. /// /// On subsequent touches *within the same query*, behavior depends on /// whether the table has already been inline-committed by a delete op: @@ -601,51 +595,13 @@ use super::staging::{MutationStaging, PendingMode}; /// away once Lance exposes a two-phase delete API /// ([lance-format/lance#6658](https://github.com/lance-format/lance/issues/6658)) /// and we can stage deletes on the same path as inserts/updates. -impl Omnigraph { - /// Resolve a LIVE-HEAD read handle for an edge table's committed-state `@card` - /// scan when collapse #1 skipped the accumulation open. The edge-insert path no - /// longer opens the edge dataset (non-strict op + txn), but cardinality is - /// validated ONCE (never rechecked at commit), so the scan must observe the - /// freshest committed edges β€” NOT the pinned `txn.base`. A concurrent writer can - /// commit edges to this table after `txn` capture; counting against the stale - /// base undercounts and lets a violating insert through (invariant 9). The table - /// LOCATION is read from the pinned entry (stable across versions); the dataset is - /// opened at live HEAD via `open_dataset_head_for_write` (a read here despite the - /// name β€” no lock/stage), restoring the pre-3b image (the mutation's own open). - /// The residual validateβ†’commit race (a writer committing between this scan and - /// the end-of-query commit) is the Β§7.1 gap, closed by RFC-013 step 4. - async fn edge_cardinality_read_handle( - &self, - txn: Option<&crate::db::WriteTxn>, - table_key: &str, - ) -> Result { - let branch = txn.and_then(|t| t.branch.as_deref()); - match txn.and_then(|t| t.base.entry(table_key)) { - Some(entry) => { - let full_path = self.storage().dataset_uri(&entry.table_path); - self.storage() - .open_dataset_head_for_write(table_key, &full_path, branch) - .await - } - // Unreachable today (the `None` handle only reaches here under a txn whose - // base contains the table). Defensive: resolve the table fresh (live) - // without the schema re-validation `snapshot_for_branch` would re-run. - None => { - let snapshot = self.fresh_snapshot_for_branch_unchecked(branch).await?; - self.storage().open_snapshot_at_table(&snapshot, table_key).await - } - } - } -} - async fn open_table_for_mutation( db: &Omnigraph, staging: &mut MutationStaging, branch: Option<&str>, table_key: &str, op_kind: crate::db::MutationOpKind, - txn: Option<&crate::db::WriteTxn>, -) -> Result<(Option, String, Option)> { +) -> Result<(Dataset, String, Option)> { if let Some(prior) = staging.inline_committed.get(table_key) { let path = staging.paths.get(table_key).ok_or_else(|| { OmniError::manifest_internal(format!( @@ -653,10 +609,6 @@ async fn open_table_for_mutation( table_key )) })?; - // The inline-committed reopen does NOT validate the schema contract - // (it reopens at the post-inline-commit Lance version directly), so it - // takes no `txn` β€” threading it here would change nothing. Deletes are - // strict ops, so this always opens (returns `Some`). let ds = db .reopen_for_mutation( table_key, @@ -666,32 +618,20 @@ async fn open_table_for_mutation( op_kind, ) .await?; - return Ok((Some(ds), path.full_path.clone(), path.table_branch.clone())); + return Ok((ds, path.full_path.clone(), path.table_branch.clone())); } - // `open_for_mutation_on_branch` returns the expected version even when it - // skips the open (collapse #1, the non-strict insert/merge path): the version - // is the pinned base's, identical to the opened handle's `.version()`. Use it - // directly for `ensure_path` so the no-open path still captures the publisher - // CAS fence. - let opened = db - .open_for_mutation_on_branch(branch, table_key, op_kind, txn) + let (ds, full_path, table_branch) = db + .open_for_mutation_on_branch(branch, table_key, op_kind) .await?; - // Pin the open-skip contract (collapse #1): a missing handle is legal ONLY on - // the non-strict `txn` path. A future change that returns `None` elsewhere - // (e.g. a new strict arm) trips this in debug builds rather than silently - // handing a `None` to a `require_handle` consumer. - debug_assert!( - opened.handle.is_some() || (txn.is_some() && !op_kind.strict_pre_stage_version_check()), - "open_for_mutation_on_branch returned no handle outside the non-strict txn open-skip path", - ); + let expected_version = ds.version().version; staging.ensure_path( table_key, - opened.full_path.clone(), - opened.table_branch.clone(), - opened.expected_version, + full_path.clone(), + table_branch.clone(), + expected_version, op_kind, ); - Ok((opened.handle, opened.full_path, opened.table_branch)) + Ok((ds, full_path, table_branch)) } /// Dβ‚‚ parse-time check: a single mutation query is either insert/update-only @@ -699,7 +639,7 @@ async fn open_table_for_mutation( /// /// Reason: under the staged-write writer, inserts and updates /// accumulate in memory and commit at end-of-query, while deletes still -/// inline-commit (Lance lacks a public two-phase delete in 6.0.1). +/// inline-commit (Lance lacks a public two-phase delete in 4.0.0). /// Mixing creates ordering hazards (same-row insertβ†’delete becomes a no-op /// because the staged insert isn't visible to delete; cascading deletes /// of just-inserted edges break referential integrity by silent design). @@ -774,15 +714,7 @@ impl Omnigraph { params: &ParamMap, actor_id: Option<&str>, ) -> Result { - // Converge any pending recovery sidecar (a previously failed - // writer's Phase B β†’ Phase C residual) before executing: the - // inline delete path advances Lance HEAD during execution and - // the staged path's commit-time drift guard refuses - // sidecar-covered drift, so a long-lived handle must heal here - // β€” not at restart. One `list_dir` when no sidecars exist (the - // steady state). MUST run before `open_write_txn` below β€” the heal - // may advance the manifest, so the pinned base must be captured after. - self.heal_pending_recovery_sidecars().await?; + self.ensure_schema_state_valid().await?; let requested = Self::normalize_branch_name(branch)?; // Reject internal `__run__*` / system-prefixed branches at the // public write boundary. Direct-publish paths assert this @@ -791,16 +723,6 @@ impl Omnigraph { if let Some(name) = requested.as_deref() { crate::db::ensure_public_branch_ref(name, "mutate")?; } - // Capture-once write transaction (RFC-013 step 3b). `open_write_txn` - // validates the schema contract ONCE (it resolves the branch target, - // whose first line is `ensure_schema_state_valid`) and pins the base - // snapshot for this write. Threaded as `Some(&txn)` through execution, - // staging commit, and the manifest publish so the per-table opens and - // the commit-time OCC re-read reuse the pinned base instead of - // re-validating the contract at every resolve point. Captured AFTER the - // recovery heal (which may advance the manifest) and AFTER `requested` - // is known so it pins the post-heal snapshot for the correct branch. - let txn = self.open_write_txn(requested.as_deref()).await?; let resolved_params = enrich_mutation_params(params)?; // Per-query staging accumulator. Inserts and updates push batches @@ -811,50 +733,13 @@ impl Omnigraph { // tables. Branch is threaded explicitly β€” no coordinator swap. let mut staging = MutationStaging::default(); - // Lower + validate up front so the touched-table set is known before - // execution. A lowering/validation error returns exactly as it did - // when this happened inside execute_named_mutation. - let ir = self.lower_named_mutation(query_source, query_name)?; - - // Up-front fork-queue acquisition (see the loader for the full - // rationale): if this mutation will fork any touched table onto a - // non-main branch, acquire the per-(table, branch) write queues for - // every touched table before the first fork and hold them through the - // publish, so the orphan-fork reclaim can't race a concurrent - // in-process fork. The touched set is derived from the lowered IR. - let fork_queue_guards: Option<( - Vec<(String, Option)>, - Vec>, - )> = if let Some(active) = requested.as_deref() { - let snapshot = self.snapshot_for_branch(Some(active)).await?; - let touched: Vec<(String, Option)> = self - .touched_table_keys(&ir) - .into_iter() - .map(|k| (k, Some(active.to_string()))) - .collect(); - let needs_fork = touched.iter().any(|(table_key, _)| { - snapshot - .entry(table_key) - .map(|e| e.table_branch.as_deref() != Some(active)) - .unwrap_or(false) - }); - if needs_fork { - let guards = self.write_queue().acquire_many(&touched).await; - Some((touched, guards)) - } else { - None - } - } else { - None - }; - let exec_result = self .execute_named_mutation( - &ir, + query_source, + query_name, &resolved_params, requested.as_deref(), &mut staging, - Some(&txn), ) .await; @@ -869,20 +754,12 @@ impl Omnigraph { // interleave between our commit_staged and our publish // (which would correctly fail our CAS but leave Lance // HEAD advanced β€” the residual class MR-870 recovers). - let super::staging::CommittedMutation { - updates, - expected_versions, - sidecar_handle, - guards: _queue_guards, - committed_handles, - } = staged + let (updates, expected_versions, sidecar_handle, _queue_guards) = staged .commit_all( self, requested.as_deref(), crate::db::manifest::SidecarKind::Mutation, actor_id, - fork_queue_guards, - Some(&txn), ) .await?; // Failpoint that wedges the documented finalizeβ†’publisher @@ -895,14 +772,12 @@ impl Omnigraph { // across this failure so the next `Omnigraph::open`'s // recovery sweep can roll forward β€” see // `tests/failpoints.rs::recovery_rolls_forward_after_finalize_publisher_failure`. - crate::failpoints::maybe_fail(crate::failpoints::names::MUTATION_POST_FINALIZE_PRE_PUBLISHER)?; + crate::failpoints::maybe_fail("mutation.post_finalize_pre_publisher")?; self.commit_updates_on_branch_with_expected( requested.as_deref(), &updates, &expected_versions, actor_id, - Some(&txn), - committed_handles, ) .await?; // Phase C succeeded β€” sidecar can be deleted. If this @@ -934,19 +809,14 @@ impl Omnigraph { } } - /// Lower + validate a named mutation query into its IR. - /// - /// Hoisted out of [`Self::execute_named_mutation`] so the caller can - /// inspect the IR before execution β€” specifically to compute the - /// touched-table set (see [`Self::touched_table_keys`]) for up-front - /// write-queue acquisition. Performs the same find β†’ typecheck β†’ lower - /// β†’ Dβ‚‚ checks that execution previously did inline, so error behavior - /// is unchanged. - fn lower_named_mutation( + async fn execute_named_mutation( &self, query_source: &str, query_name: &str, - ) -> Result { + params: &ParamMap, + branch: Option<&str>, + staging: &mut MutationStaging, + ) -> Result { let query_decl = omnigraph_compiler::find_named_query(query_source, query_name) .map_err(|e| OmniError::manifest(e.to_string()))?; @@ -963,62 +833,7 @@ impl Omnigraph { let ir = lower_mutation_query(&query_decl)?; // Dβ‚‚: reject mixed insert/update + delete before any I/O. enforce_no_mixed_destructive_constructive(&ir)?; - Ok(ir) - } - /// The COMPLETE set of `(node|edge):{type}` table keys a mutation IR can - /// touch at execution time, keyed as `MutationStaging`/`commit_all` key - /// them. Must be a superset of everything execution forks/commits, since - /// it drives the up-front fork-queue acquisition and `commit_all`'s - /// held-guard coverage check β€” a miss means an unserialized fork/commit. - /// - /// The set is a pure function of (IR ops + catalog). For each op it mirrors - /// the execute path's node-vs-edge dispatch (`node_types` first, then - /// `edge_types`). A `delete ` additionally **cascades** to every edge - /// type whose endpoint is that node (see `execute_delete_node`), forking - /// those edge tables during execution β€” so they are included here, derived - /// the same way the executor derives them (`from_type`/`to_type` match). - /// Unknown types are skipped (the execute path surfaces the error). - /// Sorted + deduped for one-shot `acquire_many`. - fn touched_table_keys(&self, ir: &omnigraph_compiler::ir::MutationIR) -> Vec { - use omnigraph_compiler::ir::MutationOpIR; - let catalog = self.catalog(); - let mut keys: Vec = Vec::new(); - for op in &ir.ops { - let type_name = match op { - MutationOpIR::Insert { type_name, .. } - | MutationOpIR::Update { type_name, .. } - | MutationOpIR::Delete { type_name, .. } => type_name, - }; - if catalog.node_types.contains_key(type_name) { - keys.push(format!("node:{type_name}")); - // A node delete cascades to every edge touching this node type, - // forking those edge tables. Include them so the up-front - // acquisition covers the cascade (mirrors execute_delete_node). - if matches!(op, MutationOpIR::Delete { .. }) { - for (edge_name, edge_type) in &catalog.edge_types { - if edge_type.from_type == *type_name || edge_type.to_type == *type_name { - keys.push(format!("edge:{edge_name}")); - } - } - } - } else if catalog.edge_types.contains_key(type_name) { - keys.push(format!("edge:{type_name}")); - } - } - keys.sort(); - keys.dedup(); - keys - } - - async fn execute_named_mutation( - &self, - ir: &omnigraph_compiler::ir::MutationIR, - params: &ParamMap, - branch: Option<&str>, - staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, - ) -> Result { let mut total = MutationResult::default(); for op in &ir.ops { let result = match op { @@ -1026,7 +841,7 @@ impl Omnigraph { type_name, assignments, } => { - self.execute_insert(type_name, assignments, params, branch, staging, txn) + self.execute_insert(type_name, assignments, params, branch, staging) .await? } MutationOpIR::Update { @@ -1034,16 +849,14 @@ impl Omnigraph { assignments, predicate, } => { - self.execute_update( - type_name, assignments, predicate, params, branch, staging, txn, - ) - .await? + self.execute_update(type_name, assignments, predicate, params, branch, staging) + .await? } MutationOpIR::Delete { type_name, predicate, } => { - self.execute_delete(type_name, predicate, params, branch, staging, txn) + self.execute_delete(type_name, predicate, params, branch, staging) .await? } }; @@ -1060,7 +873,6 @@ impl Omnigraph { params: &ParamMap, branch: Option<&str>, staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, ) -> Result { let mut resolved: HashMap = HashMap::new(); for a in assignments { @@ -1092,12 +904,12 @@ impl Omnigraph { let batch = build_insert_batch(&schema, &id, &resolved, &blob_props)?; crate::loader::validate_value_constraints(&batch, node_type)?; crate::loader::validate_enum_constraints(&batch, &node_type.properties, type_name)?; - let unique_groups = crate::loader::unique_constraint_groups_for_node(node_type); - if !unique_groups.is_empty() { + let unique_props = crate::loader::unique_property_names_for_node(node_type); + if !unique_props.is_empty() { crate::loader::enforce_unique_constraints_intra_batch( &batch, type_name, - &unique_groups, + &unique_props, )?; } let has_key = node_type.key_property().is_some(); @@ -1108,12 +920,8 @@ impl Omnigraph { } else { crate::db::MutationOpKind::Insert }; - // Node inserts are non-strict (Insert/Merge), so with a `WriteTxn` - // this opens NOTHING (collapse #1) β€” the handle is discarded anyway; - // only `ensure_path`'s captured version (read inside - // `open_table_for_mutation`) is used downstream. let (_ds, _full_path, _table_branch) = - open_table_for_mutation(self, staging, branch, &table_key, insert_kind, txn).await?; + open_table_for_mutation(self, staging, branch, &table_key, insert_kind).await?; // Accumulate. @key inserts go into the Merge stream (so a // later update on the same id coalesces correctly); no-key // inserts go into the Append stream. @@ -1137,25 +945,22 @@ impl Omnigraph { let batch = build_insert_batch(&schema, &id, &resolved, &blob_props)?; validate_edge_insert_endpoints(self, staging, branch, type_name, &resolved).await?; crate::loader::validate_enum_constraints(&batch, &edge_type.properties, type_name)?; - let unique_groups = crate::loader::unique_constraint_groups_for_edge(edge_type); - if !unique_groups.is_empty() { + let unique_props = crate::loader::unique_property_names_for_edge(edge_type); + if !unique_props.is_empty() { crate::loader::enforce_unique_constraints_intra_batch( &batch, type_name, - &unique_groups, + &unique_props, )?; } let table_key = format!("edge:{}", type_name); - // Capture pre-write metadata on first touch. Edge inserts are - // non-strict, so with a `WriteTxn` this opens NOTHING (collapse #1) - // and returns `None`. - let (handle, _full_path, _table_branch) = open_table_for_mutation( + // Capture pre-write metadata on first touch (no Lance write). + let (ds, _full_path, _table_branch) = open_table_for_mutation( self, staging, branch, &table_key, crate::db::MutationOpKind::Insert, - txn, ) .await?; // Accumulate the new edge row. Edge IDs are ULID-generated so @@ -1165,27 +970,9 @@ impl Omnigraph { // Edge cardinality validation: scan committed edges via Lance // + iterate pending edges in-memory for the `src` column, // group-by-src. The pending side already includes the row - // we just appended (above). When the open was skipped (collapse - // #1), resolve a read handle for the committed scan at LIVE HEAD - // (`edge_cardinality_read_handle`, #298) β€” NOT the pinned txn.base, - // which would undercount edges a concurrent writer committed since - // capture. Only when cardinality is non-default, so the common - // default-cardinality edge keeps the open-free path. (The residual - // validateβ†’commit race is the Β§7.1 gap β€” step 4.) - if !edge_type.cardinality.is_default() { - let committed_ds = match handle { - Some(h) => h, - None => self.edge_cardinality_read_handle(txn, &table_key).await?, - }; - validate_edge_cardinality_with_pending( - self, - &committed_ds, - staging, - &table_key, - edge_type, - ) + // we just appended (above). + validate_edge_cardinality_with_pending(self, &ds, staging, &table_key, edge_type) .await?; - } self.invalidate_graph_index().await; @@ -1206,7 +993,6 @@ impl Omnigraph { params: &ParamMap, branch: Option<&str>, staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, ) -> Result { // Defense in depth: ensure this is a node type if !self.catalog().node_types.contains_key(type_name) { @@ -1231,18 +1017,14 @@ impl Omnigraph { let blob_props = self.catalog().node_types[type_name].blob_properties.clone(); let table_key = format!("node:{}", type_name); - let (handle, _full_path, _table_branch) = open_table_for_mutation( + let (ds, _full_path, _table_branch) = open_table_for_mutation( self, staging, branch, &table_key, crate::db::MutationOpKind::Update, - txn, ) .await?; - // Update is a STRICT op, so collapse #1 never skips its open β€” the - // handle is always `Some` (and it's needed for the committed scan below). - let ds = handle.expect("strict Update op always opens its dataset"); // Scan committed via Lance + apply the same predicate to pending // batches via DataFusion `MemTable` (read-your-writes for prior @@ -1273,7 +1055,7 @@ impl Omnigraph { // and a chained `update where ` can match a row whose // pending value no longer satisfies . let batches = self - .storage() + .table_store() .scan_with_pending( &ds, pending_batches, @@ -1311,12 +1093,12 @@ impl Omnigraph { let node_type = &self.catalog().node_types[type_name]; crate::loader::validate_value_constraints(&updated, node_type)?; crate::loader::validate_enum_constraints(&updated, &node_type.properties, type_name)?; - let unique_groups = crate::loader::unique_constraint_groups_for_node(node_type); - if !unique_groups.is_empty() { + let unique_props = crate::loader::unique_property_names_for_node(node_type); + if !unique_props.is_empty() { crate::loader::enforce_unique_constraints_intra_batch( &updated, type_name, - &unique_groups, + &unique_props, )?; } @@ -1341,14 +1123,13 @@ impl Omnigraph { params: &ParamMap, branch: Option<&str>, staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, ) -> Result { let is_node = self.catalog().node_types.contains_key(type_name); if is_node { - self.execute_delete_node(type_name, predicate, params, branch, staging, txn) + self.execute_delete_node(type_name, predicate, params, branch, staging) .await } else { - self.execute_delete_edge(type_name, predicate, params, branch, staging, txn) + self.execute_delete_edge(type_name, predicate, params, branch, staging) .await } } @@ -1360,29 +1141,25 @@ impl Omnigraph { params: &ParamMap, branch: Option<&str>, staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, ) -> Result { let pred_sql = predicate_to_sql(predicate, params, false)?; let table_key = format!("node:{}", type_name); - let (handle, full_path, table_branch) = open_table_for_mutation( + let (ds, full_path, table_branch) = open_table_for_mutation( self, staging, branch, &table_key, crate::db::MutationOpKind::Delete, - txn, ) .await?; - // Delete is a STRICT op, so collapse #1 never skips its open. - let ds = handle.expect("strict Delete op always opens its dataset"); - let initial_version = ds.version(); + let initial_version = ds.version().version; // Scan matching IDs for cascade. Per Dβ‚‚ this never overlaps with // staged inserts (mixed insert/delete in one query is rejected at // parse time), so we scan committed only. let batches = self - .storage() + .table_store() .scan(&ds, Some(&["id"]), Some(&pred_sql), None) .await?; @@ -1410,11 +1187,11 @@ impl Omnigraph { let affected_nodes = deleted_ids.len(); // Delete nodes β€” still inline-commit (Lance's `Dataset::delete` is - // not exposed as a two-phase op in 6.0.1). Dβ‚‚ keeps inserts and + // not exposed as a two-phase op in 4.0.0). Dβ‚‚ keeps inserts and // deletes from coexisting in one query, so this advance of Lance // HEAD is the only HEAD movement during the query and the // publisher's CAS captures it intact. - let ds = self + let mut ds = self .reopen_for_mutation( &table_key, &full_path, @@ -1423,10 +1200,10 @@ impl Omnigraph { crate::db::MutationOpKind::Delete, ) .await?; - crate::failpoints::maybe_fail(crate::failpoints::names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE)?; - let (_new_ds, delete_state) = self - .storage_inline_residual() - .delete_where(&full_path, ds, &pred_sql) + crate::failpoints::maybe_fail("mutation.delete_node_pre_primary_delete")?; + let delete_state = self + .table_store() + .delete_where(&full_path, &mut ds, &pred_sql) .await?; staging.record_inline(crate::db::SubTableUpdate { @@ -1465,21 +1242,18 @@ impl Omnigraph { let edge_table_key = format!("edge:{}", edge_name); let cascade_filter = cascade_filters.join(" OR "); - let (edge_handle, edge_full_path, edge_table_branch) = open_table_for_mutation( + let (mut edge_ds, edge_full_path, edge_table_branch) = open_table_for_mutation( self, staging, branch, &edge_table_key, crate::db::MutationOpKind::Delete, - txn, ) .await?; - // Delete is a STRICT op, so collapse #1 never skips its open. - let edge_ds = edge_handle.expect("strict Delete op always opens its dataset"); - let (_new_edge_ds, edge_delete) = self - .storage_inline_residual() - .delete_where(&edge_full_path, edge_ds, &cascade_filter) + let edge_delete = self + .table_store() + .delete_where(&edge_full_path, &mut edge_ds, &cascade_filter) .await?; affected_edges += edge_delete.deleted_rows; @@ -1512,26 +1286,22 @@ impl Omnigraph { params: &ParamMap, branch: Option<&str>, staging: &mut MutationStaging, - txn: Option<&crate::db::WriteTxn>, ) -> Result { let pred_sql = predicate_to_sql(predicate, params, true)?; let table_key = format!("edge:{}", type_name); - let (handle, full_path, table_branch) = open_table_for_mutation( + let (mut ds, full_path, table_branch) = open_table_for_mutation( self, staging, branch, &table_key, crate::db::MutationOpKind::Delete, - txn, ) .await?; - // Delete is a STRICT op, so collapse #1 never skips its open. - let ds = handle.expect("strict Delete op always opens its dataset"); - let (_new_ds, delete_state) = self - .storage_inline_residual() - .delete_where(&full_path, ds, &pred_sql) + let delete_state = self + .table_store() + .delete_where(&full_path, &mut ds, &pred_sql) .await?; let affected = delete_state.deleted_rows; @@ -1585,7 +1355,7 @@ fn concat_match_batches_to_schema( /// dedup needed (`dedupe_key_column = None`). async fn validate_edge_cardinality_with_pending( db: &Omnigraph, - committed_ds: &SnapshotHandle, + committed_ds: &Dataset, staging: &MutationStaging, table_key: &str, edge_type: &omnigraph_compiler::catalog::EdgeType, @@ -1608,29 +1378,3 @@ fn enrich_mutation_params(params: &ParamMap) -> Result { } Ok(resolved) } - -#[cfg(test)] -mod predicate_sql_tests { - use super::*; - - // #283: a camelCase column in a mutation predicate must be emitted - // UNQUOTED and case-preserved. The committed-scan consumer, Lance's - // `Scanner::filter(&str)`, preserves an unquoted identifier's case but - // treats a double-quoted `"col"` as a string literal (which silently - // matches zero rows), so the predicate string must not quote the column. - // The pending MemTable path stays case-preserving by disabling DataFusion - // identifier normalization on its context, not by quoting here. - #[test] - fn predicate_to_sql_preserves_camelcase_column_unquoted() { - let predicate = IRMutationPredicate { - property: "repoName".to_string(), - op: CompOp::Eq, - value: IRExpr::Literal(Literal::String("acme".into())), - }; - let sql = predicate_to_sql(&predicate, &ParamMap::new(), false).unwrap(); - assert_eq!( - sql, "repoName = 'acme'", - "column must be unquoted and case-preserved, got {sql}" - ); - } -} diff --git a/crates/omnigraph/src/exec/projection.rs b/crates/omnigraph/src/exec/projection.rs index bb6e665..dec13a8 100644 --- a/crates/omnigraph/src/exec/projection.rs +++ b/crates/omnigraph/src/exec/projection.rs @@ -72,11 +72,7 @@ fn evaluate_expr(batch: &RecordBatch, expr: &IRExpr, params: &ParamMap) -> Resul } /// Create a constant array from a literal value. -/// -/// `pub(super)` so the pushdown arm (`query.rs::literal_to_typed_expr`) can build -/// a literal in the same natural Arrow type and cast it to the column type through -/// the identical `arrow_cast` path used here, keeping the two filter arms in sync. -pub(super) fn literal_to_array(lit: &Literal, num_rows: usize) -> Result { +fn literal_to_array(lit: &Literal, num_rows: usize) -> Result { Ok(match lit { Literal::Null => arrow_array::new_null_array(&DataType::Utf8, num_rows), Literal::String(s) => Arc::new(StringArray::from(vec![s.as_str(); num_rows])) as ArrayRef, @@ -426,35 +422,6 @@ pub(super) fn apply_ordering( }); } - // Deterministic tie-break for a TOTAL order. `lexsort_to_indices` is unstable - // and the input row order is not guaranteed (scan parallelism, upstream - // hashing), so equal user-sort keys would otherwise come out run-dependent β€” - // making `ORDER ... LIMIT` non-deterministic. Append the bound entities' key - // columns (`.id`, unique per row) in canonical (name-sorted) order as - // ascending tie-breaks. The combination of all bound keys uniquely identifies - // a result row, so the order is total and reproducible. (Aggregate results - // have no `.id` columns; their group rows are already distinct on the - // projected group keys.) - let mut tiebreak_cols: Vec = source - .schema() - .fields() - .iter() - .map(|f| f.name().to_string()) - .filter(|name| name.ends_with(".id")) - .collect(); - tiebreak_cols.sort(); - for name in &tiebreak_cols { - if let Some(col) = source.column_by_name(name) { - sort_columns.push(SortColumn { - values: col.clone(), - options: Some(arrow_schema::SortOptions { - descending: false, - nulls_first: true, - }), - }); - } - } - let indices = lexsort_to_indices(&sort_columns, None).map_err(|e| OmniError::Lance(e.to_string()))?; diff --git a/crates/omnigraph/src/exec/query.rs b/crates/omnigraph/src/exec/query.rs index 23e1434..7590512 100644 --- a/crates/omnigraph/src/exec/query.rs +++ b/crates/omnigraph/src/exec/query.rs @@ -2,30 +2,6 @@ use super::*; use super::projection::{apply_filter, apply_ordering, project_return}; -/// Bundles the per-handle embedding client cell with the optional injected -/// config (RFC-012 Phase 5) so the lazy init uses the injected config when -/// present, else `EmbeddingClient::from_env()`. Threaded through the query path -/// in place of the bare cell, preserving laziness (a graph that never embeds -/// builds no client and needs no key). -pub(crate) struct EmbeddingResolver<'a> { - cell: &'a tokio::sync::OnceCell, - config: Option<&'a crate::embedding::EmbeddingConfig>, -} - -impl EmbeddingResolver<'_> { - async fn resolve(&self) -> Result<&EmbeddingClient> { - let config = self.config.cloned(); - self.cell - .get_or_try_init(|| async move { - match config { - Some(cfg) => EmbeddingClient::new(cfg), - None => EmbeddingClient::from_env(), - } - }) - .await - } -} - impl Omnigraph { /// Run a named query against an explicit branch or snapshot target. pub async fn query( @@ -35,7 +11,7 @@ impl Omnigraph { query_name: &str, params: &ParamMap, ) -> Result { - // resolved_target validates the schema contract; no redundant call here. + self.ensure_schema_state_valid().await?; let resolved = self.resolved_target(target).await?; let catalog = self.catalog(); @@ -48,23 +24,18 @@ impl Omnigraph { .pipeline .iter() .any(|op| matches!(op, IROp::Expand { .. } | IROp::AntiJoin { .. })); - // Lazy: an index-served query with no AntiJoin never builds the CSR. let graph_index = if needs_graph { - GraphIndexHandle::cached(self, &resolved) + Some(self.graph_index_for_resolved(&resolved).await?) } else { - GraphIndexHandle::none() + None }; execute_query( &ir, params, &resolved.snapshot, - &graph_index, + graph_index.as_deref(), &catalog, - &EmbeddingResolver { - cell: self.embedding_cell(), - config: self.embedding_config_ref(), - }, ) .await } @@ -80,7 +51,7 @@ impl Omnigraph { query_name: &str, params: &ParamMap, ) -> Result { - // snapshot_at_version validates the schema contract; no redundant call here. + self.ensure_schema_state_valid().await?; let snapshot = self.snapshot_at_version(version).await?; let catalog = self.catalog(); @@ -93,32 +64,18 @@ impl Omnigraph { .pipeline .iter() .any(|op| matches!(op, IROp::Expand { .. } | IROp::AntiJoin { .. })); - // Lazy build against this historical snapshot (not the RuntimeCache, - // which is keyed to live branch targets); only a CSR-path Expand or an - // AntiJoin triggers it. let graph_index = if needs_graph { let edge_types = catalog .edge_types .iter() .map(|(name, et)| (name.clone(), (et.from_type.clone(), et.to_type.clone()))) .collect(); - GraphIndexHandle::direct(&snapshot, edge_types) + Some(Arc::new(GraphIndex::build(&snapshot, &edge_types).await?)) } else { - GraphIndexHandle::none() + None }; - execute_query( - &ir, - params, - &snapshot, - &graph_index, - &catalog, - &EmbeddingResolver { - cell: self.embedding_cell(), - config: self.embedding_config_ref(), - }, - ) - .await + execute_query(&ir, params, &snapshot, graph_index.as_deref(), &catalog).await } } @@ -148,7 +105,6 @@ async fn extract_search_mode( ir: &QueryIR, params: &ParamMap, catalog: &Catalog, - embedding: &EmbeddingResolver<'_>, ) -> Result { if ir.order_by.is_empty() { return Ok(SearchMode::default()); @@ -161,8 +117,7 @@ async fn extract_search_mode( query, } => { let vec = - resolve_nearest_query_vec(ir, catalog, variable, property, query, params, embedding) - .await?; + resolve_nearest_query_vec(ir, catalog, variable, property, query, params).await?; let k = ir.limit.ok_or_else(|| { OmniError::manifest("nearest() ordering requires a limit clause".to_string()) })? as usize; @@ -205,10 +160,9 @@ async fn extract_search_mode( .unwrap_or(60) as u32; let primary_mode = - extract_sub_search_mode(ir, primary, params, catalog, ir.limit, embedding).await?; + extract_sub_search_mode(ir, primary, params, catalog, ir.limit).await?; let secondary_mode = - extract_sub_search_mode(ir, secondary, params, catalog, ir.limit, embedding) - .await?; + extract_sub_search_mode(ir, secondary, params, catalog, ir.limit).await?; Ok(SearchMode { rrf: Some(RrfMode { @@ -231,7 +185,6 @@ async fn extract_sub_search_mode( params: &ParamMap, catalog: &Catalog, limit: Option, - embedding: &EmbeddingResolver<'_>, ) -> Result { match expr { IRExpr::Nearest { @@ -240,8 +193,7 @@ async fn extract_sub_search_mode( query, } => { let vec = - resolve_nearest_query_vec(ir, catalog, variable, property, query, params, embedding) - .await?; + resolve_nearest_query_vec(ir, catalog, variable, property, query, params).await?; let k = limit.unwrap_or(100) as usize; Ok(SearchMode { nearest: Some((variable.clone(), property.clone(), vec, k)), @@ -280,34 +232,15 @@ async fn resolve_nearest_query_vec( property: &str, expr: &IRExpr, params: &ParamMap, - embedding: &EmbeddingResolver<'_>, ) -> Result> { let lit = resolve_literal_or_param(expr, params)?; match lit { Literal::List(_) => literal_to_f32_vec(&lit), Literal::String(text) => { - let (expected_dim, recorded_model) = - nearest_property_dim_and_model(ir, catalog, variable, property)?; - // Lazily resolve the per-handle client once, then reuse it across - // queries (keeps the provider connection pool warm); a graph that - // never embeds never builds a client and needs no provider key. - let client = embedding.resolve().await?; - // Same-space guarantee: if the property recorded the model that - // produced its stored vectors (`@embed("…", model="…")`), the query - // embedder must resolve to that same model β€” otherwise the comparison - // is across vector spaces. Reject loudly instead of ranking garbage. - if let Some(recorded) = &recorded_model { - let resolved = &client.config().model; - if resolved != recorded { - return Err(OmniError::manifest(format!( - "nearest() on '{property}': its stored vectors were embedded with model \ - '{recorded}', but the query embedder resolves to '{resolved}'. Set \ - OMNIGRAPH_EMBED_MODEL='{recorded}' (and the matching provider) or re-embed \ - the stored vectors." - ))); - } - } - client.embed_query_text(&text, expected_dim).await + let expected_dim = nearest_property_dimension(ir, catalog, variable, property)?; + EmbeddingClient::from_env()? + .embed_query_text(&text, expected_dim) + .await } _ => Err(OmniError::manifest( "nearest query must be a string or list of floats".to_string(), @@ -349,14 +282,12 @@ fn literal_to_f32_vec(lit: &Literal) -> Result> { } } -/// Resolve the nearest() target property's vector dimension and the embedding -/// model recorded for it via `@embed("…", model="…")` (`None` if unrecorded). -fn nearest_property_dim_and_model( +fn nearest_property_dimension( ir: &QueryIR, catalog: &Catalog, variable: &str, property: &str, -) -> Result<(usize, Option)> { +) -> Result { let type_name = resolve_binding_type_name(&ir.pipeline, variable).ok_or_else(|| { OmniError::manifest_internal(format!( "nearest() variable '${}' is not bound to a node type in the lowered pipeline", @@ -375,20 +306,13 @@ fn nearest_property_dim_and_model( type_name, property )) })?; - let dim = match prop.scalar { - ScalarType::Vector(dim) if !prop.list => dim as usize, - _ => { - return Err(OmniError::manifest_internal(format!( - "nearest() property '{}.{}' is not a scalar vector", - type_name, property - ))); - } - }; - let recorded_model = node_type - .embed_sources - .get(property) - .and_then(|embed| embed.model.clone()); - Ok((dim, recorded_model)) + match prop.scalar { + ScalarType::Vector(dim) if !prop.list => Ok(dim as usize), + _ => Err(OmniError::manifest_internal(format!( + "nearest() property '{}.{}' is not a scalar vector", + type_name, property + ))), + } } fn resolve_binding_type_name<'a>(pipeline: &'a [IROp], variable: &str) -> Option<&'a str> { @@ -418,11 +342,10 @@ pub async fn execute_query( ir: &QueryIR, params: &ParamMap, snapshot: &Snapshot, - graph_index: &GraphIndexHandle<'_>, + graph_index: Option<&GraphIndex>, catalog: &Catalog, - embedding: &EmbeddingResolver<'_>, ) -> Result { - let search_mode = extract_search_mode(ir, params, catalog, embedding).await?; + let search_mode = extract_search_mode(ir, params, catalog).await?; // RRF requires forked execution if let Some(ref rrf) = search_mode.rrf { @@ -477,7 +400,7 @@ async fn execute_rrf_query( ir: &QueryIR, params: &ParamMap, snapshot: &Snapshot, - graph_index: &GraphIndexHandle<'_>, + graph_index: Option<&GraphIndex>, catalog: &Catalog, rrf: &RrfMode, ) -> Result { @@ -660,7 +583,7 @@ fn execute_pipeline<'a>( pipeline: &'a [IROp], params: &'a ParamMap, snapshot: &'a Snapshot, - graph_index: &'a GraphIndexHandle<'a>, + graph_index: Option<&'a GraphIndex>, catalog: &'a Catalog, wide: &'a mut Option, search_mode: &'a SearchMode, @@ -730,10 +653,13 @@ fn execute_pipeline<'a>( max_hops, dst_filters, } => { + let gi = graph_index.ok_or_else(|| { + OmniError::manifest("graph index required for traversal".to_string()) + })?; if let Some(batch) = wide.as_mut() { execute_expand( batch, - graph_index, + gi, snapshot, catalog, src_var, @@ -762,673 +688,8 @@ fn execute_pipeline<'a>( }) } -/// Lazily provides the in-memory CSR graph index, building it on first use and -/// memoizing for the rest of the query. Indexed-mode Expand never asks for it, -/// so a query that is entirely index-served and has no AntiJoin never pays the -/// O(|E|) CSR build (the whole point of the indexed path). The `Cached` builder -/// also reuses the cross-query `RuntimeCache` entry; `Direct` builds against an -/// arbitrary snapshot (time-travel reads); `None` is for queries with no -/// traversal at all. -pub struct GraphIndexHandle<'a> { - cell: tokio::sync::OnceCell>>, - builder: GraphIndexBuilder<'a>, -} - -enum GraphIndexBuilder<'a> { - None, - Cached(&'a Omnigraph, &'a crate::db::ResolvedTarget), - Direct(&'a Snapshot, HashMap), -} - -impl<'a> GraphIndexHandle<'a> { - fn none() -> Self { - Self { - cell: tokio::sync::OnceCell::new(), - builder: GraphIndexBuilder::None, - } - } - - fn cached(db: &'a Omnigraph, resolved: &'a crate::db::ResolvedTarget) -> Self { - Self { - cell: tokio::sync::OnceCell::new(), - builder: GraphIndexBuilder::Cached(db, resolved), - } - } - - fn direct(snapshot: &'a Snapshot, edge_types: HashMap) -> Self { - Self { - cell: tokio::sync::OnceCell::new(), - builder: GraphIndexBuilder::Direct(snapshot, edge_types), - } - } - - /// The CSR index, built on first call. `None` only when the query needs no - /// traversal (the `None` builder). - async fn get(&self) -> Result> { - let built = self - .cell - .get_or_try_init(|| async { - match &self.builder { - GraphIndexBuilder::None => Ok::>, OmniError>(None), - GraphIndexBuilder::Cached(db, resolved) => { - Ok(Some(db.graph_index_for_resolved(resolved).await?)) - } - GraphIndexBuilder::Direct(snapshot, edge_types) => { - Ok(Some(Arc::new(GraphIndex::build(snapshot, edge_types).await?))) - } - } - }) - .await?; - Ok(built.as_deref()) - } - - /// Whether the in-memory CSR is already materialized for this query (a prior - /// Expand or bulk AntiJoin realized it), so reusing it is ~free. Lets the - /// cost chooser prefer the warm CSR over per-hop indexed scans. - fn is_built(&self) -> bool { - matches!(self.cell.get(), Some(Some(_))) - } -} - -/// Explicit traversal-mode override. `OMNIGRAPH_TRAVERSAL_MODE=indexed|csr` -/// forces the path (ops escape hatch + test hook). Both modes are semantically -/// identical, so the override only changes which path runs, never the result. -fn traversal_indexed_override() -> Option { - match std::env::var("OMNIGRAPH_TRAVERSAL_MODE").ok().as_deref() { - Some("indexed") => Some(true), - Some("csr") => Some(false), - _ => None, - } -} - -/// Max source-row frontier for which Expand uses the BTREE-indexed path. -/// Larger frontiers fall back to the in-memory CSR (dense / whole-graph). See -/// `docs/user/reference/constants.md`. -const DEFAULT_EXPAND_INDEXED_MAX_FRONTIER: usize = 1024; -/// Max hop count for the indexed path (each hop is one indexed scan; very deep -/// traversals fan out toward whole-graph and are better served by CSR). -const DEFAULT_EXPAND_INDEXED_MAX_HOPS: u32 = 6; - -fn expand_indexed_max_frontier() -> usize { - std::env::var("OMNIGRAPH_EXPAND_INDEXED_MAX_FRONTIER") - .ok() - .and_then(|v| v.parse::().ok()) - .unwrap_or(DEFAULT_EXPAND_INDEXED_MAX_FRONTIER) -} - -fn expand_indexed_max_hops() -> u32 { - std::env::var("OMNIGRAPH_EXPAND_INDEXED_MAX_HOPS") - .ok() - .and_then(|v| v.parse::().ok()) - .filter(|&v| v > 0) - .unwrap_or(DEFAULT_EXPAND_INDEXED_MAX_HOPS) -} - -/// The two Expand execution paths the chooser dispatches between. Extensible: -/// a future persisted-adjacency artifact would become a third variant here, and -/// `choose_expand_mode` would learn to prefer it when covered. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -enum ExpandMode { - /// Per-hop neighbor lookup via the persisted src/dst BTREE. Work scales - /// with the frontier, not |E| β€” best for selective traversals. - IndexedScan, - /// Whole-graph in-memory CSR (built once, reused). Best for dense / deep / - /// large-frontier traversals, or when the index is degraded and a full - /// scan would be paid per hop anyway. - Csr, -} - -/// Building the in-memory CSR costs more than a bare edge scan: it scans every -/// edge AND allocates + groups the adjacency. This factor expresses that -/// overhead so a one-off degraded single-hop scan can still edge out a full CSR -/// build. The crossover is insensitive to its exact value. -const CSR_BUILD_FACTOR: f64 = 1.5; - -/// Cardinality inputs for the (pure, IO-free) traversal-mode cost model. Every -/// field is a cheap manifest-resident count or an already-in-hand value β€” the -/// chooser performs no scans. -#[derive(Debug, Clone)] -struct ExpandCostInputs { - /// Current frontier size (`wide.num_rows()`). - frontier_rows: usize, - /// |E| for the edge type (manifest `row_count`). - edge_count: u64, - /// |V_src| β€” node count of the keyed endpoint type (manifest `row_count`). - src_node_count: u64, - /// Effective max hop count for this Expand. - effective_max_hops: u32, - /// Hard ceiling above which the indexed path is never used (resolved - /// `OMNIGRAPH_EXPAND_INDEXED_MAX_HOPS`). - max_hops_cap: u32, - /// Hard ceiling above which the indexed path is never used (resolved - /// `OMNIGRAPH_EXPAND_INDEXED_MAX_FRONTIER`). - max_frontier_cap: usize, - /// Whether `scan_edges_by_endpoint`'s `key_col IN (...)` is served by the - /// BTREE (`Indexed`) or silently falls back to a full scan (`Degraded`). - coverage: crate::table_store::IndexCoverage, - /// Whether the cross-query CSR for this snapshot+edge-version is already - /// built (making the CSR path β‰ˆ free). Conservatively `false` until the - /// cache-peek is wired (the plan's optional refinement). - csr_cached: bool, -} - -/// Pure cost-based traversal-mode chooser. Compares an estimate of the indexed -/// path's frontier-relative work against the cost of building (or reusing) the -/// whole-graph CSR, and picks the cheaper. Deterministic and IO-free so it is -/// unit-tested at the crossover; the caller supplies the manifest counts and the -/// (optionally degraded) index coverage. -/// -/// Under `Indexed` coverage and a cold CSR the decision reduces to a clean -/// selectivity ratio β€” indexed wins when `hops * frontier < BUILD_FACTOR * -/// |V_src|`, i.e. when the frontier is a small fraction of the source vertex -/// set β€” which is independent of |E| (the flat-in-|E| property PR #149 shipped). -fn choose_expand_mode(i: &ExpandCostInputs) -> ExpandMode { - // Hard ceilings: very deep or very large frontiers fan out toward - // whole-graph and are always better served by CSR, regardless of the cost - // estimate. These preserve the documented semantics of the two cap flags. - if i.effective_max_hops > i.max_hops_cap || i.frontier_rows > i.max_frontier_cap { - return ExpandMode::Csr; - } - - let hops = i.effective_max_hops.max(1) as f64; - let frontier = i.frontier_rows as f64; - let edges = i.edge_count as f64; - let src = i.src_node_count.max(1) as f64; - let fanout = edges / src; - - // Indexed work scales with the frontier when the BTREE serves the IN-list; - // a degraded scan is a full edge scan per hop instead (the C6 perf cliff). - let indexed_cost = match i.coverage { - crate::table_store::IndexCoverage::Indexed => hops * frontier * fanout, - crate::table_store::IndexCoverage::Degraded { .. } => hops * edges, - }; - // A warm CSR is ~free to reuse; a cold one costs a build over all edges. - let csr_cost = if i.csr_cached { - 0.0 - } else { - CSR_BUILD_FACTOR * edges - }; - - if indexed_cost < csr_cost { - ExpandMode::IndexedScan - } else { - ExpandMode::Csr - } -} - -/// Hops the indexed path will actually run, for cost-model purposes. A cross-type -/// edge cannot chain, so `execute_expand_indexed` caps it at one hop regardless of -/// the requested range; the cost model must use that, or it over-estimates the -/// indexed cost of a cross-type variable-length expand and skews toward CSR. -fn cost_effective_hops(requested_max_hops: u32, same_type: bool) -> u32 { - if same_type { - requested_max_hops - } else { - requested_max_hops.min(1) - } -} - -/// Gather the cost-model inputs from cheap manifest counts. `None` when the -/// edge type, its source node type, or their manifest entries are absent (e.g. -/// a not-yet-materialized table) β€” the caller then falls back to the legacy -/// frontier/hop ceiling so the decision is always defined. -fn gather_cost_inputs( - snapshot: &Snapshot, - catalog: &Catalog, - edge_type: &str, - direction: Direction, - frontier_rows: usize, - effective_max_hops: u32, - coverage: crate::table_store::IndexCoverage, - csr_cached: bool, -) -> Option { - let edge_entry = snapshot.entry(&format!("edge:{}", edge_type))?; - let edge_def = catalog.edge_types.get(edge_type)?; - // Match the indexed path's cross-type one-hop cap so the cost estimate - // reflects what actually runs (see `cost_effective_hops`). - let effective_max_hops = - cost_effective_hops(effective_max_hops, edge_def.from_type == edge_def.to_type); - // The frontier source vertices are the keyed endpoint's type: `from` for an - // Out traversal (keyed on `src`), `to` for In (keyed on `dst`). - let src_type = match direction { - Direction::Out => &edge_def.from_type, - Direction::In => &edge_def.to_type, - }; - let src_entry = snapshot.entry(&format!("node:{}", src_type))?; - Some(ExpandCostInputs { - frontier_rows, - edge_count: edge_entry.row_count, - src_node_count: src_entry.row_count, - effective_max_hops, - max_hops_cap: expand_indexed_max_hops(), - max_frontier_cap: expand_indexed_max_frontier(), - coverage, - csr_cached, - }) -} - -/// Coverage value to feed the cost decision. A failed coverage probe is treated -/// as `Degraded` (conservative: don't over-favor the indexed path when we can't -/// confirm the BTREE will serve the scan). -fn coverage_for_decision( - coverage: &Result, -) -> crate::table_store::IndexCoverage { - match coverage { - Ok(c) => c.clone(), - Err(_) => crate::table_store::IndexCoverage::Degraded { - reason: "coverage check failed".to_string(), - }, - } -} - -/// Surface the C6 silent scalar-index fallback (commit `5a7ab6d`): warn when the -/// per-hop `key_col IN (...)` won't route through the BTREE. Detection-only; -/// never fails the query. Behavior-identical to the inline check it replaced. -fn warn_on_degraded_coverage( - coverage: &Result, - key_col: &str, - edge_type: &str, -) { - match coverage { - Ok(crate::table_store::IndexCoverage::Degraded { reason }) => tracing::warn!( - target: "omnigraph::traverse", - edge = %edge_type, - key_col = key_col, - reason = %reason, - "indexed traversal falls back to a full edge scan (results correct, perf degraded)" - ), - Ok(crate::table_store::IndexCoverage::Indexed) => {} - Err(e) => tracing::debug!( - target: "omnigraph::traverse", - error = %e, - "index-coverage check failed; proceeding with traversal" - ), - } -} - -/// The (key, opposite) endpoint columns for a traversal direction. Out follows -/// src -> dst (key on src); In follows the reverse. The persisted BTREE exists -/// on both columns. -fn endpoint_columns(direction: Direction) -> (&'static str, &'static str) { - match direction { - Direction::Out => ("src", "dst"), - Direction::In => ("dst", "src"), - } -} - -/// Execute a graph traversal (Expand). Dispatches to the BTREE-indexed path -/// (selective traversals β€” neighbor lookups via the persisted src/dst index) or -/// the in-memory CSR path (dense / whole-graph traversals). The CSR index is -/// built lazily and only the CSR path requests it. +/// Execute a graph traversal (Expand). async fn execute_expand( - wide: &mut RecordBatch, - graph_index: &GraphIndexHandle<'_>, - snapshot: &Snapshot, - catalog: &Catalog, - src_var: &str, - dst_var: &str, - edge_type: &str, - direction: Direction, - dst_type: &str, - min_hops: u32, - max_hops: Option, - dst_filters: &[IRFilter], - params: &ParamMap, -) -> Result<()> { - let frontier_rows = wide.num_rows(); - let effective_max_hops = max_hops.unwrap_or(min_hops.max(1)); - let (key_col, _) = endpoint_columns(direction); - let edge_table_key = format!("edge:{}", edge_type); - - // Cardinality-first preliminary decision (no IO). The override wins; else the - // cost model decides under *optimistic* coverage. Optimistic is what lets us - // skip the dataset open on a clearly-CSR traversal: real coverage can only - // make the indexed path costlier, so if even a perfectly-indexed scan loses - // to CSR here, it loses for real. - let forced = traversal_indexed_override(); - let lean_indexed = match forced { - Some(v) => v, - None => match gather_cost_inputs( - snapshot, - catalog, - edge_type, - direction, - frontier_rows, - effective_max_hops, - crate::table_store::IndexCoverage::Indexed, - graph_index.is_built(), - ) { - Some(inputs) => choose_expand_mode(&inputs) == ExpandMode::IndexedScan, - // Manifest counts absent (e.g. not-yet-materialized table): fall back - // to the legacy frontier/hop ceiling so the decision is defined. - None => { - frontier_rows <= expand_indexed_max_frontier() - && effective_max_hops <= expand_indexed_max_hops() - } - }, - }; - - if !lean_indexed { - tracing::debug!( - target: "omnigraph::traverse", - edge = %edge_type, - frontier = frontier_rows, - hops = effective_max_hops, - mode = "csr", - "expand mode chosen", - ); - let gi = graph_index.get().await?.ok_or_else(|| { - OmniError::manifest("graph index required for CSR traversal".to_string()) - })?; - return execute_expand_csr( - wide, gi, snapshot, catalog, src_var, dst_var, edge_type, direction, dst_type, - min_hops, max_hops, dst_filters, params, - ) - .await; - } - - // Leaning indexed: open the edge dataset once, confirm real coverage, and - // (unless forced) re-decide with it. The opened dataset is threaded into the - // indexed path so it is never opened twice. - let edge_ds = snapshot.open(&edge_table_key).await?; - let coverage = - crate::table_store::TableStore::key_column_index_coverage(&edge_ds, key_col).await; - - if forced.is_none() { - if let Some(inputs) = gather_cost_inputs( - snapshot, - catalog, - edge_type, - direction, - frontier_rows, - effective_max_hops, - coverage_for_decision(&coverage), - graph_index.is_built(), - ) { - if choose_expand_mode(&inputs) == ExpandMode::Csr { - tracing::debug!( - target: "omnigraph::traverse", - edge = %edge_type, - frontier = frontier_rows, - hops = effective_max_hops, - mode = "csr", - reason = "index coverage degraded", - "expand mode chosen", - ); - let gi = graph_index.get().await?.ok_or_else(|| { - OmniError::manifest("graph index required for CSR traversal".to_string()) - })?; - return execute_expand_csr( - wide, gi, snapshot, catalog, src_var, dst_var, edge_type, direction, dst_type, - min_hops, max_hops, dst_filters, params, - ) - .await; - } - } - } - - tracing::debug!( - target: "omnigraph::traverse", - edge = %edge_type, - frontier = frontier_rows, - hops = effective_max_hops, - mode = "indexed", - "expand mode chosen", - ); - // Surface the C6 silent scalar-index fallback once, now that coverage is known. - warn_on_degraded_coverage(&coverage, key_col, edge_type); - execute_expand_indexed( - wide, snapshot, catalog, src_var, dst_var, edge_type, direction, dst_type, min_hops, - max_hops, dst_filters, params, edge_ds, - ) - .await -} - -/// BTREE-indexed graph traversal: per hop, batch the current frontier into one -/// `scan_edges_by_endpoint` call against the persisted src/dst index, then fan -/// out per source row. Cost scales with the frontier, not |E|. Produces the -/// same `(src_row, dst_id)` pairs as the CSR path and shares its hydrate+align -/// tail. Multi-hop only advances for same-type edges; cross-type frontiers go -/// empty after one hop (no edges key off the destination type), matching CSR. -async fn execute_expand_indexed( - wide: &mut RecordBatch, - snapshot: &Snapshot, - catalog: &Catalog, - src_var: &str, - dst_var: &str, - edge_type: &str, - direction: Direction, - dst_type: &str, - min_hops: u32, - max_hops: Option, - dst_filters: &[IRFilter], - params: &ParamMap, - edge_ds: Dataset, -) -> Result<()> { - let src_id_col_name = format!("{}.id", src_var); - let src_ids = wide - .column_by_name(&src_id_col_name) - .ok_or_else(|| { - OmniError::manifest(format!("wide batch missing '{}' column", src_id_col_name)) - })? - .as_any() - .downcast_ref::() - .ok_or_else(|| OmniError::manifest(format!("'{}' column is not Utf8", src_id_col_name)))? - .clone(); - - let edge_def = catalog - .edge_types - .get(edge_type) - .ok_or_else(|| OmniError::manifest(format!("unknown edge type '{}'", edge_type)))?; - let same_type = edge_def.from_type == edge_def.to_type; - // The keyed/opposite endpoint columns for this direction. The edge dataset - // and the C6 coverage warn are owned by the caller (`execute_expand`), which - // opens the dataset once and threads it in. - let (key_col, opp_col) = endpoint_columns(direction); - - let max = max_hops.unwrap_or(min_hops.max(1)); - // Cross-type edges cannot chain (a Company is not a `WorksAt` source), so a - // variable-length traversal over one is structurally single-hop. Enforce it - // here instead of relying on the hop-2 scan returning empty: this BFS interns - // every endpoint string into ONE dense id space, so a cross-type id-string - // collision (a Person and a Company sharing an id) would otherwise let hop 2 - // de-intern a destination id back to the colliding source-type id and match - // its edges, emitting rows the CSR path never produces. - let max = if same_type { max } else { max.min(1) }; - - // Per-source BFS state in DENSE id space: intern node ids to u32 once via a - // per-traversal interner so visited/seen/frontier/neighbor-map avoid string - // hashing + cloning in the hot loop (mirrors the CSR path's TypeIndex). The - // GraphIndex/CSR is NOT built β€” only a local id↔u32 dictionary. Strings - // survive at the substrate edges only: the per-hop IN-list to Lance, and the - // emitted dst ids handed to the string-keyed hydrate+align tail. - let mut interner = crate::graph_index::TypeIndex::new(); - let n = src_ids.len(); - let mut frontiers: Vec> = Vec::with_capacity(n); - let mut visited: Vec> = Vec::with_capacity(n); - let mut seen_dst: Vec> = Vec::with_capacity(n); - for i in 0..n { - let sid = interner.get_or_insert(src_ids.value(i)); - let mut v = HashSet::new(); - if same_type { - v.insert(sid); - } - frontiers.push(vec![sid]); - visited.push(v); - seen_dst.push(HashSet::new()); - } - - let mut src_indices: Vec = Vec::new(); - let mut dst_dense: Vec = Vec::new(); - - for hop in 1..=max { - // Union of all live frontiers (dense), de-interned once for the IN-list. - let mut union_dense: Vec = Vec::new(); - { - let mut seen: HashSet = HashSet::new(); - for f in &frontiers { - for &node in f { - if seen.insert(node) { - union_dense.push(node); - } - } - } - } - if union_dense.is_empty() { - break; - } - let union_keys: Vec = union_dense - .iter() - .map(|&u| { - interner - .to_id(u) - .expect("interned frontier id must resolve") - .to_string() - }) - .collect(); - - let batches = crate::table_store::TableStore::scan_edges_by_endpoint( - &edge_ds, key_col, opp_col, &union_keys, - ) - .await?; - - // dense key -> dense neighbors (scan order; duplicates preserved, like CSR multi-edges). - let mut neighbor_map: HashMap> = HashMap::new(); - for batch in &batches { - let keys = batch - .column_by_name(key_col) - .ok_or_else(|| OmniError::manifest(format!("edge batch missing '{}'", key_col)))? - .as_any() - .downcast_ref::() - .ok_or_else(|| OmniError::manifest(format!("edge '{}' is not Utf8", key_col)))?; - let opps = batch - .column_by_name(opp_col) - .ok_or_else(|| OmniError::manifest(format!("edge batch missing '{}'", opp_col)))? - .as_any() - .downcast_ref::() - .ok_or_else(|| OmniError::manifest(format!("edge '{}' is not Utf8", opp_col)))?; - for r in 0..batch.num_rows() { - let k = interner.get_or_insert(keys.value(r)); - let o = interner.get_or_insert(opps.value(r)); - neighbor_map.entry(k).or_default().push(o); - } - } - - // Advance each source row's frontier independently (dense ids). - for i in 0..n { - let cur = std::mem::take(&mut frontiers[i]); - let mut next: Vec = Vec::new(); - for &node in &cur { - let Some(neighbors) = neighbor_map.get(&node) else { - continue; - }; - for &neighbor in neighbors { - if !same_type || visited[i].insert(neighbor) { - next.push(neighbor); - if hop >= min_hops && seen_dst[i].insert(neighbor) { - src_indices.push(i as u32); - dst_dense.push(neighbor); - } - } - } - } - frontiers[i] = next; - } - } - - // De-intern emitted destination ids (parallel to src_indices) for the - // string-keyed hydrate+align tail, exactly as the CSR path does. - let dst_ids: Vec = dst_dense - .iter() - .map(|&d| { - interner - .to_id(d) - .expect("interned dst id must resolve") - .to_string() - }) - .collect(); - - expand_hydrate_and_align( - wide, src_indices, dst_ids, snapshot, catalog, dst_type, dst_var, dst_filters, params, - ) - .await -} - -/// Shared tail for both Expand modes: hydrate the unique destination ids, align -/// the `(src_row, dst_id)` pairs back onto `wide`, hconcat, and apply -/// non-pushable destination filters in memory. -async fn expand_hydrate_and_align( - wide: &mut RecordBatch, - src_indices: Vec, - dst_ids: Vec, - snapshot: &Snapshot, - catalog: &Catalog, - dst_type: &str, - dst_var: &str, - dst_filters: &[IRFilter], - params: &ParamMap, -) -> Result<()> { - // Pushable destination filters are applied by `hydrate_nodes`; the rest - // (`ir_filter_to_expr` β†’ None) are applied in memory after hconcat. The - // schema arg only affects a pushable literal's TYPE, never Some-vs-None, so - // `None` here yields the same pushable/non-pushable split as `hydrate_nodes`. - let non_pushable: Vec<&IRFilter> = dst_filters - .iter() - .filter(|f| ir_filter_to_expr(f, params, None).is_none()) - .collect(); - - // Unique destination ids (first-seen order) for one batched hydration. - let mut unique_dst_list: Vec = Vec::new(); - { - let mut seen: HashSet<&str> = HashSet::with_capacity(dst_ids.len()); - for id in &dst_ids { - if seen.insert(id.as_str()) { - unique_dst_list.push(id.clone()); - } - } - } - let dst_batch = - hydrate_nodes(snapshot, catalog, dst_type, &unique_dst_list, dst_filters, params).await?; - - // id -> row index in the hydrated batch. - let dst_batch_id_col = dst_batch - .column_by_name("id") - .ok_or_else(|| OmniError::manifest("hydrated batch missing 'id' column".to_string()))? - .as_any() - .downcast_ref::() - .ok_or_else(|| OmniError::manifest("hydrated 'id' column is not Utf8".to_string()))?; - let mut id_to_row: HashMap<&str, u32> = HashMap::with_capacity(dst_batch_id_col.len()); - for row in 0..dst_batch_id_col.len() { - id_to_row.insert(dst_batch_id_col.value(row), row as u32); - } - - // Align pairs to (src_row, hydrated_dst_row), dropping ids hydration filtered out. - let mut final_src_indices: Vec = Vec::with_capacity(src_indices.len()); - let mut dst_indices: Vec = Vec::with_capacity(src_indices.len()); - for (&src_idx, dst_id) in src_indices.iter().zip(dst_ids.iter()) { - if let Some(&dst_row) = id_to_row.get(dst_id.as_str()) { - final_src_indices.push(src_idx); - dst_indices.push(dst_row); - } - } - - let src_take = UInt32Array::from(final_src_indices); - let dst_take = UInt32Array::from(dst_indices); - let expanded_wide = take_batch(wide, &src_take)?; - let dst_prefixed = prefix_batch(&dst_batch, dst_var)?; - let aligned_dst = take_batch(&dst_prefixed, &dst_take)?; - *wide = hconcat_batches(&expanded_wide, &aligned_dst)?; - - for f in &non_pushable { - apply_filter(wide, f, params)?; - } - Ok(()) -} - -/// CSR-backed graph traversal: BFS over the in-memory adjacency index. Used for -/// dense / whole-graph traversals; selective traversals use -/// `execute_expand_indexed`. Both share `expand_hydrate_and_align`. -async fn execute_expand_csr( wide: &mut RecordBatch, graph_index: &GraphIndex, snapshot: &Snapshot, @@ -1481,9 +742,6 @@ async fn execute_expand_csr( let max = max_hops.unwrap_or(min_hops.max(1)); let same_type = src_type_name == dst_type_name; - // Cross-type edges cannot chain; a variable-length traversal over one is - // structurally single-hop (mirrors the indexed path's guarantee). - let max = if same_type { max } else { max.min(1) }; // BFS to collect (src_row_idx, dst_dense) pairs with per-source dedup. // Dense u32 ids stay in hand through BFS, dedup, and align β€” we only @@ -1527,52 +785,88 @@ async fn execute_expand_csr( } } - // Map BFS-produced dense destination ids to string ids for the shared - // hydrate+align tail. Dense ids always resolve (they came from the index); - // drop any that don't, keeping the (src, dst) arrays parallel. - let mut tail_src_indices: Vec = Vec::with_capacity(src_indices.len()); - let mut dst_ids: Vec = Vec::with_capacity(dst_dense_list.len()); - for (&s, &d) in src_indices.iter().zip(dst_dense_list.iter()) { - if let Some(id) = dst_type_idx.to_id(d) { - tail_src_indices.push(s); - dst_ids.push(id.to_string()); + // Split dst_filters: SQL-pushable go to Lance, the rest applied post-hconcat + let pushdown_sql = build_lance_filter(dst_filters, params); + let non_pushable: Vec<&IRFilter> = dst_filters + .iter() + .filter(|f| ir_filter_to_sql(f, params).is_none()) + .collect(); + + // Dedup dst dense ids globally across source rows, then stringify once + // for the Lance IN-list. The post-hydrate alignment fans rows back out to + // the original (src, dst) pairs via a dense-indexed lookup below. + let mut unique_dst_list: Vec = Vec::new(); + { + let mut seen: HashSet = HashSet::with_capacity(dst_dense_list.len()); + for &d in &dst_dense_list { + if seen.insert(d) { + if let Some(id) = dst_type_idx.to_id(d) { + unique_dst_list.push(id.to_string()); + } + } } } - - expand_hydrate_and_align( - wide, - tail_src_indices, - dst_ids, + let dst_batch = hydrate_nodes( snapshot, catalog, dst_type, - dst_var, - dst_filters, - params, + &unique_dst_list, + pushdown_sql.as_deref(), ) - .await + .await?; + + // Build dense β†’ row-in-hydrated-batch via a direct-indexed array. + let dst_batch_id_col = dst_batch + .column_by_name("id") + .ok_or_else(|| OmniError::manifest("hydrated batch missing 'id' column".to_string()))? + .as_any() + .downcast_ref::() + .ok_or_else(|| OmniError::manifest("hydrated 'id' column is not Utf8".to_string()))?; + let mut dense_to_row: Vec> = vec![None; dst_type_idx.len()]; + for row in 0..dst_batch_id_col.len() { + let id_str = dst_batch_id_col.value(row); + if let Some(dense) = dst_type_idx.to_dense(id_str) { + dense_to_row[dense as usize] = Some(row as u32); + } + } + + // Build aligned src/dst index arrays (only for ids that exist in hydrated batch) + let mut final_src_indices: Vec = Vec::new(); + let mut dst_indices: Vec = Vec::new(); + for (src_idx, dst_dense) in src_indices.iter().zip(dst_dense_list.iter()) { + if let Some(dst_row) = dense_to_row[*dst_dense as usize] { + final_src_indices.push(*src_idx); + dst_indices.push(dst_row); + } + } + + let src_take = UInt32Array::from(final_src_indices); + let dst_take = UInt32Array::from(dst_indices); + let expanded_wide = take_batch(wide, &src_take)?; + let dst_prefixed = prefix_batch(&dst_batch, dst_var)?; + let aligned_dst = take_batch(&dst_prefixed, &dst_take)?; + *wide = hconcat_batches(&expanded_wide, &aligned_dst)?; + + // Apply any non-pushable destination filters (e.g. list-contains) in memory + for f in &non_pushable { + apply_filter(wide, f, params)?; + } + + Ok(()) } /// Load full node rows for a set of IDs from a snapshot. /// -/// The `id IN (...)` predicate is built as a structured DataFusion `Expr` and -/// AND'd with any pushable `dst_filters` (destination-binding filters), then -/// applied via `Scanner::filter_expr`. The structured form routes the id -/// IN-list through the `id` BTREE scalar index (index-search β†’ take) rather -/// than evaluating a string filter via DataFusion `InListEval`, which is -/// O(NΓ—M) and was measured at 72Γ— the indexed cost on a 100k-node hop -/// (MR-376). Non-pushable `dst_filters` (`ir_filter_to_expr` β†’ None) are -/// applied in memory by the caller after hydration. +/// When `extra_filter_sql` is provided (from deferred destination-binding +/// filters), it is ANDed with the `id IN (...)` clause so that Lance can +/// skip non-matching rows at the storage level. async fn hydrate_nodes( snapshot: &Snapshot, catalog: &Catalog, type_name: &str, ids: &[String], - dst_filters: &[IRFilter], - params: &ParamMap, + extra_filter_sql: Option<&str>, ) -> Result { - use datafusion::prelude::{col, lit}; - let node_type = catalog .node_types .get(type_name) @@ -1585,14 +879,15 @@ async fn hydrate_nodes( let table_key = format!("node:{}", type_name); let ds = snapshot.open(&table_key).await?; - // `id IN (ids)` AND any pushable destination filters, as a structured Expr. - let id_list: Vec = ids.iter().map(|id| lit(id.clone())).collect(); - let mut filter_expr = col("id").in_list(id_list, false); - if let Some(dst_expr) = build_lance_filter_expr(dst_filters, params, Some(&node_type.arrow_schema)) - { - filter_expr = filter_expr.and(dst_expr); + // Build filter: id IN ('a', 'b', 'c') + let escaped: Vec = ids + .iter() + .map(|id| format!("'{}'", id.replace('\'', "''"))) + .collect(); + let mut filter_sql = format!("id IN ({})", escaped.join(", ")); + if let Some(extra) = extra_filter_sql { + filter_sql = format!("({}) AND ({})", filter_sql, extra); } - let has_blobs = !node_type.blob_properties.is_empty(); let non_blob_cols: Vec<&str> = node_type .arrow_schema @@ -1602,16 +897,12 @@ async fn hydrate_nodes( .map(|f| f.name().as_str()) .collect(); let projection = has_blobs.then_some(non_blob_cols.as_slice()); - let batches = crate::table_store::TableStore::scan_stream_with( + let batches = crate::table_store::TableStore::scan_stream( &ds, projection, - None, + Some(&filter_sql), None, false, - |scanner| { - scanner.filter_expr(filter_expr); - Ok(()) - }, ) .await? .try_collect::>() @@ -1634,25 +925,6 @@ async fn hydrate_nodes( Ok(scan_result) } -/// Whether the inner pipeline is the bulk-anti-join shape: a single Expand from -/// the outer var with no destination filters (the only shape the CSR -/// `has_neighbors` fast path can serve). Pure β€” it does not touch the CSR β€” so -/// the caller can decide whether to realize the O(|E|) graph index at all. -fn bulk_anti_join_applies(inner_pipeline: &[IROp], outer_var: &str) -> bool { - matches!( - inner_pipeline, - [IROp::Expand { src_var, dst_filters, min_hops, max_hops, .. }] - if src_var == outer_var - && dst_filters.is_empty() - // `has_neighbors` is a ONE-hop existence test, so the fast path - // is valid only for a single-hop expand. Multi-hop negations - // (e.g. `not { $p knows{2,2} $x }`) fall to the slow path, whose - // inner Expand runs the real bounded traversal. - && *min_hops == 1 - && (*max_hops).unwrap_or(1) == 1 - ) -} - /// Try bulk anti-join via CSR existence check. Returns Some(mask) if the inner /// pipeline is a single Expand from outer_var (the common negation pattern). fn try_bulk_anti_join_mask( @@ -1662,17 +934,27 @@ fn try_bulk_anti_join_mask( catalog: &Catalog, outer_var: &str, ) -> Option { - if !bulk_anti_join_applies(inner_pipeline, outer_var) { + if inner_pipeline.len() != 1 { return None; } let IROp::Expand { + src_var, edge_type, direction, + dst_filters, .. } = &inner_pipeline[0] else { return None; }; + if src_var != outer_var { + return None; + } + // Bulk CSR check only tests neighbor existence, not destination + // properties. Fall back to the slow path when dst_filters are present. + if !dst_filters.is_empty() { + return None; + } let gi = graph_index?; let edge_def = catalog.edge_types.get(edge_type.as_str())?; @@ -1711,106 +993,49 @@ async fn execute_anti_join( inner_pipeline: &[IROp], params: &ParamMap, snapshot: &Snapshot, - graph_index: &GraphIndexHandle<'_>, + graph_index: Option<&GraphIndex>, catalog: &Catalog, outer_var: &str, ) -> Result<()> { - // Only the bulk fast path consumes the CSR; the slow path's inner Expand - // chooses its own access path. Realize the O(|E|) graph index ONLY when the - // inner-pipeline shape qualifies for the bulk check β€” a filtered/nested - // anti-join over a large graph must not pay a whole-graph build it won't use. - let gi = if bulk_anti_join_applies(inner_pipeline, outer_var) { - graph_index.get().await? - } else { - None - }; // Fast path: bulk CSR existence check (O(N), zero Lance I/O) - if let Some(mask) = try_bulk_anti_join_mask(wide, inner_pipeline, gi, catalog, outer_var) { + if let Some(mask) = + try_bulk_anti_join_mask(wide, inner_pipeline, graph_index, catalog, outer_var) + { *wide = arrow_select::filter::filter_record_batch(wide, &mask) .map_err(|e| OmniError::Lance(e.to_string()))?; return Ok(()); } - // Slow path (filtered / non-bulk inner): run the inner pipeline ONCE over the - // whole frontier β€” a set-oriented anti-semi-join β€” instead of row-by-row. - // Each outer row is tagged with a synthetic index; an outer row matches iff - // it produced at least one surviving inner row. No per-row dispatch, so the - // inner Expand runs as a single set-at-a-time traversal over the full - // frontier (its own chooser picks indexed vs CSR) rather than one Lance scan - // per outer row. + // Slow path: per-row inner pipeline execution let num_rows = wide.num_rows(); - if num_rows == 0 { - return Ok(()); - } + let mut keep_mask = vec![true; num_rows]; - // The tag rides through the inner pipeline: Expand's hconcat preserves - // existing columns and Filter only drops rows, so each surviving row carries - // its originating outer-row index. Correlating on the row index (not - // `outer_var.id`) stays correct even if a dst-filter references other outer - // bindings. Nested anti-joins reuse this slow path and an enclosing tag rides - // through too; Arrow allows duplicate field names and `column_by_name` - // returns the FIRST match, so choose a tag name not already present (each - // nesting level then reads its own) instead of a fixed one. - let tag_col: String = { - let mut n = 0usize; - loop { - let candidate = format!("__antijoin_outer_row_{n}"); - if wide.schema().column_with_name(&candidate).is_none() { - break candidate; - } - n += 1; - } - }; - let mut fields: Vec = wide - .schema() - .fields() - .iter() - .map(|f| f.as_ref().clone()) - .collect(); - fields.push(Field::new(tag_col.as_str(), DataType::UInt32, false)); - let mut columns: Vec = wide.columns().to_vec(); - columns.push(Arc::new(UInt32Array::from_iter_values(0..num_rows as u32))); - let tagged = RecordBatch::try_new(Arc::new(Schema::new(fields)), columns) - .map_err(|e| OmniError::Lance(e.to_string()))?; + for i in 0..num_rows { + let single_row = wide.slice(i, 1); + let mut inner_wide: Option = Some(single_row); - let mut inner_wide: Option = Some(tagged); - let no_search = SearchMode::default(); - execute_pipeline( - inner_pipeline, - params, - snapshot, - graph_index, - catalog, - &mut inner_wide, - &no_search, - ) - .await?; + let no_search = SearchMode::default(); + execute_pipeline( + inner_pipeline, + params, + snapshot, + graph_index, + catalog, + &mut inner_wide, + &no_search, + ) + .await?; - // Outer rows whose tag survived have >= 1 match. A produced-but-untagged - // batch means the inner pipeline dropped the correlation column β€” fail loudly - // rather than silently keeping every row (which would corrupt the anti-join). - let mut matched: HashSet = HashSet::new(); - if let Some(batch) = inner_wide { - if batch.num_rows() > 0 { - let tags = batch - .column_by_name(tag_col.as_str()) - .ok_or_else(|| { - OmniError::manifest( - "anti-join inner pipeline dropped the correlation column".to_string(), - ) - })? - .as_any() - .downcast_ref::() - .ok_or_else(|| { - OmniError::manifest(format!("'{}' column is not UInt32", tag_col)) - })?; - for i in 0..tags.len() { - matched.insert(tags.value(i)); - } + let has_match = inner_wide + .as_ref() + .map(|batch| batch.num_rows() > 0) + .unwrap_or(false); + + if has_match { + keep_mask[i] = false; } } - let keep_mask: Vec = (0..num_rows as u32).map(|i| !matched.contains(&i)).collect(); let mask = BooleanArray::from(keep_mask); *wide = arrow_select::filter::filter_record_batch(wide, &mask) .map_err(|e| OmniError::Lance(e.to_string()))?; @@ -1830,23 +1055,21 @@ async fn execute_node_scan( let table_key = format!("node:{}", type_name); let ds = snapshot.open(&table_key).await?; - let node_type = &catalog.node_types[type_name]; - // Lower the IR filters to a DataFusion `Expr` and apply via // `Scanner::filter_expr` inside the configure closure. The string // pushdown path (`build_lance_filter` β†’ `scanner.filter(&str)`) is // gone for node scans β€” structured Expr unlocks `CompOp::Contains` // pushdown (via `array_has`) and lets DF 53's optimizer rules // (vectorized IN-list, PhysicalExprSimplifier, CASE-NULL shortcut) - // reach our predicates. Passing the node's `arrow_schema` lets the lowering - // coerce literals to each column's exact type so narrow-numeric BTREEs are - // used. Other call sites that still take string SQL (count_rows, the - // mutation delete path) migrate in follow-up MRs. - let filter_expr = build_lance_filter_expr(filters, params, Some(&node_type.arrow_schema)); + // reach our predicates. Other call sites that still take string SQL + // (hydrate_nodes for the Expand pushdown, count_rows, the mutation + // delete path) migrate in follow-up MRs. + let filter_expr = build_lance_filter_expr(filters, params); // Blob columns must be excluded from scan when a filter is present // (Lance bug: BlobsDescriptions + filter triggers a projection assertion). // We exclude blob columns and add metadata post-scan via take_blobs_by_indices. + let node_type = &catalog.node_types[type_name]; let has_blobs = !node_type.blob_properties.is_empty(); let non_blob_cols: Vec<&str> = node_type .arrow_schema @@ -1963,6 +1186,45 @@ fn add_null_blob_columns( .map_err(|e| OmniError::Lance(e.to_string())) } +/// Convert IR filters to a Lance SQL filter string. +fn build_lance_filter(filters: &[IRFilter], params: &ParamMap) -> Option { + if filters.is_empty() { + return None; + } + + let parts: Vec = filters + .iter() + .filter_map(|f| ir_filter_to_sql(f, params)) + .collect(); + + if parts.is_empty() { + return None; + } + + Some(parts.join(" AND ")) +} + +fn ir_filter_to_sql(filter: &IRFilter, params: &ParamMap) -> Option { + // Search predicates (search/fuzzy/match_text = true) are NOT converted to SQL. + // They are handled via scanner.full_text_search() in execute_node_scan. + if is_search_filter(filter) { + return None; + } + + let left = ir_expr_to_sql(&filter.left, params)?; + let right = ir_expr_to_sql(&filter.right, params)?; + let op = match filter.op { + CompOp::Eq => "=", + CompOp::Ne => "!=", + CompOp::Gt => ">", + CompOp::Lt => "<", + CompOp::Ge => ">=", + CompOp::Le => "<=", + CompOp::Contains => return None, // Can't pushdown list contains + }; + Some(format!("{} {} {}", left, op, right)) +} + /// Build a FullTextSearchQuery from a search IR expression. fn build_fts_query( expr: &IRExpr, @@ -2035,6 +1297,15 @@ fn resolve_to_int(expr: &IRExpr, params: &ParamMap) -> Option { } } +fn ir_expr_to_sql(expr: &IRExpr, params: &ParamMap) -> Option { + match expr { + IRExpr::PropAccess { property, .. } => Some(property.clone()), + IRExpr::Literal(lit) => Some(literal_to_sql(lit)), + IRExpr::Param(name) => params.get(name).map(literal_to_sql), + _ => None, + } +} + pub(super) fn literal_to_sql(lit: &Literal) -> String { match lit { Literal::Null => "NULL".to_string(), @@ -2065,24 +1336,23 @@ pub(super) fn literal_to_sql(lit: &Literal) -> String { // // Search predicates (`is_search_filter`) are still handled separately via // `scanner.full_text_search(...)`, not via filter_expr β€” they stay None -// here (search predicates are never lowered to a scalar filter). The -// `literal_to_sql` path remains because the mutation/update layer -// (`exec/mutation.rs`) still produces SQL strings for `Dataset::delete(&str)`; -// that migration is MR-A's territory (Lance #6658 + delete two-phase). +// here just like in `ir_filter_to_sql`. The `literal_to_sql` path remains +// because the mutation/update layer (`exec/mutation.rs`) still produces +// SQL strings for `Dataset::delete(&str)`; that migration is MR-A's +// territory (Lance #6658 + delete two-phase). /// Convert IR filters to a single DataFusion `Expr` (AND-joined), or /// `None` if no filter is pushable. pub(super) fn build_lance_filter_expr( filters: &[IRFilter], params: &ParamMap, - schema: Option<&Schema>, ) -> Option { use datafusion::logical_expr::Operator; use datafusion::prelude::Expr; let mut acc: Option = None; for f in filters { - let Some(e) = ir_filter_to_expr(f, params, schema) else { + let Some(e) = ir_filter_to_expr(f, params) else { continue; }; acc = Some(match acc { @@ -2103,7 +1373,6 @@ pub(super) fn build_lance_filter_expr( pub(super) fn ir_filter_to_expr( filter: &IRFilter, params: &ParamMap, - schema: Option<&Schema>, ) -> Option { use datafusion::functions_nested::expr_fn::array_has; @@ -2112,24 +1381,16 @@ pub(super) fn ir_filter_to_expr( } // List-contains: `prop CONTAINS value` lowers to `array_has(prop, value)`. - // This is the case the old SQL-string pushdown had to return None for - // ("Can't pushdown list contains"); with structured Expr it pushes down fine. - // (Element-type coercion for the contained value is deferred β€” list columns - // are not scalar-indexed, so the index-eligibility concern below does not apply.) + // This is the case `ir_filter_to_sql` had to return None for ("Can't + // pushdown list contains"); with structured Expr it pushes down fine. if matches!(filter.op, CompOp::Contains) { - let left = ir_expr_to_expr(&filter.left, params, None)?; - let right = ir_expr_to_expr(&filter.right, params, None)?; + let left = ir_expr_to_expr(&filter.left, params)?; + let right = ir_expr_to_expr(&filter.right, params)?; return Some(array_has(left, right)); } - // A literal/param operand is coerced to the OTHER operand's column type so - // the predicate stays a direct `col OP literal` and the scalar index is used. - // Without this, DataFusion widens a narrow column (`CAST(col AS Int64)`), - // which defeats the BTREE (validated by `probe_scalar_index_use_under_literal_type`). - let left_col_type = prop_data_type(&filter.left, schema); - let right_col_type = prop_data_type(&filter.right, schema); - let left = ir_expr_to_expr(&filter.left, params, right_col_type.as_ref())?; - let right = ir_expr_to_expr(&filter.right, params, left_col_type.as_ref())?; + let left = ir_expr_to_expr(&filter.left, params)?; + let right = ir_expr_to_expr(&filter.right, params)?; Some(match filter.op { CompOp::Eq => left.eq(right), CompOp::Ne => left.not_eq(right), @@ -2147,95 +1408,19 @@ pub(super) fn ir_filter_to_expr( pub(super) fn ir_expr_to_expr( expr: &IRExpr, params: &ParamMap, - target: Option<&arrow_schema::DataType>, ) -> Option { - use datafusion::prelude::ident; + use datafusion::prelude::{col, lit}; match expr { - // #283: `ident()` preserves the identifier's case. `col()` would route - // through SQL identifier normalization and lowercase an unquoted - // camelCase column (`repoName` β†’ `reponame`), which then fails to - // resolve against the case-sensitive Lance/Arrow schema. - IRExpr::PropAccess { property, .. } => Some(ident(property)), - IRExpr::Literal(l) => literal_to_expr_coerced(l, target), - IRExpr::Param(name) => params - .get(name) - .and_then(|l| literal_to_expr_coerced(l, target)), + IRExpr::PropAccess { property, .. } => Some(col(property)), + IRExpr::Literal(l) => literal_to_expr(l), + IRExpr::Param(name) => params.get(name).and_then(literal_to_expr), _ => None, } } -/// The Arrow type of a `PropAccess` operand, looked up in the scan's schema, or -/// `None` if the expr is not a column or the schema/field is unavailable. -fn prop_data_type(expr: &IRExpr, schema: Option<&Schema>) -> Option { - match expr { - IRExpr::PropAccess { property, .. } => schema? - .field_with_name(property) - .ok() - .map(|f| f.data_type().clone()), - _ => None, - } -} - -/// Lower a literal for pushdown, coercing it to `target` (the comparison -/// column's Arrow type) when known. Falls back to the natural-type -/// `literal_to_expr` on a missing target or any coercion failure, so a filter is -/// never demoted to `None` by coercion (a node scan has no in-memory fallback for -/// inline filters β€” see `execute_node_scan`). -fn literal_to_expr_coerced( - lit: &Literal, - target: Option<&arrow_schema::DataType>, -) -> Option { - if let Some(target) = target { - if let Some(e) = literal_to_typed_expr(lit, target) { - return Some(e); - } - } - literal_to_expr(lit) -} - -/// Build a literal as a typed Arrow scalar matching `target`, reusing the same -/// `literal_to_array` + `arrow_cast` path as the in-memory arm -/// (`projection.rs::evaluate_filter`) so the two arms agree. Returns `None` on -/// any failure (unbuildable literal, incompatible cast) β€” the caller then falls -/// back to the natural-type literal. -/// -/// Lossless-only for integer targets: typecheck permits numeric cross-type -/// comparisons (`types_compatible`), so a fractional float or out-of-range -/// integer can reach here. Casting those to a narrower integer would truncate -/// (`2.7 -> 2`) or overflow to null, silently changing which rows match. We -/// round-trip the cast and, on mismatch, return `None` so the caller keeps the -/// natural literal β€” correct via DataFusion coercion, the index just goes unused -/// for that out-of-domain predicate. Float targets are exempt: narrowing -/// `F64 -> F32` is the column's own precision domain, not a value error. -fn literal_to_typed_expr( - lit: &Literal, - target: &arrow_schema::DataType, -) -> Option { - use datafusion::prelude::lit as df_lit; - use datafusion::scalar::ScalarValue; - - let arr = super::projection::literal_to_array(lit, 1).ok()?; - if arr.data_type() == target { - return Some(df_lit(ScalarValue::try_from_array(&arr, 0).ok()?)); - } - let casted = arrow_cast::cast::cast(&arr, target).ok()?; - if target.is_integer() { - let back = arrow_cast::cast::cast(&casted, arr.data_type()).ok()?; - let original = ScalarValue::try_from_array(&arr, 0).ok()?; - let round_tripped = ScalarValue::try_from_array(&back, 0).ok()?; - if original != round_tripped { - return None; - } - } - Some(df_lit(ScalarValue::try_from_array(&casted, 0).ok()?)) -} - -/// Convert a Literal to a DataFusion `Expr` in its NATURAL Arrow type. This is -/// the fallback used when the comparison column's type is unknown (no schema) or -/// when coercion to it fails; the typed, column-matched coercion that keeps -/// scalar indexes usable lives in `literal_to_typed_expr`. Returns `None` for -/// List (the SQL path also could not pushdown it β€” falls through to post-scan -/// in-memory application). +/// Convert a Literal to a DataFusion `Expr`. Returns `None` for List +/// (which the existing SQL path also can't pushdown β€” falls through to +/// post-scan in-memory application). fn literal_to_expr(lit: &Literal) -> Option { use datafusion::prelude::lit as df_lit; Some(match lit { @@ -2244,12 +1429,9 @@ fn literal_to_expr(lit: &Literal) -> Option { Literal::Integer(n) => df_lit(*n), Literal::Float(f) => df_lit(*f), Literal::Bool(b) => df_lit(*b), - // Date/DateTime pass through as strings here. Against a typed Date - // column DataFusion casts the LITERAL (`CAST(Utf8 AS Date32)`), which is - // index-safe (proven by `scalar_index_use_requires_matched_literal_type`). - // At real pushdown sites the schema is known, so `literal_to_typed_expr` - // produces a typed Date32/Date64 anyway; this branch is only the - // no-schema fallback. + // Date/DateTime stored as strings; pass through as string literals + // β€” Lance/DataFusion handles the comparison against typed columns + // via implicit cast, matching the existing string-SQL behavior. Literal::Date(s) => df_lit(s.clone()), Literal::DateTime(s) => df_lit(s.clone()), Literal::List(_) => return None, @@ -2335,386 +1517,3 @@ fn take_batch(batch: &RecordBatch, indices: &UInt32Array) -> Result .map_err(|e| OmniError::Lance(e.to_string()))?; RecordBatch::try_new(batch.schema(), columns).map_err(|e| OmniError::Lance(e.to_string())) } - -#[cfg(test)] -mod expand_chooser_tests { - use super::*; - use crate::table_store::IndexCoverage; - - /// Build cost inputs with generous hard caps, so the cost comparison (not a - /// ceiling) is what the assertions exercise unless a test sets one on purpose. - fn inputs( - frontier_rows: usize, - edge_count: u64, - src_node_count: u64, - effective_max_hops: u32, - coverage: IndexCoverage, - ) -> ExpandCostInputs { - ExpandCostInputs { - frontier_rows, - edge_count, - src_node_count, - effective_max_hops, - max_hops_cap: 6, - max_frontier_cap: 1024, - coverage, - csr_cached: false, - } - } - - #[test] - fn selective_frontier_on_large_graph_picks_indexed() { - // 50 source rows against 1M source vertices, one hop: tiny selectivity β€” - // the PR #149 win the chooser must preserve. - let m = choose_expand_mode(&inputs(50, 10_000_000, 1_000_000, 1, IndexCoverage::Indexed)); - assert_eq!(m, ExpandMode::IndexedScan); - } - - #[test] - fn flat_in_edge_count_same_selectivity_same_choice() { - // Same selectivity (frontier/|V_src|), 1000Γ— difference in |E|. Indexed - // cost is independent of |E|, so the choice must not flip. - let small = choose_expand_mode(&inputs(50, 100_000, 1_000_000, 1, IndexCoverage::Indexed)); - let huge = - choose_expand_mode(&inputs(50, 100_000_000, 1_000_000, 1, IndexCoverage::Indexed)); - assert_eq!(small, ExpandMode::IndexedScan); - assert_eq!(huge, ExpandMode::IndexedScan); - } - - #[test] - fn frontier_large_fraction_of_source_picks_csr() { - // hops*frontier (200) exceeds BUILD_FACTOR*|V_src| (1.5*100=150) β†’ CSR, - // and 200 is below the frontier cap, so it is the cost model deciding. - let m = choose_expand_mode(&inputs(200, 1_000, 100, 1, IndexCoverage::Indexed)); - assert_eq!(m, ExpandMode::Csr); - } - - #[test] - fn frontier_over_hard_cap_picks_csr() { - // 2000 > 1024 ceiling, even though the selectivity is tiny. - let m = choose_expand_mode(&inputs(2000, 10_000_000, 1_000_000, 1, IndexCoverage::Indexed)); - assert_eq!(m, ExpandMode::Csr); - } - - #[test] - fn hops_over_hard_cap_picks_csr() { - let m = choose_expand_mode(&inputs(10, 10_000_000, 1_000_000, 8, IndexCoverage::Indexed)); - assert_eq!(m, ExpandMode::Csr); - } - - #[test] - fn degraded_single_hop_tiny_frontier_stays_indexed() { - // One full degraded scan (1*|E|) still edges out a full CSR build - // (1.5*|E|) for a one-off single hop. - let m = choose_expand_mode(&inputs( - 5, - 10_000, - 10_000, - 1, - IndexCoverage::Degraded { - reason: "no btree".into(), - }, - )); - assert_eq!(m, ExpandMode::IndexedScan); - } - - #[test] - fn degraded_multi_hop_picks_csr() { - // Two degraded scans (2*|E|) lose to one CSR build (1.5*|E|). - let m = choose_expand_mode(&inputs( - 5, - 10_000, - 10_000, - 2, - IndexCoverage::Degraded { - reason: "no btree".into(), - }, - )); - assert_eq!(m, ExpandMode::Csr); - } - - #[test] - fn warm_csr_is_always_reused() { - // A maximally selective traversal still prefers an already-built CSR - // (cost ~0) over re-scanning per hop. - let mut i = inputs(1, 10_000_000, 1_000_000, 1, IndexCoverage::Indexed); - i.csr_cached = true; - assert_eq!(choose_expand_mode(&i), ExpandMode::Csr); - } - - #[test] - fn cost_model_caps_cross_type_hops() { - // Same-type passes the requested range through; cross-type caps at 1, - // matching execute_expand_indexed. - assert_eq!(cost_effective_hops(5, true), 5); - assert_eq!(cost_effective_hops(5, false), 1); - assert_eq!(cost_effective_hops(1, false), 1); - - // Consequence: a selective frontier where the requested 5 hops would - // (wrongly) flip cross-type to CSR, but the capped 1 hop β€” what actually - // runs β€” keeps it indexed. - let mut i = inputs(50, 10_000, 100, cost_effective_hops(5, false), IndexCoverage::Indexed); - assert_eq!(choose_expand_mode(&i), ExpandMode::IndexedScan); - i.effective_max_hops = 5; // as if the cross-type cap were not applied - assert_eq!(choose_expand_mode(&i), ExpandMode::Csr); - } -} - -#[cfg(test)] -mod literal_lowering_tests { - use super::*; - use datafusion::prelude::Expr; - use datafusion::scalar::ScalarValue; - - // With the column type known, the generic coercion types a date literal to - // the column's Date32/Date64 (the live pushdown path). Without a target it - // is the natural Utf8 fallback, which is still index-safe for dates because - // DataFusion casts the LITERAL, not the column (proven by - // `lance_surface_guards::scalar_index_use_requires_matched_literal_type`). - #[test] - fn date_literals_coerce_to_typed_arrow_scalars() { - use arrow_schema::DataType; - let dt = literal_to_expr_coerced( - &Literal::DateTime("2024-06-01T12:00:00Z".into()), - Some(&DataType::Date64), - ) - .unwrap(); - assert!( - matches!(dt, Expr::Literal(ScalarValue::Date64(Some(_)), ..)), - "DateTime vs Date64 column must coerce to a typed Date64, got {dt:?}" - ); - let d = literal_to_expr_coerced(&Literal::Date("2024-06-01".into()), Some(&DataType::Date32)) - .unwrap(); - assert!( - matches!(d, Expr::Literal(ScalarValue::Date32(Some(_)), ..)), - "Date vs Date32 column must coerce to a typed Date32, got {d:?}" - ); - let nat = literal_to_expr_coerced(&Literal::Date("2024-06-01".into()), None).unwrap(); - assert!( - matches!(nat, Expr::Literal(ScalarValue::Utf8(Some(_)), ..)), - "no target should keep the natural Utf8 date literal, got {nat:?}" - ); - } - - // A malformed date string makes coercion fail, so it falls back to the - // natural Utf8 literal rather than dropping the predicate to None. - #[test] - fn malformed_date_literal_falls_back_to_string() { - use arrow_schema::DataType; - let bad = literal_to_expr_coerced( - &Literal::DateTime("not-a-date".into()), - Some(&DataType::Date64), - ) - .unwrap(); - assert!( - matches!(bad, Expr::Literal(ScalarValue::Utf8(Some(_)), ..)), - "malformed DateTime literal should fall back to a Utf8 literal, got {bad:?}" - ); - } - - // With a column target, a literal lowers to the column's EXACT Arrow type - // (not its natural width), so DataFusion does not widen and cast the column - // β€” keeping the scalar BTREE usable. See - // `lance_surface_guards::scalar_index_use_requires_matched_literal_type`. - #[test] - fn integer_literal_coerces_to_narrow_column_type() { - use arrow_schema::DataType; - let i32_lit = literal_to_expr_coerced(&Literal::Integer(5), Some(&DataType::Int32)).unwrap(); - assert!( - matches!(i32_lit, Expr::Literal(ScalarValue::Int32(Some(5)), ..)), - "integer literal vs Int32 column must lower to Int32, got {i32_lit:?}" - ); - let u32_lit = literal_to_expr_coerced(&Literal::Integer(7), Some(&DataType::UInt32)).unwrap(); - assert!( - matches!(u32_lit, Expr::Literal(ScalarValue::UInt32(Some(7)), ..)), - "integer literal vs UInt32 column must lower to UInt32, got {u32_lit:?}" - ); - } - - #[test] - fn float_literal_coerces_to_f32_column_type() { - use arrow_schema::DataType; - let f32_lit = - literal_to_expr_coerced(&Literal::Float(1.5), Some(&DataType::Float32)).unwrap(); - assert!( - matches!(f32_lit, Expr::Literal(ScalarValue::Float32(Some(_)), ..)), - "float literal vs Float32 column must lower to Float32, got {f32_lit:?}" - ); - } - - // Lossless guard: a fractional float against an integer column must NOT - // truncate (2.7 -> 2). Fall back to the natural Float64 so the comparison - // stays exact (no integer equals 2.7). - #[test] - fn fractional_float_vs_int_column_falls_back_not_truncate() { - use arrow_schema::DataType; - let e = literal_to_expr_coerced(&Literal::Float(2.7), Some(&DataType::Int32)).unwrap(); - assert!( - matches!(e, Expr::Literal(ScalarValue::Float64(Some(_)), ..)), - "fractional float vs Int32 must fall back to natural Float64, got {e:?}" - ); - } - - // A whole-number float IS lossless against an integer column, so it coerces. - #[test] - fn whole_float_vs_int_column_coerces() { - use arrow_schema::DataType; - let e = literal_to_expr_coerced(&Literal::Float(2.0), Some(&DataType::Int32)).unwrap(); - assert!( - matches!(e, Expr::Literal(ScalarValue::Int32(Some(2)), ..)), - "whole-number float vs Int32 is lossless and must coerce to Int32(2), got {e:?}" - ); - } - - // Lossless guard: an integer literal outside the column's range must NOT - // overflow to null; fall back to the natural Int64 (correct via DataFusion). - #[test] - fn out_of_range_int_vs_narrow_column_falls_back() { - use arrow_schema::DataType; - let e = literal_to_expr_coerced(&Literal::Integer(3_000_000_000), Some(&DataType::Int32)) - .unwrap(); - assert!( - matches!(e, Expr::Literal(ScalarValue::Int64(Some(3_000_000_000)), ..)), - "out-of-range integer vs Int32 must fall back to natural Int64, got {e:?}" - ); - } - - // Float targets are exempt from the lossless guard: narrowing to the column's - // own precision is the correct comparison domain, even when the value is not - // exactly representable in F32 (0.1). - #[test] - fn float_vs_f32_column_coerces_even_when_not_exactly_representable() { - use arrow_schema::DataType; - let e = literal_to_expr_coerced(&Literal::Float(0.1), Some(&DataType::Float32)).unwrap(); - assert!( - matches!(e, Expr::Literal(ScalarValue::Float32(Some(_)), ..)), - "float target must coerce 0.1 to Float32 (exempt from lossless guard), got {e:?}" - ); - } - - // No target (caller without a schema) keeps the natural width β€” the existing - // fallback, so behavior never regresses where the column type is unknown. - #[test] - fn literal_without_target_keeps_natural_width() { - let nat = literal_to_expr_coerced(&Literal::Integer(5), None).unwrap(); - assert!( - matches!(nat, Expr::Literal(ScalarValue::Int64(Some(5)), ..)), - "no target should keep the natural Int64 width, got {nat:?}" - ); - } - - // True if either operand of a binary comparison is an Int32 literal. - fn binary_has_int32_literal(e: &Expr) -> bool { - if let Expr::BinaryExpr(b) = e { - [b.left.as_ref(), b.right.as_ref()] - .iter() - .any(|side| matches!(side, Expr::Literal(ScalarValue::Int32(Some(_)), ..))) - } else { - false - } - } - - fn int32_schema() -> arrow_schema::Schema { - use arrow_schema::{DataType, Field}; - arrow_schema::Schema::new(vec![Field::new("count", DataType::Int32, true)]) - } - - fn count_prop() -> IRExpr { - IRExpr::PropAccess { - variable: "m".into(), - property: "count".into(), - } - } - - // Coercion is operator-independent: a range comparison's literal coerces to - // the column type just like equality does, so range filters on a narrow - // numeric column keep the BTREE. - #[test] - fn ir_filter_coerces_literal_for_range_op() { - let schema = int32_schema(); - let filter = IRFilter { - left: count_prop(), - op: CompOp::Ge, - right: IRExpr::Literal(Literal::Integer(2)), - }; - let expr = ir_filter_to_expr(&filter, &ParamMap::new(), Some(&schema)).unwrap(); - assert!( - binary_has_int32_literal(&expr), - "range-op literal must coerce to the Int32 column type, got {expr:?}" - ); - } - - // The column may be on either side; the literal coerces to the opposite - // operand's column type regardless of order (`5 < count`). - #[test] - fn ir_filter_coerces_literal_when_column_is_on_the_right() { - let schema = int32_schema(); - let filter = IRFilter { - left: IRExpr::Literal(Literal::Integer(2)), - op: CompOp::Lt, - right: count_prop(), - }; - let expr = ir_filter_to_expr(&filter, &ParamMap::new(), Some(&schema)).unwrap(); - assert!( - binary_has_int32_literal(&expr), - "reversed-operand literal must coerce to the Int32 column type, got {expr:?}" - ); - } - - // Name of the left operand's column in a binary comparison `col OP lit`. - fn binary_left_column_name(e: &Expr) -> Option { - match e { - Expr::BinaryExpr(b) => match b.left.as_ref() { - Expr::Column(c) => Some(c.name.clone()), - _ => None, - }, - _ => None, - } - } - - // #283: a camelCase property must reach the scan as its exact column name, - // not a SQL-normalized (lowercased) one. `col()` lowercases unquoted - // identifiers; the pushed-down column ref must stay `repoName`. - #[test] - fn ir_filter_preserves_camelcase_column_name() { - use arrow_schema::{DataType, Field}; - let schema = arrow_schema::Schema::new(vec![Field::new("repoName", DataType::Utf8, true)]); - let filter = IRFilter { - left: IRExpr::PropAccess { - variable: "d".into(), - property: "repoName".into(), - }, - op: CompOp::Eq, - right: IRExpr::Literal(Literal::String("acme".into())), - }; - let expr = ir_filter_to_expr(&filter, &ParamMap::new(), Some(&schema)).unwrap(); - assert_eq!( - binary_left_column_name(&expr).as_deref(), - Some("repoName"), - "camelCase column must be preserved (not lowercased to `reponame`), got {expr:?}" - ); - } - - // Index preservation: a camelCase numeric column still coerces its literal - // (so the scalar BTREE stays eligible) β€” the colβ†’ident fix must not disturb - // the coercion path (which resolves the column type via field_with_name). - #[test] - fn ir_filter_coerces_literal_for_camelcase_int_column() { - use arrow_schema::{DataType, Field}; - let schema = - arrow_schema::Schema::new(vec![Field::new("itemCount", DataType::Int32, true)]); - let filter = IRFilter { - left: IRExpr::PropAccess { - variable: "m".into(), - property: "itemCount".into(), - }, - op: CompOp::Eq, - right: IRExpr::Literal(Literal::Integer(2)), - }; - let expr = ir_filter_to_expr(&filter, &ParamMap::new(), Some(&schema)).unwrap(); - assert!( - binary_has_int32_literal(&expr), - "camelCase int column must keep its coerced Int32 literal (BTREE-eligible), got {expr:?}" - ); - } -} diff --git a/crates/omnigraph/src/exec/staging.rs b/crates/omnigraph/src/exec/staging.rs index 7760c95..0d26fd3 100644 --- a/crates/omnigraph/src/exec/staging.rs +++ b/crates/omnigraph/src/exec/staging.rs @@ -21,10 +21,9 @@ use std::collections::{HashMap, HashSet}; use std::sync::Arc; -use crate::storage_layer::{SnapshotHandle, StagedHandle}; use arrow_array::{Array, RecordBatch, StringArray, UInt32Array}; use arrow_schema::SchemaRef; -use futures::stream::StreamExt; +use lance::Dataset; use omnigraph_compiler::catalog::EdgeType; use crate::db::manifest::{ @@ -33,13 +32,15 @@ use crate::db::manifest::{ use crate::db::{MutationOpKind, SubTableUpdate}; use crate::error::{OmniError, Result}; -/// Whether the per-table accumulator should commit via `stage_append`, -/// `stage_merge_insert`, or `stage_overwrite`. +/// Whether the per-table accumulator should commit via `stage_append` +/// (no @key inserts, edge inserts) or `stage_merge_insert` (any @key insert +/// or update). Once set to `Merge` for a table within a query, subsequent +/// inserts on that table are rolled into the same merge β€” a `WhenNotMatched +/// = InsertAll` merge is correct for both cases. #[derive(Debug, Clone, Copy, PartialEq, Eq)] pub(crate) enum PendingMode { Append, Merge, - Overwrite, } /// Per-table accumulator. Each insert/update op pushes a `RecordBatch` into @@ -157,9 +158,9 @@ impl MutationStaging { mode: PendingMode, batch: RecordBatch, ) -> Result<()> { - if batch.num_rows() == 0 && mode != PendingMode::Overwrite { - // No-op for additive modes. For Overwrite, an empty batch is - // observable: it means "replace this table with zero rows". + if batch.num_rows() == 0 { + // No-op β€” staging is purely additive; an empty batch should not + // be appended. return Ok(()); } // If we've already accumulated a batch on this table, the new @@ -173,14 +174,6 @@ impl MutationStaging { // caller a clearer point of failure attached to the specific // op that introduced the drift. if let Some(existing) = self.pending.get(table_key) { - if existing.mode == PendingMode::Overwrite || mode == PendingMode::Overwrite { - if existing.mode != mode { - return Err(OmniError::manifest_internal(format!( - "table '{}' cannot mix overwrite staging with append/merge staging", - table_key - ))); - } - } if !schemas_compatible(&existing.schema, &batch.schema()) { return Err(OmniError::manifest(format!( "table '{}' accumulated mutation batches with mismatched schemas: \ @@ -201,9 +194,8 @@ impl MutationStaging { .pending .entry(table_key.to_string()) .or_insert_with(|| PendingTable::new(schema.clone(), mode)); - // Upgrade Append -> Merge if any op needs merge semantics. Overwrite - // is never mixed with additive modes (guarded above). - if mode == PendingMode::Merge && entry.mode == PendingMode::Append { + // Upgrade Append -> Merge if any op needs merge semantics. + if mode == PendingMode::Merge { entry.mode = PendingMode::Merge; } entry.batches.push(batch); @@ -225,11 +217,6 @@ impl MutationStaging { .unwrap_or(&[]) } - /// Accumulator mode for `table_key`, if this query has touched it. - pub(crate) fn pending_mode(&self, table_key: &str) -> Option { - self.pending.get(table_key).map(|p| p.mode) - } - /// Schema of the accumulated batches for `table_key`, or `None` if no /// op has touched the table. Used by `scan_with_pending` to construct /// the in-memory `MemTable`. @@ -262,21 +249,9 @@ impl MutationStaging { /// Lance datasets is a perf follow-up; same loop structure as the /// pre-split `finalize`. pub(crate) async fn stage_all( - self, - db: &crate::db::Omnigraph, - branch: Option<&str>, - ) -> Result { - self.stage_all_with_concurrency(db, branch, 1).await - } - - /// Loader-facing variant of [`stage_all`] that preserves - /// `OMNIGRAPH_LOAD_CONCURRENCY` for the fragment-writing stage while - /// still leaving all Lance HEAD movement to [`StagedMutation::commit_all`]. - pub(crate) async fn stage_all_with_concurrency( self, db: &crate::db::Omnigraph, _branch: Option<&str>, - concurrency: usize, ) -> Result { let MutationStaging { expected_versions, @@ -286,8 +261,7 @@ impl MutationStaging { op_kinds, } = self; - let mut stage_inputs: Vec<(String, PendingTable, StagedTablePath, u64)> = - Vec::with_capacity(pending.len()); + let mut staged_entries: Vec = Vec::with_capacity(pending.len()); for (table_key, table) in pending { let path = paths.get(&table_key).cloned().ok_or_else(|| { OmniError::manifest_internal(format!( @@ -301,22 +275,77 @@ impl MutationStaging { table_key )) })?; - stage_inputs.push((table_key, table, path, expected)); + + // Reopen the dataset for staging. The op_kind reflects the + // accumulated PendingTable's mode: Append-mode batches are + // INSERT-shaped (no key-based dedup at commit_staged); Merge- + // mode batches are MERGE-shaped (key-dedup at commit_staged). + // Both skip the strict pre-stage version check under the + // [`MutationOpKind`] policy: Lance's natural rebase + the + // per-(table, branch) queue + the publisher CAS in + // `commit_all` handle drift; the strict check would + // over-reject in-process concurrent inserts (PR 2 / MR-686 + // Phase 2). + let stage_kind = match table.mode { + PendingMode::Append => crate::db::MutationOpKind::Insert, + PendingMode::Merge => crate::db::MutationOpKind::Merge, + }; + let ds = db + .reopen_for_mutation( + &table_key, + &path.full_path, + path.table_branch.as_deref(), + expected, + stage_kind, + ) + .await?; + + if table.batches.is_empty() { + continue; + } + + // For Merge mode, dedupe accumulated batches by `id`, keeping + // the LAST occurrence (last-write-wins for the query). This + // is required because Lance's `MergeInsertBuilder` produces + // arbitrary results on duplicate keys in the source. Append + // mode is exempt because no-key node and edge inserts use + // ULID-generated ids that are unique within a query. + let combined = match table.mode { + PendingMode::Merge => dedupe_merge_batches_by_id(&table.schema, table.batches)?, + PendingMode::Append => { + if table.batches.len() == 1 { + table.batches.into_iter().next().unwrap() + } else { + arrow_select::concat::concat_batches(&table.schema, &table.batches) + .map_err(|e| OmniError::Lance(e.to_string()))? + } + } + }; + + // Stage produces uncommitted fragments + transaction. No + // Lance HEAD advance until `commit_all` runs `commit_staged`. + let staged = match table.mode { + PendingMode::Append => db.table_store().stage_append(&ds, combined, &[]).await?, + PendingMode::Merge => { + db.table_store() + .stage_merge_insert( + ds.clone(), + combined, + vec!["id".to_string()], + lance::dataset::WhenMatched::UpdateAll, + lance::dataset::WhenNotMatched::InsertAll, + ) + .await? + } + }; + staged_entries.push(StagedTableEntry { + table_key, + path, + expected_version: expected, + dataset: ds, + staged_write: staged, + }); } - let concurrency = concurrency.min(stage_inputs.len()).max(1); - let staged_entries = futures::stream::iter(stage_inputs.into_iter().map( - |(table_key, table, path, expected)| async move { - stage_pending_table(db, table_key, table, path, expected).await - }, - )) - .buffered(concurrency) - .collect::>>>() - .await - .into_iter() - .collect::>>()? - .into_iter() - .flatten() - .collect(); Ok(StagedMutation { inline_committed, @@ -328,73 +357,6 @@ impl MutationStaging { } } -async fn stage_pending_table( - db: &crate::db::Omnigraph, - table_key: String, - table: PendingTable, - path: StagedTablePath, - expected: u64, -) -> Result> { - // Reopen the dataset for staging. Append/Merge can be rebased later by - // Lance + publisher CAS; Overwrite is a strict replacement and uses the - // same SchemaRewrite policy as schema apply. - let stage_kind = match table.mode { - PendingMode::Append => crate::db::MutationOpKind::Insert, - PendingMode::Merge => crate::db::MutationOpKind::Merge, - PendingMode::Overwrite => crate::db::MutationOpKind::SchemaRewrite, - }; - let ds = db - .reopen_for_mutation( - &table_key, - &path.full_path, - path.table_branch.as_deref(), - expected, - stage_kind, - ) - .await?; - - if table.batches.is_empty() { - return Ok(None); - } - - let combined = match table.mode { - PendingMode::Merge => dedupe_merge_batches_by_id(&table.schema, table.batches)?, - PendingMode::Append | PendingMode::Overwrite => { - if table.batches.len() == 1 { - table.batches.into_iter().next().unwrap() - } else { - arrow_select::concat::concat_batches(&table.schema, &table.batches) - .map_err(|e| OmniError::Lance(e.to_string()))? - } - } - }; - - // Stage produces uncommitted fragments + transaction. No Lance HEAD - // advance until `commit_all` runs `commit_staged`. - let staged = match table.mode { - PendingMode::Append => db.storage().stage_append(&ds, combined, &[]).await?, - PendingMode::Merge => { - db.storage() - .stage_merge_insert( - ds.clone(), - combined, - vec!["id".to_string()], - lance::dataset::WhenMatched::UpdateAll, - lance::dataset::WhenNotMatched::InsertAll, - ) - .await? - } - PendingMode::Overwrite => db.storage().stage_overwrite(&ds, combined).await?, - }; - Ok(Some(StagedTableEntry { - table_key, - path, - expected_version: expected, - dataset: ds, - staged_write: staged, - })) -} - /// Output of [`MutationStaging::stage_all`]. Carries the staged Lance /// transactions (Phase A complete; uncommitted fragments written) plus /// the per-table metadata needed to write the recovery sidecar, run @@ -427,37 +389,15 @@ pub(crate) struct StagedMutation { } /// Per-table state captured during `stage_all` and consumed by -/// `commit_all`. Holds the opened snapshot (so `commit_staged` doesn't -/// re-open) plus the staged Lance transaction that `commit_staged` -/// will execute. Both held as opaque `TableStorage` handles per MR-793 -/// Β§III.9 β€” the inner `lance::Dataset` / `StagedWrite` are not visible -/// to engine code outside the storage layer. +/// `commit_all`. Holds the opened `Dataset` so `commit_staged` doesn't +/// re-open, and the `StagedWrite` whose `transaction` `commit_staged` +/// will execute. struct StagedTableEntry { table_key: String, path: StagedTablePath, expected_version: u64, - dataset: SnapshotHandle, - staged_write: StagedHandle, -} - -/// Output of [`StagedMutation::commit_all`] (Phase B): the publisher's input plus -/// the queue guards the caller must hold across the manifest publish. -pub(crate) struct CommittedMutation { - /// Per-table updates to publish to the manifest. - pub(crate) updates: Vec, - /// Per-table manifest pins refreshed under the write queue β€” the publisher's CAS fence. - pub(crate) expected_versions: HashMap, - /// Recovery sidecar to delete after Phase C succeeds (`None` when nothing staged). - pub(crate) sidecar_handle: Option, - /// Per-`(table, branch)` write-queue guards β€” the caller MUST hold these across - /// the manifest publish (see `commit_all`) so no writer interleaves between - /// `commit_staged` and the publish. - pub(crate) guards: Vec>, - /// Post-`commit_staged` handle per STAGED table (table_key β†’ handle at the - /// just-committed version). Carried out (RFC-013 step 3b, collapse #4) so the - /// publish-prepare index build reuses it instead of a fresh `reopen_for_mutation` - /// at the same version. Inline-committed / delete tables are absent (no staged handle). - pub(crate) committed_handles: HashMap, + dataset: lance::Dataset, + staged_write: crate::table_store::StagedWrite, } impl StagedMutation { @@ -483,30 +423,18 @@ impl StagedMutation { /// unreferenced (cleaned by `cleanup_old_versions`'s age sweep) /// rather than being committed and creating a Lance-HEAD-ahead /// residual. - /// `held_guards`: when the caller already holds the per-`(table_key, - /// branch)` write queues for every touched table (the fork path acquires - /// them up front, before the fork, and holds them through the manifest - /// publish), it passes `(acquired_keys, guards)` here so `commit_all` - /// reuses them instead of re-acquiring β€” the queue is a non-re-entrant - /// `tokio::Mutex`, so re-acquiring a held key would self-deadlock. - /// `None` (the steady-state path) means `commit_all` acquires them - /// itself. `acquired_keys` must cover every key `commit_all` would - /// acquire (debug-asserted below) β€” the guards from `acquire_many` don't - /// carry their keys, so the caller hands the key set alongside them. The - /// fork path guarantees coverage by keying every touched table uniformly - /// by the resolved target branch. pub(crate) async fn commit_all( self, db: &crate::db::Omnigraph, branch: Option<&str>, sidecar_kind: SidecarKind, actor_id: Option<&str>, - held_guards: Option<( - Vec<(String, Option)>, - Vec>, - )>, - txn: Option<&crate::db::WriteTxn>, - ) -> Result { + ) -> Result<( + Vec, + HashMap, + Option, + Vec>, + )> { let StagedMutation { inline_committed, mut staged, @@ -515,18 +443,21 @@ impl StagedMutation { op_kinds, } = self; - // Per-(table_key, branch) queues for every touched table β€” both - // staged and inline-committed. Sorted by `acquire_many` internally - // so all multi-table writers (mutation, branch_merge, schema_apply, - // the fork path, recovery) agree on acquisition order β€” prevents - // lock-order inversion deadlock. + // Acquire per-(table_key, branch) queues for every touched + // table β€” both staged and inline-committed. Sorted by + // `acquire_many` internally so all multi-table writers + // (mutation, branch_merge, schema_apply, future MR-870 + // recovery) agree on acquisition order β€” prevents lock-order + // inversion deadlock. // - // For inline-committed tables (delete-only mutations), Lance HEAD - // has already advanced inside `delete_where` before `commit_all` - // runs. Holding the queue here prevents another writer from - // interleaving between our delete and our publish, which would - // otherwise leave a Lance-HEAD-ahead residual the delete-only - // sidecar (added below) would have to recover. + // For inline-committed tables (delete-only mutations), Lance + // HEAD has already advanced inside `delete_where` before + // `commit_all` runs. Holding the queue here doesn't prevent + // that interleaving (commit 6 will move queue acquisition into + // `delete_where`'s call site); it does prevent another writer + // from interleaving between our delete and our publish, which + // would otherwise leave a Lance-HEAD-ahead residual the + // delete-only sidecar (added below) would have to recover. let mut queue_keys: Vec<(String, Option)> = Vec::with_capacity(staged.len() + inline_committed.len()); for entry in &staged { @@ -541,30 +472,7 @@ impl StagedMutation { })?; queue_keys.push((table_key.clone(), path.table_branch.clone())); } - // Reuse the caller's guards (fork path) when handed in, else acquire - // our own. When reusing, every key we would acquire MUST already be - // covered β€” re-acquiring a held non-re-entrant key would deadlock, and - // a key we'd need but DON'T hold would commit unserialized. This is a - // load-bearing safety invariant, so it is checked in ALL builds (not a - // debug_assert) and fails the write loudly+safely rather than silently - // proceeding unguarded if a future execution path ever touches a table - // outside the caller's pre-computed set. - let guards = match held_guards { - Some((acquired_keys, guards)) => { - let held: std::collections::HashSet<&(String, Option)> = - acquired_keys.iter().collect(); - if let Some(missing) = queue_keys.iter().find(|k| !held.contains(k)) { - return Err(OmniError::manifest_internal(format!( - "commit_all: pre-held write-queue guards do not cover touched table \ - '{}' on branch {:?} β€” the caller's up-front acquisition set diverged \ - from the staged/inline set (a touched-table-set bug)", - missing.0, missing.1 - ))); - } - guards - } - None => db.write_queue().acquire_many(&queue_keys).await, - }; + let guards = db.write_queue().acquire_many(&queue_keys).await; // Re-capture manifest pins under the queue (PR 2 / MR-686). // @@ -587,32 +495,25 @@ impl StagedMutation { // until `ensure_path` learns how to bump expected_version on // op-kind upgrade. // - // Why a fresh per-branch snapshot (and not the bound-branch - // `db.snapshot()` / `snapshot_for_branch()` fast path): a stale - // engine handle may be bound to the same branch it is writing. For - // non-strict Insert/Merge, that stale local view is allowed to rebase - // to the live manifest pin under the queue; only uncovered Lance - // HEAD>manifest drift is refused. For writes targeting a branch other - // than the engine's bound branch (e.g., feature-branch ingest from a - // server handle bound to main), the same helper also resolves the - // correct branch pin. The cost is one fresh manifest read per mutation - // plus one Lance HEAD open per staged table for the drift guard below. + // Why per-branch (and not the bound-branch `db.snapshot()`): + // when the caller mutates a branch other than the engine's + // bound branch (e.g., feature-branch ingest from a server + // handle bound to main), `db.snapshot()` returns the bound + // branch's view of each table β€” which is the wrong pin for + // the publisher's CAS on a different branch. Using + // `snapshot_for_branch(branch)` resolves the per-branch + // entries correctly. The cost is one fresh manifest read per + // mutation; PR 1b's regression came from this same read, but + // that read is now strictly necessary for cross-branch + // correctness. Single-table same-branch mutations could still + // skip this read (queue exclusivity makes the publisher CAS a + // no-op), but the conditional adds complexity for marginal + // gain β€” left as a follow-up perf optimization. // // Multi-coordinator deployments (Β§VI.27 aspirational) get // genuine cross-process drift detection from this read for // free. - // - // This MUST be a FRESH per-branch manifest read (never the warm - // cache) for the OCC re-capture below β€” but with a `WriteTxn` the - // schema contract was already validated at capture, so use the - // `_unchecked` variant, which drops the redundant - // `ensure_schema_state_valid` AND the commit-graph load the OCC read - // never consults (a fresh manifest read yields the same `Snapshot`). - // Without a txn this is byte-identical to the prior checked call. - let snapshot = match txn { - Some(_) => db.fresh_snapshot_for_branch_unchecked(branch).await?, - None => db.fresh_snapshot_for_branch(branch).await?, - }; + let snapshot = db.snapshot_for_branch(branch).await?; for entry in staged.iter_mut() { let current = snapshot .entry(&entry.table_key) @@ -640,83 +541,6 @@ impl StagedMutation { )); } - // Separate manifest-visible concurrency from uncovered Lance drift. - // Non-strict inserts/merges are allowed to rebase from their staged - // read version to the fresh manifest pin above, but only if the - // live Lance HEAD still equals that manifest pin. If an external - // raw Lance write or a pre-fix maintenance path moved HEAD without - // publishing `__manifest`, this write must not silently fold it. - // - // `latest_version_id` reads the latest manifest pointer off the - // already-open staged handle (the #2 staging open) WITHOUT a fresh - // `Dataset::open` β€” the same cheap live-HEAD probe - // `ManifestCoordinator::probe_latest_version` uses. This replaces a - // redundant `open_dataset_head_for_write` (RFC-013 step 3b, collapse - // #3): the drift comparison below is byte-identical; only how `head` - // is obtained changes (probe vs cold open). - let head = entry - .dataset - .dataset() - .latest_version_id() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - if head < current { - return Err(OmniError::manifest_internal(format!( - "table '{}' Lance HEAD version {} is behind manifest version {}", - entry.table_key, head, current - ))); - } - if head > current { - // Error path only: tell the operator which drift class - // this is. Uncovered drift (external raw Lance write, - // pre-fix maintenance) goes through `omnigraph repair`. - // Sidecar-covered drift reaching this guard means the - // write-entry heal deferred it (rollback-eligible), and - // `repair` refuses while a sidecar is pending β€” the - // recovery path is a read-write reopen. A list failure - // must not mask the conflict β€” and must not pick a - // class confidently either: "could not classify" names - // both paths and the cause, never routing the operator - // to a command that will refuse. - let action = match crate::db::manifest::list_sidecars( - db.root_uri(), - db.storage_adapter(), - ) - .await - { - Ok(sidecars) => { - let covered = sidecars.iter().any(|sidecar| { - sidecar.tables.iter().any(|pin| { - // Branch-aware: a sidecar pinning the - // same table on ANOTHER branch does not - // cover this branch's drift β€” a reopen - // would recover that sidecar but leave - // this drift for `repair`. - pin.table_key == entry.table_key - && pin.table_branch == entry.path.table_branch - }) - }); - if covered { - "a pending recovery sidecar requires rollback β€” reopen the \ - graph read-write (e.g. restart the server) to recover" - .to_string() - } else { - "run `omnigraph repair` before writing".to_string() - } - } - Err(list_err) => format!( - "could not classify the drift (sidecar listing failed: {}); \ - run `omnigraph repair`, or reopen the graph read-write if \ - repair reports a pending recovery sidecar", - list_err - ), - }; - return Err(OmniError::manifest_conflict(format!( - "table '{}' has Lance HEAD version {} ahead of manifest version {}; {}", - entry.table_key, head, current, action - ))); - } - entry.expected_version = current; expected_versions.insert(entry.table_key.clone(), current); } @@ -744,9 +568,6 @@ impl StagedMutation { table_path: entry.path.full_path.clone(), expected_version: entry.expected_version, post_commit_pin: entry.expected_version + 1, - // Mutation/Load use strict single-commit classification, not - // BranchMerge's Phase-B confirmation β€” left None. - confirmed_version: None, table_branch: entry.path.table_branch.clone(), }); } @@ -773,7 +594,6 @@ impl StagedMutation { // can advance HEAD by more than one version (e.g., // when Lance internally compacts deletion vectors). post_commit_pin: update.table_version, - confirmed_version: None, table_branch: path.table_branch.clone(), }); } @@ -818,12 +638,6 @@ impl StagedMutation { let mut updates: Vec = inline_committed.into_values().collect(); - // Carry each staged table's post-`commit_staged` handle out so the - // publish-prepare index build reuses it (collapse #4) instead of - // re-opening the dataset at the same just-committed version. - let mut committed_handles: HashMap = - HashMap::with_capacity(staged.len()); - for entry in staged { let StagedTableEntry { table_key, @@ -833,25 +647,24 @@ impl StagedMutation { staged_write, } = entry; - let new_ds = db.storage().commit_staged(dataset, staged_write).await?; - let state = db.storage().table_state(&path.full_path, &new_ds).await?; + let new_ds = db + .table_store() + .commit_staged(Arc::new(dataset), staged_write.transaction) + .await?; + let state = db + .table_store() + .table_state(&path.full_path, &new_ds) + .await?; updates.push(SubTableUpdate { - table_key: table_key.clone(), + table_key, table_version: state.version, table_branch: path.table_branch.clone(), row_count: state.row_count, version_metadata: state.version_metadata, }); - committed_handles.insert(table_key, new_ds); } - Ok(CommittedMutation { - updates, - expected_versions, - sidecar_handle, - guards, - committed_handles, - }) + Ok((updates, expected_versions, sidecar_handle, guards)) } } @@ -975,9 +788,7 @@ fn dedupe_merge_batches_by_id( /// Count edges per `src` value across committed (Lance scan) + pending /// (in-memory). Caller supplies an opened committed dataset so the /// mutation path (which already has one) and the loader path (which -/// opens via snapshot) share the same body. For overwrite staging, the -/// pending batches are the replacement table image, so committed rows are -/// intentionally skipped. +/// opens via snapshot) share the same body. /// /// `dedupe_key_column` controls whether committed rows are shadowed by /// pending: @@ -992,7 +803,7 @@ fn dedupe_merge_batches_by_id( /// `LoadMode::Merge` double-counts. pub(crate) async fn count_src_per_edge( db: &crate::db::Omnigraph, - committed_ds: &SnapshotHandle, + committed_ds: &Dataset, table_key: &str, staging: &MutationStaging, dedupe_key_column: Option<&str>, @@ -1023,44 +834,41 @@ pub(crate) async fn count_src_per_edge( _ => None, }; - let replace_committed = staging.pending_mode(table_key) == Some(PendingMode::Overwrite); - if !replace_committed { - // Committed side: scan `src` plus the dedupe key column when set, so - // we can both count and shadow in one pass. - let projection: Vec<&str> = match dedupe_key_column { - Some(col) if pending_keys.as_ref().is_some_and(|s| !s.is_empty()) => vec!["src", col], - _ => vec!["src"], + // Committed side: scan `src` plus the dedupe key column when set, so + // we can both count and shadow in one pass. + let projection: Vec<&str> = match dedupe_key_column { + Some(col) if pending_keys.as_ref().is_some_and(|s| !s.is_empty()) => vec!["src", col], + _ => vec!["src"], + }; + let committed = db + .table_store() + .scan(committed_ds, Some(&projection), None, None) + .await?; + for batch in &committed { + let srcs = batch + .column_by_name("src") + .ok_or_else(|| OmniError::Lance("missing 'src' column on edge table".into()))? + .as_any() + .downcast_ref::() + .ok_or_else(|| OmniError::Lance("'src' column is not Utf8".into()))?; + // Optional shadow-key column (only present when dedupe is on). + let key_arr = match (&pending_keys, dedupe_key_column) { + (Some(set), Some(col)) if !set.is_empty() => batch + .column_by_name(col) + .and_then(|c| c.as_any().downcast_ref::()), + _ => None, }; - let committed = db - .storage() - .scan(committed_ds, Some(&projection), None, None) - .await?; - for batch in &committed { - let srcs = batch - .column_by_name("src") - .ok_or_else(|| OmniError::Lance("missing 'src' column on edge table".into()))? - .as_any() - .downcast_ref::() - .ok_or_else(|| OmniError::Lance("'src' column is not Utf8".into()))?; - // Optional shadow-key column (only present when dedupe is on). - let key_arr = match (&pending_keys, dedupe_key_column) { - (Some(set), Some(col)) if !set.is_empty() => batch - .column_by_name(col) - .and_then(|c| c.as_any().downcast_ref::()), - _ => None, - }; - for i in 0..srcs.len() { - if !srcs.is_valid(i) { + for i in 0..srcs.len() { + if !srcs.is_valid(i) { + continue; + } + // Shadow this committed row if its key is in pending. + if let (Some(arr), Some(set)) = (key_arr, pending_keys.as_ref()) { + if arr.is_valid(i) && set.contains(arr.value(i)) { continue; } - // Shadow this committed row if its key is in pending. - if let (Some(arr), Some(set)) = (key_arr, pending_keys.as_ref()) { - if arr.is_valid(i) && set.contains(arr.value(i)) { - continue; - } - } - *counts.entry(srcs.value(i).to_string()).or_insert(0) += 1; } + *counts.entry(srcs.value(i).to_string()).or_insert(0) += 1; } } diff --git a/crates/omnigraph/src/failpoints.rs b/crates/omnigraph/src/failpoints.rs index a353345..461b73e 100644 --- a/crates/omnigraph/src/failpoints.rs +++ b/crates/omnigraph/src/failpoints.rs @@ -14,115 +14,6 @@ pub(crate) fn maybe_fail(_name: &str) -> Result<()> { Ok(()) } -/// Failpoint that injects a *Lance* error rather than an `OmniError`. Used to -/// stand in for a `Dataset::open` failing with a transient/corrupt (non-not-found) -/// error, so a test can drive the caller's lance-error classification β€” the -/// behavior FIX A (`read_legacy_commit_cache`) relies on: a not-found is benign -/// (empty), anything else propagates. A no-op without the `failpoints` feature -/// (the injected variant is therefore unreachable in release builds). -#[allow(unused_variables)] -pub(crate) fn maybe_fail_lance_open(name: &str) -> std::result::Result<(), lance::Error> { - #[cfg(feature = "failpoints")] - { - fail::fail_point!(name, |_| { - Err(lance::Error::io(format!( - "injected failpoint triggered: {name}" - ))) - }); - } - Ok(()) -} - -/// Failpoint that injects a Lance `IncompatibleTransaction` β€” the variant a -/// concurrent `UpdateConfig` stamp race produces. Lets a test drive the v3β†’v4 -/// stamp loop's exhaustion path (`commit_v4_stamp_idempotently`) deterministically; -/// it is otherwise near-unreachable, since a real concurrent winner stamps the SAME -/// value, so the loop's re-read returns `Ok` on the first retry. A no-op without the -/// `failpoints` feature. -#[allow(unused_variables)] -pub(crate) fn maybe_fail_lance_incompatible(name: &str) -> std::result::Result<(), lance::Error> { - #[cfg(feature = "failpoints")] - { - fail::fail_point!(name, |_| { - Err(lance::Error::incompatible_transaction_source( - format!("injected failpoint triggered: {name}").into(), - )) - }); - } - Ok(()) -} - -/// Failpoint that injects a *retryable* `RowLevelCasContention` `OmniError` β€” the -/// typed conflict the manifest publisher's outer retry treats as retryable -/// (`is_retryable_publish_conflict`). Used to drive the publisher's -/// retry-on-`load_publish_state`-error path deterministically: the v3β†’v4 migration -/// surfaces this same type on exhaustion EXPECTING the publisher to re-run the -/// load, a path otherwise reachable only under sustained multi-writer contention. -/// A no-op without the `failpoints` feature. -#[allow(unused_variables)] -pub(crate) fn maybe_fail_retryable_contention(name: &str) -> Result<()> { - #[cfg(feature = "failpoints")] - { - fail::fail_point!(name, |_| { - return Err(crate::error::OmniError::manifest_row_level_cas_contention( - format!("injected retryable contention failpoint: {name}"), - )); - }); - } - Ok(()) -} - -/// Compile-checked catalog of every failpoint name in this crate. Call sites -/// (`maybe_fail`) and tests (`ScopedFailPoint` / the test rendezvous helper) -/// reference these constants instead of bare string literals, so a typo is a -/// compile error rather than a silently-never-firing failpoint. -pub mod names { - pub const BRANCH_CREATE_AFTER_MANIFEST_BRANCH_CREATE: &str = "branch_create.after_manifest_branch_create"; - pub const BRANCH_DELETE_BEFORE_COMMIT_GRAPH_RECLAIM: &str = "branch_delete.before_commit_graph_reclaim"; - pub const BRANCH_DELETE_BEFORE_TABLE_CLEANUP: &str = "branch_delete.before_table_cleanup"; - pub const BRANCH_MERGE_ADOPT_AFTER_APPEND_PRE_UPSERT: &str = "branch_merge.adopt_after_append_pre_upsert"; - pub const BRANCH_MERGE_ADOPT_AFTER_UPSERT_PRE_DELETE: &str = "branch_merge.adopt_after_upsert_pre_delete"; - pub const BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT: &str = "branch_merge.post_phase_b_pre_manifest_commit"; - pub const BRANCH_MERGE_REWRITE_AFTER_DELETE_PRE_INDEX: &str = "branch_merge.rewrite_after_delete_pre_index"; - pub const BRANCH_MERGE_REWRITE_AFTER_MERGE_PRE_DELETE: &str = "branch_merge.rewrite_after_merge_pre_delete"; - pub const CLASSIFY_FRESH_READ: &str = "classify.fresh_read"; - pub const CLEANUP_RECONCILE_FORK: &str = "cleanup.reconcile_fork"; - pub const CLEANUP_RESOLVE_BRANCH_SNAPSHOT: &str = "cleanup.resolve_branch_snapshot"; - pub const CLEANUP_TABLE_GC: &str = "cleanup.table_gc"; - pub const ENSURE_INDICES_POST_PHASE_B_PRE_MANIFEST_COMMIT: &str = "ensure_indices.post_phase_b_pre_manifest_commit"; - pub const ENSURE_INDICES_POST_STAGE_PRE_COMMIT_BTREE: &str = "ensure_indices.post_stage_pre_commit_btree"; - pub const FORK_BEFORE_CLASSIFY: &str = "fork.before_classify"; - pub const FORK_BEFORE_RECLAIM: &str = "fork.before_reclaim"; - pub const GRAPH_PUBLISH_AFTER_MANIFEST_COMMIT: &str = "graph_publish.after_manifest_commit"; - pub const GRAPH_PUBLISH_BEFORE_COMMIT_APPEND: &str = "graph_publish.before_commit_append"; - pub const INIT_AFTER_COORDINATOR_INIT: &str = "init.after_coordinator_init"; - pub const INIT_AFTER_SCHEMA_CONTRACT_WRITTEN: &str = "init.after_schema_contract_written"; - pub const INIT_AFTER_SCHEMA_PG_WRITTEN: &str = "init.after_schema_pg_written"; - pub const MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE: &str = "mutation.delete_node_pre_primary_delete"; - pub const MUTATION_POST_FINALIZE_PRE_PUBLISHER: &str = "mutation.post_finalize_pre_publisher"; - pub const OPTIMIZE_BEFORE_COMPACT: &str = "optimize.before_compact"; - pub const OPTIMIZE_INJECT_REINDEX_CONFLICT: &str = "optimize.inject_reindex_conflict"; - pub const OPTIMIZE_POST_PHASE_B_PRE_MANIFEST_COMMIT: &str = "optimize.post_phase_b_pre_manifest_commit"; - pub const RECOVERY_BEFORE_ROLL_FORWARD_PUBLISH: &str = "recovery.before_roll_forward_publish"; - pub const RECOVERY_ORPHAN_DISCARD_AUDIT_APPEND: &str = "recovery.orphan_discard_audit_append"; - pub const RECOVERY_RECORD_AUDIT: &str = "recovery.record_audit"; - pub const RECOVERY_SIDECAR_CONFIRM: &str = "recovery.sidecar_confirm"; - pub const RECOVERY_SIDECAR_DELETE: &str = "recovery.sidecar_delete"; - pub const RECOVERY_SIDECAR_LIST: &str = "recovery.sidecar_list"; - pub const RECOVERY_SIDECAR_WRITE: &str = "recovery.sidecar_write"; - pub const SCHEMA_APPLY_AFTER_MANIFEST_COMMIT: &str = "schema_apply.after_manifest_commit"; - pub const SCHEMA_APPLY_AFTER_STAGING_WRITE: &str = "schema_apply.after_staging_write"; - pub const SCHEMA_APPLY_BEFORE_STAGING_WRITE: &str = "schema_apply.before_staging_write"; - // RFC-013 Phase 7 migration failpoints (this branch). - pub const MIGRATION_V3_TO_V4_LEGACY_OPEN: &str = "migration.v3_to_v4.legacy_open"; - pub const MIGRATION_V4_STAMP_FORCE_INCOMPATIBLE: &str = "migration.v4_stamp.force_incompatible"; - /// Injects a retryable `RowLevelCasContention` from `load_publish_state` so a - /// test can prove the publisher's outer retry re-runs the load (the migration - /// surfaces this same typed error on exhaustion). - pub const PUBLISH_LOAD_STATE_RETRYABLE_CONTENTION: &str = - "publish.load_state_retryable_contention"; -} - #[cfg(feature = "failpoints")] pub struct ScopedFailPoint { name: String, @@ -136,20 +27,6 @@ impl ScopedFailPoint { name: name.to_string(), } } - - /// Register a callback failpoint with the same Drop-based cleanup as - /// `new`. Without the guard, a panic while the point is active would - /// leak the callback into the process-global registry and fire it under - /// later tests in the same binary. - pub fn with_callback(name: &str, callback: F) -> Self - where - F: Fn() + Send + Sync + 'static, - { - fail::cfg_callback(name, callback).expect("configure callback failpoint"); - Self { - name: name.to_string(), - } - } } #[cfg(feature = "failpoints")] diff --git a/crates/omnigraph/src/instrumentation.rs b/crates/omnigraph/src/instrumentation.rs deleted file mode 100644 index 9718686..0000000 --- a/crates/omnigraph/src/instrumentation.rs +++ /dev/null @@ -1,372 +0,0 @@ -//! Read-path cost instrumentation (test seam). -//! -//! Two boundary instruments let cost-budget tests assert that a warm read does -//! no redundant IO, the way LanceDB's IO-counted tests do (see -//! `docs/dev/testing.md`, "Cost-budget tests"): -//! -//! - **Lance object store** β€” a per-query [`WrappingObjectStore`] attached to the -//! datasets a query opens, so a test counts real `read_iops`. Delivered through -//! a task-local ([`QueryIoProbes`]) set by the test; production leaves it unset, -//! so the open helpers attach nothing (one unset-`Option` check per open). -//! - **omnigraph `StorageAdapter`** β€” [`CountingStorageAdapter`], a decorator that -//! counts per-method calls (the schema-contract reads on the query path). -//! -//! Nothing here changes runtime behavior: the wrappers only observe, and the -//! decorator delegates every call. `IOTracker` (the concrete counter) lives in -//! tests via the `lance-io` dev-dependency; this module stays generic over the -//! `lance::io`-re-exported trait, so it adds no production dependency. - -use std::sync::Arc; -use std::sync::atomic::{AtomicU64, Ordering}; - -use async_trait::async_trait; -use lance::Dataset; -use lance::dataset::builder::DatasetBuilder; -use lance::io::{ObjectStoreParams, WrappingObjectStore}; - -use crate::error::{OmniError, Result}; -use crate::storage::StorageAdapter; - -/// Per-query IO probes, installed for a query's task via [`with_query_io_probes`]. -/// -/// Each wrapper is attached (when present) to the datasets that category opens, -/// so a test reads `read_iops` off its own `IOTracker` handle. `probe_count` -/// records calls to the version probe (which runs on the coordinator's already-open -/// handle, so it is counted by invocation rather than by the per-query wrappers). -#[derive(Clone, Default)] -pub struct QueryIoProbes { - pub manifest_wrapper: Option>, - pub commit_graph_wrapper: Option>, - /// Attached to the per-table data opens a query performs (the cache-miss - /// path in `SubTableEntry::open`). Lets a cost test assert how many tables - /// a query actually opened β€” N on a cold read, 0 on a warm repeat once the - /// handle cache (Fix 3) serves them. - pub table_wrapper: Option>, - pub probe_count: Arc, - /// Counts DATA-table open CALLS through the two instrumented chokepoints - /// (`open_dataset_tracked` / `open_table_dataset`), classified by URI so the - /// internal/system tables (`__manifest`, `_graph_commits*`) are EXCLUDED β€” the - /// publisher CAS and commit-graph append open those every write, and counting - /// them would make the `data_open_count <= |touched_tables|` write gate - /// (RFC-013 step 3b) unreachable by threading alone. Unlike the opener-read - /// term (which mixes with the merge-insert/RI scan on the write path), this is - /// an exact open-invocation count. `forbidden_apis` keeps engine code OUTSIDE the - /// storage layer (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) from opening - /// datasets except through these chokepoints, so the count is complete for the - /// keyed-write data path the gate measures. (`table_store.rs` is allow-listed and - /// does hold direct `Dataset::open`s β€” but only for branch-management ops - /// (`delete_branch`/`list_branches`/`force_delete_branch`), never that hot path.) - pub data_open_count: Arc, - /// Internal/system-table (`__manifest`, `_graph_commits*`) open CALLS β€” the - /// complement of `data_open_count`, kept for symmetry and debugging. - pub internal_open_count: Arc, -} - -tokio::task_local! { - static QUERY_IO_PROBES: QueryIoProbes; -} - -/// Run `fut` with per-query IO probes installed. Test-only entry point; nothing -/// in production sets the probes, so the accessors below return `None`/no-op. -pub async fn with_query_io_probes(probes: QueryIoProbes, fut: F) -> F::Output -where - F: std::future::Future, -{ - QUERY_IO_PROBES.scope(probes, fut).await -} - -fn current(f: impl FnOnce(&QueryIoProbes) -> R) -> Option { - QUERY_IO_PROBES.try_with(f).ok() -} - -pub(crate) fn manifest_wrapper() -> Option> { - current(|p| p.manifest_wrapper.clone()).flatten() -} - -pub(crate) fn commit_graph_wrapper() -> Option> { - current(|p| p.commit_graph_wrapper.clone()).flatten() -} - -pub(crate) fn table_wrapper() -> Option> { - current(|p| p.table_wrapper.clone()).flatten() -} - -/// Record one version-probe invocation against the active per-query probes. -/// No-op when no probes are installed (production). -pub(crate) fn record_probe() { - let _ = current(|p| p.probe_count.fetch_add(1, Ordering::Relaxed)); -} - -/// Internal/system table directory names. An open of one of these is a metadata -/// open (publisher CAS, commit-graph append, recovery audit), NOT a data-table -/// open. Kept in sync with the dir constants in `db/manifest/layout.rs`, -/// `db/commit_graph.rs`, and `db/recovery_audit.rs`. -const INTERNAL_TABLE_DIRS: [&str; 4] = [ - "__manifest", - "_graph_commits.lance", - "_graph_commit_actors.lance", - "_graph_commit_recoveries.lance", -]; - -/// True when `uri`'s last path segment names an internal/system table. -fn open_is_internal(uri: &str) -> bool { - let trimmed = uri.trim_end_matches('/'); - let last = trimmed.rsplit('/').next().unwrap_or(trimmed); - INTERNAL_TABLE_DIRS.contains(&last) -} - -/// Record one table-open call against the active per-query probes, classified by -/// table class (the URI's last segment) so the write gate counts DATA-table opens -/// only and ignores the publisher/commit-graph metadata opens. No-op in production -/// (the classification runs only inside the probe closure, which `current` skips -/// when no probes are installed). Called at both open chokepoints. -pub(crate) fn record_open(uri: &str) { - let _ = current(|p| { - if open_is_internal(uri) { - p.internal_open_count.fetch_add(1, Ordering::Relaxed); - } else { - p.data_open_count.fetch_add(1, Ordering::Relaxed); - } - }); -} - -/// Per-operation staged-write counts, installed for a task via -/// [`with_merge_write_probes`]. Lets a cost-budget test assert WHICH staged-write -/// primitive an operation invokes β€” e.g. that an append-only fast-forward merge -/// routes new rows through `stage_append` and does **zero** `stage_merge_insert` -/// (the full-outer hash join). Counts the publish-path primitives only; -/// merge-staging temp tables use `append_or_create_batch`, not these. -#[derive(Clone, Default)] -pub struct MergeWriteProbes { - pub stage_append_calls: Arc, - pub stage_append_rows: Arc, - pub stage_merge_insert_calls: Arc, - pub stage_merge_insert_rows: Arc, - /// Inline vector-index (IVF) builds. The fast-forward adopt path defers - /// index coverage to the reconciler, so an adopt merge must do 0 of these. - pub create_vector_index_calls: Arc, - /// Times the merge materialized a staged delta into one in-memory batch - /// (`scan_staged_combined`). The append path streams instead, so an - /// append-only fast-forward merge must do 0 of these. - pub scan_staged_combined_calls: Arc, -} - -impl MergeWriteProbes { - pub fn stage_append_calls(&self) -> u64 { - self.stage_append_calls.load(Ordering::Relaxed) - } - pub fn stage_append_rows(&self) -> u64 { - self.stage_append_rows.load(Ordering::Relaxed) - } - pub fn stage_merge_insert_calls(&self) -> u64 { - self.stage_merge_insert_calls.load(Ordering::Relaxed) - } - pub fn stage_merge_insert_rows(&self) -> u64 { - self.stage_merge_insert_rows.load(Ordering::Relaxed) - } - pub fn create_vector_index_calls(&self) -> u64 { - self.create_vector_index_calls.load(Ordering::Relaxed) - } - pub fn scan_staged_combined_calls(&self) -> u64 { - self.scan_staged_combined_calls.load(Ordering::Relaxed) - } -} - -tokio::task_local! { - static MERGE_WRITE_PROBES: MergeWriteProbes; -} - -/// Run `fut` with staged-write probes installed. Test-only entry point; nothing -/// in production sets the probes, so `record_stage_*` below are no-ops. -pub async fn with_merge_write_probes(probes: MergeWriteProbes, fut: F) -> F::Output -where - F: std::future::Future, -{ - MERGE_WRITE_PROBES.scope(probes, fut).await -} - -/// Record one `stage_append` of `rows` rows against the active probes. No-op in -/// production (no probes installed). -pub(crate) fn record_stage_append(rows: u64) { - let _ = MERGE_WRITE_PROBES.try_with(|p| { - p.stage_append_calls.fetch_add(1, Ordering::Relaxed); - p.stage_append_rows.fetch_add(rows, Ordering::Relaxed); - }); -} - -/// Record one `stage_merge_insert` of `rows` rows against the active probes. -/// No-op in production (no probes installed). -pub(crate) fn record_stage_merge_insert(rows: u64) { - let _ = MERGE_WRITE_PROBES.try_with(|p| { - p.stage_merge_insert_calls.fetch_add(1, Ordering::Relaxed); - p.stage_merge_insert_rows.fetch_add(rows, Ordering::Relaxed); - }); -} - -/// Record one inline vector-index build against the active probes. No-op in -/// production (no probes installed). -pub(crate) fn record_create_vector_index() { - let _ = MERGE_WRITE_PROBES.try_with(|p| { - p.create_vector_index_calls.fetch_add(1, Ordering::Relaxed); - }); -} - -/// Record one `scan_staged_combined` materialization against the active probes. -/// No-op in production (no probes installed). -pub(crate) fn record_scan_staged_combined() { - let _ = MERGE_WRITE_PROBES.try_with(|p| { - p.scan_staged_combined_calls.fetch_add(1, Ordering::Relaxed); - }); -} - -/// Open a Lance dataset at `uri`, attaching `wrapper` (for IO counting) when -/// present. With no wrapper this is exactly `Dataset::open(uri)`. The wrapper is -/// set via `ObjectStoreParams` on the builder so the open itself is counted -/// (`Dataset::with_object_store_wrappers` only wraps an already-open store). -pub(crate) async fn open_dataset_tracked( - uri: &str, - wrapper: Option>, -) -> Result { - record_open(uri); - let result = match wrapper { - None => Dataset::open(uri).await, - Some(wrapper) => { - DatasetBuilder::from_uri(uri) - .with_store_params(ObjectStoreParams { - object_store_wrapper: Some(wrapper), - ..Default::default() - }) - .load() - .await - } - }; - result.map_err(|e| OmniError::Lance(e.to_string())) -} - -/// Open a data-table dataset at `location` pinned to `version` β€” the cache-miss -/// path of the data-read boundary (`SubTableEntry::open`). Attaches the shared -/// per-graph `Session` (warms metadata/index caches across opens, LanceDB's -/// one-session-per-connection pattern) and the per-query `table_wrapper` (for IO -/// counting) when present. With neither, this is exactly the Fix-2 -/// `from_uri(location).with_version(version)` open. -pub(crate) async fn open_table_dataset( - location: &str, - version: u64, - session: Option<&Arc>, -) -> Result { - record_open(location); - let mut builder = DatasetBuilder::from_uri(location).with_version(version); - if let Some(session) = session { - builder = builder.with_session(session.clone()); - } - if let Some(wrapper) = table_wrapper() { - builder = builder.with_store_params(ObjectStoreParams { - object_store_wrapper: Some(wrapper), - ..Default::default() - }); - } - builder - .load() - .await - .map_err(|e| OmniError::Lance(e.to_string())) -} - -/// Per-method read counts for [`CountingStorageAdapter`]. -#[derive(Debug, Default)] -pub struct StorageReadCounts { - pub read_text: AtomicU64, - pub exists: AtomicU64, - pub read_text_versioned: AtomicU64, - pub list_dir: AtomicU64, -} - -impl StorageReadCounts { - pub fn read_text(&self) -> u64 { - self.read_text.load(Ordering::Relaxed) - } - pub fn exists(&self) -> u64 { - self.exists.load(Ordering::Relaxed) - } - pub fn read_text_versioned(&self) -> u64 { - self.read_text_versioned.load(Ordering::Relaxed) - } - pub fn list_dir(&self) -> u64 { - self.list_dir.load(Ordering::Relaxed) - } -} - -/// Boundary decorator over a [`StorageAdapter`] that counts read-facing calls. -/// Reads delegate after incrementing; writes delegate unchanged. Construct with -/// [`CountingStorageAdapter::new`] and open an engine via -/// `Omnigraph::open_with_storage` to count its non-Lance storage IO. -#[derive(Debug)] -pub struct CountingStorageAdapter { - inner: Arc, - counts: Arc, -} - -impl CountingStorageAdapter { - /// Wrap `inner`, returning the adapter and a shared handle to its counts. - pub fn new(inner: Arc) -> (Arc, Arc) { - let counts = Arc::new(StorageReadCounts::default()); - let adapter: Arc = Arc::new(Self { - inner, - counts: Arc::clone(&counts), - }); - (adapter, counts) - } -} - -#[async_trait] -impl StorageAdapter for CountingStorageAdapter { - async fn read_text(&self, uri: &str) -> Result { - self.counts.read_text.fetch_add(1, Ordering::Relaxed); - self.inner.read_text(uri).await - } - - async fn write_text(&self, uri: &str, contents: &str) -> Result<()> { - self.inner.write_text(uri, contents).await - } - - async fn write_text_if_absent(&self, uri: &str, contents: &str) -> Result { - self.inner.write_text_if_absent(uri, contents).await - } - - async fn exists(&self, uri: &str) -> Result { - self.counts.exists.fetch_add(1, Ordering::Relaxed); - self.inner.exists(uri).await - } - - async fn rename_text(&self, from_uri: &str, to_uri: &str) -> Result<()> { - self.inner.rename_text(from_uri, to_uri).await - } - - async fn delete(&self, uri: &str) -> Result<()> { - self.inner.delete(uri).await - } - - async fn list_dir(&self, dir_uri: &str) -> Result> { - self.counts.list_dir.fetch_add(1, Ordering::Relaxed); - self.inner.list_dir(dir_uri).await - } - - async fn read_text_versioned(&self, uri: &str) -> Result<(String, String)> { - self.counts.read_text_versioned.fetch_add(1, Ordering::Relaxed); - self.inner.read_text_versioned(uri).await - } - - async fn write_text_if_match( - &self, - uri: &str, - contents: &str, - expected_version: &str, - ) -> Result> { - self.inner - .write_text_if_match(uri, contents, expected_version) - .await - } - - async fn delete_prefix(&self, prefix_uri: &str) -> Result<()> { - self.inner.delete_prefix(prefix_uri).await - } -} diff --git a/crates/omnigraph/src/lib.rs b/crates/omnigraph/src/lib.rs index 7dd7135..ff0b3d6 100644 --- a/crates/omnigraph/src/lib.rs +++ b/crates/omnigraph/src/lib.rs @@ -14,7 +14,6 @@ pub mod error; mod exec; pub mod failpoints; pub mod graph_index; -pub mod instrumentation; pub mod loader; pub mod runtime_cache; pub mod storage; diff --git a/crates/omnigraph/src/loader/mod.rs b/crates/omnigraph/src/loader/mod.rs index 075724d..cade1f4 100644 --- a/crates/omnigraph/src/loader/mod.rs +++ b/crates/omnigraph/src/loader/mod.rs @@ -26,14 +26,6 @@ use crate::exec::staging::{MutationStaging, PendingMode}; /// Result of a load operation. #[derive(Debug, Clone, Default)] pub struct LoadResult { - /// Branch the load landed on (`"main"` when no branch was given). - pub branch: String, - /// Base branch a fork was requested from (the `base` parameter of - /// `load_as`), recorded verbatim even when the target branch already - /// existed and no fork happened. - pub base_branch: Option, - /// True when this load created `branch` by forking it from `base_branch`. - pub branch_created: bool, pub nodes_loaded: HashMap, pub edges_loaded: HashMap, } @@ -65,27 +57,21 @@ pub enum LoadMode { Merge, } -/// Convenience: load JSONL data onto the database handle's *active branch* -/// (`main` when unbound). Equivalent to `db.load(active_branch, data, mode)`; -/// use `Omnigraph::load`/`load_as` directly when targeting an explicit branch -/// or when fork-from-base semantics are needed. -pub async fn load_jsonl(db: &Omnigraph, data: &str, mode: LoadMode) -> Result { +/// Load JSONL data into an Omnigraph database. +pub async fn load_jsonl(db: &mut Omnigraph, data: &str, mode: LoadMode) -> Result { let current_branch = db.active_branch().await; let branch = current_branch.as_deref().unwrap_or("main"); db.load(branch, data, mode).await } -/// Convenience: like [`load_jsonl`] but reading from a file path. -pub async fn load_jsonl_file(db: &Omnigraph, path: &str, mode: LoadMode) -> Result { +/// Load JSONL data from a file path. +pub async fn load_jsonl_file(db: &mut Omnigraph, path: &str, mode: LoadMode) -> Result { let current_branch = db.active_branch().await; let branch = current_branch.as_deref().unwrap_or("main"); db.load_file(branch, path, mode).await } impl Omnigraph { - #[deprecated( - note = "use `load_as` with an explicit `base` instead; the ingest family will be removed in a future release" - )] pub async fn ingest( &self, branch: &str, @@ -93,17 +79,9 @@ impl Omnigraph { data: &str, mode: LoadMode, ) -> Result { - #[allow(deprecated)] self.ingest_as(branch, from, data, mode, None).await } - /// Deprecated shim over the unified `load_as`. Preserves the historical - /// ingest contract exactly: `from: None` means fork from `main`, and the - /// base branch is recorded in the result even when the target branch - /// already existed (no fork happened). - #[deprecated( - note = "use `load_as` with an explicit `base` instead; the ingest family will be removed in a future release" - )] pub async fn ingest_as( &self, branch: &str, @@ -112,24 +90,22 @@ impl Omnigraph { mode: LoadMode, actor_id: Option<&str>, ) -> Result { - let result = self - .load_as(branch, Some(from.unwrap_or("main")), data, mode, actor_id) - .await?; - Ok(IngestResult { - branch: result.branch.clone(), - base_branch: result - .base_branch - .clone() - .unwrap_or_else(|| "main".to_string()), - branch_created: result.branch_created, - mode, - tables: result.to_ingest_tables(), - }) + // Engine-layer policy gate (MR-722 fan-out / PR #3). Scope is + // `Branch(branch)` for the data-write portion. If ingest creates + // a new branch as a side-effect (target branch doesn't exist), + // the inner `branch_create_from_as` call below additionally + // checks `BranchCreate` β€” both authorities are genuinely needed + // for "ingest into a fresh branch", so the layered check is + // correct, not redundant. + self.enforce( + omnigraph_policy::PolicyAction::Change, + &omnigraph_policy::ResourceScope::Branch(branch.to_string()), + actor_id, + )?; + self.ingest_with_current_actor(branch, from, data, mode, actor_id) + .await } - #[deprecated( - note = "use `load_file_as` with an explicit `base` instead; the ingest family will be removed in a future release" - )] pub async fn ingest_file( &self, branch: &str, @@ -137,13 +113,9 @@ impl Omnigraph { path: &str, mode: LoadMode, ) -> Result { - #[allow(deprecated)] self.ingest_file_as(branch, from, path, mode, None).await } - #[deprecated( - note = "use `load_file_as` with an explicit `base` instead; the ingest family will be removed in a future release" - )] pub async fn ingest_file_as( &self, branch: &str, @@ -153,51 +125,75 @@ impl Omnigraph { actor_id: Option<&str>, ) -> Result { let data = std::fs::read_to_string(path).map_err(OmniError::Io)?; - #[allow(deprecated)] self.ingest_as(branch, from, &data, mode, actor_id).await } - pub async fn load(&self, branch: &str, data: &str, mode: LoadMode) -> Result { - self.load_as(branch, None, data, mode, None).await + async fn ingest_with_current_actor( + &self, + branch: &str, + from: Option<&str>, + data: &str, + mode: LoadMode, + actor_id: Option<&str>, + ) -> Result { + self.ensure_schema_state_valid().await?; + let target_branch = + Self::normalize_branch_name(branch)?.unwrap_or_else(|| "main".to_string()); + let base_branch = Self::normalize_branch_name(from.unwrap_or("main"))? + .unwrap_or_else(|| "main".to_string()); + let branch_created = !self + .branch_list() + .await? + .iter() + .any(|name| name == &target_branch); + if branch_created { + // Thread the actor through to the implicit BranchCreate so + // policy decisions match what an explicit `branch_create_from_as` + // call would see. Calling the no-actor variant here would + // bypass BranchCreate enforcement when policy is installed β€” + // the footgun guard catches that case too, but threading is + // the correct fix. + self.branch_create_from_as( + crate::db::ReadTarget::branch(&base_branch), + &target_branch, + actor_id, + ) + .await?; + } + + let result = self.load_as(&target_branch, data, mode, actor_id).await?; + Ok(IngestResult { + branch: target_branch, + base_branch, + branch_created, + mode, + tables: result.to_ingest_tables(), + }) + } + + pub async fn load(&self, branch: &str, data: &str, mode: LoadMode) -> Result { + self.load_as(branch, data, mode, None).await } - /// Load JSONL data onto `branch`. - /// - /// `base` selects the branch-creation behavior: with `Some(base)`, a - /// missing target branch is forked from `base` first (the former - /// `ingest` semantics); with `None`, the target branch must already - /// exist β€” staging fails on an unknown branch when it resolves the - /// manifest snapshot, so a typo'd branch name can never create one. pub async fn load_as( &self, branch: &str, - base: Option<&str>, data: &str, mode: LoadMode, actor_id: Option<&str>, ) -> Result { // Engine-layer policy gate (MR-722 fan-out / PR #3). Scope is // `Branch(branch)` to match the HTTP-layer Change convention. - // When a fork happens below, `branch_create_from_as` additionally - // checks `BranchCreate` β€” both authorities are genuinely needed - // for "load into a fresh branch", so the layered check is - // correct, not redundant. + // `ingest_as` also calls `load_as` after enforcing its own + // Change gate β€” that double-check is fine because both gates + // resolve to identical Cedar decisions for the same actor + + // branch (the second check is a structurally-correct no-op). self.enforce( omnigraph_policy::PolicyAction::Change, &omnigraph_policy::ResourceScope::Branch(branch.to_string()), actor_id, )?; - // Schema-contract validation is captured ONCE per write via the - // `WriteTxn` opened in `load_jsonl_reader` (after branch resolution). - // The redundant `ensure_schema_state_valid` that used to run here is - // subsumed by `open_write_txn`'s `resolved_branch_target` call. - // Converge any pending recovery sidecar (a previously failed - // writer's Phase B β†’ Phase C residual) before staging anything: - // without this, sidecar-covered drift wedges every load on the - // commit-time drift guard until a process restart β€” `repair` - // refuses while a sidecar is pending. One `list_dir` when no - // sidecars exist (the steady state). - self.heal_pending_recovery_sidecars().await?; + self.ensure_schema_state_valid().await?; // Reject internal `__run__*` / system-prefixed branches at the // public write boundary. Direct-publish paths assert this // explicitly so a caller can't write to legacy or system @@ -209,47 +205,15 @@ impl Omnigraph { // `commit_prepared_updates_on_branch_with_expected`) and leave // `self.coordinator` with a stale manifest snapshot. let requested = Self::normalize_branch_name(branch)?; - let base_branch = match base { - Some(base) => { - Some(Self::normalize_branch_name(base)?.unwrap_or_else(|| "main".to_string())) - } - None => None, - }; - // Fork-if-missing only when a base branch was explicitly given. - // `requested == None` is `main`, which always exists. - let mut branch_created = false; - if let (Some(target), Some(base_name)) = (requested.as_deref(), base_branch.as_deref()) { - let exists = self.branch_list().await?.iter().any(|name| name == target); - if !exists { - // Thread the actor through to the implicit BranchCreate so - // policy decisions match what an explicit `branch_create_from_as` - // call would see. Calling the no-actor variant here would - // bypass BranchCreate enforcement when policy is installed β€” - // the footgun guard catches that case too, but threading is - // the correct fix. - self.branch_create_from_as( - crate::db::ReadTarget::branch(base_name), - target, - actor_id, - ) - .await?; - branch_created = true; - } - } // Direct-to-target writes: no Run state machine, no `__run__` staging // branch. Cross-table OCC is enforced by the publisher's // `expected_table_versions` CAS inside `load_jsonl_reader`. - let mut result = self - .load_direct_on_branch(requested.as_deref(), data, mode, actor_id) - .await?; - result.branch = requested.unwrap_or_else(|| "main".to_string()); - result.base_branch = base_branch; - result.branch_created = branch_created; - Ok(result) + self.load_direct_on_branch(requested.as_deref(), data, mode, actor_id) + .await } pub async fn load_file(&self, branch: &str, path: &str, mode: LoadMode) -> Result { - self.load_file_as(branch, None, path, mode, None).await + self.load_file_as(branch, path, mode, None).await } /// Read a file into memory and delegate to `load_as`. Used by the @@ -258,13 +222,12 @@ impl Omnigraph { pub async fn load_file_as( &self, branch: &str, - base: Option<&str>, path: &str, mode: LoadMode, actor_id: Option<&str>, ) -> Result { - let data = std::fs::read_to_string(path).map_err(OmniError::Io)?; - self.load_as(branch, base, &data, mode, actor_id).await + let data = std::fs::read_to_string(path).map_err(|e| OmniError::Io(e))?; + self.load_as(branch, &data, mode, actor_id).await } async fn load_direct_on_branch( @@ -325,24 +288,22 @@ async fn load_jsonl_reader( let mut node_rows: HashMap> = HashMap::new(); let mut edge_rows: HashMap> = HashMap::new(); - // Parse a stream of JSON values. Accepts both compact JSONL (one object - // per line) and pretty-printed JSON where a single object spans multiple - // lines β€” serde's streaming deserializer treats any whitespace (including - // newlines) between top-level values as a separator. - for (idx, parsed) in serde_json::Deserializer::from_reader(reader) - .into_iter::() - .enumerate() - { - let record_num = idx + 1; - let value: JsonValue = parsed.map_err(|e| { - OmniError::manifest(format!("invalid JSON at record {}: {}", record_num, e)) + for (line_num, line) in reader.lines().enumerate() { + let line = line?; + let line = line.trim(); + if line.is_empty() { + continue; + } + let value: JsonValue = serde_json::from_str(line).map_err(|e| { + OmniError::manifest(format!("invalid JSON on line {}: {}", line_num + 1, e)) })?; if let Some(type_name) = value.get("type").and_then(|v| v.as_str()) { if !catalog.node_types.contains_key(type_name) { return Err(OmniError::manifest(format!( - "record {}: unknown node type '{}'", - record_num, type_name + "line {}: unknown node type '{}'", + line_num + 1, + type_name ))); } let data = value @@ -356,22 +317,23 @@ async fn load_jsonl_reader( } else if let Some(edge_name) = value.get("edge").and_then(|v| v.as_str()) { if catalog.lookup_edge_by_name(edge_name).is_none() { return Err(OmniError::manifest(format!( - "record {}: unknown edge type '{}'", - record_num, edge_name + "line {}: unknown edge type '{}'", + line_num + 1, + edge_name ))); } let from = value .get("from") .and_then(|v| v.as_str()) .ok_or_else(|| { - OmniError::manifest(format!("record {}: edge missing 'from'", record_num)) + OmniError::manifest(format!("line {}: edge missing 'from'", line_num + 1)) })? .to_string(); let to = value .get("to") .and_then(|v| v.as_str()) .ok_or_else(|| { - OmniError::manifest(format!("record {}: edge missing 'to'", record_num)) + OmniError::manifest(format!("line {}: edge missing 'to'", line_num + 1)) })? .to_string(); let data = value @@ -385,39 +347,34 @@ async fn load_jsonl_reader( .push((from, to, data)); } else { return Err(OmniError::manifest(format!( - "record {}: expected 'type' or 'edge' field", - record_num + "line {}: expected 'type' or 'edge' field", + line_num + 1 ))); } } // Phase 2: Build per-type RecordBatches and accumulate into the - // staging pipeline. Batches go into an in-memory accumulator and a - // single `stage_*` + `commit_staged` per touched table runs at - // end-of-load β€” a mid-load failure (RI / cardinality violation) leaves - // Lance HEAD untouched. `LoadMode::Overwrite` uses Lance's staged - // `Overwrite` transaction rather than the former truncate-then-append - // inline path. + // staging pipeline. For Append/Merge, batches go into an in-memory + // accumulator and a single `stage_*` + `commit_staged` per touched + // table runs at end-of-load β€” a mid-load failure (RI / cardinality + // violation) leaves Lance HEAD untouched. For Overwrite, the legacy + // inline-commit path is preserved (truncate+append doesn't fit the + // staged shape cleanly, and overwrite has no in-flight read-your-writes + // requirement). let mut result = LoadResult::default(); - // Capture-once write transaction (RFC-013 step 3b). `open_write_txn` - // validates the schema contract ONCE and pins the base snapshot. Threaded - // as `Some(&txn)` through the per-table opens and the manifest publish so - // each resolve point reuses the pinned base instead of re-validating the - // contract. The branch already exists here (fork-if-missing ran in - // `load_as` before this), so this captures the post-fork snapshot. The - // load's own base read (`db.snapshot_for_branch` previously) is the same - // per-branch snapshot, so reuse `txn.base` for it β€” dropping a validation. - let txn = db.open_write_txn(branch).await?; - let snapshot = txn.base.clone(); + let snapshot = db.snapshot_for_branch(branch).await?; + let use_staging = !matches!(mode, LoadMode::Overwrite); let mut staging = MutationStaging::default(); + let mut overwrite_updates: Vec = Vec::new(); + let mut overwrite_expected: HashMap = HashMap::new(); let pending_mode = match mode { LoadMode::Merge => PendingMode::Merge, // Append-mode loads accumulate as Append. Edge tables (no @key) // and no-key node tables stay safe on the stage_append path. The // Merge mode applies dedupe-by-id; Append assumes unique inputs. LoadMode::Append => PendingMode::Append, - LoadMode::Overwrite => PendingMode::Overwrite, + LoadMode::Overwrite => PendingMode::Append, // unused }; // Map LoadMode to MutationOpKind for the version-check policy. // Append/Merge skip the strict pre-stage check (concurrency-safe @@ -430,45 +387,6 @@ async fn load_jsonl_reader( LoadMode::Overwrite => crate::db::MutationOpKind::SchemaRewrite, }; - // Up-front fork-queue acquisition. The first write to a table on a - // non-main branch forks it (create_branch), which advances Lance state - // before the manifest publish; the reclaim of any manifest-unreferenced - // leftover (`reclaim_orphaned_fork_and_refork`) must not race a concurrent - // in-process fork. So when this load will fork at least one touched table, - // acquire the per-(table, branch) write queues for ALL touched tables up - // front (one sorted `acquire_many`, keyed uniformly by the target branch - // so it covers what `commit_all` recomputes) and hold them through the - // publish. Main-branch loads never fork; branch loads where every touched - // table is already forked skip this and let `commit_all` acquire at commit. - let fork_queue_guards: Option<( - Vec<(String, Option)>, - Vec>, - )> = if let Some(active) = branch { - let touched: Vec<(String, Option)> = node_rows - .keys() - .map(|t| (format!("node:{t}"), Some(active.to_string()))) - .chain( - edge_rows - .keys() - .map(|e| (format!("edge:{e}"), Some(active.to_string()))), - ) - .collect(); - let needs_fork = touched.iter().any(|(table_key, _)| { - snapshot - .entry(table_key) - .map(|e| e.table_branch.as_deref() != Some(active)) - .unwrap_or(false) - }); - if needs_fork { - let guards = db.write_queue().acquire_many(&touched).await; - Some((touched, guards)) - } else { - None - } - } else { - None - }; - // Phase 2a: build and validate every node batch up front. Cheap and // synchronous β€” surfaces validation errors before any S3 traffic. let mut prepared_nodes: Vec<(String, String, RecordBatch, usize)> = @@ -478,52 +396,87 @@ async fn load_jsonl_reader( let batch = build_node_batch(node_type, rows)?; validate_value_constraints(&batch, node_type)?; validate_enum_constraints(&batch, &node_type.properties, type_name)?; - let unique_groups = unique_constraint_groups_for_node(node_type); - if !unique_groups.is_empty() { - enforce_unique_constraints_intra_batch(&batch, type_name, &unique_groups)?; + let unique_props = unique_property_names_for_node(node_type); + if !unique_props.is_empty() { + enforce_unique_constraints_intra_batch(&batch, type_name, &unique_props)?; } let loaded_count = batch.num_rows(); let table_key = format!("node:{}", type_name); - let _entry = snapshot + let entry = snapshot .entry(&table_key) .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; + if !use_staging { + overwrite_expected.insert(table_key.clone(), entry.table_version); + } prepared_nodes.push((type_name.clone(), table_key, batch, loaded_count)); } - // Phase 2b: accumulate every node type in memory. Fragment writes are - // delayed until after all validation succeeds. - for (type_name, table_key, batch, loaded_count) in prepared_nodes { - // The loader only needs the captured expected version (the publisher's - // CAS fence) for `ensure_path` β€” it discards the handle. With a - // non-strict load op (Merge/Append) and a `WriteTxn`, collapse #1 skips - // the dataset open and returns the pinned base version directly. - let opened = db - .open_for_mutation_on_branch(branch, &table_key, load_op_kind, Some(&txn)) - .await?; - staging.ensure_path( - &table_key, - opened.full_path, - opened.table_branch, - opened.expected_version, - load_op_kind, - ); - let schema = batch.schema(); - staging.append_batch(&table_key, schema, pending_mode, batch)?; - result.nodes_loaded.insert(type_name, loaded_count); + // Phase 2b: write every node type. Append/Merge β†’ in-memory + // accumulator. Overwrite β†’ concurrent inline-commit (legacy path). + if use_staging { + for (type_name, table_key, batch, loaded_count) in prepared_nodes { + let (ds, full_path, table_branch) = db + .open_for_mutation_on_branch(branch, &table_key, load_op_kind) + .await?; + let expected_version = ds.version().version; + staging.ensure_path( + &table_key, + full_path, + table_branch, + expected_version, + load_op_kind, + ); + let schema = batch.schema(); + staging.append_batch(&table_key, schema, pending_mode, batch)?; + result.nodes_loaded.insert(type_name, loaded_count); + } + } else { + let node_write_results = + write_batches_concurrently(db, branch, mode, prepared_nodes).await?; + for (type_name, table_key, loaded_count, state, table_branch) in node_write_results { + overwrite_updates.push(crate::db::SubTableUpdate { + table_key, + table_version: state.version, + table_branch, + row_count: state.row_count, + version_metadata: state.version_metadata, + }); + result.nodes_loaded.insert(type_name, loaded_count); + } } // Phase 2c: Validate edge referential integrity β€” every src/dst must - // reference an existing node ID in the appropriate type. For - // Append/Merge the lookup unions snapshot-committed IDs with the - // in-memory pending batches. For Overwrite, a touched node table's - // pending batch is the replacement image, so committed rows are not - // included for that table. + // reference an existing node ID in the appropriate type. For staged + // loads, the lookup unions snapshot-committed IDs with the in-memory + // pending batches (which carry the just-staged node inserts). for (edge_name, rows) in &edge_rows { let edge_type = &catalog.edge_types[edge_name]; - let from_ids = - collect_node_ids_with_pending(db, branch, &edge_type.from_type, &staging).await?; - let to_ids = - collect_node_ids_with_pending(db, branch, &edge_type.to_type, &staging).await?; + let from_ids = if use_staging { + collect_node_ids_with_pending(db, branch, &edge_type.from_type, &staging).await? + } else { + collect_node_ids( + db, + branch, + &edge_type.from_type, + &node_rows, + &catalog, + &overwrite_updates, + ) + .await? + }; + let to_ids = if use_staging { + collect_node_ids_with_pending(db, branch, &edge_type.to_type, &staging).await? + } else { + collect_node_ids( + db, + branch, + &edge_type.to_type, + &node_rows, + &catalog, + &overwrite_updates, + ) + .await? + }; for (i, (src, dst, _)) in rows.iter().enumerate() { if !from_ids.contains(src.as_str()) { @@ -554,99 +507,124 @@ async fn load_jsonl_reader( let edge_type = &catalog.edge_types[edge_name]; let batch = build_edge_batch(edge_type, rows)?; validate_enum_constraints(&batch, &edge_type.properties, edge_name)?; - let unique_groups = unique_constraint_groups_for_edge(edge_type); - if !unique_groups.is_empty() { - enforce_unique_constraints_intra_batch(&batch, edge_name, &unique_groups)?; + let unique_props = unique_property_names_for_edge(edge_type); + if !unique_props.is_empty() { + enforce_unique_constraints_intra_batch(&batch, edge_name, &unique_props)?; } let loaded_count = batch.num_rows(); let table_key = format!("edge:{}", edge_name); - let _entry = snapshot + let entry = snapshot .entry(&table_key) .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; + if !use_staging { + overwrite_expected.insert(table_key.clone(), entry.table_version); + } prepared_edges.push((edge_name.clone(), table_key, batch, loaded_count)); } - // Phase 2e: accumulate every edge type. Same dispatch as Phase 2b. - for (edge_name, table_key, batch, loaded_count) in prepared_edges { - // Same as the node phase: only the captured expected version is used; - // collapse #1 skips the open for a non-strict load op under a `WriteTxn`. - let opened = db - .open_for_mutation_on_branch(branch, &table_key, load_op_kind, Some(&txn)) - .await?; - staging.ensure_path( - &table_key, - opened.full_path, - opened.table_branch, - opened.expected_version, - load_op_kind, - ); - let schema = batch.schema(); - staging.append_batch(&table_key, schema, pending_mode, batch)?; - result.edges_loaded.insert(edge_name, loaded_count); + // Phase 2e: write every edge type. Same dispatch as Phase 2b. + if use_staging { + for (edge_name, table_key, batch, loaded_count) in prepared_edges { + let (ds, full_path, table_branch) = db + .open_for_mutation_on_branch(branch, &table_key, load_op_kind) + .await?; + let expected_version = ds.version().version; + staging.ensure_path( + &table_key, + full_path, + table_branch, + expected_version, + load_op_kind, + ); + let schema = batch.schema(); + staging.append_batch(&table_key, schema, pending_mode, batch)?; + result.edges_loaded.insert(edge_name, loaded_count); + } + } else { + let edge_write_results = + write_batches_concurrently(db, branch, mode, prepared_edges).await?; + for (edge_name, table_key, loaded_count, state, table_branch) in edge_write_results { + overwrite_updates.push(crate::db::SubTableUpdate { + table_key, + table_version: state.version, + table_branch, + row_count: state.row_count, + version_metadata: state.version_metadata, + }); + result.edges_loaded.insert(edge_name, loaded_count); + } } // Phase 3: Validate edge cardinality constraints (before commit β€” - // invalid data must not be committed). The helper scans committed - // edges via Lance + iterates pending edges in-memory; for Overwrite it - // treats the pending edge batches as the replacement table image. + // invalid data must not be committed). Staged path scans committed + // edges via Lance + iterates pending edges in-memory. Overwrite path + // opens the just-written version (legacy behavior). for (edge_name, _) in &edge_rows { let edge_type = &catalog.edge_types[edge_name]; let table_key = format!("edge:{}", edge_name); - validate_edge_cardinality_with_pending_loader( - db, branch, edge_type, &table_key, &staging, mode, - ) - .await?; + if use_staging { + validate_edge_cardinality_with_pending_loader( + db, branch, edge_type, &table_key, &staging, mode, + ) + .await?; + } else if let Some(update) = overwrite_updates.iter().find(|u| u.table_key == table_key) { + validate_edge_cardinality( + db, + branch, + edge_name, + update.table_version, + update.table_branch.as_deref(), + ) + .await?; + } } // Phase 4: Atomic manifest commit with publisher-level OCC. - let staged = staging - .stage_all_with_concurrency(db, branch, load_write_concurrency()) - .await?; - // `_queue_guards` holds per-(table_key, branch) write queues - // across the manifest publish below β€” see exec/mutation.rs for - // the rationale (interleaving prevention). - let crate::exec::staging::CommittedMutation { - updates, - expected_versions, - sidecar_handle, - guards: _queue_guards, - committed_handles, - } = staged - .commit_all( - db, + if use_staging { + let staged = staging.stage_all(db, branch).await?; + // `_queue_guards` holds per-(table_key, branch) write queues + // across the manifest publish below β€” see exec/mutation.rs for + // the rationale (interleaving prevention). + let (updates, expected_versions, sidecar_handle, _queue_guards) = staged + .commit_all(db, branch, crate::db::manifest::SidecarKind::Load, actor_id) + .await?; + // Same finalize β†’ publisher residual as mutations: per-table + // staged commits have advanced Lance HEAD, but the manifest + // publish has not run yet. Reuse the mutation failpoint name so + // one failpoint pins the shared `MutationStaging` boundary. + crate::failpoints::maybe_fail("mutation.post_finalize_pre_publisher")?; + db.commit_updates_on_branch_with_expected(branch, &updates, &expected_versions, actor_id) + .await?; + // The recovery sidecar protects the per-table commit_staged β†’ + // manifest publish window. Phase C succeeded β€” clean up + // best-effort: failing the user here would error out a write + // that already landed durably. + if let Some(handle) = sidecar_handle { + if let Err(err) = + crate::db::manifest::delete_sidecar(&handle, db.storage_adapter()).await + { + tracing::warn!( + error = %err, + operation_id = handle.operation_id.as_str(), + "recovery sidecar cleanup failed; the next open's recovery sweep will resolve it" + ); + } + } + } else { + // LoadMode::Overwrite keeps the legacy inline-commit path β€” + // truncate-then-append doesn't fit the staged shape (see + // `docs/runs.md` "LoadMode::Overwrite residual"). The recovery + // sidecar is not applicable here because the writer doesn't go + // through MutationStaging; per-table inline commits + a final + // manifest publish handle their own residual via the documented + // operator workflow (re-run overwrite to recover). + db.commit_updates_on_branch_with_expected( branch, - crate::db::manifest::SidecarKind::Load, + &overwrite_updates, + &overwrite_expected, actor_id, - fork_queue_guards, - Some(&txn), ) .await?; - // Same finalize β†’ publisher residual as mutations: per-table - // staged commits have advanced Lance HEAD, but the manifest - // publish has not run yet. Reuse the mutation failpoint name so - // one failpoint pins the shared `MutationStaging` boundary. - crate::failpoints::maybe_fail(crate::failpoints::names::MUTATION_POST_FINALIZE_PRE_PUBLISHER)?; - db.commit_updates_on_branch_with_expected( - branch, - &updates, - &expected_versions, - actor_id, - Some(&txn), - committed_handles, - ) - .await?; - // The recovery sidecar protects the per-table commit_staged β†’ - // manifest publish window. Phase C succeeded β€” clean up - // best-effort: failing the user here would error out a write - // that already landed durably. - if let Some(handle) = sidecar_handle { - if let Err(err) = crate::db::manifest::delete_sidecar(&handle, db.storage_adapter()).await { - tracing::warn!( - error = %err, - operation_id = handle.operation_id.as_str(), - "recovery sidecar cleanup failed; the next open's recovery sweep will resolve it" - ); - } } Ok(result) @@ -1176,6 +1154,89 @@ fn load_write_concurrency() -> usize { .unwrap_or(DEFAULT_LOAD_WRITE_CONCURRENCY) } +/// Write a set of prepared `(type_name, table_key, batch, row_count)` tuples +/// concurrently. Returns results in original iteration order so callers can +/// zip them back to per-type metadata. +async fn write_batches_concurrently( + db: &Omnigraph, + branch: Option<&str>, + mode: LoadMode, + prepared: Vec<(String, String, RecordBatch, usize)>, +) -> Result< + Vec<( + String, + String, + usize, + crate::table_store::TableState, + Option, + )>, +> { + use futures::stream::StreamExt; + + if prepared.is_empty() { + return Ok(Vec::new()); + } + + let concurrency = load_write_concurrency().min(prepared.len()).max(1); + + futures::stream::iter(prepared.into_iter().map( + |(type_name, table_key, batch, loaded_count)| async move { + let (state, table_branch) = + write_batch_to_dataset(db, branch, &table_key, batch, mode).await?; + Ok::<_, OmniError>((type_name, table_key, loaded_count, state, table_branch)) + }, + )) + .buffered(concurrency) + .collect::>>() + .await + .into_iter() + .collect() +} + +async fn write_batch_to_dataset( + db: &Omnigraph, + branch: Option<&str>, + table_key: &str, + batch: RecordBatch, + mode: LoadMode, +) -> Result<(crate::table_store::TableState, Option)> { + let op_kind = match mode { + LoadMode::Append => crate::db::MutationOpKind::Insert, + LoadMode::Merge => crate::db::MutationOpKind::Merge, + LoadMode::Overwrite => crate::db::MutationOpKind::SchemaRewrite, + }; + let (mut ds, full_path, table_branch) = db + .open_for_mutation_on_branch(branch, table_key, op_kind) + .await?; + let table_store = db.table_store(); + + match mode { + LoadMode::Overwrite => { + let state = table_store + .overwrite_batch(&full_path, &mut ds, batch) + .await?; + Ok((state, table_branch)) + } + LoadMode::Append => { + let state = table_store.append_batch(&full_path, &mut ds, batch).await?; + Ok((state, table_branch)) + } + LoadMode::Merge => { + let state = table_store + .merge_insert_batch( + &full_path, + ds, + batch, + vec!["id".to_string()], + lance::dataset::WhenMatched::UpdateAll, + lance::dataset::WhenNotMatched::InsertAll, + ) + .await?; + Ok((state, table_branch)) + } + } +} + fn generate_id() -> String { ulid::Ulid::new().to_string() } @@ -1361,16 +1422,8 @@ pub(crate) fn validate_enum_constraints( Ok(()) } -/// Detect duplicate values within a single `RecordBatch` for any of the -/// `unique_constraints` groups. Each group is a list of one or more columns -/// that together form a uniqueness key: a violation occurs when two rows share -/// the same tuple of values across *all* columns in a group, so a composite -/// `@unique(a, b)` only conflicts when both `a` and `b` match. Returns an -/// error on the first duplicate found. -/// -/// Rows where any column in a group is null are exempt (standard SQL semantics -/// for uniqueness over nullable columns), as is any group whose columns are -/// not all present in the batch (e.g. a partial-schema load). +/// Detect duplicate values within a single `RecordBatch` for any of the named +/// `unique_properties`. Returns an error on the first duplicate found. /// /// Note: this only catches duplicates *within* the batch. Cross-batch /// uniqueness against already-committed rows is not enforced here β€” that @@ -1378,37 +1431,22 @@ pub(crate) fn validate_enum_constraints( pub(crate) fn enforce_unique_constraints_intra_batch( batch: &RecordBatch, type_name: &str, - unique_constraints: &[Vec], + unique_properties: &[String], ) -> Result<()> { - for columns in unique_constraints { - // Resolve the group's columns once. A group whose columns aren't all - // present in this batch is skipped (e.g. a partial-schema load). - let Some(group_columns) = columns - .iter() - .map(|name| { - batch - .schema() - .index_of(name) - .ok() - .map(|i| batch.column(i).clone()) - }) - .collect::>>() - else { + for property in unique_properties { + let Some(col_idx) = batch.schema().index_of(property).ok() else { continue; }; - let mut seen: HashMap, usize> = HashMap::new(); + let arr = batch.column(col_idx); + let mut seen: HashMap = HashMap::new(); for row in 0..batch.num_rows() { - let Some(key) = composite_unique_key(&group_columns, row)? else { + let Some(value) = scalar_to_string(arr, row) else { continue; }; - if let Some(prev_row) = seen.insert(key.clone(), row) { + if let Some(prev_row) = seen.insert(value.clone(), row) { return Err(OmniError::manifest(format!( "@unique violation on {}.{}: value '{}' appears in rows {} and {}", - type_name, - format_tuple(columns), - format_tuple(&key), - prev_row, - row + type_name, property, value, prev_row, row ))); } } @@ -1416,131 +1454,80 @@ pub(crate) fn enforce_unique_constraints_intra_batch( Ok(()) } -/// Build the composite uniqueness key for `row` over a constraint group's -/// already-resolved columns (in declaration order). -/// -/// The key is the *tuple* of per-column scalar strings (`Vec`), keyed -/// directly in the dedup map β€” there is no separator, so no data value can -/// forge a collision (an earlier version joined on `U+001F`, which a value -/// containing that control char could still defeat). -/// -/// - `Ok(None)` if any column is null: the row is exempt (a partial tuple -/// can't violate uniqueness under SQL null semantics). -/// - `Ok(Some(tuple))` otherwise. -/// - `Err(..)` propagated from [`unique_key_scalar`] on an un-keyable value. -/// -/// Shared by the intake path (`enforce_unique_constraints_intra_batch`) and the -/// branch-merge path (`exec/merge.rs::update_unique_constraints`) so the two -/// derive identical keys and cannot drift on separator or scalar conversion. -pub(crate) fn composite_unique_key( - group_columns: &[ArrayRef], - row: usize, -) -> Result>> { - let mut parts = Vec::with_capacity(group_columns.len()); - for column in group_columns { - match unique_key_scalar(column, row)? { - Some(value) => parts.push(value), - None => return Ok(None), - } - } - Ok(Some(parts)) -} - -/// Render a constraint's column tuple for error messages: a single item as -/// `col`, a composite as `(a, b)`. Used for both the column list and the -/// offending value tuple, which share the same shape. -fn format_tuple(items: &[String]) -> String { - match items { - [single] => single.clone(), - _ => format!("({})", items.join(", ")), - } -} - -/// Reduce a single Arrow scalar at (`array`, `row`) to its uniqueness-key -/// string. -/// -/// - `Ok(None)` for a null value: nulls are exempt from uniqueness (standard -/// SQL semantics over nullable columns). -/// - `Ok(Some(s))` for every scalar type a `@unique` / `@key` column can hold. -/// Strings are covered in all three physical Arrow encodings (`Utf8`, -/// `LargeUtf8`, `Utf8View`), so a legal string column is always keyable -/// regardless of how Lance materializes it on read-back. -/// - `Err(..)` for a non-null value whose Arrow type can't be reduced to a key -/// (a list, blob, or vector column). This fails loudly rather than silently -/// exempting the row, and because every legal scalar encoding is handled -/// above, the error fires only for a genuinely un-keyable column type β€” never -/// for a legal value that merely arrived in an unenumerated encoding. -fn unique_key_scalar(array: &ArrayRef, row: usize) -> Result> { - use arrow_array::{Array, LargeStringArray, StringViewArray}; +/// Reduce a single Arrow scalar at (`array`, `row`) to a `String` for +/// uniqueness comparison. Returns `None` for null values (nulls are exempt +/// from uniqueness in standard SQL semantics). +fn scalar_to_string(array: &ArrayRef, row: usize) -> Option { + use arrow_array::Array; if array.is_null(row) { - return Ok(None); + return None; } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); - } - if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); - } - if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } if let Some(a) = array.as_any().downcast_ref::() { - return Ok(Some(a.value(row).to_string())); + return Some(a.value(row).to_string()); } - Err(OmniError::manifest(format!( - "uniqueness key: unsupported column type {:?} for @unique/@key enforcement", - array.data_type() - ))) + None } -/// Build the list of uniqueness constraint groups to enforce on a node type. -/// Each group is the column tuple of one constraint. Includes every -/// `@unique(...)` constraint (from `NodeType.unique_constraints`) and the -/// `@key` (which implies uniqueness over its column tuple). Grouping is -/// preserved so a composite `@unique(a, b)` is enforced as a composite key -/// rather than degraded into independent single-field checks. -pub(crate) fn unique_constraint_groups_for_node( +/// Build the flat list of property names that must be checked for uniqueness +/// on a node type. Includes both `@unique` properties (from +/// `NodeType.unique_constraints`) and the `@key` (which implies uniqueness). +pub(crate) fn unique_property_names_for_node( node_type: &omnigraph_compiler::catalog::NodeType, -) -> Vec> { - let mut groups: Vec> = node_type.unique_constraints.clone(); - if let Some(key) = &node_type.key - && !groups.contains(key) - { - groups.push(key.clone()); +) -> Vec { + let mut props: Vec = node_type + .unique_constraints + .iter() + .flatten() + .cloned() + .collect(); + if let Some(key) = &node_type.key { + props.extend(key.iter().cloned()); } - groups + props.sort(); + props.dedup(); + props } -/// Same as [`unique_constraint_groups_for_node`] but for an edge type (edges -/// have no `@key`). -pub(crate) fn unique_constraint_groups_for_edge( +/// Same as [`unique_property_names_for_node`] but for an edge type. +pub(crate) fn unique_property_names_for_edge( edge_type: &omnigraph_compiler::catalog::EdgeType, -) -> Vec> { - edge_type.unique_constraints.clone() +) -> Vec { + let mut props: Vec = edge_type + .unique_constraints + .iter() + .flatten() + .cloned() + .collect(); + props.sort(); + props.dedup(); + props } fn extract_numeric_value(col: &ArrayRef, row: usize) -> Option { @@ -1578,14 +1565,83 @@ fn literal_value_to_f64(v: &omnigraph_compiler::catalog::LiteralValue) -> f64 { // ─── Edge cardinality validation ───────────────────────────────────────────── +pub(crate) async fn validate_edge_cardinality( + db: &crate::db::Omnigraph, + branch: Option<&str>, + edge_name: &str, + written_version: u64, + written_branch: Option<&str>, +) -> Result<()> { + use arrow_array::Array; + let catalog = db.catalog(); + let edge_type = &catalog.edge_types[edge_name]; + if edge_type.cardinality.is_default() { + return Ok(()); + } + + // Open edge sub-table at the just-written version, not the snapshot's + // (the snapshot still pins to the pre-write version). + let snapshot = db.snapshot_for_branch(branch).await?; + let table_key = format!("edge:{}", edge_name); + let entry = snapshot + .entry(&table_key) + .ok_or_else(|| OmniError::manifest(format!("no manifest entry for {}", table_key)))?; + let ds = db + .open_dataset_at_state( + &entry.table_path, + written_branch.or(entry.table_branch.as_deref()), + written_version, + ) + .await?; + + // Scan src column, count per source + let batches = db + .table_store() + .scan(&ds, Some(&["src"]), None, None) + .await?; + + let mut counts: HashMap = HashMap::new(); + for batch in &batches { + let srcs = batch + .column_by_name("src") + .unwrap() + .as_any() + .downcast_ref::() + .unwrap(); + for i in 0..srcs.len() { + *counts.entry(srcs.value(i).to_string()).or_insert(0) += 1; + } + } + + let card = &edge_type.cardinality; + for (src, count) in &counts { + if let Some(max) = card.max { + if *count > max { + return Err(OmniError::manifest(format!( + "@card violation on edge {}: source '{}' has {} edges (max {})", + edge_name, src, count, max + ))); + } + } + if *count < card.min { + return Err(OmniError::manifest(format!( + "@card violation on edge {}: source '{}' has {} edges (min {})", + edge_name, src, count, card.min + ))); + } + } + + Ok(()) +} + /// Validate edge `@card` cardinality with in-memory pending edges visible. /// /// Loader-level analog to `exec::mutation::validate_edge_cardinality_with_pending`: /// opens the committed dataset at the pre-load snapshot version, then /// delegates to the shared `count_src_per_edge` + `enforce_cardinality_bounds` -/// helpers in `exec::staging`. Used by every load mode; for `LoadMode::Overwrite` -/// it treats the pending edge batches as the replacement table image (the -/// committed rows are being replaced, so only the pending set is counted). +/// helpers in `exec::staging`. Used by Append/Merge loads (the Overwrite +/// path uses `validate_edge_cardinality` which opens the just-written +/// Lance version). /// /// `mode` controls dedup behavior. `LoadMode::Merge` passes `Some("id")` /// so committed edges that the load is *updating* (same edge id, @@ -1633,11 +1689,6 @@ async fn validate_edge_cardinality_with_pending_loader( /// - IDs from the staged loader's pending batches (in-memory; just-staged /// inserts of this type) /// - IDs from the committed sub-table at the pre-load snapshot version -/// -/// For `LoadMode::Overwrite`, if the node table is touched then the pending -/// batches are the replacement image. In that case committed IDs are not -/// included, so edge RI is validated against exactly what the overwrite will -/// publish. async fn collect_node_ids_with_pending( db: &Omnigraph, branch: Option<&str>, @@ -1660,10 +1711,6 @@ async fn collect_node_ids_with_pending( } } - if staging.pending_mode(&table_key) == Some(PendingMode::Overwrite) { - return Ok(ids); - } - // From the committed Lance sub-table at the pre-load snapshot version. let snapshot = db.snapshot_for_branch(branch).await?; let Some(entry) = snapshot.entry(&table_key) else { @@ -1677,7 +1724,10 @@ async fn collect_node_ids_with_pending( ) .await?; - let batches = db.storage().scan(&ds, Some(&["id"]), None, None).await?; + let batches = db + .table_store() + .scan(&ds, Some(&["id"]), None, None) + .await?; for batch in &batches { let id_col = batch @@ -1700,6 +1750,72 @@ async fn collect_node_ids_with_pending( Ok(ids) } +/// Collect all valid node IDs for a given type. Union of: +/// - IDs from the just-loaded batch (in memory, from node_rows) +/// - IDs from the sub-table at the just-written version (if it was updated) +/// - IDs from the sub-table at the snapshot-pinned version (if it was not updated) +async fn collect_node_ids( + db: &Omnigraph, + branch: Option<&str>, + type_name: &str, + node_rows: &HashMap>, + catalog: &omnigraph_compiler::catalog::Catalog, + updates: &[crate::db::SubTableUpdate], +) -> Result> { + let mut ids = HashSet::new(); + + // IDs from the in-memory batch (just loaded in this operation) + if let Some(rows) = node_rows.get(type_name) { + if let Some(node_type) = catalog.node_types.get(type_name) { + if let Some(key_prop) = node_type.key_property() { + for row in rows { + if let Some(id) = row.get(key_prop).and_then(|v| v.as_str()) { + ids.insert(id.to_string()); + } + } + } + } + } + + // IDs from the Lance sub-table + let table_key = format!("node:{}", type_name); + let snapshot = db.snapshot_for_branch(branch).await?; + let Some(entry) = snapshot.entry(&table_key) else { + return Ok(ids); + }; + // Use the just-written version if this type was updated, else snapshot version + let updated = updates + .iter() + .find(|u| u.table_key == table_key) + .map(|u| (u.table_version, u.table_branch.as_deref())); + let (version, branch) = updated.unwrap_or((entry.table_version, entry.table_branch.as_deref())); + let ds = db + .open_dataset_at_state(&entry.table_path, branch, version) + .await?; + + let batches = db + .table_store() + .scan(&ds, Some(&["id"]), None, None) + .await?; + + for batch in &batches { + let id_col = batch + .column_by_name("id") + .unwrap() + .as_any() + .downcast_ref::() + .unwrap(); + for i in 0..batch.num_rows() { + if !id_col.is_valid(i) { + continue; + } + ids.insert(id_col.value(i).to_string()); + } + } + + Ok(ids) +} + #[cfg(test)] mod tests { use super::*; @@ -1867,7 +1983,6 @@ edge WorksAt: Person -> Company } #[tokio::test] - #[allow(deprecated)] async fn test_ingest_creates_branch_and_reports_tables() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -1912,7 +2027,6 @@ edge WorksAt: Person -> Company } #[tokio::test] - #[allow(deprecated)] async fn test_ingest_existing_branch_ignores_from_and_merges_data() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -1987,7 +2101,6 @@ edge WorksAt: Person -> Company } #[tokio::test] - #[allow(deprecated)] async fn test_ingest_as_stamps_actor_on_branch_head_commit() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -2013,68 +2126,6 @@ edge WorksAt: Person -> Company assert_eq!(head.actor_id.as_deref(), Some("act-andrew")); } - #[tokio::test] - async fn test_load_as_with_base_forks_missing_branch_and_stamps_metadata() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - - let result = db - .load_as("feature", Some("main"), TEST_DATA, LoadMode::Merge, None) - .await - .unwrap(); - - assert_eq!(result.branch, "feature"); - assert_eq!(result.base_branch.as_deref(), Some("main")); - assert!(result.branch_created); - assert!( - db.branch_list() - .await - .unwrap() - .contains(&"feature".to_string()) - ); - - // Re-loading onto the now-existing branch records the base but - // performs no fork. - let again = db - .load_as( - "feature", - Some("main"), - r#"{"type":"Person","data":{"name":"Bob","age":26}}"#, - LoadMode::Merge, - None, - ) - .await - .unwrap(); - assert!(!again.branch_created); - assert_eq!(again.base_branch.as_deref(), Some("main")); - } - - #[tokio::test] - async fn test_load_as_without_base_errors_on_missing_branch() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - - let result = db - .load_as("nonexistent", None, TEST_DATA, LoadMode::Merge, None) - .await; - assert!(result.is_err(), "load without base must not create branches"); - assert!( - !db.branch_list() - .await - .unwrap() - .contains(&"nonexistent".to_string()), - "failed load must not leave a branch behind" - ); - - // Loads to main carry the default branch metadata. - let main_load = db.load("main", TEST_DATA, LoadMode::Overwrite).await.unwrap(); - assert_eq!(main_load.branch, "main"); - assert_eq!(main_load.base_branch, None); - assert!(!main_load.branch_created); - } - #[test] fn test_range_constraint_rejects_nan() { use arrow_array::{Float64Array, RecordBatch, StringArray}; @@ -2118,66 +2169,4 @@ edge WorksAt: Person -> Company let err = result.unwrap_err().to_string(); assert!(err.contains("NaN"), "error should mention NaN: {}", err); } - - #[test] - fn composite_unique_key_builds_tuple_and_exempts_null() { - let a: ArrayRef = Arc::new(StringArray::from(vec![Some("x|y"), Some("x"), None])); - let b: ArrayRef = Arc::new(StringArray::from(vec![Some("z"), Some("y|z"), Some("q")])); - let cols = [a, b]; - - // Tuple key, so `("x|y", "z")` and `("x", "y|z")` stay distinct β€” - // a separator-joined key (the old `|` join) would collapse both to - // `x|y|z`. - assert_eq!( - composite_unique_key(&cols, 0).unwrap(), - Some(vec!["x|y".to_string(), "z".to_string()]) - ); - assert_eq!( - composite_unique_key(&cols, 1).unwrap(), - Some(vec!["x".to_string(), "y|z".to_string()]) - ); - assert_ne!( - composite_unique_key(&cols, 0).unwrap(), - composite_unique_key(&cols, 1).unwrap() - ); - - // Any null column β†’ the whole row is exempt (SQL null semantics). - assert_eq!(composite_unique_key(&cols, 2).unwrap(), None); - } - - #[test] - fn unique_key_scalar_errors_loudly_on_unkeyable_type() { - use arrow_array::LargeBinaryArray; - // A binary/blob column can't be reduced to a uniqueness key. Before the - // hardening this returned `None`, so a `@unique` on such a column was - // silently un-enforced; now it errors instead of weakening the - // constraint in silence. - let blob: ArrayRef = Arc::new(LargeBinaryArray::from(vec![Some(&b"abc"[..])])); - let err = unique_key_scalar(&blob, 0).unwrap_err(); - assert!( - err.to_string().contains("unsupported column type"), - "un-keyable type must fail loudly (got: {err})" - ); - } - - #[test] - fn unique_key_scalar_handles_all_string_encodings() { - use arrow_array::{LargeStringArray, StringViewArray}; - // A legal string column is keyable in every physical Arrow encoding - // Lance might hand back (Utf8 / LargeUtf8 / Utf8View). None of these may - // fall through to the loud `Err` path β€” that branch is reserved for - // genuinely un-keyable column types, not a legal value in an - // unenumerated encoding. - let utf8: ArrayRef = Arc::new(StringArray::from(vec![Some("v")])); - let large: ArrayRef = Arc::new(LargeStringArray::from(vec![Some("v")])); - let view: ArrayRef = Arc::new(StringViewArray::from(vec![Some("v")])); - for array in [&utf8, &large, &view] { - assert_eq!( - unique_key_scalar(array, 0).unwrap(), - Some("v".to_string()), - "string array {:?} must render, not error", - array.data_type() - ); - } - } } diff --git a/crates/omnigraph/src/runtime_cache.rs b/crates/omnigraph/src/runtime_cache.rs index e85a90a..84b562a 100644 --- a/crates/omnigraph/src/runtime_cache.rs +++ b/crates/omnigraph/src/runtime_cache.rs @@ -1,9 +1,6 @@ use std::collections::{HashMap, VecDeque}; -use std::hash::Hash; use std::sync::Arc; -use lance::Dataset; -use lance::session::Session; use omnigraph_compiler::catalog::Catalog; use tokio::sync::Mutex; @@ -29,15 +26,17 @@ pub struct RuntimeCache { graph_indices: Mutex, } -#[derive(Debug)] +#[derive(Debug, Default)] struct GraphIndexCache { - entries: LruMap>, + entries: HashMap>, + lru: VecDeque, } impl RuntimeCache { pub async fn invalidate_all(&self) { let mut cache = self.graph_indices.lock().await; - cache.entries.invalidate_all(); + cache.entries.clear(); + cache.lru.clear(); } pub async fn graph_index( @@ -49,6 +48,7 @@ impl RuntimeCache { { let mut cache = self.graph_indices.lock().await; if let Some(index) = cache.entries.get(&key).cloned() { + cache.touch(key.clone()); return Ok(index); } } @@ -62,6 +62,7 @@ impl RuntimeCache { let index = Arc::new(GraphIndex::build(&resolved.snapshot, &edge_types).await?); let mut cache = self.graph_indices.lock().await; if let Some(existing) = cache.entries.get(&key).cloned() { + cache.touch(key); return Ok(existing); } cache.insert(key, Arc::clone(&index)); @@ -71,86 +72,24 @@ impl RuntimeCache { impl GraphIndexCache { fn insert(&mut self, key: GraphIndexCacheKey, value: Arc) { - self.entries.insert(key, value); - } - - #[cfg(test)] - fn touch(&mut self, key: GraphIndexCacheKey) { - self.entries.touch(key); - } -} - -#[derive(Debug)] -struct LruMap -where - K: Clone + Eq + Hash, -{ - entries: HashMap, - lru: VecDeque, - cap: usize, -} - -impl LruMap -where - K: Clone + Eq + Hash, -{ - fn new(cap: usize) -> Self { - Self { - entries: HashMap::new(), - lru: VecDeque::new(), - cap, - } - } - - fn get(&mut self, key: &K) -> Option<&V> { - if self.entries.contains_key(key) { - self.touch(key.clone()); - self.entries.get(key) - } else { - None - } - } - - fn insert(&mut self, key: K, value: V) { self.entries.insert(key.clone(), value); self.touch(key); - while self.entries.len() > self.cap { + while self.entries.len() > 8 { let Some(oldest) = self.lru.pop_front() else { break; }; - self.entries.remove(&oldest); + if self.entries.remove(&oldest).is_some() { + break; + } } } - fn invalidate_all(&mut self) { - self.entries.clear(); - self.lru.clear(); - } - - #[cfg(test)] - fn contains_key(&self, key: &K) -> bool { - self.entries.contains_key(key) - } - - #[cfg(test)] - fn len(&self) -> usize { - self.entries.len() - } - - fn touch(&mut self, key: K) { + fn touch(&mut self, key: GraphIndexCacheKey) { self.lru.retain(|existing| existing != &key); self.lru.push_back(key); } } -impl Default for GraphIndexCache { - fn default() -> Self { - Self { - entries: LruMap::new(8), - } - } -} - fn graph_index_cache_key(resolved: &ResolvedTarget, catalog: &Catalog) -> GraphIndexCacheKey { let mut edge_tables: Vec = catalog .edge_types @@ -175,114 +114,6 @@ fn graph_index_cache_key(resolved: &ResolvedTarget, catalog: &Catalog) -> GraphI } } -/// Max held `Dataset` handles. A handle holds only Arcs (object store + manifest), -/// never table data, so this is cheap; it bounds how many `(table, branch, -/// version, e_tag)` cells stay warm. One graph's live table set across a couple -/// of branches at the current version fits comfortably, with headroom for the -/// recently-superseded versions left by writes until they age out. -const TABLE_HANDLE_CACHE_CAP: usize = 64; - -#[derive(Debug, Clone, PartialEq, Eq, Hash)] -struct TableHandleKey { - table_path: String, - table_branch: Option, - version: u64, - e_tag: Option, -} - -/// Held open-`Dataset` handles keyed by `(table_path, branch, version, e_tag)` β€” the -/// version-keyed analogue of LanceDB's `DatasetConsistencyWrapper` -/// (`rust/lancedb/src/table/dataset.rs`). A warm read reuses a held handle with -/// zero open IO (a cheap `Dataset` clone); a miss opens once at the location with -/// the shared `Session`. Version plus e_tag are in the key, so a write (or a -/// delete/recreate that reuses a version number on object stores with e_tags) is -/// simply a new key. A same-branch manifest refresh clears this cache as the -/// fallback for e_tag-less table locations. Only read-path Data opens use this β€” -/// writes open HEAD directly and never receive a pinned handle. -#[derive(Default)] -pub struct TableHandleCache { - inner: Mutex, -} - -struct TableHandleCacheInner { - entries: LruMap, -} - -impl TableHandleCache { - /// Drop all held handles. Correctness never requires this (version-in-key); - /// it is memory hygiene, called from the same hooks that clear the graph - /// index cache (branch switch / refresh). - pub async fn invalidate_all(&self) { - let mut inner = self.inner.lock().await; - inner.entries.invalidate_all(); - } - - /// Return the dataset for `(table_path, branch, version, e_tag)`, reusing a - /// held handle (0 open IO) or opening it once at `location` with the shared - /// `session` on a miss. - pub async fn get_or_open( - &self, - table_path: &str, - table_branch: Option<&str>, - version: u64, - e_tag: Option<&str>, - location: &str, - session: Option<&Arc>, - ) -> Result { - let key = TableHandleKey { - table_path: table_path.to_string(), - table_branch: table_branch.map(str::to_string), - version, - e_tag: e_tag.map(str::to_string), - }; - { - let mut inner = self.inner.lock().await; - if let Some(ds) = inner.entries.get(&key).cloned() { - return Ok(ds); - } - } - // Miss: open without holding the lock (the open is async IO). A concurrent - // double-miss opens twice and one wins the insert β€” correct (the dataset - // at a version is immutable) and rare. - let ds = crate::instrumentation::open_table_dataset(location, version, session).await?; - let mut inner = self.inner.lock().await; - if let Some(existing) = inner.entries.get(&key).cloned() { - return Ok(existing); - } - inner.insert(key, ds.clone()); - Ok(ds) - } -} - -impl TableHandleCacheInner { - fn insert(&mut self, key: TableHandleKey, value: Dataset) { - self.entries.insert(key, value); - } -} - -impl Default for TableHandleCacheInner { - fn default() -> Self { - Self { - entries: LruMap::new(TABLE_HANDLE_CACHE_CAP), - } - } -} - -/// Per-graph read caches handed to a resolved `Snapshot` so its table opens reuse -/// one shared `Session` (LanceDB's one-session-per-connection pattern) and the -/// held-handle cache. Manual `Debug` because `lance::session::Session` is not -/// `Debug`; this lets `Snapshot` keep its `#[derive(Debug)]`. -pub struct ReadCaches { - pub session: Arc, - pub handles: Arc, -} - -impl std::fmt::Debug for ReadCaches { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.debug_struct("ReadCaches").finish_non_exhaustive() - } -} - #[cfg(test)] mod tests { use std::sync::Arc; @@ -325,21 +156,4 @@ mod tests { assert!(cache.entries.contains_key(&key(0))); assert!(!cache.entries.contains_key(&key(1))); } - - #[test] - fn lru_map_evicts_oldest_and_touch_refreshes_order() { - let mut map = LruMap::new(2); - map.insert("a", 1); - map.insert("b", 2); - - assert_eq!(map.get(&"a"), Some(&1)); - map.insert("c", 3); - - assert!(map.contains_key(&"a")); - assert!(!map.contains_key(&"b")); - assert!(map.contains_key(&"c")); - - map.invalidate_all(); - assert_eq!(map.len(), 0); - } } diff --git a/crates/omnigraph/src/storage.rs b/crates/omnigraph/src/storage.rs index 357f990..564b577 100644 --- a/crates/omnigraph/src/storage.rs +++ b/crates/omnigraph/src/storage.rs @@ -1,15 +1,14 @@ use std::env; use std::fmt::Debug; -use std::path::{Component, Path, PathBuf}; +use std::path::{Path, PathBuf}; use std::sync::Arc; use async_trait::async_trait; use futures::TryStreamExt; use object_store::aws::AmazonS3Builder; -use object_store::local::LocalFileSystem; -use object_store::memory::InMemory; use object_store::path::Path as ObjectPath; -use object_store::{DynObjectStore, ObjectStore, ObjectStoreExt, PutMode, PutPayload}; +use object_store::{DynObjectStore, ObjectStore, PutMode, PutPayload}; +use tokio::io::AsyncWriteExt; use url::Url; use crate::error::{OmniError, Result}; @@ -39,52 +38,7 @@ pub trait StorageAdapter: Debug + Send + Sync { /// List all files (non-recursively, files only) directly under `dir_uri`. /// Returns full URIs (same scheme as `dir_uri`). The result is unordered. /// Returns Ok(empty) if the directory does not exist or is empty. - /// Consumers must tolerate non-payload residue appearing in storage - /// (backend staging files are filtered by the backend, but crash residue - /// of any future producer may not be) β€” filter by suffix, never assume - /// every entry is yours. async fn list_dir(&self, dir_uri: &str) -> Result>; - /// Read a text object together with its backend version token (stores - /// with conditional-update support: the object's ETag; local: sha256 of - /// the content). The token is opaque β€” valid only for - /// `write_text_if_match` against the same adapter. - async fn read_text_versioned(&self, uri: &str) -> Result<(String, String)>; - /// Replace the object at `uri` only if its current version still matches - /// `expected_version` (obtained from a prior versioned read/write on this - /// adapter). Returns `Ok(Some(new_version))` on success and `Ok(None)` - /// when the precondition failed (a concurrent writer won β€” the CAS-lost - /// case callers must surface, never swallow). Stores with conditional - /// updates (S3, in-memory) use a true conditional put (If-Match); the - /// local filesystem has no such primitive (`PutMode::Update` is - /// unimplemented upstream), so local compares content then replaces via - /// an atomic staged write β€” the same single-machine semantics the - /// callers had before this trait, safe under the callers' own lock - /// protocol but not a cross-process barrier by itself (see the Known - /// Gaps entry in docs/dev/invariants.md). - async fn write_text_if_match( - &self, - uri: &str, - contents: &str, - expected_version: &str, - ) -> Result>; - /// Recursively delete every object under `prefix_uri`. Returns Ok(()) - /// when nothing exists there (idempotent). Local: `remove_dir_all` - /// (directories are a local-FS concept; list+delete would leave empty - /// directory skeletons that local existence probes report as present); - /// object stores: list + delete (NOT atomic β€” callers must tolerate - /// partial prefixes on crash, which the cluster delete protocol does by - /// retry). - async fn delete_prefix(&self, prefix_uri: &str) -> Result<()>; -} - -/// Version token for local files: content identity. The local filesystem -/// backend reports mtime-derived ETags too coarse for CAS (sub-granularity -/// rewrites collide); sha256 is stable, cheap at these object sizes, and -/// already the cluster ledger's CAS vocabulary. -fn local_version_token(bytes: &[u8]) -> String { - use sha2::{Digest, Sha256}; - let digest = Sha256::digest(bytes); - digest.iter().map(|byte| format!("{byte:02x}")).collect() } #[derive(Debug, Clone, Copy, PartialEq, Eq)] @@ -93,34 +47,13 @@ pub enum StorageKind { S3, } -/// The one storage implementation: every backend is an -/// [`object_store::ObjectStore`], so the semantics (atomic-visibility puts, -/// conditional creates, path-delimited listing) are upstream-maintained and -/// identical across backends by construction. The per-backend residue is -/// confined to [`UriCodec`] (URI ↔ object path mapping) and the -/// `supports_conditional_update` capability flag (false only for the local -/// filesystem, where upstream `PutMode::Update` is unimplemented). -#[derive(Debug)] -pub struct ObjectStorageAdapter { - store: Arc, - codec: UriCodec, - /// Whether the backend implements `PutMode::Update` (ETag-conditioned - /// put). Gates BOTH the version-token source in `read_text_versioned` - /// and the `write_text_if_match` strategy β€” the two must agree or every - /// CAS loses. - supports_conditional_update: bool, -} +#[derive(Debug, Default)] +pub struct LocalStorageAdapter; -#[derive(Debug, Clone, PartialEq, Eq)] -enum UriCodec { - /// Plain absolute/relative paths or `file://` URIs, mapped onto a - /// root-anchored [`LocalFileSystem`]. - Local, - /// `s3://{bucket}/{key}` URIs, mapped onto a bucket-scoped store. - S3 { bucket: String }, - /// Opaque keys for the in-memory test/embedded backend; leading - /// slashes are stripped. - Memory, +#[derive(Debug)] +pub struct S3StorageAdapter { + bucket: String, + store: Arc, } #[derive(Debug, Clone, PartialEq, Eq)] @@ -129,22 +62,224 @@ struct S3Location { key: String, } -impl ObjectStorageAdapter { - /// Local-filesystem backend rooted at `/`. URIs are plain paths or - /// `file://` URIs; relative paths are lexically absolutized against the - /// current working directory. - pub fn local() -> Self { - Self { - store: Arc::new(LocalFileSystem::new()), - codec: UriCodec::Local, - supports_conditional_update: false, +#[async_trait] +impl StorageAdapter for LocalStorageAdapter { + async fn read_text(&self, uri: &str) -> Result { + let path = local_path_from_uri(uri)?; + Ok(tokio::fs::read_to_string(&path).await?) + } + + async fn write_text(&self, uri: &str, contents: &str) -> Result<()> { + let path = local_path_from_uri(uri)?; + // Ensure parent directory exists. S3 has no equivalent (PutObject + // is path-agnostic). For local fs, callers like the recovery + // sidecar protocol expect transparent directory creation under + // the graph root (the `__recovery/` directory doesn't pre-exist; + // first sidecar write creates it). + if let Some(parent) = path.parent() { + if !parent.as_os_str().is_empty() { + tokio::fs::create_dir_all(parent).await?; + } + } + tokio::fs::write(&path, contents).await?; + Ok(()) + } + + async fn write_text_if_absent(&self, uri: &str, contents: &str) -> Result { + let path = local_path_from_uri(uri)?; + if let Some(parent) = path.parent() { + if !parent.as_os_str().is_empty() { + tokio::fs::create_dir_all(parent).await?; + } + } + let mut file = match tokio::fs::OpenOptions::new() + .write(true) + .create_new(true) + .open(&path) + .await + { + Ok(file) => file, + Err(err) if err.kind() == std::io::ErrorKind::AlreadyExists => return Ok(false), + Err(err) => return Err(err.into()), + }; + if let Err(err) = file.write_all(contents.as_bytes()).await { + let _ = tokio::fs::remove_file(&path).await; + return Err(err.into()); + } + Ok(true) + } + + async fn exists(&self, uri: &str) -> Result { + Ok(local_path_from_uri(uri)?.exists()) + } + + async fn rename_text(&self, from_uri: &str, to_uri: &str) -> Result<()> { + let from = local_path_from_uri(from_uri)?; + let to = local_path_from_uri(to_uri)?; + tokio::fs::rename(&from, &to).await?; + Ok(()) + } + + async fn delete(&self, uri: &str) -> Result<()> { + let path = local_path_from_uri(uri)?; + match tokio::fs::remove_file(&path).await { + Ok(()) => Ok(()), + Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(()), + Err(err) => Err(err.into()), } } - /// S3 backend scoped to the bucket named in `root_uri`. Credentials and - /// endpoint come from the standard `AWS_*` environment variables (the - /// same ones Lance reads for its dataset stores). - pub fn s3_from_root_uri(root_uri: &str) -> Result { + async fn list_dir(&self, dir_uri: &str) -> Result> { + let path = local_path_from_uri(dir_uri)?; + let mut out = Vec::new(); + let mut entries = match tokio::fs::read_dir(&path).await { + Ok(e) => e, + Err(err) if err.kind() == std::io::ErrorKind::NotFound => return Ok(out), + Err(err) => return Err(err.into()), + }; + let dir_str = dir_uri.trim_end_matches('/'); + while let Some(entry) = entries.next_entry().await? { + let ft = entry.file_type().await?; + if !ft.is_file() { + continue; + } + if let Some(name) = entry.file_name().to_str() { + out.push(format!("{}/{}", dir_str, name)); + } + } + Ok(out) + } +} + +#[async_trait] +impl StorageAdapter for S3StorageAdapter { + async fn read_text(&self, uri: &str) -> Result { + let location = self.object_path(uri)?; + let bytes = self + .store + .get(&location) + .await + .map_err(|err| storage_backend_error("read", uri, err))? + .bytes() + .await + .map_err(|err| storage_backend_error("read", uri, err))?; + + String::from_utf8(bytes.to_vec()).map_err(|err| { + OmniError::manifest_internal(format!("storage read failed for '{}': {}", uri, err)) + }) + } + + async fn write_text(&self, uri: &str, contents: &str) -> Result<()> { + let location = self.object_path(uri)?; + self.store + .put(&location, PutPayload::from(contents.as_bytes().to_vec())) + .await + .map_err(|err| storage_backend_error("write", uri, err))?; + Ok(()) + } + + async fn write_text_if_absent(&self, uri: &str, contents: &str) -> Result { + let location = self.object_path(uri)?; + match self + .store + .put_opts( + &location, + PutPayload::from(contents.as_bytes().to_vec()), + PutMode::Create.into(), + ) + .await + { + Ok(_) => Ok(true), + Err(object_store::Error::AlreadyExists { .. }) + | Err(object_store::Error::Precondition { .. }) => Ok(false), + Err(err) => Err(storage_backend_error("write_if_absent", uri, err)), + } + } + + async fn exists(&self, uri: &str) -> Result { + let location = self.object_path(uri)?; + match self.store.head(&location).await { + Ok(_) => Ok(true), + Err(object_store::Error::NotFound { .. }) => { + let mut entries = self.store.list(Some(&location)); + let has_prefix_entries = entries + .try_next() + .await + .map_err(|err| storage_backend_error("exists", uri, err))? + .is_some(); + Ok(has_prefix_entries) + } + Err(err) => Err(storage_backend_error("exists", uri, err)), + } + } + + async fn rename_text(&self, from_uri: &str, to_uri: &str) -> Result<()> { + // S3 has no atomic rename. Copy then delete; if the copy succeeds and + // the delete fails (or the process crashes between them), both + // source and destination exist with the same content. Recovery code + // must tolerate this case β€” see schema_state::recover_schema_state_files. + let from = self.object_path(from_uri)?; + let to = self.object_path(to_uri)?; + self.store + .copy(&from, &to) + .await + .map_err(|err| storage_backend_error("rename:copy", from_uri, err))?; + self.store + .delete(&from) + .await + .map_err(|err| storage_backend_error("rename:delete", from_uri, err))?; + Ok(()) + } + + async fn delete(&self, uri: &str) -> Result<()> { + let location = self.object_path(uri)?; + match self.store.delete(&location).await { + Ok(()) => Ok(()), + Err(object_store::Error::NotFound { .. }) => Ok(()), + Err(err) => Err(storage_backend_error("delete", uri, err)), + } + } + + async fn list_dir(&self, dir_uri: &str) -> Result> { + // Normalize: ensure the URI describes a directory (trailing '/') so + // we don't match sibling paths with a shared prefix + // (e.g. listing `__recovery` shouldn't match `__recovery_log/...`). + let dir_with_slash = if dir_uri.ends_with('/') { + dir_uri.to_string() + } else { + format!("{}/", dir_uri) + }; + // object_store::Path strips the trailing '/'; re-add it for filtering. + let prefix_loc = self.object_path(&dir_with_slash)?; + let prefix_with_slash = format!("{}/", prefix_loc.as_ref()); + + let mut entries = self.store.list(Some(&prefix_loc)); + let mut out = Vec::new(); + let bucket_root = format!("{}{}/", S3_SCHEME_PREFIX, self.bucket); + while let Some(meta) = entries + .try_next() + .await + .map_err(|err| storage_backend_error("list_dir", dir_uri, err))? + { + let key_str = meta.location.as_ref(); + // Require the directory boundary to filter out sibling-prefix + // matches (object_store's `list` is prefix-based, not dir-based). + if !key_str.starts_with(&prefix_with_slash) { + continue; + } + let suffix = &key_str[prefix_with_slash.len()..]; + // Non-recursive: skip anything inside a sub-directory. + if suffix.contains('/') { + continue; + } + out.push(format!("{}{}", bucket_root, key_str)); + } + Ok(out) + } +} + +impl S3StorageAdapter { + fn from_root_uri(root_uri: &str) -> Result { let location = parse_s3_uri(root_uri)?; let mut builder = AmazonS3Builder::from_env().with_bucket_name(&location.bucket); @@ -170,311 +305,29 @@ impl ObjectStorageAdapter { })?; Ok(Self { + bucket: location.bucket, store: Arc::new(store), - codec: UriCodec::S3 { - bucket: location.bucket, - }, - supports_conditional_update: true, }) } - /// In-memory backend for tests and embedded experiments. Implements the - /// FULL contract including true conditional updates (unlike the local - /// filesystem), so contract tests exercise the strong-CAS path without a - /// bucket. State lives only as long as the adapter. - pub fn in_memory() -> Self { - Self { - store: Arc::new(InMemory::new()), - codec: UriCodec::Memory, - supports_conditional_update: true, - } - } - fn object_path(&self, uri: &str) -> Result { - match &self.codec { - UriCodec::Local => { - let path = absolutize_lexically(local_path_from_uri(uri)?)?; - ObjectPath::from_absolute_path(&path).map_err(|err| { - OmniError::manifest_internal(format!( - "invalid local object path for '{}': {}", - uri, err - )) - }) - } - UriCodec::S3 { bucket } => { - let location = parse_s3_uri(uri)?; - if &location.bucket != bucket { - return Err(OmniError::manifest_internal(format!( - "s3 storage bucket mismatch for '{}': expected '{}', found '{}'", - uri, bucket, location.bucket - ))); - } - if location.key.is_empty() { - return Err(OmniError::manifest_internal(format!( - "s3 storage path is empty for '{}'", - uri - ))); - } - ObjectPath::parse(&location.key).map_err(|err| { - OmniError::manifest_internal(format!( - "invalid s3 object path for '{}': {}", - uri, err - )) - }) - } - UriCodec::Memory => { - ObjectPath::parse(uri.trim_start_matches('/')).map_err(|err| { - OmniError::manifest_internal(format!( - "invalid memory object path for '{}': {}", - uri, err - )) - }) - } + let location = parse_s3_uri(uri)?; + if location.bucket != self.bucket { + return Err(OmniError::manifest_internal(format!( + "s3 storage bucket mismatch for '{}': expected '{}', found '{}'", + uri, self.bucket, location.bucket + ))); } - } -} - -#[async_trait] -impl StorageAdapter for ObjectStorageAdapter { - async fn read_text(&self, uri: &str) -> Result { - let location = self.object_path(uri)?; - let bytes = self - .store - .get(&location) - .await - .map_err(|err| storage_backend_error("read", uri, err))? - .bytes() - .await - .map_err(|err| storage_backend_error("read", uri, err))?; - - String::from_utf8(bytes.to_vec()).map_err(|err| { - OmniError::manifest_internal(format!("storage read failed for '{}': {}", uri, err)) + if location.key.is_empty() { + return Err(OmniError::manifest_internal(format!( + "s3 storage path is empty for '{}'", + uri + ))); + } + ObjectPath::parse(&location.key).map_err(|err| { + OmniError::manifest_internal(format!("invalid s3 object path for '{}': {}", uri, err)) }) } - - async fn write_text(&self, uri: &str, contents: &str) -> Result<()> { - // Atomic visibility is the backend's contract: object stores via - // PutObject; LocalFileSystem via an internal staged-temp + rename - // (a reader sees the old object or the new one, never a truncated - // in-progress write). Callers (sidecar protocol, cluster state) - // assume it. - let location = self.object_path(uri)?; - self.store - .put(&location, PutPayload::from(contents.as_bytes().to_vec())) - .await - .map_err(|err| storage_backend_error("write", uri, err))?; - Ok(()) - } - - async fn write_text_if_absent(&self, uri: &str, contents: &str) -> Result { - // PutMode::Create: atomic no-replace publish on every backend β€” - // exactly one of N concurrent claimants wins, and the winner's - // object is fully readable at the instant it becomes visible - // (LocalFileSystem stages the temp file completely, then - // hard_links it; pinned by - // `local_write_text_if_absent_is_read_visible_on_return`). - let location = self.object_path(uri)?; - match self - .store - .put_opts( - &location, - PutPayload::from(contents.as_bytes().to_vec()), - PutMode::Create.into(), - ) - .await - { - Ok(_) => Ok(true), - Err(object_store::Error::AlreadyExists { .. }) - | Err(object_store::Error::Precondition { .. }) => Ok(false), - Err(err) => Err(storage_backend_error("write_if_absent", uri, err)), - } - } - - async fn exists(&self, uri: &str) -> Result { - // head() answers for objects; the list fallback answers for - // "directory-shaped" URIs (e.g. a Lance dataset root, whose - // `_versions/*.manifest` makes any committed dataset non-empty). - // Object-store semantics throughout: only objects exist β€” - // an EMPTY local directory does not (callers that probe local - // directories use std::fs directly). - let location = self.object_path(uri)?; - match self.store.head(&location).await { - Ok(_) => Ok(true), - Err(object_store::Error::NotFound { .. }) => { - let mut entries = self.store.list(Some(&location)); - let has_prefix_entries = entries - .try_next() - .await - .map_err(|err| storage_backend_error("exists", uri, err))? - .is_some(); - Ok(has_prefix_entries) - } - Err(err) => Err(storage_backend_error("exists", uri, err)), - } - } - - async fn rename_text(&self, from_uri: &str, to_uri: &str) -> Result<()> { - // ObjectStore::rename: LocalFileSystem overrides it with an atomic - // fs::rename (creating missing destination parents); object stores - // use the default copy + delete β€” if the copy succeeds and the - // delete fails (or the process crashes between them), both source - // and destination exist with the same content. Recovery code must - // tolerate this case β€” see schema_state::recover_schema_state_files. - let from = self.object_path(from_uri)?; - let to = self.object_path(to_uri)?; - self.store - .rename(&from, &to) - .await - .map_err(|err| storage_backend_error("rename", from_uri, err))?; - Ok(()) - } - - async fn delete(&self, uri: &str) -> Result<()> { - let location = self.object_path(uri)?; - match self.store.delete(&location).await { - Ok(()) => Ok(()), - Err(object_store::Error::NotFound { .. }) => Ok(()), - Err(err) => Err(storage_backend_error("delete", uri, err)), - } - } - - async fn list_dir(&self, dir_uri: &str) -> Result> { - // list_with_delimiter is non-recursive and path-delimited on every - // backend (no sibling-prefix bleed: listing `__recovery` cannot - // match `__recovery_log/...`), and returns Ok(empty) for a missing - // directory. Output URIs are anchored on the INPUT `dir_uri` plus - // the entry filename, so the strings round-trip byte-identically - // into read_text/delete regardless of scheme (plain path, file://, - // s3://). - let anchor = dir_uri.trim_end_matches('/'); - let prefix = self.object_path(anchor)?; - let listing = self - .store - .list_with_delimiter(Some(&prefix)) - .await - .map_err(|err| storage_backend_error("list_dir", dir_uri, err))?; - let mut out = Vec::with_capacity(listing.objects.len()); - for meta in listing.objects { - if let Some(name) = meta.location.filename() { - out.push(format!("{}/{}", anchor, name)); - } - } - Ok(out) - } - - async fn read_text_versioned(&self, uri: &str) -> Result<(String, String)> { - let location = self.object_path(uri)?; - let result = self - .store - .get(&location) - .await - .map_err(|err| storage_backend_error("read", uri, err))?; - let etag = result.meta.e_tag.clone(); - let bytes = result - .bytes() - .await - .map_err(|err| storage_backend_error("read", uri, err))?; - // The token SOURCE must agree with the write_text_if_match strategy - // below: conditional-update backends compare ETags server-side, so - // the token is the ETag; the local emulation compares content, so - // the token is the content hash. Mixing them makes every CAS lose. - let version = if self.supports_conditional_update { - // Every S3-compatible store we target returns ETags; fall back - // to a content token rather than failing if one ever omits it. - etag.unwrap_or_else(|| local_version_token(&bytes)) - } else { - local_version_token(&bytes) - }; - let text = String::from_utf8(bytes.to_vec()).map_err(|err| { - OmniError::manifest_internal(format!("storage read failed for '{}': {}", uri, err)) - })?; - Ok((text, version)) - } - - async fn write_text_if_match( - &self, - uri: &str, - contents: &str, - expected_version: &str, - ) -> Result> { - let location = self.object_path(uri)?; - if self.supports_conditional_update { - let mode = PutMode::Update(object_store::UpdateVersion { - e_tag: Some(expected_version.to_string()), - version: None, - }); - return match self - .store - .put_opts( - &location, - PutPayload::from(contents.as_bytes().to_vec()), - mode.into(), - ) - .await - { - Ok(result) => Ok(Some( - result - .e_tag - .unwrap_or_else(|| local_version_token(contents.as_bytes())), - )), - Err(object_store::Error::Precondition { .. }) - | Err(object_store::Error::NotFound { .. }) => Ok(None), - Err(err) => Err(storage_backend_error("write_if_match", uri, err)), - }; - } - // Local emulation: content-compare then atomic replace. NOT a - // cross-process CAS (check-then-act gap) β€” safe under the callers' - // lock protocol only; tracked in docs/dev/invariants.md Known Gaps. - let current = match self.store.get(&location).await { - Ok(result) => result - .bytes() - .await - .map_err(|err| storage_backend_error("read", uri, err))?, - Err(object_store::Error::NotFound { .. }) => return Ok(None), - Err(err) => return Err(storage_backend_error("read", uri, err)), - }; - if local_version_token(¤t) != expected_version { - return Ok(None); - } - self.store - .put(&location, PutPayload::from(contents.as_bytes().to_vec())) - .await - .map_err(|err| storage_backend_error("write_if_match", uri, err))?; - Ok(Some(local_version_token(contents.as_bytes()))) - } - - async fn delete_prefix(&self, prefix_uri: &str) -> Result<()> { - // Directories are a local-FS concept: a list+delete loop would - // leave empty directory skeletons that local existence probes - // (cluster graph_root_exists uses std Path::exists) report as - // still-present. remove_dir_all reclaims them in one call. - if self.codec == UriCodec::Local { - let path = absolutize_lexically(local_path_from_uri(prefix_uri)?)?; - return match tokio::fs::remove_dir_all(&path).await { - Ok(()) => Ok(()), - Err(err) if err.kind() == std::io::ErrorKind::NotFound => Ok(()), - Err(err) => Err(err.into()), - }; - } - let prefix = self.object_path(prefix_uri.trim_end_matches('/'))?; - let mut entries = self.store.list(Some(&prefix)); - let mut locations = Vec::new(); - while let Some(meta) = entries - .try_next() - .await - .map_err(|err| storage_backend_error("delete_prefix", prefix_uri, err))? - { - locations.push(meta.location); - } - for location in locations { - match self.store.delete(&location).await { - Ok(()) => {} - Err(object_store::Error::NotFound { .. }) => {} - Err(err) => return Err(storage_backend_error("delete_prefix", prefix_uri, err)), - } - } - Ok(()) - } } pub fn storage_kind_for_uri(uri: &str) -> StorageKind { @@ -487,8 +340,8 @@ pub fn storage_kind_for_uri(uri: &str) -> StorageKind { pub fn storage_for_uri(uri: &str) -> Result> { match storage_kind_for_uri(uri) { - StorageKind::Local => Ok(Arc::new(ObjectStorageAdapter::local())), - StorageKind::S3 => Ok(Arc::new(ObjectStorageAdapter::s3_from_root_uri(uri)?)), + StorageKind::Local => Ok(Arc::new(LocalStorageAdapter)), + StorageKind::S3 => Ok(Arc::new(S3StorageAdapter::from_root_uri(uri)?)), } } @@ -534,38 +387,6 @@ fn local_path_from_uri(uri: &str) -> Result { Ok(PathBuf::from(uri)) } -/// Lexically absolutize a local path: join relative paths onto the current -/// working directory and fold `.` / `..` components, without touching the -/// filesystem. Required because `object_store::path::Path` rejects -/// relative and dot segments, while callers (the CLI in particular) pass -/// paths like `./graph.omni` verbatim. -fn absolutize_lexically(path: PathBuf) -> Result { - let joined = if path.is_absolute() { - path - } else { - std::env::current_dir() - .map_err(|err| { - OmniError::manifest_internal(format!( - "cannot resolve relative storage path '{}': {}", - path.display(), - err - )) - })? - .join(path) - }; - let mut out = PathBuf::new(); - for component in joined.components() { - match component { - Component::CurDir => {} - Component::ParentDir => { - out.pop(); - } - other => out.push(other), - } - } - Ok(out) -} - fn local_path_from_file_uri(uri: &str) -> Result { let url = Url::parse(uri).map_err(|err| { OmniError::manifest_internal(format!("invalid file uri '{}': {}", uri, err)) @@ -625,260 +446,6 @@ fn env_var_truthy(key: &str) -> bool { mod tests { use super::*; - /// The executable backend contract: every assertion here must hold for - /// EVERY backend (the divergence class this adapter closed was "two - /// implementations, one prose contract, no referee"). The S3 variant - /// runs bucket-gated in `tests/s3_storage.rs` - /// (`s3_adapter_conditional_writes_contract`). - async fn contract_suite(adapter: &dyn StorageAdapter, root: &str) { - // Write/read round-trip; replace is in-place and atomic. - let a = format!("{root}/contract/a.json"); - adapter.write_text(&a, "v1").await.unwrap(); - assert_eq!(adapter.read_text(&a).await.unwrap(), "v1"); - adapter.write_text(&a, "v2").await.unwrap(); - assert_eq!(adapter.read_text(&a).await.unwrap(), "v2"); - - // exists: object yes; missing no; non-empty prefix yes (the - // directory-shaped probe Lance dataset roots rely on). - assert!(adapter.exists(&a).await.unwrap()); - assert!( - !adapter - .exists(&format!("{root}/contract/missing.json")) - .await - .unwrap() - ); - assert!(adapter.exists(&format!("{root}/contract")).await.unwrap()); - - // if_absent: exactly one claim wins; the loser leaves the winner's - // object untouched. - let claim = format!("{root}/contract/claim.json"); - assert!(adapter.write_text_if_absent(&claim, "first").await.unwrap()); - assert!(!adapter.write_text_if_absent(&claim, "second").await.unwrap()); - assert_eq!(adapter.read_text(&claim).await.unwrap(), "first"); - - // Versioned CAS: fresh token wins, stale token loses with Ok(None) - // (never a silent overwrite), missing object can't match. - let state = format!("{root}/contract/state.json"); - adapter.write_text(&state, "s1").await.unwrap(); - let (text, v1) = adapter.read_text_versioned(&state).await.unwrap(); - assert_eq!(text, "s1"); - let v2 = adapter - .write_text_if_match(&state, "s2", &v1) - .await - .unwrap() - .expect("fresh token must win"); - assert_ne!(v2, v1); - assert!( - adapter - .write_text_if_match(&state, "s3", &v1) - .await - .unwrap() - .is_none() - ); - assert_eq!(adapter.read_text(&state).await.unwrap(), "s2"); - assert!( - adapter - .write_text_if_match(&format!("{root}/contract/absent.json"), "x", &v1) - .await - .unwrap() - .is_none() - ); - - // rename: destination is replaced; source is gone. - let src = format!("{root}/contract/src.json"); - adapter.write_text(&src, "moved").await.unwrap(); - adapter.rename_text(&src, &a).await.unwrap(); - assert_eq!(adapter.read_text(&a).await.unwrap(), "moved"); - assert!(!adapter.exists(&src).await.unwrap()); - - // list_dir: direct children only, no sibling-prefix bleed, output - // URIs round-trip verbatim into read_text, missing dir is empty. - let dir_uri = format!("{root}/contract/list"); - adapter - .write_text(&format!("{dir_uri}/one.json"), "1") - .await - .unwrap(); - adapter - .write_text(&format!("{dir_uri}/two.json"), "2") - .await - .unwrap(); - adapter - .write_text(&format!("{dir_uri}/sub/three.json"), "3") - .await - .unwrap(); - adapter - .write_text(&format!("{root}/contract/list_log/x.json"), "x") - .await - .unwrap(); - let mut listed = adapter.list_dir(&dir_uri).await.unwrap(); - listed.sort(); - assert_eq!( - listed, - vec![ - format!("{dir_uri}/one.json"), - format!("{dir_uri}/two.json") - ] - ); - for uri in &listed { - adapter.read_text(uri).await.unwrap(); - } - assert!( - adapter - .list_dir(&format!("{root}/contract/nope")) - .await - .unwrap() - .is_empty() - ); - - // delete: idempotent. - adapter.delete(&claim).await.unwrap(); - adapter.delete(&claim).await.unwrap(); - assert!(!adapter.exists(&claim).await.unwrap()); - - // delete_prefix: recursive + idempotent; nothing under the prefix - // (including local directory skeletons) survives. - adapter - .delete_prefix(&format!("{root}/contract")) - .await - .unwrap(); - assert!(!adapter.exists(&a).await.unwrap()); - assert!(!adapter.exists(&format!("{root}/contract")).await.unwrap()); - adapter - .delete_prefix(&format!("{root}/contract")) - .await - .unwrap(); - } - - #[tokio::test] - async fn contract_suite_local() { - let dir = tempfile::tempdir().unwrap(); - let adapter = ObjectStorageAdapter::local(); - contract_suite(&adapter, dir.path().to_str().unwrap()).await; - } - - #[tokio::test] - async fn contract_suite_in_memory() { - // InMemory implements true conditional updates, so this runs the - // strong-CAS path (ETag tokens + PutMode::Update) without a bucket. - let adapter = ObjectStorageAdapter::in_memory(); - contract_suite(&adapter, "mem-root").await; - } - - /// `write_text_if_absent` must make the contents visible to any - /// subsequent reader before it returns β€” callers acknowledge - /// success the moment it resolves (cluster state bootstrap reads - /// the file back; init ownership claims depend on it). - /// Regression: the previous hand-rolled local adapter wrote through a - /// buffered `tokio::fs::File` without flushing, so the bytes could - /// still be in flight on the blocking pool while a reader saw an empty - /// or partial file. Reads back through `std::fs` deliberately β€” - /// cross-API visibility is the point. - #[tokio::test] - async fn local_write_text_if_absent_is_read_visible_on_return() { - let dir = tempfile::tempdir().unwrap(); - let adapter = ObjectStorageAdapter::local(); - let payload = "x".repeat(8 * 1024); - for i in 0..1000 { - let path = dir.path().join(format!("obj-{i}.json")); - let uri = format!("{}", path.display()); - assert!(adapter.write_text_if_absent(&uri, &payload).await.unwrap()); - let read = std::fs::read_to_string(&path).unwrap(); - assert_eq!( - read.len(), - payload.len(), - "iteration {i}: write_text_if_absent returned before its \ - contents reached the file" - ); - } - } - - /// Regression for the write_text_if_absent buffering bug, via the - /// `storage_for_uri` + `file://` construction path and a multi-thread - /// runtime (complements `local_write_text_if_absent_is_read_visible_- - /// on_return`, which uses the direct constructor and plain paths): a - /// reader immediately after Ok(true) must never see the created file - /// empty or short. - #[tokio::test(flavor = "multi_thread")] - async fn write_text_if_absent_is_read_consistent_immediately() { - let dir = tempfile::tempdir().unwrap(); - let adapter = storage_for_uri(&format!("file://{}", dir.path().display())).unwrap(); - let payload = "x".repeat(64 * 1024); - for i in 0..200 { - let uri = format!("file://{}/f{}.json", dir.path().display(), i); - assert!(adapter.write_text_if_absent(&uri, &payload).await.unwrap()); - let read = std::fs::read_to_string(dir.path().join(format!("f{i}.json"))).unwrap(); - assert_eq!(read.len(), payload.len(), "iteration {i}: short read"); - } - } - - /// Object-store semantics on the local filesystem: only objects exist. - /// An empty directory is not an object and not a non-empty prefix β€” - /// callers that genuinely probe local directories use std::fs. - #[tokio::test] - async fn local_exists_is_object_semantics_for_directories() { - let dir = tempfile::tempdir().unwrap(); - let probe = dir.path().join("maybe-dataset"); - let adapter = ObjectStorageAdapter::local(); - std::fs::create_dir(&probe).unwrap(); - assert!( - !adapter.exists(probe.to_str().unwrap()).await.unwrap(), - "an empty directory is not an object" - ); - std::fs::write(probe.join("1.manifest"), "m").unwrap(); - assert!( - adapter.exists(probe.to_str().unwrap()).await.unwrap(), - "a non-empty prefix exists (the Lance dataset-root probe shape)" - ); - } - - /// list_dir output is anchored on the INPUT dir_uri, so `file://` - /// anchors and paths with spaces round-trip byte-identically into - /// read_text β€” the cluster store passes file://-schemed roots. - #[tokio::test] - async fn local_list_round_trips_file_scheme_and_spaces() { - let dir = tempfile::tempdir().unwrap(); - let root = dir.path().join("with space"); - let adapter = ObjectStorageAdapter::local(); - let plain = format!("{}/x.json", root.display()); - adapter.write_text(&plain, "x").await.unwrap(); - - let listed = adapter.list_dir(root.to_str().unwrap()).await.unwrap(); - assert_eq!(listed, vec![plain.clone()]); - assert_eq!(adapter.read_text(&listed[0]).await.unwrap(), "x"); - - let file_anchor = format!("file://{}", root.display()); - let listed = adapter.list_dir(&file_anchor).await.unwrap(); - assert_eq!(listed, vec![format!("{file_anchor}/x.json")]); - assert_eq!(adapter.read_text(&listed[0]).await.unwrap(), "x"); - } - - /// Relative and dot-segment paths are lexically absolutized before - /// hitting the object-path layer (which rejects them) β€” the CLI passes - /// `./graph.omni`-shaped URIs verbatim. - #[tokio::test] - async fn local_paths_with_dot_segments_are_absolutized() { - let dir = tempfile::tempdir().unwrap(); - let adapter = ObjectStorageAdapter::local(); - let uri = format!("{}/sub/../dotted.json", dir.path().display()); - adapter.write_text(&uri, "x").await.unwrap(); - assert_eq!(adapter.read_text(&uri).await.unwrap(), "x"); - assert!(dir.path().join("dotted.json").exists()); - } - - /// Upstream local rename creates missing destination parents β€” more - /// lenient than the previous bare fs::rename; pinned so an upstream - /// regression is loud. - #[tokio::test] - async fn local_rename_creates_missing_destination_parents() { - let dir = tempfile::tempdir().unwrap(); - let adapter = ObjectStorageAdapter::local(); - let src = format!("{}/src.json", dir.path().display()); - adapter.write_text(&src, "x").await.unwrap(); - let dst = format!("{}/new-sub/dst.json", dir.path().display()); - adapter.rename_text(&src, &dst).await.unwrap(); - assert_eq!(adapter.read_text(&dst).await.unwrap(), "x"); - } - #[test] fn storage_backend_selection_is_scheme_aware() { assert_eq!(storage_kind_for_uri("/tmp/graph"), StorageKind::Local); @@ -931,4 +498,15 @@ mod tests { assert_eq!(location.key, "graph/_schema.pg"); } + #[tokio::test] + async fn local_write_text_if_absent_creates_once_without_overwrite() { + let dir = tempfile::tempdir().unwrap(); + let uri = dir.path().join("claim.txt"); + let uri = uri.to_str().unwrap(); + let storage = LocalStorageAdapter; + + assert!(storage.write_text_if_absent(uri, "first").await.unwrap()); + assert!(!storage.write_text_if_absent(uri, "second").await.unwrap()); + assert_eq!(storage.read_text(uri).await.unwrap(), "first"); + } } diff --git a/crates/omnigraph/src/storage_layer.rs b/crates/omnigraph/src/storage_layer.rs index 3ea9647..dac9482 100644 --- a/crates/omnigraph/src/storage_layer.rs +++ b/crates/omnigraph/src/storage_layer.rs @@ -7,32 +7,30 @@ //! way for new engine writers to advance Lance HEAD without coupling //! "write bytes" with "advance HEAD" in one Lance API call. //! -//! ## Inline-commit residuals live on a separate trait +//! ## Transitional residuals on the trait //! -//! The inline-commit writes that Lance cannot yet express as -//! stage-then-commit are NOT on `TableStorage`. They sit on -//! [`InlineCommitResidual`], reachable only via -//! `Omnigraph::storage_inline_residual()`, so the default `db.storage()` -//! surface is staged-only and cannot couple "write bytes" with "advance -//! HEAD" β€” MR-793 acceptance Β§1 closes by construction. The residuals: -//! -//! * `delete_where` β€” Lance #6658 (`DeleteBuilder::execute_uncommitted`) -//! did not backport to the 6.x line; it first ships in `v7.0.0-beta.10`. -//! Migration to staged two-phase delete is tracked as MR-A, gated on the -//! Lance v7.x bump. -//! * `create_vector_index` β€” segment-commit-path needs -//! `build_index_metadata_from_segments`, still `pub(crate)` in Lance -//! 6.0.1 ([#6666](https://github.com/lance-format/lance/issues/6666), -//! open). Scalar indices already stage. -//! -//! Each is named honestly at its call site; the forbidden-API guard test -//! catches direct lance::* misuse outside the storage layer. +//! Several inline-commit methods remain on the trait surface as +//! documented residuals: `delete_where` +//! ([#6658](https://github.com/lance-format/lance/issues/6658) closed +//! 2026-05-14, but the public `DeleteBuilder::execute_uncommitted` API +//! did not backport to the 6.x release line β€” it first ships in +//! `v7.0.0-beta.10`. Migration to staged two-phase delete is tracked as +//! MR-A and is gated on the Lance v7.x bump, not the current v6.0.1 pin), +//! `create_vector_index` (segment-commit-path requires +//! `build_index_metadata_from_segments` which is `pub(crate)` β€” see +//! [#6666](https://github.com/lance-format/lance/issues/6666), still open), and the +//! legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / +//! `create_btree_index` / `create_inverted_index` paths kept while +//! engine call sites finish migrating off of them (Phase 1b / Phase 9 +//! of MR-793). These are named honestly at every call site; the +//! forbidden-API guard test catches direct lance::* misuse outside the +//! storage layer. //! //! ## Sealed //! -//! Both `TableStorage` and `InlineCommitResidual` are `: sealed::Sealed`. -//! Only types in this crate can implement them, so a downstream crate -//! cannot subvert the contract by providing its own impl. +//! `TableStorage: sealed::Sealed`. Only types in this crate can implement +//! the trait, so a downstream crate cannot subvert the contract by +//! providing its own impl. //! //! ## Opaque handles //! @@ -42,15 +40,15 @@ //! through. This aligns with the storage-boundary invariant: //! `lance::Dataset` does not appear in trait signatures. //! -//! ## Migration status +//! ## Migration status (MR-793 PR #70) //! -//! Phases 1a / 2 / 4 / 5 / 6 landed in MR-793 PR #70 (trait scaffolding, -//! staged primitives, migration of `ensure_indices` / `branch_merge` / -//! `schema_apply` onto the staged surface). Phase 1b (call-site -//! conversion) and Phase 9 landed in MR-854, which also split the -//! inline-commit residuals onto `InlineCommitResidual` so `db.storage()` -//! is staged-only. Phase 7 (recovery reconciler) shipped as MR-847; -//! Phase 8 (index reconciler) is tracked as MR-848. +//! Phases 1a / 2 / 4 / 5 / 6 are landed: trait scaffolding, three new +//! staged primitives (`stage_overwrite`, scalar index staging), and +//! migration of `ensure_indices`, `branch_merge`, `schema_apply` onto +//! the staged surface. Phase 1b (call-site conversion to +//! `Arc`), Phase 9 (demote unused inline-commit +//! methods to `pub(crate)`), Phase 7 (recovery reconciler β€” MR-847), +//! and Phase 8 (index reconciler β€” MR-848) are deferred to follow-ups. use std::fmt::Debug; use std::sync::Arc; @@ -107,37 +105,12 @@ impl SnapshotHandle { &self.inner } - /// Take ownership of the inner `Arc`. Used by the - /// `TableStorage` impl when an op needs to mutate the dataset in - /// place (commit a staged write, append, overwrite, …). - /// - /// Performance note: callers consume the returned `Arc` via - /// `Arc::try_unwrap(...).unwrap_or_else(|arc| (*arc).clone())`. The - /// fast path (no clone) only fires when the snapshot is single-ref - /// β€” i.e. the caller dropped every other `SnapshotHandle` clone - /// before calling. Holding parallel clones (e.g. across an `await` - /// point or stashed in a struct) forces a deep `Dataset` clone on - /// every mutating op. Engine callers should pass `SnapshotHandle` - /// by value into the mutating method, not keep a side copy. + /// Take ownership of the inner `Arc`. Used when committing + /// staged writes (the call needs to consume the snapshot). pub(crate) fn into_arc(self) -> Arc { self.inner } - /// Take ownership of the inner `Dataset` by unwrapping the `Arc` - /// (or cloning if the snapshot is shared). `pub(crate)` β€” used - /// only by the maintenance path (`optimize`, `cleanup`) which - /// must hand `&mut Dataset` to Lance compaction / cleanup APIs - /// that the `TableStorage` trait does not (and should not) - /// surface. Engine code that participates in the staged-write - /// invariant must stay on the trait methods. - /// - /// Single-ref invariant: same fast-path/clone behavior as - /// `into_arc` β€” see that method's doc. Drop sibling - /// `SnapshotHandle` clones before calling. - pub(crate) fn into_dataset(self) -> Dataset { - Arc::try_unwrap(self.inner).unwrap_or_else(|arc| (*arc).clone()) - } - // ── public, lance-free accessors ── /// Current Lance manifest version of the snapshot. @@ -184,26 +157,6 @@ pub(crate) fn staged_handles_as_writes(handles: &[StagedHandle]) -> Vec { - Created(D), - RefAlreadyExists, -} - // ─── TableStorage trait ──────────────────────────────────────────────────── /// Engine-internal trait covering every Lance dataset operation an @@ -251,24 +204,10 @@ pub trait TableStorage: sealed::Sealed + Send + Sync + Debug { table_key: &str, source_version: u64, target_branch: &str, - ) -> Result>; + ) -> Result; async fn delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()>; - /// Idempotent variant of `delete_branch` used by the best-effort fork - /// reclaim under branch delete (`db/omnigraph.rs::cleanup_deleted_branch_tables`) - /// and by the orphan-fork reconciler in `optimize`. Tolerates an - /// already-absent branch (both Lance's `RefNotFound` and the local-store - /// `NotFound` quirk on a missing `tree/{branch}/` dir). A still-referenced - /// branch (`RefConflict`) still surfaces as `OmniError::Lance`. - async fn force_delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()>; - - /// List the named Lance branches present on the dataset at `dataset_uri`. - /// The `cleanup` orphan reconciler diffs this against the manifest - /// branch set to find orphaned per-table forks. `main`/default is not a - /// named branch and never appears here. - async fn list_branches(&self, dataset_uri: &str) -> Result>; - async fn reopen_for_mutation( &self, dataset_uri: &str, @@ -353,15 +292,6 @@ pub trait TableStorage: sealed::Sealed + Send + Sync + Debug { prior_stages: &[StagedHandle], ) -> Result; - /// Append `source`'s rows into `snapshot`'s table, streaming so the whole - /// row set is never materialized in memory (see `TableStore::stage_append_stream`). - async fn stage_append_stream( - &self, - snapshot: &SnapshotHandle, - source: &SnapshotHandle, - prior_stages: &[StagedHandle], - ) -> Result; - async fn stage_merge_insert( &self, snapshot: SnapshotHandle, @@ -398,19 +328,74 @@ pub trait TableStorage: sealed::Sealed + Send + Sync + Debug { column: &str, ) -> Result; - // ── Index presence (reads, no HEAD advance) ────────────────────── + // ── Inline-commit residuals (named honestly per MR-793 Β§3.2) ────── // - // The inline-commit writes (`delete_where`, `create_vector_index`) are - // deliberately NOT on this trait. They live on - // the separate `InlineCommitResidual` trait, reachable only through - // `Omnigraph::storage_inline_residual()`. As a result the default - // `db.storage()` surface cannot couple "write bytes" with "advance HEAD" - // β€” closing MR-793 acceptance Β§1 by construction rather than by review. + // These methods advance Lance HEAD as a side effect of writing. + // They stay on the trait until the corresponding upstream Lance API + // ships: + // + // * `delete_where` β€” Lance #6658 (two-phase delete). + // * `create_*_index` β€” `build_index_metadata_from_segments` is + // `pub(crate)` for vector indices in lance-4.0.0; scalar indices + // migrate to staged in MR-793 Phase 2. + // * `append_batch`, `merge_insert_batches`, `overwrite_batch` β€” + // legacy paths that will be demoted to `pub(crate)` in MR-793 + // Phase 9 once all engine sites route through the staged + // primitives. + + async fn append_batch( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batch: RecordBatch, + ) -> Result<(SnapshotHandle, TableState)>; + + async fn merge_insert_batches( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batches: Vec, + key_columns: Vec, + when_matched: WhenMatched, + when_not_matched: WhenNotMatched, + ) -> Result; + + async fn overwrite_batch( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batch: RecordBatch, + ) -> Result<(SnapshotHandle, TableState)>; + + async fn delete_where( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + filter: &str, + ) -> Result<(SnapshotHandle, DeleteState)>; async fn has_btree_index(&self, snapshot: &SnapshotHandle, column: &str) -> Result; async fn has_fts_index(&self, snapshot: &SnapshotHandle, column: &str) -> Result; async fn has_vector_index(&self, snapshot: &SnapshotHandle, column: &str) -> Result; + async fn create_btree_index( + &self, + snapshot: SnapshotHandle, + columns: &[&str], + ) -> Result; + + async fn create_inverted_index( + &self, + snapshot: SnapshotHandle, + column: &str, + ) -> Result; + + async fn create_vector_index( + &self, + snapshot: SnapshotHandle, + column: &str, + ) -> Result; + // ── URI helpers ──────────────────────────────────────────────────── // // These are pure string formatting; they live on the trait so engine @@ -437,38 +422,6 @@ pub trait TableStorage: sealed::Sealed + Send + Sync + Debug { ) -> Result; } -// ─── InlineCommitResidual trait ──────────────────────────────────────────── - -/// Inline-commit residual surface: the writes Lance cannot yet express as a -/// stage-then-commit pair, so they advance Lance HEAD as a side effect of -/// writing. Kept OFF `TableStorage` and reachable only through -/// `Omnigraph::storage_inline_residual()`, so the default `db.storage()` path -/// is staged-only and a new writer cannot reintroduce the write+commit coupling -/// by accident (MR-793 acceptance Β§1, by construction). -/// -/// Residual reasons (each is named honestly at its call site): -/// * `delete_where` β€” Lance has no public two-phase delete on the 6.x line -/// (`DeleteBuilder::execute_uncommitted` first ships in v7.x; MR-A / Lance -/// #6658). The D2 parse-time rule + recovery sidecars cover the gap meanwhile. -/// * `create_vector_index` β€” vector-index segment-commit needs -/// `build_index_metadata_from_segments`, still `pub(crate)` in Lance 6.0.1 -/// (Lance #6666). Scalar indices already stage. -#[async_trait] -pub(crate) trait InlineCommitResidual: sealed::Sealed + Send + Sync + Debug { - async fn delete_where( - &self, - dataset_uri: &str, - snapshot: SnapshotHandle, - filter: &str, - ) -> Result<(SnapshotHandle, DeleteState)>; - - async fn create_vector_index( - &self, - snapshot: SnapshotHandle, - column: &str, - ) -> Result; -} - // ─── single impl: TableStore ────────────────────────────────────────────── #[async_trait] @@ -526,36 +479,23 @@ impl TableStorage for TableStore { table_key: &str, source_version: u64, target_branch: &str, - ) -> Result> { - Ok( - match TableStore::fork_branch_from_state( - self, - dataset_uri, - source_branch, - table_key, - source_version, - target_branch, - ) - .await? - { - ForkOutcome::Created(ds) => ForkOutcome::Created(SnapshotHandle::new(ds)), - ForkOutcome::RefAlreadyExists => ForkOutcome::RefAlreadyExists, - }, + ) -> Result { + TableStore::fork_branch_from_state( + self, + dataset_uri, + source_branch, + table_key, + source_version, + target_branch, ) + .await + .map(SnapshotHandle::new) } async fn delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()> { TableStore::delete_branch(self, dataset_uri, branch).await } - async fn force_delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()> { - TableStore::force_delete_branch(self, dataset_uri, branch).await - } - - async fn list_branches(&self, dataset_uri: &str) -> Result> { - TableStore::list_branches(self, dataset_uri).await - } - async fn reopen_for_mutation( &self, dataset_uri: &str, @@ -693,18 +633,6 @@ impl TableStorage for TableStore { .map(StagedHandle::new) } - async fn stage_append_stream( - &self, - snapshot: &SnapshotHandle, - source: &SnapshotHandle, - prior_stages: &[StagedHandle], - ) -> Result { - let staged_writes = staged_handles_as_writes(prior_stages); - TableStore::stage_append_stream(self, snapshot.dataset(), source.dataset(), &staged_writes) - .await - .map(StagedHandle::new) - } - async fn stage_merge_insert( &self, snapshot: SnapshotHandle, @@ -761,6 +689,61 @@ impl TableStorage for TableStore { .map(StagedHandle::new) } + async fn append_batch( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batch: RecordBatch, + ) -> Result<(SnapshotHandle, TableState)> { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + let state = TableStore::append_batch(self, dataset_uri, &mut ds, batch).await?; + Ok((SnapshotHandle::new(ds), state)) + } + + async fn merge_insert_batches( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batches: Vec, + key_columns: Vec, + when_matched: WhenMatched, + when_not_matched: WhenNotMatched, + ) -> Result { + let ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + TableStore::merge_insert_batches( + self, + dataset_uri, + ds, + batches, + key_columns, + when_matched, + when_not_matched, + ) + .await + } + + async fn overwrite_batch( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + batch: RecordBatch, + ) -> Result<(SnapshotHandle, TableState)> { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + let state = TableStore::overwrite_batch(self, dataset_uri, &mut ds, batch).await?; + Ok((SnapshotHandle::new(ds), state)) + } + + async fn delete_where( + &self, + dataset_uri: &str, + snapshot: SnapshotHandle, + filter: &str, + ) -> Result<(SnapshotHandle, DeleteState)> { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + let state = TableStore::delete_where(self, dataset_uri, &mut ds, filter).await?; + Ok((SnapshotHandle::new(ds), state)) + } + async fn has_btree_index(&self, snapshot: &SnapshotHandle, column: &str) -> Result { TableStore::has_btree_index(self, snapshot.dataset(), column).await } @@ -773,6 +756,36 @@ impl TableStorage for TableStore { TableStore::has_vector_index(self, snapshot.dataset(), column).await } + async fn create_btree_index( + &self, + snapshot: SnapshotHandle, + columns: &[&str], + ) -> Result { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + TableStore::create_btree_index(self, &mut ds, columns).await?; + Ok(SnapshotHandle::new(ds)) + } + + async fn create_inverted_index( + &self, + snapshot: SnapshotHandle, + column: &str, + ) -> Result { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + TableStore::create_inverted_index(self, &mut ds, column).await?; + Ok(SnapshotHandle::new(ds)) + } + + async fn create_vector_index( + &self, + snapshot: SnapshotHandle, + column: &str, + ) -> Result { + let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); + TableStore::create_vector_index(self, &mut ds, column).await?; + Ok(SnapshotHandle::new(ds)) + } + fn root_uri(&self) -> &str { TableStore::root_uri(self) } @@ -802,27 +815,3 @@ impl TableStorage for TableStore { .await } } - -#[async_trait] -impl InlineCommitResidual for TableStore { - async fn delete_where( - &self, - dataset_uri: &str, - snapshot: SnapshotHandle, - filter: &str, - ) -> Result<(SnapshotHandle, DeleteState)> { - let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); - let state = TableStore::delete_where(self, dataset_uri, &mut ds, filter).await?; - Ok((SnapshotHandle::new(ds), state)) - } - - async fn create_vector_index( - &self, - snapshot: SnapshotHandle, - column: &str, - ) -> Result { - let mut ds = Arc::try_unwrap(snapshot.into_arc()).unwrap_or_else(|arc| (*arc).clone()); - TableStore::create_vector_index(self, &mut ds, column).await?; - Ok(SnapshotHandle::new(ds)) - } -} diff --git a/crates/omnigraph/src/table_store.rs b/crates/omnigraph/src/table_store.rs index da31848..ddab706 100644 --- a/crates/omnigraph/src/table_store.rs +++ b/crates/omnigraph/src/table_store.rs @@ -2,7 +2,7 @@ use arrow_array::{ Array, ArrayRef, RecordBatch, StringArray, StructArray, UInt8Array, UInt32Array, UInt64Array, }; use arrow_schema::SchemaRef; -use datafusion::physical_plan::SendableRecordBatchStream; +use arrow_select::concat::concat_batches; use futures::TryStreamExt; use lance::Dataset; use lance::blob::BlobArrayBuilder; @@ -13,7 +13,7 @@ use lance::dataset::{ CommitBuilder, InsertBuilder, MergeInsertBuilder, WhenMatched, WhenNotMatched, WriteMode, WriteParams, }; -use lance::datatypes::{BlobKind, Schema as LanceSchema}; +use lance::datatypes::BlobKind; use lance::index::DatasetIndexExt; use lance::index::scalar::IndexDetails; use lance_file::version::LanceFileVersion; @@ -24,10 +24,9 @@ use lance_table::format::{Fragment, IndexMetadata, RowIdMeta}; use lance_table::rowids::{RowIdSequence, write_row_ids}; use std::sync::Arc; -use crate::db::manifest::TableVersionMetadata; +use crate::db::manifest::{TableVersionMetadata, open_table_head_for_write}; use crate::db::{Snapshot, SubTableEntry}; use crate::error::{OmniError, Result}; -use crate::storage_layer::ForkOutcome; #[derive(Debug, Clone, PartialEq, Eq)] pub struct TableState { @@ -44,26 +43,13 @@ pub struct DeleteState { pub(crate) version_metadata: TableVersionMetadata, } -/// Whether a `key_col IN (...)` scan on a dataset will be served by the -/// persisted scalar (BTREE) index, or silently fall back to a full filtered -/// scan. Detection-only (metadata, no IO); the scan returns the correct rows -/// either way. Surfaced by the indexed traversal path so the silent perf -/// fallback is observable, and available to a future cost-based planner. -#[derive(Debug, Clone, PartialEq, Eq)] -pub enum IndexCoverage { - /// The column has a usable BTREE and every fragment records `physical_rows`. - Indexed, - /// Lance will not use the scalar index for this scan (correct, full scan). - Degraded { reason: String }, -} - /// A Lance write that has produced fragment files on object storage but is /// not yet committed to the dataset's manifest. The staged-write primitives /// are consumed by `MutationStaging` (`exec/staging.rs`, /// `exec/mutation.rs`) and the bulk loader (`loader/mod.rs`). The /// intent: defer Lance commits to end-of-query so a mid-query failure /// leaves the touched table at the pre-mutation HEAD instead of -/// drifting ahead. See `docs/dev/writes.md` for the publisher-CAS contract +/// drifting ahead. See `docs/runs.md` for the publisher-CAS contract /// this builds on. /// /// `transaction` is opaque from our side β€” Lance owns its semantics. We @@ -160,15 +146,9 @@ impl TableStore { dataset_uri: &str, branch: Option<&str>, ) -> Result { - // Direct open by URI (O(1) latest-resolution). Routed through the tracked - // opener so a cost test counts it via the per-query `table_wrapper` - // (no-op in production β€” the task-local is unset, so this is exactly - // `Dataset::open(uri)`). - let ds = crate::instrumentation::open_dataset_tracked( - dataset_uri, - crate::instrumentation::table_wrapper(), - ) - .await?; + let ds = Dataset::open(dataset_uri) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; match branch { Some(branch) if branch != "main" => ds .checkout_branch(branch) @@ -184,14 +164,8 @@ impl TableStore { dataset_uri: &str, branch: Option<&str>, ) -> Result { - // RFC-013 step 3a: open writes via the direct opener (O(1)) instead of the - // lance-namespace builder, which re-resolved the table's version chain - // O(depth) per write. The namespace is a catalog/discovery layer, not a - // per-open hot-path component (RFC Β§2.4); the manifest already holds the - // location, and `ensure_expected_version` still validates head == pinned - // for strict ops. `table_key` retained for signature stability. - let _ = table_key; - self.open_dataset_head(dataset_uri, branch).await + let table_path = self.table_path_from_dataset_uri(dataset_uri)?; + open_table_head_for_write(&self.root_uri, table_key, &table_path, branch).await } pub async fn delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()> { @@ -203,45 +177,6 @@ impl TableStore { .map_err(|e| OmniError::Lance(e.to_string())) } - /// List the named Lance branches present on the dataset at `dataset_uri`. - /// The `cleanup` orphan reconciler diffs this against the manifest branch - /// set to find orphaned per-table forks. `main`/default is not a named - /// branch and never appears here. - pub async fn list_branches(&self, dataset_uri: &str) -> Result> { - let ds = Dataset::open(dataset_uri) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let branches = ds - .list_branches() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - Ok(branches.into_keys().collect()) - } - - /// Idempotently drop `branch` from the dataset at `dataset_uri`. - /// - /// Unlike [`delete_branch`](Self::delete_branch), this tolerates an - /// already-absent branch β€” both a missing contents ref (Lance's - /// `force_delete_branch` handles that) and a missing `tree/{branch}/` - /// directory (the local-store `NotFound` quirk pinned by - /// `lance_surface_guards::force_delete_branch_semantics`). Safe to call on a - /// possibly-orphaned or already-reclaimed fork. - /// - /// A branch that still has referencing descendants (`RefConflict`) is NOT - /// tolerated: that is a real ordering error and surfaces as `OmniError::Lance`. - /// Used by the eager best-effort reclaim in `cleanup_deleted_branch_tables` - /// and the `cleanup` orphan reconciler. - pub async fn force_delete_branch(&self, dataset_uri: &str, branch: &str) -> Result<()> { - let mut ds = Dataset::open(dataset_uri) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - match ds.force_delete_branch(branch).await { - Ok(()) => Ok(()), - Err(lance::Error::RefNotFound { .. }) | Err(lance::Error::NotFound { .. }) => Ok(()), - Err(e) => Err(OmniError::Lance(e.to_string())), - } - } - pub async fn open_dataset_at_state( &self, table_path: &str, @@ -299,7 +234,7 @@ impl TableStore { table_key: &str, source_version: u64, target_branch: &str, - ) -> Result> { + ) -> Result { let mut source_ds = self .open_dataset_head(dataset_uri, source_branch) .await? @@ -308,49 +243,28 @@ impl TableStore { .map_err(|e| OmniError::Lance(e.to_string()))?; self.ensure_expected_version(&source_ds, table_key, source_version)?; - if let Err(create_err) = source_ds + match source_ds .create_branch(target_branch, source_version, None) .await { - // Disambiguate the failure: only a genuinely pre-existing ref is a - // reclaim candidate. Mapping EVERY create_branch failure to - // `RefAlreadyExists` would route a transient I/O / version / Lance - // internal error into the destructive reclaim path. So check whether - // the ref actually exists; if not, the failure is real β€” propagate - // it (preserving error fidelity) rather than force-deleting. - // - // `list_branches` reads `_refs/branches/` from the store, so it sees - // a fully-formed manifest-unreferenced fork (our common case β€” a - // create_branch that completed but whose manifest publish did not). - // It does NOT see a phase-1-only Lance "zombie" (tree dir written, - // no BranchContents) β€” but neither does `cleanup`'s reconciler, also - // list_branches-based. A zombie only forms if create_branch is - // interrupted *between its two internal phases* (a far narrower - // window than the manifest-publish gap), and it surfaces here as the - // propagated create error requiring manual reclaim. We deliberately - // do NOT force-delete on a not-found-ref failure: it is - // indistinguishable from a transient error on a fresh create, and - // force-deleting there is the destructive overreach this guard - // removes. The caller holds the per-(table, branch) write queue, so - // no in-process writer races this fork; a cross-process create - // between our check and now is the documented one-winner-CAS gap and - // propagates as a retryable error. - let ref_exists = source_ds - .list_branches() + Ok(_) => {} + Err(create_err) => match self + .open_dataset_head(dataset_uri, Some(target_branch)) .await - .map(|b| b.contains_key(target_branch)) - .unwrap_or(false); - if ref_exists { - return Ok(ForkOutcome::RefAlreadyExists); - } - return Err(OmniError::Lance(create_err.to_string())); + { + Ok(ds) => { + self.ensure_expected_version(&ds, table_key, source_version)?; + return Ok(ds); + } + Err(_) => return Err(OmniError::Lance(create_err.to_string())), + }, } let ds = self .open_dataset_head(dataset_uri, Some(target_branch)) .await?; self.ensure_expected_version(&ds, table_key, source_version)?; - Ok(ForkOutcome::Created(ds)) + Ok(ds) } pub async fn scan_batches(&self, ds: &Dataset) -> Result> { @@ -375,29 +289,6 @@ impl TableStore { Ok(materialized) } - /// Streaming, blob-aware sibling of [`Self::scan_batches_for_rewrite`]. - /// Yields the dataset's rows lazily as a `SendableRecordBatchStream` so a - /// downstream writer (`stage_append_stream`) never materializes the whole - /// table in memory. Blob columns still need per-row rebuild, so those tables - /// fall back to the materialized path and are re-streamed from the `Vec` - /// (rare β€” only tables with a `Blob` property; bounded-memory blob streaming - /// is a follow-up). The non-blob path is a true lazy scan. - pub async fn scan_stream_for_rewrite(&self, ds: &Dataset) -> Result { - let has_blob_columns = ds.schema().fields_pre_order().any(|field| field.is_blob()); - if has_blob_columns { - let arrow_schema: SchemaRef = Arc::new(ds.schema().into()); - let batches = self.scan_batches_for_rewrite(ds).await?; - let reader = arrow_array::RecordBatchIterator::new( - batches.into_iter().map(Ok), - arrow_schema, - ); - return Ok(lance_datafusion::utils::reader_to_stream(Box::new(reader))); - } - // Non-blob: a true lazy scan. `DatasetRecordBatchStream` converts to the - // `SendableRecordBatchStream` that `execute_uncommitted_stream` consumes. - Ok(Self::scan_stream(ds, None, None, None, false).await?.into()) - } - pub(crate) async fn materialize_blob_batch( ds: &Dataset, batch: RecordBatch, @@ -649,147 +540,6 @@ impl TableStore { .map_err(|e| OmniError::Lance(e.to_string())) } - /// Indexed neighbor lookup for graph traversal. Given an edge dataset and a - /// set of endpoint keys on `key_col` (`"src"` for out-traversal, `"dst"` for - /// in-traversal), return the matching edge rows projected to - /// `[key_col, opposite_col]`. - /// - /// The `key_col IN (keys)` predicate is built as a structured DataFusion - /// `Expr` and applied via `Scanner::filter_expr`, so Lance routes it through - /// the persisted BTREE on `key_col` (index-search β†’ take). Cost scales with - /// the frontier size, not |E| β€” the basis for serving selective traversals - /// without building the whole in-memory CSR. Empty `keys` returns empty - /// without scanning. - /// - /// Note: like any indexed scan, this observes only fragments the BTREE - /// covers plus an unindexed-fragment scan fallback; it reads the committed - /// snapshot `ds` was opened at. - pub async fn scan_edges_by_endpoint( - ds: &Dataset, - key_col: &str, - opposite_col: &str, - keys: &[String], - ) -> Result> { - use datafusion::prelude::{col, lit}; - - if keys.is_empty() { - return Ok(Vec::new()); - } - let key_list: Vec = - keys.iter().map(|k| lit(k.clone())).collect(); - let filter_expr = col(key_col).in_list(key_list, false); - Self::scan_stream_with( - ds, - Some(&[key_col, opposite_col]), - None, - None, - false, - |scanner| { - scanner.filter_expr(filter_expr); - Ok(()) - }, - ) - .await? - .try_collect() - .await - .map_err(|e| OmniError::Lance(e.to_string())) - } - - /// Metadata-only check (no IO) of whether `scan_edges_by_endpoint` β€” a - /// `key_col IN (...)` filter β€” on `ds` will be served by the persisted BTREE - /// on `column`, or silently fall back to a full filtered scan. Mirrors - /// Lance's own decision: scalar indices are disabled for the whole scan if - /// ANY fragment lacks `physical_rows` (lance `dataset/scanner.rs` - /// `create_filter_plan`), and are obviously unused if no BTREE on the - /// column exists. The scan is correct (returns all rows) either way β€” this - /// only surfaces the perf cliff so the indexed traversal can warn on it. - pub async fn key_column_index_coverage(ds: &Dataset, column: &str) -> Result { - let Some(field_id) = ds.schema().field(column).map(|field| field.id) else { - return Ok(IndexCoverage::Degraded { - reason: format!("column '{}' not in schema", column), - }); - }; - let indices = ds - .load_indices() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let btree = indices - .iter() - .filter(|index| !is_system_index(index)) - .filter(|index| index.fields.len() == 1 && index.fields[0] == field_id) - .find(|index| { - index - .index_details - .as_ref() - .map(|details| details.type_url.ends_with("BTreeIndexDetails")) - .unwrap_or(false) - }); - let Some(btree) = btree else { - return Ok(IndexCoverage::Degraded { - reason: format!("no BTREE index on '{}'", column), - }); - }; - // Same check Lance runs: a fragment missing physical_rows disables - // scalar indices for the entire scan (all-or-nothing). - if ds.fragments().iter().any(|f| f.physical_rows.is_none()) { - return Ok(IndexCoverage::Degraded { - reason: "a fragment is missing physical_rows".to_string(), - }); - } - // An index only covers the fragments it was built over; fragments - // appended afterward (edge-index creation is skipped once a BTREE exists) - // are scanned unindexed. If any CURRENT fragment is absent from the - // index's `fragment_bitmap`, the scan is partly a full scan β€” so the - // chooser must not price it as fully indexed. A `None` bitmap means Lance - // can't report coverage; don't over-degrade in that case. - if let Some(bitmap) = btree.fragment_bitmap.as_ref() { - let uncovered = ds - .fragments() - .iter() - .filter(|f| !bitmap.contains(f.id as u32)) - .count(); - if uncovered > 0 { - return Ok(IndexCoverage::Degraded { - reason: format!( - "{} fragment(s) not covered by the index on '{}'", - uncovered, column - ), - }); - } - } - Ok(IndexCoverage::Indexed) - } - - /// True if any non-system index on `ds` leaves at least one current - /// fragment uncovered, i.e. rows that the index does not yet account for - /// (appended after the index was built, or rewritten by compaction). Such - /// fragments are scanned unindexed until a reindex (`optimize_indices`) - /// folds them in. Returns false when every index covers every fragment, or - /// when the table has no (non-system) indices to optimize. A `None` - /// `fragment_bitmap` means Lance cannot report coverage for that index, so - /// we do not treat it as uncovered (mirrors `key_column_index_coverage`). - /// - /// Used by `optimize` to decide whether an otherwise-already-compacted - /// table still has index work to do. - pub async fn has_unindexed_fragments(ds: &Dataset) -> Result { - let indices = ds - .load_indices() - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let frag_ids: Vec = ds.fragments().iter().map(|f| f.id as u32).collect(); - for index in indices.iter() { - if is_system_index(index) { - continue; - } - if let Some(bitmap) = index.fragment_bitmap.as_ref() { - if frag_ids.iter().any(|id| !bitmap.contains(*id)) { - return Ok(true); - } - } - } - Ok(false) - } - pub async fn count_rows(&self, ds: &Dataset, filter: Option) -> Result { ds.count_rows(filter) .await @@ -809,16 +559,7 @@ impl TableStore { }) } - /// Legacy inline-commit append: writes fragments AND commits in one - /// call, advancing Lance HEAD as a side effect. Not on the - /// `TableStorage` trait surface β€” the staged primitive `stage_append` - /// + `commit_staged` is the engine write path. This inherent method - /// survives only for in-source recovery test setup, so it is - /// `#[cfg(test)]`-gated: engine code physically cannot call it (which - /// enforces "no new call sites" by construction and silences the - /// dead-code warning the non-test lib build would otherwise emit). - #[cfg(test)] - pub(crate) async fn append_batch( + pub async fn append_batch( &self, dataset_uri: &str, ds: &mut Dataset, @@ -832,8 +573,6 @@ impl TableStore { let params = WriteParams { mode: WriteMode::Append, allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; ds.append(reader, Some(params)) @@ -853,8 +592,6 @@ impl TableStore { let params = WriteParams { mode: WriteMode::Append, allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; ds.append(reader, Some(params)) @@ -868,8 +605,6 @@ impl TableStore { enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; Dataset::write(reader, dataset_uri, Some(params)) @@ -879,7 +614,139 @@ impl TableStore { } } - pub(crate) async fn delete_where( + pub async fn overwrite_batch( + &self, + dataset_uri: &str, + ds: &mut Dataset, + batch: RecordBatch, + ) -> Result { + ds.truncate_table() + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + self.append_batch(dataset_uri, ds, batch).await + } + + pub async fn overwrite_dataset(dataset_uri: &str, batch: RecordBatch) -> Result { + let reader = arrow_array::RecordBatchIterator::new(vec![Ok(batch.clone())], batch.schema()); + let params = WriteParams { + mode: WriteMode::Overwrite, + enable_stable_row_ids: true, + data_storage_version: Some(LanceFileVersion::V2_2), + allow_external_blob_outside_bases: true, + ..Default::default() + }; + Dataset::write(reader, dataset_uri, Some(params)) + .await + .map_err(|e| OmniError::Lance(e.to_string())) + } + + pub async fn merge_insert_batch( + &self, + dataset_uri: &str, + ds: Dataset, + batch: RecordBatch, + key_columns: Vec, + when_matched: WhenMatched, + when_not_matched: WhenNotMatched, + ) -> Result { + if batch.num_rows() == 0 { + return self.table_state(dataset_uri, &ds).await; + } + + // Precondition for the FirstSeen workaround below: every caller of + // this primitive must hand in a source batch that is unique by + // `key_columns`. Without this check, `SourceDedupeBehavior::FirstSeen` + // would silently collapse genuine duplicates instead of erroring. + check_batch_unique_by_keys(&batch, &key_columns, "merge_insert_batch")?; + + // TODO(lance-upstream): MergeInsertBuilder does not accept WriteParams, + // so allow_external_blob_outside_bases cannot be set here. External URI + // blobs via merge_insert (LoadMode::Merge, mutations) are unsupported + // until Lance exposes WriteParams on MergeInsertBuilder. + let ds = Arc::new(ds); + let mut builder = MergeInsertBuilder::try_new(ds, key_columns) + .map_err(|e| OmniError::Lance(e.to_string()))?; + builder.when_matched(when_matched); + builder.when_not_matched(when_not_matched); + // Workaround for a Lance 4.0.x bug class where sequential + // merge_insert calls against rows previously rewritten by + // merge_insert produce a spurious "Ambiguous merge inserts: + // multiple source rows match the same target row on (id = ...)" + // error. Lance's `processed_row_ids: Mutex>` + // (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`) + // double-processes the same source/target match against + // datasets previously rewritten by merge_insert, and the default + // `SourceDedupeBehavior::Fail` errors on the second insertion. + // `FirstSeen` makes Lance skip the duplicate match instead. + // + // Covers both observed surfaces: + // - PR #98 (sequential `load --mode merge` against same keys). + // - MR-920 (sequential `update T set {f} where x=y` on same row). + // + // Correctness-preserving for OmniGraph because every call path + // that reaches this primitive either pre-dedupes the source batch + // by id, or surfaces a real source dup via the + // `check_batch_unique_by_keys` precondition above (which fires + // before the FirstSeen setter has a chance to silently collapse + // anything): + // - Load path: `enforce_unique_constraints_intra_batch` + // (`loader/mod.rs:1453`) errors on intra-batch `@key` dups. + // - Mutate path: `MutationStaging::finalize` (`exec/staging.rs`) + // accumulates and dedupes by `id`. + // - Branch-merge path: `compute_source_delta` / + // `compute_three_way_delta` (`exec/merge.rs`) walk via + // `OrderedTableCursor` and `push_row` each id at most once. + // So FirstSeen only suppresses the spurious Lance behavior, never + // user data. Pinned by `loader_rejects_intra_batch_duplicate_keys` + // in `tests/consistency.rs` plus the + // `check_batch_unique_by_keys` precondition. + // + // Retire when upstream Lance fixes the bug class. Tracked at + // MR-957; upstream: lance-format/lance#6877. + builder.source_dedupe_behavior(SourceDedupeBehavior::FirstSeen); + let job = builder + .try_build() + .map_err(|e| OmniError::Lance(e.to_string()))?; + + let schema = batch.schema(); + let reader = arrow_array::RecordBatchIterator::new(vec![Ok(batch)], schema); + let (new_ds, _stats) = job + .execute(lance_datafusion::utils::reader_to_stream(Box::new(reader))) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + self.table_state(dataset_uri, &new_ds).await + } + + pub async fn merge_insert_batches( + &self, + dataset_uri: &str, + ds: Dataset, + batches: Vec, + key_columns: Vec, + when_matched: WhenMatched, + when_not_matched: WhenNotMatched, + ) -> Result { + if batches.is_empty() { + return self.table_state(dataset_uri, &ds).await; + } + let batch = if batches.len() == 1 { + batches.into_iter().next().unwrap() + } else { + let schema = batches[0].schema(); + concat_batches(&schema, &batches).map_err(|e| OmniError::Lance(e.to_string()))? + }; + self.merge_insert_batch( + dataset_uri, + ds, + batch, + key_columns, + when_matched, + when_not_matched, + ) + .await + } + + pub async fn delete_where( &self, dataset_uri: &str, ds: &mut Dataset, @@ -957,12 +824,9 @@ impl TableStore { "stage_append called with empty batch".to_string(), )); } - let appended_rows = batch.num_rows() as u64; let params = WriteParams { mode: WriteMode::Append, allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; let transaction = InsertBuilder::new(Arc::new(ds.clone())) @@ -970,9 +834,6 @@ impl TableStore { .execute_uncommitted(vec![batch]) .await .map_err(|e| OmniError::Lance(e.to_string()))?; - // Record only after the staging write succeeds, so a failed write does - // not inflate the probe (matches `stage_append_stream`'s ordering). - crate::instrumentation::record_stage_append(appended_rows); let mut new_fragments = match &transaction.operation { Operation::Append { fragments } => fragments.clone(), Operation::Overwrite { fragments, .. } => fragments.clone(), @@ -984,7 +845,7 @@ impl TableStore { } }; // Assign real fragment IDs. Lance's `InsertBuilder::execute_uncommitted` - // returns fragments with `id = 0` ("Temporary ID" β€” see lance-6.0.1 + // returns fragments with `id = 0` ("Temporary ID" β€” see lance-4.0.0 // `dataset/write.rs:1044/1712`); the real assignment happens during // commit via `Transaction::fragments_with_ids`. Because we expose // these fragments to `scan_with_staged` *before* commit, two staged @@ -1013,71 +874,6 @@ impl TableStore { }) } - /// Streaming variant of [`Self::stage_append`]: appends the rows of `source` - /// into `ds` without materializing them in memory. It scans `source` lazily - /// (`scan_stream_for_rewrite`) and hands the stream to Lance's - /// `execute_uncommitted_stream`, which rolls fragments at `max_rows_per_file` - /// β€” bounded memory, one Append transaction. This is the substrate-blessed - /// bulk-append path (the same one LanceDB's `Table::add` uses). Identical - /// fragment-id / stable-row-id staging as `stage_append`. - /// - /// TRANSITIONAL caller β€” its only caller is the row-level merge append - /// (`publish_adopted_delta`, see `AdoptDelta`), which the fragment-adopt work - /// (Lance #7263/#7185) removes: a fragment graft re-appends no rows. This - /// primitive and `scan_stream_for_rewrite` are then dead unless re-adopted as - /// a general bulk-append path (the `Table::add` shape makes that plausible). - pub async fn stage_append_stream( - &self, - ds: &Dataset, - source: &Dataset, - prior_stages: &[StagedWrite], - ) -> Result { - let stream = self.scan_stream_for_rewrite(source).await?; - let params = WriteParams { - mode: WriteMode::Append, - allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, - ..Default::default() - }; - let transaction = InsertBuilder::new(Arc::new(ds.clone())) - .with_params(¶ms) - .execute_uncommitted_stream(stream) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let mut new_fragments = match &transaction.operation { - Operation::Append { fragments } => fragments.clone(), - Operation::Overwrite { fragments, .. } => fragments.clone(), - other => { - return Err(OmniError::manifest_internal(format!( - "stage_append_stream: unexpected Lance operation {:?}", - std::mem::discriminant(other) - ))); - } - }; - let appended_rows: u64 = new_fragments - .iter() - .filter_map(|f| f.physical_rows) - .map(|r| r as u64) - .sum(); - crate::instrumentation::record_stage_append(appended_rows); - // Same commit-time fragment-id / row-id renumbering as `stage_append`. - let next_id_base = ds.manifest.max_fragment_id.unwrap_or(0) as u64 - + 1 - + prior_stages_fragment_count(prior_stages); - assign_fragment_ids(&mut new_fragments, next_id_base); - if ds.manifest.uses_stable_row_ids() { - let prior_rows = prior_stages_row_count(prior_stages)?; - let start_row_id = ds.manifest.next_row_id + prior_rows; - assign_row_id_meta(&mut new_fragments, start_row_id)?; - } - Ok(StagedWrite { - transaction, - new_fragments, - removed_fragment_ids: Vec::new(), - }) - } - /// Stage a merge_insert (upsert): write fragment files describing the /// merge result, return the uncommitted transaction plus the new /// fragments. The transaction's `Operation::Update` carries the @@ -1105,7 +901,7 @@ impl TableStore { /// Lift path: either a Lance API extension that lets /// `MergeInsertBuilder` accept additional staged fragments, or an /// in-memory pre-merge here that folds prior staged batches into the - /// input stream. See `docs/dev/writes.md`. + /// input stream. See `docs/runs.md`. pub async fn stage_merge_insert( &self, ds: Dataset, @@ -1119,14 +915,12 @@ impl TableStore { "stage_merge_insert called with empty batch".to_string(), )); } - let merged_rows = batch.num_rows() as u64; - // Precondition for the FirstSeen workaround below: every call path that - // reaches stage_merge_insert (load, MutationStaging::finalize, - // branch_merge::publish_rewritten_merge_table) must hand in a source - // batch that is unique by `key_columns`. Without this check, - // `SourceDedupeBehavior::FirstSeen` would silently collapse genuine - // duplicates instead of erroring. + // Precondition for FirstSeen below. See the comment on + // `merge_insert_batch` for why this check is here, not on the caller: + // every call path that reaches stage_merge_insert (load, + // MutationStaging::finalize, branch_merge::publish_rewritten_merge_table) + // must hand in a source batch that is unique by `key_columns`. check_batch_unique_by_keys(&batch, &key_columns, "stage_merge_insert")?; let ds = Arc::new(ds); @@ -1134,21 +928,11 @@ impl TableStore { .map_err(|e| OmniError::Lance(e.to_string()))?; builder.when_matched(when_matched); builder.when_not_matched(when_not_matched); - // Workaround for a Lance bug class where sequential merge_insert calls - // against rows previously rewritten by merge_insert produce a spurious - // "Ambiguous merge inserts: multiple source rows match the same target - // row on (id = ...)" error. Lance's `processed_row_ids: - // Mutex>` (lance-6.0.1 `src/dataset/write/merge_insert.rs`) - // double-processes the same source/target match against datasets - // previously rewritten by merge_insert, and the default - // `SourceDedupeBehavior::Fail` errors on the second insertion; FirstSeen - // makes Lance skip the duplicate match instead. Correctness-preserving - // because every call path pre-dedupes the source batch by id or surfaces - // a real source dup via `check_batch_unique_by_keys` above (load: - // `enforce_unique_constraints_intra_batch`; mutate: - // `MutationStaging::finalize`; branch-merge: the `OrderedTableCursor` - // walk in `exec/merge.rs`). Retire when upstream Lance fixes the bug - // class. Tracked at MR-957; upstream: lance-format/lance#6877. + // See `merge_insert_batch` for the FirstSeen rationale. Workaround + // for the Lance 4.0.x bug class where sequential merge_insert / + // update against rows previously rewritten by merge_insert trips + // Lance's `processed_row_ids` HashSet and errors under the default + // `SourceDedupeBehavior::Fail`. Retire when upstream Lance is fixed. builder.source_dedupe_behavior(SourceDedupeBehavior::FirstSeen); let job = builder .try_build() @@ -1160,9 +944,6 @@ impl TableStore { .execute_uncommitted(stream) .await .map_err(|e| OmniError::Lance(e.to_string()))?; - // Record only after the staging write succeeds, so a failed write does - // not inflate the probe (matches `stage_append`/`stage_append_stream`). - crate::instrumentation::record_stage_merge_insert(merged_rows); // Operation::Update { removed_fragment_ids, updated_fragments, new_fragments, .. } β€” // `new_fragments` are the freshly inserted rows; `updated_fragments` // are rewrites of existing fragments that include both retained and @@ -1208,16 +989,7 @@ impl TableStore { ds: Arc, transaction: Transaction, ) -> Result { - // Skip Lance's auto-cleanup hook on every commit. OmniGraph owns version - // GC explicitly (optimize.rs::cleanup_all_tables); Lance's hook fires off - // the *dataset's stored* `lance.auto_cleanup.*` config, which graphs - // created before the v7 bump (6.0.1 defaulted auto_cleanup ON) still - // carry β€” so `WriteParams::auto_cleanup = None` alone does NOT stop it on - // upgraded graphs. Skipping here covers the staged write path (the main - // data path) for new and legacy datasets alike, preventing Lance from - // GC'ing versions the __manifest still pins for snapshots/time-travel. CommitBuilder::new(ds) - .with_skip_auto_cleanup(true) .execute(transaction) .await .map_err(|e| OmniError::Lance(e.to_string())) @@ -1236,53 +1008,40 @@ impl TableStore { /// MR-793 Phase 2: introduces this for the schema_apply rewrite path. /// Lance API verified in `.context/mr-793-design.md` Appendix A.1. pub async fn stage_overwrite(&self, ds: &Dataset, batch: RecordBatch) -> Result { - // `enable_stable_row_ids: true` is defensive β€” empirically Lance 6.0.1 + if batch.num_rows() == 0 { + return Err(OmniError::manifest_internal( + "stage_overwrite called with empty batch".to_string(), + )); + } + // `enable_stable_row_ids: true` is defensive β€” empirically Lance 4.0.0 // preserves the source dataset's flag through `Operation::Overwrite` // when WriteParams omits it (pinned by // `stage_overwrite_preserves_stable_row_ids` in tests/staged_writes.rs), - // but setting it explicitly keeps the invariant documented at every Overwrite site + // but setting it explicitly matches the public `overwrite_dataset` + // path and keeps the invariant documented at every Overwrite site // (see docs/storage.md "Stable row IDs"). Setting it on an existing // dataset that was created without stable row IDs is a no-op per // Lance's row-id-lineage spec, so this stays correct for legacy // datasets. - let (transaction, mut new_fragments) = if batch.num_rows() == 0 { - let schema = LanceSchema::try_from(batch.schema().as_ref()) - .map_err(|e| OmniError::Lance(e.to_string()))?; - let transaction = TransactionBuilder::new( - ds.manifest.version, - Operation::Overwrite { - fragments: Vec::new(), - schema, - config_upsert_values: None, - initial_bases: None, - }, - ) - .build(); - (transaction, Vec::new()) - } else { - let params = WriteParams { - mode: WriteMode::Overwrite, - enable_stable_row_ids: true, - allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, - ..Default::default() - }; - let transaction = InsertBuilder::new(Arc::new(ds.clone())) - .with_params(¶ms) - .execute_uncommitted(vec![batch]) - .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - let new_fragments = match &transaction.operation { - Operation::Overwrite { fragments, .. } => fragments.clone(), - other => { - return Err(OmniError::manifest_internal(format!( - "stage_overwrite: unexpected Lance operation {:?}", - std::mem::discriminant(other) - ))); - } - }; - (transaction, new_fragments) + let params = WriteParams { + mode: WriteMode::Overwrite, + enable_stable_row_ids: true, + allow_external_blob_outside_bases: true, + ..Default::default() + }; + let transaction = InsertBuilder::new(Arc::new(ds.clone())) + .with_params(¶ms) + .execute_uncommitted(vec![batch]) + .await + .map_err(|e| OmniError::Lance(e.to_string()))?; + let mut new_fragments = match &transaction.operation { + Operation::Overwrite { fragments, .. } => fragments.clone(), + other => { + return Err(OmniError::manifest_internal(format!( + "stage_overwrite: unexpected Lance operation {:?}", + std::mem::discriminant(other) + ))); + } }; // Overwrite REPLACES every committed fragment, and Lance restarts // fragment-ID and row-ID counters at the post-commit version. @@ -1295,7 +1054,7 @@ impl TableStore { // 2) For stable-row-id datasets, assign row_id_meta starting // at 0 (Overwrite is a fresh-start) so `scan_with_staged` // doesn't hit the "Missing row id meta" panic in - // lance-6.0.1 dataset/rowids.rs:22. + // lance-4.0.0 dataset/rowids.rs:22. assign_fragment_ids(&mut new_fragments, 1); if ds.manifest.uses_stable_row_ids() { assign_row_id_meta(&mut new_fragments, 0)?; @@ -1319,7 +1078,7 @@ impl TableStore { /// `IndexMetadata`; we manually wrap it in `Operation::CreateIndex /// { new_indices, removed_indices }` via the public `TransactionBuilder`, /// replicating the simple (non-segment-commit-path) branch of Lance's - /// `CreateIndexBuilder::execute` (lance-6.0.1 `src/index/create.rs:502-512`). + /// `CreateIndexBuilder::execute` (lance-4.0.0 `src/index/create.rs:502-512`). /// /// `removed_indices` mirrors `execute()` lines 466-476: when the /// build replaces an existing same-named index, those entries are @@ -1328,7 +1087,7 @@ impl TableStore { /// MR-793 Phase 2: scalar index types (BTree, Inverted) are /// stage-able. Vector indices are NOT (segment-commit-path requires /// `build_index_metadata_from_segments` which is `pub(crate)` in - /// lance-6.0.1); see `create_vector_index` and Appendix A.3. + /// lance-4.0.0); see `create_vector_index` and Appendix A.3. pub async fn stage_create_btree_index( &self, ds: &Dataset, @@ -1423,7 +1182,7 @@ impl TableStore { /// committed fragments carry; Lance's optimizer drops them from the /// filtered scan even when their data would match. Staged-fragment /// rows are silently absent from the result. `scanner.use_stats(false)` - /// does not fix this in lance 6.0.1. Callers needing correct filtered + /// does not fix this in lance 4.0.0. Callers needing correct filtered /// reads against staged data should use a different strategy β€” the /// engine's `MutationStaging` accumulator unions in-memory pending /// batches with the committed scan via DataFusion `MemTable` (see @@ -1647,16 +1406,31 @@ impl TableStore { })) } - pub(crate) async fn create_vector_index(&self, ds: &mut Dataset, column: &str) -> Result<()> { + pub async fn create_btree_index(&self, ds: &mut Dataset, columns: &[&str]) -> Result<()> { + let params = ScalarIndexParams::default(); + ds.create_index_builder(columns, IndexType::BTree, ¶ms) + .replace(true) + .await + .map(|_| ()) + .map_err(|e| OmniError::Lance(e.to_string())) + } + + pub async fn create_inverted_index(&self, ds: &mut Dataset, column: &str) -> Result<()> { + let params = InvertedIndexParams::default(); + ds.create_index_builder(&[column], IndexType::Inverted, ¶ms) + .replace(true) + .await + .map(|_| ()) + .map_err(|e| OmniError::Lance(e.to_string())) + } + + pub async fn create_vector_index(&self, ds: &mut Dataset, column: &str) -> Result<()> { let params = lance::index::vector::VectorIndexParams::ivf_flat(1, MetricType::L2); ds.create_index_builder(&[column], IndexType::Vector, ¶ms) .replace(true) .await - .map_err(|e| OmniError::Lance(e.to_string()))?; - // Record only after the index build succeeds, so a failed build does not - // inflate the probe (matches the `stage_*` probes). - crate::instrumentation::record_create_vector_index(); - Ok(()) + .map(|_| ()) + .map_err(|e| OmniError::Lance(e.to_string())) } pub async fn create_empty_dataset(dataset_uri: &str, schema: &SchemaRef) -> Result { @@ -1685,8 +1459,6 @@ impl TableStore { enable_stable_row_ids: true, data_storage_version: Some(LanceFileVersion::V2_2), allow_external_blob_outside_bases: true, - auto_cleanup: None, - skip_auto_cleanup: true, ..Default::default() }; Dataset::write(reader, dataset_uri, Some(params)) @@ -1736,7 +1508,7 @@ fn prior_stages_fragment_count(prior_stages: &[StagedWrite]) -> u64 { } /// Assign sequential fragment IDs starting at `start_id`. Mirrors Lance's -/// commit-time `Transaction::fragments_with_ids` (lance-6.0.1 +/// commit-time `Transaction::fragments_with_ids` (lance-4.0.0 /// `dataset/transaction.rs:1456`) β€” fragments produced by /// `InsertBuilder::execute_uncommitted` start with `id = 0` as a temporary /// placeholder; we renumber here so they don't collide with committed @@ -1767,7 +1539,7 @@ fn prior_stages_row_count(prior_stages: &[StagedWrite]) -> Result { /// Assign sequential row IDs to fragments that lack them, starting from /// `start_row_id`. Mirrors the relevant arm of Lance's -/// `Transaction::assign_row_ids` (lance-6.0.1 `dataset/transaction.rs:2682`) +/// `Transaction::assign_row_ids` (lance-4.0.0 `dataset/transaction.rs:2682`) /// for the `row_id_meta = None` case β€” fragments produced by /// `InsertBuilder::execute_uncommitted` against a stable-row-id dataset. /// @@ -1897,15 +1669,7 @@ async fn scan_pending_batches( filter: Option<&str>, ) -> Result> { let schema = pending_schema.unwrap_or_else(|| pending_batches[0].schema()); - // #283: disable SQL identifier normalization so an unquoted camelCase - // column in `filter` (e.g. `repoName = 'acme'`, emitted unquoted by - // `predicate_to_sql` because the committed Lance scan needs it unquoted) - // is matched case-preserving against the case-sensitive MemTable schema. - // Without this, DataFusion lowercases `repoName` β†’ `reponame` and fails to - // resolve. Quoted identifiers (the projection list below) are unaffected. - let mut config = datafusion::execution::context::SessionConfig::new(); - config.options_mut().sql_parser.enable_ident_normalization = false; - let ctx = datafusion::execution::context::SessionContext::new_with_config(config); + let ctx = datafusion::execution::context::SessionContext::new(); let mem = datafusion::datasource::MemTable::try_new(schema, vec![pending_batches.to_vec()]) .map_err(|e| OmniError::Lance(e.to_string()))?; ctx.register_table("pending", Arc::new(mem)) @@ -1948,7 +1712,7 @@ fn combine_committed_with_staged(ds: &Dataset, staged: &[StagedWrite]) -> Vec Person { - @unique(src, dst) -} -"#; - -const EDGE_UNIQUE_DATA: &str = r#"{"type":"Person","data":{"name":"Alice"}} -{"type":"Person","data":{"name":"Bob"}} -{"type":"Person","data":{"name":"Carol"}}"#; - -const EDGE_UNIQUE_MUTATIONS: &str = r#" -query add_knows($from: String, $to: String) { - insert Knows { from: $from, to: $to } -} -"#; - const CARDINALITY_SCHEMA: &str = r#" node Person { name: String @key @@ -548,174 +528,6 @@ async fn branch_merge_records_single_latest_commit_with_two_parents() { ); } -// ── P1: commit-DAG coherence on same-branch writes after an external commit ── -// -// `append_commit` takes a new commit's parent from the coordinator's in-memory -// head (commit_graph head_commit, zero storage read), but `commit_all` rebases -// the MANIFEST from a fresh coordinator. So after an external writer advances -// the branch, a same-branch write on a non-refreshed handle commits a fresh -// manifest version yet appends off the stale head β€” forking the commit DAG (the -// new commit and the external commit share a parent). Data is unaffected (the -// manifest is the visibility authority); only commit history is malformed. -// P1 refreshes the commit-graph head before the append, so the parent is the -// true current head. These two tests are RED before that fix, GREEN after. - -/// Non-strict insert: the fork is pre-existing (commit_all rebases the manifest -/// regardless of the stale head), independent of Fix 1. -#[tokio::test] -async fn same_branch_insert_after_external_commit_is_linear() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - // Handle A: a long-lived writer whose coordinator head stays pinned at the - // load commit (C0) β€” it never refreshes before its own write below. - let mut a = init_and_load(&dir).await; - let c0 = CommitGraph::open(uri) - .await - .unwrap() - .head_commit() - .await - .unwrap() - .unwrap(); - - // External writer B advances main: commit C1, parent C0. - let mut b = Omnigraph::open(uri).await.unwrap(); - mutate_main( - &mut b, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "ext_b")], &[("$age", 30)]), - ) - .await - .unwrap(); - let c1 = CommitGraph::open(uri) - .await - .unwrap() - .head_commit() - .await - .unwrap() - .unwrap(); - assert_eq!( - c1.parent_commit_id.as_deref(), - Some(c0.graph_commit_id.as_str()), - "sanity: B's commit C1 should descend from C0" - ); - - // A writes to main WITHOUT refreshing. A's coordinator still thinks the head - // is C0, so a pre-fix append parents the new commit on C0 instead of C1. - mutate_main( - &mut a, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "local_a")], &[("$age", 40)]), - ) - .await - .unwrap(); - - let commits = CommitGraph::open(uri) - .await - .unwrap() - .load_commits() - .await - .unwrap(); - let latest = commits.iter().max_by_key(|c| c.manifest_version).unwrap(); - assert_eq!( - latest.parent_commit_id.as_deref(), - Some(c1.graph_commit_id.as_str()), - "A's same-branch write after an external commit must append off the true \ - head C1, not the stale head C0 (commit-DAG fork)" - ); - let c0_children = commits - .iter() - .filter(|c| c.parent_commit_id.as_deref() == Some(c0.graph_commit_id.as_str())) - .count(); - assert_eq!(c0_children, 1, "C0 must have exactly one child; two is the fork"); -} - -/// Strict update after a read: Fix 1's `refresh_manifest_only` makes the read -/// freshen the read-time pin, defeating the strict 409 that used to force a -/// coherent refresh β€” so the same stale-head append forks strict ops too. -#[tokio::test] -async fn same_branch_update_after_external_commit_and_read_is_linear() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - // A inserts the row it will later update; this is A's own commit (Ca), so - // A's coordinator head is Ca. - let mut a = init_and_load(&dir).await; - mutate_main( - &mut a, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "target")], &[("$age", 40)]), - ) - .await - .unwrap(); - let ca = CommitGraph::open(uri) - .await - .unwrap() - .head_commit() - .await - .unwrap() - .unwrap(); - - // External writer B advances main: commit Cb, parent Ca. - let mut b = Omnigraph::open(uri).await.unwrap(); - mutate_main( - &mut b, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "ext_b")], &[("$age", 30)]), - ) - .await - .unwrap(); - let cb = CommitGraph::open(uri) - .await - .unwrap() - .head_commit() - .await - .unwrap() - .unwrap(); - assert_eq!(cb.parent_commit_id.as_deref(), Some(ca.graph_commit_id.as_str())); - - // A reads main: the stale-probe path refreshes A's MANIFEST (via - // refresh_manifest_only) but not its commit-graph head, freshening the - // read-time pin so the strict update below skips its 409. - query_main(&mut a, TEST_QUERIES, "total_people", ¶ms(&[])) - .await - .unwrap(); - - // Strict update, no explicit refresh: pre-fix it appends off the stale head - // Ca instead of Cb. - mutate_main( - &mut a, - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "target")], &[("$age", 99)]), - ) - .await - .unwrap(); - - let commits = CommitGraph::open(uri) - .await - .unwrap() - .load_commits() - .await - .unwrap(); - let latest = commits.iter().max_by_key(|c| c.manifest_version).unwrap(); - assert_eq!( - latest.parent_commit_id.as_deref(), - Some(cb.graph_commit_id.as_str()), - "a strict update after an external commit and a local read must append \ - off the true head Cb, not the stale head Ca" - ); - let ca_children = commits - .iter() - .filter(|c| c.parent_commit_id.as_deref() == Some(ca.graph_commit_id.as_str())) - .count(); - assert_eq!(ca_children, 1, "Ca must have exactly one child; two is the fork"); -} - #[tokio::test] async fn branch_merge_records_actor_on_latest_commit() { let dir = tempfile::tempdir().unwrap(); @@ -1307,87 +1119,6 @@ async fn branch_merge_reports_unique_violation_conflict() { } } -/// Regression for the MR-983 follow-up: the branch-merge path must enforce an -/// edge composite `@unique(src, dst)` as a true composite key, consistent with -/// the intake path. Two branches inserting the *same* (src, dst) pair must -/// conflict on merge. -#[tokio::test] -async fn branch_merge_reports_composite_unique_violation_conflict() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut main = init_db_from_schema_and_data(&dir, EDGE_UNIQUE_SCHEMA, EDGE_UNIQUE_DATA).await; - main.branch_create("feature").await.unwrap(); - - let mut feature = Omnigraph::open(uri).await.unwrap(); - - mutate_main( - &mut main, - EDGE_UNIQUE_MUTATIONS, - "add_knows", - ¶ms(&[("$from", "Alice"), ("$to", "Bob")]), - ) - .await - .unwrap(); - - mutate_branch( - &mut feature, - "feature", - EDGE_UNIQUE_MUTATIONS, - "add_knows", - ¶ms(&[("$from", "Alice"), ("$to", "Bob")]), - ) - .await - .unwrap(); - - let err = main.branch_merge("feature", "main").await.unwrap_err(); - match err { - OmniError::MergeConflicts(conflicts) => { - assert!(conflicts.iter().any(|conflict| { - conflict.table_key == "edge:Knows" - && conflict.kind == MergeConflictKind::UniqueViolation - })); - } - other => panic!("expected merge conflicts, got {other:?}"), - } -} - -/// Sibling to the above: pairs sharing `src` but differing on `dst` are unique -/// on the (src, dst) tuple and must merge cleanly. Guards against the composite -/// degrading back into a single-field `@unique(src)` on the merge path. -#[tokio::test] -async fn branch_merge_allows_distinct_composite_unique_pairs() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut main = init_db_from_schema_and_data(&dir, EDGE_UNIQUE_SCHEMA, EDGE_UNIQUE_DATA).await; - main.branch_create("feature").await.unwrap(); - - let mut feature = Omnigraph::open(uri).await.unwrap(); - - mutate_main( - &mut main, - EDGE_UNIQUE_MUTATIONS, - "add_knows", - ¶ms(&[("$from", "Alice"), ("$to", "Bob")]), - ) - .await - .unwrap(); - - mutate_branch( - &mut feature, - "feature", - EDGE_UNIQUE_MUTATIONS, - "add_knows", - ¶ms(&[("$from", "Alice"), ("$to", "Carol")]), - ) - .await - .unwrap(); - - main.branch_merge("feature", "main") - .await - .expect("distinct (src, dst) pairs are unique on the composite and must merge cleanly"); - assert_eq!(count_rows(&main, "edge:Knows").await, 2); -} - #[tokio::test] async fn branch_merge_reports_cardinality_violation_conflict() { let dir = tempfile::tempdir().unwrap(); diff --git a/crates/omnigraph/tests/composite_flow.rs b/crates/omnigraph/tests/composite_flow.rs index dd41310..6c720da 100644 --- a/crates/omnigraph/tests/composite_flow.rs +++ b/crates/omnigraph/tests/composite_flow.rs @@ -294,19 +294,21 @@ async fn composite_flow_canonical_lifecycle() { ); // ───────────────────────────────────────────────────────────────── - // Step 10: optimize the post-merge graph β€” verify compaction is - // published to the manifest (so the manifest pin tracks the compacted - // Lance HEAD), indices stay valid and queryable, and a post-optimize - // strict write commits. + // Step 10: optimize the post-merge graph β€” verify indices stay + // valid and queryable. // - // This step used to carry a "Known limitation": `optimize_all_tables` - // ran Lance `compact_files` without publishing the new version to - // `__manifest`, so the manifest pin lagged the Lance HEAD and the next - // strict write / schema apply failed with `ExpectedVersionMismatch` - // ("stale view … refresh and retry") β€” so post-optimize mutations were - // deliberately omitted here. optimize now publishes the compacted - // version, and this flow exercises exactly that previously-failing - // write below. + // **Known limitation**: `optimize_all_tables` calls Lance + // `compact_files` directly β€” it advances per-table Lance HEAD + // without updating the omnigraph `__manifest` pin. After optimize, + // the next writer's expected_table_versions captures the + // pre-optimize manifest pin, but the publisher's pre-check reads + // a higher version from the manifest dataset (because some other + // path β€” possibly schema-state recovery on reopen β€” wrote a newer + // __manifest row). The `ExpectedVersionMismatch` is benign + // (re-issuing the mutation after a snapshot refresh succeeds), but + // a composite test cannot reliably exercise post-optimize mutations + // until that path is investigated. Coverage of post-optimize + // mutations is left to a focused optimize+cleanup integration test. // ───────────────────────────────────────────────────────────────── let optimize_stats = db.optimize().await.unwrap(); assert!( @@ -329,28 +331,6 @@ async fn composite_flow_canonical_lifecycle() { "row counts unchanged by optimize" ); - // A strict update on a compacted table is exactly the write that - // failed with "stale view" before optimize published its compaction. - // It must now commit (Alice is one of the seed Persons; an update - // leaves the row count at 6). - let post_optimize_update = mutate_main( - &mut db, - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "Alice")], &[("$age", 41)]), - ) - .await - .expect("post-optimize strict update must commit β€” optimize published the manifest"); - assert_eq!( - post_optimize_update.affected_nodes, 1, - "post-optimize update must affect exactly Alice" - ); - assert_eq!( - count_rows(&db, "node:Person").await, - 6, - "an update must not change the Person row count" - ); - // ───────────────────────────────────────────────────────────────── // Step 11: cleanup β€” keep last 10 versions, only purge versions // older than 1 hour. With this small test, we have well under 10 @@ -393,27 +373,14 @@ async fn composite_flow_canonical_lifecycle() { branches, ); - // Final exercise β€” full read AND write path works post-reopen, - // post-cleanup. (The post-cleanup mutation was previously omitted - // pending resolution of the optimize-vs-manifest-pin interaction in - // Step 10; that is now fixed, so a strict write here must commit.) + // Final query exercise β€” full read path works post-reopen, + // post-cleanup. Post-cleanup mutation is omitted here pending + // resolution of the optimize-vs-manifest-pin interaction documented + // in Step 10. let final_total = query_main(&mut db, TEST_QUERIES, "total_people", &ParamMap::default()) .await .unwrap(); assert!(!final_total.batches().is_empty()); - - let post_reopen_update = mutate_main( - &mut db, - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "Alice")], &[("$age", 42)]), - ) - .await - .expect("post-reopen, post-cleanup strict update must commit"); - assert_eq!( - post_reopen_update.affected_nodes, 1, - "post-reopen update must affect exactly Alice" - ); } /// Cross-handle sequence that exercises operations after a schema_apply diff --git a/crates/omnigraph/tests/consistency.rs b/crates/omnigraph/tests/consistency.rs index aab0114..26517db 100644 --- a/crates/omnigraph/tests/consistency.rs +++ b/crates/omnigraph/tests/consistency.rs @@ -126,7 +126,7 @@ async fn load_merge_upserts_existing_and_inserts_new() { /// source batch had one row per key. /// /// Triggered by Lance's `processed_row_ids: Mutex>` -/// (lance-6.0.1 `src/dataset/write/merge_insert.rs:2099`) double- +/// (lance-4.0.0 `src/dataset/write/merge_insert.rs:2099`) double- /// processing the same source/target match against datasets previously /// rewritten by merge_insert. Worked around by opting /// `MergeInsertBuilder` into `SourceDedupeBehavior::FirstSeen` in @@ -188,7 +188,7 @@ node Thing { /// /// Defense in depth: /// 1. The loader's `enforce_unique_constraints_intra_batch` -/// (`loader/mod.rs:1442`), invoked unconditionally on any node type +/// (`loader/mod.rs:1453`), invoked unconditionally on any node type /// with a `@key`, errors on intra-batch duplicate `@key` values at /// intake β€” pinned by this test across every `LoadMode`. /// 2. The `check_batch_unique_by_keys` precondition at the top of @@ -229,122 +229,6 @@ node Thing { } } -/// Regression for MR-983: a node-level composite `@unique(a, b)` must be -/// enforced as a true composite key, not degraded into independent -/// single-field checks. Pre-fix, `unique_property_names_for_node` flattened -/// every constraint group into one property list, so `@unique(source, -/// external_id)` was enforced as `@unique(source)` *and* `@unique(external_id)` -/// β€” rejecting rows that were unique on the composite key and naming only the -/// first field in the error. -#[tokio::test] -async fn loader_enforces_composite_unique_as_composite_key() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let schema = r#" -node ExternalID { - slug: String @key - source: String @index - external_id: String @index - @unique(source, external_id) -} -"#; - let mut db = Omnigraph::init(uri, schema).await.unwrap(); - - // Same `source`, different `external_id` β†’ unique on the composite key. - // This is the exact repro from MR-983 and must be accepted. - let composite_ok = r#"{"type":"ExternalID","data":{"slug":"a","source":"whatsapp","external_id":"+E.164"}} -{"type":"ExternalID","data":{"slug":"b","source":"whatsapp","external_id":"pn:12345"}} -"#; - load_jsonl(&mut db, composite_ok, LoadMode::Overwrite) - .await - .expect("rows unique on the composite (source, external_id) must be accepted"); - assert_eq!(count_rows(&db, "node:ExternalID").await, 2); - - // Both composite columns equal β†’ genuine violation. The error must name - // the whole composite, not just the first field. - let composite_dupe = r#"{"type":"ExternalID","data":{"slug":"c","source":"whatsapp","external_id":"dup"}} -{"type":"ExternalID","data":{"slug":"d","source":"whatsapp","external_id":"dup"}} -"#; - let err = load_jsonl(&mut db, composite_dupe, LoadMode::Overwrite) - .await - .unwrap_err(); - let msg = err.to_string(); - // Columns are canonicalized to sorted order in the catalog, so the - // message reads `(external_id, source)`; assert order-agnostically that - // both composite columns are named (not just the first, as pre-fix). - assert!( - msg.contains("@unique violation") - && msg.contains("source") - && msg.contains("external_id"), - "composite violation must name both columns (got: {msg})" - ); -} - -/// Guard: the intake path (load/insert/update) and the branch-merge path must -/// derive the same composite `@unique(a, b)` key, so a pair of rows unique on -/// the tuple is accepted by BOTH. Both paths now key on the tuple itself (no -/// separator), so a value containing any byte β€” including the `|` that an -/// earlier merge-path join used as its separator β€” can't forge a collision. -/// `("x|y", "z")` and `("x", "y|z")` are distinct tuples and must survive a -/// load-on-branch then merge without a phantom `UniqueViolation`. This pins the -/// cross-path consistency against any future drift in the shared keying. -#[tokio::test] -async fn composite_unique_key_is_consistent_across_intake_and_merge() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let schema = r#" -node Item { - slug: String @key - a: String @index - b: String @index - @unique(a, b) -} -"#; - let insert_item = r#" -query insert_item($slug: String, $a: String, $b: String) { - insert Item { slug: $slug, a: $a, b: $b } -} -"#; - let main = Omnigraph::init(uri, schema).await.unwrap(); - main.branch_create("feature").await.unwrap(); - - // Two rows unique on the composite (a, b), where `a`/`b` carry a literal - // `|`. Distinct under a tuple key; identical (`x|y|z`) under a `|`-join. - let feature = Omnigraph::open(uri).await.unwrap(); - feature - .mutate( - "feature", - insert_item, - "insert_item", - ¶ms(&[("$slug", "r1"), ("$a", "x|y"), ("$b", "z")]), - ) - .await - .expect("intake must accept the first composite-unique row"); - feature - .mutate( - "feature", - insert_item, - "insert_item", - ¶ms(&[("$slug", "r2"), ("$a", "x"), ("$b", "y|z")]), - ) - .await - .expect("intake must accept the second composite-unique row (distinct on the tuple)"); - - // The merge re-validates uniqueness over the adopted source rows. Both - // rows are unique on (a, b), so this must merge cleanly with no phantom - // conflict β€” intake and merge must key the tuple identically. - let merge_result = feature.branch_merge("feature", "main").await; - assert!( - merge_result.is_ok(), - "rows unique on the composite (a, b) must merge cleanly; \ - intake and merge must key the tuple the same way (got: {:?})", - merge_result.err() - ); - - let reopened = Omnigraph::open(uri).await.unwrap(); - assert_eq!(count_rows(&reopened, "node:Item").await, 2); -} - /// Canary for the upstream Lance gap that the `FirstSeen` workaround /// in `table_store.rs` masks. The bug class is "Window 2": load β†’ /// indices built explicitly β†’ merge β†’ merge. Even with the engine diff --git a/crates/omnigraph/tests/end_to_end.rs b/crates/omnigraph/tests/end_to_end.rs index ea11d0e..a0fdb0e 100644 --- a/crates/omnigraph/tests/end_to_end.rs +++ b/crates/omnigraph/tests/end_to_end.rs @@ -1933,87 +1933,3 @@ query docs_with_tag($tag: String) { "contains-pushdown should return exactly the rows whose tags list contains 'red'" ); } - -// ─── Maintenance in the full lifecycle: optimize (compaction) ──────────────── - -/// `optimize` (Lance compaction) is part of a realistic graph lifecycle: it -/// advances the Lance HEAD and publishes the compacted version to the manifest. -/// The rest of the flow must keep working across that boundary β€” reads observe -/// the compacted data, strict updates (which check Lance HEAD == manifest -/// version) still commit, inserts still commit, and the state survives a reopen -/// (the open-time recovery sweep finds no leftover drift). Before optimize -/// published its compaction, the manifest lagged the Lance HEAD here and the -/// post-optimize update below failed with "stale view ... refresh and retry". -#[tokio::test] -async fn full_flow_optimize_then_query_update_and_reopen() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = init_and_load(&dir).await; - - // Build several Person fragments so compaction has something to merge. - for (name, age) in [("Eve", 40), ("Frank", 41), ("Grace", 42)] { - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age)]), - ) - .await - .unwrap(); - } - - let stats = db.optimize().await.unwrap(); - assert!( - stats.iter().any(|s| s.committed), - "a multi-fragment table should have compacted in this flow" - ); - - // Reads observe the compacted data. - let qr = query_main( - &mut db, - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "Alice")]), - ) - .await - .unwrap(); - assert_eq!(qr.num_rows(), 1); - - // Strict update after optimize commits (previously failed with "stale view" - // because the manifest lagged the compacted Lance HEAD). - let upd = mutate_main( - &mut db, - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "Alice")], &[("$age", 31)]), - ) - .await - .unwrap(); - assert_eq!(upd.affected_nodes, 1); - - // Insert after optimize also commits. - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Ivan")], &[("$age", 50)]), - ) - .await - .unwrap(); - assert_eq!(count_rows(&db, "node:Person").await, 8); // 4 seed + Eve/Frank/Grace + Ivan - - // State survives a reopen β€” the recovery sweep runs and finds no drift. - drop(db); - let reopened = Omnigraph::open(&uri).await.unwrap(); - assert_eq!(count_rows(&reopened, "node:Person").await, 8); - let alice = reopened - .entity_at_target(ReadTarget::branch("main"), "node:Person", "Alice") - .await - .unwrap() - .unwrap(); - assert_eq!( - alice["age"], - serde_json::json!(31), - "Alice's post-optimize age update must persist across reopen" - ); -} diff --git a/crates/omnigraph/tests/failpoint_names_guard.rs b/crates/omnigraph/tests/failpoint_names_guard.rs deleted file mode 100644 index df8fc1c..0000000 --- a/crates/omnigraph/tests/failpoint_names_guard.rs +++ /dev/null @@ -1,96 +0,0 @@ -//! Guard: failpoint names must come from the compile-checked `names` catalog -//! (`omnigraph::failpoints::names` / `omnigraph_cluster::failpoints::names`), -//! never bare string literals. -//! -//! The `names` consts give compile-time typo protection only if every call -//! site uses them. A bare `maybe_fail("typo.literal")` still compiles (the -//! arg is `&str`), so a typo there would silently never fire. This -//! source-walk closes that gap by construction β€” the same defense-in-depth -//! shape as `forbidden_apis.rs`. Add a new failpoint by adding its const to -//! the catalog first; this guard then forces every call site to reference it. - -use std::path::{Path, PathBuf}; - -/// Call-site prefixes whose first argument must be a `names::` constant. The -/// check is whitespace/newline-tolerant (it skips past the open paren to the -/// first non-whitespace token), so wrapping the call across lines cannot hide -/// a literal β€” a per-line `contains` scan would miss -/// `park_first(\n "name",\n)`. -const CALL_PREFIXES: &[&str] = &[ - "maybe_fail(", - "ScopedFailPoint::new(", - "ScopedFailPoint::with_callback(", - "park_first(", -]; - -/// 1-based line number of `byte_off` within `contents`. -fn line_of(contents: &str, byte_off: usize) -> usize { - contents[..byte_off].bytes().filter(|&b| b == b'\n').count() + 1 -} - -fn manifest_dir() -> PathBuf { - PathBuf::from(env!("CARGO_MANIFEST_DIR")) -} - -/// Production call sites live under each crate's `src`; test call sites live -/// in the two failpoint integration binaries. This guard file is deliberately -/// not in the set (it names the patterns as literals itself). -fn files_to_scan() -> Vec { - let engine = manifest_dir(); - let cluster = engine.join("../omnigraph-cluster"); - let mut out = Vec::new(); - collect_rs(&engine.join("src"), &mut out); - collect_rs(&cluster.join("src"), &mut out); - out.push(engine.join("tests/failpoints.rs")); - out.push(cluster.join("tests/failpoints.rs")); - out -} - -fn collect_rs(dir: &Path, out: &mut Vec) { - let Ok(entries) = std::fs::read_dir(dir) else { - return; - }; - for entry in entries.flatten() { - let path = entry.path(); - if path.is_dir() { - collect_rs(&path, out); - } else if path.extension().is_some_and(|e| e == "rs") { - out.push(path); - } - } -} - -#[test] -fn failpoint_names_use_the_compile_checked_catalog() { - let mut violations = Vec::new(); - for file in files_to_scan() { - let Ok(contents) = std::fs::read_to_string(&file) else { - continue; - }; - for prefix in CALL_PREFIXES { - let mut from = 0; - while let Some(rel) = contents[from..].find(prefix) { - let after_open = from + rel + prefix.len(); - // Skip whitespace (incl. newlines) after the open paren. If the - // first argument token is a `"`, it's a literal failpoint name - // β€” across a line break or not. - if contents[after_open..].trim_start().starts_with('"') { - violations.push(format!( - "{}:{}: literal failpoint name at `{}` β€” use a `names::` const", - file.display(), - line_of(&contents, from + rel), - prefix.trim_end_matches('('), - )); - } - from = after_open; - } - } - } - assert!( - violations.is_empty(), - "failpoint names must reference the compile-checked \ - `omnigraph::failpoints::names::*` (or `omnigraph_cluster::failpoints::names::*`) \ - constants, not string literals β€” a literal typo would silently never fire:\n{}", - violations.join("\n") - ); -} diff --git a/crates/omnigraph/tests/failpoints.rs b/crates/omnigraph/tests/failpoints.rs index cbd57be..5ea71c5 100644 --- a/crates/omnigraph/tests/failpoints.rs +++ b/crates/omnigraph/tests/failpoints.rs @@ -3,20 +3,15 @@ mod helpers; use fail::FailScenario; +use futures::FutureExt; use omnigraph::db::Omnigraph; -use omnigraph::error::{ManifestErrorKind, OmniError}; use omnigraph::failpoints::ScopedFailPoint; -use omnigraph::failpoints::names; -use omnigraph::loader::LoadMode; -use serial_test::serial; use helpers::recovery::{ FollowUpMutation, RecoveryExpectation, TableExpectation, assert_post_recovery_invariants, branch_head_commit_id, single_sidecar_operation_id, }; -use helpers::{ - MUTATION_QUERIES, collect_column_strings, mixed_params, mutate_main, read_table, version_main, -}; +use helpers::{MUTATION_QUERIES, mixed_params, mutate_main, version_main}; const SCHEMA_V1: &str = "node Person { name: String @key }\n"; const SCHEMA_V2_ADDED_TYPE: &str = @@ -32,13 +27,12 @@ fn node_table_uri(root: &str, type_name: &str) -> String { } #[tokio::test] -#[serial] async fn branch_create_failpoint_triggers() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); let db = Omnigraph::init(uri, helpers::TEST_SCHEMA).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::BRANCH_CREATE_AFTER_MANIFEST_BRANCH_CREATE, "return"); + let _failpoint = ScopedFailPoint::new("branch_create.after_manifest_branch_create", "return"); let err = db.branch_create("feature").await.unwrap_err(); assert!( @@ -47,564 +41,14 @@ async fn branch_create_failpoint_triggers() { ); } -// Branch delete flips the manifest authority first, then reclaims the per-table -// forks best-effort. A failure during that reclaim (here, the -// `branch_delete.before_table_cleanup` failpoint, standing in for a transient -// object-store error) must NOT fail the call: the branch is already gone, and -// `cleanup` reconciles the stranded fork. The branch name is reusable after. -#[tokio::test] -#[serial] -async fn branch_delete_partial_failure_converges_via_cleanup() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut main = helpers::init_and_load(&dir).await; - - main.branch_create("feature").await.unwrap(); - let mut feature = Omnigraph::open(&uri).await.unwrap(); - helpers::mutate_branch( - &mut feature, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - drop(feature); - - let person_uri = node_table_uri(&uri, "Person"); - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "precondition: the owned table fork exists before delete" - ); - } - - // Inject a failure during per-table cleanup, AFTER the manifest authority - // flip. branch_delete must still succeed (best-effort reclaim). - { - let _fp = ScopedFailPoint::new(names::BRANCH_DELETE_BEFORE_TABLE_CLEANUP, "return"); - main.branch_delete("feature").await.expect( - "branch_delete is best-effort after the manifest flip: a cleanup-step \ - failure must not fail the call", - ); - } - - // Authority flipped: the branch is gone. - assert_eq!(main.branch_list().await.unwrap(), vec!["main".to_string()]); - - // The eager reclaim failed, so the orphan is stranded until cleanup. - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "failed eager reclaim should leave the orphan for cleanup to reconcile" - ); - } - - // cleanup converges: the orphan is reclaimed. - main.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("feature"), - "cleanup should reconcile the orphaned fork away" - ); - } - - // The name is reusable after cleanup reclaims the orphan. - main.branch_create("feature").await.unwrap(); - let mut feature2 = Omnigraph::open(&uri).await.unwrap(); - helpers::mutate_branch( - &mut feature2, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Frank")], &[("$age", 41)]), - ) - .await - .unwrap(); -} - -// Reusing a branch name whose delete left an orphaned fork (before `cleanup` -// reconciles it) must SELF-HEAL on the next write β€” the write reclaims the -// manifest-unreferenced fork and re-forks, rather than wedging with "incomplete -// prior delete; run cleanup". (This test was the inverse before the fork-as- -// idempotent-reconcile fix; its flip is the signal the bug class is closed.) -#[tokio::test] -#[serial] -async fn recreate_over_orphaned_fork_self_heals_without_cleanup() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let main = helpers::init_and_load(&dir).await; - - main.branch_create("feature").await.unwrap(); - let mut feature = Omnigraph::open(&uri).await.unwrap(); - helpers::mutate_branch( - &mut feature, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - drop(feature); - - // Partial delete: leaves the Person fork orphaned (cleanup not yet run). - { - let _fp = ScopedFailPoint::new(names::BRANCH_DELETE_BEFORE_TABLE_CLEANUP, "return"); - main.branch_delete("feature").await.unwrap(); - } - - // Recreate the name and write to the previously-forked table WITHOUT a - // cleanup in between. The write must self-heal the stale orphan fork. - main.branch_create("feature").await.unwrap(); - let mut feature2 = Omnigraph::open(&uri).await.unwrap(); - helpers::mutate_branch( - &mut feature2, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Frank")], &[("$age", 41)]), - ) - .await - .expect("recreate-over-orphan write must self-heal, not require cleanup"); - - // The recreated branch forks FRESH from main: the deleted branch's Eve is - // gone and only the new Frank is added on top of main's seed. A count of - // main + 2 would mean Eve resurrected from the stale fork (the bug). - let main_people = helpers::count_rows(&main, "node:Person").await; - let feature_people = helpers::count_rows_branch(&feature2, "feature", "node:Person").await; - assert_eq!( - feature_people, - main_people + 1, - "self-healed feature must fork fresh from main (+Frank only); \ - main={main_people}, feature={feature_people} (main+2 β‡’ Eve resurrected)" - ); -} - -// The write-path orphan reclaim shares the same fresh-authority classifier as -// cleanup. If that classifier is Indeterminate (transient read on a live -// branch), the write must return a clear retryable authority-read conflict and -// leave the ref in place. It must not squeeze the ambiguity through -// ExpectedVersionMismatch with expected == actual, which lies about the cause. -#[tokio::test] -#[serial] -async fn recreate_over_orphaned_fork_reports_indeterminate_authority_read() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let db = helpers::init_and_load(&dir).await; - db.branch_create("feature").await.unwrap(); - - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - } - - let row = r#"{"type":"Person","data":{"name":"Grace","age":37}}"#; - { - let _fp = ScopedFailPoint::new(names::CLASSIFY_FRESH_READ, "return"); - let err = db - .load_as("feature", None, row, LoadMode::Merge, None) - .await - .expect_err("indeterminate authority read must fail retryably"); - - match &err { - OmniError::Manifest(manifest) => { - assert_eq!(manifest.kind, ManifestErrorKind::Conflict); - assert!( - manifest.details.is_none(), - "indeterminate authority read is not an expected-version mismatch: {manifest:?}" - ); - } - other => panic!("expected manifest conflict, got {other:?}"), - } - let message = err.to_string(); - assert!( - message.contains("could not verify") - && message.contains("fresh manifest authority was unavailable") - && message.contains("refresh and retry"), - "error should name the unavailable authority read, got: {message}" - ); - assert!( - !message.contains("expected manifest table version"), - "indeterminate authority must not be reported as a version mismatch: {message}" - ); - - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "ambiguous orphan status must leave the fork for a later retry" - ); - } - - db.load_as("feature", None, row, LoadMode::Merge, None) - .await - .expect("when fresh authority is available, the orphan is reclaimed and write converges"); -} - -// cleanup is the guaranteed convergence backstop, so one table's transient -// failure must not abort the whole sweep. Inject a one-shot version-GC failure -// for a single table and assert: cleanup still succeeds, the failure is -// surfaced per-table in the returned stats, and the independent reconcile pass -// still reclaimed an orphan. -#[tokio::test] -#[serial] -async fn cleanup_isolates_single_table_failure() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = helpers::init_and_load(&dir).await; - - // Forge an orphaned fork on the Person table (a reconcile target). - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("ghost", base, None).await.unwrap(); - } - - // One table's version GC fails once; the sweep must isolate it. - let _fp = ScopedFailPoint::new(names::CLEANUP_TABLE_GC, "1*return"); - let stats = db - .cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .expect("a single table's GC failure must not abort cleanup"); - - let errored = stats.iter().filter(|s| s.error.is_some()).count(); - assert_eq!( - errored, 1, - "exactly one table's GC failure should be surfaced in stats, got {errored}" - ); - assert!( - stats.len() >= 4, - "every node+edge table should still appear in the stats" - ); - - // The reconcile pass is independent of the GC failure, so the orphan is gone. - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("ghost"), - "reconcile should reclaim the orphan despite the GC failure" - ); - } -} - -// Companion to the version-GC isolation test, exercising the OTHER cleanup -// loop: a force-delete failure while reconciling one orphaned fork must be -// isolated (logged, not propagated) so the sweep continues, and a later -// cleanup converges. This is the loop the Devin finding was about. -#[tokio::test] -#[serial] -async fn cleanup_isolates_reconcile_failure() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = helpers::init_and_load(&dir).await; - - // Forge an orphaned fork the reconcile pass will try to reclaim. - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("ghost", base, None).await.unwrap(); - } - - // Inject a one-shot failure into the reconcile force-delete. The sweep must - // not abort. - { - let _fp = ScopedFailPoint::new(names::CLEANUP_RECONCILE_FORK, "1*return"); - db.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .expect("a reconcile force-delete failure must not abort cleanup"); - } - // The blocked orphan is still present (the failure was isolated, not retried). - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("ghost"), - "the orphan whose reclaim was injected-to-fail should remain" - ); - } - // A second cleanup with no injected failure converges. - db.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("ghost"), - "the second cleanup should reconcile the orphan" - ); - } -} - -// The cleanup reconciler must reclaim orphaned commit-graph branches, not just -// per-table forks. A delete whose best-effort commit-graph reclaim fails leaves -// a commit-graph orphan; the next cleanup must drop it. -#[tokio::test] -#[serial] -async fn cleanup_reclaims_orphaned_commit_graph_branch() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = helpers::init_and_load(&dir).await; - - db.branch_create("feature").await.unwrap(); - // Delete, failing the commit-graph reclaim β†’ commit-graph "feature" orphan - // (manifest branch gone, commit-graph branch left behind). - { - let _fp = ScopedFailPoint::new(names::BRANCH_DELETE_BEFORE_COMMIT_GRAPH_RECLAIM, "return"); - db.branch_delete("feature").await.unwrap(); - } - - let commits_uri = format!("{}/_graph_commits.lance", uri.trim_end_matches('/')); - { - let ds = lance::Dataset::open(&commits_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "precondition: the commit-graph branch should be orphaned after the failed reclaim" - ); - } - - db.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - - { - let ds = lance::Dataset::open(&commits_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("feature"), - "cleanup should reclaim the orphaned commit-graph branch" - ); - } -} - -// `classify_fork_ref` returns `Indeterminate` when the fresh-authority read -// fails on a LIVE branch β€” and a destructive caller must SKIP, never delete, on -// that ambiguity. Here the reconciler has a genuine origin-2 orphan candidate -// (a manifest-unreferenced Person fork on the live `feature` branch), but the -// `classify.fresh_read` failpoint makes the fresh re-check fail: cleanup must -// leave the ref in place (cannot confirm it is unreferenced), then reclaim it on -// the next run once the read succeeds. This pins the Indeterminate arm and the -// don't-destroy-on-ambiguity rule end-to-end through cleanup. -#[tokio::test] -#[serial] -async fn reconcile_skips_fork_when_fresh_recheck_is_unavailable_then_converges() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = helpers::init_and_load(&dir).await; - db.branch_create("feature").await.unwrap(); - - // Forge a manifest-unreferenced Person fork on the live `feature` branch β€” - // a genuine orphan the reconciler would normally reclaim. - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "precondition: forged orphan fork present" - ); - } - - // With the fresh re-check failing, the fork's status is Indeterminate (the - // branch is live but unreadable) β†’ cleanup must SKIP it, not delete. - { - let _fp = ScopedFailPoint::new(names::CLASSIFY_FRESH_READ, "return"); - db.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "reconcile must NOT delete a fork whose fresh re-check is inconclusive" - ); - } - - // Read succeeds now β†’ cleanup confirms the orphan and reclaims it (converges). - db.cleanup(omnigraph::db::CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - { - let ds = lance::Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("feature"), - "next cleanup (fresh read available) must reclaim the confirmed orphan" - ); - } -} - -// A branch_delete whose best-effort commit-graph reclaim fails leaves a -// commit-graph "zombie" branch. Recreating that name must heal the zombie and -// succeed (branch_create force-deletes a stale commit-graph ref since the -// manifest branch is created fresh), instead of dying on the leftover ref. -#[tokio::test] -#[serial] -async fn branch_create_recreates_over_commit_graph_zombie() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let db = Omnigraph::init(dir.path().to_str().unwrap(), helpers::TEST_SCHEMA) - .await - .unwrap(); - - db.branch_create("feature").await.unwrap(); - { - // Fail the best-effort commit-graph reclaim β†’ commit-graph "feature" - // zombie survives the delete (manifest authority still flips). - let _fp = ScopedFailPoint::new(names::BRANCH_DELETE_BEFORE_COMMIT_GRAPH_RECLAIM, "return"); - db.branch_delete("feature").await.unwrap(); - } - assert_eq!(db.branch_list().await.unwrap(), vec!["main".to_string()]); - - db.branch_create("feature") - .await - .expect("branch_create should heal the zombie commit-graph branch and succeed"); - assert!( - db.branch_list() - .await - .unwrap() - .contains(&"feature".to_string()) - ); -} - -// branch_create is authority-then-derived: if the derived commit-graph branch -// cannot be created, the manifest branch (the authority) must be rolled back so -// the branch does not half-exist. The existing failpoint fires right after the -// manifest create, standing in for any post-authority failure. -#[tokio::test] -#[serial] -async fn branch_create_rolls_back_manifest_on_commit_graph_failure() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let db = Omnigraph::init(dir.path().to_str().unwrap(), helpers::TEST_SCHEMA) - .await - .unwrap(); - - let err = { - let _fp = ScopedFailPoint::new(names::BRANCH_CREATE_AFTER_MANIFEST_BRANCH_CREATE, "return"); - db.branch_create("feature").await.unwrap_err() - }; - assert!( - !db.branch_list() - .await - .unwrap() - .contains(&"feature".to_string()), - "branch_create must roll back the manifest branch when the derived \ - commit-graph branch fails, got error: {err}" - ); -} - -// A fork collision must be classified by the manifest authority, not by Lance -// branch versions. When a concurrent first-write legitimately wins the fork -// race, the loser sees a version mismatch β€” but that is a stale snapshot, not -// an orphan, so it must be a retryable "refresh and retry", never a misleading -// "run cleanup". -// -// Ordering is made deterministic (no fixed sleeps) via the shared rendezvous: -// it parks the first arrival (writer A) at the fork point until released; later -// arrivals (writer B) fall through. The test waits on the reached condition, -// lets B win and commit the fork, then releases A. #[tokio::test(flavor = "multi_thread")] -#[serial] -async fn fork_collision_with_live_concurrent_fork_is_retryable() { - let _scenario = FailScenario::setup(); - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let main = helpers::init_and_load(&dir).await; - main.branch_create("feature").await.unwrap(); - - let rv = helpers::failpoint::Rendezvous::park_first(names::FORK_BEFORE_CLASSIFY); - - let uri_a = uri.clone(); - let writer_a = tokio::spawn(async move { - let mut a = Omnigraph::open(&uri_a).await.unwrap(); - helpers::mutate_branch( - &mut a, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - }); - - // Wait until A is parked at the fork point. - rv.wait_until_reached().await; - - // B wins the fork and commits it. - let mut b = Omnigraph::open(&uri).await.unwrap(); - helpers::mutate_branch( - &mut b, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Frank")], &[("$age", 41)]), - ) - .await - .unwrap(); - - // Release A; it resumes, re-reads the manifest, and sees the fork is live. - rv.release(); - let err = writer_a - .await - .unwrap() - .expect_err("A's stale-snapshot fork should be a retryable conflict"); - - let msg = err.to_string(); - assert!( - !msg.contains("cleanup"), - "a live concurrent fork must not be misclassified as an orphan, got: {msg}" - ); - assert!( - msg.contains("refresh and retry") || msg.contains("expected manifest table version"), - "expected a retryable stale-view error, got: {msg}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -#[serial] async fn graph_publish_failpoint_triggers_before_commit_append() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let mut db = Omnigraph::init(dir.path().to_str().unwrap(), helpers::TEST_SCHEMA) .await .unwrap(); - let _failpoint = ScopedFailPoint::new(names::GRAPH_PUBLISH_BEFORE_COMMIT_APPEND, "return"); + let _failpoint = ScopedFailPoint::new("graph_publish.before_commit_append", "return"); let err = mutate_main( &mut db, @@ -626,7 +70,6 @@ async fn graph_publish_failpoint_triggers_before_commit_append() { // state. #[tokio::test] -#[serial] async fn schema_apply_pre_commit_crash_rolls_forward_via_sidecar() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -634,7 +77,7 @@ async fn schema_apply_pre_commit_crash_rolls_forward_via_sidecar() { { let db = Omnigraph::init(&uri, SCHEMA_V1).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::SCHEMA_APPLY_AFTER_STAGING_WRITE, "return"); + let _failpoint = ScopedFailPoint::new("schema_apply.after_staging_write", "return"); let err = db.apply_schema(SCHEMA_V2_ADDED_TYPE).await.unwrap_err(); assert!( err.to_string() @@ -670,7 +113,6 @@ async fn schema_apply_pre_commit_crash_rolls_forward_via_sidecar() { } #[tokio::test] -#[serial] async fn schema_apply_recovers_post_commit_crash() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -678,7 +120,7 @@ async fn schema_apply_recovers_post_commit_crash() { { let db = Omnigraph::init(&uri, SCHEMA_V1).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::SCHEMA_APPLY_AFTER_MANIFEST_COMMIT, "return"); + let _failpoint = ScopedFailPoint::new("schema_apply.after_manifest_commit", "return"); let err = db.apply_schema(SCHEMA_V2_ADDED_TYPE).await.unwrap_err(); assert!( err.to_string() @@ -696,7 +138,6 @@ async fn schema_apply_recovers_post_commit_crash() { } #[tokio::test] -#[serial] async fn schema_apply_recovers_partial_rename() { // Construct a partial-rename state: _schema.pg has been renamed in // (matching v2), but _schema.ir.json.staging and __schema_state.json.staging @@ -761,7 +202,6 @@ async fn schema_apply_recovers_partial_rename() { /// Continuous in-process recovery (no restart needed between failure /// and recovery) is the goal of a future background reconciler. #[tokio::test] -#[serial] async fn recovery_rolls_forward_after_finalize_publisher_failure() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -771,7 +211,7 @@ async fn recovery_rolls_forward_after_finalize_publisher_failure() { // Setup: trigger the residual. { let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); + let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return"); // The mutation's finalize completes (commit_staged advances Lance // HEAD on node:Person AND writes a `__recovery/{ulid}.json` @@ -848,114 +288,12 @@ async fn recovery_rolls_forward_after_finalize_publisher_failure() { ); } -/// Regression for iss-schema-apply-reopen-recovery-race: the open-time -/// recovery sweep's roll-forward must CONVERGE (not fatally error the open) -/// when a concurrent writer advances the manifest past the sidecar's pin -/// during the classifyβ†’publish window. -/// -/// Two concurrent `Omnigraph::open` sweeps race the same pending sidecar. -/// One is parked at `recovery.before_roll_forward_publish` (after it has -/// classified `RolledPastExpected`, before its publish CAS); the other falls -/// through, rolls the sidecar forward (manifest v β†’ v+1), and deletes it. The -/// parked sweep then loses its publish CAS at the now-stale `expected = v`. -/// The manifest already reached the sidecar's goal, so this is convergence, -/// not a logical conflict β€” the open must succeed, not panic with -/// `ExpectedVersionMismatch`. -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -#[serial] -async fn open_sweep_roll_forward_converges_when_manifest_advances_concurrently() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // Setup: leave one pending sidecar (node:Person at Lance v+1, manifest v). - { - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - let _failpoint = - ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap_err(); - } - assert_eq!( - std::fs::read_dir(dir.path().join("__recovery")) - .unwrap() - .count(), - 1, - "exactly one pending sidecar must persist for the sweep to roll forward" - ); - - // Park the FIRST sweep to reach the publish window; later arrivals fall - // through. wait_until_reached gates the second open so it is guaranteed - // to be the one that converges the sidecar. - let rv = helpers::failpoint::Rendezvous::park_first( - names::RECOVERY_BEFORE_ROLL_FORWARD_PUBLISH, - ); - - let uri_parked = uri.clone(); - let parked_open = tokio::spawn(async move { Omnigraph::open(&uri_parked).await }); - rv.wait_until_reached().await; - - // A concurrent open rolls the sidecar forward (manifest v β†’ v+1) and - // deletes it, advancing the manifest past the parked sweep's pin. - let converging_open = Omnigraph::open(&uri) - .await - .expect("the second open's sweep should roll the sidecar forward and succeed"); - assert_eq!( - helpers::count_rows(&converging_open, "node:Person").await, - 1, - "the converging open must publish the rolled-forward Person row" - ); - - // Release the parked sweep: its publish CAS finds the manifest already at - // the goal. It must converge, not fail the open. - rv.release(); - parked_open - .await - .expect("the parked open task must not panic") - .expect( - "the open-time sweep must converge when the manifest already reached \ - the sidecar's goal, not fail the open with ExpectedVersionMismatch", - ); - - // The sidecar is gone and the graph is readable and consistent. - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "the sidecar must be gone after both sweeps converge" - ); - } - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 1); - - // Exactly one RolledForward audit row for this recovery event: the loser's - // convergence path must NOT append a duplicate once the winner already - // recorded the audit and deleted the sidecar (append-idempotent per - // operation_id). Two rows here would be the duplicate-audit regression. - let kinds = helpers::recovery::recovery_audit_kinds(dir.path()).await; - assert_eq!( - kinds.len(), - 1, - "exactly one recovery audit row expected after concurrent convergence, got {kinds:?}" - ); -} - -#[tokio::test(flavor = "multi_thread")] -#[serial] +#[tokio::test] async fn inline_delete_conflict_writes_sidecar_before_rejecting() { - use std::sync::Arc; - let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap().to_string(); - let db = Arc::new(helpers::init_and_load(&dir).await); + let db = helpers::init_and_load(&dir).await; let pre_snapshot = db .snapshot_of(omnigraph::db::ReadTarget::branch("main")) @@ -965,37 +303,39 @@ async fn inline_delete_conflict_writes_sidecar_before_rejecting() { let person_uri = node_table_uri(&uri, "Person"); { - // Park the delete at the primary-delete point. The concurrent update - // then lands deterministically before the delete resumes, so the - // delete's manifest CAS is guaranteed stale β€” no retry loop, no sleep. - let rv = helpers::failpoint::Rendezvous::park_first( - names::MUTATION_DELETE_NODE_PRE_PRIMARY_DELETE, + let _pause_delete = + ScopedFailPoint::new("mutation.delete_node_pre_primary_delete", "pause"); + let delete_params = helpers::params(&[("$name", "Alice")]); + let delete = db.mutate("main", MUTATION_QUERIES, "remove_person", &delete_params); + tokio::pin!(delete); + + let mut concurrent_update_succeeded = false; + for _ in 0..50 { + if delete.as_mut().now_or_never().is_some() { + panic!("delete mutation completed before primary-delete failpoint was released"); + } + let mut concurrent = Omnigraph::open_read_only(&uri).await.unwrap(); + if mutate_main( + &mut concurrent, + MUTATION_QUERIES, + "set_age", + &mixed_params(&[("$name", "Bob")], &[("$age", 26)]), + ) + .await + .is_ok() + { + concurrent_update_succeeded = true; + break; + } + tokio::time::sleep(std::time::Duration::from_millis(20)).await; + } + assert!( + concurrent_update_succeeded, + "concurrent update must land while delete is paused" ); + fail::remove("mutation.delete_node_pre_primary_delete"); - let del_db = Arc::clone(&db); - let delete = tokio::spawn(async move { - let delete_params = helpers::params(&[("$name", "Alice")]); - del_db - .mutate("main", MUTATION_QUERIES, "remove_person", &delete_params) - .await - }); - - rv.wait_until_reached().await; - - // Concurrent update lands while the delete is parked. - let mut concurrent = Omnigraph::open_read_only(&uri).await.unwrap(); - mutate_main( - &mut concurrent, - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "Bob")], &[("$age", 26)]), - ) - .await - .expect("concurrent update must land while delete is paused"); - - rv.release(); - - let err = delete.await.unwrap().unwrap_err(); + let err = delete.await.unwrap_err(); assert!( err.to_string().contains("stale view of 'node:Person'") || err.to_string().contains("ExpectedVersionMismatch") @@ -1027,7 +367,6 @@ async fn inline_delete_conflict_writes_sidecar_before_rejecting() { } #[tokio::test] -#[serial] async fn recovery_rolls_forward_load_on_feature_branch() { use omnigraph::loader::LoadMode; @@ -1058,7 +397,7 @@ async fn recovery_rolls_forward_load_on_feature_branch() { .table_version; feature_parent_commit_id = branch_head_commit_id(dir.path(), "feature").await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); + let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return"); let err = db .load( "feature", @@ -1123,78 +462,6 @@ async fn recovery_rolls_forward_load_on_feature_branch() { } #[tokio::test] -#[serial] -async fn recovery_rolls_forward_load_overwrite() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let operation_id; - let parent_commit_id; - - { - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, helpers::TEST_DATA, LoadMode::Overwrite) - .await - .unwrap(); - parent_commit_id = branch_head_commit_id(dir.path(), "main").await.unwrap(); - - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = db - .load( - "main", - r#"{"type":"Person","data":{"name":"OverwriteLoad","age":41}} -"#, - LoadMode::Overwrite, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - operation_id = single_sidecar_operation_id(dir.path()); - } - - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - helpers::count_rows(&db, "node:Person").await, - 1, - "overwrite row must be visible after recovery rolls the load forward" - ); - drop(db); - - assert_post_recovery_invariants( - dir.path(), - &operation_id, - RecoveryExpectation::RolledForward { - tables: vec![ - TableExpectation::main("node:Person") - .expected_recovery_parent_commit_id(parent_commit_id) - .follow_up_mutation(FollowUpMutation::new( - "main", - MUTATION_QUERIES, - "insert_person", - mixed_params(&[("$name", "AfterOverwriteLoad")], &[("$age", 42)]), - )), - ], - }, - ) - .await - .unwrap(); - - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - helpers::count_rows(&db, "node:Person").await, - 2, - "follow-up mutation must succeed after overwrite load recovery" - ); -} - -#[tokio::test] -#[serial] async fn recovery_rolls_forward_ensure_indices_on_feature_branch() { use lance::index::DatasetIndexExt; use omnigraph::loader::{LoadMode, load_jsonl}; @@ -1269,7 +536,7 @@ async fn recovery_rolls_forward_ensure_indices_on_feature_branch() { { let _failpoint = - ScopedFailPoint::new(names::ENSURE_INDICES_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); + ScopedFailPoint::new("ensure_indices.post_phase_b_pre_manifest_commit", "return"); let err = db.ensure_indices_on("feature").await.unwrap_err(); assert!( err.to_string().contains( @@ -1337,7 +604,6 @@ async fn recovery_rolls_forward_ensure_indices_on_feature_branch() { /// on the same handle succeeds without restart and without /// ExpectedVersionMismatch. #[tokio::test] -#[serial] async fn refresh_runs_roll_forward_recovery_in_process() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -1347,7 +613,7 @@ async fn refresh_runs_roll_forward_recovery_in_process() { // Setup: trigger the residual (sidecar persists; manifest unchanged). { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); + let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return"); let err = mutate_main( &mut db, MUTATION_QUERIES, @@ -1408,1232 +674,6 @@ async fn refresh_runs_roll_forward_recovery_in_process() { assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); } -/// The long-lived-process contract for `load`: a Phase B β†’ Phase C -/// failure (per-table `commit_staged` advanced Lance HEAD, manifest -/// publish did not land, sidecar persists) must not wedge subsequent -/// loads on the same engine handle. This is the server shape β€” `POST -/// /ingest` calls `load_as` on a shared handle with no reopen between -/// requests β€” so the follow-up load must heal the sidecar-covered -/// drift in-process: no restart, no explicit `refresh()`, no -/// `omnigraph repair`. -#[tokio::test] -#[serial] -async fn load_after_finalize_publisher_failure_heals_without_reopen() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - // Failed multi-table load: Person + Company + WorksAt all run - // commit_staged (Lance HEAD advances on three tables), then the - // publisher is wedged before the manifest commit. - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Person","data":{"name":"Bob","age":25}} -{"type":"Company","data":{"name":"Acme"}} -{"edge":"WorksAt","from":"Alice","to":"Acme"} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 1, - "exactly one sidecar must persist after the finalize failure" - ); - } - - // Follow-up load on the SAME handle, touching the drifted tables. - // Must succeed without manual intervention. - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Carol","age":41}} -{"type":"Company","data":{"name":"Globex"}} -"#, - LoadMode::Merge, - ) - .await - .expect( - "a follow-up load on the same handle must heal sidecar-covered \ - drift in-process instead of demanding repair/restart", - ); - - // Both batches are visible: the first load rolled forward, the - // second landed normally on top of it. - assert_eq!(helpers::count_rows(&db, "node:Person").await, 3); - assert_eq!(helpers::count_rows(&db, "node:Company").await, 2); - assert_eq!(helpers::count_rows(&db, "edge:WorksAt").await, 1); - - // The sidecar was consumed by the in-process roll-forward. - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "sidecar must be consumed by the in-process roll-forward" - ); - } -} - -/// Phase A storage-fault contract: a sidecar PUT failure (S3 PutObject / -/// fs write, injected at `recovery.sidecar_write`) must abort the load -/// BEFORE any Lance HEAD advances β€” no sidecar, no drift, nothing to -/// recover β€” and the same handle must write normally once the fault -/// clears (a transient storage error never wedges the graph). -#[tokio::test] -#[serial] -async fn sidecar_write_failure_aborts_load_with_no_head_advance() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - let person_uri = node_table_uri(&uri, "Person"); - let pre_head = lance::Dataset::open(&person_uri) - .await - .unwrap() - .version() - .version; - - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_WRITE, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Company","data":{"name":"Acme"}} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.sidecar_write"), - "unexpected error: {err}" - ); - } - - // Phase A ordering: the sidecar write precedes the first - // commit_staged, so the failed load left no sidecar and moved no - // Lance HEAD β€” manifest and HEAD agree, nothing to recover. - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "a Phase A put failure must not leave a sidecar" - ); - } - let post_head = lance::Dataset::open(&person_uri) - .await - .unwrap() - .version() - .version; - assert_eq!( - pre_head, post_head, - "a Phase A put failure must abort before any Lance HEAD advance" - ); - let manifest_pin = db - .snapshot_of(omnigraph::db::ReadTarget::branch("main")) - .await - .unwrap() - .entry("node:Person") - .unwrap() - .table_version; - assert_eq!(manifest_pin, post_head, "no drift after a Phase A abort"); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 0); - - // Fault cleared: the same handle writes normally β€” no wedge, no - // recovery required. - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Company","data":{"name":"Acme"}} -"#, - LoadMode::Merge, - ) - .await - .expect("a transient sidecar put failure must not wedge later writes"); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 1); - assert_eq!(helpers::count_rows(&db, "node:Company").await, 1); -} - -/// Real-backend coverage of the sidecar lifecycle: the same-handle heal -/// scenario on an S3-compatible store, exercising sidecar put / list / -/// delete through the S3 object-store backend instead of the -/// local filesystem backend. Skips unless `OMNIGRAPH_S3_TEST_BUCKET` is set -/// (same gate as `s3_storage.rs`); CI runs it against RustFS. -#[tokio::test] -#[serial] -async fn s3_load_recovers_after_publisher_failure_without_reopen() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let Some(uri) = helpers::s3_test_graph_uri("failpoints") else { - eprintln!( - "skipping s3_load_recovers_after_publisher_failure_without_reopen: \ - OMNIGRAPH_S3_TEST_BUCKET is not set" - ); - return; - }; - - let _scenario = FailScenario::setup(); - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - // Failed load: commit_staged lands on S3, manifest publish does not; - // the sidecar PUT went through the S3 adapter. - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Company","data":{"name":"Acme"}} -"#, - LoadMode::Merge, - ) - .await - .err() - .expect("finalize failpoint must fail the load"); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - } - - // Same-handle follow-up load: the entry heal LISTs __recovery/ on - // S3, rolls the sidecar forward, DELETEs it, and the write lands. - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Bob","age":25}} -"#, - LoadMode::Merge, - ) - .await - .expect("the same-handle heal must converge on an S3-backed graph"); - - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); - assert_eq!(helpers::count_rows(&db, "node:Company").await, 1); - - // Reopen cross-check: nothing left for the open-time sweep, state - // converged (the heal consumed the sidecar on S3). - drop(db); - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); -} - -/// Storage-fault contract for the recovery AUDIT write (injected at -/// `recovery.record_audit`): a failure after the roll-forward's manifest -/// publish aborts that recovery attempt loudly and keeps the sidecar; -/// re-entry detects the already-published manifest (stale-sidecar path), -/// records exactly one `RolledForward` audit row, and converges β€” the -/// documented retry tolerance in `record_audit`'s contract, exercised -/// end-to-end through a real injected failure. -#[tokio::test] -#[serial] -async fn record_audit_failure_after_roll_forward_converges_on_next_write() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - // Pending sidecar with real drift. - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Merge, - ) - .await - .err() - .expect("finalize failpoint must fail the load"); - } - - // The next write's heal rolls forward (manifest publish lands) but - // the audit write fails β€” the write must fail loudly and the sidecar - // must survive for the retry. - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_RECORD_AUDIT, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Bob","age":25}} -"#, - LoadMode::Merge, - ) - .await - .err() - .expect("an audit write failure mid-heal must fail the write"); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.record_audit"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 1, - "the sidecar must survive an audit write failure so the retry can record it" - ); - } - - // Fault cleared: the next write converges β€” stale-sidecar audit - // recovery (manifest already advanced) + the write itself. - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Carol","age":41}} -"#, - LoadMode::Merge, - ) - .await - .expect("recovery must converge once the audit fault clears"); - - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 0); - } - // Alice (rolled forward) + Carol (clean). Bob's write failed before - // staging anything β€” the heal error aborted his load at entry. - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); - // Exactly one audit row despite two recovery attempts: the first - // attempt's audit failed before any row landed; the retry recorded - // the roll-forward once. - let audit_uri = format!( - "{}/_graph_commit_recoveries.lance", - uri.trim_end_matches('/') - ); - let audit_rows = lance::Dataset::open(&audit_uri) - .await - .expect("audit dataset exists after the retried recovery") - .count_rows(None) - .await - .unwrap(); - assert_eq!(audit_rows, 1, "exactly one recovery audit row"); -} - -/// Storage-fault contract for the `__recovery/` LIST (S3 ListObjectsV2, -/// injected at `recovery.sidecar_list`): every consumer fails loudly β€” -/// the write-entry heal fails the write, the open-time sweep fails the -/// open β€” rather than silently skipping recovery over a pending sidecar -/// (which would be consumer tolerance of drift). Once the fault clears, -/// open recovers normally. -#[tokio::test] -#[serial] -async fn sidecar_list_failure_fails_write_and_open_loudly_then_clears() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - // Pending sidecar via the usual finalize β†’ publisher failure. - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 1); - } - - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_LIST, "return"); - - // Write-entry heal: the list failure surfaces as the write's error β€” - // no silent skip that would proceed over the pending sidecar. - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Bob","age":25}} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.sidecar_list"), - "the write-entry heal must surface a list failure loudly; got: {err}" - ); - - // Open-time sweep: a fresh ReadWrite open fails on the same fault. - drop(db); - let err = Omnigraph::open(&uri) - .await - .err() - .expect("open must fail while the sidecar list fault is active"); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.sidecar_list"), - "the open-time sweep must surface a list failure loudly; got: {err}" - ); - - // Fault cleared: open recovers the pending sidecar normally. - drop(_failpoint); - let db = Omnigraph::open(&uri).await.unwrap(); - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "open after the fault clears must recover the sidecar" - ); - } - assert_eq!(helpers::count_rows(&db, "node:Person").await, 1); -} - -/// Phase D storage-fault contract: a sidecar DELETE failure (S3 -/// DeleteObject, injected at `recovery.sidecar_delete`) after a -/// successful manifest publish must NOT fail the user's write β€” the -/// data is durable and visible. The stale sidecar it leaves behind is -/// consumed by the next write's entry heal (attributed `RolledForward` -/// audit row), not by an operator. -#[tokio::test] -#[serial] -async fn sidecar_delete_failure_keeps_write_success_and_next_write_heals() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_DELETE, "return"); - // The load itself must succeed: commit_staged + manifest publish - // landed; only the Phase D cleanup failed (swallowed + logged). - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Merge, - ) - .await - .expect("a Phase D delete failure must not fail a write that already published"); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 1); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 1, - "the swallowed delete leaves a stale sidecar behind" - ); - } - - // Fault cleared: the next write's entry heal consumes the stale - // sidecar (manifest pin already caught up β€” the stale-sidecar - // roll-forward audit path) and the write lands. - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Bob","age":25}} -"#, - LoadMode::Merge, - ) - .await - .expect("a stale sidecar from a failed Phase D delete must not block later writes"); - - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "the stale sidecar must be consumed by the next write's heal" - ); - } - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); -} - -/// Phase A storage-fault contract for branch_merge β€” the multi-table -/// writer where sidecar-before-commit ordering matters most. A sidecar -/// PUT failure must abort the merge before any target-table HEAD moves; -/// retrying after the fault clears merges cleanly. -#[tokio::test] -#[serial] -async fn sidecar_write_failure_aborts_branch_merge_with_no_head_advance() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Append, - ) - .await - .unwrap(); - - db.branch_create("feature").await.unwrap(); - // Diverge BOTH sides so Person is a RewriteMerged candidate (the - // merge path that pins a recovery sidecar; an unchanged target would - // adopt source state without one). - helpers::mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - helpers::mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Mallory")], &[("$age", 35)]), - ) - .await - .unwrap(); - - let person_uri = node_table_uri(&uri, "Person"); - let pre_head = lance::Dataset::open(&person_uri) - .await - .unwrap() - .version() - .version; - - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_WRITE, "return"); - let err = db.branch_merge("feature", "main").await.unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.sidecar_write"), - "unexpected error: {err}" - ); - } - - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "a Phase A put failure must not leave a sidecar" - ); - } - let post_head = lance::Dataset::open(&person_uri) - .await - .unwrap() - .version() - .version; - assert_eq!( - pre_head, post_head, - "a Phase A put failure must abort the merge before any target \ - Lance HEAD advance" - ); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); - - // Fault cleared: the merge lands cleanly. - db.branch_merge("feature", "main") - .await - .expect("a transient sidecar put failure must not wedge the merge"); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 3); -} - -/// Same contract as -/// `load_after_finalize_publisher_failure_heals_without_reopen`, for the -/// mutation entry point: after a failed mutation leaves a sidecar, the -/// next mutation on the same handle heals it in-process β€” no explicit -/// `refresh()` (which `refresh_runs_roll_forward_recovery_in_process` -/// covers), no reopen. -#[tokio::test] -#[serial] -async fn mutation_after_finalize_publisher_failure_heals_without_reopen() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 1, - "exactly one sidecar must persist after the finalize failure" - ); - } - - // Follow-up mutation on the SAME handle, same table. No refresh, no - // reopen β€” the write entry point heals the drift itself. - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Frank")], &[("$age", 33)]), - ) - .await - .expect( - "a follow-up mutation on the same handle must heal sidecar-covered \ - drift in-process instead of demanding repair/restart", - ); - - // Eve rolled forward, Frank landed normally. - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); - - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "sidecar must be consumed by the in-process roll-forward" - ); - } -} - -/// Same heal contract as the load/mutation variants, for the schema -/// apply entry point: a pending roll-forward-eligible sidecar (here -/// from a failed load) must be healed in-process before the migration -/// runs, so a long-lived handle can evolve the schema without a -/// restart after a Phase B β†’ Phase C failure. -#[tokio::test] -#[serial] -async fn schema_apply_after_finalize_publisher_failure_heals_without_reopen() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Company","data":{"name":"Acme"}} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 1); - } - - // Additive migration on the SAME handle. Must heal the load's - // sidecar first, then apply normally. - let desired = format!("{}\nnode Tag {{ name: String @key }}\n", helpers::TEST_SCHEMA); - db.apply_schema(&desired).await.expect( - "schema apply on the same handle must heal sidecar-covered \ - drift in-process instead of failing until restart", - ); - - // The failed load rolled forward; the migration landed. - assert_eq!(helpers::count_rows(&db, "node:Person").await, 1); - assert_eq!(helpers::count_rows(&db, "node:Company").await, 1); - assert_eq!(helpers::count_rows(&db, "node:Tag").await, 0); - - // No sidecar remains (the load's was consumed by the heal; schema - // apply deleted its own after publish). - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "no sidecar may remain after heal + successful schema apply" - ); - } -} - -/// Same heal contract for the branch-merge entry point: a pending -/// roll-forward-eligible sidecar on the target branch must be healed -/// (with its recovery audit row) before the merge reads its target -/// snapshot β€” not silently folded into the merge's publish. -#[tokio::test] -#[serial] -async fn branch_merge_after_finalize_publisher_failure_heals_without_reopen() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Append, - ) - .await - .unwrap(); - - // A feature branch with its own write, to merge back later. - db.branch_create("feature").await.unwrap(); - helpers::mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - // Failed load on MAIN: Person drifts ahead of the manifest with a - // sidecar covering it. - { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Bob","age":25}} -"#, - LoadMode::Merge, - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: mutation.post_finalize_pre_publisher"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 1); - } - - // Merge on the SAME handle. The entry heal must consume the load's - // sidecar (publishing Bob with a recovery audit row) BEFORE the - // merge captures its target snapshot. - db.branch_merge("feature", "main").await.expect( - "branch merge on the same handle must heal sidecar-covered \ - drift in-process instead of failing or folding it silently", - ); - - // No sidecar remains: the heal consumed the load's sidecar; the - // merge deleted its own after publish. Without the entry heal the - // merge's publish makes the drifted commit visible as a side effect - // (manifest catches up to HEAD) and the stale sidecar lingers - // until some later sweep β€” recovery must be attributed, not - // incidental. - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 0, - "the load's sidecar must be consumed by the entry heal, not left behind" - ); - } - - // All three writes are visible on main: Alice (clean load), Bob - // (rolled forward), Eve (merged). - assert_eq!(helpers::count_rows(&db, "node:Person").await, 3); -} - -/// Discarding an orphaned-branch sidecar must be idempotent across a -/// Phase D delete failure: the audit row + commit land before the -/// sidecar delete, so a delete fault leaves the sidecar on disk with -/// the audit already written β€” the retry must NOT append a second -/// audit row for the same operation, only finish the delete. -#[tokio::test] -#[serial] -async fn orphaned_branch_discard_is_idempotent_across_delete_failure() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\",\"age\":30}}\n", - LoadMode::Merge, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - helpers::mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - // Deferred-shape sidecar pinned to feature (head < expected β‡’ - // invariant violation β‡’ every roll-forward-only pass defers it). - let person_uri = node_table_uri(&uri, "Person"); - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H000000000000000000000ID", - "started_at": "0", - "branch": "feature", - "actor_id": null, - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{person_uri}", - "expected_version": 999, - "post_commit_pin": 1000, - "table_branch": "feature" - }} - ] - }}"# - ); - let recovery_dir = dir.path().join("__recovery"); - std::fs::create_dir_all(&recovery_dir).unwrap(); - std::fs::write( - recovery_dir.join("01H000000000000000000000ID.json"), - &sidecar_json, - ) - .unwrap(); - - // Orphan the sidecar. - db.branch_delete("feature").await.unwrap(); - - // First write: the discard path writes its audit row, then the - // sidecar delete fails (injected). The write fails loudly. - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_DELETE, "return"); - let err = load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .err() - .expect("a sidecar-delete fault mid-discard must fail the write"); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.sidecar_delete"), - "unexpected error: {err}" - ); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 1); - } - - // Retry: must finish the delete WITHOUT a second audit row. - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .expect("the retry must complete the orphan discard and the write"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 0); - let orphan_rows = helpers::recovery::recovery_audit_kinds(dir.path()) - .await - .into_iter() - .filter(|kind| kind == "OrphanedBranchDiscarded") - .count(); - assert_eq!( - orphan_rows, 1, - "exactly one OrphanedBranchDiscarded audit row despite the delete-fault retry" - ); -} - -/// When the commit-time drift guard cannot LIST sidecars to classify -/// the drift (transient storage fault on the guard's list, after the -/// entry heal's list succeeded), it must say so and name BOTH recovery -/// paths β€” not confidently route to `omnigraph repair`, which refuses -/// while a sidecar is pending. Sequenced failpoint: first list (entry -/// heal) passes, second list (the guard) fails. -#[tokio::test] -#[serial] -async fn drift_guard_names_both_paths_when_sidecar_list_fails() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"alice\",\"age\":30}}\n", - LoadMode::Append, - ) - .await - .unwrap(); - - // Rollback-eligible (deferred) sidecar covering main's Person drift β€” - // same shape as refresh_defers_rollback_eligible_sidecar_to_next_open. - let snapshot = db - .snapshot_of(omnigraph::db::ReadTarget::branch("main")) - .await - .unwrap(); - let entry = snapshot.entry("node:Person").unwrap(); - let person_uri = format!("{}/{}", uri.trim_end_matches('/'), entry.table_path); - let manifest_pin = entry.table_version; - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - helpers::lance_delete_inline(&mut ds, "1 = 2").await; - let head_after_drift = ds.version().version; - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H0000000000000000000LSTF", - "started_at": "0", - "branch": null, - "actor_id": null, - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key":"node:Person", - "table_path":"{}", - "expected_version":{}, - "post_commit_pin":{} - }} - ] - }}"#, - person_uri, - manifest_pin - 1, - head_after_drift, - ); - let recovery_dir = dir.path().join("__recovery"); - std::fs::create_dir_all(&recovery_dir).unwrap(); - std::fs::write( - recovery_dir.join("01H0000000000000000000LSTF.json"), - &sidecar_json, - ) - .unwrap(); - - // First list (entry heal) passes and defers the sidecar; second - // list (the guard's classification) fails. - let _failpoint = ScopedFailPoint::new(names::RECOVERY_SIDECAR_LIST, "1*off->1*return"); - let err = load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .err() - .expect("drift must still fail the write"); - let msg = err.to_string(); - assert!( - msg.contains("could not classify the drift") - && msg.contains("omnigraph repair") - && msg.contains("reopen the graph read-write"), - "an unclassifiable drift must name BOTH recovery paths, not \ - confidently route to repair; got: {msg}" - ); -} - -/// The other half of the orphan-discard fault matrix: the audit append -/// fails AFTER the recovery commit landed. The retry (keyed on the -/// audit row, the operator-facing record) must converge to exactly one -/// audit row and a consumed sidecar. The second recovery commit the -/// retry appends is the documented not-atomic-pair-write tolerance -/// (same class as `record_audit` and the manifestβ†’commit-graph Known -/// Gap): bounded commit-graph noise, never a lost or duplicated audit -/// record under clean failures. -#[tokio::test] -#[serial] -async fn orphaned_branch_discard_converges_across_audit_append_failure() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\",\"age\":30}}\n", - LoadMode::Merge, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - helpers::mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - // Deferred-shape sidecar pinned to feature, then orphaned. - let person_uri = node_table_uri(&uri, "Person"); - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H000000000000000000000AF", - "started_at": "0", - "branch": "feature", - "actor_id": null, - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{person_uri}", - "expected_version": 999, - "post_commit_pin": 1000, - "table_branch": "feature" - }} - ] - }}"# - ); - let recovery_dir = dir.path().join("__recovery"); - std::fs::create_dir_all(&recovery_dir).unwrap(); - std::fs::write( - recovery_dir.join("01H000000000000000000000AF.json"), - &sidecar_json, - ) - .unwrap(); - db.branch_delete("feature").await.unwrap(); - - // First write: the recovery commit lands, then the audit append - // fails (injected). The write fails loudly; the sidecar survives so - // the discard is retried with the audit still owed. - { - let _failpoint = ScopedFailPoint::new(names::RECOVERY_ORPHAN_DISCARD_AUDIT_APPEND, "return"); - let err = load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .err() - .expect("an audit-append fault mid-discard must fail the write"); - assert!( - err.to_string() - .contains("injected failpoint triggered: recovery.orphan_discard_audit_append"), - "unexpected error: {err}" - ); - assert_eq!( - std::fs::read_dir(&recovery_dir).unwrap().count(), - 1, - "the sidecar must survive an audit-append fault so the discard is retried" - ); - let orphan_rows = helpers::recovery::recovery_audit_kinds(dir.path()) - .await - .into_iter() - .filter(|kind| kind == "OrphanedBranchDiscarded") - .count(); - assert_eq!(orphan_rows, 0, "no audit row landed before the fault"); - } - - // Retry: converges β€” sidecar consumed, exactly one audit row. - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .expect("the retry must complete the orphan discard and the write"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 0); - let orphan_rows = helpers::recovery::recovery_audit_kinds(dir.path()) - .await - .into_iter() - .filter(|kind| kind == "OrphanedBranchDiscarded") - .count(); - assert_eq!( - orphan_rows, 1, - "exactly one OrphanedBranchDiscarded audit row despite the audit-fault retry" - ); -} - -/// After the write-entry heal rolls a SchemaApply sidecar forward (a -/// crashed apply on the SAME handle: staging promoted, registrations -/// published), the handle's in-memory catalog must be reloaded β€” disk -/// and manifest are on the new schema, and validating subsequent -/// writes against the stale catalog rejects rows of types the graph -/// already has. -#[tokio::test] -#[serial] -async fn load_after_schema_apply_phase_b_failure_uses_recovered_catalog() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"alice\",\"age\":30}}\n", - LoadMode::Append, - ) - .await - .unwrap(); - - // v2: a Person property (rewritten_tables work) + a new Tag type - // (table-set change, keeps the staging disambiguator decisive). - let v2_schema = r#"node Person { - name: String @key - age: I32? - city: String? -} - -node Company { - name: String @key -} - -node Tag { - label: String @key -} - -edge Knows: Person -> Person { - since: Date? -} - -edge WorksAt: Person -> Company -"#; - { - let _failpoint = ScopedFailPoint::new(names::SCHEMA_APPLY_AFTER_STAGING_WRITE, "return"); - let err = db.apply_schema(v2_schema).await.unwrap_err(); - assert!( - err.to_string() - .contains("injected failpoint triggered: schema_apply.after_staging_write"), - "unexpected error: {err}" - ); - let recovery_dir = dir.path().join("__recovery"); - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 1); - } - - // Same handle: a load of the NEW type. The entry heal rolls the - // apply forward (staging promoted, manifest registers node:Tag) β€” - // and the loader must then validate against the RECOVERED catalog, - // not the stale in-memory one. - load_jsonl( - &mut db, - "{\"type\":\"Tag\",\"data\":{\"label\":\"t1\"}}\n", - LoadMode::Merge, - ) - .await - .expect( - "after the heal rolls the schema apply forward, the same handle \ - must accept rows of the recovered schema's types", - ); - assert_eq!(helpers::count_rows(&db, "node:Tag").await, 1); - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 0); - } -} - -/// A concurrent write's entry heal must NOT promote a LIVE schema -/// apply's staging files. The apply pauses just after writing its -/// staging files (sidecar on disk from Phase A, staging on disk, -/// manifest not yet committed); a load on the same handle fires the -/// heal in that window. If the heal's schema-staging reconcile runs -/// unserialized, it promotes the staging files from under the live -/// apply β€” putting the NEW catalog live against the OLD manifest β€” and -/// the resumed apply's own renames then fail on the missing sources: -/// an error (and a corrupted catalog) for an otherwise-healthy apply. -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -#[serial] -async fn heal_does_not_promote_live_schema_apply_staging() { - use omnigraph::loader::LoadMode; - use std::sync::Arc; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - let db = Arc::new(Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap()); - - // Park the apply right after its staging files land (its sidecar is - // already on disk from Phase A; the manifest commit has not run). - let rv = helpers::failpoint::Rendezvous::park_first(names::SCHEMA_APPLY_AFTER_STAGING_WRITE); - - let apply_db = Arc::clone(&db); - let desired = format!("{}\nnode Tag {{ name: String @key }}\n", helpers::TEST_SCHEMA); - let apply = tokio::spawn(async move { apply_db.apply_schema(&desired).await }); - - // Wait until the apply is parked in the window (staging files written). - rv.wait_until_reached().await; - let staging_pg = dir.path().join("_schema.pg.staging"); - assert!(staging_pg.exists(), "schema apply never reached the paused window"); - - // Concurrent load on the same handle: its entry heal runs while the - // apply is paused. The load itself may fail (schema apply in - // progress) β€” what matters is what its heal does to the live apply. - let load_db = Arc::clone(&db); - let load = tokio::spawn(async move { - load_db - .load_as( - "main", - None, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\",\"age\":30}}\n", - LoadMode::Merge, - None, - ) - .await - }); - - // Give the load's heal time to act inside the window. Broken code - // completes the load here (its heal promoted the staging files and - // stole the apply's commit); fixed code leaves the load blocked on - // the schema-apply serialization key until the apply finishes. - tokio::time::sleep(std::time::Duration::from_millis(500)).await; - rv.release(); - - let apply_result = apply.await.unwrap(); - let _ = tokio::time::timeout(std::time::Duration::from_secs(30), load) - .await - .expect("load must complete once the apply releases its guards") - .unwrap(); - apply_result.expect( - "a concurrent write's heal must not promote the live schema \ - apply's staging files out from under it", - ); - - // The migration landed and nothing recovery-shaped remains. - assert_eq!(helpers::count_rows(&db, "node:Tag").await, 0); - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - assert_eq!(std::fs::read_dir(&recovery_dir).unwrap().count(), 0); - } -} - /// Refresh-time recovery must NOT call `Dataset::restore` β€” it can /// silently orphan a concurrent writer's commit. Sidecars that would /// require rollback must be left on disk for the next ReadWrite open. @@ -2644,9 +684,9 @@ async fn heal_does_not_promote_live_schema_apply_staging() { /// sidecar still on disk, Lance HEAD unchanged (no restore commit). /// Then drop + open: full sweep handles it. #[tokio::test] -#[serial] async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -2676,8 +716,12 @@ async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { // touching the manifest) so the classifier can reach UnexpectedAtP1 // / UnexpectedMultistep / RolledPastExpected paths that require // a real restore on rollback. + let store = TableStore::new(&uri); let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - helpers::lance_delete_inline(&mut ds, "1 = 2").await; + store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); let head_after_drift = ds.version().version; assert_eq!(head_after_drift, manifest_pin + 1); @@ -2751,31 +795,11 @@ async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { pre_head={pre_head}, post_head={post_head}", ); - // A write attempt while the rollback-eligible sidecar is deferred: - // the write-entry heal defers it again (roll-forward-only), and the - // commit-time drift guard must name the actual recovery path (a - // read-write reopen) β€” NOT `omnigraph repair`, which refuses while - // a sidecar is pending. - let err = mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Grace")], &[("$age", 50)]), - ) - .await - .unwrap_err(); - assert!( - err.to_string() - .contains("a pending recovery sidecar requires rollback"), - "drift guard must point at a read-write reopen for sidecar-covered \ - rollback-eligible drift; got: {err}" - ); - // Cross-check: drop the engine and reopen β€” full sweep handles // the rollback (will use Dataset::restore safely; no concurrent // writers at open time). drop(db); - let db = Omnigraph::open(&uri).await.unwrap(); + let _db = Omnigraph::open(&uri).await.unwrap(); // After full-sweep recovery, the sidecar should be processed // (deleted). Sidecar's tables are eligible for rollback (UnexpectedAtP1): // restore happens on Person (HEAD advances by 1). @@ -2798,26 +822,12 @@ async fn refresh_defers_rollback_eligible_sidecar_to_next_open() { "full sweep must run Dataset::restore (head advances); \ post_head={post_head}, final_head={final_head}", ); - // Convergence: roll-back published the restored HEAD, so the manifest pin - // tracks Lance HEAD afterward (no residual drift). - let entry_version = db - .snapshot_of(omnigraph::db::ReadTarget::branch("main")) - .await - .unwrap() - .entry("node:Person") - .unwrap() - .table_version; - assert_eq!( - entry_version, final_head, - "full-sweep roll-back must publish so manifest pin ({entry_version}) == Lance HEAD ({final_head})", - ); } /// Companion to the above β€” confirms that a finalizeβ†’publisher failure /// on one table leaves OTHER tables untouched. Subsequent writes to /// non-drifted tables proceed normally; the drift is contained. #[tokio::test] -#[serial] async fn finalize_publisher_residual_does_not_drift_untouched_tables() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); @@ -2826,7 +836,7 @@ async fn finalize_publisher_residual_does_not_drift_untouched_tables() { .unwrap(); { - let _failpoint = ScopedFailPoint::new(names::MUTATION_POST_FINALIZE_PRE_PUBLISHER, "return"); + let _failpoint = ScopedFailPoint::new("mutation.post_finalize_pre_publisher", "return"); let _ = mutate_main( &mut db, MUTATION_QUERIES, @@ -2849,67 +859,69 @@ async fn finalize_publisher_residual_does_not_drift_untouched_tables() { } /// Acceptance test: a stage-step failure in the staged-index path -/// (`stage_create_btree_index` succeeded; `commit_staged` not yet called) -/// leaves NO Lance-HEAD drift, so other tables stay writable. +/// (`stage_create_btree_index` succeeded; `commit_staged` not yet +/// called) leaves NO Lance-HEAD drift on the existing tables. +/// Subsequent operations against those tables succeed without +/// `ExpectedVersionMismatch`. /// -/// Under iss-848 schema apply no longer builds indexes inline β€” the build -/// happens in the reconciler (`ensure_indices`/`optimize`) and at load. So this -/// fires the failpoint where it lives now: an `ensure_indices` build of a BTREE -/// that a prior apply declared (`@index`) but deferred. The failpoint fires -/// between `stage_create_btree_index` and `commit_staged`, so the staged -/// segment is written under `_indices//` but `node:Person`'s Lance HEAD is -/// unchanged. `ensure_indices` fails and its EnsureIndices sidecar pins only -/// Person at NoMovement (a clean no-op on the next open). A write to a -/// different, unpinned table (`node:Company`) is unaffected: mutations/loads run -/// a roll-forward-only heal and proceed β€” they do not refuse on a pending -/// sidecar the way `optimize`/`repair` do β€” so the write succeeds with no drift. +/// Path: `apply_schema(v1 β†’ v2)` adds a new node type. The +/// `added_tables` loop in `schema_apply` creates the empty dataset and +/// then calls `build_indices_on_dataset_for_catalog` β†’ +/// `stage_and_commit_btree(..., &["id"])`. The failpoint fires +/// between `stage_create_btree_index` and `commit_staged`, so the +/// staged segments are written under `_indices//` but Lance HEAD +/// on the new dataset is unchanged at v=1. The schema-apply lock +/// branch is released by `apply_schema`'s outer match. Existing +/// tables (e.g. `node:Person`) are completely untouched by the new +/// node's added_tables iteration β€” they're outside the failed apply +/// path entirely β€” and we assert that mutations against them continue +/// to work. +/// +/// The orphan empty dataset from the failed apply is acceptable +/// residual: it's unreferenced by `__manifest` and will be reclaimed +/// by `cleanup_old_versions` (or removed when a future apply at the +/// same target path resolves the rename). #[tokio::test] -#[serial] async fn ensure_indices_stage_btree_failure_leaves_existing_tables_writable() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap().to_string(); + + // Init with TEST_SCHEMA which declares Person + Knows. Indices on + // those tables get built during init. let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - // Seed a Person row β€” the load builds Person's id BTREE + name FTS. + // Apply a schema that adds a new node type. The added_tables loop + // will hit the failpoint between stage and commit on the new + // node:Project table's btree-on-id build. (TEST_SCHEMA already + // has Person + Company + Knows + WorksAt β€” pick a name that isn't + // already declared.) + let extended_schema = format!( + "{}\nnode Project {{ name: String @key }}\n", + helpers::TEST_SCHEMA + ); + + { + let _failpoint = + ScopedFailPoint::new("ensure_indices.post_stage_pre_commit_btree", "return"); + let err = db.apply_schema(&extended_schema).await.unwrap_err(); + assert!( + err.to_string() + .contains("ensure_indices.post_stage_pre_commit_btree"), + "schema apply should fail with the synthetic failpoint error, got: {err}" + ); + } + + // Existing tables stayed at their pre-apply versions; subsequent + // mutations against them succeed (no Lance-HEAD drift). mutate_main( &mut db, helpers::MUTATION_QUERIES, "insert_person", - &mixed_params(&[("$name", "Alice")], &[("$age", 30)]), + &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), ) .await - .expect("seed Person"); - - // Add `@index` on `age`: schema apply records the intent but defers the - // physical build (iss-848), so the BTREE on `age` is unbuilt. - let indexed_schema = helpers::TEST_SCHEMA.replace("age: I32?", "age: I32? @index"); - db.apply_schema(&indexed_schema) - .await - .expect("adding an @index is metadata-only and succeeds"); - - { - // ensure_indices builds the deferred `age` BTREE on Person; the failpoint - // fires between stage and commit, so Person's Lance HEAD does not move. - let _failpoint = - ScopedFailPoint::new(names::ENSURE_INDICES_POST_STAGE_PRE_COMMIT_BTREE, "return"); - let err = db.ensure_indices().await.unwrap_err(); - assert!( - err.to_string() - .contains("ensure_indices.post_stage_pre_commit_btree"), - "ensure_indices should fail with the synthetic failpoint error, got: {err}" - ); - } - - // A different, unpinned table is untouched by the failed index build. - use omnigraph::loader::{LoadMode, load_jsonl}; - load_jsonl( - &mut db, - r#"{"type": "Company", "data": {"name": "Acme"}}"#, - LoadMode::Append, - ) - .await - .expect("Company write on a table untouched by the failed ensure_indices should succeed"); + .expect("Person mutation must succeed after the failed schema apply β€” existing tables are not drifted"); } fn assert_no_staging_files(graph: &std::path::Path) { @@ -2945,7 +957,6 @@ fn assert_no_staging_files(graph: &std::path::Path) { // ExpectedVersionMismatch. #[tokio::test] -#[serial] async fn schema_apply_without_schema_staging_rolls_back_on_next_open() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -2973,7 +984,7 @@ async fn schema_apply_without_schema_staging_rolls_back_on_next_open() { { let db = Omnigraph::open(&uri).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::SCHEMA_APPLY_BEFORE_STAGING_WRITE, "return"); + let _failpoint = ScopedFailPoint::new("schema_apply.before_staging_write", "return"); let v2_schema = r#"node Person { name: String @key age: I32? @@ -3004,15 +1015,10 @@ edge WorksAt: Person -> Company } let db = Omnigraph::open(&uri).await.unwrap(); - // Roll-back now publishes the restored version, so the manifest version - // advances β€” but to the OLD-schema content: the migration never applied - // (asserted by count_rows + the `_schema.pg` checks below), and the sweep - // converges (`manifest == Lance HEAD`, asserted by - // assert_post_recovery_invariants's RolledBack arm). - assert!( - version_main(&db).await.unwrap() > pre_failure_version, - "roll-back publishes the restored (old-schema) version, advancing the manifest; \ - pre={pre_failure_version}", + assert_eq!( + version_main(&db).await.unwrap(), + pre_failure_version, + "manifest must remain on the old schema when no schema staging files existed" ); assert_eq!( helpers::count_rows(&db, "node:Person").await, @@ -3043,7 +1049,6 @@ edge WorksAt: Person -> Company } #[tokio::test] -#[serial] async fn schema_apply_phase_b_failure_recovered_on_next_open() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -3079,7 +1084,7 @@ async fn schema_apply_phase_b_failure_recovered_on_next_open() { // written, but BEFORE the manifest publish. The recovery sidecar persists. { let db = Omnigraph::open(&uri).await.unwrap(); - let _failpoint = ScopedFailPoint::new(names::SCHEMA_APPLY_AFTER_STAGING_WRITE, "return"); + let _failpoint = ScopedFailPoint::new("schema_apply.after_staging_write", "return"); // v2 schema: add a `city` property to Person AND add a new // `Tag` node type. The new property triggers the rewritten_tables // path (Phase B sidecar coverage). The new type changes the @@ -3186,341 +1191,7 @@ edge WorksAt: Person -> Company ); } -/// `optimize` Phase B β†’ Phase C residual: `compact_files` advanced the Lance -/// HEAD but the manifest publish hasn't run. The `Optimize` recovery sidecar -/// (loose-match, like SchemaApply/EnsureIndices) must roll the compacted version -/// forward on next open so the manifest tracks the Lance HEAD β€” and the healed -/// table must then accept a schema apply (the original bug's victim). #[tokio::test] -#[serial(optimize)] -async fn optimize_phase_b_failure_recovered_on_next_open() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let operation_id; - - // Seed: several separate Person inserts β†’ multiple fragments, so compaction - // has real work and advances the Lance HEAD. - { - let db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - for (name, age) in [("alice", 30), ("bob", 31), ("carol", 32), ("dave", 33)] { - db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age)]), - ) - .await - .unwrap(); - } - } - - let pre_failure_version = { - let db = Omnigraph::open(&uri).await.unwrap(); - version_main(&db).await.unwrap() - }; - - // Failpoint fires AFTER compact_files advanced the Lance HEAD but BEFORE the - // manifest publish. The Optimize sidecar persists (only node:Person has - // compactable fragments, so exactly one sidecar is written). - { - let db = Omnigraph::open(&uri).await.unwrap(); - let _failpoint = - ScopedFailPoint::new(names::OPTIMIZE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); - let err = db.optimize().await.unwrap_err(); - assert!( - err.to_string().contains( - "injected failpoint triggered: optimize.post_phase_b_pre_manifest_commit" - ), - "unexpected error: {err}" - ); - - let recovery_dir = dir.path().join("__recovery"); - let sidecars: Vec<_> = std::fs::read_dir(&recovery_dir) - .unwrap() - .filter_map(|e| e.ok()) - .collect(); - assert_eq!( - sidecars.len(), - 1, - "exactly one Optimize sidecar must persist after optimize failure" - ); - operation_id = single_sidecar_operation_id(dir.path()); - } - - // Recovery: reopen runs the sweep. The Optimize sidecar classifies - // RolledPastExpected (loose-match) β†’ RollForward β†’ manifest extends to the - // compacted Lance HEAD. - let db = Omnigraph::open(&uri).await.unwrap(); - let post_recovery_version = version_main(&db).await.unwrap(); - assert!( - post_recovery_version > pre_failure_version, - "manifest version must advance post-recovery (compaction rolled forward); \ - pre={pre_failure_version}, post={post_recovery_version}", - ); - drop(db); - - assert_post_recovery_invariants( - dir.path(), - &operation_id, - RecoveryExpectation::RolledForward { - tables: vec![TableExpectation::main("node:Person")], - }, - ) - .await - .unwrap(); - - // The healed table accepts an additive schema apply β€” its HEAD-vs-manifest - // precondition is satisfied because recovery published the compacted version. - let db = Omnigraph::open(&uri).await.unwrap(); - let desired = helpers::TEST_SCHEMA.replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - db.apply_schema(&desired) - .await - .expect("schema apply after optimize recovery must succeed"); -} - -/// Cross-process race (the prod bug): a served write advances the manifest on the -/// same table while a SEPARATE `optimize` process is paused between its compaction -/// and its manifest publish. The in-process write queue does NOT serialize across -/// processes, so optimize's equality-CAS publish (expected = its pre-compaction -/// version) finds the manifest already advanced. optimize must CONVERGE β€” the -/// concurrent write built on top of the compacted HEAD, so the compaction is -/// already reflected β€” not fail with "expected X but current Y". RED before the -/// monotonic-publish fix. -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -#[serial(optimize)] -async fn optimize_survives_concurrent_insert_advancing_manifest() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - { - let db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - for (name, age) in [("alice", 30), ("bob", 31), ("carol", 32), ("dave", 33)] { - db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age)]), - ) - .await - .unwrap(); - } - } - - // Pause optimize BEFORE it compacts, so the concurrent insert lands while - // HEAD == manifest (no in-flight optimize drift for the writer to trip on); the - // insert advances the manifest, then optimize compacts on top and must converge - // its publish over the advanced manifest rather than fail the equality CAS. - let failpoint = ScopedFailPoint::new(names::OPTIMIZE_BEFORE_COMPACT, "pause"); - - let uri_opt = uri.clone(); - let optimize = tokio::spawn(async move { - let db = Omnigraph::open(&uri_opt).await.unwrap(); - db.optimize().await - }); - - // Wait until optimize reaches the pause (its Optimize sidecar is on disk). - assert!( - wait_for_sidecar(dir.path()).await, - "optimize never reached the pre-compact pause", - ); - - // Concurrent insert on the SAME table via a SEPARATE handle (= separate - // in-process write queue = a different process) advances the manifest. - { - let db_b = Omnigraph::open(&uri).await.unwrap(); - db_b.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "eve")], &[("$age", 34)]), - ) - .await - .unwrap(); - } - - drop(failpoint); // release optimize - let result = tokio::time::timeout(std::time::Duration::from_secs(20), optimize) - .await - .expect("optimize task hung") - .unwrap(); - result.expect("optimize must survive a concurrent same-table write (cross-process)"); - - // No lost write: 4 seed + eve all present; graph remains re-optimizable. - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - helpers::count_rows(&db, "node:Person").await, - 5, - "concurrent insert must not be lost", - ); - db.optimize() - .await - .expect("graph must remain healthy / re-optimizable"); -} - -/// Cross-process race: a served DELETE commits on the same table while a SEPARATE -/// `optimize` process is parked just before its compaction. Lance rebases the -/// compaction past the delete cleanly (so this surfaces as a manifest-CAS mismatch -/// at publish, not a Lance `Rewrite` conflict β€” the genuine `Rewrite`-vs-`Rewrite` -/// overlap is the rarer many-fragment/concurrent-compaction case, covered by the -/// shared `is_retryable_lance_conflict` retry the internal-table path already -/// exercises). optimize must converge its publish over the advanced manifest and -/// preserve the delete. RED before the fix. -#[tokio::test(flavor = "multi_thread", worker_threads = 4)] -#[serial(optimize)] -async fn optimize_survives_concurrent_delete_before_compaction() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - { - let db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - for (name, age) in [("alice", 30), ("bob", 31), ("carol", 32), ("dave", 33)] { - db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age)]), - ) - .await - .unwrap(); - } - } - - // Pause optimize BEFORE its compaction commits. - let failpoint = ScopedFailPoint::new(names::OPTIMIZE_BEFORE_COMPACT, "pause"); - - let uri_opt = uri.clone(); - let optimize = tokio::spawn(async move { - let db = Omnigraph::open(&uri_opt).await.unwrap(); - db.optimize().await - }); - - assert!( - wait_for_sidecar(dir.path()).await, - "optimize never reached the pre-compact pause", - ); - - // Concurrent DELETE of an existing row writes a deletion vector onto the - // fragment optimize is about to compact β†’ optimize's Rewrite overlap-conflicts - // at the Lance level ("Rewrite … preempted by concurrent Delete/Update"). - { - let db_b = Omnigraph::open(&uri).await.unwrap(); - db_b.mutate( - "main", - MUTATION_QUERIES, - "remove_person", - &mixed_params(&[("$name", "alice")], &[]), - ) - .await - .unwrap(); - } - - drop(failpoint); // release optimize - let result = tokio::time::timeout(std::time::Duration::from_secs(20), optimize) - .await - .expect("optimize task hung") - .unwrap(); - result.expect("optimize must reopen+replan past a concurrent overlapping delete"); - - // No lost write: alice's delete persisted (3 rows); graph remains re-optimizable. - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - helpers::count_rows(&db, "node:Person").await, - 3, - "the concurrent delete must persist (alice removed)", - ); - db.optimize() - .await - .expect("graph must remain healthy / re-optimizable"); -} - -/// Regression: the outer compaction retry loop must NOT misclassify optimize's OWN -/// committed Phase-B work as external drift. Attempt 1 compacts (HEAD β†’ V+1); if a -/// LATER Phase-B op (reindex) then hits a retryable conflict, the reopened attempt -/// sees Lance HEAD ahead of the manifest β€” from OUR compaction, not an external -/// writer. The drift guard must skip it (we hold the sidecar) and converge, not -/// delete the sidecar and return `skipped_for_drift` (which would strand uncovered -/// drift). Reproduced by injecting one retryable reindex conflict after the compact. -#[tokio::test] -#[serial(optimize)] -async fn optimize_retry_does_not_misclassify_own_head_drift() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - { - let db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - for (name, age) in [("alice", 30), ("bob", 31), ("carol", 32), ("dave", 33)] { - db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age)]), - ) - .await - .unwrap(); - } - } - - // Inject exactly one retryable reindex conflict: attempt 1 compacts (HEAD+1) then - // "conflicts" on reindex β†’ retry; attempt 2 reopens with HEAD ahead of the manifest - // from our own compaction β€” the misclassification trigger. - let _failpoint = ScopedFailPoint::new(names::OPTIMIZE_INJECT_REINDEX_CONFLICT, "1*return"); - - let db = Omnigraph::open(&uri).await.unwrap(); - let stats = db - .optimize() - .await - .expect("optimize must converge, not misclassify its own HEAD drift"); - let person = stats - .iter() - .find(|s| s.table_key == "node:Person") - .expect("node:Person stat present"); - assert!( - person.skipped.is_none(), - "node:Person must converge, not skipped_for_drift: {:?}", - person.skipped, - ); - - // No uncovered drift stranded: a follow-up optimize is clean and all rows read. - let stats2 = db.optimize().await.unwrap(); - let person2 = stats2 - .iter() - .find(|s| s.table_key == "node:Person") - .unwrap(); - assert!( - person2.skipped.is_none(), - "follow-up optimize must be clean (no stranded drift): {:?}", - person2.skipped, - ); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 4); -} - -/// Poll until `optimize` has written its recovery sidecar (i.e. reached Phase B -/// and is about to / has compacted), signalling it is parked at its failpoint. -async fn wait_for_sidecar(root: &std::path::Path) -> bool { - let recovery_dir = root.join("__recovery"); - for _ in 0..1000 { - if recovery_dir.exists() - && std::fs::read_dir(&recovery_dir) - .map(|d| d.count() > 0) - .unwrap_or(false) - { - return true; - } - tokio::time::sleep(std::time::Duration::from_millis(10)).await; - } - false -} - -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] async fn branch_merge_phase_b_failure_recovered_on_next_open() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -3576,7 +1247,7 @@ async fn branch_merge_phase_b_failure_recovered_on_next_open() { { let db = Omnigraph::open(&uri).await.unwrap(); let _failpoint = - ScopedFailPoint::new(names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); + ScopedFailPoint::new("branch_merge.post_phase_b_pre_manifest_commit", "return"); let err = db.branch_merge("feature", "main").await.unwrap_err(); assert!( err.to_string().contains( @@ -3632,24 +1303,47 @@ async fn branch_merge_phase_b_failure_recovered_on_next_open() { ); // The recovered branch_merge must record a MERGE commit (with - // `merged_parent_commit_id` set), not a plain commit. Without this, future - // merges between the same pair lose already-up-to-date detection. RFC-013 - // Phase 7 records the recovery commit in `__manifest` (folded into the - // recovery publish CAS), so we read it through the commit-graph projection - // (`CommitGraph::load_commits`) and assert some commit carries a non-null - // `merged_parent_commit_id`. Only a recovered branch_merge can produce one - // here (we never completed a normal merge in this test). + // `merged_parent_commit_id` set), not a plain commit. Without + // this, future merges between the same pair lose + // already-up-to-date detection. We verify by reading + // `_graph_commits.lance` and asserting the most recent commit + // tagged with the recovery actor has a non-null + // `merged_parent_commit_id`. { - let commits = - omnigraph::db::commit_graph::CommitGraph::open(dir.path().to_str().unwrap()) - .await - .unwrap() - .load_commits() - .await - .unwrap(); - let found_recovery_merge = commits - .iter() - .any(|c| c.merged_parent_commit_id.is_some()); + use arrow_array::{Array, StringArray}; + use futures::TryStreamExt; + let commits_dir = dir.path().join("_graph_commits.lance"); + let ds = lance::Dataset::open(commits_dir.to_str().unwrap()) + .await + .unwrap(); + let batches: Vec = ds + .scan() + .try_into_stream() + .await + .unwrap() + .try_collect() + .await + .unwrap(); + let mut found_recovery_merge = false; + for batch in batches { + let merged = batch + .column_by_name("merged_parent_commit_id") + .expect("merged_parent_commit_id column present") + .as_any() + .downcast_ref::() + .expect("merged_parent_commit_id is Utf8"); + // The actor_id lives in _graph_commit_actors; cross-checking + // is heavier than necessary. Detecting any non-null + // merged_parent_commit_id in the post-recovery state is + // sufficient: only a recovered branch_merge can produce one + // here (we never completed a normal merge in this test). + for i in 0..merged.len() { + if !merged.is_null(i) { + found_recovery_merge = true; + break; + } + } + } assert!( found_recovery_merge, "recovered branch_merge must record `merged_parent_commit_id` so future \ @@ -3659,358 +1353,6 @@ async fn branch_merge_phase_b_failure_recovered_on_next_open() { drop(db); } -/// AdoptWithDelta recovery (the gap closure): a fast-forward merge β€” main has -/// NOT advanced since the branch forked, so the touched table is classified -/// `AdoptWithDelta`, not `RewriteMerged` β€” that fails after Phase B must still -/// recover on the next open. Before the recovery-pin closure this drifted -/// silently: the adopt path advanced Lance HEAD but was unpinned, so the sweep -/// found no sidecar and the merge was lost. -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn branch_merge_adopt_with_delta_phase_b_failure_recovered_on_next_open() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // Seed main, branch off, mutate ONLY the branch. main stays at base, so the - // merge is a fast-forward and Person classifies `AdoptWithDelta` (forked - // source, target == base, non-empty delta) β€” NOT `RewriteMerged`. - { - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"alice","age":30}} -"#, - LoadMode::Append, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Bob")], &[("$age", 40)]), - ) - .await - .unwrap(); - // main intentionally NOT mutated β†’ fast-forward β†’ AdoptWithDelta. - } - - let pre_failure_version = { - let db = Omnigraph::open(&uri).await.unwrap(); - version_main(&db).await.unwrap() - }; - - // Fail after the per-table publish loop, before commit_manifest_updates. - { - let db = Omnigraph::open(&uri).await.unwrap(); - let _failpoint = - ScopedFailPoint::new(names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); - let err = db.branch_merge("feature", "main").await.unwrap_err(); - assert!( - err.to_string().contains( - "injected failpoint triggered: branch_merge.post_phase_b_pre_manifest_commit" - ), - "unexpected error: {err}" - ); - - // The gap closure: an AdoptWithDelta merge must persist a sidecar. - let recovery_dir = dir.path().join("__recovery"); - let sidecars: Vec<_> = std::fs::read_dir(&recovery_dir) - .unwrap() - .filter_map(|e| e.ok()) - .collect(); - assert_eq!( - sidecars.len(), - 1, - "AdoptWithDelta merge must persist exactly one recovery sidecar (the closed gap)" - ); - } - - // Reopen β†’ the recovery sweep rolls the AdoptWithDelta merge forward. - let db = Omnigraph::open(&uri).await.unwrap(); - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - let remaining: Vec<_> = std::fs::read_dir(&recovery_dir) - .unwrap() - .filter_map(|e| e.ok()) - .collect(); - assert!( - remaining.is_empty(), - "sidecar must be deleted post-recovery; remaining: {remaining:?}" - ); - } - - let post_recovery_version = version_main(&db).await.unwrap(); - assert!( - post_recovery_version > pre_failure_version, - "manifest must advance post-recovery; pre={pre_failure_version} post={post_recovery_version}" - ); - let names = collect_column_strings(&read_table(&db, "node:Person").await, "name"); - assert!( - names.contains(&"Bob".to_string()), - "recovered AdoptWithDelta merge must include Bob; have {names:?}" - ); - drop(db); -} - -/// Which branch-merge publish path a partial-Phase-B test exercises. -enum MergeScenario { - /// main stays at base β†’ the touched table is `AdoptWithDelta` - /// (`publish_adopted_delta`: append β†’ upsert β†’ delete). - Adopt, - /// main advances past base β†’ the touched table is `RewriteMerged` - /// (`publish_rewritten_merge_table`: merge_insert β†’ delete β†’ index). - Rewrite, -} - -async fn sorted_person_names(db: &Omnigraph) -> Vec { - let mut names = collect_column_strings(&read_table(db, "node:Person").await, "name"); - names.sort(); - names -} - -/// THE recovery-atomicity regression gate. A branch merge whose per-table publish -/// is a multi-commit sequence (append β†’ upsert β†’ delete, or merge_insert β†’ delete -/// β†’ index) advances Lance HEAD step by step before the manifest publish. If the -/// process dies *mid*-sequence β€” after some commits but before the achieved-version -/// intent is recorded β€” recovery must roll the whole merge **back**, not publish -/// the partial and record the merge as complete. -/// -/// The delta is deliberately MIXED β€” a fresh id (`bob`, append), a modified base id -/// (`carol`, upsert) and a removed base id (`dave`, delete) β€” so every partial -/// window leaves real work undone. Proof of rollback: after recovery the target is -/// back at its base name-set, and a *re-run* of the merge re-applies the full delta -/// (the partial was not silently recorded as "already merged"). -/// -/// RED before the fix: the loose `BranchMerge` classification rolls any -/// `lance_head > manifest_pinned` forward, so the partial is published (e.g. `bob` -/// present, `dave` kept) and the merge recorded β€” the first assert (back at base) -/// fails. GREEN after: `achieved_version == None` β†’ `IncompletePhaseB` β†’ roll back. -async fn assert_partial_merge_rolls_back(scenario: MergeScenario, failpoint: &str) { - use omnigraph::loader::load_jsonl; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // Seed main {alice, carol, dave}; on `feature` add bob (append), bump carol - // (upsert), remove dave (delete). For Rewrite, also move main past base so the - // table classifies RewriteMerged instead of a fast-forward AdoptWithDelta. - { - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"alice\",\"age\":30}}\n\ - {\"type\":\"Person\",\"data\":{\"name\":\"carol\",\"age\":50}}\n\ - {\"type\":\"Person\",\"data\":{\"name\":\"dave\",\"age\":60}}\n", - LoadMode::Append, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "bob")], &[("$age", 40)]), - ) - .await - .unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "carol")], &[("$age", 55)]), - ) - .await - .unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "remove_person", - &mixed_params(&[("$name", "dave")], &[]), - ) - .await - .unwrap(); - if matches!(scenario, MergeScenario::Rewrite) { - db.mutate( - "main", - MUTATION_QUERIES, - "set_age", - &mixed_params(&[("$name", "alice")], &[("$age", 35)]), - ) - .await - .unwrap(); - } - } - - // Crash mid-Phase-B at the injected window. - { - let db = Omnigraph::open(&uri).await.unwrap(); - let _fp = ScopedFailPoint::new(failpoint, "return"); - let err = db.branch_merge("feature", "main").await.unwrap_err(); - assert!( - err.to_string().contains(failpoint), - "expected the injected failpoint {failpoint}, got: {err}" - ); - } - - // Reopen β†’ the open-time sweep must ROLL BACK to base (the merge never reached - // its commit boundary), and a re-run must then apply the FULL delta. - { - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - sorted_person_names(&db).await, - vec!["alice", "carol", "dave"], - "partial Phase B at {failpoint} must roll back to base \ - (no bob, dave kept, carol's upsert reverted); the merge must NOT be recorded", - ); - db.branch_merge("feature", "main").await.unwrap(); - assert_eq!( - sorted_person_names(&db).await, - vec!["alice", "bob", "carol"], - "re-merge after rollback must re-apply the full delta \ - (bob added, dave removed) β€” proof the partial was not silently recorded", - ); - } -} - -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn branch_merge_adopt_partial_after_append_rolls_back() { - assert_partial_merge_rolls_back( - MergeScenario::Adopt, - "branch_merge.adopt_after_append_pre_upsert", - ) - .await; -} - -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn branch_merge_adopt_partial_after_upsert_rolls_back() { - assert_partial_merge_rolls_back( - MergeScenario::Adopt, - "branch_merge.adopt_after_upsert_pre_delete", - ) - .await; -} - -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn branch_merge_rewrite_partial_after_merge_rolls_back() { - assert_partial_merge_rolls_back( - MergeScenario::Rewrite, - "branch_merge.rewrite_after_merge_pre_delete", - ) - .await; -} - -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn branch_merge_rewrite_partial_after_delete_rolls_back() { - assert_partial_merge_rolls_back( - MergeScenario::Rewrite, - "branch_merge.rewrite_after_delete_pre_index", - ) - .await; -} - -/// Backward-compat: a `BranchMerge` sidecar written by a *pre-confirmation* -/// binary (schema_version 1, no `confirmed_version`) must NOT be misread as a -/// partial Phase B and rolled back. A pre-upgrade crash in the Phase-Bβ†’C gap can -/// leave such a sidecar over a *completed* merge; rolling it back would silently -/// discard a finished merge with no operator signal β€” the regression greptile / -/// Cursor flagged. -/// -/// We synthesize the pre-upgrade sidecar realistically: crash after Phase B (a -/// real sidecar + advanced Lance HEAD), then downgrade the on-disk JSON to the -/// v1 shape (`schema_version` = 1, strip every pin's `confirmed_version`) before -/// reopening β€” exactly what an old binary would have left. -/// -/// RED before the versioning fix: a v1 sidecar with no `confirmed_version` -/// classifies `IncompletePhaseB` β†’ rolls back β†’ `bob` is discarded. GREEN after: -/// the version-aware classifier reads v1 as the old loose generation β†’ rolls -/// forward β†’ `bob` preserved. -#[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] -async fn pre_upgrade_v1_branch_merge_sidecar_rolls_forward_not_back() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // main {alice}; feature adds bob β†’ a fast-forward AdoptWithDelta merge, which - // writes a recovery sidecar. - { - let mut db = Omnigraph::init(&uri, helpers::TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"alice\",\"age\":30}}\n", - LoadMode::Append, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - db.mutate( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "bob")], &[("$age", 40)]), - ) - .await - .unwrap(); - } - - // Crash after Phase B (Lance HEAD advanced, manifest not published) β†’ a real - // sidecar lands on disk. - { - let db = Omnigraph::open(&uri).await.unwrap(); - let _fp = ScopedFailPoint::new(names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); - db.branch_merge("feature", "main").await.unwrap_err(); - } - - // Downgrade the sidecar to the pre-confirmation v1 shape an old binary writes. - { - let recovery_dir = std::path::Path::new(&uri).join("__recovery"); - let path = std::fs::read_dir(&recovery_dir) - .unwrap() - .filter_map(Result::ok) - .map(|e| e.path()) - .find(|p| p.extension().is_some_and(|x| x == "json")) - .expect("a recovery sidecar must exist after the post-Phase-B crash"); - let mut v: serde_json::Value = - serde_json::from_str(&std::fs::read_to_string(&path).unwrap()).unwrap(); - v["schema_version"] = serde_json::json!(1); - for table in v["tables"].as_array_mut().unwrap() { - table.as_object_mut().unwrap().remove("confirmed_version"); - } - std::fs::write(&path, serde_json::to_string_pretty(&v).unwrap()).unwrap(); - } - - // Reopen β†’ the pre-upgrade completed merge must roll FORWARD (bob kept), not - // be silently discarded. - { - let db = Omnigraph::open(&uri).await.unwrap(); - assert_eq!( - sorted_person_names(&db).await, - vec!["alice", "bob"], - "a pre-confirmation (v1) BranchMerge sidecar over a completed merge must roll \ - forward, not be misread as a partial and rolled back", - ); - } -} - /// Branch-axis variant of the branch_merge recovery test: target is a /// non-main branch. Catches the branch-specific commit-graph head bug /// (D2) β€” without `CommitGraph::open_at_branch`, the recovery sweep @@ -4018,8 +1360,6 @@ async fn pre_upgrade_v1_branch_merge_sidecar_rolls_forward_not_back() { /// target, and future merges between the same pair would lose /// already-up-to-date detection. #[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] async fn branch_merge_phase_b_failure_recovered_on_non_main_target() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -4083,7 +1423,7 @@ async fn branch_merge_phase_b_failure_recovered_on_non_main_target() { { let db = Omnigraph::open(&uri).await.unwrap(); let _failpoint = - ScopedFailPoint::new(names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); + ScopedFailPoint::new("branch_merge.post_phase_b_pre_manifest_commit", "return"); let err = db .branch_merge("source_branch", "target_branch") .await @@ -4144,8 +1484,6 @@ async fn branch_merge_phase_b_failure_recovered_on_non_main_target() { /// keeps RewriteMerged tables on active_branch), the contract assertion /// catches a regression that reverts to `entry.table_branch.clone()`. #[tokio::test] -#[serial] -#[serial(branch_merge_phase_b)] async fn branch_merge_sidecar_pins_table_branch_to_active_branch() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -4186,7 +1524,7 @@ async fn branch_merge_sidecar_pins_table_branch_to_active_branch() { { let db = Omnigraph::open(&uri).await.unwrap(); let _failpoint = - ScopedFailPoint::new(names::BRANCH_MERGE_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); + ScopedFailPoint::new("branch_merge.post_phase_b_pre_manifest_commit", "return"); let _ = db .branch_merge("source_branch", "target_branch") .await @@ -4247,7 +1585,6 @@ async fn branch_merge_sidecar_pins_table_branch_to_active_branch() { /// `needs_index_work_*` code path and the /// `recovery_ensure_indices_handles_empty_tables` integration test. #[tokio::test] -#[serial] async fn ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_needed() { use omnigraph::loader::{LoadMode, load_jsonl}; @@ -4278,7 +1615,7 @@ async fn ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_neede { let db = Omnigraph::open(&uri).await.unwrap(); let _failpoint = - ScopedFailPoint::new(names::ENSURE_INDICES_POST_PHASE_B_PRE_MANIFEST_COMMIT, "return"); + ScopedFailPoint::new("ensure_indices.post_phase_b_pre_manifest_commit", "return"); let err = db.ensure_indices().await.unwrap_err(); assert!( err.to_string().contains( @@ -4348,12 +1685,11 @@ async fn ensure_indices_phase_b_failure_does_not_leak_sidecar_when_no_work_neede // limitation. #[tokio::test] -#[serial] async fn init_failpoint_after_schema_pg_written_cleans_up_schema_file() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); - let _failpoint = ScopedFailPoint::new(names::INIT_AFTER_SCHEMA_PG_WRITTEN, "return"); + let _failpoint = ScopedFailPoint::new("init.after_schema_pg_written", "return"); let err = match Omnigraph::init(uri, helpers::TEST_SCHEMA).await { Ok(_) => panic!("expected Omnigraph::init to fail at the configured failpoint"), @@ -4375,12 +1711,11 @@ async fn init_failpoint_after_schema_pg_written_cleans_up_schema_file() { } #[tokio::test] -#[serial] async fn init_failpoint_after_schema_contract_written_cleans_up_all_schema_files() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); - let _failpoint = ScopedFailPoint::new(names::INIT_AFTER_SCHEMA_CONTRACT_WRITTEN, "return"); + let _failpoint = ScopedFailPoint::new("init.after_schema_contract_written", "return"); let err = match Omnigraph::init(uri, helpers::TEST_SCHEMA).await { Ok(_) => panic!("expected Omnigraph::init to fail at the configured failpoint"), @@ -4407,12 +1742,11 @@ async fn init_failpoint_after_schema_contract_written_cleans_up_all_schema_files } #[tokio::test] -#[serial] async fn init_failpoint_after_coordinator_init_cleans_up_schema_files() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); - let _failpoint = ScopedFailPoint::new(names::INIT_AFTER_COORDINATOR_INIT, "return"); + let _failpoint = ScopedFailPoint::new("init.after_coordinator_init", "return"); let err = match Omnigraph::init(uri, helpers::TEST_SCHEMA).await { Ok(_) => panic!("expected Omnigraph::init to fail at the configured failpoint"), @@ -4448,7 +1782,6 @@ async fn init_failpoint_after_coordinator_init_cleans_up_schema_files() { } #[tokio::test] -#[serial] async fn init_failpoint_returns_original_error_not_cleanup_error() { // The cleanup is best-effort. If `storage.delete` fails (e.g. transient // network blip on S3), the original init failpoint error must still @@ -4460,7 +1793,7 @@ async fn init_failpoint_returns_original_error_not_cleanup_error() { let _scenario = FailScenario::setup(); let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); - let _failpoint = ScopedFailPoint::new(names::INIT_AFTER_SCHEMA_PG_WRITTEN, "return"); + let _failpoint = ScopedFailPoint::new("init.after_schema_pg_written", "return"); let err = match Omnigraph::init(uri, helpers::TEST_SCHEMA).await { Ok(_) => panic!("expected Omnigraph::init to fail at the configured failpoint"), @@ -4473,153 +1806,3 @@ async fn init_failpoint_returns_original_error_not_cleanup_error() { "init error must surface the failpoint cause, got: {msg}" ); } - -// ── RFC-013 Phase 7 / FIX A: a transient legacy-open failure must abort the ── -// v3β†’v4 migration loudly, not silently swallow the lineage and stamp v4. -// -// `migrate_v3_to_v4` backfills graph lineage from `_graph_commits.lance` into -// `__manifest`, then stamps internal-schema v4. The migration runs exactly once -// per graph (`migrate_internal_schema` is `while stamp < CURRENT`). If a -// transient or corrupt `Dataset::open` of the legacy commit dataset is treated -// as "no legacy data" (the pre-fix `Err(_) => empty` arm), the migration backfills -// NOTHING and stamps v4 β€” orphaning the real lineage permanently, since the v3 -// fallback is then disabled. The fix matches the not-found variants (benign: -// genuinely no legacy data) and propagates anything else. -// -// This test injects a non-not-found Lance error at the legacy open via the -// `migration.v3_to_v4.legacy_open` failpoint. The load-bearing assertion is the -// last one: a once-transient failure leaves the graph RETRYABLE (stamp still v3, -// no lineage), so a later open with the fault cleared completes the migration β€” -// it was not a poison pill. -#[tokio::test] -async fn transient_legacy_open_failure_aborts_migration_without_stamping_v4() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // A real pre-Phase-7 (v3) graph: lineage only in `_graph_commits.lance`, - // `__manifest` stamped v3 with no `graph_commit` rows. - let fixture = omnigraph::db::commit_graph::seed_legacy_v3_lineage(&uri) - .await - .unwrap(); - let (rows_before, stamp_before) = - omnigraph::db::manifest::lineage_row_count_and_stamp_for_test(&uri, None) - .await - .unwrap(); - assert_eq!(stamp_before, 3, "fixture is stamped v3"); - assert_eq!(rows_before, 0, "fixture has no lineage in __manifest"); - - // Arm the legacy-open fault and run the read-write migration entry point. - { - let _fp = ScopedFailPoint::new(names::MIGRATION_V3_TO_V4_LEGACY_OPEN, "return"); - let err = match omnigraph::db::manifest::migrate_on_open_for_test(&uri).await { - Ok(()) => panic!("migration must abort when the legacy open fails transiently"), - Err(e) => e, - }; - // The injected (non-not-found) Lance error must surface, not be masked. - let msg = err.to_string(); - assert!( - msg.contains("injected failpoint triggered: migration.v3_to_v4.legacy_open"), - "expected the injected legacy-open error to propagate, got: {msg}" - ); - } - - // The migration left NO drift: stamp still v3, still no lineage. (Pre-fix, - // the swallow would have stamped v4 with an empty backfill β€” permanent loss.) - let (rows_after_fault, stamp_after_fault) = - omnigraph::db::manifest::lineage_row_count_and_stamp_for_test(&uri, None) - .await - .unwrap(); - assert_eq!( - stamp_after_fault, 3, - "a transient legacy-open failure must NOT stamp the manifest to v4", - ); - assert_eq!( - rows_after_fault, 0, - "a transient legacy-open failure must NOT partially backfill lineage", - ); - - // The whole correctness claim: a once-transient failure is retryable. With the - // fault cleared, the next migration pass reads the legacy lineage and completes. - omnigraph::db::manifest::migrate_on_open_for_test(&uri) - .await - .unwrap(); - let (rows_done, stamp_done) = - omnigraph::db::manifest::lineage_row_count_and_stamp_for_test(&uri, None) - .await - .unwrap(); - assert_eq!(stamp_done, 4, "the retried migration stamps v4"); - assert_eq!( - rows_done, - fixture.all_ids.len(), - "the retried migration backfills every legacy commit", - ); -} - -// ── RFC-013 Phase 7 / FIX B follow-up: the v3β†’v4 stamp-bump retry loop must ── -// surface a RETRYABLE contention error on exhaustion, not a stringified Lance error. -// -// `commit_v4_stamp_idempotently` bumps the internal-schema stamp under concurrent -// runners: the `UpdateConfig` CAS loser gets `IncompatibleTransaction`, re-opens, -// confirms the winner stamped the same value, and is done. Genuine exhaustion (every -// attempt loses) must return a `RowLevelCasContention` so the publisher's OUTER retry -// completes the one-time open β€” an `OmniError::Lance` would be treated as fatal. The -// `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, -// driving the otherwise-near-unreachable exhaustion path deterministically. (Pre-fix β€” -// `0..=BUDGET` + an `attempt < BUDGET` guard β€” the last iteration fell through to the -// stringifying `Err(e)` arm and returned a non-retryable `OmniError::Lance`.) -#[tokio::test] -async fn v4_stamp_exhaustion_returns_retryable_contention() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // A real v3 graph: the backfill merge succeeds; only the terminal stamp loop - // is forced to exhaust. - let _fixture = omnigraph::db::commit_graph::seed_legacy_v3_lineage(&uri) - .await - .unwrap(); - - let _fp = ScopedFailPoint::new(names::MIGRATION_V4_STAMP_FORCE_INCOMPATIBLE, "return"); - let err = match omnigraph::db::manifest::migrate_on_open_for_test(&uri).await { - Ok(()) => panic!("migration must error when the stamp bump exhausts its retries"), - Err(e) => e, - }; - assert!( - matches!( - &err, - omnigraph::error::OmniError::Manifest(m) - if matches!( - m.details, - Some(omnigraph::error::ManifestConflictDetails::RowLevelCasContention) - ) - ), - "stamp-bump exhaustion must surface a RETRYABLE RowLevelCasContention so the \ - publisher's outer retry completes the open, got: {err:?}", - ); -} - -// The publisher's outer retry must re-run `load_publish_state` on a RETRYABLE error, -// not propagate it fatally. `load_publish_state` runs `migrate_internal_schema`, whose -// bounded merge/stamp loops surface a `RowLevelCasContention` on exhaustion EXPECTING -// this re-run (a clean second scan, by which point a concurrent winner has finished the -// migration). Before the fix, `load_publish_state().await?` short-circuited the loop β€” -// only `merge_rows` conflicts hit the retry β€” so the typed contention aborted the -// publish. Inject a ONE-SHOT retryable contention into `load_publish_state`: the write -// must still commit, because the publisher retries and the cleared second attempt wins. -#[tokio::test] -#[serial] -async fn publisher_retries_retryable_load_publish_state_error() { - let _scenario = FailScenario::setup(); - let dir = tempfile::tempdir().unwrap(); - let db = helpers::init_and_load(&dir).await; - - // `1*return`: fail only the FIRST `load_publish_state` of the next publish, so the - // retry's second call is clean. Set after `init_and_load` so its publishes are - // unaffected. - let _fp = ScopedFailPoint::new(names::PUBLISH_LOAD_STATE_RETRYABLE_CONTENTION, "1*return"); - let row = r#"{"type":"Person","data":{"name":"Grace","age":37}}"#; - db.load_as("main", None, row, LoadMode::Merge, None) - .await - .expect("publisher must retry the one-shot retryable load_publish_state error and commit"); -} diff --git a/crates/omnigraph/tests/fixtures/search.gq b/crates/omnigraph/tests/fixtures/search.gq index d53fbc9..c39af82 100644 --- a/crates/omnigraph/tests/fixtures/search.gq +++ b/crates/omnigraph/tests/fixtures/search.gq @@ -42,17 +42,3 @@ query hybrid_search($vq: Vector(4), $tq: String) { order { rrf(nearest($d.embedding, $vq), bm25($d.title, $tq)) } limit 3 } - -query rrf_two_fts($q: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { rrf(bm25($d.title, $q), bm25($d.body, $q)) } - limit 3 -} - -query rrf_two_vectors($q1: Vector(4), $q2: Vector(4)) { - match { $d: Doc } - return { $d.slug, $d.title } - order { rrf(nearest($d.embedding, $q1), nearest($d.embedding, $q2)) } - limit 3 -} diff --git a/crates/omnigraph/tests/forbidden_apis.rs b/crates/omnigraph/tests/forbidden_apis.rs index 667e8c5..1936815 100644 --- a/crates/omnigraph/tests/forbidden_apis.rs +++ b/crates/omnigraph/tests/forbidden_apis.rs @@ -29,21 +29,15 @@ //! the cross-table manifest commit. Documented exception. //! - `crates/omnigraph/src/storage_layer.rs` β€” IS the trait module. //! -//! ## Allow-list shape +//! ## Transitional allow-list //! -//! After MR-854, `db.storage()` (`&dyn TableStorage`) exposes only staged -//! primitives + reads. The inline-commit writes live on a separate -//! `InlineCommitResidual` trait reached via -//! `Omnigraph::storage_inline_residual()`, so the default storage surface -//! cannot couple "write bytes" with "advance HEAD" β€” engine code that -//! wants an inline residual must name the residual accessor explicitly. -//! The only residuals are `delete_where` (Lance #6658 / v7.x) and -//! `create_vector_index` (Lance #6666). The dead legacy methods -//! (trait `append_batch` / `merge_insert_batches`, inherent -//! `merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were -//! removed entirely. This guard's scope is unchanged: it catches direct -//! `lance::*` inline-commit misuse outside the storage layer. The -//! file-level allow-list below matches that boundary. +//! The migration of writers onto staged primitives is incremental. +//! Several writers (ensure_indices, branch_merge, schema_apply rewrites) +//! already route through the staged primitives; others (bulk loader, +//! exec/mutation, exec/query) still use the legacy inherent +//! `TableStore` methods β€” they're not visible at the trait boundary, but +//! they DO call lance types. The file-level allow-list below reflects +//! this transitional state and tightens as call sites migrate. use std::path::{Path, PathBuf}; @@ -71,14 +65,6 @@ const FORBIDDEN_PATTERNS: &[&str] = &[ "Dataset::drop_columns", "Dataset::truncate_table", "Dataset::restore", - // Raw dataset OPENS β€” all reads must route through `Snapshot::open` (the - // held-handle cache + shared Session, Fix 3). Only the instrumented opener - // (`instrumentation.rs`) and the storage/manifest layers (allow-listed below) - // open datasets directly; forbidding these in the read/exec layer keeps a - // future read from silently bypassing the cache. - "Dataset::open", - "DatasetBuilder::from_uri", - "DatasetBuilder::from_namespace", // Lance-specific method names that don't clash with our `TableStore` // wrappers (we use `merge_insert_batch{,es}`, `add_columns_to_*`, // etc. β€” never the bare Lance names). Engine code that writes @@ -114,7 +100,6 @@ const ALLOW_LIST_FILES: &[&str] = &[ "commit_graph.rs", // Maintains `_graph_commits.lance` system table. "graph_coordinator.rs", // Drives the manifest publisher / branch coordinator. "recovery_audit.rs", // Maintains `_graph_commit_recoveries.lance` (recovery audit trail). - "instrumentation.rs", // The instrumented dataset opener (open_dataset_tracked / open_table_dataset). ]; /// Directories exempt from the guard. Files under these paths may use diff --git a/crates/omnigraph/tests/helpers/cost.rs b/crates/omnigraph/tests/helpers/cost.rs deleted file mode 100644 index 9c82229..0000000 --- a/crates/omnigraph/tests/helpers/cost.rs +++ /dev/null @@ -1,393 +0,0 @@ -//! Shared cost-budget test harness (RFC-013) β€” the single place the IO-counting -//! plumbing lives, so `warm_read_cost.rs`, `write_cost.rs`, and the S3 variant -//! assert in one vocabulary instead of duplicating `probes()` + raw `IOTracker` -//! reads. Three clean abstractions: structured counts, a `measure` primitive, a -//! named flat-assertion, plus store-agnostic backend fixtures. -//! -//! The data-table wrapper is a **path-classifying** counter (`PrefixCounter`), not a -//! plain `IOTracker`: it splits each read into the **opener** term (latest-version -//! resolution β€” reads of `_versions/`/`.manifest` objects) vs the **scan** term -//! (data-fragment reads, `data/`/`*.lance`). That lets a cost test isolate the -//! opener (RFC-013 step 3a's target, O(1) after the bypass) from the merge-insert/RI -//! scan (O(fragment-count), compaction's domain) even though both ride the same -//! `Dataset` β€” without controlling the fixture (no compaction needed). `__manifest` -//! and `_graph_commits` keep the plain `IOTracker` (no sub-prefixes worth splitting). -#![allow(dead_code)] - -use std::fmt; -use std::future::Future; -use std::sync::atomic::{AtomicU64, Ordering}; -use std::sync::{Arc, Mutex}; - -use async_trait::async_trait; -use futures::stream::BoxStream; -use lance::io::WrappingObjectStore; -use lance_io::utils::tracking_store::IOTracker; -use object_store::path::Path; -use object_store::{ - CopyOptions, GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta, ObjectStore, - PutMultipartOptions, PutOptions, PutPayload, PutResult, Result as OSResult, -}; - -use omnigraph::db::Omnigraph; -use omnigraph::instrumentation::{ - MergeWriteProbes, QueryIoProbes, with_merge_write_probes, with_query_io_probes, -}; -use omnigraph::loader::{LoadMode, load_jsonl}; - -use super::{MUTATION_QUERIES, TEST_DATA, TEST_SCHEMA, init_and_load, mixed_params}; - -/// Object-store op counts for one measured operation, by table class β€” the -/// vocabulary cost tests assert in (vs raw `IOTracker::stats().read_iops`). -#[derive(Debug, Clone, Copy, Default)] -pub struct IoCounts { - /// Per-table DATA opens (node/edge tables). The dominant write-path term. - pub data_reads: u64, - pub data_writes: u64, - /// DATA-table reads attributed to latest-version resolution (`_versions/`, - /// `.manifest`). This is the **opener** term step 3a flattened β€” isolated from - /// the scan, so it can be gated directly without compacting the fixture. - pub data_opener_reads: u64, - /// DATA-table reads attributed to data fragments (`data/`, `*.lance`) β€” the - /// merge-insert/RI **scan**, which grows with fragment count (compaction's - /// domain, not the opener). - pub data_scan_reads: u64, - /// `__manifest` registry scans (publish state). - pub manifest_reads: u64, - /// `_graph_commits` lineage scans. - pub commit_graph_reads: u64, - /// Version-probe invocations (the cheap freshness check). - pub version_probes: u64, - /// DATA-table open CALL count through the two instrumented chokepoints β€” an - /// exact open-invocation count (not the opener-read term), classified by URI so - /// internal/system-table opens are excluded. Step-3b target: - /// `data_open_count <= |touched_tables|` for a write. - pub data_open_count: u64, - /// Internal/system-table (`__manifest`, `_graph_commits*`) open CALL count β€” - /// the complement of `data_open_count` (publisher CAS + commit-graph append). - pub internal_open_count: u64, -} - -impl IoCounts { - pub fn total_reads(&self) -> u64 { - self.data_reads + self.manifest_reads + self.commit_graph_reads - } -} - -/// Which staged-write primitives an operation invoked (from `MergeWriteProbes`). -#[derive(Debug, Clone, Copy, Default)] -pub struct StagedCounts { - pub stage_append: u64, - pub stage_merge_insert: u64, - pub create_vector_index: u64, - pub scan_staged_combined: u64, -} - -// ── Path-classifying data-table read counter ── - -/// How a data-table object read is attributed. -enum ReadClass { - /// Latest-version resolution: `_versions/`, `.manifest`, `_latest`. - Opener, - /// Data fragments: `data/`, `*.lance`. - Scan, - /// Anything else (indices, deletion files, …) β€” counted in the total only. - Other, -} - -/// Classify a Lance object path by its role in a write open. Lance's on-object -/// layout is identical on local FS and S3, so this split is backend-independent. -fn classify(path: &Path) -> ReadClass { - let p = path.as_ref(); - if p.contains("_versions") || p.ends_with(".manifest") || p.contains("_latest") { - ReadClass::Opener - } else if p.contains("/data/") || p.starts_with("data/") || p.ends_with(".lance") { - ReadClass::Scan - } else { - ReadClass::Other - } -} - -#[derive(Debug, Default, Clone, Copy)] -struct PrefixCounts { - reads: u64, - writes: u64, - opener_reads: u64, - scan_reads: u64, -} - -/// A `WrappingObjectStore` that counts reads/writes and attributes each read to the -/// opener vs scan term by object-key prefix. Shares its tally via `Arc>` so -/// the wrapped store (handed to Lance) and the test read the same counters. -#[derive(Debug, Default, Clone)] -struct PrefixCounter(Arc>); - -impl PrefixCounter { - fn record_read(&self, location: &Path) { - let mut c = self.0.lock().unwrap(); - c.reads += 1; - match classify(location) { - ReadClass::Opener => c.opener_reads += 1, - ReadClass::Scan => c.scan_reads += 1, - ReadClass::Other => {} - } - } - - fn record_write(&self) { - self.0.lock().unwrap().writes += 1; - } - - fn snapshot(&self) -> PrefixCounts { - *self.0.lock().unwrap() - } -} - -impl WrappingObjectStore for PrefixCounter { - fn wrap(&self, _store_prefix: &str, target: Arc) -> Arc { - Arc::new(PrefixCountingStore { - target, - counter: self.clone(), - }) - } -} - -/// The wrapped `ObjectStore` that records each call into a [`PrefixCounter`]. -/// Implements only the required core `ObjectStore` methods (object_store 0.13: the -/// convenience surface β€” `get`/`put`/`head`/`get_range`/… β€” lives on -/// `ObjectStoreExt` and is provided by a blanket impl that routes through `get_opts` -/// / `put_opts`, so every read/write is still counted here). Per-read path -/// classification is the only addition over a plain pass-through. -#[derive(Debug)] -struct PrefixCountingStore { - target: Arc, - counter: PrefixCounter, -} - -impl fmt::Display for PrefixCountingStore { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - write!(f, "PrefixCountingStore({})", self.target) - } -} - -#[async_trait] -impl ObjectStore for PrefixCountingStore { - async fn put_opts( - &self, - location: &Path, - payload: PutPayload, - opts: PutOptions, - ) -> OSResult { - self.counter.record_write(); - self.target.put_opts(location, payload, opts).await - } - - async fn put_multipart_opts( - &self, - location: &Path, - opts: PutMultipartOptions, - ) -> OSResult> { - self.counter.record_write(); - self.target.put_multipart_opts(location, opts).await - } - - async fn get_opts(&self, location: &Path, options: GetOptions) -> OSResult { - self.counter.record_read(location); - self.target.get_opts(location, options).await - } - - fn delete_stream( - &self, - locations: BoxStream<'static, OSResult>, - ) -> BoxStream<'static, OSResult> { - self.target.delete_stream(locations) - } - - fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, OSResult> { - self.counter.record_read(&prefix.cloned().unwrap_or_default()); - self.target.list(prefix) - } - - fn list_with_offset( - &self, - prefix: Option<&Path>, - offset: &Path, - ) -> BoxStream<'static, OSResult> { - self.counter.record_read(&prefix.cloned().unwrap_or_default()); - self.target.list_with_offset(prefix, offset) - } - - async fn list_with_delimiter(&self, prefix: Option<&Path>) -> OSResult { - self.counter.record_read(&prefix.cloned().unwrap_or_default()); - self.target.list_with_delimiter(prefix).await - } - - async fn copy_opts(&self, from: &Path, to: &Path, options: CopyOptions) -> OSResult<()> { - self.counter.record_write(); - self.target.copy_opts(from, to, options).await - } -} - -/// The tracker handles backing one measurement; read once into [`IoCounts`]. -struct ProbeHandles { - manifest: IOTracker, - commit_graph: IOTracker, - table: PrefixCounter, - probe_count: Arc, - data_open_count: Arc, - internal_open_count: Arc, -} - -impl ProbeHandles { - fn install() -> (QueryIoProbes, Self) { - let h = ProbeHandles { - manifest: IOTracker::default(), - commit_graph: IOTracker::default(), - table: PrefixCounter::default(), - probe_count: Arc::new(AtomicU64::new(0)), - data_open_count: Arc::new(AtomicU64::new(0)), - internal_open_count: Arc::new(AtomicU64::new(0)), - }; - let probes = QueryIoProbes { - manifest_wrapper: Some(Arc::new(h.manifest.clone()) as Arc), - commit_graph_wrapper: Some( - Arc::new(h.commit_graph.clone()) as Arc - ), - table_wrapper: Some(Arc::new(h.table.clone()) as Arc), - probe_count: Arc::clone(&h.probe_count), - data_open_count: Arc::clone(&h.data_open_count), - internal_open_count: Arc::clone(&h.internal_open_count), - }; - (probes, h) - } - - fn counts(&self) -> IoCounts { - let t = self.table.snapshot(); - IoCounts { - data_reads: t.reads, - data_writes: t.writes, - data_opener_reads: t.opener_reads, - data_scan_reads: t.scan_reads, - manifest_reads: self.manifest.stats().read_iops, - commit_graph_reads: self.commit_graph.stats().read_iops, - version_probes: self.probe_count.load(Ordering::Relaxed), - data_open_count: self.data_open_count.load(Ordering::Relaxed), - internal_open_count: self.internal_open_count.load(Ordering::Relaxed), - } - } -} - -/// Run `op` under object-store IO counting; return its output + the counts. -/// The only place the `QueryIoProbes` task-local + tracker wiring lives. -pub async fn measure(op: F) -> (F::Output, IoCounts) { - let (probes, handles) = ProbeHandles::install(); - let out = with_query_io_probes(probes, op).await; - (out, handles.counts()) -} - -/// Like [`measure`], but also capture which staged-write primitives ran -/// (composes the two task-locals cleanly). -pub async fn measure_with_staged(op: F) -> (F::Output, IoCounts, StagedCounts) { - let (probes, handles) = ProbeHandles::install(); - let merge = MergeWriteProbes::default(); - let out = with_merge_write_probes(merge.clone(), with_query_io_probes(probes, op)).await; - let staged = StagedCounts { - stage_append: merge.stage_append_calls(), - stage_merge_insert: merge.stage_merge_insert_calls(), - create_vector_index: merge.create_vector_index_calls(), - scan_staged_combined: merge.scan_staged_combined_calls(), - }; - (out, handles.counts(), staged) -} - -/// Assert a per-depth metric is flat: the deepest sample must not exceed the -/// shallowest by more than `slack`. `select` picks the field; `what` names it in -/// the failure message. The shape every depth-swept cost gate uses. -pub fn assert_flat( - curve: &[(u64, IoCounts)], - select: impl Fn(&IoCounts) -> u64, - slack: u64, - what: &str, -) { - assert!(curve.len() >= 2, "assert_flat needs >= 2 depth points"); - let (d_lo, lo) = (curve[0].0, select(&curve[0].1)); - let (d_hi, hi) = (curve[curve.len() - 1].0, select(&curve[curve.len() - 1].1)); - assert!( - hi <= lo + slack, - "{what} grew with history: depth {d_lo} = {lo} -> depth {d_hi} = {hi} (slack {slack})" - ); -} - -/// Assert a per-depth metric *does* grow with history by at least `min_delta` β€” the -/// dual of [`assert_flat`], used to prove a term is genuinely history-dependent (so a -/// flat sibling term isn't flat merely because nothing was measured). -pub fn assert_grows( - curve: &[(u64, IoCounts)], - select: impl Fn(&IoCounts) -> u64, - min_delta: u64, - what: &str, -) { - assert!(curve.len() >= 2, "assert_grows needs >= 2 depth points"); - let (d_lo, lo) = (curve[0].0, select(&curve[0].1)); - let (d_hi, hi) = (curve[curve.len() - 1].0, select(&curve[curve.len() - 1].1)); - assert!( - hi >= lo + min_delta, - "{what} did not grow as expected: depth {d_lo} = {lo} -> depth {d_hi} = {hi} (min delta {min_delta})" - ); -} - -/// Measure one committing `insert_person` to `main` β€” the canonical write the cost -/// gates sweep over commit-history depth. Shared by `write_cost.rs` and -/// `write_cost_s3.rs` so the measured write is defined once. -pub async fn measure_insert(db: &mut Omnigraph, tag: &str) -> IoCounts { - let (res, io) = measure(db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", tag)], &[("$age", 30)]), - )) - .await; - res.unwrap(); - io -} - -/// Like [`measure_insert`] but carries an actor, so the write appends to and reads -/// `_graph_commit_actors.lance` β€” the authenticated (server/CLI) write path. The -/// commit-graph IO wrapper covers both `_graph_commits` and `_graph_commit_actors`, -/// so `IoCounts::commit_graph_reads` includes the actor-table scan on this path. -pub async fn measure_insert_as(db: &mut Omnigraph, tag: &str, actor: &str) -> IoCounts { - let (res, io) = measure(db.mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", tag)], &[("$age", 30)]), - Some(actor), - )) - .await; - res.unwrap(); - io -} - -// ── Backend fixtures β€” one knob, store-agnostic body ── - -/// Local tempdir graph (default; deterministic, every-PR). -pub async fn local_graph(dir: &tempfile::TempDir) -> Omnigraph { - init_and_load(dir).await -} - -/// Emulated-S3 graph, bucket-gated. Returns `None` **only** when -/// `OMNIGRAPH_S3_TEST_BUCKET` is unset, so the caller logs + skips β€” the -/// `tests/s3_storage.rs` graceful-skip pattern. Once the bucket *is* configured -/// (the rustfs CI job), any `init`/seed failure is a real failure and panics -/// rather than silently skipping β€” otherwise a down/misconfigured store would let -/// a bucket-gated gate pass vacuously. `name` disambiguates the prefix. -pub async fn s3_graph(name: &str) -> Option { - let bucket = std::env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?; - let uri = format!("s3://{bucket}/cost-tests/{name}-{}", std::process::id()); - let mut db = Omnigraph::init(&uri, TEST_SCHEMA) - .await - .expect("OMNIGRAPH_S3_TEST_BUCKET is set but S3 graph init failed"); - load_jsonl(&mut db, TEST_DATA, LoadMode::Overwrite) - .await - .expect("OMNIGRAPH_S3_TEST_BUCKET is set but S3 seed load failed"); - Some(db) -} diff --git a/crates/omnigraph/tests/helpers/failpoint.rs b/crates/omnigraph/tests/helpers/failpoint.rs deleted file mode 100644 index 0c93670..0000000 --- a/crates/omnigraph/tests/helpers/failpoint.rs +++ /dev/null @@ -1,84 +0,0 @@ -//! Deterministic rendezvous for concurrent failpoint tests. -//! -//! The pattern: park the FIRST thread that hits a failpoint until the test -//! explicitly releases it, while later arrivals fall through. This replaces -//! fixed "guess" `sleep`s for cross-thread coordination β€” the test waits on -//! the *condition* (the point was reached) with a bounded timeout that fails -//! loudly, instead of betting a fixed duration is long enough. -//! -//! Extracted from the open-coded `AtomicBool` + callback pattern that -//! `fork_collision_with_live_concurrent_fork_is_retryable` proved out. -//! -//! The `reached` flag also doubles as a fired-assertion: a point that is -//! never hit makes [`Rendezvous::wait_until_reached`] panic, so a typo'd or -//! misplaced failpoint cannot pass silently. - -use std::sync::Arc; -use std::sync::atomic::{AtomicBool, Ordering::SeqCst}; -use std::time::Duration; - -use omnigraph::failpoints::ScopedFailPoint; - -/// A parked-on-first-arrival rendezvous bound to a failpoint name. The -/// underlying callback is RAII-cleaned when this guard drops. -pub struct Rendezvous { - name: String, - reached: Arc, - release: Arc, - _failpoint: ScopedFailPoint, -} - -impl Rendezvous { - /// Register `name` so the FIRST thread to hit it records readiness and - /// blocks until [`release`](Self::release); later arrivals fall through - /// immediately. The park is bounded (~30s) so a test bug cannot hang the - /// suite forever. - pub fn park_first(name: &str) -> Self { - let reached = Arc::new(AtomicBool::new(false)); - let release = Arc::new(AtomicBool::new(false)); - let (cb_reached, cb_release) = (Arc::clone(&reached), Arc::clone(&release)); - let _failpoint = ScopedFailPoint::with_callback(name, move || { - if cb_reached - .compare_exchange(false, true, SeqCst, SeqCst) - .is_ok() - { - // ~30s bound (6000 * 5ms); released earlier on the common path. - for _ in 0..6000 { - if cb_release.load(SeqCst) { - return; - } - std::thread::sleep(Duration::from_millis(5)); - } - } - }); - Self { - name: name.to_string(), - reached, - release, - _failpoint, - } - } - - /// Async-wait until the parked thread has reached the failpoint, polling - /// the readiness condition with a bounded (~12s) timeout. Panics if the - /// point is never hit β€” the fired-assertion. - pub async fn wait_until_reached(&self) { - for _ in 0..2400 { - if self.reached.load(SeqCst) { - return; - } - tokio::time::sleep(Duration::from_millis(5)).await; - } - panic!("rendezvous: failpoint '{}' was never reached", self.name); - } - - /// Whether the parked thread has reached the failpoint yet. - pub fn reached(&self) -> bool { - self.reached.load(SeqCst) - } - - /// Release the parked thread so it resumes past the failpoint. - pub fn release(&self) { - self.release.store(true, SeqCst); - } -} diff --git a/crates/omnigraph/tests/helpers/mod.rs b/crates/omnigraph/tests/helpers/mod.rs index 13127f2..c97ff72 100644 --- a/crates/omnigraph/tests/helpers/mod.rs +++ b/crates/omnigraph/tests/helpers/mod.rs @@ -1,8 +1,5 @@ #![allow(dead_code)] -pub mod cost; -#[cfg(feature = "failpoints")] -pub mod failpoint; pub mod recovery; use arrow_array::{Array, RecordBatch, StringArray}; @@ -57,19 +54,6 @@ pub async fn init_and_load(dir: &tempfile::TempDir) -> Omnigraph { db } -/// On-disk Lance dataset URI for a node type, mirroring the engine's -/// `nodes/{fnv1a(type)}` layout. Used by tests that reach the raw Lance -/// dataset to forge or inspect branch state. (Local copies exist in -/// `failpoints.rs` / `maintenance.rs`; this is the shared one for new tests.) -pub fn node_table_uri(root: &str, type_name: &str) -> String { - let mut hash: u64 = 0xcbf2_9ce4_8422_2325; - for &b in type_name.as_bytes() { - hash ^= b as u64; - hash = hash.wrapping_mul(0x100_0000_01b3); - } - format!("{}/nodes/{hash:016x}", root.trim_end_matches('/')) -} - /// Read all rows from a sub-table by table_key. pub async fn read_table(db: &Omnigraph, table_key: &str) -> Vec { let snap = snapshot_main(db).await.unwrap(); @@ -169,37 +153,6 @@ pub async fn mutate_branch( db.mutate(branch, query_source, query_name, params).await } -/// Advance the manifest version `n` times (one commit per insert), building -/// deep commit history for cost-budget tests (history depth, not row count). -pub async fn commit_many(db: &mut Omnigraph, n: usize) { - for i in 0..n { - mutate_main( - db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", &format!("commit_many_{i}"))], &[("$age", 30)]), - ) - .await - .unwrap(); - } -} - -/// Like [`commit_many`] but every commit carries an actor, so it grows -/// `_graph_commit_actors.lance` too β€” the authenticated (server/CLI) write path. -pub async fn commit_many_as(db: &mut Omnigraph, n: usize, actor: &str) { - for i in 0..n { - db.mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", &format!("commit_many_as_{i}"))], &[("$age", 30)]), - Some(actor), - ) - .await - .unwrap(); - } -} - pub async fn snapshot_main(db: &Omnigraph) -> Result { db.snapshot_of(ReadTarget::branch("main")).await } @@ -242,14 +195,6 @@ pub async fn diff_since_branch( .await } -/// Advance a Lance dataset HEAD directly from tests without going through -/// OmniGraph's storage residual surface. Used to synthesize uncovered drift. -pub async fn lance_delete_inline(ds: &mut lance::Dataset, filter: &str) -> usize { - let result = ds.delete(filter).await.unwrap(); - *ds = (*result.new_dataset).clone(); - result.num_deleted_rows as usize -} - /// Build a ParamMap from string key-value pairs. pub fn params(pairs: &[(&str, &str)]) -> ParamMap { pairs @@ -291,15 +236,6 @@ pub fn vector_param(name: &str, values: &[f32]) -> ParamMap { map } -/// Build a ParamMap with two vector params. -pub fn two_vector_params(name1: &str, vals1: &[f32], name2: &str, vals2: &[f32]) -> ParamMap { - let mut map = vector_param(name1, vals1); - let key = name2.strip_prefix('$').unwrap_or(name2).to_string(); - let lit = Literal::List(vals2.iter().map(|v| Literal::Float(*v as f64)).collect()); - map.insert(key, lit); - map -} - /// Build a ParamMap with a vector param and a string param. pub fn vector_and_string_params( vec_name: &str, @@ -313,27 +249,6 @@ pub fn vector_and_string_params( map } -/// Test-only helper: perform a raw `Dataset::append` against Lance, -/// advancing Lance HEAD without going through the manifest. Used by -/// `recovery::*` and `staged_writes::*` tests that deliberately set up -/// HEAD-ahead-of-manifest drift scenarios. -/// -/// This mirrors the body of the engine's inline-commit -/// `TableStore::append_batch` (which is `pub(crate)` after MR-854) β€” -/// kept here as a test helper because integration tests need to -/// simulate drift without depending on the demoted crate-internal API. -pub async fn lance_append_inline(ds: &mut lance::Dataset, batch: RecordBatch) { - use lance::dataset::{WriteMode, WriteParams}; - let schema = batch.schema(); - let reader = arrow_array::RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Append, - allow_external_blob_outside_bases: true, - ..Default::default() - }; - ds.append(reader, Some(params)).await.unwrap(); -} - pub fn s3_test_graph_uri(suite: &str) -> Option { let bucket = std::env::var("OMNIGRAPH_S3_TEST_BUCKET").ok()?; let prefix = std::env::var("OMNIGRAPH_S3_TEST_PREFIX") diff --git a/crates/omnigraph/tests/helpers/recovery.rs b/crates/omnigraph/tests/helpers/recovery.rs index 4cb45e0..c76009e 100644 --- a/crates/omnigraph/tests/helpers/recovery.rs +++ b/crates/omnigraph/tests/helpers/recovery.rs @@ -143,39 +143,6 @@ pub fn sidecar_operation_ids(graph_root: &Path) -> Vec { ids } -/// Recovery-audit rows' `recovery_kind` values at `graph_root`, in -/// storage order. Empty when the audit dataset doesn't exist yet. -pub async fn recovery_audit_kinds(graph_root: &Path) -> Vec { - let recoveries_dir = graph_root.join("_graph_commit_recoveries.lance"); - if !recoveries_dir.exists() { - return Vec::new(); - } - let ds = Dataset::open(recoveries_dir.to_str().unwrap()) - .await - .expect("recoveries dataset opens"); - let batches: Vec = ds - .scan() - .try_into_stream() - .await - .unwrap() - .try_collect() - .await - .unwrap(); - let mut out = Vec::new(); - for batch in batches { - let kinds = batch - .column_by_name("recovery_kind") - .expect("recovery_kind column present") - .as_any() - .downcast_ref::() - .expect("recovery_kind is Utf8"); - for i in 0..kinds.len() { - out.push(kinds.value(i).to_string()); - } - } - out -} - pub async fn branch_head_commit_id(graph_root: &Path, branch: &str) -> Result { let graph = match branch { "main" => CommitGraph::open(&graph_uri(graph_root)).await?, @@ -214,9 +181,6 @@ pub async fn assert_post_recovery_invariants( "audit row for {operation_id} recorded the wrong recovery_kind", ); assert_rollback_outcomes_record_drift(&audit); - // Roll-back now publishes the restored HEAD, so manifest == Lance - // HEAD afterward (symmetric with roll-forward) β€” no residual drift. - assert_manifest_pins_match_lance_heads(graph_root, &tables).await?; assert_recovery_commit_shape(graph_root, &audit, &tables).await?; assert_non_main_did_not_move_main(graph_root, &tables).await?; assert_idempotent_reopen(graph_root, operation_id).await?; diff --git a/crates/omnigraph/tests/lance_surface_guards.rs b/crates/omnigraph/tests/lance_surface_guards.rs index d34080b..b65a808 100644 --- a/crates/omnigraph/tests/lance_surface_guards.rs +++ b/crates/omnigraph/tests/lance_surface_guards.rs @@ -30,17 +30,9 @@ use arrow_schema::{DataType, Field, Schema}; use lance::Dataset; use lance::dataset::builder::DatasetBuilder; use lance::dataset::optimize::{CompactionOptions, compact_files}; -use lance::dataset::transaction::Operation; use lance::dataset::write::delete::DeleteResult; -use lance::dataset::{ - CommitBuilder, InsertBuilder, MergeInsertBuilder, WhenMatched, WhenNotMatched, WriteMode, - WriteParams, -}; -use lance::index::DatasetIndexExt; +use lance::dataset::{MergeInsertBuilder, WhenMatched, WhenNotMatched, WriteMode, WriteParams}; use lance_file::version::LanceFileVersion; -use lance_index::IndexType; -use lance_index::optimize::OptimizeOptions; -use lance_index::scalar::ScalarIndexParams; use lance_namespace::LanceNamespace; use lance_table::io::commit::ManifestNamingScheme; @@ -86,83 +78,6 @@ async fn lance_error_too_much_write_contention_variant_exists() { ); } -// --- Guard 1a: LanceError::IncompatibleTransaction variant exists ---------- -// -// `db/manifest/migrations.rs::commit_v4_stamp_idempotently` pattern-matches on -// this variant: two concurrent v3β†’v4 runners both bump the internal-schema stamp -// (an `UpdateConfig` commit on the same metadata key), and the loser gets -// `IncompatibleTransaction`. Since both write the same value the conflict is -// benign and is retried idempotently. If Lance renames the variant or removes the -// builder, the match silently stops catching the conflict β€” this guard fails to -// force an update. - -#[tokio::test] -async fn lance_error_incompatible_transaction_variant_exists() { - let err = - lance::Error::incompatible_transaction_source("concurrent UpdateConfig at version N".into()); - assert!( - matches!(err, lance::Error::IncompatibleTransaction { .. }), - "Lance::Error::IncompatibleTransaction variant missing or renamed; \ - update db/manifest/migrations.rs::commit_v4_stamp_idempotently and \ - this guard, then re-pin docs/dev/lance.md." - ); -} - -// --- Guard 1c: LanceError::DatasetAlreadyExists variant exists -------------- -// -// `db/commit_graph.rs` and `db/recovery_audit.rs` create internal Lance tables -// with a create-or-open idempotency fallback: a concurrent/prior create races, -// and the `DatasetAlreadyExists` arm falls back to `Dataset::open`. They match -// the typed variant, NOT the display string ("Dataset already exists: ..."), -// which is not a Lance API contract. If Lance renames the variant the match -// silently stops catching the race and a re-create errors instead of opening β€” -// this guard turns red to force an update. - -#[tokio::test] -async fn lance_error_dataset_already_exists_variant_exists() { - let err = lance::Error::dataset_already_exists("guard"); - assert!( - matches!(err, lance::Error::DatasetAlreadyExists { .. }), - "Lance::Error::DatasetAlreadyExists variant missing or renamed; update the \ - db/commit_graph.rs + db/recovery_audit.rs create-or-open fallbacks and \ - this guard, then re-pin docs/dev/lance.md." - ); -} - -// --- Guard 1b: Dataset::open on a missing path returns a not-found variant -- -// -// `db/commit_graph.rs::read_legacy_commit_cache` (the v3β†’v4 lineage migration -// source) classifies a legacy-open error: a genuine not-found is the benign -// "no legacy data" signal (empty cache), and ANY OTHER error propagates loudly -// rather than being read as "empty" β€” a swallow there would let the migration -// stamp v4 over an empty backfill, orphaning real lineage permanently. That -// classification relies on Lance mapping an object-store NotFound to -// `DatasetNotFound` (or, for some paths, `NotFound`). If a Lance bump emits a -// different variant for a missing dataset, the migration would propagate a -// genuine "no legacy data" as a hard error β€” this guard turns red to force the -// classifier (and this guard) to be updated together. - -#[tokio::test] -async fn dataset_open_missing_returns_not_found_variant() { - let dir = tempfile::tempdir().unwrap(); - // A path that was never written β€” nothing to open. - let missing = dir.path().join("does-not-exist.lance"); - let err = match Dataset::open(missing.to_str().unwrap()).await { - Ok(_) => panic!("opening a never-written dataset path must error"), - Err(e) => e, - }; - assert!( - matches!( - err, - lance::Error::DatasetNotFound { .. } | lance::Error::NotFound { .. } - ), - "Dataset::open on a missing path no longer returns DatasetNotFound/NotFound \ - (got: {err:?}); update db/commit_graph.rs::read_legacy_commit_cache's \ - legacy-open classification and this guard together, then re-pin \ - docs/dev/lance.md." - ); -} - // --- Guard 2: ManifestLocation field shape --------------------------------- // // `db/manifest/metadata.rs:84-88` reads `.path`, `.size`, `.e_tag`, @@ -307,33 +222,6 @@ async fn _compile_compact_files_signature() -> lance::Result<()> { Ok(()) } -// --- Guard 7b: transaction history exposes repair's classification surface - -// -// `db/omnigraph/repair.rs` reads Lance transactions between manifest and HEAD -// and treats only `ReserveFragments` + `Rewrite` as safe maintenance drift. -// Compile-only. - -#[allow( - dead_code, - unreachable_code, - unused_variables, - unused_mut, - clippy::diverging_sub_expression -)] -async fn _compile_transaction_history_for_repair_signature() -> lance::Result<()> { - let ds: Dataset = unimplemented!(); - let tx = ds.read_transaction_by_version(1u64).await?; - if let Some(tx) = tx { - let operation = tx.operation; - let _name: &str = operation.name(); - match operation { - Operation::Rewrite { .. } | Operation::ReserveFragments { .. } => {} - _ => {} - } - } - Ok(()) -} - // --- Guard 8: Dataset::delete returns DeleteResult { new_dataset, num_deleted_rows } --- // // `table_store.rs::delete_where` consumes both fields. When MR-A migrates @@ -354,794 +242,3 @@ async fn _compile_delete_result_field_shape() -> lance::Result<()> { let _num_deleted: u64 = result.num_deleted_rows; Ok(()) } - -// --- Guard 9: force_delete_branch semantics -------------------------------- -// -// The branch-delete reconciler (`db/omnigraph/optimize.rs::reconcile_orphaned_branches`) -// and the eager best-effort reclaim in `cleanup_deleted_branch_tables` call -// `force_delete_branch` to drop orphaned branch refs. The single-authority -// design relies on three facts pinned here: -// 1. plain `delete_branch` errors on a missing ref (so the design uses the -// force variant instead); -// 2. `force_delete_branch` removes an existing (forked) branch β€” the orphan -// case, where a `tree/{branch}/` exists; -// 3. `force_delete_branch` on a *fully-absent* branch (no tree dir) still -// errors on the local store, because `remove_dir_all`'s NotFound is not -// caught for Lance's native error variant. `TableStore::force_delete_branch` -// wraps this to be fully idempotent. Pin the raw quirk so a future Lance -// fix (which would let us simplify the wrapper) is noticed. - -#[tokio::test] -async fn force_delete_branch_semantics() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard9.lance"); - let uri = uri.to_str().unwrap(); - let mut ds = fresh_dataset(uri).await; - - // (1) Plain delete of a never-created branch errors (RefNotFound). - assert!( - ds.delete_branch("nope").await.is_err(), - "Dataset::delete_branch on a missing ref should error; if this is now \ - Ok, the reconciler could drop the force variant." - ); - - // (2) force_delete_branch removes an existing (forked) branch. - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - ds.force_delete_branch("feature").await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("feature"), - "force_delete_branch should remove an existing branch ref" - ); - - // (3) Quirk: force_delete on a fully-absent branch errors on the local - // store (worked around by TableStore::force_delete_branch). - assert!( - ds.force_delete_branch("never").await.is_err(), - "force_delete_branch on a fully-absent branch no longer errors β€” \ - TableStore::force_delete_branch's NotFound tolerance can be simplified." - ); -} - -// --- Guard 10: blob-column compaction is still broken in this Lance -------- -// -// `db/omnigraph/optimize.rs` skips tables with blob columns while -// `LANCE_SUPPORTS_BLOB_COMPACTION = false`: Lance `compact_files` forces -// `BlobHandling::AllBinary`, and the blob-v2 struct decoder mis-counts columns -// ("more fields in the schema than provided column indices"), failing even a -// pristine uniform-V2_2 multi-fragment blob table. Reads are unaffected (they -// use descriptor handling). -// -// WHEN THIS TEST TURNS RED (compact_files no longer errors), the Lance bug is -// fixed: flip `LANCE_SUPPORTS_BLOB_COMPACTION` to true in optimize.rs, drop the -// blob-skip branch + the `optimize_skips_blob_table_and_reports_skip` -// skip assertions in maintenance.rs, and re-pin docs/dev/lance.md. - -#[tokio::test] -async fn compact_files_still_fails_on_blob_columns() { - use arrow_array::{LargeBinaryArray, StructArray}; - - fn blob_batch(start: i32, n: i32) -> RecordBatch { - let ids: Vec = (start..start + n).map(|i| format!("n{i}")).collect(); - let data = - LargeBinaryArray::from_iter_values((start..start + n).map(|i| format!("blob{i}"))); - let blob_uri = StringArray::from(vec![None::<&str>; n as usize]); - let DataType::Struct(fields) = lance::blob::blob_field("content", true).data_type().clone() - else { - unreachable!("blob_field is always a Struct"); - }; - let content = StructArray::new( - fields, - vec![Arc::new(data) as _, Arc::new(blob_uri) as _], - None, - ); - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - lance::blob::blob_field("content", true), - ])); - RecordBatch::try_new( - schema, - vec![ - Arc::new(StringArray::from(ids)) as _, - Arc::new(content) as _, - ], - ) - .unwrap() - } - - async fn write(uri: &str, batch: RecordBatch, mode: WriteMode) { - let schema = batch.schema(); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - // Blob v2 requires file version >= 2.2; without the pin the *write* - // would fail with a different error, masking the guard's intent. - let params = WriteParams { - mode, - enable_stable_row_ids: true, - data_storage_version: Some(LanceFileVersion::V2_2), - ..Default::default() - }; - Dataset::write(reader, uri, Some(params)).await.unwrap(); - } - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard10-blob.lance"); - let uri = uri.to_str().unwrap(); - - // Uniform V2_2, two fragments β†’ forces compaction to actually rewrite. - write(uri, blob_batch(0, 2), WriteMode::Create).await; - write(uri, blob_batch(100, 2), WriteMode::Append).await; - - let mut ds = Dataset::open(uri).await.unwrap(); - assert!( - ds.get_fragments().len() >= 2, - "guard needs a multi-fragment table to trigger a real compaction rewrite" - ); - - let result = compact_files(&mut ds, CompactionOptions::default(), None).await; - let err = result.expect_err( - "compact_files unexpectedly SUCCEEDED on a blob table β€” the Lance blob-v2 \ - compaction bug is fixed. Flip LANCE_SUPPORTS_BLOB_COMPACTION to true in \ - db/omnigraph/optimize.rs, remove the blob-skip branch, and re-pin docs/dev/lance.md.", - ); - assert!( - err.to_string() - .contains("more fields in the schema than provided column indices"), - "blob compaction failed with an unexpected error (Lance internals may have \ - shifted): {err}" - ); -} - -// --- Guard 11: scalar-index coverage surface (physical_rows + index details) --- -// -// `table_store.rs::key_column_index_coverage` mirrors Lance's `create_filter_plan` -// C6 fallback: it reads `fragment.physical_rows` (the field whose absence on ANY -// fragment disables the scalar index for the whole scan) and sniffs the BTREE via -// `load_indices()` β†’ `index.fields` / `index.index_details.type_url`. This is the -// one real Lance-internal coupling on the indexed-traversal read path. If any of -// these surfaces renames or changes type, the coverage check (and the cost-based -// traversal chooser that consumes it) silently misclassifies. Compile-only. - -#[allow( - dead_code, - unreachable_code, - unused_variables, - unused_mut, - clippy::diverging_sub_expression -)] -async fn _compile_scalar_index_coverage_surface() -> lance::Result<()> { - let ds: Dataset = unimplemented!(); - // The create_filter_plan coupling: a fragment lacking `physical_rows` - // disables the scalar index for the entire scan. - for frag in ds.fragments().iter() { - let _physical_rows: Option = frag.physical_rows; - // `key_column_index_coverage` checks each current fragment id against the - // index `fragment_bitmap`. - let _id: u64 = frag.id; - } - // The index sniff: BTREE presence is detected by single-field index whose - // details type_url ends with "BTreeIndexDetails". The fragment coverage check - // reads `fragment_bitmap` (Option) and calls `.contains(u32)`. - let indices = ds.load_indices().await?; - for index in indices.iter() { - let _fields: &Vec = &index.fields; - if let Some(details) = index.index_details.as_ref() { - let _type_url: &str = details.type_url.as_str(); - } - let _covered: Option = index.fragment_bitmap.as_ref().map(|b| b.contains(0u32)); - } - Ok(()) -} - -// --- Guard 12: can a scalar BTREE be built on a system version column? -------- -// -// The deferred persisted-adjacency artifact plan assumed a cheap delta read of -// `_row_last_updated_at_version > V` could be a BTREE range lookup. Lance resolves -// index columns from the dataset schema, and the version columns are system -// metadata β€” so this probe documents whether the assumption holds. The outcome is -// the load-bearing fact, not a pass/fail of intent: if this starts SUCCEEDING when -// it currently errors (or vice versa), the artifact's delta-cost story changes. - -#[tokio::test] -async fn scalar_index_on_system_version_column_probe() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard12.lance"); - let mut ds = fresh_dataset(uri.to_str().unwrap()).await; - - // Sanity: the system version column is present (stable row ids + V2_2). - assert!( - ds.schema().field("_row_last_updated_at_version").is_none(), - "PROBE NOTE: `_row_last_updated_at_version` is NOT in the user schema \ - (it is system metadata); indexing it resolves through a different path." - ); - - let result = ds - .create_index_builder( - &["_row_last_updated_at_version"], - IndexType::BTree, - &ScalarIndexParams::default(), - ) - .replace(true) - .await; - - // Pin the observed behavior: a scalar index on the system version column is - // NOT buildable via the normal create-index path in this Lance. If this turns - // green (Ok), the artifact delta CAN use a version-column BTREE β€” revisit the - // deferred plan's Phase-2 delta-cost note in docs/dev/traversal handoff. - assert!( - result.is_err(), - "create_index on `_row_last_updated_at_version` unexpectedly SUCCEEDED β€” \ - a system-column scalar index is now buildable; the persisted-artifact \ - delta read could use it. Update the deferred-design notes." - ); -} - -// --- Guard 13: per-fragment deletion metadata is exposed without a scan ------- -// -// The deferred artifact's delete-correctness coverage model needs to detect, -// cheaply (O(fragments), no row scan), that a covered fragment acquired new -// deletions. That hinges on Lance tracking deletions at fragment-metadata level. -// This pins that a delete populates `fragment.deletion_file`, and probes whether -// the deleted-row COUNT is available as metadata (`num_deleted_rows`) β€” the -// difference between an O(fragments) coverage check and an O(|E|) scan. - -#[tokio::test] -async fn fragment_deletion_metadata_is_available() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard13.lance"); - let ds = fresh_dataset(uri.to_str().unwrap()).await; // 2 rows: alice, bob - - let deleted: DeleteResult = { - let mut ds = ds; - ds.delete("id = 'alice'").await.unwrap() - }; - assert_eq!(deleted.num_deleted_rows, 1, "one row deleted"); - let ds = deleted.new_dataset; - - // A delete must be tracked at fragment-metadata level (not only in data). - let with_deletion = ds - .fragments() - .iter() - .find(|f| f.deletion_file.is_some()) - .expect( - "after a delete, some fragment must carry a deletion_file β€” if not, \ - Lance changed deletion tracking; the artifact coverage model's \ - cheap delete-detection assumption is invalid.", - ); - - // Probe: is the deleted-row count available as metadata (cheap), or must the - // deletion vector be read? Pin whichever holds so the artifact plan knows. - let count: Option = with_deletion - .deletion_file - .as_ref() - .and_then(|df| df.num_deleted_rows); - assert_eq!( - count, - Some(1), - "PROBE: deletion_file.num_deleted_rows is not a populated metadata count \ - (got {count:?}); the artifact coverage model cannot cheaply detect \ - per-fragment deletions and would need to read the deletion vector.", - ); -} - -// --- Guard 14: Dataset::optimize_indices signature ---------------------------- -// -// `db/omnigraph/optimize.rs::optimize_one_table` calls -// `ds.optimize_indices(&OptimizeOptions::default())` (via `DatasetIndexExt`) to -// fold appended/compacted fragments back into existing indexes. If Lance -// changes the receiver, the options type, or the return shape, this fails to -// compile. Compile-only. - -#[allow( - dead_code, - unreachable_code, - unused_variables, - unused_mut, - clippy::diverging_sub_expression -)] -async fn _compile_optimize_indices_signature() -> lance::Result<()> { - let mut ds: Dataset = unimplemented!(); - let options = OptimizeOptions::default(); - // `&mut self`, `&OptimizeOptions`, returns `Result<()>` (mutates in place - // and commits β€” there is no uncommitted variant in this Lance, which is why - // optimize treats it as an inline-commit residual under a recovery sidecar). - let _: () = ds.optimize_indices(&options).await?; - Ok(()) -} - -// --- Guard 15: optimize_indices extends fragment coverage ---------------------- -// -// PR3's reindex assumes `optimize_indices` folds fragments appended AFTER an -// index was built into that index (incremental merge, not retrain). This pins -// that Lance behavior at the surface layer so a regression turns red here, the -// first smoke check on a Lance bump, before the slower engine suite. - -#[tokio::test] -async fn optimize_indices_extends_fragment_coverage() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard_optimize_indices.lance"); - let uri = uri.to_str().unwrap(); - - // Fragment 0: alice, bob. Build a BTREE over `value` covering only it. - let mut ds = fresh_dataset(uri).await; - ds.create_index_builder(&["value"], IndexType::BTree, &ScalarIndexParams::default()) - .replace(true) - .await - .unwrap(); - - // Append a second fragment the existing index does not cover. - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("value", DataType::Int32, false), - ])); - let batch = RecordBatch::try_new( - schema.clone(), - vec![ - Arc::new(StringArray::from(vec!["carol"])), - Arc::new(Int32Array::from(vec![3])), - ], - ) - .unwrap(); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Append, - enable_stable_row_ids: true, - data_storage_version: Some(LanceFileVersion::V2_2), - ..Default::default() - }; - Dataset::write(reader, uri, Some(params)).await.unwrap(); - - let mut ds = Dataset::open(uri).await.unwrap(); - assert!( - value_index_uncovered_count(&ds).await > 0, - "appended fragment should be uncovered by the BTREE before optimize_indices" - ); - - ds.optimize_indices(&OptimizeOptions::default()) - .await - .unwrap(); - - assert_eq!( - value_index_uncovered_count(&ds).await, - 0, - "optimize_indices must fold the appended fragment into the existing index \ - (incremental coverage); if this regresses, PR3's reindex no longer keeps \ - coverage current β€” revisit db/omnigraph/optimize.rs and docs/dev/lance.md." - ); -} - -/// Count current fragments not covered by the single-column `value` BTREE β€” -/// mirrors `TableStore::has_unindexed_fragments` (load_indices + -/// `fragment_bitmap.contains`), pinned by Guard 11. -async fn value_index_uncovered_count(ds: &Dataset) -> usize { - let indices = ds.load_indices().await.unwrap(); - let frag_ids: Vec = ds.fragments().iter().map(|f| f.id as u32).collect(); - let value_fid = ds.schema().field("value").unwrap().id; - for index in indices.iter() { - if index.fields.len() == 1 && index.fields[0] == value_fid { - if let Some(bitmap) = index.fragment_bitmap.as_ref() { - return frag_ids.iter().filter(|id| !bitmap.contains(**id)).count(); - } - } - } - // No `value` index found β€” treat as fully uncovered so a missing index - // is never mistaken for full coverage. - frag_ids.len() -} - -// --- Guard 16: scalar index use requires a literal matching the column type --- -// -// Pins the substrate behavior the pushdown literal-coercion fix relies on -// (`query.rs::literal_to_typed_expr`): Lance uses the BTREE only when the filter -// is `column OP literal` with a matching type. A width-mismatched literal makes -// DataFusion widen and cast the COLUMN (`CAST(n32 AS Int64)`), which drops the -// scalar index and full-scans. Temporal columns are immune (DataFusion casts the -// Utf8 LITERAL to the date type, not the column). If a Lance/DataFusion bump -// changes either coercion direction, this turns red β€” re-validate the fix. -#[tokio::test] -async fn scalar_index_use_requires_matched_literal_type() { - use datafusion::physical_plan::displayable; - use datafusion::prelude::{col, lit}; - use datafusion::scalar::ScalarValue; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("probe_literal_type.lance"); - let uri = uri.to_str().unwrap(); - - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("n32", DataType::Int32, false), - Field::new("d32", DataType::Date32, false), - ])); - let batch = RecordBatch::try_new( - schema.clone(), - vec![ - Arc::new(StringArray::from(vec!["a", "b", "c", "d"])), - Arc::new(Int32Array::from(vec![1, 5, 9, 13])), - Arc::new(arrow_array::Date32Array::from(vec![19000, 19723, 20000, 20500])), - ], - ) - .unwrap(); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Create, - enable_stable_row_ids: true, - data_storage_version: Some(LanceFileVersion::V2_2), - ..Default::default() - }; - let mut ds = Dataset::write(reader, uri, Some(params)).await.unwrap(); - for c in ["n32", "d32"] { - ds.create_index_builder(&[c], IndexType::BTree, &ScalarIndexParams::default()) - .replace(true) - .await - .unwrap(); - } - - async fn plan_str(ds: &Dataset, filter: datafusion::prelude::Expr) -> String { - let mut scanner = ds.scan(); - scanner.filter_expr(filter); - let plan = scanner.create_plan().await.unwrap(); - format!("{}", displayable(plan.as_ref()).indent(true)) - } - - // (label, filter, expect_index_used) - let cases = [ - ("n32 = 5i32 (matched Int32)", col("n32").eq(lit(5i32)), true), - ("n32 = 5i64 (widened Int64)", col("n32").eq(lit(5i64)), false), - ( - "d32 = Date32 (matched)", - col("d32").eq(lit(ScalarValue::Date32(Some(19723)))), - true, - ), - ( - "d32 = '2024-01-01' (Utf8 vs Date32)", - col("d32").eq(lit("2024-01-01")), - true, - ), - ]; - - for (label, filter, expect_index) in cases { - let s = plan_str(&ds, filter).await; - let uses_index = s.contains("ScalarIndexQuery"); - assert_eq!( - uses_index, expect_index, - "[{label}] expected scalar-index use = {expect_index}, got {uses_index}.\n\ - A change here means Lance/DataFusion shifted its coercion or index \ - pushdown; re-validate query.rs::literal_to_typed_expr.\nplan:\n{s}" - ); - } - - // The widened case must show the index-defeating column CAST (the precise - // shape the fix avoids by coercing the literal to the column type). - let widened = plan_str(&ds, col("n32").eq(lit(5i64))).await; - assert!( - widened.contains("CAST(n32 AS Int64)"), - "expected a column-side cast in the widened plan, got:\n{widened}" - ); -} - -// --- Guard 17: BTREE scalar-index range-boundary correctness (lance#6796) ----- -// -// lance#6796 (issue #6792) fixed a BTREE range-query bound-inclusiveness bug: -// `price <= 10 AND price > 5` returned the wrong boundary row (5.0 instead of -// 10.0). OmniGraph today builds BTREE only on string `@key` columns and queries -// them by equality/IN, not range, so its current patterns do not hit this β€” the -// guard protects any future BTREE-range path. It reproduces the exact #6792 shape -// (5 rows + an explicit BTREE drives the index path even on tiny data, per the -// upstream repro) and pins the corrected inclusive-`<=` / exclusive-`>` semantics. -#[tokio::test] -async fn btree_range_query_boundary_is_correct() { - use arrow_array::Float64Array; - use futures::TryStreamExt; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("guard17.lance"); - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("price", DataType::Float64, false), - ])); - let batch = RecordBatch::try_new( - schema.clone(), - vec![ - Arc::new(StringArray::from(vec!["a", "b", "c", "d", "e"])), - Arc::new(Float64Array::from(vec![1.0, 5.0, 10.0, 15.0, 20.0])), - ], - ) - .unwrap(); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Create, - enable_stable_row_ids: true, - data_storage_version: Some(LanceFileVersion::V2_2), - ..Default::default() - }; - let mut ds = Dataset::write(reader, uri.to_str().unwrap(), Some(params)) - .await - .unwrap(); - - // Build the BTREE on the numeric column so the range filter resolves through - // the scalar index (the path lance#6796 fixed). - ds.create_index_builder(&["price"], IndexType::BTree, &ScalarIndexParams::default()) - .replace(true) - .await - .unwrap(); - - let mut scanner = ds.scan(); - scanner.filter("price <= 10.0 AND price > 5.0").unwrap(); - let batches: Vec = scanner - .try_into_stream() - .await - .unwrap() - .try_collect() - .await - .unwrap(); - let mut got: Vec = Vec::new(); - for b in &batches { - let col = b - .column_by_name("price") - .unwrap() - .as_any() - .downcast_ref::() - .unwrap(); - for i in 0..col.len() { - got.push(col.value(i)); - } - } - got.sort_by(|a, b| a.partial_cmp(b).unwrap()); - assert_eq!( - got, - vec![10.0], - "BTREE range `price <= 10 AND price > 5` must return exactly [10.0] \ - (lance#6796 / issue #6792 boundary fix); got {got:?}. If this regressed, \ - Lance reintroduced the range-bound inclusiveness bug.", - ); -} - -// --- Guard 18: skip_auto_cleanup suppresses version GC (lance#6755 / PR #229) -- -// -// After the v7 bump, OmniGraph relies on `CommitBuilder::with_skip_auto_cleanup` -// (`commit_staged`) and `MergeInsertBuilder::skip_auto_cleanup` (the `__manifest` -// publisher) to stop Lance's per-commit auto-cleanup hook from GC'ing versions -// the `__manifest` pins for snapshots/time-travel. This is load-bearing for -// graphs created BEFORE the bump: 6.0.1 defaulted `WriteParams::auto_cleanup` ON, -// so those datasets carry `lance.auto_cleanup.*` config that `auto_cleanup = None` -// on new writes cannot retroactively clear β€” only the per-commit skip stops it. -// -// Pins both halves: WITHOUT the skip the aggressive config GCs v1; WITH the skip -// (the exact call `commit_staged` makes) v1 survives. -#[tokio::test] -async fn skip_auto_cleanup_suppresses_version_gc() { - use std::collections::HashMap; - - // The cleanup config 6.0.1 stored by default, made aggressive: fire on every - // commit, delete anything older than now. - async fn set_legacy_cleanup(ds: &mut Dataset) { - let mut cfg = HashMap::new(); - cfg.insert("lance.auto_cleanup.interval".to_string(), "1".to_string()); - cfg.insert("lance.auto_cleanup.older_than".to_string(), "0ms".to_string()); - ds.update_config(cfg).await.unwrap(); - } - fn row(i: i32) -> (Arc, RecordBatch) { - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("value", DataType::Int32, false), - ])); - let batch = RecordBatch::try_new( - schema.clone(), - vec![ - Arc::new(StringArray::from(vec![format!("k{i}")])), - Arc::new(Int32Array::from(vec![i])), - ], - ) - .unwrap(); - (schema, batch) - } - - // Negative control: WITHOUT skip, the legacy config GCs the pinned v1. - let ctrl = tempfile::tempdir().unwrap(); - let curi = ctrl.path().join("g18_ctrl.lance"); - let curi = curi.to_str().unwrap(); - let mut ds = fresh_dataset(curi).await; - let v1 = ds.version().version; - set_legacy_cleanup(&mut ds).await; - for i in 0..5 { - let (schema, batch) = row(i); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - ds.append( - reader, - Some(WriteParams { - mode: WriteMode::Append, - ..Default::default() - }), - ) - .await - .unwrap(); - } - assert!( - ds.checkout_version(v1).await.is_err(), - "negative control: without skip_auto_cleanup, the legacy auto_cleanup \ - config should have GC'd pinned v{v1}; if this fails the config is not \ - firing and the positive assertion below proves nothing." - ); - - // The guarantee: WITH the per-commit skip, v1 survives. Mirrors - // `TableStore::commit_staged` (InsertBuilder::execute_uncommitted + - // CommitBuilder::with_skip_auto_cleanup(true)). - let keep = tempfile::tempdir().unwrap(); - let kuri = keep.path().join("g18.lance"); - let kuri = kuri.to_str().unwrap(); - let mut ds = fresh_dataset(kuri).await; - let v1 = ds.version().version; - set_legacy_cleanup(&mut ds).await; - for i in 0..5 { - let (_schema, batch) = row(i); - let tx = InsertBuilder::new(Arc::new(ds.clone())) - .with_params(&WriteParams { - mode: WriteMode::Append, - ..Default::default() - }) - .execute_uncommitted(vec![batch]) - .await - .unwrap(); - ds = CommitBuilder::new(Arc::new(ds.clone())) - .with_skip_auto_cleanup(true) - .execute(tx) - .await - .unwrap(); - } - assert!( - ds.checkout_version(v1).await.is_ok(), - "v{v1} was GC'd despite CommitBuilder::with_skip_auto_cleanup(true) β€” the \ - commit_staged / publisher skip is the only thing protecting \ - __manifest-pinned versions on upgraded (pre-bump) graphs." - ); -} - -// --- Guard 19: unenforced primary key is immutable once set (lance v7) ------ -// -// Lance 7 (`lance::dataset::transaction`) makes the unenforced PK reserved: -// once `lance-schema:unenforced-primary-key` is set on a field, any later write -// that touches that reserved key β€” even re-applying the SAME value β€” errors -// "the unenforced primary key is a reserved key and cannot be changed once set". -// -// This is the upstream behavior that broke -// `db/manifest/migrations.rs::migrate_v1_to_v2`'s crash-idempotency: a -// pre-v0.4.0 graph that crashed after the field-set but before the stamp bump -// re-enters the migration with the PK already present, and on Lance 6 the -// re-apply was a no-op. The migration now guards the set on the manifest's -// unenforced-PK field (`["object_id"]` β†’ no-op, `[]` β†’ set, anything else β†’ -// loud refusal). If Lance ever relaxes immutability (a re-set becomes a no-op -// again), this guard goes red β€” revisit whether that field-guard is still -// needed, and re-pin docs/dev/lance.md. -#[tokio::test] -async fn unenforced_primary_key_is_immutable_once_set() { - use lance::datatypes::LANCE_UNENFORCED_PRIMARY_KEY; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("g19.lance"); - let mut ds = fresh_dataset(uri.to_str().unwrap()).await; - - // Precondition: no unenforced PK yet (mirrors a genuine pre-v0.4.0 manifest). - assert!( - ds.schema().unenforced_primary_key().is_empty(), - "fresh dataset should carry no unenforced primary key" - ); - - // First set succeeds β€” the genuine pre-v0.4.0 migration path. (Discard the - // returned &Schema so the &mut borrow ends before the next call.) - ds.update_field_metadata() - .update( - "id", - [(LANCE_UNENFORCED_PRIMARY_KEY.to_string(), "true".to_string())], - ) - .unwrap() - .await - .unwrap(); - let pk: Vec = ds - .schema() - .unenforced_primary_key() - .iter() - .map(|field| field.name.clone()) - .collect(); - assert_eq!( - pk, - ["id"], - "first set should install `id` as the unenforced PK" - ); - - // Re-applying the SAME reserved key must still error. Normalize the sync - // validation stage (`.update()`) and the async commit stage (`.await`) into - // one Result so the actionable diagnostic below fires whichever stage Lance - // enforces immutability at β€” and even if a future Lance relaxes it to `Ok`. - // Bare `.unwrap()` / `.unwrap_err()` would instead panic with a generic - // message in those cases, defeating the guard's purpose. - let outcome: lance::Result<()> = match ds.update_field_metadata().update( - "id", - [(LANCE_UNENFORCED_PRIMARY_KEY.to_string(), "true".to_string())], - ) { - Ok(builder) => builder.await.map(|_| ()), - Err(e) => Err(e), - }; - assert!( - matches!(&outcome, Err(e) if e.to_string().contains("cannot be changed once set")), - "Lance no longer rejects re-setting the unenforced PK as immutable \ - (got: {outcome:?}); immutability relaxed or moved off the commit path \ - β€” revisit migrate_v1_to_v2's field-guard and re-pin docs/dev/lance.md." - ); -} - -// --- Guard 20: camelCase @index equality routes to the scalar index (#283) ---- -// -// The #283 read-pushdown fix builds the filter column with datafusion `ident()` -// (case-preserving) instead of `col()` (SQL identifier normalization, which -// lowercases an unquoted name). The correctness tests in literal_filters.rs / -// writes.rs prove the right rows come back, but a result-only assertion also -// passes on a full-scan fallback β€” exactly the gap testing.md warns about. This -// guard pins the *plan*: an equality on a camelCase BTREE column must compile to -// a `ScalarIndexQuery` under the fix's expr shape, and must NOT under the old -// `col()` shape (which lowercases `repoName` β†’ a nonexistent `reponame`). A -// regression that breaks camelCase index routing β€” or a revert to `col()` β€” -// turns this red instead of silently degrading to a full scan. -#[tokio::test] -async fn camelcase_index_equality_routes_to_scalar_index() { - use datafusion::physical_plan::displayable; - use datafusion::prelude::{col, ident, lit}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().join("camelcase_index.lance"); - let uri = uri.to_str().unwrap(); - - let schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("repoName", DataType::Utf8, false), - ])); - let batch = RecordBatch::try_new( - schema.clone(), - vec![ - Arc::new(StringArray::from(vec!["a", "b", "c", "d"])), - Arc::new(StringArray::from(vec![ - "acme", "globex", "initech", "umbrella", - ])), - ], - ) - .unwrap(); - let reader = RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Create, - enable_stable_row_ids: true, - data_storage_version: Some(LanceFileVersion::V2_2), - ..Default::default() - }; - let mut ds = Dataset::write(reader, uri, Some(params)).await.unwrap(); - ds.create_index_builder(&["repoName"], IndexType::BTree, &ScalarIndexParams::default()) - .replace(true) - .await - .unwrap(); - - async fn plan_str(ds: &Dataset, filter: datafusion::prelude::Expr) -> lance::Result { - let mut scanner = ds.scan(); - scanner.filter_expr(filter); - let plan = scanner.create_plan().await?; - Ok(format!("{}", displayable(plan.as_ref()).indent(true))) - } - - // The fix's shape: ident() preserves case β†’ resolves `repoName` β†’ index. - let used = plan_str(&ds, ident("repoName").eq(lit("acme"))) - .await - .expect("ident(\"repoName\") must plan against the case-preserved schema"); - assert!( - used.contains("ScalarIndexQuery"), - "camelCase @index equality must route to the scalar index (not full scan), got:\n{used}" - ); - - // The pre-fix shape: col() normalizes `repoName` β†’ `reponame`, which does not - // exist in the case-sensitive schema, so planning fails. This is precisely - // why `col()` could never reach the index and surfaced the #283 runtime error - // β€” it could not silently full-scan past the index either. - let err = plan_str(&ds, col("repoName").eq(lit("acme"))).await; - assert!( - err.is_err(), - "col() lowercases repoNameβ†’reponame against a case-sensitive schema; \ - planning must fail rather than resolve, confirming ident() is required \ - for camelCase index routing. got plan:\n{err:?}" - ); -} diff --git a/crates/omnigraph/tests/lance_version_columns.rs b/crates/omnigraph/tests/lance_version_columns.rs index fbe0cb4..b9367b9 100644 --- a/crates/omnigraph/tests/lance_version_columns.rs +++ b/crates/omnigraph/tests/lance_version_columns.rs @@ -191,16 +191,14 @@ async fn lance_merge_insert_new_row_stamps_created_at_version() { let eve = rows.iter().find(|r| r.0 == "eve").unwrap(); eprintln!("Eve: created_at_version={}, v1={}, v2={}", eve.2, v1, v2); - // Lance behavior (7.0.0, lance#6774): merge_insert stamps new INSERT - // rows with _row_created_at_version = the commit version (v2). Earlier - // Lance used a fallback of the dataset creation version; #6774 changed - // it so created_at reflects when the row actually entered the dataset. - // Omnigraph's change detection keys on _row_last_updated_at_version + ID - // set membership (see changes/mod.rs), so this stamping change leaves - // insert-vs-update classification unaffected. + // Lance behavior (as of 3.0.1): merge_insert stamps new rows with + // _row_created_at_version = dataset_creation_version (v1), NOT the + // merge_insert commit version (v2). This is why Omnigraph's change + // detection uses _row_last_updated_at_version + ID set membership + // to classify inserts vs updates, not _row_created_at_version alone. assert_eq!( - eve.2, v2, - "Lance merge_insert stamps new rows with created_at = commit version (lance#6774)" + eve.2, v1, + "Lance merge_insert stamps new rows with created_at = dataset creation version, not commit version" ); assert_eq!( eve.3, v2, @@ -260,24 +258,11 @@ async fn lance_merge_insert_update_preserves_created_at_version() { assert_eq!(alice.2, v1, "alice created_at should still be v1"); assert_eq!(alice.3, v1, "alice updated_at should still be v1"); - // Bob: updated via merge_insert. + // Bob: updated via merge_insert + // created_at should be preserved (v1), updated_at should be bumped (v2) eprintln!( "Bob: created_at={}, updated_at={}, v1={}, v2={}", bob.2, bob.3, v1, v2 ); assert_eq!(bob.1, 99, "bob's value should be updated to 99"); - // created_at is preserved across an UPDATE (lance#6774 only changed the - // INSERT-row stamping), which is what this test's name promises. - assert_eq!( - bob.2, v1, - "bob created_at must be preserved across a merge_insert UPDATE" - ); - // updated_at bumps to the commit version on UPDATE β€” the change-feed - // invariant OmniGraph's insert/update classification relies on - // (changes/mod.rs keys on _row_last_updated_at_version). If this regresses, - // the diff/change feed silently misses updates. - assert_eq!( - bob.3, v2, - "bob updated_at must bump to the commit version on a merge_insert UPDATE" - ); } diff --git a/crates/omnigraph/tests/lifecycle.rs b/crates/omnigraph/tests/lifecycle.rs index 9488e12..a56a80c 100644 --- a/crates/omnigraph/tests/lifecycle.rs +++ b/crates/omnigraph/tests/lifecycle.rs @@ -304,108 +304,3 @@ async fn init_with_force_recovers_from_orphan_schema_files() { "force-recovered graph must have full schema state written" ); } - -/// E2e for the schema-level `.pg` surface: `@description` (node / edge / -/// property) and `@instruction` (node / edge only) parse, validate, and -/// persist verbatim into the on-disk `_schema.ir.json` through `Omnigraph::init` -/// β€” the contract that surfaces them in catalog metadata for tooling. -#[tokio::test] -async fn schema_annotations_persist_into_ir_json_on_init() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - let schema = r#" -node Task @description("Tracked work item") @instruction("Prefer querying by slug") { - slug: String @key @description("Stable external identifier") -} - -edge DependsOn: Task -> Task @description("Hard dependency") @instruction("Use only for blockers") -"#; - - Omnigraph::init(uri, schema).await.unwrap(); - - let ir_json = fs::read_to_string(dir.path().join("_schema.ir.json")).unwrap(); - let ir: serde_json::Value = serde_json::from_str(&ir_json).unwrap(); - - // Helper: collect the {name -> value} map of annotations that carry a - // string value. Value-less annotations (e.g. `@key`, which also desugars - // to a constraint) are skipped β€” they aren't what this test asserts. - let anns = |v: &serde_json::Value| -> std::collections::BTreeMap { - v["annotations"] - .as_array() - .unwrap() - .iter() - .filter_map(|a| { - Some(( - a["name"].as_str()?.to_string(), - a["value"].as_str()?.to_string(), - )) - }) - .collect() - }; - - let node = ir["nodes"] - .as_array() - .unwrap() - .iter() - .find(|n| n["name"] == "Task") - .unwrap(); - let node_anns = anns(node); - assert_eq!(node_anns.get("description").map(String::as_str), Some("Tracked work item")); - assert_eq!( - node_anns.get("instruction").map(String::as_str), - Some("Prefer querying by slug"), - "node @instruction persists into _schema.ir.json" - ); - - let prop = node["properties"] - .as_array() - .unwrap() - .iter() - .find(|p| p["name"] == "slug") - .unwrap(); - assert_eq!( - anns(prop).get("description").map(String::as_str), - Some("Stable external identifier"), - "property @description persists into _schema.ir.json" - ); - - let edge = ir["edges"] - .as_array() - .unwrap() - .iter() - .find(|e| e["name"] == "DependsOn") - .unwrap(); - let edge_anns = anns(edge); - assert_eq!(edge_anns.get("description").map(String::as_str), Some("Hard dependency")); - assert_eq!(edge_anns.get("instruction").map(String::as_str), Some("Use only for blockers")); -} - -/// `@instruction` is rejected on a property at compile time, so init aborts -/// before any graph state is written (mirrors the parser-level rejection from -/// the full engine boundary). -#[tokio::test] -async fn init_rejects_instruction_on_property() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - let schema = r#" -node Task { - slug: String @key @instruction("bad") -} -"#; - - // `Omnigraph` is not `Debug`, so match rather than `unwrap_err`. - let err = match Omnigraph::init(uri, schema).await { - Ok(_) => panic!("property-level @instruction must abort init"), - Err(err) => err, - }; - assert!( - err.to_string().contains("@instruction is only supported on node and edge types"), - "property-level @instruction must abort init: {err}" - ); - assert!( - !dir.path().join("_schema.ir.json").exists(), - "rejected init must not persist a schema IR" - ); -} diff --git a/crates/omnigraph/tests/lineage_projection.rs b/crates/omnigraph/tests/lineage_projection.rs deleted file mode 100644 index e2a6762..0000000 --- a/crates/omnigraph/tests/lineage_projection.rs +++ /dev/null @@ -1,235 +0,0 @@ -//! RFC-013 Phase 7 acceptance gate: graph lineage lives ONLY in `__manifest`. -//! -//! The `graph_commit` + `graph_head` rows ride the same publish CAS as the -//! table-version rows, so `_graph_commits.lance` carries NO commit rows. This -//! gate proves two things over a realistic history (commits on main, a branch, -//! a merge, all with actors): -//! -//! 1. The production commit-graph projection (`CommitGraph::open(...)`, which now -//! reads `__manifest`) reconstructs the full lineage correctly β€” commit set, -//! parents, the merge commit's two parents + merge actor, per-branch heads, -//! and the inline actors. -//! 2. `_graph_commits.lance` (and its actor sidecar) hold ZERO commit rows: the -//! dual-write is gone and nothing appends to them. This is the load-bearing -//! "single source" assertion. - -mod helpers; - -use futures::TryStreamExt; -use lance::Dataset; - -use omnigraph::db::commit_graph::CommitGraph; -use omnigraph::db::{GraphCommit, Omnigraph}; - -use helpers::*; - -/// Count rows in a Lance dataset directory under the graph root, or `0` if it -/// does not exist. -async fn row_count(root: &str, dir: &str) -> usize { - let uri = format!("{}/{}", root.trim_end_matches('/'), dir); - let Ok(dataset) = Dataset::open(&uri).await else { - return 0; - }; - let batches: Vec = dataset - .scan() - .try_into_stream() - .await - .unwrap() - .try_collect() - .await - .unwrap(); - batches.iter().map(|b| b.num_rows()).sum() -} - -/// The production commit-graph projection at `branch`, sourced from `__manifest`. -async fn projected_commits(root: &str, branch: Option<&str>) -> Vec { - let graph = match branch { - Some(branch) => CommitGraph::open_at_branch(root, branch).await.unwrap(), - None => CommitGraph::open(root).await.unwrap(), - }; - let mut commits = graph.load_commits().await.unwrap(); - commits.sort_by(|a, b| { - a.manifest_version - .cmp(&b.manifest_version) - .then_with(|| a.created_at.cmp(&b.created_at)) - .then_with(|| a.graph_commit_id.cmp(&b.graph_commit_id)) - }); - commits -} - -async fn head_id(root: &str, branch: Option<&str>) -> String { - let graph = match branch { - Some(branch) => CommitGraph::open_at_branch(root, branch).await.unwrap(), - None => CommitGraph::open(root).await.unwrap(), - }; - graph - .head_commit() - .await - .unwrap() - .unwrap() - .graph_commit_id -} - -#[tokio::test] -async fn graph_lineage_lives_only_in_manifest() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - - // Build a realistic history: several authored commits on main, a branch with - // its own authored commits, then an authored merge back into main. - let main = init_and_load(&dir).await; - - main.mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Alice")], &[("$age", 30)]), - Some("act-alice"), - ) - .await - .unwrap(); - main.mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Bob")], &[("$age", 41)]), - Some("act-bob"), - ) - .await - .unwrap(); - - main.branch_create("feature").await.unwrap(); - - let feature = Omnigraph::open(&uri).await.unwrap(); - feature - .mutate_as( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Carol")], &[("$age", 27)]), - Some("act-carol"), - ) - .await - .unwrap(); - feature - .mutate_as( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Dave")], &[("$age", 33)]), - Some("act-dave"), - ) - .await - .unwrap(); - - // Advance main once more so the merge is a real (non-fast-forward) merge with - // two distinct parents. - main.mutate_as( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Erin")], &[("$age", 38)]), - Some("act-erin"), - ) - .await - .unwrap(); - - let outcome = main - .branch_merge_as("feature", "main", Some("act-merger")) - .await - .unwrap(); - // A genuine three-way merge (both sides advanced past the base). - assert_eq!( - outcome, - omnigraph::db::MergeOutcome::Merged, - "expected a real merge, not fast-forward/up-to-date" - ); - - // ── single source: nothing writes `_graph_commits.lance` ───────────────── - // RFC-013 Phase 7 folds lineage into `__manifest`; the commit-graph dataset - // exists only to carry branch refs, so it (and its actor sidecar) hold ZERO - // commit rows. If a stray `append_commit` reappears, this turns red. - assert_eq!( - row_count(&uri, "_graph_commits.lance").await, - 0, - "_graph_commits.lance must carry no commit rows β€” lineage lives in __manifest" - ); - assert_eq!( - row_count(&uri, "_graph_commit_actors.lance").await, - 0, - "_graph_commit_actors.lance must carry no rows β€” actors live inline in __manifest" - ); - - // ── main lineage projected from `__manifest` ───────────────────────────── - let main_commits = projected_commits(&uri, None).await; - // genesis + Alice + Bob + Erin + the merge = 5 on main. - assert!( - main_commits.len() >= 5, - "expected a non-trivial main history, got {} commits", - main_commits.len() - ); - - // Genesis is the unique parentless commit and carries no actor. - let genesis: Vec<&GraphCommit> = main_commits - .iter() - .filter(|c| c.parent_commit_id.is_none()) - .collect(); - assert_eq!(genesis.len(), 1, "exactly one genesis (parentless) commit"); - assert!( - genesis[0].actor_id.is_none(), - "genesis commit carries no actor" - ); - - // Every non-genesis commit's parent resolves to a known commit (a connected - // lineage β€” the publisher resolved each parent under the CAS). - for commit in &main_commits { - if let Some(parent) = &commit.parent_commit_id { - assert!( - main_commits.iter().any(|c| &c.graph_commit_id == parent), - "parent {parent} of {} must be a known commit", - commit.graph_commit_id - ); - } - } - - // The merge commit carries both parents and the merge actor. - let merge_commit = main_commits - .iter() - .find(|c| c.merged_parent_commit_id.is_some()) - .expect("a merge commit with a merged parent must exist"); - assert_eq!(merge_commit.actor_id.as_deref(), Some("act-merger")); - assert!(merge_commit.parent_commit_id.is_some()); - // The merge is the head of main. - assert_eq!( - head_id(&uri, None).await, - merge_commit.graph_commit_id, - "the merge commit is the head of main" - ); - - // ── feature lineage projected from `__manifest` ────────────────────────── - let feature_commits = projected_commits(&uri, Some("feature")).await; - // The feature head is Dave's commit (the last authored on the branch). - let feature_head = head_id(&uri, Some("feature")).await; - let feature_head_commit = feature_commits - .iter() - .find(|c| c.graph_commit_id == feature_head) - .expect("feature head must be in the feature projection"); - assert_eq!( - feature_head_commit.actor_id.as_deref(), - Some("act-dave"), - "feature head is Dave's authored commit" - ); - - // ── actors surface inline from the manifest metadata ───────────────────── - // main's authored commits: Alice, Bob, Erin (direct) + the merge (act-merger) - // = 4. Carol/Dave were authored on the feature branch, not main. Genesis has - // no actor. - let authored = main_commits - .iter() - .filter(|c| c.actor_id.is_some()) - .count(); - assert!( - authored >= 4, - "expected the authored commits to surface their actor in the projection, saw {authored}" - ); -} diff --git a/crates/omnigraph/tests/literal_filters.rs b/crates/omnigraph/tests/literal_filters.rs deleted file mode 100644 index 9fb480a..0000000 --- a/crates/omnigraph/tests/literal_filters.rs +++ /dev/null @@ -1,173 +0,0 @@ -//! Execution goldens for filtering by non-string/non-integer scalar LITERALS -//! (F64, F32, Bool, Date, DateTime), across both the in-memory comparison arm -//! (standalone `$m.prop op lit`) and the Lance-pushdown arm (inline binding -//! `Metric { prop: lit }`). Param-bound scalar filters and list-column -//! `contains` are already covered elsewhere; this closes the literal-RHS gap. - -mod helpers; - -use arrow_array::{Array, StringArray}; - -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_compiler::ir::ParamMap; - -use helpers::*; - -const SCHEMA: &str = r#" -node Metric { - name: String @key - score: F64? - ratio: F32? - count: I32? - active: Bool? - born: Date? - seen: DateTime? -} -"#; - -// Seeds partition every predicate, so a dropped filter returns all 4 rows. -const DATA: &str = r#"{"type":"Metric","data":{"name":"m1","score":2.5,"ratio":0.5,"count":1,"active":true,"born":"2024-06-01","seen":"2024-06-01T12:00:00Z"}} -{"type":"Metric","data":{"name":"m2","score":1.0,"ratio":0.25,"count":2,"active":false,"born":"2023-01-01","seen":"2023-01-01T00:00:00Z"}} -{"type":"Metric","data":{"name":"m3","score":3.0,"ratio":0.75,"count":3,"active":true,"born":"2025-01-01","seen":"2025-01-01T00:00:00Z"}} -{"type":"Metric","data":{"name":"m4","score":0.5,"ratio":0.1,"count":4,"active":false,"born":"2022-12-31","seen":"2022-01-01T00:00:00Z"}}"#; - -async fn metric_db(dir: &tempfile::TempDir) -> Omnigraph { - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, SCHEMA).await.unwrap(); - load_jsonl(&mut db, DATA, LoadMode::Overwrite).await.unwrap(); - db -} - -async fn sorted_metric_names(db: &mut Omnigraph, queries: &str, name: &str) -> Vec { - let r = query_main(db, queries, name, &ParamMap::new()).await.unwrap(); - if r.num_rows() == 0 { - return Vec::new(); - } - let b = r.concat_batches().unwrap(); - let col = b.column(0).as_any().downcast_ref::().unwrap(); - let mut v: Vec = (0..col.len()).map(|i| col.value(i).to_string()).collect(); - v.sort(); - v -} - -#[tokio::test] -async fn float_literal_filters_execute() { - let dir = tempfile::tempdir().unwrap(); - let mut db = metric_db(&dir).await; - let q = r#" -query gt() { match { $m: Metric $m.score > 1.5 } return { $m.name } } -query le() { match { $m: Metric $m.ratio <= 0.25 } return { $m.name } } -query inline() { match { $m: Metric { score: 3.0 } } return { $m.name } } -"#; - // F64 standalone: scores 2.5, 3.0 > 1.5 - assert_eq!(sorted_metric_names(&mut db, q, "gt").await, vec!["m1", "m3"]); - // F32 standalone: ratios 0.25, 0.1 <= 0.25 - assert_eq!(sorted_metric_names(&mut db, q, "le").await, vec!["m2", "m4"]); - // F64 inline-binding pushdown: score == 3.0 - assert_eq!(sorted_metric_names(&mut db, q, "inline").await, vec!["m3"]); -} - -// Inline-binding equality is the Lance-pushdown arm. With the literal coerced to -// the column's exact Arrow type, a narrow-numeric column (I32) and an F32 column -// must still select the right rows β€” the coercion changes the literal's type, not -// the result set. (The index-use win this enables is pinned at the Lance-surface -// layer by `lance_surface_guards::scalar_index_use_requires_matched_literal_type`.) -#[tokio::test] -async fn int_and_f32_literal_pushdown_coercion() { - let dir = tempfile::tempdir().unwrap(); - let mut db = metric_db(&dir).await; - let q = r#" -query count_eq() { match { $m: Metric { count: 2 } } return { $m.name } } -query ratio_eq() { match { $m: Metric { ratio: 0.25 } } return { $m.name } } -query count_ge() { match { $m: Metric $m.count >= 3 } return { $m.name } } -"#; - // I32 column, integer literal coerced Int64 -> Int32: count == 2 is m2 only. - assert_eq!(sorted_metric_names(&mut db, q, "count_eq").await, vec!["m2"]); - // F32 column, float literal coerced Float64 -> Float32: ratio == 0.25 is m2. - assert_eq!(sorted_metric_names(&mut db, q, "ratio_eq").await, vec!["m2"]); - // Range on the I32 column: count 3,4 >= 3 -> m3, m4 (coercion is op-independent). - assert_eq!( - sorted_metric_names(&mut db, q, "count_ge").await, - vec!["m3", "m4"] - ); -} - -// A fractional float against an integer column must not be truncated by the -// pushdown coercion (`2.7 -> 2` would wrongly match the count=2 row). The -// lossless guard falls back to the natural Float64 literal, so `count = 2.7` -// matches no integer and returns no rows. -#[tokio::test] -async fn fractional_float_equality_on_int_column_returns_no_rows() { - let dir = tempfile::tempdir().unwrap(); - let mut db = metric_db(&dir).await; - let q = r#" -query count_frac() { match { $m: Metric { count: 2.7 } } return { $m.name } } -"#; - assert!( - sorted_metric_names(&mut db, q, "count_frac") - .await - .is_empty(), - "count = 2.7 must match no integer rows (no truncation to count = 2)" - ); -} - -#[tokio::test] -async fn bool_literal_filters_execute() { - let dir = tempfile::tempdir().unwrap(); - let mut db = metric_db(&dir).await; - let q = r#" -query standalone() { match { $m: Metric $m.active = true } return { $m.name } } -query inline() { match { $m: Metric { active: true } } return { $m.name } } -query negated() { match { $m: Metric $m.active != true } return { $m.name } } -"#; - assert_eq!(sorted_metric_names(&mut db, q, "standalone").await, vec!["m1", "m3"]); - assert_eq!(sorted_metric_names(&mut db, q, "inline").await, vec!["m1", "m3"]); - assert_eq!(sorted_metric_names(&mut db, q, "negated").await, vec!["m2", "m4"]); -} - -#[tokio::test] -async fn date_and_datetime_literal_filters_execute() { - let dir = tempfile::tempdir().unwrap(); - let mut db = metric_db(&dir).await; - let q = r#" -query born_ge() { match { $m: Metric $m.born >= date("2024-01-01") } return { $m.name } } -query seen_lt() { match { $m: Metric $m.seen < datetime("2024-01-01T00:00:00Z") } return { $m.name } } -query born_eq() { match { $m: Metric { born: date("2024-06-01") } } return { $m.name } } -query seen_eq() { match { $m: Metric { seen: datetime("2024-06-01T12:00:00Z") } } return { $m.name } } -"#; - // born: m1 2024-06, m3 2025 >= 2024-01-01 - assert_eq!(sorted_metric_names(&mut db, q, "born_ge").await, vec!["m1", "m3"]); - // seen: m2 2023, m4 2022 < 2024-01-01 - assert_eq!(sorted_metric_names(&mut db, q, "seen_lt").await, vec!["m2", "m4"]); - // Inline-binding equality exercises the Lance-pushdown arm with a typed - // Date32/Date64 literal: the epoch conversion must select exactly m1. - assert_eq!(sorted_metric_names(&mut db, q, "born_eq").await, vec!["m1"]); - assert_eq!(sorted_metric_names(&mut db, q, "seen_eq").await, vec!["m1"]); -} - -// #283: a property-match on a camelCase `@index` field must execute, not fail -// with "No field named reponame" at the Lance scan. Exercises the pushdown arm -// (inline binding `Doc { repoName: $r }`) end-to-end. -const CC_SCHEMA: &str = r#" -node Doc { - slug: String @key - repoName: String @index -} -"#; -const CC_DATA: &str = r#"{"type":"Doc","data":{"slug":"d1","repoName":"acme"}} -{"type":"Doc","data":{"slug":"d2","repoName":"globex"}}"#; - -#[tokio::test] -async fn camelcase_property_filter_executes() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, CC_SCHEMA).await.unwrap(); - load_jsonl(&mut db, CC_DATA, LoadMode::Overwrite).await.unwrap(); - - let q = r#"query by_repo($r: String) { match { $d: Doc { repoName: $r } } return { $d.slug } }"#; - let r = query_main(&mut db, q, "by_repo", ¶ms(&[("$r", "acme")])) - .await - .expect("camelCase property filter must execute, not fail at the Lance scan"); - assert_eq!(r.num_rows(), 1, "expected exactly the d1 row for repoName=acme"); -} diff --git a/crates/omnigraph/tests/maintenance.rs b/crates/omnigraph/tests/maintenance.rs index 8e7bfc9..3c6ab30 100644 --- a/crates/omnigraph/tests/maintenance.rs +++ b/crates/omnigraph/tests/maintenance.rs @@ -7,118 +7,32 @@ mod helpers; use std::time::Duration; -use lance::Dataset; -use lance::dataset::optimize::{CompactionOptions, compact_files}; -use omnigraph::db::{ - CleanupPolicyOptions, Omnigraph, ReadTarget, RepairAction, RepairClassification, RepairOptions, - SkipReason, -}; +use omnigraph::db::{CleanupPolicyOptions, Omnigraph}; use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph::table_store::{IndexCoverage, TableStore}; -use helpers::{ - MUTATION_QUERIES, TEST_DATA, TEST_SCHEMA, count_rows, init_and_load, mixed_params, mutate_main, - snapshot_main, -}; - -/// Filesystem URI of a node sub-table, mirroring the engine's layout -/// (FNV-1a of the type name under `nodes/`). Matches the helper in -/// `failpoints.rs`; used to inspect/forge Lance branches directly in tests. -fn node_table_uri(root: &str, type_name: &str) -> String { - let mut hash: u64 = 0xcbf2_9ce4_8422_2325; - for &b in type_name.as_bytes() { - hash ^= b as u64; - hash = hash.wrapping_mul(0x100_0000_01b3); - } - format!("{}/nodes/{hash:016x}", root.trim_end_matches('/')) -} - -async fn person_manifest_and_head(db: &Omnigraph, root: &str) -> (u64, u64, String) { - let snap = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let entry = snap.entry("node:Person").unwrap(); - let full = format!("{}/{}", root.trim_end_matches('/'), entry.table_path); - let head = Dataset::open(&full).await.unwrap().version().version; - (entry.table_version, head, full) -} - -async fn add_person_fragments(db: &mut Omnigraph) { - for (name, age) in [("Eve", 40), ("Frank", 41), ("Grace", 42), ("Heidi", 43)] { - mutate_main( - db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age as i64)]), - ) - .await - .expect("insert"); - } -} - -async fn forge_person_compaction_drift(db: &mut Omnigraph, root: &str) -> (u64, u64, String) { - add_person_fragments(db).await; - let (manifest_version, _, full) = person_manifest_and_head(db, root).await; - let mut ds = Dataset::open(&full).await.unwrap(); - let metrics = compact_files(&mut ds, CompactionOptions::default(), None) - .await - .expect("raw Lance compaction"); - let lance_head_version = ds.version().version; - assert!( - lance_head_version > manifest_version, - "raw Lance compaction should advance HEAD beyond manifest" - ); - assert!( - metrics.fragments_removed > 0 || metrics.fragments_added > 0, - "test precondition: raw compaction should rewrite fragments" - ); - (manifest_version, lance_head_version, full) -} - -async fn forge_person_delete_drift(db: &Omnigraph, root: &str) -> (u64, u64, String) { - let (manifest_version, _, full) = person_manifest_and_head(db, root).await; - let mut ds = Dataset::open(&full).await.unwrap(); - let deleted = ds.delete("name = 'Alice'").await.expect("raw Lance delete"); - assert_eq!(deleted.num_deleted_rows, 1, "fixture should delete Alice"); - let lance_head_version = deleted.new_dataset.version().version; - assert!( - lance_head_version > manifest_version, - "raw Lance delete should advance HEAD beyond manifest" - ); - (manifest_version, lance_head_version, full) -} +use helpers::{TEST_DATA, TEST_SCHEMA, count_rows, init_and_load}; #[tokio::test] async fn optimize_on_empty_graph_returns_stats_per_table_with_no_changes() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); - let db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); + let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); let stats = db.optimize().await.unwrap(); - // Schema declares 2 nodes + 2 edges = 4 data tables, plus the 3 internal - // system tables (`__manifest`, `_graph_commits`, `_graph_commit_actors`) optimize - // also compacts (RFC-013 step 2) = 7. Compaction should run on each but find - // nothing to merge. The genesis graph commit rides the SINGLE init - // `__manifest` write (RFC-013 Phase 7), so a fresh graph has one fragment per - // table β€” nothing to compact anywhere. - assert_eq!(stats.len(), 7); + // Schema declares 2 nodes + 2 edges = 4 tables. Compaction should run on + // each but find nothing to merge. + assert_eq!(stats.len(), 4); for s in &stats { assert_eq!(s.fragments_removed, 0, "{} should not remove", s.table_key); assert_eq!(s.fragments_added, 0, "{} should not add", s.table_key); } - // The internal tables are present and reported as no-ops on an empty graph. - for key in ["__manifest", "_graph_commits", "_graph_commit_actors"] { - let s = stats - .iter() - .find(|s| s.table_key == key) - .unwrap_or_else(|| panic!("optimize stats missing internal table {key}")); - assert!(!s.committed, "{key} should be a no-op on an empty graph"); - } } #[tokio::test] async fn optimize_after_load_then_again_is_idempotent() { let dir = tempfile::tempdir().unwrap(); - let db = init_and_load(&dir).await; + let mut db = init_and_load(&dir).await; // First pass may compact (load wrote real fragments). let _first = db.optimize().await.unwrap(); @@ -145,798 +59,6 @@ async fn optimize_after_load_then_again_is_idempotent() { } } -/// RFC-013 step 2 + Phase 7: `optimize` compacts `__manifest`, which now -/// accumulates one fragment per commit for BOTH the table-version rows and the -/// folded-in graph-lineage rows (`graph_commit` + `graph_head`). The -/// commit-graph datasets (`_graph_commits`, `_graph_commit_actors`) no longer -/// take a per-commit row (lineage lives in `__manifest`), so they stay flat β€” -/// nothing to compact. After compaction `__manifest` sheds fragments, writes no -/// recovery sidecar (a single atomic Lance commit β€” no HEAD-before-publish gap), -/// and the graph stays coherent for subsequent reads + strict writes. -#[tokio::test] -async fn optimize_compacts_internal_tables() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - // Build version-history depth so `__manifest` accumulates fragments. - for i in 0..20 { - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", &format!("p{i}"))], &[("$age", 30)]), - ) - .await - .unwrap(); - } - - let stats = db.optimize().await.unwrap(); - - // `__manifest` carries every per-commit fragment (table versions + lineage) - // and compacts. - let manifest_stats = stats - .iter() - .find(|s| s.table_key == "__manifest") - .expect("optimize stats missing internal table __manifest"); - assert!( - manifest_stats.committed, - "__manifest should compact after 20 commits" - ); - assert!( - manifest_stats.fragments_removed > 0, - "__manifest should shed fragments, removed {}", - manifest_stats.fragments_removed - ); - - // The commit-graph datasets take no per-commit row anymore (RFC-013 Phase 7 - // folds lineage into `__manifest`), so they stay at one fragment β€” no-ops. - for key in ["_graph_commits", "_graph_commit_actors"] { - let s = stats - .iter() - .find(|s| s.table_key == key) - .unwrap_or_else(|| panic!("optimize stats missing internal table {key}")); - assert!( - !s.committed, - "{key} carries no per-commit rows after Phase 7 β€” nothing to compact" - ); - } - - // Internal compaction leaks no recovery sidecar. - let recovery_dir = dir.path().join("__recovery"); - if recovery_dir.exists() { - let leftover: Vec<_> = std::fs::read_dir(&recovery_dir) - .unwrap() - .filter_map(|e| e.ok()) - .map(|e| e.file_name()) - .collect(); - assert!( - leftover.is_empty(), - "optimize leaked recovery sidecars: {leftover:?}" - ); - } - - // Coherent after internal compaction: reads + a strict write still work. - assert!(count_rows(&db, "node:Person").await > 0); - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "after_compact")], &[("$age", 40)]), - ) - .await - .unwrap(); -} - -/// `optimize` must not fail on a graph that has no `_graph_commits.lance` β€” a valid -/// state the coordinator opens as `commit_graph = None` (graphs predating the commit -/// graph). Without the existence guard, `Dataset::open` on the absent table errors -/// and fails the whole optimize. Regression for the missing-existence-guard. -/// -/// Uses an EMPTY graph deliberately: a graph with data would publish during -/// optimize, and a publish records a graph commit that recreates `_graph_commits` -/// before the guard runs β€” masking the bug. With no data, nothing recreates it, so -/// the table stays absent through the guard. -#[tokio::test] -async fn optimize_tolerates_absent_graph_commits_table() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - - // Simulate a graph with no commit-graph dataset. - std::fs::remove_dir_all(dir.path().join("_graph_commits.lance")).unwrap(); - - // Coordinator tolerates the absence; optimize must succeed (the guard skips the - // absent table rather than letting `Dataset::open` error) and omit its stat. - let db = Omnigraph::open(uri).await.unwrap(); - let stats = db.optimize().await.unwrap(); - assert!( - stats.iter().any(|s| s.table_key == "__manifest"), - "__manifest must still be compacted" - ); - assert!( - !stats.iter().any(|s| s.table_key == "_graph_commits"), - "absent _graph_commits must be skipped, not opened (would error)" - ); -} - -/// `optimize` must stay NON-DESTRUCTIVE on a pre-`auto_cleanup`-fix upgraded graph: -/// `compact_files` would otherwise fire the dataset's stored `lance.auto_cleanup.*` -/// hook (version GC) during the compaction commit. Internal-table compaction clears -/// that stale config first, so no versions are deleted. Without the clear, the -/// aggressive policy below GCs old versions and the count drops. -#[tokio::test] -async fn optimize_clears_stale_auto_cleanup_and_preserves_versions() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - for i in 0..5 { - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", &format!("v{i}"))], &[("$age", 30)]), - ) - .await - .unwrap(); - } - let manifest_uri = format!("{}/__manifest", dir.path().to_str().unwrap()); - - // Simulate an upgraded graph: an aggressive stored auto_cleanup config that, if - // it fired during compaction, would GC old versions. - { - let mut ds = Dataset::open(&manifest_uri).await.unwrap(); - ds.update_config([ - ("lance.auto_cleanup.interval", Some("1")), - ("lance.auto_cleanup.older_than", Some("0s")), - ]) - .await - .unwrap(); - } - let versions_before = Dataset::open(&manifest_uri) - .await - .unwrap() - .versions() - .await - .unwrap() - .len(); - - db.optimize().await.unwrap(); - - let ds = Dataset::open(&manifest_uri).await.unwrap(); - // (a) the stale auto_cleanup config was cleared (non-destructive by construction). - assert!( - !ds.config().keys().any(|k| k.starts_with("lance.auto_cleanup.")), - "optimize must clear stale auto_cleanup config; config = {:?}", - ds.config() - ); - // (b) no version GC: every pre-optimize version survives (compaction + the - // config-clear each add versions, so the count only grows). - let versions_after = ds.versions().await.unwrap().len(); - assert!( - versions_after >= versions_before, - "optimize must not GC __manifest versions: before={versions_before} after={versions_after}" - ); -} - -/// The same non-destructive guarantee on a DATA (node/edge) table, not just the -/// internal tables. `optimize_one_table` runs `compact_files` / `optimize_indices` -/// with a default `CommitConfig` (`skip_auto_cleanup = false`); on an upgraded -/// graph whose Person table still carries the pre-v7 `lance.auto_cleanup.*` config, -/// those commits would fire Lance's version-GC hook and prune `__manifest`-pinned -/// data-table versions. The path must strip that config first. Without the strip, -/// the aggressive policy below GCs old versions and the config survives the run. -#[tokio::test] -async fn optimize_clears_stale_auto_cleanup_on_data_tables_too() { - let dir = tempfile::tempdir().unwrap(); - let root = dir.path().to_str().unwrap().trim_end_matches('/').to_string(); - let mut db = init_and_load(&dir).await; - add_person_fragments(&mut db).await; // multiple fragments β†’ will_compact - - // Simulate an upgraded graph: set an aggressive stored auto_cleanup config on - // the Person table. This is an out-of-band Lance commit (an `UpdateConfig` that - // advances HEAD past the manifest), so realign the manifest with a forced repair - // first β€” otherwise optimize skips the table as uncovered drift and never - // reaches the scrub. (Forced because UpdateConfig is not verified maintenance.) - let (_, _, person_full) = person_manifest_and_head(&db, &root).await; - { - let mut ds = Dataset::open(&person_full).await.unwrap(); - ds.update_config([ - ("lance.auto_cleanup.interval", Some("1")), - ("lance.auto_cleanup.older_than", Some("0s")), - ]) - .await - .unwrap(); - } - db.repair(RepairOptions { - confirm: true, - force: true, - }) - .await - .unwrap(); - - let versions_before = Dataset::open(&person_full) - .await - .unwrap() - .versions() - .await - .unwrap() - .len(); - let rows_before = count_rows(&db, "node:Person").await; - - db.optimize().await.unwrap(); - - let ds = Dataset::open(&person_full).await.unwrap(); - // (a) the stale auto_cleanup config was cleared (non-destructive by construction). - assert!( - !ds.config().keys().any(|k| k.starts_with("lance.auto_cleanup.")), - "optimize must clear stale auto_cleanup config on data tables; config = {:?}", - ds.config() - ); - // (b) no version GC: every pre-optimize version survives (compaction + the - // config-clear each add versions, so the count only grows). - let versions_after = ds.versions().await.unwrap().len(); - assert!( - versions_after >= versions_before, - "optimize must not GC Person versions: before={versions_before} after={versions_after}" - ); - // (c) data is intact β€” the run rewrote fragments, it did not drop rows. - assert_eq!(count_rows(&db, "node:Person").await, rows_before); -} - -// PR3 (Workstream B): an existing scalar index does not cover fragments -// appended after it was built (build_indices is existence-gated), so those -// rows are scanned unindexed. `optimize` must fold them back in via Lance's -// incremental `optimize_indices`, restoring full coverage. -#[tokio::test] -async fn optimize_reindexes_fragments_appended_after_index_build() { - const SCHEMA: &str = r#" -node Doc { - slug: String @key - rank: I32 @index -} -"#; - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, SCHEMA).await.unwrap(); - - // First load builds the id + rank BTREEs over the initial fragment. - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"rank\":1}}\n\ - {\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"rank\":2}}", - LoadMode::Merge, - ) - .await - .unwrap(); - - // A second load with NEW keys appends a fragment the existing BTREEs do not - // cover (the existence gate skips re-building an index that already exists). - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d3\",\"rank\":3}}\n\ - {\"type\":\"Doc\",\"data\":{\"slug\":\"d4\",\"rank\":4}}", - LoadMode::Merge, - ) - .await - .unwrap(); - - // Precondition: the appended fragment is unindexed. - { - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Doc").await.unwrap(); - assert!( - TableStore::has_unindexed_fragments(&ds).await.unwrap(), - "appended fragment should be unindexed before optimize" - ); - } - - db.optimize().await.unwrap(); - - // Postcondition: optimize_indices folded the appended fragment in, so every - // index covers every fragment and `rank` reports fully Indexed. - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Doc").await.unwrap(); - assert!( - !TableStore::has_unindexed_fragments(&ds).await.unwrap(), - "optimize must extend index coverage to all fragments" - ); - assert_eq!( - TableStore::key_column_index_coverage(&ds, "rank") - .await - .unwrap(), - IndexCoverage::Indexed, - "rank BTREE must cover all fragments after optimize" - ); -} - -// Regression: `optimize` must not crash on a graph that has a `Blob` table. -// -// Lance `compact_files` forces `BlobHandling::AllBinary`, which mis-decodes -// blob-v2 columns ("more fields in the schema than provided column indices"), -// failing even a pristine uniform-V2_2 multi-fragment blob table. `optimize` -// must skip blob-bearing tables (and report the skip) rather than aborting the -// whole sweep. -// -// Before the skip fix, `optimize()` returned that Lance error here and aborted -// the whole sweep; it now skips the blob table (`doc.skipped == Some(..)`) -// while the sibling non-blob `Tag` table still compacts. The skip is gated by -// `LANCE_SUPPORTS_BLOB_COMPACTION`; the surface guard -// `compact_files_still_fails_on_blob_columns` flags when the upstream Lance fix -// makes the skip (and this test's blob arm) removable. -#[tokio::test] -async fn optimize_skips_blob_table_and_reports_skip() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - // One Blob node type (`Doc`) + one plain node type (`Tag`): proves the blob - // table is skipped while a non-blob table in the same sweep still compacts. - let schema = "\ -node Doc {\n slug: String @key\n content: Blob\n}\n\ -node Tag {\n slug: String @key\n}\n"; - let mut db = Omnigraph::init(uri, schema).await.unwrap(); - - // Multi-fragment blob table: Overwrite creates fragment 1; each Merge of - // new keys appends another. A >=2-fragment blob table is exactly what - // crashes `compact_files` today (single fragment would no-op and not crash). - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"content\":\"base64:aGVsbG8x\"}}\n{\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"content\":\"base64:aGVsbG8y\"}}", - LoadMode::Overwrite, - ) - .await - .unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d3\",\"content\":\"base64:aGVsbG8z\"}}", - LoadMode::Merge, - ) - .await - .unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d4\",\"content\":\"base64:aGVsbG80\"}}", - LoadMode::Merge, - ) - .await - .unwrap(); - // Plain table, also multi-fragment so it has something to compact. - load_jsonl( - &mut db, - "{\"type\":\"Tag\",\"data\":{\"slug\":\"t1\"}}\n{\"type\":\"Tag\",\"data\":{\"slug\":\"t2\"}}", - LoadMode::Merge, - ) - .await - .unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Tag\",\"data\":{\"slug\":\"t3\"}}", - LoadMode::Merge, - ) - .await - .unwrap(); - - let stats = db - .optimize() - .await - .expect("optimize must not crash on a graph with a Blob table"); - - let doc = stats - .iter() - .find(|s| s.table_key == "node:Doc") - .expect("Doc stat present"); - let tag = stats - .iter() - .find(|s| s.table_key == "node:Tag") - .expect("Tag stat present"); - // The blob table is skipped (and reported), not compacted. - assert_eq!( - doc.skipped, - Some(SkipReason::BlobColumnsUnsupportedByLance), - "blob table must be reported as skipped", - ); - assert!(!doc.committed, "skipped blob table is not compacted"); - assert_eq!(doc.fragments_removed, 0); - assert_eq!(doc.fragments_added, 0); - // The plain (non-blob) table is unaffected by the skip. - assert_eq!(tag.skipped, None, "non-blob table must not be skipped"); -} - -// Regression: `optimize` must publish its compaction to the `__manifest` so the -// manifest's recorded `table_version` tracks the compacted Lance HEAD. -// -// Lance `compact_files` advances the *dataset's* version (reserve-fragments + -// rewrite commits) but knows nothing about OmniGraph's `__manifest`. If optimize -// does not publish a manifest update, the manifest's `table_version` lags the -// Lance HEAD: reads stay pinned to the pre-compaction version (compaction is -// invisible to them) and any subsequent schema apply / strict update/delete -// fails its HEAD-vs-manifest precondition with -// "stale view of '': expected manifest table version X but current is Y". -// This pins the fix β€” optimize publishes the compacted version, so manifest == -// HEAD and migrations after a compaction succeed. -#[tokio::test] -async fn optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - - // Several separate inserts β†’ multiple Person fragments, so `compact_files` - // actually merges and moves the Lance HEAD (a single fragment is a no-op). - for (name, age) in [("Eve", 40), ("Frank", 41), ("Grace", 42), ("Heidi", 43)] { - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", name)], &[("$age", age as i64)]), - ) - .await - .expect("insert"); - } - - let stats = db.optimize().await.unwrap(); - let person = stats - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person stat present"); - assert!( - person.committed, - "Person is multi-fragment, so optimize must have compacted it" - ); - - // After optimize, the manifest's recorded table_version must equal the actual - // Lance HEAD β€” optimize published its compaction, so there is no drift. - let snap = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let entry = snap.entry("node:Person").unwrap(); - let manifest_version = entry.table_version; - let full = format!("{}/{}", root, entry.table_path); - let lance_head = Dataset::open(&full).await.unwrap().version().version; - assert_eq!( - manifest_version, lance_head, - "after optimize, manifest table_version ({manifest_version}) must equal Lance HEAD ({lance_head})", - ); - - // Reads observe the compacted version with rows preserved (4 seed + 4 inserts). - assert_eq!(count_rows(&db, "node:Person").await, 8); - - // The headline: an additive (nullable property) migration touching the - // just-compacted table succeeds, where it previously failed with "stale view". - let desired = TEST_SCHEMA.replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - let result = db - .apply_schema(&desired) - .await - .expect("additive schema apply after optimize must succeed"); - assert!(result.applied, "schema apply should report applied=true"); -} - -#[tokio::test] -async fn optimize_skips_preexisting_manifest_head_drift() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - let (manifest_before, head_before, _) = forge_person_compaction_drift(&mut db, &root).await; - - let stats = db.optimize().await.unwrap(); - let person = stats - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person stat present"); - assert_eq!(person.skipped, Some(SkipReason::DriftNeedsRepair)); - assert!(!person.committed); - assert_eq!(person.manifest_version, Some(manifest_before)); - assert_eq!(person.lance_head_version, Some(head_before)); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!( - manifest_after, manifest_before, - "optimize must not publish uncovered drift" - ); - assert_eq!( - head_after, head_before, - "optimize must not move drifted HEAD" - ); -} - -#[tokio::test] -async fn repair_preview_reports_verified_maintenance_drift_without_healing() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - let (manifest_before, head_before, _) = forge_person_compaction_drift(&mut db, &root).await; - - let stats = db - .repair(RepairOptions { - confirm: false, - force: false, - }) - .await - .unwrap(); - assert_eq!(stats.manifest_version, None); - let person = stats - .tables - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person repair stat present"); - assert_eq!( - person.classification, - RepairClassification::VerifiedMaintenance - ); - assert_eq!(person.action, RepairAction::Preview); - assert_eq!(person.manifest_version, manifest_before); - assert_eq!(person.lance_head_version, head_before); - assert!( - person - .operations - .iter() - .all(|op| op == "ReserveFragments" || op == "Rewrite"), - "maintenance drift should only include Lance maintenance operations: {:?}", - person.operations - ); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, manifest_before); - assert_eq!(head_after, head_before); -} - -#[tokio::test] -async fn repair_confirm_heals_verified_maintenance_drift() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - let (_, head_before, _) = forge_person_compaction_drift(&mut db, &root).await; - - let stats = db - .repair(RepairOptions { - confirm: true, - force: false, - }) - .await - .unwrap(); - assert!( - stats.manifest_version.is_some(), - "confirmed repair should publish one manifest commit" - ); - let person = stats - .tables - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person repair stat present"); - assert_eq!( - person.classification, - RepairClassification::VerifiedMaintenance - ); - assert_eq!(person.action, RepairAction::Healed); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, head_before); - assert_eq!(head_after, head_before); - - let desired = TEST_SCHEMA.replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - let result = db - .apply_schema(&desired) - .await - .expect("strict schema apply should succeed after repair"); - assert!(result.applied); -} - -#[tokio::test] -async fn repair_refuses_raw_delete_without_force() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let db = init_and_load(&dir).await; - let (manifest_before, head_before, _) = forge_person_delete_drift(&db, &root).await; - - let stats = db - .repair(RepairOptions { - confirm: true, - force: false, - }) - .await - .unwrap(); - assert_eq!(stats.manifest_version, None); - let person = stats - .tables - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person repair stat present"); - assert_eq!(person.classification, RepairClassification::Suspicious); - assert_eq!(person.action, RepairAction::Refused); - assert!( - person.operations.iter().any(|op| op == "Delete"), - "raw Lance delete should be reported as a suspicious operation: {:?}", - person.operations - ); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, manifest_before); - assert_eq!(head_after, head_before); - assert_eq!( - count_rows(&db, "node:Person").await, - 4, - "manifest-pinned reads should still see the pre-delete version" - ); -} - -#[tokio::test] -async fn repair_force_heals_suspicious_drift() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let db = init_and_load(&dir).await; - let (_, head_before, _) = forge_person_delete_drift(&db, &root).await; - - let stats = db - .repair(RepairOptions { - confirm: true, - force: true, - }) - .await - .unwrap(); - let person = stats - .tables - .iter() - .find(|s| s.table_key == "node:Person") - .expect("Person repair stat present"); - assert_eq!(person.classification, RepairClassification::Suspicious); - assert_eq!(person.action, RepairAction::Forced); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, head_before); - assert_eq!(head_after, head_before); - assert_eq!( - count_rows(&db, "node:Person").await, - 3, - "forced repair publishes the raw delete's HEAD" - ); -} - -#[tokio::test] -async fn non_strict_load_refuses_uncovered_drift_before_folding_it() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - let (manifest_before, head_before, _) = forge_person_compaction_drift(&mut db, &root).await; - - let err = load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Ivan\",\"age\":44}}", - LoadMode::Merge, - ) - .await - .expect_err("merge load must not silently fold uncovered drift"); - assert!( - err.to_string().contains("omnigraph repair"), - "error should point at explicit repair; got: {err}" - ); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, manifest_before); - assert_eq!(head_after, head_before); -} - -#[tokio::test] -async fn delete_only_mutation_refuses_uncovered_drift_before_inline_commit() { - let dir = tempfile::tempdir().unwrap(); - let root = dir - .path() - .to_str() - .unwrap() - .trim_end_matches('/') - .to_string(); - let mut db = init_and_load(&dir).await; - let (manifest_before, head_before, _) = forge_person_compaction_drift(&mut db, &root).await; - - let err = mutate_main( - &mut db, - MUTATION_QUERIES, - "remove_person", - &mixed_params(&[("$name", "Alice")], &[]), - ) - .await - .expect_err("strict delete must reject uncovered drift before delete_where"); - assert!( - err.to_string().contains("expected"), - "delete should fail as a strict stale-version write; got: {err}" - ); - - let (manifest_after, head_after, _) = person_manifest_and_head(&db, &root).await; - assert_eq!(manifest_after, manifest_before); - assert_eq!( - head_after, head_before, - "delete_where must not run after the strict drift guard fails" - ); - assert_eq!( - count_rows(&db, "node:Person").await, - 8, - "manifest-pinned reads should still see all rows present before the failed delete" - ); -} - -// Regression: `optimize` must REFUSE when an unresolved recovery sidecar is -// pending. Operating on an unrecovered graph could publish a partial write that -// the all-or-nothing recovery sweep would roll back; the operator must reopen -// (run the recovery sweep) first. -#[tokio::test] -async fn optimize_defers_when_recovery_sidecar_is_pending() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let db = init_and_load(&dir).await; - - // Simulate an in-process failed write that left a recovery sidecar on disk. - let recovery_dir = dir.path().join("__recovery"); - std::fs::create_dir_all(&recovery_dir).unwrap(); - let person_path = node_table_uri(uri, "Person"); - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H000000000000000000DEFR", - "started_at": "0", - "branch": null, - "actor_id": "act-test", - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{}", - "expected_version": 1, - "post_commit_pin": 2 - }} - ] - }}"#, - person_path - ); - std::fs::write( - recovery_dir.join("01H000000000000000000DEFR.json"), - sidecar_json, - ) - .unwrap(); - - let err = db - .optimize() - .await - .expect_err("optimize must defer (error) while a recovery sidecar is pending"); - assert!( - err.to_string().to_lowercase().contains("recovery"), - "optimize defer error should mention recovery; got: {err}", - ); -} - #[tokio::test] async fn cleanup_without_any_policy_option_errors() { let dir = tempfile::tempdir().unwrap(); @@ -1036,278 +158,3 @@ async fn cleanup_then_optimize_preserves_rows_and_table_remains_writable() { .unwrap(); assert_eq!(count_rows(&db, "node:Person").await, people_before); } - -#[tokio::test] -async fn cleanup_reconciles_orphaned_branch_forks() { - // An incomplete prior `branch_delete` can leave a per-table Lance branch - // that the manifest no longer references (a "zombie" fork). It is - // unreachable through any snapshot but pins its `tree/{branch}/` storage. - // `cleanup` must reconcile it away: drop every Lance branch absent from the - // manifest authority, without touching `main`. - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = init_and_load(&dir).await; - - let people_before = count_rows(&db, "node:Person").await; - assert!(people_before > 0, "fixture should seed Person rows"); - - // Forge an orphaned fork the manifest never knew about. - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("ghost", base, None).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("ghost"), - "precondition: orphaned fork staged" - ); - } - - db.cleanup(CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - - // Orphan reclaimed; main untouched. - { - let ds = Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("ghost"), - "cleanup should reconcile the orphaned 'ghost' fork away" - ); - } - assert_eq!( - count_rows(&db, "node:Person").await, - people_before, - "cleanup must not disturb main while reconciling orphans" - ); - - // Idempotent: a second cleanup with the orphan already gone is a no-op. - db.cleanup(CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); -} - -// cleanup must reclaim a manifest-unreferenced fork even when the BRANCH is -// still live (origin 2: an interrupted first-write fork), while KEEPING a table -// that is legitimately forked on that same live branch. Before the per-table -// authority broadening, the reconciler keyed only on the branch name and so -// never reclaimed a fork on a live branch β€” the wedge the handoff hit. -#[tokio::test] -async fn cleanup_reconciles_live_branch_orphan_fork_but_keeps_legitimate_fork() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = init_and_load(&dir).await; - - db.branch_create("feature").await.unwrap(); - - // Legitimately fork Company onto the live `feature` branch (a real write). - db.load_as( - "feature", - None, - r#"{"type":"Company","data":{"name":"Acme"}}"#, - LoadMode::Merge, - None, - ) - .await - .unwrap(); - - // Forge a manifest-unreferenced Person fork on the SAME live branch: the - // manifest's `feature` snapshot still places Person on main (Person was - // never written on feature), so this ref is an origin-2 orphan. - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "precondition: forged orphan Person fork present on the live branch" - ); - } - - let company_uri = node_table_uri(&uri, "Company"); - let main_people = count_rows(&db, "node:Person").await; - let main_companies = count_rows(&db, "node:Company").await; - - db.cleanup(CleanupPolicyOptions { - keep_versions: Some(1), - older_than: None, - }) - .await - .unwrap(); - - // Origin-2 orphan reclaimed... - { - let ds = Dataset::open(&person_uri).await.unwrap(); - assert!( - !ds.list_branches().await.unwrap().contains_key("feature"), - "cleanup must reclaim the manifest-unreferenced Person fork on the live branch" - ); - } - // ...but the legitimate Company fork on the same live branch is kept. - { - let ds = Dataset::open(&company_uri).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "cleanup must NOT reclaim a legitimately-forked table on a live branch" - ); - } - // main is untouched. - assert_eq!(count_rows(&db, "node:Person").await, main_people); - assert_eq!(count_rows(&db, "node:Company").await, main_companies); -} - -// Regression (iss-848): a table with rows but NULL vectors (the load-before- -// embed window) must not abort index building. The vector (IVF) index cannot -// train on 0 vectors, so `create_vector_index` errors with "KMeans cannot -// train 1 centroids with 0 vectors". `build_indices_on_dataset_for_catalog` -// is the chokepoint every caller funnels through (load/mutate via -// prepare_updates_for_commit, ensure_indices, optimize, schema apply, merge), -// so per-index fault isolation there must defer that one column (pending) and -// still build the sibling scalar indexes, instead of propagating the error. -// This exercises both the load path (which builds indices inline) and the -// ensure_indices reconciler. Pre-fix this fails at the load step. -#[tokio::test] -async fn index_build_tolerates_null_vector_rows() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let schema = "node Doc {\n \ - slug: String @key\n \ - n: I64 @index\n \ - embedding: Vector(8)? @index\n\ - }\n"; - let mut db = Omnigraph::init(uri, schema).await.unwrap(); - // Rows present, embeddings null (loaded but not yet embedded). - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"n\":1}}\n\ - {\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"n\":2}}", - LoadMode::Merge, - ) - .await - .expect("load rows with null embeddings"); - - // Must not abort: the untrainable vector column is deferred, the sibling - // BTREE on `n` still builds. - db.ensure_indices() - .await - .expect("ensure_indices must not abort when a vector column has no trainable vectors yet"); -} - -// iss-848: `optimize` converges declared-but-unbuilt indexes. After an @index is -// added post-data (a metadata-only apply that defers the physical build), the -// column is unindexed and reads scan. `optimize` β€” the operator's reconciler, -// run on a cron β€” must materialize it, by composing the ensure_indices -// reconciler after the compaction sweep. Pre-iss-848 optimize only maintained -// coverage of EXISTING indexes (optimize_indices) and never created missing ones. -#[tokio::test] -async fn optimize_materializes_index_declared_but_unbuilt() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let v1 = "node Doc {\n slug: String @key\n rank: I32\n}\n"; - let mut db = Omnigraph::init(uri, v1).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"rank\":1}}\n\ - {\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"rank\":2}}", - LoadMode::Merge, - ) - .await - .unwrap(); - - // Add @index on `rank` after data exists: a metadata-only apply that defers - // the physical build (iss-848), so the column is declared-indexed but unbuilt. - let v2 = "node Doc {\n slug: String @key\n rank: I32 @index\n}\n"; - db.apply_schema(v2).await.expect("index-only apply"); - - // Precondition: `rank` is declared @index but unbuilt -> reads degrade. - { - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Doc").await.unwrap(); - assert!( - matches!( - TableStore::key_column_index_coverage(&ds, "rank") - .await - .unwrap(), - IndexCoverage::Degraded { .. } - ), - "rank must be unindexed after the deferred apply" - ); - } - - db.optimize().await.unwrap(); - - // Postcondition: optimize's reconciler materialized the declared index. - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Doc").await.unwrap(); - assert_eq!( - TableStore::key_column_index_coverage(&ds, "rank") - .await - .unwrap(), - IndexCoverage::Indexed, - "optimize must build the declared-but-unbuilt rank index" - ); -} - -// iss-848 (PR review): the rename path also defers index building. A RenameType -// migration writes the renamed table as a new dataset with the existing rows -// but no indexes (its inline build was removed). optimize must then materialize -// the declared index on the renamed table. -#[tokio::test] -async fn optimize_materializes_index_after_type_rename() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let v1 = "node Doc {\n slug: String @key\n rank: I32 @index\n}\n"; - let mut db = Omnigraph::init(uri, v1).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Doc\",\"data\":{\"slug\":\"d1\",\"rank\":1}}\n\ - {\"type\":\"Doc\",\"data\":{\"slug\":\"d2\",\"rank\":2}}", - LoadMode::Merge, - ) - .await - .unwrap(); - - // Rename Doc -> Item; rows are preserved on the new table key. - let v2 = "node Item @rename_from(\"Doc\") {\n slug: String @key\n rank: I32 @index\n}\n"; - let result = db.apply_schema(v2).await.expect("rename apply"); - assert!(result.applied); - assert_eq!( - count_rows(&db, "node:Item").await, - 2, - "rename must preserve rows" - ); - - // Post-rename the renamed table's declared rank index is unbuilt (deferred). - { - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Item").await.unwrap(); - assert!( - matches!( - TableStore::key_column_index_coverage(&ds, "rank") - .await - .unwrap(), - IndexCoverage::Degraded { .. } - ), - "rank must be unindexed immediately after the rename" - ); - } - - db.optimize().await.unwrap(); - - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Item").await.unwrap(); - assert_eq!( - TableStore::key_column_index_coverage(&ds, "rank") - .await - .unwrap(), - IndexCoverage::Indexed, - "optimize must build the renamed table's deferred rank index" - ); -} diff --git a/crates/omnigraph/tests/merge_fast_forward.rs b/crates/omnigraph/tests/merge_fast_forward.rs deleted file mode 100644 index 185f45d..0000000 --- a/crates/omnigraph/tests/merge_fast_forward.rs +++ /dev/null @@ -1,213 +0,0 @@ -//! Fast-forward branch-merge cost + correctness. -//! -//! The data-path fix routes *new* rows of an adopted-source merge through -//! `stage_append` (a streaming `Operation::Append`) instead of lumping new + -//! changed rows into one `stage_merge_insert` (a full-outer hash join that -//! buffers the whole delta and exhausts the DataFusion memory pool on -//! embedding-bearing tables). -//! -//! The regression gate here is *structural*, not a brittle size threshold: it -//! asserts WHICH staged-write primitive the merge invokes, via the task-local -//! write probes in `omnigraph::instrumentation`. That is deterministic and -//! machine-independent β€” it cannot flake on a bigger memory pool. - -// Wrapping `branch_merge` in `with_merge_write_probes` (a task-local scope) -// nests the already-deep merge future one layer deeper, overflowing rustc's -// default 128 layout-query depth. Bump it for this test crate. -#![recursion_limit = "512"] - -mod helpers; - -use omnigraph::db::{MergeOutcome, Omnigraph}; -use omnigraph::instrumentation::{MergeWriteProbes, with_merge_write_probes}; - -use helpers::*; - -/// Insert `n` brand-new persons (fresh ids) onto `branch`, forking the Person -/// table onto it. All rows are "new on source" β€” none collide with base ids. -async fn append_new_persons(db: &mut Omnigraph, branch: &str, n: usize) { - for i in 0..n { - mutate_branch( - db, - branch, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", &format!("ff_new_{i}"))], &[("$age", 30)]), - ) - .await - .unwrap(); - } -} - -/// THE cost-budget gate. An append-only fast-forward merge must append the new -/// rows and run **zero** `stage_merge_insert` (the full-outer hash join that is -/// the OOM). RED today (new + changed are lumped into one `stage_merge_insert`); -/// GREEN once the adopt path splits newβ†’`stage_append`, changedβ†’`stage_merge_insert`. -#[tokio::test] -async fn append_only_fast_forward_merge_does_no_merge_insert() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let main = init_and_load(&dir).await; - main.branch_create("feature").await.unwrap(); - - let mut feature = Omnigraph::open(uri).await.unwrap(); - append_new_persons(&mut feature, "feature", 5).await; - - let probes = MergeWriteProbes::default(); - let outcome = - with_merge_write_probes(probes.clone(), main.branch_merge("feature", "main")) - .await - .unwrap(); - assert_eq!(outcome, MergeOutcome::FastForward); - - assert_eq!( - probes.stage_merge_insert_calls(), - 0, - "append-only fast-forward merge must do 0 stage_merge_insert (the OOM hash join); did {}", - probes.stage_merge_insert_calls(), - ); - assert!( - probes.stage_append_calls() >= 1, - "append-only fast-forward merge must append the new rows via stage_append; did {}", - probes.stage_append_calls(), - ); - assert_eq!( - probes.scan_staged_combined_calls(), - 0, - "append-only merge must stream the append (stage_append_stream), not materialize the \ - whole delta into one batch via scan_staged_combined; did {}", - probes.scan_staged_combined_calls(), - ); -} - -/// Functional correctness: a fast-forward merge of an append-only branch leaves -/// main equal to the source branch. Independent of the cost-budget gate. -#[tokio::test] -async fn fast_forward_merge_yields_source_state() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let main = init_and_load(&dir).await; - let base_count = count_rows(&main, "node:Person").await; - - main.branch_create("feature").await.unwrap(); - let mut feature = Omnigraph::open(uri).await.unwrap(); - append_new_persons(&mut feature, "feature", 5).await; - let source_count = count_rows_branch(&feature, "feature", "node:Person").await; - assert_eq!(source_count, base_count + 5); - - let outcome = main.branch_merge("feature", "main").await.unwrap(); - assert_eq!(outcome, MergeOutcome::FastForward); - - // main now equals source: the 5 new persons are present, the base rows kept. - assert_eq!(count_rows(&main, "node:Person").await, source_count); - let names = collect_column_strings(&read_table(&main, "node:Person").await, "name"); - for i in 0..5 { - assert!( - names.contains(&format!("ff_new_{i}")), - "merged main missing new person ff_new_{i}; have {names:?}" - ); - } -} - -const VEC_SCHEMA: &str = "node Chunk {\n slug: String @key\n embedding: Vector(8) @index\n}\n"; - -/// Commit 6 behavior: the fast-forward adopt path does NOT build indices inline -/// β€” index coverage is reconciler-owned (`optimize`/`ensure_indices`). A merge -/// into a freshly-initialized (unindexed) vector table must perform **0** inline -/// vector-index (IVF) builds; reads stay correct via brute-force until -/// `optimize` covers the new rows. RED before the change (the publish path built -/// the IVF inline); GREEN after. -#[tokio::test] -async fn fast_forward_merge_defers_vector_index_to_reconciler() { - use omnigraph::loader::LoadMode; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - // Empty Chunk table β†’ no vector index at init (KMeans can't train on 0 rows). - let main = Omnigraph::init(uri, VEC_SCHEMA).await.unwrap(); - main.branch_create("feature").await.unwrap(); - - // Load embedding-bearing chunks onto the branch. The branch builds its own - // index here (outside the probe scope) β€” irrelevant to the merge's cost. - let mut rows = String::new(); - for i in 0..24 { - let v: Vec = (0..8).map(|j| format!("{}.0", (i + j) % 5)).collect(); - rows.push_str(&format!( - "{{\"type\":\"Chunk\",\"data\":{{\"slug\":\"c{i}\",\"embedding\":[{}]}}}}\n", - v.join(",") - )); - } - let feature = Omnigraph::open(uri).await.unwrap(); - feature.load("feature", &rows, LoadMode::Merge).await.unwrap(); - - // Merge, counting inline vector-index builds the publish path performs. - let probes = MergeWriteProbes::default(); - let outcome = with_merge_write_probes(probes.clone(), main.branch_merge("feature", "main")) - .await - .unwrap(); - assert_eq!(outcome, MergeOutcome::FastForward); - - assert_eq!( - probes.create_vector_index_calls(), - 0, - "fast-forward adopt merge must defer vector-index coverage to the reconciler \ - (0 inline IVF builds); did {}", - probes.create_vector_index_calls(), - ); - // Correctness: the rows landed on main (reads brute-force until optimize). - assert_eq!(count_rows(&main, "node:Chunk").await, 24); -} - -const BLOB_SCHEMA: &str = "node Document {\n title: String @key\n content: Blob?\n note: String?\n}\n"; -const BLOB_INSERT: &str = r#" -query insert_doc($title: String, $content: Blob, $note: String) { - insert Document { title: $title, content: $content, note: $note } -} -"#; - -/// A fast-forward merge of a branch with a Blob column exercises the blob -/// fallback in `scan_stream_for_rewrite` (materialize β†’ re-stream) through the -/// streaming append. main is NOT mutated, so Document is `AdoptWithDelta` (the -/// adopt/append path), not `RewriteMerged`. The blob bytes must survive the -/// materialize β†’ stream β†’ append round-trip. -#[tokio::test] -async fn fast_forward_merge_streams_blob_columns() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut main = Omnigraph::init(uri, BLOB_SCHEMA).await.unwrap(); - load_jsonl( - &mut main, - "{\"type\":\"Document\",\"data\":{\"title\":\"seed\",\"content\":\"base64:U2VlZA==\",\"note\":\"base\"}}", - LoadMode::Overwrite, - ) - .await - .unwrap(); - main.branch_create("feature").await.unwrap(); - - // Only the branch is mutated β†’ fast-forward β†’ adopt/append path. - let mut feature = Omnigraph::open(uri).await.unwrap(); - mutate_branch( - &mut feature, - "feature", - BLOB_INSERT, - "insert_doc", - ¶ms(&[ - ("$title", "readme"), - ("$content", "base64:SGVsbG8="), - ("$note", "branch"), - ]), - ) - .await - .unwrap(); - - let outcome = main.branch_merge("feature", "main").await.unwrap(); - assert_eq!(outcome, MergeOutcome::FastForward); - - // The appended blob row's bytes survive the streaming append; the base row stays intact. - let readme = main.read_blob("Document", "readme", "content").await.unwrap(); - assert_eq!(&readme.read().await.unwrap()[..], b"Hello"); - let seed = main.read_blob("Document", "seed", "content").await.unwrap(); - assert_eq!(&seed.read().await.unwrap()[..], b"Seed"); -} diff --git a/crates/omnigraph/tests/merge_truth_table.rs b/crates/omnigraph/tests/merge_truth_table.rs index e2df882..068b439 100644 --- a/crates/omnigraph/tests/merge_truth_table.rs +++ b/crates/omnigraph/tests/merge_truth_table.rs @@ -941,8 +941,8 @@ async fn merge_pair_truth_table() { unsupported_cells, 45, "expected 45 cells involving dropProperty/addLabel/removeLabel" ); - // No wall-clock assertion here: `elapsed` is logged above for visibility, but - // a fixed time budget in a correctness test flakes under parallel test load - // (it tripped at ~31s in the full `--test-threads=4` gate while passing at - // ~20s in isolation). Merge-perf regressions belong in a bench, not here. + assert!( + elapsed.as_secs() < 30, + "merge truth table exceeded 30s budget: {elapsed:?}" + ); } diff --git a/crates/omnigraph/tests/ordering.rs b/crates/omnigraph/tests/ordering.rs deleted file mode 100644 index 2684b1c..0000000 --- a/crates/omnigraph/tests/ordering.rs +++ /dev/null @@ -1,134 +0,0 @@ -//! ORDER BY golden coverage: descending, multi-key precedence, deterministic -//! tie-break (total order), and NULL placement. -//! -//! These pin the observable output-ordering contract (deny-list: "output -//! ordering … become dependencies once shipped"). `apply_ordering` appends the -//! bound entities' key columns as an ascending tie-break, so equal user-sort -//! keys yield a TOTAL, deterministic order (and `ORDER … LIMIT` is -//! deterministic). NULL placement is `nulls_first = !descending` (NULLs first -//! under ASC, last under DESC). Both are documented in -//! `docs/user/queries/index.md`. - -mod helpers; - -use arrow_array::{Array, StringArray}; - -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_compiler::ir::ParamMap; -use omnigraph_compiler::result::QueryResult; - -use helpers::*; - -/// Names in result ROW order (not sorted) β€” these tests assert positional order. -fn names_in_order(result: &QueryResult) -> Vec { - let batch = result.concat_batches().unwrap(); - if batch.num_rows() == 0 { - return Vec::new(); - } - let col = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - (0..col.len()).map(|i| col.value(i).to_string()).collect() -} - -/// Init the standard schema and load a custom Person-only dataset. -async fn init_people(dir: &tempfile::TempDir, jsonl: &str) -> Omnigraph { - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, jsonl, LoadMode::Overwrite).await.unwrap(); - db -} - -#[tokio::test] -async fn ordering_descending() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - let q = r#" -query q() { - match { $p: Person } - return { $p.name } - order { $p.age desc } -} -"#; - let got = names_in_order(&query_main(&mut db, q, "q", &ParamMap::new()).await.unwrap()); - // Charlie(35), Alice(30), Diana(28), Bob(25) - assert_eq!(got, vec!["Charlie", "Alice", "Diana", "Bob"]); -} - -#[tokio::test] -async fn ordering_multi_key_age_desc_name_asc() { - let dir = tempfile::tempdir().unwrap(); - // Alice & Bob tie at age 30; loaded Bob-first so the expected output order - // cannot be the load order. - let data = r#"{"type":"Person","data":{"name":"Bob","age":30}} -{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Person","data":{"name":"Charlie","age":25}}"#; - let mut db = init_people(&dir, data).await; - let q = r#" -query q() { - match { $p: Person } - return { $p.name } - order { $p.age desc, $p.name asc } -} -"#; - let got = names_in_order(&query_main(&mut db, q, "q", &ParamMap::new()).await.unwrap()); - // age desc -> [30,30,25]; the 30-tie broken by name asc -> Alice before Bob. - assert_eq!(got, vec!["Alice", "Bob", "Charlie"]); -} - -#[tokio::test] -async fn ordering_tiebreak_by_key_is_deterministic() { - let dir = tempfile::tempdir().unwrap(); - // Same tie at age 30, NO secondary sort key. Loaded Bob-first; the tie must - // break by the entity key (name) ascending -> Alice before Bob, regardless - // of load order. This locks the total-order tie-break in apply_ordering. - let data = r#"{"type":"Person","data":{"name":"Bob","age":30}} -{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Person","data":{"name":"Charlie","age":25}}"#; - let mut db = init_people(&dir, data).await; - let q = r#" -query q() { - match { $p: Person } - return { $p.name } - order { $p.age asc } -} -"#; - let got = names_in_order(&query_main(&mut db, q, "q", &ParamMap::new()).await.unwrap()); - // age asc -> Charlie(25), then the 30-tie broken by key asc -> Alice, Bob. - assert_eq!(got, vec!["Charlie", "Alice", "Bob"]); -} - -#[tokio::test] -async fn ordering_nulls_placement_asc_and_desc() { - let dir = tempfile::tempdir().unwrap(); - // Bob has a NULL age. - let data = r#"{"type":"Person","data":{"name":"Alice","age":30}} -{"type":"Person","data":{"name":"Bob","age":null}} -{"type":"Person","data":{"name":"Charlie","age":25}}"#; - let mut db = init_people(&dir, data).await; - - let asc = r#" -query q() { - match { $p: Person } - return { $p.name } - order { $p.age asc } -} -"#; - let got_asc = names_in_order(&query_main(&mut db, asc, "q", &ParamMap::new()).await.unwrap()); - // ASC: nulls_first -> Bob(null), then 25, 30. - assert_eq!(got_asc, vec!["Bob", "Charlie", "Alice"]); - - let desc = r#" -query q() { - match { $p: Person } - return { $p.name } - order { $p.age desc } -} -"#; - let got_desc = names_in_order(&query_main(&mut db, desc, "q", &ParamMap::new()).await.unwrap()); - // DESC: nulls last -> 30, 25, then Bob(null). - assert_eq!(got_desc, vec!["Alice", "Charlie", "Bob"]); -} diff --git a/crates/omnigraph/tests/policy_engine_chassis.rs b/crates/omnigraph/tests/policy_engine_chassis.rs index 8443940..def5349 100644 --- a/crates/omnigraph/tests/policy_engine_chassis.rs +++ b/crates/omnigraph/tests/policy_engine_chassis.rs @@ -243,7 +243,6 @@ async fn load_as_denies_when_policy_rejects_actor() { let result = db .load_as( "main", - None, ONE_PERSON_JSONL, LoadMode::Merge, Some("act-denied"), @@ -259,7 +258,6 @@ async fn load_as_allows_when_policy_permits_actor() { db.load_as( "main", - None, ONE_PERSON_JSONL, LoadMode::Merge, Some("act-allowed"), @@ -283,7 +281,6 @@ async fn load_file_as_denies_when_policy_rejects_actor() { let result = db .load_file_as( "main", - None, data_path.to_str().unwrap(), LoadMode::Merge, Some("act-denied"), @@ -301,7 +298,6 @@ async fn load_file_as_allows_when_policy_permits_actor() { db.load_file_as( "main", - None, data_path.to_str().unwrap(), LoadMode::Merge, Some("act-allowed"), @@ -311,7 +307,6 @@ async fn load_file_as_allows_when_policy_permits_actor() { } #[tokio::test] -#[allow(deprecated)] async fn ingest_as_denies_when_policy_rejects_actor() { let dir = tempfile::tempdir().unwrap(); let (db, _engine) = init_with_policy(&dir).await; @@ -329,7 +324,6 @@ async fn ingest_as_denies_when_policy_rejects_actor() { } #[tokio::test] -#[allow(deprecated)] async fn ingest_as_allows_when_policy_permits_actor() { let dir = tempfile::tempdir().unwrap(); let (db, _engine) = init_with_policy(&dir).await; diff --git a/crates/omnigraph/tests/proptest_equivalence.rs b/crates/omnigraph/tests/proptest_equivalence.rs deleted file mode 100644 index 3423a2f..0000000 --- a/crates/omnigraph/tests/proptest_equivalence.rs +++ /dev/null @@ -1,311 +0,0 @@ -//! Property-based query-correctness invariants over generated graphs. -//! -//! The cross-type id-collision bug (fixed in f6a0e53) was a silent wrong-result -//! divergence between the two Expand modes, caught only because someone -//! hand-built the one colliding fixture. This turns that single example into a -//! search over the whole class: node keys for BOTH types are drawn from a small -//! SHARED alphabet, so cross-type collisions β€” plus cycles and self-loops β€” -//! arise frequently. The invariants make any future fork divergence (the planned -//! third ExpandMode, the anti-join fast/slow fork) fail loudly instead of -//! silently. -//! -//! Each test is a sync `#[test]` + `#[serial]`: it builds its own runtime and -//! `block_on`s per generated case (proptest closures are sync), and the -//! mode-equivalence test writes `OMNIGRAPH_TRAVERSAL_MODE`, so serial execution -//! keeps env writes from racing other tests in this binary. - -mod helpers; - -use std::collections::HashSet; - -use arrow_array::{Array, StringArray}; -use proptest::prelude::*; -use proptest::test_runner::{Config, TestRunner}; -use serial_test::serial; - -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph_compiler::ir::ParamMap; -use omnigraph_compiler::query::ast::Literal; - -use helpers::*; - -/// Small SHARED key alphabet β€” Person and Company keys are both drawn from this, -/// so cross-type id collisions are common. -const KEYS: &[&str] = &["a", "b", "c", "d", "e"]; - -const QUERIES: &str = r#" -query friends($name: String) { - match { - $p: Person { name: $name } - $p knows{1,3} $f - } - return { $f.name } -} -query employers($name: String) { - match { - $p: Person { name: $name } - $p worksAt{1,2} $c - } - return { $c.name } -} -query all_persons() { - match { $p: Person } - return { $p.name } -} -query employed() { - match { - $p: Person - $p worksAt $c - } - return { $p.name } -} -query unemployed() { - match { - $p: Person - not { $p worksAt $_ } - } - return { $p.name } -} -"#; - -#[derive(Debug, Clone)] -struct GenGraph { - persons: Vec, - companies: Vec, - knows: Vec<(usize, usize)>, // indices into persons (self-loops & cycles allowed) - works_at: Vec<(usize, usize)>, // (person idx, company idx) -} - -impl GenGraph { - fn to_jsonl(&self) -> String { - let mut s = String::new(); - for p in &self.persons { - s.push_str(&format!("{{\"type\":\"Person\",\"data\":{{\"name\":\"{p}\"}}}}\n")); - } - for c in &self.companies { - s.push_str(&format!("{{\"type\":\"Company\",\"data\":{{\"name\":\"{c}\"}}}}\n")); - } - // Dedup exact-duplicate edge rows (the loader rejects intra-batch - // duplicate keys); collisions/cycles/self-loops are unaffected. - let mut seen = HashSet::new(); - for &(a, b) in &self.knows { - if seen.insert(("k", a, b)) { - s.push_str(&format!( - "{{\"edge\":\"Knows\",\"from\":\"{}\",\"to\":\"{}\"}}\n", - self.persons[a], self.persons[b] - )); - } - } - for &(a, b) in &self.works_at { - if seen.insert(("w", a, b)) { - s.push_str(&format!( - "{{\"edge\":\"WorksAt\",\"from\":\"{}\",\"to\":\"{}\"}}\n", - self.persons[a], self.companies[b] - )); - } - } - s - } -} - -fn arb_keys() -> impl Strategy> { - proptest::sample::subsequence(KEYS.to_vec(), 1..=KEYS.len()) - .prop_map(|v| v.into_iter().map(String::from).collect()) -} - -fn arb_graph() -> impl Strategy { - (arb_keys(), arb_keys()).prop_flat_map(|(persons, companies)| { - let np = persons.len(); - let nc = companies.len(); - let knows = prop::collection::vec((0..np, 0..np), 0..=10); - let works = prop::collection::vec((0..np, 0..nc), 0..=10); - (Just(persons), Just(companies), knows, works).prop_map( - |(persons, companies, knows, works_at)| GenGraph { - persons, - companies, - knows, - works_at, - }, - ) - }) -} - -fn config() -> Config { - Config { - cases: 48, - ..Config::default() - } -} - -fn clear_mode() { - unsafe { std::env::remove_var("OMNIGRAPH_TRAVERSAL_MODE") }; -} - -/// RAII guard that sets `OMNIGRAPH_TRAVERSAL_MODE` and clears it on drop β€” so a -/// panic mid-case (e.g. a query `unwrap`) cannot leak the forced mode into -/// proptest's subsequent shrink/cases and mask the divergence under test. SAFE: -/// every test in this binary is `#[serial]`, so no thread reads the env during -/// the write. -struct ModeGuard; -impl ModeGuard { - fn set(mode: &str) -> Self { - unsafe { std::env::set_var("OMNIGRAPH_TRAVERSAL_MODE", mode) }; - ModeGuard - } -} -impl Drop for ModeGuard { - fn drop(&mut self) { - unsafe { std::env::remove_var("OMNIGRAPH_TRAVERSAL_MODE") }; - } -} - -async fn load_graph(graph: &GenGraph) -> (tempfile::TempDir, Omnigraph) { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, &graph.to_jsonl(), LoadMode::Overwrite) - .await - .unwrap(); - (dir, db) -} - -fn one_param(val: &str) -> ParamMap { - let mut m = ParamMap::new(); - m.insert("name".to_string(), Literal::String(val.to_string())); - m -} - -/// First-column strings, sorted (MULTISET β€” preserves duplicate-row count so -/// mode comparisons catch dedup divergence, not just set divergence). -async fn col0_sorted(db: &mut Omnigraph, name: &str, params: &ParamMap) -> Vec { - let r = db - .query(ReadTarget::branch("main"), QUERIES, name, params) - .await - .unwrap(); - if r.num_rows() == 0 { - return Vec::new(); - } - let b = r.concat_batches().unwrap(); - let col = b.column(0).as_any().downcast_ref::().unwrap(); - let mut v: Vec = (0..col.len()).map(|i| col.value(i).to_string()).collect(); - v.sort(); - v -} - -async fn col0_set(db: &mut Omnigraph, name: &str, params: &ParamMap) -> HashSet { - col0_sorted(db, name, params).await.into_iter().collect() -} - -// INVARIANT 1: mode equivalence. For any generated graph and start key, the -// CSR, indexed, and auto paths return identical result multisets β€” over both a -// same-type traversal (knows{1,3}, exercises cycles/self-loops) and a cross-type -// one (worksAt{1,2}, collision-prone). This is the search-over-the-class version -// of the hand-built cross-type-collision fixture. -#[test] -#[serial] -fn prop_expand_indexed_eq_csr() { - let rt = tokio::runtime::Runtime::new().unwrap(); - let mut runner = TestRunner::new(config()); - runner - .run(&arb_graph(), |graph| { - let mismatch = rt.block_on(async { - let (_dir, mut db) = load_graph(&graph).await; - for start in graph.persons.clone() { - let p = one_param(&start); - for q in ["friends", "employers"] { - // Each guard clears the mode on drop (end of the block, - // or on panic), so a forced mode never leaks across runs. - let csr = { - let _g = ModeGuard::set("csr"); - col0_sorted(&mut db, q, &p).await - }; - let indexed = { - let _g = ModeGuard::set("indexed"); - col0_sorted(&mut db, q, &p).await - }; - // No guard β†’ env unset β†’ auto (cost-based) path. - let auto = col0_sorted(&mut db, q, &p).await; - if csr != indexed || csr != auto { - return Some((start, q, csr, indexed, auto)); - } - } - } - None - }); - prop_assert!( - mismatch.is_none(), - "Expand mode divergence: {:?}", - mismatch - ); - Ok(()) - }) - .unwrap(); -} - -// INVARIANT 2: no phantom rows. Every key a traversal returns must belong to the -// destination type's loaded key set β€” independent of the two-mode comparison, so -// it catches over-emission even if both modes are wrong identically. -#[test] -#[serial] -fn prop_results_subset_of_existing_nodes() { - clear_mode(); - let rt = tokio::runtime::Runtime::new().unwrap(); - let mut runner = TestRunner::new(config()); - runner - .run(&arb_graph(), |graph| { - let bad = rt.block_on(async { - let (_dir, mut db) = load_graph(&graph).await; - let persons: HashSet = graph.persons.iter().cloned().collect(); - let companies: HashSet = graph.companies.iter().cloned().collect(); - for start in graph.persons.clone() { - let p = one_param(&start); - for f in col0_set(&mut db, "friends", &p).await { - if !persons.contains(&f) { - return Some(("friends", start, f)); - } - } - for c in col0_set(&mut db, "employers", &p).await { - if !companies.contains(&c) { - return Some(("employers", start, c)); - } - } - } - None - }); - prop_assert!(bad.is_none(), "phantom row: {:?}", bad); - Ok(()) - }) - .unwrap(); -} - -// INVARIANT 3: anti-join complement. `not { $p worksAt $_ }` and its complement -// (persons WITH a worksAt) must be disjoint and together cover all persons. -#[test] -#[serial] -fn prop_antijoin_partitions_persons() { - clear_mode(); - let rt = tokio::runtime::Runtime::new().unwrap(); - let mut runner = TestRunner::new(config()); - runner - .run(&arb_graph(), |graph| { - let err = rt.block_on(async { - let (_dir, mut db) = load_graph(&graph).await; - let all = col0_set(&mut db, "all_persons", &ParamMap::new()).await; - let unemployed = col0_set(&mut db, "unemployed", &ParamMap::new()).await; - let employed = col0_set(&mut db, "employed", &ParamMap::new()).await; - let overlap: Vec<_> = unemployed.intersection(&employed).cloned().collect(); - let union: HashSet<_> = unemployed.union(&employed).cloned().collect(); - if !overlap.is_empty() { - return Some(format!("overlap {overlap:?}")); - } - if union != all { - return Some(format!("union {union:?} != all {all:?}")); - } - None - }); - prop_assert!(err.is_none(), "anti-join partition broken: {:?}", err); - Ok(()) - }) - .unwrap(); -} diff --git a/crates/omnigraph/tests/recovery.rs b/crates/omnigraph/tests/recovery.rs index 9658c58..a090178 100644 --- a/crates/omnigraph/tests/recovery.rs +++ b/crates/omnigraph/tests/recovery.rs @@ -104,10 +104,8 @@ async fn recovery_refuses_unknown_schema_version_on_open() { let _db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); drop(_db); - // A sidecar from a hypothetical future writer (version NEWER than this - // binary's max); the reader must refuse to interpret it β€” it cannot guess - // semantics a newer writer baked in. (Older versions are accepted and - // interpreted with their original semantics; see `parse_sidecar`.) + // A sidecar from a hypothetical future writer; the older binary must + // refuse to interpret it (resolved-decisions Β§3 in the design doc). let sidecar_json = r#"{ "schema_version": 99, "operation_id": "01H000000000000000000000ZZ", @@ -122,11 +120,11 @@ async fn recovery_refuses_unknown_schema_version_on_open() { let err = Omnigraph::open(uri) .await .err() - .expect("expected open to fail because of a future sidecar schema_version"); + .expect("expected open to fail because of unknown sidecar schema_version"); let msg = err.to_string(); assert!( - msg.contains("schema_version=99") && msg.contains("newer than the maximum"), - "expected a future-version refusal, got: {}", + msg.contains("schema_version=99") && msg.contains("supports only schema_version=1"), + "expected SidecarSchemaError mentioning the version mismatch, got: {}", msg, ); // Sidecar must still be on disk β€” we don't auto-delete unparseable files. @@ -136,218 +134,6 @@ async fn recovery_refuses_unknown_schema_version_on_open() { ); } -#[tokio::test] -async fn recovery_refuses_corrupt_sidecar_on_open_and_write() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - - // A truncated/garbage sidecar β€” e.g. a crashed writer or a partial - // local-FS write (S3 PutObject is atomic; local fs::write is not). - write_sidecar_file(dir.path(), "01H000000000000000000000CC", "{not json"); - - // A live handle's write-entry heal must surface the parse failure - // loudly instead of proceeding over a sidecar it cannot interpret. - let err = load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"Alice","age":30}} -"#, - LoadMode::Merge, - ) - .await - .err() - .expect("expected the write to fail on the corrupt sidecar"); - assert!( - err.to_string().contains("is not valid JSON"), - "expected the corrupt-sidecar parse error, got: {}", - err, - ); - - // A fresh ReadWrite open fails the same way. - drop(db); - let err = Omnigraph::open(uri) - .await - .err() - .expect("expected open to fail because of the corrupt sidecar"); - let msg = err.to_string(); - assert!( - msg.contains("01H000000000000000000000CC") && msg.contains("is not valid JSON"), - "expected the corrupt-sidecar parse error naming the file, got: {}", - msg, - ); - // The file must remain on disk for inspection β€” never auto-deleted. - assert!( - list_recovery_dir(dir.path()).contains(&"01H000000000000000000000CC.json".to_string()), - "corrupt sidecar should remain on disk after refusal" - ); - - // Read-only open still works β€” the sweep is skipped entirely. - let _db = Omnigraph::open_read_only(uri).await.unwrap(); -} - -/// The commit-time drift guard's advice must be branch-aware: a pending -/// sidecar on ANOTHER branch does not cover this branch's drift. With a -/// deferred feature-branch sidecar on disk and genuinely uncovered drift -/// on main, the main write must still point at `omnigraph repair` β€” a -/// read-write reopen recovers the sidecar but cannot repair main's -/// uncovered drift. -#[tokio::test] -async fn drift_guard_advice_ignores_other_branch_sidecars() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\",\"age\":30}}\n", - LoadMode::Merge, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - // A real feature write forks Person's Lance dataset onto the branch - // (the heal classifies a feature sidecar against the forked head). - db.mutate( - "feature", - helpers::MUTATION_QUERIES, - "insert_person", - &helpers::mixed_params(&[("$name", "eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - // A sidecar pinning node:Person ON FEATURE, shaped so the write-entry - // heal defers it (head < expected_version classifies as an invariant - // violation; roll-forward-only mode leaves it for the next ReadWrite - // open) β€” it persists through the write attempt below. - let person_uri = node_table_uri(uri, "Person"); - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H000000000000000000000XB", - "started_at": "0", - "branch": "feature", - "actor_id": null, - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{person_uri}", - "expected_version": 999, - "post_commit_pin": 1000, - "table_branch": "feature" - }} - ] - }}"# - ); - write_sidecar_file(dir.path(), "01H000000000000000000000XB", &sidecar_json); - - // Genuinely uncovered drift on MAIN's Person (raw Lance write - // bypassing the manifest β€” the `omnigraph repair` class). - let mut ds = Dataset::open(&person_uri).await.unwrap(); - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; - - let err = load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .err() - .expect("uncovered main drift must fail the write"); - assert!( - err.to_string().contains("run `omnigraph repair`"), - "a feature-branch sidecar must not flip main's uncovered-drift \ - advice to the reopen path; got: {err}" - ); -} - -/// A deferred sidecar pinned to a branch that is subsequently DELETED -/// must not wedge the graph: the branch's tree and forks are reclaimed, -/// so the pinned drift is unreachable and the sidecar is provably moot. -/// Both the write-entry heal and the open-time sweep must classify it -/// as orphaned (audit + discard) instead of failing to open the dead -/// branch on every write and every ReadWrite open β€” a terminal state, -/// since `repair` refuses while a sidecar is pending. -#[tokio::test] -async fn deleted_branch_sidecar_does_not_wedge_writes_or_open() { - use omnigraph::loader::{LoadMode, load_jsonl}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = Omnigraph::init(&uri, TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Alice\",\"age\":30}}\n", - LoadMode::Merge, - ) - .await - .unwrap(); - db.branch_create("feature").await.unwrap(); - db.mutate( - "feature", - helpers::MUTATION_QUERIES, - "insert_person", - &helpers::mixed_params(&[("$name", "eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - // A rollback-eligible (deferred) sidecar pinned to feature β€” shaped - // so every roll-forward-only pass leaves it on disk. - let person_uri = node_table_uri(&uri, "Person"); - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H000000000000000000000DB", - "started_at": "0", - "branch": "feature", - "actor_id": null, - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{person_uri}", - "expected_version": 999, - "post_commit_pin": 1000, - "table_branch": "feature" - }} - ] - }}"# - ); - write_sidecar_file(dir.path(), "01H000000000000000000000DB", &sidecar_json); - - // Branch delete defers the rollback-eligible sidecar and proceeds β€” - // the sidecar now references a branch that no longer exists. - db.branch_delete("feature").await.unwrap(); - - // The next write's heal must classify the orphan and discard it, - // not fail opening the dead branch. - load_jsonl( - &mut db, - "{\"type\":\"Person\",\"data\":{\"name\":\"Bob\",\"age\":25}}\n", - LoadMode::Merge, - ) - .await - .expect("a write after deleting a sidecar-pinned branch must succeed"); - assert_eq!( - list_recovery_dir(dir.path()).len(), - 0, - "the orphaned sidecar must be discarded (with an audit row), not left to wedge" - ); - - // And a fresh ReadWrite open must succeed too (the sweep shares the - // same classification). - drop(db); - let db = Omnigraph::open(&uri) - .await - .expect("ReadWrite open after deleting a sidecar-pinned branch must succeed"); - assert_eq!(helpers::count_rows(&db, "node:Person").await, 2); -} - #[tokio::test] async fn read_only_open_skips_recovery_sweep() { let dir = tempfile::tempdir().unwrap(); @@ -389,6 +175,7 @@ async fn read_only_open_skips_recovery_sweep() { #[tokio::test] async fn recovery_rolls_back_synthetic_drift_on_open() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -415,9 +202,13 @@ async fn recovery_rolls_back_synthetic_drift_on_open() { // residual the sweep recovers from is the manifest-vs-Lance-HEAD gap; // it's agnostic to *what* op caused the gap. let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); let mut ds = Dataset::open(&person_uri).await.unwrap(); let head_before_drift = ds.version().version; - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; + let _ = store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); let head_after_drift = ds.version().version; assert_eq!( head_after_drift, @@ -487,92 +278,6 @@ async fn recovery_rolls_back_synthetic_drift_on_open() { ); } -/// Regression: recovery roll-back must PUBLISH the restored version so -/// `manifest == Lance HEAD` afterward (no residual "orphaned drift"). Before the -/// fix, roll-back restored via `Dataset::restore` but left the manifest pin -/// behind HEAD, so a subsequent strict write / schema apply failed its -/// HEAD-vs-manifest precondition ("stale view … refresh and retry") β€” and a -/// failed schema apply's own roll-back leaked +1 each retry (the original bug's -/// loop). With convergence, one roll-back leaves `manifest == HEAD` and the -/// follow-up succeeds. -#[tokio::test] -async fn recovery_rollback_converges_manifest_so_schema_apply_succeeds() { - use omnigraph::db::ReadTarget; - use omnigraph::loader::{LoadMode, load_jsonl}; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl( - &mut db, - r#"{"type":"Person","data":{"name":"alice","age":30}} -{"type":"Person","data":{"name":"bob","age":25}} -"#, - LoadMode::Append, - ) - .await - .unwrap(); - drop(db); - - // Forge a Phase-B residual: advance Person's Lance HEAD without publishing to - // the manifest (the manifest pin stays at the load's committed version). - let person_uri = node_table_uri(uri, "Person"); - let mut ds = Dataset::open(&person_uri).await.unwrap(); - let manifest_pin = ds.version().version; - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; - drop(ds); - - // Roll-back-classified sidecar (post_commit_pin != observed head β‡’ - // UnexpectedAtP1 β‡’ RollBack). - let sidecar_json = format!( - r#"{{ - "schema_version": 1, - "operation_id": "01H0000000000000000000CVG", - "started_at": "0", - "branch": null, - "actor_id": "act-test", - "writer_kind": "Mutation", - "tables": [ - {{ - "table_key": "node:Person", - "table_path": "{}", - "expected_version": {}, - "post_commit_pin": {} - }} - ] - }}"#, - person_uri, manifest_pin, manifest_pin - ); - write_sidecar_file(dir.path(), "01H0000000000000000000CVG", &sidecar_json); - - // Reopen runs the sweep: restore Person to manifest_pin, then PUBLISH so the - // manifest tracks the restored Lance HEAD. - let db = Omnigraph::open(uri).await.unwrap(); - - // Convergence: manifest pin == Lance HEAD. Fails before the fix β€” the - // manifest stays at manifest_pin while HEAD advanced past it. - let snap = db.snapshot_of(ReadTarget::branch("main")).await.unwrap(); - let entry = snap.entry("node:Person").unwrap(); - let lance_head = Dataset::open(&person_uri).await.unwrap().version().version; - assert_eq!( - entry.table_version, lance_head, - "roll-back must publish so manifest pin ({}) == Lance HEAD ({})", - entry.table_version, lance_head, - ); - - // The +1-loop victim: an additive schema apply must now succeed (its - // HEAD-vs-manifest precondition is satisfied). Before the fix this failed - // with "stale view … refresh and retry". - let desired = TEST_SCHEMA.replace( - " age: I32?\n}", - " age: I32?\n nickname: String?\n}", - ); - db.apply_schema(&desired) - .await - .expect("schema apply after a converging roll-back must succeed"); -} - // ===================================================================== // Phase 4 β€” roll-forward path + audit row recording // ===================================================================== @@ -685,26 +390,44 @@ async fn list_recovery_audit_kinds(graph_root: &Path) -> Vec { out } -/// Helper: count graph commits authored by the recovery actor. RFC-013 Phase 7 -/// records the recovery commit in `__manifest` (folded into the recovery publish -/// CAS), not `_graph_commits.lance`, so this counts through the production -/// commit-graph projection (`load_commits`), filtering on the inline actor. +/// Helper: count `_graph_commits.lance` rows tagged with the recovery actor. async fn count_recovery_actor_commits(graph_root: &Path) -> usize { - let commits = omnigraph::db::commit_graph::CommitGraph::open(graph_root.to_str().unwrap()) + let actors_dir = graph_root.join("_graph_commit_actors.lance"); + if !actors_dir.exists() { + return 0; + } + let ds = Dataset::open(actors_dir.to_str().unwrap()).await.unwrap(); + use arrow_array::{Array, StringArray}; + use futures::TryStreamExt; + let batches: Vec = ds + .scan() + .try_into_stream() .await .unwrap() - .load_commits() + .try_collect() .await .unwrap(); - commits - .iter() - .filter(|c| c.actor_id.as_deref() == Some("omnigraph:recovery")) - .count() + let mut count = 0; + for batch in &batches { + let actors = batch + .column_by_name("actor_id") + .unwrap() + .as_any() + .downcast_ref::() + .unwrap(); + for i in 0..actors.len() { + if actors.value(i) == "omnigraph:recovery" { + count += 1; + } + } + } + count } #[tokio::test] async fn recovery_rolls_forward_after_phase_b_completes() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -721,12 +444,16 @@ async fn recovery_rolls_forward_after_phase_b_completes() { drop(db); let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); let mut ds = Dataset::open(&person_uri).await.unwrap(); let head_before = ds.version().version; // Synthesize a successful Phase B: advance Lance HEAD by one // (delete_where with no-match β€” no fragment changes, but version bumps). - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; + let _ = store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); let head_after = ds.version().version; assert_eq!(head_after, head_before + 1); @@ -910,6 +637,7 @@ async fn recovery_records_rolled_forward_for_stale_sidecar_after_successful_roll #[tokio::test] async fn recovery_rolls_back_records_audit_row_with_recovery_actor() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -923,9 +651,13 @@ async fn recovery_rolls_back_records_audit_row_with_recovery_actor() { drop(db); let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); let mut ds = Dataset::open(&person_uri).await.unwrap(); let head_before = ds.version().version; - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; + let _ = store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); let head_after = ds.version().version; let _ = head_after; @@ -972,6 +704,7 @@ async fn recovery_rolls_back_records_audit_row_with_recovery_actor() { #[tokio::test] async fn recovery_rolls_forward_with_null_actor() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -985,9 +718,13 @@ async fn recovery_rolls_forward_with_null_actor() { drop(db); let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); let mut ds = Dataset::open(&person_uri).await.unwrap(); let head_before = ds.version().version; - let _ = helpers::lance_delete_inline(&mut ds, "1 = 2").await; + let _ = store + .delete_where(&person_uri, &mut ds, "1 = 2") + .await + .unwrap(); let head_after = ds.version().version; // Sidecar with no actor_id (CLI-driven mutation; common case). @@ -1043,6 +780,7 @@ async fn recovery_rolls_forward_with_null_actor() { #[tokio::test] async fn recovery_processes_multiple_sidecars_with_fresh_snapshot_per_iter() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -1060,14 +798,21 @@ async fn recovery_processes_multiple_sidecars_with_fresh_snapshot_per_iter() { // Synthesize drift on both tables independently. let person_uri = node_table_uri(uri, "Person"); let company_uri = node_table_uri(uri, "Company"); + let store = TableStore::new(uri); let mut person_ds = Dataset::open(&person_uri).await.unwrap(); let person_pre = person_ds.version().version; - let _ = helpers::lance_delete_inline(&mut person_ds, "1 = 2").await; + let _ = store + .delete_where(&person_uri, &mut person_ds, "1 = 2") + .await + .unwrap(); let person_post = person_ds.version().version; let mut company_ds = Dataset::open(&company_uri).await.unwrap(); let company_pre = company_ds.version().version; - let _ = helpers::lance_delete_inline(&mut company_ds, "1 = 2").await; + let _ = store + .delete_where(&company_uri, &mut company_ds, "1 = 2") + .await + .unwrap(); let company_post = company_ds.version().version; // Drop two sidecars; ULID prefix ensures sort order is A then B. @@ -1247,6 +992,7 @@ async fn recovery_ensure_indices_handles_empty_tables() { #[tokio::test] async fn recovery_multi_sidecar_requires_fresh_snapshot_for_correctness() { use omnigraph::loader::{LoadMode, load_jsonl}; + use omnigraph::table_store::TableStore; let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_str().unwrap(); @@ -1265,6 +1011,7 @@ async fn recovery_multi_sidecar_requires_fresh_snapshot_for_correctness() { drop(db); let person_uri = node_table_uri(uri, "Person"); + let store = TableStore::new(uri); let mut ds = Dataset::open(&person_uri).await.unwrap(); let v1 = ds.version().version; @@ -1278,9 +1025,23 @@ async fn recovery_multi_sidecar_requires_fresh_snapshot_for_correctness() { // Bypassing __manifest is what `delete_where` and `append_batch` // both do (direct on Lance); using append_batch (instead of no-op // deletes) is what makes the fragment-set differ across versions. - helpers::lance_append_inline(&mut ds, person_batch(&[("bob-id", "bob", Some(25))])).await; + store + .append_batch( + &person_uri, + &mut ds, + person_batch(&[("bob-id", "bob", Some(25))]), + ) + .await + .unwrap(); let v2 = ds.version().version; - helpers::lance_append_inline(&mut ds, person_batch(&[("carol-id", "carol", Some(40))])).await; + store + .append_batch( + &person_uri, + &mut ds, + person_batch(&[("carol-id", "carol", Some(40))]), + ) + .await + .unwrap(); let v3 = ds.version().version; assert_eq!(v2, v1 + 1); assert_eq!(v3, v2 + 1); @@ -1445,7 +1206,14 @@ async fn recovery_classifies_feature_branch_sidecar_against_feature_branch() { .open_dataset_head(&person_uri, feature_branch_name.as_deref()) .await .unwrap(); - helpers::lance_append_inline(&mut ds, person_batch(&[("carol-id", "carol", Some(40))])).await; + store + .append_batch( + &person_uri, + &mut ds, + person_batch(&[("carol-id", "carol", Some(40))]), + ) + .await + .unwrap(); let v_head = ds.version().version; assert_eq!(v_head, v_pin + 1, "append must advance HEAD by 1"); @@ -1560,7 +1328,14 @@ async fn recovery_rolls_back_feature_branch_sidecar_against_feature_branch() { .open_dataset_head(&person_uri, feature_branch_name.as_deref()) .await .unwrap(); - helpers::lance_append_inline(&mut ds, person_batch(&[("dave-id", "dave", Some(50))])).await; + store + .append_batch( + &person_uri, + &mut ds, + person_batch(&[("dave-id", "dave", Some(50))]), + ) + .await + .unwrap(); let v_head = ds.version().version; assert_eq!(v_head, v_pin + 1); diff --git a/crates/omnigraph/tests/writes.rs b/crates/omnigraph/tests/runs.rs similarity index 77% rename from crates/omnigraph/tests/writes.rs rename to crates/omnigraph/tests/runs.rs index b57d8fd..cfff3fc 100644 --- a/crates/omnigraph/tests/writes.rs +++ b/crates/omnigraph/tests/runs.rs @@ -1,13 +1,13 @@ -//! Tests for the direct-publish write path: mutations and loads write -//! directly to target tables and commit once via the publisher's -//! `expected_table_versions` CAS. (History: this replaced the removed Run -//! state machine / `__run__` staging branches / RunRecord β€” MR-771.) +//! Tests for the direct-to-target write path (Run state machine +//! removed). The Run/`__run__` staging branch / RunRecord state machine no +//! longer exists; mutations and loads write directly to target tables and +//! commit once via the publisher's `expected_table_versions` CAS. //! //! What this file covers: //! - No `__run__*` branches are created by load or mutate. //! - Cancellation of a mutation future leaves no graph-level state. -//! - Concurrent non-strict inserts/merges rebase under the per-table queue; -//! strict updates/deletes surface `ExpectedVersionMismatch` on stale state. +//! - Concurrent writers to the same table land exactly one publish; the +//! loser surfaces `ManifestConflictDetails::ExpectedVersionMismatch`. //! - Failed mutations and loads leave the target unchanged. //! - Multi-statement mutations are atomic (one commit per query). //! - actor_id propagates through to the commit graph. @@ -17,7 +17,7 @@ mod helpers; use arrow_array::Array; use omnigraph::db::commit_graph::CommitGraph; use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph::error::OmniError; +use omnigraph::error::{ManifestConflictDetails, ManifestErrorKind, OmniError}; use omnigraph::loader::{LoadMode, load_jsonl}; use helpers::*; @@ -241,11 +241,18 @@ async fn partial_failure_leaves_target_queryable_and_unblocks_next_mutation() { assert_eq!(frank.num_rows(), 1, "Frank must be visible after publish"); } -/// Stale non-strict writers rebase to the live manifest pin under the -/// per-table queue instead of folding raw drift or returning a false 409. -/// Strict update/delete semantics are covered by the consistency/server tests. +/// Concurrent writers to the same `(table, branch)` produce exactly one +/// success and one `ExpectedVersionMismatch`. The replacement for the old +/// `concurrent_conflicting_run_publish_fails_cleanly` test β€” the OCC fence +/// has moved from a graph-level run-publish merge into the publisher's +/// per-table CAS. +/// +/// Drives the race by interleaving two handles that captured the same +/// pre-write manifest snapshot: A commits first; B's commit then sees +/// `expected_versions[node:Person] = pre` while the manifest is at +/// `pre + 1`, and the publisher rejects. #[tokio::test] -async fn stale_non_strict_insert_rebases_to_live_manifest_pin() { +async fn concurrent_writers_one_succeeds_one_gets_expected_version_mismatch() { let dir = tempfile::tempdir().unwrap(); let uri = dir.path().to_string_lossy().into_owned(); @@ -274,30 +281,40 @@ async fn stale_non_strict_insert_rebases_to_live_manifest_pin() { .unwrap(); } - // Writer B's coordinator is still at the pre-A snapshot, but Insert is - // non-strict: commit_all re-reads the live manifest pin under the queue, - // verifies Lance HEAD equals that pin, and then lets Lance rebase the - // staged append. - db_b.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "WriterB")], &[("$age", 42)]), - ) - .await - .unwrap(); - - for name in ["WriterA", "WriterB"] { - let person = query_main( - &mut db_b, - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", name)]), + // Writer B's coordinator is still at the pre-A snapshot. Its mutation + // captures expected_versions[node:Person] = pre (stale), then publishes + // β€” the publisher's CAS pre-check sees the manifest is now at post and + // rejects with ExpectedVersionMismatch. + let result_b = db_b + .mutate( + "main", + MUTATION_QUERIES, + "insert_person", + &mixed_params(&[("$name", "WriterB")], &[("$age", 42)]), ) - .await - .unwrap(); - assert_eq!(person.num_rows(), 1, "{name} should be visible"); - } + .await; + + let err = result_b.expect_err("stale writer must hit ExpectedVersionMismatch"); + let OmniError::Manifest(manifest_err) = err else { + panic!("expected Manifest error, got {err:?}"); + }; + assert_eq!(manifest_err.kind, ManifestErrorKind::Conflict); + let Some(ManifestConflictDetails::ExpectedVersionMismatch { + ref table_key, + expected, + actual, + }) = manifest_err.details + else { + panic!( + "expected ExpectedVersionMismatch, got {:?}", + manifest_err.details, + ); + }; + assert_eq!(table_key, "node:Person"); + assert!( + actual > expected, + "actual ({actual}) should be ahead of expected ({expected})", + ); } /// The cancellation hole that motivated removing the Run state machine: dropping a mutation future @@ -354,10 +371,11 @@ async fn cancelled_mutation_future_leaves_no_state() { // Cancel-safety property: no graph-level run/staging state remains. // - // No `__run__` branches can ever be created: the Run state machine - // (`begin_run` etc.) was deleted in MR-771 β€” verified by the build itself, - // those symbols no longer exist. Any legacy `__run__*` branch on an - // upgraded graph is swept by the v2β†’v3 manifest migration. + // Note: `branch_list()` already filters `__run__*` via + // `is_internal_system_branch`, so a runtime "no `__run__` branches" check + // would be vacuous. The structural property that no `__run__` branches + // can ever be created is enforced by deletion of `begin_run` etc. in + // (verified by the build itself β€” those symbols no longer exist). // // (1) The branch list is unchanged: cancellation/completion cannot // synthesize new public branches. @@ -424,40 +442,34 @@ async fn repeated_loads_do_not_accumulate_branches() { assert_eq!(db.branch_list().await.unwrap(), vec!["main".to_string()]); } -/// After MR-770, `__run__*` is an ordinary branch name β€” the Run state machine -/// and its `is_internal_run_branch` guard are gone. The surviving internal-ref -/// guard still rejects the active `__schema_apply_lock__` branch on the public -/// create/merge APIs. +/// User code must not be able to write to internal `__run__*` names. +/// The branch-name guard predicate is kept as defense-in-depth; it +/// will be removed once a future production sweep retires the legacy +/// branches. #[tokio::test] -async fn public_branch_apis_reject_internal_system_refs() { +async fn public_branch_apis_reject_internal_run_refs() { let dir = tempfile::tempdir().unwrap(); let mut db = init_and_load(&dir).await; - // `__run__*` is no longer reserved β€” creating it now succeeds. - db.branch_create("__run__formerly_reserved") - .await - .expect("__run__ prefix is a normal branch name post-MR-770"); - - // The schema-apply lock branch is still rejected on public branch APIs. - let create_err = db.branch_create("__schema_apply_lock__").await.unwrap_err(); + let create_err = db.branch_create("__run__synthetic").await.unwrap_err(); let OmniError::Manifest(err) = create_err else { panic!("expected Manifest error"); }; assert!( - err.message.contains("internal system ref"), + err.message.contains("internal run ref"), "unexpected error: {}", err.message ); let merge_err = db - .branch_merge("__schema_apply_lock__", "main") + .branch_merge("__run__synthetic", "main") .await .unwrap_err(); let OmniError::Manifest(err) = merge_err else { panic!("expected Manifest error"); }; assert!( - err.message.contains("internal system refs"), + err.message.contains("internal run refs"), "unexpected error: {}", err.message ); @@ -613,10 +625,7 @@ async fn mixed_insert_and_update_on_same_person_coalesces_to_one_merge() { "dedupe must keep the update's age value, not the insert's", ); - // One-publish guarantee: manifest version advanced by exactly 1. The graph - // commit (`graph_commit` + `graph_head` rows) rides the SAME publish CAS as - // the table-version rows (RFC-013 Phase 7), so one graph commit is exactly - // one manifest version bump. + // One-publish guarantee: manifest version advanced by exactly 1. let post_version = version_main(&db).await.unwrap(); assert_eq!( post_version, @@ -662,9 +671,7 @@ async fn multiple_appends_to_same_edge_coalesce_to_one_append() { let edges_after = count_rows(&db, "edge:Knows").await; assert_eq!(edges_after, edges_before + 2); - // One manifest version bump for the two-edge query (atomic publish): the - // graph commit rides the same publish CAS as the table-version rows - // (RFC-013 Phase 7). + // One manifest version bump for the two-edge query (atomic publish). let post_version = version_main(&db).await.unwrap(); assert_eq!( post_version, @@ -695,8 +702,6 @@ async fn multi_statement_inserts_publish_exactly_once() { .await .unwrap(); - // One manifest version bump: the graph commit rides the same publish CAS - // as the table-version rows (RFC-013 Phase 7). let post_version = version_main(&db).await.unwrap(); assert_eq!( post_version, @@ -785,47 +790,6 @@ async fn load_with_bad_edge_reference_unblocks_next_load() { ); } -#[tokio::test] -async fn load_overwrite_with_bad_edge_reference_unblocks_next_load() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, TEST_DATA, LoadMode::Overwrite) - .await - .unwrap(); - - let pre_persons = count_rows(&db, "node:Person").await; - let pre_edges = count_rows(&db, "edge:Knows").await; - - let bad = r#"{"type": "Person", "data": {"name": "Mallory", "age": 5}} -{"edge": "Knows", "from": "Mallory", "to": "Ghost"} -"#; - let err = load_jsonl(&mut db, bad, LoadMode::Overwrite) - .await - .expect_err("RI violation must fail overwrite before commit_staged"); - let OmniError::Manifest(manifest_err) = err else { - panic!("expected Manifest error, got {err:?}"); - }; - assert!( - manifest_err.message.contains("not found"), - "unexpected error: {}", - manifest_err.message, - ); - - assert_eq!(count_rows(&db, "node:Person").await, pre_persons); - assert_eq!(count_rows(&db, "edge:Knows").await, pre_edges); - - let good = r#"{"type": "Person", "data": {"name": "Pat", "age": 55}} -{"type": "Person", "data": {"name": "Quinn", "age": 56}} -{"edge": "Knows", "from": "Pat", "to": "Quinn"} -"#; - load_jsonl(&mut db, good, LoadMode::Overwrite) - .await - .unwrap(); - assert_eq!(count_rows(&db, "node:Person").await, 2); - assert_eq!(count_rows(&db, "edge:Knows").await, 1); -} - /// Same shape as the RI test above, but driven by a cardinality /// violation (`@card(0..1)` on `WorksAt`). The staged loader's pending /// edge accumulator drives the cardinality scan; a violation aborts @@ -890,56 +854,6 @@ edge WorksAt: Person -> Company @card(0..1) ); } -#[tokio::test] -async fn load_overwrite_with_cardinality_violation_unblocks_next_load() { - const CARD_SCHEMA: &str = r#" -node Person { - name: String @key - age: I32? -} -node Company { - name: String @key -} -edge WorksAt: Person -> Company @card(0..1) -"#; - - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, CARD_SCHEMA).await.unwrap(); - - let seed = r#"{"type": "Person", "data": {"name": "Alice", "age": 30}} -{"type": "Company", "data": {"name": "Acme"}} -{"type": "Company", "data": {"name": "Bigco"}} -"#; - load_jsonl(&mut db, seed, LoadMode::Overwrite) - .await - .unwrap(); - - let pre_works = count_rows(&db, "edge:WorksAt").await; - - let bad = r#"{"edge": "WorksAt", "from": "Alice", "to": "Acme"} -{"edge": "WorksAt", "from": "Alice", "to": "Bigco"} -"#; - let err = load_jsonl(&mut db, bad, LoadMode::Overwrite) - .await - .expect_err("cardinality violation must fail overwrite before commit_staged"); - let OmniError::Manifest(manifest_err) = err else { - panic!("expected Manifest error, got {err:?}"); - }; - assert!( - manifest_err.message.contains("@card violation"), - "unexpected error: {}", - manifest_err.message, - ); - assert_eq!(count_rows(&db, "edge:WorksAt").await, pre_works); - - let good = r#"{"edge": "WorksAt", "from": "Alice", "to": "Acme"}"#; - load_jsonl(&mut db, good, LoadMode::Overwrite) - .await - .unwrap(); - assert_eq!(count_rows(&db, "edge:WorksAt").await, 1); -} - // ─── Chained-mutation correctness β€” pinned coverage ───────────────────────── /// Chained `update` ops in one query must respect each previous op's @@ -1012,8 +926,6 @@ async fn chained_updates_with_overlapping_predicate_respects_intermediate_value( "chained-update final value must reflect the second update applied to op-1's pending value" ); - // One manifest version bump: the graph commit rides the same publish CAS - // as the table-version rows (RFC-013 Phase 7). let post_version = version_main(&db).await.unwrap(); assert_eq!( post_version, @@ -1052,9 +964,6 @@ async fn multi_statement_delete_on_same_node_table() { pre_persons - 2, "both deletes must land", ); - // One manifest version bump: the graph commit (delete-only queries record - // one too) rides the same publish CAS as the table-version rows - // (RFC-013 Phase 7). let post_version = version_main(&db).await.unwrap(); assert_eq!( post_version, @@ -1552,176 +1461,3 @@ async fn second_sequential_update_on_same_row_succeeds() { "Alice's age must reflect the second update" ); } - -// An interrupted first-write fork (create_branch succeeded, the manifest -// publish did not) leaves a fully-formed Lance branch ref on the table that -// the manifest never references β€” a "manifest-unreferenced fork". The branch -// itself stays a valid manifest branch, so `cleanup`'s reconciler (keyed on -// the manifest branch list) never reclaims it. Today the next write to that -// table on that branch re-enters the fork path, `create_branch` collides, and -// the engine wedges with "incomplete prior delete; run `omnigraph cleanup`". -// -// We forge that exact residue (a live `feature` branch + a directly-created -// `feature` ref on the Person table the manifest doesn't reference) and assert -// the next write β€” via both `load` and `mutate` β€” self-heals by reclaiming the -// orphan fork and re-forking, rather than wedging. No process death / timing -// needed: the forge is the post-crash state. -#[tokio::test] -async fn first_write_self_heals_manifest_unreferenced_fork_on_live_branch() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap().to_string(); - let mut db = init_and_load(&dir).await; - db.branch_create("feature").await.unwrap(); - - // Forge the manifest-unreferenced fork directly at the Lance layer. - let person_uri = node_table_uri(&uri, "Person"); - { - let mut ds = lance::Dataset::open(&person_uri).await.unwrap(); - let base = ds.version().version; - ds.create_branch("feature", base, None).await.unwrap(); - assert!( - ds.list_branches().await.unwrap().contains_key("feature"), - "precondition: forged orphan fork present on Person" - ); - } - - // load β†’ must self-heal, not wedge with "incomplete prior delete". - let row = r#"{"type":"Person","data":{"name":"Zoe","age":30}}"#; - db.load_as("feature", None, row, LoadMode::Merge, None) - .await - .expect("load onto a manifest-unreferenced fork must self-heal, not wedge"); - - // mutate β†’ same path, must also self-heal. - mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Yan")], &[("$age", 41)]), - ) - .await - .expect("mutate onto a manifest-unreferenced fork must self-heal"); - - // The healed branch holds the new rows; main is untouched (still no Zoe/Yan). - let feature_people = count_rows_branch(&db, "feature", "node:Person").await; - let main_people = count_rows(&db, "node:Person").await; - assert!( - feature_people >= main_people + 2, - "feature must contain the two new rows on top of the inherited set \ - (feature={feature_people}, main={main_people})" - ); -} - -// A node delete cascades to every edge table touching that node, forking those -// edge tables during execution. The up-front fork-queue acquisition must cover -// those cascade-forked edges, not just the node table named in the IR β€” else -// commit_all's held-guard coverage check fails the write (and, before the -// coverage check was promoted out of debug-only, edge commits would slip -// through unserialized). This drives the new code via a DELETE (the only -// cascading op), on a branch, as the FIRST write (so it actually forks). -#[tokio::test] -async fn branch_cascade_delete_forks_node_and_edges_under_held_queues() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - db.branch_create("feature").await.unwrap(); - - // Baseline inherited from main (Alice has 2 Knows + 1 WorksAt edge). - let main_people = count_rows(&db, "node:Person").await; - let main_knows = count_rows(&db, "edge:Knows").await; - - // First write to `feature` is `delete Person Alice`, whose cascade forks - // node:Person AND edge:Knows + edge:WorksAt. Pre-fix the up-front set held - // only node:Person, so commit_all's coverage check rejected the write. - mutate_branch( - &mut db, - "feature", - MUTATION_QUERIES, - "remove_person", - &mixed_params(&[("$name", "Alice")], &[]), - ) - .await - .expect("branch cascade-delete must hold queues for cascade-forked edge tables"); - - // Alice and her edges are gone on feature; main is untouched. - assert_eq!( - count_rows_branch(&db, "feature", "node:Person").await, - main_people - 1, - "feature should have Alice removed from the inherited set" - ); - assert!( - count_rows_branch(&db, "feature", "edge:Knows").await < main_knows, - "feature should have Alice's cascade-deleted Knows edges removed" - ); - assert_eq!( - count_rows(&db, "node:Person").await, - main_people, - "main must be untouched by the branch delete" - ); -} - -// #283: a mutation predicate (`where camelField = ...`) on a camelCase column -// must execute, not fail at the Lance scan with "No field named ...". Covers -// both `update` (committed scan via scan_with_pending) and `delete` -// (delete_where), which share the same emitted SQL filter string. -const CC_SCHEMA: &str = r#" -node Doc { - slug: String @key - repoName: String @index - status: String? -} -"#; -const CC_DATA: &str = r#"{"type":"Doc","data":{"slug":"d1","repoName":"acme","status":"open"}} -{"type":"Doc","data":{"slug":"d2","repoName":"globex","status":"open"}}"#; - -#[tokio::test] -async fn camelcase_mutation_predicate_updates_and_deletes() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, CC_SCHEMA).await.unwrap(); - load_jsonl(&mut db, CC_DATA, LoadMode::Overwrite).await.unwrap(); - - let m = r#" -query set_status($repo: String, $st: String) { update Doc set { status: $st } where repoName = $repo } -query del($repo: String) { delete Doc where repoName = $repo } -"#; - - let upd = db - .mutate("main", m, "set_status", ¶ms(&[("$repo", "acme"), ("$st", "closed")])) - .await - .expect("update with a camelCase predicate must execute"); - assert_eq!(upd.affected_nodes, 1, "exactly the acme Doc should update"); - - let del = db - .mutate("main", m, "del", ¶ms(&[("$repo", "globex")])) - .await - .expect("delete with a camelCase predicate must execute"); - assert_eq!(del.affected_nodes, 1, "exactly the globex Doc should delete"); - - assert_eq!(count_rows(&db, "node:Doc").await, 1, "one Doc (acme) should remain"); -} - -// #283 (pending side): a chained mutation whose 2nd op filters a camelCase -// column must read op-1's staged rows through the pending DataFusion `MemTable` -// (`SELECT … WHERE {filter}` via ctx.sql), which lowercases unquoted idents. -// This is the path the single update/delete above does NOT exercise. -#[tokio::test] -async fn camelcase_chained_mutation_reads_pending_by_camelcase() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, CC_SCHEMA).await.unwrap(); - load_jsonl(&mut db, CC_DATA, LoadMode::Overwrite).await.unwrap(); - - // op-1 stages a status change to the acme Doc; op-2 re-filters the same - // camelCase column, so it must match op-1's pending row. - let m = r#" -query chain($repo: String) { - update Doc set { status: "stage1" } where repoName = $repo - update Doc set { status: "stage2" } where repoName = $repo -} -"#; - let r = db - .mutate("main", m, "chain", ¶ms(&[("$repo", "acme")])) - .await - .expect("chained camelCase mutation must read the pending row, not fail at the MemTable SELECT"); - assert_eq!(r.affected_nodes, 2, "both ops should touch the acme Doc (read-your-writes)"); -} diff --git a/crates/omnigraph/tests/s3_storage.rs b/crates/omnigraph/tests/s3_storage.rs index 3814600..7e4f0a3 100644 --- a/crates/omnigraph/tests/s3_storage.rs +++ b/crates/omnigraph/tests/s3_storage.rs @@ -167,80 +167,3 @@ async fn s3_public_load_uses_hidden_run_and_publishes() { .to_rust_json(); assert_eq!(loaded[0]["p.name"], "Loaded-Over-S3"); } - -/// The conditional-write contract the cluster ledger depends on (RFC-006): -/// versioned read -> If-Match replace -> stale token refused. Pins the -/// S3-compatible backend's behavior (RustFS in CI) β€” turns red if a backend -/// bump regresses conditional puts. -#[tokio::test(flavor = "multi_thread")] -async fn s3_adapter_conditional_writes_contract() { - let Some(uri) = s3_test_graph_uri("adapter-cas") else { - eprintln!("skipping s3 adapter cas test: OMNIGRAPH_S3_TEST_BUCKET is not set"); - return; - }; - use omnigraph::storage::storage_for_uri; - let adapter = storage_for_uri(&uri).unwrap(); - let object = format!("{uri}/cas-probe.json"); - - assert!(adapter.write_text_if_absent(&object, "v1").await.unwrap()); - assert!(!adapter.write_text_if_absent(&object, "v1b").await.unwrap()); - - let (text, version) = adapter.read_text_versioned(&object).await.unwrap(); - assert_eq!(text, "v1"); - let next = adapter - .write_text_if_match(&object, "v2", &version) - .await - .unwrap() - .expect("fresh etag must win"); - assert!( - adapter - .write_text_if_match(&object, "v3", &version) - .await - .unwrap() - .is_none(), - "stale etag must be refused" - ); - let again = adapter - .write_text_if_match(&object, "v3", &next) - .await - .unwrap(); - assert!(again.is_some()); - - // Prefix delete: recursive + idempotent. - adapter - .write_text(&format!("{uri}/tree/a.json"), "a") - .await - .unwrap(); - adapter - .write_text(&format!("{uri}/tree/sub/b.json"), "b") - .await - .unwrap(); - adapter.delete_prefix(&format!("{uri}/tree")).await.unwrap(); - assert!(!adapter.exists(&format!("{uri}/tree/a.json")).await.unwrap()); - adapter.delete_prefix(&format!("{uri}/tree")).await.unwrap(); - adapter.delete(&object).await.unwrap(); -} - -/// Schema apply against an S3 graph β€” the cluster's schema executor will -/// lean on this; previously untested upstream on object storage. -#[tokio::test(flavor = "multi_thread")] -async fn s3_schema_apply_migrates_live_graph() { - let Some(uri) = s3_test_graph_uri("schema-apply") else { - eprintln!("skipping s3 schema apply test: OMNIGRAPH_S3_TEST_BUCKET is not set"); - return; - }; - let mut db = Omnigraph::init(&uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, TEST_DATA, LoadMode::Overwrite) - .await - .unwrap(); - - let desired = format!("{TEST_SCHEMA}\nnode Note {{\n title: String @key\n}}\n"); - let result = db.apply_schema(&desired).await.unwrap(); - assert!(result.applied, "{result:?}"); - - let reopened = Omnigraph::open(&uri).await.unwrap(); - assert!( - reopened.schema_source().contains("Note"), - "live S3 schema must carry the migration" - ); -} diff --git a/crates/omnigraph/tests/scalar_indexes.rs b/crates/omnigraph/tests/scalar_indexes.rs deleted file mode 100644 index 8d8a3f0..0000000 --- a/crates/omnigraph/tests/scalar_indexes.rs +++ /dev/null @@ -1,74 +0,0 @@ -//! Coverage for `build_indices_on_dataset_for_catalog`'s per-property index -//! dispatch: which scalar/vector index each `@index`/`@key` column gets. -//! -//! The observable signal is `TableStore::key_column_index_coverage`, which -//! reports `Indexed` only when a BTREE covers the column (the same helper the -//! traversal chooser uses). Enums and orderable scalars must get a BTREE so -//! `=`/range/IN/IS NULL are index-accelerated; free-text Strings keep FTS -//! (which `key_column_index_coverage` does not count as a BTREE, by design). - -mod helpers; - -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph::table_store::{IndexCoverage, TableStore}; - -use helpers::*; - -const SCHEMA: &str = r#" -node Item { - slug: String @key - status: enum(active, archived) @index - published: DateTime @index - rank: I32 @index - title: String @index - note: String? -} -"#; - -const DATA: &str = r#"{"type":"Item","data":{"slug":"a","status":"active","published":"2024-06-01T00:00:00Z","rank":1,"title":"alpha","note":"n1"}} -{"type":"Item","data":{"slug":"b","status":"archived","published":"2023-01-01T00:00:00Z","rank":2,"title":"beta","note":"n2"}} -{"type":"Item","data":{"slug":"c","status":"active","published":"2025-02-02T00:00:00Z","rank":3,"title":"gamma","note":"n3"}}"#; - -// Enums and orderable scalars (DateTime, numeric) get a BTREE from load's -// build-indices pass, so a `=`/range filter on them uses the index. Free-text -// String `@index` keeps FTS (no BTREE), and an un-annotated column has no -// scalar index β€” both report `Degraded`, which is the negative control that -// keeps this test from being vacuously green. -#[tokio::test] -async fn node_scalar_and_enum_index_columns_get_btree() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, SCHEMA).await.unwrap(); - load_jsonl(&mut db, DATA, LoadMode::Overwrite).await.unwrap(); - - let snap = snapshot_main(&db).await.unwrap(); - let ds = snap.open("node:Item").await.unwrap(); - - for col in ["status", "published", "rank"] { - let cov = TableStore::key_column_index_coverage(&ds, col).await.unwrap(); - assert_eq!( - cov, - IndexCoverage::Indexed, - "column '{col}' (enum/DateTime/numeric @index) must get a BTREE, got {cov:?}" - ); - } - - // Free-text String @index -> FTS, which is not a BTREE -> Degraded. - let title_cov = TableStore::key_column_index_coverage(&ds, "title") - .await - .unwrap(); - assert!( - matches!(title_cov, IndexCoverage::Degraded { .. }), - "free-text String @index should keep FTS (no BTREE), got {title_cov:?}" - ); - - // No @index annotation -> no scalar index at all -> Degraded. - let note_cov = TableStore::key_column_index_coverage(&ds, "note") - .await - .unwrap(); - assert!( - matches!(note_cov, IndexCoverage::Degraded { .. }), - "un-annotated column should have no scalar index, got {note_cov:?}" - ); -} diff --git a/crates/omnigraph/tests/schema_apply.rs b/crates/omnigraph/tests/schema_apply.rs index 508451a..cc0cae2 100644 --- a/crates/omnigraph/tests/schema_apply.rs +++ b/crates/omnigraph/tests/schema_apply.rs @@ -736,108 +736,3 @@ edge Knows: Person -> Person { // current contract, the data is *unreachable* via omnigraph // (no manifest entry), which is the user-facing guarantee. } - -// Regression (bug 3 / dev-graph iss-848): a `Vector @index` on a 0-row table -// must not abort an otherwise-valid schema apply. A vector (IVF) index trains -// k-means centroids over the column's vectors, so Lance cannot build it on 0 -// vectors β€” it errors with "Creating empty vector indices with train=False is -// not yet implemented". When a *later* migration touches that table (here, an -// unrelated scalar `@index` on `body`), schema apply reconciles the table's -// whole index set, which previously tried to materialize the dormant vector -// index and aborted the entire migration (all-or-nothing). The build is now -// deferred (pending) when the column is untrainable, instead of failing the -// migration. The dormant index is materialized by a later `ensure_indices` / -// `optimize` once the table has rows. Full decoupling β€” intent recorded at -// apply, an async reconciler converges physical coverage β€” is iss-848. -#[tokio::test] -async fn apply_schema_defers_vector_index_on_empty_table() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - - // init does not build indices, so the declared-but-unbuilt vector index - // sits harmless on the empty table (this is how it survived earlier - // applies that never touched the table). - // `slug` is the user @key; omnigraph injects its own internal `id` column, - // so the key field must not be named `id`. - let v1 = "node Doc {\n \ - slug: String @key\n \ - body: String?\n \ - embedding: Vector(8) @index\n\ - }\n"; - let mut db = Omnigraph::init(uri, v1).await.unwrap(); - - // Add an *unrelated* scalar @index on `body`. This routes Doc through - // schema apply's index reconcile, which must NOT abort on the untrainable - // empty vector index. - let v2 = "node Doc {\n \ - slug: String @key\n \ - body: String? @index\n \ - embedding: Vector(8) @index\n\ - }\n"; - let result = db.apply_schema(v2).await.expect( - "schema apply must succeed: an empty-table vector @index is deferred, not fatal", - ); - assert!(result.applied, "the scalar @index change must apply"); - - // The deferred vector index is not dropped β€” once the table has a - // trainable vector, `ensure_indices` materializes it without error. (If - // the guard wrongly skipped a non-empty column, this would still be - // unindexed; if it wrongly tried to build on empty, the apply above would - // have failed.) - load_jsonl( - &mut db, - r#"{"type":"Doc","data":{"slug":"d1","body":"hello","embedding":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8]}}"#, - LoadMode::Merge, - ) - .await - .expect("loading a Doc with an embedding must succeed"); - db.ensure_indices() - .await - .expect("the deferred vector index must build once the table has a trainable vector"); -} - -// iss-848: adding an `@index` to an existing column is a pure metadata change. -// Schema apply records the intent (the catalog/IR now declares the index) but -// must NOT build the index inline, so the table's data and manifest version are -// untouched. The physical index is materialized later by ensure_indices / -// optimize. Pre-iss-848 the indexed_tables block built the index inline and -// bumped the table version. -#[tokio::test] -async fn index_only_constraint_apply_touches_no_table_data() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let v1 = "node Doc {\n slug: String @key\n n: I64\n}\n"; - let mut db = Omnigraph::init(uri, v1).await.unwrap(); - load_jsonl( - &mut db, - r#"{"type":"Doc","data":{"slug":"d1","n":1}}"#, - LoadMode::Merge, - ) - .await - .expect("load a Doc"); - - let before = db - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap() - .entry("node:Doc") - .unwrap() - .table_version; - - // Add an @index on the existing `n` column. - let v2 = "node Doc {\n slug: String @key\n n: I64 @index\n}\n"; - let result = db.apply_schema(v2).await.expect("index-only apply must succeed"); - assert!(result.applied, "the @index addition must apply"); - - let after = db - .snapshot_of(ReadTarget::branch("main")) - .await - .unwrap() - .entry("node:Doc") - .unwrap() - .table_version; - assert_eq!( - before, after, - "adding an @index must not bump the table version (no inline index build)" - ); -} diff --git a/crates/omnigraph/tests/search.rs b/crates/omnigraph/tests/search.rs index 425c51b..c4454cf 100644 --- a/crates/omnigraph/tests/search.rs +++ b/crates/omnigraph/tests/search.rs @@ -60,15 +60,6 @@ query hybrid_search_string($vq: String, $tq: String) { limit 3 } "#; -// Same shape as MOCK_SEARCH_SCHEMA but the vector records the model that -// produced its stored vectors, opting into the query-time same-space check. -const MODEL_RECORDED_SCHEMA: &str = r#" -node Doc { - slug: String @key - title: String @index - embedding: Vector(4) @embed("title", model="test-model-a") @index -} -"#; const SEARCH_MUTATIONS: &str = r#" query insert_doc($slug: String, $title: String, $body: String, $embedding: Vector(4)) { insert Doc { @@ -98,15 +89,6 @@ async fn init_mock_embedding_search_db(dir: &tempfile::TempDir) -> Omnigraph { db } -async fn init_model_recorded_search_db(dir: &tempfile::TempDir) -> Omnigraph { - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, MODEL_RECORDED_SCHEMA).await.unwrap(); - load_jsonl(&mut db, &mock_embedding_seed_data(), LoadMode::Overwrite) - .await - .unwrap(); - db -} - fn mock_embedding_seed_data() -> String { [ ("alpha-doc", "alpha guide", mock_embedding("alpha", 4)), @@ -528,14 +510,9 @@ async fn explicit_vector_nearest_does_not_require_gemini_credentials() { #[tokio::test] #[serial] -async fn string_nearest_requires_provider_credentials_when_mock_is_disabled() { - // With mock off and no provider key, the default (openai-compatible) - // provider fails loudly rather than silently producing garbage vectors. +async fn string_nearest_requires_gemini_credentials_when_mock_is_disabled() { let _guard = EnvGuard::set(&[ ("OMNIGRAPH_EMBEDDINGS_MOCK", None), - ("OMNIGRAPH_EMBED_PROVIDER", None), - ("OPENROUTER_API_KEY", None), - ("OPENAI_API_KEY", None), ("GEMINI_API_KEY", None), ]); @@ -551,105 +528,7 @@ async fn string_nearest_requires_provider_credentials_when_mock_is_disabled() { .await .unwrap_err(); - assert!( - err.to_string() - .contains("OPENROUTER_API_KEY or OPENAI_API_KEY"), - "unexpected error: {err}" - ); -} - -#[tokio::test] -#[serial] -async fn nearest_string_passes_when_query_model_matches_recorded_model() { - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", Some("1")), - ("OMNIGRAPH_EMBED_MODEL", Some("test-model-a")), - ("OMNIGRAPH_EMBED_PROVIDER", None), - ("OPENROUTER_API_KEY", None), - ("OPENAI_API_KEY", None), - ("GEMINI_API_KEY", None), - ]); - - let dir = tempfile::tempdir().unwrap(); - let mut db = init_model_recorded_search_db(&dir).await; - - let result = query_main( - &mut db, - MOCK_SEARCH_QUERIES, - "vector_search_string", - ¶ms(&[("$q", "alpha")]), - ) - .await - .unwrap(); - - assert_eq!(result_slugs(&result)[0], "alpha-doc"); -} - -#[tokio::test] -#[serial] -async fn nearest_string_errors_when_query_model_differs_from_recorded_model() { - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", Some("1")), - ("OMNIGRAPH_EMBED_MODEL", Some("test-model-b")), - ("OMNIGRAPH_EMBED_PROVIDER", None), - ("OPENROUTER_API_KEY", None), - ("OPENAI_API_KEY", None), - ("GEMINI_API_KEY", None), - ]); - - let dir = tempfile::tempdir().unwrap(); - let mut db = init_model_recorded_search_db(&dir).await; - - let err = query_main( - &mut db, - MOCK_SEARCH_QUERIES, - "vector_search_string", - ¶ms(&[("$q", "alpha")]), - ) - .await - .unwrap_err(); - - // The error must name both the recorded model and the resolved one. - let msg = err.to_string(); - assert!(msg.contains("test-model-a"), "got: {msg}"); - assert!(msg.contains("test-model-b"), "got: {msg}"); -} - -#[tokio::test] -#[serial] -async fn injected_embedding_config_is_used_instead_of_env() { - // No mock flag and no provider keys in env, so `from_env()` would error. - // Injecting a Mock config proves the resolver uses the injected config - // (RFC-012 Phase 5), and its model satisfies the recorded same-space check. - let _guard = EnvGuard::set(&[ - ("OMNIGRAPH_EMBEDDINGS_MOCK", None), - ("OMNIGRAPH_EMBED_PROVIDER", None), - ("OMNIGRAPH_EMBED_MODEL", None), - ("OPENROUTER_API_KEY", None), - ("OPENAI_API_KEY", None), - ("GEMINI_API_KEY", None), - ]); - - let dir = tempfile::tempdir().unwrap(); - let mut db = init_model_recorded_search_db(&dir) - .await - .with_embedding_config(std::sync::Arc::new(omnigraph::embedding::EmbeddingConfig { - provider: omnigraph::embedding::Provider::Mock, - model: "test-model-a".to_string(), - base_url: String::new(), - api_key: String::new(), - })); - - let result = query_main( - &mut db, - MOCK_SEARCH_QUERIES, - "vector_search_string", - ¶ms(&[("$q", "alpha")]), - ) - .await - .unwrap(); - - assert_eq!(result_slugs(&result)[0], "alpha-doc"); + assert!(err.to_string().contains("GEMINI_API_KEY")); } // ─── BM25 search ──────────────────────────────────────────────────────────── @@ -677,111 +556,6 @@ async fn bm25_returns_ranked_results() { assert!(result.num_rows() <= 3, "bm25 should respect limit 3"); } -// Full rank-ORDER golden (not just top-1 / non-empty): pins ranks 2..k so a -// regression corrupting the tail or reversing the sort direction fails loudly. -// nearest skips apply_ordering (is_search_ordered) and returns Lance native -// order, so result_slugs row order == rank order. -#[tokio::test] -#[serial] -async fn nearest_full_rank_order() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - let result = query_main( - &mut db, - SEARCH_QUERIES, - "vector_search", - &vector_param("$q", &[0.1, 0.2, 0.3, 0.4]), - ) - .await - .unwrap(); - // [0.1,0.2,0.3,0.4] == ml-intro's embedding (dist 0); the rest by ascending L2. - assert_eq!(result_slugs(&result), vec!["ml-intro", "nlp-guide", "rl-intro"]); -} - -#[tokio::test] -#[serial] -async fn bm25_full_rank_order() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - let result = query_main( - &mut db, - SEARCH_QUERIES, - "bm25_search", - ¶ms(&[("$q", "Learning")]), - ) - .await - .unwrap(); - // Descending BM25 score order. - assert_eq!(result_slugs(&result), vec!["rl-intro", "ml-intro", "dl-basics"]); -} - -// Characterization: fuzzy() does NOT match under the default tokenizer/index in -// this setup β€” a one-edit typo ("Introductio" for "Introduction") returns no -// rows. (`search`/`match_text` DO work, so FTS itself is fine; fuzzy term -// queries specifically are inert here.) This pins that documented limitation -// instead of leaving fuzzy silently unasserted: if a Lance/tokenizer change -// makes fuzzy match, this turns red and should be promoted to a real -// matched-set + exclusion golden. -#[tokio::test] -#[serial] -async fn fuzzy_does_not_match_under_default_tokenizer() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - let r = query_main(&mut db, SEARCH_QUERIES, "fuzzy_search", ¶ms(&[("$q", "Introductio")])) - .await - .unwrap(); - assert!( - result_slugs(&r).is_empty(), - "fuzzy now matches β€” promote this to a real matched-set/exclusion golden" - ); -} - -// match_text is a FILTER on the body: assert the exact matched set, not contains. -#[tokio::test] -#[serial] -async fn match_text_matches_exact_set_excludes_unrelated() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - // "neural" appears only in dl-basics's body ("neural networks"). - let r = query_main(&mut db, SEARCH_QUERIES, "phrase_search", ¶ms(&[("$q", "neural")])) - .await - .unwrap(); - let mut got = result_slugs(&r); - got.sort(); - assert_eq!(got, vec!["dl-basics"]); -} - -// RRF fuses arms OTHER than the default nearest+bm25: two FTS arms (title+body). -// Proves primary_var resolves when neither arm is `nearest`, and fusion runs. -#[tokio::test] -#[serial] -async fn rrf_fuses_two_fts_fields() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - let r = query_main(&mut db, SEARCH_QUERIES, "rrf_two_fts", ¶ms(&[("$q", "learning")])) - .await - .unwrap(); - assert_eq!(result_slugs(&r), vec!["dl-basics", "ml-intro", "rl-intro"]); -} - -// RRF fuses two vector arms (no embedding creds β€” explicit vectors). A doc near -// BOTH query vectors out-ranks one near only one. -#[tokio::test] -#[serial] -async fn rrf_fuses_two_vector_queries() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_search_db(&dir).await; - let r = query_main( - &mut db, - SEARCH_QUERIES, - "rrf_two_vectors", - &two_vector_params("$q1", &[0.1, 0.2, 0.3, 0.4], "$q2", &[0.5, 0.6, 0.7, 0.8]), - ) - .await - .unwrap(); - assert_eq!(result_slugs(&r), vec!["rl-intro", "ml-intro", "dl-basics"]); -} - #[tokio::test] #[serial] async fn mutation_commit_refreshes_search_indices_without_manual_ensure() { diff --git a/crates/omnigraph/tests/staged_writes.rs b/crates/omnigraph/tests/staged_writes.rs index cf0e04c..021b36e 100644 --- a/crates/omnigraph/tests/staged_writes.rs +++ b/crates/omnigraph/tests/staged_writes.rs @@ -2,7 +2,7 @@ //! exercise `stage_append`, `stage_merge_insert`, `scan_with_staged`, //! and `count_rows_with_staged` directly against a Lance dataset β€” no //! Omnigraph engine involved. The engine-level use of these primitives -//! is exercised by `tests/writes.rs`. +//! is exercised by `tests/runs.rs`. //! //! Test surface here: //! 1. `stage_append` + `scan_with_staged` shows committed + staged data @@ -23,9 +23,6 @@ use arrow_schema::{DataType, Field, Schema}; use futures::TryStreamExt; use lance::Dataset; use lance::dataset::{WhenMatched, WhenNotMatched}; -use lance::index::DatasetIndexExt; -use lance_index::IndexType; -use lance_linalg::distance::MetricType; use lance_table::format::Fragment; use omnigraph::table_store::{StagedWrite, TableStore}; use std::sync::Arc; @@ -37,22 +34,6 @@ fn person_schema() -> Arc { ])) } -/// Test-only helper: raw `Dataset::append` to advance Lance HEAD without -/// going through the manifest. Mirrors `TableStore::append_batch`'s body -/// (which is `pub(crate)` after MR-854) β€” kept local so these -/// drift-simulation tests don't depend on the demoted crate-internal API. -async fn lance_append_inline_local(ds: &mut Dataset, batch: RecordBatch) { - use lance::dataset::{WriteMode, WriteParams}; - let schema = batch.schema(); - let reader = arrow_array::RecordBatchIterator::new(vec![Ok(batch)], schema); - let params = WriteParams { - mode: WriteMode::Append, - allow_external_blob_outside_bases: true, - ..Default::default() - }; - ds.append(reader, Some(params)).await.unwrap(); -} - fn person_batch(rows: &[(&str, Option)]) -> RecordBatch { let ids: Vec<&str> = rows.iter().map(|(id, _)| *id).collect(); let ages: Vec> = rows.iter().map(|(_, age)| *age).collect(); @@ -370,7 +351,7 @@ async fn stage_merge_insert_then_commit_persists_merged_view() { /// `write_fragments_internal` lack per-column statistics. The result /// contains only matching committed rows; matching staged rows are /// silently absent. `scanner.use_stats(false)` does not bypass this in -/// lance 6.0.1. +/// lance 4.0.0. /// /// This test pins the actual behavior so a future change either /// preserves it (and updates the doc) or fixes it (and rewrites this @@ -635,58 +616,6 @@ async fn stage_overwrite_replaces_all_fragments() { ); } -#[tokio::test] -async fn stage_overwrite_empty_batch_replaces_all_rows() { - let dir = tempfile::tempdir().unwrap(); - let uri = format!("{}/people.lance", dir.path().to_str().unwrap()); - let store = TableStore::new(dir.path().to_str().unwrap()); - - let ds = TableStore::write_dataset( - &uri, - person_batch(&[("alice", Some(30)), ("bob", Some(25))]), - ) - .await - .unwrap(); - let pre_version = ds.version().version; - - let target_schema = Arc::new(Schema::new(vec![ - Field::new("id", DataType::Utf8, false), - Field::new("age", DataType::Int32, true), - Field::new("nickname", DataType::Utf8, true), - ])); - let staged = store - .stage_overwrite(&ds, RecordBatch::new_empty(target_schema.clone())) - .await - .unwrap(); - assert!( - staged.new_fragments.is_empty(), - "empty overwrite should produce a zero-fragment Lance Overwrite transaction" - ); - assert_eq!( - staged.removed_fragment_ids.len(), - ds.manifest.fragments.len(), - "empty overwrite still removes every committed fragment" - ); - assert_eq!( - ds.version().version, - pre_version, - "staging empty overwrite must not advance HEAD" - ); - - let new_ds = store - .commit_staged(Arc::new(ds.clone()), staged.transaction) - .await - .unwrap(); - assert_eq!(new_ds.version().version, pre_version + 1); - assert_eq!(new_ds.count_rows(None).await.unwrap(), 0); - assert!( - arrow_schema::Schema::from(new_ds.schema()) - .field_with_name("nickname") - .is_ok(), - "empty overwrite must commit the replacement batch schema" - ); -} - /// `stage_create_btree_index` writes index segments to object storage /// but does NOT advance Lance HEAD until `commit_staged`. After commit, /// the index is queryable. @@ -770,7 +699,7 @@ async fn stage_create_inverted_index_does_not_advance_head_until_commit() { ); } -/// Pin the inline-commit behavior of `delete_where`. Lance 6.0.1 does +/// Pin the inline-commit behavior of `delete_where`. Lance 4.0.0 does /// NOT expose a public `DeleteJob::execute_uncommitted` /// (`pub(crate)` β€” see lance-format/lance#6658). The trait deliberately /// does NOT introduce a `stage_delete` wrapper that would secretly @@ -780,11 +709,12 @@ async fn stage_create_inverted_index_does_not_advance_head_until_commit() { /// /// **When Lance #6658 lands**: this test will need to flip β€” replace /// the assertion with a `stage_delete` + `commit_staged` round-trip -/// and remove the residual line in `docs/dev/writes.md`. +/// and remove the residual line in `docs/runs.md`. #[tokio::test] async fn delete_where_advances_head_inline_documents_residual() { let dir = tempfile::tempdir().unwrap(); let uri = format!("{}/people.lance", dir.path().to_str().unwrap()); + let store = TableStore::new(dir.path().to_str().unwrap()); let mut ds = TableStore::write_dataset( &uri, @@ -794,11 +724,13 @@ async fn delete_where_advances_head_inline_documents_residual() { .unwrap(); let pre_version = ds.version().version; - let result = ds.delete("id = 'alice'").await.unwrap(); - ds = (*result.new_dataset).clone(); - assert_eq!(result.num_deleted_rows, 1); + let result = store + .delete_where(&uri, &mut ds, "id = 'alice'") + .await + .unwrap(); + assert_eq!(result.deleted_rows, 1); assert!( - ds.version().version > pre_version, + result.version > pre_version, "delete_where ADVANCES Lance HEAD inline (the residual). When \ lance-format/lance#6658 ships and we migrate to stage_delete + \ commit_staged, flip this assertion to assert that staging does \ @@ -807,9 +739,9 @@ async fn delete_where_advances_head_inline_documents_residual() { } /// Companion to `delete_where_*`: pin the inline-commit behavior of -/// `create_vector_index`. Lance 6.0.1 vector indices take the +/// `create_vector_index`. Lance 4.0.0 vector indices take the /// "segment commit path" which calls `build_index_metadata_from_segments` -/// (`pub(crate)` in lance-6.0.1 `src/index.rs:111`). Until upstream +/// (`pub(crate)` in lance-4.0.0 `src/index.rs:111`). Until upstream /// exposes that helper (companion ticket to lance-format/lance#6658), /// the trait surface deliberately does NOT include /// `stage_create_vector_index` β€” same rationale as `stage_delete`'s @@ -848,9 +780,8 @@ async fn create_vector_index_advances_head_inline_documents_residual() { let pre_version = ds.version().version; assert!(!store.has_vector_index(&ds, "embedding").await.unwrap()); - let params = lance::index::vector::VectorIndexParams::ivf_flat(1, MetricType::L2); - ds.create_index_builder(&["embedding"], IndexType::Vector, ¶ms) - .replace(true) + store + .create_vector_index(&mut ds, "embedding") .await .unwrap(); assert!( @@ -873,7 +804,7 @@ async fn create_vector_index_advances_head_inline_documents_residual() { /// The Lance source confirms this β€” `restore()` (no args) takes the /// currently-checked-out version's content and applies it via /// `apply_commit` against the latest manifest, advancing HEAD by one. -/// See lance-6.0.1 `src/dataset.rs:1106` and the transaction-spec +/// See lance-4.0.0 `src/dataset.rs:1106` and the transaction-spec /// example at https://lance.org/format/table/transaction/. /// /// If the lance bump (4.0.0 β†’ 4.x) ever changes this delta or the call @@ -884,6 +815,7 @@ async fn create_vector_index_advances_head_inline_documents_residual() { async fn lance_restore_appends_one_commit_with_checked_out_content() { let dir = tempfile::tempdir().unwrap(); let uri = format!("{}/people.lance", dir.path().to_str().unwrap()); + let store = TableStore::new(dir.path().to_str().unwrap()); // Build version history: v1 = {alice}, v2 = {alice, bob}, v3 = {alice, bob, carol}. let mut ds = TableStore::write_dataset(&uri, person_batch(&[("alice", Some(30))])) @@ -891,10 +823,16 @@ async fn lance_restore_appends_one_commit_with_checked_out_content() { .unwrap(); assert_eq!(ds.version().version, 1); - lance_append_inline_local(&mut ds, person_batch(&[("bob", Some(25))])).await; + store + .append_batch(&uri, &mut ds, person_batch(&[("bob", Some(25))])) + .await + .unwrap(); assert_eq!(ds.version().version, 2); - lance_append_inline_local(&mut ds, person_batch(&[("carol", Some(40))])).await; + store + .append_batch(&uri, &mut ds, person_batch(&[("carol", Some(40))])) + .await + .unwrap(); assert_eq!(ds.version().version, 3); let head_before = ds.version().version; @@ -940,7 +878,7 @@ async fn lance_restore_appends_one_commit_with_checked_out_content() { /// and any future continuous-recovery reconciler's queue-acquisition /// requirement. /// -/// `Dataset::restore`'s `check_restore_txn` (lance-6.0.1 +/// `Dataset::restore`'s `check_restore_txn` (lance-4.0.0 /// `src/io/commit/conflict_resolver.rs:986`) returns `Ok(())` against /// almost every other op (Append, Update, Delete, CreateIndex, Merge, …), /// so a Restore commits successfully even with concurrent commits in @@ -970,6 +908,7 @@ async fn lance_restore_appends_one_commit_with_checked_out_content() { async fn lance_restore_loses_to_concurrent_append_via_orphaning() { let dir = tempfile::tempdir().unwrap(); let uri = format!("{}/people.lance", dir.path().to_str().unwrap()); + let store = TableStore::new(dir.path().to_str().unwrap()); // v1: seed with alice. let _ = TableStore::write_dataset(&uri, person_batch(&[("alice", Some(30))])) @@ -986,7 +925,10 @@ async fn lance_restore_loses_to_concurrent_append_via_orphaning() { // This simulates a per-table-queue model where another tenant wrote // between recovery's open and recovery's restore call. let mut writer_handle = Dataset::open(&uri).await.unwrap(); - lance_append_inline_local(&mut writer_handle, person_batch(&[("bob", Some(25))])).await; + store + .append_batch(&uri, &mut writer_handle, person_batch(&[("bob", Some(25))])) + .await + .unwrap(); assert_eq!(writer_handle.version().version, 2); // Recovery now restores. Because restore's `check_restore_txn` returns @@ -1046,54 +988,3 @@ async fn lance_restore_loses_to_concurrent_append_via_orphaning() { let v2_ids = collect_ids(&v2_batches); assert_eq!(v2_ids, vec!["alice".to_string(), "bob".to_string()]); } - -/// Regression for PR #229: `commit_staged` must skip Lance's per-commit -/// auto-cleanup hook. A graph created BEFORE the v7 bump (6.0.1 defaulted -/// `WriteParams::auto_cleanup` ON) carries `lance.auto_cleanup.*` config on its -/// datasets that `auto_cleanup = None` on new writes cannot retroactively clear; -/// Lance's hook fires off that *stored* config at commit time. Without the skip, -/// the engine's own writes would GC the versions `__manifest` pins for -/// snapshots/time-travel. (The substrate negative control β€” that the config -/// really does GC without the skip β€” lives in -/// `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc`.) -#[tokio::test] -async fn commit_staged_skips_auto_cleanup_so_pinned_versions_survive() { - use std::collections::HashMap; - - let dir = tempfile::tempdir().unwrap(); - let uri = format!("{}/people.lance", dir.path().to_str().unwrap()); - let store = TableStore::new(dir.path().to_str().unwrap()); - - let mut ds = TableStore::write_dataset(&uri, person_batch(&[("seed", Some(0))])) - .await - .unwrap(); - let v1 = ds.version().version; - - // Simulate a pre-bump dataset: aggressive legacy auto_cleanup config (fire on - // every commit, delete anything older than now). - let mut cfg = HashMap::new(); - cfg.insert("lance.auto_cleanup.interval".to_string(), "1".to_string()); - cfg.insert("lance.auto_cleanup.older_than".to_string(), "0ms".to_string()); - ds.update_config(cfg).await.unwrap(); - - // Several writes through the engine's staged commit path. - for i in 0..5i32 { - let name = format!("p{i}"); - let staged = store - .stage_append(&ds, person_batch(&[(name.as_str(), Some(i))]), &[]) - .await - .unwrap(); - ds = store - .commit_staged(Arc::new(ds.clone()), staged.transaction) - .await - .unwrap(); - } - - // `commit_staged` sets `with_skip_auto_cleanup(true)`, so the legacy config - // must NOT have GC'd the `__manifest`-pinned create version. - assert!( - ds.checkout_version(v1).await.is_ok(), - "commit_staged must skip Lance auto-cleanup so a pre-bump graph's pinned \ - v{v1} survives; it was GC'd" - ); -} diff --git a/crates/omnigraph/tests/traversal.rs b/crates/omnigraph/tests/traversal.rs index 2f518fd..6efe7de 100644 --- a/crates/omnigraph/tests/traversal.rs +++ b/crates/omnigraph/tests/traversal.rs @@ -46,194 +46,6 @@ query not_at_acme() { assert_eq!(names_vec, vec!["Bob", "Charlie", "Diana"]); } -// Nested anti-join (double negation): proves `not { … not { … } }` recurses -// through execute_pipeline. "People who do NOT work at any NON-Acme company": -// inner `not { $c.name = "Acme" }` keeps the non-Acme employers, the outer `not` -// removes anyone who has one. Alice (Acme only), Charlie & Diana (no employer) -// remain β€” distinct from plain unemployed {Charlie, Diana}. -#[tokio::test] -async fn nested_anti_join_double_negation() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - let queries = r#" -query no_nonacme_employer() { - match { - $p: Person - not { - $p worksAt $c - not { - $c.name = "Acme" - } - } - } - return { $p.name } -} -"#; - let result = query_main(&mut db, queries, "no_nonacme_employer", &ParamMap::new()) - .await - .unwrap(); - - let batch = result.concat_batches().unwrap(); - let names = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut names_vec: Vec<&str> = (0..names.len()).map(|i| names.value(i)).collect(); - names_vec.sort(); - assert_eq!(names_vec, vec!["Alice", "Charlie", "Diana"]); -} - -// The anti-join has two execution forks: the CSR `has_neighbors` fast path -// (bare single-op Expand inner) and the set-oriented inner-pipeline replay (when -// dst_filters force a multi-op inner). They must agree. `not { $p worksAt $_ }` -// takes the fast path; the same negation with an always-true dst filter -// (`$c.name != ""`) is semantically identical but forces the slow path. -#[tokio::test] -async fn anti_join_fast_and_slow_paths_agree() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - let queries = r#" -query fast() { - match { - $p: Person - not { $p worksAt $_ } - } - return { $p.name } -} -query slow() { - match { - $p: Person - not { - $p worksAt $c - $c.name != "" - } - } - return { $p.name } -} -"#; - let names = |result: omnigraph_compiler::result::QueryResult| { - let batch = result.concat_batches().unwrap(); - let col = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut v: Vec = (0..col.len()).map(|i| col.value(i).to_string()).collect(); - v.sort(); - v - }; - - let fast = names(query_main(&mut db, queries, "fast", &ParamMap::new()).await.unwrap()); - let slow = names(query_main(&mut db, queries, "slow", &ParamMap::new()).await.unwrap()); - - assert_eq!(fast, slow, "anti-join fast and slow paths must agree"); - // Alice->Acme, Bob->Globex employed; Charlie & Diana have no employer. - assert_eq!(fast, vec!["Charlie", "Diana"]); -} - -// Regression: nested slow-path anti-joins must not collide on the synthetic -// correlation tag. The outer anti-join tags rows with a correlation column that -// rides through its inner pipeline; when the inner pipeline contains ANOTHER -// slow-path anti-join, a fixed tag name would duplicate, and reading it by name -// returns the OUTER tag β€” mis-correlating the inner negation. Fan-out (p1 works -// at two companies) makes the inner row indices diverge from the outer tags, so -// the bug produces a different person set than the correct one. -#[tokio::test] -async fn nested_anti_join_with_fanout_correlates_correctly() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - // p1 -> {Acme, Globex} (fan-out), p2 -> Globex, p3 -> Acme, p4 -> (none). - let data = r#"{"type":"Person","data":{"name":"p1"}} -{"type":"Person","data":{"name":"p2"}} -{"type":"Person","data":{"name":"p3"}} -{"type":"Person","data":{"name":"p4"}} -{"type":"Company","data":{"name":"Acme"}} -{"type":"Company","data":{"name":"Globex"}} -{"edge":"WorksAt","from":"p1","to":"Acme"} -{"edge":"WorksAt","from":"p1","to":"Globex"} -{"edge":"WorksAt","from":"p2","to":"Globex"} -{"edge":"WorksAt","from":"p3","to":"Acme"}"#; - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, data, LoadMode::Overwrite).await.unwrap(); - - let queries = r#" -query no_nonacme_employer() { - match { - $p: Person - not { - $p worksAt $c - not { - $c.name = "Acme" - } - } - } - return { $p.name } -} -"#; - let result = query_main(&mut db, queries, "no_nonacme_employer", &ParamMap::new()) - .await - .unwrap(); - let batch = result.concat_batches().unwrap(); - let names = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut names_vec: Vec<&str> = (0..names.len()).map(|i| names.value(i)).collect(); - names_vec.sort(); - // p1 & p2 have a non-Acme employer (Globex) -> excluded; p3 (Acme only) and - // p4 (no employer) remain. - assert_eq!(names_vec, vec!["p3", "p4"]); -} - -// Regression: a multi-hop anti-join must not take the bulk fast path. The fast -// path answers via `has_neighbors` (ONE-hop existence), so `not { $p knows{2,2} -// $x }` would wrongly drop a node that has a 1-hop neighbor but no 2-hop path. -// Graph: a->b (b is a sink, so a has no 2-hop path), c->d->e (c has a 2-hop -// path). Only c has a 2-hop knows path, so only c is removed. -#[tokio::test] -async fn anti_join_respects_multi_hop_bounds() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let data = r#"{"type":"Person","data":{"name":"a"}} -{"type":"Person","data":{"name":"b"}} -{"type":"Person","data":{"name":"c"}} -{"type":"Person","data":{"name":"d"}} -{"type":"Person","data":{"name":"e"}} -{"edge":"Knows","from":"a","to":"b"} -{"edge":"Knows","from":"c","to":"d"} -{"edge":"Knows","from":"d","to":"e"}"#; - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, data, LoadMode::Overwrite).await.unwrap(); - - let queries = r#" -query no_two_hop() { - match { - $p: Person - not { $p knows{2,2} $x } - } - return { $p.name } -} -"#; - let result = query_main(&mut db, queries, "no_two_hop", &ParamMap::new()) - .await - .unwrap(); - let batch = result.concat_batches().unwrap(); - let names = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut names_vec: Vec<&str> = (0..names.len()).map(|i| names.value(i)).collect(); - names_vec.sort(); - // Only c has a 2-hop knows path β†’ removed; everyone else (incl. a, which has - // a 1-hop neighbor but no 2-hop path) is kept. - assert_eq!(names_vec, vec!["a", "b", "d", "e"]); -} - // ─── Variable-length hops ─────────────────────────────────────────────────── const CHAIN_SCHEMA: &str = r#" diff --git a/crates/omnigraph/tests/traversal_indexed.rs b/crates/omnigraph/tests/traversal_indexed.rs deleted file mode 100644 index 2ceed85..0000000 --- a/crates/omnigraph/tests/traversal_indexed.rs +++ /dev/null @@ -1,327 +0,0 @@ -//! BTREE-indexed Expand path (`execute_expand_indexed`) coverage. -//! -//! These tests force the Expand execution mode via `OMNIGRAPH_TRAVERSAL_MODE` -//! and assert the indexed path matches the CSR path (both are semantically -//! identical β€” the indexed path just serves neighbor lookups from the persisted -//! src/dst BTREE instead of an in-memory CSR). They live in their own test -//! binary and are all `#[serial]`, so the env writes never race a concurrent -//! reader: within this process serial execution serializes every env read, and -//! other test binaries (e.g. `traversal.rs`) are separate processes whose env -//! stays unset (β†’ CSR), validating the shared hydrate/align tail on the CSR path. - -mod helpers; - -use arrow_array::{Array, StringArray}; - -use omnigraph::db::Omnigraph; -use omnigraph::loader::{LoadMode, load_jsonl}; -use omnigraph::table_store::{IndexCoverage, TableStore}; -use omnigraph_compiler::ir::ParamMap; -use serial_test::serial; - -use helpers::*; - -fn set_mode(mode: &str) { - // SAFE: every test here is #[serial] and this binary has no non-serial - // env reader, so no thread reads the environment during this write. - unsafe { std::env::set_var("OMNIGRAPH_TRAVERSAL_MODE", mode) }; -} - -fn clear_mode() { - unsafe { std::env::remove_var("OMNIGRAPH_TRAVERSAL_MODE") }; -} - -/// Run a name-returning query and return its first column, sorted. -async fn sorted_names(db: &mut Omnigraph, queries: &str, name: &str, params: &ParamMap) -> Vec { - let result = query_main(db, queries, name, params).await.unwrap(); - if result.num_rows() == 0 { - return Vec::new(); - } - let batch = result.concat_batches().unwrap(); - let col = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut v: Vec = (0..col.len()).map(|i| col.value(i).to_string()).collect(); - v.sort(); - v -} - -/// Run the same query under CSR, indexed, and auto (cost-chooser) modes; assert -/// all three produce identical results and return them. The auto pass exercises -/// `choose_expand_mode` end to end: whichever path it selects, the rows must -/// match the forced paths (the chooser changes which path runs, never the result). -async fn both_modes(db: &mut Omnigraph, queries: &str, name: &str, params: &ParamMap) -> Vec { - set_mode("csr"); - let csr = sorted_names(db, queries, name, params).await; - set_mode("indexed"); - let indexed = sorted_names(db, queries, name, params).await; - clear_mode(); - let auto = sorted_names(db, queries, name, params).await; - assert_eq!( - indexed, csr, - "indexed Expand must produce identical results to CSR for query '{name}'" - ); - assert_eq!( - auto, csr, - "auto (cost-chooser) Expand must produce identical results to the forced paths for query '{name}'" - ); - indexed -} - -// The C6 index-coverage guard: `key_column_index_coverage` must report whether -// a `key_col IN (...)` scan will use the persisted BTREE or silently full-scan. -// Not #[serial] β€” it calls the helper directly and reads no env. -#[tokio::test] -async fn key_column_index_coverage_detects_btree_presence() { - let dir = tempfile::tempdir().unwrap(); - let db = init_and_load(&dir).await; - let snap = snapshot_main(&db).await.unwrap(); - - // Edge `src` gets a BTREE from ensure_indices on load β†’ Indexed. - let edge_ds = snap.open("edge:Knows").await.unwrap(); - let src_cov = TableStore::key_column_index_coverage(&edge_ds, "src") - .await - .unwrap(); - assert_eq!(src_cov, IndexCoverage::Indexed, "edge src is BTREE-indexed"); - - // A node property column with no scalar index β†’ Degraded (the warn path). - let node_ds = snap.open("node:Person").await.unwrap(); - let age_cov = TableStore::key_column_index_coverage(&node_ds, "age") - .await - .unwrap(); - assert!( - matches!(age_cov, IndexCoverage::Degraded { .. }), - "non-indexed column should be Degraded, got {age_cov:?}" - ); -} - -// An edge appended after the BTREE was built lands in a new fragment that the -// index does not cover (edge-index creation is skipped once a BTREE exists). The -// scan is then partly a full scan, so coverage must report `Degraded` β€” otherwise -// the cost chooser would price an unindexed-in-part scan as fully indexed. -// (Results stay correct regardless β€” `indexed_finds_unindexed_appended_edge`.) -#[tokio::test] -async fn coverage_degrades_for_appended_unindexed_fragment() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - // Fresh load: the Knows BTREE covers every fragment β†’ Indexed. - let snap = snapshot_main(&db).await.unwrap(); - let edge_ds = snap.open("edge:Knows").await.unwrap(); - assert_eq!( - TableStore::key_column_index_coverage(&edge_ds, "src").await.unwrap(), - IndexCoverage::Indexed, - "freshly-loaded edge BTREE covers all fragments" - ); - - // Append an edge β†’ a new, unindexed fragment outside the index fragment_bitmap. - mutate_main( - &mut db, - MUTATION_QUERIES, - "add_friend", - ¶ms(&[("$from", "Alice"), ("$to", "Diana")]), - ) - .await - .unwrap(); - - let snap2 = snapshot_main(&db).await.unwrap(); - let edge_ds2 = snap2.open("edge:Knows").await.unwrap(); - let cov = TableStore::key_column_index_coverage(&edge_ds2, "src").await.unwrap(); - assert!( - matches!(cov, IndexCoverage::Degraded { .. }), - "appended unindexed fragment must degrade coverage, got {cov:?}" - ); -} - -#[tokio::test] -#[serial] -async fn indexed_matches_csr_one_hop_same_type() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - // friends_of: `$p knows $f` (Person -> Person, single hop). - let got = both_modes(&mut db, TEST_QUERIES, "friends_of", ¶ms(&[("$name", "Alice")])).await; - assert_eq!(got, vec!["Bob", "Charlie"], "Alice knows Bob and Charlie"); -} - -#[tokio::test] -#[serial] -async fn indexed_matches_csr_multi_hop_same_type() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - let queries = r#" -query reach($name: String) { - match { - $p: Person { name: $name } - $p knows{1,2} $f - } - return { $f.name } -} -"#; - // Alice -> Bob, Charlie (1 hop); Bob -> Diana (2 hops). - let got = both_modes(&mut db, queries, "reach", ¶ms(&[("$name", "Alice")])).await; - assert_eq!(got, vec!["Bob", "Charlie", "Diana"]); -} - -#[tokio::test] -#[serial] -async fn indexed_matches_csr_cross_type() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - let queries = r#" -query employer($name: String) { - match { - $p: Person { name: $name } - $p worksAt $c - } - return { $c.name } -} -"#; - let got = both_modes(&mut db, queries, "employer", ¶ms(&[("$name", "Alice")])).await; - assert_eq!(got, vec!["Acme"], "Alice works at Acme"); -} - -#[tokio::test] -#[serial] -async fn indexed_matches_csr_no_match() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - // Diana has no outgoing Knows edges β†’ empty in both modes. - let got = both_modes(&mut db, TEST_QUERIES, "friends_of", ¶ms(&[("$name", "Diana")])).await; - assert!(got.is_empty(), "Diana knows no one"); -} - -#[tokio::test] -#[serial] -async fn indexed_finds_unindexed_appended_edge() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - // Append Alice -> Diana AFTER the initial load. `ensure_indices`' existence - // guard means the src/dst BTREE built on the first load does NOT cover this - // new fragment. The indexed path must still find it via Lance's - // unindexed-fragment scan (fast_search=false default), so partial index - // coverage never silently drops rows. - mutate_main( - &mut db, - MUTATION_QUERIES, - "add_friend", - ¶ms(&[("$from", "Alice"), ("$to", "Diana")]), - ) - .await - .unwrap(); - - set_mode("indexed"); - let got = sorted_names(&mut db, TEST_QUERIES, "friends_of", ¶ms(&[("$name", "Alice")])).await; - clear_mode(); - - assert_eq!( - got, - vec!["Bob", "Charlie", "Diana"], - "indexed traversal must see the freshly-appended, unindexed edge" - ); -} - -// Regression: a node `id` is unique only WITHIN a type, so a `Person` and a -// `Company` can share an id string. A variable-length traversal over a -// cross-type edge (`worksAt`, Person -> Company) must structurally stop after -// one hop β€” a Company is not a `worksAt` source β€” so `worksAt{1,2}` returns -// exactly the one-hop companies. Before the structural hop-cap, the indexed -// path's single string interner de-interned the hop-1 Company id back to the -// colliding Person id and ran a hop-2 `worksAt src IN (...)` scan that matched -// that same-string Person's edges, emitting a spurious second-hop company the -// CSR path never produces. `both_modes` (csr == indexed == auto) plus the -// golden assert catch both the divergence and an over-emitting shared bug. -#[tokio::test] -#[serial] -async fn cross_type_id_collision_does_not_bleed_into_second_hop() { - const SCHEMA: &str = r#" -node Person { name: String @key } -node Company { name: String @key } -edge WorksAt: Person -> Company -"#; - // `shared` is BOTH a Person id and a Company id. alice worksAt the Company - // `shared`; the Person `shared` worksAt the Company `other`. - const DATA: &str = r#"{"type":"Person","data":{"name":"alice"}} -{"type":"Person","data":{"name":"shared"}} -{"type":"Company","data":{"name":"shared"}} -{"type":"Company","data":{"name":"other"}} -{"edge":"WorksAt","from":"alice","to":"shared"} -{"edge":"WorksAt","from":"shared","to":"other"}"#; - const QUERY: &str = r#" -query reach($name: String) { - match { - $p: Person { name: $name } - $p worksAt{1,2} $c - } - return { $c.name } -} -"#; - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let mut db = Omnigraph::init(uri, SCHEMA).await.unwrap(); - load_jsonl(&mut db, DATA, LoadMode::Overwrite).await.unwrap(); - - let got = both_modes(&mut db, QUERY, "reach", ¶ms(&[("$name", "alice")])).await; - assert_eq!( - got, - vec!["shared"], - "cross-type worksAt{{1,2}} must return only the one-hop company; a hop-2 \ - result means the id-string collision bled across types" - ); -} - -const REACH_5: &str = r#" -query reach($name: String) { - match { - $p: Person { name: $name } - $p knows{1,5} $f - } - return { $f.name } -} -"#; - -// A directed 3-cycle a->b->c->a, traversed with a hop ceiling (5) ABOVE the cycle -// length. Variable-length traversal must terminate and dedup (the source is -// seeded into `visited`, so the c->a back-edge does not re-emit a). Uses a -// bounded range deliberately: an unbounded `{1,}` is a typecheck error, not a -// runtime path. `both_modes` also confirms indexed == csr on the cycle. -#[tokio::test] -#[serial] -async fn variable_hops_terminate_and_dedup_on_cycle() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let data = r#"{"type":"Person","data":{"name":"a"}} -{"type":"Person","data":{"name":"b"}} -{"type":"Person","data":{"name":"c"}} -{"edge":"Knows","from":"a","to":"b"} -{"edge":"Knows","from":"b","to":"c"} -{"edge":"Knows","from":"c","to":"a"}"#; - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, data, LoadMode::Overwrite).await.unwrap(); - - let got = both_modes(&mut db, REACH_5, "reach", ¶ms(&[("$name", "a")])).await; - // From a: b (1 hop), c (2 hops); the c->a back-edge hits the seeded source - // and is not re-emitted. No infinite loop, each node at most once. - assert_eq!(got, vec!["b", "c"]); -} - -// A self-loop a->a plus a->b. Variable-length traversal must not loop forever and -// must not re-emit the seeded source. -#[tokio::test] -#[serial] -async fn variable_hops_handle_self_loop() { - let dir = tempfile::tempdir().unwrap(); - let uri = dir.path().to_str().unwrap(); - let data = r#"{"type":"Person","data":{"name":"a"}} -{"type":"Person","data":{"name":"b"}} -{"edge":"Knows","from":"a","to":"a"} -{"edge":"Knows","from":"a","to":"b"}"#; - let mut db = Omnigraph::init(uri, TEST_SCHEMA).await.unwrap(); - load_jsonl(&mut db, data, LoadMode::Overwrite).await.unwrap(); - - let got = both_modes(&mut db, REACH_5, "reach", ¶ms(&[("$name", "a")])).await; - // a->a hits the seeded source (pruned); only b is reached. - assert_eq!(got, vec!["b"]); -} diff --git a/crates/omnigraph/tests/validators.rs b/crates/omnigraph/tests/validators.rs index ce8525d..4c7a2f3 100644 --- a/crates/omnigraph/tests/validators.rs +++ b/crates/omnigraph/tests/validators.rs @@ -237,58 +237,6 @@ async fn cardinality_rejected_on_mutation_insert_edge() { ); } -/// RFC-013 step 3b regression guard (cursor High / codex P1 on #298): edge `@card` -/// validation must scan LIVE committed HEAD, not the pinned `txn.base`. Collapse #1 -/// skips the edge accumulation open, so a non-strict edge insert under a `WriteTxn` -/// reopens for the cardinality scan β€” and that scan must observe edges a concurrent -/// writer committed after this mutation captured its base, or a `@card` max is -/// silently exceeded (invariant 9). The residual validateβ†’commit TOCTOU is the Β§7.1 -/// gap (step 4); this only un-widens what 3b widened (live HEAD vs mutation-start base). -/// -/// Deterministic β€” no failpoint: handle B's coordinator is stale by construction -/// (the write path does not probe the manifest version, unlike the read path). B MUST -/// NOT read between A's commit and B's insert β€” a read refreshes B's coordinator and -/// masks the bug (the same caveat as the served stale-view repro in `writes.rs`). -#[tokio::test] -async fn cardinality_rejected_for_stale_handle_after_concurrent_edge_commit() { - let (dir, mut db_a) = init_with(CARDINALITY_SCHEMA, CARDINALITY_SEED).await; - let uri = dir.path().to_str().unwrap(); - - // Handle B opens the same graph at the seed version (no edges yet); it then - // never reads again, so its in-memory coordinator stays pinned at the seed. - let mut db_b = Omnigraph::open(uri).await.unwrap(); - - // Handle A commits WorksAt(Alice -> Acme): Alice is now at the @card(0..1) max. - // This advances the on-disk manifest; B's coordinator is now stale. - mutate_main( - &mut db_a, - CARDINALITY_MUTATIONS, - "add_employment", - ¶ms(&[("$person", "Alice"), ("$company", "Acme")]), - ) - .await - .unwrap(); - - // Handle B (stale, never read since A committed) inserts a second WorksAt for - // Alice. B is non-strict + under a WriteTxn, so collapse #1 skips the open and the - // cardinality scan reopens: it MUST read live HEAD (Alice has 1) β†’ reject (1+1 > 1), - // not the stale base (Alice has 0) β†’ which would wrongly pass and commit a 2nd edge. - let err = mutate_main( - &mut db_b, - CARDINALITY_MUTATIONS, - "add_employment", - ¶ms(&[("$person", "Alice"), ("$company", "Beta")]), - ) - .await - .unwrap_err(); - assert!( - err.to_string().to_lowercase().contains("cardinality") - || err.to_string().to_lowercase().contains("@card"), - "a stale-handle edge insert must be rejected by @card against live HEAD, got: {}", - err - ); -} - #[tokio::test] async fn cardinality_rejected_on_jsonl_load() { // Already covered by existing loader Phase 3 logic but assert the diff --git a/crates/omnigraph/tests/warm_read_cost.rs b/crates/omnigraph/tests/warm_read_cost.rs deleted file mode 100644 index b3f5446..0000000 --- a/crates/omnigraph/tests/warm_read_cost.rs +++ /dev/null @@ -1,742 +0,0 @@ -//! Cost-budget tests for the warm read path (Fix 1): a warm same-branch read -//! must perform no manifest or commit-graph opens, measured via the shared -//! `helpers::cost` harness at the object-store boundary (the LanceDB -//! IO-counted-test pattern; see docs/dev/testing.md). Guards invariant 15 (read -//! cost bounded by work, not history) for snapshot resolution, and invariant 6 -//! (a warm reader still observes external commits). - -mod helpers; - -use arrow_array::{Array, StringArray}; -use omnigraph::db::{Omnigraph, ReadTarget}; -use omnigraph_compiler::result::QueryResult; - -use helpers::cost::measure; -use helpers::{ - MUTATION_QUERIES, TEST_QUERIES, commit_many, count_rows, init_and_load, mixed_params, - mutate_branch, mutate_main, params, -}; - -fn first_column_strings(result: &QueryResult) -> Vec { - if result.num_rows() == 0 { - return Vec::new(); - } - let batch = result.concat_batches().unwrap(); - let values = batch - .column(0) - .as_any() - .downcast_ref::() - .unwrap(); - let mut out = (0..values.len()) - .filter(|&row| !values.is_null(row)) - .map(|row| values.value(row).to_string()) - .collect::>(); - out.sort(); - out -} - -/// A warm same-branch read must not re-open or scan `__manifest`, and must not -/// open the commit graph, even at commit-history depth. The only manifest IO is -/// the version probe (counted by invocation). Fails before Fix 1, where the read -/// path re-opens a fresh coordinator and scans both internal tables. -#[tokio::test] -async fn warm_same_branch_read_does_no_resolution_opens() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - // Deep history: warm-read resolution cost must be flat in commit count. - commit_many(&mut db, 20).await; - - let (out, io) = measure(db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - out.unwrap(); - - // A warm same-branch read opens nothing from the internal tables, even at - // commit-history depth. Fix 1 reuses the coordinator (no re-open: 0 - // commit-graph opens, exactly 1 cheap version probe). Fix 2 opens the touched - // data table by location+version instead of via the namespace, so the - // per-table __manifest scan is gone too. Pre-fix, each of these is a deep scan - // of an internal table that grows with commit count. - assert_eq!( - io.manifest_reads, 0, - "warm same-branch read must not scan __manifest (resolution or per-table)" - ); - assert_eq!( - io.commit_graph_reads, 0, - "warm same-branch read must not open the commit graph (no coordinator re-open)" - ); - assert_eq!( - io.version_probes, 1, - "warm same-branch read performs exactly one version probe" - ); -} - -/// A multi-table query (a traversal touching Person, WorksAt, and Company) scans -/// `__manifest` zero times. Fix 2 opens every touched table by location+version, -/// so manifest IO no longer scales with the number of tables β€” pre-Fix-2 each -/// table cost two full `__manifest` scans (`describe_table` + -/// `describe_table_version`), which is the "2 tables = 2Γ—" multi-table tax. -#[tokio::test] -async fn multi_table_query_does_no_manifest_scans() { - let dir = tempfile::tempdir().unwrap(); - let db = init_and_load(&dir).await; - - let (out, io) = measure(db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "age_stats", - ¶ms(&[]), - )) - .await; - out.unwrap(); - - assert_eq!( - io.manifest_reads, 0, - "a multi-table read must not scan __manifest once per touched table" - ); -} - -/// A warm reader must observe a commit made through another handle (invariant 6, -/// strong consistency): the version probe detects the advance and refreshes. -/// Passes before and after Fix 1 (today's cold re-read is always fresh); a -/// regression guard so the warm-reuse fast path never serves a stale read. -#[tokio::test] -async fn external_commit_observed_by_warm_reader() { - let dir = tempfile::tempdir().unwrap(); - let mut writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - - let before = count_rows(&reader, "node:Person").await; - - // External commit through a separate handle. - mutate_main( - &mut writer, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "ext_new_person")], &[("$age", 41)]), - ) - .await - .unwrap(); - - let after = count_rows(&reader, "node:Person").await; - assert_eq!( - after, - before + 1, - "warm reader must observe an external commit" - ); -} - -// ── Finding A: drop the redundant per-query schema validation ───────────────── -// -// Every query runs `ensure_schema_state_valid`. It ran TWICE per query (once in -// query()/run_query_at, once again in resolved_target/snapshot_at_version), each -// reading 3 contract files + 2 existence probes (~10 storage ops). Finding A -// removes the redundant caller, so validation runs once. (A cheaper source-only -// probe was rejected: the codebase requires per-call detection of IR/state drift -// on long-lived handles -- lifecycle::long_lived_handle_rejects_schema_ir_drift -// -- which a source-only compare would miss.) Measured at the StorageAdapter -// boundary with the counting decorator. - -/// A warm query validates the schema contract exactly once (3 reads + 2 exists), -/// not twice. Fails before finding A, where query() and resolved_target each -/// validate (6 read_text + 4 exists). -#[tokio::test] -async fn warm_query_validates_schema_contract_once() { - use omnigraph::instrumentation::CountingStorageAdapter; - use omnigraph::storage::storage_for_uri; - - let dir = tempfile::tempdir().unwrap(); - // Init through the standard path, then re-open behind a counting adapter to - // measure the per-query schema-contract storage reads (delta around the - // query excludes open-time reads). - let _ = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let (adapter, counts) = CountingStorageAdapter::new(storage_for_uri(uri).unwrap()); - let db = Omnigraph::open_with_storage(uri, adapter).await.unwrap(); - - let before_read_text = counts.read_text(); - let before_exists = counts.exists(); - db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - ) - .await - .unwrap(); - - assert_eq!( - counts.read_text() - before_read_text, - 3, - "warm query should validate the schema contract once (3 reads), not twice" - ); - assert_eq!( - counts.exists() - before_exists, - 2, - "warm query should probe contract-file existence once (2 probes), not twice" - ); -} - -/// The cheap source-compare must still detect that the on-disk schema source has -/// drifted from the validated contract and fail the read, rather than serving the -/// stale-but-cached schema. Passes before and after finding A (regression guard -/// for the documented weaker per-query guard). -#[tokio::test] -async fn schema_source_drift_is_caught_on_read() { - let dir = tempfile::tempdir().unwrap(); - let _writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - - // Drift the on-disk schema source behind the reader's back. - std::fs::write( - dir.path().join("_schema.pg"), - "this is not a valid schema {{{", - ) - .unwrap(); - - let result = reader - .query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - ) - .await; - assert!( - result.is_err(), - "a query must fail when the on-disk schema source has drifted from the validated contract" - ); -} - -// ── Morphological-matrix coverage: branch-warm + stale-refresh cells ────────── - -/// A WARM read on a non-main branch (handle synced to that branch) also scans -/// `__manifest` zero times. Exercises Fix 2's branch-owned-table open -/// (`{table_path}/tree/{branch}` + with_version) on Fix 1's warm path β€” the cell -/// that regressed when the open used `with_branch` against the base. -#[tokio::test] -async fn warm_branch_read_does_no_manifest_scans() { - let dir = tempfile::tempdir().unwrap(); - let db = init_and_load(&dir).await; - db.branch_create("feature").await.unwrap(); - // Write to the branch so its tables are branch-owned (under tree/feature). - db.mutate( - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - // Bind the handle's coordinator to the branch so reads of it take the warm path. - db.sync_branch("feature").await.unwrap(); - - let (out, io) = measure(db.query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - out.unwrap(); - - assert_eq!( - io.manifest_reads, 0, - "warm branch read must not scan __manifest (branch-owned table opened by location)" - ); - assert_eq!( - io.commit_graph_reads, 0, - "warm branch read must not open the commit graph" - ); - assert_eq!( - io.version_probes, 1, - "warm branch read performs exactly one version probe" - ); -} - -/// A non-main branch can be deleted and recreated at the same Lance version -/// number. Warm branch freshness therefore needs the manifest incarnation, not -/// just `version()`, or a reader pinned to the old incarnation can serve stale -/// rows from the deleted branch. This is the correctness guard for Phase 6A. -#[tokio::test] -async fn warm_read_on_recreated_branch_observes_new_incarnation() { - let dir = tempfile::tempdir().unwrap(); - let mut writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - - writer.branch_create("feature").await.unwrap(); - mutate_branch( - &mut writer, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Eve")], &[("$age", 22)]), - ) - .await - .unwrap(); - - reader.sync_branch("feature").await.unwrap(); - let old_feature = reader - .query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "Eve")]), - ) - .await - .unwrap(); - assert_eq!( - old_feature.num_rows(), - 1, - "test setup: old feature branch must contain Eve" - ); - let old_version = reader - .version_of(ReadTarget::branch("feature")) - .await - .unwrap(); - - writer.branch_delete("feature").await.unwrap(); - mutate_main( - &mut writer, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "MainOnly")], &[("$age", 44)]), - ) - .await - .unwrap(); - writer.branch_create("feature").await.unwrap(); - let new_version = writer - .version_of(ReadTarget::branch("feature")) - .await - .unwrap(); - assert_eq!( - new_version, old_version, - "test setup must exercise branch incarnation reuse at one Lance version" - ); - - let (new_feature, io) = measure(reader.query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "MainOnly")]), - )) - .await; - let new_feature = new_feature.unwrap(); - - assert_eq!( - new_feature.num_rows(), - 1, - "warm reader must refresh to the recreated branch incarnation" - ); - assert!( - io.manifest_reads > 0, - "recreated branch must re-read the manifest after the incarnation probe" - ); - assert_eq!( - io.commit_graph_reads, 0, - "same-branch incarnation refresh must be manifest-only" - ); - assert_eq!( - io.version_probes, 2, - "stale same-branch read probes once under the read lock and once under the write lock" - ); -} - -/// Recreated non-main branches can reuse the same branch-owned table version. -/// This forces the held table-handle cache to distinguish incarnations by the -/// per-table Lance manifest e_tag, not just `(table_path, branch, version)`. -#[tokio::test] -async fn recreated_branch_owned_table_handle_uses_table_etag() { - let dir = tempfile::tempdir().unwrap(); - let mut writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - - writer.branch_create("feature").await.unwrap(); - mutate_branch( - &mut writer, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "OldOnly")], &[("$age", 31)]), - ) - .await - .unwrap(); - - reader.sync_branch("feature").await.unwrap(); - let old_person = reader - .query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "OldOnly")]), - ) - .await - .unwrap(); - assert_eq!(old_person.num_rows(), 1); - let old_entry = reader - .snapshot_of(ReadTarget::branch("feature")) - .await - .unwrap() - .entry("node:Person") - .unwrap() - .clone(); - assert_eq!(old_entry.table_branch.as_deref(), Some("feature")); - - writer.branch_delete("feature").await.unwrap(); - writer.branch_create("feature").await.unwrap(); - mutate_branch( - &mut writer, - "feature", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "NewOnly")], &[("$age", 32)]), - ) - .await - .unwrap(); - let new_entry = writer - .snapshot_of(ReadTarget::branch("feature")) - .await - .unwrap() - .entry("node:Person") - .unwrap() - .clone(); - assert_eq!(new_entry.table_path, old_entry.table_path); - assert_eq!(new_entry.table_branch, old_entry.table_branch); - assert_eq!( - new_entry.table_version, old_entry.table_version, - "test setup must force table handle identity to differ only by e_tag" - ); - - let (new_person, io) = measure(reader.query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "NewOnly")]), - )) - .await; - let new_person = new_person.unwrap(); - assert_eq!( - new_person.num_rows(), - 1, - "warm reader must open the recreated branch-owned table incarnation" - ); - assert!( - io.data_reads > 0, - "table e_tag must force a held-handle cache miss for the recreated table" - ); - assert!( - io.manifest_reads > 0, - "recreated branch must refresh the manifest" - ); - assert_eq!( - io.commit_graph_reads, 0, - "same-branch table-incarnation refresh must be manifest-only" - ); - assert_eq!( - io.version_probes, 2, - "stale same-branch read probes once under each lock" - ); - - let stale_old_person = reader - .query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "get_person", - ¶ms(&[("$name", "OldOnly")]), - ) - .await - .unwrap(); - assert_eq!( - stale_old_person.num_rows(), - 0, - "old branch-owned table contents must not leak after branch recreation" - ); -} - -/// The graph-index cache is keyed by synthetic snapshot id plus edge-table -/// state. A recreated branch can reuse the same edge table `(branch, version)`, -/// so the synthetic snapshot id must carry the manifest incarnation or traversal -/// can reuse stale topology. -#[tokio::test] -async fn recreated_branch_traversal_uses_graph_index_incarnation() { - let dir = tempfile::tempdir().unwrap(); - let mut writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - - writer.branch_create("feature").await.unwrap(); - mutate_branch( - &mut writer, - "feature", - MUTATION_QUERIES, - "insert_person_and_friend", - &mixed_params( - &[("$name", "OldWalker"), ("$friend", "Alice")], - &[("$age", 41)], - ), - ) - .await - .unwrap(); - - reader.sync_branch("feature").await.unwrap(); - let old_friends = reader - .query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "friends_of", - ¶ms(&[("$name", "OldWalker")]), - ) - .await - .unwrap(); - assert_eq!(first_column_strings(&old_friends), vec!["Alice"]); - let old_edge_entry = reader - .snapshot_of(ReadTarget::branch("feature")) - .await - .unwrap() - .entry("edge:Knows") - .unwrap() - .clone(); - assert_eq!(old_edge_entry.table_branch.as_deref(), Some("feature")); - - writer.branch_delete("feature").await.unwrap(); - writer.branch_create("feature").await.unwrap(); - mutate_branch( - &mut writer, - "feature", - MUTATION_QUERIES, - "insert_person_and_friend", - &mixed_params( - &[("$name", "NewWalker"), ("$friend", "Bob")], - &[("$age", 42)], - ), - ) - .await - .unwrap(); - let new_edge_entry = writer - .snapshot_of(ReadTarget::branch("feature")) - .await - .unwrap() - .entry("edge:Knows") - .unwrap() - .clone(); - assert_eq!(new_edge_entry.table_path, old_edge_entry.table_path); - assert_eq!(new_edge_entry.table_branch, old_edge_entry.table_branch); - assert_eq!( - new_edge_entry.table_version, old_edge_entry.table_version, - "test setup must force graph-index identity to differ only by snapshot incarnation" - ); - - let (new_friends, io) = measure(reader.query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "friends_of", - ¶ms(&[("$name", "NewWalker")]), - )) - .await; - let new_friends = new_friends.unwrap(); - assert_eq!( - first_column_strings(&new_friends), - vec!["Bob"], - "traversal must use the recreated branch's topology, not stale cached graph index" - ); - assert!( - io.manifest_reads > 0, - "recreated branch traversal must refresh the manifest" - ); - assert_eq!( - io.commit_graph_reads, 0, - "same-branch traversal incarnation refresh must be manifest-only" - ); - assert_eq!( - io.version_probes, 2, - "stale same-branch read probes once under each lock" - ); - - let stale_old_friends = reader - .query( - ReadTarget::branch("feature"), - TEST_QUERIES, - "friends_of", - ¶ms(&[("$name", "OldWalker")]), - ) - .await - .unwrap(); - assert_eq!( - first_column_strings(&stale_old_friends), - Vec::::new(), - "old branch topology must not leak after branch recreation" - ); -} - -/// When an external writer advances the manifest, the reader's next query takes -/// the STALE path: it re-reads the manifest (read_iops > 0) but never scans the -/// commit graph (`refresh_manifest_only`), unlike a full coordinator refresh. -/// Pins Fix 1's manifest-only refresh. -#[tokio::test] -async fn stale_read_refreshes_manifest_only() { - let dir = tempfile::tempdir().unwrap(); - let mut writer = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let reader = Omnigraph::open(uri).await.unwrap(); - // Establish the reader's warm coordinator. - reader - .query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - ) - .await - .unwrap(); - - // External commit advances the on-disk manifest behind the reader. - mutate_main( - &mut writer, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "Frank")], &[("$age", 33)]), - ) - .await - .unwrap(); - - let (out, io) = measure(reader.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - out.unwrap(); - - assert!( - io.manifest_reads > 0, - "stale read must re-read the manifest" - ); - assert_eq!( - io.commit_graph_reads, 0, - "stale refresh must be manifest-only (no commit-graph scan)" - ); - assert_eq!( - io.version_probes, 2, - "stale same-branch read probes once under the read lock and once under the write lock" - ); -} - -// ── Fix 3: held-handle cache β€” warm repeat reads stop re-opening tables ──────── -// -// After Fix 1+2 a warm same-branch read still re-opened every touched table per -// query (the "never warms up" residual). Fix 3 holds the open `Dataset` per -// `(table, branch, version, e_tag)` (the version-keyed analogue of LanceDB's -// `DatasetConsistencyWrapper`) and shares one `Session` per graph, so a second -// identical warm read reuses the handle with zero table opens. - -/// Headline: a second identical warm same-branch read does ZERO table opens -/// (the cold first read opens; the warm repeat serves from the held-handle -/// cache). Fails before Fix 3, where every read re-opens the table. -#[tokio::test] -async fn repeat_warm_read_reuses_table_handles() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - // Deep history: the win must hold regardless of commit count. - commit_many(&mut db, 10).await; - - // Cold first read: opens the touched table. - let (cold_out, cold) = measure(db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - cold_out.unwrap(); - assert!( - cold.data_reads > 0, - "the cold first read must open the table" - ); - - // Warm repeat: the held handle is reused, so no open happens through this - // query's table wrapper. A fresh `measure()` isolates the warm repeat's cost. - let (warm_out, warm) = measure(db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - warm_out.unwrap(); - assert_eq!( - warm.data_reads, 0, - "a warm repeat read must reuse the held handle (0 table opens)" - ); - assert_eq!(warm.manifest_reads, 0, "warm repeat read: 0 manifest opens"); - assert_eq!( - warm.commit_graph_reads, 0, - "warm repeat read: 0 commit-graph opens" - ); - assert_eq!( - warm.version_probes, 1, - "warm repeat read: exactly one version probe" - ); -} - -/// A write advances the table's version, so the next read misses the -/// version-keyed cache and re-opens β€” never serving a stale handle (invariant 6 -/// for the cached path). Passes with or without the cache; a correctness guard -/// that the cache cannot serve pre-write data. -#[tokio::test] -async fn write_invalidates_table_cache_for_changed_table() { - let dir = tempfile::tempdir().unwrap(); - let mut db = init_and_load(&dir).await; - - let before = count_rows(&db, "node:Person").await; - - // Warm the cache for Person. - db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - ) - .await - .unwrap(); - - // Write Person: its version advances, so the cached (table, branch, version) - // key is now superseded. - mutate_main( - &mut db, - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "cache_miss_one")], &[("$age", 50)]), - ) - .await - .unwrap(); - - // The next read re-opens Person at the new version (cache miss). - let (out, io) = measure(db.query( - ReadTarget::branch("main"), - TEST_QUERIES, - "total_people", - ¶ms(&[]), - )) - .await; - out.unwrap(); - assert!( - io.data_reads > 0, - "a read after a write to the table must re-open it (version-keyed miss)" - ); - - let after = count_rows(&db, "node:Person").await; - assert_eq!( - after, - before + 1, - "the post-write read observes the new row (no stale handle served)" - ); -} diff --git a/crates/omnigraph/tests/write_cost.rs b/crates/omnigraph/tests/write_cost.rs deleted file mode 100644 index 6cbf763..0000000 --- a/crates/omnigraph/tests/write_cost.rs +++ /dev/null @@ -1,273 +0,0 @@ -//! Cost-budget tests for the WRITE path (RFC-013 step 1) β€” the safety/latency -//! twin of `warm_read_cost.rs`, on the shared `helpers::cost` harness. A -//! committing write's per-table opens and internal-table scans must be bounded -//! and **flat across commit-history depth**, measured at the object-store -//! boundary. Guards invariant 15 (cost bounded by work, not history) on writes. -//! -//! **Backend split (see docs/dev/testing.md / RFC-013).** This file runs on -//! **local FS** and gates the **internal-table** term (`__manifest`/`_graph_commits` -//! fragment scans, ~+18/depth β€” O(fragments) on any backend, step 2's target). -//! -//! The **data-table opener** term (step 3a's win) is a per-object-store-RPC -//! phenomenon and is NOT gated here: local-FS latest-resolution is cheap whether -//! the open goes through the namespace builder or direct-by-URI, so the -//! namespaceβ†’direct switch is invisible on local. Measured: the local data-table -//! read count grows with depth too (~+0.9/depth), but that is a *different* term β€” -//! the merge-insert/RI scan reading O(depth) **fragments**, unchanged by the -//! opener switch (depth-100 = 92 ops both before and after step 3a, same slope) -//! and reduced only by compaction. The opener term shows up only on a real object -//! store (per-version GETs, ~+12/depth β†’ flat after step 3a), so it is gated in -//! `write_cost_s3.rs` (bucket-gated). Same `measure`/`IoCounts` harness, different -//! backend; each term gated where it actually manifests. -#![recursion_limit = "512"] - -mod helpers; - -use helpers::cost::{ - IoCounts, assert_flat, assert_grows, local_graph, measure, measure_insert, measure_insert_as, - measure_with_staged, -}; -use helpers::{MUTATION_QUERIES, commit_many, commit_many_as, init_and_load, mixed_params}; - -// ── (A) The internal-table LOCK β€” the acceptance test for step 2 (compaction) ── -// -// `__manifest` / `_graph_commits` / `_graph_commit_actors` scans on a write must be -// O(1) in commit-history depth **on a compacted graph**. Without internal-table -// compaction these scans are O(fragments) and grow forever; step 2 brings all three -// internal tables into `db.optimize()`, so after compaction the per-write scan is -// flat. The test runs the **authenticated (actorful) write path** β€” every commit -// carries an actor, so it grows `_graph_commit_actors.lance` too (the production -// server/CLI path); the commit-graph IO wrapper covers both that and `_graph_commits`, -// so `commit_graph_reads` includes the actor-table scan. It compacts at each depth -// checkpoint before measuring β€” pinning the production invariant "a periodically- -// compacted graph's write cost does not grow with version history." -#[tokio::test] -async fn internal_table_scans_are_flat_in_history() { - const ACTOR: &str = "act-cost-gate"; - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - - let mut curve: Vec<(u64, IoCounts)> = Vec::new(); - let mut current = 0u64; - for d in [10u64, 100] { - if d > current { - commit_many_as(&mut db, (d - current) as usize, ACTOR).await; - current = d; - } - // Step 2: compaction folds all three internal tables' O(depth) fragments back - // to a small constant, so the following write's scan of them is flat. - db.optimize().await.unwrap(); - let io = measure_insert_as(&mut db, &format!("lock_{d}"), ACTOR).await; - current += 1; // the measured write advanced depth by one - eprintln!( - "depth~{d}: data={} __manifest={} _graph_commits+actors={}", - io.data_reads, io.manifest_reads, io.commit_graph_reads - ); - curve.push((d, io)); - } - - assert_flat(&curve, |c| c.manifest_reads, 4, "__manifest scan"); - // commit_graph_reads covers BOTH _graph_commits and _graph_commit_actors (shared - // wrapper), so this also gates the actor table on the authenticated path. - assert_flat(&curve, |c| c.commit_graph_reads, 4, "_graph_commits + _graph_commit_actors scan"); -} - -// The data-table OPENER history-gate (opener flat across depth) lives in -// `write_cost_s3.rs` β€” its history-dependence is an S3-only phenomenon. But the -// *probe that isolates* the opener (the `PrefixCounter` split) is validated here, -// every-PR, on local FS: - -/// Proves the `PrefixCounter` opener/scan split: a committing write's data-table -/// reads divide into a **flat opener** term and a **growing scan** term. This pins -/// (a) the classifier actually attributes reads to the opener bucket (non-zero, so a -/// flat assertion isn't vacuously flat-at-zero), and (b) the local data-table growth -/// is the merge-insert/RI fragment scan, not the opener β€” which is *why* the S3 -/// gate asserts `data_opener_reads`, not total `data_reads`. (On local FS the opener -/// is O(1) regardless of step 3a; the opener's history-dependence is gated on S3.) -#[tokio::test] -async fn data_table_reads_split_into_flat_opener_and_growing_scan() { - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - - let mut curve: Vec<(u64, IoCounts)> = Vec::new(); - let mut current = 0u64; - for d in [10u64, 100] { - if d > current { - commit_many(&mut db, (d - current) as usize).await; - current = d; - } - let io = measure_insert(&mut db, &format!("split_{d}")).await; - current += 1; - eprintln!( - "depth~{d}: opener={} scan={} data_total={}", - io.data_opener_reads, io.data_scan_reads, io.data_reads - ); - curve.push((d, io)); - } - - assert!( - curve[0].1.data_opener_reads > 0, - "opener reads must be > 0 β€” the classifier missed version-resolution reads, \ - so a flat opener assertion would be vacuous" - ); - assert_flat(&curve, |c| c.data_opener_reads, 4, "local data-table opener"); - assert_grows(&curve, |c| c.data_scan_reads, 20, "local data-table scan"); -} - -// ── (B) Green-today regression guards β€” run on every PR ── - -/// A single insert's *data-table* write cost is O(1): the table commit is a small -/// constant number of writes, independent of history. -#[tokio::test] -async fn single_insert_data_write_is_bounded() { - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - commit_many(&mut db, 5).await; - let io = measure_insert(&mut db, "w").await; - eprintln!("single insert: data_writes={}", io.data_writes); - assert!(io.data_writes <= 4, "data-table write_iops should be a small constant, got {}", io.data_writes); -} - -/// At a fixed shallow depth, the per-write object-store read count is below a -/// documented ceiling. Fails the moment a change *adds* a round-trip on the write -/// path β€” the "no new round-trip" guard. -/// -/// Two folds keep the count low: RFC-013 Phase 7 put the `graph_commit` + -/// `graph_head` rows in the same publish merge-insert (no extra `__manifest` -/// write/scan per commit), and RFC-013 P2 collapsed the publish path's FOUR -/// `__manifest` scans (table locations + version entries + tombstones + a -/// separate `read_graph_lineage` for the parent) into ONE β€” the -/// `manifest_reads` sub-ceiling below would trip if any of those scans crept -/// back. Calibrated at depth ~5: ~26 `__manifest` reads / ~36 total after the -/// P2 fold (was ~44 / ~54 with the four separate scans). -#[tokio::test] -async fn write_op_count_ceiling_at_shallow_depth() { - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - commit_many(&mut db, 5).await; - let io = measure_insert(&mut db, "ceil").await; - eprintln!( - "depth~5: data={} __manifest={} _graph_commits={} total_reads={}", - io.data_reads, io.manifest_reads, io.commit_graph_reads, io.total_reads() - ); - // Sub-ceiling on `__manifest` reads specifically: the publish path does one - // scan, not four. ~26 measured at this depth; a re-added scan would push it - // well past this. (Deterministic on local FS.) - const MANIFEST_CEILING: u64 = 34; - assert!( - io.manifest_reads <= MANIFEST_CEILING, - "per-write __manifest reads {} exceeded ceiling {MANIFEST_CEILING} β€” a publish-path \ - scan was re-added (RFC-013 P2 folds them into one)", - io.manifest_reads, - ); - const CEILING: u64 = 80; - assert!( - io.total_reads() <= CEILING, - "per-write read ops {} exceeded ceiling {CEILING} β€” a new round-trip was added", - io.total_reads() - ); -} - -// ── (C) Fitness assert via the staged-write probes ── - -/// A keyed `Person` insert routes through `stage_merge_insert` exactly once, does -/// no `stage_append`, and no inline vector-index build. Pins the structural shape. -#[tokio::test] -async fn keyed_insert_routes_through_merge_insert_only() { - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - let (res, _io, staged) = measure_with_staged(db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "fit")], &[("$age", 30)]), - )) - .await; - res.unwrap(); - assert_eq!(staged.stage_merge_insert, 1, "keyed Person insert stages one merge-insert"); - assert_eq!(staged.stage_append, 0, "keyed insert must not stage_append"); - assert_eq!(staged.create_vector_index, 0, "no inline vector-index build on a plain insert"); -} - -// ── (D) Step-3b capture-once fitness asserts (RED today β†’ GREEN after WriteTxn) ── - -/// A write must validate the schema contract EXACTLY ONCE (3 `read_text` + 2 `exists`). -/// Today the write path re-validates at every resolve point (entry, per-table -/// `resolved_branch_target`, commit-time `fresh_snapshot_for_branch`), so the delta is -/// a multiple of that. Step 3b's `WriteTxn` validates once and threads it. The shape is -/// the write twin of `warm_read_cost.rs::warm_query_validates_schema_contract_once`, -/// built with ZERO production change via the counting storage adapter. -#[tokio::test] -async fn write_validates_schema_contract_once() { - use omnigraph::instrumentation::CountingStorageAdapter; - use omnigraph::storage::storage_for_uri; - - let dir = tempfile::tempdir().unwrap(); - let _ = init_and_load(&dir).await; - let uri = dir.path().to_str().unwrap(); - let (adapter, counts) = CountingStorageAdapter::new(storage_for_uri(uri).unwrap()); - let db = omnigraph::db::Omnigraph::open_with_storage(uri, adapter) - .await - .unwrap(); - - let before_read_text = counts.read_text(); - let before_exists = counts.exists(); - db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "schema_once")], &[("$age", 30)]), - ) - .await - .unwrap(); - - let read_text_delta = counts.read_text() - before_read_text; - let exists_delta = counts.exists() - before_exists; - eprintln!("schema-contract reads on one write: read_text={read_text_delta} exists={exists_delta}"); - assert_eq!( - read_text_delta, 3, - "a write must validate the schema contract once (3 reads), not N times", - ); - assert_eq!( - exists_delta, 2, - "a write must probe contract-file existence once (2 probes), not N times", - ); -} - -/// A keyed single-table write must open its DATA table AT MOST ONCE. Today it opens -/// ~4Γ— (accumulation, staging, commit drift-guard, publish-prepare/index-build), each -/// a fresh cold `Dataset::open`. Step 3b opens the base once (a *session-aware* base -/// open is deferred to step 5), threads the commit-return handle, and replaces the -/// drift-guard open with a cheap `latest_version_id` probe β€” collapsing to 1 open. -/// Counted by `data_open_count`, the -/// table-class-scoped chokepoint probe: the internal-table opens (publisher CAS + -/// commit-graph append) are EXCLUDED, since they are unrelated to data-table reuse and -/// would otherwise keep this count >1 regardless of threading. (`forbidden_apis` keeps -/// engine code outside the storage layer from opening datasets except through the -/// instrumented chokepoints β€” `table_store.rs`'s own direct opens are branch-management -/// ops, not this keyed-write path.) -#[tokio::test] -async fn keyed_insert_opens_table_at_most_once() { - let dir = tempfile::tempdir().unwrap(); - let mut db = local_graph(&dir).await; - let io = { - let (res, io) = measure(db.mutate( - "main", - MUTATION_QUERIES, - "insert_person", - &mixed_params(&[("$name", "opens")], &[("$age", 30)]), - )) - .await; - res.unwrap(); - io - }; - eprintln!( - "data_open_count={} internal_open_count={} for a single-table keyed insert", - io.data_open_count, io.internal_open_count - ); - assert!( - io.data_open_count <= 1, - "a keyed single-table write must open its data table at most once, got {}", - io.data_open_count, - ); -} diff --git a/crates/omnigraph/tests/write_cost_s3.rs b/crates/omnigraph/tests/write_cost_s3.rs deleted file mode 100644 index d8ffd4f..0000000 --- a/crates/omnigraph/tests/write_cost_s3.rs +++ /dev/null @@ -1,71 +0,0 @@ -//! S3 (object-store) cost-budget gate for the WRITE path β€” the bucket-gated twin of -//! `write_cost.rs` that proves RFC-013 **step 3a's data-table opener win**. On the -//! shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`/`s3_graph`). -//! -//! The opener term is an **object-store-RPC phenomenon**: latest-version resolution -//! costs per-version GETs/HEADs on S3 (O(depth) before step 3a, when writes routed -//! through the lance-namespace builder), which local FS cannot reproduce (one cheap -//! `read_dir` regardless). After step 3a (direct-by-URI opens), the per-write -//! **data-table read count is FLAT across commit-history depth** β€” the measured 70% -//! win. This file is the redβ†’green acceptance for that term (it would be RED on the -//! pre-3a `from_namespace` opener); `write_cost.rs` gates the internal-table term on -//! local every-PR. -//! -//! **Isolating the opener (important):** total `data_reads` is not opener-only β€” the -//! same wrapped `Dataset` backs the merge-insert/RI **scan**, which reads -//! O(fragment-count) and grows with history for a *different* reason (compaction's -//! domain, not the opener; this is the term that made the *local* data-table count -//! grow). The shared harness's `PrefixCounter` attributes each read by object-key -//! prefix, so this gate asserts `data_opener_reads` (reads of `_versions/`/`.manifest`) -//! **directly** β€” no compaction or fixture massaging needed. After step 3a the opener -//! is O(1) regardless of version-history depth; before it grew ~+12/depth (RFC Β§2.4 -//! [M]). (See `write_cost.rs` for the local test that proves the split itself β€” -//! opener flat, scan growing.) -//! -//! Skips gracefully without `OMNIGRAPH_S3_TEST_BUCKET` (the `tests/s3_storage.rs` -//! pattern); runs for real in the rustfs CI job (`.github/workflows/ci.yml`). -#![recursion_limit = "512"] - -mod helpers; - -use helpers::cost::{IoCounts, assert_flat, measure_insert, s3_graph}; -use helpers::commit_many; - -/// After step 3a the data-table opener term is flat across depth on a real object -/// store (the measured win). RED on the pre-3a namespace-builder opener (O(depth) -/// per-version resolution). -#[tokio::test] -async fn data_table_opener_is_flat_in_history_on_s3() { - let Some(mut db) = s3_graph("write-cost-opener").await else { - eprintln!( - "SKIP data_table_opener_is_flat_in_history_on_s3: OMNIGRAPH_S3_TEST_BUCKET \ - unset (or store unreachable) β€” the S3 opener gate needs an object store" - ); - return; - }; - - let mut curve: Vec<(u64, IoCounts)> = Vec::new(); - let mut current = 0u64; - for d in [10u64, 50] { - if d > current { - commit_many(&mut db, (d - current) as usize).await; - current = d; - } - let io = measure_insert(&mut db, &format!("s3_{d}")).await; - current += 1; - eprintln!( - "depth~{d}: opener={} scan={} data_total={} __manifest={} _graph_commits={}", - io.data_opener_reads, - io.data_scan_reads, - io.data_reads, - io.manifest_reads, - io.commit_graph_reads - ); - curve.push((d, io)); - } - - // The opener (latest-version resolution) is O(1) after step 3a (direct-by-URI), - // isolated from the scan by the PrefixCounter. Slack absorbs object-store variance; - // the pre-3a builder grew this ~+12/depth (RFC Β§2.4 [M]). - assert_flat(&curve, |c| c.data_opener_reads, 8, "S3 data-table opener"); -} diff --git a/docker/entrypoint.sh b/docker/entrypoint.sh index 79d7de7..83b7d34 100644 --- a/docker/entrypoint.sh +++ b/docker/entrypoint.sh @@ -9,30 +9,8 @@ fi bind="${OMNIGRAPH_BIND:-0.0.0.0:8080}" -# Cluster mode first, and exclusive (the server's mode-inference rule 0): -# a deployment serves from cluster state XOR omnigraph.yaml, never a merge. -# Fail fast here with the same contract the server enforces. -if [ -n "${OMNIGRAPH_CLUSTER:-}" ]; then - if [ -n "${OMNIGRAPH_TARGET_URI:-}" ] || [ -n "${OMNIGRAPH_CONFIG:-}" ] || [ -n "${OMNIGRAPH_TARGET:-}" ]; then - echo "OMNIGRAPH_CLUSTER is an exclusive boot source; unset OMNIGRAPH_TARGET_URI/OMNIGRAPH_CONFIG/OMNIGRAPH_TARGET" >&2 - exit 64 - fi - set -- --cluster "${OMNIGRAPH_CLUSTER}" --bind "${bind}" - case "${OMNIGRAPH_REQUIRE_ALL_GRAPHS:-}" in - ""|0|false|FALSE) ;; - *) set -- "$@" --require-all-graphs ;; - esac - exec "$SERVER_BIN" "$@" -fi - -# URI comes from the env var (the positional arg wins over any config -# `graphs` block in resolve_target_uri). OMNIGRAPH_CONFIG, when also set, -# is forwarded as --config purely to supply a policy file β€” the two -# compose. Without OMNIGRAPH_CONFIG the behavior is unchanged. if [ -n "${OMNIGRAPH_TARGET_URI:-}" ]; then - exec "$SERVER_BIN" "${OMNIGRAPH_TARGET_URI}" \ - ${OMNIGRAPH_CONFIG:+--config "$OMNIGRAPH_CONFIG"} \ - --bind "${bind}" + exec "$SERVER_BIN" "${OMNIGRAPH_TARGET_URI}" --bind "${bind}" fi if [ -n "${OMNIGRAPH_CONFIG:-}" ]; then @@ -44,17 +22,11 @@ fi cat >&2 <<'EOF' omnigraph-server container startup requires one of: - - OMNIGRAPH_CLUSTER (serve a cluster directory's applied revision; - exclusive β€” cannot combine with the others) - OMNIGRAPH_TARGET_URI - OMNIGRAPH_CONFIG Optional: - OMNIGRAPH_BIND (default: 0.0.0.0:8080) - - OMNIGRAPH_REQUIRE_ALL_GRAPHS (cluster mode: fail startup unless every - applied graph is healthy) - OMNIGRAPH_TARGET (used with OMNIGRAPH_CONFIG) - - OMNIGRAPH_CONFIG (may also accompany OMNIGRAPH_TARGET_URI to add a - policy file; the URI still comes from OMNIGRAPH_TARGET_URI) EOF exit 64 diff --git a/docker/entrypoint_test.sh b/docker/entrypoint_test.sh deleted file mode 100755 index 3ee668f..0000000 --- a/docker/entrypoint_test.sh +++ /dev/null @@ -1,85 +0,0 @@ -#!/bin/sh -# Self-contained test for docker/entrypoint.sh argument composition. -# Runs the entrypoint against a stub server that echoes its args, and -# asserts the forwarded argv for each startup mode. No Docker required. -# -# sh docker/entrypoint_test.sh -# -# Exits 0 on success, 1 on the first mismatch. -set -eu - -here=$(CDPATH= cd -- "$(dirname -- "$0")" && pwd) -entrypoint="$here/entrypoint.sh" - -work=$(mktemp -d) -trap 'rm -rf "$work"' EXIT -mkdir -p "$work/bin" -cat > "$work/bin/omnigraph-server" <<'EOF' -#!/bin/sh -echo "ARGS: $*" -EOF -chmod +x "$work/bin/omnigraph-server" - -# Run the real entrypoint with SERVER_BIN pointed at the stub. -ep="$work/entrypoint.sh" -sed "s#SERVER_BIN=\"/usr/local/bin/omnigraph-server\"#SERVER_BIN=\"$work/bin/omnigraph-server\"#" \ - "$entrypoint" > "$ep" - -fail=0 -check() { - desc=$1; want=$2; got=$3 - if [ "$got" != "$want" ]; then - echo "FAIL: $desc" - echo " want: $want" - echo " got: $got" - fail=1 - else - echo "ok: $desc" - fi -} - -got=$(OMNIGRAPH_TARGET_URI="s3://b/g" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep") -check "TARGET_URI only (legacy)" \ - "ARGS: s3://b/g --bind 0.0.0.0:8080" "$got" - -got=$(OMNIGRAPH_TARGET_URI="s3://b/g" OMNIGRAPH_CONFIG="/etc/omnigraph/omnigraph.yaml" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep") -check "TARGET_URI + CONFIG composes (policy)" \ - "ARGS: s3://b/g --config /etc/omnigraph/omnigraph.yaml --bind 0.0.0.0:8080" "$got" - -got=$(OMNIGRAPH_CONFIG="/etc/omnigraph/omnigraph.yaml" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep") -check "CONFIG only" \ - "ARGS: --config /etc/omnigraph/omnigraph.yaml --bind 0.0.0.0:8080" "$got" - -got=$(OMNIGRAPH_CONFIG="/etc/omnigraph/omnigraph.yaml" OMNIGRAPH_TARGET="active" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep") -check "CONFIG + TARGET" \ - "ARGS: --config /etc/omnigraph/omnigraph.yaml --target active --bind 0.0.0.0:8080" "$got" - -got=$(sh "$ep" some-uri --bind 1.2.3.4:9 --extra) -check "explicit args passthrough" \ - "ARGS: some-uri --bind 1.2.3.4:9 --extra" "$got" - -got=$(OMNIGRAPH_CLUSTER="/var/lib/omnigraph/company-brain" OMNIGRAPH_BIND="0.0.0.0:8080" sh "$ep") -check "CLUSTER only (Phase 5 mode switch)" \ - "ARGS: --cluster /var/lib/omnigraph/company-brain --bind 0.0.0.0:8080" "$got" - -# Exclusivity: OMNIGRAPH_CLUSTER refuses every combination, exit 64. -for combo in "OMNIGRAPH_TARGET_URI=s3://b/g" "OMNIGRAPH_CONFIG=/etc/o.yaml" "OMNIGRAPH_TARGET=active"; do - if out=$(env "$combo" OMNIGRAPH_CLUSTER="/data/cluster" sh "$ep" 2>&1); then - echo "FAIL: CLUSTER + ${combo%%=*} unexpectedly succeeded: $out" - fail=1 - else - status=$? - if [ "$status" -ne 64 ]; then - echo "FAIL: CLUSTER + ${combo%%=*} exited $status, want 64" - fail=1 - else - echo "ok: CLUSTER + ${combo%%=*} refused (64)" - fi - fi -done - -if [ "$fail" -ne 0 ]; then - echo "entrypoint_test: FAILED" - exit 1 -fi -echo "entrypoint_test: all cases passed" diff --git a/docs/dev/architecture.md b/docs/dev/architecture.md index 972157b..8b7fca2 100644 --- a/docs/dev/architecture.md +++ b/docs/dev/architecture.md @@ -1,6 +1,6 @@ # Architecture -OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a per-operator `~/.omnigraph/config.yaml` plus team-owned cluster directories. +OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`. ## Reading guide @@ -10,7 +10,7 @@ Three views, increasing zoom: 2. **Layer view** β€” the eight-layer stack inside one OmniGraph process. 3. **Component zoom-ins** β€” what's inside each layer. -For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/concepts/storage.md). +For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/storage.md). L1 (orange in the diagrams) is what we inherit from Lance; L2 (blue) is what OmniGraph adds. The L1/L2 framing is also called out in prose at the bottom of this doc. @@ -133,7 +133,7 @@ flowchart TB subgraph state[graph state] coord[GraphCoordinator]:::l2 mr[ManifestCoordinator
db/manifest.rs]:::l2 - cg[CommitGraph
projection of __manifest graph_commit/graph_head rows]:::l2 + cg[CommitGraph
_graph_commits.lance]:::l2 stg[MutationStaging
per-query in-memory accumulator
exec/staging.rs]:::l2 end @@ -186,8 +186,7 @@ op-2 (insert/update) β†’ read committed via Lance + pending via DataFusion op-N β†’ push batch ─── end of query ─────────────────────────────────────── finalize: per pending table: - concat batches β†’ stage_append OR stage_merge_insert OR stage_overwrite - β†’ commit_staged + concat batches β†’ stage_append OR stage_merge_insert β†’ commit_staged publisher: ManifestBatchPublisher::publish (one cross-table CAS) ``` @@ -198,10 +197,9 @@ contracts: - `Dβ‚‚` parse-time rule: a query is either insert/update-only or delete-only. Mixed β†’ reject. Deletes still inline-commit (Lance 4.0.0 has no public two-phase delete); Dβ‚‚ keeps the inline path safe. -- `LoadMode::Overwrite` uses Lance `Operation::Overwrite` through the - same staged path. Loader validation runs against the replacement - in-memory batches before any `commit_staged`, and the publish window is - covered by `SidecarKind::Load` recovery. +- `LoadMode::Overwrite` keeps the inline-commit path + (truncate-then-append doesn't fit the staged shape; overwrite has no + in-flight read-your-writes requirement). - Read sites consume `TableStore::scan_with_pending`, which Lance-scans the committed snapshot at the captured `expected_version` and unions with a DataFusion `MemTable` over the pending batches. @@ -209,7 +207,7 @@ contracts: This pattern realizes read-your-writes within a multi-statement mutation and keeps failure scope bounded for inserts/updates by construction at the writer layer. See [docs/dev/invariants.md](invariants.md) and -[docs/dev/writes.md](writes.md) for the publisher CAS contract this builds on. +[docs/dev/runs.md](runs.md) for the publisher CAS contract this builds on. ### Storage trait β€” today vs. roadmap @@ -280,7 +278,7 @@ flowchart LR eng --> wq ``` -The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel β€” see [docs/user/server.md](../user/operations/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly. +The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel β€” see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/runs.md](runs.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly. Code paths: diff --git a/docs/dev/branch-protection.md b/docs/dev/branch-protection.md index d3a9f6b..9b2fa78 100644 --- a/docs/dev/branch-protection.md +++ b/docs/dev/branch-protection.md @@ -8,14 +8,15 @@ This page explains what the policy says and how to change it. | Setting | Value | Why | |---|---|---| -| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws` | Every PR must pass the AWS-feature build/test and AGENTS.md link integrity. **`Test Workspace` is deliberately NOT required** β€” it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending β€” the job-never-reports trap. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). Each required context must equal a job `name:` that actually reports on PRs **verbatim** β€” a context naming a job that never reports leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. | -| **Required approving reviews** | `0` | No human-review gate. With a 2-person team where both maintainers own everything, requiring an approval meant every PR needed the *other* person (or an admin/bypass override) β€” friction with no real review value. CI checks are the gate; maintainers merge their own PRs once checks pass. Raise this to `1` if an outside-contributor flow ever needs a review gate. | -| **Require code-owner reviews** | `false` | CODEOWNERS was removed entirely (see the git history of `.github/`); there is no code-owner review requirement. | +| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS / drift`, `CODEOWNERS / noedit` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. `strict: true` requires the branch to be up-to-date with `main` before merge. | +| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. | +| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. | +| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. | | **Require linear history** | `true` | No merge commits β€” squash or rebase only. Matches recent practice. | | **Disallow force pushes** | `true` | No history rewrites on `main`. | | **Disallow branch deletions** | `true` | `main` cannot be deleted. | | **Required conversation resolution** | `true` | All review comment threads must be resolved before merge. | -| **Enforce on admins** | `false` | Admins can override the gates (`enforce_admins: false` in the JSON). This is the intended escape hatch for the 2-person team; tightening to `true` is tracked under hardening below. | +| **Enforce on admins** | `true` | Even repository admins go through the gates. The point is no bypasses. | | **Required signed commits** | not yet | Not enabled. Would lock out maintainers until everyone enrolls GPG/SSH commit signing. Tracked as a follow-up. | ## How to apply @@ -56,7 +57,7 @@ Outputs the live policy. Compare against `.github/branch-protection.json` to det - **Audit trail**: `git log .github/branch-protection.json` shows every change with a reviewable diff and a merge commit. - **Disaster recovery**: if branch protection is accidentally removed or weakened via the UI, the JSON is the canonical recovery point. -- **Consistency**: repository policy lives in the repository, reviewed like code. +- **Consistency**: pairs with `.github/codeowners-roles.yml` (the CODEOWNERS source of truth). Repository policy lives in the repository. ## What this gates @@ -64,11 +65,11 @@ After branch protection is applied, every PR targeting `main` must: 1. Pass all listed status checks. 2. Be up-to-date with `main` (rebase or merge-from-main). -3. Have all review conversations resolved. -4. Be squash- or rebase-merged (no merge commits). +3. Have at least one approving review from a code owner for the touched paths. +4. Have all review conversations resolved. +5. Be squash- or rebase-merged (no merge commits). -No human approval is required (`required_approving_review_count: 0`). Repository -admins can override the gates (`enforce_admins: false`). +Even repository admins are subject to these rules. ## Subsequent hardening (not in this PR) @@ -76,7 +77,7 @@ The branch-protection policy is the foundation. Future hardening adds: - **Required signed commits** (`required_signatures: true`) β€” once maintainers enroll GPG/SSH signing. - **Tag protection** for `v*` tags via `repos/.../tags/protection`. -- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via a GitHub ruleset's path-scoped required-review rule, if a review gate is ever reintroduced. +- **Required reviewers from specific teams** for high-leverage paths (e.g., `docs/dev/invariants.md`) via CODEOWNERS tier expansion + the N-unique-approvers CI workaround. - **More required CI checks**: `cargo deny`, `cargo audit`, `cargo fmt --check`, `cargo clippy -D warnings`, CodeQL, secret scanning, schema-lint (MR-946). See the hardening playbook for the full plan. diff --git a/docs/dev/bug-case-fix.md b/docs/dev/bug-case-fix.md deleted file mode 100644 index d5d596e..0000000 --- a/docs/dev/bug-case-fix.md +++ /dev/null @@ -1,217 +0,0 @@ -# Bug case study: camelCase property filters lowercased at runtime - -**Issue:** [#283](https://github.com/ModernRelay/omnigraph/issues/283) (mirrored -in the dev-graph as `iss-990`) -**Reported on:** 0.7.0 (release binary) -**Status of code:** present on `v0.7.0`; fixed on branch `fix/iss-283-camelcase-filter` (read pushdown + pending mutation scan) -**Severity:** correctness β€” a valid, lint-clean query fails at run time. - -## Symptom - -A read query that filters on a **camelCase** schema field lints and plans -cleanly but fails when it executes: - -```text -No field named reponame. Column names are case sensitive. -``` - -Minimal repro: - -```pg -node SourceDocument { - repoName: String @index -} -``` - -```gq -query find($repoName: String) { - match { $d: SourceDocument { repoName: $repoName } } - return { $d.repoName } -} -``` - -`omnigraph lint` passes; running the query errors. The operator workaround is to -rename the field to all-lowercase (`repo`), which is why this looked like a -schema-design quirk rather than an engine bug. - -## Root cause - -The filter-pushdown path builds the Lance scan predicate's column reference with -`datafusion::prelude::col(property)`: - -- **Site:** `crates/omnigraph/src/exec/query.rs` β€” `ir_expr_to_expr`: - ```rust - IRExpr::PropAccess { property, .. } => Some(col(property)), - ``` -- `col(&str)` runs DataFusion's SQL **identifier normalization** - (`Column::from_qualified_name` β†’ `parse_identifiers_normalized(.., false)`), - which **lowercases unquoted identifiers**. So `col("repoName")` resolves to a - column named `reponame`. -- Lance stores columns **case-preserved** (`repoName`) and resolves them - case-sensitively, so the scan can't find `reponame` and errors. - -The IR is not at fault: the parser and lowering preserve the original case -(`property: pm.prop_name.clone()`), which is exactly why the compiler resolves -`repoName` and **lint passes**. The case is destroyed only at the -engine β†’ Lance boundary. - -There is a **second** boundary with the same root cause but a *different* -parser: the pending-batch scan in `table_store.rs::scan_pending_batches` splices -the mutation predicate string into a DataFusion `SELECT … WHERE {filter}` over a -`MemTable`, and DataFusion's SQL parser lowercases the unquoted column the same -way (`repoName` β†’ `reponame`). See **Part 2** of the fix β€” it surfaces only on a -*chained* mutation that re-reads the pending side, which is why a single -update/delete on a camelCase predicate looked fine. - -### Why the rest of the engine is unaffected - -The two pushdown sites above were the offenders; the remaining paths already -treat column names case-sensitively and handle camelCase correctly: - -- **Projection / return** uses the real Arrow field name (`f.name()`). -- **In-memory filtering** (the fallback for non-pushable predicates) looks the - column up by the preserved property name against the batch schema. -- **The committed Lance mutation scan** (`Scanner::filter(&str)`) preserves an - unquoted identifier's case, so committed-row matching on a camelCase predicate - already worked. - -So the read bug surfaces for predicates that *are* pushed down (e.g. an equality -on a scalar camelCase column), and the mutation bug only for the pending-side -re-scan of a chained mutation. - -### Why it slipped through - -The `ir_filter_to_expr` unit tests only use the all-lowercase field `count`, so -no test exercised a camelCase property. Nothing in CI compared the emitted -column name against the schema's casing. - -## Fix - -There are **two** engineβ†’Lance boundaries that lose case, and they need -**different** fixes because the two consumers disagree on quoting semantics. - -### Part 1 β€” read pushdown (`exec/query.rs`, `ir_expr_to_expr`) - -Use DataFusion's case-preserving column constructor, `ident()`, instead of -`col()`: - -```rust -IRExpr::PropAccess { property, .. } => Some(datafusion::prelude::ident(property)), -``` - -`ident()` builds `Expr::Column(Column::new_unqualified(property))` with no SQL -parse and no normalization, so the case is preserved. Property references here -are always bare column names (the variable is dropped via `..`), so there is no -qualified-name (`a.b`) handling to lose. - -This is the right layer and the right shape: - -- It is a **no-op for the lowercase columns that work today** (`slug`, `id`, - `status`, …) β€” lowercasing those was already a no-op β€” so there is no - regression risk for the common case. -- It makes pushdown **consistent** with projection and in-memory filtering, - which already use case-preserved names. -- It also restores **index use** for camelCase columns: today such a filter - errors before the BTREE is even considered. - -### Part 2 β€” pending mutation scan (`table_store.rs`, `scan_pending_batches`) - -`update`/`delete` predicates lower through `predicate_to_sql(..)` into a single -**SQL string** (`format!("{} {} {}", column, op, value_sql)`). That one string -is consumed by **two** different parsers, and *they disagree on what quoting -means*: - -- The **committed** side passes the string to Lance's `Scanner::filter(&str)`. - Lance **preserves an unquoted identifier's case** (so unquoted camelCase - *already works* on the committed scan) but treats a double-quoted `"col"` as a - **string literal** β€” `"repoName" = 'acme'` parses as `'repoName' = 'acme'`, - a constant-false predicate that silently matches **zero** committed rows. -- The **pending** side splices the same string into a DataFusion - `SELECT … FROM pending WHERE {filter}` over a `MemTable`. DataFusion's SQL - parser **lowercases** an unquoted identifier (`repoName` β†’ `reponame`) and - fails to resolve against the case-sensitive `MemTable` schema. - -So no single quoting choice for the column satisfies both: quoting fixes the -pending side but breaks the committed side, and vice versa. The fix keeps the -predicate **unquoted** (what the committed Lance scan needs) and makes the -*pending* context case-preserving instead, by disabling SQL identifier -normalization on its `SessionContext`: - -```rust -let mut config = SessionConfig::new(); -config.options_mut().sql_parser.enable_ident_normalization = false; -let ctx = SessionContext::new_with_config(config); -``` - -`predicate_to_sql` itself never lowercased anything (it copies the preserved -property name), so its emitted string is unchanged β€” it gains only a comment -recording the unquoted contract. The projection list in the same function is -already double-quoted and is unaffected (quoted identifiers are case-preserved -under either normalization setting). - -Rejected alternatives: banning/normalizing camelCase at the compiler (a real -usability regression β€” camelCase fields are legitimate), lowercasing column -names in storage (a breaking on-disk change), merely making lint *warn* (a -band-aid that leaves the runtime broken), or **quoting the column in -`predicate_to_sql`** (empirically breaks 7 existing lowercase-column mutation -tests because Lance reads `"col"` as a string literal β€” see Part 2). - -## Scope and caveats - -- **Not Windows-specific.** The original report's environment was Windows, but - the cause is platform-independent. -- **The mutation path was only *partially* broken, and not where first - assumed.** The committed side of `scan_with_pending(..)` (Lance - `Scanner::filter(&str)`) and `delete`'s `delete_where(..)` / `Dataset::delete` - preserve an unquoted identifier's case, so a *single* `update`/`delete` on a - camelCase predicate already worked. Only the **pending** side β€” the in-memory - `MemTable` re-scan that a *chained* mutation hits β€” lowercased the column. - This was confirmed empirically: a single update+delete on `repoName` passes - unfixed; a chained update that re-reads the pending side fails with - `No field named reponame`. The fix is Part 2 above (disable identifier - normalization on the pending `SessionContext`), **not** quoting the column. - The eventual MR-A migration (`delete_where` β†’ Lance 7 - `DeleteBuilder::execute_uncommitted`, structured `Expr`) is the longer-term - shape but is out of scope here. -- **Check the coercion lookup.** Adjacent to the fix, the literal-coercion step - (`prop_data_type(.., schema)`, which keeps the BTREE usable) also resolves the - column by name. Confirm it uses the preserved name; if it mishandles case a - camelCase filter would resolve but lose its index β€” a silent perf regression, - not a crash. -- **Do not use `col(r#""repoName""#)` as the general read-path fix.** Quoting - would preserve this one name, but it routes through SQL identifier parsing and - changes qualified-name semantics. The IR property here is already a bare - column name, so `ident(property)` / `Column::new_unqualified(property)` is the - precise structured expression. -- **Do not "fix" the mutation string by quoting the column.** It is tempting to - reuse a `quote_ident` helper symmetric with `literal_to_sql`'s value escaping, - but the column quote-rules differ between the two consumers of the predicate - string: Lance's `Scanner::filter(&str)` reads `"col"` as a *string literal* - (silently matching nothing), while DataFusion's `ctx.sql` reads it as a - case-preserved identifier. Because the committed Lance scan already preserves - the *unquoted* identifier's case, the column must stay unquoted and the - pending DataFusion context must be told not to normalize β€” not the reverse. - -## Validation (test-first) - -1. **Red:** add an `ir_filter_to_expr` test asserting the emitted - `Expr::Column` name for a camelCase property is `repoName`, not `reponame`. - Fails on current code. -2. **Green:** apply the `col` β†’ `ident` change (Part 1) and the pending-context - `enable_ident_normalization = false` change (Part 2). -3. **End-to-end:** a camelCase `@index` field with - `match { T { camelField: $x } }` returns the row (the unit test alone can't - catch an engine↔Lance boundary regression). -4. **Mutation parity:** with the same camelCase field, cover: - - `update T where camelField == $x set otherField = ...` updates the intended - row. - - `delete T where camelField == $x` deletes the intended row and cascades as - expected. - - A chained update that hits the pending side of `scan_with_pending` still - works, so both the committed Lance scan and pending DataFusion `MemTable` - predicate paths are case-preserving. -5. **Index preservation:** keep or add a plan/trace assertion that the - camelCase `@index` equality predicate still reaches the scalar-index path. - A result-only test can pass while silently falling back to a full scan. -6. Run the full engine suite (`cargo test -p omnigraph-engine`) β€” in particular - the existing BTREE index-eligibility tests, which `ident()` must not disturb. diff --git a/docs/dev/ci.md b/docs/dev/ci.md index 6cc4e1f..8495d5e 100644 --- a/docs/dev/ci.md +++ b/docs/dev/ci.md @@ -3,12 +3,8 @@ `.github/workflows/`: - **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`). - - **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off β€” it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, and `Test omnigraph-server --features aws`. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md). - - **Consequences to internalize:** (1) a regression that the suite would catch now lands on `main` and turns the post-merge run red, rather than being blocked pre-merge β€” `main` can briefly break, so run `cargo test --workspace --locked` locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) `openapi.json` is no longer auto-regenerated on PRs (that step is inside the `test` job); for server/API changes, regenerate it locally with `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi` and commit it, or the strict drift check fails the post-merge `main` run. - - **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** β€” until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap). - **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest. -- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax. - **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`. -- **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / macOS arm64 archives and Windows x86_64 zip + sha256, publishes a rolling prerelease, then smoke-tests the Windows PowerShell installer against `edge`. -- **release.yml**: on `v*` tags, builds the Linux x86_64 / macOS arm64 archives and Windows x86_64 zip release matrix, updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`, and smoke-tests the Windows PowerShell installer against the tag. +- **release-edge.yml**: on every push to main, retags `edge`, builds Linux x86_64 / macOS arm64 archives + sha256, publishes a rolling prerelease. +- **release.yml**: on `v*` tags, builds the Linux x86_64 / macOS arm64 release matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`. - **package.yml**: manual ECR image build; emits two image tags per commit (``, `-aws`) via CodeBuild. diff --git a/docs/dev/cluster-axioms.md b/docs/dev/cluster-axioms.md deleted file mode 100644 index dddecf1..0000000 --- a/docs/dev/cluster-axioms.md +++ /dev/null @@ -1,106 +0,0 @@ -# Cluster Control-Plane Axioms - -**Type:** Standing design filter -**Status:** Draft / thinking-in-progress -**Date:** 2026-06-07 -**Relationship:** the distilled axioms behind [cluster-config-specs.md](cluster-config-specs.md). The downstream implementation inventory and blast-radius assessment live in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). The high-level spec is the argument; this is the checklist. Hold any config / control-plane / deployment proposal against these and cite them by number (e.g. "violates axiom 5"). - -This file is intentionally short and stable. The axioms are phrased so other -docs can reference "axiom 6" without churn. The motivating requirement comes -first; the core axioms are what the design is *based on*; the derived rules are -consequences that follow from them. - -> **Revision 2026-06-07 β€” committed to the Terraform paradigm.** State is now an -> **authoritative, locked ledger in a backend** (no longer framed as a -> "mostly-rebuildable projection"); `plan` is a **config ↔ state diff**; and -> **ETL pipelines** join schema as config-defined resources that trigger -> data-plane effects. Secrets live in a gitignored **`.env`** file (`${NAME}`), -> and **query exposure is a policy decision** (no registry `expose:` flag). -> Axioms **2, 5, 6** revised; **12, 13, 14** added. The earlier -> "state is just a rebuildable projection; config is the *only* truth" framing is -> superseded β€” see axiom 5. -> -> **Revision 2026-06-08 β€” JSON state first.** The baseline state backend is now -> Terraform-style JSON documents plus backend lock/CAS, not Lance control-plane -> datasets. Lance remains a possible later backend only if row-level history or -> queryability justifies the extra machinery. -> -> **Revision 2026-06-09 β€” single ownership during migration.** Axiom **15** -> added: while `omnigraph.yaml` and the cluster catalog coexist, every fact has -> exactly one owner at a time β€” coexistence is a **mode switch, never a merge**. -> `omnigraph.yaml` does not get replaced; its job description shrinks to the -> permanent per-operator layer. - ---- - -## Tenet 0 β€” the motivating requirement - -**0. The Sarah/Bob test.** If one operator changes schema / queries / policies / UI / pipelines / aliases, another operator (or their agent) must learn *what the deployment is and what changed* from **one source, one history, one diff**. Fragmentation across separate mechanisms is the failure the whole design exists to eliminate. Every other axiom is in service of passing this test. - ---- - -## Core axioms (what the design is based on) - -**1. The cluster is the unit of declarative state.** Not the graph (policies, queries, UI, and pipelines cross-cut graphs; "which graphs exist" has no per-graph home), not the fleet (the next scope up β€” named and deferred). The cluster is what two operators collaborate over; a graph is a *resource within* it. - -**2. Two sources of truth, for two different questions β€” config for *intent*, state for *deployed reality*.** The version-controlled **config** (a set of files in one folder) is the source of truth for what the cluster *should be*. The **state ledger** is the source of truth for what *is* currently deployed. Change flows one way only: you edit config and `apply` converges the cluster (**code β†’ cluster**, never edit-the-cluster-and-call-it-intent). But "what exists right now" is read from **state**, not re-derived from the world on every command. `plan` is the diff between the two. - -**3. Declarative, not imperative.** You describe the desired end state; the reconciler computes the steps. No runtime mutation API that makes the running system the place *intent* lives. - -**4. As-code is structural, not stylistic (the recursion argument).** Code is the base case; modeling the definition *as data* (a meta-graph describing graphs) recurses with no base case. Config must live **outside** the running system so it is reviewable (PRs), reproducible (clone + apply), diffable as text, and editable by an agent β€” without the system having to describe itself. - - -**5. The Terraform model: config / state / reconcile β€” and state is an authoritative, locked ledger.** Config (as code) = desired truth. **State = the authoritative record of what has been applied**, held in a **backend** β€” the cluster's own object-store backend *or* a separate cloud store, the operator's choice, exactly like a Terraform backend. The baseline representation is JSON documents (`state.json`, status/approval/recovery JSON records) protected by backend lock/CAS, not Lance control-plane tables. State is **locked** during apply so two operators cannot converge concurrently. `validate` parses and schema-checks desired config; `plan` = `diff(config, state)` as a structured artifact with resource digests, dependency edges, state observations, proposed changes, blast radius, and approval gates; `apply` converges the cluster from an accepted fresh plan and **updates state**, and does not acknowledge success until state has recorded the result. A cluster-hosted JSON backend is still a separate state CAS step from graph Lance manifest moves; failures surface a repair/import condition instead of being described as cross-object all-or-nothing. A future Lance-backed state backend or cluster manifest publisher is optional and must earn its complexity by needing row-level queryability/history or tighter publish fencing. Because OmniGraph's running cluster is self-describing (manifests, commit logs), state is *reconstructable* by import/refresh if lost β€” its edge over opaque-cloud Terraform β€” but it is **treated as the source of truth for current reality, not casually regenerated**. The one slice that can never be reconstructed (who approved an irreversible apply) lives in the durable audit ledger; state references it (axiom 11). - -**6. The control plane reconciles definition, not data β€” across two data-plane seams.** Definition β€” schema, policies, queries, UI, bindings, aliases, ETL **pipelines**, embeddings config, and the set of graphs β€” is reconciled. Data β€” rows, edges, vectors β€” is data-plane content, versioned by the commit DAG and produced by `load` / `mutate` and **pipeline execution**, sitting **outside** the reconcile loop. Exactly two definition kinds *trigger* a data-plane effect without owning data: **schema** (a migration conforms existing rows; `plan` previews its impact) and **ETL pipelines** (their execution ingests external data). The loop converges their *definitions*; the data they produce is never what it reconciles. - -**7. Operated by agent (agent-as-controller).** An agent authors config changes and drives reconciliation as an authenticated actor, subject to policy and approval gates β€” no human state-management burden. This fuses Terraform's as-code config with Kubernetes' continuous reconciliation. - ---- - -## Derived rules (consequences of the axioms) - -**8. The reversibility gradient gates apply β€” including drift correction.** Irreversible / data-loss operations (drop a graph, hard-drop schema data, a pipeline that overwrites) and compatibility-narrowing migrations (for example, future validated enum narrowing) are gated; reversible ones (recolor a dashboard) are not. The gate is keyed to physics, not to who operates it, and a reconciler "just fixing drift" is never an exception. - -**9. Atomicity and referential integrity are plan-time, not runtime.** `ApplyGroup` is the atomicity unit; cross-resource references *force* grouping (mandatory, not opt-in); references use typed resource/provider addresses (`graph.knowledge`, `query.knowledge.find_experts`, `provider.source.github_org`) so the planner can reject wrong-kind or missing targets before apply β€” bare names in a kind-fixed field are accepted shorthand and normalized to the typed address (fix 2026-06-08), while a kind-ambiguous value (e.g. `source: github`) is rejected; a reference to a missing or being-removed resource is a fail-closed `plan` error, not a deferred runtime failure. - -**10. Secrets live in a `.env` file; connection/identity is per-operator.** The committed cluster config carries **no secret values** β€” only `${NAME}` references. The values (embedding API keys, pipeline **source credentials**, per-deployment settings) live in a separate **`.env` file** β€” which is gitignored and supplied per deployment, never committed. Separately, an operator's own connection (which cluster, which token) is the per-operator layer, distinct from both the shared config and its `.env` file. - -**11. Approvals and audit live in a durable ledger, not inline in state.** State *references* the audit record by id. In the baseline, that ledger is append-only JSON records in the state backend; a future Lance table is an implementation option, not a requirement. This keeps the bulk of state reconstructable and keeps approval facts β€” "who authorized this irreversible apply" β€” where loss is impossible. - -**12. State lives in a backend and is locked.** The state ledger is stored in a configurable backend β€” the cluster's own backend, or a separate cloud store β€” and `plan`/`apply` acquire a **state lock** first, so concurrent applies serialize instead of racing. (Generalizes the existing `__schema_apply_lock__` from schema scope to cluster scope.) The backend choice is part of the safety model: the first backend should be JSON plus object-store lock/CAS; any Lance-backed state backend needs its own RFC-level proof that the table semantics are worth the control-plane complexity. - -**13. Pipelines are definition; their execution is data-plane.** An ETL pipeline (external source β†’ transform β†’ target graph) is **declared in config and reconciled like any resource**; *running* it produces ordinary data-plane writes (`load`/`mutate`) outside the reconcile loop. `apply` converges the pipeline's *definition* (create / update / delete / schedule); the rows it ingests are never reconciled. A fan-out run over several graphs is statusful rather than magically atomic: each target records commit id, status, retryability, and idempotency key unless the pipeline explicitly uses a branch/merge protocol that can fence the whole target set. Source credentials are secret references (axiom 10). - - -**14. Exposure is a policy decision, not a config flag.** Target design: which stored queries (and the tools/dashboards built on them) an actor may **list or invoke** is decided by the policy layer (Cedar: `invoke_query` + catalog visibility), not by a per-query `expose:` boolean. The registry only says a query *exists* (name β†’ file); **policy says who may see and run it**, so the MCP catalog (`GET /queries`) becomes each actor's policy-permitted set. This supersedes the engine's current `mcp.expose` flag only after per-query `invoke_query` scope and Cedar-filtered catalog listing land; until then, proposals must state the compatibility bridge to today's `mcp.expose` + coarse invocation gate. - -**15. Every fact has exactly one owner at a time; coexistence is a mode switch, never a merge.** `cluster.yaml` is not `omnigraph.yaml` v2 β€” the two documents end with disjoint jobs, and only the *shared-truth* parts of today's `omnigraph.yaml` (the set of graphs, stored-query registry, policy wiring, server boot source) migrate to the cluster catalog. The per-operator parts β€” connection/cluster selection, the operator's own credential reference, active graph/branch context, CLI ergonomics β€” are per-operator *by nature* (Sarah's and Bob's differ) and stay in the per-operator layer permanently; plan a **shrinking job description** for `omnigraph.yaml`, not an exit. During the migration window each fact is read from exactly one source at a time: a deployment serves from `omnigraph.yaml` **or** boots from cluster state (an exclusive mode switch), never from a precedence-merge of both. Two readers for one fact is the brittle-backcompat failure mode β€” it is the deny-list's "state that drifts from what it can be derived from" wearing a compatibility costume. Any compatibility bridge must name its replacement and its removal phase (the `mcp.expose` β†’ policy-owned exposure bridge of axiom 14 is the template); bridges that accumulate without an exit are rejected at review. - ---- - -## The one-line compression - -**One cluster; config (a folder of files) is desired truth and a locked state ledger in a backend is deployed truth; `plan` diffs them, `apply` converges the cluster and updates state, an agent drives the loop β€” reconciling the cluster's *definition* (schema, policies, queries, UI, pipelines, …) and never its data β€” so any operator sees the whole system and its history from one place.** - ---- - -## How to use this file - -- **Reviewing a proposal:** walk axioms 0–15; any conflict is the burden of the proposer to justify. The most common tensions: - - Treating the *running system* as the source of truth for **intent** β†’ axioms 2, 4 (intent lives in config). - - Treating state as a throwaway derivation rather than an authoritative, locked, backend-held ledger β†’ axiom 5, 12. - - A runtime config-mutation API instead of declarative apply β†’ axiom 3. - - "State" meaning a per-operator selection rather than the applied-cluster ledger β†’ axiom 5. - - The control plane reconciling (or owning) data β€” including treating pipeline *rows* as reconciled state β†’ axiom 6, 13. - - Treating fan-out pipeline execution as atomic without a branch/merge protocol or per-target status ledger β†’ axiom 13. - - Per-graph or per-server scoping of cluster-level definition β†’ axiom 1. - - Bare string references that force the planner to guess whether `knowledge` means a graph, query, provider, or path β†’ axiom 9. - - A secret value (token, embedding key, pipeline source credential) inline in config instead of in the gitignored `.env` file β†’ axiom 10. - - A per-query `expose:`/visibility flag in target-state cluster config instead of governing list/invoke in policy; or failing to account for today's `mcp.expose` compatibility bridge β†’ axiom 14. - - Shipping `apply` before hermetic `validate` + read-only `plan` tests, or shipping graph/schema-moving apply before recovery tests for the graph/resource-moved-before-cluster-publish gap β†’ axiom 5 and axiom 12. - - Reading one fact from both `omnigraph.yaml` and the cluster catalog with precedence rules (a merge instead of a mode switch), migrating per-operator concerns into shared cluster config, or adding a compatibility bridge with no named replacement and removal phase β†’ axiom 15. -- **Citing:** reference axioms by number in PRs and review comments so the rationale is stable across renames and refactors. diff --git a/docs/dev/cluster-config-implementation-spec.md b/docs/dev/cluster-config-implementation-spec.md deleted file mode 100644 index b58e531..0000000 --- a/docs/dev/cluster-config-implementation-spec.md +++ /dev/null @@ -1,741 +0,0 @@ -# Cluster Config Implementation Spec And Blast Radius - -**Status:** Draft / implementation planning -**Type:** Downstream design spec -**Date:** 2026-06-08 -**Relationship:** companion to [cluster-config-specs.md](cluster-config-specs.md) -and [cluster-axioms.md](cluster-axioms.md). The high-level spec explains why -the cluster control plane should exist; this file names what must change -downstream and how large the blast radius is. - - - -## Executive Summary - -Overall blast radius: **very high**. - -This is not a small extension to `omnigraph.yaml`. The target design creates a -new shared cluster desired-state document, a locked state ledger, a cluster -manifest publisher, and a reconciler that coordinates resources above a single -graph. The existing config system remains useful, but its role changes: - -- `omnigraph.yaml` / global config remains the per-operator and startup bridge. -- `cluster.yaml` becomes shared desired state for a deployment. -- The cluster state ledger becomes the authoritative record of applied reality. -- Server/runtime surfaces eventually read from the cluster catalog instead of - only from process-start config. - -Safe rollout requires an additive path. Do not replace the current config, -server, or policy behavior in one step. - -## Current Surfaces Surveyed - -| Surface | Current behavior | Why it matters | -|---|---|---| -| `omnigraph-config::OmnigraphConfig` | Layered global/state/project config for CLI and server startup; strict `version: 1`; named maps replace wholesale | A cluster spec needs different ownership and merge semantics; do not stretch this type until it becomes ambiguous | -| `omnigraph-server::load_server_settings` | Opens either one selected graph or every configured embedded graph in multi mode | Cluster config changes startup, registry identity, and eventually runtime reconcile | -| `GraphRegistry` | Holds open graph handles; production registry is startup-only today; runtime insert is test-only | Cluster apply wants graph add/remove/reload as real control-plane operations | -| `omnigraph-queries::QueryRegistry` | Loads `.gq` files from `queries:` and honors `mcp.expose` for catalog listing | Target cluster config removes exposure from the registry and moves list/invoke to policy | -| `omnigraph-policy::PolicyAction` | Per-graph actions plus server-scoped `graph_list`; `invoke_query` is graph-scoped and coarse | Cluster plan/apply and per-query exposure need new policy scope without breaking coarse rules | -| Engine graph manifest | Graph-level atomic visibility via `__manifest`, expected table versions, and recovery sidecars | Cluster apply needs a higher-level publisher; Lance still commits per dataset | -| Schema apply | Existing plan/apply/lock shape for one graph; soft/hard drops already modeled | This is the prototype resource reconciler, but cluster apply cannot call it blindly and then claim cluster atomicity | -| Public docs/tests | Config, policy, server, and query behavior are already documented and tested | Every behavior change below has user docs and test fallout | - -## Compatibility Stance - - - -1. `cluster.yaml` is a new target-state file, not `omnigraph.yaml` v2. -2. Existing `omnigraph.yaml` keeps working for CLI, server boot, aliases, - graph locators, bearer-token env lookup, and the current stored-query - registry. -3. Initial cluster commands are explicit: `omnigraph cluster validate`, - `omnigraph cluster plan`, `omnigraph cluster apply`, `omnigraph cluster - status`, `omnigraph cluster refresh`, and `omnigraph cluster import`. -4. Cluster config is one shared folder, resolved from the command's cluster - root or explicit path. It is not merged from global + project + active - context layers. -5. The per-operator connection layer selects the cluster root and actor - identity. It is not committed into `cluster.yaml`. -6. `mcp.expose` remains supported in current `omnigraph.yaml` until the - per-query policy replacement ships. -7. **Single ownership (axiom 15).** While `omnigraph.yaml` and the cluster - catalog coexist, each fact is read from exactly one source at a time. - Phase 5 server boot is an exclusive mode switch β€” boot from cluster state - XOR from `omnigraph.yaml` β€” never a precedence-merge of both. No phase may - introduce a surface that reads the same fact (graph set, query registry, - policy wiring, bind address) from both sources with tie-break rules. -8. **`omnigraph.yaml` shrinks; it does not get deprecated.** Its terminal role - is the per-operator layer: connection/cluster selection, the operator's - credential reference, active graph/branch context, CLI ergonomics, and - purely personal aliases (target home: the operator's global config dir per - RFC-002). Shared-truth keys migrate to `cluster.yaml`; per-operator keys - never do. -9. **Bridges carry sunsets.** Every compatibility bridge names its replacement - and the phase that removes it (`mcp.expose` β†’ Phase 6 policy-owned exposure - is the template). A bridge without an exit is a review-blocking finding. - -## Terraform-Aligned Schema Validation - - - -Every field in target-state `cluster.yaml` must be **honored or rejected**: - -- If a field is part of the declared resource schema, it must affect - validation, plan, apply, state, or status. -- If a field is misspelled, placed under the wrong resource kind, or reserved - for a future phase, `cluster validate` / `cluster plan` must fail with a - typed diagnostic. -- Compatibility warnings are allowed only in an explicit migration window for - old schema versions. They are not allowed in the target schema. -- Free-form extension areas must be named as such, for example `labels`, - `metadata`, `vars`, or `provider_options`; accidental unknown keys are never - treated as extension data. - -Examples: - -```yaml -graphs: - knowledge: - schema: ./knowledge.pg - lables: { team: platform } # invalid: typo, use `labels` - -pipelines: - github_sync: - source: { kind: github, token: ${GITHUB_TOKEN} } - into: - - { graph: engineering, map: ./github.map.yaml } - retry_magic: true # invalid unless `retry_magic` is in schema -``` - -```yaml -graphs: - knowledge: - schema: ./knowledge.pg - labels: { team: platform } # valid free-form metadata bucket - provider_options: - lance: - compaction_window: daily # valid only if this extension is declared -``` - -## Typed Resource And Provider Addresses - - - - -A locator is a typed address to another declared thing. **Internally β€” in plan and -state β€” every reference is a typed address** (axiom 9). At the config *surface* a -field may accept **bare shorthand when its schema fixes the referent kind** (a -policy `applies_to:` list is graph refs; a pipeline `into.graph` is a graph id) β€” -the parser normalizes it to the typed address before planning. A value whose -*kind* is ambiguous or wrong (a `source:` that could be a connector type, an -instance, or a provider) has no safe normalization and must be a typed -`provider.*` address or an explicit inline block. - -Target address forms: - -```text -graph. -schema. -query.. -policy. -ui.dashboard. -pipeline. -provider.storage. -provider.source. -provider.embedding. -``` - -Bad shape β€” the value's **kind is ambiguous or wrong**, not merely bare: - -```yaml -pipelines: - github_sync: - source: github # AMBIGUOUS kind: connector type, instance, or provider? - # β†’ provider.source. or inline { kind: github, ... } -policies: - base_rbac: - applies_to: [query.knowledge.find_experts] # WRONG kind: a query address in a graph-ref field -``` - -OK shorthand (kind fixed by the field β†’ normalized): - -```yaml -policies: - base_rbac: - applies_to: [knowledge, engineering] # bare names in a graph-ref field β†’ graph.knowledge, graph.engineering -``` - -Target shape: - -```yaml -providers: - storage: - prod_graphs: - kind: s3 - bucket: company - prefix: prod - source: - github_org: - kind: github - token: ${GITHUB_TOKEN} - -graphs: - knowledge: - storage: provider.storage.prod_graphs - path: graphs/knowledge.omni - schema: ./knowledge.pg - engineering: - storage: provider.storage.prod_graphs - path: graphs/engineering.omni - schema: ./engineering.pg - -policies: - base_rbac: - file: ./base_rbac.policy.yaml - applies_to: - - graph.knowledge - - graph.engineering - -pipelines: - github_sync: - source: provider.source.github_org - into: - - { graph: graph.engineering, map: ./github_to_engineering.map.yaml } - - { graph: graph.knowledge, map: ./github_to_people.map.yaml } -``` - - - -Validation rules: - -- A field that expects a graph address accepts `graph.`, not - `query..` or an arbitrary string. -- A field that expects a query address accepts `query..`, and the - planner validates both the graph and the query symbol. -- A field that expects a source provider accepts `provider.source.`, not - `provider.storage.`. -- A field that expects storage accepts `provider.storage.` or an explicit - storage block, not a server URL or source connector. - -- A field whose schema **fixes the kind** accepts bare shorthand (e.g. `knowledge` - in a graph-ref field) and normalizes it to the typed address; a kind-ambiguous - or wrong-kind value is rejected with a typed diagnostic. -- Plan and state always store the **normalized typed address**, regardless of - whether the surface used shorthand. - -## Target Components - -Preferred split: - -| Component | Responsibility | Depends on | -|---|---|---| -| `omnigraph-cluster` crate | Cluster spec types, path resolution, resource graph, plan model, state backend traits, apply orchestration | `omnigraph-config` only for shared simple config types if needed; avoid server deps | -| `omnigraph` engine additions | Graph lifecycle primitives, schema-apply integration, recovery hooks for graph moves during cluster apply; optional future cluster manifest publisher if JSON state is not enough | Lance, existing graph manifest/recovery | -| `omnigraph-cli` | `cluster *` commands, plan rendering, approval collection, state lock UX | `omnigraph-cluster`, engine | -| `omnigraph-server` | Optional boot from cluster state, registry reload, status endpoints, policy-filtered query catalog | `omnigraph-cluster`, engine, policy | -| `omnigraph-policy` | Cluster/server actions, per-query list/invoke scope, approval policy predicates | none above server | -| `omnigraph-queries` | Registry without exposure side-channel; dependency metadata for downstream validation | compiler/config | -| `omnigraph-api-types` | New status/plan/apply response types if cluster HTTP endpoints ship | serde only | - -If the first implementation avoids a new crate, keep the same boundary in -modules. The important constraint is that cluster spec parsing must not drag -HTTP/server code into compiler or engine crates. - -## Resource Model - -Resource identity is stable and typed: - -```text -ClusterRoot -ResourceKey = // -ResourceAddress = . | .. -ProviderAddress = provider.. - -graph/cluster/knowledge -schema/graph:knowledge/main -query/graph:knowledge/find_experts -policy/cluster/base_rbac -ui/cluster/dashboard.overview -pipeline/cluster/github_sync -alias/cluster/experts -embedding/cluster/default -``` - - - -Resource records carry: - -| Field | Meaning | -|---|---| -| `kind` | Graph, Schema, Query, PolicyBundle, UiSpec, Binding, Alias, EmbeddingConfig, Pipeline | -| `scope` | Cluster or graph id | -| `name` | Stable resource name inside scope | -| `fingerprint` | Content hash of the normalized spec and all referenced files | -| `dependencies` | Resource keys this resource references | -| `observed` | Applied graph manifest version, policy digest, query digest, schedule id, etc. | -| `status` | `Pending`, `Planned`, `Applying`, `Applied`, `Drifted`, `Blocked`, `Error` | -| `conditions` | Typed details such as `ActualAppliedStatePending`, `NeedsApproval`, `DependencyMissing`, `PartialPipelineRun` | - -The planner builds a dependency graph from these records and uses it for both -validation and blast-radius reporting. - -## Terraform-Style Validate / Plan / Apply - -The cluster workflow deliberately mirrors Terraform's safe sequence: - -```text -cluster validate # parse + schema-check desired config, no state mutation -cluster plan # diff desired config against state, with optional refresh -cluster apply # apply an accepted fresh plan and update state -cluster status # read state-backed deployed reality -cluster refresh # repair/import observations from actual cluster state -``` - -Implementation rollout follows the same safety posture: ship parser/validate -first, then read-only plan, then state backend and lock, then apply. - -The plan is a structured artifact, not just terminal text. It must include: - -| Plan field | Why it exists | -|---|---| -| `desired_revision` | Git commit / config digest being evaluated | -| `resource_digests` | Exact digest of every schema, query, policy, UI, pipeline, and map file | -| `dependencies` | Edges such as query -> graph/schema, dashboard -> query, pipeline -> source provider + graph | -| `state_observations` | Applied revision, resource fingerprints, graph manifest versions, status conditions, and drift | -| `changes` | Create/update/delete/replace/refresh-only operations | -| `blast_radius` | Downstream resources to revalidate or affected behavior to surface | -| `approvals_required` | Irreversible/data-loss or compatibility-narrowing gates | - -`cluster apply` must reject a stale plan when state, resource digests, or -observed graph versions no longer match the plan base. The operator or agent -must re-plan or explicitly refresh first. - -## Cluster Storage Layout - -Target Phase-1 cluster-root layout: - -```text -/ - __cluster/ - state.json - lock.json - status/ - .json - approvals/ - .json - recoveries/ - .json - resources/ - query///.gq - policy//.yaml - ui//.dashboard.yaml - pipeline//.pipeline.yaml - graphs/ - .omni/ -``` - - -The exact filenames can change, but the shape cannot: - -- There is one cluster-control namespace under the cluster root. -- Graph data remains in ordinary OmniGraph graph roots. -- State is a locked/CAS-updated JSON document, not a Lance dataset. -- Status, approval, and recovery ledgers are append-only or per-resource JSON - records until table semantics are proven necessary. -- Resource payloads are content-addressed by digest so apply can be idempotent. -- Cluster state is not inferred from the operator's working tree. -- A Lance-backed control-plane store is a future backend option only if - row-level queryability/history or tighter publish fencing justifies it. - -## State Backend Protocol - -### Cluster-Hosted JSON State - -When `state.backend: cluster`, the baseline backend stores JSON documents under -`/__cluster/` and protects `state.json` with object-store lock/CAS. -It is cluster-hosted, but it is still a separate state write from graph Lance -manifest movement. - -Apply protocol: - -1. Acquire the cluster state lock. -2. Read current `state.json` and backend CAS token / object generation. -3. Validate plan base still matches state. -4. Write a cluster recovery sidecar before any graph manifest or non-idempotent - resource can move. -5. Write content-addressed resource payloads and perform any required graph - manifest movements. -6. CAS-update `state.json` with the new applied revision, resource - fingerprints, observed graph versions, status references, and approval / - recovery references. -7. If step 6 fails after actual resources moved, do not acknowledge success. - Surface `ActualAppliedStatePending` and require `refresh` / `import` repair. -8. Delete the sidecar and release the lock only after the state outcome is - recorded. - -### External State - - - -When `state.backend` points outside the cluster root, the same JSON state shape -lives in an external store. It is locked and CAS-updated, but it is not atomic -with Lance or OmniGraph manifests. - -Apply protocol: - -1. Acquire the external state lock. -2. Read state and CAS token. -3. Validate plan base still matches state. -4. Write a cluster recovery sidecar. -5. Perform the cluster resource changes. -6. CAS-update external state with the new applied revision, statuses, and the - observed graph manifest / resource versions it records. -7. If step 6 fails, do not acknowledge success. Surface - `ActualAppliedStatePending` and require `refresh` / `import` repair. -8. Release the external lock only after the state outcome is recorded. - -This mode can be strongly coordinated, but it must never be documented as one -atomic commit across both stores. - -### Future Lance-Backed State - -A Lance-backed state/status/approval/recovery store is deliberately not the -baseline. It becomes attractive only if JSON files become a real liability: -large status sets need structured filtering, approval/recovery history needs -table scans, or cluster apply needs a manifest publisher that can fence state -and graph-version pins together. Until then, Lance datasets add bootstrapping, -schema migration, and control-plane recovery surface without enough benefit. - -## Cluster Manifest Publisher - -The cluster publisher is a possible later layer above today's graph publisher. -It does not replace Lance or the per-graph `__manifest` table, and it is not -required for Phase-1 JSON state / read-only plan. - -Required semantics: - -| Requirement | Detail | -|---|---| -| Expected-version CAS | Every resource in an apply group supplies its expected current version/fingerprint | -| Resource changes | Register/update/tombstone resource payloads and graph version pins | -| Graph-head fencing | If a graph schema/lifecycle operation moves a graph manifest, the cluster manifest records the exact graph manifest version | -| Sidecar coverage | Any graph or cluster resource that can move before cluster publish must be recoverable all-or-nothing | -| Deterministic publish order | Sidecars and apply groups process in stable order | -| Loud partials | If a group cannot be rolled back or forward in-process, status records the condition before more apply work proceeds | - -The risky case is nested publish: - -```text -schema apply moves graph:knowledge manifest -cluster apply has not yet published query/policy/state records -process crashes -``` - -That is not safe unless the cluster sidecar records enough information to roll -the graph movement forward into the cluster manifest or roll it back using the -same recovery discipline as current graph recovery. - -## Plan Model - -Plan output is a durable, replay-checked proposal, not just pretty text: - -```text -Plan { - plan_id, - desired_revision, - base_state_revision, - base_state_cas, - changes[], - apply_groups[], - approvals_required[], - blast_radius, - diagnostics[] -} -``` - -Each change records: - -| Field | Meaning | -|---|---| -| `resource` | Stable `ResourceKey` | -| `operation` | Create, Update, Delete, Replace, RefreshOnly | -| `reversibility` | Reversible, Recoverable, CompatibilityNarrowing, IrreversibleDataLoss | -| `effect` | ConfigOnly, Catalog, GraphDefinition, GraphDataRewrite, DataPlaneSchedule | -| `downstream` | Resources that must be revalidated or will observe changed behavior | -| `approval` | None, HumanRequired, PolicyRequired, AlreadySatisfied | - -`apply` must re-read state and reject stale plans unless an explicit -`--refresh` / `--replan` path recomputes the plan. - -## Downstream Dependency Rules - -These are the concrete "what requires downstream" rules. - -| Changed resource | Must revalidate / recompute downstream | Blocking failures | -|---|---|---| -| Graph create/delete/rename | Policies, queries, aliases, dashboards, pipelines, bindings, server registry, state graph set | Dangling graph references; duplicate URI; invalid `GraphId`; graph delete without irreversible approval | -| Schema | Stored queries, pipeline maps, UI bindings/query outputs, embedding/index config, data-impact preview, policy predicates once row/type pushdown exists | Unsupported migration; query breakage; missing target type/property; hard drop without approval | -| Stored query | Aliases, UI bindings, policy list/invoke grants, MCP/tool catalog compatibility, typed params | Query file parse/type errors; registry key != `query `; removed query still referenced | -| Policy bundle | Query catalog visibility, graph/server action authorization, approval gates, bootstrap permissions | Invalid Cedar/YAML; server-scoped action in graph policy; per-query list/invoke gap unhandled | -| UI/dashboard | Query bindings, graph refs, output field expectations, policy visibility for referenced queries | Binding to missing graph/query/param/output | -| Alias | CLI command resolution, graph/query refs, shared-vs-personal boundary | Dangling graph/query; mutation alias pointing at read-only context | -| Embedding config | Schema `@embed` columns, model dimension, index rebuild/reconcile, env refs | Dimension mismatch; missing env ref; unsupported model/provider | -| Pipeline definition | Target graph schemas, mapping files, env refs, scheduler/runtime state, per-target run ledger | Missing target graph/type/property; overwrite mode without approval; source secret missing | -| Binding | Referenced source/surface pair, dependency order, visibility policy | Missing source or target; incompatible params | -| State backend config | Lock implementation, import/refresh protocol, apply acknowledgements | Backend missing CAS/lock; state CAS failure after graph/resource movement | - -## Blast Radius Matrix - -| Area | Required downstream change | Blast radius | Notes | -|---|---|---|---| -| Config parsing | Add strict `cluster.yaml` parser, path/env-ref resolver, resource fingerprints, no layered merge | High | Separate from `OmnigraphConfig`; existing config tests still need backcompat coverage | -| CLI | Add `cluster validate/plan/apply/status/refresh/import`, plan rendering, approval flags, actor threading | High | Must not change existing command selection or `omnigraph use` behavior | -| State backend | Add JSON state document, status/approval/recovery records, lock/CAS, and import/refresh repair | High | Must not silently succeed after state CAS failure | -| Optional cluster publisher | Add a cluster manifest plus table-backed state/status store only if stronger all-or-nothing apply is required | Very high | Touches core atomicity and recovery invariants | -| Recovery | Add cluster sidecars and failpoint coverage for graph-move-before-state-publish gaps | Very high | Any missed sidecar is a correctness bug | -| Graph lifecycle | First-class graph resource create/delete/rename or stable-id story | High | Current server add/remove is intentionally not exposed | -| Schema apply integration | Make schema apply cluster-aware or wrap it with cluster recovery | High | Existing schema apply cannot be treated as cluster atomic by assertion | -| Query registry | Remove target-state exposure flag, add dependency metadata, keep `mcp.expose` bridge | Medium/high | Catalog behavior is observable public API | -| Policy | Add cluster plan/apply/admin actions and per-query list/invoke scope | High | Needs docs, tests, Cedar schema migration, and compatibility with coarse `invoke_query` | -| Server registry | Boot from cluster state, eventually reload/reconcile graph handles, expose statuses | High | Affects routing, OpenAPI, auth, and workload admission | -| API types/OpenAPI | Plan/status/apply DTOs if HTTP management endpoints ship | Medium/high | OpenAPI drift must be regenerated | -| UI specs | New renderer/spec validator/binding checker | High | New product surface, not currently implemented | -| Pipelines | New scheduler/runtime/connector/mapping/idempotency/run ledger | Very high | **Separate project** (socket reserved here); second data-plane seam, large product and correctness surface | -| Embeddings | Cluster-level defaults, env refs, model/dimension validation, index interaction | Medium | Existing embedding code is mostly offline/client-side | -| Docs | User docs for cluster config, policy, server, CLI; dev docs for invariants/testing | High | Public contract changes | -| Tests | New cluster suites plus extensions to config/server/policy/recovery/schema/query tests | High | Needs boundary-matched coverage | - -## Reversibility And Approval Tiers - -| Tier | Examples | Gate | -|---|---|---| -| Display-only | Dashboard layout, non-breaking alias addition | No approval beyond policy | -| Catalog behavior | Add query, hide/list query via policy, add policy grant | Policy check; no data-loss approval | -| Compatibility narrowing | Future validated enum narrowing, query param removal, policy removal that revokes access | Explicit compatibility warning; may require human approval by policy | -| Recoverable definition rewrite | Soft schema drop, graph schema rename, index rebuild | Plan warning; no data-loss approval unless policy requires | -| Irreversible data loss | Graph delete, hard schema drop, cleanup-triggered prior-version reclamation, overwriting pipeline target | Human approval artifact recorded in audit ledger | - -Future enum narrowing belongs in `CompatibilityNarrowing` unless the migration -also drops/coerces data or triggers cleanup. That distinction matters for plan -wording and for policy predicates. - -## Rollout Phases - - - -### Phase 0: Documentation And Parser Skeleton - -- Add cluster spec types and strict parser behind an unused feature/module. -- Implement `cluster validate --config ` with no state backend. -- Validate file paths, env refs, duplicate resource keys, and dependency graph. -- No behavior change to `omnigraph.yaml`, server boot, or query exposure. - -### Phase 1: Read-Only Planning - -- Add `cluster plan` against a mock/imported state snapshot. -- Produce plan JSON and human output. -- Reuse existing schema migration planner for schema resources. -- Validate stored queries against desired schema. -- Compute downstream dependencies and blast radius. -- Still no apply. - -### Phase 2: State Backend And Lock - -- Add `state.backend: cluster` JSON storage and lock/CAS. -- Add external backend trait only if lock + CAS semantics are explicit. -- Add `cluster status`, `refresh`, and `import`. -- Persist `AppliedRevision`, `ResourceStatus`, and audit references in JSON. - -### Phase 3: Config-Only Apply - -- Apply query, policy, UI, alias, embedding, and pipeline definition resources - that do not move graph manifests. -- Publish by writing content-addressed resource payloads and CAS-updating - `state.json`. -- Keep server boot from `omnigraph.yaml`; cluster state is inspectable but not - yet serving traffic. - -### Phase 4: Graph And Schema Apply - -Detailed design: [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) -(cluster sidecar schema, roll-forward-only recovery matrix, approval artifacts, -actor threading, 4A/4B/4C staging). - -- Add graph create/delete as cluster resources. -- Make schema apply cluster-aware, with sidecar coverage for graph manifest - movements before JSON state publish. -- Gate irreversible data-loss operations with approval artifacts. -- Consider a cluster manifest publisher only if the JSON sidecar + repair path - is not strong enough for the accepted safety contract. - -### Phase 5: Server Reads Cluster Catalog - -Detailed design: [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) -(the --cluster mode switch, applied-revision serving, serving metadata in -state, readiness table, migration path). - -- Allow server startup from cluster state. -- Add status and catalog endpoints as needed. -- Keep the current `omnigraph.yaml` startup path as compatibility mode β€” an - **exclusive mode switch** per deployment (cluster state XOR `omnigraph.yaml`), - never a merged read of both (Compatibility Stance #7, axiom 15). -- Regenerate OpenAPI for any HTTP surface. - -### Phase 6: Policy-Owned Query Exposure - -- Add per-query policy scope for list/invoke. -- Filter `GET /queries` by actor. -- Keep coarse `invoke_query` as a broad allow rule for compatibility until - docs and migrations say it can be narrowed. -- Deprecate and later remove `mcp.expose` from target-state cluster config. - -### Pipelines: separate project (socket only) - -Pipelines are **descoped from this rollout** (2026-06-10): the runtime -(scheduler/worker, connector contracts, mapping validation, idempotency keys, -per-target run status, retry behavior) is a separate project with its own -RFC. This rollout guarantees only the socket: - -- `pipelines:` stays a reserved config field, rejected with a typed - `future_phase_field` diagnostic (enforced + test-covered in - `omnigraph-cluster`). -- `pipeline.` stays a reserved typed address; the resource model - (kind-agnostic state entries, extensible sidecar kinds, dependency edges) - accepts the new kind without reshaping. -- Axiom 13 is the contract the future implementation must satisfy: the - definition is reconciled, the execution is data-plane; fan-out is statusful, - never silently atomic. - -## Test Ownership - -Tests must prove the Terraform-style workflow, not just individual parsers. -The minimum behavior contract: - -```text -validate catches bad config -plan is deterministic and complete -apply only applies a fresh accepted plan -state changes are locked and durable -drift and partial convergence are visible, not silent -``` - -| Change | Existing coverage to extend | New coverage likely needed | -|---|---|---| -| Cluster parser | `omnigraph-config` inline config tests for strictness/path resolution | `omnigraph-cluster` parser/dependency tests | -| Plan dependency graph | Schema planner tests, query registry tests | Golden plan JSON for cross-resource downstream impacts | -| State lock/backend | Existing schema apply lock tests as model | JSON state CAS/lock race tests | -| Optional cluster manifest publisher | `crates/omnigraph/src/db/manifest/tests.rs` | Cluster publisher CAS, expected-version, deterministic order tests if that backend ships | -| Cluster recovery | `recovery.rs`, `failpoints.rs` | Phase B -> state publish failpoints, external state CAS failure tests | -| Schema cluster apply | `schema_apply.rs`, failpoints schema apply cases | Nested graph/cluster recovery tests | -| Query exposure policy | `omnigraph-policy` invoke_query tests, server query catalog tests | Per-query list/invoke allow/deny and no-probing tests | -| Server cluster boot | `omnigraph-server/tests/server.rs`, `openapi.rs` | Boot from cluster state, registry reload/status tests | -| CLI cluster commands | `omnigraph-cli/tests/cli.rs`, `system_local.rs` | `cluster validate/plan/apply/status` system tests | -| Pipelines | None today | New runtime/mapping/idempotency/run-ledger suites | - -Workflow-specific tests: - -| Workflow area | Required assertions | -|---|---| -| Parser / validate | Unknown fields, wrong-kind typed addresses, missing providers, inline secret values, dangling graph/query/pipeline refs, and future-phase fields fail with typed diagnostics | -| Plan goldens | Given config + imported/fake state, plan JSON contains stable resource digests, dependency edges, state observations, proposed changes, blast radius, and approval gates in deterministic order | -| Fresh-plan apply | Changing config digest, state revision, resource digest, or observed graph manifest version after planning makes `cluster apply` reject and require re-plan/refresh | -| State lock / CAS | Concurrent applies against the same backend cannot both succeed; loser gets a typed lock/CAS conflict | -| Recovery / partial apply | Fail after graph/resource movement but before cluster state publish; assert recovery or status surfaces `ActualAppliedStatePending`/sidecar state and never returns success | -| Server/runtime phase | Before cluster state drives routing or registry reload, tests are hermetic: no real home dir, no real global config, no real credentials, no ignored remote tests | -| Pipeline phase | Fan-out run records per-target status, commit ids, retryability, and idempotency keys; no aggregate success unless every target succeeded | - -Hard gates: - -- Do not ship `cluster apply` until `cluster validate` and read-only - `cluster plan` have hermetic tests. -- Do not ship graph/schema-moving apply until failpoint recovery tests prove the - Phase B -> state publish gap is covered. (Stage 3B delivered the apply-side - half: `omnigraph-cluster` has failpoint infrastructure and tests for the - crash-after-payload and state-CAS-race windows of config-only apply, plus - catalog payload verification in status/refresh. Graph-moving sidecar - coverage remains Phase 4 work.) - -For docs-only changes, `scripts/check-agents-md.sh` is enough. For -implementation phases, run the boundary tests above before widening to -`cargo test --workspace --locked`. - -## User-Visible Documentation Fallout - -The following public docs must change when the corresponding phase ships: - -| Phase | User docs | -|---|---| -| Parser/validate | New `docs/user/cluster-config.md`; CLI reference for `cluster validate` | -| Plan/apply | CLI reference, transactions, policy, errors | -| State backend | Storage, deployment, constants, maintenance | -| Server cluster boot | Server, deployment, OpenAPI | -| Policy query exposure | Policy, server, query language / stored-query registry docs | -| Pipelines | New pipeline user guide, deployment, audit, errors | -| Embeddings config | Embeddings, indexes | - -Do not ship a user-visible command, flag, env var, endpoint, or config key -without updating the corresponding user doc in the same PR. - -## Known High-Risk Design Decisions - -1. **Cluster root identity.** Decide whether `metadata.name` is a label or - identity. Prefer root-derived stable identity plus display name to avoid a - rename breaking resource identity. -2. **Graph storage derivation.** The high-level sample omits graph storage. - Implementation should derive graph roots under `ClusterRoot/graphs/.omni` - by default and treat external graph roots as a separate, explicit feature. -3. **Nested apply.** Schema apply and graph lifecycle cannot move a graph - manifest outside cluster sidecar coverage. -4. **External state.** Must expose pending repair instead of returning success - when graph/resource movement succeeds and external state CAS fails. -5. **Per-query policy.** Catalog filtering must avoid probing leaks: callers - without list/invoke permission should not distinguish hidden from missing. -6. **Pipeline fan-out.** Do not promise atomic multi-graph ingestion unless the - runtime uses a real branch/merge or equivalent protocol for every target. -7. **Drift correction.** Reconciler-initiated deletes are the same data-loss - class as human-requested deletes. - -## Exit Criteria For A Real RFC - -Before implementation begins beyond parser/validate, the RFC must answer: - -1. Exact JSON state/status/approval/recovery schemas and object-store paths. -2. Exact sidecar JSON schema and recovery decision matrix. -3. State backend interface and supported lock/CAS implementations. -4. Cluster apply group syntax and dependency ordering rules. -5. Plan JSON schema, including blast-radius and approval fields. -6. Bootstrap authority and first-actor story. -7. Server startup and migration path from `omnigraph.yaml`. -8. Per-query policy schema and compatibility bridge for `mcp.expose`. -9. Pipeline runtime owner, status schema, and idempotency contract β€” **deferred to the separate pipelines project's own RFC**; this rollout only reserves the socket. diff --git a/docs/dev/cluster-config-specs.md b/docs/dev/cluster-config-specs.md deleted file mode 100644 index b9dfde8..0000000 --- a/docs/dev/cluster-config-specs.md +++ /dev/null @@ -1,496 +0,0 @@ -# Cluster Config Spec β€” Declarative, As-Code, Agent-Operated - -**Status:** Draft / thinking-in-progress -**Type:** Architecture direction -**Date:** 2026-06-07 -**Relationship:** generalizes today's `omnigraph.yaml` graph/query/policy configuration surface ([CLI reference](../user/cli/reference.md), [server docs](../user/operations/server.md)) into a future cluster control plane. The distilled rules are in [cluster-axioms.md](cluster-axioms.md); detailed downstream implementation spec and blast-radius assessment in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). This is a proposed architecture, not an implemented RFC. - -> **Implementation status.** The examples below describe the full target schema. -> Stage 2B only accepts the read-only subset documented in -> [cluster-config.md](../user/clusters/config.md). Future-phase fields such as -> `env_file`, `apply`, `providers`, `pipelines`, `embeddings`, `ui`, `aliases`, -> and `bindings` are intentionally rejected with typed diagnostics until their -> reconciler semantics are implemented. - -> **Revision 2026-06-07 β€” full commitment to the Terraform paradigm.** Three changes from the earlier draft: (1) **state is an authoritative, locked ledger in a backend** (server-hosted *or* a separate cloud store), not "a mostly-rebuildable projection"; (2) `plan` is framed as the **CLI diff between local config and state**; (3) **ETL pipelines** (external data sources) are a first-class config asset β€” a second seam, alongside schema, where a definition triggers a data-plane effect. The full set of config assets (incl. **aliases**, **embeddings**) is enumerated below. - ---- - -## The problem (the Sarah/Bob test) - -Two operators, Sarah and Bob, administer the same OmniGraph deployment. Sarah adds new queries, changes a schema, adds a dashboard, updates policies, and wires in a new data feed. - -**How does Bob find out?** - -Today he can't β€” not cleanly. Sarah's changes land in many different places via many different mechanisms: - -- schema β†’ the schema-apply path, accepted state in `_schema.pg`, `_schema.ir.json`, `__schema_state.json`, and table versions in the graph manifest -- queries β†’ `.gq` files passed per request or resolved through CLI query roots / aliases; not durable cluster state -- policies β†’ `policy.file` in `omnigraph.yaml`, pointing at Cedar/YAML files that are usually GitOps'd externally -- aliases β†’ CLI sugar in each operator's `omnigraph.yaml` -- external data β†’ ad-hoc `load`/`ingest` scripts, cron jobs, glue code that lives nowhere durable -- UI β†’ undefined - -There is no single diff that spans them, no single change record attributed to Sarah, no one place Bob (or Bob's agent) reads to answer "what is this deployment, and what changed?" The state is **fragmented**, and fragmentation is hostile to the one thing an agent must do: reason over the system *as a whole*. - -A design passes only if it answers the Sarah/Bob test directly. - ---- - -## Thesis - -The unit of declarative state is the **cluster** (the deployment), described by **a single config, as code, in version control**, operated by an **agent** through a plan/apply/reconcile loop against an authoritative state ledger. - -Every surface is a declarative as-code artifact β€” schema (`.pg`), queries (`.gq`), policies (`.yaml`), UI (`.yaml`), aliases, **ETL pipelines**, and embeddings config. The UI is not a separately-deployed application; it is a declarative spec, a first-class resource reconciled exactly like the others. - -Three pillars, none optional: - -1. **DECLARATIVE** β€” you describe the desired end state, not the steps. The reconciler computes the steps. -2. **AS CODE** β€” the config is declarative text in a repo, version-controlled. This is the **source of truth for *intent***. -3. **OPERATED BY AGENT** β€” an agent authors config changes and drives reconciliation as an authenticated actor, with policy and approval gates. No human state-management burden. - -This is **Terraform's model, taken literally**: config (as code) is desired truth; **state is an authoritative, locked ledger** of what has been applied β€” held in a backend (the cluster, or a separate cloud store); `plan` diffs config against state; `apply` converges reality to config and updates state β€” applied at **cluster** scope, with OmniGraph as its own data-aware provider and an agent as the controller. - ---- - -## Why as-code (the recursion argument) - -"As code" is not branding. It is the structural property that makes a self-describing system well-founded. - -Consider the rejected alternative: model the cluster's definition *as a graph* (a meta-graph whose nodes are graphs/policies/queries/UI). To describe a graph you need a schema. The meta-graph's schema is either: - -- **hardcoded** β†’ the base case is *code* (you smuggled code in at the bottom anyway), or -- **another graph** β†’ infinite regress, no base case. - -Graph-describing-graph never terminates. **Code is the base case.** A declarative config needs no meta-describer because it is parsed by the engine's compiled code β€” not described by more user-space data. - -> **Declarative-as-code terminates. Declarative-as-data (a graph of graphs) recurses.** - -This is also why **config** must live **outside** the running system: reviewable (PRs), reproducible (clone + apply), diffable as text, and editable by an agent β€” without depending on the running system to describe its own intent. - -Corollary on direction: change flows **code β†’ cluster, never the reverse.** You do not edit the running system and call that intent. (State, separately, *records* what the cluster currently is β€” see the next section β€” but it is never where you express what it *should* be.) - ---- - -## Why per-cluster, not per-graph - -The definition Sarah changed does not *belong* to any single graph: - -1. **Policies cross-cut graphs.** "Member can't delete on any graph," "who may list/create/delete graphs" β€” cluster facts. No graph could own them. -2. **"Which graphs exist" has no home in a per-graph model.** The set of graphs is state *above* any graph. -3. **Queries, UI, pipelines, and aliases span graphs.** The MCP/tool catalog an agent discovers is the *cluster's* surface; a dashboard renders multiple graphs; a pipeline may fan out into several. -4. **Cross-graph apply groups.** Sarah may add a graph *and* wire it into the UI *and* grant policy access *and* attach a feed as one logical change β€” only the cluster can express, plan, and eventually fence that as one apply group. -5. **Operators operate clusters.** Bob is Sarah's peer on a *deployment*, not a graph. The collaboration unit is the cluster. - -The graph is a *resource within* the cluster, not the unit of operation. - -The mirror question β€” *why not per-fleet?* β€” is the same one this section used against per-graph, one level up. A fleet of clusters may eventually want its own declarative spec describing which clusters exist. That recursion is real but **out of scope here**: this proposal stops at the cluster because the cluster is the unit two operators collaborate over. Fleet is the next scope up, named and deferred, not denied. - ---- - -## The model: config / state / reconcile (the Terraform model, literally) - -| Layer | What it is | Source of truth for… | Who manages it | -|---|---|---|---| -| **Config** (as code, a folder of files) | Desired state of the whole cluster β€” graphs, schemas, policies, queries, UI, bindings, aliases, embeddings, ETL pipelines | **Intent** ("what it should be") | Operators/agents, in version control | -| **State** (a locked ledger in a backend) | The authoritative record of what has been applied β€” applied revision, per-resource fingerprints, observed graph/table versions, audit-record references, resource conditions | **Deployed reality** ("what is") | The reconciler; humans don't hand-edit it | -| **Actual cluster** | The realized *definition* of the running graphs β€” schema/policies/queries/UI/pipelines as actually in force | β€” (reality itself) | The engine; `apply` converges it to config | - -**`plan`** = `diff(config, state)` β†’ proposed change set (optionally refreshed against the actual cluster). -**`apply`** = acquire the state lock β†’ converge actual β†’ config β†’ **update state** β†’ release lock. Apply does **not** acknowledge success until the state update succeeds; if actual moved but the state write failed, the next `plan` / `refresh` must surface the non-success state and repair or import it before more work proceeds. - -### State is an authoritative, locked ledger β€” not a throwaway projection - -This is the 2026-06-07 revision. State is treated exactly as Terraform treats `tfstate`: - -- **Authoritative.** State is the trusted record of what is deployed. `plan` diffs config against **state** (fast, deterministic), not against a full live scan of the cluster on every command. "What exists" is answered from state. -- **In a backend.** State lives in a configurable backend: the **cluster's own object-store backend**, or a **separate cloud store** (e.g. a different bucket/account) β€” the operator's choice, mirroring Terraform's local/S3/remote backends. The config declares which. -- **JSON first.** The baseline state format is Terraform-style JSON documents (`state.json` plus status/approval/recovery JSON records) protected by backend lock/CAS. Lance control-plane datasets are a possible later backend only if row-level history, queryability, or tighter publish fencing justifies the added machinery. -- **Atomicity depends on backend and publish scope.** A JSON state backend, even when stored under the cluster root, is a separate CAS step from graph Lance manifest moves. If actual resources move but the state write fails, apply must surface `ActualAppliedStatePending` (or equivalent) and require refresh/import repair instead of pretending one atomic commit covered every object. A future Lance-backed state backend or cluster manifest publisher may tighten this, but that is not the Phase-1 assumption. -- **Locked.** `plan`/`apply` acquire a **state lock** before touching state, so two operators (or two agents) cannot converge concurrently and corrupt the ledger. This generalizes the existing `__schema_apply_lock__` from schema scope to cluster scope. -- **Reconstructable, but not casually rebuilt.** OmniGraph's edge over opaque-cloud Terraform: the running cluster is self-describing (manifests, commit logs), so a lost state ledger can be **imported / refreshed** from the live cluster. That is a *resilience* property β€” not licence to treat state as disposable. State is protected and backed up like any source of truth. -- **One slice is never reconstructable.** Who *approved* an irreversible apply cannot be re-derived from a manifest scan. That approval/audit record lives in the **durable audit ledger** (baseline: append-only JSON records in the state backend; future: a Lance table only if needed). State *references* it by id; it never *is* it. - -**The control plane reconciles definition, not data.** The reconcile loop converges the cluster's *definition* β€” schema, policies, queries, UI, bindings, aliases, pipelines, and the set of graphs. It does **not** converge **data**: rows, edges, and vectors are data-plane content, mutated by `load`/`mutate` and by **pipeline execution**, versioned by the commit DAG, and they sit entirely outside the reconcile loop. (`load`/`mutate` never appear in `cluster.yaml`.) **Two** definition kinds *trigger* a data-plane effect without owning data β€” schema and ETL pipelines (see "ETL pipelines" below). - -### Cluster resource model - -Minimum vocabulary: - -- **ClusterRoot** β€” the object-store prefix / control namespace for one deployment. -- **DesiredRevision** β€” git commit, `cluster.yaml` digest, and per-resource digests. -- **ResourceKind** β€” `Graph`, `Schema`, `Query`, `PolicyBundle`, `UiSpec`, `Binding`, `Alias`, `EmbeddingConfig`, **`Pipeline`** (ETL), and future cluster-scoped resources. -- **ResourceAddress** β€” normalized typed references between resources, such as `graph.knowledge`, `query.knowledge.find_experts`, `policy.base_rbac`, and `pipeline.github_sync`; illustrative YAML may use shorthand, but plan/state store the typed form. -- **ProviderAddress** β€” typed references to provider instances, such as `provider.storage.prod_graphs`, `provider.source.github_org`, and `provider.embedding.default`; provider addresses keep storage, external sources, and embedding providers from being inferred from ambiguous strings. -- **StateBackend** β€” where the JSON state ledger is stored: `cluster` (this deployment's own backend) or an external store (a separate bucket/account). -- **StateLock** β€” the cluster-scope lock acquired before plan/apply. -- **AppliedRevision** β€” the durable, locked record (the heart of state) of which desired revision is applied, with audit-record references, resource fingerprints, and graph/table version observations. -- **ResourceStatus** β€” `Pending | Planned | Applying | Applied | Drifted | Blocked | Error`, with typed conditions and observed actual state. -- **ApplyGroup** β€” the explicit atomicity unit. Default is one independent resource per group; cross-resource references force planner-derived groups, and user-declared groups may opt into larger atomicity only for resources the active backend protocol can fence or repair. Baseline JSON state supports small, explicit groups; larger all-or-nothing groups require a future cluster publisher or equivalent proof. - ---- - -## State: backend, lock, and the config ↔ state diff - -The CLI is the operator's window onto the gap between config and state. - -The Terraform-aligned workflow is: - -```text -cluster validate # parse + schema-check desired config, no state mutation -cluster plan # diff desired config against state, with optional refresh -cluster apply # apply an accepted fresh plan and update state -cluster status # read what state says is deployed now -cluster refresh # update/import state observations from actual cluster state -``` - -`plan` is the central artifact. It records the desired revision, resource -digests for every referenced file, dependency edges between resources, observed -state fingerprints / graph manifest versions, proposed changes, and approval -gates. The human output below is a rendering of that structured plan, not the -only representation. - -``` - $ omnigraph cluster plan - config ./ β†’ diff against state (backend: cluster Β· lock: acquired) - - ~ schema knowledge hard-drop Person.legacy_id ⚠ prior versions reclaimed β€” needs approval - + query knowledge.find_experts (new stored query) - - query knowledge.orphan_pages (removed) - ~ policy base_rbac grant invoke find_experts β†’ members (this is what EXPOSES the new query) - + pipeline saas_sync notion β†’ knowledge, hourly - ~ ui dashboards.overview add panel "experts" - + alias experts - ───────────────────────────────────────────────────────────────────── - 6 changes Β· 1 requires approval (hard schema drop on knowledge) Β· run `apply` to converge -``` - - -That output **is** the answer to the Sarah/Bob test: one diff, spanning every surface, attributed to a git commit and concrete resource digests, with data-impact peeked (axiom-6 schema seam), dependency fallout visible, observed state compared, and approval gates surfaced *before* anything moves. Drift (someone poked the live cluster out-of-band) shows up here too β€” `plan` reconciles state against the actual cluster and flags resources whose observed version no longer matches the ledger. - - -`apply` then: acquire **state lock** β†’ execute the change set (ordered/grouped per the planner) β†’ **CAS-update the JSON state ledger** with the new applied revision/status observations β†’ release the lock. For config-only resources, content-addressed payload writes can happen before the state CAS because state is the publish point. For graph/schema moves, the graph manifest may move before the state CAS; a crash or CAS failure there leaves a loud repair/import condition and no success acknowledgement, not a silently successful atomic apply. A future cluster manifest publisher can tighten this gap, but the baseline protocol does not assume it. - ---- - -## ETL pipelines (the second data-plane seam) - -> **Scope note (2026-06-10): descoped to a separate project.** Pipelines are -> a product surface of their own (scheduler, connectors, mapping language, -> idempotency, run ledger) and will be designed and built outside the cluster -> control-plane track. What this spec retains is the **socket** they plug -> into, which is already enforced: (1) the `pipelines:` config field is -> reserved β€” `cluster validate` rejects it with a typed `future_phase_field` -> diagnostic, so it can never be silently squatted; (2) the typed address -> form `pipeline.` and the `Pipeline` resource kind are reserved in the -> resource model; (3) axiom 13 fixes the contract any future implementation -> must satisfy β€” the pipeline *definition* is a reconciled cluster resource, -> its *execution* is data-plane and never reconciled. The design text below -> stands as the requirements record for that project, not as a phase of this -> one. - - -External data β€” from another database, an API, a file drop, a stream β€” is a first-class config asset, not glue code that lives nowhere. - -A **Pipeline** is declared in config: a **source** (e.g. `notion`, `github`, `slack`, `gdrive`, `postgres`, `http`, `s3-files`, `kafka`), an optional **schedule/trigger**, and **one or more target graphs**, each with its own **mapping/transform** (external records β†’ graph types & properties). A single feed can **fan out across graphs** β€” e.g. a GitHub sync that populates both the `engineering` graph and the people/teams in `knowledge`. It is reconciled like any resource β€” `apply` creates / updates / deletes / (re)schedules the pipeline *definition*. This is the canonical "company brain" move: the deployment's graphs are continuously assembled from the SaaS tools the org already uses. - -The crucial boundary (axiom 6, axiom 13): the pipeline **definition** is control-plane and reconciled; the pipeline's **execution** β€” actually pulling rows and writing them β€” is a **data-plane effect** that produces ordinary `load`/`mutate` commits *outside* the reconcile loop. The reconciler converges the pipeline; the rows it ingests are never reconciled state (just as a cron *definition* is config but its output is not). This makes ETL the **second seam** where a definition triggers a data-plane effect β€” schema being the first (a migration conforms existing rows; ETL ingests new ones). - -Consequences that fall out of the existing model: - -- **`plan` previews the pipeline, not the data.** "pipeline `saas_sync`: notion β†’ `knowledge`, hourly" is a definition diff; it does not scan the source (data-volume-independent), the same way schema `plan` previews impact only at the bounded, opt-in data peek. -- **Source credentials come from the `.env` file** (axiom 10): `token: ${NOTION_TOKEN}` β€” resolved from the gitignored `.env` file per deployment, never inline. -- **Reversibility gradient applies** (axiom 8): a pipeline that *appends* is reversible-ish; one configured to *overwrite* a target is a data-loss path and hits the irreversible-op gate. -- **Referential integrity is plan-time** (axiom 9): a pipeline whose `into:` names a graph/type the same revision removes is a fail-closed `plan` error. -- **Fan-out is statusful, not magically atomic.** A pipeline execution that writes to several graphs is a set of ordinary per-target graph writes unless the pipeline explicitly stages through a branch/merge protocol that can fence those targets. A failed run may therefore leave `engineering=Applied`, `knowledge=Error` (for example), and the pipeline run ledger must expose per-target status, commit ids, retryability, and idempotency keys. Control-plane `apply` only converges the definition/schedule; it never means every future data-plane target has ingested successfully. - ---- - -## Config assets β€” the full set - -Everything below is **shared cluster config** (in the folder, version-controlled, secret-free) unless marked per-operator. The rule of thumb: if two operators must agree on it, it's config; if it's how *you personally* reach or view the cluster, it's per-operator. - -| Asset | In config? | Notes | -|---|---|---| -| **Graphs** (the set that exists) | βœ… config | the named graphs; their existence is cluster state | -| **Schema** (`.pg`, **one per graph**) | βœ… config | also encodes indexes (`@index`/`@unique`/vector), constraints, and search (`@embed`) β€” so indexes & search are reconciled *via* schema | -| **Stored queries** (`.gq`, **per graph**) | βœ… config | a `.gq` file declares **many** named queries; the registry declares which exist (name β†’ file, key must match the `query ` symbol). **Target design:** exposure β€” who may list/invoke each β€” is a policy decision, not a registry flag. **Current compatibility bridge:** shipped `omnigraph.yaml` still has `queries..mcp.expose`, and the HTTP catalog is not Cedar-filtered per query yet. Aliases & bindings reference a query by name | -| **Policy bundles** (`.yaml`) | βœ… config | YAML (not Cedar files); **shared across graphs** via `applies_to: [cluster \| ]` (many-to-many; fix 2026-06-08 unified the old `scope:`/`graphs:` split). Gates actions **and query exposure** (who may list/invoke each stored query) | -| **UI specs / dashboards** (`.yaml`) | βœ… config | first-class resources; a dashboard **reads from several graphs** (`graphs: [...]`) | -| **Bindings** | βœ… config | wiring between resources (query ⇄ UI surface) | -| **Aliases** | βœ… config* | CLI shortcut to a stored query: `{ command, query: <.gq file>, name: , args, format }` β€” `query` is the **file**, `name` the **query symbol** in it. See note | -| **Embeddings config** | βœ… config | model + dimension + which fields embed; the **API key comes from the `.env` file** (`${…}`) | -| **ETL pipelines** | βœ… config | source β†’ transform β†’ **one or more target graphs**; source credentials come from the `.env` file | -| **Apply settings** | βœ… config | `apply.default_grain`, grouping/ordering hints | -| **State backend + lock** | βœ… config | where the ledger lives, whether to lock | -| **Secrets (`.env` file)** | βœ… ref'd by config; values **gitignored** | a separate `.env` of secret values, referenced as `${NAME}`; never committed (OmniGraph's standard env-file convention) | -| **Connection** (which cluster URI) | ❌ per-operator | how *you* reach the cluster | -| **Operator token** | ❌ per-operator (secret) | each operator's own credential to reach the cluster | -| **CLI prefs** (output format, table layout, active graph/branch selection) | ❌ per-operator | personal ergonomics, not shared truth | - -\* **Aliases β€” the one with a split.** A shared alias that names a cluster resource (a stored query, a dashboard) is config β€” it's a vocabulary the whole team relies on, and it belongs in the spec (often it *is* just the stored-query catalog entry, since that already carries name + params + tool metadata). A *purely personal* shortcut (your own command abbreviations) stays in the per-operator layer. When in doubt: if it should survive `git clone` and be the same for Bob as for Sarah, it's config. - ---- - -## The synthesis (beyond vanilla Terraform) - -Embracing Terraform does not mean stopping at Terraform. Three extensions make this specifically right for OmniGraph and the agentic future: - -1. **OmniGraph is its own data-aware provider, and `plan` can peek across the data boundary.** A Terraform provider CRUDs resources blind to your data. Here, the control-plane resource is the schema **definition** (declarative, reconciled); converging it *triggers* a data-plane **effect** β€” currently soft/hard drops, rewrites, and index creation, with future validated migrations such as enum narrowing or `String`β†’`enum` conversion once the planner grows that tier. The leverage is that `plan`, before applying the definition change, can *peek* at bounded data-plane consequence and report it β€” **"hard-dropping this property requires approval and will make prior versions unreachable after cleanup"** or, in the future, **"narrowing this enum will fail on 37 rows"** β€” which Terraform structurally cannot do. This is deliberate and bounded: a data peek makes that `plan` cost scale with data volume, so it is **opt-in / bounded** (sampled or skippable for large tables), and it never makes the control plane the owner of data. Schema and ETL pipelines are the **two** seams where the control plane reaches into the data plane; everywhere else `plan` is data-volume-independent. - -2. **JSON state first, explicit partials, optional stronger fencing later.** Terraform apply is not transactional β€” partial applies are a real failure mode. Lance commits are per dataset, and today's OmniGraph manifest atomicity is graph-scoped: one graph commit flips the relevant sub-table versions together, protected by expected table versions and recovery sidecars. The first cluster-control backend should match Terraform's shape: a locked JSON state document plus append-only JSON status/approval/recovery records. That keeps Phase 1 inspectable and narrow. Cluster-level all-or-nothing apply is a later capability only if we add a **cluster manifest publisher** or Lance-backed state backend that fences graph *version pins*, query catalogs, policy bundles, UI specs, pipeline definitions, recovery sidecars, and state as one commit protocol. Until that exists, apply must surface partial convergence as `ResourceStatus`, not pretend it was atomic. - -3. **Agent-as-controller fuses Terraform with Kubernetes.** Terraform contributes the as-code config (truth outside the system, recursion-terminating) and the locked state ledger. Kubernetes contributes *continuous* reconciliation (controllers watch, not apply-on-demand). The agent is both author and controller: it reads a config change, runs the data-aware plan, evaluates blast radius against the reversibility gradient, **auto-applies the reversible parts only when policy permits, and escalates irreversible / data-loss gates to a human approval artifact recorded in the audit ledger and referenced by state.** - -> Terraform's as-code config + locked state Γ— Kubernetes' continuous reconciliation Γ— the agent as the controller that bridges them β€” on OmniGraph's data-aware, atomic substrate. - ---- - -## Concrete shape (illustrative) - -The config is **a set of files in one folder** (flat, Terraform-style β€” the extension carries the type): - -``` - company-brain/ - β”œβ”€β”€ cluster.yaml # the spec (graphs, policies, ui, bindings, aliases, pipelines, state, vars ref) - β”œβ”€β”€ .env # SECRET VALUES β€” gitignored, never committed - β”œβ”€β”€ knowledge.pg Β· engineering.pg # schemas (one per graph) (.pg) - β”œβ”€β”€ knowledge.gq Β· engineering.gq # query files β€” each holds MANY queries (.gq) - β”œβ”€β”€ cluster_admin.policy.yaml Β· base_rbac.policy.yaml Β· knowledge_pii.policy.yaml # shared policy bundles - β”œβ”€β”€ overview.dashboard.yaml # cross-graph UI spec (.dashboard.yaml) - └── notion_to_knowledge.map.yaml Β· github_to_engineering.map.yaml Β· github_to_people.map.yaml # pipeline maps -``` - -Secrets live in a gitignored `.env` file (OmniGraph's standard env-file convention); the config references them as `${NAME}`: - -```bash -# .env β€” secret values; gitignored; never committed. Referenced in cluster.yaml as ${NAME}. -NOTION_TOKEN=… -GITHUB_TOKEN=… -EMBEDDING_API_KEY=… -``` - -Resource relationships (so the wiring is unambiguous): - -``` - cluster ──has many──► graph ──has one──► schema - └────has──► query file(s) (.gq) ──each declares MANY──► query { … } symbols - registry entry key = the query symbol ──points to──► its .gq file (queries: { : { file } }) - (registry says a query EXISTS; it carries NO expose flag) - policy bundle ──applies to──► { cluster | one or MANY graphs } (SHARED, many-to-many) - └──governs query EXPOSURE──► who may LIST / INVOKE each stored query (no `expose:` in the registry) - alias (command, query = .gq FILE, name = symbol, args, format) ──selects one query from that file - binding names a query by registry name (graph.queryName) ──► resolved to (file, symbol) - dashboard ──reads from──► one or MANY graphs - pipeline ──writes into──► one or MANY graphs - secrets ──live in──► a separate gitignored `.env` file; config uses ${NAME} -``` - -```yaml -# cluster.yaml β€” desired state of the whole deployment (config = source of truth for INTENT) -version: 1 -metadata: - name: company-brain - -state: # the authoritative ledger's backend (Terraform-style) - backend: cluster # "cluster" = this deployment's own store; or s3://… (a separate store) - lock: true # acquire a state lock before plan/apply - -env_file: ./.env # secret VALUES live in a gitignored .env file; referenced below as ${NAME} - -apply: - default_grain: resource # references may force groups; explicit groups request more atomicity - -graphs: # the cluster's graphs β€” each is ONE schema + a set of named queries - knowledge: # people Β· teams Β· docs Β· decisions Β· projects - schema: ./knowledge.pg # desired schema; reconciler runs (and plan previews) the migration - queries: # the graph's stored (named) queries; KEY must match a `query ` in the file - find_experts: { file: ./knowledge.gq } # ─┐ `query find_experts` and `query related_docs` - related_docs: { file: ./knowledge.gq } # β”€β”˜ both live in knowledge.gq. Who may LIST/INVOKE β†’ policy (not here) - engineering: # repos Β· services Β· incidents Β· PRs - schema: ./engineering.pg - queries: - service_owners: { file: ./engineering.gq } - open_incidents: { file: ./engineering.gq } - -policies: # policy BUNDLES (YAML) β€” SHARED across graphs (many-to-many). - # Policy ALSO governs query EXPOSURE: who may list/invoke each stored query. - # Fix (2026-06-08): unified the binding field on `applies_to:` (was a - # `scope:` + `graphs:` split) β€” one field, takes `cluster` or graph refs; - # bare graph names are shorthand for `graph.` (see impl-spec typed addresses). - cluster_admin: # cluster-scoped: graph_list, create/delete, management - file: ./cluster_admin.policy.yaml - applies_to: [cluster] - base_rbac: # read/write + which roles may invoke which queries, across both graphs - file: ./base_rbac.policy.yaml - applies_to: [knowledge, engineering] - knowledge_pii: # an extra bundle, only for knowledge - file: ./knowledge_pii.policy.yaml - applies_to: [knowledge] - -pipelines: # ETL β€” ONE pipeline may write into SEVERAL graphs (definition only) - saas_sync: # the "company brain" move: assemble graphs from the SaaS tools - source: { kind: notion, token: ${NOTION_TOKEN} } # secret via ${NAME}, never inline - schedule: "0 * * * *" # hourly; execution is a data-plane effect, not reconciled state - into: # fans out across graphs - - { graph: knowledge, map: ./notion_to_knowledge.map.yaml } - github_sync: - source: { kind: github, token: ${GITHUB_TOKEN} } - schedule: "*/15 * * * *" - into: - - { graph: engineering, map: ./github_to_engineering.map.yaml } - - { graph: knowledge, map: ./github_to_people.map.yaml } # same feed enriches a SECOND graph - -embeddings: # semantic search over docs/decisions; key via the `.env` file - model: gemini-embedding-2 - dimension: 3072 - api_key: ${EMBEDDING_API_KEY} - -ui: # dashboards read from SEVERAL graphs - dashboards: - overview: - file: ./overview.dashboard.yaml - graphs: [knowledge, engineering] # cross-graph - -aliases: # CLI shortcuts. ⚠ an alias's `query:` is the .gq FILE PATH; - # `name:` selects the query SYMBOL inside it (a file declares many). - experts: { command: query, graph: knowledge, query: ./knowledge.gq, name: find_experts, args: [topic], format: table } - incidents: { command: query, graph: engineering, query: ./engineering.gq, name: open_incidents, format: table } - -bindings: # wiring between resources - - query: knowledge.find_experts - surface: ui.dashboards.overview -``` - - -What this is *not*: it is **not** a graph, and it carries **no credentials** β€” only secret *references* (`${…}`). It is parsed by the engine (the base case), describes the desired cluster, and is the thing two operators diff and review. - -The **state ledger** lives in the configured backend (the cluster, or a separate cloud store), versioned, CAS-updated, schema-versioned, locked during apply, agent-managed β€” the authoritative record of what is deployed. The baseline backend is JSON, so even cluster-hosted state is published through a state CAS and repaired explicitly if graph/resource movement happened first. A future cluster publisher can tighten that boundary, but it is not assumed by the high-level spec. - ---- - -## Boundaries that hold (orthogonal correctness, not Terraform-bias) - -1. **Secrets live in a `.env` file, never inline in config.** The committed config is what the cluster *is* (shared, reviewable, as code) and carries **no secret values** β€” only `${NAME}` references. The values (embedding API key, pipeline source credentials, per-deployment settings) live in a separate **`.env` file** β€” which is **gitignored and never committed**, and supplied per deployment. Separately, an *operator's own token* (how they personally reach the cluster) belongs to the per-operator connection layer, not the cluster config or its `.env` file. - -2. **The reversibility gradient gates apply β€” including drift correction.** Dropping a graph, hard-dropping schema data, or an overwriting pipeline is irreversible data loss; a future validated enum narrowing is a compatibility-narrowing migration unless it also drops or coerces stored values; recoloring a dashboard is not. Unified config, unified plan β€” but **tiered gates inside apply**, keyed to physics, not to who operates it. The gate applies to **drift correction too**: converging actualβ†’config can mean *dropping* something added out-of-band β€” a data-loss path that hits the same gate. A reconciler "just fixing drift" is never an exception. - -3. **Agents are actors, not ambient authority.** The reconciler runs with a resolved actor or service account, subject to Cedar policy. If it applies on behalf of a human, the durable audit ledger carries both the controller actor and the approving human / approval artifact, and state references that ledger entry. Client-supplied actor identity is never trusted. - -4. **Status is explicit when apply is not atomic.** A unified plan does not imply a unified commit. If an apply group partially converges, the cluster must expose `ResourceStatus` and typed conditions until reconciliation finishes or rolls back. Silent partial success is forbidden. - -5. **State integrity is protected.** State is locked during apply and stored durably in its backend. The baseline state backend is JSON plus lock/CAS, so state update failures surface a repair/import condition before success is acknowledged. A lost ledger is recoverable (import/refresh from the self-describing cluster), but state is never treated as disposable. - ---- - -## Relationship to current config - -This is not green field, but it is also not today's `omnigraph.yaml`. The current file is a shared convenience for CLI and server startup: named graph targets, server defaults, query roots, aliases, embeddings model, auth env-file lookup, and `policy.file`. It is **not** the cluster's source of truth, it has no separate state ledger, and parts of it are intentionally per-operator. - -This proposal: - -- **splits** per-operator connection/credential/preference config from shared cluster config, -- **adds** `cluster.yaml` + a flat config folder as the full declarative cluster config (graphs, schemas, query catalog, policy bundles, UI specs, bindings, **aliases**, **embeddings**, **ETL pipelines**), -- **adds** the **JSON state ledger** (authoritative, locked, in a backend) and the `cluster plan`/`apply` loop, -- **adds** the reconciler (with OmniGraph as its own data-aware provider), while treating a cluster manifest publisher as a later option rather than the baseline, -- **lets an agent drive** plan/apply/continuous-reconcile. - -The connection/credential/preference layer remains per operator: it points at a cluster, resolves that operator's identity, and holds personal ergonomics. The cluster config stays shared, secret-free, and reviewable; the state ledger stays authoritative and locked. - -### Migration model: single ownership, mode switch, shrinking job description (axiom 15) - -`omnigraph.yaml` is not being replaced; its **job description shrinks**. Only the -shared-truth parts of its current role migrate to the cluster catalog (the set of -graphs, the stored-query registry, policy wiring, the server boot source). The -per-operator parts are per-operator *by nature* β€” Sarah's and Bob's differ β€” and -keep `omnigraph.yaml`/the per-operator layer as a permanent, well-defined home. - -While both exist, **each fact has exactly one owner at any moment, and -coexistence is a mode switch, never a merge**. The brittle version of backward -compatibility β€” the server reading graphs from `omnigraph.yaml` *and* from -cluster state with precedence rules gluing them together β€” is rejected outright: -two readers for one truth means every bug becomes "which file won?" and every -feature pays the tax twice. The realistic timeline has three windows: - -1. **Now β†’ Phase 4 (no conflict).** Cluster apply writes only to its own catalog - (`__cluster/`); `omnigraph.yaml` serves traffic. `Applied` status must - visibly mean "recorded in the cluster catalog, not yet serving" so the - overlap is loud, not hidden. -2. **Phase 5 (the mode switch).** A deployment opts into booting from cluster - state; `omnigraph.yaml`'s server-role keys become inert *for that - deployment*. Exclusive β€” boot from cluster state XOR `omnigraph.yaml` β€” with - no key-level aliasing and no merged precedence. -3. **Phase 6+ (bridges with sunsets).** Targeted compatibility bridges are - allowed only with a named replacement and a removal phase; `mcp.expose` β†’ - policy-owned exposure is the template. Bridges that accumulate without an - exit are review-rejected. - -Key-by-key compatibility inside one evolving file is the expensive kind of -backcompat (the v1 `omnigraph.yaml` reshape's `--target`/legacy-key regressions -are the in-repo cautionary tale); resource-ownership seams between two files -with a mode switch is the cheap kind. Police the single-owner rule in every -Phase 3–6 PR: a proposal that merges the two sources for one fact is the -deny-list's "state that drifts from what it can be derived from" wearing a -compatibility costume. - -### The per-operator layer: contents and destination - -The per-operator layer must be **complete** β€” everything an operator needs to -work against any cluster from any directory, and nothing that two operators must -agree on: - -| Per-operator concern | Today | Target | -|---|---|---| -| Connection (which cluster/server, named endpoints) | `omnigraph.yaml` `graphs.` URIs / `servers:` refs | global config, per-operator | -| Operator credential **reference** (`bearer_token_env`, env-file lookup) | `omnigraph.yaml` + `.env` | global config references; secret values stay in env/`.env`, never in any config | -| Active context (current graph/branch selection) | ad-hoc per-command flags / `defaults` | global state layer (e.g. `omnigraph use`), explicitly **not** the cluster state ledger (axiom 5's "state" is the applied-cluster ledger, not a personal selection) | -| CLI ergonomics (output format, table layout) | `omnigraph.yaml` `cli:`/`defaults:` | global config, per-operator | -| Personal command shortcuts (purely personal aliases) | `omnigraph.yaml` `aliases:` | global config; *shared* aliases (team vocabulary) are cluster config β€” see the aliases split note above | - -Destination: this layer belongs in the operator's **global config dir** -(`~/.omnigraph`, per the RFC-002 global-first layered-config direction β€” -global config + active-context state file), not in a repo-committed file, so it -survives `git clone`, works from any directory, and never collides with the -shared cluster folder. The RFC-002 layering implementation is currently parked -(PRs #139/#162 closed over review findings), but the *boundary* it draws is the -one this spec depends on: per-operator β†’ global dir; shared deployment intent β†’ -the cluster config folder; deployed reality β†’ the state ledger. - -Implementation gate: the Terraform-style workflow must be testable in order. -`cluster validate` must catch bad config before any apply path exists; -read-only `cluster plan` must have deterministic structured-plan tests before -state mutation ships; and graph/schema-moving apply must have recovery tests for -the gap between graph/resource movement and JSON state publish. Otherwise the -control plane can look declarative while still hiding drift or partial success. - ---- - -## Open questions - -1. **Cluster state layout.** What exact JSON documents / object-store paths hold `AppliedRevision`, `ResourceStatus`, approval records, recovery records, sidecars, and resource content for query/policy/UI/pipeline specs? What evidence would justify a future Lance-backed state backend? -2. **State backend options.** Beyond "cluster" and "a separate bucket," what backends are first-class (a different account, a remote control service)? How is the backend itself bootstrapped and its lock implemented (object-store CAS vs an external lock service)? -3. **State import / refresh.** The exact actual-state scan that reconstructs a conservative `AppliedRevision` when the ledger is lost, and which fields become `Unknown`. -4. **Apply grain syntax.** Apply defaults to per-resource `ApplyGroup`; cross-resource references force planner-derived groups; user-declared groups opt into more atomicity. What's the YAML, and which combinations can the publisher actually fence? -5. **Pipeline runtime.** Where do pipelines *execute* (in the server? a worker? an external scheduler?), how are runs observed in `ResourceStatus`, and how does a failed/partial run reconcile vs. retry? -6. **Continuous reconciliation trigger.** Watch-and-converge (k8s-style) vs. apply-on-config-change. The agent-as-controller model leans toward continuous. -7. **Tenant partitioning (cloud).** A cluster may host multiple tenants; config/state is then tenant-partitioned, consistent with the reserved `GraphKey { tenant_id, graph_id }`. Tenant resolved from the token, never the config. -8. **Bootstrap β€” config, state, *and* authority.** How a cluster comes into existence from an initial config (`init` seeds; cluster owns; git mirrors for CI/DR), the first state write, and the chicken-and-egg of the very first apply (which needs an actor before any cluster exists to resolve policy against β€” so the bootstrap actor is necessarily out-of-band and privileged). Security-sensitive; needs an explicit story. -9. **Alias scoping.** Where exactly the shared/personal alias line falls, and whether shared aliases are just stored-query catalog entries. -10. **UI render and safety model.** Generic engine-side renderer vs. thin client, allowed components, query-binding validation, policy propagation, sandboxing, version compatibility. -11. **Cluster identity vs. `metadata.name`.** Is `metadata.name` a label or stable identity? If identity, renaming loses it β€” the stable-ID-across-rename gap already in `invariants.md`. Decide whether identity keys on `name` or on `ClusterRoot`, and reuse the existing known-gap framing. -12. **Resource dependency ordering.** Explicit dependency DAG (Terraform) vs. eventual convergence with retries (k8s). The most consequential unmade fork: it decides whether `plan` can promise an apply *order* before any data moves. -13. **Query exposure in policy (supersedes `mcp.expose`).** *Today* the stored-query registry carries a per-query `mcp.expose` flag and invocation is gated with the coarse `invoke_query` Cedar action β€” with **per-query authorization a documented gap** (the catalog isn't Cedar-filtered per query yet). This design **folds exposure fully into policy and drops the flag**: a stored query's visibility (catalog membership) and invocability are both policy decisions, so the catalog `GET /queries` returns each actor's policy-permitted set. The open work is the exact policy predicates for *list* vs *invoke* per query, and retiring `mcp.expose`. - ---- - -## Prior art - -- **Terraform** β€” declarative infra *as code*; config is desired truth, **state is an authoritative ledger in a backend**, **state locking** serializes applies, `plan` diffs config↔state, providers do the CRUD. The core model adopted here, taken literally. -- **Kubernetes** β€” one cluster store, many resource types under one API; controllers reconcile continuously; cluster-level RBAC. The continuous-reconciliation half of the synthesis. -- **dbt / Airflow / Dagster** β€” declarative, as-code data pipelines with lineage. Prior art for the **ETL-pipeline-as-config** asset (the second data-plane seam). -- **OmniGraph's own schema-apply** β€” already a faithful plan/apply/state/drift loop for the `schema` resource type, with `__schema_apply_lock__` as the lock seed; the reconciler this generalizes. diff --git a/docs/dev/codeowners.md b/docs/dev/codeowners.md new file mode 100644 index 0000000..9a7fb50 --- /dev/null +++ b/docs/dev/codeowners.md @@ -0,0 +1,37 @@ +# Code ownership + +`.github/CODEOWNERS` is **generated** β€” not hand-edited. The source of truth is `.github/codeowners-roles.yml`, expanded by `.github/scripts/render-codeowners.py`. CI rejects drift between the two and rejects direct edits to `CODEOWNERS` that don't accompany a yml change. + +This setup gives every role change a reviewable PR and a permanent in-repository audit trail (`git log .github/codeowners-roles.yml`). + +## Current roles + +| Role | Members | Scope | +|---|---|---| +| `engineering` | `@ragnorc` | All code under `crates/**`, repository infrastructure, default for unmapped paths | +| `docs` | `@ragnorc` | `docs/**`, README.md, AGENTS.md, CLAUDE.md, SECURITY.md | + +GitHub treats multiple owners in a CODEOWNERS line as **"any one of them satisfies the review requirement"**. To require N distinct approvers on a specific path, layer a CI check on top (not currently configured). + +## How to change role membership or path mappings + +1. Edit `.github/codeowners-roles.yml`. +2. Run `python3 .github/scripts/render-codeowners.py` (requires PyYAML; `pip install pyyaml`). +3. Commit both files in the same PR. + +CI fails the PR if: +- `CODEOWNERS` was edited without a corresponding yml change, or +- The yml was changed but the rendered `CODEOWNERS` doesn't match. + +## How to add a new role + +1. Add a new entry to `roles:` in the yml with a `description` and `members` list. +2. Reference the role from `paths:` (or `default:`). +3. Regenerate + commit as above. + +## Why a generator, not direct CODEOWNERS edits? + +- **Audit trail**: `git log .github/codeowners-roles.yml` is the canonical record of every role change. The rendered `CODEOWNERS` is a derived artifact. +- **Roles are first-class**: paths reference roles, not raw handles. Renaming a person or rotating a role updates one place, not every path. +- **Future extension**: scheduled rotation (weekly on-call, quarterly leads) plugs into the same yml without changing the path mappings. Not enabled today. +- **Consistency with the product**: omnigraph itself enforces auditable Cedar policy. The repository's code-owner policy follows the same "policy as reviewed code" pattern. diff --git a/docs/dev/docs-issues.md b/docs/dev/docs-issues.md deleted file mode 100644 index c0a4fdb..0000000 --- a/docs/dev/docs-issues.md +++ /dev/null @@ -1,70 +0,0 @@ -# User Docs Coherence Ledger - -**Last review:** 2026-06-20 (against 0.7.1) -**Status:** all open findings resolved β€” living ledger for future audits. - -This page tracks stale or incoherent user-doc claims found during broad docs -reviews. Findings are validated against current **code/behavior**, not just -cross-doc consistency. Record new findings as they surface; mark them resolved -(with the fixing commit) once the public pages are corrected. - -## Resolved β€” 2026-06-20 docs/user coherence sweep - -Every finding from the 2026-06-20 review was validated (all reproduced) and -fixed. Branch `docs/user-coherence-0-7-1`. - -| Pri | Finding | Resolution | -|---|---|---| -| P1 | `cluster apply` documented as catalog-only / "Stage 3A" with graph+schema deferred β€” in both `cli/reference.md` and the shipped CLI help (`cli.rs`) | Rewrote both to describe the real converge behavior (creates graphs, applies schema with soft drops, writes catalog, executes approved deletes in one ordered run); `deferred` now means the genuinely-unsupported case (standalone schema delete). | -| P1 | Stored-query exposure had two contracts: `server.md` documented a per-query `mcp:{expose:false}` knob; cluster docs said all queries are listed | Confirmed in code: cluster registry has no expose field (`QueryConfig`), boot bridge hardcodes `expose: true` (`omnigraph-server` settings), no GQ-level annotation. Removed the knob from `server.md`; documented "every applied query is listed; per-query exposure may become a Cedar-policy decision later". | -| P1 | The same stale "`mcp.expose == true` subset" contract lived in the **OpenAPI surface**: utoipa annotations (`handlers.rs:1029,1037`, `omnigraph-api-types/src/lib.rs:404`) drove `openapi.json` (Greptile catch on #293) | Updated the three Rust doc-comment/annotation strings to "every stored query" and regenerated `openapi.json` (`OMNIGRAPH_UPDATE_OPENAPI=1`); drift test green. Same-change per AGENTS.md rule 4. | -| P2 | `schema/index.md` claimed `allow_data_loss` honored "uniformly across transports" incl. HTTP `POST /schema/apply` | Scoped to the direct/embedded path; added that cluster-managed graphs evolve via `cluster apply` (soft drops only) and the HTTP route is 409-disabled for cluster serving. | -| P2 | `/load` missing from admission / body-limit / rate-limit / manifest-conflict prose (named `/ingest` only); constants called it "Ingest body limit" | Documented `/load` as canonical everywhere with `/ingest` as the deprecated alias; renamed the constant to "Load (bulk-write) body limit". | -| P2 | CLI "Bearer token resolution" section listed removed `omnigraph.yaml` keys (`graphs..bearer_token_env`, `auth.env_file`) | Replaced with a pointer to the keyed-credential model (`OMNIGRAPH_TOKEN_` β†’ `~/.omnigraph/credentials` β†’ `OMNIGRAPH_BEARER_TOKEN`); no plaintext-in-config path. | -| P2 | Flat route names in a cluster-only server (`POST /query`, `POST /mutate`, `GET /queries`, `POST /queries/{name}`) | Added a one-line note that the per-graph subsections use shorthand under `/graphs/{id}/…`; the endpoint table is already fully qualified. | -| β€” | `version` printed `omnigraph 0.3.x` | β†’ `0.7.x`. | -| β€” | `search/indexes.md` used deprecated `ingest --mode merge` | β†’ `load --mode merge`. | -| β€” | `config.md` `deferred` disposition described as "graph/schema change, later phase" | β†’ "an unsupported change (e.g. standalone schema delete)". | -| β€” | Stale stage labels (`Stage 3A`, `Stage 2C`, `Stage 1`) in active reference docs | Removed / reworded to plain language; release notes keep history. | - -## Open β€” surfaced 2026-06-20, not yet fixed - -- **Stale "config-only apply" / "Stage 3A" comments in `omnigraph-cluster` - source** (internal rustdoc, not user docs β€” out of scope for the docs sweep - above): `src/types.rs:147` ("Applied changes execute (config-only query/policy - catalog writes)"), `src/types.rs:265` ("Output of config-only cluster apply"), - `src/diff.rs:256`, and `src/tests.rs:1129` ("config-only apply (Stage 3A)"). - Apply now also runs graph creates, schema applies, and approved deletes - (`diff.rs:411` `GraphCreate` / `SchemaApply`; the Stage-4 create/schema/delete - executors + tests `apply_creates_graph_and_unblocks_dependents`, - `apply_schema_update_and_dependent_query_in_one_run`, - `apply_blocks_graph_delete_without_approval`). Update these comments in a - cluster-crate change. -- **Cross-repo drift from this sweep** (separate repos): - - `omnigraph-ts` SDK β€” its generated `spec/openapi.json` + - `packages/sdk/src/generated/types.gen.ts` still describe the `GET /queries` - catalog as the `mcp.expose` subset. **No hand-fix:** the SDK's - `scripts/sync-spec.ts` pulls openapi.json from a *tagged* omnigraph release - (`/omnigraph/v{version}/openapi.json`), and the catalog fix landed on main - *after* the v0.7.1 tag β€” so it is in no tag yet and a hand-edit would be - overwritten on the next sync. It flows in automatically when the SDK bumps - to a tag containing the fix (v0.7.2+). Tracked, not actioned. - - `omnigraph-cookbooks/docs/best-practices.md` `bearer_token_env` chain β€” - **RESOLVED** by omnigraph-cookbooks PR #26 (2026-06-21), which deleted - `docs/best-practices.md` as part of the 0.7 restructure; the stale chain - survives nowhere on `main`. - -## Verification checklist (re-run on the next docs audit) - -```bash -rg -n "Stage [0-9]|graph/schema changes are deferred|reserved for later stages" docs/user crates/omnigraph-cli/src/cli.rs -rg -n "POST /query|POST /mutate|GET /queries|POST /queries/\{name\}|POST /schema/apply" docs/user -rg -n "ingest --mode|Ingest body limit|/ingest" docs/user -rg -n "0\.3\.x|bearer_token_env|auth\.env_file" docs/user -rg -n "expose: false|mcp\.expose" docs/user -``` - -Expected: active user docs have no matches for stale phrases, or the remaining -matches are explicitly marked as deprecated aliases, "no longer exist" notes, or -route shorthand disclaimed relative to `/graphs/{id}`. Release notes are allowed -to preserve historical behavior. diff --git a/docs/dev/execution.md b/docs/dev/execution.md index e9ac9eb..f5c2840 100644 --- a/docs/dev/execution.md +++ b/docs/dev/execution.md @@ -84,7 +84,7 @@ Resolves expression values to literals, converts to typed Arrow arrays (`literal - `insert` (no `@key`, edges) β†’ accumulate into `MutationStaging.pending` (Append mode); finalize calls `stage_append` once per touched table. - `insert` (`@key` node) β†’ accumulate into `pending` (Merge mode); finalize calls `stage_merge_insert` once per touched table. - `update` β†’ scan committed via Lance + pending via DataFusion `MemTable` (read-your-writes), apply assignments, accumulate into `pending` (Merge mode). -- `delete` β†’ still inline-commits via `delete_where` (Lance v6.0.1 has no public two-phase delete; `DeleteBuilder::execute_uncommitted` first ships in v7.0.0-beta.10 β€” tracked as MR-A in [docs/dev/lance.md](lance.md)); recorded into `MutationStaging.inline_committed`. +- `delete` β†’ still inline-commits via `delete_where` (Lance 4.0.0 has no public two-phase delete); recorded into `MutationStaging.inline_committed`. **Dβ‚‚ parse-time rule.** A single mutation query is either insert/update-only or delete-only. Mixed β†’ reject before any I/O. The check fires in `enforce_no_mixed_destructive_constructive(&ir)` inside `execute_named_mutation`. @@ -147,7 +147,7 @@ sequenceDiagram - End-of-query Lance commit: `TableStore::stage_append`, `stage_merge_insert`, `commit_staged` at `crates/omnigraph/src/table_store.rs` - Manifest commit primitive: `commit_updates_on_branch_with_expected` at `crates/omnigraph/src/db/omnigraph/table_ops.rs` -Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/writes.md](writes.md). +Atomicity guarantee for multi-statement mutations: a mid-query failure leaves Lance HEAD untouched on staged tables (no inline commit happened during op execution), so the next mutation proceeds normally with no `ExpectedVersionMismatch`. The publisher CAS at the very end either succeeds (manifest advances atomically across all touched sub-tables) or fails with a typed `ManifestConflictDetails::ExpectedVersionMismatch` (no partial publish). See [docs/dev/invariants.md](invariants.md) and [docs/dev/runs.md](runs.md). ## Bulk loader (`loader/mod.rs`) @@ -162,19 +162,19 @@ Atomicity guarantee for multi-statement mutations: a mid-query failure leaves La | Mode | Semantics | Path (post-MR-794) | |---|---|---| -| `Overwrite` | Replace all data in the target tables on the branch | Same accumulator; one `stage_overwrite` + `commit_staged` per touched table at end-of-load (a staged Lance `Operation::Overwrite` transaction β€” HEAD does not advance until commit; MR-793 Phase 2); publisher CAS. | +| `Overwrite` | Replace all data in the target tables on the branch | Inline-commit per type, then publisher CAS at end-of-load. Truncate-then-append doesn't fit the staged shape; documented residual. | | `Append` | Strict insert; duplicates error | In-memory `MutationStaging` accumulator; one `stage_append` + `commit_staged` per touched table at end-of-load; publisher CAS. | | `Merge` | Upsert by `id` (`merge_insert`) | Same accumulator; one `stage_merge_insert` per touched table at end-of-load (Merge mode dedupes by `id`, last-write-wins); publisher CAS. | -For all three modes, a mid-load failure (RI / cardinality violation, validation error) leaves Lance HEAD untouched on the staged tables β€” the next load on the same tables proceeds normally with no `ExpectedVersionMismatch`. +For Append/Merge, a mid-load failure (RI / cardinality violation, validation error) leaves Lance HEAD untouched on the staged tables β€” the next load on the same tables proceeds normally with no `ExpectedVersionMismatch`. For Overwrite, a mid-load failure can still leave Lance HEAD on a partially-truncated table; the next overwrite replaces it. -## `load` and the deprecated `ingest` shims +## `load` vs `ingest` -- `load_as(branch, base, data, mode, actor)` β€” the unified entry (single publisher commit per call). `base: Some(b)` forks a missing `branch` from `b` first (via `branch_create_from_as`, which enforces `BranchCreate`); `base: None` requires the branch to exist β€” staging fails on an unknown branch, so a typo'd name can never create one. -- `load(branch, data, mode)` β€” convenience wrapper with `base: None` and no actor. -- Returns `LoadResult { branch, base_branch, branch_created, nodes_loaded, edges_loaded }`. -- `ingest{,_as,_file,_file_as}` are `#[deprecated]` shims over `load_as` preserving the historical contract (`from: None` forks from `main`; returns `IngestResult`); they are slated for removal. The CLI `ingest` command is a deprecated alias of `load --from `. +- `load(branch, data, mode)` β€” direct load to a branch (single publisher commit per call). +- `ingest(branch, from, data, mode)` β€” branch-creating wrapper: if `branch` doesn't exist, fork it from `from` (default `main`) via `branch_create_from`, then call `load(branch, data, mode)`. +- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`. +- `ingest_as(actor_id)` records the actor on the resulting commit. ## Embeddings during load -The loader does **not** embed `@embed` properties at load time. `@embed` is a catalog annotation consumed by query typecheck/lint; vectors are supplied directly in the load data, or pre-filled by the offline `omnigraph embed` pipeline. Query-time `nearest($v, "string")` auto-embeds the query string via the provider-independent embedding client. See [embeddings.md](../user/search/embeddings.md). (Ingest-time `@embed` execution is a planned RFC-012 phase.) +If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](../user/embeddings.md). diff --git a/docs/dev/handoff-rfc-013-write-path.md b/docs/dev/handoff-rfc-013-write-path.md deleted file mode 100644 index 3706012..0000000 --- a/docs/dev/handoff-rfc-013-write-path.md +++ /dev/null @@ -1,460 +0,0 @@ -# Handoff: finishing RFC-013 (write-path latency + correctness) - -**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** β€” -this doc is the *current-state map + the decisions/validation from the latest work cycle -+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc). - -**Audience:** the engineer/agent who picks up RFC-013 next. - ---- - -## 0. TL;DR β€” where we are and what's next - -RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables -under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read Β§9 of the RFC -for the canonical list. Current reality: - -**Landed on `main`:** -- **Step 1** β€” Tier-1 cost gate + the shared `helpers::cost` harness (#288). -- **Step 3a** β€” opener bypass: write opens go direct (`Dataset::open` by URI + version) - instead of the Lance-namespace builder (#288). **This already banked the dominant - depth win** β€” see Β§2 below; it reframes everything. -- **Step 2a** β€” internal-table compaction: `optimize` now compacts `__manifest` / - `_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model - correction (#292). -- **Optimize-vs-write race** β€” optimize survives a cross-process write race on the - same table (#297, **LANDED** β€” origin/main `6d4606a8`; see Β§6 for why it's not - redundant with Design A). Step 3b stacks on top of this. - -**Open PRs (land these; relationships in Β§7):** -- **#296** `correctness-by-design-fix` β€” recovery roll-forward converges on a concurrent - manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`). - **MERGED to main and integrated into this branch** β€” the converge helper now threads - Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`). -- **#295** `docs/rfc-013-step-3b` β€” the step-3b RFC doc. -- **#254** `ragnorc/bug-4-schema-apply-occ` β€” schema-apply vs optimize false-fail - (same op-class family as #297, logical side). - -**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see Β§4) on -`rfc-013-step-3b-writetxn-v2`. **Next: Phase 7 (step 4), then the big one β€” Design A / -`PublishPlan` unification (step 5)** β€” see Β§5, the convergent fix for the bug *class* this -area keeps generating, which also absorbs 3b's deferred session-aware write opens. - ---- - -## 1. The corrected mental model (read this before touching anything) - -Three reframes from the latest cycle that the older RFC prose may not fully reflect: - -### 1a. 3a already won the depth fight β†’ the residual is constant-factor + RTT -Before 3a, the write re-opened each table through the lance-namespace builder ~13Γ—, and -that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open β€” -**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not -Lance β€” validated against Lance source). 3a swapped it for the direct opener, which is -**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So: - -- The dominant **O(depth) data-table** term is **gone**. -- Step 2a flattened the secondary **internal-table** scan term. -- What remains is the **~110-hop serial backbone Γ— RTT + compute** β€” a constant in - depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)Β·RTT - + compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an - unlimited store it parallelizes away. Measured: prod one-row write 27β†’15.76s after - 2a; the remaining 15.76s is the serial backbone β€” **step 3b's target**, not step 2's. -- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the - flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from - 3b; gate it on the constant-factor (S3 round-trips), not a curve. - -### 1b. Two op classes, two commit models (the Β§6.6 principle) -Every concurrency bug in this area is **one op class using the other's commit model**: - -| class | examples | commutes? | correct commit model | -|---|---|---|---| -| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** β€” no epoch, no read-set | -| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence | - -Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance -(false conflicts β€” #297's bug), too weak for logical cross-process (Β§6.5 corruption). - -### 1c. The root liability (what keeps generating these bugs) -Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so -every multi-commit op advances per-table Lance HEAD **before** the manifest references it -(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous** -(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code -paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path -drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class: -- Β§6.5 cross-process logical corruption, -- #297's own-HEAD-drift misclassification, -- the flaky write-path "HEAD ahead of manifest, run repair" guard, -- the recovery classifier edges. - -**The convergent fix is Design A (one publish authority β€” step 5); Lance MTT eventually -retires the window entirely.** See Β§5. - -### 1d. The second facet: the write base is a stale pin (no probe) -The READ path resolves its base behind a freshness probe (`resolve_target_inner` -omnigraph.rs:~1072 β†’ `probe_latest_incarnation` β†’ `refresh_manifest_only`); the WRITE path -does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for -the bound branch, no probe). So a long-lived server's write base lags the live manifest. That -single staleness feeds **two distinct failure modes**, both surfaced this cycle: - -1. **Stale validation *reads* β†’ integrity under-enforced.** Write-path RI checks read - committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`: - `edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of - live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is - uncounted β†’ a `@card` max can be exceeded (cursor **High** / codex **P1** on #298, - **VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe β€” - the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's - edge, then B validates β†’ must see A) + correct the wrong "pinned base == live HEAD" doc - comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability - underneath: there is **no unified write-validation read-set** β€” endpoint - (`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned - `txn.base`; loader: warm `snapshot_for_branch` β€” the SAME check forks per write path), - commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness - (`enforce_unique_constraints_intra_batch`, intra-batch only β€” cross-version uniqueness is a - documented gap). Three freshness levels chosen ad hoc, none re-validated at commit β†’ the - Β§7.1 TOCTOU class, and each new constraint forks the pattern again. - -2. **Stale OCC *pin* β†’ false-fail on a maintenance advance.** A served strict update/delete - pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external - `optimize` advanced `__manifest` β€” even though the advance was content-preserving - compaction the logical write should fast-forward past (invariant 7). It's the **write-side - mirror of #297/Β§6.6** (#297 made optimize fast-forward past a logical write; this is a - logical write that must fast-forward past optimize). A served read clears it (the read - probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) + - `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` - (`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is - proven wrong** β€” a blanket probe silently refreshes past *logical* advances too, breaking - `consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate - cross-process lost-update OCC primitive). The fix must **discriminate by op class**. - -**Both fold into Design A (step 5), same as Β§1c.** `open_txn`'s one warm probe makes the base -fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** β€” -derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` = -maintenance β†’ fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical β†’ fail -loudly; NO parallel marker, invariant 1/15) β€” is the correctness fence for anything that lands -after. And the Β§7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the -`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set -liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified -pin) with **one home: the single PublishPlan delta-interpreter** (Β§1c + Β§5). Strong corroboration -of Design A β€” three symptoms, one fix. - ---- - -## 2. Validated facts β€” do NOT re-derive these - -Established this cycle against **Lance 7.0.0 source** -(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so -you can trust them without re-investigating. - -**Lance (upstream):** -- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed - V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk). - (`lance-table/src/io/commit.rs` `default_resolve_version`.) -- A shared `Arc` (`DatasetBuilder::with_session`) warms metadata/index caches - keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached - β€” the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does* - carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the - `WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open - `Dataset`**, and Lance takes the store/session from that handle. So to attach the shared - Session to a write base, open read-style (`open_table_dataset` β†’ `from_uri().with_version() - .with_session()`) and drive the staged write off that handle. -- A held `Arc` at a pinned version is `Send + Sync`, immutable, safe to reuse for - many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already - relies on this). -- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one). - `commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In - `commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop** - via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict` - (`:1096`), and even that re-rebases the *same* transaction (never re-plans). β‡’ - **compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let - Lance own it.** -- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint - `Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable - conflict. β‡’ the common concurrent insert/update/delete is rebased for free; the app - retry fires only on real overlap. - -**Engine (internal):** -- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`), - warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held - `TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per - graph** (`read_caches.session`). **Writes bypass all of it by construction** - (`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no - session and opens by latest, not pinned version). -- A single write opens each table **3–4Γ—** (accumulation β†’ staging reopen β†’ commit - drift-guard β†’ publish prepare), each a fresh cold open. `validate_schema_contract` - (`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text` - + 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in - depth β€” 3b's targets. -- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version` - `table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS - `publisher.rs`). Capture-once **supplies** the pinned operand β€” never remove a guard. -- Fork-on-first-write authority reads (`classify_fork_ref` β†’ `fresh_snapshot_for_branch`) - must stay **fresh** (not served from a pinned base). -- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/ - `local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter` - (`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod - change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes` - (copy the `probe_count`/`record_probe` pattern). The forbidden-API guard - (`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete. - ---- - -## 3. The #297 cycle (this branch) β€” what it is, and the lesson - -`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the -same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both -compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry** -β€” outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest -publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the -manifest already moved to `β‰₯ N`), never the strict equality CAS. Sidecar written once; -in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`. - -**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** β€” both -fixed, both regression-tested (own-HEAD-drift via negative control): -1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every - iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a - later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the - sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track - `head_advanced`; the drift guard fires only when `!head_advanced`. -2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on - its final retry even if the conflict meant a concurrent writer already published `β‰₯ N` - (postcondition met). Fix: re-check `current >= state.version` on exhaustion. - -**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a -retry silently converts every "checked once, before any side effect" precondition into -"re-checked after partial side effects."* That's a distinct bug class; it needs -fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests. -(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for -exactly this.) - -**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD -movement mine." Under Design A the authority answers that from the plan/sidecar **identity** -β€” so `head_advanced` is the part that gets *replaced*, while the monotonic-publish + -reopen/replan **semantics** are permanent. (Noted in RFC Β§6.6.) - ---- - -## 4. DONE: Step 3b β€” capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`) - -**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema -contract **once** and opens each touched data table **at most once** β€” a constant-factor/RTT -win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node -insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and -`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the -minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot -mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's -`PublishPlan`. - -**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation -(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through -`snapshot_for_branch` and re-validates the schema β€” and reads **warm**, not live. Threading -`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality -fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations -needs the unified, re-checked read-set β€” **step 4 Β§7.1** (Β§1d). So #298 **un-regresses -cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once -gate yet (only the node gates above). - -Commits (off merged-#297 main): -- **Stage 0** β€” scope `open_count` β†’ `data_open_count`/`internal_open_count` by URI class - (the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the - raw counter conflated them and the gate was unreachable). Re-baselined RED 4. -- **Commit A (schema-once)** β€” capture `txn` once at entry (the single validation); the 4 - validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a - (`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`; - S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read - minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the - manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe; - same `Snapshot`). -- **Commit B (open collapse 4β†’1)** β€” #1 accumulation open ELIMINATED (the node path discarded - the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open); - #3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a - cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses - the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`. -- **Commit B.1 + cleanup** β€” named the two positional returns (`OpenedForMutation`, - `CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the - unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session). - -**RFC Β§4.1 corrections β€” how they resolved:** -1. *Thread the evolving handle, not a version-keyed cache* β†’ realized as collapse #4 (carry - the `commit_staged` handle forward into the index build). -2. *Don't forbid re-resolution* β†’ honored: the commit-time OCC re-read - (`fresh_snapshot_for_branch_unchecked` β€” fresh manifest, only schema-revalidation dropped) - and the fork-authority reads stay fresh. -3. *Minimal carrier* β†’ `WriteTxn { branch, base }` (even the `session` from the original - sketch was dropped as unearned). - -**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining -open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store -(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5, -where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here: -#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD -probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`). - -**Guardrails (don't regress):** schema validation is deliberately uncached for drift -detection β€” collapse to 1 *per write*, never cache across writes on a long-lived handle -(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC -machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh. -Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh -open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the -post-#297 main before starting. - ---- - -## 5. Design A β€” the `PublishPlan` unification (step 5) = the convergent fix - -**This is the real fix for the bug class in Β§1c.** Collapse the four hand-rolled writers + -the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded -retry is **unconditional and unbypassable** (no caller can "hold the queue β†’ skip the CAS"). -Properties: -- **One interpreter of the `HEAD vs manifest` delta** β€” and "is this my work?" is answered - by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the - Β§6.5 writers, the write-path guard β€” all close *by construction*. -- **Recovery = the same `PublishPlan` re-applied** β€” the crash-recovery interpreter and the - live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone). -- Each `TableAction` commits by its **class** (Β§1b): `Rewrite` = maintenance (Lance rebase - + reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC - + `writer_epoch`). - -**Why it composes with Lance MTT (don't over-build):** -- The **unification itself is convergent** β€” when MTT lands, it slots *underneath* the same - authority; nothing wasted. Build this. -- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes - a cross-process fence). Build it *last and minimally*, gated on actually deploying - multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own. - -**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone -= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim -is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5 -(the single authority + recovery-as-plan) forward β€” possibly ahead of 3b** β€” because it stops -the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they -validate the maintenance-class and logical-class commit models in isolation first, so Design -A implements a known spec rather than designing under refactor pressure. Do NOT build more -substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper -over the window β€” strictly higher liability than either Design A or waiting for MTT. - -**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments -+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not -exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666 -open; delete #6658 shipped). Track, don't build yet. - -### 5.1 Step-5 design constraints inherited from the #295 spec review -3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via -eliminate/probe/thread) and **deferred** the full Β§4.1 opener-unification β€” the pinned-base -opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op -conflict-timing move β€” to step 5. So the greptile-bot comments on the #295 *spec* were **moot -for #298** (which built none of those constructs) but are **load-bearing constraints for step -5** when it builds them. Bank them: -1. **Handle cache must be `Send + Sync`** (`Mutex>`, not `RefCell`) if - `WriteTxn::open(&self)` is shared across concurrent stage futures β€” a `RefCell` compiles - but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This - is the deny-list "in-process-only `Dataset` impls β€” `Send + Sync`" item. -2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves - strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS - (the Β§4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit** - (success or conflict β€” the handle cache is commit-invalidated), and the retry **re-opens a - fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base β€” that - reproduces the lost-update). **This is the same retry/refresh contract as the stale-view - false-fail (Β§1d.2)** β€” the op-class-aware precondition + "fresh base on retry" are one - design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the - contract is unchanged; step 5 owns it the moment it pins strict reads to the base. -3. **The opener-equivalence test must be non-trivial.** A differential test that only passes - when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open` - returns the pinned base, not HEAD," the test must **advance the branch HEAD externally - (direct Lance write), then assert the txn open still reads the base version** β€” and that a - strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move). - ---- - -## 6. Why #297 is still needed even if you do Design A -- Design A **relocates** #297's maintenance-class commit logic into the authority's - `TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*. -- The two regression tests + Β§6.6 are the **contract** Design A must keep green. -- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a - correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency. -- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy - (~a dozen lines). Everything else relocates or persists. **#297 LANDED.** - ---- - -## 7. Open PRs and their relationships -- **#297** β€” maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`); - step 3b stacks on it. -- **#254** β€” logical-class fix (schema-apply vs optimize false-fail). Same op-class family; - both are de-risking inputs for Design A's per-class commit models. -- **#296** β€” recovery roll-forward converges on concurrent manifest advance. The fix - for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is - *aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296 - landed on main first and is merged into this branch:** the converge helper - (`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward β€” - on convergence the audit references the winner's folded `graph_commit_id` (the current - `graph_head`), not a freshly minted one. -- **#295** β€” the step-3b RFC doc (apply Β§4's three corrections to it). - ---- - -## 8. Remaining RFC steps after 3b (RFC Β§9 is canonical) -- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression** - (Β§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix - the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The - residual concurrent TOCTOU is the Β§7.1 gap (step 4) β€” un-widen here, don't over-reach. -- **Step 4 / Phase 7** (`iss-991`): lineage into `__manifest` (publish `graph_commit` + - mutable `graph_head:` in the same merge-insert; `_graph_commits` becomes a - projection). Removes the per-write `commit_graph.refresh`; closes the manifestβ†’commit-graph - atomicity + commit-graph-parent-under-concurrency gaps. **Hard prereq: step 2 (done).** - Carries the Β§7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) β€” - **frame Β§7.1 as "unify the entire write-validation read-set" (endpoint + cardinality + - cross-version uniqueness), not merely "add `graph_head`"** (Β§1d.1): the bespoke - `edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one - pinned read-set re-validated under the `graph_head` contention, or the liability survives as - a second special-case. -- **Step 5 / Design A** β€” Β§5 above. **Acceptance item: the served-strict-write stale-view - false-fail** (Β§1d.2) β€” the op-class-aware precondition + `open_txn` probe. The contract is - two tests passing *together*: un-ignore - `writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green) - *while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green - (maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone - like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter. -- **Step 2b** β€” internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag). - Deferred: only the secondary version-count/space term, touches the read/open path, and is - MTT-redundant. Land when version-count cost bites. -- **Β§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) β€” inbound-RI - validation on node removal; independent, ships anytime. -- **#20** β€” the prod per-write `storage.ops` span metric (RFC Β§5.3), still owed. -- Branch ops: Lance `Clone` for create (`iss-691`). - ---- - -## 9. Gotchas / traps (learned the hard way) -- **In-process queue β‰  cross-process lock.** Any "I hold the queue β†’ skip the retry/CAS" - reasoning is a bug across processes. This is the recurring trap. -- **Monotonic publish must be `β‰₯`-conditional, never "no assertion."** The `__manifest` - merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so - the equality (or monotonic) pre-check is the *only* guard β€” dropping it lets `UpdateAll` - regress a newer version = lost write. -- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over - self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest - interpretation on "have *we* committed yet." -- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with - `skip_auto_cleanup=false`, no override) β€” optimize strips stale `lance.auto_cleanup.*` - config before compacting to stay non-destructive on upgraded graphs. The strip is a - separate commit (relevant to the partial-commit retry trap). -- **Lance rebases the common concurrent case for free** β€” so the data-table conflict usually - surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is - rare and needs failpoint injection to test. - ---- - -## 10. Verification (the gate) -- `cargo test --workspace --locked` β€” the canonical gate (matches CI). -- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` β€” - the optimize concurrency/recovery tests. -- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) β€” - cost gates (3b adds the schema-once + open-count asserts here). -- `cargo test -p omnigraph-engine --test maintenance` β€” optimize/repair/cleanup. -- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md) - before each change (always-on requirement). - -Lance source for re-validation: -`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`, -`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`, -`dataset/builder.rs`). diff --git a/docs/dev/index.md b/docs/dev/index.md index be98602..83df8c8 100644 --- a/docs/dev/index.md +++ b/docs/dev/index.md @@ -20,28 +20,28 @@ constraints. User-facing behavior should still be documented through | Area | Read | |---|---| | System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) | -| On-disk layout, manifest schema, URI behavior | [storage.md](../user/concepts/storage.md) | -| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) | +| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) | +| Direct-publish writes, D2, staged writes, recovery sidecars | [runs.md](runs.md) | | Query execution, mutation execution, loader flow | [execution.md](execution.md) | -| Index lifecycle and graph topology indexes | [indexes.md](../user/search/indexes.md) | -| Branch and commit internals | [branches-commits.md](../user/branching/index.md) | +| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) | +| Branch and commit internals | [branches-commits.md](../user/branches-commits.md) | | Three-way merge implementation and conflicts | [merge.md](merge.md) | -| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) | +| Diff/change-feed implementation | [changes.md](../user/changes.md) | | Branch protection policy | [branch-protection.md](branch-protection.md) | +| CODEOWNERS source of truth | [codeowners.md](codeowners.md) | ## Language, Runtime, And Boundaries | Area | Read | |---|---| -| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema/index.md) | -| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/queries/index.md) | -| Embedding client and `@embed` integration | [embeddings.md](../user/search/embeddings.md) | -| Cedar policy surface and server gating | [policy.md](../user/operations/policy.md) | -| Server auth, OpenAPI, endpoint handlers | [server.md](../user/operations/server.md) | -| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) | -| Constants and tunables | [constants.md](../user/reference/constants.md) | -| Transaction model public contract | [transactions.md](../user/branching/transactions.md) | -| User-doc coherence cleanup ledger | [docs-issues.md](docs-issues.md) | +| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema-language.md) | +| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/query-language.md) | +| Embedding client and `@embed` integration | [embeddings.md](../user/embeddings.md) | +| Cedar policy surface and server gating | [policy.md](../user/policy.md) | +| Server auth, OpenAPI, endpoint handlers | [server.md](../user/server.md) | +| Error taxonomy and serialization | [errors.md](../user/errors.md) | +| Constants and tunables | [constants.md](../user/constants.md) | +| Transaction model public contract | [transactions.md](../user/transactions.md) | ## Project Operations @@ -51,28 +51,6 @@ constraints. User-facing behavior should still be documented through | Install and deployment packaging | [install.md](../user/install.md), [deployment.md](../user/deployment.md) | | Release history | [releases/](../releases/) | -## Contribution & Governance - -| Area | Read | -|---|---| -| How to contribute (external) | [CONTRIBUTING.md](../../CONTRIBUTING.md) | -| Governance model, roles, decision authority | [GOVERNANCE.md](../../GOVERNANCE.md) | -| Public contribution RFC track | [rfcs/](../rfcs/) | - -The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The -maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned -track; don't conflate the two. - -## Case Studies - -Worked write-ups of specific bugs β€” root cause, fix, and the reasoning that -ruled out the tempting-but-wrong alternatives. Read these for the debugging -pattern, not just the outcome. - -| Area | Read | -|---|---| -| camelCase property filters lowercased at runtime (#283) β€” two engineβ†’Lance boundaries, two different fixes | [bug-case-fix.md](bug-case-fix.md) | - ## Active Implementation Plans Working documents for in-flight feature work. Removed when the work lands. @@ -81,19 +59,6 @@ Working documents for in-flight feature work. Removed when the work lands. |---|---| | Schema-lint chassis v1 (MR-694) β€” `--allow-data-loss`, soft/hard drops | [schema-lint-v1-plan.md](schema-lint-v1-plan.md) | | Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) | [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) | -| Config & CLI architecture β€” layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) | -| MCP server surface β€” full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) | -| Future cluster control plane β€” declarative as-code config, JSON state ledger, reconciler | [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) | -| Cluster graph & schema apply β€” Phase 4 sidecars, roll-forward recovery, approval artifacts | [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) | -| Server boots from cluster state β€” Phase 5 mode switch, applied-revision serving | [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) | -| Per-operator config β€” `~/.omnigraph/` identity, keyed credentials, named servers (the operator slice of RFC-002) | [rfc-007-operator-config.md](rfc-007-operator-config.md) | -| Deprecate `omnigraph.yaml` β€” one concern per config surface; key-by-key migration map and staged retirement | [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) | -| Unify CLI embedded/remote access paths β€” parity referee, shared wire-DTO crate, `GraphClient` trait, declared plane capabilities | [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) | -| Restructure the CLI around explicit planes β€” one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) | -| CLI refactoring β€” one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) | -| Provider-independent embedding configuration β€” one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) | -| Write-path latency β€” capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) | -| RFC-013 handoff β€” current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work | [handoff-rfc-013-write-path.md](handoff-rfc-013-write-path.md) | ## Boundary diff --git a/docs/dev/invariants.md b/docs/dev/invariants.md index 9bb6dbd..958042f 100644 --- a/docs/dev/invariants.md +++ b/docs/dev/invariants.md @@ -15,51 +15,13 @@ Use it this way: - Keep implementation ledgers, roadmap detail, and historical MR notes in the per-area docs. This file is the filter, not the encyclopedia. -## Governing principle: logical contract over physical state - -The hard invariants below are instances of one rule. Keep it in view whenever -a change touches the boundary between what the graph *means* and how it is -physically stored. - -> **Logical state is the contract. Physical state β€” index coverage, fragment -> layout, compaction versions, staged writes β€” is derived, rebuildable, and may -> be produced asynchronously. A physical operation must never fail a logical -> one. Preconditions are checked against logical state; physical reconciliation -> is idempotent and may lag or retry. Genuine logical conflicts still fail -> loudly: the licence to lag covers physical convergence, not correctness.** - -Invariants that instantiate it: **2** (manifest-atomic visibility) and **5** -(recovery is part of the commit protocol) β€” a partially-written physical layer -never changes what a graph commit means; **7** (indexes are derived state) β€” a -query is correct under partial index coverage, and expensive index work -converges from manifest state instead of gating the write path; **13** (failures -bounded and observable) β€” the licence to lag is not a licence to drop, so a -physical step that cannot make progress is surfaced, not swallowed. Deny-list -items that enforce it: synchronous inline vector/FTS index rebuilds on the -commit path; state that drifts from Lance or the manifest when it can be -derived; job queues for manifest-derivable state where a reconciler fits. - -The failure shape it rules out: a legitimate background operation on the -physical layer (compaction, an index build, an interrupted staged write) is -allowed to break a logical operation (a query's correctness, a migration's -success, a branch's writability). The smell to watch for is a logical operation -whose precondition is a *physical* fact β€” a cached file version, an index's -existence, a fragment count. Make the precondition logical and let a reconciler -converge the physical state. - ## Hard Invariants 1. **Respect the substrate.** Lance owns columnar storage, per-dataset versioning, fragments, branches, compaction, cleanup, and index primitives. DataFusion should own relational execution where it fits. Do not add custom WALs, transaction managers, buffer pools, page formats, or local clones of - substrate behavior. Read [lance.md](lance.md) before guessing. Respecting the - substrate also means *using* it idiomatically, not only refraining from - rebuilding it: reuse long-lived handles instead of re-opening per call, - resolve latest state through the substrate's cheap primitive instead of - re-scanning, and share its caches/session. Re-deriving per call what the - substrate keeps warm is a substrate violation even when no code is - reimplemented. + substrate behavior. Read [lance.md](lance.md) before guessing. 2. **Graph visibility is manifest-atomic.** Lance commits are per dataset. OmniGraph's graph-level atomicity comes from publishing one manifest update @@ -76,16 +38,14 @@ converge the physical state. publishes one manifest update. Do not commit per statement. Delete-only queries are the documented inline residual; the parse-time D2 rule prevents mixing deletes with insert/update until Lance exposes two-phase delete. - Read [writes.md](writes.md) and [execution.md](execution.md). + Read [runs.md](runs.md) and [execution.md](execution.md). 5. **Recovery is part of the commit protocol.** Writers that can advance Lance HEAD before manifest publish must write `__recovery/{ulid}.json` sidecars. - `Omnigraph::open` in read-write mode runs the all-or-nothing sweep; the - write entry points (`load_as`, `mutate_as`, `apply_schema_as`, - `branch_merge_as`) and `refresh` run roll-forward-only recovery in-process, - so a long-lived process converges on its next write rather than at restart. Do not add a new writer kind without - sidecar coverage or an explicit proof that no Lance HEAD can move before - manifest publish. + `Omnigraph::open` in read-write mode runs the all-or-nothing sweep, and + `refresh` runs roll-forward-only recovery for long-lived processes. Do not + add a new writer kind without sidecar coverage or an explicit proof that no + Lance HEAD can move before manifest publish. 6. **Strong consistency is the default.** Reads are snapshot-isolated, writes are durable before acknowledgement, and branch reads observe the current @@ -96,7 +56,7 @@ converge the physical state. branch they read even when index coverage is partial. Expensive index work should converge from manifest state instead of extending the critical write path. Scalar staged index builds and vector inline residuals are documented - in [writes.md](writes.md) and [indexes.md](../user/search/indexes.md). + in [runs.md](runs.md) and [indexes.md](../user/indexes.md). 8. **Schema identity survives renames.** Accepted schema identity must remain stable across type and property renames. Rename support belongs in migration @@ -132,40 +92,20 @@ converge the physical state. a substitute for missing lower-level assertions. Read [testing.md](testing.md) before adding tests. -15. **One source of truth, cheaply derived.** Lance and the manifest are the - source of truth. Everything the engine needs at runtime is a derived view of - them: read or projected on demand, held warm, refreshed by a cheap probe. Two - failure modes are forbidden. A *parallel copy* the engine maintains can drift - from the source, and that divergence compounds over time. *Cold - re-derivation* rebuilds the view from the full source on every call, so its - cost grows with history. Invariants 1 and 7, and the deny-list "state that - drifts" and "manifest-derivable reconciler" items, are instances; so is - bounding a read's cost to its working set rather than the commit count. This - is the structural face of "engineering is programming integrated over time": - both failure modes are liabilities that compound as the system grows. - ## Current Truth Matrix | Area | Current state | Source | |---|---|---| -| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) | -| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) | -| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/queries/index.md), [writes.md](writes.md) | -| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) | -| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) | -| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) | -| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the inline-commit residuals (`delete_where`, `create_vector_index`) are split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so Β§1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) | -| Index lifecycle | `@index`/`@key` declares *intent*; the physical index is derived state and never fails a logical op. `schema apply` builds no indexes (records intent only; index-only changes touch no table data). `load`/`mutate` build inline through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar β†’ BTREE, free-text String β†’ FTS, Vector β†’ vector) that fault-isolates an untrainable Vector column into a *pending* index instead of aborting. `optimize`/`ensure_indices` is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (`optimize_indices`), reporting still-pending columns. Explicit maintenance call, not yet a background loop | [indexes.md](../user/search/indexes.md), [maintenance.md](../user/operations/maintenance.md) | -| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/queries/index.md) | -| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/operations/server.md), [policy.md](../user/operations/policy.md) | -| Tests | Tempdir-backed Lance tests are the current substrate; the storage adapter has an in-memory backend for adapter-level contract tests, but Lance datasets bypass it | [testing.md](testing.md) | - -The branch-delete reconciler is authority-derived: it reclaims orphaned forks -today and degrades to a no-op if Lance ships an atomic multi-dataset branch -operation, so the design composes with that future rather than blocking it. This -is the same shape as invariant 7 (indexes are derived state); prefer it over a -recovery-sidecar-style approach for any new multi-dataset metadata operation, -since the sidecar would be scaffolding to remove once the substrate closes the gap. +| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [runs.md](runs.md), [architecture.md](architecture.md) | +| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [runs.md](runs.md), [execution.md](execution.md) | +| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [runs.md](runs.md) | +| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) | +| Unique constraints | Intra-batch and write-path checks exist; full cross-version uniqueness is still a gap | [schema-language.md](../user/schema-language.md) | +| Storage trait | `TableStorage` exists as the sealed staged-write surface; full call-site migration and capability/stat surfaces are incomplete | [runs.md](runs.md), [architecture.md](architecture.md) | +| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) | +| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) | +| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/server.md), [policy.md](../user/policy.md) | +| Tests | Tempdir-backed Lance tests are the current substrate; there is no `MemStorage` test backend | [testing.md](testing.md) | ## Known Gaps @@ -176,120 +116,12 @@ them explicit. renames. The current compiler still derives type IDs from `kind:name`; this must be fixed before relying on renamed IDs across accepted schemas. - **Storage abstraction:** `TableStorage` is present, sealed, and canonical for - staged writes. MR-854 sealed it: `db.storage()` exposes only staged primitives - + reads, and the inline-commit residuals are split onto a separate sealed - `InlineCommitResidual` trait reached via `db.storage_inline_residual()`, so a - new writer cannot couple a write with a HEAD advance through the default - surface. The dead legacy methods (`append_batch` on the trait, - `merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were removed. The - remaining residuals are `delete_where` and `create_vector_index`. The Lance - 6.0.1 β†’ 7.0.0 bump landed, so the staged two-phase delete API - (`DeleteBuilder::execute_uncommitted`, Lance #6658) is now available and MR-A - is unblocked β€” but the migration itself is still pending, so `delete_where` - stays inline for now. `create_vector_index` remains gated on Lance #6666 - (still open). See [lance.md](lance.md) and [writes.md](writes.md). New write - paths should use the staged shape unless a documented Lance blocker applies. + staged writes, but older inherent `TableStore` call sites and inline residuals + remain. New write paths should use the staged shape unless a documented Lance + blocker applies. - **Deletes and vector indexes:** `delete_where` and vector index creation still - advance Lance HEAD inline. The public delete two-phase API now exists (Lance - #6658 shipped in 7.0.0), so the delete residual is unblocked pending the MR-A - migration; vector index creation is still blocked (Lance #6666 open). Keep D2 - and recovery coverage in place until those residuals are removed. -- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns - under its forced `BlobHandling::AllBinary` read ("more fields in the schema - than provided column indices"), so `optimize` skips any table with a `Blob` - property β€” reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a - silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes - are unaffected; only space/fragment reclamation on blob tables is deferred. - Remove the skip when the upstream Lance fix lands β€” the - `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard - turns red on that bump to force it. -- **Recovery is serialized against live writers in-process only:** the - write-entry heal (and `refresh`) serialize against a live writer's sidecar - lifetime via the per-`(table, branch)` write queues plus the schema-apply - serialization key β€” all in-process primitives. A recovery pass in one - process cannot serialize against a live writer in another (the open-time - sweep has the same exposure, and always has): it may roll a live foreign - writer's sidecar forward, which degrades to publisher-CAS contention for - data writes but can race the schema-staging promotion for a foreign live - schema apply. The roll-**forward** CAS contention is now - convergence-idempotent: when the publish loses the CAS to a concurrent - writer that already reached the sidecar's goal, the sweep treats it as - convergence (record the `RolledForward` audit + delete) rather than a fatal - `ExpectedVersionMismatch`, and defers when the manifest is only partway - (`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`; - iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer - fails the open. The schema-staging promotion race and the destructive - roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it - cannot be made idempotent β€” iss-recovery-sweep-live-writer-rollback) still - need the cross-process primitive. Multi-process writers on one graph are - already documented one-winner-CAS territory; closing this fully needs a - cross-process serialization primitive (e.g. lease-based use of the - schema-apply lock branch) β€” design it before promoting multi-process write - topologies. -- **Fork reclaim is in-process-safe only:** the first write to a table on a - branch forks it (a Lance `create_branch` that advances state before the - manifest publish). An interrupted fork (crash, or a cancelled request - future) leaves a manifest-unreferenced branch ref. The next write self-heals - it β€” `reclaim_orphaned_fork_and_refork` (`force_delete_branch` + re-fork) - β€” but reclaim is only safe because the writer holds the per-`(table, - branch)` write queue from before the fork through the publish AND re-checks - the live manifest under it, so no *in-process* writer can be mid-fork. A - reclaim cannot serialize against a foreign-*process* in-flight fork: it may - force-delete a peer's just-created ref, which makes that peer's commit fail - and retry β€” the same one-winner-CAS exposure as above, not corruption. The - reclaim never fires unless in-process-queue + manifest authority both prove - the ref is manifest-unreferenced. `cleanup`'s per-table reconciler - (`reconcile_orphaned_branches`) is the guaranteed backstop for any fork the - write path never revisits. Both degrade to a no-op if Lance ships an atomic - multi-dataset branch op. -- **Local `write_text_if_match` is not a cross-process CAS:** object-store - backends use a true conditional put (ETag If-Match; the in-memory test - backend too), but upstream `object_store` leaves `PutMode::Update` - unimplemented for `LocalFileSystem`, so the local path emulates CAS with - a content-token compare followed by an atomic replace β€” a check-then-act - gap plus content-token ABA. Every current caller goes through the cluster - lock protocol first, which makes this safe. A lock-free caller would get - S3-correct but local-racy behavior β€” the same divergence shape as the - acknowledged-before-visible bug this branch fixed. Close it (local CAS - primitive, or a trait-level lock requirement) before admitting any - lock-free `if_match` caller. -- **Manifestβ†’commit-graph publish atomicity β€” CLOSED (RFC-013 Phase 7):** graph - lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:` - rows written in the SAME `MergeInsertBuilder` commit as the table-version rows - (`commit_changes_with_lineage` β†’ `GraphNamespacePublisher::publish` with a - `LineageIntent`). There is no second write to fail between β€” a graph commit and - its lineage land at one manifest version atomically, so a crash after the publish - leaves no gap. The commit-graph cache is a derived projection of those manifest - rows; nothing writes `_graph_commits.lance` (it persists only to carry branch - refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N) - is gone by construction. A graph created before Phase 7 (internal schema v3) - carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4` - internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest` - per-branch on the first read-write open (idempotent, crash-safe, data-preserving), - and a read-only open of an un-migrated v3 graph sources the DAG from - `_graph_commits.lance` via a stamp-gated transitional fallback so reads stay - correct until the first write migrates it. An old binary refuses a v4-stamped - graph (read-write and read-only) with the standard upgrade error. The migration - is **loud on failure and concurrent-runner idempotent**: the legacy-open read - (`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data" - and propagates any other open error (so a transient/corrupt open can never stamp - v4 over an empty backfill β€” orphaning lineage permanently), and the backfill - converges all-or-nothing when two runners open the same legacy graph at once β€” a - bounded re-open retry on the `graph_head:` row-level CAS plus an - idempotent terminal stamp bump (both runners write the same value, so a concurrent - `UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp - already landed). The branch read path (`load_commit_cache_for_branch`) also - refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`; - defense-in-depth; not a live hole because migrations run main-first, so main - refuses first). The migration chain is **floor-bounded**: - `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is - the oldest stamp this binary opens, enforced symmetrically with the ceiling by the - single `refuse_if_stamp_unsupported` guard at all three stamp-read sites - (write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the - now-dead `migrate_vN_…` arms and (at MIN β‰₯ 4) the `commit_graph_legacy_v3` legacy - readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the - build if the floor and the lowest registered arm drift. Retirement runbook lives on - the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment. + advance Lance HEAD inline because the required public Lance APIs are missing. + Keep D2 and recovery coverage in place until those residuals are removed. - **Planner capability/stat surfaces:** cost-aware planning, complete capability advertisement, and explain-with-cost are roadmap. Do not describe them as implemented. @@ -304,44 +136,6 @@ them explicit. - **Resource bounds:** some operations still lack enforced per-query memory or time budgets. New long-running work should add explicit bounds rather than widening the gap. -- **Read-path re-derivation (largely closed by the query-latency work):** - snapshot resolution used to re-open a fresh coordinator per read (a full - `__manifest` re-scan plus two commit-graph scans), open each table through the - namespace (two more `__manifest` scans per table), validate the schema twice, - and share no Lance `Session`. That was an O(commits) cost that never warmed up. - Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open - tables by location+version), finding A (validate once), and Fix 3 (a held - `Dataset`-handle cache keyed by `(table, branch, version, e_tag when Lance - exposes it)` plus one shared `Session` per graph) remove that tax: a warm - same-branch read does one probe, one schema read, and zero opens on a repeat. - Non-main branch freshness compares the manifest incarnation (`version` plus - manifest-location e_tag when available, otherwise Lance manifest timestamp), - because Lance branch names can be deleted/recreated at the same version number; - the manifest e_tag is carried into synthetic snapshot ids when available, and - a detected same-branch manifest refresh clears read caches as the fallback for - e_tag-less table locations/topology. Remaining: `optimize` now compacts the - internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2), - so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat - in history; but they are not yet brought into `cleanup` (version GC), so the - `_versions/` chain still grows until an explicit cleanup (the cleanup half is - deferred β€” it needs the Q8 cleanup-resurrection watermark first). The commit - graph IS now reconcilable from the manifest (RFC-013 Phase 7 β€” it is a pure - projection of the `graph_commit`/`graph_head` rows); the traversal id-map is - still rebuilt. -- **Commit-graph parent under concurrency β€” CLOSED (RFC-013 Phase 7):** the graph - commit is now recorded in the manifest publish CAS, and the publisher resolves - the new commit's parent INSIDE its retry loop, per attempt, from the just-loaded - `__manifest` (the `should_replace_head` winner over the visible `graph_commit` - rows). A CAS-conflict retry re-reads the advanced head and parents correctly, so - the refresh-then-append TOCTOU is gone. Two processes writing disjoint tables on - the same branch now also contend on the shared `graph_head:` row (one - `object_id`, `WhenMatched::UpdateAll`): one wins, the other retries and re-parents - β€” so the cross-process disjoint-table fork is closed too. This is the intended - Β§7.1 contention point, pinned by - `manifest::tests::concurrent_disjoint_writes_share_head_and_form_linear_chain` - (two disjoint writers β†’ both commit, single linear chain) and - `manifest::tests::n_concurrent_disjoint_writers_converge_to_one_linear_chain` - (N=8 disjoint writers with app-level retry β†’ one linear chain of 8, no fork). ## Deny-list @@ -367,10 +161,6 @@ case is exceptional. - Cost-blind plan choice when statistics are available or required. - Hidden statistics for behavior that affects planning or operator choice. - Hash-map iteration order in result ordering, plan choice, or migration output. -- Cold re-derivation on the hot path: rebuilding from the full source what could - be held warm and refreshed cheaply, so cost scales with history rather than the - working set (the cost face of invariant 15; "state that drifts" above is its - shadow-copy face). - String-flattened SQL/filter generation when a structured pushdown API is available. - Eager multi-hop cross-product materialization when factorization fits. @@ -378,10 +168,6 @@ case is exceptional. fits. - Discarding retrieval score/rank before fusion or projection decisions. - Auto-creating placeholder nodes for orphan edges. -- Raw filesystem I/O for cluster-stored state (ledger, lock, sidecars, - approvals, catalog) outside the cluster crate's storage module β€” every - stored byte goes through the engine `StorageAdapter` so `file://` and - `s3://` stay one code path. - Wire-protocol-specific code in compiler or engine crates. - Cloud-only correctness fixes or forks of the OSS engine for correctness. - Mutating immutable substrate state in place, including Lance fragments or @@ -407,8 +193,6 @@ Use this as yes/no/NA for any non-trivial design or PR: - Are stats/capabilities exposed when behavior depends on them? - Are existing known gaps left no worse and documented if touched? - Does the test live at the same boundary as the change? -- Is this operation's cost bounded with respect to history and scale, or does it - re-derive warm state from cold storage per call? - Does the change avoid every deny-list pattern, or justify the exception? ## Maintenance Policy diff --git a/docs/dev/lance.md b/docs/dev/lance.md index d0a5c31..ef83f2c 100644 --- a/docs/dev/lance.md +++ b/docs/dev/lance.md @@ -55,18 +55,18 @@ Adding/changing index types, fixing coverage, debugging FTS or vector recall, de | Topic | URL | |---|---| -| Index spec overview | https://lance.org/format/index/ | -| BTREE scalar index | https://lance.org/format/index/scalar/btree/ | -| Bitmap scalar index | https://lance.org/format/index/scalar/bitmap/ | -| Bloom-filter scalar index | https://lance.org/format/index/scalar/bloom_filter/ | -| Label-list scalar index | https://lance.org/format/index/scalar/label_list/ | -| Zone-map scalar index | https://lance.org/format/index/scalar/zonemap/ | -| R-Tree scalar index (spatial) | https://lance.org/format/index/scalar/rtree/ | -| Full-text search (FTS) index | https://lance.org/format/index/scalar/fts/ | -| N-gram scalar index | https://lance.org/format/index/scalar/ngram/ | -| Vector index | https://lance.org/format/index/vector/ | -| Fragment-reuse system index | https://lance.org/format/index/system/frag_reuse/ | -| MemWAL system index | https://lance.org/format/index/system/mem_wal/ | +| Index spec overview | https://lance.org/format/table/index/ | +| BTREE scalar index | https://lance.org/format/table/index/scalar/btree/ | +| Bitmap scalar index | https://lance.org/format/table/index/scalar/bitmap/ | +| Bloom-filter scalar index | https://lance.org/format/table/index/scalar/bloom_filter/ | +| Label-list scalar index | https://lance.org/format/table/index/scalar/label_list/ | +| Zone-map scalar index | https://lance.org/format/table/index/scalar/zonemap/ | +| R-Tree scalar index (spatial) | https://lance.org/format/table/index/scalar/rtree/ | +| Full-text search (FTS) index | https://lance.org/format/table/index/scalar/fts/ | +| N-gram scalar index | https://lance.org/format/table/index/scalar/ngram/ | +| Vector index | https://lance.org/format/table/index/vector/ | +| Fragment-reuse system index | https://lance.org/format/table/index/system/frag_reuse/ | +| MemWAL system index | https://lance.org/format/table/index/system/mem_wal/ | | HNSW Rust example | https://lance.org/examples/rust/hnsw/ | | Distributed indexing | https://lance.org/guide/distributed_indexing/ | | Tokenizer (FTS, n-gram) | https://lance.org/guide/tokenizer/ | @@ -125,7 +125,7 @@ Touching `omnigraph optimize` / `cleanup`, the underlying `compact_files` / `cle |---|---| | Read-and-write guide (covers `compact_files`, `cleanup_old_versions`) | https://lance.org/guide/read_and_write/ | | Performance (compaction tradeoffs) | https://lance.org/guide/performance/ | -| Fragment-reuse index | https://lance.org/format/index/system/frag_reuse/ | +| Fragment-reuse index | https://lance.org/format/table/index/system/frag_reuse/ | ### DataFusion integration @@ -156,25 +156,7 @@ If a future need pulls one of these into scope, add a row to the matching domain When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers. -### Last alignment audit: 2026-06-15 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0) - -Migration from Lance 6.0.1 β†’ 7.0.0 landed in this cycle. **Arrow stayed 58, DataFusion stayed 53** (no change) β€” the only transitive bump is `object_store` 0.12.5 β†’ 0.13.2. 141 upstream commits reviewed (6.0.1 β†’ 7.0.0); no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). Behavior-affecting findings: - -- **object_store 0.13 moved convenience methods behind a new `ObjectStoreExt` trait** (`get`/`put`/`head`/`rename`/`delete`; `list`/`list_with_delimiter`/`put_opts` stay on the core `ObjectStore` trait). Fix = add `use object_store::ObjectStoreExt;` to `storage.rs` and `db/manifest/namespace.rs`; no call-site changes. Mirrors Lance's own migration in PR #6672. The local-FS `PutMode::Update` gap is unchanged (still unimplemented upstream), so `storage.rs::write_text_if_match`'s local content-token emulation stays. -- **`roaring` must be pinned to 0.11.4** (`cargo update -p roaring --precise 0.11.4`). Lance 7.0.0's `UpdatedFragmentOffsets` newtype (PR #6650) derives `Eq` over `HashMap`, which needs `RoaringBitmap: Eq` β€” added only in roaring 0.11.4 (roaring-rs PR #341). Lance's loose `roaring = "0.11"` constraint otherwise resolves the broken 0.11.3 and **lance itself fails to compile** (`RoaringBitmap: Eq is not satisfied`). roaring is transitive (no direct workspace dep); the pin lives only in `Cargo.lock`. -- **`_row_created_at_version` for merge-insert INSERT rows now = the commit version** (PR #6774; was a fallback of 1 / dataset-creation version). Flipped `lance_version_columns.rs::lance_merge_insert_new_row_stamps_created_at_version` to assert `== v2`. Production change-detection keys on `_row_last_updated_at_version` + ID-set membership, so classification logic is unaffected (the `changes/mod.rs` rationale comment was corrected). -- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug β€” but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape). -- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) β€” the load-bearing half β€” `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged` β†’ `CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival). -- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) β†’ MR-A (migrate `delete_where` to the staged two-phase API, retire the parse-time D2 rule) is now **unblocked**, tracked separately (dev-graph `iss-950`). The bump itself keeps `delete_where` inline; the `_compile_delete_result_field_shape` guard is left untouched until MR-A. -- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L2472–2480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) β†’ "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS β€” baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set on the manifest's unenforced-PK field (`db/manifest/migrations.rs::migrate_v1_to_v2`): `["object_id"]` β†’ no-op, `[]` β†’ set, any other PK field β†’ loud refusal (the wrong CAS key, unchangeable under v7). Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability); regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7). -- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent β†’ `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately β€” Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only β€” the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis). -- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) β€” `create_vector_index` inline residual retained; blob-column compaction β€” `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`. -- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) β€” but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment β€” run `cargo test --workspace` before declaring a Lance bump done. -- **Two surface guards added by the v3β†’v4 migration-robustness follow-up** (not a Lance bump, but they pin Lance error surfaces the migration now classifies on): `dataset_open_missing_returns_not_found_variant` (a missing `Dataset::open` returns `DatasetNotFound`/`NotFound` β€” the legacy-open read in `db/commit_graph.rs::read_legacy_commit_cache` treats only those as "no legacy data" and propagates everything else) and `lance_error_incompatible_transaction_variant_exists` (a concurrent `UpdateConfig` stamp-bump loses with `IncompatibleTransaction` β€” `db/manifest/migrations.rs::commit_v4_stamp_idempotently` matches it to retry the benign same-value race). Re-run on a Lance bump like the others. - -Bump this date stanza on the next alignment pass. - -### Prior alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1) +### Last alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1) Migration from Lance 4.0.0 β†’ 6.0.1 landed in this cycle (DataFusion 52 β†’ 53, Arrow 57 β†’ 58, lance-tokenizer 6.0.1 added, tantivy* removed). Direct 4 β†’ 6 jump; v5.x was not used as an intermediate (rationale in `~/.claude/plans/shimmering-percolating-duckling.md`). Behavior-affecting findings: @@ -187,14 +169,13 @@ Migration from Lance 4.0.0 β†’ 6.0.1 landed in this cycle (DataFusion 52 β†’ 53, - **`Dataset::checkout_version(N).await?.restore().await?`**: `restore()` takes `&mut self` and returns `Result<()>` (mutates in place, does not consume + return a new dataset). The recovery rollback hammer at `db/manifest/recovery.rs:505-522` continues to work. Pinned by `lance_surface_guards.rs::_compile_checkout_version_then_restore_signature`. - **`DatasetBuilder::from_namespace(...).with_branch(...).with_version(...).load()`** surface preserved (the namespace builder chain at `db/manifest/namespace.rs:162-174`). Pinned by `lance_surface_guards.rs::_compile_dataset_builder_from_namespace_signature`. - **`compact_files(&mut ds, CompactionOptions::default(), None)`** signature stable. `CompactionOptions` still does not expose `data_storage_version`; `compact_files` builds its own `WriteParams { ..Default::default() }`. Note: `LanceFileVersion::default()` is now V2_1 in v6, so optimize-rewritten fragments come out at V2_1 by default (was V2_0 in v4). Existing explicit V2_2 pins on creates/appends still apply. -- **`Dataset::optimize_indices(&mut self, &lance_index::optimize::OptimizeOptions)`** (via `DatasetIndexExt`) is a depended-on surface as of the index-coverage work: `db/omnigraph/optimize.rs` calls it after `compact_files` to fold appended/rewritten fragments into existing indexes (incremental merge, not retrain). It is a **committing** call (mutates in place, advances HEAD; no uncommitted variant in v6.0.1), so optimize treats it as an inline-commit residual under the `SidecarKind::Optimize` recovery sidecar. Signature pinned by `lance_surface_guards.rs::_compile_optimize_indices_signature`; the incremental-coverage behavior pinned by `optimize_indices_extends_fragment_coverage` (appended fragment uncovered before, covered after). - **`Dataset::delete(predicate)` returns `DeleteResult { new_dataset: Arc, num_deleted_rows: u64 }`** β€” unchanged shape. Pinned by `lance_surface_guards.rs::_compile_delete_result_field_shape`. MR-A will repurpose this guard to the staged two-phase variant once `DeleteBuilder::execute_uncommitted` migration lands. - **File reader read methods now async** (Lance PR #6710, v6.0). No effect β€” omnigraph reaches Lance exclusively through `Dataset::scan` and the staged-write API. - **Tokenizer vendored as `lance-tokenizer`** (Lance PR #6512, v6.0). No effect β€” no direct tokenizer imports. - **Lance #6658 closed** (2026-05-14) but `DeleteBuilder::execute_uncommitted` did **not** ship in v6.0.1 β€” binary search across the release stream shows it first appears in `v7.0.0-beta.10` (the closing commits landed on main but didn't backport to the 6.x line). Tracked as MR-A: migrate `delete_where` to staged, retire the parse-time D2 mutation rule, extend recovery sidecar coverage. **Gated on the Lance v7.x bump**, not this PR. v7.0.0-rc.1 dropped 2026-05-21. - **Lance #6666 still open** (`build_index_metadata_from_segments` public): vector-index two-phase blocked; inline `create_vector_index` residual retained. - **Lance #6877 still open** (`MergeInsertBuilder` dup-rowid): PR #109's `SourceDedupeBehavior::FirstSeen` + `check_batch_unique_by_keys` precondition stay load-bearing. -- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`. -- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed β†’ flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions. -Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only; plus the index-coverage work's `_compile_optimize_indices_signature` and `optimize_indices_extends_fragment_coverage`). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension). +Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (8 named guards; 3 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension). + +Bump this date stanza on the next alignment pass. diff --git a/docs/dev/rfc-001-queries-envelope-mcp.md b/docs/dev/rfc-001-queries-envelope-mcp.md index 94d15e8..b5d62d4 100644 --- a/docs/dev/rfc-001-queries-envelope-mcp.md +++ b/docs/dev/rfc-001-queries-envelope-mcp.md @@ -348,4 +348,4 @@ Callers move at their own pace. The envelope upgrades + URL rename ship in v0.6. - RFC 8288 (`Link` relations, `successor-version`) - MCP spec: [modelcontextprotocol.io](https://modelcontextprotocol.io) - [invariants.md](./invariants.md) β€” substrate boundaries this work respects -- [../user/server.md](../user/operations/server.md) β€” current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation) +- [../user/server.md](../user/server.md) β€” current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation) diff --git a/docs/dev/rfc-002-config-cli-architecture.md b/docs/dev/rfc-002-config-cli-architecture.md deleted file mode 100644 index 8095eda..0000000 --- a/docs/dev/rfc-002-config-cli-architecture.md +++ /dev/null @@ -1,590 +0,0 @@ -# RFC: Config & CLI Architecture β€” Layered Config, Client Targeting, File Naming - -**Status:** Proposed (umbrella; implementation parked β€” PRs #139/#162). Its pieces have since landed or been superseded piecemeal: layered config/file-naming/credentials β†’ [rfc-007-operator-config.md](rfc-007-operator-config.md) (landed); the project-manifest role of `omnigraph.yaml` β†’ deprecated by [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) (stages 1–4 landed); the `omnigraph-api-types` extraction and client unification β†’ [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (proposed; salvages #139's clean extraction). Still exclusively here: `GraphLocator`/multi-homing (Β§1), roles (Β§3), the State layer / `omnigraph use`. Do not implement from this document without checking those successors first. -**Date:** 2026-05-30 -**Tickets:** MR-668 (multi-graph server, shipped β€” the dependency this builds on), MR-969 (stored queries + MCP β€” supplies the in-repo agent tool surface), MR-973 (quickstart / onboarding), MR-974 (agent setup surface), MR-981 (agent-friendly CLI hardening) -**Target release:** v0.8.x (tentative; phased β€” see Rollout) - -## Summary - -OmniGraph today has a single config file, `omnigraph.yaml`, read both by the CLI (operating the embedded engine) and by `omnigraph-server` (hosting graphs). There is **no client-side configuration that targets a *running server*** β€” to talk to a deployed `omnigraph-server` you drop to `curl` or the `omnigraph-ts` client. This is the one real gap in an otherwise coherent design (storage-URI addressing, multi-graph routing, per-graph policy). - -This RFC defines the config and CLI architecture that closes that gap, derived from first principles β€” *working backwards from what OmniGraph uniquely enables* rather than copying kubeconfig / `helix.toml`. The result: - -1. A **global-first layered config** β€” user-global (`~/.omnigraph/`) is the **primary, self-sufficient default**; per-project (`./omnigraph.yaml`) is an *optional* override + deployment manifest. One uniform schema, both layers optional; the CLI works from any directory with **no project file** (the `kubectl`/`aws`/`gh` posture), unlike today's project-anchored behavior. -2. A single unifying noun β€” the **target** β€” that resolves a name to a concrete `(locus, graph, sub-state, credential)` tuple, where the locus is **embedded (storage URI) XOR remote (server endpoint)**. -3. A **multi-server Γ— multi-graph** client model (OmniGraph hosts N graphs per server and there are M servers β€” unlike Helix's one-cluster-one-graph). -4. **Credentials by reference, keyed by server name** (the AWS/gh/kube model) β€” OS keychain `omnigraph:` (preferred) β†’ a `[]` profile in `~/.omnigraph/credentials` β†’ `OMNIGRAPH_TOKEN[_]` env (CI). `servers.` is endpoint-only by default but may carry an explicit, secret-free `auth: { token: { env|file|command|keychain } }` source; no `credentials.yaml`; the shipped `bearer_token_env` + dotenv stay as a legacy compat path. Every committed/GitOps'd surface stays secret-free. -5. A **file-naming** decision: project and server config are **the same artifact, same name** (`omnigraph.yaml`); the only differently-named file is the user-global `config.yaml`, justified by **scope, not role**. - -The design optimizes jointly for **DX** (one command surface across embedded and remote; clone-and-go) and **AX** (agent experience: one flat resolved context, secrets structurally unreachable, branch-pinned reproducible reads, and a GitOps'd capability surface). - -## Reconciliation with shipped / planned CLI work - -Verified **against the code**, not ticket statuses (which are unreliable β€” e.g. MR-581 is marked done but is stale and unbuilt). Findings and the corrections they force: - -- **Noun is `graph`/`graphs`, NOT `target`/`targets`.** The config key is `graphs:` in `config.rs` and the flag is `--graph`. **This RFC uses `graphs:`/`--graph` throughout**; the unifying noun is a **`graphs:` entry** that is *embedded* (`storage:`, formerly `uri:`) XOR *remote* (`server:` + `graph_id:` defaulting to the entry key) β€” a typed locator (Β§1.1). Read any lingering `targets:`/`--target` below as `graphs:`/`--graph`. -- **`~/.omnigraph/` stands on its own merits** (Helix/aws/kube peer convention), **not** on precedent β€” there is **no `~/.omnigraph/` usage in the code** today. (MR-581 / MR-531 templates-into-`~/.omnigraph/` are *stale tickets, unbuilt*.) -- **Templates do not exist** in the code (no `template` command). The template mechanism is a *design question for this RFC / the init family*, not an existing foothold. -- **What actually exists in the CLI** (verified): `init, query(read), mutate(change), load, ingest, branch, schema, lint, snapshot, export, commit, policy, optimize, cleanup, graphs`. **Not built:** `serve, quickstart, template, prune, login`. `omnigraph init` exists (with `scaffold_config_if_missing`, `main.rs:1415`); the rest of the "init family" (`quickstart` MR-973, `serve` MR-970, `prune`/`init --force` MR-972/975, `mcp install`/skills MR-974, agent-mode MR-981) are **unbuilt tickets**, some stale. -- **Config still uses `aliases:`** (no `operations:` in code; MR-839 unbuilt). Β§6's reconciliation talks about `aliases:` as-is, noting `operations:` is a *proposed* rename. -- **`bearer_token_env` exists** (per-graph, `config.rs`); MR-971 flags a CLI-parity / server-side gap. The per-`servers.` extension lands on top of that. -- **A top-level `omnigraph lint` command exists** (verified). A stored-query *registry* validator must pick a verb that doesn't read as a competing lint/check. - -## Motivation - -Three problems, in priority order: - -- **No clientβ†’server targeting config.** The moment an operator stands up `omnigraph-server` β€” for bearer auth + Cedar at a network boundary + admission control + multi-graph routing β€” the CLI can't address it. `curl` is the fallback. There is no named, switchable, credential-carrying way to say "run this against `prod` on the team server." -- **Multi-server Γ— multi-graph has no first-class expression.** OmniGraph genuinely runs N graphs per server across M servers. The same graph is **multi-homed** β€” `s3://b/prod` may be `prod` on server A, `production` on server B, and opened directly by the CLI. Today's flat `graphs:` map (nameβ†’storage-URI) can't express "graph `production` on server `prod-eu`." -- **Solo-first and embedded-first are unserved by the remote story.** A solo developer with no projects should define everything in `~`. A developer iterating locally (embedded, no server) and then pointing at staging (remote) should change *one word*, not learn a second command surface. - -MR-668 shipped the server side (multiple graphs per server). MR-969 ships the in-repo agent tool surface (stored queries / MCP). This RFC supplies the **client and config layer** that lets humans and agents target that surface coherently β€” the foundation under MR-973 / MR-974 / MR-981. - -## Non-Goals - -- **A control plane / dashboard for config.** Operators edit files and (for servers) restart. No runtime config-mutation API. Matches the MR-668 / MR-969 operational model. -- **Hot reload.** Restart-only for server-side config, matching MR-668 and MR-969. -- **Embedding secrets in any config file.** Credentials are by-reference; the git-ignored `auth.env_file` dotenv (or, later, the OS keychain) holds tokens. Never a committable `*.yaml`. -- **Renaming the project manifest by role.** No `omnigraph.server.yaml` / `omnigraph.client.yaml`. Role lives in sections, not filenames (see Design Β§3). -- **Dropping embedded mode.** Embedded-first is load-bearing for the file-naming decision; this RFC assumes it stays. -- **Cross-graph / cross-server tool listing in MCP.** Clients loop over per-graph catalogs (a MR-969 non-goal, restated). - -## Background - -OmniGraph runs on Lance 6.x: typed nodes/edges in per-type Lance datasets, atomic multi-table commits via a `__manifest` table, branchable and time-travelable. The CLI (`omnigraph`) operates the **embedded engine** directly against a storage URI β€” no HTTP client in its runtime dependencies. `omnigraph-server` (Axum) is a *separate* HTTP front-end over the same engine, with bearer auth + per-graph Cedar (MR-668). The two read the same `omnigraph.yaml` but never connect to each other. - -OmniGraph **already has a credentials-by-reference mechanism**, which this RFC builds on rather than replacing: `TargetConfig.bearer_token_env` names the env var holding a graph's bearer token, and `auth.env_file` points at a git-ignored dotenv (`.env.omni`) that the CLI auto-loads into the process (`load_env_file_into_process`) with real-env-vars-win precedence; `resolve_remote_bearer_token` resolves a token via env var then dotenv named lookup. `.env.omni` is already in `.gitignore`. - -The six **irreducible enablers** that drive the design (referenced as E1–E6 below): - -| # | Enabler | Consequence | -|---|---|---| -| E1 | A graph is a **self-contained storage URI**; the substrate (object store + manifest CAS) is the source of truth β€” no server required to read/write. | A graph is addressable **directly (embedded)**, not only via a server. | -| E2 | A server hosts **many graphs**; **many servers** exist. | The remote address space is **`{server} Γ— {graph_id}`**. | -| E3 | The same graph is **multi-homed** under different per-locus names. | **Name β‰  identity.** Resolution is mandatory. | -| E4 | **Branch / commit / snapshot** are first-class addressable sub-state. | An address is *graph @ branch/snapshot*, not just graph. | -| E5 | Enforcement is **two-layered**: engine-layer Cedar (`_as` writers, works embedded) + HTTP-boundary bearer+Cedar (server only). | *How* you reach a graph determines *which* enforcement applies. | -| E6 | **Stored queries / MCP tools are a per-graph registry defined in the project config** (MR-969). | The **agent tool surface is version-controlled in the repo**. | - -Competitors collapse dimensions OmniGraph keeps live: **Helix** fuses E2+E3 (one cluster = one graph); **namidb** fuses E1+E3 into the URI (`s3://b?ns=prod`) and serves one namespace per process. OmniGraph has all of E1–E6 at once, so its config resolves a richer space β€” but the richness is *earned* by capability. - -## Design - -### 1. The address space and the `target` abstraction - -Every OmniGraph address is a tuple: - -``` -(locus, graph, sub-state, credential) - locus = embedded(URI) XOR remote(server-endpoint) # E1, E2 - graph = a URI (embedded) | a graph_id on a server (remote) # E3 - sub-state = branch | snapshot # E4 - credential = cloud-storage creds (embedded) | bearer token (remote) # E5 -``` - -The config's only job is **name β†’ this tuple**. Define one noun β€” a **target** β€” that resolves to either shape: - -```yaml -targets: - dev: # embedded β€” substrate-direct (E1) - storage: s3://team-bucket/dev.omni - branch: main # sub-state (E4) - staging: # remote β€” resolves a server by reference (E2/E3) - server: staging # β†’ looked up in `servers` - graph_id: prod # the graph's id on that server (defaults to the entry key) - branch: review -``` - -`--target staging` resolves: project `targets.staging` β†’ `{server: staging, graph_id: prod, branch: review}` β†’ `servers.staging` β†’ `{endpoint, token-by-ref}` β†’ final `(remote(https://…), prod, review, $TOKEN)`. Embedded targets skip the server hop and use cloud-storage credentials. - -**Two concepts, not kubeconfig's three.** kube splits cluster / user / context; that 3-way split is its most-cursed UX. A target *bundles* server+graph+branch+defaults under one name; the **only** thing split out is `servers`, because endpoints+credentials are shared across many targets and are secret-bearing (different ownership and rate-of-change; see Β§2). Result: **2 nouns β€” `servers` and `targets`.** Embedded `targets` (`storage:`) subsume today's `graphs:` entries. - -### 1.1 The resolved address is a typed *locator*, not a `uri` string - -The shipped config models a graph as a single `uri: String`, and code branches on `is_remote_uri(uri)`. That conflates two structurally different addresses: an **embedded** graph is a *complete, self-contained* address β€” one storage URI = one graph, opened directly via the embedded engine; a **remote** graph is a *server endpoint + a `graph_id`* β€” one server hosts N graphs. A bare server URL **is not a graph**; it lacks the `graph_id`. The cost of the string model, in the code today: - -- the CLI re-decides "server or file?" via `is_remote_uri` at ~16 call sites; -- `TargetConfig` (one `uri` field) **cannot express** multi-server Γ— multi-graph or a multi-homed graph (E2/E3) β€” "graph `production` on server `prod-eu`" has no representation; -- the CLI **bails on remote URIs** for most operations, precisely because the string can't carry the `graph_id`; -- the `omnigraph-ts` SDK had to model `baseUrl` **+** `graphId` *separately* (rewriting `/graphs/{graphId}/…`) β€” it invented the structure the string lacks. - -So the *resolved* address is a **typed locator**, not a string: - -```rust -enum GraphLocator { - Embedded { storage: StorageUri }, // file:// , s3:// β€” a complete graph - Remote { server: ServerId, graph_id: GraphId }, // which server + which graph (+ bearer creds) -} -``` - -A `graphs:` entry resolves into this **once**; downstream code dispatches on the variant (the breadboard's `GraphConn = Embedded(engine) | Remote(http)`) instead of re-sniffing a scheme at each call site. The `uri` string becomes an *input format* for the embedded variant, never the address itself. - -**YAML naming follows the locator β€” the *key* names the locus**, so neither the value's scheme nor a comment is load-bearing: - -| Locus | Key | Value | -|---|---|---| -| Embedded | **`storage:`** (shipped `uri:` is a deprecated alias) | a storage URI (`s3://…`, `file://…`) | -| Remote | **`server:`** | a name in `servers:` (its `endpoint` + creds resolve by name, Β§5) | -| Remote graph id | **`graph_id:`** | the id on that server β€” **defaults to the entry key**; set only when the local alias differs | - -An entry has `storage:` **xor** `server:` β€” the deserializer rejects *both* and *neither* (no silent ambiguity). This removes two prior confusions: `graphs:` (the map) vs `graph:` (the remote id), and `uri:`-might-be-a-server. - -```yaml -servers: - prod-eu: { endpoint: https://og-eu.internal:8080 } -graphs: - dev: { storage: s3://team-bucket/dev.omni } # embedded - production: { server: prod-eu } # remote β€” graph_id = "production" (the key) - staging: { server: prod-eu, graph_id: prod } # remote β€” alias β‰  server's id -``` - -### 1.2 Invalid configs are rejected by design - -The DX rule is: **a config field is either honored or rejected, never silently ignored**. The loader therefore has two phases: - -1. Parse YAML into a loose/raw shape that preserves origin (`base_dir`, layer, line/path when available). -2. Convert once into a typed, role-aware resolved config. Every command receives the resolved form, not the raw YAML structs. - -The typed graph shape is: - -```rust -enum GraphEntry { - Embedded(EmbeddedGraphEntry), - Remote(RemoteGraphEntry), -} - -struct EmbeddedGraphEntry { - storage: StorageUri, - branch: Option, - policy: Option, - queries: QueryRegistrySpec, -} - -struct RemoteGraphEntry { - server: ServerId, - graph_id: GraphId, - branch: Option, -} -``` - -That makes these rules structural rather than advisory: - -- A graph entry must specify **exactly one** locator: `storage:`/legacy `uri:` xor `server:`. -- `policy:` and `queries:` are valid only on `Embedded` graph entries, because they define the capability surface of a graph this process opens directly. A `Remote` graph entry points at a server; that server owns policy and stored-query definitions. -- `omnigraph-server` may serve only `Embedded` graph entries. A server manifest entry with `server:` is rejected: a server should not "host" a graph by proxying another server. -- A named graph uses its own graph entry. Top-level `policy:` / `queries:` are a legacy anonymous-bare-URI compatibility path only; if a named graph is selected while top-level blocks would be ignored, config validation errors with a migration hint. -- A client-defined remote graph discovers stored queries from the server (`GET /queries`) and invokes them (`POST /queries/{name}`); it does not define `queries:` locally for that remote graph. - -Examples that must fail fast: - -```yaml -graphs: - prod: - storage: s3://team-bucket/prod.omni - server: prod-us # invalid: storage xor server -``` - -```yaml -graphs: - prod: - server: prod-us - graph_id: production - policy: { file: ./policies/prod.yaml } # invalid: remote graph policy lives on the server - queries: - find_user: { file: ./queries/find_user.gq } # invalid: remote graph queries are discovered -``` - -`omnigraph config view --resolved --show-origin` is the user-facing debugger for this boundary: it shows the final `Embedded` or `Remote` graph and where every honored field came from. Fields that cannot be honored never make it into the resolved view; they fail validation first. - -### 2. Layered config β€” global-first, uniform schema, project-optional - -**Posture: global-first, project-optional.** OmniGraph's CLI is primarily a *client* (it operates against graphs and servers, embedded or remote), so it sits on the **global-first** side of the CLI-config axis β€” like `kubectl` / `aws` / `gh` / `docker`, and unlike *project-first* tools (`git` / `cargo` / `terraform`) whose primary config is per-repo. The **global user config is the primary, self-sufficient default**; the project file is an *optional* repo-scoped override (and, when present, the deployment manifest). `omnigraph query --target prod` must work from **any directory with no project file**, exactly as `kubectl get pods --context prod` works from anywhere. *(This is a deliberate flip from today, where the CLI reads `./omnigraph.yaml` and does not even walk parent dirs β€” i.e. today it is project-anchored.)* - -**Rule: the two layers share ONE raw schema, and each is fully self-sufficient** (the git-layering mechanism β€” same schema at both levels; you never need a repo to have a working config). Do **not** specialize the file format by layer. Instead, run the same role-aware validation everywhere (Β§1.2): the global and project layers may both define graph locators, defaults, servers, and aliases, but fields that are meaningless for a resolved graph variant are rejected rather than ignored. For example, `queries:` is valid for an embedded graph this config opens directly; it is invalid on a remote graph entry because remote stored queries are server-owned and discovered. - -This makes the **zero-project case the default, not an edge case**: a solo user (or an agent) defines everything needed for client work in `~/.omnigraph/config.yaml` β€” servers, embedded + remote graph locators, defaults, aliases, and optionally personal embedded-graph query registries β€” and **never creates a project file**. A team adds `./omnigraph.yaml` only when it wants repo-scoped overrides or a committed, GitOps'd deployment manifest. Global-first does **not** forbid project files; it stops *requiring* them (the kubectl model: `~/.kube/config` is sufficient and default; per-project kubeconfigs are opt-in via `KUBECONFIG`). - -| Layer | Required? | Typical use | Path | -|---|---|---|---| -| Global | no | **the default** β€” solo/agent's entire config; shared servers+creds for teams; even a personal server's graphs/queries | `~/.omnigraph/config.yaml` | -| Project | no | **opt-in** β€” repo-scoped overrides + the committed deployment manifest (graphs, queries, policy) | `./omnigraph.yaml` | - -**Precedence (low β†’ high):** built-in defaults < global < project < env vars < CLI flags. With no project file it collapses to **built-in < global < env < flags** β€” the common global-only path. - -**Merge semantics β€” "closest layer wins, at the smallest meaningful unit"** (the field consensus: git / kubeconfig / cargo / Helm / VS Code): -- **Settings objects** (`defaults`, `auth`, `server`) β†’ **deep-merge per field**: a project sets `defaults.graph` and *inherits* the global `defaults.output_format`. (VS Code / cargo behavior.) -- **Named-resource maps** (`servers`, `graphs` / compat `targets`, `queries`, `aliases`) β†’ **union by key; on a collision the higher layer's entry REPLACES the lower wholesale** β€” *no field-level deep-merge within an entry*. (kubeconfig: union contexts by name.) The footgun this avoids: global `servers.prod = {endpoint, policy}`, project `servers.prod = {endpoint: other}` β€” deep-merge would silently retain the old fields; replace makes the project's `prod` self-contained and predictable. -- **Lists/arrays** β†’ **replace, never append** (Helm convention; appending is order-sensitive and surprising). -- **Scalars** β†’ higher layer wins. -- **Relative paths carry their origin's base_dir.** A `queries:` entry's `.gq` path, or a `policy.file`, resolves against the directory of the layer it was *defined in* β€” global entries under `~/.omnigraph/`, project entries under the project dir. -- **Inspectable (non-negotiable):** `omnigraph config view --resolved --show-origin` prints each final value *and which layer set it* (the `git config --show-origin` / `kubectl config view` rule). A layered config without origin-tracing is a debugging trap. - -### 3. Roles, and the file-naming decision (same name for project = server) - -`omnigraph.yaml` carries two *roles* that diverge in prod and collapse on a laptop: - -- **Server role** (read by `omnigraph-server`): `graphs:` entries that are **embedded storage locators**, per-graph `policy.file`, **`queries:` β€” the stored-query/MCP registry lives here**, plus serving knobs. Remote graph locators are rejected in this role. -- **Client role** (read by the CLI/agent): `servers:`, embedded or remote `graphs:` locators, `defaults:`, `aliases:`. A remote graph locator points at server-owned capabilities; it cannot define local `policy:` or `queries:`. - -**Project config and server config are the same artifact, hence the same name.** The server *serves the project*: the file that says "these graphs exist, with these stored queries and this policy" is simultaneously the project manifest and the server's deploy config. Role is distinguished by which *sections* are populated, never by filename. Readers ignore sections that are not theirs (today's file already does this with `cli:` vs `server:`). - -**Why not kube's role-split.** Two coherent models exist: (A) one project file with role-sections (Helix `helix.toml` holds both `[local.dev]` and `[enterprise.production]`; compose; Cargo), and (B) deployment-manifest strictly separate from client config (kubectl β€” you never put a context in `deployment.yaml`). kube is the sharpest topological analog (multi-server Γ— multi-graph, one client targeting many), so B has a real claim. The tiebreaker is **E1: OmniGraph is embedded-first.** In embedded mode the manifest's `graphs:` *is* the local target list β€” manifest and local-client-view are the same object, so splitting them (B) fights the grain and forces two files for local work. kube splits because it has **no** embedded mode (client always remote+global). So: take the half kube is right about β€” *remote* client targeting (`servers:`, endpoints, creds) is a separate concern in a separate **user-global** file (`config.yaml`, like `~/.kube/config`); reject the half it is wrong about for us β€” do **not** split the *project* layer by role. **The second name (`config.yaml`) is justified by scope (user-global), not role.** *(If OmniGraph ever dropped embedded mode and went pure-remote, model B's strict split would become cleanest.)* - -### 4. File naming - -Principles from the field: **one global dir** `~/.omnigraph/` (like `~/.aws`/`~/.kube`/`~/.helix`), with config/cache/state as **subdirectories** (separation without XDG's three-root scatter); **secrets keyed by server name in the OS keychain or a separate git-ignored profile file** (AWS/gh model, not a new `credentials.yaml`); **project-root manifest keeps the app-named file** (`Cargo.toml`, `package.json`); **`.yaml`, not `.yml`**; keep OmniGraph's established names. The genuinely *new* decisions are the **global** dir's existence and keyed-by-name resolution with an explicit `auth.token` override (MR-971); the shipped `bearer_token_env` + `auth.env_file` mechanism remains as legacy compat. - -| Artifact | Path / name | Why | -|---|---|---| -| Project = server config (one artifact) | `./omnigraph.yaml` | **Keep.** Root manifest like `Cargo.toml` / `compose.yaml` / `helix.toml`. Same name for both roles because it is one file. In prod the server's deploy repo and an app repo each have their own `omnigraph.yaml` β€” same name, different repos. | -| Global user config | `~/.omnigraph/config.yaml` | **One dir** (`~/.omnigraph/`, like `~/.aws`/`~/.kube`/`~/.helix`). Named `config.yaml` *not* `omnigraph.yaml` β€” the name signals scope (and `~/.aws/config`, `~/.kube/config`, `~/.helix/config` all do this). Holds the full schema so a solo user needs nothing else. | -| Credentials | OS keychain (`omnigraph:`, preferred) β†’ `~/.omnigraph/credentials` profile file (`[]`, `0600`, git-ignored). **Keyed by server name**, inside the one dir. | **Key by name, AWS/gh model** β€” `~/.aws/credentials [profile]`, `~/.kube/config users:`, `~/.helix/credentials`. *Not* a `credentials.yaml`, and *not* a per-server hand-named env var; the secret lives under the server name (no indirection). Legacy `bearer_token_env` + `.env.omni` dotenv remain as a compat path. See Β§5. | -| Cache / state | `~/.omnigraph/cache/`, `~/.omnigraph/state/` | Subdirs of the one dir (like `~/.aws/sso/cache/`, `~/.kube/cache/`) β€” cache is `rm -rf`-safe and backup-excludable without scattering across XDG roots. | -| Cedar policy | `./policies/.yaml` + `.tests.yaml` | **Keep.** Referenced by `policy.file`. | -| Schema | `./*.pg` (e.g. `schema.pg`) | **Keep.** | -| Stored queries | `./queries/*.gq` | **Keep.** `.gq` sources referenced by the `queries:` registry. | - -**Global dir: `~/.omnigraph/` β€” one place, with subdirectories.** Everything OmniGraph keeps for a user lives under a single `~/.omnigraph/` directory, matching the peer group (`~/.aws`, `~/.kube`, `~/.docker`) and the direct competitor (`~/.helix`). This is what DB/cloud-CLI users expect and the lowest-cognitive-load shape. - -*Separation and "one place" are not in conflict* β€” the decisive realization. The peer tools get config/cache/state separation via **subdirectories inside the one dir**, not via XDG's three scattered roots: `~/.aws/sso/cache/`, `~/.kube/cache/`. So OmniGraph keeps `~/.omnigraph/config.yaml`, `~/.omnigraph/credentials`, `~/.omnigraph/cache/` (catalogs β€” `rm -rf`-safe, backup-excludable), `~/.omnigraph/state/` (session, logs) β€” getting cache hygiene **and** a single discoverable location, without the XDG scatter. An earlier draft argued XDG on a false dichotomy (it assumed single-dir β‡’ mixed); subdirs dissolve it. `~/.omnigraph/` is canonical and documented; `$XDG_CONFIG_HOME` may optionally be honored if a user has set it, but XDG is not part of the mental model. - -**Env / override precedence (the `KUBECONFIG` analog):** -- `OMNIGRAPH_CONFIG=/path` β€” explicit config file, highest precedence. -- `OMNIGRAPH_HOME=/path` β†’ the global dir (default `~/.omnigraph/`); `$XDG_CONFIG_HOME` optionally honored if a user has set it, but `~/.omnigraph/` is canonical. -- Cache and state are subdirs of the one dir: `~/.omnigraph/cache/` (cached remote catalogs), `~/.omnigraph/state/` (session, logs). -- Per-server token resolution: an explicit `auth: { token: {...} }` source (env/file/command/keychain) wins if set; otherwise **keyed by the server name** β€” `OMNIGRAPH_TOKEN_` (or `OMNIGRAPH_TOKEN` for the active server) β†’ OS keychain `omnigraph:` β†’ the `[]` profile in `~/.omnigraph/credentials`; legacy `bearer_token_env` still honored. See Β§5. - -### 5. Credentials, connection tiers, and bind portability (12-factor) - -**Credentials are by-reference everywhere, never inlined β€” and keyed by the *server name*, not by a hand-invented env-var name.** This is the one place the design departs from simply reusing the shipped `bearer_token_env` mechanism, because that mechanism is sub-optimal for a multi-server client: it forces the operator to invent and coordinate an env-var name per server (three steps to add a server: pick a var, name it in config, set it in the store). The peer group (AWS profiles, `gh` hosts, kubeconfig users, docker auths) instead keys the secret **by the server's name** β€” no indirection. OmniGraph should match that. - -**Resolution for server `` (no config field required):** -1. **`OMNIGRAPH_TOKEN_`** env var (name-derived, upper-snake), else **`OMNIGRAPH_TOKEN`** for the active server β€” the CI/headless override (12-factor). -2. **OS keychain** entry `omnigraph:` β€” the preferred interactive store (no plaintext on disk); written by `omnigraph login `. -3. **`~/.omnigraph/credentials`** β€” an AWS-style profile file keyed by server name (mode `0600`, git-ignored), the fallback when no keychain: - ```ini - [prod-us] - token = … - [prod-eu] - token = … - ``` -So a `servers.` with no token field resolves by name β€” adding a server is one step (`omnigraph login `), and "multiple servers, multiple tokens" falls out for free. - -**But implicit must not be the *only* path β€” explicit sourcing is a first-class option** (the DX/AX lesson). Pure-convention is invisible (you must *know* `OMNIGRAPH_TOKEN_`), can't integrate with a secrets-manager's fixed var name, and can't do dynamic/short-lived tokens. So a server may declare an explicit `auth:` block β€” a **method-agnostic wrapper** (today only `token:` for bearer; `mtls:`/`oidc:` are the future siblings, so the credential model never has to be re-keyed) holding a tagged token *source*. Secrets are *still* never inlined (every source is a reference): - -```yaml -servers: - prod-us: - endpoint: https://og-us… - auth: { token: { env: OG_PROD_US_TOKEN } } # explicit env var β€” self-documenting (= legacy bearer_token_env) - prod-eu: - endpoint: https://og-eu… - auth: { token: { command: [vault, read, -field=token, secret/og] } } # dynamic / short-lived - edge: - endpoint: https://og-edge… - auth: { token: { file: /run/secrets/og-token } } # k8s/docker mounted secret - staging: - endpoint: https://og-staging… # no auth: β†’ implicit chain (below) -``` - -| `auth.token:` source | when | DX/AX value | -|---|---|---| -| *(auth omitted)* | the common case | zero-config; `omnigraph login` populates keychain `omnigraph:` | -| `{ env: VAR }` | secrets-manager / CI injects a fixed var | **self-documenting** β€” config states the source; = the legacy `bearer_token_env` | -| `{ file: PATH }` | k8s/docker secret mounted as a file | no env plumbing | -| `{ command: [...] }` | Vault, cloud IAM, `gh auth token` | **dynamic tokens** β€” first-class exec, the capability pure-env/keychain can't give (kube `exec` / AWS `credential_process`) | -| `{ keychain: ENTRY }` | pin a non-default keychain entry | explicit override of the name-derived default | - -**Resolution per server:** if `auth.token:` is set, use that source (no fallthrough). Else the **implicit chain**: `OMNIGRAPH_TOKEN_` (or `OMNIGRAPH_TOKEN` for the active server) β†’ keychain `omnigraph:` β†’ `[]` in `~/.omnigraph/credentials` (`0600`, git-ignored). `omnigraph login ` writes/rotates only that server's secret; per-server precedence is independent; sharing is opt-in (same env var or source). The `command` source runs locally with the operator's own privileges and is defined only in operator-owned config (never server-supplied), so it adds no remote-execution surface. The `auth:` wrapper is method-agnostic so adding mTLS/OIDC later is a new sibling key, not a breaking re-key (Hyrum's Law: the field name is a contract once shipped). There is **no `credentials.yaml`** and **no inlined secret**. *Convention for the floor, explicit for control β€” and explicit is legible to agents and never inlines a secret.* - -**Back-compat.** The shipped per-graph `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`, real-env-wins) keeps working unchanged for existing single-server setups; `bearer_token_env` is just the legacy flat alias for `auth: { token: { env } }`. Resolution tries an explicit `auth.token:` (or legacy `bearer_token_env`) first, then the keyed-by-name chain β€” so nothing breaks, but the zero-config default is the no-boilerplate keyed-by-name path. (MR-971 β€” the `bearer_token_env` parity gap β€” is where this resolver work lands.) - -**Three connection tiers** (Supabase/Prisma teach the zero-config floor): -1. **Env vars** β€” `OMNIGRAPH_SERVER=https://…` + `OMNIGRAPH_TOKEN=…`: zero-config remote, no file (the `DATABASE_URL` floor). -2. **Global `config.yaml`** β€” named `servers:` + `graphs:` for multi-server setups (the AWS-profiles convenience). -3. **Project `omnigraph.yaml`** β€” project-pinned targets/graphs, committed. - -**Keep `omnigraph.yaml` a *portable* manifest (12-factor).** Deploy-specific runtime that varies per environment β€” the **bind host/port**, worker counts β€” should be supplied by **`--bind` / `OMNIGRAPH_BIND` (flags/env)**, *not* a committed `server.bind:` baked into the manifest. A manifest that hardcodes `0.0.0.0:8080` is not portable across deploys and leaks an environment detail into a version-controlled file. The same-named `omnigraph.yaml` stays portable across deploys precisely because the volatile, per-environment knobs live in env/flags (12-factor config), while the stable, portable definition (graphs, queries, policy) lives in the file. This is the one concrete lesson taken from kube's model-B without adopting its file split: portability via env/flags, not via a second file. - -### 6. Where stored queries live: defined locally, invoked remotely - -A stored query splits across two axes; do not conflate them: -- **Definition** (`.gq` source + `queries:` entry) lives next to the **embedded graph entry that owns it**. For a hosted remote graph, that is the **deployment manifest** read by `omnigraph-server`; for a personal embedded graph, it may be the user's own config. It never lives on a client-side `Remote` graph entry. -- **Discovery** ("what tools exist for me?") is fetched from the **server** (Cedar-filtered `GET /queries` / MCP catalog) at connect time. -- **Invocation** is **remote** (client β†’ server, HTTP/MCP) β€” or **embedded** (the CLI opens the graph directly and reads the same manifest). - -For remote use, the client carries *pointers to servers*, not query definitions; it **discovers and invokes**, never defines. This is the **capability-as-code guarantee for agents**: an agent can only invoke tools the server's *committed, reviewed* config exposes β€” it **cannot define a new tool at runtime**. Definition is structurally outside the agent's reach. - -`queries:` (graph-capability registry, Cedar-gated when served remotely, MCP-visible when exposed) and `aliases:` (client CLI shortcut) overlap β€” both can name `.gq`-backed operations. This RFC keeps them siblings (the MR-969 decision); the clean long-term is **one registry, two invocation surfaces** (embedded + remote), with `aliases:` subsumed. Out of scope here. - -#### Reconciling `aliases:` with the role model - -`aliases:` is the pre-MR-969, **client-role, embedded-only, ungated** ancestor of `queries:`. An alias bundles `command` (read/change), `query` (`.gq` path), `name` (symbol), `args` (positional param names), and `graph`/`branch`/`format` defaults; the CLI runs it embedded. The server never reads it. So: - -- **Role:** `aliases:` is **client-role** (CLI behavior) β†’ it may live in **both** the user-global `config.yaml` and the project manifest, layered. `queries:` is **graph-capability role** β†’ it lives only on an `Embedded` graph entry, and for remote server graphs that means the server deployment manifest. *Who opens the graph determines where query definitions can live.* -- **Difference:** `aliases:` = embedded invocation, no gating, explicit `command`, bundles client defaults + positional args. `queries:` = remote (+future embedded), Cedar + `mcp.expose`, **infers** read/mutate, bundles only MCP settings. -- **Convergence:** decompose an alias β€” *definition* (nameβ†’.gq+symbol) β†’ `queries:` (the superset: typed, validated, gated, multi-surface, no redundant `command`); *target/branch/format* β†’ client invocation context (`--target`/`--branch`/`--format` or `defaults:`), not baked per-query; *positional `args`* β†’ thin CLI sugar or dropped (agents/services use named JSON params). End-state: one `queries:` registry + the client config model subsumes `aliases:`. -- **Validation:** a file-backed alias (`query: ./foo.gq`) may target only an embedded graph. A remote graph shortcut must be explicit that it invokes a server-owned stored query, e.g. `invoke: find_user`, so the client cannot smuggle a new `.gq` definition into a remote capability surface. -- **v1:** keep `aliases:` unchanged. Footgun worth a load-time warn: an alias and a query with the same name in one manifest are different namespaces invoked differently (`--alias X` vs `POST /queries/X`). - -```yaml -aliases: - local_owner: - command: query - query: ./queries/owner.gq - name: owner - graph: dev # valid only if `dev` resolves Embedded - - remote_owner: - invoke: find_user - graph: prod # valid only if `prod` resolves Remote; source lives on the server - args: [name] -``` - -### 7. CLI surface - -- `omnigraph login ` β€” interactive auth; stores the token keyed by server name in the OS keychain (`omnigraph:`) or the `[]` profile of `~/.omnigraph/credentials` (0600). The `gh auth login` analog. -- `omnigraph use ` β€” set the active graph (writes the appropriate layer). The `kubectl config use-context` analog. -- `omnigraph config view [--resolved] [--show-origin] []` β€” print the merged config and, with `--resolved`, the final tuple **plus the origin layer of every field** (the `git config --show-origin` / `kubectl config view` analog). Resolution is never a mystery. -- All existing verbs (`query`, `mutate`, `load`, `schema`, `branch`, …) gain `--graph `; resolution decides embedded vs remote transparently. - -### 7.5 Init, login, and bootstrap β€” three tiers (folds in the Q2 design) - -Scaffolding splits into three tiers by *scope* and *fatness*, mirroring the field (supabase `init` vs `login`; HelixDB thin `init` vs fat `chef`). Most of this lives in sibling tickets; this RFC owns only the **user route**. - -| Tier | Command | Scope | What it does | Model | Status | -|---|---|---|---|---|---| -| **User route** | `omnigraph login []` | user (`~/.omnigraph/`) | auth + write `~/.omnigraph/config.yaml` / `credentials`; first-run global setup | gh / supabase `login` | **this RFC** (unbuilt) | -| **Thin project init** | `omnigraph init` | project, in-place | create graph + `scaffold_config_if_missing` (`omnigraph.yaml` + minimal `.pg`/`.gq`); refuse-if-exists or `--force` | `cargo init`, `prisma init` | exists; `--force` purge = MR-975 | -| **Fat bootstrap** | `omnigraph quickstart [--template ] [--auto]` | project, possibly new-dir | scaffold + seed data + `serve start` + agent prompt file | HelixDB `chef`, `create-next-app` | MR-973 (unbuilt) | - -**Design positions** (first-principles, since none of the fat tier is built): -- **Split `init` (project) from `login` (user)** β€” never one command writing to both `$HOME` and the project (the supabase line, not the dbt line). `init`=project scaffold; `login`=user credential + global config. -- **`init` is in-place + refuse-if-exists** (cargo/prisma/terraform default): don't clobber; adopt existing files; require `--force` to overwrite (and `--force` purges Lance state per MR-975). -- **Interactive for humans, `--auto`/agent-mode for automation** (npm `-y`, create-* `--CI`, MR-981 `--machine`). In `OMNIGRAPH_AGENT_MODE` any prompt β†’ fail with a repair hint. -- **Templates are a `--template ` flag on the fat tier** (create-vite model), with the *content* (schema + queries + seed) coming from a template source. Mechanism is a design question (bundled-in vs `og template pull` from a repo vs `npm create-*`-style delegation) β€” **not** an existing foothold (MR-581 stale). Lean: a small set of bundled templates first (generic `Personβ†’Knows`, plus promote `omnigraph-intel-bootstrap`), `--template ` later. -- **`init`/`quickstart` can scaffold the `graphs:` map with one or more entries**; "init with specific graphs" = the scaffolded `graphs:` block (embedded `storage:` locally; the agent/operator adds remote `server:` entries via `login` + editing). -- **Secrets-on-scaffold rule** (prisma/dbt/supabase all do this): anything that writes a token also keeps it out of VCS. `login` prefers the OS keychain (no file); the `~/.omnigraph/credentials` profile fallback is `0600` and git-ignored, and any project-local `.env`-shaped file gets a `.gitignore` entry. - -### 8. Concrete shape - -**Global** `~/.omnigraph/config.yaml` (per-user, secret-free): -```yaml -servers: # endpoint only β€” token is keyed by the server name - prod-us: { endpoint: https://og-us.internal:8080 } - prod-eu: { endpoint: https://og-eu.internal:8080 } - staging: { endpoint: https://og-staging.internal:8080 } -graphs: - personal: { storage: ~/graphs/personal.omni } -defaults: - graph: personal -aliases: - my_people: - command: query - query: ~/queries/people.gq - name: list_people - graph: personal -``` - -**Project client** `./omnigraph.yaml` (committed, secret-free, portable β€” no `server.bind`). Note the shipped noun is `graphs:` (MR-603); an entry is embedded (`storage:`) XOR remote (`server:` + `graph_id:`, Β§1.1): -```yaml -graphs: - dev: { storage: s3://team-bucket/dev.omni, branch: main } # embedded - staging: { server: staging, graph_id: prod, branch: review } # remote β†’ graph `prod` on server `staging` - prod-us: { server: prod-us, graph_id: production } - prod-eu: { server: prod-eu, graph_id: production } # multi-homed: same graph, another server -defaults: { graph: dev, output_format: table } -aliases: - owner: - command: query - query: ./queries/owner.gq - name: owner - args: [name] - graph: dev -``` -Select with `--graph ` (shipped flag, MR-603). - -**Server deployment** `./omnigraph.yaml` (committed in the deploy repo, read by `omnigraph-server`). Every served graph is an embedded storage locator; server-owned policy and stored-query definitions live here: -```yaml -graphs: - production: - storage: s3://team-bucket/prod.omni - policy: - file: ./policies/prod.yaml - queries: - find_user: - file: ./queries/find_user.gq - mcp: { expose: true, tool_name: lookup_user } - -server: - policy: - file: ./policies/server.yaml -``` - -**Credentials** are keyed by server name β€” `omnigraph login prod-us` writes the OS keychain entry `omnigraph:prod-us` (or a `[prod-us]` profile in `~/.omnigraph/credentials`, 0600, git-ignored); `OMNIGRAPH_TOKEN_PROD_US` overrides for CI. No token fields in any config file; no committable secrets. - -## DX - -1. **One command surface, two loci.** `query --graph dev` (embedded) and `--graph staging` (remote) are the same command; only resolution differs. Change one word, not a mental model. -2. **Clone-and-go.** Project config names servers+graphs; teammate runs `omnigraph login staging` once and every target resolves. The git + `gh auth login` model. -3. **Multi-server Γ— multi-graph is the default.** Remote graph entries reference `server` by name; `servers` is a global named map; graphs are per-server. `prod-us` and `prod-eu` both serving `production` is two graph entries β€” Helix cannot express this. -4. **Solo-first.** Everything in `~`, no project required. -5. **Laptop-to-fleet on one schema.** Local = one `omnigraph.yaml` (both roles); prod = role-split across repos. No second format to learn. - -## AX (agent experience) - -1. **One flat resolved context, never a config to navigate.** targetβ†’serverβ†’endpointβ†’token resolves *before* the agent sees anything. The agent reasons about tools, not topology (the LLM-safe-surface principle extended to config). -2. **Secrets are structurally outside the agent's reach.** The repo it operates in has no tokens; they are in the global layer / keychain, outside its view. An agent *cannot* exfiltrate a prod token from project config because it is not there. -3. **Branch/snapshot-pinned contexts** (E4) β€” hand an agent a `branch: review` / `--snapshot v42` target and its reads are reproducible and cannot see uncommitted main-line state. No kubeconfig analog. -4. **The agent's capabilities are a GitOps'd artifact** (E6) β€” which graphs exist, which stored-query tools it may call, and which Cedar rules gate them are all in the version-controlled server config. Powers change only via a reviewed PR, deployed by restart. Infrastructure-as-code for what the AI can do. -5. **Config + policy compose.** Config = "where am I pointed + which token"; Cedar = "what may I do there." Orthogonal; no enforcement logic leaks into config. - -## GitOps β€” three surfaces, secrets in none - -| Surface | Repo | Contents | Deploy | Secrets | -|---|---|---|---|---| -| Server deployment config | infra/deploy repo | `graphs:`, policy, **`queries:` + `.gq` files** | commit β†’ CI β†’ **server restart** (no hot reload) | none β€” by-reference | -| Project client config | app repo | `graphs:` β†’ embedded storage or remote server+graph | committed, read by CLI/agent | none | -| Global user config | **not GitOps'd** β€” machine-local `~` | `servers:` + creds-by-ref | `omnigraph login` writes it | refs only (like `~/.kube/config`) | - -## Comparison - -| Property | kubeconfig | Helix | git | compose | **OmniGraph (this RFC)** | -|---|---|---|---|---|---| -| Named remote endpoints + creds-by-ref | βœ… | βœ… | partial | partial | βœ… (global `servers`) | -| Global + project layering, uniform schema | βœ— | βœ— | βœ… | βœ— | βœ… | -| Embedded OR remote under one name | βœ— | βœ— | n/a | βœ— | βœ… (E1) | -| Multi-server Γ— multi-graph | βœ… | βœ— | n/a | n/a | βœ… (E2) | -| Branch/snapshot in the address | βœ— | βœ— | partial | βœ— | βœ… (E4) | -| Agent tool surface in the repo | βœ— | βœ— (separate bundle) | n/a | n/a | βœ… (E6) | -| Project manifest renamed by role | β€” | no | β€” | no | **no** | -| Concept count | 3 | 1 | 2 | 1 | **2 (servers/targets)** | - -## Migration / backwards compatibility - -- **Additive.** Today's `omnigraph.yaml` (`graphs:`, `cli:`, `server:`, `aliases:`, `policy:`) keeps working unchanged. `graphs:` entries are equivalent to embedded `targets:` with a `storage:` (shipped `uri:` is a deprecated alias); both resolve. -- **`targets:` is new** and optional. `servers:` is new and optional. Absent β†’ today's behavior. -- **Global `~/.omnigraph/config.yaml` is new.** Absent β†’ only project + env + flags, exactly as now. Its addition is the **global-first posture flip**: today the CLI is project-anchored (reads `./omnigraph.yaml`, no parent walk); the global config becomes the new primary discovery path so the CLI works with no project file. Existing project-only workflows are unchanged (project still overrides global); the flip is additive β€” it adds a fallback layer below the project file, it does not remove the project file. -- **`graphs:` β†’ `targets:` is an evolution, not a break.** Both can coexist; `targets:` is the superset (adds remote + branch pinning). A future cleanup may alias `graphs:` to embedded `targets:`. -- **`server.bind` stays supported** but documentation steers operators to `--bind` / `OMNIGRAPH_BIND` for portability; no removal. -- **Credentials: keyed-by-name is new; `bearer_token_env` is the compat path.** The primary design (keychain / `[]` profile / `OMNIGRAPH_TOKEN_`) is new resolver work (lands on MR-971). The shipped `bearer_token_env` + `auth.env_file` dotenv (`resolve_remote_bearer_token`) is **unchanged and still honored** β€” existing single-server dotenv setups keep working, and the resolver honors an explicit `auth: { token: {...} }` source (env/file/command/keychain) with `bearer_token_env` as its flat legacy alias. No `credentials.yaml`. -- **Validation tightens invalid mixes, not valid legacy use.** Top-level `policy:` / `queries:` remain only for anonymous bare-URI compatibility. Named graphs use per-entry fields. Remote graph entries with local `policy:` / `queries:` and server manifests with `server:` graph locators are rejected because there is no correct way to honor those fields. - -## Open questions - -- **`graphs:` vs `targets:` naming churn.** Do we rename `graphs:` β†’ `targets:` (with a deprecation alias) or keep `graphs:` for embedded and add `targets:` for remote? Leaning: keep both, document `targets:` as the superset. -- **Keychain integration scope.** Keychain is now the *primary* credential store (Β§5), so this is on the critical path, not optional: macOS Keychain first (matches operator practice) with the `0600` `[]` profile file as fallback; Linux Secret Service / `pass` later. Open: which keyring crate, and the exact `OMNIGRAPH_TOKEN_` name-derivation (upper-snake, non-alnum β†’ `_`). -- **Project-local `servers:`.** Allowed (e.g. a localhost dev server), merged with global. Confirm creds stay by-reference even for project-local servers (yes). -- **`aliases:` ⇄ `queries:` convergence.** Out of scope here; tracked separately. One registry with embedded + remote invocation surfaces is the target end state. -- **Single-file `KUBECONFIG`-style list.** Do we support `OMNIGRAPH_CONFIG` pointing at multiple files (colon-joined), or a single file only? Start single; revisit if demand appears. - -## Implementation β€” breadboard + slices (Shape A) - -Shaped via requirements + a fit check (Shape A β€” global-first layered config + unified `graphs:` entry + three-tier init β€” selected over a project-first minimal option and a Helix-clone). This section breadboards A and slices it. **Bold** = NEW. - -### Places - -| # | Place | What | -|---|---|---| -| P1 | Disk | `~/.omnigraph/{config.yaml, credentials, cache/, state/}` + project `omnigraph.yaml` + `.env.omni` | -| P2 | Config resolution | runs on every command: load layers β†’ merge β†’ resolve `--graph` | -| P3 | Command execution | embedded engine OR remote HTTP client | -| P4 | Remote `omnigraph-server` | existing HTTP surface (`/query`, `/mutate`, `/queries/{name}`) | -| P5 | Scaffold | `login` / `init` / `quickstart` | - -### Affordances - -| # | Place | Affordance | NEW? | Wires | -|---|---|---|---|---| -| U1 | P1 | `~/.omnigraph/config.yaml` (operator edits) | **N** | β†’ N1 | -| U2 | P1 | project `./omnigraph.yaml` | β€” | β†’ N1 | -| U3 | P1 | `~/.omnigraph/credentials` / `.env.omni` dotenv (secrets, git-ignored) | β€” | β†’ N4 | -| U4 | P3 | `omnigraph --graph ` (any command) | β€” | β†’ N14 | -| U5 | P5 | `omnigraph login []` | **N** | β†’ N11 | -| U6 | P5 | `omnigraph init` / `quickstart [--template]` | partly | β†’ N12 / N13 | -| U7 | P2 | `omnigraph config view --resolved --show-origin` | **N** | β†’ N10 | -| N1 | P2 | `load_layered_config()` β€” global (N3) + project (cwd), serde each | **N** | β†’ N2 | -| N2 | P2 | **merge engine** β€” deep-merge settings; replace named-resource entries; replace lists; **retain provenance** and raw field origins | **N⚠️** | β†’ N5, β†’ S_merged | -| N3 | P2 | global-dir resolver β€” `OMNIGRAPH_HOME` else `~/.omnigraph/` | **N** | β†’ N1 | -| N4 | P2 | `load_env_file_into_process` β€” dotenv, real-env-wins (existing) | β€” | β†’ N9 | -| N5 | P2 | `resolve_graph(name, merged)` β†’ typed `Embedded`/`Remote` locator; rejects invalid role/field combinations before execution | **N⚠️** | β†’ N6 | -| N6 | P3 | `GraphConn` β€” `Embedded(engine)` \| `Remote(http)` dispatch | **N⚠️** | β†’ N7, β†’ N8 | -| N7 | P3 | embedded path β€” `Omnigraph::open(uri)` (existing) | β€” | β†’ engine | -| N8 | P3 | **HTTP-client path** β€” POST `/query`/`/mutate`/`/queries/{name}` | **N⚠️** | β†’ P4, β†’ N9 | -| N9 | P2 | `resolve_bearer_token(server)` β€” explicit `auth.token` source if set, else **keyed by name**: `OMNIGRAPH_TOKEN_`/`OMNIGRAPH_TOKEN` β†’ keychain `omnigraph:` β†’ `[]` profile; legacy `bearer_token_env`/dotenv (MR-971) | **N⚠️** | β†’ N8 | -| N10 | P2 | `config view` handler β€” merged + per-field origin (needs N2 provenance) | **N** | β†’ U7 | -| N11 | P5 | `login` handler β€” interactive auth β†’ write `config.yaml` + `credentials` (0600) + `.gitignore` | **N⚠️** | β†’ S_global | -| N12 | P5 | `init` handler β€” `scaffold_config_if_missing` + create graph; refuse-if-exists/`--force` purge (MR-975) | partly | β†’ S_project | -| N13 | P5 | `quickstart` handler β€” scaffold + `--template` + seed + `serve start` + agent prompt (MR-973; needs serve MR-970) | **N⚠️** | β†’ S_project | -| N14 | P3 | agent-mode wrapper β€” `--machine`/`OMNIGRAPH_AGENT_MODE`: JSON, structured errors, never-prompt, typed exit codes (MR-981) | **N⚠️** | β†’ N1 | -| S_global | P1 | `~/.omnigraph/config.yaml` + `credentials` | **N** | read by N1/N9 | -| S_project | P1 | `./omnigraph.yaml` + `.env.omni` | β€” | read by N1/N4 | -| S_merged | P2 | in-memory resolved config (per command, with provenance) | **N** | read by N5/N10 | -| S_cache | P1 | `~/.omnigraph/cache/` (remote catalogs) | **N** | read by N8 | - -```mermaid -flowchart TB - subgraph P1["P1: Disk"] - U1["U1: ~/.omnigraph/config.yaml"] - U2["U2: ./omnigraph.yaml"] - U3["U3: credentials dotenv"] - end - subgraph P2["P2: Config resolution"] - N3["N3: global-dir (OMNIGRAPH_HOME)"] - N1["N1: load_layered_config"] - N2["N2: merge engine (+provenance)"] - N4["N4: dotenv loader"] - N5["N5: resolve_graph(--graph)"] - N9["N9: resolve_bearer_token"] - N10["N10: config view"] - end - subgraph P3["P3: Command execution"] - U4["U4: omnigraph --graph"] - N14["N14: agent-mode wrapper"] - N6["N6: GraphConn embedded|remote"] - N7["N7: embedded Omnigraph::open"] - N8["N8: HTTP-client POST"] - end - subgraph P5["P5: Scaffold"] - U5["U5: login"]; U6["U6: init/quickstart"] - N11["N11: login handler"]; N12["N12: init"]; N13["N13: quickstart"] - end - P4["P4: remote omnigraph-server"] - U1-->N1; U2-->N1; N3-->N1; N1-->N2-->N5-->N6 - U3-->N4-->N9-->N8 - U4-->N14-->N1 - N6-->N7; N6-->N8-->P4 - N2-->N10-->U7["U7: config view --resolved"] - U5-->N11; U6-->N12; U6-->N13 - classDef ui fill:#ffb6c1,stroke:#d87093,color:#000 - classDef n fill:#d3d3d3,stroke:#808080,color:#000 - class U1,U2,U3,U4,U5,U6,U7 ui - class N1,N2,N3,N4,N5,N6,N7,N8,N9,N10,N11,N12,N13,N14 n -``` - -### Slices (vertical, each demo-able) - -| # | Slice | Parts/affordances | Demo | -|---|---|---|---| -| **V1** | **Global layer + merge + `config view`** | A1–A4 Β· N1,N2,N3,N10 Β· U1,U7,S_global,S_merged | Put config in `~/.omnigraph/`, run `omnigraph config view --resolved --show-origin` from any dir β†’ merged result with per-field origin; existing embedded commands work global-first with no project file | -| **V2** | **Remote graphs + HTTP client + creds** | A5–A7 Β· N5,N6,N8,N9 Β· S_cache | Define a `server:` graph entry; `omnigraph query --graph prod` hits the remote server (`curl`-free); embedded `--graph dev` still local | -| **V3** | **`omnigraph login`** | A8 Β· N11,U5 | `omnigraph login prod` writes `~/.omnigraph/credentials` (0600) + `.gitignore`; V2 remote query now works with no manual env | -| **V4** | **Thin-init hardening + quickstart + templates** | A9 Β· N12,N13,U6 (needs serve MR-970) | `omnigraph quickstart --template person-knows` scaffolds + seeds + serves; `init --force` purges (MR-975) | -| **V5** | **Agent-mode** | A10 Β· N14,U4 (MR-981) | `OMNIGRAPH_AGENT_MODE=1 omnigraph query …` β†’ JSON + structured errors + typed exit codes; never-prompt | - -V1 is the foundation (global-first + merge + view). V2 closes the substantive clientβ†’server gap. V3 is credential ergonomics. V4/V5 ride sibling tickets (MR-970/973/981). MR-969 (stored queries) ships independently and is reached by N8's `/queries/{name}` once V2 lands. - -## Rollout - -The slices above are the rollout order: **V1 (global layer + merge) β†’ V2 (remote graphs + HTTP client) β†’ V3 (login) β†’ V4 (quickstart/templates, on MR-970) β†’ V5 (agent-mode, MR-981).** V1–V2 close the substantive gap (global-first config + `curl`-free server access); V3–V5 are ergonomics that ride sibling tickets. Evaluate after V2 against early-adopter and agent-onboarding (MR-973 / MR-974) signal. The spikes (X1 HTTP-client, X2 merge engine, X3 resolver+provenance, X4 login) resolve before their owning slice. - -## Prior art - -- kubeconfig (clusters / users / contexts; `KUBECONFIG`; `kubectl config view`) -- Helix CLI v2 (`helix.toml` local+enterprise instance blocks; `~/.helix/config`; `~/.helix/credentials`) -- AWS CLI (`~/.aws/config` + `~/.aws/credentials` split; named profiles; `credential_process`) -- git (`~/.gitconfig` + `.git/config`; `--show-origin`) -- Cargo (`Cargo.toml` manifest + `~/.cargo/config.toml`) -- Supabase / Prisma (one project manifest; connection via `DATABASE_URL` env) -- 12-factor app (config that varies by deploy lives in the environment) diff --git a/docs/dev/rfc-003-mcp-server-surface.md b/docs/dev/rfc-003-mcp-server-surface.md deleted file mode 100644 index 32fbce5..0000000 --- a/docs/dev/rfc-003-mcp-server-surface.md +++ /dev/null @@ -1,270 +0,0 @@ -# RFC: MCP Server Surface for `omnigraph-server` β€” Full Tool Parity, Stored Queries, Modular Auth - -**Status:** Proposed -**Date:** 2026-06-01 -**Tickets:** MR-969 (stored queries + MCP exposure β€” the surface this completes), MR-956 (federated auth / WorkOS OAuth β€” the auth substrate this consumes), MR-971 (per-server credential resolver), MR-974 (agent setup surface β€” the installer that wires this), MR-668 (multi-graph server β€” shipped, the routing this builds on) -**Builds on:** [omnigraph#128](https://github.com/ModernRelay/omnigraph/pull/128) (`ragnorc/stored-queries-mcp`) β€” the shipped stored-query registry, `GET /queries`, `POST /queries/{name}`, and the coarse `invoke_query` gate. -**Supersedes:** the MCP-transport portion of [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (`/mcp/tools` + `/mcp/invoke`). See [Relationship to RFC-001](#relationship-to-rfc-001). -**Target release:** v0.8.x (phased β€” see Rollout) - -## Summary - -Add a first-class **MCP (Model Context Protocol) server surface to `omnigraph-server`**, exposed over **Streamable HTTP**, that projects the server's operations as MCP tools and resources for LLM clients (Claude Code/Desktop/web, Cursor, etc.). Two populations of tools share one projection path: - -1. **Built-in operational tools** β€” parity with the existing `@modernrelay/omnigraph-mcp` stdio package's **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and its **2 resources** (`omnigraph://schema`, `omnigraph://branches`), plus a new server-scoped `graphs_list` tool and an `omnigraph://graphs` resource (multi-graph mode). -2. **Dynamic stored-query tools** β€” one MCP tool per `mcp.expose: true` entry in the `queries:` registry (MR-969 / #128), with parameters typed from the `.gq` declaration via the shipped `query_catalog_entry` / `param_descriptor` projection. - -Every tool is **authorized by the server's existing Cedar policy engine**. The MCP layer never implements its own authentication: it consumes an **already-resolved `ResolvedActor`** from the server's bearer middleware (`require_bearer_auth` today; the `TokenVerifier` seam when MR-956 lands), so the **same MCP endpoint serves on-prem (static or customer-OIDC tokens) and our cloud (WorkOS OAuth) by configuration only**. Cloud OAuth is an additive layer (RFC 9728 protected-resource metadata) that slots in with zero MCP changes. - -The end-state collapses two diverging tool implementations into one: the in-server MCP is the canonical, Cedar-gated, remotely-reachable surface; the stdio package becomes a thin stdio↔HTTP proxy (local on-ramp) over it. - -> **Key caveat, stated up front (see Β§5.9 below):** the headline "a token scoped via Cedar to a *specific set* of stored queries" requires **per-query `invoke_query` scope**, which is *designed* (rfc-001) but **not yet implemented** β€” the shipped action is coarse (any stored query on the graph, or none). Per-actor Cedar curation works today for *built-in vs ad-hoc vs admin* tools and for *stored-vs-ad-hoc*; sub-selecting individual stored queries per actor is gated on a prerequisite (PR 0b). Until then, stored-query curation is graph-level (registry membership + `mcp.expose`). - -## Relationship to RFC-001 - -[rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) (MR-656 / MR-976 / MR-969) is the parent design for stored queries + the response envelope + MCP. This RFC is the **detailed MCP-transport design** that #128 left for a follow-up, and it **revises rfc-001 in three places where the shipped code or the MCP wire protocol diverged from rfc-001's sketch**: - -1. **Transport shape.** rfc-001 sketched `GET /mcp/tools` + `POST /mcp/invoke` (a bespoke REST pair). **That is not the MCP wire protocol β€” real MCP clients cannot connect to it.** This RFC implements actual MCP JSON-RPC over Streamable HTTP and reuses `query_catalog_entry` as a *projection source*, not a parallel surface. (rfc-001's own Open Question already leaned toward Streamable HTTP.) -2. **Exposure config.** rfc-001 specified inline `.gq` pragmas (`@mcp(expose=…)`, default `expose=false`). **#128 shipped a different mechanism:** YAML `queries..mcp.expose` in `omnigraph.yaml`, **default `true`** (declaring a query in the manifest *is* the opt-in). This RFC builds on the shipped YAML form; the `.gq`-pragma design in rfc-001 is superseded for exposure. -3. **Schema introspection.** rfc-001 lists "Schema introspection through MCP" as a **non-goal** ("agents see types through declared return shapes"). This RFC **revises that**: the operational-parity tools include `schema_get` and `omnigraph://schema` β€” *because the shipped stdio package already exposes both*. The non-goal is achieved by *policy*, not omission: `schema_get`/`omnigraph://schema` are Cedar-gated by `Read`, and the recommended locked-down agent policy denies `Read`, so a curated agent still never sees the schema. (rfc-001's intent is preserved; the mechanism moves from "don't build it" to "build it, gate it.") - -Everything else in rfc-001 (two-paths-one-engine, per-query `invoke_query` *as the intended scope*, the response envelope, multi-graph per-graph endpoints) this RFC consumes unchanged. - -> **Numbering note:** the `TokenVerifier`/WorkOS auth design is referred to in code (`crates/omnigraph-server/src/identity.rs`) as "RFC 0001," which is a *different* document from this repo's `docs/dev/rfc-001-queries-envelope-mcp.md`. To avoid the collision this RFC cites the auth substrate as **MR-956** throughout, never "RFC 0001." - -## Reconciliation with shipped code (verified against `ragnorc/stored-queries-mcp` HEAD) - -Verified against `crates/omnigraph-server/src/{lib.rs,api.rs}` and `crates/omnigraph-policy/src/lib.rs` at the current branch head (not the #128 PR body, and not `api.rs` alone): - -- βœ… `GET /queries` returns the `mcp.expose == true` subset as `QueriesCatalogOutput { queries: [QueryCatalogEntry] }`, each with typed `ParamDescriptor`s, `tool_name`, `description`, `instruction`, and a `mutation` flag. **MCP-ready projection, but exposed as bespoke REST/JSON β€” not the MCP wire protocol.** -- βœ… `POST /queries/{name}` route exists (`server_invoke_query`, `lib.rs`). -- βœ… `query_catalog_entry()` / `param_descriptor()` with an exhaustive `ScalarType β†’ ParamKind` map (a new scalar is a compile error). -- βœ… `InvokeQuery` Cedar action defined in `omnigraph-policy`. -- βœ… **`InvokeQuery` IS enforced** at `POST /queries/{name}`: `server_invoke_query` calls `authorize(PolicyAction::InvokeQuery)` and **masks a denial to a 404 identical to "unknown query"** so the catalog isn't probeable (the denial-masking the previous draft of this RFC reported as missing is shipped β€” it lives in `lib.rs`, not `api.rs`). The stored-mutation path is already double-gated: `InvokeQuery` outer, then `Change` inside `run_mutate`. -- βœ… **Reuse path exists:** `run_query` / `run_mutate` are already decoupled from their HTTP request bodies and take registry-supplied `(source, name, params, branch/snapshot)`. MCP `tools/call` for both stored and ad-hoc tools delegates to these β€” no new business logic. -- ❌ **Per-query (`invoke_query[name]`) scope is NOT implemented.** `PolicyRequest` carries only `{action, branch, target_branch}` β€” **no query-name dimension** β€” and the action is documented coarse ("permits *any* stored query on the graph"). rfc-001 *designed* per-name scope; it is unbuilt. This RFC's per-query Cedar filtering (Β§5.4) and recommended agent policy (Β§5.9) depend on it β†’ tracked as **PR 0b**. -- ❌ No MCP protocol surface (`initialize`/`tools/list`/`tools/call`, JSON-RPC, transport). -- ❌ No `TokenVerifier` trait yet β€” `require_bearer_auth` resolves a `ResolvedActor` inline (static-hash). The trait/`OidcJwtVerifier` are MR-956 (draft). The MCP layer's only requirement β€” *consume `ResolvedActor`* β€” is satisfiable today. - -Stack (verified `Cargo.toml`): Axum + utoipa (OpenAPI) + `omnigraph-policy` (Cedar) + `futures` + `tokio`. **No MCP crate present.** `edition = "2024"`. - -## Motivation - -- **One curated, safe, remotely-reachable tool surface.** MR-969's thesis: hand an LLM a token Cedar-scoped to a set of tools and it sees exactly those typed tools β€” cannot construct ad-hoc queries it isn't permitted, cannot read the schema it isn't permitted, cannot reach other graphs. Today the only MCP is the stdio package: local-only, full surface, ungated. -- **Parity, so the in-server MCP can be the single implementation.** Operators/agents already depend on the operational tools. Supporting them server-side behind one Cedar gate lets the stdio package degrade to a proxy and removes two diverging tool sets. -- **On-prem and cloud from one endpoint.** A managed cloud (WorkOS OAuth) and an on-prem/air-gapped deploy (static or customer-OIDC tokens) must serve the same MCP without forks or MCP-specific auth. -- **Foundation for the agent on-ramp (MR-974).** `omnigraph mcp install --agent ` needs a decided transport + a stable endpoint. - -## Goals - -- Project built-in tools + stored queries as MCP tools through **one** registry abstraction. -- `tools/list` and the callable set are **identical for argument-independent authorization**, both driven by Cedar (see Β§5.4 for the branch-scoped caveat). -- The MCP layer is **auth-method-agnostic**: it consumes `ResolvedActor`, never a raw token, never branches on how auth happened. -- The same endpoint works on-prem (static/OIDC) and cloud (WorkOS OAuth), switched by config; cloud OAuth is additive (RFC 9728). -- No new business logic: MCP tools delegate to the same `run_query`/`run_mutate`/branch/schema functions the HTTP routes call. -- Behaviour-neutral when unused: no MCP traffic = no change. - -## Non-Goals - -- **Building/hosting an OAuth authorization server.** The server is a Resource Server; WorkOS AuthKit+Connect is the AS (MR-956). The MCP endpoint validates tokens, never issues them, never holds client secrets. -- **OAuth/WorkOS implementation itself** β€” MR-956's work. This RFC leaves a clean RFC-9728 hook and consumes `ResolvedActor`. -- **MCP prompts, elicitation, `tools/list_changed`, resource subscriptions, server-initiated messages.** None needed β†’ enables a stateless POST-only transport (Β§5.6). -- **stdio transport inside the server.** stdio stays in the TS package (now a proxy). -- **Cross-graph tool listing.** Per-graph catalogs only (MR-969 + RFC-002 non-goal). -- **Hot reload of the query registry.** Restart-only (MR-969). - -## Background - -`omnigraph-server` (Axum) already implements every operation this RFC exposes as an authenticated HTTP route; each authorizes via a `PolicyAction` against the Cedar policy for a server-resolved actor and calls into the engine. The existing stdio MCP package is a *client* of these routes (it owns no business logic). MR-956 will introduce a `TokenVerifier` trait (`StaticHashTokenVerifier` today inline, `OidcJwtVerifier` for OIDC/WorkOS) producing the `ResolvedActor { actor_id, tenant_id: Option, scopes: Vec, source }` that already exists in `identity.rs` and is consumed by Cedar β€” token *validation* is offline (cached JWKS), so on-prem/air-gapped has no request-path dependency on the cloud. - -## Design - -### 5.1 One tool model: a `McpTool` trait, two populators - -Both built-in and stored-query tools implement one trait so `tools/list` / `tools/call` never special-case: - -```rust -trait McpTool: Send + Sync { - fn name(&self) -> &str; // MCP tool id (stable) - fn title(&self) -> Option<&str>; - fn description(&self) -> &str; - fn input_schema(&self) -> serde_json::Value; // JSON Schema (draft 2020-12) - fn annotations(&self) -> ToolAnnotations; // readOnlyHint / destructiveHint / idempotentHint - /// The Cedar request(s) this call requires, given parsed args. Used BOTH at - /// list-time (dry-run filter, default args) and call-time (enforce, real args). - fn authorization(&self, args: &ToolArgs) -> Vec; - async fn call(&self, ctx: &GraphCtx, args: ToolArgs) -> Result; -} -``` - -- **Built-ins**: ~14 static impls, each delegating to the *same* function its HTTP route calls (`run_query`, `run_mutate`, branch ops, `apply_schema_as`, …). `input_schema` authored once (or derived from each route's existing `utoipa`/`ToSchema` DTO). -- **Stored queries**: generated `McpTool` instances, one per `mcp.expose` entry; `input_schema` from `param_descriptor` (Β§5.3); `authorization` β†’ `InvokeQuery` (coarse today; `InvokeQuery{name}` after PR 0b) then the inner `Read`/`Change`. - -`ToolRegistry` for a graph = the static built-ins + the dynamic stored-query tools resolved from that graph's `GraphHandle` registry. - -### 5.2 Tool catalog (parity) and Cedar mapping - -Each built-in **reuses the exact `PolicyAction` its HTTP route already enforces** β€” verified against the handlers in `lib.rs`, not invented: - -| MCP tool | Scope | Read/Mutate | Cedar action (verified from route) | -|---|---|---|---| -| `health` | server | read | none (liveness/version) | -| `graphs_list` *(new)* | server | read | `GraphList` | -| `snapshot` | graph | read | `Read` | -| `schema_get` | graph | read | `Read` | -| `branches_list` | graph | read | `Read` | -| `commits_list`, `commits_get` | graph | read | `Read` | -| `read` (ad-hoc `.gq`) / `query` *(alias)* | graph | read | `Read` | -| `change` (ad-hoc `.gq`) / `mutate` *(alias)* | graph | mutate | `Change` | -| `ingest` (NDJSON) | graph | mutate | `Change` (+ `BranchCreate` when forking a new branch) | -| `branches_create` | graph | mutate | `BranchCreate` | -| `branches_delete` | graph | mutate | `BranchDelete` | -| `branches_merge` | graph | mutate | `BranchMerge` | -| `schema_apply` (`allow_data_loss`) | graph | mutate | `SchemaApply` | -| **stored query** (`find_user`, …) | graph | inferred | `InvokeQuery` (coarse; `InvokeQuery{name}` after PR 0b) + inner `Read`/`Change` | - -There is **no `Ingest` and no separate `snapshot`/`Export` action** β€” `ingest` enforces `Change`, `snapshot` enforces `Read`. (`Export` exists but maps to the `/export` route, which this RFC does not expose as a tool.) - -**Tool id parity vs. canonicalization.** The shipped stdio package uses tool ids **`read`/`change`** (and calls the deprecated `/read`,`/change` routes). The server HTTP surface canonicalized to `/query`,`/mutate` with `/read`,`/change` deprecated (MR-656). To keep existing package clients working *and* align with the server, the MCP exposes **`query`/`mutate` as canonical with `read`/`change` retained as deprecated-but-live aliases** (both dispatch to the same handler). Open Q7 asks whether to drop the aliases later. - -Resources (Β§5.5): `omnigraph://schema`, `omnigraph://branches` (parity), plus `omnigraph://graphs` *(new)* β€” each gated by the same action as its list/get route (`Read`, `Read`, `GraphList`). - -### 5.3 `ParamDescriptor β†’ JSON Schema` (stored-query tools) - -| `ParamKind` | JSON Schema | Notes | -|---|---|---| -| String | `{"type":"string"}` | | -| Bool | `{"type":"boolean"}` | | -| Int (i32/u32) | `{"type":"integer"}` | | -| BigInt (i64/u64) | `{"type":"string","pattern":"^-?\\d+$"}` | JSON numbers lose precision >2⁡³ β†’ string (matches the shipped `api.rs` rationale). (Open Q1) | -| Float (f32/f64) | `{"type":"number"}` | | -| Date | `{"type":"string","format":"date"}` | | -| DateTime | `{"type":"string","format":"date-time"}` | | -| Blob | `{"type":"string","contentEncoding":"base64"}` | | -| Vector | `{"type":"array","items":{"type":"number"},"minItems":dim,"maxItems":dim}` | uses `vector_dim` | -| List | `{"type":"array","items":}` | scalar items only (grammar guarantees) | - -`nullable == false` β†’ param is in `required`. Annotations: `mutation` β†’ `{readOnlyHint:false, destructiveHint:true}`; else `{readOnlyHint:true}`. `description` β†’ tool description; `instruction` β†’ appended to description (or `_meta`). (The shipped `check()` already warns when an `mcp.expose` query declares a `Vector` param an LLM can't supply.) - -For built-in tools the schema is hand-authored from the route DTO; e.g. `query` β†’ `{source: string, branch?: string, params?: object}`; `schema_apply` β†’ `{schema: string, allow_data_loss?: boolean}`; `ingest` β†’ `{ndjson: string, mode?: "merge"|"append"|"overwrite", branch?: string}`. - -### 5.4 `tools/list` (Cedar-filtered) and `tools/call` (dispatch + masking) - -- **`tools/list`**: build the `ToolRegistry`; for each tool evaluate `authorization(default_args)` against the actor's Cedar policy; **emit only tools that authorize**. Authz decisions memoized per request. Stored-query tools additionally require `mcp.expose: true`. - - **Exactness caveat (R7 is conditional):** the listed set equals the callable set **only for tools whose authorization is argument-independent** (`health`, `graphs_list`, `snapshot`, `schema_get`, `branches_list`, `commits_*`, ad-hoc `query`/`mutate`, and stored queries under the *coarse* action). For **branch-scoped tools** (`branches_create`/`merge` with `target_branch_scope`, and any branch-scoped `Read`/`Change` rule), list-time uses `default_args` (e.g. branch `main`) and cannot know the real target, so the listed set is a *best-effort approximation* of callability β€” a call may still be denied (or, rarely, a hidden tool would have been allowed). `tools/call` is always the authoritative gate. The contract is: **list never shows a tool the actor can't ever call; for branch-scoped tools it may show one the actor can call only on some branches.** -- **`tools/call`**: resolve `name` β†’ `McpTool` (masked-404 if unknown *or* `mcp.expose:false`); parse+validate args against `input_schema`; enforce `authorization(args)` (mutations stay double-gated: `InvokeQuery` then `Change`); on success `call`. **Denial masking** lives in one place (the dispatcher): an authz denial is returned identically to "unknown tool" (Β§5.10), reusing the same deny≑missing principle already shipped at `POST /queries/{name}`. - -### 5.5 Resources - -Advertise `resources` capability (`subscribe:false, listChanged:false`). `resources/list` β†’ the URIs the actor may read; `resources/read` β†’ schema `.pg` text / branches JSON / (multi-graph) graphs JSON, each gated by the corresponding action (`Read`, `Read`, `GraphList`). A locked-down agent denied `Read` simply never sees `omnigraph://schema` or `omnigraph://branches` β€” this is how rfc-001's "agents don't introspect schema" intent is met *by policy* (Β§Relationship-to-RFC-001). - -### 5.6 Transport: Streamable HTTP, stateless, POST-only - -- **Streamable HTTP** (MCP's current standard; we're already an HTTP server). One endpoint per scope (Β§5.7). -- Because the server emits **no** server-initiated messages, implement the **minimal conformant** shape: client `POST`s JSON-RPC, server replies `application/json`. **No SSE channel, no `Mcp-Session-Id`, stateless** β€” each request authenticated independently via the bearer middleware. Honour the `MCP-Protocol-Version` header. SSE/sessions can be added later if subscriptions land. -- **JSON-RPC methods:** `initialize` (advertise `{tools:{listChanged:false}, resources:{listChanged:false, subscribe:false}}` + serverInfo/version), `notifications/initialized` (no-op ack), `ping`, `tools/list`, `tools/call`, `resources/list`, `resources/read`. `prompts/list` returns empty if probed. -- **Library decision (Open Q2):** spike `rmcp` (official Rust MCP SDK) for conformance + Streamable-HTTP/Axum on edition 2024; **fall back to a hand-rolled ~150 LOC JSON-RPC-over-POST** (only the methods above) on friction. Given the tiny surface, hand-roll is an acceptable default. - -### 5.7 Endpoint routing (server- vs graph-scoped) - -- **Single-graph mode:** `POST /mcp` β€” graph tools + server tools (`health`, `graphs_list`). -- **Multi-graph mode (MR-668):** `POST /graphs/{graph_id}/mcp` β€” graph-scoped tools for that graph; plus a server-level `POST /mcp` exposing only server-scoped tools (`health`, `graphs_list`). A per-graph endpoint never lists another graph's tools (isolation, tested). Mirrors the shipped `/graphs/{graph_id}/…` cluster routing. (Open Q5: confirm naming + whether server tools also appear on the per-graph endpoint.) - -### 5.8 Modular / decoupled auth (the cross-cutting requirement) - -**Invariant (load-bearing, satisfiable today):** the MCP handler receives an **already-resolved `ResolvedActor`** and **branches on nothing** about how the token was verified. No token parsing, no method check, no OAuth inside the MCP module. Today that actor comes from `require_bearer_auth`; when MR-956 lands it comes from a `TokenVerifier` β€” the MCP code is identical either way. - -``` -request β†’ [auth middleware: ResolvedActor] β†’ [MCP route] β†’ Cedar β†’ McpTool -``` - -**Server side β€” auth is config, not code:** - -| Deployment | Verifier | MCP change | -|---|---|---| -| On-prem, static bearer | `require_bearer_auth` / `StaticHashTokenVerifier` | none | -| On-prem, customer IdP | `OidcJwtVerifier` β†’ customer issuer (MR-956) | none | -| Our cloud | `OidcJwtVerifier` β†’ WorkOS, `tenant_id = Some(org_id)` (MR-956) | none | - -Token validation is offline (cached JWKS) β€” on-prem/air-gapped keeps working with no request-path cloud dependency. The MCP endpoint never terminates OAuth and never holds a client secret (Resource Server only). - -**Cloud client negotiation β€” additive, no MCP changes:** when MR-956 lands, the server publishes RFC 9728 `/.well-known/oauth-protected-resource` and returns `WWW-Authenticate: Bearer ..., resource_metadata="..."` on 401. A compliant MCP client (Claude) then auto-negotiates: static bearer to an on-prem endpoint; on a cloud 401 it discovers the WorkOS AS and runs OAuth/PKCE itself β€” **same endpoint URL, zero client-side branching.** This RFC only requires that MCP routes flow through the standard 401 path so that hook can be added later without touching MCP. - -**Multi-user identity pass-through (cloud):** the *caller's* token (a WorkOS JWT, audience-bound per-tenant) must reach the server so Cedar enforces per-user/per-tenant policy β€” never a shared service token. The MCP endpoint validates it offline and maps `org_id β†’ tenant_id`. This is why the **remote path is the in-server HTTP MCP that Claude connects to directly** (its token flows through), not a stdio bridge impersonating a user. - -**Client-side credential acquisition (CLI/SDK/proxy) β€” pluggable `CredentialSource`** (RFC-002 Β§5, MR-971), keyed by server name, so OAuth is a future *sibling key*, not a re-key: - -```yaml -servers: - onprem: { endpoint: https://og.internal:8080, auth: { token: { env: OG_TOKEN } } } - edge: { endpoint: https://og-edge, auth: { token: { command: [vault, read, -field=token, secret/og] } } } - cloud: { endpoint: https://api.omnigraph.cloud, auth: { oauth: { issuer: workos } } } # future sibling -``` - -Implicit chain when `auth:` omitted: `OMNIGRAPH_TOKEN_` β†’ keychain `omnigraph:` β†’ `[]` in `~/.omnigraph/credentials`; legacy `bearer_token_env` honoured. Secrets never inlined. - -### 5.9 Safety model β€” Cedar is the gate, default-deny is the floor - -With ad-hoc `query`/`mutate`/`schema_apply` present as tools, the **only** thing protecting an untrusted agent is the Cedar policy. Therefore: - -- **Default-deny when tokens are configured** (MR-723, shipped) is the floor β€” an actor with no grants sees an empty tool list. -- **What works today (coarse action):** a policy can hide all ad-hoc tools and admin tools per-actor (`deny Read, Change, SchemaApply, Branch*`) while allowing stored queries (`allow InvokeQuery`). That already reproduces "can't run ad-hoc, can't read schema, can only call stored queries" β€” the agent sees *every* exposed stored query plus nothing else. -- **What needs PR 0b (per-query scope):** selecting *which* stored queries an actor may call (`allow InvokeQuery [find_user, list_orders]`, deny the rest). The shipped `invoke_query` is coarse (all stored queries or none). Until PR 0b adds a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), per-actor sub-selection of stored queries is **not expressible**; curation is graph-level (which `.gq` files are registered + `mcp.expose`). -- `schema_apply`, `branches_delete`, ad-hoc `mutate` require an explicit admin-tier grant; never in a default agent policy. -- (Open Q3) Optional `mcp.allow_adhoc` server switch defaulting **off** for the ad-hoc `query`/`mutate` tools β€” defence-in-depth independent of Cedar, and independent of PR 0b. - -### 5.10 Result shaping and error mapping - -- **Success:** `tools/call` returns `content: [{type:"text", text:}]` where `` is the route's existing output envelope (read rows / mutation summary, i.e. `ReadOutput` / `ChangeOutput`). (Open Q4: also emit `structuredContent` + `outputSchema` β€” defer; text-JSON for v1.) -- **Tool execution error** (bad params after schema validation, engine error): result with `isError:true` + a text content block. -- **Authorization denial / unknown tool / `mcp.expose:false`:** a single JSON-RPC error (`-32602`, message `"unknown tool"`) β€” identical for all three so policy isn't probeable (same principle as the shipped `POST /queries/{name}` 404 masking). -- **Auth failure** (bad/absent bearer): HTTP 401 from the middleware *before* MCP β€” carries `WWW-Authenticate` (the RFC 9728 hook), never masked as a tool error. (This is exactly the path the shipped `authorize`/`authorize_request` split preserves: operational failures keep their status; only *denials* are masked.) - -## Relationship to the `@modernrelay/omnigraph-mcp` stdio package - -Verified surface of the package (`omnigraph-ts`, pkg version `0.3.0`, `@modelcontextprotocol/sdk@^1.29.0`, **stdio only**): **13 tools** (`health`, `snapshot`, `read`, `schema_get`, `branches_list`, `commits_list`, `commits_get`, `change`, `ingest`, `branches_create`, `branches_delete`, `branches_merge`, `schema_apply`) and **2 resources** (`omnigraph://schema`, `omnigraph://branches`). It is a thin client over the SDK β†’ HTTP routes and **forwards the caller's bearer verbatim** (no inspection). - -Once parity lands, **collapse to one implementation**: the in-server MCP is canonical (Cedar-gated, remote-capable, the path that becomes a Claude-web connector via MR-956). The stdio package degrades to a **thin stdio↔HTTP proxy** forwarding JSON-RPC (and the incoming `Authorization`) to `/mcp` β€” staying the local on-ramp for Claude Code/Desktop while sharing one tool set, one Cedar gate. Transition: keep the current independent stdio package on its `0.3.x`/`0.6.x` line; ship proxy mode in a later TS minor once the server endpoint is GA. (Note: the package is currently several minors behind the server β€” its vendored `spec/openapi.json` predates the stored-query routes β€” so it needs the standard re-sync regardless of MCP work.) - -## Testing - -- **Protocol conformance:** `initialize` handshake + advertised capabilities; `tools/list` shape; `tools/call` happy path; JSON-RPC error envelopes (`-32601` unknown method, `-32602` invalid params / unknown tool); `resources/list` + `resources/read`. -- **Cedar filtering (coarse, today):** an actor with `allow InvokeQuery` + `deny Read/Change` sees *all* exposed stored queries but **not** `query`/`mutate`/`schema_get`; `tools/call query` returns masked "unknown tool"; an admin sees the full catalog. -- **Cedar filtering (per-query, gated on PR 0b):** actor scoped to `InvokeQuery [find_user]` sees *only* `find_user`; `tools/call list_orders` masks. **This test ships with PR 0b**, not PR 1 β€” it cannot pass against the coarse action. -- **Parity per built-in:** each tool round-trips against the same expectations as its HTTP route (reuse route tests); `read`/`change` aliases dispatch identically to `query`/`mutate`. -- **Double-gating:** a stored mutation requires both `InvokeQuery` and `Change`; `schema_apply` requires `SchemaApply`. -- **`mcp.expose:false`:** absent from `GET /queries` and MCP `tools/list`; still service-callable by name through `POST /queries/{name}` when the actor has `invoke_query`, but not MCP-callable. -- **Schema generation:** table-driven over every `ParamKind` incl. nullable / list / vector(dim). -- **Branch-scoped list approximation:** assert the documented R7 caveat β€” a branch-scoped policy lists `branches_create`, and `tools/call` is the authoritative gate (a denied target still 403s/masks). -- **Multi-graph isolation:** `/graphs/a/mcp` never lists graph `b`'s tools; server `/mcp` exposes only server tools. -- **Auth decoupling:** the MCP suite is green under the current `require_bearer_auth` and under a mock OIDC `ResolvedActor` source β€” proving verifier-agnosticism. A 401 carries `WWW-Authenticate`. -- **OpenAPI:** the JSON-RPC endpoint is not REST β€” document only the envelope in utoipa (or exclude); keep `openapi.json` drift test green (`OMNIGRAPH_UPDATE_OPENAPI=1` to regenerate on intentional change). -- **Cross-repo smoke (optional):** point `@modelcontextprotocol/sdk` (TS) at the HTTP endpoint in an `omnigraph-ts` integration test. - -## Rollout β€” phased by risk - -- **PR 0a β€” extract the reusable invoke path (small).** The coarse `invoke_query` gate + 404 denial-masking are **already shipped** in `server_invoke_query`. Extract the read/mutate dispatch into `invoke_stored_query(handle, name, params, branch/snapshot, actor)` so MCP `tools/call` and the HTTP route share one path. No behaviour change. *(Replaces the previous draft's "PR 0 β€” wire the gate", which was already done.)* -- **PR 0b β€” per-query `invoke_query` scope (the safety prerequisite).** Add a query-name dimension to `PolicyRequest` + the Cedar schema (rfc-001's intended design), wire it at `POST /queries/{name}` and in the stored-query `McpTool::authorization`. Independently useful (the `allow InvokeQuery [find_user]` policy). **Gates the per-query Cedar-filtering test and Β§5.9's recommended agent policy.** -- **PR 1 β€” MCP transport + read-only parity + stored-query reads.** Endpoint(s), `initialize`/`tools/list`/`tools/call`/`resources/*`, the `McpTool` registry, Cedar-filtered listing, the read-only built-ins (`health`, `graphs_list`, `snapshot`, `read`/`query`, `schema_get`, `branches_list`, `commits_*`) + resources + stored-query *reads*. All auth-agnostic. -- **PR 2 β€” mutating parity + stored-query mutations.** `change`/`mutate`, `ingest`, `branches_create/delete/merge`, `schema_apply`, stored-query mutations + the `mcp.allow_adhoc` switch. -- **PR 3 β€” docs + agent on-ramp hook.** `docs/user/server.md` MCP section (incl. the recommended agent policy + the coarse-vs-per-query caveat), `openapi.json` sync, the `omnigraph mcp install` config target (MR-974), and the downstream `omnigraph-ts` re-sync/proxy follow-up. -- **Later (separate, MR-956):** RFC 9728 protected-resource metadata + WorkOS β€” slots in with zero MCP changes. -- **Later (TS minor):** stdio package β†’ proxy mode. - -## Migration / backwards compatibility - -- **Additive.** No `queries:` and no MCP traffic β†’ today's behaviour unchanged. New endpoints are new routes. -- **Cedar default-deny** (when tokens configured) means MCP exposes nothing until an actor is granted β€” safe by default. -- The stdio package keeps working unchanged; proxy mode is opt-in later. -- `openapi.json` only gains the documented MCP envelope; existing REST routes untouched. - -## Open Questions - -1. **BigInt/u64 as JSON string** (recommended, precision-safe) vs number. -2. **`rmcp` vs hand-rolled** JSON-RPC (spike `rmcp` on edition 2024; default to hand-roll on friction). -3. **Default-off `mcp.allow_adhoc`** for ad-hoc `query`/`mutate` (recommended) vs always-on + Cedar-only. -4. **`structuredContent` + `outputSchema`** now vs text-JSON v1 (recommend v1 text-JSON). -5. **Endpoint paths:** `/mcp` + `/graphs/{id}/mcp` β€” confirm naming and whether server-scoped tools also appear on the per-graph endpoint. -6. **Stateless POST-only** confirmed (no near-term server-initiated messages) β€” revisit only if subscriptions land. -7. **Legacy alias tools** (`read`/`change`): keep for client compat (the shipped package uses them), or drop and rely on `query`/`mutate`? -8. **PR 0b shape:** per-query scope as a Cedar *resource* (`StoredQuery::"find_user"`) vs a `query_name` *context attribute* + policy condition β€” affects how `allow InvokeQuery [list]` is authored. diff --git a/docs/dev/rfc-004-cluster-graph-schema-apply.md b/docs/dev/rfc-004-cluster-graph-schema-apply.md deleted file mode 100644 index e9c0336..0000000 --- a/docs/dev/rfc-004-cluster-graph-schema-apply.md +++ /dev/null @@ -1,211 +0,0 @@ -# RFC: Cluster Graph & Schema Apply β€” Phase 4 of the Cluster Control Plane - -**Status:** Landed (4A #170, 4B #171, 4C β€” all shipped) -**Implementation deviations:** (1) D3 row 8 retires the stale delete sidecar and lets the still-approved delete re-propose and retry, instead of a pending-block β€” prefix removal is idempotent, so the retry is the repair. (2) The approver/actor flag is the CLI's existing global `--as`, not a dedicated `--actor`/`--by`. (3) Consumed approval artifacts are rewritten with `consumed_at` rather than moved into state β€” the file and the ledger record both survive independently (axiom 11). -**Date:** 2026-06-10 -**Builds on:** cluster Stages 1–3B (shipped: validate/plan/status/refresh/import/force-unlock, config-only `cluster apply` with content-addressed catalog publish, catalog payload verification, failpoint-proven crash/CAS recovery for the apply protocol). Normative context: [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). -**Target release:** unversioned (phased β€” see Sequencing); no cluster functionality is in a tagged release yet. - -## Summary - -Extend `cluster apply` from config-only resources (stored queries, policy bundles) to **graph-moving resources**: graph create, cluster-driven schema apply, and graph delete. This is the nested-publish territory the implementation spec flags as its highest-risk decision: a graph's Lance manifest can move (via the engine's own atomic publish) *before* the cluster's JSON state CAS lands, and a crash in that window must never be silent, never acknowledged as success, and never repaired by guessing. - -Three design commitments make the phase tractable: - -1. **Cluster recovery is roll-forward-only.** The engine's recovery sidecars (`__recovery/{ulid}.json`, the open-time sweep in `db/manifest/recovery.rs`) already make every graph-level operation atomic *within the graph* β€” a schema apply either fully published or fully recovers at the next open. The cluster therefore never rolls a graph back. Cluster sidecars exist to **classify and record**: after a crash, the sweep observes the live graph, decides "moved / didn't move / moved unexpectedly," and either rolls the *cluster state* forward to match observable reality (axiom 5) or surfaces a loud pending-repair condition. The cluster holds no second transaction log and no rollback hammer β€” that would duplicate substrate behavior the engine already owns (invariant: respect the substrate). -2. **Irreversible operations require a digest-bound approval artifact.** Graph delete and `allow_data_loss` schema applies consume an explicit `__cluster/approvals/{ulid}.json` record bound to the exact change digests, written by a new `cluster approve` command and retired into the state ledger's `approval_records` (the durable audit reference of axiom 11). -3. **The operator identity becomes explicit.** `cluster apply` gains an actor, threaded to the engine's `apply_schema_as` so Cedar enforcement and commit attribution work unchanged. The cluster control plane adds no policy engine of its own (transport/auth stay at the boundary). - -## Motivation - -After Stage 3B, the control plane converges everything *except* the resources that define the data plane. The Sarah/Bob test is half-passed: Bob can see that Sarah changed a schema (`plan` shows the deferred change; `refresh` shows drift), but the system cannot act on it β€” Sarah still applies schemas with the per-graph tool and the cluster ledger trails reality. Graph creation is worse: a new graph in `cluster.yaml` blocks every dependent query and policy with `dependency_missing` until someone runs `omnigraph init` by hand at exactly the derived path. Phase 4 closes the loop: desired config in, converged deployment out, for the full resource vocabulary of Stage 1. - -The implementation spec's hard gate for this phase β€” failpoint recovery tests proving the movement-before-state-publish gap β€” was deliberately front-loaded: Stage 3B shipped the failpoint infrastructure and the apply-side crash/CAS tests. What remains is the design this RFC supplies: the sidecar schema, the recovery decision matrix, the approval artifact, and the ordering rules. - -## Non-Goals - -- **Server boot from cluster state** (Phase 5) β€” applied graphs/schemas still serve nothing; the server boots from `omnigraph.yaml` until the explicit per-deployment mode switch (axiom 15). -- **Policy-owned query exposure / `mcp.expose` retirement** (Phase 6). -- **Pipelines, embeddings, UI, aliases, bindings, providers, `env_file`** (Phase 7 and reserved fields). -- **External or Lance-backed state backends**; the local JSON backend + lock/CAS remains the substrate. -- **A cluster manifest publisher.** Deferred, per the spec: it becomes interesting only if the sidecar + repair path proves too weak for the accepted safety contract. Nothing in this design forecloses it. -- **Multi-graph atomic apply groups.** Cross-graph convergence remains statusful-partial per resource; one graph's failure never pretends to fence another's success. -- **Graph rename.** Stable-identity-across-rename is an open known gap at the schema level already; graph rename compounds it and is explicitly out of scope (see Open Questions). - -## Background - -What Phase 4 builds on (all shipped): - -- **The engine's recovery discipline.** Writers that can advance Lance HEAD before manifest publish write `__recovery/{ulid}.json` sidecars carrying per-table pins (`expected_version`, `post_commit_pin`); `Omnigraph::open` in read-write mode classifies every pinned table (`NoMovement` / `RolledPastExpected` / `UnexpectedAtP1` / `UnexpectedMultistep` / `InvariantViolation`) and decides all-or-nothing: roll forward via one manifest publish, or roll back via `Dataset::restore`, recording an audit row attributed to `omnigraph:recovery`. The cluster inherits the *vocabulary* of this design but not its mechanics β€” see the roll-forward-only argument below. -- **The engine's schema-apply surface.** `apply_schema_as(desired_source, SchemaApplyOptions { allow_data_loss }, actor)` returns `SchemaApplyResult { supported, applied, manifest_version, steps }`; `preview_schema_apply_with_options` returns the migration plan plus desired catalog without applying; the `__schema_apply_lock__` branch serializes schema applies graph-wide and refuses to run while user branches exist. Policy enforcement (`enforce(SchemaApply, TargetBranch("main"), actor)`) happens before the lock. -- **Graph init.** `Omnigraph::init(uri, schema_source)` with a strict preflight (errors if schema artifacts exist) and an atomic `_schema.pg` claim. A documented gap: a failed init does not clean up Lance datasets or `__manifest/` it already created. -- **No engine graph-delete primitive.** Deleting a graph today means removing its object-store prefix. This RFC works with that fact rather than waiting on a primitive. -- **Cluster state and observations.** `state.json` (locked, CAS-checked, atomically replaced) already records per-resource digests, statuses, `observations["graph."]` with `manifest_version` and live schema digest, plus empty `approval_records` / `recovery_records` placeholders reserved for this phase. -- **Stage 3A/3B apply mechanics.** Dispositions (`applied`/`derived`/`deferred`/`blocked`), content-addressed catalog publish before the state CAS, persisted-statuses contract on write failure, idempotent re-apply, payload verification with the drift + self-heal loop, and failpoints `cluster_apply.after_payload_phase` / `cluster_apply.before_state_write`. - -## Design - -### D1. Resource semantics: which dispositions change - -The Stage 3A classifier gains executable rows. Everything else (catalog resources, `derived` composites, blocked dependents) is unchanged: - -| Change | Stage 3A disposition | Phase 4 disposition | -|---|---|---| -| `graph.` Create | Deferred | **Applied** (4A): `Omnigraph::init` at the derived root | -| `schema.` Create | Deferred | **Applied with the graph create** (the init carries the schema) | -| `schema.` Update | Deferred | **Applied** (4B): `apply_schema_as` against the live graph | -| `graph.` / `schema.` Delete | Deferred | **Applied behind approval** (4C): prefix removal | -| `query.*`/`policy.*` blocked on the above | Blocked | Unblocked in the same apply once the dependency lands (ordering, D5) | - -Graph roots remain **derived**: `ClusterRoot/graphs/.omni` (high-risk decision #2 dispositioned: external graph roots are a separate, explicit future feature, not this phase). - -### D2. Cluster recovery sidecar (exit criterion 2, first half) - -Written under the state lock **before** any engine call that can move or create a graph manifest; deleted only **after** the cluster state CAS that records the outcome lands. - -```json -{ - "schema_version": 1, - "operation_id": "", - "started_at": "", - "actor": "", - "kind": "graph_create | schema_apply | graph_delete", - "graph_id": "", - "graph_uri": "", - "observed_manifest_version": 7, - "expected_manifest_version": null, - "desired_schema_digest": "", - "state_cas_base": "sha256:" -} -``` - -Path: `__cluster/recoveries/{operation_id}.json`, atomic write (temp + rename, the `write_state` discipline). Notes: - -- `observed_manifest_version` is the live graph's main-branch manifest version read at sidecar-write time (`null` for `graph_create` β€” no graph yet). This is the fencing value: apply refuses to proceed if it differs from the version recorded in `observations["graph."]` at plan time *and* re-observed under the lock (the same recompute-under-lock posture as Stage 3A's diff). -- `expected_manifest_version` starts `null` and is **rewritten into the sidecar immediately after the engine call returns** with `SchemaApplyResult.manifest_version` (or the post-init observation). A crash before that rewrite leaves `null`, which the sweep treats as "engine call outcome unknown β€” classify by observation only." For `graph_delete` the field is **always `null`** β€” prefix removal produces no new manifest version, so there is no rewrite step for that kind; delete sidecars are classified purely by root presence + state tombstone (D3 rows 7/7b/8). -- `state_cas_base` is **recorded for audit and diagnostics only β€” the sweep decision logic never consults it.** The sweep re-reads `state.json` under the lock and performs ordinary CAS-checked writes, so an independent state mutation between sidecar write and sweep is handled by the CAS like any other concurrent write, not by this field. Its value is forensic: a recovery audit entry can show which state revision the interrupted operation departed from. -- One sidecar per graph-moving resource operation. Apply processes graph-moving operations strictly sequentially (D5), so at most one sidecar is pending per apply run *per graph*, and the sweep processes sidecars in ULID order. - -### D3. Recovery decision matrix β€” roll-forward-only (exit criterion 2, second half) - -**Why no rollback.** The engine's sidecars already guarantee that a schema apply is atomic within the graph: by the time any cluster-visible manifest version moved, the engine either fully published or will recover all-or-nothing at its next read-write open. A cluster-level rollback would mean un-publishing a successfully published graph commit β€” rewriting substrate history the cluster does not own, duplicating the engine's transaction discipline (deny-list: custom transaction manager; state that drifts from what it can be derived from). The cluster's job after a crash is therefore *epistemic*, not transactional: observe what the graph actually is, and converge the ledger to it or refuse loudly. - -**Sweep trigger.** The sweep runs at the start of every state-mutating cluster command (`apply`, `refresh`, `import`), under the state lock, before the command's own work β€” mirroring the engine's open-time sweep gating (read-only `status`/`plan`/`validate` report pending sidecars as a warning, `cluster_recovery_pending`, but do not act). - -| # | Sidecar kind | Observation | Decision | -|---|---|---|---| -| 1 | any | Graph at `observed_manifest_version` (nothing moved) | Engine call never landed. Delete sidecar; the command's own plan/apply re-proposes the change. | -| 2 | `graph_create` / `schema_apply` | Graph at `expected_manifest_version`; state already records the outcome | Crash fell between state CAS and sidecar delete. Delete sidecar; done. | -| 3 | `schema_apply` | Graph at `expected_manifest_version` (or, when `expected` is `null`, live schema digest == `desired_schema_digest`); state stale | **Roll the cluster state forward**: record the live schema digest, recompute the graph composite, set statuses `applied`, append a `recovery_records` entry (audit), CAS-write, delete sidecar. | -| 4 | `graph_create` | Graph opens read-only and its schema digest == `desired_schema_digest`; state stale | Same roll-forward as #3 (the create completed). | -| 5 | `graph_create` | Root exists but the graph does not open (the engine's partial-init gap) | Status `error`, condition `graph_create_incomplete`, message: remove the root and re-run apply. **No auto-delete** β€” reconciler-initiated deletion is the same data-loss class as human deletion (high-risk decision #7). Sidecar kept until the operator acts and a sweep observes a clean state. | -| 6 | any | Graph at any other version (out-of-band movement during the crash window) | Status `drifted`, condition `actual_applied_state_pending`; sidecar kept; the command refuses graph-moving work for that graph until `cluster refresh` re-observes and the operator re-plans. No success is acknowledged for the interrupted operation. | -| 7 | `graph_delete` | Root absent; state already tombstoned | The delete kind's analog of row 2 (no manifest exists to version-check): crash fell between state CAS and sidecar delete. Delete sidecar; done. | -| 7b | `graph_delete` | Root absent; state stale | Roll forward: tombstone the graph subtree out of state (D6), record audit, delete sidecar. Idempotent β€” re-entry after a crash mid-row lands in row 7. | -| 8 | `graph_delete` | Root present (delete crashed mid-prefix-removal or never started) | If the approval artifact is still attached (D4), the delete is re-proposed by plan and re-runnable; status `drifted`, condition `graph_delete_incomplete`. Partial prefix removal leaves an unopenable graph β€” same operator message as #5. | - -Rows 3, 4 and 7b are the only mutations the sweep performs, and each is an ordinary CAS-checked state write under the lock β€” the sweep introduces no new write machinery. - -### D4. Approval artifacts (exit criteria 1-partial and 6-partial; axioms 8 and 11) - -The irreversible tier β€” graph delete, `allow_data_loss` schema apply (hard drops) β€” requires a recorded human decision that survives any reconstruction of state. `plan` already emits `approvals_required`; Phase 4 adds the consumption side. - -**Artifact** (`__cluster/approvals/{approval_id}.json`, written by the new command, never by apply): - -```json -{ - "schema_version": 1, - "approval_id": "", - "resource": "graph.scratch", - "operation": "delete", - "reason": "", - "bound_config_digest": "", - "bound_before_digest": "", - "bound_after_digest": "", - "approved_by": "", - "created_at": "" -} -``` - -**Flow.** `cluster approve --config --by ` re-runs the plan under the lock, locates the pending gated change for that address, prints it, and writes the artifact bound to the exact digests. `cluster apply` executes a gated change only when a pending artifact matches **all** bound digests β€” a stale approval (config moved since) matches nothing, is reported (`approval_stale` warning), and the change stays `blocked` with condition `approval_required`. On successful execution the artifact file is moved into `state.approval_records[approval_id]` in the same state CAS that records the outcome (the state references the audit fact; losing state does not lose the approval, which is also why `import` preserves `approval_records` it finds β€” see D7). - -`allow_data_loss` is **never** a CLI flag on `cluster apply`; destructive promotion is expressed only through an approval artifact for the specific schema change. The default schema apply path runs with `allow_data_loss: false` (soft drops), which the spec's tier table classes as a recoverable definition rewrite β€” plan warning, no artifact. - -### D5. Actor, ordering, and apply groups (exit criterion 4) - -**Actor.** `cluster apply --actor ` / `cluster approve --by `, with `OMNIGRAPH_CLUSTER_ACTOR` as the env fallback. The actor is threaded to `apply_schema_as` (so engine-side Cedar enforcement fires wherever a policy checker is installed and graph commits are attributed), recorded in sidecars, approvals, and `recovery_records`. The cluster adds no policy engine: graph-moving operations inherit the engine's gate; catalog-only operations remain ungated as today. When no actor is supplied and the target graph has no policy checker, behavior is unchanged from Stage 3A (`None` actor, as the engine's no-actor variants do); when a checker is installed the engine's existing "actor required" error surfaces as a typed diagnostic (`actor_required`). - -**Ordering.** Deterministic, dependency-shaped, within one apply run: - -1. graph creates (with their schemas) β€” ULID-stable order by graph id -2. schema applies β€” sequential, one graph at a time (each holds that graph's `__schema_apply_lock__`; the cluster state lock already serializes cluster-side) -3. catalog writes (queries/policies) β€” the Stage 3A path, unchanged -4. deletes last (catalog deletes, then approved graph deletes) - -Each graph-moving operation is its own apply group: sidecar β†’ engine call β†’ sidecar update β†’ continue. The **state CAS stays single and final** (one write at the end recording every outcome), preserving Stage 3A's protocol; sidecars cover the widened gap between individual engine calls and that final CAS. A failure mid-sequence stops graph-moving work, reports per-resource statuses for everything already done (loud partials), and leaves sidecars for the sweep. Cross-graph atomicity is explicitly not promised. - -**Failpoints.** Each engine-call boundary gets a failpoint (`cluster_apply.before_graph_create`, `cluster_apply.after_graph_create`, `cluster_apply.before_schema_apply`, `cluster_apply.after_schema_apply`, `cluster_apply.before_graph_delete`) so every row of the D3 matrix is testable with the Stage 3B harness. - -### D6. Graph delete (4C) - -With no engine primitive, delete is cluster-orchestrated prefix removal: verify the approval artifact β†’ sidecar (`kind: graph_delete`, current manifest version recorded) β†’ recursively remove `ClusterRoot/graphs/.omni` β†’ state CAS that tombstones the graph subtree (graph, schema, and its queries removed from `applied_revision.resources` and `resource_statuses`; observation replaced by a tombstone record `{deleted_at, approval_id}`) β†’ delete sidecar. Catalog blobs of the graph's queries stay (GC remains a later stage, consistent with Stage 3A deletes). The engine gap (no atomic prefix delete; partial removal leaves an unopenable root) is handled by D3 row 8, and this RFC registers a desire for an engine-level `destroy_graph` primitive as future work, not a dependency. - -### D7. Plan and import integration - -- **Plan** gains real data impact for schema updates: where Stage 3A showed only a digest diff, Phase 4 calls `preview_schema_apply_with_options` against the live graph (read-only) and embeds the migration steps + drop warnings in the change record β€” the "data-aware provider peek" from the high-level spec, bounded to graphs the plan already observes. Failure to preview (graph unreachable) degrades to the digest diff with a warning, never blocks planning. -- **Import/refresh** already observe live graphs; Phase 4 makes `import` preserve `approval_records` and pending `recoveries/` it finds (state reconstruction must not orphan audit facts or pending repairs). - -### D8. Invariants and axioms check - -- *Respect the substrate / no custom transaction manager*: cluster never rolls back graphs; engine sidecars own intra-graph atomicity (D3). -- *Axiom 5 (state = deployed reality)*: recovery converges the ledger to observation, never observation to ledger. -- *Axiom 8 (reversibility gates apply, including drift correction)*: approval artifacts for the irreversible tier; sweep never auto-deletes (D3 rows 5/8). -- *Axiom 9 (plan-time integrity)*: ordering is planner-derived from existing dependency edges; no runtime discovery. -- *Axiom 11 (approvals in a durable ledger)*: artifacts are files first, state-referenced after consumption; reconstructable state never re-derives who approved. -- *Axiom 12 (locked state)*: every new write path (sidecars, approvals consumption, sweep) runs under the existing state lock. -- *Axiom 15 (single owner / mode switch)*: nothing here reads from or writes to `omnigraph.yaml`; applied graphs still serve nothing until Phase 5. -- *Loud partials (deny-list)*: every crash window lands in a typed status/condition; no path acknowledges unverified success. - -## Migration / Compatibility - -Additive. Stage 3A/3B behavior is unchanged for catalog-only configs; existing state files gain no required fields (`approval_records`/`recovery_records` already exist, empty). New CLI surface: `cluster approve`, `--actor` on `cluster apply`. A deployment that never declares schema changes or graph creates sees identical behavior to Stage 3B. The honored-or-rejected posture continues: no new `cluster.yaml` fields are introduced by this phase (graph roots stay derived). - -## Sequencing - -| Stage | Scope | Gate | -|---|---|---| -| **4A graph create** | `Omnigraph::init` at derived roots; create-intent sidecar; D3 rows 1/2/4/5; dependents unblock in-run | Failpoint tests for crash-before/after-init; e2e: declare graph β†’ apply β†’ import-less convergence | -| **4B schema apply** | Full sidecar lifecycle; roll-forward sweep (D3 rows 3/6); actor threading; plan data-impact preview; soft-drop default | Failpoint tests per matrix row; e2e: schema evolution fully cluster-driven (replaces the Stage 3A deferβ†’manualβ†’refresh loop) | -| **4C graph delete** | `cluster approve` + artifact consumption; prefix removal; tombstones; D3 rows 7/7b/8 | Failpoint tests incl. partial-removal; e2e: gated delete refused without artifact, executed with it, stale artifact rejected | - -Each stage is a separate PR with boundary-matched tests (the Stage 1–3B discipline). 4A ships first because it moves no existing manifest; 4B is the heart; 4C last because it is the only irreversible-tier executor and consumes the approval machinery 4B's hard-drop path also needs. - -## Exit-criteria coverage (implementation spec) - -| # | Criterion | This RFC | -|---|---|---| -| 1 | State/status/approval/recovery schemas + paths | **Approval + recovery schemas: answered** (D2, D4). State/status: unchanged from shipped Stage 2A/3A. | -| 2 | Sidecar schema + recovery decision matrix | **Answered** (D2, D3) | -| 3 | State backend interface / lock+CAS | Unchanged (local JSON backend, shipped) β€” out of scope | -| 4 | Apply group syntax + dependency ordering | **Answered** (D5): per-resource groups, fixed kind-ordering; no user-declared group syntax this phase | -| 5 | Plan JSON schema incl. blast radius + approvals | **Extended** (D7 preview embedding); base schema shipped | -| 6 | Bootstrap authority + first-actor | **Partial** (D5 actor threading); cluster bootstrap authority remains open (below) | -| 7 | Server startup migration | Phase 5 β€” deferred | -| 8 | Per-query policy / `mcp.expose` bridge | Phase 6 β€” deferred | -| 9 | Pipeline runtime | Phase 7 β€” deferred | - -## Open Questions - -1. **Bootstrap authority.** The first apply against a fresh cluster has no policy engine to consult and no actor registry; today the answer is "whoever holds the object store wins." The durable story (out-of-band privileged bootstrap actor, per the high-level spec Β§open-questions) is unresolved and blocks nothing in this phase, since graph-level Cedar still gates wherever installed. -2. **Approval expiry.** Artifacts are digest-bound, so config drift invalidates them naturally; is wall-clock expiry also wanted (operator hygiene), or does digest binding suffice? -3. **Sweep on read-only commands.** This RFC has `status`/`plan` only *warn* about pending sidecars. If operator feedback shows the warn-but-don't-repair posture causes confusion, promoting `plan` to run the sweep (it already takes the lock) is a compatible change. -4. **Graph rename.** Deliberately out of scope; interacts with the rename-stable-identity known gap in [invariants.md](invariants.md). A rename today is delete + create β€” i.e., gated, lossy, and honest about it. -5. **Engine `destroy_graph` primitive.** 4C's prefix removal is correct but unatomic; if the engine grows a graph-destroy primitive with its own recovery, D6 collapses onto it (the cluster code is shaped to delegate). - -## References - -- [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) β€” phases, exit criteria, high-risk decisions, approval tiers -- [cluster-axioms.md](cluster-axioms.md) β€” axioms 5, 8, 9, 11, 12, 15 -- [cluster-config-specs.md](cluster-config-specs.md) β€” the data-aware provider peek; state/ledger model -- `crates/omnigraph/src/db/manifest/recovery.rs` β€” the engine sidecar + classifier this design mirrors in vocabulary and deliberately does not duplicate in mechanics -- [writes.md](writes.md), [invariants.md](invariants.md) β€” engine recovery protocol and the deny-list this design is checked against diff --git a/docs/dev/rfc-005-server-cluster-boot.md b/docs/dev/rfc-005-server-cluster-boot.md deleted file mode 100644 index 6c57bba..0000000 --- a/docs/dev/rfc-005-server-cluster-boot.md +++ /dev/null @@ -1,143 +0,0 @@ -# RFC: Server Boots from Cluster State β€” Phase 5 of the Cluster Control Plane - -**Status:** Landed (5A policy bindings #175; 5B/5C the `--cluster` boot mode β€” one PR) -**Implementation deviations:** (1) cluster mode reuses `ServerConfigMode::Multi` (a new settings *source*, not a new enum variant; `config_path` carries the cluster dir). (2) Stored queries load via `QueryRegistry::from_specs` from verified blob *content*, not blob paths. (3) More than one policy bundle binding a single scope is a boot error (the serving pipeline holds one bundle per graph + one server-level; stacking is a later slice). (4) `GET /graphs` keeps its closed-by-default contract β€” without a cluster-bound bundle there is no server-level Cedar engine, so enumeration refuses. (5) Graph-attributed startup failures quarantine that graph by default; operators can restore all-or-nothing boot with `--require-all-graphs` / `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. -**Date:** 2026-06-10 -**Builds on:** Phase 4 complete ([rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md), Landed): `cluster apply` converges graphs, schemas, stored queries, and policies into the cluster catalog. Normative context: [cluster-config-specs.md](cluster-config-specs.md) (the migration model's "window 2"), [cluster-axioms.md](cluster-axioms.md) (axiom 15), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) (Phase 5 rollout, Compatibility Stance #7–#9, exit criterion 7). -**Target release:** unversioned (phased β€” see Sequencing). - -## Summary - -Give `omnigraph-server` a second boot source: `omnigraph-server --cluster ` reads its graph set, stored queries, and Cedar policies from the **cluster catalog** β€” `state.json`'s applied revision plus the content-addressed blobs under `__cluster/resources/` β€” instead of `omnigraph.yaml`. This is the moment "applied" finally means "serving": the standing caveat in every cluster doc since Stage 3A ("the server still boots from `omnigraph.yaml`") retires for deployments that flip the switch. - -Three commitments: - -1. **An exclusive mode switch, never a merge** (axiom 15, Compatibility Stance #7). `--cluster ` is mutually exclusive with the positional URI, `--target`, and `--config`. In cluster mode, `omnigraph.yaml` is not read at all β€” not for graphs, not for queries, not for policies. There is no precedence, no key-level aliasing, no fallback read. A deployment serves from one source. -2. **The server serves the *applied* revision, not the desired config.** What's live is what `cluster apply` converged: graph roots recorded in state, query/policy content at the *applied* digests from the content-addressed catalog. Un-applied config drift never leaks into serving β€” the serving surface and the ledger cannot disagree (axiom 5 extended to the data path). -3. **The state ledger becomes serving-sufficient.** Today one fact needed to serve is missing from state: a policy's `applies_to` bindings live only in `cluster.yaml`. A prerequisite slice (5A) records binding metadata into the applied revision at apply time, so a booting server reads state + blobs and nothing else. Without this, "boot from state" would silently become "boot from state *and* config" β€” the merged read axiom 15 forbids. - -## Motivation - -Phase 4 closed the convergence loop but left it inert: an operator can declare, plan, approve, and apply an entire deployment, and the running server ignores all of it. The Sarah/Bob test still fails at the last step β€” Sarah's applied change is visible in `cluster status` but Bob's clients hit a server still wired to a hand-maintained `omnigraph.yaml`. Phase 5 makes the catalog the serving source, which is also the precondition for Phase 6 (policy-owned query exposure must filter a catalog the server actually reads). - -## Non-Goals - -- **Runtime reconciliation / hot reload.** Cluster-mode boot is static, exactly like today's boot: the server reads the applied revision once at startup; picking up a newer applied state means restarting the process. The registry's runtime-mutation seam (the test-only `insert()` + mutate `Mutex` in `registry.rs`) stays future-proofing for a later watch-and-reload slice, not this RFC. -- **Policy-owned query exposure** (Phase 6) β€” but this RFC defines the bridge it sunsets (Β§D5). -- **Remote cluster roots.** `--cluster ` is a local directory in this phase, same as the `cluster` CLI commands; S3-hosted cluster roots arrive with external state backends. -- **Retiring `omnigraph.yaml` server boot.** It remains a fully supported mode indefinitely (Compatibility Stance #8: the file's job shrinks; the *server-role* keys become inert only for deployments that switch). -- **New management endpoints** (`/cluster/status` etc.) β€” noted as future work; this RFC changes the boot source, not the HTTP surface (beyond OpenAPI regen if anything shifts). - -## Background (verified against main) - -- **Server boot today** (`omnigraph-server/src/main.rs`, `lib.rs:891-1029`): `load_server_settings` applies a four-rule mode inference (positional URI / `--target` / `server.graph` β†’ Single; `--config` + `graphs:` β†’ Multi), builds `ServerConfigMode::{Single,Multi}` with per-graph `GraphStartupConfig {graph_id, uri, policy_file, queries}`, loads `QueryRegistry` from `.gq` files at settings time (identity-checked), type-checks queries at engine open (`validate_and_attach`), loads Cedar via `PolicyEngine::load_graph`/`load_server`, installs it with `with_policy`, and assembles `GraphRegistry::from_handles` (startup-only; lock-free `ArcSwap` reads). Bind address and bearer tokens come from flags/env, not from graph config. No reload machinery exists. -- **The catalog today** (`omnigraph-cluster`): `state.json` records `applied_revision.resources` (address β†’ digest) for `graph.*`, `schema.*`, `query..`, `policy.`, plus statuses, observations (incl. tombstones), approval and recovery records. Query/policy *content* lives content-addressed at `__cluster/resources/query///.gq` and `policy//.yaml`. Graph roots are derived: `/graphs/.omni`. -- **The gap**: state records a policy's *digest* only; `applies_to` (cluster vs graph refs) lives in `cluster.yaml`. Queries are fine β€” their graph binding is encoded in the address itself. - -## Design - -### D1. The mode switch - -New server flag: `omnigraph-server --cluster ` (the directory containing `cluster.yaml`, `__cluster/`, and `graphs/`). Mutually exclusive β€” a hard startup error, not a precedence rule β€” with the positional URI, `--target`, and `--config`. `--bind`, `--unauthenticated`, and the bearer-token env vars keep working identically: listen address and credentials are **process-operational facts**, not cluster facts (they differ per replica/host and never belonged to the shared catalog; if a `serve:` section ever joins `cluster.yaml`, that's a separate proposal). - -Mode inference gains rule 0: `--cluster ` β†’ **Cluster mode**, which is always multi-graph routing (`/graphs/{graph_id}/...`), even for a single declared graph. No flat-route legacy surface in cluster mode β€” it's a new mode with no compatibility debt to carry. - -### D2. What the server reads (the applied revision, and only it) - -`load_server_settings` grows a cluster branch that reads, in order: - -1. `__cluster/state.json` β€” **missing state is a boot error** ("run `cluster import` + `cluster apply` first"). Invalid or unattributable recovery sidecars under `__cluster/recoveries/` are also a boot error: a server must not start if it cannot prove the blast radius. Valid graph-attributed sidecars quarantine that graph by default and are logged as `cluster_recovery_pending`; `--require-all-graphs` promotes them back to a boot error. -2. **Graph set** = state's `graph.` resources (tombstoned graphs are absent by construction). Each graph's URI is the derived root `/graphs/.omni`. A recorded graph whose root does not open quarantines that graph by default; `--require-all-graphs` restores the original fail-fast posture. -3. **Stored queries** = state's `query..` entries, content loaded from the catalog blob at the recorded digest. Blob-missing or digest-mismatched is a boot error (the catalog verification semantics from Stage 3B, applied at boot). Queries type-check at engine open exactly as today (`validate_and_attach` β€” unchanged). -4. **Policies** = state's `policy.` entries, content from catalog blobs, bindings from the applied metadata of D3: bundles bound to `cluster` load as the server-level Cedar engine (`PolicyEngine::load_server`); bundles bound to graphs load per-graph (`PolicyEngine::load_graph`) and install via `with_policy` β€” the existing two-gate structure, unchanged. -5. `cluster.yaml` is parsed **only** to validate that the directory is a cluster root (and for nothing else β€” explicitly not for resource content; a divergence between desired config and applied state is *served as applied*, visible via `cluster plan`). - -Everything downstream of settings construction β€” `GraphStartupConfig`, parallel engine opens, `GraphRegistry::from_handles`, routing middleware, auth, workload admission, OpenAPI β€” is reused as-is. Cluster mode is a new *source* for the same boot pipeline, not a new pipeline. - -### D3. Prerequisite: serving metadata in the applied revision (slice 5A) - -State's `StateResource` records only a digest. To make the ledger serving-sufficient, `cluster apply` (and the sweep's roll-forwards) additionally record **binding metadata** for policy resources at apply time: - -```json -"applied_revision": { - "resources": { - "policy.base_rbac": { - "digest": "", - "applies_to": ["cluster", "graph.knowledge"] - } - } -} -``` - -- Additive and optional (`#[serde(default)]`) β€” existing state files parse unchanged; a policy entry without `applies_to` (applied before 5A) is a **boot error in cluster mode** with the remedy "re-run `cluster apply`" (one apply rewrites the metadata; the digest needn't change β€” the metadata write is part of the state mutation, not the blob). -- `applies_to` is normalized to typed addresses (`cluster` | `graph.`) at apply time, mirroring the validator's normalization. -- Queries need no equivalent: the address (`query..`) already carries the binding, and the registry key/symbol invariant is enforced at apply (validate) time. -- This is deliberately *applied* metadata, not config mirroring: if `cluster.yaml` changes a binding, the server keeps serving the old binding until `cluster apply` converges it β€” the same contract as every other resource. - -### D4. Readiness and failure posture - -Cluster-global failures are fail-fast, matching the server's existing stance (bad policy YAML refuses boot). Graph-local failures quarantine the affected graph by default so a single bad graph cannot crash-loop an otherwise healthy cluster. Operators who prefer the original all-or-nothing contract pass `--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`, which promotes every graph-local quarantine/open/settings failure to a boot error. - -| Condition | Behavior | -|---|---| -| `state.json` missing / unparseable / unsupported version | boot error | -| invalid/unreadable/unattributable recovery sidecars | boot error (run any state-mutating cluster command to sweep or inspect) | -| valid graph-attributed recovery sidecars | quarantine that graph; strict mode boot error | -| recorded graph root missing or unopenable | quarantine that graph; strict mode boot error | -| query/policy blob missing or digest-mismatched | boot error (run `cluster refresh` + `apply` to self-heal, then restart) | -| policy entry without `applies_to` metadata | boot error ("re-run cluster apply", D3) | -| stored query fails parse/type-check against the live schema | quarantine that graph; strict mode boot error | -| embedding provider configuration for one graph cannot resolve | quarantine that graph; strict mode boot error | -| every applied graph is quarantined or fails startup | boot error (`cluster_no_healthy_graphs`) | -| state lock held | **not** an error β€” boot takes no lock; it reads a point-in-time snapshot of an immutable-once-written state file (the CAS discipline means a concurrent apply produces a *new* file atomically; the server reads whichever was current at open) | - -### D5. The `mcp.expose` bridge in cluster mode - -The cluster query registry has no `expose` flag by design (axiom 14: exposure is a policy decision β€” Phase 6). Until Phase 6 ships, cluster-mode servers list **all** stored queries in `GET /queries`. This is the documented bridge: *cluster mode = everything exposed; omnigraph.yaml mode = `mcp.expose` honored as today*. Its named sunset is Phase 6's policy-filtered catalog (Compatibility Stance #9). Invocation remains gated by the existing coarse `invoke_query` Cedar action in both modes. - -### D6. Migration path (exit criterion 7) - -For an operator running multi-graph from `omnigraph.yaml`: - -1. Author `cluster.yaml` declaring the same graphs/queries/policies; place existing graph roots under `/graphs/.omni` (or start fresh). -2. `cluster import` (observes live graphs) β†’ `cluster plan` β†’ `cluster apply` (publishes queries/policies into the catalog; with 5A, records policy bindings). -3. Restart the server with `--cluster ` instead of `--config omnigraph.yaml`. -4. `omnigraph.yaml`'s `graphs:`/`serve:`/`queries:`/`policy:` keys are now inert for this deployment; the file remains the CLI's per-operator config. - -Rollback is the same switch in reverse β€” nothing in cluster mode mutates `omnigraph.yaml` or the graphs in a way the yaml mode can't serve. - -### D7. Invariants and axioms check - -- *Axiom 15 / Stance #7*: exclusive flag, hard mutual-exclusion error, zero `omnigraph.yaml` reads in cluster mode β€” no fact has two readers. -- *Axiom 5*: the server serves deployed reality (applied digests), never desired intent; D3 keeps the ledger the single serving source. -- *Axiom 12*: boot reads without the lock but relies on the atomic-replace write discipline; it never writes state. -- *Axiom 14 / Stance #9*: the expose-all bridge is named, scoped to cluster mode, and carries its Phase 6 sunset. -- *Loud failures (deny-list)*: every degraded condition is either a typed cluster-global boot error with a remedy or an explicit graph quarantine logged at startup; no silent fallback to the yaml. `--require-all-graphs` is the opt-in all-or-nothing mode for operators who treat any degraded graph as fatal. -- *Respect the boundaries*: `omnigraph-cluster` stays free of HTTP; the server reads the catalog through a small read-only loader (either a `pub` read surface on `omnigraph-cluster` or a thin module in the server consuming the documented file formats β€” implementation picks the one that keeps `omnigraph-cluster` dependency-light; the state/blob formats are already a documented contract). - -## Sequencing - -| Slice | Scope | Gate | -|---|---|---| -| **5A: serving metadata in state** | `applies_to` recorded on policy resources at apply + sweep roll-forward; additive state schema; `status`/plan surfacing | In-crate tests: metadata written/rolled-forward; old state parses; re-apply backfills | -| **5B: `--cluster` boot mode** | Flag + mode inference rule 0; catalog loader (state β†’ `GraphStartupConfig`s + registries + policy engines); readiness table; OpenAPI regen if surface shifts | Server tests: boot from a converged fixture dir, serve `/graphs/{id}/query` + stored queries + Cedar gates; D4 cluster-global rows refuse boot; graph-local rows quarantine by default and refuse under `--require-all-graphs`; e2e: `cluster apply` then serve β€” "applied means serving" | -| **5C: docs + caveat retirement** | `cluster-config.md` mode-switch section; `server.md`/`deployment.md`; retire the "not serving" caveats for cluster-mode deployments; migration guide (D6) | `check-agents-md.sh`; doc accuracy review | - -## Exit-criteria coverage - -Answers implementation-spec exit criterion 7 (server startup + migration path) in full; touches 1 (state schema gains policy binding metadata β€” additive). Criteria 8 (per-query policy) and 9 (pipelines β€” descoped to a separate project) remain. - -## Open Questions - -1. **Loader home**: `pub` read-only API on `omnigraph-cluster` (server gains the dependency) vs a server-side reader of the documented formats. Leaning `omnigraph-cluster` API β€” one parser for the state schema beats two drifting ones; the crate stays HTTP-free either way. -2. **Boot-time blob re-hash**: D4 requires digest verification at boot; for large catalogs a stat-only fast path with full hashes behind a flag may matter later. Start with full verification (catalogs are small). -3. **`GET /graphs` enrichment**: cluster mode could expose applied digests/revision in the enumeration β€” deferred until a consumer exists. -4. **Watch-and-reload**: the natural follow-up once cluster mode exists; the registry's mutation seam is ready, but reload semantics (drain? cutover?) deserve their own design. - -## References - -- [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) β€” the convergence machinery this serves -- [cluster-config-specs.md](cluster-config-specs.md) Β§Migration model β€” window 2 is this RFC -- [cluster-axioms.md](cluster-axioms.md) β€” axioms 5, 12, 14, 15 -- [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) β€” Phase 5 rollout, Compatibility Stance #7–#9, blast-radius rows for the server registry -- `crates/omnigraph-server/src/lib.rs` (`load_server_settings`, `ServerConfigMode`, `GraphRegistry`) β€” the boot pipeline this extends without forking diff --git a/docs/dev/rfc-007-operator-config.md b/docs/dev/rfc-007-operator-config.md deleted file mode 100644 index 1cbc0ef..0000000 --- a/docs/dev/rfc-007-operator-config.md +++ /dev/null @@ -1,333 +0,0 @@ -# RFC: Per-Operator Config β€” the Operator Slice of RFC-002 - -**Status:** Proposed -**Date:** 2026-06-11 -**Builds on:** [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) (Proposed; implementation parked β€” PRs #139/#162 closed over review findings), [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) (Landed), RFC-006 storage roots (#186/#190/#194, landed). The #139 review record is a normative input: every design rule in Β§D6 traces to a confirmed finding. -**Paired with:** [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) β€” together they define the two-surface architecture this RFC's operator half belongs to. -**Target release:** unversioned (staged; see Sequencing). - -## Summary - -Give OmniGraph the operator half of the **two-surface config architecture** -(RFC-008): **cluster config** (team-owned, in a repo β€” what the system *is*) -and **operator config** (person-owned, in `$HOME` β€” who *I* am). This is -Terraform's split: `~/.terraformrc` for the operator, the checkout for the -declaration. OmniGraph today has neither half cleanly β€” `omnigraph.yaml` -mixes both concerns (RFC-008 retires it), and there is no home-level config -at all: identity and credentials get re-declared per working directory, in -files that sit next to repo-committed config. - -This RFC introduces **`~/.omnigraph/config.yaml`** (the operator surface) -and a **keyed credentials chain**, scoped deliberately small: - -1. **Operator identity** β€” a default actor for every `--as` cascade. -2. **Credentials by server name** β€” no more inventing env-var names per - server; secrets never inline, never in any repo-committed file. -3. **Named servers** β€” operator-owned endpoint definitions; nothing a - checkout supplies can redefine them. - -It is explicitly a **subset of RFC-002**, sequenced to land. RFC-002 settled -the right long-term decisions (one `~/.omnigraph/` dir, credentials keyed by -server name, `OMNIGRAPH_CONFIG`/`OMNIGRAPH_HOME` env precedence) but its -implementation arrived as one 4,800-line PR mixing a crate extraction with -behavior changes, and died over ten confirmed findings. This RFC adopts -RFC-002's settled decisions verbatim where they apply, defers everything -else (`GraphLocator`, multi-homing, `omnigraph use`, the State layer), and -encodes the #139 findings as design rules so the same failures cannot recur. - -## Motivation - -Three concrete pains, all hit in real operation this cycle: - -- **Identity repetition.** The cluster actor cascade (#180) resolves - `--as` from the per-operator `omnigraph.yaml` β€” which means every - operator hand-maintains a copy in every working directory (the - `~/exp/intel` setup needed exactly this). A repo-committed - `omnigraph.yaml` cannot carry `as: act-andrew` without claiming every - contributor is Andrew. -- **Credential ergonomics.** `bearer_token_env` forces three coordinated - steps per server (invent a var name, reference it in config, set it in - the secret store). The peer group β€” AWS profiles, `gh hosts`, kubeconfig - users β€” keys secrets by the server's *name*. -- **Cluster-era working shape.** With clusters on object storage (RFC-006), - the project directory is a *declaration checkout* β€” operators run - `cluster apply --config ./checkout` from anywhere. The things that are - about the *operator* (who am I, which servers do I know, how do I like - output formatted) have no home that travels with them. - -## Non-Goals - -- **`GraphLocator` / multi-homed graph resolution** (RFC-002 Β§1) β€” the - biggest and riskiest part of config-v2; untouched here. -- **`omnigraph use` / the State layer** (`~/.omnigraph/state/`) β€” deferred - with it (finding #2 showed its precedence interacts badly with scaffolds; - that problem belongs to the slice that introduces it). -- **OS keychain integration** β€” the credentials *chain* (Β§D4) leaves a slot - for it; this RFC ships env + file sources only. -- **Config-file walk-up.** Terraform does not walk up from subdirectories - and neither do we β€” `--config` (or running in the directory) stays the - explicit, deterministic story for cluster checkouts. Rejected, not - deferred: walk-up makes "which config am I using" a function of cwd - depth, the class of surprise this RFC exists to remove. -- **Retiring `omnigraph.yaml`** β€” that is RFC-008's job, with its own - staging. This RFC builds the destination; during RFC-008's deprecation - window the legacy file keeps loading exactly as today. -- **Renaming or removing anything.** No flag renames, no key renames, no - schema-version bumps (findings #1, #3, #10). - -## Background (verified against main) - -- **Project-config lookup today** (`crates/omnigraph-server/src/config.rs:529-553`, - shared by CLI and server): `--config `, else `./omnigraph.yaml` in - cwd, else built-in defaults. Relative paths inside the file resolve - against the file's own directory (`base_dir`). No env var, no home file, - no walk-up. -- **Side-effect on load** (`crates/omnigraph-cli/src/helpers.rs:102-108`): - `load_cli_config` also loads `auth.env_file` into the process env β€” - this is how `OMNIGRAPH_BEARER_TOKEN` reaches remote commands today. -- **Actor resolution** (`helpers.rs:170`, #180): `--as` flag, else the - project config's actor β€” currently the end of the chain. -- **Existing credential mechanism**: `TargetConfig.bearer_token_env` names - an env var; `auth.env_file` points at a git-ignored dotenv. Both keep - working indefinitely (RFC-002 already committed to this; finding #3 - showed what happens otherwise). -- **`OMNIGRAPH_CONFIG`** exists today only as the *container entrypoint's* - translation to the server's `--config`. The CLI does not read it. - -## Design - -### D1. Files and discovery - -``` -~/.omnigraph/config.yaml # the operator surface (this RFC) -~/.omnigraph/credentials # keyed secrets, 0600, git-irrelevant (Β§D4) -./cluster.yaml + checkout # the team surface (unchanged; RFC-004..006) -./omnigraph.yaml # legacy, loads as today through RFC-008's window -``` - -Discovery order for the operator file: `$OMNIGRAPH_HOME/config.yaml` if -`OMNIGRAPH_HOME` is set, else `~/.omnigraph/config.yaml`. Absent file = -empty layer, never an error. `~` is expanded wherever paths are read -(finding #9 β€” today a literal `./~/...` directory gets created). - -`OMNIGRAPH_CONFIG=` becomes a first-class override for the `--config` -argument in the CLI (highest precedence below the flag itself), aligning the -CLI with the container contract that already uses this variable for the -server. One name, one meaning, both binaries β€” it points at whatever the -command's `--config` would (a cluster checkout for cluster commands; the -legacy file during RFC-008's window). - -Per RFC-002 Β§4 (adopted verbatim): `~/.omnigraph/` is the one canonical -dir β€” cache/state subdirectories arrive with their own slices; XDG roots are -not part of the mental model (`$XDG_CONFIG_HOME` may be honored as a -fallback read location if set, but is never written to). - -### D2. The operator schema (v1 of this layer) - -```yaml -# ~/.omnigraph/config.yaml β€” about the OPERATOR, never about the system -operator: - actor: act-andrew # default for every --as cascade - -servers: # operator-owned endpoint definitions - intel-dev: - url: http://127.0.0.1:8080 - prod: - url: https://graph.modernrelay.ai - # No token here, ever. Resolution: Β§D4. - -aliases: # personal shorthand over CLUSTER-owned queries - triage: - server: intel-dev # required: names an operator server above - graph: spike # optional (omit for single-mode servers) - query: weekly_triage # STORED query name on that server β€” never a file - args: [since] # positional CLI args -> params, in order - params: { limit: 20 } # optional fixed defaults (positionals/--params win) - format: table # optional; feeds the format cascade - -defaults: - output: table # read --format default -``` - -Unknown keys are a **warning, not an error** in this layer (an operator file -written by a newer CLI must not brick an older one; contrast with -`cluster.yaml`, where unknown keys are deliberately fatal because they -change what a *plan* means). - -#### Aliases are bindings, not content - -Three things must not be conflated: - -- **Stored queries (the cluster catalog)** are *content plus its canonical, - team-owned name* β€” reviewed, digest-pinned, invocable by name over HTTP. -- **Legacy `omnigraph.yaml` aliases** conflate a personal name with a - pointer to query *content in a local file* β€” which is why they break - across directories and can drift from the catalog. RFC-008 retires them. -- **Operator aliases** are pure **bindings, zero content**: a personal name - β†’ (server, graph, stored-query *name*, arg mapping, defaults). An alias - that carries content competes with the catalog; an alias that references - a name composes with it. - -The three senses of "global", resolved by this split: - -1. **Across graphs/servers** β€” preserved and strengthened: today's aliases - are "global" only within one per-directory config file; operator - aliases live in one `$HOME` file, each binding self-contained, usable - from any cwd. -2. **Across operators (team-shared shorthand)** β€” deliberately *no alias - mechanism*: the shared name IS the stored query's catalog name. A team - that wants a shorter shared name renames the query in `cluster.yaml` - (reviewed, one name). A parallel team-alias namespace would be two - shared names for one thing β€” pure drift surface. -3. **Across machines** β€” dotfile the one operator file; bindings carry no - local-file dependencies. - -Collision rule during the RFC-008 window: a legacy file-alias with the -same name **wins**, with a warning naming both definitions β€” consistent -with Β§D3's legacy-outranks-operator ordering. - -### D3. Precedence and the merge rule - -The end-state cascade is short, because the team surface (cluster config) -deliberately carries **no operator-resolvable keys** β€” no actor, no tokens, -no output preferences. Identity can never come from a checkout: - -``` -flag > env > operator config > built-in -``` - -During RFC-008's deprecation window, a legacy `omnigraph.yaml` slots in -between env and operator config (its keys win over operator defaults, -preserving today's behavior for unmigrated setups) β€” with the Β§D5 -credential inversion: **credentials and endpoint definitions never come -from a legacy/checkout file when an operator-layer definition exists for -the same server name.** - -Merging is **key-level**: scalars override per key; maps (`servers:`, -`aliases:`) merge per *entry*, and entries merge per *field* (finding #13 β€” -`merge_map` replacing whole entries silently dropped sibling fields). - -Concretely for the two flows this slice touches: - -- **Actor**: `--as` > legacy `cli.actor` (window only, unchanged semantics) - > `operator.actor` > none (commands that need an actor keep failing - loudly). -- **Output format**: `--format` > legacy default (window only) > - `defaults.output` > `table`. - -### D4. Credentials: keyed by server name, by-reference always - -Adopted from RFC-002 Β§5 unchanged, minus the keychain (a later source in -the same chain). For a server named ``, the resolution chain is: - -1. `OMNIGRAPH_TOKEN_` (uppercased, `-`β†’`_`) β€” explicit env, wins. -2. `[]` section in `~/.omnigraph/credentials` (INI-style, `0600`; - the loader refuses a group/world-readable file). -3. The legacy pair β€” `bearer_token_env` + `auth.env_file` β€” exactly as - today, for configs that already use it. - -No inline secrets in any YAML file, anywhere (the existing invariant 12 -posture extended to disk). A future `omnigraph login ` -writes/rotates one section of the credentials file via temp + rename -(finding #7: every operator-layer write is atomic), creating it `0600`. - -### D5. The trust boundary (the security findings, made structural) - -Findings #4, #5, #6 share one root cause: a file that arrives with a -*repo checkout* could redirect where requests go and what secrets they -carry. In the end state this is closed by construction β€” cluster config has -no server/credential keys at all, and the operator surface never comes from -a checkout. The rules below therefore govern the **RFC-008 window** (while -legacy `omnigraph.yaml` still loads) and stand as the permanent law for any -future checkout-supplied surface: - -1. **A checkout-supplied file may *reference* a server by name; it may not - *redefine* an operator-defined server.** If a legacy `./omnigraph.yaml` - declares `servers.prod.url` and `~/.omnigraph/config.yaml` also defines - `prod`, the operator definition wins and the CLI warns about the - shadowed entry. A legacy-only server name keeps working (compat), but - the keyed-credentials chain (Β§D4 steps 1–2) never resolves for it β€” - only the legacy explicit `bearer_token_env` does. Net effect: a - malicious checkout cannot point `prod` at an attacker host and harvest - the operator's `prod` token. -2. **`auth.env_file` keeps auto-loading (compat), but checkout-layer - env-files cannot *override* variables already set in the process or by - the operator layer** β€” first-set-wins, operator-before-checkout (the - existing real-env-wins rule, extended one layer down). Finding #5's - injection becomes a no-op against any var the operator actually uses. -3. **A token is sent only to the server it is keyed to.** The legacy - single `OMNIGRAPH_BEARER_TOKEN` fallback keeps working for the - single-server shape, but when a request resolves through a *named* - server, only that name's chain applies (finding #6's broadcast). - -### D6. Compatibility rules (the #139 findings as law) - -| Rule | Source finding | -|---|---| -| No flag or key is removed or renamed; new behavior is additive | #1, #3 | -| A config that loads today loads identically after this RFC; new validation applies only to new keys | #3, #8, #10 | -| Every operator-layer file write is temp + rename, never in-place | #7 | -| `~` expands wherever a path is read | #9 | -| Map merges are per-entry, per-field β€” never wholesale replace | #13 | -| One resolution path per concern β€” the actor chain and the token chain each have exactly one implementation, called by CLI and server alike | #11, #12 | -| Each slice lands as its own PR with the workspace gate green; no slice mixes mechanical moves with behavior changes | #139's disposition | - -## Sequencing - -Three PRs, each independently useful, each landable without the next: - -1. **PR 1 β€” the operator file + identity** *(landed: #196)*. Loader for - `~/.omnigraph/config.yaml` (+ `OMNIGRAPH_HOME`, `~`-expansion, warn-only - unknown keys), `operator.actor` joining the `--as` cascade, - `defaults.output` joining the format cascade, `OMNIGRAPH_CONFIG` env for - the CLI's `--config`. Docs: `cli-reference.md` gains the two-surface - table. -2. **PR 2 β€” keyed credentials** *(landed)*. `servers:` in the operator layer, the - Β§D4 chain (env + credentials file), the Β§D5 trust rules, and - `omnigraph login ` (atomic write, `0600`). Legacy mechanisms - untouched and tested-as-untouched. -3. **PR 3 β€” operator targeting** *(landed)*. `--server ` on remote-capable - commands and `aliases:` in the operator layer (server + graph + query + - default params), resolving through operator-defined servers. This is - the *bridge* toward RFC-002's locator β€” multi-server addressing in a - safe, minimal form without the `GraphLocator` rework β€” and the - replacement RFC-008 needs before legacy aliases can migrate. - -RFC-008's deprecation stages begin only after PRs 1–2 are on main: the -operator surface must exist before `config migrate` has somewhere to move -keys to. - -## Open questions - -- Should `operator.actor` apply to *local* (embedded-engine) writes too, or - only where a server/cluster boundary exists? Leaning yes-everywhere: one - identity chain (Β§D6 one-path rule), and local audit rows get better. -- Does `defaults.output` belong in slice 1, or is identity-only an even - cleaner first PR? (Cost of including it is one cascade hop; value is - immediate.) -- `omnigraph config view --resolved` (RFC-002 had it; #139 shipped a - version) β€” slice 1 or slice 2? It materially helps debugging precedence, - which argues early. - -## Relationship to RFC-002 and RFC-008 - -**RFC-008 is the other half of this design**: this RFC builds the operator -surface; RFC-008 retires the mixed-ownership file -([rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md)), -leaving exactly two config surfaces β€” cluster (team) and operator (person). -Every mention of `omnigraph.yaml` in this RFC describes the deprecation -window only. Sequencing couples them: RFC-007 PRs 1–2 land first, then -RFC-008's migration stages run against them. - -[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) consumes -this RFC's surfaces: the actor chain and keyed-credential chain become -constructor-time inputs of its `RemoteClient`/`EmbeddedClient`, and -`--server`/operator aliases resolve to the same (base URL, credential) -pair before its `GraphClient` trait is touched. - -RFC-002 remains the umbrella architecture. This RFC implements its Β§2 -(layered config, global-first), Β§4 (file naming / one dir), and Β§5 -(credentials) in their minimal load-bearing form, and explicitly defers Β§1 -(`GraphLocator`/targets), Β§3 (roles), and the State layer. If/when the -locator work resumes, it builds on these layers rather than re-landing -them. RFC-002's header should gain a pointer here once this merges. diff --git a/docs/dev/rfc-008-deprecate-omnigraph-yaml.md b/docs/dev/rfc-008-deprecate-omnigraph-yaml.md deleted file mode 100644 index 4b5bf35..0000000 --- a/docs/dev/rfc-008-deprecate-omnigraph-yaml.md +++ /dev/null @@ -1,178 +0,0 @@ -# RFC: Deprecate `omnigraph.yaml` β€” One Concern per Config Surface - -**Status:** Proposed -**Date:** 2026-06-11 -**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md) (the -operator layer that absorbs the identity/credential keys), -[rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) (Landed β€” -cluster-booted serving), RFC-006 storage roots (landed: #186/#190/#194). -**Supersedes in part:** RFC-007's "project layer" framing (Β§Relationship -below) and [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md)'s -assumption that `omnigraph.yaml` remains the project manifest. -**Target release:** staged; final removal at the next major (see Sequencing). - -## Summary - -Retire `omnigraph.yaml`. It is three unrelated concerns wearing one -filename β€” server deployment config, project/CLI conveniences, and operator -identity β€” and the mixture is not a cosmetic wart but the root cause of a -recurring class of problems: operators keeping personal copies of "project" -files, repo checkouts able to carry credential-adjacent keys (the #139 -security findings), `omnigraph init` scaffolding config into unrelated -directories, and every config discussion needing a paragraph to establish -which of the three files is meant. - -The end state is **two config surfaces with single owners**: - -| Surface | Owner | Declares | -|---|---|---| -| **Cluster config** (`cluster.yaml` + catalog) | the team, in a repo | what the system *is*: graphs, schemas, queries, policies, storage | -| **Operator config** (`~/.omnigraph/`) | one person, in `$HOME` | who *I* am: identity, credentials, known servers, ergonomics | - -plus **flags/env** for the zero-config tier (one graph, one server, no -control plane) β€” which already works today with no file at all. - -`omnigraph.yaml` has no role left once every key has a better home. This -RFC gives each key that home, and stages the retirement so that no working -setup breaks without a loud warning, a migration command, and a full -deprecation cycle first. - -## Motivation - -- **It breaks the ownership logic.** A config file must have one owner. A - file that carries `graphs:` (team-owned, reviewable) next to `cli.actor` - (one person's identity) and `auth.env_file` (credential loading) can be - neither safely committed nor sensibly personal. Every real deployment - this cycle tripped on it: per-operator copies in `~/exp/intel`, - graph-scoped alias URIs that only make sense per-person, the #139 - findings where a checkout-supplied file could redirect tokens. -- **The cluster made it redundant.** Since RFC-005/006, a cluster - deployment serves from the applied catalog β€” `--cluster` mode does not - read `omnigraph.yaml` *at all*. Stored queries, policies, bindings, and - graph addressing all have authoritative homes. What remains in - `omnigraph.yaml` for cluster users is dead weight that can silently - disagree with what is actually serving. -- **Two declarative dialects is one too many.** `cluster.yaml` and - `omnigraph.yaml` both declare graphs/queries/policies with different - schemas, different validation strictness, and different lifecycle - guarantees. Maintaining, documenting, and testing both β€” and explaining - when each applies β€” is a permanent tax (the "programming integrated over - time" lens says: this forks on every config-surface change). - -## Non-Goals - -- **Breaking anyone now.** Every `omnigraph.yaml` that works today keeps - working through the entire deprecation window, with warnings. -- **Retiring the zero-config tier.** `omnigraph-server s3://bucket/g.omni - --bind …` plus env vars stays first-class forever β€” that tier needs *no* - file, which is the point. -- **Forcing the control plane on single-graph users.** The migration target - for a multi-graph yaml deployment is a *minimal* cluster (file-rooted, - no bucket required, `cluster.yaml` barely longer than the `graphs:` map - it replaces) β€” but a single graph never needs even that. -- **Touching `cluster.yaml`** β€” its schema and strictness are unchanged. - -## Where every key goes (the complete migration map) - -The full `OmnigraphConfig` surface (verified against -`crates/omnigraph-server/src/config.rs:182-207`): - -| `omnigraph.yaml` key | Concern | New home | -|---|---|---| -| `graphs..uri` | what exists / where | `cluster.yaml` `graphs:` (storage-root-derived) β€” or a flag/env for the zero-config tier | -| `graphs..queries`, top-level `queries:` | what exists | cluster catalog (`.gq` discovery, RFC-004/#183) | -| `graphs..policy.file`, top-level `policy.file`, `server.policy.file` | what's enforced | `cluster.yaml` `policies:` + `applies_to` bindings | -| `server.bind` | deployment runtime | `--bind` / env (already authoritative; the key is a default) | -| `server.graph` | deployment runtime | `--target`-style flag / env in the zero-config tier; meaningless under cluster boot | -| `graphs..bearer_token_env`, `auth.env_file` | credentials | operator credentials chain (RFC-007 Β§D4) | -| `cli.actor` | identity | `operator.actor` (RFC-007 Β§D3) | -| `cli.output_format`, `cli.table_*` | personal ergonomics | `defaults:` in operator config (RFC-007 Β§D2) | -| `cli.graph`, `cli.branch` | personal targeting | operator config: named servers + a per-operator default target (RFC-007 PR 3) | -| `aliases.` | a personal name conflated with a content pointer | **splits in two** (RFC-007 Β§D2 "bindings, not content"): the referenced `.gq` file's *content* becomes a catalog stored query (team-reviewed); the *binding* becomes an operator alias referencing that name. `config migrate` proposes both halves but cannot publish catalog content itself β€” that is a `cluster apply` | -| `query.roots` | discovery convenience | obsolete β€” cluster query discovery (#183) replaced it | -| `project.name` | label | dropped (the cluster's `metadata.name` is the deployment label) | - -Two placements worth defending: - -- **Aliases are operator config, not cluster config.** The stored query is - the shared contract (catalog-owned, digest-pinned); an alias is one - person's shorthand with their favorite default params and target. Putting - aliases in the cluster would force team review on personal ergonomics; - leaving them per-directory recreates today's problem. Per-operator, - keyed by server/graph name, is the AWS-profile shape. -- **Multi-graph serving without a control plane migrates to a minimal - cluster, not to a new file.** The honest cost: `cluster import` + `apply` - once, on a `file://` root next to the graphs. The honest benefit: one - declarative dialect, one validation path, one serving source β€” and the - upgrade path to buckets/approvals is a one-line `storage:` change instead - of a re-platform. - -## Deprecation mechanics - -Per Hyrum's Law (the repo's own deny-list: shipped observable behavior is -contract), retirement is staged, loud, and tooled: - -1. **Warn** *(landed)*. Loading `omnigraph.yaml` emits a one-line deprecation notice - naming the replacement for each key actually present in the file (not a - generic banner β€” the migration map above, applied to *your* file). - Suppressible per-process (`OMNIGRAPH_SUPPRESS_YAML_DEPRECATION=1`) for - CI logs during the window. -2. **Migrate** *(landed)*. `omnigraph config migrate` reads an existing - `omnigraph.yaml` and writes the split: the team half as a ready-to-review - `cluster.yaml` (+ moves query/policy files into the checkout layout), - the personal half merged into `~/.omnigraph/config.yaml` β€” printing a - diff-style summary and touching nothing without `--write`. The command - is the test of the migration map's completeness: any key it cannot - place is a bug in this RFC. -3. **Stop scaffolding** *(landed)*. `omnigraph init` stops generating - `omnigraph.yaml` (it scaffolded one into cwd β€” the source of the - test-pollution bug). **No replacement scaffold**: a minimal - `cluster.yaml` is five lines; a generator would be a second copy of the - schema to keep in sync, producing a file that is unusable until - hand-edited anyway (Terraform has no config scaffolder either). New - users copy from the cluster quick-start; migrants get a ready-to-review - `cluster.yaml` from `config migrate`. -4. **Opt-in strict** *(landed β€” the release gap to stages 1–3 collapsed: no version boundary was crossed between them, so all four ship in the same release)*. `OMNIGRAPH_NO_LEGACY_CONFIG=1` turns the warning into - an error β€” for teams that finished migrating and want regressions caught. -5. **Remove at the next major** *(eased by [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) Phases 4–5: declared plane capabilities and route alignment shrink the yaml-boot removal diff)*. Loading the file becomes an error pointing - at `config migrate`. The `OmnigraphConfig` code path, the dual - query-registry loaders, and the yaml-mode server boot source are deleted - β€” the payoff that makes the whole exercise worth it. - -Stages 1–3 can land in one release once RFC-007 PRs 1–2 exist (the operator -layer must exist before anything can migrate *to* it). Stage 4 the release -after. Stage 5 at the major, with the removal listed in release notes from -stage 1 onward. - -## What this deletes, eventually - -- The `OmnigraphConfig` struct and its 12-key surface, the - `load_config`/`load_cli_config` pair and its env-side-effect, the - scaffolder, and the legacy resolution paths (`resolve_cli_graph`'s dual - modes β€” finding #11's root cause). -- The yaml-mode multi-graph server boot (`ServerConfigMode::Multi` keeps - existing β€” cluster boot constructs it β€” but its `omnigraph.yaml` source - goes). -- An entire class of documentation ("which file does X go in?") and the - #139 security surface (a checkout cannot hijack what no longer loads). - -## Relationship to RFC-007 and RFC-002 - -RFC-007 ships the operator layer this RFC migrates *to*; its "project -layer" language should be read as transitional β€” after this RFC, the -project layer **is** the cluster checkout, and RFC-007's PR 3 (project -`server:` references) applies to `cluster.yaml`-adjacent operator targeting -rather than to `omnigraph.yaml`. RFC-002's locator/state-layer work, if -resumed, targets the two-surface world directly. RFC-002's file-naming -decisions (`~/.omnigraph/` as the one dir) are unaffected. - -## Open questions - -- **Window length**: one minor release between warn (stage 1) and strict - (stage 4), or two? Cookbooks, skills, and the deployment docs all need - the same pass; the migration command makes a short window defensible. -- **`omnigraph login` vs `config migrate` ordering** β€” both write - `~/.omnigraph/`; whichever lands first establishes the file-locking and - atomic-write helpers the other reuses. -- **Does the MCP server config** (RFC-003) reference `omnigraph.yaml` - anywhere that needs the same treatment? To be audited in stage 1. diff --git a/docs/dev/rfc-009-unify-access-paths.md b/docs/dev/rfc-009-unify-access-paths.md deleted file mode 100644 index 9b2d842..0000000 --- a/docs/dev/rfc-009-unify-access-paths.md +++ /dev/null @@ -1,236 +0,0 @@ -# RFC: Unify the CLI's Embedded and Remote Access Paths - -**Status:** Proposed -**Date:** 2026-06-12 -**Audience:** engine/CLI/server maintainers -**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md) -(landed β€” `--server` targeting and operator aliases are remote-addressing -surfaces the unified client must treat as first-class), -[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) -(stages 1–4 landed β€” the config-authority demotion this RFC's earlier -drafts promised as a "companion" already happened; the remaining sliver, -removing the yaml-mode server boot source, is RFC-008 stage 5 and is -*eased* by Phases 4–5 here), [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) -(umbrella; see Prior Art for what is salvageable from its parked -implementation). -**Sequencing:** post-v0.7.0 (the release cut comes first). - -## Summary - -Collapse the CLI's per-command `is_remote` forks into one execution path coded -against a `GraphClient` trait with two implementations (embedded engine, HTTP), -sharing one wire-DTO crate with the server. Establish an executable parity -referee *before* the refactor. This is the same cure, in the same order, that -fixed the storage layer: one contract, one implementation where semantics are -one thing, an executable contract where two implementations must exist. - -## Motivation β€” validated facts - -Graph **semantics** cannot drift between paths today: both converge on the same -engine `_as` entry points (verified: `omnigraph-server/src/handlers.rs` calls -`mutate_as`, `apply_schema_as_with_catalog_check`, `load_as`, -`branch_create_from_as`, `branch_delete_as`, `branch_merge_as` β€” the same -functions the CLI's embedded arm calls), and Cedar enforcement lives inside -those writers. What *can* drift is everything around them: - -1. **15 `is_remote` forks** in `crates/omnigraph-cli/src/main.rs`, each - duplicating request shaping and output mapping per command. (RFC-007 PR 3 - threaded `apply_server_flag` through exactly these sites β€” the duplication - is measured, not estimated.) -2. **Triple DTO construction.** "The result of a load" is built in three - places: the server handler (engine result β†’ HTTP response), the CLI remote - arm (HTTP response β†’ `LoadOutput` via `load_output_from_tables`), and the - CLI local arm (engine result β†’ hand-built `LoadOutput`). Three mappings - that agree only by discipline β€” the exact shape of the storage-adapter bug - class (one prose contract, N implementations, no referee). -3. **The remote `load` arm rides the deprecated `/ingest` route.** A - non-deprecated CLI verb coupled to a deprecated endpoint turns the - endpoint's eventual removal into a surprise CLI breakage. -4. **Plane restrictions are accidental, not declared.** `init` / `optimize` / - `repair` / `cleanup` / `cluster *` are storage-only and `graphs list` is - server-only by code shape; pointing `optimize` at an `https://` target - fails with whatever `Omnigraph::open` says about an https URI. Per Hyrum, - that accidental error text is already someone's dependency. -5. **Parity pinning is thin.** One explicit parity test - (`cli_schema_config.rs::schema_plan_parity_cli_and_sdk`), flow coverage in - `system_local.rs`, and the OpenAPI drift test. No systematic - per-verb embedded-vs-remote comparison exists. Two bugs from the current - cycle argue the referee's value concretely: the operator-alias positional - bug (the hidden `legacy_uri` positional swallowed the first arg β€” local - and remote disagreed until a live smoke caught it) and the - `write_text_if_absent` flush bug (one of N implementations of an - unwritten contract) would both have failed a parity matrix. - -## Design - -Ordered so each phase is independently shippable and the referee exists before -anything moves β€” mirroring the storage collapse, where the pinned contract -tests gated the swap, and the test-monolith modularization (#192/#193), which -makes Phase 3 tractable: the CLI dispatch is 1,184 lines today, not 4,200. - -### Phase 1 β€” Parity matrix (the referee; do first, no refactor) *(landed)* - -A CLI integration test (extend the `system_local.rs` harness, which already -spawns both binaries): one fixture graph; for every forked verb, run the -command once against the local URI and once against a spawned server with -identical inputs; diff the `--json` outputs against an explicit allowlist of -transport-only fields (e.g. resolved URI). Assert identical exit codes for the -shared error cases. - -This pins today's behavior so Phase 3 can't silently change it, and catches -every future fork drift. It also incidentally covers utoipa annotation↔route -mismatches (a lying `#[utoipa::path]` makes the remote leg 404). - -**Phase 1 outcome (landed):** `crates/omnigraph-cli/tests/parity_matrix.rs` -β€” 11 rows green with an **empty divergence ledger**: with matched Cedar -policy on both arms, embedded and remote agree on every forked verb's -scrubbed JSON and exit codes. Two findings along the way: like-for-like -requires the same policy bundle on both arms (a tokens-only server is -default-deny by design β€” the harness encodes this), and inline execution's -unbound-param matches-all vs the invoke path's hard error is a cross-path -asymmetry, filed as #207 and pinned (not repaired) by the matrix. - -### Phase 2 β€” One wire-DTO crate *(landed)* - -Move the HTTP request/response types and the single `engine result β†’ DTO` -mapping per verb into a shared crate (working name `omnigraph-api-types`), -carrying serde + utoipa `ToSchema` derives: - -- `omnigraph-server` handlers serialize these types; utoipa derives - `openapi.json` from them (the existing `openapi.rs` regeneration test stays - the spec referee). -- The CLI embedded path constructs them via the shared mapping. -- The CLI remote path deserializes the literal same types. - -The mapping then exists once, next to the type β€” it cannot fork. Spec codegen -remains exactly where it belongs: foreign-language clients (the TS SDK -pipeline). Generating a Rust client from the spec is explicitly rejected β€” it -would round-trip Rust types through a lossy intermediate when compile-time -type sharing is available. - -**Prior art to salvage:** PR #139's review explicitly found the -`omnigraph-api-types` extraction *clean* ("the crate extraction itself is -clean and the openapi.json byte-identical claim holds" β€” -[pr-139 findings](rfc-002-config-cli-architecture.md)); it was the behavior -changes bundled alongside that killed the PR. Seed this phase from the -extraction commits on `ragnorc/scrutinize-rfc-002` rather than rebuilding β€” -cherry-picked narrowly, never relanded wholesale. - -Boundary note: this does NOT violate "transport/auth stay at the boundary" -(invariants Β§11). The shared crate holds plain serde DTOs; it depends on -neither axum nor the engine's internals. The engine crate does not depend on -it β€” the `engine result β†’ DTO` mapping lives in the shared crate (or the CLI/ -server side), taking engine result types as input. - -**Phase 2 outcome (landed):** `crates/omnigraph-api-types` holds the wire -DTOs + their `engine-result β†’ DTO` mappings; `omnigraph-server::api` is a -`pub use` re-export (so `openapi.json` is byte-identical β€” the referee -passed with zero diff), and the CLI consumes the crate directly. One -deliberate refinement of the original sketch: `LoadOutput` is a rendered -CLI output type, not a wire DTO, so it stayed CLI-side β€” both its mappings -(local `LoadResult`, remote `IngestOutput`) now sit together in -`output.rs`. The parity matrix passed textually unchanged. - -### Phase 3 β€” `GraphClient` trait, two implementations - -```text -trait GraphClient // verb-level: load, mutate, query, branch_*, schema_*, export, commit_* - β”œβ”€β”€ EmbeddedClient // wraps Omnigraph + the shared mapping; actor: explicit (--as cascade, RFC-007) - └── RemoteClient // reqwest + bearer; actor: resolved server-side from the token -``` - -Each CLI command body is written once against the trait; the 15 forks become -2 impls Γ— 1 contract. Actor resolution is a constructor-time difference of the -impls, never a per-verb branch β€” the trust model (storage credentials = -self-declared actor via the RFC-007 actor chain; server = token-resolved -actor via the RFC-007 keyed-credential chain) is a feature, not drift. -`RemoteClient` construction is where RFC-007's addressing lands once: -positional URI, `--target`, `--server `, and operator aliases all -resolve to the same (base URL, credential) pair before the trait is touched. -The Phase 1 matrix becomes the trait's conformance suite, run against both -impls. - -### Phase 4 β€” Declared plane capabilities - -Each CLI command declares `Storage | Server | Both`. Dispatch checks the -resolved target against the declaration and fails with one deliberate message -("maintenance commands operate on storage directly; use a storage URI, not a -server target") instead of today's incidental errors. The declaration table is -also documentation: it makes the control-plane/data-plane split (maintenance -and cluster commands must work with the server down) explicit in code. -"Server" targets include operator-config named servers (RFC-007), not only -literal `http(s)://` URIs. - -### Phase 5 β€” Route alignment (landed) - -Added a canonical `POST /load` (shared `run_ingest` body; the deprecated -`/ingest` is now a thin alias carrying `#[deprecated]` + RFC 9745/8288 -`Deprecation`/`Link: ` headers, exactly mirroring `/mutate`↔`/change`) -and pointed the CLI's remote `load` arm at it; `/ingest` stays on its -deprecation path. `/load` reuses `IngestRequest`/`IngestOutput` (as canonical -`/mutate` reuses `Change*`); a DTO rename is a separate change. - -Registration finding: the server **hand-mounts** routes (`.route(...)`) beside a -manual `#[openapi(paths(...))]` list, not `utoipa-axum`'s `OpenApiRouter`/ -`routes!`. This PR followed the existing manual pattern (one `.route` + one -`paths(...)` entry + the `#[utoipa::path]` annotation) rather than migrating -registration β€” the migration is a worthwhile but orthogonal cleanup, deferred. - -## Non-goals - -- **No localhost-server funnel for the embedded path.** Routing embedded use - through a daemon would destroy the embedded/CLI/test story to buy parity the - trait + matrix already provide. -- **No trust-model unification.** `--as` vs bearer-resolved actors stay. -- **No spec-codegen for the Rust client** (see Phase 2). -- **No change to plane-restricted command availability** β€” maintenance stays - storage-direct by design; Phase 4 only makes the restriction explicit. -- **No config-authority work** β€” that was RFC-008, already landed through - stage 4; this RFC neither accelerates nor blocks stage 5, though Phases 4–5 - make the eventual yaml-boot removal a smaller diff. - -## Compatibility - -- CLI `--json` output is observable contract; Phase 1 freezes it before - Phase 3 moves code. Any field-level unification that *changes* output is a - deliberate, release-noted decision, not a refactor side effect. -- Error-message text for mis-planed commands changes in Phase 4 β€” release-note - it (Hyrum). -- `openapi.json` should be byte-stable through Phase 2 if the DTO move is - faithful; the regeneration test enforces this. - -## Testing - -- Phase 1 matrix is the spine; it must stay green, textually unchanged, - through Phases 2–3 (the storage-collapse playbook). -- Phase 2: `openapi.rs` byte-stability + existing server tests. -- Phase 4: one test per capability class asserting the deliberate error. -- Phase 5: parity matrix leg for `load` flips to `/load`; an `/ingest` shim - test stays until removal. - -## Open questions - -1. Crate granularity: one `omnigraph-api-types` crate vs folding into an - existing one. (Leaning separate: server and CLI both depend on it; the - engine must not. The #139 extraction already answered this with a separate - crate that reviewed clean.) -2. Does the `query`/`read` streaming path (NDJSON export) fit the trait, or is - export a documented per-impl method? (Streaming over HTTP vs an iterator - over the embedded engine differ in shape, not content.) -3. ~~Whether `graphs list` belongs on the trait.~~ **Answered by the - two-surface architecture**: the embedded impl enumerates the **cluster - catalog** (`read_serving_snapshot` exists for exactly this), never - `omnigraph.yaml` (deprecated, RFC-008). `graphs list` becomes - `Both`-capability: remote = `GET /graphs` (policy-gated), embedded = - catalog enumeration from a cluster storage root. - -## Relationship to prior work - -The third application of the same principle in this lineage: storage adapters -(collapsed to one implementation + an executable contract β€” and its CAS/flush -bugs were exactly the no-referee class), recovery liveness, and now access -paths. RFC-007 supplied the addressing and credential surfaces `RemoteClient` -consumes; RFC-008 removed the competing config authority; RFC-002 remains the -umbrella whose remaining unimplemented pieces (`GraphLocator`, the State -layer) would build on the trait introduced here rather than on per-command -forks. diff --git a/docs/dev/rfc-010-cli-planes-restructure.md b/docs/dev/rfc-010-cli-planes-restructure.md deleted file mode 100644 index ad7f7b9..0000000 --- a/docs/dev/rfc-010-cli-planes-restructure.md +++ /dev/null @@ -1,449 +0,0 @@ -# RFC: Restructure the CLI Around Explicit Planes - -**Status:** Proposed -**Date:** 2026-06-13 -**Audience:** CLI/server/cluster maintainers -**Builds on:** [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) -(Phases 3a–3c landed β€” the embedded/remote data-plane fork is now one -`GraphClient` enum; this RFC **expands RFC-009 Phase 4** from a narrow -embedded-vs-remote capability table into the full plane model, and leaves -Phase 5 route alignment where it is), -[rfc-007-operator-config.md](rfc-007-operator-config.md) (operator -`--server`/`--graph`/`--target` addressing β€” the surfaces this RFC makes -uniform across planes), -[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md). -**Sequencing:** post-v0.7.0, after RFC-009 Phase 3c (done). - -## Summary - -The CLI silently spans **three planes** β€” data, storage/maintenance, and -control β€” and forces the operator to know which plane each verb lives on *and* -address a graph differently per plane. The same graph you query as -`--server prod --graph knowledge` you must maintain as -`s3://bucket/knowledge.omni`. Plane restrictions (`graphs list` is server-only, -`optimize` is storage-only) are *accidental* β€” discovered by hitting a cryptic -error, not *declared*. - -This RFC makes the plane model **explicit and coherent** with three moves: - -1. **One graph-addressing model** across every verb (`--target`/`--graph`/ - positional URI/`--server`), resolving to a storage URI for maintenance and a - remote client for data β€” instead of two different ways to name one graph. -2. **A declared, per-subcommand capability surface** (RFC-009 Phase 4): each - verb declares its plane(s); wrong-plane invocations get an honest "this is - storage-plane, `--server` doesn't apply" error from one table, not scattered - `bail!`s. -3. **Plane-grouped `--help`** so the model is legible at a glance. - -No new server feature. Storage maintenance stays off the wire β€” deliberately. - -## Current state of affairs - -The CLI has 23 top-level commands. They divide into three planes, addressed -three different ways: - -| Plane | Verbs | Reaches the graph by | Addressing surface | -|---|---|---|---| -| **Data** | `query`, `mutate`, `load`, `ingest`, `branch *`, `snapshot`, `export`, `commit *`, `schema show/apply` (and `graphs list`, **remote-only today** β€” see note) | embedded engine **or** HTTP server (one `GraphClient`) | positional URI **or** `--target` / `--graph` / `--server` (config aliases) | -| **Storage / maintenance** | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, `queries validate` | embedded engine **only**, directly on storage (`file://` or `s3://`) | positional URI **or** `--target` β€” **no `--server` / `--graph`** (except `init`, which today takes **only a required positional URI** β€” no `--target`) | -| **Control** | `cluster validate/plan/apply/approve/status/refresh/import/force-unlock` | a cluster **directory** (`file://` or `s3://`), not a graph URI | `--config ` | - -### What's confusing (validated facts) - -1. **Two names for one graph.** Data verbs resolve `--server prod --graph - knowledge` through `GraphClient::resolve*` (the embedded/remote fork collapsed - in RFC-009 Phases 3a–3c; only the two `GraphClient` factories call - `apply_server_flag`). Maintenance verbs instead use - `resolve_uri`/`resolve_local_uri` and accept only a positional URI or - `--target` β€” so to compact the graph you *query* as `--server prod --graph - knowledge` you must *type* `s3://bucket/knowledge.omni`. One graph, two - addressing vocabularies. - - > **Note (`graphs list`).** It is routed through `GraphClient` only to share - > the addressing/token resolver; its embedded arm fails loudly, so it is - > **remote-only today** (the later capability table and *Relationship to - > RFC-009* record it as remote-now / embedded-cluster-later). - -2. **Plane restrictions are accidental, not declared.** `graphs list` is - server-only and `optimize`/`repair`/`cleanup`/`init` are storage-only purely - by code shape. Point `optimize` at an `https://` URL and you get whatever - `Omnigraph::open` says about an https URI β€” accidental error text that, per - Hyrum's Law, is already someone's dependency. The capability is real but - unstated. - -3. **The split is per-subcommand, and the family names hide it.** `schema plan` - is storage-only (`resolve_local_uri`) while `schema show`/`schema apply` are - data-plane (the graph client). `queries validate` opens the graph to - typecheck while `queries list` only reads the registry config. The plane is - a property of the *subcommand*, not the family. - -4. **Maintenance has no server/cluster counterpart at all.** There is no HTTP - route and no `cluster` subcommand for `optimize`/`cleanup`/`repair` (verified: - nothing in the server route table, nothing in `omnigraph-cluster/src`). For a - server-backed deployment you run the *same CLI* against the storage URI, - out-of-band from the serving process. This is correct (maintenance is - heavyweight, destructive, single-operator β€” it should not be a multi-tenant - HTTP surface), but it is **undocumented in the CLI's own shape**, so it reads - as an omission rather than a decision. - -5. **`init` has a hidden control-plane twin.** Bare `init` creates a single - graph from storage; in cluster mode the equivalent is `cluster apply` - (graph-creation stage, with ledger/recovery/approval semantics). Same intent, - two entry points, no signpost between them. - -6. **Flat `--help`.** All 23 commands list as one undifferentiated block, so the - plane a verb belongs to is tribal knowledge. - -The net effect: a new operator must already know OmniGraph's plane architecture -to predict which flags work on which verb and how to name a graph. The CLI does -not teach its own model. - -## Target CLI ergonomics - -The throughline: **you name a graph one way, and the CLI tells you what works -where.** Simple examples of the end state: - -### One name for a graph, everywhere - -A config target `knowledge` works on every verb that touches that graph: - -```bash -omnigraph query --target knowledge --query q.gq # data (embedded or remote, auto) -omnigraph load --target knowledge --data rows.jsonl # data -omnigraph optimize --target knowledge # maintenance (resolves to its storage URI) -omnigraph cleanup --target knowledge --keep 10 --confirm -omnigraph repair --target knowledge --confirm -``` - -The positional URI form still works everywhere, unchanged: - -```bash -omnigraph optimize s3://bucket/knowledge.omni -``` - -### Data plane: same command, embedded or remote - -You don't pick "local vs server" syntax β€” resolution decides: - -```bash -omnigraph query ./local.omni --query q.gq # opens engine directly -omnigraph query --server prod --graph knowledge --query q.gq # over HTTP -omnigraph query --target knowledge --query q.gq # whichever the config says -``` - -### Maintenance: `--target` must resolve to direct storage (loud if not) - -```bash -$ omnigraph optimize --target prod -error: `--target prod` resolves to a remote server (https://prod…). - `optimize` is a storage-plane command and needs direct storage access. - Pass the graph's s3://… URI, or use --cluster --graph . -``` - -Cluster-managed graphs get an explicit, intentional path (no implicit -`cluster.yaml` peeking): - -```bash -omnigraph optimize --cluster ./cluster --graph knowledge -``` - -### Wrong-plane = one honest, stable error - -```bash -$ omnigraph optimize --server prod -error: `optimize` is a storage-plane command; `--server` addresses the data - plane and does not apply here. Use --target or a storage URI. - -$ omnigraph graphs list ./local.omni -error: `graphs list` needs a remote multi-graph server (http/https) today. - (Embedded cluster-catalog enumeration is planned β€” RFC-009.) -``` - -### `--help` teaches the model - -``` -DATA PLANE run against a graph (embedded or --server) - query mutate load branch snapshot export commit schema show schema apply - -STORAGE / MAINTENANCE direct storage access; no server - init optimize repair cleanup schema plan queries validate - -CONTROL PLANE manage a cluster directory - cluster - -INSPECT / SESSION - graphs list queries list lint policy embed login logout config -``` - -### Exceptions, signposted (not silent) - -```bash -omnigraph init --schema s.pg ./new.omni # plain path: fine - -$ omnigraph init --target knowledge --schema s.pg # cluster-managed target: redirected -error: `knowledge` is a cluster-managed graph. Create it via `cluster apply` - (which records ledger + recovery + approvals), not `init`. -``` - -**In one line:** one way to name a graph, the right flags accepted per verb, and -a CLI that tells you its planes instead of making you memorize them. - -## Proposed shape (mechanism) - -### One addressing model for every graph-addressing verb - -Route **all** graph-addressing verbs β€” data *and* maintenance β€” through one -resolver that turns `(positional URI | --target | --graph | --server)` into -either a **storage URI** (`file://`/`s3://`) β†’ embedded execution, or a **remote -`GraphClient`** β†’ HTTP execution, per the verb's declared plane. - -**Authority rule (the precedence must not be silent).** `--target` is an -operator/legacy target lookup; `cluster.yaml` is a *different* authority surface -(read only by `cluster` commands and `--cluster` boot). A maintenance verb must -not quietly consult both and invent a precedence. The rule: - -- A maintenance verb's `--target` resolves through the **operator/legacy** - config and its URI must already be **direct storage**; a target that resolves - to a remote (`http(s)://`) URL **fails loudly** (see the example above). -- **Cluster-managed graphs are addressed explicitly** via a cluster-root + - graph-id pair (spelled `--cluster --graph ` for illustration), so - reading cluster state is an intentional mode β€” never an implicit fallback - between operator config and `cluster.yaml`. - - > **Flag-shape caveat (deferred).** `--graph` is *already* a global flag that - > `requires = "server"` and appends `/graphs/` to a **remote** URL β€” a - > different meaning, and clap won't permit `--graph` without `--server`. So the - > cluster-maintenance addressing needs either a distinct flag (e.g. - > `--cluster-graph `) or an explicit global-flag migration. This is why - > the cluster-managed resolver is **deferred to a later slice** (it also rides - > the applied-state-vs-declared-config open question below); the - > operator/legacy `--target` path lands first. - -### A declared, per-subcommand capability surface (RFC-009 Phase 4, expanded) - -One table, **per subcommand** (family-level rows hide exactly the cases the -table exists to make non-accidental): - -| Command | Data (embedded) | Data (remote) | Storage (direct) | Config / session | Notes | -|---|---|---|---|---|---| -| `query`, `mutate`, `load`, `ingest` | βœ… | βœ… | β€” | β€” | `ingest` is the deprecated alias of `load` | -| `branch create/list/delete/merge` | βœ… | βœ… | β€” | β€” | | -| `snapshot`, `export`, `commit list/show` | βœ… | βœ… | β€” | β€” | | -| `schema show` | βœ… | βœ… | β€” | β€” | | -| `schema apply` | βœ… | βœ… | β€” | β€” | declarative alternative: `cluster apply` | -| `schema plan` | β€” | β€” | βœ… | β€” | local resolver today | -| `queries validate` | β€” | β€” | βœ… | β€” | opens the graph to typecheck | -| `init` | β€” | β€” | βœ… | β€” | cluster-managed graphs β†’ `cluster apply` | -| `optimize`, `repair`, `cleanup` | β€” | β€” | βœ… | β€” | | -| `graphs list` | (later) | βœ… | β€” | β€” | remote today; embedded-cluster later (RFC-009) | -| `queries list` | β€” | β€” | β€” | βœ… | reads the registry config; no graph | -| `lint` | β€” | β€” | βœ… | βœ… | `--schema` file, or opens a local graph | -| `policy validate/test/explain` | β€” | β€” | β€” | βœ… | reads policy files + config | -| `embed` | β€” | β€” | β€” | βœ… | local tooling (files + embedding API) | -| `login`, `logout`, `config`, `version` | β€” | β€” | β€” | βœ… | session / config; no graph | - -The resolver consults this table. A wrong-plane invocation produces one honest, -stable message instead of N ad-hoc `bail!`s and accidental `open` errors. - -### Plane-grouped `--help` - -Group the command list by plane (the `--help` block shown under Target CLI -ergonomics). Cosmetic, zero behavior change, highest legibility-per-line. - -### Maintenance stays off the wire (decision, not omission) - -This RFC **does not** add server routes for `optimize`/`cleanup`/`repair`: - -- **Serving = the server.** Multi-tenant, safe-for-many-callers data plane. -- **Storage maintenance = the CLI against storage**, addressed uniformly, - run by an operator or a scheduled job with storage access. - -Adding maintenance-over-HTTP would re-introduce a heavyweight, destructive -multi-tenant surface and *add* a plane rather than clarify the three we have. -A future cluster-driven maintenance reconciler (scheduled compaction/GC as a -control-plane policy) is explicitly **out of scope** β€” net-new design (who runs -it, with what resource bounds), not a CLI restructure. - -### `init` is an explicit exception (decision) - -Direct-storage `init` against a plain URI/target stays. But if a target resolves -to a **cluster-managed** graph root, `init` **refuses and signposts** `cluster -apply` (which records ledger, recovery, and approval artifacts) rather than -initializing that root out of band. This closes the "hidden twin" of the current -state. - -## Compatibility - -Additive and low-risk: - -- **`--target`/`--graph` on maintenance verbs** is new capability; the positional - URI form keeps working unchanged. -- **Grouped `--help`** is cosmetic. -- **Capability-surface error text** changes the message you get on a wrong-plane - or misaddressed invocation. Per Hyrum's Law that text is observable; the change - is deliberate, release-noted, and replaces an *accidental* `Omnigraph::open` - string with a *stable, declared* one β€” a net improvement, but flagged. - -No engine, server, or wire-protocol change. The work is CLI-internal: the shared -resolver, the capability table, and help grouping. - -## Test plan - -Extend the existing CLI suites rather than adding a duplicate harness: - -- **`parity_matrix.rs`** β€” capability exclusions (the per-subcommand plane table - becomes the source of truth for which verbs are remote-only / storage-only). -- **`cli_data.rs`** β€” maintenance wrong-plane errors (`optimize --server`, - `optimize --target `), and `--target` resolving to direct storage. -- **`cli_schema_config.rs`** β€” `graphs list` plane behavior, `schema plan` - vs `schema show/apply` plane split, and plane-grouped `--help` output. -- **`system_local.rs`** β€” `--server` / operator-targeting edge cases end-to-end. - -Pin the new wrong-plane error strings deliberately: this RFC is intentionally -replacing accidental `Omnigraph::open` strings with stable capability errors, and -those strings become observable behavior (Hyrum). - -## Relationship to RFC-009 - -RFC-009 Phase 4 was scoped as "declared plane capabilities" for the -embedded-vs-remote axis only. This RFC **subsumes and broadens** that phase into -the full three-plane, per-subcommand model (adds uniform maintenance addressing, -the authority rule, and help grouping). RFC-009 Phase 5 (remote `load` β†’ -`/load` route alignment) is unaffected and remains in RFC-009. - -**`graphs list` reconciliation:** RFC-009's answered open question (pinned in -`parity_matrix.rs`'s exclusions comment) targets `graphs list` becoming -Both-capability once the embedded arm enumerates the cluster catalog. This RFC -**aligns** with that rather than superseding it: the capability table shows -`graphs list` as remote today, embedded-cluster later. - -## Open questions - -1. **Capability-table location** β€” a CLI-internal const, or surfaced (e.g. in - `--help` and a machine-readable `omnigraph capabilities` for tooling)? -2. **`--cluster --graph ` for maintenance** β€” does the maintenance - command resolve the storage URI from the applied cluster state, or from the - declared `cluster.yaml`? (Applied state is the truth the server serves; - declared config may be ahead of it.) - -## Review comments (Codex, 2026-06-13) - -Overall take: the direction is right. The planes already exist; making them -declared in code, help text, and error messages should reduce operator surprise. -Keeping storage maintenance off HTTP is also the right boundary: `optimize`, -`repair`, and `cleanup` are direct-storage operator actions, not a multi-tenant -serving surface. - -Before implementation, tighten these points: - -1. **Resolver authority needs a sharper rule.** The proposal says maintenance - resolves storage URIs "from `cluster.yaml` / operator config", but those are - different authority surfaces. Today `--target` is an operator/legacy - graph-target lookup; cluster config is read by `cluster` commands and by - `--cluster` server boot. Do not make a maintenance command silently consult - both and pick a precedence. Either: - - `--target` on maintenance means an operator/legacy target whose URI is - already direct storage, with remote targets failing loudly; or - - add an explicit cluster-root/config resolver for this case, so reading - cluster state is an intentional mode. - - **Resolution (accepted):** both β€” `--target` resolves through operator/legacy - config and must be direct storage (remote β†’ loud fail); cluster-managed graphs - use the explicit `--cluster --graph ` resolver. See *Authority - rule* under Proposed shape. - -2. **`graphs list` conflicts with RFC-009's target shape.** This RFC classifies - `graphs list` as remote-only, while RFC-009's answered open question says it - becomes Both-capability once the embedded arm enumerates the cluster catalog. - Pick one direction here: either this RFC explicitly supersedes that target, - or the capability table should show `graphs list` as remote today and - embedded-cluster later. - - **Resolution (accepted):** align, don't supersede. The table shows `graphs - list` remote-today / embedded-cluster-later. See *Relationship to RFC-009*. - -3. **The capability table should be per subcommand, not per family.** The - family-level rows hide the exact cases the table is supposed to make - non-accidental. At minimum, call out: - - `schema plan` as local/storage-backed today, while `schema show` and - `schema apply` route through the graph client; - - `queries validate` versus `queries list`, which do not have the same - plane shape; - - `lint`, `policy`, `embed`, `login`, `logout`, `config`, and `version`, so - enumeration/session/tooling commands are intentionally classified instead - of falling outside the model. - - **Resolution (accepted):** the capability table is now per-subcommand and - classifies every command, including the session/tooling group. - -4. **`init` should be an explicit exception.** Direct-storage `init` is fine. - A cluster-managed graph should be created by `cluster apply`, with ledger, - recovery, and approval semantics. If a named target resolves to a - cluster-managed graph root, `init` should signpost `cluster apply` rather - than quietly initializing that root out of band. - - **Resolution (accepted):** promoted from open question to a decision. See - *`init` is an explicit exception*. - -Testing notes for the implementation slice: - -- Extend the existing CLI suites rather than adding a new duplicate harness: - `parity_matrix.rs` for capability exclusions, `cli_data.rs` for maintenance - wrong-plane errors, `cli_schema_config.rs` for `graphs list` / help behavior, - and `system_local.rs` for `--server` / operator-targeting edge cases. -- Pin the new wrong-plane error strings deliberately. This RFC is intentionally - replacing accidental `Omnigraph::open` strings with stable capability errors, - and those strings become observable behavior. - - **Resolution (accepted):** captured as the *Test plan* section. - -## Verification comments (Codex, 2026-06-13) - -Follow-up verification against the current CLI/server code found a few -remaining current-state nits. These are doc-shape issues, not objections to the -proposal: - -1. **Current-state table overstates `graphs list`.** The table under *Current - state of affairs* still lists `graphs list` with data verbs that reach the - graph by embedded engine or HTTP. Current code routes it through `GraphClient` - only to share the resolver, but the embedded arm fails loudly; the later - RFC text correctly says remote today / embedded-cluster later. Make the - current-state row match that. - - **Resolution (accepted):** the Data row now marks `graphs list` **remote-only - today**, with a note that it rides `GraphClient` only to share the resolver. - -2. **Current-state table overstates `init` addressing.** `init` is grouped with - maintenance verbs whose addressing surface is positional URI or `--target`. - Current `init` only accepts a required positional URI and has no `--target` - or config path. The proposal can add that capability, but the current-state - table should not describe it as already present. - - **Resolution (accepted):** the Storage row now calls out that `init` takes - **only a required positional URI** today (no `--target`); adding `--target` to - `init` is part of the proposal, entangled with the `init`β†’`cluster apply` - signpost, not current state. - -3. **`apply_server_flag` call-site count is stale.** The text says data verbs - resolve `--server prod --graph knowledge` through `apply_server_flag` at - 16 call sites. Current code has the fork collapsed: data verbs call - `GraphClient::resolve*`, and only the two `GraphClient` factories call - `apply_server_flag`. Rephrase the verified fact around `GraphClient`, not - the old pre-collapse call-site count. - - **Resolution (accepted):** validated-fact #1 now describes the post-collapse - reality (`GraphClient::resolve*`; the two factories call `apply_server_flag`), - dropping the stale count. - -4. **`--cluster --graph ` collides with today's global `--graph` - semantics.** The target ergonomics section proposes that flag shape for - maintenance, but current `--graph` is a global flag that requires - `--server` and appends `/graphs/` to a remote server URL. Either choose - a separate cluster-maintenance graph flag shape, or call out the clap/global - flag migration explicitly as part of the implementation. - - **Resolution (accepted):** the *Authority rule* now carries a flag-shape - caveat β€” the cluster-managed resolver (and its flag shape, e.g. - `--cluster-graph` vs a `--graph` migration) is **deferred to a later slice**; - the operator/legacy `--target` path lands first. The illustrative - `--cluster --graph ` spelling is marked as not-final. diff --git a/docs/dev/rfc-011-cli-refactoring.md b/docs/dev/rfc-011-cli-refactoring.md deleted file mode 100644 index d26dd84..0000000 --- a/docs/dev/rfc-011-cli-refactoring.md +++ /dev/null @@ -1,756 +0,0 @@ -# RFC-011: CLI refactoring β€” one addressing & config model - -**Status:** Accepted β€” implemented (the `omnigraph.yaml` excision landed as -#250/#251/#252; D1–D4, D6, D7, D9, D10 shipped). Two items remain: **D11** -(server-side maintenance jobs) is gated on the bulk-data-plane RFC #219; **D5** -(combined admin scope) stays deferred by design. -**Date:** 2026-06-14 -**Audience:** CLI/server maintainers -**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md) -(per-operator config, keyed credentials, named servers), -[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) -(the legacy file this RFC finishes removing), -[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) -(`GraphClient` β€” embedded ≑ remote at the execution layer), -[rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) -(declared planes + the wrong-plane guard this RFC subsumes). -**Sequencing:** lands as / after RFC-008 stage 5 (the `omnigraph.yaml` removal). - -## Summary - -Refactor the CLI around one coherent model once `omnigraph.yaml` is gone. The -shape: - -- **One ontology** (store, server, cluster; cluster config vs operator config; - catalog; profile; capability) where each term names exactly one concept. -- **Addressing = scope + `--graph`, with the access path *derived*.** A command - resolves a *scope* (operator defaults, an optional named *profile*, or one - explicit primitive address β€” `--store` / `--server` / `--cluster`), selects a - graph inside it with `--graph`, and the **served-vs-direct access path falls out - of the scope's bindings Γ— the verb's capability** β€” it is never a per-command - toggle and never inferred from a URI scheme. -- **Served is the front door; direct storage is privileged.** The everyday scope - is a *server* (a bearer token, no bucket credentials). Reading or writing a - remote store/cluster directly is an explicit, credentialed, admin/break-glass - act β€” never the default, never baked into everyday operator config. -- **The CLI is stateless per command.** No `current_profile` pointer, no - `USE`-style mode; every command is fully determined by its flags + static - config. You *select* a graph, you do not *switch into* one. -- **Definitions are named; payloads are passed.** Queries (`.gq`) and schema - (`.pg`) live in the catalog and are invoked by name; params and bulk data are - the only per-call inputs. - -This removes `--target`, `--cluster-graph`, `--uri` scheme-dispatch, and the -plane guard's "a `--target` that resolves to a remote URL" special case β€” and it -collapses the four-plane vocabulary, for users, into a single capability rule. - -## Motivation: the legacy file pollutes the taxonomy - -Today the CLI exposes four overlapping addressing forms but the system has only -three real entities; the mismatch is the whole problem, and `omnigraph.yaml` is -the carrier: - -1. **`--target` straddles kinds.** It resolves through the legacy - `omnigraph.yaml` `graphs:` map (`config.rs::resolve_target_uri`), and that - `.uri` can be a **storage location** (`file`/`s3`) *or* a **remote server** - (`http`). One flag, two access paths with different capability and trust - models. The wrong-plane guard's storage-plane remote rejection - (`helpers.rs:467`) exists *only* to compensate for this overload. -2. **Scheme-inferred transport.** ``/`--uri` has the same disease a level - down: `is_remote_uri` (`helpers.rs:15`) silently picks embedded vs remote from - the scheme. Transport is guessed from a string, not declared. -3. **No single environment concept.** Defaults are smeared across the deprecated - `omnigraph.yaml` (`cli.graph`, `server.graph`) with no clean way to name or - switch environments. - -Removing `omnigraph.yaml` is the moment to fix all three at once. - -## Ontology - -Every term is one concept. The rest of this RFC uses them precisely. - -### Entities β€” the things that exist - -- **Graph** β€” a typed property graph (node/edge types over Lance); the thing you - query and mutate. *Example: the `knowledge` graph.* -- **Store** β€” the storage location of a **single** graph: its Lance datasets at a - `file://`/`s3://` URI. Addressed directly with `--store`. *Example: - `s3://acme/clusters/brain/graphs/knowledge.omni`.* -- **Cluster** β€” a storage root holding **many** graphs plus the catalog and - control-plane state (state ledger, approvals, recovery). Managed as-code by the - team. *Example: the `brain` cluster at `s3://acme/clusters/brain`.* -- **Server** β€” an `omnigraph-server` process serving graphs over HTTP with bearer - auth and Cedar policy; boots from a bare graph or a cluster. *Example: `prod` at - `https://graph.example.com`, serving the `brain` cluster.* - -### Config & catalog β€” the descriptions - -- **Cluster config** β€” `cluster.yaml` in the cluster root, declaring the **desired - state** (graphs, schemas, stored queries, policies, storage), applied with - `cluster apply`. Team-owned; the source of truth for *what the system is*. -- **Catalog** β€” the **applied** registry the cluster owns in storage: the graphs, - stored queries, and policies `cluster apply` materialized. What a server serves - and what `query ` resolves against. *(Cluster config is the spec; the - catalog is the applied result.)* -- **Operator config** β€” `~/.omnigraph/config.yaml`, your **personal** file: - identity (actor), default graph, named servers/clusters, output prefs, optional - profiles. Declares *who I am*, never what the system is. -- **Profile** β€” an optional named bundle of **defaults inside the operator - config** (one of {cluster, server, store} + a default graph). Config data, - **not state**: selecting one fills in omitted flags for a command; it does not - put you "in" a mode. Chosen per command (`--profile `) or per shell - (`OMNIGRAPH_PROFILE`). -- **Credential** β€” a bearer token keyed to a **server name**, resolved via - `OMNIGRAPH_TOKEN_` or `~/.omnigraph/credentials` (`0600`); sent only to - the server it is keyed to. (Per RFC-007 β€” the operator config holds endpoints, - never tokens.) - -### What you run β€” definitions vs payloads - -- **Schema** β€” the `.pg` type definitions for a graph; authored as a file, applied - via `schema apply` (or `cluster apply`). -- **Stored query** β€” a named query in the catalog, the team's reusable contract; - invoked by name. *Example: `find_people`.* -- **Query file (`.gq`)** β€” an authoring artifact holding `query ` - declarations; becomes a stored query when `cluster apply` adopts it. For - authoring/ad-hoc, not everyday invocation. -- **Payload** β€” the per-call inputs that vary each run: params (`--params`, - positional args) and bulk data (`--data`). Never part of config. - -### How a command resolves - -- **Scope** β€” the resolved environment a command addresses: operator defaults, a - named profile, or one explicit primitive address. -- **Access path** β€” **served** (through a server) or **direct** (open storage - in-process). Derived from scope Γ— capability; see "Access path" below. -- **Capability** β€” what a verb requires: `any`, `served`, `direct`, `control`, - or `local`. -- **Target shape** β€” whether the verb is **graph-scoped** (selects one graph - inside the scope), **scope-scoped** (operates on the whole server/cluster - scope), or **local** (does not resolve scope or graph). -- **Actor** β€” the identity a write is attributed to: server-resolved from the - bearer token (served), or `--as` ?? `operator.actor` (direct). - -### The relationships that prevent confusion - -- **Exactly two config surfaces:** **cluster config** (team) and **operator - config** (personal). Nothing else is "a config." -- A **profile is not a third config** β€” it lives *inside* the operator config, and - it is **defaults, not state**. -- A **catalog is not config** β€” it is the *applied state* the cluster owns. -- A **store is one graph; a cluster is many graphs** + catalog + control state. -- A **graph is the logical thing**; store/server/cluster are ways to reach it. -- "State" elsewhere is not the profile: *graph state* is committed data in Lance; - *cluster state* is the applied control-plane ledger. Neither is operator config. - -## Design - -### First principles - -> Addressing should be 1:1 with the system's real entities; the access path -> (served vs direct) should be **derived**, never inferred from a string or -> toggled per command; the CLI should be **terse by config and stateless per -> command**; and **definitions are named while payloads are passed**. - -Every command answers four orthogonal questions β€” kept orthogonal here: - -| Axis | Question | Today | Target | -|---|---|---|---| -| Scope | which environment? | `omnigraph.yaml` defaults / `--target` | operator defaults Β· `--profile` Β· one primitive | -| Target shape | whole scope or one graph? | implicit in command family | declared per verb | -| Graph | which graph in it? | tangled into the address | `--graph` only for graph-scoped server/cluster verbs | -| Access path | served or direct? | inferred from scheme / target | **derived** from scope Γ— capability | -| Actor | who am I? | `--as` > `cli.actor` (yaml) > `operator.actor` | `--as`/`operator.actor` (direct) Β· token (served) | - -### A scope binds one entity β€” and served is the default - -A scope (a profile, the flat defaults, or one primitive flag) binds **exactly one -of** {server, cluster, store}. Server and cluster scopes may contain many graphs -and can carry a `default_graph`; a store scope is already one graph and does not -accept `--graph`. They differ by privilege, and **the everyday default is a -server**: - -- **server** β†’ served (the everyday scope). A bearer token, **no storage - credentials**. Data verbs run through it, policy-enforced; maintenance verbs are - unavailable from this scope β€” there is no server route for them, so you must - name storage explicitly. This is what a normal operator's config binds. -- **cluster** β†’ direct storage to a managed cluster, for **control, - maintenance, and graph-backed validation only** (`cluster *`, - `optimize`/`repair`/`cleanup`/`schema plan`, graph-backed `lint`, and - `queries validate`). Data verbs are **not** run directly against a cluster β€” - they go served, or `--store` for ad-hoc. **Privileged:** requires bucket - credentials, so it appears only in a maintainer's config or as an explicit - `--cluster` flag β€” never in an everyday operator's defaults. -- **store** β†’ one graph's storage, direct. A **local file** store is ordinary - local dev; a **remote `s3://`** store is break-glass. No catalog (named queries - do not resolve β€” the ad-hoc lane). - -A scope names **one** thing, so there is no independent `server`+`cluster` pair -that could disagree (the audit's coherence hazard is gone by construction β€” the -default is just a server). And the storage root lives only where it must: - -### Direct storage access is privileged (the storage-root rule) - -> The storage root (`s3://…`) is **server-and-admin knowledge, never -> everyday-operator knowledge.** Everyday operator config binds a server (a bearer -> token, no bucket credentials). Direct remote access β€” opening a cluster root or -> an `s3://` store β€” is always **explicit and privileged**: you name -> `--cluster`/`--store`, and only someone with bucket credentials can. The CLI -> never opens a remote store from a default scope. - -This is the least-privilege posture β€” revoke a bearer token, don't rotate bucket -keys; only the **server process** and an occasional **maintenance admin** ever -hold storage credentials. It makes "use the server, not raw storage" -**structural**, not advisory: direct access requires credentials a normal operator -does not have *and* a flag they must type. The only storage root in an everyday -setup is the one the **server** boots from; operators never see it. (Local *file* -stores for dev are unaffected β€” a local file is not the production bucket.) - -### Access path is derived, not chosen - -The two access paths are genuinely different β€” not two transports for one thing: - -- **Served** (through a server): the server resolves your actor from a token and - enforces Cedar policy at the HTTP boundary. In cluster mode the **catalog and - config** (graph set, stored queries, policy bundles) are pinned to the applied - serving revision and move only on restart; **graph data** is read through the - server's engine handle against the requested branch/snapshot (it is not frozen - at boot, though a long-running server will not observe *out-of-band direct - writes* to storage until its handle refreshes). No storage credentials needed. -- **Direct** (open the Lance storage in-process): a **privileged** path β€” it needs - your own storage credentials, so only an admin/maintainer (or a local-dev file - store) takes it. Actor self-declared (`--as` ?? `operator.actor`), reads **live - storage HEAD**. There is **no server-side identity/auth gate** β€” but engine-level - Cedar policy *is* still enforced when the graph selection provides a policy - (enforcement is engine-wide; embedded `_as` writers call the same `enforce`). - "Direct" means "no HTTP boundary," not "unpoliced." - -Because they differ in authority, freshness, and availability, a graph reached via -a server and that graph's raw storage are **different things you name -differently** β€” not one identity you flip. Making the access path a per-command -toggle (`--via`) is the `--target` mistake in new clothes; it is rejected. - -> **The access path follows from the scope and the verb.** A **server** scope β†’ -> served (data/catalog). A **cluster** scope β†’ direct control, maintenance, and -> validation. A **store** scope β†’ direct ad-hoc data (no catalog). The verb's -> capability picks which applies and rejects the mismatches. - -State the bound plainly: the everyday data path -(`query`/`mutate`/`load`/`branch`/`export`/`commit`) against a served graph -**never needs direct storage access**, and direct access is legitimate only in -bounded places: **bootstrap** (`init`), **storage-native maintenance** -(`optimize`/`repair`/`cleanup`/`schema plan`), **graph-backed validation** -(`lint`), **catalog validation** (`queries validate`), the **control plane** -(`cluster *`), **local dev** with no server, and **break-glass** (recovery, or -checking whether a long-running server's handle lags live HEAD). Everything else -is served. This is what makes "discourage direct storage" enforceable rather -than aspirational. - -This list is expected to **shrink**: Decision 11 moves -`optimize`/`cleanup` (and healthy-path `repair`) to server-managed jobs, which -would leave direct access to just standalone/local dev, the control plane, and -break-glass β€” and remove the last routine reason an admin needs bucket -credentials. - -### Capability semantics - -The CLI validates through verb capability, not plane jargon: - -| Capability | Meaning | Examples | -|---|---|---| -| `any` | graph-scoped data; served via a server scope; direct only against a **store** scope (local dev / break-glass); **errors on a cluster scope** | `query`, `mutate`, `load`, `export`, branch reads, `schema show/apply` | -| `served` | requires an HTTP server; may be graph-scoped or scope-scoped | `graphs list`, `queries list` | -| `direct` | graph-scoped storage-native or graph-backed validation; no server form exists | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, graph-backed `lint` | -| `control` | cluster-scoped catalog/control-plane work; addresses the cluster, not a single raw store | `cluster *`, `queries validate` | -| `local` | does not address a graph or scope | `config`, `profile`, `lint --query ... --schema ...` | - -`any` does **not** mean "the user picks": the resolver picks from the scope. -Internally the exhaustive `command_plane` match (`planes.rs`) stays as the drift -guard; user-facing errors speak in terms of what the command needs. - -### Definitions vs payloads - -Queries and schema are **definitions** β€” contracts that live in the catalog and -are invoked **by name**; params and data are **payloads** passed per call. So the -everyday form is `omnigraph query [params]`, not -`omnigraph query --file find.gq`. A `.gq` path on a routine query is a smell: the -query is not in the catalog yet. Lifecycle: **author a `.gq` β†’ `cluster apply` -adopts it β†’ invoke by name thereafter.** - -Named queries resolve through a **server** (which serves the cluster's catalog). -`queries list` is therefore a served catalog read. `queries validate` is a -control/catalog check against the cluster-owned query definitions. A bare -`--store` has **no catalog**, so it is the ad-hoc lane (`-e` / `--file`), and -`--cluster` does not invoke stored queries. So named-query invocation is a -**served** convenience; direct access (`--store`) is always ad-hoc. - -| Kind | Examples | How it enters a command | -|---|---|---| -| Definition | stored query, schema | named in the catalog; authored as a file, adopted by `cluster apply` | -| Payload | params, bulk data | passed per call (`--params`, positional args, `--data`) | -| Authoring / ad-hoc | a `.gq` you're writing | `-e '…'`, `--file new.gq`, `lint --query new.gq --schema schema.pg`, `schema apply --schema` | - -### Resolution rule - -1. If the verb is `local`, reject graph/scope flags and run without resolving a - scope. -2. If a primitive address is supplied (`--store`/`--server`/`--cluster`), use it - and ignore operator-config scope defaults. *(A **named** primitive β€” `--server - prod`, `--cluster brain` β€” still resolves through the operator-config registry; - a **literal** β€” `--server https://…`, `--store s3://…` β€” bypasses it. Per - Decision 2: a value containing `://` is a literal, otherwise a config-name - lookup.)* -3. Else if `--profile ` (or `OMNIGRAPH_PROFILE`) selects a profile, use it. -4. Else use the operator config's flat defaults. Error only if neither resolves. - *(No sticky "current" pointer β€” each command resolves scope fresh.)* -5. Resolve the graph only for **graph-scoped** verbs. Server/cluster scopes: - exactly one graph in scope β†’ use it; else `default_graph`; else require - `--graph `. Store scopes are already one graph, so `--graph` is rejected. - **Scope-scoped** verbs (`graphs list`, `queries list`, `queries validate`, - and `cluster *`) do not select a graph unless their own resource argument says - otherwise. -6. Derive the access path from capability Γ— scope: - - `direct` verb β†’ the scope's cluster/store; if the scope is a server, error - (name storage explicitly β€” it is privileged). - - `served` verb β†’ the scope's server; if the scope is a cluster/store, error. - - `control` verb β†’ the scope's cluster; if the scope is a server/store, error - (name a cluster explicitly β€” it is privileged). - - `any` verb β†’ **served** if the scope is a server; **direct** against a - **store** scope (ad-hoc); on a **cluster** scope, error β€” cluster is - maintenance-only, so use a server for data or `--store` for ad-hoc. -7. Reject mismatches with an error naming the missing axis. - -Good errors: - -```text -scope "prod" has 4 graphs; pass --graph or set default_graph -optimize needs direct storage access; scope "prod" is a server β€” name storage with --cluster s3://… or --store (requires storage credentials) -graphs list enumerates a server scope; do not pass --graph ---store opens raw storage directly, bypassing any server (no HTTP auth gate, live HEAD); for recovery/inspection -``` - -### Config shape (operator config) - -`~/.omnigraph/config.yaml` β€” your personal file; the cluster config -(`cluster.yaml` + catalog) is the separate, team-owned surface. The default-graph -key is `default_graph` everywhere (the per-command flag is `--graph`). - -**Everyday operator β€” binds a server, holds no storage root:** - -```yaml -defaults: - server: prod - default_graph: knowledge - output: table -servers: - prod: { url: https://graph.example.com } # token keyed by name (RFC-007); no creds here - staging: { url: https://staging.example.com } -profiles: # optional, only for multiple environments - staging: { server: staging, default_graph: knowledge } -``` - -A normal operator never has a storage root or bucket credentials. Their default -scope is served; `optimize`/`repair`/`cleanup` error with a pointer to name -storage explicitly. - -**Maintainer β€” opts into a cluster root (and has bucket credentials):** - -```yaml -profiles: - brain-admin: { cluster: brain, default_graph: knowledge } # direct; admin/control/maintenance -clusters: - brain: { root: s3://acme/clusters/brain } # the s3:// root lives ONLY here -``` - -The `clusters:` block β€” the only place a storage root appears in operator config β€” -is **admin-only and opt-in**, absent from a normal operator's file. Equivalently, -skip config and name it per command: -`omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge`. The -cluster stays the source of truth for the managed catalog; tokens live in the -keyed credential store, never in this file. - -### Command shape - -Assume the everyday flat defaults: server `prod`, default graph `knowledge`. - -| Intent | Command | Path | -|---|---|---| -| Run a catalog query | `omnigraph query find_people` | served | -| …with params | `omnigraph query find_people --params '{"title":"Eng"}'` | served | -| Another graph in scope | `omnigraph query find_people --graph archive` | served | -| Write | `omnigraph load --data batch.jsonl --mode append` | served | -| A different environment | `omnigraph --profile staging query find_people` | served | -| One-off server, no config | `omnigraph query find_people --server https://graph.example.com --graph knowledge` | served | -| Maintain (admin, explicit storage) | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) | -| Maintain (admin, via admin profile) | `omnigraph --profile brain-admin optimize --graph knowledge` | direct (privileged) | -| List catalog queries | `omnigraph queries list` | served | -| Validate cluster query catalog | `omnigraph queries validate --cluster s3://acme/clusters/brain` | control (privileged) | -| Offline query lint | `omnigraph lint --query new.gq --schema schema.pg` | local | -| Graph-backed query lint | `omnigraph lint --query new.gq --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) | -| Local dev, no server | `omnigraph query -e 'match { … } return { … }' --store graph.omni` | direct (local file) | -| Break-glass: raw storage of a served graph | `omnigraph query --file find.gq --store s3://acme/clusters/brain/graphs/knowledge.omni` | direct (privileged, rare) | - -Note what the everyday rows are: **all served.** `optimize` does *not* appear in -the default-scope rows β€” from a server scope it errors and points you to name -storage (see the resolution rule), so maintenance is always a deliberate, -credentialed act. There is no "force served/direct" row β€” you never toggle the -path on a configured graph; the only way to reach raw storage is to *name it* -(`--cluster`/`--store`), which makes the privileged bypass unmistakable. Everyday -rows invoke a query **by name**; a `.gq` file appears only where there is no -catalog (bare store, break-glass) via `-e`/`--file`. - -## Before / after - -**Before** = best available today (legacy `omnigraph.yaml` `--target`, `.gq` -files, `--cluster-graph`, scheme inference). **After** = this model. - -| Intent | Before | After | -|---|---|---| -| Run a query | `omnigraph query --target knowledge --query find.gq --name find_people` | `omnigraph query find_people` | -| Another graph | `omnigraph query --target archive --query find.gq --name find_people` | `omnigraph query find_people --graph archive` | -| Load | `omnigraph load --data b.jsonl --mode append --target knowledge` | `omnigraph load --data b.jsonl --mode append` | -| Maintain (admin) | `omnigraph optimize --cluster brain --cluster-graph knowledge` | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` | -| Another environment | edit `omnigraph.yaml`, or re-address with full URIs | `--profile staging …` or `OMNIGRAPH_PROFILE=staging` | -| One-off remote | `omnigraph query --uri https://… --query find.gq` *(schemeβ†’remote)* | `omnigraph query find_people --server https://… --graph knowledge` | -| Raw storage of a served graph | `omnigraph query s3://…/knowledge.omni --query find.gq` *(looks like a normal query)* | `omnigraph query --file find.gq --store s3://…/knowledge.omni` *(explicit bypass)* | - -**Removed:** `--target`; `--cluster-graph` (`--graph` is the graph selector only -for graph-scoped server/cluster verbs); `--uri` http-scheme dispatch; `--via` -(never ships); everyday `--query ` (definitions are named); -`omnigraph.yaml` and its `cli.graph`/`server.graph` defaults. - -## Server-side corollary - -The same ontology applies to `omnigraph-server` boot: with `omnigraph.yaml` gone, -a server boots from a single bare graph URI **or** a cluster (`--cluster `, -RFC-005), never a `graphs:` map. The store/server/cluster ontology is then -consistent across CLI and server. - -## Migration & compatibility - -Addressing flags and config keys are observable contract (Hyrum); every removal is -staged and release-noted. - -- **`config migrate`** (shipped) maps each legacy `graphs:` entry **by what it - actually is**: `http(s)` URIs β†’ a `server:` (the recommended everyday shape); - `file` URIs β†’ a local `store:`; an `s3://` **graph** URI β†’ an **admin** `store:` - (it is a single graph, not a cluster); an `s3://` **cluster root** (one that - carries cluster state) β†’ an **admin** `cluster:`. Everyday `s3://` graph usage - migrates with a **warning** β€” prefer serving it via a server rather than - re-establishing direct remote access. It reports dropped keys. -- **Operators move to a server-default scope.** Where a legacy setup pointed - `cli.graph` at an `s3://` graph for everyday use, migration flags it: the - recommended shape is a `server:` scope (bearer token, no bucket creds), with the - `s3://` root kept only in a maintainer's config β€” not every operator's. -- **`--target`** warns for one release, then errors; **`OMNIGRAPH_NO_LEGACY_CONFIG=1`** - (already the strict switch) becomes the default β€” loading `omnigraph.yaml` is a - hard error. -- **`--cluster-graph` β†’ `--graph`**: `--cluster-graph` is accepted with a warning - for one release, then removed. -- **`--graph` meaning change**: today `--graph` is "graph id on a multi-graph - server" (paired with `--server`); it generalizes to "select the graph for - graph-scoped verbs in server/cluster scopes." Existing `--server --graph` - usage keeps working (it is a strict superset); release-note the broadened - meaning and the fact that store/scope-scoped verbs reject it. -- **`--uri http://…`** warns, then errors with a pointer to `--server`. -- **`--as` on served paths**: today global `--as` is accepted (a no-op on remote - writes β€” the server resolves the actor from the token); rejecting it on the - served path is staged β€” warn for one release, then error. -- **`--alias`** β†’ the `alias` namespace (`omnigraph alias `, Decision 4); - the old `--alias` flag warns for one release, then is removed. - -## Non-goals - -- **No change to the direct/served capability split.** Maintenance stays - storage-direct by design (no server routes for `optimize`/`repair`/`cleanup`); - this RFC only makes the split explicit. -- **No new transport.** Addressing surface, not protocol. -- **No positional sigil grammar** (`@server/graph`, `%cluster/graph`). Considered - and rejected: explicit flags are more discoverable; profiles already give - brevity. Revisit only on demonstrated expert-terseness demand. - -## Decisions - -The questions this RFC opened are resolved as follows. Two are explicitly -deferred (see below); they do not block the model. - -1. **Local-dev path β†’ embedded `--store` scope.** Local dev runs the engine - in-process against a `--store ` (or a store-scoped profile); `omnigraph - serve` stays available but is not required. Consistent with embedded ≑ remote - (RFC-009). -2. **Primitives are one flag, typed by content.** `--server` and `--cluster` - accept either a config name or a literal URI: a value containing `://` is a - literal (bypasses the registry); otherwise it is a config-name lookup (error if - unknown). `--store` is always a URI. (Replaces the earlier "literal-vs-named" - question β€” no `--server-url`/`--cluster-root` split.) -3. **Stored invocation: `query ` (read) / `mutate ` (write), one - catalog namespace.** A name maps to one definition; the verb asserts its kind - and the CLI errors on mismatch (`'apply_labels' is a mutation β€” use - omnigraph mutate apply_labels`). No `invoke` verb. -4. **Aliases live under an `alias` namespace** β€” `omnigraph alias [args]`, - never bare top-level. An alias can therefore neither shadow nor be shadowed by a - built-in (current or future) verb. -6. **Profile merge: scope wholesale, prefs layered.** The entity binding + - `default_graph` come *wholesale* from the active scope (a profile, or flat - defaults if none) β€” never per-key merged across the entity dimension (that would - yield "server *and* cluster"). Only non-scope preferences (`output`, table - layout) take flat defaults as a base. Precedence: explicit flag > profile > flat - defaults. -7. **No default graph β†’ error + list candidates.** A graph-scoped verb with no - `--graph`, no `default_graph`, and >1 graph in scope errors and lists candidates - (served: `GET /graphs`; cluster-direct: catalog enumeration). If enumeration is - policy-gated/unavailable, it says so and asks for `--graph`. Never auto-pick. -9. **Diagnostics & safety.** Writes echo the resolved scope + access path to stderr - (suppress with `--quiet`). Destructive verbs (`cleanup`, overwrite `load`, - `branch delete`) require confirmation when the scope is not local; `--yes` skips - it; **no TTY without `--yes` errors** (never silently proceed). `--json`/CI never - prompt β€” destructive without `--yes` errors. -10. **Cluster graphs evolve only via `cluster apply`.** `schema apply` (an `any` - verb) targets standalone graphs; against a cluster-managed graph it errors and - points at `cluster apply` (which records ledger/recovery/approvals β€” RFC-004). - Mirrors `init`'s refusal of a cluster-managed path. -11. **Maintenance moves server-side (committed direction).** `optimize`/`cleanup` - (and healthy-path `repair`) become server/cluster-managed async jobs β€” - policy-gated, audited, single-coordinator β€” with `direct` retained only as - break-glass (`repair` when the server is down). Runs out-of-band (a worker + - async job routes, the `POST …` / `GET …/{id}` shape of the bulk-data-plane RFC - (`docs/rfcs/0001-bulk-data-plane.md`, PR #219, not yet merged)), never inline in - serving; `schema plan` is - excluded (β‰ˆ `cluster plan` in cluster mode). The **mechanism** (job routes, - worker, scheduling) is a follow-up RFC; until it lands the capability table above - stands, and maintenance is `direct`. When it lands, the maintenance verbs' - capability becomes "served-job + direct break-glass." - -## Deferred - -Non-blocking; settle when convenient. - -- **D5 β€” combined admin scope.** A scope binds one entity; admins read via a - server scope and maintain via `--cluster`. A `deployments: { … }` object - (server + cluster validated coherent, referenced by a profile) is revisited only - if admin ergonomics demand it β€” and Decision 11 largely removes the need. -- **D8 β€” the `profile` command surface.** *Shipped:* `profile list` / `profile - show []` (read-only inspection). The *no sticky `profile use`* constraint - holds β€” it is a design principle, not a command. - -## Safety - -Dropping the sticky `current_profile` pointer removes the main footgun β€” a -destructive command silently inheriting a "current" environment from an earlier -session. Because each command resolves scope fresh, what is on the command line is -what runs. Two guards remain (a flat default or `OMNIGRAPH_PROFILE` can still point -at prod): echo the resolved scope + access path on writes, and require -confirmation (or `--yes`) for destructive verbs when the resolved scope is not -local (Decision 9). The most dangerous direct writes (`cleanup`, overwrite -`load`) are *structurally* rare now β€” unavailable from the everyday server scope, -and gated behind bucket credentials plus an explicit `--cluster`/`--store` β€” so a -normal operator's setup mostly cannot issue them by accident at all. - -## Invariants & deny-list check - -- **Β§10 query semantics first-class / Β§11 transport at the boundary:** preserved β€” - addressing resolves CLI-side to a `GraphClient`; no transport concepts leak into - engine crates. -- **Β§12 no client-set actor:** strengthened β€” the served path's actor stays - token-resolved and `--as` is rejected there; direct self-declares. -- **Least privilege (security posture):** everyday operators hold a revocable - bearer token, not bucket credentials; only the server process and maintenance - admins hold storage creds. Direct remote access is structural opt-in, not a - default β€” narrowing the blast radius of a leaked operator config. -- **Β§6 strong consistency:** both paths are snapshot-isolated per query; this RFC - changes addressing, not isolation. -- **Deny-list (no state that drifts):** profiles and aliases are static config - sugar that resolve to canonical scopes; they declare nothing the cluster or - server doesn't already own. No sticky session state is introduced. -- No Hard Invariant is weakened; the change is CLI surface + config removal. - -## Relationship to prior work - -The completion of the config/CLI lineage: RFC-007 added the operator config and -keyed credentials; RFC-008 demoted `omnigraph.yaml`; RFC-009 unified execution -behind `GraphClient`; RFC-010 declared the planes. This RFC removes the last -legacy addressing surface so the plane model becomes a clean function of the three -real entities, and folds the planes into a single capability rule. It is adjacent -to the public-track bulk-data-plane RFC (`docs/rfcs/0001-bulk-data-plane.md`, -PR #219, not yet merged), which canonicalizes `load`/`export` verbs; this RFC -canonicalizes how every verb *addresses* a graph. - -## Appendix: target CLI taxonomy (end state) - -The full command set under this model, organized by **capability** (the new -classifying axis) instead of plane β€” the end-state counterpart to the -current-taxonomy appendix below. Every command, with its end-state addressing. - -``` -omnigraph -β”‚ -β”œβ”€ any β€” data verbs Β· served by default (server scope, or --server ); -β”‚ --graph selects the graph in scope; --store forces ad-hoc direct (no catalog) -β”‚ β”œβ”€ query (alias: read*) invoke a stored query by NAME; -e/--file for ad-hoc -β”‚ β”œβ”€ mutate (alias: change*) invoke a stored mutation by name; -e/--file for ad-hoc -β”‚ β”œβ”€ load bulk write β€” --data, --mode required; --from forks a missing branch -β”‚ β”œβ”€ export dump graph data (NDJSON / Arrow) -β”‚ β”œβ”€ snapshot current per-table versions -β”‚ β”œβ”€ branch { create | list | delete | merge } merge takes --into -β”‚ β”œβ”€ commit { list | show } inspect the commit graph -β”‚ └─ schema { show (alias: get) | apply } cluster graphs evolve via cluster apply (Decision 10) -β”‚ -β”œβ”€ served β€” needs a server (errors on a store/cluster scope) -β”‚ β”œβ”€ graphs list enumerate the graphs a server serves -β”‚ └─ queries list list stored queries in the served catalog -β”‚ -β”œβ”€ direct β€” storage-native, PRIVILEGED Β· --cluster | --store + bucket creds; never a server -β”‚ β”œβ”€ init bootstrap a graph (--store ); refuses a cluster-managed path -β”‚ β”œβ”€ optimize compaction; --graph selects -β”‚ β”œβ”€ repair publish uncovered drift; --confirm / --force -β”‚ β”œβ”€ cleanup version GC; --keep / --older-than / --confirm -β”‚ β”œβ”€ schema plan migration preview (reads storage directly) -β”‚ └─ lint --query graph-backed query lint (with --graph on cluster scope) -β”‚ -β”œβ”€ control β€” cluster/catalog control, PRIVILEGED Β· --cluster -β”‚ β”œβ”€ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock } -β”‚ apply/approve take --as ; force-unlock takes -β”‚ └─ queries validate validate cluster-owned stored queries against graph schemas -β”‚ -└─ local β€” no graph - β”œβ”€ policy { validate | test | explain } offline Cedar tooling - β”œβ”€ profile { list | show } read-only; NO mutating `use` (no sticky state) - β”œβ”€ alias [args] personal shortcut; expands to its bound stored-query call (D4) - β”œβ”€ config { migrate } finish the omnigraph.yaml split (RFC-008) - β”œβ”€ login / logout per-server bearer credentials - β”œβ”€ embed offline embedding pipeline - β”œβ”€ lint --query --schema file-only query lint - └─ version (-v) -``` - -`*` `read`/`change` remain as deprecated aliases (warn on use); `ingest` and the -`check`β†’`lint` argv-shim are **removed**. `get` aliases `schema show`. - -### Addressing forms (end state) - -Three scope forms β€” one per real entity β€” plus the graph selector. No `--target`, -no `--cluster-graph`, no `--uri` scheme-dispatch, no `--via`. - -| Form | Resolves to | Access | Privilege | -|---|---|---|---| -| **server scope** β€” operator default, a `--profile`, or `--server ` | a served endpoint + keyed token | served | everyday (bearer token) | -| **cluster scope** β€” an admin profile, or `--cluster ` | a managed cluster's storage + catalog | direct | privileged (bucket creds) | -| **store scope** β€” `--store ` | one graph's storage (no catalog) | direct | local-dev (file) / break-glass (s3) | -| **`--graph `** | selects the graph for graph-scoped verbs in server/cluster scopes; invalid for store scopes and scope-scoped verbs | β€” | β€” | - -Resolution: explicit primitive (`--server`/`--cluster`/`--store`) β†’ `--profile` / -`OMNIGRAPH_PROFILE` β†’ operator flat defaults. Access path is then derived from the -scope kind Γ— the verb's capability (see the Resolution rule); it is never inferred -from a URI scheme and never toggled. - -### What moved vs today - -| Command(s) | Today (plane) | End state (capability) | -|---|---|---| -| `query`/`mutate`/`load`/`export`/`snapshot`/`branch`/`commit`/`schema show`/`schema apply` | Data | **`any`** (served-default; `--store` ad-hoc) | -| `graphs list` | Data (remote-only) | **`served`** | -| `queries list` | Session | **`served`** (catalog read) | -| `init`/`optimize`/`repair`/`cleanup`/`schema plan`/graph-backed `lint` | Storage | **`direct`** (privileged) | -| `queries validate` | Storage | **`control`** (catalog validation) | -| `cluster *` | Control | **control** (unchanged) | -| `policy *`/`embed`/`login`/`logout`/`config`/`version`/offline `lint --query --schema` | Session | **`local`** | -| `ingest`; `--target`; `--cluster-graph`; `--uri http` dispatch | present | **removed** | -| β€” | β€” | **added:** `profile { list | show }` (read-only) | - -Cross-capability families: `schema` (`plan` is `direct`, `show`/`apply` are -`any`), `queries` (`list` is `served`, `validate` is `control`), and `lint` -(offline with `--schema` is `local`, graph-backed is `direct`) split per -subcommand/mode, exactly where their authority and data dependencies differ. - -## Appendix: current CLI taxonomy (today) - -The **as-is** command surface this RFC transforms, kept so the RFC is -self-contained. The source of truth is the exhaustive `command_plane` match in -`crates/omnigraph-cli/src/planes.rs`. -Where it disagrees with the design above (four planes, `--target`, -`--cluster-graph`, scheme-inferred transport), the design is the *target* and this -is *today*. - -### The four planes (today) - -| Plane | What it touches | Addressing accepted | -|---|---|---| -| **Data** | a graph β€” embedded **or** via a server | `` Β· `--target` Β· `--server` (+`--graph`) | -| **Storage** | direct storage, no server | `` Β· `--target` (local/S3 only) Β· some also `--cluster`+`--cluster-graph` | -| **Control** | a cluster *directory* | `--config ` | -| **Session** | no graph | β€” | - -`--server`/`--graph` are gated strictly to the data plane; `guard_addressing` -(`planes.rs:128`) rejects them elsewhere (RFC-010 Slice 1). - -### Command tree by plane (today) - -``` -omnigraph -β”œβ”€ DATA ────────── run against a graph; embedded or --server -β”‚ β”œβ”€ query (alias: read) Β· mutate (alias: change) Β· load Β· ingest (hidden, deprecated) -β”‚ β”œβ”€ branch { create | list | delete | merge } Β· snapshot Β· export Β· commit { list | show } -β”‚ β”œβ”€ graphs { list } (remote-only) -β”‚ └─ schema { show (alias: get) | apply } ← show/apply are DATA -β”œβ”€ STORAGE ─────── direct file://|s3:// access; --server rejected -β”‚ β”œβ”€ init Β· optimize Β· repair Β· cleanup (optimize/repair/cleanup also: --cluster --cluster-graph) -β”‚ β”œβ”€ lint (check shim) Β· schema plan ← plan is STORAGE -β”‚ └─ queries validate -β”œβ”€ CONTROL ─────── cluster directory via --config -β”‚ └─ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock } -└─ SESSION ─────── no graph - β”œβ”€ policy { validate | test | explain } Β· embed Β· login / logout - β”œβ”€ config { migrate } Β· queries list ← list is SESSION - └─ version (-v) -``` - -`read`/`change` are visible clap aliases (deprecated names, warn); `check` is an -argv-shim β†’ `lint`; `get` aliases `schema show`; `ingest` is hidden but runs. - -### Cross-plane families (today) - -- **`schema`**: `schema plan` is Storage; `schema show`/`apply` are Data. -- **`queries`**: `queries validate` is Storage; `queries list` is Session. - -### Addressing forms (today) - -| Form | Looks up in | Resolves to | Source | -|---|---|---|---| -| `` / `--uri` | nothing (explicit) | the literal URI | β€” | -| `--target ` | `omnigraph.yaml` `graphs:` | that graph's `uri` (local / S3 / **http**) | `config.rs::resolve_target_uri` | -| `--server ` (+`--graph`) | `~/.omnigraph/config.yaml` `servers:` | a remote server URL | `helpers.rs::resolve_server_flag` | -| `--cluster --cluster-graph ` | served cluster state | the graph's storage URI | `helpers.rs` (RFC-010 Slice 3) | - -Precedence (`resolve_target_uri`): explicit ``/`--uri` β†’ `--target` β†’ -`cli.graph` default β†’ error. `is_remote_uri` (`helpers.rs:15`) then selects -`GraphClient::Remote` vs `Embedded` (`client.rs:86`). - -### Enforcement points (today) - -- **`guard_addressing`** (`planes.rs:128`): `--server`/`--graph` on a non-data verb - fails with a declared message. -- **Storage-plane remote rejection** (`helpers.rs:467`): a storage verb whose - `--target` resolves to `http(s)://` is rejected. -- **`init` into a cluster layout** is refused (use `cluster apply`). - -## Audit comments - -Reviewed against the current CLI taxonomy, `planes.rs`, `cli.rs`, `helpers.rs`, -`client.rs`, RFC-007/RFC-010, and the user-facing CLI/server docs. - -### Validated - -- The target taxonomy now has a stable classifier: `any`, `served`, `direct`, - `control`, and `local` are all declared capabilities. -- Cluster scope is coherent: it is privileged direct storage for control, - maintenance, and validation, not a direct data path. `any` data verbs served by - default and reject cluster scope. -- Graph selection is no longer universal. Graph-scoped verbs select a graph; - scope-scoped verbs such as `graphs list`, `queries list`, `queries validate`, - and `cluster *` address the whole server/cluster scope. -- The current-state appendix still matches the implemented CLI: four planes, - `--target`, `--cluster-graph`, scheme-inferred transport, `schema plan` as - Storage, and `schema show/apply` as Data. - -Decisions and deferrals are tracked in [Decisions](#decisions) above β€” not -duplicated here. diff --git a/docs/dev/rfc-012-embedding-provider-config.md b/docs/dev/rfc-012-embedding-provider-config.md deleted file mode 100644 index 45083a2..0000000 --- a/docs/dev/rfc-012-embedding-provider-config.md +++ /dev/null @@ -1,295 +0,0 @@ -# RFC: Provider-Independent Embedding Configuration - -**Status:** Accepted β€” Phases 1-5 implemented -**Date:** 2026-06-15 -**Builds on:** the engine embedding client (`crates/omnigraph/src/embedding.rs`), the `@embed` catalog -annotation (`omnigraph-compiler/src/catalog`), the cluster `providers.embedding` surface -([cluster-config-specs.md](cluster-config-specs.md), [rfc-007-operator-config.md](rfc-007-operator-config.md) -for the secret-resolution pattern). -**Target release:** staged β€” NFR floor first, then the provider-independent config core; ingest-time `@embed` -execution is a separate later phase. - -## Summary - -OmniGraph's embedding subsystem is **hardwired to a single provider (Google Gemini)** and has no recorded -link between the model that produced a stored vector and the model that embeds a query string. Today that -happens to be self-consistent (one live client embeds both sides), but it is consistent by accident, not by -construction: the provider is hardcoded, the model is a moving `-preview` target, nothing validates that a -query vector and a stored vector share a space, and the one configurable knob (key + base URL) cannot change -the provider or model. - -This RFC makes embedding **provider-independent**: one resolved `EmbeddingConfig { provider, model, base_url, -api_key, dim, normalize }` behind a sealed provider abstraction, resolved once and shared by every embedder. -The **primary variant is OpenAI-compatible** β€” a single request/response shape (`POST {base}/embeddings`, -`{model, input, dimensions}`) that covers **OpenRouter** (the recommended default gateway, one key for Gemini, -OpenAI, Mistral, BGE, Qwen, sentence-transformers, …), OpenAI direct, and any self-hosted OpenAI-compatible -endpoint (vLLM, Ollama, LM Studio, Together). A native **Gemini** (`generativelanguage`) variant is retained -for shops that want to hit Google directly with its `RETRIEVAL_QUERY`/`RETRIEVAL_DOCUMENT` task-type -asymmetry, plus a deterministic **Mock**. The embedding *identity* (provider + model + dim) is recorded in the -schema IR so it travels with the data, and a query whose resolved embedder cannot match the stored vectors' -recorded identity is **rejected with a typed error instead of silently ranking across vector spaces.** -Provider/endpoint wiring lands on the already-reserved cluster `providers.embedding` field; secrets follow the -existing operator-credential pattern; no secret ever enters the schema. - -This RFC supersedes the framing in `docs/user/search/embeddings.md` that described "two embedding clients -with different defaults" β€” one of those clients was dead code with zero callers and has been removed (see -Phase 1); the OpenAI request shape returns as a first-class *provider variant* of the one client, not as a -second parallel client. - -## Motivation - -This work originated in an external handoff that reported a live cross-provider bug: gemini-3072 stored -vectors compared against OpenAI-1536 query vectors, silently. Investigation against the current source showed -the reported mechanism is **inaccurate** β€” the OpenAI client it blamed (`omnigraph-compiler/src/embedding.rs`) -was `pub(crate)`, `#![allow(dead_code)]`, and had **zero callers**; the live `nearest("string")` path and the -offline `omnigraph embed` CLI both use the engine **Gemini** client; and `@embed` does no ingest-time -embedding at all. So the documented happy path is self-consistent. But the investigation surfaced four real -problems the handoff's instincts correctly smelled: - -- **P1 β€” Provider is hardwired.** The one live client builds Google `generativelanguage` requests; only key + - base URL are configurable, not the provider or model. A non-Gemini shop cannot use `nearest("string")` - without a Gemini key, and cannot make it produce non-Gemini vectors. If they store their own vectors and - query with `nearest("string")`, the query is embedded with Gemini β†’ a silent cross-space ranking. This is - the handoff's failure, reached by a different cause. -- **P2 β€” A dead, divergent second client + stale docs** invited exactly the misdiagnosis the handoff made. -- **P3 β€” No same-space guarantee recorded with the data.** Nothing stamps which model/dim produced a stored - vector, so write-side and read-side embedders can drift with no validation. -- **P4 β€” `@embed` is declarative-in-name-only.** It records a source property for the typechecker but never - embeds at ingest; the docs claimed otherwise. - -Per the project's first principle, the lower-liability shape is **one provider-independent client with the -identity recorded next to the data**, not N independently-defaulted clients kept in lockstep by discipline. -Hardcoding one provider mortgages every future "we need OpenAI / a local model / Vertex" against a rewrite; -recording identity once closes the silent-wrong-results class by construction. - -## Current state β€” which API we actually use - -| | Live engine client (`crates/omnigraph/src/embedding.rs`) | Deleted dead client (was `omnigraph-compiler/src/embedding.rs`) | -|---|---|---| -| Provider | **Google Gemini Developer API** (`generativelanguage`, *not* Vertex AI) | OpenAI | -| Endpoint | `POST {base}/models/{model}:embedContent` | `POST {base}/embeddings` | -| Auth | header `x-goog-api-key`, env `GEMINI_API_KEY` | `Authorization: Bearer`, env `OPENAI_API_KEY` | -| Model | `gemini-embedding-2-preview` (hardcoded) | `text-embedding-3-small` (env `NANOGRAPH_EMBED_MODEL`) | -| Base default | `https://generativelanguage.googleapis.com/v1beta` | `https://api.openai.com/v1` | -| Request body | `{model, content:{parts:[{text}]}, taskType, outputDimensionality}` | `{model, input:[…], dimensions}` | -| Response | `{embedding:{values:[f32]}}` | `{data:[{index, embedding:[f32]}]}` | -| Task types | `RETRIEVAL_QUERY` / `RETRIEVAL_DOCUMENT` | none | -| Status | **live** β€” used by `nearest("string")` and `omnigraph embed` | **removed in Phase 1** (zero callers) | - -Both shapes honour a requested output dimensionality (Gemini `outputDimensionality`, OpenAI `dimensions`) -driven by the target column width, so dimension is already schema-driven. The two known shapes are exactly the -two initial provider variants this RFC defines β€” the OpenAI shape returns from git history as a `Provider` -variant of the single client. - -## Guide-level explanation - -### Configuring a provider (operator view) - -Pick a provider for the graph in `cluster.yaml` (the team-owned surface), referencing a secret by name. The -recommended default routes through OpenRouter (OpenAI-compatible, one key for many models): - -```yaml -providers: - embedding: - default: - kind: openai-compatible # openai-compatible | gemini | mock - base_url: https://openrouter.ai/api/v1 - model: google/gemini-embedding-2 # or openai/text-embedding-3-large, mistralai/mistral-embed, … - api_key: ${OPENROUTER_API_KEY} -graphs: - knowledge: - schema: knowledge.pg - embedding_provider: default -``` - -The same `openai-compatible` kind points at OpenAI direct (`base_url: https://api.openai.com/v1`, -`model: text-embedding-3-large`) or a self-hosted endpoint (vLLM/Ollama/LM Studio) by changing `base_url`. Use -`kind: gemini` only to reach Google's `generativelanguage` API directly (it keeps the query/document -task-type asymmetry that the OpenAI-compatible shape does not expose). Dimensions are schema-driven by the -target `Vector(N)` column, not duplicated in the provider profile. - -The zero-config tier keeps working with env only (`OMNIGRAPH_EMBED_PROVIDER`, `OMNIGRAPH_EMBED_BASE_URL`, -`OMNIGRAPH_EMBED_MODEL`, and the provider api-key env β€” `OPENROUTER_API_KEY` / `OPENAI_API_KEY` / -`GEMINI_API_KEY`), so no cluster file is required for a single-graph setup. - -### Recording identity in the schema - -`@embed` grows optional arguments that pin the embedding identity to the vector column: - -```pg -node Doc { - slug: String @key - text: String - v: Vector(3072) @embed("text", model="gemini-embedding-2", dim=3072) @index -} -``` - -The single-argument form `@embed("text")` keeps working unchanged. The recorded identity persists in the -schema IR (`_schema.ir.json`) and so travels with `schema apply` and `schema show`. - -### What a mismatch looks like - -If the resolved read-side embedder cannot produce the recorded identity (wrong model, wrong dim, wrong -provider), `nearest($v, "string")` fails with a typed error naming both sides, instead of returning a -plausible-but-meaningless ranking. Changing the recorded identity on an existing column is a loud schema-apply -refusal (it is a re-embed, a deliberate migration step), reusing the migration planner's existing -annotation-change rejection. - -## Reference-level design - -### One client, sealed provider abstraction - -Replace the two-variant `EmbeddingTransport` with a resolved config plus a sealed provider enum: - -```text -EmbeddingConfig { provider: Provider, model, base_url, api_key, dim, normalize } -enum Provider { - OpenAiCompatible, // POST {base}/embeddings, Bearer auth, {model, input, dimensions} β†’ {data:[{embedding,index}]} - // covers OpenRouter (default gateway), OpenAI direct, vLLM/Ollama/LM Studio/Together - Gemini, // POST {base}/models/{model}:embedContent, x-goog-api-key, with RETRIEVAL_QUERY/DOCUMENT task types - Mock, // deterministic, offline -} -struct EmbeddingClient { config, http, retry, deadline } -``` - -`Provider` owns the per-API differences (endpoint suffix, auth header, request JSON, response JSON, task-type -support); the client owns retry/backoff, the deadline, normalization, and tracing β€” all provider-independent. -**OpenRouter is not a distinct variant** β€” it is `OpenAiCompatible` with `base_url = -https://openrouter.ai/api/v1`, which is the point: one OpenAI-compatible shape gives provider-independence -across every model OpenRouter fronts, so the gateway does the multi-provider fan-out and OmniGraph carries one -request shape. The native `Gemini` variant exists only for direct-to-Google with task-type asymmetry. An enum -(not a trait) is the earned complexity for this small, first-party set; if third-party plug-in providers are -ever needed, the enum becomes a trait behind the same `EmbeddingConfig` surface without touching callers. - -The OpenAI-compatible `input` accepts an **array**, giving batch embedding for free β€” which the later -ingest phase needs for throughput, and which removes the open dependency on Gemini's native -`batchEmbedContents`. - -### Config resolution (resolved once, shared) - -Precedence, highest first for served cluster graphs: applied cluster `providers.embedding.` profile β†’ -env (`OMNIGRAPH_EMBED_*`, provider api-key env) β†’ built-in defaults. The cluster `api_key` value is a -`${NAME}` env reference resolved at server boot; plaintext never lives in the schema, state ledger, or any -checked-in file. Resolution happens once per graph handle; the resolved client is shared by -`nearest("string")`. Direct single-graph serving, embedded callers, and the offline CLI keep the env path -unless they inject an `EmbeddingConfig` directly. - -### Identity recorded in the schema IR (not a new store) - -The `@embed` args serialize into `PropertyIR.annotations` β†’ `_schema.ir.json`, which `schema apply` already -persists atomically and which the catalog (the one thing `nearest()` reads at query time) is built from. No -new metadata store, no manifest column, no extra read on the query path. The migration planner already rejects -non-description annotation changes as `UnsupportedChange`, so "recorded identity is immutable without a -deliberate re-embed migration" is the default behaviour, not new code. (A second, optional copy in Lance -field metadata β€” co-located with the vectors β€” is available later by activating the currently no-op -`UpdatePropertyMetadata` migration step; out of scope here.) - -### Query-time validation - -`resolve_nearest_query_vec` compares the resolved read-side identity against the column's recorded identity -before embedding; on mismatch it returns a typed `OmniError` naming recorded vs resolved (model, dim, -provider). This is the only behaviour that closes P3 by construction. - -### NFR floor (independent of the provider work) - -- **Deadline:** wrap every embed call (query or document) in a total-operation deadline - (`OMNIGRAPH_EMBED_DEADLINE_MS`) so a degraded provider cannot hang the caller for the current ~121 s worst - case (4 Γ— 30 s timeout + backoff). -- **Observability:** `tracing` span per embed call (provider, model, dim, attempts, outcome, elapsed; `warn!` - per retry; token usage when the provider returns it). The subsystem has zero instrumentation today. -- **Single normalization:** one `normalize_vector` (the dead client carried a divergent second copy; removed - in Phase 1). -- **Stable model:** make the model configurable and default to a stable (non-`-preview`) model once the GA - name is confirmed. - -### Ingest-time `@embed` (later phase, not this RFC's core) - -Making `@embed` embed at ingest is a separate phase with a hard constraint: embedding is a slow, external, -**non-idempotent** side effect, so it must run **entirely before staging** β€” in the pure in-memory phase, -before any `stage_*`/Lance HEAD move, alongside the existing constraint validation β€” so a mid-load provider -failure aborts with zero drift. It must never sit inside or after the commit protocol, because the recovery -sweep cannot re-run or undo an external embedding. It also needs a content-hash skip (so `load --mode -overwrite` does not re-bill every row), batching, and a bounded-concurrency stage. Specified here only to fix -the design constraint; deferred to its own RFC/phase. - -### Phasing (implementation order) - -| Phase | Scope | Demo | -|---|---|---| -| **1 β€” NFR floor + dead-client removal** | deadline, observability, single normalize, configurable model, delete dead client + `NANOGRAPH_*` | a hung provider fails at the deadline; embed calls traced; `rg NANOGRAPH_` empty | -| **2 β€” Provider-independent config** | `EmbeddingConfig` + `Provider` enum (OpenAiCompatible covering OpenRouter/OpenAI/local, Gemini, Mock); env-first resolution; client reuse | point `base_url` at OpenRouter, run `nearest("string")`, get correct neighbours vs OpenRouter-stored vectors; CLI shares the config | -| **3 β€” Record identity in schema IR** | `@embed` args grammar + catalog + IR persistence | `schema show` reflects recorded model/dim | -| **4 β€” Query-time validation** | compare resolved vs recorded; typed error; planner refusal on identity change | stored model A vs read model B β†’ loud error, never silent garbage | -| **5 β€” Cluster provider wiring** | `providers.embedding` resources; `graphs..embedding_provider`; `${NAME}` resolution at server boot | provider profile resolved from applied cluster state; legacy `omnigraph.yaml` untouched | -| later | ingest-time `@embed` (Shape C) | separate RFC | - -**Status:** Phases 1–5 are implemented (`@embed("…", model="…")` is recorded in the schema IR and validated at -query time with a typed same-space error; an unrecorded `@embed` keeps working with no check; cluster-served -graphs can bind an applied `providers.embedding` profile). Ingest-time `@embed` remains. - -## Invariants & deny-list check - -- **Invariant 9 (integrity failures are loud):** strengthened β€” query-time identity mismatch becomes a typed - error instead of silent wrong results. -- **Invariant 10 (query semantics are first-class IR concepts):** embedding identity becomes IR/catalog data, - not an out-of-band env guess. -- **Invariant 11 (transport stays at the boundary):** strengthened β€” Phase 1 removes the HTTP client + async - runtime (`reqwest`, `tokio`) from `omnigraph-compiler`, whose own manifest advertises "Zero Lance - dependency"; the embedding HTTP client lives only in the engine. -- **Invariant 12 / secret handling:** api-keys resolve through the existing credential chain; never in schema - or checked-in config. -- **Invariant 13 (bounded & observable):** addressed β€” the deadline bounds latency; tracing makes the - subsystem observable. -- **Deny-list β€” "silent fallback / dropped rows":** the cross-space ranking is exactly a silent-wrong-result; - this RFC closes it. -- **Deny-list β€” "new write paths that advance Lance HEAD before manifest publish without a recovery - sidecar":** the ingest phase (deferred) explicitly keeps embedding *before* staging, so it does not create a - new HEAD-advancing write path. No invariant is weakened. - -## Drawbacks & alternatives - -- **Do nothing.** The happy path works today, so the live risk is narrow (P1 + P3). But the provider hardwiring - and missing validation are a latent silent-wrong-results class that bites the first non-Gemini user. -- **Interim env-only provider switch (no schema record).** Cheaper, but leaves the same-space guarantee to - operator discipline (fails P3). Folded in as Phase 2's env-first resolution, with Phases 3–4 adding the - record/validate guarantee. -- **Trait-based provider plug-ins now.** Rejected as unearned complexity for two first-party providers; the - enum upgrades to a trait behind the same surface if needed. -- **Stamp identity in the manifest or Lance field metadata instead of the IR.** The manifest is the wrong - granularity; field metadata needs net-new wiring and a query-path dataset open. The IR is where `@embed` - already lives and is already read at query time (see spike). - -## Reversibility - -Mostly reversible. Phases 1–2 and 5 are code/config (env, CLI, cluster keys) and cheap to undo. Phase 3 -(recording identity in the schema IR) is **near-permanent** β€” it changes the on-disk `_schema.ir.json` shape -and the schema hash β€” so it earns the most scrutiny: the single-arg `@embed` form stays byte-compatible, and -recorded identity is additive (absent identity = today's behaviour). Provider request/response shapes are -external API contracts, not our format, so adding providers is reversible. - -## Gateway tradeoff (OpenRouter) - -Routing through OpenRouter (the default) buys provider-independence with one key and one billing relationship, -batch input, and access to the GA `google/gemini-embedding-2`. Costs to accept, all controllable: - -- **Extra network hop** β†’ more query-path latency. The Phase-1 deadline bounds it; the cache mitigates repeats. -- **Text transits a third party.** OpenRouter's `provider: { data_collection }` routing preference controls - retention; shops with strict residency requirements use `kind: gemini`/`openai-compatible` pointed at the - provider (or a self-hosted endpoint) directly instead of the gateway. Provider-independence means this is a - config change, not a code change. -- **Loses Gemini's task-type asymmetry** when Gemini is reached via the OpenAI-compatible gateway (both sides - embed symmetrically). This is a retrieval-quality cost, **not** a same-space correctness cost β€” both stored - and query vectors take the identical path, so they stay in one space by construction. Shops that want the - asymmetry use `kind: gemini`. - -## Unresolved questions - -- GA Gemini model name β€” **resolved:** `google/gemini-embedding-2` (via OpenRouter) / `gemini-embedding-2` - (direct), 128–3072 dims (recommended 768/1536/3072). Default flips off `-preview` in Phase 2. -- Gemini `batchEmbedContents` availability β€” **moot** when going through the OpenAI-compatible gateway (its - `input` array batches); still relevant only for the direct `kind: gemini` path. -- Identity granularity: per-vector-property args vs one graph-level default profile referenced by name. -- Whether to backfill recorded identity for existing graphs, or treat absent-identity as "unvalidated, legacy" - permanently. -- Default model for the zero-config tier: `google/gemini-embedding-2` vs `openai/text-embedding-3-large` - (both 3072-capable) β€” pick the project default. diff --git a/docs/dev/rfc-013-write-path-latency.md b/docs/dev/rfc-013-write-path-latency.md deleted file mode 100644 index 53f6430..0000000 --- a/docs/dev/rfc-013-write-path-latency.md +++ /dev/null @@ -1,1479 +0,0 @@ -# RFC-013: Write-path latency β€” capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract - -**Status:** Proposed -**Author(s):** write-path latency investigation (handoff + multi-agent validation) -**Date:** 2026-06-19 -**Audience:** engine / storage maintainers -**Builds on:** -[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (`GraphClient` β€” embedded ≑ remote), -the query-latency work (PR #268, read-path warm-up β€” the read-side twin of this change), -the iss-991 handoff (manifest-authoritative graph lineage / Phase 7), -[writes.md](writes.md), [execution.md](execution.md), [invariants.md](invariants.md). -**Tracking (dev graph `modernrelay`):** primary `iss-write-s3-roundtrip-amplification`; depth term `iss-991`; substrate seam `iss-863`/`iss-864`; branch-create `iss-691`; recovery `iss-856`/`iss-recovery-sweep-live-writer-rollback`/`iss-merge-recovery-partial-rollforward`; MemWAL `iss-681`; read twin `gap-read-path-rederivation`. - -> Status maintained by maintainers: `Proposed` while open, `Accepted` on merge. - ---- - -## Summary - -On object-store-backed clusters a single trivial write (one edge, one branch op) -issues **hundreds of mostly-sequential object-store round-trips**, and that count -**grows without bound with the graph's commit history**, so a long-lived graph -degrades to minutes per edge. The cost is invisible on a local filesystem -(Β΅s/call) and to correctness tests (results are right, just slow), and it was -never measured because nothing in the suite counts *object-store round-trips per -logical operation*. - -This RFC specifies the optimal write path from first principles β€” **a write is a -pure function of one version-pinned snapshot, published in a single -manifest-atomic CAS** β€” and the **cost contract that makes its O(1)-in-history -guarantee provable and non-regressable** (deterministic IO-counted tests on every -PR). It collapses four hand-rolled writers into one `GraphPublishAuthority`, -moves graph lineage into the manifest (so the per-write `_graph_commits` scan -disappears), brings the internal metadata tables into compaction (so the -per-write `__manifest` scan stops growing), takes recovery off the hot path, and -adds an epoch fence for multi-writer safety. None of it is a substrate rewrite β€” -the manifest-CAS model is already correct and is exactly what Lance native -multi-table transactions (lance#7264) will later formalize; this RFC builds the -seam to that future and pays down the write path onto it. - -**The dominant fix is demonstrated, not proposed:** a one-line opener-bypass -prototype (open writes direct-by-URI instead of through the lance-namespace builder) -flattens the depth-dominant term `31 + 12Β·depth β†’ flat 4` and cuts a depth-80 edge -**2.7Γ—** (1618 β†’ 593 ops), measured end-to-end and functionally correct on -main/branch/node paths (Β§2.4). It is shippable as a standalone PR first (Β§9 step -3a); the rest of the RFC is the constant-factor + correctness + internal-residual -work layered on the same seam. - -**Correction (2026-06-20/21) β€” the latency metric is `(serial_hops + ops / -effective_concurrency) Β· RTT + compute`, measured [M].** Two findings, both from the -deployed edge binary (steps 1+3a landed) on rustfs behind a latency+concurrency proxy: -**(i)** under *unlimited* concurrency, wall-clock is a **~110-hop serial backbone, -depth-invariant** β€” the depth-driven ops parallelize away (Β§0(c)); but **(ii)** under -an **R2-realistic concurrency cap (8)**, the internal-table fragment scan can no longer -fan out, so **op count re-enters wall-clock** and an uncompacted graph *runs away* -(per-write ops 1273β†’3505, wall 6β†’16s and climbing) β€” while #291's internal-table -compaction cuts it ~6Γ— and bounds it (Β§0(d) A/B). So the design is **vindicated and -unchanged** (Β§3/Β§4.1: capture-once `WriteTxn` + parallel stages β†’ "~2–3 hops" is the -**serial-backbone** lever, step 3b; bounded history is the **op-count** lever, step 2a) -β€” what's corrected is the *measurement framing and step sizing*: op count was the wrong -latency proxy **only because the harness had unlimited concurrency**; on a capped store -both `serial_hops` (β†’ step 3b) and `ops` (β†’ step 2a) are on the critical path, and -which dominates is set by `effective_concurrency Γ— fragment_count`. The cost gate -(Β§5.1) is corrected to inject a **concurrency cap *and* latency**, and to assert serial -hops *and* op-count-flat-in-history. - ---- - -## 0. Validation ledger (read this first) - -Every claim is tagged: **[M]** measured by me this cycle, **[S]** verified in -v0.7.0 source (`file:line` given), **[U]** verified against upstream -Lance/LanceDB/SlateDB source or docs, **[G]** tracked in the dev graph (slug -given), **[I]** inferred/reasoned. - -A correction from the originating handoff: it hypothesized that **Cloudflare R2 -walks the full manifest listing on every open** (a prod-only amplifier absent -from AWS). **This is false for the pinned Lance 7.0.0 [U].** R2 is treated as -lexically ordered (`list_is_lexically_ordered = !is_s3_express`, -`lance-io/.../providers/aws.rs:183`), so R2 gets the O(1) head-only manifest fast -path, same as AWS; only S3-Express buckets are excluded, and even those are O(1) -via the v7 `latest_version_hint.json`. There is no R2-list config fix because -there is no R2-list problem. - -**The depth term β€” corrected attribution.** Two measurements, one -instrumentation-blind, one complete: - -*(a) IOTracker probe [M] β€” internal tables only.* A throwaway probe (the -`warm_read_cost` harness applied to a single insert to `main`, swept across -commit depth) counted the two internal tables: `__manifest` β‰ˆ 14 + 2Β·depth, -`_graph_commits` β‰ˆ 9 + 2Β·depth β†’ β‰ˆ 23 + 4Β·depth, `write_iops = 1`. **But this -probe is structurally blind to the write path's per-table *data* opens** β€” they -bypass the instrumented opener (`table_wrapper`), so it reports `probes=0` for the -data tables. It measured the *minority* of the cost. - -*(b) Network-proxy measurement [M] β€” all RPCs, fresh graph.* A counting proxy in -front of `rustfs` (sees every object-store RPC, under `--mode merge` β€” the -production path), on a brand-new graph (400 seed nodes, one committing merge per -checkpoint), classified by S3 key: - -| commit depth | data `_versions` | `__manifest` | `_graph_commits` | node (RI) | schema | TOTAL | `write_iops` | -|---:|---:|---:|---:|---:|---:|---:|---:| -| 0 | 31 | 29 | 13 | 6 | 46 | 156 | 1 | -| 5 | 121 | 44 | 23 | 6 | 46 | 268 | 1 | -| 10 | 181 | 59 | 33 | 6 | 46 | 358 | 1 | -| 20 | 301 | 89 | 53 | 6 | 46 | 538 | 1 | -| 40 | 541 | 149 | 93 | 6 | 46 | 898 | 1 | -| 80 | 1021 | 269 | 173 | 6 | 46 | 1618 | 1 | - -Slopes: **data table +12/depth (~67%)**, `__manifest` +3/depth, `_graph_commits` -+2/depth β†’ **TOTAL β‰ˆ 156 + 18Β·depth**, `write_iops` flat at 1. The IOTracker probe -(a) saw only the +4/depth internal subset β€” blind to the data-table opens, the -dominant ~67%. - -**Constant-factor finding [M]: the schema contract is a flat 46 reads/write** β€” not -depth-scaling, but **29% of the depth-0 cost (46/156)**, from -`validate_schema_contract` re-running uncached on every resolve (`omnigraph.rs:561`). -A depth-slope gate will *not* catch it; WriteTxn's resolve/validate-once kills it, -and the Β§5.1 fitness assert (`validate_schema_contract_calls == 1`) is what pins it -(constant-factor delta, Β§6). - -The dominant term is **the written table's open routed through the lance-namespace -builder ~13Γ— per write** β€” now source-traced. The **write** path opens via -`DatasetBuilder::from_namespace` (`namespace.rs:174`, from `open_table_head_for_write` -`table_store.rs:181` / `namespace.rs:544`). Lance's builder calls the namespace's -`describe_table` once and uses only `response.location` (`lance` `builder.rs:130-178`) -β€” but omnigraph's `describe_table` **opens the whole dataset** just to produce that -location (`open_head` β†’ `Dataset::open`, `namespace.rs:362`/`:112`), and `.load()` -then **resolves the latest version again** β€” a **double latest-resolution per -open**, ~13Γ— per write, nothing cached. Crucially, latest-resolution is **not -inherently O(depth)**: the namespace path is O(depth) because it **misses the V2 -lexical / `latest_version_hint.json` fast path** that the direct opener engages -(most likely because `load_table_from_namespace` attaches no shared `Session`/store -params, `namespace.rs:174` β€” inferred, not traced). The **read** path skips all of -it β€” `from_uri(location).with_version(N)`, one HEAD, O(1) β€” which is why reads are -flat (+12/depth on the data table, Β§0(b)). **Proven on omnigraph's own table [M]:** -a direct `Dataset::open` of the *same physical* 85-version edge table = **2 ops -(O(1))**, the `from_namespace` open of that identical table = the O(depth) sweep β€” -same bytes, two open paths. `checkout_version` is also O(1) β€” **exonerated**, not a -back-walk. So `from_uri().with_version(N)` is the O(1) primitive and step 3 makes -each open O(1) *intrinsically* (cleanup then becomes hygiene/interim, not -load-bearing for read cost β€” Β§2.3). **Mode-independent [U]:** `append` ≑ `merge` ≑ +12/depth, so Β§0(a) -measuring a single insert was *not* the defect β€” the defect is the namespace open -path, not the verb. **Using `from_namespace` per-open is a misuse of Lance's -design** (the namespace is a catalog/discovery layer β€” resolve once, then open the -dataset directly, `lance-namespace` `operations/index.md` **[U]**); the read path -already bypasses it (PR #268 Fix 2 β€” see Β§2.4). - -**Corrected conclusion.** The depth blow-up is in omnigraph's DB layer and is -**data-table-dominated**: the redundant per-table opens (fixed by Β§9 step 3 β€” -WriteTxn open-once-by-pinned-version β€” plus scheduled *version cleanup* of the -node/edge tables) are ~70% of it; the uncompacted internal tables (Β§9 step 2) are -the secondary ~30%. Both the originating R2 hypothesis and the earlier "entirely -the internal tables" framing are wrong. The exact Lance call doing the data-table -chain re-read (`checkout_version` back-walk vs merge-insert conflict replay) is the -one unpinned item β€” see Β§12. Reads, by contrast, are flat in depth -(`warm_read_cost.rs`, PR #268). This is the O(history)-per-write β†’ -O(NΒ²)-cumulative behavior the production incident hit. - -**(c) Serial-hop measurement [M] β€” wall-clock is set by the serial backbone, not -the op count.** Β§0(b) counts *total* object-store ops; wall-clock is set by the ops -on the *critical path*. Measured on the **deployed edge binary `f6d2cc03`** (steps -1+3a landed) via rustfs + a per-op latency proxy, sweeping injected per-op latency `L` -and reading the slope of `wall = compute + serial_hops Β· L` (the slope **is** the -critical-path hop count; the proxy also reports request overlap β†’ parallelism): - -| depth | total ops | parallelism | **serial backbone (slope)** | `L=0` wall (compute floor) | -|---:|---:|---:|---:|---:| -| ~1 | 107 | 1.0–1.2 | **~109** | 2.15s | -| ~33 | 338 | 3.4–4.0 | **~108** | 2.45s | -| ~85 | 716 | 6.0–7.1 | **~113** | 4.27s | - -The serial backbone is **~110 hops and depth-INVARIANT**, while total ops grow -`+~7/depth` (107β†’716, the Β§0(b) term) **and parallelize** (parallelism 1β†’6, -`max_inflight` up to 65) β€” so the depth-driven ops add almost nothing to wall-clock. -`wall β‰ˆ 110Β·RTT + compute`; the prod 35s direct-main write β‰ˆ 110 hops Γ— ~280ms -cross-region RTT. Branch ops measured the same way (4-table graph; prod = 217 tables, -β‰ˆ50Γ— worse): **branch-create serial ~77, branch-delete ~87** (op counts scale with -table count β†’ Β§9 step 6), and **branch-WRITE is worst β€” 1777 ops, serial ~258, 21s -compute floor even at `L=0`** = fork-on-first-write (the path 3a did *not* cover; Β§9 -step 3b + the fork seam), matching prod's 103–138s. - -**The methodological correction this forces.** *Op count is a cost/space/compute-floor -metric; the serial-hop count (latency slope / `num_stages`) is the wall-clock metric.* -3a's real 90sβ†’35s win (β‰ˆ2.6Γ—, matching its measured 2.7Γ— op cut) is genuine **because -it removed *serial* hops** (the per-table data opens were on the critical path). But -the wall-clock predictor is not serial-hops *alone* β€” it is -**`(serial_hops + ops / effective_concurrency) Β· RTT + compute`**: total op count -re-enters wall-clock whenever the store **caps concurrency**, because the parallel -tail can no longer fan out. - -**(d) The concurrency-cap A/B [M] β€” proves op count *is* wall-clock on a capped store, -and that step 2a is a primary latency lever (not a parallel afterthought).** Β§0(c) was -measured on **rustfs with unlimited concurrency** (`max_inflight` reached **129**) β€” a -poor proxy for R2, which is connection-capped and rate-limited. Re-running the same -write through a proxy capped at **8 concurrent** (R2-realistic), with internal-table -**fragment count as the only variable** (edge binary for writes; the unmerged #291 -binary only to run `optimize`), depth ~130, `__manifest`β‰ˆ137 fragments: - -| state | per-write ops | wall (cap=8, L=20) | trend | -|---|---:|---:|---| -| **uncompacted** (`__manifest` 137 frags) | 1273 β†’ 1487 β†’ **3505** | 5.9 β†’ 8.4 β†’ **16.4 s** | **runaway** β€” each write reads all frags **and appends one more** | -| **after #291 `optimize`** (137β†’1 frag) | 275 β†’ 250 β†’ **197** | 6.2 β†’ 5.4 β†’ **3.8 s** | **bounded** | - -`optimize` collapsed `__manifest` 137β†’1, `_graph_commits` 140β†’1 frags β†’ **~6Γ— fewer -ops/write and the runaway stopped.** Under unlimited concurrency this delta vanishes -(the frags fan out); under the cap it is the dominant term. **This is the actual -mechanism of the prod 35s and its degradation over time** (the `O(NΒ²)` of Β§0/Β§2.2): -on a capped store, every uncompacted write scans all `__manifest`/`_graph_commits` -fragments *and adds one*, so latency climbs with graph age β€” exactly what prod shows, -and exactly what step 2a halts. Prod confirms the scale: `__manifest` 1,739 obj / -59 MiB, `_graph_commits` 1,848 obj / 23.5 MiB, read per write, **uncompacted** (the -deployed `f6d2cc03` optimize is node/edge-only β€” Β§9 step 2 β€” so an operator `optimize` -run on prod cannot touch them; only #291 can). - -**Corrected conclusion.** The Β§2.4 op-count math (`1720β†’198 β‡’ 258sβ†’30s`) is still -wrong *as stated* (it assumes full serialization), but the opposite over-correction β€” -"step 2 is parallel, so irrelevant to latency" β€” is **also wrong**, and an artifact of -the unlimited-concurrency harness. The truth is **concurrency-dependent**: on a capped -store (R2) the internal-scan op count *is* on the critical path and **step 2a is a -primary latency lever and the anti-runaway fix**; the residual after compaction -(~4 s here, mostly compute + the serial backbone) is then **step 3b**'s. Both are -load-bearing; which dominates is set by `effective_concurrency Γ— fragment_count`. So -the cost gate (Β§5.1) must inject a **concurrency cap**, not just latency. - ---- - -## 1. Problem & measurements - -On object storage every call is a 10–100 ms RPC, there is no cheap stat, and -sequential RPCs serialize. A long-lived production graph on R2, originating handoff -**[M]**: - -| operation | prod (R2) | local `file://` | -|---|---|---| -| one-edge `load --mode merge` β†’ main | ~3 min (90 s workflow timeout) | <1 s | -| `branch create --from main` | 120 s | <1 s | -| one-row `load` β†’ a branch | 204 s | <1 s | -| `branch delete` | 216 s | <1 s | -| warm read / `/healthz` | fast (0.2–2 s) | fast | - -`iss-write-s3-roundtrip-amplification` **[G]** independently records the same: -cross-region single insert ~46 s, 5-node mutation ~110 s, vs ~390 ms for a -no-storage `/healthz`. Its acceptance criteria are this RFC's goal: *"a single -insert issues O(1)-to-few S3 round-trips, not O(number of tables); bulk mutations -amortize the manifest commit."* - -The cost decomposes into terms; the dominant one scales with history (Β§0): - -1. **Per-table opens through the O(depth) lance-namespace builder (DOMINANT, - O(tables Γ— depth)).** Each stage opens via `DatasetBuilder::from_namespace` - (`namespace.rs:174`); its `describe_table` opens the whole dataset just to return - a location (`open_head` β†’ `Dataset::open`, `namespace.rs:362`/`:112`) and - `.load()` resolves latest **again** β€” a double latest-resolution per open, - O(depth) on the repro store, ~13Γ— per write with nothing caching it **[S]** - (Β§2.2). The read path's direct `from_uri().with_version(N)` is O(1). β†’ - **+12 reads/depth, ~70% of the slope [M]**. Fixed by opening once, by pinned - version via the direct opener (Β§9 step 3); node/edge version *cleanup* bounds it - further. -2. **Per-write `__manifest` scan (O(history), secondary).** Every publish - full-scans the uncompacted `__manifest` (`load_publish_state` β†’ - `read_manifest_scan`, `state.rs:133-141`) **[S]**; the internal tables are - never compacted/cleaned (`optimize` iterates node/edge only, - `optimize.rs:895-904`) **[S]**. +3.1 reads/depth **[M]**. -3. **Per-write `_graph_commits` refresh (O(history), secondary).** - `record_graph_commit` reloads the entire commit cache before each append - (`commit_graph.rs:136-164`) **[S]**; never compacted/cleaned. +2.1 reads/depth - **[M]**. The "read-path anti-pattern, now on writes" (`iss-991` handoff **[G]**). - -Terms 2+3 are the secondary ~30%; term 1 dominates. Plus per-write fixed taxes: a `list_dir("__recovery/")` (`loader/mod.rs:197`, -`exec/mutation.rs:725`, `exec/merge.rs:1090`) **[S]**, and the publisher CAS -retry budget (`PUBLISHER_RETRY_BUDGET = 5`, `publisher.rs:51`) **[S]**. - -Branch ops compound it: `branch create` is a per-table sequential fork loop -(`fork_branch_from_state`, `table_store.rs:282`); `branch delete` opens a -snapshot per *other* branch (`ensure_branch_delete_safe`, `omnigraph.rs:1317`) -and force-deletes per forked table sequentially (`cleanup_deleted_branch_tables`, -`omnigraph.rs:1359`) **[S]**. Measured serial backbones (Β§0(c), edge binary): branch -create **~77 hops**, delete **~87** (op counts scale with table count β†’ Β§9 step 6); -**branch *write* is the worst β€” 1777 ops, ~258-hop serial backbone, a 21s compute -floor even at zero RTT** = fork-on-first-write (the path step 3a did not cover; Β§9 step -3b + the fork seam), which is why prod branch-load (103–138s) ≫ direct-main (35s). - ---- - -## 2. Root cause (validated) - -### 2.1 The write re-derives its world from storage every stage - -`loader/mod.rs:400` captures a `snapshot` once, but downstream stages **ignore -it** and re-resolve **[S]**: - -- `open_for_mutation_on_branch` (`table_ops.rs:505`) re-calls - `resolved_branch_target` **per table** (`:512`), which runs - `ensure_schema_state_valid` (a full schema-contract storage read with no cache, - `omnigraph.rs:561-568`) and then opens **by head** via - `open_dataset_head_for_write` (`:522`/`:559`), asserting head == pinned only - *after* the open. -- `fresh_snapshot_for_branch` (`omnigraph.rs:771`) always does fresh I/O; the - fork authority path re-reads the live manifest (`table_ops.rs:574`). -- The captured snapshot is used only for membership/fork checks, never for the - actual opens. - -The drift guards, CAS retries, and recovery scans are **compensating machinery** -for the staleness this self-inflicts. The `Snapshot`/coordinator primitive -already exists; it is treated as cheap-to-reacquire rather than as the -operation's authoritative identity. - -### 2.2 The depth terms β€” data-table re-reads dominate, internal tables secondary - -Confirmed in code and measurement (Β§0). The **dominant** term is Β§2.1's per-table -opens: ~13 opens per write through the lance-namespace builder -(`DatasetBuilder::from_namespace`, `namespace.rs:174`). The builder calls the -namespace's `describe_table` (`lance` `builder.rs:130-178`), and omnigraph's -`describe_table` opens the whole dataset just to return a location (`open_head` β†’ -`Dataset::open`, `namespace.rs:362`/`:112`); `.load()` then resolves latest again β€” -a **double latest-resolution per open**, O(depth) on the repro store β€” so cost -grows with the table's version count (+12 reads/depth, ~70%). The **read** path -opens direct `from_uri().with_version(N)` (`namespace.rs:112` / `SubTableEntry::open`) -β€” O(1) β€” and native pylance is flat 6 ops at any depth **[U]**, so this is -omnigraph's *namespace-open* pattern, not Lance; `checkout_version` is O(1) and not -implicated. (The heavier `list_table_versions` β€” `versions()` + a checkout per -version, `namespace.rs:395-427` β€” is **not** on this path; it is test-only today, a -separate latent O(depth): Β§10 follow-up.) The **secondary** terms are the two -internal tables: `load_publish_state` and -`commit_graph.refresh` each full-scan a table that gains a fragment per write and -is never compacted (+5 reads/depth, ~30%). This is the `gap-read-path-rederivation` -**[G]** failure mode β€” "cost grows with fragment count" β€” on the *write* path, -where PR #268 never reached. `invariants.md` documents the internal-table half: -*"the internal metadata tables (`__manifest`, `_graph_commits`) are still not -compacted, so the probe and refresh cost still grows with fragment count."* - -### 2.3 The `skip_auto_cleanup` interaction β€” and compaction β‰  cleanup - -v0.7.0 sets `skip_auto_cleanup: true` deliberately (`table_store.rs` 10 sites + -`publisher.rs:392`) **[S]** β€” load-bearing, because Lance 7's on-by-default -`auto_cleanup` would GC `__manifest`-pinned snapshot versions (`lance.md` audit) -**[U]**. Two distinct levers were turned off and must be replaced *separately*: -**compaction** (`compact_files`) rewrites small fragments into fewer larger ones -but does **not** prune old versions; **cleanup** (`cleanup_old_versions`) prunes -old versions. Measured on a ~85-version graph **[M]**: `optimize`/compaction -*added a version* (data-table reads 1035 β†’ 1083, frags 81β†’1 β€” **no help** against -the depth term); `cleanup --keep 3` dropped it 1035 β†’ 63 (89 versions pruned across -7 tables, **16Γ—**). So only *cleanup* bounds the version-chain length. Note today's -`cleanup`/`optimize` cover **node/edge tables only** (the "7 tables"; internal -`__manifest`/`_graph_commits` are excluded, `optimize.rs:895-904` **[M]**) β€” so -bounding the internal +5/depth residual needs them **added** to the key set (Β§9 step -2's code change). Operationally: `cleanup` aborts on a remote store without `--yes` -(the -scheduled job must pass it). Relation to step 3: while the namespace open is still -on the write path, cleanup **caps** the dominant term β€” a real interim mitigation; -once step 3 opens direct-by-version (O(1) regardless of version count, Β§2.4), -cleanup is **storage hygiene + internal-table sprawl**, not load-bearing for read -cost. The correct replacement is *scheduled* compaction **and** version cleanup -(Β§9 step 2), **not** re-enabling `auto_cleanup`. Without it, version history (and -per-write cost) grows forever. - -**Why Lance/LanceDB don't have this cost β€” the internal-table scan is self-inflicted -[U].** Verified in Lance 7.0.0 source (cargo registry): a Lance dataset's metadata is a -**per-version manifest *file*** β€” one self-contained protobuf -(`format/manifest.rs:35`, `struct Manifest { fragments: Arc>, … }`) β€” -and the current version is resolved **O(1)** via `latest_version_hint.json` -("O(1)/O(k) latest-version lookup via HEAD", `io/commit.rs:75-79`) or the V2 lexical -name. Reading current state is **one file read, never a scan over accumulated -metadata**; old manifests + `_transactions` files are reclaimed by **timestamp GC** -(`dataset/cleanup.rs`, on by default), and manifest *size* is bounded by data -compaction. **LanceDB** is multi-table but each table is an *independent* Lance -dataset; its catalog is a directory/namespace lookup (or a cloud catalog service), not -a mutable dataset read per write β€” it does **no cross-table atomic commit**, so it -needs no coordinating meta-table. Omnigraph's `__manifest`/`_graph_commits` are -therefore **not a Lance pattern** β€” they exist only because omnigraph layers a -**mutable catalog *as a Lance dataset*** over 217 independent tables to get a -cross-table atomic commit (the lance#7264 "Alternative A"). The whole Β§2.2 internal -term is the price of that choice: omnigraph reads its catalog as an **O(fragments) -dataset scan and appends a fragment per write**, where Lance reads its own metadata -**O(1)** and prunes by default. Step 2a (compact β†’ 1 fragment) β‰ˆ Lance's single-file -manifest read; step 2b (cleanup) β‰ˆ Lance's `cleanup_old_versions`; the design simply -re-derives, on a Lance-dataset catalog, the hygiene Lance treats as table stakes β€” and -Β§8/lance#7264 MTT is the path to delete the catalog and inherit Lance's O(1) metadata -outright. *(This also raises a design question β€” should the catalog be a Lance dataset -at all, vs a single flat CAS'd manifest file? β€” addressed in Β§8.)* - -### 2.4 Lance namespace: proper use (why the fix is bypass, not patch) - -The upstream Lance Namespace is a **catalog / discovery layer** β€” "table -discovery, resolving table locations, and coordinating commits" β€” whose intended -division of labor is *"the namespace provides basic information about the table, -[then] the Lance SDK … fulfill[s] the other operations"* (`lance-namespace` -`namespace/index.md`, `operations/index.md`) **[U]**. It is meant to be consulted -to *resolve a table once*, after which you operate on the `Dataset` directly β€” **not -consulted on every per-table open on a hot path.** `DatasetBuilder::from_namespace` -itself reflects this: it calls `describe_table` only to extract `location`, then -reduces to a `from_uri` builder (`lance` `builder.rs:130-178`). For a system that -*already holds* each table's location + version (omnigraph's `__manifest` does, via -`SubTableEntry`), routing per-open resolution back through the namespace is the -anti-pattern β€” and it aligns with this project's invariant 1 ("resolve latest state -through the substrate's cheap primitive instead of re-scanning") and the deny-list -"cold re-derivation on the hot path." - -So the fix is **bypass, not patch**: open writes by URI + pinned version -(`from_uri(location).with_version(N)`) β€” exactly what the **read** path already does -(PR #268 Fix 2; the read path's own comment notes the namespace open "would -full-scan `__manifest` twice per open (`describe_table` + `describe_table_version`)"), -so this completes #268's open-by-location migration on the write side (Β§9 step 3). -The **custom namespace impl stays** β€” it is still the right home for legitimate -*catalog* operations (`describe_table` / `table_exists` / `list_table_versions` / -`create_table_version` / managed-versioning commit coordination); only the -per-open *resolution* leaves it. Two Lance facts make this safe and final: opening -by explicit version is `default_resolve_version` = a single HEAD, O(1) (`lance` -`commit.rs:939-981`), and Lance's own latest-resolution cost work (version-hint, PR -#6752) confirms the latest path is the expensive one to avoid. **Proven on -omnigraph's own table [M]:** a direct `Dataset::open` of the *same physical* -85-version edge table is 2 ops (O(1)), while the `from_namespace` open of that -identical table is the O(depth) sweep β€” so latest-resolution is not inherently -O(depth); the namespace path is O(depth) only because it misses the fast path the -direct opener engages (likely the un-threaded `Session`). Step 3 therefore makes -each write open O(1) on its own β€” so node/edge `cleanup` (Β§2.3) is an **interim -mitigation + storage hygiene**, not load-bearing for read cost once step 3 ships. - -**End-to-end proof [M] β€” the one-line opener bypass, measured.** A prototype -patched `open_dataset_head_for_write` (`table_store.rs:174`) to open directly by URI -(bypassing `from_namespace` β€” exactly step 3 / Alternative B), rebuilt v0.7.0, and -re-ran the depth sweep on a fresh graph: - -| depth | data `edgeVER` baseline | data patched | TOTAL baseline | TOTAL patched | -|---:|---:|---:|---:|---:| -| 0 | 31 | **4** | 156 | 121 | -| 10 | 181 | **4** | 358 | 173 | -| 20 | 301 | **4** | 538 | 233 | -| 40 | 541 | **4** | 898 | 353 | -| 80 | 1021 | **4** | 1618 | **593** | - -The dominant data-table term collapses `31 + 12Β·depth β†’ flat 4` (O(1) in history), -the total slope drops `+18/depth β†’ +5/depth` (the residual +5 is exactly the two -internal tables β€” step 2's scope), and at depth 80 a single edge drops **1618 β†’ 593 -ops (2.7Γ—)** from this one change alone, before step 2 / Phase 7. Functional -correctness verified on the hot paths: main edge merge, branch create + write + -read-back, node merge (managed-versioning still correct) β€” the direct opener already -handles `checkout_branch` for non-main, so the namespace layer was not load-bearing -for write correctness on these paths. **Caveat:** the prototype did **not** exercise -schema-apply, branch merge, fork-on-first-write to a new table on a branch, overwrite -mode, or concurrent writers β€” a production step 3 must pass the full -`merge_truth_table`/recovery/failpoint suite (the namespace may do -managed-versioning work that matters there). It proves the thesis + hot-path -correctness, not drop-in completeness. - -**Step 2 also proven [M].** On the step-3-patched binary at depth ~87, compacting -the internal tables to 1 fragment each (content-preserving) collapsed their scans: -`__manifest` 285 β†’ 32 (8.9Γ—), `_graph_commits` 177 β†’ 11 (16Γ—); the step-3 data term -stayed flat at 4. So **both depth *op-count* terms are now empirically eliminated** β€” -a depth-87 single edge drops **~1720 β†’ 198 ops (~8.7Γ— in op count)** with both fixes. -**Wall-clock correction (Β§0(c)/(d)):** the `β‰ˆ258 s β†’ β‰ˆ30 s` figure was wrong (it -multiplied *total* ops by RTT as if serial); but the win is **concurrency-dependent**, -not zero. Under *unlimited* concurrency the depth-driven ops parallelize and this -op-count cut barely moves wall-clock (the backbone is ~110 hops); **under an -R2-realistic concurrency cap the same op-count cut is a primary latency win** β€” the -Β§0(d) A/B shows the uncompacted internal scan *runs away* (6β†’16 s) and #291's -compaction cuts it ~6Γ— and bounds it. So step 2a is a **latency lever on a capped store -(R2) and the anti-runaway fix**, *and* a compute-floor / Phase-7-prerequisite / space -win; step 3b is the lever for the residual serial backbone. The internal term is -**fragment-scan growth** (`read_manifest_scan` / -`commit_graph.refresh` read all fragments of the *latest* version), so the fix is -**compaction** (merge fragments) β€” distinct from the data table's version-chain term -that step 3 / version-cleanup handle. `optimize`'s `all_table_keys` -(`optimize.rs:895-904`) excludes the internal tables, so step 2 is a real code -change, not just scheduling. - ---- - -## 3. First principles - -On object storage the only objective function is **minimize the number of -*sequential* round-trips per logical operation, and make that number invariant to -graph age, history depth, and table count** β€” under the hard floor of SI, -durability, atomicity, and loud integrity. Three generating principles fall out, -each mapped to a validated failure: - -1. **Pin once, derive the rest (MVCC / invariant 15).** A write is a pure - function of one immutable, fully-pinned snapshot - `{branch, manifest_version, per-table (location, version, e_tag), schema_hash, - writer_epoch}`, resolved exactly once; every stage reads only from it - (open-by-pinned-version, O(1), cacheable); the only contact with "current" is - the final CAS. β†’ fixes Β§2.1. -2. **One source of truth, one commit (invariant 2).** Visibility + lineage + - version bumps are **one atomic manifest write**; the commit graph, indexes, - and topology are *projections* of the manifest, never second authorities to - keep in sync. β†’ fixes the Β§2.2 `_graph_commits` term (iss-991 Phase 7). -3. **The plan is the contract (correct-by-construction recovery).** The writer - serializes its *complete* publish intent **before any HEAD moves**; the live - commit and crash-recovery execute the *identical* plan, so they cannot - diverge. β†’ fixes the partial-publish bug class structurally - (`iss-merge-recovery-partial-rollforward`, PR #277). - -The optimal single-edge write under these: **~2–3 sequential hops, O(1) in size** -β€” 1 warm probe (0 if the coordinator is unchanged), 1 parallel stage of fragment -writes, 1 manifest CAS β€” regardless of 5 tables or 500, 10 commits or 10M. -Lance's own `test_commit_iops` (read 1 / write 2 / stages 3) **[U]** proves the -per-table primitive already hits this; the job is to make the *graph* write -inherit it. - -This is not speculative: it is exactly what the two reference object-storage -databases do. **LanceDB** threads a pinned `Arc` + shared `Session` and -commits with one CAS off a captured `read_version`, never re-resolving "latest" -under default consistency **[U]**. **SlateDB** captures a snapshot, treats a -monotonic-ID manifest (no pointer file) as the *sole* authority, commits with one -conditional-PUT, recovers on open (never per-write), fences with a monotonic -`writer_epoch`, and compacts on a schedule **[U]**. - ---- - -## 4. Reference-level design - -### 4.1 The interface β€” one publish authority, one declarative plan - -The deepest structural flaw is **four hand-rolled writers** (`load_as`, -`mutate_as`, `apply_schema_as`, `branch_merge_as`), each re-implementing open β†’ -stage β†’ commit β†’ sidecar β†’ lineage. There is **one publish machine**; the verbs -are different declarative plans fed to it. - -```rust -// The pinned, immutable operation identity β€” resolved ONCE. -struct WriteTxn { - branch: BranchRef, - base: PinnedSnapshot, // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch} - session: Arc, // shared per-graph; warms metadata/index caches across opens - handles: HandleMap, // open the base once WITH session; thread the handle each - // commit RETURNS forward (HEAD walks Nβ†’N+1β†’N+2). NOT a - // version-keyed cache β€” HEAD moves, so a (table,version) key - // misses; reuse = forward the commit-return handle. [3b-validated] -} - -// A typed, declarative publish plan β€” the COMPLETE "what", built before any HEAD moves. -enum TableAction { - Append(Stream), Upsert(Batch), Overwrite(Image), Delete(Pred), - Fork { from_version: u64 }, Register(Schema), Tombstone, -} -struct PublishPlan { - base: PinnedSnapshot, - actions: Map>, - lineage: GraphCommitIntent, // parent = base.head; rides the SAME manifest CAS (Phase 7) - expected: Expectations, // per-table versions + graph_head + writer_epoch -} - -impl GraphPublishAuthority { - async fn open_txn(&self, branch: BranchRef) -> WriteTxn; // 1 warm probe - async fn publish(&self, txn: &WriteTxn, plan: PublishPlan) -> PublishedSnapshot; // stageβˆ₯ β†’ 1 CAS -} -``` - -Properties that make it optimal: - -- **Stages take `&WriteTxn`/`&PublishPlan` for the BASE** β€” re-resolving the pinned - read base / open-latest for the pre-commit phase is unrepresentable; invariants 2/3/15 - hold for the base by construction. **Caveat [3b-validated]:** this is NOT "no - re-resolution anywhere." Three commit-boundary reads are irreducible correctness - machinery and MUST stay fresh: the commit-time `fresh_snapshot_for_branch` (cross-process - OCC), the live-HEAD drift probe (a concurrent writer may have moved HEAD since staging), - and the fork-authority reads (`classify_fork_ref` deliberately bypasses the cached base β€” - a pinned base there re-opens the "force-delete a live fork" bug). Model "pinned base for - the pre-commit phase + named fresh re-reads at the commit/fork boundary." The achievable - open count is **1 base open (with session) + 1 cheap `latest_version_id` probe + threaded - commit handles**, not literally one open. -- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and - recovery both call `plan.apply()` β€” a merge that bumps tables A+B can never - roll A forward and silently drop B. The - `iss-merge-recovery-partial-rollforward` bug class is gone by design. -- **One CAS.** `publish` issues exactly one conditional `__manifest` - merge-insert carrying every touched-table version + the `graph_commit` / - `graph_head` lineage rows + the `writer_epoch` check. -- **Verbs are thin lowerings.** `load`/`mutate`/`schema apply`/`branch merge` - each build a `PublishPlan` and call `publish`. Four copies β†’ one machine; the - public `load_as`/`mutate_as` API is unchanged (it lowers internally). - -The cost contract becomes part of `publish`'s documented API: - -> `publish(txn, plan)` costs `opens ≀ |plan.touched_tables|` (0 warm), -> `stages ≀ 3`, `manifest_ops = O(1)` β€” **invariant to history depth and table -> count.** - -### 4.2 Supporting mechanics (each validated this cycle) - -| Mechanic | Design | Validation | -|---|---|---| -| Open by pinned version | `from_uri(location).with_version(N)` + shared `Session` + warm handle cache β€” the O(1) opener *reads* already use (`instrumentation::open_table_dataset:112`, `SubTableEntry::open` `db/manifest.rs:200`). **NOT** the write path's `from_namespace` builder (`namespace.rs:174`), whose `describe_table` + `.load()` do an O(depth) double latest-resolution (Β§2.2 β€” the dominant cost), and **NOT** `open_dataset_at_state` (opens head then checks out, `table_store.rs:232`, not O(1)). | #0 **[S]** | -| Strict-op SI | Update/Delete/SchemaRewrite open by pinned version (consistent read base) and the publish CAS rejects a *same-table* advance. Insert/Merge rely on Lance's natural rebase. **Do not remove the open guards wholesale** β€” that is a silent lost-update. | #5 **[S]** | -| Fork Γ— pinned-version | Fork already opens source at the pinned version and creates the target from it; the live-manifest authority re-read before fork stays (not defeated by the pin). | #6 **[S]** | -| Open-once via the direct opener (**THE dominant depth fix**) | Reuse is **intra-transaction** (open each table once, by pinned version, thread it β€” kills the ~13 namespace-builder opens, the O(depth) double latest-resolution / +12/depth term, Β§0/Β§2.2). A commit invalidates its own entry, so no cross-write warm cache. Thread the shared per-graph `Session` through write opens (it is *not* today β€” `load_table_from_namespace` attaches no session, `namespace.rs:174`). | #9 **[S]** | -| Lineage in the manifest (Phase 7) | Publish `graph_commit` + mutable `graph_head:` rows in the same `__manifest` merge-insert with a branch-head CAS; `_graph_commits` becomes a projection. Removes the per-write `commit_graph.refresh` and closes the "manifestβ†’commit-graph atomicity" + "commit-graph parent under concurrency" gaps. | `iss-991` **[G]**, **[S]** | -| Bounded history (compaction **and** cleanup) | Bring the internal table(s) into the `optimize` loop AND schedule version *cleanup* of node/edge tables β€” compaction rewrites fragments, only cleanup prunes the version chain that Β§2.2's dominant term re-reads (Β§2.3). No blob/PK/CAS blocker (`__manifest` has no blob column, `state.rs:44-72`; the unenforced PK is orthogonal to a content-preserving Rewrite). Post-Phase-7 there is only **one** internal table to compact. | #8 **[S]** | -| Recovery off the hot path | Move the per-write `list_dir("__recovery/")` to coordinator-open + the CAS-conflict path, guarded by a sidecar-age grace window (the sidecar carries `created_at` micros + a ULID, `recovery.rs:762`/`:1522`). | #4, `iss-856`/`iss-recovery-sweep-live-writer-rollback` **[G][S]** | -| Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** | -| Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)β†’O(1) sequential). | `iss-691` **[G]** | -| Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** | -| Maintenance-class commit (compaction) | Commutative/content-preserving ops do NOT use the logical class's strict OCC: Lance rebases the disjoint case, the app reopens+replans on a real overlap, and the manifest publish is a **monotonic fast-forward** (advance or no-op, never equality CAS) β€” no `writer_epoch`. The two-op-class rule + the found+fixed optimize-vs-write race: Β§6.6. | Β§6.6 **[M]**, **LANDED** | - ---- - -## 5. The cost contract β€” measurement & enforcement - -The bug class is invisible to correctness tests, to local-FS tests, and to -wall-clock benches. You can only prevent a regression in a quantity you **define -precisely, measure deterministically, and bound on every PR.** The quantity is -*sequential object-store round-trips per logical operation, as a function of -history depth and table count.* OmniGraph already has the correct pattern for -**reads** (`warm_read_cost.rs`, `IOTracker`, swept to depth 20); this RFC extends -it across the write/branch/open surface. This is exactly how Lance and SlateDB -enforce it **[U]**. - -### 5.1 Tier 1 β€” deterministic IO-counted gate (every PR) - -Ordinary `cargo test`, hermetic (in-memory / tempdir + `IOTracker`), no S3, no -wall-clock. Two shapes: - -```rust -// (A) cost-invariant-to-HISTORY β€” the load-bearing gate. Gate the MERGE verb (the prod path). -for depth in [10, 100, 1000] { // REAL commit history, not row count - build_history(depth); - reset_counters(); - let s = measured_merge(); // --mode merge, the read-modify production path - // PRIMARY β€” the dominant term (Β§0): the written table's data opens/reads, flat in depth. - assert!(s.data_table_opens <= touched_tables); // open each table ONCE, by pinned version - assert!(s.data_table_reads_per_open <= K_OPEN); // each open O(1) in the table's version count - // SECONDARY β€” internal-table scans flat in depth (compaction + cleanup). - assert!(s.manifest_ops <= K_MANIFEST); // small CONSTANT, NOT a function of depth - assert!(s.lineage_ops <= K_LINEAGE); - assert!(s.stages <= 3); // bounded sequential hops -} -assert_flat_across_depths(); // ALL terms β€” esp. data-table opens β€” flat in N - -// (B) fitness functions β€” architectural invariants AS tests -assert_eq!(validate_schema_contract_calls(write), 1); // resolve-once -assert_eq!(coordinator_resolutions(write), 1); // O(1) resolution -assert_eq!(recovery_listdir_calls(steady_state_write), 0); -``` - -**Prerequisite, not a follow-up: route ALL opens (read + write) through the one -instrumented opener BEFORE the gate is meaningful.** Today the write path's data -opens bypass `table_wrapper` (the Β§0(a) blind spot), so a gate that asserts only -`manifest_ops`/`lineage_ops` would **pass a still-broken build** β€” one that -compacts the internal tables (Β§9 step 2) but keeps the dominant ~13Γ— namespace-open -sweep (Β§2.2). The gate MUST count data-table opens/reads (the dominant term), which -requires the routing change first. The data term is **mode-independent** (append ≑ -merge ≑ +12/depth **[U]**), so either verb exercises it; gate the **merge** verb -as the production path. **Fixture caveat [U]:** use *valid* edge endpoints β€” a -write to a non-existent endpoint fails RI validation and rolls back at ~192 ops -with **zero chain reads**, so a bad-endpoint fixture silently measures the rollback -path and would pass falsely. - -The load-bearing rule both Lance and SlateDB mostly miss: **assert the constant is -flat across N, not just small at one N.** A shallow fixture cannot catch an -O(history) cost (the Β§0(b) table is the red baseline). - -**Two latency LOCKs, and the `ThrottledStore` must cap concurrency *and* inject -latency (corrected per Β§0(c)/(d)).** The wall-clock model is -`(serial_hops + ops/effective_concurrency)Β·RTT + compute`, so the gate needs **both** -terms, and an unlimited-concurrency harness measures neither honestly: -(1) **serial-hop LOCK** (`serial_hops ≀ K`, flat in depth) β€” read off the -`wall = compute + serial_hopsΒ·L` slope (Lance's `test_commit_iops` setup); catches the -~110-hop backbone (step 3b's target). (2) **op-count-flat-in-history LOCK** under a -**capped-concurrency** `ThrottledStore` (e.g. `MAXCONC=8`) β€” catches the internal-scan -runaway (Β§0(d)) that step 2a fixes; *without the cap this LOCK is invisible* because -the ops fan out (the Β§0(d) trap). Both are load-bearing: a build can pass the serial-hop -LOCK and still run away on a capped store if its per-write op count grows with history. -Run the depth sweep through a `ThrottledStore` that **both** throttles per-op latency -**and** bounds in-flight concurrency to an R2-realistic value; assert `serial_hops` flat -*and* `ops` flat in history. (A pure op-count gate under unlimited concurrency would -*fail a correct build* whose parallel scans grow yet cost no wall-clock, and *pass a -slow one* β€” which is why the cap is the load-bearing addition.) - -### 5.2 Tier 2 β€” wall-clock trend (post-merge / nightly, never a PR gate) - -A `ThrottledStore` criterion bench injecting cross-region RTT (50/150 ms/op β€” the -incident's regime) for single-insert and branch-op latency, with a threshold -alert (Bencher.dev `--err` / github-action-benchmark `fail-on-alert`). Both -reference DBs keep wall-clock out of the PR gate (too noisy on shared runners) -and use it only as a trend. - -### 5.3 Close the loop β€” production metric - -Emit `storage.ops` and `storage.stages` per logical operation as a span/counter -(cheap always-on atomics; the heavy per-table attributing wrapper stays -test-only behind a `test-util`-style feature, zero release cost). The number -asserted in CI is the number observed in prod β€” `iss-write-s3-roundtrip-amplification`'s -cross-region signal becomes a direct readout. - -### 5.4 Process discipline β€” test-first for performance - -Write the depth-sweep cost-budget test **first**: it goes **red today** (Β§0), the -WriteTxn + Phase-7 + compaction work turns it **green** (flat in N), and the -redβ†’green is the proof. This is CLAUDE.md rule 12 applied to cost, and the -originating handoff's sequencing (Β§8/Β§9: land the tests before the fix so the win -is measured and locked). Add the policy (extend invariant 15 + testing.md "Cost -budget tests"): *any change touching the read/write/branch/open path MUST add or -extend a cost-budget test asserting the metric is flat at history depth.* - -### 5.5 The correctness contract β€” concurrency tests (the safety twin of the cost gate) - -The cost gate proves *fast*; these prove *safe*. Β§6.5's multi-writer cliff slipped -the suite for the same structural reason the latency bug did β€” **nothing runs the -schedule that triggers it**: the suite is single-process with the in-process queue -(the bug is cross-process), uses local/in-memory stores (no object-store -cross-process CAS), and its recovery tests cover restart-time sweep, not -live-writer rollback. **These four must land before `PublishPlan`/epoch merge -(steps 5):** - -1. **Cross-process multi-writer on a real/emulated object store** (the *corruption* - case) β€” N independent engine **processes** writing the same `(table, branch)`; - assert all commit-or-cleanly-retry (no lost updates, no stuck "needs recovery," - no HEAD-ahead-of-manifest). **A single-process failpoint test cannot reproduce - the corruption** (in-process degrades to clean OCC, Β§6.5) β€” this genuinely needs - a multi-process harness (empirically 1/12 today). State that so nobody writes a - single-process test expecting it to fail. -2. **Deterministic in-process interleaving (failpoint) β€” WRITTEN, passes [M].** Twoβ†’ - eight handles, sleep failpoint at the `commit_staged`β†’publish window - (`loader/mod.rs:605`); resume losers and assert they retry cleanly. This - demonstrates the **benign** path (N=8 β†’ 2 commit, 6 clean OCC retries) β€” it is the - regression guard for "in-process stays clean," *not* a reproduction of the - cross-process cliff. -3. **Live-writer recovery** (`iss-recovery-sweep-live-writer-rollback`) β€” a - concurrent open must not roll back a live in-flight publish (the grace window). -4. **Formal model** β€” a Quint/TLA+ model of `{two writers, interleave commit_staged - and manifest-CAS}` (`iss-934`); it finds the Β§6.5 cliff immediately. -5. **Cross-table write-skew β€” WRITTEN, red, and driven redβ†’green in-process [M].** - Failpoint `loader.post_ri_pre_stage` (between RI-validation and staging): writer B - validates "Bob exists" and parks; writer A `overwrite`s `node:Person` dropping Bob - (non-cascading); B commits `Knows(Bobβ†’Alice)` β†’ committed orphan. The red test for - the Β§7.1 fix. **Acceptance is a single-process gate** β€” unlike the Β§6.5 HEAD-ahead - corruption (which genuinely needs the multi-process harness), this skew reproduces - *deterministically in one process*: the parked edge writer's snapshot really does - pin `edge:Knows:1` before the overwrite commits, so the overlap is real with two - in-process handles. The fix went redβ†’green in-process behind a shared head row - (Β§7.1). Only #1–#4 (HEAD-ahead/epoch corruption) need cross-process scheduling. - -Plus one **disambiguating run** owed (Β§6.5 confound): separate-handles in-process -on S3 β€” to confirm the corruption is the process boundary, not the store. - -This mirrors the cost gate's discipline (assert across the dimension the suite -otherwise never exercises) β€” there, history depth; here, concurrent cross-process -schedules. - ---- - -## 6. What is already right vs. the deltas - -**Already correct β€” do not rewrite.** The in-memory `MutationStaging` accumulator, -the recovery sidecar mechanism, the per-(table,branch) write queue, D2, the sealed -`TableStorage` trait, and the read-path warm-up (PR #268) all stay. This is **not** -a substrate rewrite. - -**One claim to soften β€” manifest-CAS is atomic *per publish*, not unconditionally -cross-table-serializing [M].** The manifest CAS (the reference impl of the -lance#7264 "Alternative A") makes each publish atomic and serializes any two writers -whose write-sets **share a `__manifest` row** β€” overlapping or same-table, which is -exactly why Β§6.5's same-table cases and the cascading-delete case retry cleanly. But -two writers touching **disjoint** tables write disjoint per-`object_id` rows, so Lance -sees no conflict and **both commit** (proven [M], Β§7.1). The genuinely-atomic -cross-table commit Β§13 contrasts with Delta is the **target** (Β§4.1's single -merge-insert over a shared head row), **not current state**. So "do not rewrite the -CAS" holds for the *commit primitive*, but the cross-table-serialization Β§7.1 needs -is a real addition (the shared `graph_head` row), not something the current CAS -already provides. - -**The deltas (each a validated, localized gap):** - -| # | Delta | Mechanism | Tracking | -|---|---|---|---| -| 1 | Snapshot re-derived per stage | capture-once `WriteTxn`, thread by ref | `iss-write-s3-roundtrip-amplification` | -| 2 | Write opens via `from_namespace` re-resolve the data-table ~13Γ—/write, missing the fast path (**DOMINANT, +12/depth**) | open each table **once, direct `from_uri().with_version(N)`** (bypass namespace, Β§2.4) + shared Session | `iss-write-s3-roundtrip-amplification`, #0 | -| 3 | Lineage = 2nd authority, O(history) refresh (secondary) | Phase 7: lineage into `__manifest` | `iss-991` | -| 4 | `__manifest`/`_graph_commits` excluded from optimize/cleanup (`optimize.rs:895-904`; prototype pruned "7 tables" = node/edge only **[M]**) β€” the +5/depth residual after step 3 | **add them to `all_table_keys`** (a code change) + scheduled compaction/cleanup | `gap-read-path-rederivation` (write twin) | -| 5 | `list_dir("__recovery/")` per write | move to open + conflict, grace window | `iss-856`, `iss-recovery-sweep-live-writer-rollback` | -| 6 | 4 hand-rolled writers, commit↔recovery drift | one `PublishPlan` executed by both | `iss-merge-recovery-partial-rollforward` (PR #277) | -| 7 | No writer epoch (multi-process exposure) | `writer_epoch` in `__manifest` | β€” (new) | -| 8 | branch create = O(tables) fork loop | Lance `Clone` | `iss-691` | -| 9 | branch delete = sequential loops | concurrent `buffer_unordered` | β€” (new) | -| 10 | No write/branch cost gate (must count **data-table** opens; route all opens through the instrumented opener first) | Tier-1 IO-counted tests, merge verb | β€” (new) | -| 11 | Schema contract re-validated uncached per resolve (**flat 46 reads/write β€” 29% of depth-0 cost; constant, not depth**) | resolve/validate-once in `WriteTxn`; Β§5.1 `validate_schema_contract_calls==1` (the depth gate misses it) | `iss-write-s3-roundtrip-amplification` | - ---- - -## 6.5 Concurrency correctness β€” the multi-writer cliff (proven [M]) - -The latency fixes are about *speed*; a separate, proven finding is about *safety*. -A multi-writer experiment **[M]** shows concurrent same-branch writers behave very -differently by topology: - -| topology | concurrency | outcome | -|---|---|---| -| single server (shared in-proc queue, `loader/mod.rs:426`) | 12 | **12 / 12 commit** (clean) | -| in-process, separate handles, interleave failpoint at `commit_staged`β†’publish (`loader/mod.rs:605`) | 8 | **2 / 8 commit; the other 6 are clean retryable OCC** | -| multi-process (separate CLIs / S3, no shared queue) | 2 / 3 / 5 / 12 | **1 / N commit; the rest CORRUPT** | - -**Two distinct failure modes β€” and the corruption is strictly cross-process:** - -- **In-process β†’ benign.** Even with *separate handles, no shared queue, high - contention*, losers fail with `stale view of 'edge:Knows': expected manifest table - version 5 but current is 7 β€” refresh and retry` β€” a **clean, retryable OCC - conflict; graph state stays consistent.** The publisher CAS is doing its job. -- **Cross-process β†’ corruption.** `Lance HEAD version N+1 ahead of manifest version - N; a pending recovery sidecar requires rollback`. **Mechanism:** a losing writer - advances the table's Lance HEAD (`commit_staged`) *before* the manifest CAS; when - the CAS loses, HEAD is ahead of the manifest β€” a partial commit the per-write heal - **defers** (`recovery.rs:978-988`; only the open-time sweep rolls back), so a - *live* writer hitting it **fails instead of healing**. Self-heals on the next - read-write reopen (not permanently bricked), but during a burst throughput - collapses to one survivor. Reachable at **concurrency = 2** cross-process. - -So in-process safety **already comes from the publisher CAS** (clean OCC); the -corruption needs the process boundary. *(Confound, stated honestly: the in-process -interleave ran on local-FS and the cross-process on S3-via-proxy β€” but -single-server-on-S3 was also clean (12/12), giving two independent "in-process -clean" points vs one "cross-process corrupt," triangulating on the process -boundary, not the store. One disambiguating run β€” separate-handles in-process on S3 -β€” would move this from triangulated to proven; Β§5.5.)* - -**Scoping (matters for urgency):** **single-server prod is serialized-correct, just -slow** β€” the in-process `(table,branch)` queue serializes same-branch writes (all 12 -commit, no lost updates); the production incident was the *latency* (serialized -O(depth) writes β†’ 90 s timeout), **not** corruption. The corruption hazard is -**latent**: it appears the moment a second writer exists (server replica, -CLI-alongside-server, multi-writer scale-out). **So: single-server today = -serialized-correct (slow; fixed by steps 2/3); multi-writer = UNSAFE until -`writer_epoch` lands.** - -**The fix is the existing RFC, no new design.** The `A`-before-`B` window -(Lance HEAD moves before the manifest references it) is inherent to Lance's -per-table-lineage model β€” you cannot eliminate it, only fence and recover it: the -**`writer_epoch`** (delta #7) is a leader-lease via cross-process CAS so two writers -are never in the `commit_staged`β†’manifest-CAS window across processes (it removes -the concurrent-race dimension); the **`PublishPlan`=sidecar** (delta #6) makes a -single crashed writer roll forward/back deterministically (the crash dimension); and -**recovery off the hot path + grace window** (delta #5, Q2) is the exact reason the -live writers failed rather than self-healed (`iss-recovery-sweep-live-writer-rollback`). -This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB's -`FenceableTransactionalObject` and Kleppmann's fencing-token canon, Β§10). **This -finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that -gates multi-writer topologies β€” and it is the motivating case for them.** - -## 6.6 The two op classes β€” and a found+fixed maintenance race (LANDED) - -Β§6.5 is about the **logical** write class. A prod run surfaced the same -process-boundary flaw in the **maintenance** class: a direct CLI `optimize` racing -a served write on the same table **failed** β€” either the Lance `Rewrite` lost -("preempted by concurrent Update") or the manifest publish lost the strict equality -CAS ("expected 14 but current 15"). Same root cause as Β§6.5 (the in-process write -queue does not serialize across processes), but the right fix is the **opposite** of -the logical class, because the two classes commute differently: - -| class | examples | commutes? | correct commit model | -|---|---|---|---| -| **maintenance** | compaction (`Rewrite`), `optimize_indices` | **yes** (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** β€” no epoch, no read-set | -| **logical mutation** | load / mutate / merge / delete | **no** (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence (Β§6.5, #7) | - -Applying strict OCC + equality-CAS uniformly is the mistake: **too strong for -maintenance** (it manufactures a false conflict against a commutative op β€” this -bug) and **too weak for logical writes cross-process** (Β§6.5). The maintenance fix -needs **no `writer_epoch`** β€” that is the tell that it is a different class. - -**Validated against Lance 7.0.0 source + reproduced [M].** Lance has no compaction -re-plan retry (the semantic `RetryableCommitConflict` escapes `commit_transaction`'s -loop at `io/commit.rs:979`; only the OCC manifest race is retried), so the -application must reopen + re-plan β€” exactly what the internal-table path already -did. Notably, Lance **rebases the common case for free**: a concurrent -insert/update/delete on *other* fragments is disjoint, so the data-table compaction -commits cleanly and the conflict surfaces only as the manifest fast-forward -(the genuine `Rewrite`-vs-`Rewrite` overlap is the rarer many-fragment / -concurrent-compaction case). - -**Fix (LANDED).** Both compaction paths now share one reopen+replan shape with a -two-level retry: an outer loop reopens+replans on a real Lance overlap conflict; an -inner Phase-C loop makes the manifest publish a **monotonic fast-forward** -(advance to the compacted version `N`, or no-op when the manifest already moved to -`β‰₯ N` β€” being linear, it descends from and includes the compaction), never the -equality CAS. The `Optimize` recovery sidecar is written once and reused across -attempts (every commit is content-preserving). The in-process queue is kept as an -in-process contention reducer, not the cross-process guard. No `writer_epoch`. -(`db/omnigraph/optimize.rs`; regression tests in `tests/failpoints.rs`: -`optimize_survives_concurrent_insert_advancing_manifest`, -`optimize_survives_concurrent_delete_before_compaction`.) - -**Relationship to step 5 (the unification).** This is the first correct *instance* of -the maintenance-class commit model, not a parallel band-aid: when step 5 collapses the -writers into the single `publish(txn, plan)` authority, it **relocates** this β€” a -`TableAction::Rewrite` carries the monotonic-fast-forward + reopen/replan commit model -into the unified publisher β€” rather than reinventing it. What step 5 deletes is the -*location* (the hand-rolled loop in `optimize_one_table`), not the *semantics*; the two -regression tests above are the contract that must stay green across that refactor. It -also makes step 5 easier, not harder: it already unified the two compaction paths onto -one retry shape and drew the op-class line (logical writers keep equality CAS; only -compaction is monotonic), so there is one fewer pattern for the unification to absorb. - ---- - -## 7. Invariants & deny-list check - -Touches and *strengthens* (does not weaken) invariants in -[invariants.md](invariants.md): - -- **Β§2 (manifest-atomic visibility):** preserved; lineage now rides the same CAS - (strengthens β€” closes the "manifestβ†’commit-graph atomicity" gap). -- **Β§3 (one snapshot per op):** enforced *by construction* via `&WriteTxn`. -- **Β§4 (publish at one boundary):** unchanged β€” still one manifest publish. -- **Β§5 (recovery part of the commit protocol):** preserved; the sidecar *is* the - `PublishPlan` (strengthens β€” commit and recovery cannot diverge). The grace - window addresses the documented "recovery serialized against live writers - in-process only" gap. -- **Β§7 (indexes derived) / Β§15 (one source of truth, cheaply derived):** this RFC - is the write-side application of Β§15 β€” bound cost to the working set, not - history. The commit graph becomes derived (strengthens). -- **Β§5 strict-op SI:** preserved (#5 validation β€” open guards kept for - read-modify-write). - -**Deny-list:** does *not* hit "cold re-derivation on the hot path" (it removes -two instances), "state that drifts" (lineage stops being a second authority), or -"acks before durable persistence." The `writer_epoch` is the closing move on the -"local `write_text_if_match` is not a cross-process CAS" / multi-process gaps β€” -add it before admitting multi-process write topologies. - -No invariant is weakened. Two Known Gaps **close** (manifestβ†’commit-graph -atomicity; commit-graph parent under concurrency, via Phase 7); one -(read-path-rederivation) gets its **write twin** filed and addressed. - -### 7.1 Scope of the correctness claims (literature review, Β§13) - -The "correct by construction" framing (Β§3, Β§4.1) is **precise but bounded** β€” the -DB-canon review flags three places not to over-claim: - -- **Per-table serializability, not graph-wide β€” but the gap is narrow and now - measured [M].** Three deterministic cases (failpoint `loader.post_ri_pre_stage`, - placed between RI-validation and staging; red test in `tests/failpoints.rs`): - - **Cross-table *disjoint* β†’ genuine skew, VIOLATED.** A **non-cascading endpoint - removal** β€” `node:Person` *overwrite* dropping Bob, touching only the node table - β€” concurrent with an edge insert `Knows(Bobβ†’Alice)`: both commit (write-set-only - CAS, RI validated once pre-commit and never re-checked at publish) β†’ **committed - orphan**. (= `iss-ri-write-skew-dangling-edges` + the concurrent face of - `iss-overwrite-orphans-committed-edges`.) - - **Cross-table *overlapping* β†’ incidentally protected.** `delete`-based removal - **cascades** into `edge:Knows`, so the write-sets overlap, the per-table CAS - engages, and the loser fails **cleanly** (stale-view OCC retry); invariant held. - - **Same-table β†’ NOT a separate skew.** Cardinality / `@unique(src)` have - overlapping write-sets, so the per-table CAS holds the constraint; the loser's - failure is the **HEAD-ahead corruption already scoped to #6/#7** (epoch + - PublishPlan), not a consistency hole. *(This corrects an earlier - over-generalization: cardinality/uniqueness do not share the read-set gap.)* - - So the skew is **reachable only for the non-cascading-overwrite Γ— disjoint-edge-insert - shape** β€” operation-specific, not constraint-specific. - - **The scoped fix alone is a no-op β€” proven [M], and the reason is mechanical.** - Feeding the endpoint node-table versions into the edge's publish *expected* set - (`check_expected_table_versions`, `publisher.rs:353`) was prototyped exactly; debug - confirmed the pins reach the check, **and both writers still committed β€” the orphan - persisted.** Every publish writes a *unique per-`object_id` row* into `__manifest` - (merge key `object_id = version_object_id(table, version)`). Two disjoint-table - writers (`node:Person` vs `edge:Knows`) touch **no common row**, so Lance's - row-level merge-insert CAS commits both with **no conflict**, the publisher's retry - loop **never fires**, and `check_expected_table_versions` β€” a **non-atomic - pre-check, not part of the CAS** β€” is evaluated exactly once against the stale - pre-both manifest and passes for both. The read-set pin only bites if the loser is - **forced to retry and re-evaluate against fresh state**, which requires a *shared - contention row* every publish touches. Adding a stand-in global head row - (`UpdateAll`-touched by every publish) makes the disjoint writers overlap β†’ Lance - conflict β†’ publisher retry β†’ the reloaded pin (`edge:Knows:1` vs current `5`) - rejects the stale writer β†’ no orphan (redβ†’green, failpoint suite 52/52). **That - shared row is exactly Phase-7's `graph_head:`.** - - **Consequence β€” Β§7.1 is NOT a standalone single-server PR** (correct earlier text - that called it "single-server-live, not deferrable" β€” it *is* urgent and - epoch-independent, but it cannot ship against today's per-`object_id` manifest - without a contention point). Land it one of three ways: **(a)** with Phase 7 - (step 4), reusing `graph_head:` as the contention row; **(b)** behind a - minimal per-branch head row ahead of Phase 7 (~15 lines, as prototyped); or - **(c)** as commit-time re-validation β€” still must win a serialization point first. - **Recommended: (c) behind a per-branch head row.** The CAS-map approach carries the - two costs Β§11 anticipated β€” *table-granularity false conflicts* (any `Person` - overwrite conflicts with any concurrent `edge:Knows` insert, even different rows β€” - needs a row-granularity read-set) and *scope* (a global head serializes the whole - graph; per-branch `graph_head` is the right granularity). Commit-time re-validation - is precise (no false positives) **and** reuses the same serialization point, so once - the head row exists it strictly dominates the CAS-map. Either way the head row - imposes an inherent trade β€” same-branch writers serialize cross-process (throughput - ceiling 1/branch, bounded by `PUBLISHER_RETRY_BUDGET`) β€” **now a correctness - requirement, not just a Phase-7 side effect** (Β§11). - - **Two faces, two fixes β€” do not bundle them.** The above addresses only the - *concurrent* face (overlapping snapshots, `iss-ri-write-skew-dangling-edges`). The - *sequential* face (`iss-overwrite-orphans-committed-edges`) β€” an overwrite drops a - node that **already has a committed inbound edge**, with *zero* concurrency β€” - **cannot** be caught by read-set-in-CAS: the later writer's snapshot legitimately - post-dates the edge, so its pin matches and it commits. That is a pure - **inbound-RI-validation** gap: when an overwrite/delete removes node endpoints, - re-check that no live edge references them. A validation concern, not a CAS one; - it needs no contention row and ships independently. - *(Note: `iss-984` is a different bug β€” remote branch-merge idempotency β€” not this.)* -- **Recovery: roll-forward is by-construction; roll-back is not.** "Commit and - recovery replay the identical plan" holds for the **redo** direction (shared - `plan.apply()`). The undo classifier (NoMovement / UnexpectedAtP1 / - UnexpectedMultistep / IncompletePhaseB) lives *outside* the shared executor, only - at open-time β€” that's where ARIES-style divergence risk concentrates and where the - Β§5.5 failpoint coverage is owed. -- **The fence and the cross-file atomicity rest on a linearizable conditional-put.** - Kleppmann's fencing-token guarantee, the manifest CAS, and the epoch all require a - linearizable register β€” true on S3/R2 (If-Match) but **not** on the local-FS path - (`write_text_if_match` is content-token compare-then-replace, ABA-prone β€” - `invariants.md` Known Gap). **Precondition to state up front: every "deterministic - fence" / "atomic CAS" claim holds *on a store with linearizable conditional-put*; - the epoch must not use the local-FS path.** Delta Lake Β§3.2.2 treats the - object-store consistency model (read-after-write + put-if-absent) as a first-class - design parameter; so should this RFC. - ---- - -## 8. Relationship to Lance MTT (the seam, not a dependency) - -`GraphPublishAuthority.publish(txn, plan)` is exactly the adapter to a future -Lance `catalog.transaction()`. lance#7264 ("Multi-Table Transactions via -Branching") is real and OmniGraph is its reference "Alternative A" -(fast-forward-main + WAL + roll-forward recovery) **[U]**, but it is a 5-day-old -discussion with two unbuilt dependencies (lance#7263 branch merge/rebase, -lance#7185 UUID branch paths), an unresolved central choice (it *favors* -pointer-swap β€” the opposite identity model from OmniGraph), and an open soundness -question (TTL lease needs an epoch). **Build the seam now on its own merits; do -not schedule around MTT landing.** When it ships, `publish`'s *body* swaps -(stageβ†’CASβ†’sidecar β†’ `catalog.transaction()`) while `WriteTxn`/`PublishPlan` and -every verb lowering stay. `iss-863`/`iss-864` **[G]** already scope this spike. - -**Why keep `__manifest` as a Lance *dataset* (and compact it) rather than a single flat -CAS'd manifest file?** The Lance-source comparison (Β§2.3) makes this an explicit choice -to defend, not assume. Both reference designs the RFC cites store cross-version metadata -as **one flat file** read O(1): Lance's per-version manifest (`format/manifest.rs`) and -SlateDB's monotonic-ID manifest (Β§13). A flat `graph_manifest.json` updated by -conditional-PUT would give omnigraph O(1) catalog reads and a natural one-writer CAS -**with no fragment-scan / compaction / cleanup treadmill** β€” structurally cheaper than -the Lance-dataset `__manifest` whose hygiene Β§9 step 2 exists to maintain. The reason to -keep the Lance-dataset form is the **MTT seam**: `__manifest` is deliberately shaped so -`publish` swaps to Lance `catalog.transaction()` when lance#7264 lands, at which point -Lance owns the cross-table manifest and omnigraph **deletes `__manifest` entirely** β€” -inheriting Lance's O(1) metadata rather than maintaining its own. A flat-file rewrite -would be a detour *away* from that seam, replaced again by MTT. So the trade is -**"Lance-dataset catalog (compacted, MTT-aligned) over flat-file manifest (locally -cheaper, off the MTT path)"** β€” defensible, but it means step 2's compaction/cleanup -work is a *bridge cost*, justified only by the MTT endgame; if MTT slips materially, the -flat-file manifest becomes the better target and step 2 stops being a bridge and starts -being permanent overhead. Worth a revisit checkpoint tied to the lance#7264 timeline. - -The MemWAL/LSM ingest tier (`iss-681` **[G]**, `dec-adopt-lance-v7-memwal`) is -**complementary, not competing, and not in flight** (the `memwal-benefit-analysis` -branch is an empty placeholder; the real analysis is commit `c9a81266`). MemWAL -sits *below* the manifest publisher (per-table durability, opt-in, intra-table); -`WriteTxn` owns the cross-table CAS. Build `WriteTxn` first. - ---- - -## 9. Sequencing - -Ordered by leverage and dependency. **The dominant depth term is the redundant -data-table opens (step 3), not the internal tables (step 2)** β€” Β§0; both must land -to flatten the curve. - -1. **Measure first (Tier-1 gate). βœ… LANDED (gate + harness).** *Prerequisite (1a):* - the write opener (`open_dataset_head`) is routed through the instrumented - `open_dataset_tracked` so the gate can count data-table opens (Β§5.1). The - write cost-budget tests live in `crates/omnigraph/tests/write_cost.rs` on a - **shared, store-agnostic harness** (`tests/helpers/cost.rs`: `measure`/`IoCounts`/ - `assert_flat`/`local_graph`/`s3_graph`) that `warm_read_cost.rs` and - `write_cost_s3.rs` also consume β€” one vocabulary, no duplicated `IOTracker` - plumbing. The local gate ships green every-PR guards + the RED `#[ignore]`'d - internal-table LOCK (step 2's redβ†’green acceptance). *Still owed:* the prod - `storage.ops` span metric (Β§5.3) and the bucket-gated `write_cost_s3.rs` opener - LOCK (step 3a's redβ†’green, S3-only per the Β§9-3a measurement note). -2. **Bound history β€” bring the INTERNAL tables into optimize/cleanup.** Split into - a compaction half (safe) and a cleanup half (version GC, needs the Q8 watermark). - Validated (Lance docs + source): compaction *preserves* versions and flattens the - per-write metadata *op-count* scan; cleanup is the separate version-deleting op that - opens the Q8 hole. **Latency role β€” concurrency-dependent, MEASURED (Β§0(d)):** the - internal fragment scan parallelizes only on a store with free concurrency; under an - R2-realistic cap (8) it serializes and an uncompacted graph *runs away* (per-write - ops 1273β†’3505, wall 6β†’16 s), which #291's compaction cuts ~6Γ— and bounds. So on R2 - step 2a is **both a primary latency lever and the anti-runaway fix**, *and* the - **hard prerequisite for Phase 7 / step 4** (the `graph_head` CAS retry re-runs - `load_publish_state`, only acceptable once `__manifest` is compacted), *and* a - compute-floor/space win. (On an unlimited-concurrency store the latency component - alone vanishes β€” the depth ops fan out β€” but R2 is not that store.) **#291 is merged - to main but undeployed; the deployed `f6d2cc03` optimize is node/edge-only, so an - operator `optimize` on prod cannot compact these tables β€” deploying #291 + running - optimize is the immediate prod win.** - - **2a. Internal-table compaction. βœ… LANDED.** `optimize` now compacts all - three internal tables β€” `__manifest`, `_graph_commits`, **and - `_graph_commit_actors`** (the actor table grows one fragment per commit on the - authenticated write path, so it carries the same O(depth) scan as the other - two and is compacted from one source-of-truth list with per-table existence - guards). `compact_internal_table` is a separate simpler path than - `optimize_one_table`: no manifest publish, no recovery sidecar. The sidecar-free - property does **not** rest on single-commit atomicity (`compact_files` can emit a - `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip is a - further commit) β€” it holds because each of those commits is content-preserving - and the table is read at HEAD, so a crash leaves it readable and content-identical - and the next `optimize` re-plans. **Non-destructive by construction:** compaction - preserves versions, and before compacting it strips any stale `lance.auto_cleanup.*` - config off the table, so a graph created by an older binary (on-by-default GC - hook) cannot have the commit-time hook silently prune `__manifest`-pinned - versions during an `optimize` (current binaries store no such config; the - strip is the upgrade-path safety net). **The same strip now also runs on the - data-table path** (`optimize_one_table`), inside the Optimize sidecar window β€” - so `optimize` is non-destructive on node/edge tables too, not just the internal - ones (the data-table path was a pre-existing gap, since `compact_files`/ - `optimize_indices` there also commit with the auto-cleanup hook enabled). **Concurrency:** - no app lock on the internal path β€” and `compact_files` does *not* auto-retry a - semantic conflict against a concurrent live writer (Lance prescribes app-rerun for - `Rewrite` vs `Update`/`Merge`), so `compact_internal_table` runs a *bounded* - retry loop that reopens fresh and replans on a retryable Lance conflict (the - canonical Lance-consumer pattern); transient contention does not fail the live - publisher or the operator's `optimize`, but sustained contention past the budget - surfaces a loud conflict error (bounded + observable, not an infinite loop). The - data-table path instead holds the per-table write queue, so it never contends. A - coordinator `refresh` after the compaction restores cache coherence. The - `internal_table_scans_are_flat_in_history` LOCK is now green on the - **authenticated** write path: on a compacted graph a write's - `__manifest`/`_graph_commits`+`_graph_commit_actors` scan is flat in history - (measured `__manifest` 4β†’2, commit-graph+actors 10β†’2 across depth 10β†’100). - Compacts all three tables even though Phase 7 (`iss-991`) will later fold - `_graph_commits` into `__manifest` (one-call throwaway; full interim win until - then). **2a is also the hard prerequisite for Phase 7** (its `graph_head` CAS - contention is only acceptable once `__manifest` compaction bounds the - publisher's `load_publish_state` scan). - - **2b. Internal-table cleanup + Q8 watermark β€” DEFERRED** (debated; not bundled - with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection - (Β§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with** - a durable monotonic watermark (a Lance boundary tag β€” durable across cleanup, - `cleanup.rs` `is_tagged`). Deferred because it touches the read/open path - (a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT - may replace `__manifest`), and only buys the secondary version-count/space term - β€” whereas 2a delivers the dominant per-write scan win with zero resurrection - risk. Land it when the version-count cost bites or the Lance MTT timeline - clarifies. (`gap-read-path-rederivation` write twin.) -3. **The opener fix β€” a shippable lead + the structural follow-on.** - - **3a. Opener bypass (standalone PR, THE dominant fix β€” [M] proven). βœ… LANDED.** - `TableStore::open_dataset_head_for_write` now delegates to the direct - `open_dataset_head` opener (`Dataset::open` by URI + `checkout_branch`, routed - through `instrumentation::open_dataset_tracked` so the cost gate can count it; - no-op in prod) instead of the `from_namespace` builder. Measured end-to-end on - the prototype: data term `31 + 12Β·depth β†’ flat 4`, total `+18 β†’ +5/depth`, - depth-80 **2.7Γ—** (Β§2.4), functionally correct on main/branch/node. - **Acceptance:** the full `cargo test --workspace --locked` suite passes under the - bypass (the `tests/` integration + `merge_truth_table` + recovery/failpoint - suites the prototype's `--lib` run didn't cover β€” schema-apply, branch merge, - fork-on-first-write, overwrite). **Namespace retired to test-only:** with both - reads (Fix 2) and now writes bypassing it, *nothing in production routes through - the Lance namespace* β€” confirming Β§2.4's premise. The dead per-table open chain - (`load_table_from_namespace`, `open_table_head_for_write`) was deleted and the - `StagedTableNamespace` contract apparatus gated `#[cfg(test)]`, mirroring the - already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). **Measurement - note (corrected):** the opener win is **S3-only** β€” local FS resolves latest with - one cheap `read_dir` regardless of opener, so the namespace-vs-direct difference - is invisible there (the local data-table read count *does* grow with depth, but - that is the merge-insert/RI scan over O(depth) *fragments*, a compaction term, - not the opener; depth-100 = 92 ops identically before and after the bypass). The - opener LOCK therefore lives in the bucket-gated `write_cost_s3.rs`, not the local - `write_cost.rs`. - - **3b. Full `WriteTxn` (capture-once + intra-txn handle reuse + shared Session).** - Formalize 3a's open-once into the pinned, threaded `WriteTxn` (re-resolution - *unrepresentable*, invariant 3) and kill the flat-46 schema-read constant - (resolve/validate-once, Β§0/Β§6). (`iss-write-s3-roundtrip-amplification`.) -4. **Phase 7 β€” lineage into the manifest.** Removes the per-write - `commit_graph.refresh`; commit graph becomes a projection. (`iss-991`.) - **Hard dependency: step 2 must land first (Q1, Β§12)** β€” each publisher retry - re-runs the O(history) `load_publish_state` scan, so the `graph_head` CAS - contention Phase 7 introduces is acceptable only once compaction bounds that - scan. Acceptance includes the Q1 concurrent-same-branch-writer gate. - **Carries the Β§7.1 concurrent write-skew fix.** The `graph_head:` row is - the shared contention point the cross-table read-set-in-CAS needs β€” proven [M] - that the read-set fix is a no-op without it (Β§7.1). So the concurrent face of the - write-skew lands *with* this step (or, if Β§7.1 must ship earlier, behind a minimal - per-branch head row β€” ~15 lines β€” or as commit-time re-validation). The - *sequential* face (`iss-overwrite-orphans-committed-edges`) is independent: - inbound-RI validation on node removal, no head row, ships anytime. -5. **`PublishPlan` unification + recovery off the hot path + epoch fence β€” the - multi-writer safety guard.** Collapse the four writers; move the `__recovery` list - to open/conflict; add the `writer_epoch` leader-lease. **Motivated by the proven - Β§6.5 cliff** (multi-process same-branch writers corrupt at concurrency = 2) β€” this - is the guard that makes multi-writer topologies safe, not optional polish. - **Gated by the Β§5.5 correctness contract** (the four concurrency tests must land - with it). `writer_epoch` must be a true cross-process conditional CAS β€” **not** - the local-FS `write_text_if_match` path (Β§7.1). (`iss-856`, - `iss-merge-recovery-partial-rollforward`, `iss-recovery-sweep-live-writer-rollback`, - `iss-934`.) -6. **Branch ops.** Lance `Clone` for create (`iss-691`); concurrent delete loops. - Measured backbones (Β§0(c)): create ~77, delete ~87 β€” op counts scale with table - count, so `Clone` (O(tables)β†’O(1)) + `buffer_unordered` delete are the fix. - **Note: branch *write* (1777 ops, ~258-hop backbone, 21s compute floor) is NOT a - step-6 item** β€” it is fork-on-first-write stacked on the main backbone, owned by - **step 3b + the fork seam** (the path 3a skipped); it is the single worst write - shape and should be a named acceptance case for step 3b. -7. **Freeze** investment in publisher/sidecar/fork internals; pursue the MTT - seam (`iss-863`/`iss-864`) as the strategic exit. - -**Land PR #277 first** β€” it closes `iss-merge-recovery-partial-rollforward` and is -the producer-side half of the `PublishPlan` discipline; the heal-relocation in -step 5 must preserve its merge pre-snapshot heal (`exec/merge.rs:1084-1090`) and -its open-time `IncompletePhaseB β†’ RollBack` (which the per-write heal never -performed anyway). - ---- - -## 10. Cross-reference map (the ties) - -**Dev-graph items (modernrelay) β€” what this RFC ties together:** - -- Primary: `iss-write-s3-roundtrip-amplification` (the bug). -- Depth term / Phase 7 (commit graph β†’ manifest-derived projection): `iss-991` - (related: `iss-707` structured commit-graph lineage; `iss-934` Quint - multi-table-publish verification). Read twin: `gap-read-path-rederivation`. -- Substrate seam: `iss-863`, `iss-864`. Decision: `dec-adopt-lance-v7-memwal` - (`iss-681`). -- Recovery: `iss-856`, `iss-recovery-sweep-live-writer-rollback`, - `iss-merge-recovery-partial-rollforward`, `iss-903`, `iss-load-not-crash-safe`. -- Residual migration: `iss-950` (MR-A staged delete, retires D2), `iss-848` - (index-coverage reconciler, owns `create_vector_index`). -- Branch/load: `iss-691`, `iss-677`, `iss-895`, `iss-topology-cross-branch-cache`, - `iss-841`, `iss-982`, `iss-423`, `iss-989`. -- Concurrency correctness (survives MTT) β€” **two faces, two different fixes [M]** - (Β§7.1): `iss-ri-write-skew-dangling-edges` (the *concurrent* face; fix = - read-set-in-CAS **+ a shared `graph_head` contention row**, so it's coupled to - step 4 / a minimal head row / commit-time re-validation β€” NOT a standalone PR) and - `iss-overwrite-orphans-committed-edges` (the *sequential* face; fix = - **inbound-RI validation on node removal**, ships independently, no contention row). - *(`iss-984` β€” remote branch-merge idempotency β€” is unrelated; not a write-skew.)* -- Blockers: `blk-lance-6658` (shipped 7.0.0), `blk-lance-6666` (open, vector - index two-phase), `blk-lance-blob-compaction`. -- Epics: `epc-bulk-data-plane`, `epc-lance-v7-migration`, `epc-783` (reliability - harness), `epc-929` (Quint verification). - -**Proposed new dev-graph wiring (not yet written):** - -- New **Epic** `epc-write-path-latency` β€” owns the cluster of orphaned issues - above (none currently has an epic). -- New **Gap** `gap-write-path-rederivation` β€” the write twin of - `gap-read-path-rederivation` (current: write re-derives snapshot + scans - uncompacted internal tables per write; target: capture-once + bounded history). -- New **Issues**: write-side cost-budget gate + prod metric (step 1; prereq 1a - routes all opens through the instrumented opener); **opener bypass β€” open writes - direct-by-URI, standalone (step 3a, [M] the dominant fix, completes PR #268 Fix 2 - on the write path, Β§2.4)**; full `WriteTxn` capture-once (step 3b); **add - `__manifest`/`_graph_commits` to `all_table_keys`** for compaction+cleanup (step 2 - β€” a code change, `optimize.rs:895-904`); `PublishPlan` unification + epoch - (step 5); branch-delete concurrency (step 6). -- **Per-table namespace retired to test-only (step 3a landed).** With reads (Fix 2) - and now writes (step 3a) both opening direct-by-URI, *nothing in production routes - through the per-table `StagedTableNamespace`*. The dead open chain - (`load_table_from_namespace`, `open_table_head_for_write`) was deleted; the - `StagedTableNamespace` struct/impl/factory are now `#[cfg(test)]`, mirroring the - already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). Both are retained - only to validate the `LanceNamespace` contract in unit tests. *Production catalog / - managed-versioning commit coordination for `__manifest` itself goes through a - **separate** namespace (`GraphNamespacePublisher`), unaffected by this change.* The - former follow-up to harden `StagedTableNamespace::list_table_versions` - (`checkout_version` per version, O(depth)) is now purely a test-hygiene note β€” no - prod caller can hit it; if any future version-list / time-travel feature needs - per-table version enumeration, build `TableVersion`s from `versions()` metadata - directly rather than resurrecting the namespace open path. -- New **Decision** `dec-writetxn-manifest-authoritative-publish` β€” records this - RFC's design choice and the MTT-seam stance. - -**Key source locations (v0.7.0):** -`omnigraph.rs:561-568,739-779,1317-1389`; `table_ops.rs:505-609`; -`table_store.rs:157-280,282-341,797`; `loader/mod.rs:197,400,485,557`; -`exec/mutation.rs:725`; `exec/merge.rs:1084-1090`; -`db/manifest/publisher.rs:51,93-124,356-371,385,432-440,448-490`; -`exec/mutation.rs:640-673` (D2 rule); `db/manifest/state.rs:44-72,133-141`; `db/manifest/layout.rs:22-26`; -`db/manifest/namespace.rs:111-112` (read open, O(1)),`:357-385`/`:362` (`describe_table` β†’ redundant `Dataset::open` β€” the write-path double-open),`:158-186,544-550` (write open via `from_namespace`),`:395-427` (`list_table_versions` per-version checkout β€” test-only O(depth), the Β§10 follow-up); -`db/manifest/recovery.rs:762,978-988,1522`; `db/commit_graph.rs:136-164,213-272`; -`db/omnigraph/optimize.rs:240,517,895-904`; `instrumentation.rs:37,112-131`; -`runtime_cache.rs:202-283`; `tests/warm_read_cost.rs` (the read-side gate to mirror). - -**Upstream:** lance#7264/#7263/#7185 (MTT); Lance `with_version` O(1) open -(`from_namespace` β†’ `describe_table`, `builder.rs:130-178`; `default_resolve_version` -= one HEAD, `commit.rs:939-981`; version-hint PR #6752), -`list_is_lexically_ordered = !is_s3_express` (`aws.rs:183`), -`IOTracker`/`assert_io_*`/`num_stages`, `test_commit_iops`, -`test_commit_uses_version_hint_on_non_lexical_store`; **lance-namespace** design -(`namespace/index.md`, `operations/index.md` β€” catalog/discovery layer, resolve -once); LanceDB `io_tracking.rs`, `test_reload_resets_consistency_timer`; SlateDB -`FenceableTransactionalObject` (epoch fence), `InstrumentedObjectStore`, -monotonic-ID manifest. - -**Reproduce the Β§0(b) network measurement:** `rustfs` (S3-compat) on `:9000` -behind a ~90-LoC Go counting proxy on `:9100` (adds `LATENCY_MS`, preserves the -SigV4 `Host` header, `/__ctl/reset` + `/__ctl/stat`); an omnigraph cluster on -`s3://…/cluster` through the proxy. Single-write breakdown: reset the proxy log, -`load --mode merge` one edge, classify by S3 key. Depth slope: write NΓ— to main, -diff the per-write log at depth D vs D+20 by table. Native baseline: pylance 7.0.0 -`write_dataset(mode="append")` in a loop β†’ flat 6 ops/append at any depth. - ---- - -## 11. Drawbacks, alternatives, reversibility - -**Drawbacks.** Phase 7 makes disjoint-table same-branch writers contend on the -`graph_head:` row (they don't today) β€” bounded by the Lance retry budget, -inherent to a linear per-branch DAG, gated on a measured concurrency test and on -step 2 landing first (Β§12 Q1, resolved). **Reframe [M]: this contention is -load-bearing for correctness, not merely a throughput tax.** The Β§7.1 write-skew is -*unreachable only because* the shared head row forces disjoint cross-table writers to -overlap, conflict, retry, and re-evaluate their read-set pins against fresh state -(proven β€” without it the scoped CAS fix is a no-op). So Β§7.1 and the head row are -**coupled**: the "drawback" is exactly what buys the cross-table invariant, and the -throughput ceiling (1 writer/branch, bounded by `PUBLISHER_RETRY_BUDGET`) is a -**correctness requirement** the moment Β§7.1 ships, not an optional Phase-7 side -effect. `PublishPlan` is a non-trivial refactor of four writers; it must land behind -the cost gate and the `merge_truth_table`/recovery/failpoint suites. - -**Alternatives.** (A) *Caching band-aid only* β€” memoize schema validation, cache -opens within a request: ~30–50% fewer round-trips but leaves open-by-latest and -the O(history) terms. Mitigation, not a fix. (B) *Opener bypass only* (open -direct-by-URI+version, no full txn) β€” **kills the dominant depth term, now measured -[M]**: a one-line patch flattened the data term `31+12Β·depth β†’ flat 4` and cut a -depth-80 edge **2.7Γ—** (Β§2.4), leaving only the secondary internal-table term and -the writer unification. (C) *Full design (this RFC)* β€” correctness by construction. -(D) *Wait for Lance MTT* β€” future exit, not a current dependency (Β§8). -**Recommend: ship B as a standalone PR first (behind the step-1 gate), then C for -the constant-factor + correctness, then step 2 for the internal residual; D as the -strategic end-state.** B is the demonstrated dominant fix, not a partial one. - -**Reversibility.** The interface (`WriteTxn`/`PublishPlan`) is internal and -reversible. Phase 7's new `__manifest` object types (`graph_commit`, -`graph_head`) are an **on-disk format addition** β€” additive (old binaries skip -unknown `object_type`s) but near-permanent; it earns its own validation pass -(forward/back-compat, the validation checklist in the `iss-991` handoff). The -`writer_epoch` is likewise a durable manifest field. Everything else (compaction -scheduling, recovery relocation, branch concurrency, the cost gate) is cheap to -undo. - ---- - -## 12. Resolved questions (was: unresolved) - -All five original open questions were investigated read-only against post-#277/#284 -`origin/main`, upstream Lance 7.0.0, and the dev graph; each is resolved below. One -new item (Q6), surfaced by peer review, remains genuinely open. - -1. **`graph_head` CAS contention β†’ RESOLVED, gated on step 2 + a concurrency test.** - Retry is publisher-owned; Lance's internal rebase-retry is disabled - (`conflict_retries(0)`, `publisher.rs:385`) β†’ no double-retry. Row-CAS is true - one-winner (`TooMuchWriteContention` β†’ retryable, `publisher.rs:432-440`), - bounded by `PUBLISHER_RETRY_BUDGET = 5`. **But each retry re-runs the O(history) - `load_publish_state` scan (`publisher.rs:455`)**, so `graph_head` contention - multiplies the manifest term β€” **step 2 (compaction) is a hard prerequisite for - step 4 (Phase 7)**. Same-branch is the real workload (the incident is concurrent - `main` writes). Residual: a measured gate before Phase 7 β€” Nβ‰ˆ100 concurrent - same-branch writers, assert bounded retry + O(working-set) re-scan + P99 within - SLA. Fallback: batched-lineage, or Alternative B (defer lineage-in-manifest). -2. **Recovery grace-window β†’ RESOLVED.** PR #284 is **unrelated** (cluster-apply - trap; zero `recovery.rs` changes). The dangerous rollback classifications - (NoMovement / UnexpectedAtP1 / UnexpectedMultistep / #277's IncompletePhaseB) - fire only at the open-time Full sweep; the per-write heal defers all rollback - (`recovery.rs:978-988`), so moving the heal off the hot path doesn't break #277. - A sidecar-age grace window (defer sidecars younger than T_grace, loud typed - skip, `repair` override) on the existing `created_at`/ULID - (`recovery.rs:762`/`:1522`) is the sound interim; the permanent fix is the - in-process background reconciler `iss-856`. Lands step 5 with a failpoint test. -3. **Epoch fence Γ— publisher CAS β†’ RESOLVED (by construction).** With Lance retry - off (Q1), the publisher loop is the only retry layer. Model `writer_epoch` as a - **pre-publish hard-fail gate** beside `check_expected_table_versions` - (`publisher.rs:462`) but non-retryable (a stale epoch is a protocol violation, - not a race). No double-retry; the epoch gate and the row-CAS loop are - sequential. SlateDB `FenceableTransactionalObject` is the precedent. -4. **Compaction cadence β†’ RESOLVED.** Not `auto_cleanup` (GCs pinned versions). - Not foreground every-N-writes (deny-list job-queue + invariant 7 + cost cliff). - Minimum (step 2): extend `optimize`/`cleanup` to the internal tables AND node/edge - version cleanup β€” no special-casing (`SidecarKind::Optimize` already covers a - mid-compaction crash). Follow-up: an `iss-856`-shaped background reconciler - triggered by a cheap fragment-count probe (work off the hot path; a reconciler, - not a job queue β€” deny-list-clean; SlateDB's epoch-coordinated compactor is the - precedent). CLI `omnigraph optimize` stays the operator override. -5. **`PublishPlan` residuals β†’ RESOLVED.** Both `delete_where` and - `create_vector_index` are representable as `TableAction` variants with existing - sidecar coverage (`SidecarKind::Mutation`/`EnsureIndices`) and are - content-preserving (roll-forward safe). `TableAction::Delete` migrates to staged - two-phase via MR-A / `iss-950` (now unblocked β€” `blk-lance-6658` shipped); **D2 - retires then** (`enforce_no_mixed_destructive_constructive`, - `exec/mutation.rs:640-673`). `TableAction::CreateVectorIndex` stays inline until - `blk-lance-6666` ships (`iss-848` reconciler path). - -**Resolved post-review:** - -6. **The exact mechanism of the data-table chain re-read β†’ RESOLVED (Β§0, Β§2.4).** - Pinned by Lance-source trace + proxy + pylance isolation **[U]**: it is **not** - `checkout_version` (O(1)) and **not** merge-insert conflict replay. The write - open goes through `DatasetBuilder::from_namespace` (`namespace.rs:174`), whose - `describe_table` opens the whole dataset just to return a location - (`namespace.rs:362`/`:112`) and whose `.load()` resolves latest **again** β€” a - double latest-resolution per open, ~13Γ— per write, nothing cached. The open - resolves latest **without the V2 lexical / version-hint fast path** the direct - opener uses (likely the un-threaded `Session`/store params, - `load_table_from_namespace` `namespace.rs:174` β€” inferred, not traced), so it is - O(depth) where a direct `from_uri().with_version(N)` is O(1). **The mechanism - question is now academic for the fix:** bypassing `from_namespace` makes the open - flat regardless of the precise sub-mechanism (un-threaded `Session` / double - resolve / missed hint) β€” the bypass is the answer. (`list_table_versions` is - **not** on this path β€” test-only; Β§10 follow-up.) `checkout_version` stays - exonerated. - -**Resolved end-to-end [M]:** - -7. **End-to-end prototype of step 3 β†’ DONE, measured (Β§2.4 before/after).** A - prototype patched the opener (`open_dataset_head_for_write`, `table_store.rs:174`) - to bypass `from_namespace` and open direct-by-URI, rebuilt v0.7.0, and re-ran the - sweep: the data term collapsed `31 + 12Β·depth β†’ flat 4`, total `+18/depth β†’ - +5/depth` (residual = the two internal tables, step 2), depth-80 **1618 β†’ 593 ops - (2.7Γ—)** β€” functionally correct on main edge merge, branch create+write+read, and - node merge. So step 3's "closes the dominant term outright" is **measured, not - inferred**, and the opener bypass is **shippable standalone** (Β§9 step 3a). - **Remaining (not blockers for step 3a's thesis):** the prototype did not cover - schema-apply / branch merge / fork-on-first-write / overwrite / concurrent β€” a - production opener change must pass the full `merge_truth_table`/recovery/failpoint - suite; and the internal-table cleanup demo (step 2) + the concurrency - fault-injection harness (steps 4/5) are still owed. - -**Newly surfaced (open):** - -8. **CAS-resurrection after cleanup β†’ CONFIRMED VULNERABLE [S]; boundary watermark - is a HARD PREREQUISITE for step 2.** SlateDB found this race (RFC-0026 / issue - #352): a writer that stalls between computing manifest id `N+1` and creating it - can, *after GC deletes `N+1`*, re-create it and observe **false success**. - Lance 7.0.0 was traced directly and is **not immune**: version creation is a plain - `put_opts(naming_scheme.manifest_path(base, version), PutMode::Create)` / - `rename_if_not_exists` (`lance-table commit.rs:1421-1437`, `:1358`) on a - version-numbered, **pruneable** path, with **no monotonic/boundary/watermark - guard** anywhere in the manifest/commit/dataset path; `cleanup_old_versions` is - **timestamp-based** (`cleanup.rs:1086`), so it deletes the very file the only - guard (AlreadyExistsβ†’rebase) relies on. A stalled publisher whose target version - was pruned by step-2 cleanup gets a `PutMode::Create` **success on a non-existent - version β†’ false success.** Severity by store: **R2/S3 (lexical, prod) = a silent - lost write** (the resurrected version doesn't win V2 latest-resolution, so data - lands on a dead branch while the publisher believes it committed); non-lexical = - the version hint (`commit.rs:1439`) is overwritten to the stale version and - trusted (worse, but not the prod path). **Action:** step 2 ships **only with** a - durable **boundary/floor watermark** (GC advances it before deleting; every writer - rejects `id <= boundary` after a "successful" create β€” SlateDB's fix), which also - bounds any list-then-read-latest fallback. This was "lowest-risk earliest item"; - it is now gated (Β§9 step 2). - ---- - -## 13. External validation (subagents + literature) - -Validated read-only against OSS prior art and the DB/distributed-systems canon: - -- **SlateDB** (canonical object-store LSM) β€” tenet-by-tenet βœ… on capture-once - snapshot, monotonic-ID manifest (no pointer file β€” *explicitly rejected* in their - RFC-0001), the **epoch fence** (exact match: `FenceableTransactionalObject`, - hard-fail, TTL-lease *explicitly rejected* β€” adopt as specified), background - epoch-coordinated compaction/GC, and recovery-on-open. **Adopt-items OmniGraph is - missing / under-specifies:** (1) the **boundary-file** CAS-resurrection guard (Q8); - (2) **group-commit batching** β€” coalesce pending `PublishPlan`s into one manifest - CAS, directly mitigating the Q1 / Β§6.5 contention; (3) SlateDB *peels* compaction - state *out* of the manifest (RFC-0013) β€” the **opposite** of Phase 7's fold-*in*; - Β§11 should defend "fold-in (lineage must be atomic with visibility) beats peel-out - for us"; (4) **write back-pressure** when cleanup lags (`l0_max_ssts`). **Citation - correction:** SlateDB has the per-RPC counter (`InstrumentedObjectStore`) but - **not** the flatness-across-history gate β€” the depth-swept Tier-1 gate (Β§5.1) is - OmniGraph-novel; cite it that way. -- **Literature** β€” OCC/MVCC (Kung-Robinson 1981; DDIA ch.7), ARIES redo/undo, the - fencing-token canon (Kleppmann β€” whose motivating example *is* OmniGraph's - S3-read-modify-write-paused-past-lease scenario), and the lakehouse genre (Delta - Lake VLDB 2020, Iceberg spec, Neon). The spine β€” OCC-over-MVCC + one atomic - manifest CAS + WAL-of-intent recovery + monotonic-epoch fence β€” is canon-blessed, - and OmniGraph **exceeds** Delta/Iceberg on the axis that matters (both are - explicitly *single-table*-transactional; the manifest CAS delivers the atomic - *cross-table* commit Delta only speculates about). The three scoping caveats are in - Β§7.1. -- **HelixDB** (embedded LMDB graph DB) β€” too different a substrate to validate the - object-store machinery (LMDB's `commit()` subsumes tenets #2–#8 for free), but it - **corroborates tenet #1** (capture-once, thread-by-reference, re-resolution - unrepresentable β€” its `&mut RwTxn`-threaded traversal is the embedded twin of - `WriteTxn`) and confirms the bug class is **substrate-induced**. Portable idea for - the roadmapped traversal work: adjacency as a *persisted, sorted, - label-partitioned projection* keyed by `(node, label)` (vs the cold-rebuilt - `TypeIndex`). diff --git a/docs/dev/runs.md b/docs/dev/runs.md new file mode 100644 index 0000000..816f2ac --- /dev/null +++ b/docs/dev/runs.md @@ -0,0 +1,277 @@ +# Runs β€” REMOVED (MR-771) + +The Run state machine and `__run__` staging branches were removed in +MR-771. `mutate_as` and `load` now write **directly to the target table** +and call `ManifestBatchPublisher::publish` once at the end with +`expected_table_versions` (the per-table manifest versions captured before +the first write). Cross-table OCC is enforced inside the publisher; the +publisher's row-level CAS on `__manifest` is the single fence. + +## What this means in practice + +- No `RunRecord`, no `_graph_runs.lance`, no `_graph_run_actors.lance`. +- No `omnigraph run *` CLI subcommands and no `/runs/*` HTTP endpoints. +- No `__run__` staging branches. (Legacy on-disk artifacts from + pre-MR-771 repos are inert; MR-770 sweeps them in production.) +- Cancelled mutation futures leave **no graph-level state** β€” only orphaned + Lance fragments, which the existing `omnigraph cleanup` pipe reclaims. + +## Read-your-writes within a multi-statement mutation + +A `.gq` query with multiple ops (e.g. `insert Person … insert Knows …`) +must observe earlier ops' writes when validating later ops (referential +integrity, edge cardinality). After MR-794 step 2+ this is implemented +via an in-memory `MutationStaging` accumulator in +[`crates/omnigraph/src/exec/staging.rs`](../../crates/omnigraph/src/exec/staging.rs), +shared by both `mutate_as` and the bulk loader: + +- On the first touch of each table, the pre-write manifest version is + captured into `expected_versions[table_key]` (the publisher's CAS + fence at end-of-query). +- Each insert/update op pushes a `RecordBatch` into the per-table + pending accumulator. Lance HEAD does **not** advance during op + execution. +- Read sites (validation, predicate matching for `update`) consume + `TableStore::scan_with_pending`, which scans committed via Lance + and applies the same SQL filter to the pending batches via DataFusion + `MemTable`. Same-query writes are visible to subsequent reads. +- At end-of-query, `MutationStaging::finalize` issues exactly one + `stage_*` + `commit_staged` per touched table (concatenating + accumulated batches; merge-mode dedupes by `id`, last-write-wins), + and the publisher publishes the manifest atomically across all + touched sub-tables. Cross-table conflicts surface as + `ManifestConflictDetails::ExpectedVersionMismatch`. +- **Deletes still inline-commit.** Lance's `Dataset::delete` is not + exposed as a two-phase op in 4.0.0; deletes go through `delete_where` + immediately and record their post-write state in + `MutationStaging.inline_committed`. The parse-time Dβ‚‚ rule (below) + prevents inserts/updates from coexisting with deletes in one query, + so the inline path is safe for delete-only mutations. + +This upholds the manifest-atomic mutation and read-your-writes invariants +tracked in [docs/dev/invariants.md](invariants.md). + +### Dβ‚‚ β€” parse-time mixed-mode rejection + +A single mutation query is either insert/update-only or delete-only. +Mixed β†’ rejected at parse time with a clear error directing the user to +split the query. Reason: mixing creates ordering hazards +(insertβ†’delete on the same row would silently no-op because the staged +insert isn't visible to delete; cascading deletes of just-inserted +edges break referential integrity). Until Lance exposes a two-phase +delete API, the parse-time rejection keeps both paths atomic and +correct. Tracked: MR-793, plus a Lance-upstream ticket. + +### MR-793 status (storage trait two-phase invariant) β€” partial + +MR-793 hoists the staged-write pattern into a `TableStorage` trait +surface with sealed-trait enforcement and opaque `SnapshotHandle` / +`StagedHandle` types β€” see `crates/omnigraph/src/storage_layer.rs`. +The trait is the canonical surface for new engine code; existing call +sites still use the inherent `TableStore` methods (mechanical migration +deferred to a follow-up cycle β€” tracked). + +Three writers have been migrated onto staged primitives: + +* **`ensure_indices`** (`db/omnigraph/table_ops.rs::build_indices_on_dataset_for_catalog`) + β€” scalar indices (BTree, Inverted) now use `stage_create_*_index` + + `commit_staged`. Vector indices stay inline (residual β€” Lance + `build_index_metadata_from_segments` is `pub(crate)` in 4.0.0; + companion ticket to lance-format/lance#6658 needed). +* **`branch_merge::publish_rewritten_merge_table`** + (`exec/merge.rs`) β€” merge_insert now uses `stage_merge_insert` + + `commit_staged`. Deletes stay inline (Lance #6658 residual). +* **`schema_apply` rewritten_tables** (`db/omnigraph/schema_apply.rs`) + β€” non-empty rewrites use `stage_overwrite` + `commit_staged`. + Empty-batch rewrites stay inline (Lance `InsertBuilder::execute_uncommitted` + rejects empty data; the empty case is rare and bounded by the + schema-apply lock branch). + +A defense-in-depth integration test (`tests/forbidden_apis.rs`) walks +engine source and fails if non-allow-listed code calls Lance's +inline-commit APIs directly. The trait surface itself is the primary +enforcement (sealed + only-callable-via-trait once call sites land); +the grep test catches type-system bypass attempts. + +The "finalize β†’ publisher residual" described below applies equally to +the migrated writers β€” Lance has no multi-dataset atomic commit +primitive, so the per-table commit_staged β†’ manifest publish gap is +the same drift class. Closing it requires either upstream Lance +multi-dataset commit OR the omnigraph-side recovery-on-open reconciler +described in `.context/mr-793-design.md` Β§15 (deferred to MR-795). + +### Inline-commit method residuals on `TableStorage` (MR-793 acceptance Β§1 option b) + +MR-793's acceptance criterion Β§1 ("`TableStore` public API has no method that performs a manifest commit as a side effect of writing") is met **per-method** by enumerating every inline-commit method that remains on the trait surface, naming why it cannot yet be removed, and keeping the residual comment at every call site: + +| Method on `TableStore` | Inline-commit reason | Closes when | +|---|---|---| +| `delete_where` | `DeleteJob` is `pub(crate)` in lance-4.0.0 β€” no public two-phase delete API | [lance-format/lance#6658](https://github.com/lance-format/lance/issues/6658) lands and `stage_delete` joins the trait | +| `create_vector_index` | Vector indices take Lance's "segment commit path"; the helper `build_index_metadata_from_segments` is `pub(crate)` | [lance-format/lance#6666](https://github.com/lance-format/lance/issues/6666) lands and `stage_create_vector_index` joins the trait | +| `append_batch` | Legacy inherent method; some engine call sites haven't migrated to `stage_append + commit_staged` yet | MR-793 Phase 1b (call-site conversion) + Phase 9 (demote to `pub(crate)`) | +| `merge_insert_batch` / `merge_insert_batches` | Legacy inherent method | Same β€” Phase 1b + Phase 9 | +| `overwrite_batch` | Legacy inherent method | Same β€” Phase 1b + Phase 9 | +| `create_btree_index` (inherent) | Legacy inherent method (the migrated callers use `stage_create_btree_index` + `commit_staged`; the inherent stays for tests / un-migrated paths) | Same β€” Phase 1b + Phase 9 | +| `create_inverted_index` (inherent) | Same | Same β€” Phase 1b + Phase 9 + index-class split (MR-848) | +| `truncate_table` (inherent on `TableStore`) | Used by `overwrite_batch` internally | Phase 9 | + +After **lance#6658 + lance#6666 ship + MR-793 Phase 1b + MR-793 Phase 9 all complete**, the trait surface exposes only staged-write primitives + `commit_staged`. Until then this matrix names every residual explicitly, every call site carries a one-line residual comment, and no engine code outside `table_store.rs` is permitted to reach the inline-commit Lance APIs (enforced by the `tests/forbidden_apis.rs` guard). + +### `LoadMode::Overwrite` residual + +The bulk loader's Append and Merge modes use the staged-write path +described above. `LoadMode::Overwrite` keeps the legacy inline-commit +path: truncate-then-append doesn't fit the staged shape cleanly in +Lance 4.0.0, and overwrite has no in-flight read-your-writes +requirement (the prior data is being wiped). A mid-overwrite failure +can leave Lance HEAD on a partially-truncated table; the next overwrite +will replace it. Operator-driven (rare in agent workloads); document +permanently until Lance exposes `Operation::Overwrite { fragments }` as +a two-phase op. + +### Open-time recovery sweep + +The staged-write rewire eliminates one drift class **by construction at +the writer layer**: an op that fails before pushing to the in-memory +accumulator (validation errors, missing endpoints, parse-time Dβ‚‚ +rejection) leaves Lance HEAD untouched on every staged table. This is +the case the `partial_failure_leaves_target_queryable_and_unblocks_next_mutation` +test pins. + +A second, narrower drift class β€” the **finalize β†’ publisher window** β€” +is closed across one open cycle by the open-time recovery sweep: + +`MutationStaging::finalize` runs `stage_*` + `commit_staged` per touched +table sequentially, then the publisher commits the manifest. Lance has +no multi-dataset atomic commit, so the per-table `commit_staged` calls +are independent operations: if commit_staged on table N+1 fails *after* +commit_staged on tables 1..N succeeded, or if the publisher's CAS +pre-check rejects *after* every commit_staged succeeded, tables 1..N +are left at `Lance HEAD = manifest_pinned + 1`. + +**Recovery protocol** (lifecycle of every staged-write writer β€” +`MutationStaging::finalize`, `schema_apply::apply_schema_with_lock`, +`branch_merge_on_current_target`, `ensure_indices_for_branch`): + +1. **Phase A**: writer writes a sidecar JSON to + `__recovery/{ulid}.json` BEFORE its first `commit_staged`. The + sidecar names every `(table_key, table_path, expected_version, + post_commit_pin)` it intends to commit + the writer kind + + actor_id. +2. **Phase B**: writer's per-table `commit_staged` loop runs. +3. **Phase C**: publisher commits the manifest. +4. **Phase D**: writer deletes the sidecar. + +> **Phase letter convention.** Throughout the recovery code, log +> messages, failpoint names (e.g. `branch_merge.post_phase_b_pre_manifest_commit`), +> and the per-writer integration tests, "Phase A/B/C/D" refers +> exclusively to the four-step lifecycle above. The per-table +> staged-write contract (`stage_*` then `commit_staged`, two steps) +> is referred to by those API verbs β€” never by phase letters β€” so a +> reader of `recovery.rs`, `failpoints.rs`, or this document only +> encounters phase letters in the per-writer context. + +A failure between Phase A and Phase D leaves the sidecar on disk. The +next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the +recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: + +- For each sidecar in `__recovery/`, compare every named table's + Lance HEAD to the manifest pin. Classify per the all-or-nothing + decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 / + UnexpectedMultistep / InvariantViolation). +- If any table is `InvariantViolation` (Lance HEAD < manifest pinned β€” + should be impossible), **abort** with a loud error and leave the + sidecar on disk for operator review. +- Otherwise, if every table is `RolledPastExpected`, **roll forward**: + a single `ManifestBatchPublisher::publish` call extends every pin + atomically. `SchemaApply` sidecars are eligible only when schema-state + recovery promoted the matching staging files in the same recovery pass; + otherwise full open-time recovery rolls them back and refresh-time + recovery leaves them for the next read-write open. +- Otherwise **roll back**: per-table `Dataset::restore` to the + manifest-pinned table version for that branch. Rollback records the + actual restore target in the audit row's `to_version`. +- After a successful roll-forward or roll-back, an audit row is + recorded β€” `_graph_commits.lance` carries + a commit tagged `actor_id = "omnigraph:recovery"`, and a sibling + `_graph_commit_recoveries.lance` row carries `recovery_kind`, + `recovery_for_actor` (the original sidecar's actor), `operation_id`, + per-table outcomes. Operators run `omnigraph commit list --filter + actor=omnigraph:recovery` to find recoveries. +- Sidecar deleted as the final step. + +Triggers for the residual: transient Lance write errors during finalize +(object-store retry budget exhaustion, disk full); persistent publisher +contention exceeding `PUBLISHER_RETRY_BUDGET = 5` retries. + +**Long-running servers**: `Omnigraph::refresh` runs roll-forward-only +recovery in-process β€” the common Phase B β†’ Phase C residual closes +without a restart. The next mutation on the same handle (after refresh) +no longer surfaces `ExpectedVersionMismatch` for the failed table. +Sidecars that would require a `Dataset::restore` (mixed / unexpected +state) are deferred to the next `OpenMode::ReadWrite` open: restore is +unsafe under concurrency because Lance's `check_restore_txn` accepts +the restore against in-flight Append/Update/Delete commits and +silently orphans them (pinned by +`tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`). +Continuous in-process recovery for the rollback path is the goal of a +future background reconciler with per-(table, branch) writer-queue +acquisition. + +The publisher-CAS contract is unchanged: a *concurrent writer* that +advances any of our touched tables between snapshot capture and +publisher commit produces exactly one winner. The residual above is +about *our* abandoned commits in the failure path, not about +concurrency races. + +## Conflict shape + +Concurrent writers to the same `(table, branch)` produce exactly one +success and one failure. The losing writer's error is +`OmniError::Manifest` with kind `Conflict` and details +`ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, +actual }`. The HTTP server maps this to **409 Conflict** with body +`{"error": "...", "code": "conflict", "manifest_conflict": { "table_key": +"...", "expected": N, "actual": M }}` β€” see [docs/user/server.md](../user/server.md). + +## Audit + +`actor_id` lands in `_graph_commits.lance` via `record_graph_commit` (no +intermediate run record). Audit history is queried via `omnigraph commit +list`. + +## Migration code + +`db/manifest/migrations.rs` does not change. Active deletion of +`_graph_runs.lance` belongs in MR-770 (the production sweep) β€” this PR +stops *creating* run state but does not destroy legacy bytes on disk. + +## Mid-query partial failure: closed by MR-794 + +The pre-MR-794 design had a known limitation: a multi-statement `.gq` +mutation where op-N inline-committed a Lance fragment and op-N+1 then +failed left the touched table at `Lance HEAD = manifest_version + 1`, +blocking the next mutation with `ExpectedVersionMismatch`. + +MR-794 (step 1 + step 2+) closed this for inserts/updates **by +construction at the writer layer**: insert and update batches accumulate +in memory; no Lance HEAD advance happens during op execution; one +`stage_*` + `commit_staged` per touched table runs at end-of-query, and +only after every op succeeded. A failed op leaves Lance HEAD untouched +on the staged tables, so the next mutation proceeds normally with no +drift to reconcile. + +The cancellation case (future drop mid-mutation) inherits the same +guarantee β€” the in-memory accumulator evaporates with the dropped task +and no Lance write was ever issued. + +For delete-touching mutations the legacy inline-commit shape is +preserved (Lance has no public two-phase delete in 4.0.0) β€” the same +narrow window remains. The parse-time Dβ‚‚ rule prevents inserts/updates +from coexisting with deletes in one query, so a pure-delete failure +cannot drift any staged-table state. If a delete-only multi-table +mutation fails mid-cascade, the same workaround as before applies +(retry; rely on `omnigraph cleanup` once a later successful commit +moves HEAD past the orphan version). Closing this requires Lance to +expose `DeleteJob::execute_uncommitted`; tracked in MR-793 and a +Lance-upstream ticket. diff --git a/docs/dev/testing.md b/docs/dev/testing.md index 1b480f4..e6989ba 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -6,10 +6,9 @@ This file is the always-on map of the test surface. **Consult it before every ta | Crate | Path | Style | |---|---|---| -| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (28 files), fixture-driven, share `tests/helpers/mod.rs` | -| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions β€” lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply), `cli_queries.rs`, `parity_matrix.rs` (RFC-009 Phase 1: the embedded-vs-remote referee β€” every forked verb run against both arms with matched Cedar policy and the same actor, scrubbed-JSON + exit-code equality; divergences are pinned in its `KNOWN_DIVERGENCES` ledger, never silently repaired), `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) | -| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated); `tests/s3_cluster.rs` (bucket-gated full lifecycle on object storage) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) | -| `omnigraph-server` | `crates/omnigraph-server/tests/` | Per-area suites (post-modularization): `auth_policy.rs`, `data_routes.rs`, `schema_routes.rs`, `stored_queries.rs`, `multi_graph.rs` (cluster-mode boot β€” converged serving, policy binding wiring, boot refusals β€” + the concurrent branch-ops matrix), `boot_settings.rs` (mode inference, PolicySource), `s3.rs` (bucket-gated: single-graph serving + config-free `--cluster s3://` boot), `openapi.rs` (OpenAPI drift / regeneration); share `tests/support/mod.rs` | +| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (21 files), fixture-driven, share `tests/helpers/mod.rs` | +| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | `cli.rs` (unit-ish), `system_local.rs`, `system_remote.rs`, share `tests/support/mod.rs` | +| `omnigraph-server` | `crates/omnigraph-server/tests/` | `server.rs` (HTTP-level), `openapi.rs` (OpenAPI drift / regeneration) | | `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint | The engine's `tests/` is the principal coverage surface; most graph-shaped behavior is exercised there. @@ -21,34 +20,24 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav | `end_to_end.rs` | Full init β†’ load β†’ query/mutate flow | | `branching.rs` | Branch create / list / delete, lazy fork | | `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9Γ—9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. | -| `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (Dβ‚‚ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) | +| `runs.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (Dβ‚‚ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) | | `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) β€” primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead | -| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) β€” reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: ` sentinel exempts reviewed lines | -| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) β€” the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands | -| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below | -| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth β€” `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only β€” see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below | -| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary | | `lifecycle.rs` | Graph lifecycle, schema state | | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) | | `changes.rs` | `diff_between` / `diff_commits` | | `consistency.rs` | Cross-table snapshot isolation, atomic publish | -| `schema_apply.rs` | Migration plan + apply, schema-apply lock; index materialization deferred to the reconciler (iss-848): `apply_schema_defers_vector_index_on_empty_table` (an empty-table Vector `@index` never aborts the apply) and `index_only_constraint_apply_touches_no_table_data` (adding an `@index` is metadata-only β€” no table-version bump) | +| `schema_apply.rs` | Migration plan + apply, schema-apply lock | | `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) | -| `traversal.rs` | `Expand`, variable-length hops, anti-join (CSR path β€” `OMNIGRAPH_TRAVERSAL_MODE` unset) | -| `traversal_indexed.rs` | BTREE-indexed Expand (`execute_expand_indexed`) forced via `OMNIGRAPH_TRAVERSAL_MODE`, asserted semantically equal to the CSR path; own binary, all `#[serial]` so env writes never race | -| `proptest_equivalence.rs` | Property-based query-correctness invariants over generated graphs (shared key alphabet forces cross-type id collisions, cycles, self-loops) β€” pins Expand-mode equivalence so a future fork divergence fails loudly instead of silently; `#[serial]` | -| `ordering.rs` | ORDER BY contract: descending, multi-key precedence, deterministic key-column tie-break (total order, so `ORDER … LIMIT` is deterministic), NULL placement (`nulls_first = !descending`) | -| `literal_filters.rs` | Execution goldens for non-string/non-integer scalar literal filters (F64/F32/Bool/Date/DateTime) across both the in-memory comparison arm and the Lance-pushdown arm | +| `traversal.rs` | `Expand`, variable-length hops, anti-join | | `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` | | `export.rs` | NDJSON streaming export filters | | `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) | | `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior | | `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths | -| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly β€” no HTTP β€” proving embedded and CLI callers hit the same gate as the server, with action Γ— scope shapes matching `authorize_request` | -| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) | -| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B β†’ recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests β€” load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). Also the v3β†’v4 migration fault-injection test (`transient_legacy_open_failure_aborts_migration_without_stamping_v4`, `migration.v3_to_v4.legacy_open` failpoint): a transient legacy-open failure aborts the migration loudly and leaves it retryable (stamp stays v3, no partial backfill), never stamping v4 over an empty backfill. Also the v4 stamp-bump exhaustion regression (`v4_stamp_exhaustion_returns_retryable_contention`, `migration.v4_stamp.force_incompatible` failpoint): the stamp retry loop surfaces a retryable `RowLevelCasContention` on exhaustion, not a stringified `Lance`. And the convergence-idempotent roll-forward regression (`open_sweep_roll_forward_converges_when_manifest_advances_concurrently`: two concurrent open-sweeps race one sidecar at the `recovery.before_roll_forward_publish` rendezvous; the CAS loser must converge, not fail the open β€” iss-schema-apply-reopen-recovery-race). | +| `maintenance.rs` | `optimize` (compaction) + `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation | +| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the four per-writer Phase B β†’ recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`). | | `recovery.rs` | Open-time recovery sweep β€” sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path | -| `composite_flow.rs` | Compositional/narrative end-to-end stories β€” multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). | +| `composite_flow.rs` | Compositional/narrative end-to-end stories β€” multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories). | ## Fixtures @@ -60,34 +49,25 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav - **CLI** β€” `crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers. - **Server** β€” no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire. -> Note: the storage adapter has an in-memory backend (`ObjectStorageAdapter::in_memory()`, full contract including true conditional updates) used by the adapter contract tests in `storage.rs`. It covers only the text-object layer (sidecars, schema staging, cluster state) β€” **Lance datasets bypass the adapter**, so engine integration tests still use `tempfile::tempdir()`. An in-memory Lance substrate remains an architectural ask β€” keep it explicit in [docs/dev/invariants.md](invariants.md) under known gaps. +> Note: there is **no `MemStorage` or in-memory backend** today. Tests use `tempfile::tempdir()` for local FS. If you find yourself needing one for layer isolation, that's an architectural ask β€” keep it explicit in [docs/dev/invariants.md](invariants.md) under known gaps. ## Failpoints (fault injection) -- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` in `crates/omnigraph/Cargo.toml`; the cluster's `failpoints` feature additionally enables `omnigraph/failpoints` (`crates/omnigraph-cluster/Cargo.toml`), so the shared test guard is available to cluster tests. -- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` each expose `maybe_fail("name")` (per-crate error type). The test-side config guard `ScopedFailPoint` (`new` for action strings, `with_callback` for callbacks; RAII `Drop` removes the point) lives **once** in the engine and is reused by both test binaries. -- **Names are compile-checked.** Every failpoint name is a `pub const` in `omnigraph::failpoints::names` (engine) / `omnigraph_cluster::failpoints::names` (cluster). Call sites and tests reference the constant, never a bare literal β€” a typo is a compile error, not a silently-never-firing point. Add a new failpoint by adding its const first. -- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, the recovery sweep's classifyβ†’roll-forward-publish window, cluster apply's payloadβ†’state-write window, etc.). -- **Serialize and rendezvous, never sleep.** The `fail` registry is process-global, so every failpoint test carries `#[serial]` (`serial_test`). For concurrent tests, use `helpers::failpoint::Rendezvous` (`tests/helpers/failpoint.rs`): `park_first(name)` parks the first thread to hit the point until `release()`, and `wait_until_reached().await` blocks on that condition (it doubles as a fired-assertion). Do not coordinate threads with fixed `sleep`s. -- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (integration binaries, never in-source β€” the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`. +- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml`). +- Wrapper: `crates/omnigraph/src/failpoints.rs` exposes `maybe_fail("name")` and `ScopedFailPoint` for tests. +- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, etc.). +- Activated tests: `crates/omnigraph/tests/failpoints.rs`. Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints`. ## RustFS / S3 integration -CI runs these S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` β†’ `rustfs_integration` job): +CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` β†’ `rustfs_integration` job): - `cargo test -p omnigraph-engine --test s3_storage` -- `cargo test -p omnigraph-engine --test write_cost_s3` (RFC-013 step 3a's data-table opener cost gate β€” flat across commit depth on S3; the term local FS can't reproduce) -- `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot) -- `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket) +- `cargo test -p omnigraph-server --test server server_opens_s3_graph_directly_and_serves_snapshot_and_read` - `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow` -- `cargo test -p omnigraph-engine --features failpoints --test failpoints s3_` (recovery-sidecar lifecycle on a real bucket) Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully. -## System e2e requirements and suppression - -The CLI system tests (`system_local.rs`) spawn the workspace-built `omnigraph` and `omnigraph-server` binaries (cargo provides paths via `CARGO_BIN_EXE_*`), bind ephemeral localhost ports, and use local-FS temp dirs β€” no external services, no env vars required; they run in the default `cargo test --workspace`. The comprehensive cluster lifecycle e2es (multi-server-restart flows) honor an opt-out for constrained sandboxes: set `OMNIGRAPH_SKIP_SYSTEM_E2E=1` to skip them with a logged message (the same graceful-skip pattern as the S3 gate). Cargo-native filtering also works: `cargo test --test system_local -- --skip local_cluster`. - ## OpenAPI drift `crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repository PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`). @@ -109,7 +89,7 @@ If introducing coverage tooling is in scope for your task, the natural first ste How to check: -1. **Map the change to an area** β€” use the engine integration-test table above (`branching.rs`, `writes.rs`, `search.rs`, etc.). The filename usually names the area. +1. **Map the change to an area** β€” use the engine integration-test table above (`branching.rs`, `runs.rs`, `search.rs`, etc.). The filename usually names the area. 2. **Open the file and skim every test fn name.** Test fn names are the index β€” read them all, not just the first few. 3. **Grep for the symbol or path you're changing.** `rg ` or `rg ` across all `tests/` directories surfaces existing coverage you might miss. 4. **Decide one of three outcomes**, in this order of preference: @@ -131,15 +111,5 @@ When you pick up any change, walk through this: 6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks. 7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`. 8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first. -9. **Bound hot-path cost at history depth.** If the change touches a read, **write**, or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a per-write read-op count flat across commit depth) against a fixture with realistic *commit-history depth*, not just realistic row counts. Reuse the shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`) β€” don't hand-roll `IOTracker` wiring. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below. - -## Cost-budget tests: bound hot-path cost at history depth - -Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and degrade silently in production. The engine read path historically had no cost assertion, and fixtures carry shallow commit history, so an O(commits)-per-query cost stayed green in CI and only surfaced on a long-lived graph (read snapshot resolution re-scanned the internal manifest and commit-graph tables on every query, and those tables were never compacted). Guard against the class: - -- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query." -- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost. -- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives β€” consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason β€” the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon β€” local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears. -- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history). When in doubt, re-read [docs/dev/invariants.md](invariants.md) β€” quality gates apply to every change. diff --git a/docs/dev/writes.md b/docs/dev/writes.md deleted file mode 100644 index 7239742..0000000 --- a/docs/dev/writes.md +++ /dev/null @@ -1,403 +0,0 @@ -# Direct-Publish Write Path - -> History: the Run state machine and `__run__` staging branches were -> removed in MR-771 (shipped v0.4.0). Writes now go directly to the target -> table; this document specifies that direct-publish path. - -`mutate_as` and `load` write **directly to the target table** -and call `ManifestBatchPublisher::publish` once at the end with -`expected_table_versions` (the per-table manifest versions captured before -the first write). Cross-table OCC is enforced inside the publisher; the -publisher's row-level CAS on `__manifest` is the single fence. - -## What this means in practice - -- No `RunRecord`, no `_graph_runs.lance`, no `_graph_run_actors.lance`. -- No `omnigraph run *` CLI subcommands and no `/runs/*` HTTP endpoints. -- No `__run__` staging branches; `__run__*` is no longer a reserved - name. The branch-name guard was removed in MR-770, and any stale - `__run__*` branch on an upgraded graph is swept off `__manifest` by the - v2β†’v3 internal-schema migration on first read-write open. (The inert - `_graph_runs.lance` bytes remain until a `delete_prefix` primitive lands.) -- Cancelled mutation futures leave **no graph-visible state** β€” the manifest - is never advanced. They can leave two kinds of unreferenced residue, both - self-healing: orphaned Lance fragments (reclaimed by `omnigraph cleanup`), - and β€” on the *first* write to a table on a branch, which forks it before the - publish β€” a manifest-unreferenced branch ref. The next write to that table - reclaims the stale fork and re-forks (`reclaim_orphaned_fork_and_refork`), - and `cleanup`'s per-table reconciler is the guaranteed backstop; see the - fork-reclaim note in [invariants.md](invariants.md). - -## Read-your-writes within a multi-statement mutation - -A `.gq` query with multiple ops (e.g. `insert Person … insert Knows …`) -must observe earlier ops' writes when validating later ops (referential -integrity, edge cardinality). After MR-794 step 2+ this is implemented -via an in-memory `MutationStaging` accumulator in -[`crates/omnigraph/src/exec/staging.rs`](../../crates/omnigraph/src/exec/staging.rs), -shared by both `mutate_as` and the bulk loader: - -- On the first touch of each table, the pre-write manifest version is - captured into `expected_versions[table_key]` (the publisher's CAS - fence at end-of-query). -- Each insert/update op pushes a `RecordBatch` into the per-table - pending accumulator. Lance HEAD does **not** advance during op - execution. -- Read sites (validation, predicate matching for `update`) consume - `TableStore::scan_with_pending`, which scans committed via Lance - and applies the same SQL filter to the pending batches via DataFusion - `MemTable`. Same-query writes are visible to subsequent reads. -- At end-of-query, `MutationStaging::finalize` issues exactly one - `stage_*` + `commit_staged` per touched table (concatenating - accumulated batches; merge-mode dedupes by `id`, last-write-wins), - and the publisher publishes the manifest atomically across all - touched sub-tables. Cross-table conflicts surface as - `ManifestConflictDetails::ExpectedVersionMismatch`. -- **Deletes still inline-commit.** Lance's `Dataset::delete` is not - exposed as a two-phase op in 6.0.1; deletes go through `delete_where` - immediately and record their post-write state in - `MutationStaging.inline_committed`. The parse-time Dβ‚‚ rule (below) - prevents inserts/updates from coexisting with deletes in one query, - so the inline path is safe for delete-only mutations. - -This upholds the manifest-atomic mutation and read-your-writes invariants -tracked in [docs/dev/invariants.md](invariants.md). - -### Dβ‚‚ β€” parse-time mixed-mode rejection - -A single mutation query is either insert/update-only or delete-only. -Mixed β†’ rejected at parse time with a clear error directing the user to -split the query. Reason: mixing creates ordering hazards -(insertβ†’delete on the same row would silently no-op because the staged -insert isn't visible to delete; cascading deletes of just-inserted -edges break referential integrity). Until Lance exposes a two-phase -delete API, the parse-time rejection keeps both paths atomic and -correct. Tracked: MR-793, plus a Lance-upstream ticket. - -### MR-793 status (storage trait two-phase invariant) β€” partial - -MR-793 hoists the staged-write pattern into a `TableStorage` trait -surface with sealed-trait enforcement and opaque `SnapshotHandle` / -`StagedHandle` types β€” see `crates/omnigraph/src/storage_layer.rs`. -The trait is the canonical surface for new engine code; existing call -sites still use the inherent `TableStore` methods (mechanical migration -deferred to a follow-up cycle β€” tracked). - -Three writers have been migrated onto staged primitives: - -* **`ensure_indices`** (`db/omnigraph/table_ops.rs::build_indices_on_dataset_for_catalog`) - β€” scalar indices (BTree, Inverted) use `stage_create_*_index` + - `commit_staged`. Which index a `@index`/`@key` property gets is dispatched by - type via `node_prop_index_kind` (enum + orderable scalar β†’ BTree, free-text - String β†’ Inverted/FTS, Vector β†’ vector). Vector indices stay inline (residual - β€” Lance `build_index_metadata_from_segments` is `pub(crate)` in 6.0.1; - companion ticket to lance-format/lance#6658 needed). This build is - existence-gated (it creates a *missing* index over current fragments); folding - fragments appended afterward into an *existing* index is `optimize`'s - `optimize_indices` pass β€” an inline-commit residual, not a staged write (Lance - exposes no uncommitted index-optimize), covered by the optimize recovery - sidecar (see [maintenance.md](../user/operations/maintenance.md)). -* **`branch_merge::publish_rewritten_merge_table`** - (`exec/merge.rs`) β€” merge_insert now uses `stage_merge_insert` + - `commit_staged`. Deletes stay inline (Lance #6658 residual). -* **`schema_apply` rewritten_tables** (`db/omnigraph/schema_apply.rs`) - β€” rewrites use `stage_overwrite` + `commit_staged`, including empty-table - rewrites via a zero-fragment Lance `Operation::Overwrite`. - -A defense-in-depth integration test (`tests/forbidden_apis.rs`) walks -engine source and fails if non-allow-listed code calls Lance's -inline-commit APIs directly. The trait surface itself is the primary -enforcement (sealed + only-callable-via-trait once call sites land); -the grep test catches type-system bypass attempts. - -The "finalize β†’ publisher residual" described below applies equally to -the migrated writers β€” Lance has no multi-dataset atomic commit -primitive, so the per-table commit_staged β†’ manifest publish gap is -the same drift class. Closing it requires either upstream Lance -multi-dataset commit OR the omnigraph-side recovery-on-open reconciler -described in `.context/mr-793-design.md` Β§15 (deferred to MR-795). - -### Inline-commit residuals live on `InlineCommitResidual`, not `db.storage()` (MR-793 acceptance Β§1, by construction) - -MR-793's acceptance criterion Β§1 ("`TableStore` (or successor) public API has no method that performs a manifest commit as a side effect of writing") holds **by construction** after MR-854. `db.storage()` (`&dyn TableStorage`) exposes only staged primitives + reads; the inline-commit writes Lance cannot yet stage live on a separate `InlineCommitResidual` trait reached via `Omnigraph::storage_inline_residual()`. A new engine writer cannot couple a write with a Lance HEAD advance through the default surface β€” it would have to name the residual accessor explicitly. The dead legacy methods (trait `append_batch` / `merge_insert_batches`, inherent `merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were removed; appends/merges and scalar index builds all use the `stage_*` primitives. - -Two methods remain on `InlineCommitResidual`, each named honestly at its call site: - -| Residual method | Inline-commit reason | Closes when | -|---|---|---| -| `delete_where` | `DeleteBuilder::execute_uncommitted` is not in Lance v6.0.1 (closed upstream as [#6658](https://github.com/lance-format/lance/issues/6658) but first ships in `v7.0.0-beta.10`); see [docs/dev/lance.md](lance.md) | MR-A: Lance v7.x bump migrates `delete_where` to staged, retires the parse-time Dβ‚‚ mutation rule, and extends recovery sidecar coverage | -| `create_vector_index` | Vector indices take Lance's "segment commit path"; `build_index_metadata_from_segments` is `pub(crate)` (Lance [#6666](https://github.com/lance-format/lance/issues/6666) still open) | Lance #6666 lands and `stage_create_vector_index` joins the staged surface | - -The `tests/forbidden_apis.rs` guard still catches direct `lance::*` inline-commit misuse outside the storage layer; the trait split makes the staged-only default a type-system guarantee on top of it. - -### `LoadMode::Overwrite` uses staged Lance `Overwrite` - -The bulk loader's Append, Merge, and Overwrite modes all use the -staged-write path described above. `LoadMode::Overwrite` accumulates -replacement batches in memory, validates node/edge constraints, referential -integrity, and edge cardinality before any Lance HEAD movement, stages -each touched table with Lance `Operation::Overwrite`, then runs -`commit_staged` under the normal `SidecarKind::Load` recovery sidecar -before publishing `__manifest`. `OMNIGRAPH_LOAD_CONCURRENCY` applies to the -fragment-writing stage only; the commit and manifest publish still run -under the per-table write queues. Empty-table overwrite is represented as -a valid zero-fragment Lance `Overwrite` transaction, not as -truncate-then-append. - -### Open-time recovery sweep - -The staged-write rewire eliminates one drift class **by construction at -the writer layer**: an op that fails before pushing to the in-memory -accumulator (validation errors, missing endpoints, parse-time Dβ‚‚ -rejection) leaves Lance HEAD untouched on every staged table. This is -the case the `partial_failure_leaves_target_queryable_and_unblocks_next_mutation` -test pins. - -A second, narrower drift class β€” the **finalize β†’ publisher window** β€” -is closed across one open cycle by the open-time recovery sweep: - -`MutationStaging::finalize` runs `stage_*` + `commit_staged` per touched -table sequentially, then the publisher commits the manifest. Lance has -no multi-dataset atomic commit, so the per-table `commit_staged` calls -are independent operations: if commit_staged on table N+1 fails *after* -commit_staged on tables 1..N succeeded, or if the publisher's CAS -pre-check rejects *after* every commit_staged succeeded, tables 1..N -are left at `Lance HEAD = manifest_pinned + 1`. - -**Recovery protocol** (lifecycle of every staged-write writer β€” -`MutationStaging::finalize`, `schema_apply::apply_schema_with_lock`, -`branch_merge_on_current_target`, `ensure_indices_for_branch`, -`optimize_all_tables`): - -1. **Phase A**: writer writes a sidecar JSON to - `__recovery/{ulid}.json` BEFORE its first HEAD-advancing commit - (`commit_staged`, or `compact_files` for `optimize_all_tables`, - which advances the Lance HEAD via a reserve-fragments + rewrite - commit rather than a staged write). The - sidecar names every `(table_key, table_path, expected_version, - post_commit_pin)` it intends to commit + the writer kind + - actor_id. -2. **Phase B**: writer's per-table `commit_staged` loop runs. - - **Phase-B confirmation (`BranchMerge` only)**: a `BranchMerge` writer - advances each table's HEAD by *several* commits (append β†’ upsert β†’ - delete), so a bare "HEAD moved" is ambiguous β€” it could be a complete - publish or one crashed mid-sequence. After the whole per-table loop - finishes, the writer re-writes the sidecar stamping each pin's - `confirmed_version` with the exact achieved version, then proceeds to - Phase C. This is the commit point of the recovery WAL: a crash *after* - confirmation rolls forward to those versions; a crash *during* Phase B - (sidecar still unconfirmed) rolls back. Other writers don't confirm β€” - their drift is derived state (index coverage, compaction) that a partial - roll-forward never corrupts. -3. **Phase C**: publisher commits the manifest. -4. **Phase D**: writer deletes the sidecar. - -> **Phase letter convention.** Throughout the recovery code, log -> messages, failpoint names (e.g. `branch_merge.post_phase_b_pre_manifest_commit`), -> and the per-writer integration tests, "Phase A/B/C/D" refers -> exclusively to the four-step lifecycle above. The per-table -> staged-write contract (`stage_*` then `commit_staged`, two steps) -> is referred to by those API verbs β€” never by phase letters β€” so a -> reader of `recovery.rs`, `failpoints.rs`, or this document only -> encounters phase letters in the per-writer context. - -A failure between Phase A and Phase D leaves the sidecar on disk. The -next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the -recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: - -- For each sidecar in `__recovery/`, compare every named table's - Lance HEAD to the manifest pin. Classify per the all-or-nothing - decision tree (RolledPastExpected / NoMovement / UnexpectedAtP1 / - UnexpectedMultistep / IncompletePhaseB / InvariantViolation). For a - `BranchMerge` sidecar, a moved HEAD with no `confirmed_version` classifies - as `IncompletePhaseB` (a partial multi-commit publish) and forces roll-back; - with a `confirmed_version`, roll-forward targets exactly that version. -- If any table is `InvariantViolation` (Lance HEAD < manifest pinned β€” - should be impossible), **abort** with a loud error and leave the - sidecar on disk for operator review. -- Otherwise, if every table is `RolledPastExpected`, **roll forward**: - a single `ManifestBatchPublisher::publish` call extends every pin - atomically. `SchemaApply` sidecars are eligible only when schema-state - recovery promoted the matching staging files in the same recovery pass; - otherwise full open-time recovery rolls them back and refresh-time - recovery leaves them for the next read-write open. -- Otherwise **roll back**: per-table `Dataset::restore` to the - manifest-pinned table version, then a single `ManifestBatchPublisher::publish` - of the restored HEAD β€” symmetric with roll-forward, so `manifest == HEAD` - after recovery (no residual drift). This convergence is what lets a - failed-then-retried schema apply succeed instead of failing one version higher - each iteration. The audit row's `to_version` records the logical - rolled-back-to version (`manifest_pinned`); the manifest is published at the - restore commit (`manifest_pinned + 1`, same content). -- After a successful roll-forward or roll-back, an audit row is - recorded β€” the graph commit lineage (the `graph_commit` rows in `__manifest` - since RFC-013 Phase 7) carries a commit tagged - `actor_id = "omnigraph:recovery"`, and a sibling - `_graph_commit_recoveries.lance` row carries `recovery_kind`, - `recovery_for_actor` (the original sidecar's actor), `operation_id`, - per-table outcomes. Operators run `omnigraph commit list --filter - actor=omnigraph:recovery` to find recoveries. -- Sidecar deleted as the final step. - -Triggers for the residual: transient Lance write errors during finalize -(object-store retry budget exhaustion, disk full); persistent publisher -contention exceeding `PUBLISHER_RETRY_BUDGET = 5` retries. - -**Long-running servers**: the write entry points (`load_as`, -`mutate_as`, `apply_schema_as`, `branch_merge_as`) and -`Omnigraph::refresh` run roll-forward-only recovery in-process -(`recovery::heal_pending_sidecars_roll_forward`) β€” the common -Phase B β†’ Phase C residual closes on the next write, without a -restart and without an explicit refresh. The heal lists `__recovery/` -(one `list_dir`; empty in the steady state) and, per sidecar, acquires -the same per-`(table_key, table_branch)` write queues every sidecar -writer holds from before `write_sidecar` until after `delete_sidecar` β€” -so it serializes against a live writer instead of rolling its -in-flight sidecar forward from under it (a sidecar whose queues can be -acquired belongs to a writer that finished or died; an existence -re-check after the wait skips the finished case). Lock order is -queues β†’ coordinator, matching every writer's commitβ†’publish path. -Pinned by the four -`tests/failpoints.rs::*_after_finalize_publisher_failure_heals_without_reopen` -tests (load, mutation, schema apply, branch merge). The maintenance -entries need the heal for more than liveness: without it, a schema -apply re-plans rewrites from the manifest pin and orphans the drifted -Phase-B commit (dropping its rows), and a branch merge publishes the -drift as an unattributed side effect β€” both while the stale sidecar -lingers to misclassify later. -Sidecars that would require a `Dataset::restore` (mixed / unexpected -state) are deferred to the next `OpenMode::ReadWrite` open: restore is -unsafe under concurrency because Lance's `check_restore_txn` accepts -the restore against in-flight Append/Update/Delete commits and -silently orphans them (pinned by -`tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning`). -When such a deferred sidecar blocks a write, the commit-time drift -guard says so explicitly ("a pending recovery sidecar requires -rollback β€” reopen the graph read-write") instead of pointing at -`omnigraph repair`, which refuses while a sidecar is pending. -Continuous in-process recovery for the rollback path is the goal of a -future background reconciler. `ensure_indices` does not heal at entry -itself β€” it runs inside the load / schema-apply flows after their -entry heal, and its strict preconditions still fail loudly on drift -when invoked directly. - -The publisher-CAS contract is unchanged: a *concurrent writer* that -advances any of our touched tables between snapshot capture and -publisher commit produces exactly one winner. The residual above is -about *our* abandoned commits in the failure path, not about -concurrency races. - -**Sidecar I/O failure semantics** (all sidecar I/O goes through the -backend-generic `StorageAdapter`; the contracts below are pinned by the -storage-fault failpoints `recovery.sidecar_{write,delete,list}` / -`recovery.record_audit` and their tests in `tests/failpoints.rs` and -`tests/recovery.rs`): - -- **Phase A put fails** (S3 PutObject / fs write): the writer aborts - before its first HEAD-advancing commit β€” no sidecar, no drift, - nothing to recover; a transient fault never wedges later writes. -- **Phase D delete fails** (S3 DeleteObject): swallowed with a warning β€” - the write already published, so failing the caller would report an - error for a durable write. The stale sidecar is consumed by the next - write's entry heal (or the next open) via the stale-sidecar - audit-recovery path, recorded as `RolledForward`. -- **`__recovery/` list fails** (S3 ListObjectsV2): loud at every - consumer β€” the write-entry heal fails the write, the open-time sweep - fails the open. Silently skipping recovery would be consumer - tolerance of drift. -- **Corrupt / unparseable sidecar**: refused loudly by heal and open - alike; the file stays on disk for operator inspection (read-only - opens still work β€” the sweep is skipped there). -- **Audit append fails after a roll-forward publish**: that recovery - attempt errors and keeps the sidecar; re-entry sees the - already-published manifest, records exactly one `RolledForward` - audit row, and deletes the sidecar (the retry tolerance documented - on `record_audit`). - -Backend notes (the adapter is one implementation over `object_store` -for every backend): local writes stage through `name#` temp -files that the backend filters from listings and refuses to address β€” -crash residue of that shape is invisible to the sweep, harmless, and -reclaimed by `delete_prefix`/manual cleanup. Storage errors are -backend-wrapped text without a typed NotFound discriminant β€” callers -that need missing-vs-error (the cluster store) probe `exists()` first. -`exists()` itself is object-store semantics everywhere: only objects -(or non-empty prefixes) exist, and a permission failure is a loud -error, not a silent `false`. - -## Conflict shape - -Concurrent writers to the same `(table, branch)` produce exactly one -success and one failure. The losing writer's error is -`OmniError::Manifest` with kind `Conflict` and details -`ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, -actual }`. The HTTP server maps this to **409 Conflict** with body -`{"error": "...", "code": "conflict", "manifest_conflict": { "table_key": -"...", "expected": N, "actual": M }}` β€” see [docs/user/server.md](../user/operations/server.md). - -## Audit - -`actor_id` lands in the graph commit lineage β€” the `graph_commit` rows in -`__manifest`, written in the publish CAS (RFC-013 Phase 7; previously -`_graph_commits.lance`). Audit history is queried via `omnigraph commit list`. - -## Migration code - -`db/manifest/migrations.rs` is the single place on-disk `__manifest` shape is -reconciled with what the binary expects, stepping the -`omnigraph:internal_schema_version` stamp forward one `match`-arm at a time. It -runs in `Omnigraph::open(ReadWrite)` (via `manifest::migrate_on_open`, before the -coordinator reads branch state) and again on the publisher's write path, so each -branch migrates on its first write; every step is idempotent under crash-retry -(work first, stamp bump last). - -- **v2β†’v3** (MR-770): a one-time sweep that deletes legacy `__run__*` staging - branches off `__manifest`. Deleting the inert `_graph_runs.lance` / - `_graph_run_actors.lance` dataset *bytes* is still deferred β€” it needs a - `StorageAdapter::delete_prefix` primitive β€” but those bytes are invisible to - graph-level state. -- **v3β†’v4** (RFC-013 Phase 7, `migrate_v3_to_v4`): backfills the graph lineage - from `_graph_commits.lance` into `__manifest` as `graph_commit` / `graph_head` - rows. A graph created before Phase 7 has its lineage only in - `_graph_commits.lance`; the new binary reads lineage from the `__manifest` - projection, so without this backfill it would see an empty commit DAG. The - backfill is per-branch (each branch migrates on its first write), idempotent - (keyed on `object_id`; a fast-path guard skips when `__manifest` already - carries `graph_commit` rows), and writes exactly one `graph_head:` row - for the actual head. `_graph_commits.lance` is left in place as the branch-ref - carrier β€” no commit row is written to it again. While a graph is below v4, a - **read-only** open (which never writes, so never migrates) sources the commit - DAG from `_graph_commits.lance` via the stamp-gated transitional fallback in - `CommitGraph::open*`, so reads see correct history before the first write - migrates the graph. An old binary opening a v4-stamped graph is refused with an - "upgrade omnigraph" error in both read-write and read-only modes. - -## Mid-query partial failure: closed by MR-794 - -The pre-MR-794 design had a known limitation: a multi-statement `.gq` -mutation where op-N inline-committed a Lance fragment and op-N+1 then -failed left the touched table at `Lance HEAD = manifest_version + 1`, -blocking the next mutation with `ExpectedVersionMismatch`. - -MR-794 (step 1 + step 2+) closed this for inserts/updates **by -construction at the writer layer**: insert and update batches accumulate -in memory; no Lance HEAD advance happens during op execution; one -`stage_*` + `commit_staged` per touched table runs at end-of-query, and -only after every op succeeded. A failed op leaves Lance HEAD untouched -on the staged tables, so the next mutation proceeds normally with no -drift to reconcile. - -The cancellation case (future drop mid-mutation) inherits the same -guarantee β€” the in-memory accumulator evaporates with the dropped task -and no Lance write was ever issued. - -For delete-touching mutations the legacy inline-commit shape is -preserved (Lance has no public two-phase delete in 6.0.1) β€” the same -narrow window remains. The parse-time Dβ‚‚ rule prevents inserts/updates -from coexisting with deletes in one query, so a pure-delete failure -cannot drift any staged-table state. If a delete-only multi-table -mutation fails mid-cascade, the same workaround as before applies -(retry; rely on `omnigraph cleanup` once a later successful commit -moves HEAD past the orphan version). Closing this requires Lance to -expose `DeleteJob::execute_uncommitted`; tracked in MR-793 and a -Lance-upstream ticket. diff --git a/docs/releases/v0.2.0.md b/docs/releases/v0.2.0.md new file mode 100644 index 0000000..7872ecf --- /dev/null +++ b/docs/releases/v0.2.0.md @@ -0,0 +1,86 @@ +# Omnigraph v0.2.0 + +Omnigraph v0.2.0 focuses on day-to-day operability: safer schema evolution, more capable mutation queries, better local and remote ergonomics, and a documented HTTP surface for clients and tooling. + +This release is especially relevant if you are running Omnigraph locally on RustFS or using the CLI and server together as a graph application backend. + +## Highlights + +### Schema planning and apply + +Schema changes can now move from planning to execution with first-class CLI and server support. + +- Added `omnigraph schema apply --schema ...` alongside `schema plan` +- Added `POST /schema/apply` on the server +- Added policy support for schema application through the `schema_apply` action +- Persisted accepted schema updates as part of a supported apply flow + +This makes schema evolution an actual product capability instead of a plan-only diagnostic. + +### Safer schema apply on live repos + +After the initial schema-apply rollout, the apply path was hardened to avoid clobbering concurrent writes and to preserve indexes during table rewrites. + +- Blocks writes while schema apply is in progress +- Verifies source heads before publishing rewritten tables +- Rebuilds the full expected index set after rewrite operations +- Keeps schema apply constrained to repos whose only branch is `main` + +The result is a much more defensible v1 schema migration path. + +### Multi-statement mutations + +Mutation queries can now contain multiple sequential statements that execute atomically within one run. + +Example: + +```gq +query add_and_link($name: String, $age: I32, $friend: String) { + insert Person { name: $name, age: $age } + insert Knows { from: $name, to: $friend } +} +``` + +This is a meaningful step toward richer write-side workflows without forcing multiple client round trips. + +### OpenAPI support + +The server now publishes an OpenAPI document at `/openapi.json`. + +- Added schema-backed endpoint documentation for the Omnigraph HTTP API +- Documented request and response types for the current server surface +- Made the published spec reflect runtime auth mode, so open local deployments are documented correctly + +This makes Omnigraph easier to integrate with generated clients, inspection tools, and API consumers that want a machine-readable contract. + +### CLI and export ergonomics + +Several rough edges in the CLI were fixed. + +- Export now streams instead of buffering the full snapshot in memory first +- Load summaries now report actual loaded row counts +- Alias handling no longer steals legitimate first arguments +- `commit show` matches the documented `--uri` usage +- Remote and local usage are more consistent for common admin flows + +## Additional Improvements + +- RustFS CI is now scoped to relevant changes instead of burning time on unrelated pull requests +- README and install docs were tightened around public binary install behavior +- The local RustFS bootstrap remains aligned with the rolling `edge` binary channel + +## Upgrade Notes + +- If you use local or remote schema administration, prefer `schema plan` before `schema apply` +- `schema apply` is intentionally conservative in v1 and rejects repos with non-`main` branches +- If policy is enabled, make sure admin actors are allowed to perform `schema_apply` +- If you rely on published binaries, this release is the point where stable installers can pick up schema apply and the newer CLI/runtime behavior without using `edge` + +## Included Changes + +- PR #2: CLI ergonomics and streamed export output +- PR #5: schema apply command and policy support +- PR #7: schema apply concurrency and index-preservation hardening +- PR #4: multi-statement mutations +- PR #1: OpenAPI generation and auth-aware `/openapi.json` +- PR #8: RustFS CI scoping improvements diff --git a/docs/releases/v0.2.1.md b/docs/releases/v0.2.1.md new file mode 100644 index 0000000..b840885 --- /dev/null +++ b/docs/releases/v0.2.1.md @@ -0,0 +1,59 @@ +# Omnigraph v0.2.1 + +Omnigraph v0.2.1 is a focused follow-up release on top of v0.2.0. It adds query linting, improves query execution correctness, hardens the local RustFS bootstrap flow, and cleans up project config naming. + +## Highlights + +### Query lint and check + +The CLI now ships a first-class query validation surface: + +- `omnigraph query lint` +- `omnigraph query check` + +These commands validate `.gq` files against either an explicit schema file or a local/S3-backed repo schema, emit structured results, and support both human-readable and JSON output. + +### Query execution fixes and aggregate support + +This release includes several improvements in the query engine: + +- aggregate execution support for read queries +- nullable query parameters now accept omission and explicit null for nullable params +- traversal planning and join alignment are more robust for traversal-introduced bindings + +Together, these changes make complex read queries more dependable and easier to author. + +### Better local RustFS startup + +The local RustFS bootstrap is more resilient: + +- detects dirty/stale repo prefixes before blindly reinitializing +- makes bootstrap recovery clearer for persisted local RustFS state +- ships a more generic demo fixture instead of user-specific seed content + +This reduces the most common failure mode in local-first setup. + +### Config terminology cleanup + +`omnigraph.yaml` now uses graph-oriented naming: + +- `graphs:` instead of `targets:` +- `cli.graph` / `server.graph` instead of `target` + +This removes one of the more confusing overloaded terms in the CLI/server config model. + +## Included Changes + +- PR #15: query lint and query check commands +- PR #6: aggregate execution support +- PR #3: nullable query parameter fixes +- PR #16: traversal planning and join-alignment fixes +- PR #13: local RustFS bootstrap recovery hardening +- PR #14: generic bootstrap fixture +- PR #17: config rename from targets to graphs + +## Upgrade Notes + +- If you maintain `.gq` files in-repo, add `omnigraph query lint` to your local validation workflow +- Existing configs must use `graphs:` / `graph:` after this release +- Local RustFS users should prefer the current bootstrap script from `main` or this release rather than older cached copies diff --git a/docs/releases/v0.2.2.md b/docs/releases/v0.2.2.md new file mode 100644 index 0000000..88d086e --- /dev/null +++ b/docs/releases/v0.2.2.md @@ -0,0 +1,29 @@ +# Omnigraph v0.2.2 + +Omnigraph v0.2.2 is a packaging follow-up to v0.2.1. It keeps the CLI and server surface the same, but renames the published runtime crate from `omnigraph` to `omnigraph-engine` so the full crate set can be published cleanly to crates.io. + +## Highlights + +### Published runtime crate rename + +The runtime package is now published as: + +- `omnigraph-engine` + +The in-code Rust library name remains `omnigraph`, so internal imports and code paths stay stable. CLI users are unaffected. + +### Crates.io metadata cleanup + +All published crates now ship repository, homepage, and documentation metadata so the crates.io pages are complete and the release pipeline no longer emits missing-package-metadata warnings. + +## Included Changes + +- rename runtime package from `omnigraph` to `omnigraph-engine` +- bump `omnigraph-engine`, `omnigraph-compiler`, `omnigraph-server`, and `omnigraph-cli` to `0.2.2` +- update dependent manifests and CI package references to the new runtime package name + +## Upgrade Notes + +- Rust consumers should depend on `omnigraph-engine` on crates.io +- Code that imports the library can continue using `omnigraph` as the crate name +- The `omnigraph` CLI binary name is unchanged diff --git a/docs/releases/v0.3.0.md b/docs/releases/v0.3.0.md new file mode 100644 index 0000000..4c900a7 --- /dev/null +++ b/docs/releases/v0.3.0.md @@ -0,0 +1,49 @@ +# Omnigraph v0.3.0 + +Omnigraph v0.3.0 is a feature and security release. It adds an AWS deployment path for the server, hardens bearer-token authentication, introduces a schema inspection endpoint, and ships the CodeBuild-driven image packaging pipeline. + +## Highlights + +### AWS deployment path + +A new `aws` Cargo feature enables an AWS-native bearer-token backend. When compiled with `--features aws` and pointed at an AWS Secrets Manager secret ARN via `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET`, the server fetches and parses bearer tokens directly from Secrets Manager at startup. The token loading path is abstracted behind a `TokenSource` trait so additional backends are easy to add. + +A manually-dispatched Package workflow builds two variants of the server image (default and `--features aws`) via AWS CodeBuild, tags them by source SHA in ECR, and records the digests for downstream deploy automation. + +### Bearer auth hardening + +Bearer tokens are now hashed (SHA-256) at rest inside the server and compared using constant-time equality (`subtle::ConstantTimeEq`). The authenticated actor id is resolved server-side from the hash match β€” requests can no longer assert their own actor id by setting a header. + +### Schema inspection API + +A new `GET /schema` endpoint and matching CLI `schema get` command return the active graph schema as JSON. A static OpenAPI spec is published at `openapi.json` and kept in sync with the server via a CI job. + +### Stricter run-branch hygiene + +Internal `__run__…` branches, used for short-lived write staging, are now filtered out of user-visible branch listings and are deleted on every terminal state transition instead of accumulating over time. + +## Breaking changes + +### Schema state is now required + +The server refuses to open a repo that lacks persisted schema state (`_schema.pg`, `_schema.ir.json`, `__schema_state.json`) or that has non-main public branches left over from earlier versions. Existing repos created with 0.2.x need to be reinitialized (or have their schema state written explicitly) before they can be opened with 0.3.0. + +## Included Changes + +- Add `aws` feature + `SecretsManagerTokenSource` backend +- Extract `TokenSource` trait for bearer token loading +- Harden bearer auth: constant-time compare, SHA-256 hashed at rest, server-authoritative actor id +- Add manually-dispatched Package workflow for CodeBuild image builds (default + aws variants) +- Add `GET /schema` endpoint and `schema get` CLI command +- Ship static `openapi.json` spec with CI auto-sync +- Filter and delete ephemeral `__run__` branches +- Switch Dockerfile base to ECR Public (avoid Docker Hub rate limits) +- Raise `LANCE_MEM_POOL_SIZE` default to 1 GB for stable parallel tests +- Automate Homebrew tap updates on release tags +- Documentation for the AWS build variant and bearer-token sources + +## Upgrade Notes + +- Repos created with 0.2.x must be reinitialized (or have their schema state generated) before they can be opened with 0.3.0 +- Deployments using AWS Secrets Manager for bearer tokens must build the server with `--features aws` and set `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` to the secret ARN +- The default token source (env var or JSON file) continues to work unchanged diff --git a/docs/releases/v0.3.1.md b/docs/releases/v0.3.1.md new file mode 100644 index 0000000..1f5d7dc --- /dev/null +++ b/docs/releases/v0.3.1.md @@ -0,0 +1,19 @@ +# Omnigraph v0.3.1 + +Omnigraph v0.3.1 is a performance and operability point release. + +## Highlights + +- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads. +- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form. +- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45). + +## Included Changes + +- Parallel per-type load writes (#46) +- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46) +- Dedupe dst ids before hydrating nodes in `execute_expand` (#45) + +## Upgrade Notes + +No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1. diff --git a/docs/releases/v0.4.0.md b/docs/releases/v0.4.0.md new file mode 100644 index 0000000..efb2da7 --- /dev/null +++ b/docs/releases/v0.4.0.md @@ -0,0 +1,88 @@ +# Omnigraph v0.4.0 + +Omnigraph v0.4.0 demotes the Run state machine to commit metadata via the +publisher's CAS, fixing a write-cancellation hole and reducing the engine's +surface area. + +## Highlights + +- **Direct-to-target writes**: `mutate_as` and `load` write + directly to the target tables and call + `ManifestBatchPublisher::publish` once at the end with + `expected_table_versions`. No more `__run__` staging branches, no + more `RunRecord` state machine. Cross-table OCC is enforced inside the + publisher's row-level CAS on `__manifest`. +- **Cancellation safety by construction**: a dropped mutation future + leaves no graph-level state β€” only orphaned Lance fragments, reclaimed + by `omnigraph cleanup`. The "zombie run" cascade documented in + `.context/zombie-run-investigation.md` is gone. +- **Read-your-writes inside multi-statement mutations**: a `.gq` query + that inserts and then references a row in the same statement now sees + its own writes via an in-process `MutationStaging` cache, even though + no manifest commit happens between ops. +- **Structured conflict surface**: concurrent writers race through the + publisher's CAS; the loser surfaces as + `ManifestConflictDetails::ExpectedVersionMismatch { table_key, + expected, actual }`. The HTTP server maps this to **409 Conflict** with + a structured `manifest_conflict` body so clients can detect-and-retry + without parsing the message. + +## Removed + +This is a breaking release. Pre-0.4.0 / no SLA. + +- `omnigraph::db::{RunRecord, RunStatus, RunId}` types and the + `_graph_runs.lance` / `_graph_run_actors.lance` Lance datasets. +- Engine APIs `begin_run`, `begin_run_as`, `publish_run`, + `publish_run_as`, `abort_run`, `fail_run`, `terminate_run`, + `list_runs`, `get_run`. +- HTTP endpoints: `GET /runs`, `GET /runs/{run_id}`, `POST + /runs/{run_id}/publish`, `POST /runs/{run_id}/abort`. The + `RunListOutput` and `RunOutput` schemas are removed from the OpenAPI + document. +- CLI subcommands: `omnigraph run list`, `omnigraph run show`, `omnigraph + run publish`, `omnigraph run abort`. Use `omnigraph commit list` + reading the commit graph for audit history. +- Cedar policy actions `run_publish` and `run_abort`. Existing + `policy.yaml` files referencing these actions will fail validation β€” + remove the rules; the `change` action covers the equivalent gating. + +## Behavior changes + +- `mutate_as` / `load` are now **atomic per query, single publish at the + end**. A failed mutation leaves the target unchanged with no + intermediate manifest commits. +- The `OmniError::manifest_conflict` shape produced by concurrent + writers is now `ExpectedVersionMismatch` (was `MergeConflict::DivergentUpdate` + via the run merge path). Clients that match on the conflict body must + switch to inspecting `manifest_conflict.table_key/expected/actual`. + +## Known limitation + +A multi-statement mutation that writes a Lance fragment in op-N and then +fails in op-N+1 leaves the touched table with Lance HEAD ahead of the +manifest. The next mutation against that table fails with +`ExpectedVersionMismatch`. Most validation runs before any Lance write, +so single-statement mutations are unaffected; the narrow path is +multi-statement queries with late-op failures. Tracked as a follow-up; +see [docs/dev/runs.md](../dev/runs.md#known-limitation-mid-query-partial-failure-on-the-same-table) +for the workaround. + +## Upgrade notes + +- **Stale `__run__*` branches and `_graph_runs.lance`** in legacy v0.3.x + repos are *inert* β€” the engine no longer reads them β€” but they remain + on disk until production cleanup. This release deliberately does not touch + legacy bytes. +- The `is_internal_run_branch` predicate is kept as a defense-in-depth + guard against users naming a branch `__run__*`. It will be removed in + a follow-up cleanup. +- External scripts hitting `/runs/*` will now receive 404. Migrate them + to `/commits` for audit history; mutation status is implied by the + HTTP response on `/change` itself. + +## Included Changes + +- Demote Run: write directly to target via publisher +- `ManifestBatchPublisher::publish` accepts per-table + `expected_table_versions` diff --git a/docs/releases/v0.4.1.md b/docs/releases/v0.4.1.md new file mode 100644 index 0000000..78211e4 --- /dev/null +++ b/docs/releases/v0.4.1.md @@ -0,0 +1,142 @@ +# Omnigraph v0.4.1 + +Omnigraph v0.4.1 closes the multi-statement-mutation atomicity gap that +v0.4.0 documented as a known limitation. Inserts and updates now route +through an in-memory `MutationStaging` accumulator and commit via Lance's +two-phase distributed-write API at end-of-query. A failed mid-query op +no longer leaves Lance HEAD drifted on the touched table β€” the next +mutation proceeds normally. + +## Highlights + +- **Staged-write rewire**: `mutate_as` and `load` (Append / + Merge modes) accumulate insert/update batches into + `MutationStaging.pending` per touched table. No Lance HEAD advance + happens during op execution; one `stage_*` + `commit_staged` per + table runs at end-of-query, then `ManifestBatchPublisher::publish` + commits the manifest atomically. **For op-execution failures** + (validation errors, missing endpoints, parse-time Dβ‚‚ rejection), Lance + HEAD on every staged table is untouched and the next mutation + proceeds normally. A narrowed residual remains at the + finalizeβ†’publisher boundary (multi-table `commit_staged` is not + atomic with the manifest commit) β€” see [docs/dev/runs.md](../dev/runs.md) + "Finalize β†’ publisher residual" for details. +- **Dβ‚‚ parse-time rule**: a single mutation query is either + insert/update-only or delete-only. Mixed β†’ rejected with a clear + error directing the caller to split into two queries. Lance 4.0.0 + has no public two-phase delete; deletes still inline-commit, and Dβ‚‚ + keeps that path safe. +- **Read-your-writes via DataFusion `MemTable`**: read sites in + multi-statement mutations consume `TableStore::scan_with_pending`, + which Lance-scans the committed snapshot at the captured + `expected_version` and unions with a DataFusion `MemTable` over the + pending batches. Replaces the previous "reopen at staged Lance + version" pattern. +- **Coordinator swap-restore eliminated** from `mutate_with_current_actor`. + Branch is threaded explicitly through the per-op execution path + (`execute_named_mutation`, `execute_insert`, `execute_update`, + `execute_delete*`, `validate_edge_insert_endpoints`, + `ensure_node_id_exists`). The `swap_coordinator_for_branch` / + `restore_coordinator` API and `CoordinatorRestoreGuard` are removed + from `mutation.rs`. (`merge.rs` keeps its own swap pattern; that's + a separate workflow.) +- **`docs/dev/invariants.md` mutation atomicity / read-your-writes status** + flips from `aspirational/open` to `upheld for inserts/updates`. The within-query read-your-writes + guarantee is now load-bearing for the publisher CAS contract. + +## Behavior changes + +- A failed multi-statement mutation no longer surfaces + `ExpectedVersionMismatch` on the *next* mutation against the same + table. The next call proceeds normally β€” Lance HEAD on staged + tables is unchanged. +- Mixed insert/update + delete in one query is rejected at parse + time. Existing test queries that mixed both must be split. +- `MutationStaging`'s shape changed: `pending: HashMap` + + `inline_committed: HashMap` replaces the + previous `latest: HashMap`. This is an internal + type; no public API impact. + +## Residual / out of scope + +- **`LoadMode::Overwrite`** keeps the legacy inline-commit path + (truncate-then-append doesn't fit the staged shape). A mid-overwrite + failure can still drift Lance HEAD on a partially-truncated table; + the next overwrite replaces it. Operator-driven, rare. +- **Delete-only multi-statement mutations** still inline-commit per op. + Dβ‚‚ keeps inserts/updates from coexisting with deletes, so the + inline path remains atomic per op but not per query for delete-only + cascades. Closing this requires Lance to expose + `DeleteJob::execute_uncommitted`; tracked upstream with Lance. +- **`schema_apply`, `branch_merge_internal`, `ensure_indices`** still + use Lance's inline-commit APIs. The two-phase pattern is in + `mutate_as` and `load` only; hoisting it to a storage-trait invariant + covering all writers remains future work. + +## Tests added + +- `tests/runs.rs::partial_failure_leaves_target_queryable_and_unblocks_next_mutation` + (replaces the old `partial_failure_observably_rolls_back_but_blocks_next_mutation_on_same_table`) +- `tests/runs.rs::mutation_rejects_mixed_insert_and_delete_at_parse_time` +- `tests/runs.rs::mixed_insert_and_update_on_same_person_coalesces_to_one_merge` +- `tests/runs.rs::multiple_appends_to_same_edge_coalesce_to_one_append` +- `tests/runs.rs::multi_statement_inserts_publish_exactly_once` +- `tests/runs.rs::load_with_bad_edge_reference_unblocks_next_load` +- `tests/runs.rs::load_with_cardinality_violation_unblocks_next_load` + +## Files changed + +- `crates/omnigraph/src/exec/staging.rs` (NEW) β€” `MutationStaging`, + `PendingTable`, `PendingMode`, `StagedTablePath`, + `dedupe_merge_batches_by_id`. +- `crates/omnigraph/src/exec/mutation.rs` β€” Dβ‚‚ check; per-op + rewires (`execute_insert`, `execute_update`, `execute_delete*`); + branch threading; coordinator-swap removal; helper + `validate_edge_cardinality_with_pending`; helper + `concat_match_batches_to_schema`; `apply_assignments` updated to + copy unassigned blob columns from full-schema scans. +- `crates/omnigraph/src/loader/mod.rs` β€” `load_jsonl_reader` split: + staged path for Append/Merge, legacy inline-commit path for + Overwrite. Helpers `collect_node_ids_with_pending` and + `validate_edge_cardinality_with_pending_loader`. +- `crates/omnigraph/src/table_store.rs` β€” `scan_with_pending`, + `count_rows_with_pending` (DataFusion `MemTable`-backed union with + Lance scan). +- `Cargo.toml` (workspace) + `crates/omnigraph/Cargo.toml` β€” added + `datafusion = "52"` direct dep (transitively pulled by Lance + already; required for `MemTable`). +- `docs/dev/runs.md` β€” removed "Known limitation" section; documented + the new accumulator + Dβ‚‚ + LoadMode::Overwrite residual. +- `docs/dev/invariants.md` β€” mutation atomicity / read-your-writes status + flipped to `upheld for inserts/updates`. +- `docs/dev/architecture.md` β€” added "Mutation atomicity β€” in-memory + accumulator" subsection; refreshed the engine + state + diagrams to drop `RunRegistry` and add `MutationStaging`. +- `docs/dev/execution.md` β€” rewrote the mutation flow sequence diagram + for the staged-write path; updated the `LoadMode` table to call + out per-mode commit semantics; rewrote `load` vs `ingest`. +- `docs/user/query-language.md` β€” documented the Dβ‚‚ parse-time rule. +- `docs/user/errors.md` β€” added the Dβ‚‚ `BadRequest` rejection path. +- `docs/user/storage.md` β€” dropped the live `_graph_runs.lance` reference + from the layout diagram and prose. +- `docs/user/branches-commits.md` β€” moved `__run__` to a legacy note; + removed `publish_run` from the publish-trigger list. +- `docs/user/audit.md` β€” current `_as` API list refreshed; legacy + `RunRecord.actor_id` moved to a historical note. +- `docs/user/constants.md` β€” marked the run registry / branch-prefix rows + as legacy. +- `docs/user/cli.md` β€” replaced the legacy `omnigraph run *` quickstart + block with `omnigraph commit list/show`. +- `docs/dev/testing.md` β€” extended the `runs.rs` row to cover the new + staged-write contract tests; added the `staged_writes.rs` row. +- `AGENTS.md` (CLAUDE.md symlink) β€” updated the atomic-per-query + description and the L2 capability matrix row. + +## Included Changes + +- Rewire `mutate_as` and `load` via in-memory `MutationStaging` + + `stage_*` / `commit_staged` per touched table at end-of-query. +- (The storage substrate shipped in v0.4.0's PR #67 β€” `StagedWrite`, + `stage_append`, `stage_merge_insert`, `commit_staged`, + `scan_with_staged`, `count_rows_with_staged` β€” and is the substrate + this release builds on.) diff --git a/docs/releases/v0.4.2.md b/docs/releases/v0.4.2.md new file mode 100644 index 0000000..bc45716 --- /dev/null +++ b/docs/releases/v0.4.2.md @@ -0,0 +1,115 @@ +# Omnigraph v0.4.2 + +Omnigraph v0.4.2 is a concurrency, admission-control, and release-hygiene +release. It removes the server-global write lock, lets disjoint writers make +progress concurrently, adds per-actor admission limits, hardens branch and +mutation races with snapshot-isolation fences, and documents the release in +public open-source terms. + +## Highlights + +- **Unlocked server engine handle**: the HTTP server now holds the engine behind + a shared handle instead of a server-global write lock. Concurrent handlers can + call engine APIs directly while the engine serializes only the resources that + actually conflict. +- **Engine-owned writer queues**: same `(table, branch)` writers are serialized + by per-table writer queues inside the engine, while disjoint table/branch + writes can run concurrently. This narrows contention without relying on route + handlers to know storage-level ordering rules. +- **Per-actor admission control**: mutating HTTP handlers are gated by a + `WorkloadController` with per-actor in-flight request and estimated-byte + budgets. Rejections use HTTP 429 with `code: too_many_requests` and a + `Retry-After` header, so noisy actors back off without blocking unrelated + actors. +- **Admission coverage for all mutating handlers**: `/change`, `/ingest`, + `/schema/apply`, branch create/delete, and branch merge now flow through the + admission controller. Read-only endpoints are not admission-gated. +- **Op-kind-aware version checks**: mutation commit-time drift checks distinguish + append-like inserts from strict update/delete work. Inserts remain permissive + enough for safe concurrent append patterns; updates and deletes get stricter + stale-view rejection. +- **Read-time drift checks for strict mutations**: staged mutations compare the + manifest pin captured when the query opened against the manifest snapshot + captured under table-queue ownership. If a concurrent writer moved the table + after the query read, the stale writer returns a structured + `manifest_conflict` 409 instead of staging work computed against an old + snapshot. +- **Inline-delete recovery coverage**: delete-only mutations still use Lance's + inline delete path, but their recovery sidecar is now written before the + manifest-version rejection path can return. If a delete moves Lance HEAD and a + concurrent manifest update makes the query stale, the next read-write open can + roll the residual back rather than leaving a head-ahead-of-manifest table. +- **Branch-operation race hardening**: branch creation and branch merge avoid + coordinator swap-restore races that could expose the wrong active branch to + concurrent work. Concurrent branch merges are serialized by a merge mutex. +- **Branch-merge target revalidation**: merges re-check target table versions + after acquiring target write queues. A stale merge plan returns a structured + conflict instead of overwriting concurrent target-branch changes or adopting a + source table over newly appended target rows. +- **Schema refresh deadlock fix**: recovery refresh releases the write guard + before schema reload, preventing a refresh/schema-apply deadlock. +- **Lean admission API**: removed the unused global rewrite admission pool, + `service_unavailable` error variant, related 503 documentation, and benchmark + flag. The public server surface now reflects only admission behavior that is + wired to handlers. +- **Open-source release hygiene**: this release adds guidance for public-facing + documentation, release notes, and version bumps. Release docs now avoid + private issue tracker references and use stable public descriptions instead. + +## Behavior changes + +- Disjoint mutating HTTP requests can now make progress concurrently instead of + queueing behind one process-wide engine write lock. +- Mutating handlers may return HTTP 429 when an actor exceeds per-actor in-flight + or estimated-byte budgets. Clients should respect `Retry-After` and retry + later. +- Concurrent update/delete and merge races now return structured + `manifest_conflict` 409 responses in more stale-view cases instead of relying + on later publisher-CAS detection or allowing a stale plan to proceed. +- Concurrent branch merge Γ— change on the same target branch may return either + success or a clean 409 conflict, depending on which operation wins the queue. +- `OMNIGRAPH_GLOBAL_REWRITE_MAX` is no longer recognized. Remove it from + deployment manifests; use the per-actor in-flight and byte-budget admission + settings for the currently wired server controls. + +## Upgrade Notes + +- No repository migration is required. Existing v0.4.1 repos can be opened + directly with v0.4.2. +- Clients should treat `manifest_conflict` 409 responses as retryable stale-view + conflicts. This was already the documented contract, but this release uses it + in more concurrent-write paths. +- Clients should handle HTTP 429 from every mutating endpoint, not only + `/change`. Honor the `Retry-After` header. +- Operators should remove stale references to global rewrite admission and 503 + rewrite-pool exhaustion from local runbooks. +- If you maintain public docs or release notes, use public identifiers and + user-facing descriptions rather than private tracker IDs. + +## Tests added or strengthened + +- Regression tests for update read-your-writes under in-process concurrency. +- HTTP tests for same-key insert snapshots, disjoint `/change` concurrency, and + `/ingest` admission 429 + `Retry-After`. +- Branch-operation regression tests for branch-create swap-restore races, + concurrent `/change` + branch-merge interleavings, branch-merge swap-restore + races, branch-op matrix coverage, and post-reopen consistency. +- Failpoint-backed regression coverage for inline-delete recovery sidecar + creation before version-mismatch rejection. +- Admission tests use injectable `WorkloadController` state instead of mutating + process environment. + +## Included Changes + +- Shared server engine state and per-actor admission on mutating endpoints. +- Per-(table, branch) writer queues and op-kind-aware manifest drift checks. +- Strict read-time version checks for updates/deletes. +- Branch create/merge race hardening and branch-merge target snapshot + revalidation under queue ownership. +- Retry-after support for admission rejections and OpenAPI updates for reachable + 429 responses. +- Actor-isolation benchmark harness updates for the current admission controller. +- Removal of the unwired global rewrite admission / 503 server surface. +- Version bump to `0.4.2` across workspace crates, `Cargo.lock`, and + `openapi.json`. +- Public release-note cleanup and new OSS best-practice guidance in `AGENTS.md`. diff --git a/docs/releases/v0.5.0.md b/docs/releases/v0.5.0.md new file mode 100644 index 0000000..16e284e --- /dev/null +++ b/docs/releases/v0.5.0.md @@ -0,0 +1,171 @@ +# Omnigraph v0.5.0 + +Omnigraph v0.5.0 is a substrate, security, and migration-safety release. It +jumps the storage substrate from Lance 4 to Lance 6.0.1 (DataFusion 52 β†’ 53, +Arrow 57 β†’ 58), introduces engine-wide Cedar policy enforcement on every +authoring path, and ships a structured schema-lint v1 chassis with +code-tagged diagnostics, soft drops, and an explicit `--allow-data-loss` +flag for destructive migrations. + +## Highlights + +- **Lance 6.0.1 substrate**: bump from Lance 4.0.0 β†’ 6.0.1, DataFusion 52 β†’ + 53, Arrow 57 β†’ 58. New optimizer rules (vectorized `IN`-list eq kernel, + `PhysicalExprSimplifier`, push-limit-into-hash-join, CASE-NULL shortcut) + reach predicates that flow through the engine. `lance-tokenizer` replaces + tantivy internally; FTS behavior preserved. +- **Cedar policy engine**: a new `omnigraph-policy` crate wires + `Omnigraph::enforce(action, scope, actor)` into every `_as` writer + (`mutate_as`, `load_as`, `apply_schema_as`, `branch_create_as`, + `branch_merge_as`, `branch_delete_as`, plus the load and change + variants). The HTTP server defaults to deny-all when no Cedar policy is + configured; a YAML policy file is required to enable writes. Actor + identity comes only from signed token claims β€” clients cannot set actor + identity directly. +- **Schema lint v1 chassis**: diagnostics now carry stable codes of the form + `OG-XXX-NNN` instead of free-form messages. `omnigraph schema plan` and + `apply` understand soft drops on properties and types β€” destructive drops + require the new `--allow-data-loss` flag (Hard mode) at the CLI and an + equivalent JSON flag over HTTP. +- **Structured filter pushdown**: query-language predicates lower to + DataFusion `Expr` and push down through Lance's `Scanner::filter_expr` + instead of being flattened to SQL strings. This unlocks `CompOp::Contains` + pushdown (via `array_has`), which previously fell through to in-memory + post-scan filtering, and lets the DataFusion 53 optimizer rules above act + on our predicates. +- **HTTP `allow_data_loss` parity**: the destructive-drop guard now exists + on both the CLI (`--allow-data-loss`) and HTTP (`allow_data_loss: true` in + the schema-apply request body). +- **Inline query strings on CLI and HTTP**: `omnigraph read` / + `omnigraph mutate` and the corresponding HTTP endpoints accept inline + `.gq` source, not just a file path. Easier ad-hoc queries, clearer + request logs. +- **Browser CORS layer**: optional CORS layer on `omnigraph-server` for + browser-based UIs, gated by `OMNIGRAPH_CORS_ORIGINS`. +- **Merge-insert dup-rowid fix**: Lance's `MergeInsertBuilder` could surface + spurious `"Ambiguous merge inserts"` errors on sequential merges against + rows previously rewritten by `merge_insert`. The engine now opts into + `SourceDedupeBehavior::FirstSeen` with a `check_batch_unique_by_keys` + fail-fast precondition that guarantees source-side dedup happens before + Lance sees the batch. +- **Branch-merge error-path recovery**: a branch merge that failed + mid-flight could leave the in-process coordinator pointing at a stale + active branch. The error path now restores the prior coordinator, + matching the success path's invariant. +- **Branch merge with blob columns**: external blob URIs are now + materialized correctly during branch merge instead of being dropped or + pointing at the source branch. +- **Lance API surface guards**: a new test file + (`crates/omnigraph/tests/lance_surface_guards.rs`) pins eight specific + Lance API surfaces (`LanceError::TooMuchWriteContention`, + `ManifestLocation` fields, `MergeInsertBuilder` return shape, + `WriteParams::default`, `compact_files` signature, etc.) so the next + Lance bump fails compile or runtime on any silent drift rather than + producing wrong-state recovery in production. + +## Behavior changes + +- **On-disk format unchanged**: existing v0.4.2 datasets open unchanged. + The Lance file format pin stays at V2_2 (required by Lance's blob v2 + feature). +- **`omnigraph-server` defaults to deny-all under `--policy`**: starting a + server with the policy feature enabled but no Cedar YAML policy + configured rejects every write. Operators must supply a policy file to + authorize anything. +- **Schema-lint diagnostics carry stable codes**: messages now lead with + `OG-XXX-NNN`. CI parsers or tooling that keyed off the v0.4.2 free-form + text need to switch to code-based matching. +- **Destructive schema drops require `--allow-data-loss`**: dropping a + property or type returns a structured diagnostic by default. + `omnigraph schema apply --allow-data-loss` (CLI) or + `{"allow_data_loss": true}` (HTTP) opts into Hard mode. +- **`HashJoinExec` null-aware semantics on anti-join**: a side effect of + the DataFusion 53 bump β€” `NOT IN` semantics under null-valued anti-join + columns are now correct per SQL standard. Queries that depended on the + prior behavior would have been incorrect. + +## Upgrade Notes + +### Migration + +- No data migration. v0.4.2 repos open directly on v0.5.0. + +### Clients + +- HTTP and SDK clients should switch any string-matching schema-lint + parsing to code-based matching against the `OG-XXX-NNN` prefix. +- Clients exercising destructive schema drops (`DropProperty`, `DropType`) + must add the `allow_data_loss` request field (HTTP) or + `--allow-data-loss` flag (CLI). Default is soft-drop-or-reject. +- Clients consuming `mutate_as` / `load_as` / `apply_schema_as` / branch + authoring APIs now flow through the policy enforcer. Anything bypassing + authorization on v0.4.2 will be rejected on v0.5.0 once a policy is + configured. + +### Operators + +- Configure a Cedar policy YAML for production servers before enabling + writes; deny-all is the new default. The `omnigraph policy validate` / + `test` / `explain` CLI commands are unchanged. +- Bearer tokens continue to be the actor-identity source; review the + signed-token-claim-only invariant in `docs/dev/invariants.md` if you've + built custom authentication. +- If your local CI uses RustFS for S3-compatible storage testing, our CI + pins `rustfs/rustfs:1.0.0-beta.3` (the last known-good tag before the + upstream credentials-policy change). Mirror the pin or set + `RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true` for the new image + versions. + +## Tests added or strengthened + +- `crates/omnigraph/tests/lance_surface_guards.rs` β€” 8 named guards pinning + Lance API surfaces against silent drift on future bumps. +- `crates/omnigraph/tests/policy_engine_chassis.rs` β€” engine-level policy + enforcement coverage; complements the existing HTTP policy tests. +- Policy chassis e2e gap-fills β€” branch-merge, branch-create, branch-delete + policy paths now have explicit end-to-end tests over HTTP and CLI. +- Merge-pair truth table β€” exhaustive op-variant matrix for three-way + merge across `noop`, `addNode`, `removeNode`, `addEdge`, `removeEdge`, + `setProperty`, `dropProperty`, `addLabel`, `removeLabel`; the build + fails to compile when a new op variant is added without dispositioning + every pairing. +- Merge-insert: regression for the dup-rowid bug class on the load surface + (`load_merge_repeated_against_overlapping_keys_succeeds`), the update + surface (`second_sequential_update_on_same_row_succeeds`), and the + upstream-Lance-gap canary + (`load_merge_window_2_documents_upstream_lance_gap`). +- Maintenance + destructive-migration coverage β€” `omnigraph optimize` / + `cleanup` boundary cases, plus schema-apply soft-drop and Hard-mode + paths. +- Stable-row-id preservation across `stage_overwrite` β€” pins the invariant + that staged overwrites carry stable row IDs through to the committed + fragment set. +- `CompOp::Contains` pushdown regression + (`ir_filter_with_list_contains_pushes_down`) β€” pins the new structured + Expr pushdown path that retired the in-memory fallback. + +## Included Changes + +- Lance 4 β†’ 6.0.1, DataFusion 52 β†’ 53, Arrow 57 β†’ 58 substrate upgrade. +- `omnigraph-policy` crate with engine-wide Cedar enforcement and + signed-token-claim-only actor identity. +- Schema-lint v1 chassis with `OG-XXX-NNN` codes, soft `DropProperty` / + `DropType` semantics, and `--allow-data-loss` for Hard mode. +- HTTP `allow_data_loss` request field parity with the CLI flag. +- Structured DataFusion `Expr` filter pushdown via + `Scanner::filter_expr`, with `CompOp::Contains` lowered through + `array_has`. +- Inline `.gq` source acceptance on CLI and HTTP read/mutate endpoints. +- Optional CORS layer on `omnigraph-server` for browser UIs. +- Bug fixes: merge-insert dup-rowid (FirstSeen + uniqueness precondition), + branch-merge coordinator restore on error, blob-column materialization + during branch merge. +- New Lance API surface-guard test file as the canary for future Lance + bumps. +- Recovery-sidecar coverage extended across the four write paths + (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, + `ensure_indices`) with failpoint regression tests. +- CI: pinned `rustfs/rustfs:1.0.0-beta.3` after the upstream `:latest` + introduced a credentials-policy change. +- Version bump to `0.5.0` across workspace crates, `Cargo.lock`, + `openapi.json`, and the `AGENTS.md` surveyed version. diff --git a/docs/releases/v0.6.0.md b/docs/releases/v0.6.0.md new file mode 100644 index 0000000..7984056 --- /dev/null +++ b/docs/releases/v0.6.0.md @@ -0,0 +1,141 @@ +# Omnigraph v0.6.0 + +Three pieces of work land in this release: + +1. The **graph terminology rename** (renamed `Repo` β†’ `Graph` across the Cedar resource model, policy API, and query-lint schema source). +2. **Multi-graph server mode** β€” one `omnigraph-server` process can now serve 1–10 graphs concurrently behind cluster routes (`/graphs/{graph_id}/...`), with per-graph and server-level Cedar policy, read-only `GET /graphs` enumeration, and CLI parity (`omnigraph graphs list`). +3. **Inline + canonical-named queries and mutations.** New `POST /query` and `POST /mutate` endpoints pair with the CLI's new `-e/--query-string` flag for ad-hoc execution without a temp file. `POST /read` and `POST /change` continue serving indefinitely as deprecated aliases that carry RFC 9745 `Deprecation: true` and RFC 8288 `Link: ; rel="successor-version"` response headers, plus `deprecated: true` in `openapi.json`. Same canonicalization on the CLI: `omnigraph query`, `omnigraph mutate`, and top-level `omnigraph lint` / `omnigraph check` replace `omnigraph read`, `omnigraph change`, and the nested `omnigraph query lint` / `omnigraph query check`. Every deprecated spelling remains a `visible_alias` that warns to stderr once per invocation. + +Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`, `omnigraph graphs create`) is **not** in v0.6.0. Operators add or remove graphs by editing `omnigraph.yaml` and restarting. The first cut of `POST /graphs` shipped behind an atomic-YAML-rewrite design that we pulled before release once its concurrency guarantees were challenged (flock-on-renamed-inode race, duplicate-check outside the critical section, and an init-cleanup path that could destroy an existing graph's schema on re-init). The correct fix is a Lance-style cluster catalog (reserve β†’ init β†’ publish with recovery sidecars); that work is deferred. + +## Breaking Changes + +### Graph terminology rename + +- Renamed the Cedar resource entity from `Omnigraph::Repo` to `Omnigraph::Graph`. +- Renamed policy API terminology from `repo_id` to `graph_id` on `PolicyCompiler::compile` (and on the new `PolicyEngine::load_graph` / `PolicyEngine::load_server` loaders described below). +- Renamed query-lint schema source JSON from `"repo"` to `"graph"` for `schema_source.kind`. + +### Multi-graph server mode + +- **Multi-graph deployments lose flat routes.** Single-graph invocation (`omnigraph-server `) is unchanged β€” same flat `/snapshot`, `/read`, `/branches`, etc. Multi-graph deployments serve those routes under `/graphs/{graph_id}/...`; bare flat paths return 404 in multi mode. +- **`ServerConfig` shape change** (programmatic embedders only): `ServerConfig { uri, policy_file }` is replaced by `ServerConfig { mode: ServerConfigMode }`, where `ServerConfigMode = Single { uri, policy_file } | Multi { graphs, config_path, server_policy_file }`. Callers that use `load_server_settings` are unaffected; callers that construct `ServerConfig` directly need to wrap their fields in `ServerConfigMode::Single`. +- **`AppState`'s routing surface** is `AppState::routing() -> &GraphRouting`, where `GraphRouting = Single { handle } | Multi { registry, config_path }`. The previous `AppState::uri()`, `AppState::mode()`, `AppState::registry()` accessors and the `ServerMode` enum are gone β€” embedders read `state.routing()` and match on the arm they need. Per-graph URIs live on `handle.uri`. +- **`AppState::new_multi`** is the new multi-graph constructor. Single-mode `new_*` / `open_*` constructors are unchanged. +- **`AuthenticatedActor(Arc)` β†’ `ResolvedActor { actor_id, tenant_id, scopes, source }`** (programmatic embedders only). The struct shape changes, but the HTTP contract β€” bearer auth and the bearer-derived-actor-identity guarantee β€” is unchanged. Cluster-mode call sites construct with `tenant_id: None`, `scopes: vec![Scope::Full]`, `source: AuthSource::Static`. The new fields are forward-compat seams for future multi-tenant and OAuth deployments; they're inert in this release. +- **`PolicyEngine::load(path, graph_id)` removed** in favor of two kind-typed loaders: `PolicyEngine::load_graph(path, graph_id)` for per-graph policies and `PolicyEngine::load_server(path)` for server-level policies. Each loader rejects rules whose action `resource_kind()` doesn't match the engine kind β€” operators who put a `graph_list` rule in a per-graph file (or a `read` rule in a server file) now get a load-time error instead of a silently-never-matching rule. +- **`PolicyRequest::actor_id` field removed.** Actor identity is now a separate parameter on `PolicyEngine::authorize(actor_id, &request)`. The type system enforces the server-authoritative-actor invariant: actor identity is always sourced from the bearer-token match resolved at the auth boundary; handlers cannot smuggle identity through the request body. +- **`Omnigraph::init` is strict by default.** Initialization at a URI that already holds schema files now errors with `OmniError::AlreadyInitialized` instead of silently overwriting. Operators who actually want to overwrite use `InitOptions { force: true }` (CLI: `omnigraph init --force`). Closes the destructive-cleanup footgun where a failed re-init would delete an existing graph's schema files. +- **Top-level `policy.file` is rejected in multi-graph server mode.** It remains valid for single-graph / CLI-local policy. Multi-graph deployments must move graph rules to `graphs..policy.file` and server-scoped `graph_list` rules to `server.policy.file`. +- **Open server startup requires explicit opt-in.** A server with no bearer tokens and no policy now refuses to start unless passed `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1`. +- **Policy requires bearer tokens.** Configuring any policy file without bearer tokens now refuses startup; otherwise every protected request would 401 before Cedar could evaluate it. +- **Tokens without policy default-deny non-read actions.** Existing authenticated deployments that relied on writes or admin routes without Cedar policy must add policy rules for those actions. +- **`GET /graphs` requires `server.policy.file` in every runtime state.** Even `--unauthenticated` mode keeps server topology closed until the operator explicitly authorizes `graph_list`. + +### Query / mutation rename + +- **`ChangeRequest` field rename**: `query_source` β†’ `query`, `query_name` β†’ `name`. Both legacy names continue to deserialize via `#[serde(alias = "...")]`, so existing clients sending the old JSON keys keep working. CLI remote calls against `/change` still emit the legacy keys verbatim through the `legacy_change_request_body` helper so a newer CLI talking to an older server keeps working byte-for-byte. +- **CLI `omnigraph query lint` / `omnigraph query check`** are now top-level β€” canonical name is **`omnigraph lint`**. The three deprecated invocations (`omnigraph query lint`, `omnigraph query check`, and bare `omnigraph check`) remain as argv-level shims that rewrite to `omnigraph lint` and print a one-line stderr deprecation warning. `check` is deliberately **not** a clap `visible_alias` on `lint` β€” two equivalent canonical names would split agent emissions between them depending on training-data drift, so the deprecation pattern (rewrite + warn) gives one unambiguous canonical name in `omnigraph --help`. + +## New + +- **Multi-graph mode**. Invoke with `omnigraph-server --config omnigraph.yaml` where the YAML has a non-empty `graphs:` map and no single-mode selector (no `server.graph`, no CLI `` or `--target`). At startup the server opens every configured graph in parallel (bounded concurrency, fail-fast). +- **`GET /graphs`**. Lists every registered graph, sorted alphabetically by `graph_id`. Auth-required when bearer tokens are configured; Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"`. Returns 405 in single mode. Server-scoped actions require an explicit `server.policy.file` in every runtime state β€” the management surface is closed by default even in `--unauthenticated` mode so that server topology is never exposed without operator opt-in. +- **CLI `omnigraph graphs list`**. Mirrors the HTTP surface. Rejects local URI targets with a clear message β€” for remote multi-graph servers only. +- **CLI `omnigraph init --force`**. Bypasses the strict-init preflight when an operator deliberately wants to recover from orphan schema files. Does NOT purge existing Lance datasets; recursive deletion needs `StorageAdapter::delete_prefix` (deferred β€” see below). +- **Per-graph Cedar policy**. Each entry in the `graphs:` map can carry a `policy.file` path, loaded at startup via `PolicyEngine::load_graph`. Cedar's `Omnigraph::Graph::""` resource is per-graph; the new `Omnigraph::Server::"root"` resource governs server-level actions. +- **Server-level Cedar policy**. `server.policy.file` in the config governs the `graph_list` action on `Omnigraph::Server::"root"`. Required to expose `GET /graphs` in every runtime state β€” without a server policy the default-deny posture rejects `graph_list`, including in `--unauthenticated` mode. +- **Cedar action vocabulary**: `graph_list` (server-scoped). Runtime `graph_create` / `graph_delete` are reserved but not shipped β€” see "Deferred." +- **Canonical graph URI identity.** Server startup normalizes graph root URIs before registry insertion and response output, so aliases such as `/tmp/g`, `/tmp/g/`, and `file:///tmp/g` cannot register as distinct graphs that actually share one Lance root. +- **`POST /query`** and **`POST /mutate`**. Canonical inline endpoints. `/query` rejects mutations with a typed 400 (the D2 rule lives at the URL β€” read-only contract enforced before execution); body uses the clean `{ query, name, params, branch, snapshot }` shape. `/mutate` accepts the same shape for mutations. Both available in single mode and per-graph multi mode (`/graphs/{id}/query`, `/graphs/{id}/mutate`). Internal call sites share two helpers (`run_query`, `run_mutate`) that take decoupled args, not request bodies β€” the seam MR-969's future stored-query handler plugs into. +- **CLI `omnigraph query` / `omnigraph mutate`** as top-level canonical subcommands. Pairs with new top-level **`omnigraph lint` (alias `check`)** so query validation no longer sits under `omnigraph query`. +- **CLI `-e, --query-string `** on both `omnigraph query` and `omnigraph mutate`. 3-way mutex with `--query ` and `--alias ` β€” exactly one is required. Empty string rejected. Suits ad-hoc exploration, REPL workflows, and agent tool-use without temp files. +- **Three-channel deprecation signal on `/read` and `/change`**: OpenAPI `deprecated: true` on the operation (every codegen flags the generated SDK method), RFC 9745 `Deprecation: true` response header, and RFC 8288 `Link: ; rel="successor-version"` (or ``) response header. Auto-discoverable; no SDK breakage. +- **`omnigraph.yaml` `aliases..command`** now accepts `query` and `mutate` as canonical values alongside the legacy `read` and `change`. The internal `AliasCommand` enum retains the legacy variant names so serialized configs stay byte-stable. + +## Configuration + +`omnigraph.yaml` schema additions (all optional, single-mode unaffected): + +```yaml +server: + bind: 0.0.0.0:8080 + policy: + file: ./server-policy.yaml # server-level Cedar (graph_list) + +graphs: + alpha: + uri: s3://tenant-bucket/alpha + policy: + file: ./policies/alpha.yaml # per-graph Cedar + beta: + uri: s3://tenant-bucket/beta + # no per-graph policy β†’ engine-layer enforcement is a no-op +``` + +## Deferred + +- **`POST /graphs` runtime graph creation** and **CLI `omnigraph graphs create`**. Pulled before release after the YAML-rewrite design's correctness story didn't survive review. A future release will add a managed cluster catalog (Lance-backed reserve β†’ init β†’ publish with recovery sidecars) and re-expose runtime creation on top of it. Until then, operators add graphs by editing `omnigraph.yaml` and restarting. +- **`DELETE /graphs/{id}`**. Never shipped in v0.6.0; deferred with the same cluster-catalog work. +- **`StorageAdapter::delete_prefix`**. The substrate primitive a managed catalog would need. Will land alongside runtime mutation. +- **`omnigraph init --force` purging Lance state.** Today `--force` only bypasses the schema-file preflight; recursive deletion of existing Lance datasets needs `delete_prefix`. +- **`X-Actor-Id` service delegation forwarding**. Needs durable both-actor audit on `_graph_commits.lance` β€” out of scope. +- **Hot policy reload**. Restart is cheap at N≀10 graphs. + +## User Impact + +- **No on-disk migration is required.** Existing `.omni` graphs from v0.5.0 (and earlier) open cleanly under v0.6.0 β€” Lance datasets, `__manifest`, `_schema.pg`, `_schema.ir.json`, `__schema_state.json`, `_graph_commits.lance`, `_graph_commit_recoveries.lance` all use unchanged formats. No conversion step. +- **Existing single-graph storage upgrades without migration.** Server deployments may need auth/policy config changes: explicitly pass `--unauthenticated` for local open mode, configure tokens when using policy, and add Cedar policy for non-read authenticated actions. +- **Multi-graph adoption is opt-in.** Add a `graphs:` map to `omnigraph.yaml` (and remove `server.graph`) to switch a deployment to multi mode. +- **Cluster routes are breaking for client SDKs targeting multi mode.** Generated clients from previous v0.5.0 OpenAPI specs will hit 404 on flat paths against a multi-mode server. Regenerate against the v0.6.0 `openapi.json`. +- **Supported YAML policy authoring is unchanged.** The Cedar `Omnigraph::Graph` and `Omnigraph::Server` entities are internally generated by `compile_policy_source` β€” operator YAML only references actions and groups. +- **Operators with unsupported raw Cedar policy files** should update `Omnigraph::Repo` resource references to `Omnigraph::Graph`. +- **Endpoint and CLI rename is cosmetic on the client side.** Existing callers on `/read`, `/change`, `omnigraph read`, `omnigraph change`, and `omnigraph query lint` keep working β€” they pick up the `Deprecation` + `Link` headers (or stderr deprecation warning on the CLI) so SDKs and proxies can surface the successor name automatically. New integrations should target the canonical names. ChangeRequest field names migrate at the caller's pace β€” both `query_source`/`query_name` and `query`/`name` accepted indefinitely. + +## Migration: single β†’ multi + +```yaml +# Before (v0.5.0 single-mode invocation) +server: + graph: my-graph +graphs: + my-graph: + uri: /var/lib/omnigraph/my-graph +policy: + file: ./policy.yaml +``` + +```yaml +# After (v0.6.0 multi-mode β€” drop `server.graph` and the top-level `policy`) +server: + policy: + file: ./server-policy.yaml # NEW: governs GET /graphs +graphs: + my-graph: + uri: /var/lib/omnigraph/my-graph + policy: + file: ./policy.yaml # MOVED: was top-level +``` + +Same `omnigraph.yaml` file; restart the server. Clients targeting the old flat routes (`/snapshot`, `/read`, …) must update to `/graphs/my-graph/snapshot`, etc. + +To add a new graph after rollout: stop the server, append a new `graphs.` entry, restart. + +## Documentation + +- Public docs, CLI help, examples, server docs, and test helpers now consistently use "graph" for the OmniGraph data artifact. +- GitHub/source repository terminology remains spelled out as "repository" where needed. +- New: `docs/user/cli.md` documents `omnigraph graphs list`; `docs/user/server.md` documents the multi-graph mode and the cluster route convention; `docs/user/policy.md` documents the per-graph vs server-scoped action distinction. +- New: `docs/user/server.md` documents `POST /query` / `POST /mutate` and the three-channel deprecation signal on `/read` / `/change`. `docs/user/cli.md` documents the `-e/--query-string` flag with examples. `docs/user/cli-reference.md` shows the canonical CLI verbs (`query`, `mutate`, `lint`, `check`) with legacy spellings as visible aliases. +- New: `docs/dev/rfc-001-queries-envelope-mcp.md` is the cross-cutting design doc for the inline / stored query work that started landing in this release. It sequences the v0.6.x patch series (request/response envelope hardening) and the v0.7.0 stored-query + MCP work. + +## Test coverage + +- `GraphId` newtype validation, registry race tests, init failpoints (still reachable from `omnigraph init` CLI). +- Mode-inference four-rule matrix, parallel multi-graph startup, cluster routing. +- Cedar `Server` resource refactor, backwards-compat for graph-only policies, kind-alignment rejection (server actions in graph files / vice versa). +- `GET /graphs` enumeration, 405-in-single-mode, 403-in-Open-mode-without-server-policy, Cedar admin/viewer authorization. +- Cluster routes with inner path params (`/branches/{branch}`, `/commits/{commit_id}`) deserialize correctly under axum 0.8 nested routing. +- Policy-requires-tokens startup invariant enforced uniformly across single and multi mode. +- The bearer-auth-derived-actor-identity regression test (client-supplied identity headers are ignored; the server-resolved actor is the only identity Cedar sees) stays green across the entire refactor. + diff --git a/docs/releases/v0.7.0.md b/docs/releases/v0.7.0.md deleted file mode 100644 index 24cefdf..0000000 --- a/docs/releases/v0.7.0.md +++ /dev/null @@ -1,300 +0,0 @@ -# Omnigraph v0.7.0 - -v0.7.0 is three large arcs in one release. **Operations:** the cluster control -plane moves to object storage and the configuration architecture collapses to two -single-owner surfaces β€” a cluster can live entirely on an S3-compatible bucket, a -server boots from it with no local files, and the legacy combined `omnigraph.yaml` -is **removed**. **CLI:** the command-line surface is unified and made honest β€” -embedded and remote runs are one execution path, `load` becomes the single -bulk-write command, every command declares the **capability** it needs (and -rejects flags that don't apply), and the server boots only from a cluster. -**Engine & substrate:** Lance moves to 7.x, traversal/index/recovery internals -get faster and self-healing, and text embedding becomes provider-independent. - -## Highlights - -### Clusters & storage on object storage - -- **Clusters on object storage (`storage:`).** `cluster.yaml` gains an optional - `storage: s3://bucket/prefix` root. Every stored byte β€” state ledger, lock, - recovery sidecars, approval artifacts, catalog blobs, and the derived graph - roots (`/graphs/.omni`) β€” flows through one storage layer, so - `file://` (the default, byte-compatible with existing clusters) and `s3://` - are a single code path. The ledger's compare-and-swap uses S3 conditional - writes (`If-Match`/`If-None-Match`), verified against AWS, RustFS, and other - S3-compatible stores; the state lock is genuinely cross-machine on object - storage. -- **Config-free serving: `--cluster s3://bucket/prefix`.** The server accepts a - bare storage-root URI and boots from the applied revision on the bucket β€” the - ledger and catalog are the whole deployment artifact. Policy bundles serve as - digest-verified *content* from the catalog (never re-read from disk). The - preferred container shape becomes **bucket, no volume** (see - `docs/user/deployment.md`). -- **Cluster-only server.** `omnigraph-server` boots **only** from `--cluster - ` and serves N graphs (N β‰₯ 1) under cluster routes - (`/graphs/{id}/…`, plus a read-only `GET /graphs` enumeration). The old - single-graph flat-route mode, positional-`` boot, and `omnigraph.yaml` - `graphs:`-map boot are gone β€” add or remove graphs with `cluster apply` and - restart. -- **Resilient cluster boot with strict opt-out.** Graph-attributed startup - failures now quarantine that graph and let healthy graphs serve; `/graphs` - lists only ready graphs, and quarantined graph routes return 404. Cluster- - global failures still refuse boot, and `--require-all-graphs` (or - `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) restores fail-fast all-or-nothing startup - for operators who prefer any degraded graph to abort the process. -- **One storage substrate + recovery liveness.** The cluster storage backend and - the engine both go through one `StorageAdapter` (versioned read, conditional - replace/CAS, prefix delete), exercised by a storage fault-injection matrix. - A long-lived server now heals a recoverable write on its *next write* rather - than only at restart. - -### Configuration: two single-owner surfaces - -The legacy combined `omnigraph.yaml` is **removed**. Configuration now lives in -two surfaces with single owners, plus a zero-config tier: - -- **Cluster config (`cluster.yaml` + checkout, team-owned)** declares what the - system *is*: graphs, schemas, stored queries, policies, storage. A server boots - from it via `--cluster`. -- **Per-operator config (`~/.omnigraph/config.yaml`, person-owned)** declares who - *you* are: `operator.actor` (the last hop of the `--as` chain), output - defaults, named servers + clusters, profiles, aliases, and a default scope. - `$OMNIGRAPH_HOME` relocates it. -- **Credentials keyed by server name.** `omnigraph login ` stores a - bearer token in `~/.omnigraph/credentials` (created `0600`; over-permissive - files refused). Resolution for a request whose URL matches an operator-defined - server: `OMNIGRAPH_TOKEN_` env β†’ the credentials file β†’ the default - `OMNIGRAPH_BEARER_TOKEN`. A token is only ever sent to the server it is keyed to. -- **Operator targeting and aliases.** `--server ` (with `--graph ` for - multi-graph servers) addresses operator-defined endpoints. Operator aliases are - pure, **read-only** *bindings* β€” personal name β†’ (server, graph, stored-query - name, default params) β€” invoking catalog-owned stored queries; they carry no - query content and a binding to a stored mutation is rejected. -- **Default scopes.** `defaults.server` (served) or `defaults.store` (a zero-flag - *local* default β€” mutually exclusive with `server`) supply the no-flag scope, - with an optional `default_graph`. `--profile ` / `$OMNIGRAPH_PROFILE` - selects a named scope bundle wholesale; `omnigraph profile list` / - `profile show []` inspect what's defined (read-only). - -### Unified, capability-aware CLI - -- **One bulk-write command: `omnigraph load`.** `load` is now the single data-write - command and works against remote graphs (over HTTP with the same bearer/actor - resolution as every other remote command) β€” previously the only data command - forced to open storage directly. `--mode overwrite|append|merge` is **required** - (overwrite is destructive, so there is no default); `--from ` opts into - fork-if-missing for `--branch`. `omnigraph ingest` becomes a **deprecated - alias** (`--from main --mode merge` defaults; one-line stderr warning). -- **No implicit branch forks.** Loading into a branch that does not exist is an - **error** unless `--from ` is given β€” a typo'd branch name no longer - silently forks `main` and lands your data there. Same rule on the server. -- **One execution path, embedded ≑ remote.** Every CLI verb runs through one - `GraphClient` with two implementations (embedded engine, HTTP) sharing a single - wire-DTO crate (`omnigraph-api-types`). An executable parity matrix runs every - verb against both and asserts identical results, so local and remote no longer - drift. -- **Declared capabilities + honest addressing.** Every command declares the - **capability** it needs β€” `any` (run against a graph, served or embedded), - `served` (needs a server), `direct` (direct storage access), `control` - (manage/inspect a cluster), or `local` (no graph) β€” and the CLI enforces it. - Wrong-capability addressing now fails loudly with a declared message (e.g. - `--server` on `optimize`) instead of being silently ignored, and a maintenance - verb pointed at a remote target is rejected. `omnigraph --help` groups commands - by capability with a legend. -- **Address cluster graphs for maintenance.** `optimize` / `repair` / `cleanup` - accept `--cluster --graph ` (`--cluster` is a cluster directory, - storage-root URI, or a `clusters:` name from `~/.omnigraph/config.yaml`), - resolving the graph's storage URI from the served cluster state (no need to - hand-type `/graphs/.omni`). `--graph` is the single graph selector - across server and cluster scopes. Conversely, `omnigraph init` **refuses** a - cluster-managed path and points at `cluster apply` β€” graphs in a cluster are - created with ledger/recovery/approvals, not by hand. `schema apply` refuses a - cluster-managed graph for the same reason (and the server rejects a cluster- - backed schema apply with `409`, pointing at `cluster apply`). -- **Write diagnostics + destructive-write safety (RFC-011 Decision 9).** Every - write (`load`, `mutate`, `branch create|delete|merge`, `schema apply`, - `optimize`, `repair`, `cleanup`) echoes its resolved target + access path to - stderr β€” e.g. `omnigraph load β†’ s3://…/knowledge.omni (direct, remote)` β€” - suppressible with the global `--quiet`. Destructive writes against a - **non-local** scope (`cleanup`, overwrite `load`, `branch delete` against an - `http(s)://` server or `s3://` store/cluster) require explicit consent: the - global `--yes`, an interactive TTY prompt, or β€” for a non-interactive / - `--json` run β€” a hard refusal instead of silently proceeding. Local (`file://`) - writes are unaffected. -- **Route alignment: canonical `POST /load`.** The server gains a canonical - `POST /load`; `POST /ingest` is now a deprecated alias that emits RFC 9745 - `Deprecation: true` + RFC 8288 `Link: ; rel="successor-version"` - headers (a sibling-relative reference that resolves under `/graphs/{id}/…`). - The CLI's `load` targets `/load`. -- **Operator aliases get their own namespace (`omnigraph alias `).** A - personal binding to a stored query on a named server is invoked as - `omnigraph alias [args]` (RFC-011 Decision 4), so an alias can never - shadow β€” or be shadowed by β€” a built-in verb. `alias` rejects global scope - flags (`--server`/`--graph`/`--store`/`--cluster`/`--profile`/`--as`) its - binding already owns. -- **No-graph addressing lists candidates (RFC-011 Decision 7).** When a scope - has no `--graph` and no `default_graph`, the CLI never silently picks. A - **cluster** scope with exactly one applied graph uses it automatically and - otherwise **lists the candidates** (from the served catalog). A multi-graph - **server** lists the candidates (from `GET /graphs`) and requires `--graph `. -- **Invoke stored queries by name (RFC-011 Decision 3).** `omnigraph query - ` / `mutate ` invoke a stored query **by name** from the served - catalog β€” `omnigraph query find_people` instead of `--query find.gq --name - find_people`. The verb asserts the query's kind (an `expect_mutation` flag on - `POST /queries/{name}`: `query ` is rejected with `'' is a - mutation β€” use omnigraph mutate `, and vice-versa). `.gq` files become - the explicit ad-hoc lane (`-e` / `--query`), with the positional selecting - which query in the source. - -### Engine & substrate - -- **Lance 6.0.1 β†’ 7.0.0.** The columnar substrate is bumped to Lance 7.x with - correct-by-design alignment: the unenforced primary key is immutable once set, - `WriteParams::auto_cleanup` is disabled so version GC stays operator-owned, and - the native namespace/`object_store` 0.13 surface is pinned by surface-guard - tests. No on-disk format change for existing graphs. -- **Indexed graph traversal.** `Expand` can run over a BTREE-indexed path, - asserted semantically equal to the CSR traversal it accelerates. -- **Scalar index coverage + filter literal coercion.** Closes index-coverage gaps - and coerces filter literals correctly, cutting query latency on indexed scans. -- **Index materialization is derived state.** `schema apply` records - `@index`/`@key` *intent* and builds nothing (index-only changes touch no table - data); `load`/`mutate` build inline through one chokepoint but **defer** an - untrainable Vector column as *pending* instead of aborting; `optimize` is the - reconciler that materializes declared-but-missing indexes and folds appended - fragments back into existing ones. -- **Recovery liveness + one storage substrate.** Writers heal a recoverable - write on the *next write* (not only at the next read-write open); a storage - fault-injection matrix exercises the sidecar lifecycle; the cluster and engine - share one `StorageAdapter` over `object_store`. -- **Branch-fork self-heal.** Manifest-unreferenced branch forks are reclaimed - (eager best-effort + a `cleanup` reconciler backstop), so a failed branch-delete - reclaim no longer wedges a reused branch name. -- **Composite `@unique(a, b)`.** Enforced as a true composite key, with one shared - keying function for intake and branch-merge that fails loudly on an un-keyable - column type rather than silently exempting it. - -### Embeddings: provider-independent (RFC-012) - -- **One client, any provider.** Text embedding moves to a single - provider-independent `EmbeddingConfig` behind a sealed `Provider` enum: - **OpenAI-compatible** (the **OpenRouter** default gateway β€” one key for many - models β€” plus OpenAI-direct and self-hosted endpoints), native **Gemini**, and - a deterministic **Mock**. One client serves both the query path and the offline - `omnigraph embed` CLI, with a per-query deadline and `tracing` observability. - The dead, uncallable compiler-crate OpenAI client (and its `reqwest`/`tokio` - deps) was removed. -- **Same-space guarantee.** `@embed("source", model="…")` records the embedding - identity (model) in the schema IR so it travels with the data; a string - `nearest()` whose resolved embedder model differs from the recorded one is - **rejected with a typed error** instead of silently ranking across vector - spaces. (`@embed` still does no ingest-time embedding β€” deferred to a later - phase.) - -## Breaking & behavior changes - -- **`omnigraph.yaml` is removed.** The CLI and server no longer read it at all; - the `OmnigraphConfig` type, `omnigraph config migrate`, and the deprecation - env vars (`OMNIGRAPH_NO_LEGACY_CONFIG`, `OMNIGRAPH_SUPPRESS_YAML_DEPRECATION`, - `OMNIGRAPH_CONFIG`) are gone. Configure via a team `cluster.yaml` and a - per-operator `~/.omnigraph/config.yaml` (see Upgrade notes). -- **`omnigraph-server` boots only from `--cluster`.** The positional-`` - single-graph boot and the `omnigraph.yaml` `graphs:`-map boot are removed; all - HTTP is under `/graphs/{id}/…` (with flat `/healthz` and the `/graphs` - enumeration). Upgrade deployments to `omnigraph-server --cluster `. -- **Default embedding provider flips to OpenRouter.** Embedding is no longer - hardwired to Gemini: the default provider is **OpenAI-compatible via - OpenRouter**, `OMNIGRAPH_GEMINI_BASE_URL` is dropped, and Gemini-direct users - must set `OMNIGRAPH_EMBED_PROVIDER=gemini`. A `nearest("string")` query whose - resolved model differs from a property's recorded `@embed(model=…)` is now a - typed error rather than silent cross-space ranking. -- **`query --alias ` is removed.** Invoke operator aliases via - `omnigraph alias [args]`. -- **`query`/`mutate` no longer take a positional graph URI, `--uri`, or - `--name`** (RFC-011 D3). The positional is now the query name; address the - graph with `--store` (local) / `--server` / `--profile`, and select a query - within an ad-hoc `--query`/`-e` source with the positional (replacing - `--name`). By-name catalog invocation is **served-only** (a bare `--store` has - no catalog β€” use `-e`/`--query` there). Scripts using - `query --query f.gq --name q` become - `query --store --query f.gq q`. -- **Legacy data-plane addressing removed** (#238): `--target`, the positional - `http(s)://`β†’remote dispatch, and `--as` on a served write (the actor is - resolved server-side from the bearer token) no longer exist. -- **`omnigraph load` replaces direct-storage-only loading; `--mode` is required.** - Scripts calling `load` without `--mode` must add one (`overwrite|append|merge`). -- **`omnigraph ingest` is deprecated** (still works; one-line stderr warning). - Use `load --from --mode `. -- **Loading into a missing branch is now an error without `--from`** (CLI and - `POST /load`/`POST /ingest`): a missing branch returns 404 / fails, never an - implicit fork. Pass `--from ` (CLI) or the request `from` field (HTTP) to - fork-if-missing. This affects any workflow that relied on auto-forking. -- **Scope flags that can't apply now error instead of being silently ignored.** - `--server` on any direct/control/session command, `--cluster` outside the - cluster-scoped verbs, and `--graph` where no multi-graph scope applies all fail - with a declared message. `--graph` is the single graph selector and is - **accepted** on `optimize` / `repair` / `cleanup` when paired with `--cluster` - (replacing the removed `--cluster-graph`). -- **`schema apply` is refused against a cluster-managed graph.** The CLI signposts - `omnigraph cluster apply`; a cluster-backed server returns `409 Conflict` - (after the Cedar gate, so an unauthorized actor still gets `403`). Cluster - graphs evolve through `cluster apply`, never a direct apply. -- **Storage-plane error text changed.** A maintenance verb pointed at a remote - target now fails with a declared direct-capability message (replacing the older - "only supported against local graph URIs" wording). Error strings are observable - contract (Hyrum); pin against the new text. -- **Non-local destructive writes now require `--yes` in automation.** A - `cleanup` / overwrite-`load` / `branch delete` against an `http(s)://` or - `s3://` target with `--json` (or any non-TTY context) previously executed; - it now **refuses** unless `--yes` is passed. CI scripts that destroy remote - data must add `--yes`. Local (`file://`) writes are unchanged. -- **`omnigraph init` no longer scaffolds a config file,** and **refuses a - cluster-managed storage path** (`/graphs/.omni` under a cluster) β€” - create those graphs with `cluster apply`. -- **`POST /ingest` is deprecated** (kept indefinitely as a shim) and returns - `Deprecation`/`Link` headers. **A v0.7 CLI talks to `POST /load`,** which a - pre-0.7 server does not expose β€” upgrade the server and CLI together, or keep - using `ingest`. -- **`ServingPolicy` (cluster crate API) carries verified policy content instead - of a blob path; `read_serving_snapshot` and several cluster command entry points - are now `async`.** -- **`omnigraph --help` reorders commands** (grouped by capability) and **hides - the deprecated `ingest`** from the listing β€” `ingest` still runs. Help text is - observable; this is a deliberate output change. - -## Upgrade notes - -- Existing clusters need no migration: an absent `storage:` key keeps the - config-directory layout byte-for-byte. -- **`omnigraph.yaml` is no longer read.** There is no automated migrate command - in 0.7.0; recreate configuration as a team `cluster.yaml` (graphs, schemas, - stored queries, policies β€” see `docs/user/clusters/`) plus a per-operator - `~/.omnigraph/config.yaml` (identity, servers, credentials, defaults β€” see - `docs/user/cli/reference.md`). -- **`omnigraph-server` now requires `--cluster `** β€” there is no - positional-URI boot. Run `cluster apply` first, then serve the applied revision. -- **Gemini-direct embedding users** set `OMNIGRAPH_EMBED_PROVIDER=gemini` (the - default is now OpenRouter); `OMNIGRAPH_GEMINI_BASE_URL` is removed. -- Audit scripts for two CLI changes: add `--mode` to every `load`, and add - `--from ` anywhere you relied on a missing branch being auto-created. -- Upgrade server and CLI together for the `/load` route (or keep `ingest`). -- Operator setup is three lines: `mkdir -p ~/.omnigraph`, write `operator.actor` - (and `servers:`) into `~/.omnigraph/config.yaml`, then - `echo $TOKEN | omnigraph login `. - -## Internals - -- The cluster, server, and CLI crates were modularized (the ~7.9k-line cluster - `lib.rs` into focused modules; the server and CLI test monoliths into per-area - suites) β€” pure code movement. -- The parity matrix (embedded vs remote) is the new referee for CLI behavior; the - OpenAPI drift test guards `openapi.json`; Lance-surface guard tests pin the - upstream APIs the engine depends on (the first smoke check on a Lance bump). -- Gated end-to-end suites run the full cluster lifecycle against a real - S3-compatible store in CI (lock-release regression, config-free boot from a - bare bucket URI). -- The deployment guide gains the bucket-no-volume container recipe for AWS / - S3-compatible object storage. -- `clap` updated to 4.6.1. CI runs the full workspace suite on `main` post-merge - rather than on every PR (faster PR turnaround; the local - `cargo test --workspace --locked` is the pre-merge gate). diff --git a/docs/releases/v0.7.1.md b/docs/releases/v0.7.1.md deleted file mode 100644 index 3497261..0000000 --- a/docs/releases/v0.7.1.md +++ /dev/null @@ -1,67 +0,0 @@ -# Omnigraph v0.7.1 - -A patch release on top of v0.7.0: three correctness fixes (camelCase filters, -cluster-apply crash loops, branch-merge OOM on embedding tables), one CLI -catalog-metadata improvement, and a warm-read performance fix. No breaking -changes, no on-disk format change, and no migration β€” drop-in over v0.7.0. - -## Fixes - -- **camelCase property filters now execute (#283).** A query β€” or a chained - mutation β€” that filtered on a camelCase schema field (e.g. `repoName`) linted - and planned cleanly but failed at run time with `No field named reponame. - Column names are case sensitive.` The identifier's case was destroyed at two - engineβ†’Lance boundaries: the read-filter pushdown built the column with a - case-normalizing constructor, and the pending-batch mutation scan re-parsed - the predicate through a normalizing SQL context. Both now preserve case (the - read path uses a case-preserving column reference; the pending scan disables - SQL identifier normalization), so camelCase fields work consistently in read - and write predicates and a camelCase `@index` equality still routes to the - scalar index. The fix is correct-by-construction rather than a per-query - guard; a regression test pins index routing so a silent full-scan fallback - can't slip back in. - -- **`cluster apply` no longer crash-loops a booting server (#284).** Applying a - schema change while a graph had non-main (agent/review) branches, or a - migration that needed a backfill, could throw a freshly-booting - `omnigraph-server --cluster` into an unescapable crash loop. Neither input is - an engine bug β€” the engine rejects both cleanly and before moving any graph - state β€” but `cluster apply` wrote a recovery sidecar before calling the - engine and left it in place on the clean rejection, and the server refuses to - boot while a sidecar is pending. The asymmetric-cleanup path is fixed so a - pre-movement rejection leaves no stale sidecar, breaking the loop. - -- **Branch-merge fast-forward no longer OOMs on embedding tables (#277).** A - branchβ†’main fast-forward merge of a forked, embedding-bearing table - re-derived the whole branch through a single Lance `merge_insert` β€” a - full-outer hash join over the entire delta β€” which exhausted the DataFusion - memory pool on high-dimensional embeddings (e.g. 8k rows Γ— 3072-dim) and hung - or failed the merge. New rows now stream through `stage_append` (no hash - join), only genuinely-changed rows are upserted, embeddings are no longer - stringified to diff them, and index coverage defers to the reconciler, so a - fast-forward merge completes in bounded work. The three-way merge path is - unchanged. - -## Improvements - -- **`omnigraph queries list` surfaces stored-query `@description` / - `@instruction` (#280).** The CLI now shows a stored query's catalog metadata β€” - what it does and how to invoke it β€” in both human and `--json` output, - matching what `GET /queries` already returned. Previously both fields were - silently dropped on the CLI side. - -- **Warm reads no longer pay an O(history) metadata tax (#268).** Warm reads - used to re-derive per-query metadata (coordinator re-open, `__manifest` + - commit-graph re-scans, per-table re-open, double schema validation) on a cost - that scaled with commit history and never warmed up. A warm same-branch read - now does one cheap version probe, one schema read, and zero table opens on a - warm repeat (warm coordinator reuse, open-by-location+version, validate-once, - held `Dataset` handles + one shared Lance `Session` per graph). This also - closes a commit-DAG fork where a same-branch write after an external commit - could append off a stale cached head. - -## Upgrade notes - -Drop-in over v0.7.0 β€” no configuration, schema, or data changes. Upgrade the -server and CLI together as usual. Graphs created on v0.7.0 read and write -identically on v0.7.1. diff --git a/docs/releases/v0.7.2.md b/docs/releases/v0.7.2.md deleted file mode 100644 index ecf0acf..0000000 --- a/docs/releases/v0.7.2.md +++ /dev/null @@ -1,60 +0,0 @@ -# Omnigraph v0.7.2 - -A patch release over v0.7.1: write-path latency reductions plus three -correctness fixes on the maintenance and recovery paths. No breaking changes, no -on-disk format change, and no migration β€” drop-in over v0.7.1. - -## Performance - -- **Write opens go direct, schema validates once (#288, #298).** Write opens - used to route through the per-table Lance namespace catalog, which re-opened - the dataset just to read its location and re-resolved the latest version on - every table open β€” an O(commit-depth) double resolution that dominated write - latency on object stores (~70%). Writes now open each touched data table - directly by its manifest-recorded location (Lance's O(1) version-hint path), - validate the schema contract once per write instead of ~4Γ—, and open each - touched table once instead of 4Γ—. - -- **`optimize` compacts the internal metadata tables (#291).** `optimize` - previously iterated only node/edge tables, so the internal `__manifest`, - `_graph_commits`, and `_graph_commit_actors` tables accumulated one fragment - per commit and were never compacted β€” making every write's metadata scan grow - with commit history. `optimize` now compacts all three, so a periodically - optimized long-lived graph keeps its per-write metadata scan flat in history. - -## Fixes - -- **`optimize` survives a cross-process write race (#297).** A CLI `optimize` - racing a served write on the same table could fail: the in-process write queue - doesn't serialize across processes, so a concurrent insert/delete advancing the - manifest between optimize's compaction and its publish broke the strict - equality CAS. Optimize now reopens-and-replans on a genuine Lance conflict and - fast-forwards its publish monotonically, so a maintenance compaction never - fails a live write. Bounded retry; sustained contention surfaces a loud - conflict rather than dropping work. - -- **`optimize` is non-destructive on upgraded graphs (#291).** A graph created by - a pre-0.7.0 binary carries an on-by-default Lance auto-cleanup config; under it, - optimize's compaction commit could fire Lance's version-GC hook and prune - `__manifest`-pinned versions (breaking snapshots and time travel). Optimize now - strips any stale `lance.auto_cleanup.*` config off every table β€” data and - internal β€” before its HEAD-advancing commits, so compaction can never GC pinned - versions. - -- **Recovery converges instead of failing `open` under a concurrent manifest - advance (#296).** The open-time recovery sweep published its roll-forward at the - sidecar's pinned expected version; if another writer advanced the manifest - during the classifyβ†’publish window, the CAS failed and aborted the whole - `Omnigraph::open`. The sweep now treats roll-forward as "the manifest reflects - the sidecar's committed state," not "this sweep won the CAS": on a CAS loss it - re-reads the live manifest and, when the sidecar's intent is already satisfied, - records the recovery and deletes the sidecar idempotently β€” so a concurrent - advance no longer fails the open. (The destructive roll-back twin still defers - to a cross-process lease, as documented.) - -## Upgrade notes - -Drop-in over v0.7.1 β€” no configuration, schema, or data changes. Upgrade the -server and CLI together as usual. Graphs created on v0.7.1 read and write -identically on v0.7.2; the optimize non-destructive fix additionally protects -graphs created by pre-0.7.0 binaries from version GC during compaction. diff --git a/docs/releases/v0.8.0.md b/docs/releases/v0.8.0.md deleted file mode 100644 index f85acdc..0000000 --- a/docs/releases/v0.8.0.md +++ /dev/null @@ -1,61 +0,0 @@ -# Omnigraph v0.8.0 (in progress) - -> Draft release notes for the next minor. The version line in `AGENTS.md` and the -> crate manifests are bumped when this release is cut β€” these notes track the -> user-visible delta as the RFC-013 work lands. - -This release moves the graph commit lineage into `__manifest` (RFC-013 Phase 7) -and ships a **one-time on-disk migration** for existing graphs. It is the first -release with an internal-schema change since v0.4.0, so it has an upgrade-order -requirement β€” read the upgrade notes before rolling it out. - -## Graph lineage now lives in `__manifest` (internal schema v4) - -The graph commit DAG (commits, parents, merge parents, per-branch heads, and the -authoring actor) is now stored in `__manifest` as `graph_commit` / `graph_head` -rows, written in the **same commit (CAS)** as the table-version rows of a graph -publish. Previously the lineage lived in a separate `_graph_commits.lance` -dataset written after the manifest commit, leaving a narrow window where a crash -could land a manifest version with no matching lineage row. Folding the lineage -into the publish closes that gap by construction: a graph commit and its lineage -now land atomically at one manifest version. The in-memory commit graph is a -projection of those manifest rows; `_graph_commits.lance` is retained only as a -carrier for Lance branch refs and no longer receives commit rows. - -This bumps the `__manifest` internal schema stamp from **v3 to v4**. - -## Existing graphs migrate seamlessly on first write - -A graph created by an earlier binary (internal schema v3) keeps its lineage in -`_graph_commits.lance` with none in `__manifest`. On the **first read-write -open**, Omnigraph backfills that lineage into `__manifest` (the `migrate_v3_to_v4` -internal-schema step) and bumps the stamp to v4. The migration: - -- is **per-branch** β€” each branch backfills on its first write; -- is **idempotent and crash-safe** β€” the stamp bump is the last step, and the - backfill is keyed on the commit id, so a crash mid-migration re-runs harmlessly - on the next open; -- **preserves all data** β€” every commit, parent, merge parent, actor, and head is - carried over; commit ids are stable, so existing references still resolve. - -No data is lost and no operator action is required beyond upgrading the binary. - -Before its first write migrates the graph, a **read-only** open of a v3 graph -(e.g. `omnigraph commit list`, NDJSON export) still reads correct history via a -transitional fallback that sources the commit DAG from `_graph_commits.lance` β€” -read-only opens never write, so they never migrate, but they never show an empty -history either. - -## Breaking: upgrade writer binaries first - -Internal schema v4 is a hard version gate. Once a graph has been opened for write -by a v0.8.0 binary, its `__manifest` is stamped v4, and an **older binary will -refuse to open it** β€” read-write *and* read-only β€” with an -`upgrade omnigraph before opening this graph` error rather than silently -misreading the new lineage. This is the standard forward-version protection -(same shape as the v1β†’v2 / v2β†’v3 steps), now enforced on the read-only path too. - -**Upgrade order:** upgrade every writer (and reader) binary that touches a graph -to v0.8.0 before, or together with, the first write under the new version. A -mixed fleet where an old binary still writes the same graph is unsupported, as -with any internal-schema bump. diff --git a/docs/rfcs/0000-template.md b/docs/rfcs/0000-template.md deleted file mode 100644 index 48f4bda..0000000 --- a/docs/rfcs/0000-template.md +++ /dev/null @@ -1,54 +0,0 @@ -# RFC NNNN: - -| | | -|---|---| -| **Status** | Proposed | -| **Author(s)** | <your name / handle> | -| **Discussion** | <link to the originating Discussion, if any> | -| **Implementation** | <issue/PR links, filled in as work lands> | - -> Status is maintained by maintainers: `Proposed` while the PR is open, -> `Accepted` on merge, `Declined` on close, `Superseded by NNNN` later. - -## Summary - -One paragraph: what this changes, in plain terms. - -## Motivation - -What problem does this solve, and why is it worth the ongoing cost? Tie it to a -concrete need (a Discussion, a recurring issue, a user request). Per the -project's first principle, argue the *long-run liability*, not just the -short-term convenience. - -## Guide-level explanation - -Explain the change as you'd teach it to a user or contributor: new commands, -syntax, API shapes, behavior. Examples first. - -## Reference-level design - -The precise design: data structures, IR/AST/planner changes, storage/format -impact, migration path, error behavior. Enough that a reviewer can find the -holes. - -## Invariants & deny-list check - -Which Hard Invariants in [../dev/invariants.md](../dev/invariants.md) does this -touch? Does it brush against any deny-list item β€” and if so, why is this the -justified exception? State explicitly that no invariant is weakened, or which -Known Gap moves. - -## Drawbacks & alternatives - -What does this cost, what did you reject, and why. "Do nothing" is a valid -alternative to weigh. - -## Reversibility - -Is this reversible? On-disk/wire/format and substrate choices are near-permanent -and demand more evidence; a CLI flag or doc is cheap to undo. Say which this is. - -## Unresolved questions - -What's deliberately left open for review to settle. diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md deleted file mode 100644 index 99cdd76..0000000 --- a/docs/rfcs/README.md +++ /dev/null @@ -1,66 +0,0 @@ -# RFCs - -Substantial changes to OmniGraph β€” new user-facing surface, format or protocol -changes, anything irreversible or cross-cutting β€” go through a lightweight RFC -so the design is agreed *as reviewable code* before implementation starts. This -is the public RFC track, open to **anyone, including external contributors**. - -This complements the always-on review bar in -[../dev/invariants.md](../dev/invariants.md): the invariants say *what every -change must respect*; an RFC says *why this particular change is worth making and -how*. - -> **Two tracks, don't conflate them.** This `docs/rfcs/` directory is the -> **public contribution** track (anyone authors; maintainers accept). The -> maintainer-internal RFCs under `docs/dev/rfc-00N-*.md` are a separate, -> team-owned track for in-flight internal work. If you're an outside -> contributor, you're in the right place here. - -## When you need one - -- **RFC required:** new query/schema/CLI/HTTP surface; on-disk or wire-format - changes; a new substrate dependency; anything the deny-list in - [../dev/invariants.md](../dev/invariants.md) flags; anything irreversible - ("reversibility shapes evidence demand"). -- **RFC not required:** bug fixes for an `accepted` issue, and the trivial - fast-lane (typos, docs, deps) β€” see [../../CONTRIBUTING.md](../../CONTRIBUTING.md). - -If you're unsure, start a [Discussion](../../../discussions); a maintainer will -tell you whether it needs an RFC. - -## Lifecycle - -``` -Discussion (incubate, get rough consensus) - β”‚ graduate - β–Ό -RFC pull request β†’ adds docs/rfcs/NNNN-title.md (Status: Proposed) - β”‚ -maintainer review ──▢ changes requested / declined (PR closed, with rationale) - β”‚ - β–Ό -merged == Accepted (the merged file is the durable decision record) - β”‚ - β–Ό -Implementation PR(s) reference the accepted RFC -``` - -- **Author:** anyone. **Acceptance:** a maintainer decision, performed by - merging the RFC PR. Declining is closing it with rationale. -- The merged RFC *is* the accepted record β€” there is no separate sign-off step. -- Later reversals don't edit history: supersede with a new RFC that links back - and flip the old one's `Status` to `Superseded`. - -## Numbering & naming - -- File: `docs/rfcs/NNNN-kebab-title.md`, where `NNNN` is the next free - zero-padded integer (`0001`, `0002`, …). `0000-template.md` is reserved. -- Pick the number when you open the PR; if it collides with another in-flight - RFC, the second to merge bumps theirs. - -## Status values - -`Proposed` (open PR) Β· `Accepted` (merged) Β· `Declined` (closed) Β· -`Superseded by NNNN` Β· `Implemented` (set once the work lands, optional). - -Copy [0000-template.md](0000-template.md) to start. diff --git a/docs/user/audit.md b/docs/user/audit.md new file mode 100644 index 0000000..e8abe5b --- /dev/null +++ b/docs/user/audit.md @@ -0,0 +1,7 @@ +# Audit / Actor tracking + +- `Omnigraph::audit_actor_id: Option<String>` is the actor in effect. +- `_as` variants of every write API let callers override the actor: `mutate_as`, `ingest_as`, `branch_merge_as`, `apply_schema_as`, etc. +- Actor IDs are persisted on `GraphCommit.actor_id` with split storage in `_graph_commit_actors.lance` (the commit graph is split into `_graph_commits.lance` for the linkage and `_graph_commit_actors.lance` for the actor map). +- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor). +- Pre-v0.4.0 graphs also stored actor IDs on `RunRecord.actor_id` in `_graph_runs.lance` / `_graph_run_actors.lance`. The Run state machine was removed in MR-771; those files are inert post-v0.4.0 and reclaimed by MR-770's production sweep. diff --git a/docs/user/branches-commits.md b/docs/user/branches-commits.md new file mode 100644 index 0000000..de6c653 --- /dev/null +++ b/docs/user/branches-commits.md @@ -0,0 +1,63 @@ +# Branches, Commits, Snapshots + +## L1 β€” Lance per-dataset branches + +Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork. + +## L2 β€” Graph-level branches + +OmniGraph builds *graph branches* on top by branching every sub-table coherently: + +- `branch_create(name)` / `branch_create_from(target, name)` β€” disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle. +- `branch_list()` β€” returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes. +- `branch_delete(name)` β€” refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments. +- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source. +- `sync_branch(branch)` β€” re-binds the in-memory handle to the latest head of the branch. + +## L2 β€” Commit graph (`db/commit_graph.rs`) + +In-memory shape of a graph commit: + +``` +GraphCommit { + graph_commit_id: ULID, + manifest_branch: Option<String>, + manifest_version: u64, + parent_commit_id: Option<String>, + merged_parent_commit_id: Option<String>, // populated for merge commits + actor_id: Option<String>, // joined in-memory from _graph_commit_actors.lance, NOT a column on _graph_commits.lance + created_at: i64 (microseconds since epoch), +} +``` + +Storage is split across two Lance datasets (both with stable row IDs): + +- `_graph_commits.lance` β€” every column above *except* `actor_id`. +- `_graph_commit_actors.lance` β€” optional separate `(graph_commit_id, actor_id)` map, created on demand. The `actor_id` field above is populated by joining this dataset in-memory at load time. + +Notes: + +- Every successful publish (load / change / merge / schema_apply) appends one commit. +- Merge commits have two parents; linear commits have one. +- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`. + +## L2 β€” Snapshots & time travel + +- `snapshot()` β€” current snapshot for the bound branch; cached. +- `snapshot_of(target)` β€” snapshot at a `ReadTarget` (branch | snapshot id). +- `snapshot_at_version(v: u64)` β€” historical snapshot from any manifest version. +- `entity_at(table_key, id, version)` β€” single-entity time travel without building a full snapshot. +- A `Snapshot` is a `(version, HashMap<table_key, SubTableEntry>)` β€” cheap to build, snapshot-isolated cross-table reads. + +## L2 β€” Internal system branches + +Filtered from `branch_list()` but visible to internals: + +- `__schema_apply_lock__` β€” serializes schema migrations. +- `__run__<run-id>` β€” legacy from the pre-v0.4.0 Run state machine (removed in MR-771). The branch-name guard predicate `is_internal_run_branch` is kept as defense-in-depth so users cannot create a branch matching the legacy prefix; the filter will be removed once production legacy branches are swept (MR-770). + +## L2 β€” Recovery audit trail + +The four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) protect their multi-table commits with a sidecar at `__recovery/{ulid}.json` written before Phase B and deleted after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the recovery sweep in `crates/omnigraph/src/db/manifest/recovery.rs`: classify per-table state, decide all-or-nothing per sidecar, roll forward / back, record an audit row. + +Audit rows live in `_graph_commit_recoveries.lance` (sibling to `_graph_commits.lance`) and reference the commit graph by `graph_commit_id`. The linked recovery commit is identified by that same `graph_commit_id`, and `actor_id="omnigraph:recovery"` is stored in `_graph_commit_actors.lance` (joined by `graph_commit_id`) β€” `_graph_commits.lance` itself does not carry the `actor_id` column. To find recoveries for a specific original actor: `omnigraph commit list --filter actor=omnigraph:recovery`, then join to `_graph_commit_recoveries.lance` by `graph_commit_id` to read `recovery_for_actor`. Schema: see `crates/omnigraph/src/db/recovery_audit.rs`. diff --git a/docs/user/branching/index.md b/docs/user/branching/index.md deleted file mode 100644 index 20ea125..0000000 --- a/docs/user/branching/index.md +++ /dev/null @@ -1,40 +0,0 @@ -# Branches, Commits, Snapshots - -## L1 β€” Lance per-dataset branches - -Lance supports branching at the dataset level: a branch is a named lineage of versions, and a copy-on-write fork creates a new branch from a source branch at a given version. - -## L2 β€” Graph-level branches - -OmniGraph builds *graph branches* on top by branching every sub-table coherently: - -- **Create** (`branch create` / `branch create --from <target>`) β€” the name `main` is disallowed; fails if the branch exists. Atomic: the new branch becomes visible all-or-nothing, so a name never half-exists. -- **List** (`branch list`) β€” returns public branches, **filtering the internal** `__schema_apply_lock__` branch. -- **Delete** (`branch delete`) β€” refuses if there are descendants on the branch, or if it is the current branch. Once deleted, the branch is gone from every snapshot. The owned per-table forks are reclaimed best-effort; if that reclaim hits a transient object-store error, the leftover storage is reclaimed later by the [`cleanup`](../operations/maintenance.md) command. One consequence: if a delete's reclaim fails, reusing that branch name before the next `cleanup` surfaces a clear error pointing at `cleanup`. -- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share storage with their source. If two writers race to first-write the same branch, the loser gets a retryable "refresh and retry". - -## L2 β€” Commit graph - -Each graph commit carries a ULID id, the manifest branch and version it published, its parent commit (two parents for a merge commit, one for a linear commit), the actor who made it, and a creation timestamp. - -- Every successful publish (load / change / merge / schema apply) appends one commit. -- Merge commits have two parents; linear commits have one. -- Inspect history with `commit list` and `commit show`. - -## L2 β€” Snapshots & time travel - -Reading a branch at a past version, or a single entity at a past version, is -covered on the [time travel](time-travel.md) page. Merging branches and the -conflict kinds are on the [merge](merge.md) page. - -## L2 β€” Internal system branches - -- `__schema_apply_lock__` β€” serializes schema migrations; filtered from `branch list` but used internally. - -## L2 β€” Recovery audit trail - -Interrupted multi-table writes are recovered automatically the next time the graph is opened read-write. Recovery commits are recorded in the audit trail under the actor `omnigraph:recovery`, so you can find them with: - -```bash -omnigraph commit list --filter actor=omnigraph:recovery -``` diff --git a/docs/user/branching/merge.md b/docs/user/branching/merge.md deleted file mode 100644 index cb54ed6..0000000 --- a/docs/user/branching/merge.md +++ /dev/null @@ -1,66 +0,0 @@ -# Merging Branches - -Merging integrates the changes on one branch into another. OmniGraph merges are -**three-way and row-level**: it compares both branches against their common -ancestor and merges each node/edge table row by row, then publishes the result as -**one atomic commit** across the whole graph. - -```bash -omnigraph branch merge review/2026-04-25 --into main s3://bucket/graph.omni -``` - -`branch merge <source> [--into <target>]` merges `<source>` into `<target>` -(default `main`). - -## Outcomes - -A merge resolves to one of three outcomes: - -- **Already up to date** β€” the target already contains every change on the source; - nothing to do. -- **Fast-forward** β€” the target has no changes the source lacks, so the target - simply advances to the source. -- **Merged** β€” both sides diverged; a new merge commit is created with two parents. - -## Indexes after a merge - -A **fast-forward** merge (the common case β€” the target had no conflicting -changes, so the source's rows are adopted) does not build or rebuild indexes on -the rows it brings into the target. Newly merged rows (and any index a table does -not yet have) are covered the next time `optimize` runs β€” indexes are derived -state, and reads stay correct in the meantime via brute-force scan over the -not-yet-covered rows. This keeps a fast-forward merge fast (it never pays an -inline vector/FTS rebuild on the publish path), at the cost of brute-force search -latency on freshly merged rows until the next `optimize`. - -A **three-way** merge (the `Merged` outcome β€” both branches changed the table and -the rows were reconciled) still rebuilds the table's indexes inline today, as part -of the publish. So a Merged-outcome merge of an embedding-bearing table pays the -index-build cost up front. - -Either way, run `omnigraph optimize` after a large merge to restore (or, for the -fast-forward path, establish) full index coverage. - -## Conflicts - -When both branches changed the same data incompatibly, the merge fails with a -structured list of conflicts (the HTTP server returns `409` with a -`merge_conflicts[]` array). No partial result is published β€” the merge is -all-or-nothing. The conflict kinds are: - -| Kind | Meaning | -|---|---| -| `DivergentInsert` | The same id was inserted on both branches. | -| `DivergentUpdate` | The same row was updated differently on both branches. | -| `DeleteVsUpdate` | One side deleted a row the other side updated. | -| `OrphanEdge` | An edge references a node the other side deleted. | -| `UniqueViolation` | The merged result would violate a unique constraint. | -| `CardinalityViolation` | The merged result would violate an edge cardinality constraint. | -| `ValueConstraintViolation` | The merged result would violate a value constraint (enum/range). | - -Each conflict carries the table, the row id (when applicable), the kind, and a -message. Resolve conflicts by reconciling the two branches β€” typically by making -the conflicting change on one side and re-merging. - -See [branches & commits](index.md) for the branch and commit-DAG model, and -[changes](changes.md) for diffing two branches before you merge. diff --git a/docs/user/branching/time-travel.md b/docs/user/branching/time-travel.md deleted file mode 100644 index e6bd52d..0000000 --- a/docs/user/branching/time-travel.md +++ /dev/null @@ -1,31 +0,0 @@ -# Snapshots & Time Travel - -Every read in OmniGraph happens against a **snapshot** β€” a consistent, cross-table -view of the graph at one manifest version. A query holds one snapshot for its whole -lifetime, so it never sees a partial write from a concurrent commit (see -[transactions](transactions.md)). - -## Reading the past - -- **Current head** β€” by default a read targets the current head of the bound branch. -- **By snapshot id** β€” read a branch or a specific snapshot id (`--snapshot` on - `omnigraph read`). -- **By version** β€” reconstruct a historical snapshot from any past manifest version. -- **Single entity** β€” look up one entity at a past version without building a full - snapshot (cheaper when you only need one node or edge). - -Snapshots are cheap to build: a snapshot is just the set of visible sub-table -versions at a manifest version, so cross-table reads stay snapshot-isolated. - -## CLI - -```bash -# Read a query against a past snapshot -omnigraph read --query ./q.gq --name find --snapshot <snapshot-id> s3://bucket/graph.omni -``` - -Time travel composes with branches: every branch has its own version history, and -you can read any branch at any of its past versions. Commits and the commit DAG -that these versions correspond to are described in -[branches & commits](index.md); diffing two versions is on the -[changes](changes.md) page. diff --git a/docs/user/branching/changes.md b/docs/user/changes.md similarity index 93% rename from docs/user/branching/changes.md rename to docs/user/changes.md index a9bceec..58739e2 100644 --- a/docs/user/branching/changes.md +++ b/docs/user/changes.md @@ -1,6 +1,6 @@ # Change Detection / Diff -Diffing two read targets uses a three-level algorithm: +`changes/mod.rs`. Three-level algorithm: 1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged. 2. **Lineage check**: diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md new file mode 100644 index 0000000..0326e64 --- /dev/null +++ b/docs/user/cli-reference.md @@ -0,0 +1,86 @@ +# CLI Reference (`omnigraph`) + +A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md). + +17 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`. + +## Top-level commands + +| Command | Purpose | +|---|---| +| `init` | `--schema <pg>` β†’ initialize a graph (also scaffolds `omnigraph.yaml` if missing) | +| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) | +| `ingest` | branch-creating transactional load (`--from <base>`) | +| `query` (alias: `read`) | run named read query; source via `--query <path>`, `-e`/`--query-string <GQ>`, or `--alias <name>` (exactly one). `read` is the deprecated previous name and prints a one-line warning to stderr | +| `mutate` (alias: `change`) | run mutation query; same `--query` / `-e` / `--alias` mutual-exclusion as `query`. `change` is the deprecated previous name and prints a one-line warning to stderr | +| `snapshot` | print current snapshot (per-table version + row count) | +| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) | +| `branch create \| list \| delete \| merge` | branching ops | +| `commit list \| show` | inspect commit graph | +| `run list \| show \| publish \| abort` | transactional run ops | +| `schema plan \| apply \| show (alias: get)` | migrations | +| `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` | +| `optimize` | non-destructive Lance compaction | +| `cleanup --keep N --older-than 7d --confirm` | destructive version GC | +| `embed` | offline JSONL embedding pipeline | +| `policy validate \| test \| explain` | Cedar tooling | +| `version` / `-v` | print `omnigraph 0.3.x` | + +## `omnigraph.yaml` schema + +```yaml +project: { name } +graphs: + <name>: + uri: <local|s3://|http(s)://> + bearer_token_env: <ENV_NAME> +server: + graph: <name> + bind: <ip:port> +cli: + graph: <name> + branch: <name> + output_format: json|jsonl|csv|kv|table + table_max_column_width: 80 + table_cell_layout: truncate|wrap +query: + roots: [<dir>, …] # search path for .gq files +auth: + env_file: ./.env.omni +aliases: + <alias>: + # accepted values: `read` / `query` (read alias), `change` / `mutate` + # (write alias). `query` and `mutate` are recommended; `read` and + # `change` remain accepted forever for back-compat. + command: read|change|query|mutate + query: <path-to-.gq> + name: <query-name> + args: [<positional-name>, …] + graph: <name> + branch: <name> + format: <output-format> +policy: + file: ./policy.yaml +``` + +## Output formats (`query` command, alias: `read`) + +- `json` β€” pretty-printed object with metadata + rows +- `jsonl` β€” one metadata line then one JSON object per row +- `csv` β€” RFC 4180-ish quoting +- `table` β€” fitted text table, honors `table_max_column_width` + `table_cell_layout` +- `kv` β€” grouped per-row key/value blocks + +## Param resolution + +Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, `omnigraph.yaml` defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients. + +## Bearer token resolution (CLI) + +1. `graphs.<name>.bearer_token_env` +2. `OMNIGRAPH_BEARER_TOKEN` global env +3. `auth.env_file` referenced `.env` + +## Duration parsing (cleanup) + +`s | m | h | d | w` units, e.g. `--older-than 7d`. diff --git a/docs/user/cli.md b/docs/user/cli.md new file mode 100644 index 0000000..b6f2c09 --- /dev/null +++ b/docs/user/cli.md @@ -0,0 +1,164 @@ +# CLI Guide + +## Core Graph Flow + +```bash +omnigraph init --schema ./schema.pg ./graph.omni +omnigraph load --data ./data.jsonl --mode overwrite ./graph.omni +omnigraph snapshot ./graph.omni --branch main --json +omnigraph query --uri ./graph.omni --query ./queries.gq --name get_person --params '{"name":"Alice"}' +omnigraph mutate --uri ./graph.omni --query ./queries.gq --name insert_person --params '{"name":"Mina","age":28}' +``` + +`omnigraph query` is the canonical read command (pairs with `POST /query`); +`omnigraph mutate` is the canonical write command (pairs with `POST /mutate`). +The previous names `omnigraph read` and `omnigraph change` keep working as +visible aliases β€” invocations emit a one-line deprecation warning to stderr +and otherwise behave identically. See [Deprecated names](#deprecated-names) +for the migration table. + +For ad-hoc reads and mutations (REPLs, AI agents, one-off scripts), pass the +GQ source inline with `-e` / `--query-string` instead of a file path: + +```bash +omnigraph query --uri ./graph.omni \ + -e 'query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }' \ + --params '{"name":"Alice"}' + +omnigraph mutate --uri ./graph.omni \ + -e 'query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }' \ + --params '{"name":"Inline","age":42}' +``` + +`-e` is mutually exclusive with `--query <path>` and `--alias <name>`; exactly +one of the three must be provided. The inline source travels through the same +parser, lint, params binding, and commit machinery as a file-based query β€” +only the source loader changes. + +## Branching And Reviewable Data Flows + +```bash +omnigraph branch create --uri ./graph.omni --from main feature-x +omnigraph branch list --uri ./graph.omni +omnigraph branch merge --uri ./graph.omni feature-x --into main + +omnigraph ingest --data ./batch.jsonl --branch review/import-2026-04-09 ./graph.omni +omnigraph export ./graph.omni --branch main --type Person > people.jsonl +omnigraph commit list ./graph.omni --branch main --json +omnigraph commit show --uri ./graph.omni <commit-id> --json +``` + +## Remote Server Mode + +Serve a graph: + +```bash +omnigraph-server ./graph.omni --bind 127.0.0.1:8080 +``` + +Read through the HTTP API: + +```bash +omnigraph query \ + --target http://127.0.0.1:8080 \ + --query ./queries.gq \ + --name get_person \ + --params '{"name":"Alice"}' +``` + +If the server requires auth, set `OMNIGRAPH_SERVER_BEARER_TOKEN` on the server +and configure the matching `bearer_token_env` in `omnigraph.yaml`. + +## Multi-graph servers (v0.6.0+) + +Against a multi-graph server (started with `--config omnigraph.yaml` referencing a non-empty `graphs:` map), use `omnigraph graphs list` to enumerate the registered graphs. The server must configure bearer tokens and `server.policy.file` with a rule that allows `graph_list`; `/graphs` is closed by default even when the server runs with `--unauthenticated`. + +```bash +OMNIGRAPH_BEARER_TOKEN=admin-token \ + omnigraph graphs list --uri http://server.example.com --json +``` + +For config-driven clients, set the remote graph's `bearer_token_env` to an environment variable containing a token whose actor is authorized by `server.policy.file`. + +`list` rejects local URI targets β€” it's for remote multi-graph servers only. + +Runtime add/remove is **not** in v0.6.0. To add a graph, stop the server, add a `graphs.<id>` entry to `omnigraph.yaml`, then restart. To remove, stop the server, delete the entry, restart. + +Per-graph URLs: hit a graph's cluster route from any subcommand by pointing `--uri` at it: + +```bash +omnigraph read --uri http://server.example.com/graphs/beta --query ./q.gq ... +``` + +## Runs, Policy, And Diagnostics + +```bash +omnigraph lint --query ./queries.gq --schema ./schema.pg --json +omnigraph check --query ./queries.gq ./graph.omni --json + +omnigraph schema plan --schema ./next.pg ./graph.omni --json +omnigraph schema apply --schema ./next.pg ./graph.omni --json +omnigraph policy validate --config ./omnigraph.yaml +omnigraph policy test --config ./omnigraph.yaml +omnigraph policy explain --config ./omnigraph.yaml --actor act-alice --action read --branch main + +omnigraph commit list ./graph.omni --json +omnigraph commit show --uri ./graph.omni <commit-id> --json +``` + +(The legacy `omnigraph run list/show/publish/abort` subcommands were removed in MR-771; mutations and loads publish atomically and the commit graph (`omnigraph commit list`) is the audit surface.) + +`query lint` and `query check` are the same command surface. In v1, graph-backed +lint uses local or `s3://` graph URIs; HTTP targets are only supported when you +also pass `--schema`. + +## Config + +`omnigraph.yaml` lets the CLI and server share named graphs, defaults, and +query roots: + +```yaml +graphs: + local: + uri: ./demo.omni + dev: + uri: http://127.0.0.1:8080 + bearer_token_env: OMNIGRAPH_BEARER_TOKEN + +cli: + graph: local + branch: main + +query: + roots: + - queries + - . +``` + +The config file can also define: + +- server bind defaults +- auth env files +- query aliases for common read and change commands +- `policy.file` for Cedar authorization rules + +When policy is enabled, `schema apply` is authorized through the +`schema_apply` action and is typically limited to admins on protected `main`. + +## Deprecated names + +The CLI was renamed to align with the HTTP server's canonical endpoint +names (`POST /query`, `POST /mutate`) and the `query` keyword in the GQ +language. The previous spellings keep working forever; invocations emit a +one-line warning to stderr and otherwise behave identically. + +| Old (deprecated) | New (canonical) | Migration | +|--------------------------|---------------------|----------------------------------------------------------| +| `omnigraph read` | `omnigraph query` | Same flags and behavior. `read` is a visible clap alias. | +| `omnigraph change` | `omnigraph mutate` | Same flags and behavior. `change` is a visible clap alias. | +| `omnigraph query lint` | `omnigraph lint` | Same flags. The argv-level shim rewrites `query lint` to `lint`. | +| `omnigraph query check` | `omnigraph check` | `check` is a visible alias of `omnigraph lint`. | + +The `command:` field in `aliases.<name>` in `omnigraph.yaml` accepts both +`read` / `change` (legacy) and `query` / `mutate` (canonical); the two +spellings are interchangeable on the wire via serde aliases. diff --git a/docs/user/cli/index.md b/docs/user/cli/index.md deleted file mode 100644 index b00d42b..0000000 --- a/docs/user/cli/index.md +++ /dev/null @@ -1,175 +0,0 @@ -# CLI Guide - -## Core Graph Flow - -```bash -omnigraph init --schema schema.pg graph.omni -omnigraph load --data data.jsonl --mode overwrite graph.omni -omnigraph snapshot graph.omni --branch main --json -# Invoke a stored query BY NAME from the catalog (served β€” addressed by scope): -omnigraph query get_person --params '{"name":"Alice"}' -omnigraph mutate insert_person --params '{"name":"Mina","age":28}' -``` - -`omnigraph query` is the canonical read command (pairs with `POST /query`); -`omnigraph mutate` is the canonical write command (pairs with `POST /mutate`). -The positional argument is the **stored-query name**, invoked from the served -catalog β€” the graph is addressed by scope (`--server` / `--profile` -/ defaults), and the verb asserts the query's kind (`query` rejects a stored -mutation, and vice-versa). The previous names `omnigraph read` and -`omnigraph change` keep working as visible aliases β€” invocations emit a one-line -deprecation warning to stderr. See [Deprecated names](#deprecated-names). - -For **ad-hoc** reads and mutations (REPLs, AI agents, one-off scripts, local dev), -pass the GQ source with `-e` / `--query-string` (inline) or `--query <path>` (a -file), and address a graph's storage directly with `--store`. By-name catalog -invocation is served-only β€” a bare `--store` has no catalog, so it's the ad-hoc -lane: - -```bash -omnigraph query --store graph.omni \ - -e 'query find($name: String) { match { $p: Person { name: $name } } return { $p.name, $p.age } }' \ - --params '{"name":"Alice"}' - -omnigraph mutate --store graph.omni \ - -e 'query add($name: String, $age: I32) { insert Person { name: $name, age: $age } }' \ - --params '{"name":"Inline","age":42}' - -# A multi-query file: the positional selects which query to run. -omnigraph query --store graph.omni --query queries.gq get_person --params '{"name":"Alice"}' -``` - -`-e` is mutually exclusive with `--query <path>`. With either, the positional -name (optional) selects which query in the source to run. The inline source -travels through the same parser, lint, params binding, and commit machinery as a -file-based query β€” only the source loader changes. - -## Branching And Reviewable Data Flows - -```bash -omnigraph branch create --uri graph.omni --from main feature-x -omnigraph branch list --uri graph.omni -omnigraph branch merge --uri graph.omni feature-x --into main - -omnigraph load --data batch.jsonl --branch review/import-2026-04-09 --from main --mode merge graph.omni -omnigraph export graph.omni --branch main --type Person > people.jsonl -omnigraph commit list graph.omni --branch main --json -omnigraph commit show --uri graph.omni <commit-id> --json -``` - -## Remote Server Mode - -Serve a cluster-applied graph: - -```bash -omnigraph cluster apply --config ./company-brain -omnigraph-server --cluster ./company-brain --bind 127.0.0.1:8080 -``` - -Read through the HTTP API β€” invoke a stored query by name from the catalog: - -```bash -omnigraph query get_person \ - --server http://127.0.0.1:8080 \ - --params '{"name":"Alice"}' -``` - -A server is addressed with `--server` (a name from `~/.omnigraph/config.yaml` or a -literal URL); a positional `http(s)://` URI is rejected. If the server requires -auth, set its bearer token and `omnigraph login <server>` (or -`OMNIGRAPH_BEARER_TOKEN`). - -## Multi-graph servers - -A server boots from a cluster directory (`omnigraph-server --cluster <dir>`) and -serves every graph the cluster declares. Use `omnigraph graphs list` to enumerate -them. The cluster's server-level policy must allow `graph_list`; `/graphs` is -closed by default even when the server runs with `--unauthenticated`. - -```bash -OMNIGRAPH_BEARER_TOKEN=admin-token \ - omnigraph graphs list --server http://server.example.com --json -``` - -For an operator-defined server, store its token with `omnigraph login <name>` (or -`OMNIGRAPH_TOKEN_<NAME>`); the actor must be authorized by the cluster's -server-level policy. - -`list` rejects local (`--store`) targets β€” it's for remote multi-graph servers only. - -Runtime add/remove via API is not exposed. To add or remove a graph, edit the -cluster's `cluster.yaml`, run `omnigraph cluster apply`, then restart the server. - -Per-graph addressing: select a graph on a multi-graph server with `--graph`: - -```bash -omnigraph query get_person --server http://server.example.com --graph beta --params '{"name":"Ada"}' -``` - -## Runs, Policy, And Diagnostics - -```bash -omnigraph lint --query queries.gq --schema schema.pg --json -omnigraph check --query queries.gq graph.omni --json - -omnigraph schema plan --schema next.pg graph.omni --json -omnigraph schema apply --schema next.pg graph.omni --json -omnigraph policy validate --cluster ./company-brain --graph knowledge -omnigraph policy test --cluster ./company-brain --graph knowledge --tests policy.tests.yaml -omnigraph policy explain --cluster ./company-brain --graph knowledge --actor act-alice --action read --branch main - -omnigraph commit list graph.omni --json -omnigraph commit show --uri graph.omni <commit-id> --json -``` - -(Mutations and loads publish atomically; the commit graph (`omnigraph commit list`) is the audit surface.) - -`query lint` and `query check` are the same command surface. In v1, graph-backed -lint uses local or `s3://` graph URIs; HTTP targets are only supported when you -also pass `--schema`. - -## Config - -Configuration has two surfaces with single owners (see the -[CLI reference](reference.md#config-surfaces) for the full schema): - -- **`~/.omnigraph/config.yaml`** β€” your personal operator config: default actor - (`--as`), named servers + credentials, clusters, profiles, aliases, and - default scope (`defaults.server` / `defaults.store` / `default_graph`). It - decides *who you are* and *what you address by default*. -- **`cluster.yaml`** (a team-owned cluster directory) β€” declares *what the system - is*: graphs, schemas, stored queries, policies, and storage. A server boots - from it (`--cluster <dir>`); see the [cluster guide](../clusters/index.md). - -```yaml -# ~/.omnigraph/config.yaml -operator: - actor: act-andrew -servers: - dev: - url: http://127.0.0.1:8080 -defaults: - server: dev - default_graph: knowledge -``` - -When policy is enabled, `schema apply` is authorized through the -`schema_apply` action and is typically limited to admins on protected `main`. - -## Deprecated names - -The CLI was renamed to align with the HTTP server's canonical endpoint -names (`POST /query`, `POST /mutate`) and the `query` keyword in the GQ -language. The previous spellings keep working forever; invocations emit a -one-line warning to stderr and otherwise behave identically. - -| Old (deprecated) | New (canonical) | Migration | -|--------------------------|---------------------|----------------------------------------------------------| -| `omnigraph read` | `omnigraph query` | Same flags and behavior. `read` is a visible clap alias. | -| `omnigraph change` | `omnigraph mutate` | Same flags and behavior. `change` is a visible clap alias. | -| `omnigraph query lint` | `omnigraph lint` | Same flags. The argv-level shim rewrites `query lint` to `lint`. | -| `omnigraph query check` | `omnigraph check` | `check` is a visible alias of `omnigraph lint`. | - -The `command:` field in `aliases.<name>` in `~/.omnigraph/config.yaml` accepts -both `read` / `change` (legacy) and `query` / `mutate` (canonical); the two -spellings are interchangeable on the wire via serde aliases. diff --git a/docs/user/cli/reference.md b/docs/user/cli/reference.md deleted file mode 100644 index 3b97800..0000000 --- a/docs/user/cli/reference.md +++ /dev/null @@ -1,237 +0,0 @@ -# CLI Reference (`omnigraph`) - -A reference for the `omnigraph` binary's command surface and the per-operator `~/.omnigraph/config.yaml` schema. For a quick-start guide, see [cli.md](index.md). - -Top-level command families and subcommands. Graph-targeting commands accept a positional `file://`/`s3://` URI, `--server <name|url>` (an operator-defined server from `~/.omnigraph/config.yaml` by name, or a literal `http(s)://` URL, optionally with `--graph <id>` for multi-graph servers; exclusive with a positional URI), `--store <uri>` (a single graph's storage directly), or `--profile <name>` / `$OMNIGRAPH_PROFILE` (a named scope bundle; see [Scopes & profiles](#scopes--profiles)); `cluster` commands use `--config <dir>`, while `policy` and `queries` read a cluster's applied state via `--cluster <dir|uri>`. A remote server is addressed only with `--server` β€” a positional `http(s)://` URI is rejected. **`query`/`mutate` are the exception**: their positional is a stored-query *name*, not a graph URI, so they address the graph only via `--store`/`--server`/`--profile`/defaults. - -## Top-level commands - -| Command | Purpose | -|---|---| -| `init` | `--schema <pg>` β†’ initialize a graph (start cluster configs from the [cluster.md](../clusters/index.md) quick-start) | -| `load` | bulk load a branch, local or remote (`--mode overwrite\|append\|merge` is **required** β€” overwrite is destructive, so there is no default). Without `--from` the target branch must exist; `--from <base>` forks a missing `--branch` from `<base>` first | -| `ingest` | deprecated alias of `load --from <base>` (defaults: `--from main --mode merge`); prints a one-line warning to stderr | -| `query <name>` (alias: `read`) | run a read query. **Catalog lane** (default): `<name>` is a stored query invoked **by name** from the served catalog (served-only β€” address with `--server`/`--profile`; the verb asserts the query is a read). **Ad-hoc lane**: with `--query <path>` or `-e`/`--query-string <GQ>`, runs that source (the positional `<name>` then selects which query in it). No positional graph URI β€” address via `--store`/`--server`/`--profile`. `read` is the deprecated previous name (one-line stderr warning) | -| `mutate <name>` (alias: `change`) | run a mutation query; same catalog (by-name, served-only, verb asserts mutation) / ad-hoc (`--query`/`-e`) lanes as `query`. `change` is the deprecated previous name (one-line stderr warning) | -| `alias <name> [args]` | invoke an operator alias β€” a read-only personal binding (under `aliases:` in `~/.omnigraph/config.yaml`) to a stored query on a named server (replaces the removed `--alias` flag; stored mutations are rejected before execution) | -| `snapshot` | print current snapshot (per-table version + row count) | -| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) | -| `branch create \| list \| delete \| merge` | branching ops | -| `commit list \| show` | inspect commit graph | -| `schema plan \| apply \| show (alias: get)` | migrations. `apply` refuses a cluster-managed graph (one whose storage is inside a cluster) and points at `cluster apply` β€” those graphs evolve through the cluster ledger, not a direct apply | -| `lint` (alias: `check`) | offline / graph-backed query validation. Replaces `query lint` / `query check`, which are kept as deprecated argv-level shims that print a one-line warning and rewrite to `omnigraph lint` | -| `cluster validate \| plan \| apply \| approve \| status \| refresh \| import \| force-unlock` | declarative cluster control plane. `validate` checks a local `cluster.yaml` folder and referenced schema/query/policy files; `plan` diffs it against local JSON state at `__cluster/state.json`, annotates dispositions, and embeds real schema-migration previews; `apply` converges the cluster β€” stored-query/policy catalog writes (content-addressed under `__cluster/resources/`), graph creates, schema updates (soft drops only; `--as` records the actor), and graph deletes behind a digest-bound approval from `cluster approve <resource> --as <actor>` (`apply`/`approve` default the actor from `~/.omnigraph/config.yaml`'s `operator.actor` when `--as` is omitted); what apply converges is what an `omnigraph-server --cluster <dir>` deployment serves on its next restart (`--cluster` is the server's only boot source β€” cluster-only); `status` reads the state ledger; `refresh`/`import` explicitly update local JSON state from read-only graph observations; `force-unlock <LOCK_ID>` manually removes a held local state lock by exact id | -| `optimize` | non-destructive Lance compaction (skips tables with `Blob` columns or uncovered drift; `--json` reports `skipped`) | -| `repair [--confirm] [--force]` | preview or explicitly publish uncovered manifest/head drift. `--confirm` heals verified maintenance drift and exits non-zero if suspicious/unverifiable drift is refused; `--force --confirm` publishes suspicious/unverifiable drift after operator review | -| `cleanup --keep N --older-than 7d --confirm` | destructive version GC (`--confirm` to execute; also needs `--yes` against a non-local `s3://` target β€” see *Write diagnostics & destructive confirmation*) | -| `embed` | offline JSONL embedding pipeline | -| `policy validate \| test \| explain` | Cedar tooling against a cluster's applied policies (`--cluster <dir>`; `--graph <id>` picks a graph's bundle when several apply). `test` takes `--tests <file>`; `explain` takes `--actor`/`--action`/`--branch`/`--target-branch` | -| `queries list \| validate` | inspect a cluster's applied stored-query registry (`--cluster <dir\|uri>`; `--graph <id>` to scope one graph). `list` prints each query's kind (read/mutation), name, typed params, and `[mcp: …]` exposure; a query's `@description`/`@instruction` are shown as indented `description:` / `instruction:` lines when declared (omitted otherwise). `--json` emits `{name, mcp_expose, tool_name, mutation, params}` plus `description`/`instruction` **only when present** β€” matching the HTTP `GET /queries` catalog ([server.md](../operations/server.md)). `validate` type-checks the registry and exits non-zero on a broken query | -| `profile list \| show [<name>]` | read-only inspection of `~/.omnigraph/config.yaml` profiles. `list` shows each profile's binding (server/cluster/store) + default graph and marks the `$OMNIGRAPH_PROFILE`-active one; JSON keeps `binding` and adds `scope_kind`, `target`, `valid`, and `error`; `show` resolves one profile's scope (endpoint + default graph), defaulting to the active profile, else the flat operator defaults | -| `version` / `-v` | print `omnigraph 0.7.x` | - -## Command capabilities - -Every command declares the **capability** it needs β€” what it requires to reach a graph β€” which determines the addressing flags that apply: - -- **`any`** β€” `query`, `mutate`, `load`, `ingest`, `branch *`, `snapshot`, `export`, `commit *`, `schema show`, `schema apply`. Run against a graph **served (via a server) or embedded (direct against a store)**: accept a positional `file://`/`s3://` URI, `--server <name|url>` (+ `--graph <id>` for multi-graph servers), `--store <uri>`, or `--profile <name>`. A remote server is addressed with `--server` β€” a positional `http(s)://` URI does **not** dispatch to one. -- **`served`** β€” `graphs list`. Requires a server (accepts `--server` / `--profile`). -- **`direct`** β€” `init`, `optimize`, `repair`, `cleanup`, `schema plan`, `lint`. Need **direct storage access** (`file://` / `s3://`), never through a server. They accept a positional `URI`, but **not** `--server`, and a remote (`http(s)://`) URI is rejected. `optimize` / `repair` / `cleanup` additionally accept **`--cluster <dir|s3://…> --graph <id>`** (`--cluster` is a cluster directory or storage-root URI, named via `clusters:` in `~/.omnigraph/config.yaml` or a literal root), which resolves the graph's storage URI from the served cluster state (so you needn't know the `<storage>/graphs/<id>.omni` layout). `--graph` is the one graph selector across all scopes β€” on these three verbs it picks the cluster graph; on the other `direct` verbs it does not apply. -- **`control`** β€” `cluster *` via `--config <dir>`; `policy *` and `queries *` via `--cluster <dir|uri>` or a cluster profile. -- **`local`** β€” `alias`, `embed`, `login`, `logout`, `profile`, `version`. Address no explicit graph scope. - -These restrictions are enforced and reported, not silent: - -- A scope flag on a verb that can't consume it fails loudly rather than being silently dropped β€” `--server` outside a served scope, `--cluster` outside cluster-scoped verbs, or `--graph` where no multi-graph scope applies, e.g.: ``optimize is a direct (storage-native) command; --server addresses a served graph and does not apply. Pass a storage URI, or --cluster <dir> --graph <id>.`` -- A `direct` verb pointed at a remote URI fails loudly, e.g.: ``optimize is a direct (storage-native) command and needs direct storage access; the resolved target is a remote server (https://…). Pass the graph's file:// or s3:// URI.`` -- A data verb pointed at a positional `http(s)://` URI fails loudly: ``a remote graph must be addressed with --server <url> β€” a positional (or --uri) http(s):// URL no longer dispatches to a server.`` -- `init` into an **established cluster's** storage layout (`<root>/graphs/<id>.omni` where `<root>` holds `__cluster/state.json`) is refused β€” graphs in a cluster are created by `cluster apply` (which records ledger / recovery / approvals), not `init`. - -To maintain a server-backed graph, run the `direct` verbs from a host with storage access against the graph's storage URI (a positional URI, or `--cluster … --graph …`), out-of-band from the serving process β€” there are no server routes for `optimize` / `repair` / `cleanup` by design. - -`omnigraph --help` lists commands with a **capability legend** at the bottom (any / served / direct / control / local). - -## Write diagnostics & destructive confirmation - -Two global flags make writes self-documenting and guard the dangerous ones: - -- **Every write echoes its resolved target to stderr** β€” `omnigraph load β†’ s3://acme/brain/graphs/knowledge.omni (direct, remote)` β€” so you catch a scope that resolved somewhere unexpected (e.g. *prod*) before it lands. Applies to `load`, `ingest`, `mutate`, `branch create|delete|merge`, `schema apply`, `optimize`, `repair`, `cleanup`. The line is stderr, so `--json` consumers reading stdout are unaffected; suppress it with **`--quiet`**. -- **Destructive writes against a non-local scope require confirmation.** `cleanup`, overwrite `load` (`--mode overwrite`), and `branch delete` proceed freely against a local (`file://`) graph, but when the resolved target is **not local** (a served `http(s)://` graph or an `s3://` store/cluster) they require explicit consent: pass **`--yes`** to confirm, an interactive terminal is prompted, and a non-interactive run (no TTY, or `--json`) **refuses with an error** rather than silently destroying. `cleanup` still also requires its existing `--confirm` (previewβ†’execute); `--yes` is the additional non-local consent. - -A "local" target is a bare path or a `file://` URI; `http(s)://`, `s3://`, and other object-store schemes are non-local. - -## Config surfaces - -Two config surfaces with single owners, plus a zero-config tier: - -| Surface | Owner | Location | Declares | -|---|---|---|---| -| Cluster config | the team, in a repo | `cluster.yaml` + checkout ([cluster-config.md](../clusters/config.md)) | what the system **is**: graphs, schemas, queries, policies, storage | -| Operator config | one person | `~/.omnigraph/config.yaml` (override dir with `$OMNIGRAPH_HOME`) | who **I** am: identity, ergonomics | -| Flags / env | per invocation | β€” | everything, explicitly | - -### `~/.omnigraph/config.yaml` (operator) - -```yaml -operator: - actor: act-andrew # default identity for the --as cascade: --as > operator.actor > none -servers: # operator-owned endpoints; names key the credentials - prod: - url: https://graph.example.com # no tokens in this file, ever -defaults: - output: table # read format default, below --json/--format/alias - server: prod # the everyday SERVED scope when no address is given - # store: file:///data/dev.omni # OR a zero-flag LOCAL default (mutually - # # exclusive with `server`); the local-dev - # # counterpart of `server` - default_graph: knowledge # graph selected in a server/cluster scope -clusters: # admin-only: managed-cluster storage roots. - brain: # the ONLY place a storage root lives in this file. - root: s3://acme/clusters/brain -profiles: # named scope bundles; pick with --profile - staging: { server: staging, default_graph: knowledge } # a served scope - brain-admin: { cluster: brain, default_graph: knowledge } # a direct cluster scope -``` - -Absent file = empty layer. Unknown keys warn and load (a file written for a -newer CLI works on an older one). Override the config directory with -`$OMNIGRAPH_HOME`. - -#### Scopes & profiles - -A command resolves a **scope** β€” a server, a cluster, or a store β€” then selects a -graph in it; the served-vs-direct access path is derived from the scope, not -toggled. The scope comes from one of (highest precedence first): an explicit -address (a positional URI, `--server`, or `--store <uri>`); a named -`--profile <name>` (or `$OMNIGRAPH_PROFILE`); or the flat `defaults.server` + -`defaults.default_graph` (a served default) **or** `defaults.store` (a zero-flag -*local* default β€” mutually exclusive with `defaults.server`). A **profile** binds -exactly one of `server` / `cluster` / `store` plus an optional default graph β€” -config data, not state: every command resolves its scope fresh, there is no -sticky "current" mode. Inspect what is defined with `omnigraph profile list` and -`omnigraph profile show [<name>]` (read-only). - -- `--store <uri>` addresses a single graph's storage directly (ad-hoc / break-glass). -- A `cluster`-bound profile reaches `optimize` / `repair` / `cleanup` for a managed - graph (resolving its storage root from `clusters:`), the same as - `--cluster <root> --graph <id>`. A `--graph` flag overrides the profile's default. -- A `server`-bound scope on a maintenance verb, or a `cluster`-bound scope on a - data verb, is rejected with a message pointing at the right addressing. -- **No graph selected.** When a scope has no `--graph` and no - `default_graph`, the CLI never silently picks: - - **Cluster scope** β€” exactly **one** applied graph is used automatically; - **several** errors and lists the candidates (from the served catalog). - - **Server scope** β€” an `omnigraph-server` is always cluster-backed, so its - `GET /graphs` lists the graphs and you must pass `--graph <id>` (the CLI - lists the candidates if you omit it). It falls back to the bare URL only - when `/graphs` is unavailable: policy-gated, unreachable, or a - non-`omnigraph` endpoint. - -`--target`, `--cluster-graph`, and the positional-`http(s)://`β†’remote dispatch -have been **removed** (`--graph` is now the one graph selector across server and -cluster scopes); operator `defaults`/`--profile` supply the no-flag scope and an -explicit address always wins. - -#### Credentials keyed by server name - -`omnigraph login <name>` stores a bearer token in -`~/.omnigraph/credentials` (created `0600`; group/world-readable files are -refused). Token from `--token`, or β€” preferred, keeps it out of shell -history β€” one line on stdin: `echo $TOKEN | omnigraph login prod`. -`omnigraph logout <name>` removes it (idempotent). - -#### Operator aliases β€” bindings, not content - -An operator alias is a personal name for *invoking a stored query on a -named server* β€” it carries no query content (the stored query in the -catalog is the team's contract; the alias, its defaults, and its name are -yours): - -```yaml -aliases: - triage: - server: intel-dev # names an entry under servers: - graph: spike # optional (multi-graph servers) - query: weekly_triage # the STORED query's name β€” never a file - args: [since] # positional args -> params, in order - params: { limit: 20 } # fixed defaults; positionals/--params win - format: table -``` - -`omnigraph alias triage 2026-06-01` invokes -`POST <server>/graphs/spike/queries/weekly_triage` with the keyed -credential. Aliases live in their own `alias` namespace, -so an alias can never shadow β€” or be shadowed by β€” a built-in verb. (The old -`--alias <name>` flag on `query`/`mutate` was removed.) - -A remote command whose URL prefix-matches an operator server's `url` (the -`gh` host model β€” no flags needed) resolves its token through: - -| Order | Source | -|---|---| -| 1 | `OMNIGRAPH_TOKEN_<NAME>` env (`prod` β†’ `OMNIGRAPH_TOKEN_PROD`) | -| 2 | `[<name>]` section in `~/.omnigraph/credentials` | -| 3 | the default `OMNIGRAPH_BEARER_TOKEN` env | - -A keyed token is only ever sent to the server it is keyed to: a URL matching no -operator server falls back to `OMNIGRAPH_BEARER_TOKEN` alone. - -## Cluster config preview - -```bash -omnigraph cluster validate --config company-brain -omnigraph cluster plan --config company-brain --json -omnigraph cluster apply --config company-brain --json -omnigraph cluster approve graph.<id> --config company-brain --as <actor> -omnigraph cluster status --config company-brain --json -omnigraph cluster refresh --config company-brain --json -omnigraph cluster import --config company-brain --json -omnigraph cluster force-unlock <LOCK_ID> --config company-brain --json -``` - -`--config` is a directory containing `cluster.yaml`; it defaults to `.`. The -config declares graphs, schemas, stored queries, and policy bundle file -references. `cluster plan` reads local JSON state from -`<config-dir>/__cluster/state.json`; a missing file means empty state. Plan, -apply, refresh, and import acquire `__cluster/lock.json` by default and release -it before returning. `cluster apply` converges the cluster to its config in one -ordered run: it creates declared graphs, applies schema updates (soft drops -only β€” see [schema](../schema/index.md)), writes stored-query/policy catalog -resources (content-addressed under `__cluster/resources/`), and executes -approved graph deletes; it requires an existing `state.json` (run `import` -first). Applied state does not serve traffic until an `omnigraph-server ---cluster <dir>` restart picks up the new revision. Standalone schema deletes -remain unsupported and are reported as `deferred` with a warning. `cluster -status` reads state only and reports any existing lock metadata. `force-unlock` -removes a lock only when the supplied id exactly matches the lock file. -`refresh` requires an existing `state.json`; `import` creates one only when it -is missing. Both observe declared graphs read-only at -`<config-dir>/graphs/<graph-id>.omni`. External state backends, automatic -stale-lock breaking, `plan --refresh`, pipelines, UI specs, embeddings, -aliases, and bindings are not yet supported. See -[cluster-config.md](../clusters/config.md). - -## Output formats (`query` command, alias: `read`) - -- `json` β€” pretty-printed object with metadata + rows -- `jsonl` β€” one metadata line then one JSON object per row -- `csv` β€” RFC 4180-ish quoting -- `table` β€” fitted text table, honors `table_max_column_width` + `table_cell_layout` -- `kv` β€” grouped per-row key/value blocks - -## Param resolution - -Precedence (high to low): explicit `--params` / `--params-file`, alias positional args. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients. - -## Bearer token resolution (CLI) - -See **Credentials keyed by server name** above: a remote command resolves its -token via `OMNIGRAPH_TOKEN_<NAME>` env β†’ the `[<name>]` section in -`~/.omnigraph/credentials` β†’ the default `OMNIGRAPH_BEARER_TOKEN` env, and a -keyed token is only ever sent to the server it is keyed to. Plaintext tokens are -never stored in operator config; the removed `omnigraph.yaml` keys -(`graphs.<name>.bearer_token_env`, `auth.env_file`) no longer exist. - -## Duration parsing (cleanup) - -`s | m | h | d | w` units, e.g. `--older-than 7d`. diff --git a/docs/user/clusters/config.md b/docs/user/clusters/config.md deleted file mode 100644 index d9fdc1a..0000000 --- a/docs/user/clusters/config.md +++ /dev/null @@ -1,501 +0,0 @@ -# Cluster Config - -> New to the cluster tooling? Start with the operator how-to guide, -> [cluster.md](index.md) β€” this document is the reference. - -Cluster config is the future control-plane configuration surface for a whole -OmniGraph deployment. In this stage, OmniGraph can validate a local -`cluster.yaml` folder, produce a deterministic read-only plan, inspect the -local JSON state ledger, explicitly refresh/import graph observations into -that ledger, manually remove a held local state lock by exact lock id, and -**apply the executable subset of the plan** β€” stored-query and policy-bundle -catalog writes, **graph creation** (a declared graph that does not exist yet -is initialized by apply at the derived root), **schema updates** (soft drops -only), and β€” behind an explicit, digest-bound **approval** β€” **graph -deletion**. It does not perform data-loss schema migrations, start servers, -or run data loads. A server can boot from the applied ledger with -`omnigraph-server --cluster <config-dir | storage-root>`. - -## Commands - -```bash -omnigraph cluster validate --config company-brain -omnigraph cluster plan --config company-brain --json -omnigraph cluster apply --config company-brain --json -omnigraph cluster approve graph.<id> --config company-brain --as <actor> -omnigraph cluster status --config company-brain --json -omnigraph cluster refresh --config company-brain --json -omnigraph cluster import --config company-brain --json -omnigraph cluster force-unlock <LOCK_ID> --config company-brain --json -``` - -`--config` points at a directory, not a file. The directory must contain -`cluster.yaml`. When omitted, it defaults to the current directory. - -## Relationship to `~/.omnigraph/config.yaml` - -`cluster.yaml` and the per-operator `~/.omnigraph/config.yaml` never describe -the same fact. The operator config is the permanent **per-operator** layer -(the operator's identity and credential references, named servers/clusters, -profiles, and CLI defaults); `cluster.yaml` is the shared desired state of a -whole deployment, read only by the `cluster` commands via `--config`. - -The exact contract: - -- **Cluster commands read the operator config for exactly one thing**: the - `operator.actor` default used by `apply`/`approve` when `--as` is omitted β€” - operator identity is a per-operator fact. With `--as` present, the operator - config is not needed. Nothing else in it influences a cluster command. -- **No legacy `omnigraph.yaml`**: the CLI does not read `omnigraph.yaml` at - all, and a `--cluster` server reads only the cluster catalog β€” boot is - cluster-only. -- **The other direction is ergonomics, not coupling**: per-operator - data-plane commands address a cluster graph by its derived storage root - (`company-brain/graphs/knowledge.omni`) with `--store <uri>` β€” an ordinary - local path, no special handling. - -## Supported `cluster.yaml` - -The current config surface accepts this resource subset: - -```yaml -version: 1 -metadata: - name: company-brain - -state: - backend: cluster - lock: true - -providers: - embedding: - default: - kind: openai-compatible - base_url: https://openrouter.ai/api/v1 - model: openai/text-embedding-3-large - api_key: ${OPENROUTER_API_KEY} - -graphs: - knowledge: - schema: knowledge.pg - embedding_provider: default - queries: queries/ # discover every `query <name>` in queries/*.gq - -policies: - base: - file: base.policy.yaml - applies_to: [knowledge] -``` - -`queries` is Terraform-shaped β€” the `.gq` files are the declaration. Three -forms: - -```yaml -queries: queries/ # directory: top-level *.gq, sorted; every declaration registers -queries: [people.gq, extra/a.gq] # explicit files; every declaration in each -queries: # fine-grained name -> file map - find_experts: - file: knowledge.gq -``` - -Discovery is loud: an unreadable or unparseable `.gq`, or the same query name -declared in two files, fails validation (`query_parse_error`, -`duplicate_query_name`). Each discovered query is still an individually -addressed resource (`query.<graph>.<name>`) with its own plan/apply lifecycle; -the digest is the containing file's hash, so editing a multi-query file -updates all of its queries together. Paths are relative to the config -directory β€” the cluster is one explicit folder, so no `./` prefixes are -needed. - -`providers.embedding.<name>` defines a query-time embedding provider profile -for cluster-served graphs. A graph opts in with `embedding_provider: <name>`; -bare names normalize to `provider.embedding.<name>`. Supported provider -`kind` values are `openai-compatible` (default/OpenRouter-compatible), -`openai` (OpenAI's own host), `gemini`, and `mock`. Real providers require -`api_key: ${ENV_VAR}`; inline secrets are rejected. The env var is resolved -only when a `--cluster` server boots, so `cluster validate`, `plan`, and -`apply` do not need deployment secrets. `mock` is deterministic and does not -require `api_key`. Vector dimensions stay schema-driven by the target -`Vector(N)` column, not the provider profile. - -`storage:` (optional) is the **storage root URI** for everything the cluster -stores β€” the state ledger, lock, content-addressed catalog, recovery -sidecars, approval artifacts, and the derived graph roots -(`<storage>/graphs/<id>.omni`). Absent, it defaults to the config directory -itself (the original layout, byte-compatible with pre-existing clusters). -`s3://bucket/prefix` puts the whole cluster on S3-compatible object storage: -the ledger CAS uses conditional writes (verified against AWS S3 semantics and -RustFS), the lock becomes genuinely cross-machine, and graph roots are -engine-native S3 URIs. Credentials are **never** in `cluster.yaml` β€” the -standard `AWS_*` environment contract applies, identical to graph storage. -Declared configuration (`cluster.yaml` and the schema/query/policy sources it -references) always stays in the working tree: config is versioned in git, -state lives in the store β€” the Terraform split. - -`metadata.name` is a display label. `state.backend` may be omitted or set to -`cluster`; external state backends are reserved for a later stage. `state.lock` -defaults to `true`. When enabled, `cluster plan`, `cluster apply`, -`cluster refresh`, and `cluster import` briefly acquire -`<config-dir>/__cluster/lock.json`, then remove it before returning. `cluster status` never acquires the lock; it only reports -whether one is present. `cluster force-unlock` is the only lock-removal command; -it requires the exact lock id and should be run only after confirming no cluster -operation is active. - -## Validation - -`cluster validate` checks: - -- `cluster.yaml` syntax and supported fields -- duplicate YAML keys -- schema, query, and policy file existence -- schema parsing and catalog construction -- stored-query parsing and query-name matching -- stored-query type-checking against the desired schema -- policy `applies_to` graph references -- embedding provider profiles and graph `embedding_provider` references - -Fields reserved for later phases, such as `pipelines`, top-level -`embeddings`, `ui`, `aliases`, and `bindings`, fail with a typed diagnostic -instead of being silently ignored. Under `providers`, only `embedding` is -supported today; other provider namespaces fail as unsupported config. - -## Planning - -`cluster plan` first performs validation, then reads local JSON state from: - -```text -<config-dir>/__cluster/state.json -``` - -If the file is missing, the state is treated as empty and every desired -resource is planned as a create. If present, the file must use this shape: - -```json -{ - "version": 1, - "state_revision": 0, - "applied_revision": { - "config_digest": "...", - "resources": { - "schema.knowledge": { "digest": "..." }, - "query.knowledge.find_experts": { "digest": "..." }, - "provider.embedding.default": { - "digest": "...", - "embedding_profile": { - "kind": "openai-compatible", - "base_url": "https://openrouter.ai/api/v1", - "model": "openai/text-embedding-3-large", - "api_key": "${OPENROUTER_API_KEY}" - } - }, - "graph.knowledge": { - "digest": "...", - "embedding_provider": "provider.embedding.default" - }, - "policy.base": { - "digest": "...", - "applies_to": ["cluster", "graph.knowledge"] - } - } - }, - "resource_statuses": { - "graph.knowledge": { - "status": "applied", - "conditions": [], - "message": "optional status detail" - } - }, - "approval_records": {}, - "recovery_records": {}, - "observations": {} -} -``` - -`state_revision`, `resource_statuses`, `approval_records`, `recovery_records`, -and `observations` are optional so earlier state fixtures keep working. -Missing `state_revision` is treated as `0`. Resource status values are -`pending`, `planned`, `applying`, `applied`, `drifted`, `blocked`, or `error`. - -Plan output compares desired resource digests against state resource digests -and reports `create`, `update`, and `delete` changes. It also reports the state -CAS (`sha256:<digest>`) and state revision. `state_observations.locked` means an -existing lock file was observed, along with its metadata (`lock_id`, -`lock_operation`, `lock_created_at`, `lock_pid`, `lock_age_seconds`); a -successful `plan` instead reports `lock_acquired: true` and an -`acquired_lock_id`, then releases the lock before returning. The command never -writes `state.json` and does not scan live graphs. Use explicit -`cluster refresh` / `cluster import` when the state ledger should be updated -from live observations. Live drift scans during plan are later-stage work. - -Policy entries additionally record their applied `applies_to` bindings as -normalized typed refs β€” the state ledger is serving-sufficient for the -future server-boot stage. A change to `applies_to` alone (the policy file -digest unchanged) appears in the plan as an Update marked `binding_change` -(human output: `[bindings]`), and as `metadata_change: policy_bindings` in -structured output. Embedding provider entries similarly carry their resolved -profile in the ledger; pre-profile ledgers are backfilled by an Update with -`metadata_change: embedding_profile`. These metadata-only updates apply like -catalog changes and count toward convergence. - -Each plan change carries a `disposition` field β€” an honest preview of what -`cluster apply` will do with it: `applied` (executes β€” graph creates, schema -updates, catalog writes, approved deletes), `derived` (a `graph.<id>` -composite-digest update that converges automatically once its query digests -land), `deferred` (an unsupported change, e.g. a standalone schema delete), or -`blocked` (query/policy gated by an unapplied or missing dependency, with the -condition in `reason`). - -## Apply - -`cluster apply` executes the executable subset of the plan β€” stored-query and -policy-bundle changes, graph creates, and schema updates. There is no confirm -flag: `cluster plan` is the preview, -and apply recomputes the same diff under the state lock before executing, so a -stale preview can never be applied. Apply requires an existing `state.json` -(`state_missing` directs you to `cluster import` first). - -For each applied create/update, the resource payload is written -content-addressed into the local catalog: - -```text -<config-dir>/__cluster/resources/query/<graph>/<name>/<digest>.gq -<config-dir>/__cluster/resources/policy/<name>/<digest>.yaml -``` - -Extensions are fixed per kind regardless of the source file's name. Payloads -are written before the state update because `state.json` is the publish point: -if the final CAS-checked state write fails, no success is reported and the -digest-named blobs already written are inert β€” re-running apply is the repair. -Deletes remove the resource from state; their old payload blobs stay on disk -(garbage collection is a later stage). Re-running a converged apply is a no-op: -no state write, no revision change (`state_written: false`). - -**Applied means serving.** A server started with `--cluster <dir>` boots from -the applied revision (see -[Serving from the cluster](#serving-from-the-cluster-the-mode-switch)); it -picks up newly applied state on its next restart. Until that restart, applied -means recorded in the catalog, nothing more. - -### Graph creation - -A `graph.<id>` create (the graph is declared but no root exists) is executed -by apply: the graph is initialized at the derived root - -```text -<config-dir>/graphs/<graph-id>.omni -``` - -with the declared schema, before any catalog writes, so queries and policies -that depend on the new graph apply **in the same run**. Each create is fenced -by a recovery sidecar under `__cluster/recoveries/{ulid}.json`, written before -the init and removed only after the state update lands. If apply crashes in -between, the next state-mutating command (`apply`, `refresh`, `import`) runs a -**recovery sweep** that classifies the survivor by observation: an absent root -removes the stale intent; a completed create rolls the cluster state forward -(recorded in the state's `recovery_records`); a partial root reports -`graph_create_incomplete` (status `error` β€” remove the root and re-run apply; -nothing is auto-deleted); unexpected graph content reports -`actual_applied_state_pending` (status `drifted` β€” run `cluster refresh` and -re-plan). While a kept sidecar is pending, that graph's create and its -dependents are blocked with `cluster_recovery_pending`. Read-only commands -(`status`, `plan`) warn about pending sidecars without acting on them. - -**Re-creation is convergence.** If a graph root disappears out-of-band, -`refresh` records the drift and the next `plan` proposes a create β€” and apply -will execute it, producing an **empty** graph at the root. The data was -already lost when the root vanished; the create is visible in the plan -(disposition `applied`) before anything runs. - -### Schema updates - -A `schema.<id>` update (the declared schema differs from what state records) -is executed by apply via the engine's schema-apply, after graph creates and -before catalog writes β€” so a query change that depends on the new schema -applies in the same run. Each schema apply is sidecar-fenced like a create: -pre-operation manifest version recorded, post-operation version written back, -sidecar retired only after the state update lands; the recovery sweep -classifies survivors by schema digest (consistent ledger β†’ retired; completed -on the graph β†’ state rolled forward with an audit entry; anything else β†’ -`drifted`/`actual_applied_state_pending`, kept). - -Migrations run with **soft drops only** β€” a removed property disappears from -the current version while prior versions retain the data (reversible until -`cleanup`). Data-loss migrations (`allow_data_loss`) are not reachable from -cluster apply until the approval-artifact stage. Unsupported migrations -(e.g. changing a property's type), engine lock contention, or graphs with -user branches fail loudly as `schema_apply_failed` with the engine's message; -dependent changes are demoted to `blocked` and graph-moving work stops for -the run. These pre-movement failures are checked before the cluster schema -recovery sidecar is created, so they do not leave stale recovery files behind -or brick later server boot. - -`cluster plan` previews schema updates with the engine's real migration plan: -each schema change carries a `migration` field (`supported` + typed steps), -and the human output prints the steps. If the live graph cannot be opened the -preview degrades to the digest diff with a `schema_preview_unavailable` -warning. - -**Drift is converged, not just reported.** A schema changed out-of-band on -the live graph shows up as `drifted` after `refresh`, and the next plan -proposes migrating it back to the declared schema β€” apply executes that like -any other soft migration. Drift correction is gated by the same rules as any -change; nothing about it is hidden (the plan shows the steps, including soft -drops of out-of-band fields). - -**Attribution.** `cluster apply --as <actor>` records the operator identity -in recovery sidecars and audit entries and threads it to the engine's -schema-apply (so commit attribution and Cedar enforcement β€” wherever a policy -checker is installed β€” work unchanged). - -### Approvals and graph deletion - -Deleting a graph is the irreversible tier: it requires a recorded human -decision. `cluster plan` lists the gate under `approvals_required` (one gate -per graph β€” the graph-level approval carries its schema and queries); -`cluster approve graph.<id> --as <actor>` writes a digest-bound artifact to - -```text -<config-dir>/__cluster/approvals/<approval-id>.json -``` - -bound to the exact desired config digest and the change's state digest, so -**any config or state drift after approving invalidates the artifact** -automatically (`approval_stale` warning; it never authorizes a different -change). An unapproved delete blocks with `approval_required`. - -An approved delete executes **last** in the apply run: the graph root is -removed recursively, the subtree (graph, schema, its queries) is tombstoned -out of the state ledger with a tombstone observation, and the approval is -consumed β€” recorded in the state's `approval_records` in the same state -update, and the artifact file rewritten with `consumed_at` (the file is never -deleted: the audit fact survives the loss of either store). A failed run -consumes nothing; the approval stays valid for the retry. Catalog blobs of -the deleted graph's queries stay on disk (GC is a later stage). - -Crash recovery for deletes: a completed-but-unrecorded delete is rolled -forward by the sweep (tombstone + approval consumption + audit entry); an -incomplete delete (root still present) is retired with a -`graph_delete_incomplete` warning and simply **re-proposed** β€” prefix removal -is idempotent, so the still-approved retry is the repair. - -Standalone schema deletes are never executed by this stage. They are -reported as `deferred` (warning `apply_unsupported_change`), and query/policy -changes that depend on them are `blocked` (warning `apply_dependency_blocked`, status -`blocked` in state). A partially-applicable plan still exits 0 with warnings; -the JSON `converged` field is the automation signal for "state now matches the -desired revision". The applied `config_digest` is only recorded when apply -fully converges. The `graph.<id>` composite digest is recomputed from state's -own schema/query digests after each apply, so applied query changes converge -without graph movement. - -## Serving from the cluster (the mode switch) - -```bash -omnigraph-server --cluster company-brain --bind 0.0.0.0:8080 -``` - -`--cluster <dir>` is an **exclusive boot source** (axiom 15): it cannot -combine with a graph URI or `--config`, and in this mode -`omnigraph.yaml` is never read β€” not for graphs, not for queries, not for -policies. The server serves the **applied revision**: graph roots recorded in -`state.json`, stored-query and policy content from the content-addressed -catalog at the applied digests (re-verified at boot), and policy bundles -wired by their applied `applies_to` bindings β€” `cluster`-bound bundles become -the server-level Cedar engine, graph-bound bundles attach per graph. -Un-applied config drift never leaks into serving; `cluster plan` is where -drift is visible. Routing is always multi-graph (`/graphs/{id}/...`). Bearer -tokens and the bind address stay process-level (flags/env) β€” they are -per-replica facts, not cluster facts. - -Boot is fail-fast for cluster-global readiness failures: missing or -unreadable state, invalid/unattributable recovery sidecars, -missing/tampered shared catalog blobs, policy entries without binding -metadata (pre-binding ledgers β€” re-run `cluster apply`), an empty graph set, -more than one policy bundle binding a single scope (split or merge bundles; -stacked scopes are a later stage), cluster policy problems, or zero healthy -graphs. Valid graph-attributed recovery sidecars, unopenable graph roots, and -stored queries that no longer type-check quarantine that graph instead; the -server logs startup diagnostics, skips the graph's queries and graph-only -policy bindings, and serves any remaining healthy graphs. A held state lock -is *not* an error β€” boot reads the atomically-replaced state file without -locking. - -Use `omnigraph-server --require-all-graphs` (or -`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`) when degraded serving is not acceptable; it -promotes every graph-local quarantine or startup failure back to a boot error. - -Serving is static per process: the server reads the applied revision once at -startup, so picking up newly applied state means restarting it. `GET /graphs` -lists only ready/served graphs; quarantined graphs are omitted and their -routes return 404. Stored queries are all listed in `GET /queries` in cluster -mode (the cluster registry has no expose flag; exposure becomes a policy -decision in a later phase). - -## Status - -`cluster status` reads the same local JSON state ledger and prints what the -ledger says is deployed. It does not validate referenced schema/query/policy -files and does not inspect live graphs. Missing `state.json` succeeds with a -warning; invalid state JSON or an unsupported state version fails. If a lock is -present, status reports its id, operation, creation time, pid, and age. - -Status also verifies the catalog payloads read-only: every query/policy digest -recorded in state is checked against its content-addressed blob under -`__cluster/resources/` (existence and full digest re-hash). A missing or -mismatched blob is reported as a warning (`catalog_payload_missing` / -`catalog_payload_mismatch`); an unreadable blob is an error -(`catalog_payload_read_error`) because an unverifiable catalog must not report -healthy. Status never writes state β€” persisting the `drifted` condition is -refresh's job. The check runs without the state lock, so it is a point-in-time -report. - -## Refresh And Import - -`cluster refresh` updates an existing `state.json` from actual observations. -`cluster import` creates the first `state.json` when the ledger is missing. -Both commands open declared graphs read-only at: - -```text -<config-dir>/graphs/<graph-id>.omni -``` - -They observe only branch `main`, recording graph existence, manifest version, -live schema digest, desired schema digest, and schema-match status under -`observations["graph.<id>"]`. Missing graph roots are recorded as drift and -remove the graph/schema digests from state so a later `plan` proposes creates. -Invalid graph roots are recorded as errors; `refresh` persists the error -observation and exits non-zero, while `import` exits non-zero without creating -initial state. - -Refresh also verifies the catalog payloads of every query/policy digest -recorded in state (the same check `cluster status` reports read-only), and -closes the loop: - -- a **missing** or **digest-mismatched** blob marks the resource `drifted` - (condition `payload_missing` / `payload_mismatch`) and removes its digest - from state β€” so the next `cluster plan` proposes a create and the next - `cluster apply` republishes the blob (the self-heal loop, mirroring how a - missing graph root is handled); -- an **unreadable** blob (IO error other than not-found) keeps the digest, - marks the resource `error` (condition `payload_read_error`), and exits - non-zero β€” transient IO must not trigger a spurious republish. - -Upgrade note: a state ledger written before catalog publish existed records -query/policy digests with no blobs on disk; the first refresh after upgrading -flags them all `payload_missing`, and a single `cluster apply` republishes -everything and converges. - -Refresh/import do not observe query or policy resources beyond their catalog -payloads yet. Existing query and policy state digests are preserved on refresh -(unless their payload drifted, above) and are not invented on import. - -## Force Unlock - -`cluster force-unlock <LOCK_ID>` removes `<config-dir>/__cluster/lock.json` only -when the file exists, is valid version-1 lock JSON, and its `lock_id` exactly -matches the argument. A wrong id, missing lock, invalid lock JSON, or unsupported -lock version exits non-zero and leaves the file untouched. - -This is manual recovery for abandoned local locks. OmniGraph does not perform -PID-liveness checks, TTL expiry, stale-lock breaking, or automatic unlock -today. diff --git a/docs/user/clusters/index.md b/docs/user/clusters/index.md deleted file mode 100644 index 089fd4b..0000000 --- a/docs/user/clusters/index.md +++ /dev/null @@ -1,304 +0,0 @@ -# Operating an OmniGraph Cluster - -This is the operator's guide to the cluster control plane: how to go from an -empty directory to a served deployment, and how to run it day to day β€” -evolving schemas, rotating queries and policies, healing drift, approving -destructive changes, and recovering from crashes. - -It is a **how-to**. The reference for every `cluster.yaml` key, command flag, -state-file field, and diagnostic code is -[cluster-config.md](config.md); the HTTP surface is -[server.md](../operations/server.md). - -## The model in one paragraph - -You declare the entire deployment β€” graphs, schemas, stored queries, Cedar -policies β€” as files in one directory (`cluster.yaml` plus the `.pg`/`.gq`/ -`.yaml` files it references). `cluster apply` converges reality to that -declaration and records what it did in a state ledger -(`__cluster/state.json`); `cluster plan` previews exactly what apply would -do, including real schema-migration steps. A server started with -`omnigraph-server --cluster <dir>` serves what was applied β€” never what is -merely written in config. Terraform users will recognize the shape: config -is desired state, the ledger is recorded state, plan is the diff, apply is -the only thing that changes the world, and irreversible changes require an -explicitly recorded approval. - -## 1. Deploy a cluster from zero - -Lay out a config directory: - -``` -company-brain/ -β”œβ”€β”€ cluster.yaml -β”œβ”€β”€ people.pg # schema for the "knowledge" graph -β”œβ”€β”€ queries/ # stored queries β€” the .gq files ARE the declaration -β”‚ └── people.gq -└── base.policy.yaml # a Cedar policy bundle -``` - -```yaml -# cluster.yaml -version: 1 -# storage: s3://omnigraph-local/clusters/company-brain # optional: put the -# ledger, catalog, and graph data on object storage (default: this folder) -metadata: - name: company-brain -graphs: - knowledge: - schema: people.pg - queries: queries/ # every `query <name>` in queries/*.gq registers -policies: - base: - file: base.policy.yaml - applies_to: [knowledge] # graph-bound; use [cluster] for server-level -``` - -Bring it to life: - -```bash -omnigraph cluster validate --config company-brain # parse + typecheck everything -omnigraph cluster import --config company-brain # create the state ledger -omnigraph cluster plan --config company-brain # preview: what would apply do? -omnigraph cluster apply --config company-brain # converge -``` - -That single `apply` **creates the graph** (at the derived root -`company-brain/graphs/knowledge.omni`), applies its schema, and publishes -the query and policy into the content-addressed catalog -(`__cluster/resources/…`). The output lists every change with its -disposition; `converged: true` means there is nothing left to do β€” re-running -`apply` is always safe and idempotent. - -Load data through the normal graph plane (the control plane manages -*definitions*, not rows): - -```bash -omnigraph load --data seed.jsonl company-brain/graphs/knowledge.omni -``` - -Serve it: - -```bash -OMNIGRAPH_SERVER_BEARER_TOKENS_JSON='{"act-reader":"s3cret"}' \ - omnigraph-server --cluster company-brain --bind 0.0.0.0:8080 -``` - -`--cluster` accepts either a **config directory** (the storage root resolves -through `cluster.yaml`'s `storage:` key) or a **storage-root URI directly** -(`--cluster s3://bucket/prefix`) β€” config-free serving: a serving box needs -only the URI and credentials, no checkout of the config repo. The ledger and -catalog on the bucket are the deployment artifact. - -`--cluster` is an **exclusive boot source**: it cannot be combined with a -graph URI or `--config`, and `omnigraph.yaml` is never read in -this mode. Routing is always multi-graph: - -```bash -curl -H 'authorization: Bearer s3cret' \ - -X POST http://localhost:8080/graphs/knowledge/queries/find_person \ - -H 'content-type: application/json' -d '{"params":{"name":"Ada"}}' -``` - -Bearer tokens and the bind address are deliberately *not* cluster facts β€” -they are per-replica, set by flag or environment -([server.md](../operations/server.md#modes) for the token sources). - -## 2. The day-2 loop: edit β†’ plan β†’ apply β†’ restart - -Every change follows the same loop, whatever its kind: - -```bash -$EDITOR company-brain/people.pg # or any .gq / policy / cluster.yaml edit -omnigraph cluster plan --config company-brain -omnigraph cluster apply --config company-brain --as andrew -# restart cluster-booted servers to pick it up -``` - -`--as <actor>` attributes the run: it is recorded in recovery sidecars and -audit entries and threaded into the engine's commit history. Set -`operator: { actor: <you> }` in your `~/.omnigraph/config.yaml` to make it the -default when `--as` is omitted (the flag always wins; `approve` requires one -of the two). - -What each change kind does: - -| You edit | Plan shows | Apply does | -|---|---|---| -| a `.gq` file or `queries:` entry | `Update query.<g>.<n>` | publishes the new content-addressed blob, updates the ledger | -| a policy file | `Update policy.<n>` | same β€” new blob, ledger update | -| a policy's `applies_to` | `Update policy.<n> [bindings]` | records the new bindings (the file digest is unchanged; bindings are first-class changes) | -| a `.pg` schema | `Update schema.<g>` **with the real migration steps embedded** | runs the engine's schema apply on the live graph β€” soft drops only, sidecar-fenced | -| `graphs:` gains an entry | `Create graph.<g>` (+ schema, queries) | initializes the graph at its derived root; dependents apply in the same run | -| `graphs:` loses an entry | `Delete graph.<g>` β€” **blocked, `approval_required`** | nothing, until approved (see Β§4) | - -Two properties worth internalizing: - -- **One apply, ordered correctly.** Creates run first, then schema - migrations, then catalog writes, then (approved) deletes β€” so a schema - change plus a query that uses the new field converge together in one run. -- **Soft drops only.** A removed schema property disappears from the current - version while prior versions retain the data (reversible until `cleanup`). - Data-loss migrations are not reachable from cluster apply. - -Read the plan before applying when the change is non-trivial β€” for schema -updates it embeds the engine's actual migration plan (`add_property`, -`drop_property [soft]`, `unsupported: …`), so you see data impact before -anything runs. - -## 3. Inspect: status, refresh, drift - -```bash -omnigraph cluster status --config company-brain --json # ledger only, read-only -omnigraph cluster refresh --config company-brain # re-observe live graphs -``` - -`status` never touches the graphs; `refresh` opens them read-only and -records what it finds β€” manifest versions, live schema digests, catalog blob -integrity. If someone changed a graph behind the control plane's back (a -direct `omnigraph schema apply`, a tampered catalog file), refresh marks the -resource **`drifted`**. - -**Drift is converged, not just reported.** After a refresh records drift, -the next `plan` proposes migrating the live graph back to the declared -schema β€” with the steps visible, including the soft drops of out-of-band -fields β€” and `apply` executes it like any other change. If the out-of-band -change is the one you want, change the *config* to match instead, and apply -converges the ledger. - -## 4. Destructive changes: the approval gate - -Removing a graph from `cluster.yaml` never executes silently: - -```bash -omnigraph cluster apply --config company-brain -# Delete graph.scratch [Blocked: approval_required] - -omnigraph cluster approve graph.scratch --config company-brain --as andrew -# cluster approve: delete graph.scratch approved by andrew (approval 01KT…) - -omnigraph cluster apply --config company-brain --as andrew -# Delete graph.scratch [Applied] ← root removed, subtree tombstoned -``` - -The approval artifact (`__cluster/approvals/<id>.json`) is **digest-bound**: -it authorizes exactly the change you saw when you approved it. Any config or -state movement afterwards invalidates it automatically (`approval_stale` -warning) β€” a stale approval can never authorize a different delete. One -approval covers the graph's whole subtree (its schema and queries ride -along). Consumed artifacts are kept (rewritten with `consumed_at`) and -summarized in the ledger's `approval_records`, so the audit trail of *who -approved what* survives the loss of either store. - -## 5. When things go wrong - -**Crashes are designed for.** Every graph-moving operation (create, schema -apply, delete) writes a recovery sidecar before acting. If an apply dies -mid-run, the next state-mutating command sweeps the sidecars and reconciles -β€” rolling the ledger forward when the operation completed on the graph, -retiring stale intent when nothing moved, and flagging anything it cannot -verify. You generally fix a crashed run by **running `cluster apply` -again**. - -**A held lock** (a crashed process left `__cluster/lock.json`): - -```bash -omnigraph cluster status --config company-brain # shows the lock holder + id -omnigraph cluster force-unlock <LOCK_ID> --config company-brain -``` - -Force-unlock requires the exact lock id (from status) β€” there is no blind -unlock. - -**A lost or corrupted state ledger**: the cluster is self-describing. -`cluster import` rebuilds `state.json` from the config plus read-only -observation of the live graphs; the next `apply` re-converges onto the same -content-addressed catalog. - -**A server that refuses to boot** with `--cluster` is telling you the -applied revision is not safely servable. Each refusal names its remedy: - -| Boot error | Meaning | Remedy | -|---|---|---| -| `cluster_state_missing` | no ledger | `cluster import`, then `apply` | -| `cluster_recovery_pending` | graph was quarantined because an interrupted operation awaits sweep | run `cluster apply` (or any state-mutating command), restart | -| `cluster_no_healthy_graphs` | every applied graph is quarantined or failed startup | sweep/fix the graph-specific failures, then restart | -| `catalog_payload_missing` / `…_digest_mismatch` | catalog blob lost or tampered | `cluster refresh`, then `apply`, restart | -| `policy_bindings_missing` | ledger predates binding metadata | re-run `cluster apply` (backfills), restart | -| `cluster_empty` | applied revision has no graphs | apply a cluster with β‰₯1 graph | -| multiple bundles bind one scope | serving holds one policy bundle per graph + one server-level | split or merge bundles | - -A held *state lock* is deliberately **not** a boot error β€” the server reads -the atomically-replaced ledger without locking, so serving never contends -with an in-flight apply. - -When at least one graph is healthy, graph-attributed recovery sidecars and -graph-local startup failures do not block the whole server. The affected -graph is skipped, its graph-only policy bindings and queries are omitted, -and `/graphs` lists only the ready graphs. Pass -`omnigraph-server --require-all-graphs` or set -`OMNIGRAPH_REQUIRE_ALL_GRAPHS=1` to make any such quarantine fail startup. - -## 6. Deployment patterns - -- **Replicas**: any number of `--cluster` servers can serve the same config - directory; boot is read-only. Roll out a change by `apply` once, then - restarting replicas (serving is static per process β€” there is no hot - reload yet). Container/cloud recipes (AWS ECS+EFS, Railway volumes): - [deployment.md](../deployment.md#cluster-mode-in-containers-aws-railway). -- **The directory is the deployable unit**: config, catalog, ledger, - approvals, and graph data all live under it. Back it up as a whole; - version the *config files* (not `__cluster/` or `graphs/`) in git. -- **CI-driven convergence**: `validate` and `plan --json` are read-only and - safe in pipelines; gate `apply --as ci` on plan review. Approvals are the - human step by design β€” keep `cluster approve` out of automation. -- **`~/.omnigraph/config.yaml` is the per-operator config**: your - `operator.actor` default for `--as`, named servers/clusters, credentials, - profiles, and data-plane ergonomics (address a cluster graph by its derived - root like `company-brain/graphs/knowledge.omni` with `--store` for loads). The - cluster directory's `cluster.yaml` is the **sole deployment declaration** β€” the - server boots from the cluster only. - -## 7. Maintaining a cluster graph - -Storage maintenance (`optimize` / `repair` / `cleanup`) is **not** a control-plane -operation β€” it runs out-of-band, with direct storage access, against the graph's -roots. Address a cluster graph by name instead of hand-typing its storage path: - -```bash -omnigraph optimize --cluster ./company-brain --graph knowledge -omnigraph cleanup --cluster ./company-brain --graph knowledge --keep 10 --confirm -# --cluster also takes the storage-root URI directly (config-free), and a -# `clusters:` name from ~/.omnigraph/config.yaml: -omnigraph optimize --cluster s3://bucket/clusters/company-brain --graph knowledge -``` - -The graph's storage URI is resolved from the **served cluster state** (the same -truth a `--cluster` server boots from); a graph that hasn't been applied yet is -not resolvable. Run these from a host with storage access β€” there are no server -routes for them. Conversely, **`init` refuses** a cluster-managed path: graphs in -a cluster are created by `cluster apply`, not by hand. - -If the cluster has exactly **one** applied graph you can omit `--graph` β€” it is -used automatically. With **several**, omitting `--graph` errors and lists the -candidates; it never picks one for you. - -Against an **`s3://`-backed cluster** the resolved graph storage is non-local, so a -destructive `cleanup` additionally requires **`--yes`** (an interactive prompt -otherwise, refusal without a TTY) on top of `--confirm` β€” see [cli-reference.md](../cli/reference.md)'s -*Write diagnostics & destructive confirmation*. Every maintenance run also echoes -its resolved target to stderr (suppress with `--quiet`). - -## What the control plane does not do (yet) - -- **No hot reload** β€” applied changes serve on the next restart. -- **No data operations** β€” rows move through `omnigraph load / ingest / - mutate` against the graph roots, with branches and merges as usual. -- **Stored-query exposure is all-or-nothing per cluster** β€” every applied - query is listed and invokable (subject to Cedar `invoke_query`); per-query - exposure policy is a planned phase. -- **Pipelines (ETL)** are a separate project; the `pipelines:` key is - reserved and rejected loudly. - -For the full reference β€” every key, flag, status, disposition, and -diagnostic β€” see [cluster-config.md](config.md). diff --git a/docs/user/concepts/index.md b/docs/user/concepts/index.md deleted file mode 100644 index 8bc3d7e..0000000 --- a/docs/user/concepts/index.md +++ /dev/null @@ -1,49 +0,0 @@ -# Concepts - -OmniGraph is a typed property-graph engine built as a coordination layer over the -[Lance](https://lance.org) columnar storage format. It gives you a schema-checked -graph with vector, full-text, and graph queries in one runtime, plus Git-style -branches and commits across the whole graph. - -## The data model - -- A graph has **node types** and **edge types**, declared in a - [schema](../schema/index.md). -- Each node type and each edge type is stored as its **own Lance dataset** β€” - columnar, versioned, on local disk or object storage. -- A single `__manifest` table coordinates all of those datasets, so the graph has - one coherent version even though it spans many datasets. - -This split is what lets a graph commit be **atomic across every type at once**: a -publish flips every relevant dataset's version together in one manifest write, so -readers never see a half-applied change. See [storage](storage.md) for the layout. - -## Two layers: inherited vs. added - -Throughout the docs, capabilities are framed as **L1** (inherited from Lance) or -**L2** (added by OmniGraph): - -| | L1 β€” from Lance | L2 β€” added by OmniGraph | -|---|---|---| -| Storage | Columnar Arrow datasets on object storage | Per-type datasets coordinated as one graph | -| Versioning | Per-dataset versions + time travel | [Snapshots](../branching/time-travel.md) across all types at once | -| Branches | Per-dataset branches | [Graph-level branches](../branching/index.md), atomic across types | -| Commits | Per-dataset commits | [Commit DAG](../branching/index.md) for the whole graph; three-way [merge](../branching/merge.md) | -| Indexes | Scalar / vector / full-text indexes | Built per relevant column; graph topology index for traversal | -| Search | Vector + full-text primitives | [`nearest` / `bm25` / `rrf`](../search/index.md) in one query, plus graph traversal | -| Querying | β€” | The [`.gq` query language](../queries/index.md) and [`.pg` schema language](../schema/index.md) | - -## How the pieces fit - -- The **schema** (`.pg`) and **query** (`.gq`) languages are compiled to a typed - intermediate representation. -- The **engine** runs queries and mutations against Lance, coordinates the manifest, - maintains the commit graph, and builds indexes. -- The **CLI** ([`omnigraph`](../cli/index.md)) and the - **HTTP server** ([`operations/server.md`](../operations/server.md)) are two front - ends over the same engine, so embedded and remote behavior match. -- [Cedar policy](../operations/policy.md) enforcement is engine-wide β€” every writer - goes through the same authorization gate regardless of front end. - -For deployment-scale topics β€” multi-graph servers, control-plane operations, -recovery β€” see [clusters](../clusters/index.md). diff --git a/docs/user/concepts/storage.md b/docs/user/concepts/storage.md deleted file mode 100644 index e3d9ef1..0000000 --- a/docs/user/concepts/storage.md +++ /dev/null @@ -1,115 +0,0 @@ -# Storage - -## L1 β€” Lance dataset (per node/edge type) - -Every node type and every edge type is its own Lance dataset: - -- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema. -- **Fragments**: data is partitioned into fragments; new writes create new fragments. -- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable. -- **Stable row IDs**: stable row IDs are enabled on every Lance dataset OmniGraph creates β€” node and edge data tables, `__manifest`, the commit-graph datasets, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create, so a future change that introduces a Lance dataset must preserve it. Consequences: `_row_created_at_version` and `_row_last_updated_at_version` are available on every dataset (load-bearing for change-feed validators); indices survive `omnigraph optimize`. Pre-0.4.x graphs created before this code path settled may have datasets without the flag and cannot be retrofitted in place β€” the supported path is dump-and-reload. The rewrite path used by `schema_apply` preserves the flag. -- **Append / delete / `merge_insert`**: native Lance write modes. -- **Per-dataset branches** (Lance native): copy-on-write at the dataset level. -- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) β€” OmniGraph wires file:// and s3://. - -## L2 β€” Multi-dataset coordination via `__manifest` - -OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table. - -- **Manifest table**: `__manifest/` Lance dataset. -- **Layout**: - - `nodes/{fnv1a64-hex(type_name)}` β€” one Lance dataset per node type - - `edges/{fnv1a64-hex(edge_type_name)}` β€” one Lance dataset per edge type - - `__manifest/` β€” the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7) - - `_graph_commits.lance` / `_graph_commit_actors.lance` β€” legacy / branch-ref carriers. Since RFC-013 Phase 7 the graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows, written in the publish CAS); `_graph_commits.lance` no longer receives commit rows, but is retained to carry the Lance branch refs that `create_branch` / `list_branches` / the `cleanup` orphan reconciler operate on. A graph created before Phase 7 (internal schema v3) keeps its lineage here until its first read-write open, which migrates it into `__manifest` via `migrate_v3_to_v4`. - - (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed. The internal schema migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a prefix-delete storage primitive lands) -- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`): - - `object_type` ∈ `table | table_version | table_tombstone | graph_commit | graph_head` - - `table_key` ∈ `node:<TypeName> | edge:<EdgeName>` (empty for `graph_commit` / `graph_head` lineage rows) - - `table_branch` is `null` for the main lineage and the branch name otherwise - - **Graph lineage rows** (RFC-013 Phase 7): one immutable `graph_commit` row per commit (`object_id` = the commit ULID; `metadata` JSON carries parent / merged-parent / actor / timestamp) plus one mutable `graph_head:<branch>` pointer per branch (`graph_head:main` for main). The in-memory commit DAG is a projection of these rows. -- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones β€” rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`. -- **Atomic publish**: multi-dataset commits publish so that a single write to `__manifest` flips all the new sub-table versions visible at once. -- **Row-level CAS on the merge-insert join key**: `object_id` carries an unenforced-primary-key annotation so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same `object_id` row. Without this annotation, Lance's transparent rebase would admit silent duplicates from racing publishers. -- **Optimistic concurrency control on publish**: a publish asserts the manifest's current latest non-tombstoned version for each touched table is exactly what the caller observed; mismatches surface as an `ExpectedVersionMismatch` manifest conflict naming the table and the expected/actual versions. Concurrent advances surface as a conflict rather than being silently rebased through. - -### Internal schema versioning - -The on-disk shape of `__manifest` is reconciled with the binary via a single version stamp held in the manifest dataset's schema-level metadata. - -- **Graph creation** stamps the current version, so newly initialized graphs never need migration. -- **The open-for-write path** migrates the on-disk stamp before reading state. When the stamp matches the binary, this is a single metadata read with no writes; otherwise the migration walks steps forward (1β†’2, 2β†’3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free. -- **Forward-version protection**: a stamp *higher* than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp". -- **Idempotency**: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands. - -| Stamp | Shape change | -|---|---| -| v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; no row-level CAS protection. | -| v2 | `__manifest.object_id` carries an unenforced-primary-key annotation; row-level CAS engaged. | -| v3 | One-time sweep of legacy `__run__*` staging branches (pre-v0.4.0 Run state machine, removed) off `__manifest`. Runs at read-write open and on publish. | - -## On-disk layout - -A graph on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets. - -```mermaid -flowchart TB - classDef l1 fill:#fef3e8,stroke:#c46900,color:#000 - classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000 - - graph["graph URI<br/>file:// or s3://bucket/prefix"]:::l2 - - manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2 - nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2 - edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2 - cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/<br/>_graph_commit_recoveries.lance/"]:::l2 - recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2 - refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2 - - graph --> manifest - graph --> nodes - graph --> edges - graph --> cgraph - graph --> recovery - graph --> refs - - subgraph dataset[Inside each Lance dataset β€” L1] - ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1 - ds_data["data/<br/>fragment files (Arrow IPC)"]:::l1 - ds_idx["_indices/{uuid}/<br/>BTREE Β· Inverted FTS Β· IVF/HNSW"]:::l1 - ds_refs["_refs/<br/>per-dataset Lance branches/tags"]:::l1 - ds_tx["_transactions/<br/>commit transaction logs"]:::l1 - end - - nodes -.-> dataset - edges -.-> dataset - manifest -.-> dataset -``` - -**What's where:** - -- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph. -- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here. -- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe. -- **`_graph_commits.lance`** is an L2 dataset retained only as a branch-ref carrier (and, on a pre-Phase-7 graph, the migration source). Since RFC-013 Phase 7 the graph commit DAG lives in `__manifest` as `graph_commit` / `graph_head` rows written in the publish CAS β€” `_graph_commits.lance` and its paired `_graph_commit_actors.lance` no longer receive commit rows. A graph created before Phase 7 (internal schema v3) backfills its lineage into `__manifest` on its first read-write open (`migrate_v3_to_v4`). (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.) -- **`_graph_commit_recoveries.lance`** β€” one row per crash-recovery action. Joined by `graph_commit_id` to the graph commit lineage (the `graph_commit` rows in `__manifest` since RFC-013 Phase 7); the linked commit carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. -- **`__recovery/{ulid}.json`** β€” transient sidecar files written by a writer before it advances the underlying dataset, deleted once the matching manifest publish succeeds. A sidecar persisting after process exit means the writer crashed mid-commit; the next read-write open processes it. Steady-state directory is empty. -- **`_refs/branches/{name}.json`** is graph-level branch metadata β€” pointers from a branch name to the manifest version it heads. -- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags. - -The split β€” L2 owns the cross-dataset catalog; L1 owns the per-dataset internals β€” means that schema work (which adds or removes datasets) updates `__manifest`, while data work (which adds fragments) updates `_versions/` inside the affected dataset and then bumps `__manifest`. - -## URI scheme support - -| Scheme | Backend | Notes | -|---|---|---| -| local path / `file://` | local filesystem | Normalized to absolute paths; relative and dot-segment paths are lexically absolutized | -| `s3://bucket/prefix` | S3 object store | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` | -| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend | - -## Object-store env vars (S3-compatible) - -- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` -- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` β€” for MinIO / RustFS / GCS-via-XML -- `AWS_S3_FORCE_PATH_STYLE=true` β€” path-style URLs -- `AWS_ALLOW_HTTP=true` β€” allow plain HTTP (local dev) diff --git a/docs/user/constants.md b/docs/user/constants.md new file mode 100644 index 0000000..527aaea --- /dev/null +++ b/docs/user/constants.md @@ -0,0 +1,22 @@ +# Constants & Tunables (cheat sheet) + +| Name | Value | Where | +|---|---|---| +| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` | +| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` | +| Run registry dir (legacy, removed MR-771) | `_graph_runs.lance` | inert post-v0.4.0; reclaimed by MR-770 | +| Run branch prefix (legacy, removed MR-771) | `__run__` | filtered by `is_internal_run_branch` defense-in-depth | +| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` | +| Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | `db/manifest/publisher.rs` | +| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` | `db/manifest/migrations.rs` | +| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` | +| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` | +| Graph index cache size | `8` (LRU) | `runtime_cache.rs` | +| Default body limit | `1 MB` | `omnigraph-server/lib.rs` | +| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` | +| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` | +| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` | +| Embed timeout | `30 000 ms` | both clients | +| Embed retries | `4` | both clients | +| Embed retry backoff | `200 ms` | both clients | +| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime | diff --git a/docs/user/deployment.md b/docs/user/deployment.md index 1772b9a..fc5ee08 100644 --- a/docs/user/deployment.md +++ b/docs/user/deployment.md @@ -13,11 +13,6 @@ Omnigraph supports two broad deployment shapes: The server binary and container image expose the same HTTP surface. -The server has a single **boot source**: a **cluster directory** -(`omnigraph-server --cluster <dir | s3://…>`), which serves the cluster control -plane's applied revision β€” see -[cluster-config.md](clusters/config.md#serving-from-the-cluster-the-mode-switch). - ## Binary Deployment Build or install: @@ -25,150 +20,64 @@ Build or install: - `omnigraph` - `omnigraph-server` -On Windows, the binaries are `omnigraph.exe` and `omnigraph-server.exe`. - -The server boots from a cluster only (RFC-011) β€” there is no positional -`<URI>` / single-graph boot. Point it at a local cluster directory: +Run against a local graph: ```bash -omnigraph-server --cluster ./company-brain --bind 0.0.0.0:8080 +omnigraph-server ./graph.omni --bind 0.0.0.0:8080 ``` -Or boot config-free from an object-storage-rooted cluster: +Run against an object-store-backed graph: ```bash OMNIGRAPH_SERVER_BEARER_TOKEN="change-me" \ AWS_REGION="us-east-1" \ -omnigraph-server --cluster s3://my-bucket/clusters/company-brain \ +omnigraph-server s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0 \ --bind 0.0.0.0:8080 ``` -The server serves every graph in the cluster's applied revision under -`/graphs/{id}/...`. See [clusters](clusters/index.md) for authoring and -applying a cluster. +## One-Command Local RustFS Bootstrap -## Cluster Mode in Containers (AWS, Railway) - -A cluster-booted deployment has **two shapes** since the `storage:` root: - -- **Bucket, no volume (preferred for cloud)** β€” the cluster's ledger, - catalog, and graph data live under an object-storage root - (`storage: s3://bucket/prefix` in `cluster.yaml`). The server boots - **config-free** from the bare URI; the container needs no volume at all: - - ```bash - docker run -d \ - -e OMNIGRAPH_CLUSTER=s3://my-bucket/clusters/company-brain \ - -e AWS_ACCESS_KEY_ID=... -e AWS_SECRET_ACCESS_KEY=... \ - -e OMNIGRAPH_SERVER_BEARER_TOKEN=... \ - -p 8080:8080 <image> - ``` - - Day-2 runs from any operator checkout of the config repo: - `omnigraph cluster apply --config ./company-brain` (the `storage:` key - routes every stored byte to the bucket), then restart the service. The - state lock is genuinely cross-machine on object storage, so CI and - operator shells contend safely. - -- **Volume (file-rooted)** β€” the original shape: the whole cluster - directory on a mounted volume. Still fully supported; the container - contract: +The easiest local S3-backed deployment path is: ```bash -docker run -d \ - -v /srv/company-brain:/var/lib/omnigraph/cluster \ - -e OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster \ - -e OMNIGRAPH_SERVER_BEARER_TOKEN=... \ - -p 8080:8080 <image> +curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/local-rustfs-bootstrap.sh | bash ``` -`OMNIGRAPH_CLUSTER` is the server's only boot source. The image also -ships the `omnigraph` CLI, so the day-2 loop runs in-container: +The bootstrap: -```bash -docker exec -it <container> sh -c \ - 'omnigraph cluster apply --as <you> --config /var/lib/omnigraph/cluster' -# then restart the container to pick up the applied state -``` +- starts a local RustFS-backed object store +- creates a bucket and S3-backed Omnigraph graph +- loads the checked-in context fixture +- starts `omnigraph-server` on `127.0.0.1:8080` -### AWS (ECS/Fargate + EFS) +Supported behavior: -1. Push the image to ECR (the `package.yml` workflow builds it). -2. Create an EFS filesystem; mount it in the task definition at - `/var/lib/omnigraph/cluster`. -3. Task environment: `OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster`, bearer - tokens via Secrets Manager/SSM into `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` - (or the `--features aws` build's native Secrets Manager source). -4. ALB in front for TLS; target the container's 8080 with `/healthz` checks. -5. Day-2: ECS exec into the task β†’ edit/upload config on the volume β†’ - `omnigraph cluster apply --as <you> --config /var/lib/omnigraph/cluster` - β†’ force a new deployment (restart). +- downloads the rolling `edge` binary when one exists for the current platform +- otherwise clones `ModernRelay/omnigraph` and builds from source +- reuses an existing RustFS container if it is already running -For a stateless, volume-free deployment, root the cluster on object -storage and boot config-free with -`OMNIGRAPH_CLUSTER=s3://bucket/clusters/<name>` (the bucket-no-volume -shape above) β€” the simplest AWS architecture. +Useful overrides: -### Railway +- `WORKDIR=/path/to/state` +- `BUCKET=omnigraph-local` +- `PREFIX=graphs/context` +- `RESET_REPO=1` to delete an existing partially initialized graph prefix before recreating it +- `BIND=127.0.0.1:8080` +- `RUSTFS_CONTAINER_NAME=omnigraph-rustfs-demo` -1. Create a service from the image; attach a **volume** mounted at - `/var/lib/omnigraph/cluster`. -2. Variables: `OMNIGRAPH_CLUSTER=/var/lib/omnigraph/cluster`, - `OMNIGRAPH_SERVER_BEARER_TOKEN=<token>`. Railway terminates TLS at its - edge and routes to the exposed 8080. -3. Day-2: `railway shell` (or `railway run`) β†’ `omnigraph cluster apply - --as <you> --config /var/lib/omnigraph/cluster` β†’ redeploy/restart the - service. +The bootstrap expects: -### Constraints (current honest list) +- Docker +- `curl` +- either a matching release asset or a local Rust toolchain plus `git` -- **No hot reload** β€” applied changes serve on the next restart. -- **Single-writer apply** β€” run `cluster apply` from one place at a time - (the state lock enforces this; CI or one operator shell, not both). -- **Multi-replica serving off a shared volume (EFS) is documented but - unvalidated** β€” boot is lock-free read-only so it should compose, but it - is not yet exercised by tests. +If `aws` is not installed, the script attempts a user-local AWS CLI install via +`python3 -m pip`. Docker Desktop or another Docker daemon must already be +running. -## Testing against S3 locally - -To exercise the S3 storage path without a cloud account, run any S3-compatible -store in Docker and point the standard `AWS_*` environment at it. RustFS is -shown; MinIO works the same way. - -```bash -docker run -d --name omnigraph-s3 -p 9000:9000 \ - -e RUSTFS_ACCESS_KEY=omnigraph -e RUSTFS_SECRET_KEY=omnigraph \ - -e RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true \ - rustfs/rustfs:latest /data - -export AWS_ACCESS_KEY_ID=omnigraph AWS_SECRET_ACCESS_KEY=omnigraph \ - AWS_REGION=us-east-1 AWS_ENDPOINT_URL_S3=http://127.0.0.1:9000 \ - AWS_ALLOW_HTTP=true AWS_S3_FORCE_PATH_STYLE=true - -# create the bucket once (any S3 client works) -aws --endpoint-url "$AWS_ENDPOINT_URL_S3" s3 mb s3://omnigraph-local -``` - -Now an `s3://…` URI works anywhere a graph or cluster root is expected. Root a -cluster on the bucket and serve it config-free: - -```bash -# cluster.yaml -# version: 1 -# storage: s3://omnigraph-local/clusters/demo -# graphs: { demo: { schema: schema.pg } } - -omnigraph cluster validate --config . -omnigraph cluster import --config . -omnigraph cluster apply --config . --as you -omnigraph load --data seed.jsonl --mode merge \ - s3://omnigraph-local/clusters/demo/graphs/demo.omni -omnigraph-server --cluster s3://omnigraph-local/clusters/demo \ - --bind 127.0.0.1:8080 --unauthenticated -``` - -The same `AWS_*` contract applies to a production object store β€” swap the -endpoint and credentials. CI exercises this path against containerized RustFS. +If a previous bootstrap left objects behind under the selected `PREFIX` but did +not finish initializing the graph, rerun with `RESET_REPO=1` or choose a new +`PREFIX`. ## Container Deployment @@ -178,44 +87,26 @@ Build the image: docker build -t omnigraph-server:local . ``` -The server boots from a cluster only (RFC-011). Run against a cluster -directory on a mounted volume: +Run against a local graph: ```bash docker run --rm -p 8080:8080 \ - -v "$PWD/company-brain:/var/lib/omnigraph/cluster" \ + -v "$PWD/graph.omni:/data/graph.omni" \ omnigraph-server:local \ - --cluster /var/lib/omnigraph/cluster --bind 0.0.0.0:8080 + /data/graph.omni --bind 0.0.0.0:8080 ``` -Run config-free against an object-storage-rooted cluster: +Run against an S3-backed graph: ```bash docker run --rm -p 8080:8080 \ -e OMNIGRAPH_SERVER_BEARER_TOKEN="change-me" \ -e AWS_REGION="us-east-1" \ omnigraph-server:local \ - --cluster s3://my-bucket/clusters/company-brain \ + s3://my-bucket/graphs/example/releases/2026-04-10-v0.1.0 \ --bind 0.0.0.0:8080 ``` -### Container entrypoint env vars - -When no positional args are given, the image entrypoint -(`docker/entrypoint.sh`) builds the server command from env vars: - -| Var | Effect | -|---|---| -| `OMNIGRAPH_CLUSTER` | Cluster boot source β€” a config directory or a storage-root URI, forwarded as `--cluster`. The only boot source. | -| `OMNIGRAPH_BIND` | Listen address (default `0.0.0.0:8080`). | -| `OMNIGRAPH_REQUIRE_ALL_GRAPHS` | When truthy, forwarded as `--require-all-graphs`: any graph-local quarantine or startup failure aborts cluster boot instead of serving the healthy subset. | - -Per-graph and server-level Cedar policy come from the cluster's applied -revision (authored in `cluster.yaml` and published with `cluster apply`), -not from a separate config file. The cluster docker shapes β€” volume vs. -config-free object-storage root β€” are detailed under -[Cluster Mode in Containers](#cluster-mode-in-containers-aws-railway) above. - ## Auth The server can run unauthenticated for local development only when explicitly @@ -250,10 +141,8 @@ The server binary ships in two flavors: | **AWS** | `cargo build --release --features aws` | Adds AWS Secrets Manager backend for bearer tokens | Tagged release archives contain the default `omnigraph` and -`omnigraph-server` binaries on macOS / Linux, and `omnigraph.exe` plus -`omnigraph-server.exe` on Windows. AWS-enabled server binaries are built from -source with `cargo build --release --features aws -p omnigraph-server` when -needed. +`omnigraph-server` binaries. AWS-enabled server binaries are built from source +with `cargo build --release --features aws -p omnigraph-server` when needed. The AWS build adds ~150 transitive deps and ~30-60s of first-build compile time. Default builds don't pay that cost. diff --git a/docs/user/embeddings.md b/docs/user/embeddings.md new file mode 100644 index 0000000..382e683 --- /dev/null +++ b/docs/user/embeddings.md @@ -0,0 +1,31 @@ +# Embeddings + +OmniGraph has **two** embedding clients with different defaults and purposes. + +## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) β€” query-time normalization + +- Default model: `text-embedding-3-small` (OpenAI-style schema) +- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200` +- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)` +- Mock mode: deterministic FNV-1a + xorshift64 β†’ L2-normalized vectors + +## Engine-side client (`omnigraph/src/embedding.rs`) β€” runtime ingest + +- Model: `gemini-embedding-2-preview` +- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK` +- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT) +- Exponential backoff with retryable detection (timeouts, 429, 5xx) + +## Schema integration + +Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim). + +## CLI `omnigraph embed` (offline file pipeline) + +Operates on **JSONL files** (not on a graph). Three modes (mutually exclusive): + +- (default) `fill_missing` β€” only embed rows whose target field is empty +- `--reembed-all` β€” overwrite all +- `--clean` β€” strip embeddings + +Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL β†’ JSONL. diff --git a/docs/user/operations/errors.md b/docs/user/errors.md similarity index 64% rename from docs/user/operations/errors.md rename to docs/user/errors.md index 85b4fde..fd047eb 100644 --- a/docs/user/operations/errors.md +++ b/docs/user/errors.md @@ -9,10 +9,10 @@ - `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, details: Option<ManifestConflictDetails>, … })` - `ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, actual }` β€” caller's `expected_table_versions` did not match the manifest's current latest non-tombstoned version (set by `OmniError::manifest_expected_version_mismatch`). - `ManifestConflictDetails::RowLevelCasContention` β€” Lance row-level CAS rejected the publish because a concurrent writer landed the same `object_id`. Retried internally by the publisher; only surfaces if the retry budget exhausts. - - **Dβ‚‚ parse-time rejection**: a single mutation query that mixes inserts/updates with deletes errors out *before any I/O* with kind `BadRequest`. Message: `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes`. See [query-language.md](../queries/index.md) for the rule. + - **Dβ‚‚ parse-time rejection** (MR-794): a single mutation query that mixes inserts/updates with deletes errors out *before any I/O* with kind `BadRequest`. Message: `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes`. See [docs/user/query-language.md](query-language.md) for the rule and [docs/dev/runs.md](../dev/runs.md) for the underlying staged-write rationale. - `MergeConflicts(Vec<MergeConflict>)` -Compiler-side `CompilerError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics. The legacy `NanoError` name remains as a deprecated compatibility alias. +Compiler-side `NanoError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics. ## Result serialization (`omnigraph_compiler::result::QueryResult`) diff --git a/docs/user/index.md b/docs/user/index.md index cabd98a..1b93efa 100644 --- a/docs/user/index.md +++ b/docs/user/index.md @@ -2,68 +2,42 @@ **Audience:** users, CLI users, HTTP clients, and self-hosting operators -This is the public-facing entry point. These docs describe behavior, commands, -configuration, and operational contracts without requiring knowledge of internal -recovery mechanics or contributor-only invariants. They are organized by topic β€” -start with install, then follow the section that matches your task. +This is the public-facing entry point. These docs should describe behavior, +commands, configuration, and operational contracts without requiring knowledge +of MRs, internal recovery mechanics, or contributor-only invariants. -## Start here +## Start Here | Goal | Read | |---|---| | Install OmniGraph | [install.md](install.md) | -| Run the core loop end to end | [quickstart.md](quickstart.md) | -| Understand the model | [concepts/index.md](concepts/index.md) | -| Run the CLI | [cli/index.md](cli/index.md) | -| Look up every CLI flag and config field | [cli/reference.md](cli/reference.md) | +| Run the CLI locally | [cli.md](cli.md) | +| Look up every CLI flag and config field | [cli-reference.md](cli-reference.md) | +| Write schemas | [schema-language.md](schema-language.md) | +| Read schema-lint diagnostic codes | [schema-lint.md](schema-lint.md) | +| Write queries and mutations | [query-language.md](query-language.md) | +| Use embeddings | [embeddings.md](embeddings.md) | -## Schema & queries +## Operate A Graph | Goal | Read | |---|---| -| Write schemas (the `.pg` language) | [schema/index.md](schema/index.md) | -| Read schema-lint diagnostic codes | [schema/lint.md](schema/lint.md) | -| Write queries (the `.gq` language) | [queries/index.md](queries/index.md) | -| Write data β€” inserts, updates, deletes | [mutations/index.md](mutations/index.md) | -| Use vector / full-text / hybrid search | [search/index.md](search/index.md) | -| Generate embeddings | [search/embeddings.md](search/embeddings.md) | -| Build and use indexes | [search/indexes.md](search/indexes.md) | +| Understand graph layout and URI support | [storage.md](storage.md) | +| Work with branches, commits, and snapshots | [branches-commits.md](branches-commits.md) | +| Coordinate multi-query workflows | [transactions.md](transactions.md) | +| Read diffs and change feeds | [changes.md](changes.md) | +| Build and use indexes | [indexes.md](indexes.md) | +| Compact and clean old versions | [maintenance.md](maintenance.md) | +| Interpret errors and output formats | [errors.md](errors.md) | -## Branching & version control - -| Goal | Read | -|---|---| -| Work with branches and commits | [branching/index.md](branching/index.md) | -| Read past versions (time travel) | [branching/time-travel.md](branching/time-travel.md) | -| Merge branches and resolve conflicts | [branching/merge.md](branching/merge.md) | -| Coordinate multi-query workflows | [branching/transactions.md](branching/transactions.md) | -| Read diffs and change feeds | [branching/changes.md](branching/changes.md) | - -## Operations +## Run The Server | Goal | Read | |---|---| | Deploy the binary or container | [deployment.md](deployment.md) | -| Use HTTP endpoints | [operations/server.md](operations/server.md) | -| Compact, repair, and clean old versions | [operations/maintenance.md](operations/maintenance.md) | -| Configure Cedar authorization | [operations/policy.md](operations/policy.md) | -| Track actors and audit behavior | [operations/audit.md](operations/audit.md) | -| Interpret errors and output formats | [operations/errors.md](operations/errors.md) | - -## Clusters - -| Goal | Read | -|---|---| -| Deploy and operate a cluster (how-to) | [clusters/index.md](clusters/index.md) | -| Reference every `cluster.yaml` key and command | [clusters/config.md](clusters/config.md) | - -## Concepts & reference - -| Goal | Read | -|---|---| -| Understand the model and L1/L2 framing | [concepts/index.md](concepts/index.md) | -| Understand graph layout and URI support | [concepts/storage.md](concepts/storage.md) | -| Look up constants and tunables | [reference/constants.md](reference/constants.md) | +| Use HTTP endpoints | [server.md](server.md) | +| Configure Cedar authorization | [policy.md](policy.md) | +| Track actors and audit behavior | [audit.md](audit.md) | ## Releases @@ -72,6 +46,7 @@ changes between versions, not for contributor design history. ## Boundary -User docs focus on stable behavior. If a paragraph needs to explain internal -sidecars, Lance API blockers, or test strategy, it probably belongs in -[docs/dev/index.md](../dev/index.md) or a developer-area document instead. +User docs should focus on stable behavior. If a paragraph needs to explain +internal sidecars, Lance API blockers, MR numbers, test strategy, or review +rules, it probably belongs in [docs/dev/index.md](../dev/index.md) or a developer-area document +instead. diff --git a/docs/user/indexes.md b/docs/user/indexes.md new file mode 100644 index 0000000..ce6c728 --- /dev/null +++ b/docs/user/indexes.md @@ -0,0 +1,26 @@ +# Indexes + +## L1 β€” Lance index types OmniGraph exposes + +| Index | Use | Notes | +|---|---|---| +| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` | +| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries | +| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) | + +## L2 β€” OmniGraph orchestration + +- `ensure_indices()` / `ensure_indices_on(branch)` β€” idempotent build of BTREE + inverted indexes for the current head; safe to re-run. +- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state. +- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index β€” the main lineage's index is reused until the first write triggers a copy-on-write fork. +- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column. + +## L2 β€” Graph topology index (`graph_index/mod.rs`) + +This is OmniGraph-specific (not Lance): + +- `TypeIndex`: dense `u32 ↔ String id` mapping per node type. +- `CsrIndex`: Compressed Sparse Row representation of edges per edge type β€” `offsets[i]..offsets[i+1]` slices into `targets`. +- `GraphIndex { type_indices, csr (out), csc (in) }` β€” built on demand from a snapshot's edge tables. +- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions). +- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it. diff --git a/docs/user/install.md b/docs/user/install.md index 4a11372..ea9fb8c 100644 --- a/docs/user/install.md +++ b/docs/user/install.md @@ -2,29 +2,16 @@ ## Quick Install -macOS / Linux: - ```bash curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash ``` -Windows PowerShell: - -```powershell -powershell -NoProfile -ExecutionPolicy Bypass -Command "iwr -UseBasicParsing https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.ps1 | iex" -``` - By default the installer places: - `omnigraph` - `omnigraph-server` -in `~/.local/bin` on macOS / Linux, or: - -- `omnigraph.exe` -- `omnigraph-server.exe` - -in `%USERPROFILE%\.local\bin` on Windows. +in `~/.local/bin`. The default installer is binary-only. It downloads a published release asset, verifies the SHA256 checksum, and unpacks it. It does not build from source. @@ -52,13 +39,6 @@ Rolling edge binaries from `main`: curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | RELEASE_CHANNEL=edge bash ``` -Windows rolling edge binaries: - -```powershell -iwr -UseBasicParsing https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.ps1 -OutFile install.ps1 -powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -ReleaseChannel edge -``` - Install from source: ```bash @@ -73,24 +53,12 @@ Install to a different directory: curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | INSTALL_DIR="$HOME/bin" bash ``` -Windows: - -```powershell -powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -InstallDir "$env:USERPROFILE\bin" -``` - Install a specific tag: ```bash curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | VERSION=v0.1.0 bash ``` -Windows: - -```powershell -powershell -NoProfile -ExecutionPolicy Bypass -File .\install.ps1 -Version v0.1.0 -``` - Build from a specific git ref: ```bash @@ -99,53 +67,27 @@ curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/ ## Manual Source Build -macOS / Linux: - ```bash cargo build --release --locked -p omnigraph-cli -p omnigraph-server install -m 0755 target/release/omnigraph ~/.local/bin/omnigraph install -m 0755 target/release/omnigraph-server ~/.local/bin/omnigraph-server ``` -Windows: - -```powershell -cargo build --release --locked -p omnigraph-cli -p omnigraph-server -New-Item -ItemType Directory -Force "$env:USERPROFILE\.local\bin" | Out-Null -Copy-Item target\release\omnigraph.exe "$env:USERPROFILE\.local\bin\omnigraph.exe" -Copy-Item target\release\omnigraph-server.exe "$env:USERPROFILE\.local\bin\omnigraph-server.exe" -``` - ## Release Assets Tagged releases are expected to publish: - `omnigraph-linux-x86_64.tar.gz` - `omnigraph-macos-arm64.tar.gz` -- `omnigraph-windows-x86_64.zip` -The macOS / Linux archives contain both binaries: +Each archive contains both binaries: - `omnigraph` - `omnigraph-server` -The Windows archive contains: - -- `omnigraph.exe` -- `omnigraph-server.exe` - ## Verify The Install -macOS / Linux: - ```bash omnigraph version omnigraph-server --help ``` - -Windows: - -```powershell -omnigraph.exe version -omnigraph-server.exe --help -``` diff --git a/docs/user/maintenance.md b/docs/user/maintenance.md new file mode 100644 index 0000000..08ae8da --- /dev/null +++ b/docs/user/maintenance.md @@ -0,0 +1,29 @@ +# Maintenance: Optimize & Cleanup + +`db/omnigraph/optimize.rs`. + +## `optimize_all_tables(db)` β€” non-destructive + +- Lance `compact_files()` on every node + edge table on `main`. +- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests. +- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). +- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`. + +## `cleanup_all_tables(db, options)` β€” destructive + +- Lance `cleanup_old_versions()` per table. +- Removes manifests (and their unique fragments) older than the retention policy. +- `CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> }` β€” at least one is required. +- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`. +- CLI guards with `--confirm`; without it, prints a preview line. +- **Recovery floor:** `--keep < 3` may garbage-collect Lance versions that the open-time recovery sweep needs as a rollback target (the sweep restores to the branch's manifest-pinned table version, which is HEAD-1 in the typical Phase B β†’ Phase C drift case). Default `--keep 10` is safe. + +## Tombstones + +Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction. + +## Internal schema migrations (`db/manifest/migrations.rs`) + +Version evolutions of the on-disk `__manifest` shape are reconciled automatically on the first write under a new binary. `INTERNAL_MANIFEST_SCHEMA_VERSION` declares the shape the binary expects; the on-disk stamp `omnigraph:internal_schema_version` (Lance schema-level metadata) records the on-disk shape. The publisher's open-for-write path calls `migrate_internal_schema` before reading state; reads are side-effect-free. No operator action is required for in-place upgrades. See [storage.md β†’ Internal schema versioning](storage.md) for the full mechanism. + +A binary opening a manifest stamped at a version *higher* than it knows about refuses to publish with a clear "upgrade omnigraph first" error β€” old binaries cannot clobber a newer schema. diff --git a/docs/user/mutations/index.md b/docs/user/mutations/index.md deleted file mode 100644 index 2602ae5..0000000 --- a/docs/user/mutations/index.md +++ /dev/null @@ -1,52 +0,0 @@ -# Mutations - -Write statements live inside a `query` declaration whose body is one or more -mutation statements (the [query language](../queries/index.md) covers the read -shape and shared declaration syntax). - -``` -query onboard($name: String, $title: String) { - insert Person { name: $name, title: $title } -} -``` - -An edge type is inserted the same way β€” its endpoint columns are just -properties in the assignment block (`insert WorksAt { person: $p, org: $o }`). - -## Statements - -- `insert <Type> { prop: <value>, … }` -- `update <Type> set { prop: <value>, … } where <prop> <op> <value>` -- `delete <Type> where <prop> <op> <value>` - -`<value>` is a literal, `$param`, or `now()`. - -## Atomicity - -A change query publishes **one commit** at the end of the query. Multiple -insert/update statements accumulate in memory and commit together β€” a mid-query -failure leaves the graph untouched. See [transactions](../branching/transactions.md) -for the per-query atomicity contract and [branches](../branching/index.md) for -multi-query workflows. - -## Inserts/updates and deletes cannot mix in one query - -A single change query must be **either insert/update-only or delete-only**. -Mixing the two is rejected at parse time, before any I/O: - -> `mutation '<name>' on the same query mixes inserts/updates and deletes; split -> into separate mutations: (1) inserts and updates, then (2) deletes.` - -Run two separate queries instead β€” the inserts/updates first, then the deletes. -The restriction exists because inserts/updates and deletes commit through -different paths today, and mixing them in one query creates ordering hazards -(e.g. a same-row insert-then-delete, or a cascading delete of a just-inserted -edge). Keeping the two kinds in separate queries keeps each one atomic and -correct. - -## Bulk loading - -For loading data from files rather than inline statements, use -[`omnigraph load`](../cli/index.md) (`--mode overwrite|append|merge`) β€” it is the -single bulk-write command and applies the same schema validation and atomic -publish as inline mutations. diff --git a/docs/user/operations/audit.md b/docs/user/operations/audit.md deleted file mode 100644 index 7e8b24d..0000000 --- a/docs/user/operations/audit.md +++ /dev/null @@ -1,46 +0,0 @@ -# Audit & Actor Tracking - -Every write in OmniGraph records **who made it**. The actor id is persisted on the -graph commit, so the commit history is an audit trail of which actor changed the -graph and when. - -## Where the actor comes from - -The actor is resolved differently depending on the front end, but it always lands -on the commit: - -- **HTTP server** β€” the actor is resolved **server-side from the bearer token**. A - client cannot set its own actor id; it is derived from the authenticated token. - See [policy](policy.md) for how tokens map to actors. -- **CLI / embedded** β€” the actor is self-declared through one resolution chain: - - 1. `--as <actor>` on the command, - 2. then `operator.actor` in `~/.omnigraph/config.yaml` (see the - [CLI reference](../cli/reference.md)), - 3. otherwise none. - -This difference is intentional: storage credentials imply a self-declared actor, -while a server resolves the actor from a token it trusts. - -## Reading the audit trail - -Actor ids are stored on each commit in the [commit graph](../branching/index.md). -List commits to see who made each change: - -```bash -omnigraph commit list graph.omni -``` - -System-initiated writes use reserved actor ids β€” for example, automatic recovery -of an interrupted write records `omnigraph:recovery`, so operator changes and -machine repairs are distinguishable in the history: - -```bash -omnigraph commit list --filter actor=omnigraph:recovery graph.omni -``` - -## What is tracked - -Every successful publish β€” load, change, branch merge, and schema apply β€” appends a -commit carrying the resolving actor. Because publishes are atomic, the actor on a -commit is exactly the actor responsible for that whole change. diff --git a/docs/user/operations/maintenance.md b/docs/user/operations/maintenance.md deleted file mode 100644 index d8df950..0000000 --- a/docs/user/operations/maintenance.md +++ /dev/null @@ -1,52 +0,0 @@ -# Maintenance: Optimize, Repair & Cleanup - -**Addressing.** `optimize`, `repair`, and `cleanup` are **direct** (storage-native) CLI commands: they run with direct storage access against a positional `file://`/`s3://` URI or **`--cluster <dir|s3://…> --graph <id>`** (which resolves the graph's storage URI from the served cluster state, so you needn't know the `<storage>/graphs/<id>.omni` layout). They never run through a server, and reject `--server` or a remote (`http(s)://`) URI with a declared error. There are no server routes for them by design β€” to maintain a server-backed graph, run them out-of-band against the graph's storage URI. See the *Command capabilities* section of [cli-reference.md](../cli/reference.md). - -## `optimize` β€” non-destructive - -- Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed. -- Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs. -- **Also compacts the internal system tables** `__manifest`, `_graph_commits`, and `_graph_commit_actors` (RFC-013 step 2), which accumulate one fragment per commit (the actor table only on the authenticated write path, where every commit carries an actor) and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place β€” **no manifest publish and no recovery sidecar**. (The sidecar-free property is not because it is one commit β€” `compact_files` can emit a `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip below is a further commit β€” but because every one of those commits is content-preserving and the table is read at its latest version, so a crash at any point leaves it readable and content-identical and the next `optimize` re-plans.) They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"` / `"_graph_commit_actors"` (the latter two only when present). They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat. -- **`optimize` is non-destructive by construction β€” it never garbage-collects versions, on any table (data or internal).** Compaction rewrites fragments and advances the version; old versions stay reachable until you run `cleanup`. This holds even for a graph created by an older binary that stored an on-by-default Lance `auto_cleanup` hook: `compact_files` / `optimize_indices` commit with the hook enabled and expose no skip override, so before compacting **any** table `optimize` strips its stale `lance.auto_cleanup.*` config first, so Lance's commit-time GC hook cannot fire and silently prune `__manifest`-pinned versions. (Graphs created by current binaries store no such config; the strip is the upgrade-path safety net.) The internal-table path additionally tolerates a concurrent live writer: it runs a **bounded** rebase-and-retry, so transient contention does not fail the operator's `optimize` or the live write β€” but sustained contention past the retry budget surfaces a loud conflict error rather than looping forever (bounded and observable, not a silent give-up). The data-table path holds the per-table write queue while it compacts, so it does not contend with mutations on that table in the first place. -- **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail. -- **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout β€” so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes. -- Each table's compactβ†’reindexβ†’publish serializes with concurrent mutations on the same table. A crash mid-operation is recovered automatically on the next open (both compaction and reindex are content-preserving, so roll-forward is always safe). -- **Requires a recovered graph.** `optimize` refuses (errors) when a pending crash-recovery operation is present β€” operating on an unrecovered graph could publish a partial write that recovery would roll back. Reopen the graph to run recovery, then re-run `optimize`. -- **Uncovered drift is skipped, not interpreted.** If a table's underlying version is ahead of the version recorded in `__manifest` and no crash-recovery record covers that movement, `optimize` reports `skipped: DriftNeedsRepair` with the manifest/head versions and leaves the table untouched. Run `omnigraph repair` to classify and explicitly publish that drift. -- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). -- Returns per-table stats: `table_key, fragments_removed, fragments_added, committed, skipped, manifest_version, lance_head_version, pending_indexes` (the last lists any declared `@index` column the reconciler could not build this run, with the reason β€” e.g. a vector column with no trainable vectors yet). -- **Blob tables are skipped.** A table that declares any `Blob` property is not compacted: it is reported with `skipped: BlobColumnsUnsupportedByLance` (and logged) instead of compacted, and the rest of the sweep proceeds normally. **Reads and writes are unaffected** β€” only compaction is. Consequence: fragment count and deleted-row space on blob tables are not reclaimed; query results are never affected. A skipped blob table is also **not reindexed** in the same sweep (the skip happens before the reindex step), so its index coverage on appended rows is not refreshed by `optimize` today. - -## `repair` β€” explicit - -- Handles **uncovered manifest/head drift**: a table's underlying version is ahead of the manifest pin and no crash-recovery record explains the movement. -- Preview by default. `omnigraph repair --json <uri>` reports each table's `classification`, `action`, manifest/head versions, underlying operation names, and any classification error. `--confirm` publishes only verified maintenance drift; if any suspicious or unverifiable table is refused, the CLI prints the per-table output and exits non-zero. `--force --confirm` also publishes suspicious or unverifiable drift after operator review. -- Classifies drift by reading the table's transaction history from `manifest_version + 1` through the current head. Only fragment-reservation and rewrite (compaction) operations are verified maintenance. Semantic operations such as append, delete, update, merge, or missing transaction history are not auto-healed. -- Publishes repair by advancing `__manifest` to the existing head; it does **not** rewrite data. If the publish succeeds, normal reads and strict writes use the repaired version. If it fails, no new data-side partial state was created. -- Requires a clean recovery state. A pending crash-recovery operation still belongs to automatic recovery, not manual repair. - -## `cleanup` β€” destructive - -- Garbage-collects old versions per table. -- Removes versions (and their unique fragments) older than the retention policy. -- Policy options `keep_versions` and `older_than` β€” at least one is required. -- Returns per-table stats: `table_key, bytes_removed, old_versions_removed, error`. -- **Fault-isolated per table.** A single table's transient failure (version GC or - orphan reclaim) is recorded on that table's stats row (with an `error`) and logged, - and never aborts the healthy tables β€” cleanup is the convergence - backstop, so it does as much as it can and converges on re-run. The CLI reports - any failed tables; rerun `cleanup` to retry them. -- CLI guards with `--confirm`; without it, prints a preview line. -- **Non-local consent.** Against a non-local target (an `s3://` store/cluster), `cleanup` additionally requires `--yes` on top of `--confirm`: a TTY is prompted, and a non-interactive run (no TTY, or `--json`) refuses rather than destroying. A local (`file://`) target needs only `--confirm`. The same `--yes` gate applies to overwrite `load` and `branch delete`; every maintenance run echoes its resolved target to stderr (suppress with `--quiet`). -- **Recovery floor:** `--keep < 3` may garbage-collect versions that crash recovery needs as a rollback target. Default `--keep 10` is safe. -- **Orphaned-branch reconciliation:** before the version GC, cleanup reclaims any per-table or commit-graph branch absent from the manifest branch list. These orphans arise when a `branch_delete` flips the manifest authority but a downstream best-effort reclaim does not complete (see [branches-commits.md](../branching/index.md)). The reconciler is idempotent (it no-ops once nothing is orphaned), runs regardless of the `keep_versions` / `older_than` values (those gate version GC only), and never reclaims `main` or system-branch forks. Reclaimed forks are logged. - -## Tombstones - -Logical sub-table delete markers in `__manifest` that exclude a sub-table version from snapshot reconstruction. - -## Internal schema migrations - -Version evolutions of the on-disk `__manifest` shape are reconciled automatically on the first write under a new binary. An on-disk stamp records the shape; the binary migrates it forward before reading state, and reads are side-effect-free. No operator action is required for in-place upgrades. See [storage.md β†’ Internal schema versioning](../concepts/storage.md) for the full mechanism. - -A binary opening a manifest stamped at a version *higher* than it knows about refuses to publish with a clear "upgrade omnigraph first" error β€” old binaries cannot clobber a newer schema. diff --git a/docs/user/operations/policy.md b/docs/user/operations/policy.md deleted file mode 100644 index 54fbea5..0000000 --- a/docs/user/operations/policy.md +++ /dev/null @@ -1,155 +0,0 @@ -# Authorization (Cedar policy) - -OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC. - -## Policy actions - -Per-graph actions (bind to `Omnigraph::Graph::"<graph_id>"`): - -1. `read` β€” query / snapshot / list branches & commits -2. `export` β€” NDJSON export -3. `change` β€” mutations -4. `schema_apply` β€” apply schema migrations -5. `branch_create` -6. `branch_delete` -7. `branch_merge` -8. `admin` β€” reserved for policy-management surfaces (hot reload, audit log, approvals). No call site today. -9. `invoke_query` β€” gates invoking a server-side stored query (the `queries:` registry). Graph-scoped (like `admin`) β€” per-branch access is enforced by the inner `read` / `change` gate, so a rule that sets `branch_scope` on `invoke_query` is rejected. Coarse in this release: an `invoke_query` allow rule permits any stored query on the graph; a future, additive refinement adds an optional per-query-name scope without changing rules written against the coarse action. Enforced at `POST /queries/{name}` (see [server](server.md)). A stored *mutation* is double-gated: `invoke_query` to reach the tool, plus `change` for the write itself (the engine `_as` writers still enforce per the query body). - -Server-scoped action (v0.6.0+; binds to `Omnigraph::Server::"root"`): - -10. `graph_list` β€” `GET /graphs` registry enumeration (multi-graph mode) - -Server-scoped actions cannot use `branch_scope` or `target_branch_scope` β€” they operate on the registry, not on a graph's branches. A rule cannot mix server-scoped and per-graph actions; split into separate rules. (Runtime `graph_create` / `graph_delete` over HTTP are reserved but not shipped; operators add/remove graphs by editing the cluster's `cluster.yaml`, running `omnigraph cluster apply`, and restarting the server.) - -## Scope kinds - -- `branch_scope` β€” applied to source branch (`read`, `export`, `change`) -- `target_branch_scope` β€” applied to destination (`schema_apply`, branch ops, run ops) -- `protected_branches` β€” named list with special rules; rule scopes are `any | protected | unprotected` - -## Per-graph vs. server-level policy - -A server boots from a cluster (`--cluster <dir>`), and the cluster's -`cluster.yaml` declares its policy bundles in a `policies:` section. Each bundle -names the scopes it `applies_to`: a graph id (per-graph rules β€” `read`, `change`, -`branch_*`, `schema_apply`) or the literal `cluster` (server-level rules β€” -`graph_list`). - -```yaml -# cluster.yaml -policies: - base: - file: base.policy.yaml - applies_to: [cluster, knowledge] # cluster-level + the `knowledge` graph - alpha: - file: policies/alpha.yaml - applies_to: [alpha] # per-graph: alpha only -``` - -A graph with no bundle bound to it has no engine-layer Cedar enforcement. Each -graph's HTTP request flows through its bound bundle; the management endpoint -(`GET /graphs`) flows through the `cluster`-scoped bundle. When no bundle binds -`cluster`, `GET /graphs` is denied in every runtime state, including -`--unauthenticated`; with bearer tokens configured it returns 403 after admission -control because `graph_list` is not a `read`-equivalent action. The operator must -bind a `cluster`-scoped bundle granting `graph_list` to expose `/graphs`. - -Example `cluster`-scoped bundle: - -```yaml -version: 1 -groups: - admins: [act-andrew] -rules: - - id: admins-can-list-graphs - allow: - actors: { group: admins } - actions: [graph_list] -``` - -Each per-graph rule may use at most one of `branch_scope` or -`target_branch_scope`. Server-scoped rules (`graph_list`) take neither β€” they -have no branch context. - -## Actor for direct-engine writes - -The default actor identity for CLI direct-engine (`--store`) writes is -`operator.actor` in `~/.omnigraph/config.yaml`. Override per-invocation with -`--as <ACTOR>` β€” `--as` wins, otherwise `operator.actor`, otherwise no actor. -Remote HTTP writes ignore both β€” they resolve their actor server-side from the -bearer token. (Direct-store access carries no Cedar policy; policy -lives in the cluster/server.) - -## CLI - -Policy tooling reads a cluster's applied policy bundles: pass `--cluster <dir>`, -and `--graph <id>` to pick a graph's bundle when several apply. - -- `omnigraph policy validate` β€” parse + count actors, exit 1 on parse error. -- `omnigraph policy test --tests <file>` β€” run the declarative cases in `<file>` against the selected bundle, exit 1 on any expectation mismatch. -- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` β€” show decision and matched rule. -- `omnigraph --as <ACTOR> <subcommand>` β€” set the actor for the duration of one invocation. Effective for `change`, `load` (and its deprecated `ingest` alias), `branch create|delete|merge`, and `schema apply` against a direct (`--store`) graph. **Rejected** on a served write (`--server`): the actor is bearer-token-resolved server-side, so `--as` can't set it there. - -## Enforcement - -Policy is a property of the **engine**, not the transport. Every mutating -write β€” `mutate_as`, `load_as` (the deprecated `ingest_as` shims route -through it), `apply_schema_as`, -`branch_create_as`, `branch_create_from_as`, `branch_delete_as`, -`branch_merge_as` β€” consults the policy gate at the head of the method. -The gate fires identically whether the call -originates from the HTTP server, the CLI, or an embedded SDK consumer. -When no policy is installed (the dev/embedded default) the gate -is a strict no-op; when one is installed and the call site forgets to -thread an actor through, the gate fails closed rather than silently -bypassing. - -## Server runtime states - -The HTTP server classifies its startup configuration into one of three -states based on whether bearer tokens are configured and whether a -policy file is set. The state determines what happens to a request that -reaches the authorization gate without a matching policy permit. - -| State | Tokens | Policy file | Behavior | -|---|---|---|---| -| **Open** | no | no | Every request is permitted. Refuses to start unless `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1` is set β€” the operator must explicitly opt in. | -| **DefaultDeny** | yes | no | Every authenticated request for an action other than `read` is rejected with HTTP 403. Closes the "tokens but forgot the policy file" trap β€” an operator who sets up auth and forgot to point at a policy file used to ship the illusion of protection. | -| **PolicyEnabled** | yes | yes | Authenticated requests that reach a configured policy engine are evaluated by Cedar. Server-scoped actions still require a `cluster`-scoped policy bundle. | - -The server refuses to start for the "no tokens, no policy, no flag" cell -and for "policy file, no tokens" β€” instead of silently shipping an open -instance or a policy-protected server that can only 401. - -Server-side, request authorization still runs at the HTTP boundary β€” -that's where actor identity is resolved from the bearer token and where -admission control / per-actor rate limits live. Engine-layer enforcement -is the **defense in depth** layer: it catches CLI direct-engine writes, -embedded SDK consumers, and any future transport that hasn't (or won't) -re-implement the HTTP boundary's authorization. Both layers consult the same -Cedar policy, so decisions cannot disagree. - -## Coarse vs. fine enforcement - -There are two enforcement points, each with non-overlapping -responsibilities: - -| Layer | Question it answers | Where it fires | -|---|---|---| -| **Engine-layer (coarse)** | Can this actor invoke this action against this branch / branch-transition? | The policy gate at the head of every `_as` writer; one Cedar decision per call. | -| **Query-layer (fine)** | For the rows / types this action actually touches, which can the actor see or modify? | Per-row predicates pushed into the query plan. **Not yet implemented.** | - -The engine-layer gate keeps its resource scope deliberately at branch -granularity (graph, branch, target branch, branch transition). -Per-type and per-row authority is the query-layer's job; conflating them -in the engine-layer scope would create two places per-type policy could be -evaluated and a drift surface between them. - -## Actor identity (signed-claim-only) - -The actor identity used for every policy decision comes from the matched bearer token β€” never from a client-supplied request header, query parameter, or body field. The server resolves the token at the auth middleware boundary, looks up the actor it was minted for, and overwrites whatever the handler may have placed in the policy request. Clients cannot set `actor_id` directly. - -This is intentional. Trusting client-supplied identity for authorization is "asking the attacker if they're an admin" β€” Supabase's RLS history names the same footgun. The chokepoint lives at the server's auth boundary: a request with `Authorization: Bearer <token-for-actor-A>` plus `X-Actor-Id: actor-B` always evaluates as actor A, never as actor B. - -If you find yourself wanting to let clients override `actor_id` for impersonation, delegation, or service-account flows β€” that's a feature, but it needs explicit design (e.g., signed delegation claims, an `On-Behalf-Of` audit trail). It is not a convenience knob. diff --git a/docs/user/operations/server.md b/docs/user/operations/server.md deleted file mode 100644 index 3f6bcd0..0000000 --- a/docs/user/operations/server.md +++ /dev/null @@ -1,241 +0,0 @@ -# HTTP Server (`omnigraph-server`) - -Axum 0.8 + tokio + utoipa-generated OpenAPI. **Cluster-only boot**: the server always boots from a cluster (`--cluster <dir | s3://…>`) and serves N graphs (N β‰₯ 1) under cluster routes. There is no longer a single-graph flat-route mode, no positional `<URI>` boot, no `--target`, and no `omnigraph.yaml`-`graphs:`-map boot. All HTTP is nested under `/graphs/{graph_id}/...`; `/healthz` and the management `/graphs` enumeration stay flat. - -## Boot - -### Cluster boot (the only boot) - -```bash -omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080 -``` - -`omnigraph-server --cluster <dir-or-uri>` boots from the cluster catalog's -**applied revision**. The server resolves that revision into per-graph -startup configs (id, URI, optional per-graph policy, stored-query -registry) plus an optional server-level policy, then opens every -configured graph in parallel at startup (bounded concurrency = 4, -quarantining graph-specific open failures). Routing is always multi-graph β€” -requests to bare flat protected paths (`/read`, `/snapshot`, …) return -404; the served surface is `/graphs/{graph_id}/...`. See -[cluster-config.md](../clusters/config.md#serving-from-the-cluster-the-mode-switch) -for what is read and the readiness rules. - -Readiness is fail-fast for cluster-global problems: missing or unreadable -state, invalid/unattributable recovery sidecars, unreadable shared catalog -payloads, cluster policy errors, or zero healthy graphs. Graph-attributed -pending recovery sidecars and graph-specific startup failures quarantine -that graph instead; the server logs startup diagnostics and serves the -remaining healthy graphs. `GET /graphs` enumerates ready/served graphs only, -so quarantined graphs are absent and their routes return 404. - -Operators who want the original all-or-nothing boot contract can pass -`--require-all-graphs` or set `OMNIGRAPH_REQUIRE_ALL_GRAPHS=1`. In that mode, -any graph quarantine, graph-open failure, stored-query startup failure, or -embedding-provider resolution failure aborts startup. - -A scheme-qualified argument (`s3://…`) reads the ledger straight from the -storage root, with no local config directory. `--bind`, -`--unauthenticated`, and the bearer-token env vars all apply. - -### Stored-query validation at startup - -If a graph declares a `queries:` registry (see [cli-reference](../cli/reference.md)), the server **loads and type-checks every stored query against that graph's live schema at startup**. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with `omnigraph queries validate`. Discover the stored queries as a typed tool catalog with `GET /queries`, and invoke one over HTTP with `POST /queries/{name}` (both below). - -## Endpoint inventory - -Per-graph endpoints β€” all nested under `/graphs/{id}/...`. `{id}` is the -graph id from the cluster's applied revision: - -| Method | Path | Auth | Action | -|---|---|---|---| -| GET | `/healthz` | none | β€” | -| GET | `/openapi.json` | none | β€” (strips security if auth disabled; emits the nested cluster paths with `cluster_` operation-id prefix) | -| GET | `/graphs/{id}/snapshot?branch=` | bearer + `read` | snapshot of branch | -| POST | `/graphs/{id}/query` | bearer + `read` | inline read query (canonical; clean field names `query`/`name`; mutations β†’ 400) | -| POST | `/graphs/{id}/read` | bearer + `read` | **deprecated** alias of `/query` (legacy field names `query_source`/`query_name`, byte-stable response; carries `Deprecation: true` + `Link: <query>; rel="successor-version"`) | -| POST | `/graphs/{id}/export` | bearer + `export` | NDJSON stream | -| POST | `/graphs/{id}/mutate` | bearer + `change` | mutation (canonical; `query`/`name`; accepts legacy `query_source`/`query_name` as serde aliases) | -| POST | `/graphs/{id}/change` | bearer + `change` | **deprecated** alias of `/mutate` (carries `Deprecation: true` + `Link: <mutate>; rel="successor-version"`) | -| GET | `/graphs/{id}/queries` | bearer + `read` | list the graph's stored queries as a typed tool catalog | -| POST | `/graphs/{id}/queries/{name}` | bearer + `invoke_query` (+ `change` for a stored mutation) | invoke a named query from the `queries:` registry; deny == 404 | -| GET | `/graphs/{id}/schema` | bearer + `read` | get current `.pg` source | -| POST | `/graphs/{id}/schema/apply` | bearer + `schema_apply` (target=`main`) | disabled for cluster-backed serving; returns 409 and points operators at `omnigraph cluster apply` + restart | -| POST | `/graphs/{id}/load` | bearer + `branch_create` (only when `from` is set and the branch is created) + `change` | bulk load (canonical); branch creation is opt-in via `from` β€” without it a missing `branch` is a 404, never an implicit fork (32 MB body limit) | -| POST | `/graphs/{id}/ingest` | bearer + `branch_create` (only when `from` is set and the branch is created) + `change` | **deprecated** alias of `/load` (carries `Deprecation: true` + `Link: <load>; rel="successor-version"`) (32 MB body limit) | -| GET | `/graphs/{id}/branches` | bearer + `read` | list branches | -| POST | `/graphs/{id}/branches` | bearer + `branch_create` | create | -| DELETE | `/graphs/{id}/branches/{branch}` | bearer + `branch_delete` | delete | -| POST | `/graphs/{id}/branches/merge` | bearer + `branch_merge` | merge `source β†’ target` | -| GET | `/graphs/{id}/commits?branch=` | bearer + `read` | list | -| GET | `/graphs/{id}/commits/{commit_id}` | bearer + `read` | show | - -Server-level management endpoints: - -| Method | Path | Auth | Action | -|---|---|---|---| -| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list ready/served graphs | - -> The per-graph subsections below name routes in shorthand (`GET /queries`, -> `POST /query`, `POST /mutate`, `POST /queries/{name}`); every one is served -> under the `/graphs/{id}/…` prefix shown in the table β€” only `/graphs` and -> `/healthz` are flat. - -### Stored-query catalog (`GET /queries`) - -List the graph's stored queries as a typed tool catalog β€” enough for a client (e.g. an MCP server) to register each as a tool without fetching `.gq` source. Each entry: `{ name, tool_name, description, instruction, mutation, params }`, where each param is `{ name, kind, item_kind?, vector_dim?, nullable }`. `kind` is one of `string | bool | int | bigint | float | date | datetime | blob | vector | list` (decomposed so a consumer maps it with a closed `switch`, never re-parsing GQ type spelling). `bigint` (I64/U64), `date`, `datetime`, and `blob` are carried as JSON **strings** β€” a 64-bit integer loses precision as a JSON number, dates are ISO strings, and a blob is a URI string. - -- **Read-gated** (works in default-deny mode). The catalog is **graph-wide** (branch-independent; `read` is authorized against `main`). -- **Every stored query in the applied registry is listed.** Cluster-served graphs have no per-query expose flag today β€” every query in the cluster `queries:` registry appears in the catalog. (Per-query exposure may become a Cedar-policy decision in a later release; see [cluster-config](../clusters/config.md).) -- **Not Cedar-filtered per query (yet).** A caller with `read` but not `invoke_query` can *list* a query they can't *invoke* (which would 404). Closing that gap is future per-query authorization; for now the catalog is a discovery surface and `invoke_query` remains the invocation gate. - -### Stored-query invocation (`POST /queries/{name}`) - -Invoke a curated, server-side stored query by **name** β€” the source comes from the graph's `queries:` registry, so the client never sends `.gq`. The request body itself is optional; omit it for no-param queries, or send `{ "params": { … }, "branch": "main", "snapshot": null }`, where every field is optional and `params` keys match the query's declared parameters. The response is the **read envelope** (`ReadOutput`) for a stored read or the **mutation envelope** (`ChangeOutput`) for a stored mutation β€” serialized untagged, so the wire shape is identical to `/query` / `/mutate`. - -- **Gate:** `invoke_query` (per-graph, graph-scoped) at the boundary. A stored *mutation* is **double-gated** β€” it also passes the engine's `change` gate, so an actor with `invoke_query` but not `change` gets `403`. -- **Deny == unknown, for callers without `invoke_query`:** for a caller lacking the grant, an `invoke_query` denial and an unknown query name return the **same `404`** (identical body), so the catalog can't be probed. A caller that *holds* `invoke_query` may still get the inner gate's `403` for an existing query it can't `read`/`change` (the double-gate, above) β€” so existence is visible to grant-holders by design. -- **Requires an explicit policy grant when auth is on.** In default-deny mode (bearer tokens but no `policy.file`), only `read` is permitted, so *every* `/queries/{name}` call returns `404` until an `invoke_query` rule is configured. -- A stored mutation cannot target a `snapshot` (`400`); a parameter type error is a structured `400` naming the parameter. - -## Adding and removing graphs - -Runtime add/remove via API is **not** exposed β€” neither `POST /graphs` -nor `DELETE /graphs/{id}` is implemented. Operators add or remove graphs -by running `cluster apply` against the cluster (which publishes a new -applied revision) and restarting the server so it boots from the new -revision. The server treats the cluster source as operator-owned and -never writes it. - -A future release may introduce a managed registry and re-expose runtime -mutation on top of it. - -## Inline read queries (`POST /query`) - -`POST /query` is the read-only, agent-friendly twin of `POST /read`. The -request body uses clean field names that match the CLI `-e` flag and the GQ -`query` keyword: - -```json -{ - "query": "query find($n: String) { match { $p: Person { name: $n } } return { $p.name } }", - "name": "find", - "params": { "n": "Alice" }, - "branch": "main", - "snapshot": null -} -``` - -Response shape is identical to `/read` (`ReadOutput`). If the inline source -contains mutations (`insert` / `update` / `delete`), the request is rejected -with HTTP 400 and an error pointing the caller at `POST /mutate` β€” the -read-only contract is enforced at the URL. - -`POST /mutate` is the canonical mutation endpoint. It accepts the same clean -field names (`query`, `name`); the legacy field names `query_source` and -`query_name` continue to deserialize as serde aliases so existing clients keep -working without changes. - -## Deprecated names (`/read`, `/change`) - -`POST /read` and `POST /change` are kept for back-compat indefinitely β€” they -are byte-stable on the request side and otherwise behave identically to -`/query` / `/mutate`. They are flagged as deprecated through three independent -channels: - -- **OpenAPI**: the operations carry `deprecated: true` in `openapi.json`, so - every OpenAPI codegen (typescript-fetch, openapi-generator, oapi-codegen, - …) emits a `@deprecated` marker on the generated SDK method. -- **Response headers (RFC 9745)**: every response carries `Deprecation: true`. -- **Response headers (RFC 8288)**: every response carries a `Link` header - pointing at the canonical successor: - `Link: <query>; rel="successor-version"` for `/read`, and - `Link: <mutate>; rel="successor-version"` for `/change`. SDKs and HTTP - proxies can pick the successor up automatically. - -Migration is purely cosmetic on the client side β€” swap the URL path, leave -the request body and response handling alone. - -## Streaming - -Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON. - -## Error model - -Uniform `ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | too_many_requests | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`. - -`manifest_conflict` is set on **concurrent-write rejections** (HTTP 409): the -caller's pre-write view of one table's manifest version was stale. -`ManifestConflictOutput { table_key, expected, actual }` tells the client -which table to refresh and retry. This is the conflict shape produced by -concurrent `/mutate` (or its `/change` alias), `/load` (or its deprecated -`/ingest` alias) calls landing the same `(table, branch)` race. - -HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500. - -## Per-actor admission control - -Disjoint -`(table, branch)` writes from different actors now run concurrently, -guarded only by the engine's per-(table, branch) write queue. To keep -one heavy actor from exhausting shared capacity (Lance I/O, manifest -churn, network), the server gates mutating handlers through per-process -admission limits configured from environment variables: - -| Env var | Default | Purpose | -|---|---|---| -| `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX` | 16 | Concurrent in-flight mutations per actor | -| `OMNIGRAPH_PER_ACTOR_BYTES_MAX` | 4 GiB | In-flight estimated bytes per actor | - -When an actor exceeds its in-flight count or byte budget, the server -returns **HTTP 429 Too Many Requests** with `code: too_many_requests` -and a `Retry-After` header (seconds). The actor should back off; other -actors are unaffected. - -Cedar policy authorization runs **before** admission accounting so -denied requests don't consume admission slots. - -Today admission gates every mutating handler: `/mutate` (and its -deprecated alias `/change`), `/load` (and its deprecated alias `/ingest`), -`/branches/{create,delete,merge}`, -and `/schema/apply`. Read-only endpoints (`/snapshot`, `/query`, `/read`, -`/export`, `/branches` GET, `/commits`, `/schema` GET) are not -admission-gated. - -## Body limits - -- Default: 1 MB -- `/load` (and its deprecated `/ingest` alias): 32 MB - -## Auth model (`bearer + SHA-256`) - -- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory. -- Constant-time comparison. -- Three sources, in precedence: - 1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` β€” AWS Secrets Manager (build with `--features aws`) - 2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` β€” JSON `{actor_id: token, …}` - 3. `OMNIGRAPH_SERVER_BEARER_TOKEN` β€” single legacy token, actor `default` -- If no tokens are configured, startup refuses unless `--unauthenticated` or - `OMNIGRAPH_UNAUTHENTICATED=1` explicitly opts into open local-dev mode. A - policy file without tokens is also rejected at startup. In open mode - `/openapi.json` strips the security scheme. - -See [deployment.md](../deployment.md) for token-source operational details. - -## Tracing & observability - -- `tower_http::TraceLayer::new_for_http()` -- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule -- Startup logs: token source name, graph URI, bind address -- Graceful SIGINT shutdown - -## Not implemented (by design or "TBD") - -- CORS β€” not configured; add `tower_http::cors` if needed. -- Rate limiting β€” per-actor admission control gates `/mutate` (alias - `/change`), `/load` (alias `/ingest`), `/branches/{create,delete,merge}`, - `/schema/apply` (see "Per-actor - admission control" above). No global rate limiter is configured; - add `tower_http::limit` if a graph-wide cap is needed. -- Pagination β€” none (commits/branches return everything; export streams). -- Runtime graph add/remove β€” run `cluster apply` and restart. diff --git a/docs/user/policy.md b/docs/user/policy.md new file mode 100644 index 0000000..749d3be --- /dev/null +++ b/docs/user/policy.md @@ -0,0 +1,164 @@ +# Authorization (Cedar policy) + +OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC. + +## Policy actions + +Per-graph actions (bind to `Omnigraph::Graph::"<graph_id>"`): + +1. `read` β€” query / snapshot / list branches & commits +2. `export` β€” NDJSON export +3. `change` β€” mutations +4. `schema_apply` β€” apply schema migrations +5. `branch_create` +6. `branch_delete` +7. `branch_merge` +8. `admin` β€” reserved for policy-management surfaces (hot reload, audit log, approvals). No call site today; see MR-724 for the reservation rationale. + +Server-scoped action (v0.6.0+; binds to `Omnigraph::Server::"root"`): + +9. `graph_list` β€” `GET /graphs` registry enumeration (multi-graph mode) + +Server-scoped actions cannot use `branch_scope` or `target_branch_scope` β€” they operate on the registry, not on a graph's branches. A rule cannot mix server-scoped and per-graph actions; split into separate rules. (Runtime `graph_create` / `graph_delete` are reserved but not shipped in v0.6.0; operators add/remove graphs by editing `omnigraph.yaml` and restarting.) + +## Scope kinds + +- `branch_scope` β€” applied to source branch (`read`, `export`, `change`) +- `target_branch_scope` β€” applied to destination (`schema_apply`, branch ops, run ops) +- `protected_branches` β€” named list with special rules; rule scopes are `any | protected | unprotected` + +## Per-graph vs. server-level policy (multi-graph mode) + +In multi mode (`omnigraph.yaml` with a non-empty `graphs:` map), policy files attach at two levels: + +```yaml +server: + policy: + file: ./server-policy.yaml # server-level: graph_list + +graphs: + alpha: + uri: s3://tenant-bucket/alpha + policy: + file: ./policies/alpha.yaml # per-graph: read, change, branch_*, schema_apply + beta: + uri: s3://tenant-bucket/beta + # no per-graph policy β†’ no engine-layer Cedar enforcement on beta +``` + +Top-level `policy.file` is single-graph / CLI-local policy only. Multi-graph +server startup rejects it because applying one graph policy to every configured +graph is ambiguous. Move per-graph rules to `graphs.<graph_id>.policy.file` and +move `graph_list` rules to `server.policy.file`. + +Each graph's HTTP request flows through its own per-graph policy. The management endpoint (`GET /graphs`) flows through the server-level policy. When `server.policy.file` is unset, `GET /graphs` is denied in every runtime state, including `--unauthenticated`; with bearer tokens configured, it returns 403 after admission control because `graph_list` is not a `read`-equivalent action. The operator must explicitly authorize via `server-policy.yaml` to expose `/graphs`. + +Example server-level policy: + +```yaml +version: 1 +groups: + admins: [act-andrew] +rules: + - id: admins-can-list-graphs + allow: + actors: { group: admins } + actions: [graph_list] +``` + +## Configuration + +`omnigraph.yaml`: + +```yaml +policy: + file: ./policy.yaml # Cedar rules + groups + tests: ./policy.tests.yaml # declarative test cases + +cli: + actor: act-andrew # default actor for CLI direct-engine writes +``` + +Each per-graph rule may use at most one of `branch_scope` or `target_branch_scope`. Server-scoped rules (`graph_list`) take neither β€” they have no branch context. + +`cli.actor` is the default actor identity for CLI direct-engine writes +when `policy.file` is configured. Override per-invocation with `--as +<ACTOR>` (top-level flag) β€” `--as` wins, otherwise `cli.actor` is used, +otherwise no actor. With policy configured and neither set, the +engine-layer footgun guard intentionally denies the write (silent bypass +via "I forgot the actor" is exactly what the guard prevents). Remote +HTTP writes ignore both β€” they resolve their actor server-side from the +bearer token. + +## CLI + +- `omnigraph policy validate` β€” parse + count actors, exit 1 on parse error. +- `omnigraph policy test` β€” run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch. +- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` β€” show decision and matched rule. +- `omnigraph --as <ACTOR> <subcommand>` β€” set the actor for the duration of one invocation. Effective for `change`, `load`, `ingest`, `branch create|delete|merge`, and `schema apply` against local URIs. No-op against remote HTTP URIs (actor is bearer-token-resolved server-side). + +## Enforcement + +Policy is a property of the **engine**, not the transport. Every mutating +write β€” `mutate_as`, `load_as`, `ingest_as`, `apply_schema_as`, +`branch_create_as`, `branch_create_from_as`, `branch_delete_as`, +`branch_merge_as` β€” calls `Omnigraph::enforce(action, scope, actor)` at +the head of the method. The gate fires identically whether the call +originates from the HTTP server, the CLI, or an embedded SDK consumer. +When no `PolicyChecker` is installed (the dev/embedded default) the gate +is a strict no-op; when one is installed and the call site forgets to +thread an actor through, the gate fails closed rather than silently +bypassing. + +## Server runtime states (MR-723) + +The HTTP server classifies its startup configuration into one of three +states based on whether bearer tokens are configured and whether a +policy file is set. The state determines what happens to a request that +reaches `authorize_request()` without a matching policy permit. + +| State | Tokens | Policy file | Behavior | +|---|---|---|---| +| **Open** | no | no | Every request is permitted. Refuses to start unless `--unauthenticated` or `OMNIGRAPH_UNAUTHENTICATED=1` is set β€” the operator must explicitly opt in. | +| **DefaultDeny** | yes | no | Every authenticated request for an action other than `read` is rejected with HTTP 403. Closes the "tokens but forgot the policy file" trap β€” an operator who sets up auth and forgot to point at a policy file used to ship the illusion of protection. | +| **PolicyEnabled** | yes | yes | Authenticated requests that reach a configured policy engine are evaluated by Cedar. Server-scoped actions still require `server.policy.file`. | + +The classifier is `classify_server_runtime_state` in +`crates/omnigraph-server/src/lib.rs`; it returns `Err` for the "no +tokens, no policy, no flag" cell and for "policy file, no tokens" so the +server refuses to start instead of silently shipping an open instance or +a policy-protected server that can only 401. Tests pin every cell of the +matrix and the State-2 deny path. + +Server-side, `authorize_request()` still runs at the HTTP boundary β€” +that's where actor identity is resolved from the bearer token and where +admission control / per-actor rate limits live. Engine-layer enforcement +is the **defense in depth** layer: it catches CLI direct-engine writes, +embedded SDK consumers, and any future transport that hasn't (or won't) +re-implement HTTP's authorize_request. Both layers consult the same +Cedar policy via the same `PolicyChecker` trait, so decisions cannot +disagree. + +## Coarse vs. fine enforcement + +There are two enforcement points, each with non-overlapping +responsibilities: + +| Layer | Question it answers | Where it fires | +|---|---|---| +| **Engine-layer (coarse)** | Can this actor invoke this action against this branch / branch-transition? | `Omnigraph::enforce(action, scope, actor)` at the head of every `_as` writer; one Cedar decision per call. | +| **Query-layer (fine)** | For the rows / types this action actually touches, which can the actor see or modify? | Per-row predicates pushed into DataFusion at plan time. **Not yet implemented β€” see MR-725.** | + +The engine-layer gate keeps `ResourceScope` deliberately at branch +granularity (`Graph`, `Branch`, `TargetBranch`, `BranchTransition`). +Per-type and per-row authority is the query-layer's job; conflating them +in `ResourceScope` would create two places per-type policy could be +evaluated and a drift surface between them. + +## Actor identity (signed-claim-only) + +The actor identity used for every policy decision comes from the matched bearer token β€” never from a client-supplied request header, query parameter, or body field. The server resolves the token at the auth middleware boundary, looks up the actor it was minted for, and overwrites whatever the handler may have placed in the policy request. Clients cannot set `actor_id` directly. + +This is intentional. Trusting client-supplied identity for authorization is "asking the attacker if they're an admin" β€” Supabase's RLS history names the same footgun. The chokepoint lives in `authorize_request` in `crates/omnigraph-server/src/lib.rs` and is named in `docs/dev/invariants.md` Hard Invariant 11. A regression test asserts the contract: a request with `Authorization: Bearer <token-for-actor-A>` plus `X-Actor-Id: actor-B` always evaluates as actor A, never as actor B. + +If you find yourself wanting to let clients override `actor_id` for impersonation, delegation, or service-account flows β€” that's a feature, but it needs explicit design (e.g., signed delegation claims, an `On-Behalf-Of` audit trail). It is not a convenience knob. diff --git a/docs/user/queries/index.md b/docs/user/queries/index.md deleted file mode 100644 index c8a70c5..0000000 --- a/docs/user/queries/index.md +++ /dev/null @@ -1,65 +0,0 @@ -# Query Language (`.gq`) - -## Query declarations - -``` -query <name>($p1: T1, $p2: T2?, …) - @description("…") @instruction("…") { - … -} -``` - -Two body shapes: - -- **Read**: `match { … } return { … } [order { … }] [limit N]` β€” covered on this page. -- **Mutation**: one or more of `insert | update | delete` statements β€” see [mutations](../mutations/index.md). - -Multi-modal search functions (`nearest`, `bm25`, `rrf`, …) used inside `match`, -`return`, and `order` are documented on the [search](../search/index.md) page. - -Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`. - -## MATCH clauses - -- **Binding**: `$x: NodeType { prop: <literal | $param | now()>, … }` -- **Traversal**: `$src EDGE_NAME { min, max? } $dst` β€” variable-length paths via hop bounds; default 1..1 if bounds omitted. -- **Filter**: `<expr> <op> <expr>` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`. -- **Negation**: `not { clause+ }` β€” desugars to anti-join over the inner pipeline. - -## RETURN clause - -`return { <expr> [as <alias>], … }` with expressions: - -- Variable / property access: `$x`, `$x.prop` -- Literals: string, int, float, bool, list -- `now()` -- Aggregates: `count`, `sum`, `avg`, `min`, `max` -- [Search functions](../search/index.md) (so you can return a score column) -- `AliasRef` β€” re-use a previous projection alias - -## ORDER & LIMIT - -- `order { <expr> [asc|desc], … }` β€” supports plain expressions and `nearest(...)`. -- `limit <integer>` β€” required when there is a `nearest(...)` ordering. -- **Total, deterministic order.** Rows with equal user-sort keys are broken by the bound entities' key columns (`<var>.id`, ascending) appended as a final tie-break, so the result is a *total* order β€” reproducible across runs, and `order … limit N` returns a deterministic top-N even when ties straddle the cutoff. (Aggregate results have no entity-key columns; their group rows are already distinct on the projected group keys.) -- **NULL placement** is *nulls-first ascending, nulls-last descending* (i.e. `nulls_first = !descending`): a NULL sorts as if smaller than any value. - -Write statements (`insert` / `update` / `delete`) are documented on the -[mutations](../mutations/index.md) page. - -## Traversal execution - -Variable-length traversals (`Expand`) are executed one of two ways, chosen per-expand by a cost model over cheap manifest counts (frontier size, edge count, source-vertex count, hops) plus index coverage: selective traversals (small frontier relative to the source set) resolve neighbors from the persisted `src`/`dst` BTREE (one indexed scan per hop); dense / deep / large-frontier traversals β€” or those whose BTREE coverage is degraded so a full scan would be paid per hop β€” use an in-memory CSR adjacency index. Both produce identical results. The `OMNIGRAPH_EXPAND_INDEXED_MAX_FRONTIER` / `OMNIGRAPH_EXPAND_INDEXED_MAX_HOPS` ceilings bound the *initial dispatch* frontier/hops (beyond them CSR is always used); the cost model estimates total indexed work as ~`hops Γ— frontier Γ— fanout` and prices dense fan-out toward CSR β€” they are not a hard per-hop bound. `OMNIGRAPH_TRAVERSAL_MODE=indexed|csr` forces a mode (see [constants](../reference/constants.md)). - -## Linting & validation - -Codes seen so far: - -- **Q000** (Error): parse error -- **L201** (Warning): nullable property never set by any UPDATE β€” "{type}.{prop} exists in schema but no update query sets it" -- (Warning): mutation declares no params β€” hardcoded mutations are easy to miss -- Plus all type errors from type checking (undefined types, mismatched operators, undefined edges, etc.) - -Lint output reports an overall status, per-query results (name, kind, status, any error and warnings), and structured findings (severity, code, message, and the type/property/query they apply to). - -CLI exits non-zero only on `status = Error`. diff --git a/docs/user/query-language.md b/docs/user/query-language.md new file mode 100644 index 0000000..94528af --- /dev/null +++ b/docs/user/query-language.md @@ -0,0 +1,111 @@ +# Query Language (`.gq`) + +Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`. + +## Query declarations + +``` +query <name>($p1: T1, $p2: T2?, …) + @description("…") @instruction("…") { + … +} +``` + +Two body shapes: + +- **Read**: `match { … } return { … } [order { … }] [limit N]` +- **Mutation**: one or more of `insert | update | delete` statements + +Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`. + +## MATCH clauses + +- **Binding**: `$x: NodeType { prop: <literal | $param | now()>, … }` +- **Traversal**: `$src EDGE_NAME { min, max? } $dst` β€” variable-length paths via hop bounds; default 1..1 if bounds omitted. +- **Filter**: `<expr> <op> <expr>` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`. +- **Negation**: `not { clause+ }` β€” desugars to anti-join over the inner pipeline. + +## Search clauses (multi-modal) + +Used inside MATCH or as expressions inside RETURN/ORDER: + +| Function | Purpose | Underlying Lance facility | +|---|---|---| +| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) | +| `search(field, q)` | Generic FTS | Inverted index | +| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index | +| `match_text(field, q)` | Pattern match | Inverted index | +| `bm25(field, q)` | BM25 scoring | Inverted index | +| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings | + +`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input). + +## RETURN clause + +`return { <expr> [as <alias>], … }` with expressions: + +- Variable / property access: `$x`, `$x.prop` +- Literals: string, int, float, bool, list +- `now()` +- Aggregates: `count`, `sum`, `avg`, `min`, `max` +- All search functions above (so you can return a score column) +- `AliasRef` β€” re-use a previous projection alias + +## ORDER & LIMIT + +- `order { <expr> [asc|desc], … }` β€” supports plain expressions and `nearest(...)`. +- `limit <integer>` β€” required when there is a `nearest(...)` ordering. + +## Mutation statements + +- `insert <Type> { prop: <value>, … }` +- `update <Type> set { prop: <value>, … } where <prop> <op> <value>` +- `delete <Type> where <prop> <op> <value>` + +`<value>` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0). + +### Dβ‚‚ β€” mixed insert/update + delete is rejected at parse time + +A single mutation query must be **either insert/update-only or delete-only**. Mixed β†’ rejected before any I/O with the message: + +> `mutation '<name>' on the same query mixes inserts/updates and deletes; split into separate mutations: (1) inserts and updates, then (2) deletes. This restriction lifts when Lance exposes a two-phase delete API (tracked: MR-793 / Lance-upstream).` + +Reason: under the staged-write rewire (MR-794), inserts and updates accumulate in memory and commit at end-of-query, while deletes still inline-commit (Lance 4.0.0 has no public two-phase delete). Mixing creates ordering hazards (same-row insertβ†’delete becomes a no-op because the staged insert isn't visible to delete; cascading deletes of just-inserted edges break referential integrity by silent design). Until Lance exposes `DeleteJob::execute_uncommitted`, the parse-time rejection keeps both paths atomic and correct. See [docs/dev/runs.md](../dev/runs.md) and [docs/dev/invariants.md](../dev/invariants.md). + +## IR (Intermediate Representation) + +`QueryIR { name, params, pipeline: Vec<IROp>, return_exprs, order_by, limit }` + +Pipeline operations: + +- `NodeScan { variable, type_name, filters }` +- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` β€” destination filters are pushed *into* the expand so Lance scalar pushdown can prune. +- `Filter { left, op, right }` +- `AntiJoin { outer_var, inner: Vec<IROp> }` β€” for `not { … }` + +Lowering: + +1. Partition MATCH clauses (bindings, traversals, filters, negations). +2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown. +3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations. +4. Translate RETURN / ORDER expressions; preserve LIMIT. + +## Linting & validation (`query/lint.rs`) + +Codes seen so far: + +- **Q000** (Error): parse error +- **L201** (Warning): nullable property never set by any UPDATE β€” "{type}.{prop} exists in schema but no update query sets it" +- (Warning): mutation declares no params β€” hardcoded mutations are easy to miss +- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.) + +Output: + +``` +QueryLintOutput { status, schema_source, query_path, + queries_processed, errors, warnings, infos, + results: [{ name, kind, status, error?, warnings[] }], + findings: [{ severity, code, message, type_name?, property?, query_names[] }] } +``` + +CLI exits non-zero only on `status = Error`. diff --git a/docs/user/quickstart.md b/docs/user/quickstart.md deleted file mode 100644 index ae98e7c..0000000 --- a/docs/user/quickstart.md +++ /dev/null @@ -1,84 +0,0 @@ -# Quickstart - -This walks the core loop end to end: define a schema, initialize a graph, load -data, query it, and use a branch. It uses a local file-backed graph; swap the -path for an `s3://…` URI to run the same flow against object storage. - -[Install](install.md) the `omnigraph` CLI first. - -## 1. Write a schema - -A schema (`.pg`) declares your node and edge types. Save this as `schema.pg`: - -``` -node Person { - name: String - title: String? -} -``` - -See the [schema language](schema/index.md) for types, constraints, and edges. - -## 2. Initialize the graph - -```bash -omnigraph init --schema schema.pg graph.omni -``` - -`init` creates an empty graph at the given URI with your schema applied. - -## 3. Load data - -`load` is the single bulk-write command. `--mode` is required -(`overwrite | append | merge`): - -```bash -omnigraph load --data people.jsonl --mode overwrite graph.omni -``` - -`people.jsonl` is newline-delimited JSON, one record per line. For finer-grained -or inline writes, see [mutations](mutations/index.md). - -## 4. Query - -Write a query (`.gq`) β€” save as `queries.gq`: - -```gq -query find_people($title: String) { - match { $p: Person { title: $title } } - return { $p.name } -} -``` - -Run it: - -```bash -omnigraph query find_people --query queries.gq \ - --params '{"title":"Engineer"}' --format table --store graph.omni -``` - -The query name is positional; `--query` points at the `.gq` source and -`--store` addresses the graph's storage directly. - -The [query language](queries/index.md) covers `match`/`return`/`order`, and -[search](search/index.md) covers vector and full-text search. - -## 5. Work on a branch - -Branches isolate changes until you merge them β€” Git-style, across the whole graph: - -```bash -omnigraph branch create review/new-hires graph.omni -omnigraph load --data new-hires.jsonl --mode append --branch review/new-hires graph.omni -# inspect the branch, then integrate it -omnigraph branch merge review/new-hires --into main graph.omni -``` - -See [branches & commits](branching/index.md) and [merging](branching/merge.md). - -## Next steps - -- [CLI reference](cli/reference.md) β€” every command and flag. -- [Schema language](schema/index.md) and [query language](queries/index.md). -- [Operating a cluster](clusters/index.md) and [running the server](operations/server.md) - for multi-graph, multi-user deployments. diff --git a/docs/user/reference/constants.md b/docs/user/reference/constants.md deleted file mode 100644 index 0e9ee22..0000000 --- a/docs/user/reference/constants.md +++ /dev/null @@ -1,42 +0,0 @@ -# Constants & Tunables (cheat sheet) - -| Name | Value | Area | -|---|---|---| -| `MANIFEST_DIR` | `__manifest` | manifest layout | -| Commit graph dir | `_graph_commits.lance` | branch-ref carrier + pre-v4 lineage source (lineage lives in `__manifest` since RFC-013 Phase 7) | -| Run registry dir (legacy, removed) | `_graph_runs.lance` | inert post-v0.4.0; bytes remain until a prefix-delete primitive lands | -| Run branch prefix (legacy, removed) | `__run__` | swept off `__manifest` by the internal schema migration; no longer a reserved name | -| Schema apply lock | `__schema_apply_lock__` | schema apply | -| Manifest publisher retry budget | `PUBLISHER_RETRY_BUDGET = 5` | manifest publish | -| Internal manifest schema version | `INTERNAL_MANIFEST_SCHEMA_VERSION = 4` | manifest migrations (v4 = graph lineage in `__manifest`, RFC-013 Phase 7) | -| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | merge execution | -| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | optimize/cleanup | -| Lance blob compaction support | `LANCE_SUPPORTS_BLOB_COMPACTION = false` | optimize | -| Graph index cache size | `8` (LRU) | runtime cache | -| Expand indexed-path frontier ceiling | `OMNIGRAPH_EXPAND_INDEXED_MAX_FRONTIER=1024` | traversal | -| Expand indexed-path hop ceiling | `OMNIGRAPH_EXPAND_INDEXED_MAX_HOPS=6` | traversal | -| Expand CSR-build cost factor | `CSR_BUILD_FACTOR = 1.5` | traversal | -| Expand mode override | `OMNIGRAPH_TRAVERSAL_MODE` (`indexed`\|`csr`; unset = cost-based auto) | traversal | -| Default body limit | `1 MB` | HTTP server | -| Load (bulk-write) body limit | `32 MB` | HTTP server (`/load`; shared by the deprecated `/ingest` alias) | -| Default embed provider/model | `openai-compatible` / `openai/text-embedding-3-large` | engine embedding | -| OpenAI-direct embed model | `text-embedding-3-large` | engine embedding | -| Gemini-direct embed model | `gemini-embedding-2` | engine embedding | -| Embed deadline | `OMNIGRAPH_EMBED_DEADLINE_MS=60000` | engine embedding | -| Embed timeout | `OMNIGRAPH_EMBED_TIMEOUT_MS=30000` | engine embedding | -| Embed retries | `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4` | engine embedding | -| Embed retry backoff | `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200` | engine embedding | -| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime | - -**Expand traversal dispatch.** With `OMNIGRAPH_TRAVERSAL_MODE` unset, the engine -chooses the indexed (per-hop BTREE) vs CSR (whole-graph in-memory) path with a -cost model over cheap manifest counts (frontier size, |E|, source-vertex count, -hops) plus the index-coverage signal: the indexed path is preferred when its -frontier-relative work beats building the CSR (β‰ˆ when `hops Γ— frontier` is a -small fraction of the source-vertex set), and CSR is preferred for dense/deep -traversals or when the BTREE coverage is degraded and a full scan would be paid -per hop. The two ceilings bound the **initial dispatch** frontier/hops (beyond -them CSR is always used); they are not a hard per-hop bound β€” the cost model -*estimates* total indexed work as ~`hops Γ— frontier Γ— fanout`, so dense fan-out is -priced toward CSR rather than capped mid-traversal. The override flag forces a path (the `auto` result is identical either way; -only the path differs). diff --git a/docs/user/schema-language.md b/docs/user/schema-language.md new file mode 100644 index 0000000..4250676 --- /dev/null +++ b/docs/user/schema-language.md @@ -0,0 +1,88 @@ +# Schema Language (`.pg`) + +Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`. + +## Top-level declarations + +- `interface <Name> { property* }` β€” reusable property contracts. +- `node <Name> [implements <Iface>, ...] { property* | constraint* }` +- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }` +- Comments: line `//` and block `/* … */`. + +## Property declarations + +`<ident>: <TypeRef> [annotation*]` + +## Built-in scalar types + +| Scalar | Arrow type | +|---|---| +| `String` | Utf8 | +| `Blob` | LargeBinary | +| `Bool` | Boolean | +| `I32` / `I64` | Int32 / Int64 | +| `U32` / `U64` | UInt32 / UInt64 | +| `F32` / `F64` | Float32 / Float64 | +| `Date` | Date32 | +| `DateTime` | Date64 | +| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≀ dim ≀ i32::MAX` | +| `[<scalar>]` | List(scalar) | +| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values | +| `<scalar>?` | Same as scalar but `nullable: true` | + +## Constraints (body level) + +| Constraint | On | Effect | +|---|---|---| +| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key | +| `@unique(p, …)` | node, edge | Uniqueness across listed columns | +| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns | +| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) | +| `@check(p, "regex")` | node | Regex pattern validation | +| `@card(min..max?)` | edge | Edge multiplicity β€” default `0..*`; `0..1`, `1..1`, `1..*`, etc. | + +Edge bodies only allow `@unique` and `@index`. + +## Annotations + +- `@<ident>` or `@<ident>(<literal>)` on any declaration or property. +- Known annotations: + - `@embed` on a Vector property β€” names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType). + - `@description("…")`, `@instruction("…")` on query declarations (carried through to clients). +- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation. + +## Catalog construction + +- Pass 0: collect interfaces. +- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`). +- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties. + +## Schema IR & stable type IDs + +- `SCHEMA_IR_VERSION = 1` (`catalog/schema_ir.rs`). +- Each interface/node/edge currently gets a `stable_type_id` from a kind+name hash. +- Rename-preserving accepted IDs are an architectural invariant, but the current hash-on-name implementation is a known gap until migration carries IDs across `@rename_from`. +- Serialized as JSON for diff/migration plans. + +## Schema migration planning + +`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types: + +- `AddType { type_kind, name }` +- `RenameType { type_kind, from, to }` +- `AddProperty { type_kind, type_name, property_name, property_type }` +- `RenameProperty { type_kind, type_name, from, to }` +- `AddConstraint { type_kind, type_name, constraint }` +- `UpdateTypeMetadata { … annotations }` +- `UpdatePropertyMetadata { … annotations }` +- `UnsupportedChange { entity, reason }` (forces `supported=false`) + +`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize. + +## Destructive drops β€” `--allow-data-loss` + +`DropProperty` and `DropType` steps default to `Soft` mode: the catalog tombstones the entry but the prior column / dataset remains time-travel-reachable via `snapshot_at_version(prev)` until `omnigraph cleanup` runs. Soft drops are reversible. + +Pass `--allow-data-loss` (CLI) or `allow_data_loss: true` (HTTP `POST /schema/apply` body, SDK `SchemaApplyOptions`) to promote every drop in the plan to `Hard` mode. Hard drops run `cleanup_old_versions` on the affected dataset immediately after the manifest publish, making the prior column / dataset unreachable. **Irreversible.** + +The flag is honored uniformly across transports β€” `omnigraph schema apply --allow-data-loss`, `POST /schema/apply { schema_source, allow_data_loss: true }`, and `apply_schema_with_options(.., SchemaApplyOptions { allow_data_loss: true })` produce identical plans and identical effects. diff --git a/docs/user/schema/lint.md b/docs/user/schema-lint.md similarity index 58% rename from docs/user/schema/lint.md rename to docs/user/schema-lint.md index 6635e9f..a1495fd 100644 --- a/docs/user/schema/lint.md +++ b/docs/user/schema-lint.md @@ -2,26 +2,29 @@ The migration planner emits **code-tagged diagnostics** for every schema change it rejects. Codes have the form `OG-XXX-NNN` and identify the rule (not the message); operators reference them in suppression directives, severity overrides, and CI reports. -This page is the catalog of codes shipped today. +This page is the catalog of codes shipped today. The chassis behind it is tracked in [MR-694](https://linear.app/modernrelay/issue/MR-694). -## What's shipped +## What's shipped in v0 -- Stable code attached to every rejection the planner emits (today: 5 of 17 paths β€” the rest are tagged as future work). +- Stable code attached to every rejection the planner emits (today: 5 of 17 paths β€” the rest carry `code: None` and are tagged as future work). - Code appears in the user-visible error message: `[OG-DS-104] removing property 'Person.age' is not supported …`. - CLI `omnigraph schema plan` shows the code on `unsupported change …` lines. +- Tests in `tests/schema_apply.rs` assert on codes, not on free-text prose. ## What's not shipped yet -- Severity configuration (planned: `lint: { OG-DS-103: error }`). +- Severity configuration in `omnigraph.yaml` (planned: `lint: { OG-DS-103: error }`). - `@allow(OG-XXX-NNN, "rationale")` suppression directives. -- Pre-migration checks (the `migration_check { … }` block). -- The CD / VE / LK / NM families. -- CI integration. -- Cost-class annotations. +- Pre-migration checks (the `migration_check { … }` block β€” MR-941). +- The CD / VE / LK / NM families (MR-942..945). +- CI integration (MR-946). +- Cost-class annotations (MR-944). -## Code catalog +See the parent chassis issue (MR-694) for the design and the per-family sub-issues for what's planned. -The chassis defines ten families. Today only DS and MF have emitted codes. The remaining families are reserved for future releases. +## Code catalog (v0) + +The chassis defines ten families. Today only DS and MF have emitted codes. The remaining families are reserved for future PRs. | Code | Family | Tier | Default severity | Meaning | |---|---|---|---|---| @@ -34,22 +37,24 @@ The chassis defines ten families. Today only DS and MF have emitted codes. The r | `OG-MF-104` | Maybe-fail | validated | error | tighten nullable to non-nullable (reserved) | | `OG-MF-106` | Maybe-fail | destructive | error | narrowing scalar type | +The full code catalog source of truth lives in `crates/omnigraph-compiler/src/lint/codes.rs`. CI-level invariants (uniqueness, format, family coverage) are unit-tested in the same module. + ## Families The ten chassis families: | Prefix | Family | Status | |---|---|---| -| **DS** | Destructive (data-loss) | shipped | -| **MF** | Maybe-fail / data-dependent | shipped | -| **CD** | Constraint deletion (relaxation warning) | planned | +| **DS** | Destructive (data-loss) | shipped, v0 | +| **MF** | Maybe-fail / data-dependent | shipped, v0 | +| **CD** | Constraint deletion (relaxation warning) | tracked in MR-942 | | **BC** | Backward-incompatible (rename) | implicit in `@rename_from`; codify later | -| **NM** | Naming conventions | planned | -| **OW** | Ownership (per-resource Cedar) | planned | -| **NL** | Non-linear (branch-merge divergence) | planned | -| **VE** | Vector / embedding | planned | -| **ED** | Edge / graph topology | planned | -| **LK** | Lock duration / cost | planned | +| **NM** | Naming conventions | tracked in MR-945 | +| **OW** | Ownership (per-resource Cedar) | tracked in MR-722 | +| **NL** | Non-linear (branch-merge divergence) | stubbed in MR-947 | +| **VE** | Vector / embedding | tracked in MR-943 | +| **ED** | Edge / graph topology | tracked in MR-701, MR-943 | +| **LK** | Lock duration / cost | tracked in MR-944 | ## Prior art diff --git a/docs/user/schema/index.md b/docs/user/schema/index.md deleted file mode 100644 index df82a23..0000000 --- a/docs/user/schema/index.md +++ /dev/null @@ -1,79 +0,0 @@ -# Schema Language (`.pg`) - -## Top-level declarations - -- `interface <Name> { property* }` β€” reusable property contracts. -- `node <Name> [implements <Iface>, ...] { property* | constraint* }` -- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }` -- Comments: line `//` and block `/* … */`. - -## Property declarations - -`<ident>: <TypeRef> [annotation*]` - -## Built-in scalar types - -| Scalar | Arrow type | -|---|---| -| `String` | Utf8 | -| `Blob` | LargeBinary | -| `Bool` | Boolean | -| `I32` / `I64` | Int32 / Int64 | -| `U32` / `U64` | UInt32 / UInt64 | -| `F32` / `F64` | Float32 / Float64 | -| `Date` | Date32 | -| `DateTime` | Date64 | -| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≀ dim ≀ i32::MAX` | -| `[<scalar>]` | List(scalar) | -| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values | -| `<scalar>?` | Same as scalar but `nullable: true` | - -## Constraints (body level) - -| Constraint | On | Effect | -|---|---|---| -| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key | -| `@unique(p, …)` | node, edge | Uniqueness across listed columns | -| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns | -| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) | -| `@check(p, "regex")` | node | Regex pattern validation | -| `@card(min..max?)` | edge | Edge multiplicity β€” default `0..*`; `0..1`, `1..1`, `1..*`, etc. | - -Edge bodies only allow `@unique` and `@index`. - -## Annotations - -- `@<ident>` or `@<ident>(<literal>)` on any declaration or property. -- Known annotations: - - `@embed("source_property")` on a Vector property β€” records which String property is the embedding source for query-time `nearest($v, "string")` auto-embedding. It is a catalog annotation; it does **not** populate the vector at ingest (supply vectors in load data, or pre-fill via the offline `omnigraph embed` pipeline). An optional `model="…"` kwarg (`@embed("source_property", model="openai/text-embedding-3-large")`) records the embedding model so a `nearest()` query whose embedder uses a different model is rejected loudly; `model` is the only supported kwarg. See [search/embeddings.md](../search/embeddings.md). - - `@description("…")`, `@instruction("…")` on query declarations (carried through to clients). -- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation. - -## Table layout - -- Each node type compiles to a table with an `id: Utf8` column plus all declared properties (blob columns are stored as `LargeBinary`); `implements` clauses expand the interface's properties into the node. -- Each edge type compiles to a table with `id: Utf8, src: Utf8, dst: Utf8` plus the edge's own properties. Edge endpoint types (`from`/`to`) must exist, and edge names are matched case-insensitively. - -## Schema migration planning - -A migration plan compares the accepted schema against the desired one and reports whether the change is supported plus the ordered steps it requires: - -- Add a type -- Rename a type -- Add a property -- Rename a property -- Add a constraint -- Update type or property metadata (annotations) -- Unsupported change (reports the entity and reason; forces the plan to unsupported) - -Applying a plan reports whether it was supported, the steps applied, and the resulting manifest version. Concurrent schema applies serialize so they can't interleave. - -## Destructive drops β€” `--allow-data-loss` - -`DropProperty` and `DropType` steps default to `Soft` mode: the catalog tombstones the entry but the prior column / dataset remains time-travel-reachable via `snapshot_at_version(prev)` until `omnigraph cleanup` runs. Soft drops are reversible. - -Pass `--allow-data-loss` (CLI `schema apply`) or `allow_data_loss: true` (SDK `SchemaApplyOptions`) to promote every drop in the plan to `Hard` mode. Hard drops run `cleanup_old_versions` on the affected dataset immediately after the manifest publish, making the prior column / dataset unreachable. **Irreversible.** - -This is the **direct/embedded** schema-apply path β€” `omnigraph schema apply --store …` and the embedded SDK `apply_schema_with_options(.., SchemaApplyOptions { allow_data_loss: true })` produce identical plans and identical effects. - -**Cluster-managed graphs are different.** A graph served from a cluster evolves only through `omnigraph cluster apply`, which performs **soft drops only** (no `allow_data_loss` path), and the HTTP `POST /schema/apply` route is **disabled (returns 409) for cluster-backed serving** β€” see [server](../operations/server.md) and [cluster-config](../clusters/config.md). Direct `schema apply` against a cluster-managed storage path is likewise refused. diff --git a/docs/user/search/embeddings.md b/docs/user/search/embeddings.md deleted file mode 100644 index 11f3540..0000000 --- a/docs/user/search/embeddings.md +++ /dev/null @@ -1,112 +0,0 @@ -# Embeddings - -OmniGraph embeds text through a **single, provider-independent client** resolved from one -`EmbeddingConfig { provider, model, base_url, api_key }`. The same resolved config is used by the query-time -auto-embed of a string in `nearest($v, "string")` and by the offline `omnigraph embed` file pipeline, so -query vectors and document vectors share one model and one vector space. - -## Providers - -| `provider` | Wire shape | Use it for | -|---|---|---| -| `openai-compatible` (default) | `POST {base}/embeddings`, bearer auth, `{model, input, dimensions}` | **OpenRouter** (the default gateway β€” one key for many models), OpenAI direct, or a self-hosted endpoint (vLLM / Ollama / LM Studio) | -| `gemini` | `POST {base}/models/{model}:embedContent`, `x-goog-api-key`, with `RETRIEVAL_QUERY` / `RETRIEVAL_DOCUMENT` task types | Reaching Google's `generativelanguage` API directly | -| `mock` | none β€” deterministic offline vectors | Tests and local dev without a key | - -Vectors are stored L2-normalized as `FixedSizeList(Float32, dim)`; the requested output dimension is driven by -the target column width and sent as Gemini `outputDimensionality` / OpenAI `dimensions`. - -## Configuration (cluster) - -Cluster-served graphs can pin their query-time embedder in `cluster.yaml`: - -```yaml -providers: - embedding: - default: - kind: openai-compatible - base_url: https://openrouter.ai/api/v1 - model: openai/text-embedding-3-large - api_key: ${OPENROUTER_API_KEY} - -graphs: - knowledge: - schema: knowledge.pg - embedding_provider: default -``` - -`embedding_provider` references `providers.embedding.<name>`; bare names are -normalized to that typed ref. The server resolves `${ENV_VAR}` only when it -boots from the applied cluster ledger, so `cluster validate`, `plan`, and -`apply` do not need provider secrets. Inline API keys are rejected. `mock` -needs no key. Vector dimensions stay schema-driven by the target `Vector(N)` -column. - -Direct (`--store`) access, embedded callers, and the offline -`omnigraph embed` pipeline use environment configuration unless they inject an -`EmbeddingConfig` directly. - -## Configuration (environment) - -| Variable | Meaning | -|---|---| -| `OMNIGRAPH_EMBED_PROVIDER` | `openai-compatible` (default, β†’ OpenRouter) \| `openai` (β†’ OpenAI's own host) \| `gemini` \| `mock` | -| `OMNIGRAPH_EMBED_BASE_URL` | endpoint base; defaults `https://openrouter.ai/api/v1` (`openai-compatible`/unset), `https://api.openai.com/v1` (`openai`), `https://generativelanguage.googleapis.com/v1beta` (`gemini`) | -| `OMNIGRAPH_EMBED_MODEL` | model id; defaults `openai/text-embedding-3-large` (OpenRouter), `text-embedding-3-large` (`openai`), `gemini-embedding-2` (`gemini`) | -| `OPENROUTER_API_KEY` / `OPENAI_API_KEY` | api key for `openai-compatible` (OpenRouter preferred) | -| `GEMINI_API_KEY` | api key for `gemini` | -| `OMNIGRAPH_EMBED_DEADLINE_MS` | total wall-clock budget for one embed call across all retries (default `60000`; `0` = unbounded) | -| `OMNIGRAPH_EMBED_TIMEOUT_MS` | per-request HTTP timeout (default `30000`) | -| `OMNIGRAPH_EMBED_RETRY_ATTEMPTS` / `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS` | retry policy (defaults `4` / `200`) | -| `OMNIGRAPH_EMBEDDINGS_MOCK` | set truthy to force the deterministic mock provider | - -The default zero-config path is OpenRouter: set `OPENROUTER_API_KEY` and run. Reaching Gemini takes -`OMNIGRAPH_EMBED_PROVIDER=gemini` plus `GEMINI_API_KEY`. - -### Behavior notes - -- **Bounded latency.** Each embed call is wrapped in `OMNIGRAPH_EMBED_DEADLINE_MS`, so a degraded - provider cannot hang a read for the full retry envelope. -- **Reuse.** The query path builds the client once per graph handle (on the first `nearest($v, "string")` - that needs embedding) and reuses it, keeping the provider connection pool warm. A graph that never embeds - needs no provider key. -- **Observability.** Embed calls emit `tracing` events under `target = "omnigraph::embedding"` (provider, - model, dim, attempt, elapsed, outcome). - -## `@embed` schema annotation - -Mark a Vector property with `@embed("source_text_property")`. This is a **catalog annotation** consumed by the -query typechecker and linter: it records which String property is the embedding source and lets -`nearest($v, "string")` auto-embed a query string for comparison against that vector column. - -Optionally record the model that produced the stored vectors: -`@embed("source_text_property", model="openai/text-embedding-3-large")`. When a model is recorded, a -`nearest($v, "string")` query is **rejected with a typed error** unless the resolved query embedder uses the -same model β€” so stored and query vectors are guaranteed same-space instead of silently ranking across spaces. -To fix a mismatch, set `OMNIGRAPH_EMBED_MODEL` (and the matching provider) to the recorded model, or re-embed. -The recorded model is the literal string, so `openai/text-embedding-3-large` (via OpenRouter) and -`text-embedding-3-large` (OpenAI direct) are distinct identities; use the matching string. Changing a recorded -model is a loud `schema apply` refusal (treat it as a re-embed migration). `@embed` without a model keeps -working with no validation. `model` is the only supported `@embed` argument; any other is a parse error. - -**It does not embed at ingest.** Stored vectors are supplied directly in your load data, or pre-filled by the -offline `omnigraph embed` pipeline below. (Ingest-time execution of `@embed` is a planned enhancement.) - -## CLI `omnigraph embed` (offline file pipeline) - -Operates on **JSONL files** (not on a graph), using the same resolved provider config. Three modes (mutually -exclusive): - -- (default) `fill_missing` β€” only embed rows whose target field is empty -- `--reembed-all` β€” overwrite all -- `--clean` β€” strip embeddings - -Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL β†’ JSONL. - -## Migration - -This release has no backwards-compatibility shim (pre-release). The default provider is now OpenRouter, and -the legacy `OMNIGRAPH_GEMINI_BASE_URL` is removed. A graph whose vectors were produced with -`gemini-embedding-2-preview` should either re-embed, or pin the query-time embedder to match by setting -`OMNIGRAPH_EMBED_PROVIDER=gemini` and `OMNIGRAPH_EMBED_MODEL=gemini-embedding-2-preview` (the stored and query -vectors must come from the same model to be comparable). diff --git a/docs/user/search/index.md b/docs/user/search/index.md deleted file mode 100644 index 280e9e8..0000000 --- a/docs/user/search/index.md +++ /dev/null @@ -1,48 +0,0 @@ -# Search - -OmniGraph runs vector, full-text, and hybrid search in the same runtime as graph -traversal β€” a single [query](../queries/index.md) can combine a vector `nearest`, -a `bm25` text score, and an `Expand` traversal. Search functions are used inside -`match` (to filter), or as expressions inside `return` / `order` (to score and -rank). - -## Functions - -| Function | Purpose | Backing index | -|---|---|---| -| `nearest($x.vec, $q)` | k-NN vector search (cosine) | vector index (IVF / HNSW) | -| `search(field, q)` | Generic full-text search | inverted (FTS) index | -| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | inverted index | -| `match_text(field, q)` | Pattern match | inverted index | -| `bm25(field, q)` | BM25 relevance scoring | inverted index | -| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default `k=60`) | fuses scored rankings | - -- `nearest()` requires a `limit`. The query vector is resolved from the param map, - or embedded from a text input at runtime via the configured - [embedding client](embeddings.md). -- Scores and ranks propagate as ordinary columns, so you can `return` a score and - `order` by it. - -## Hybrid ranking with `rrf` - -Reciprocal Rank Fusion combines two independent rankings (typically one vector and -one text) into a single fused ranking, without needing the two score scales to be -comparable. Rank each retrieval separately, then fuse: - -```gq -query hybrid($q: String) { - match { $d: Document { } } - return { - $d, - rrf( nearest($d.embedding, $q), bm25($d.body, $q) ) as score - } - order { score desc } - limit 10 -} -``` - -## Indexes and embeddings - -Search functions only work when the backing index exists β€” see -[indexes](indexes.md) for building vector and inverted indexes, and -[embeddings](embeddings.md) for generating the vectors `nearest` searches over. diff --git a/docs/user/search/indexes.md b/docs/user/search/indexes.md deleted file mode 100644 index af8c128..0000000 --- a/docs/user/search/indexes.md +++ /dev/null @@ -1,43 +0,0 @@ -# Indexes - -## L1 β€” Lance index types OmniGraph exposes - -| Index | Use | Notes | -|---|---|---| -| **BTREE scalar** | `=` / range / `IN` / `IS NULL` on a scalar | always on the node `id` and edge `src`/`dst`; and on each one-column `@index`/`@key` property that is an **enum** or an **orderable scalar** (`DateTime`/`Date`/`I32`/`I64`/`U32`/`U64`/`F32`/`F64`/`Bool`) | -| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on **free-text** (non-enum) `String` `@index`/`@key` columns | -| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) | - -The per-property index a column gets is decided by `node_prop_index_kind` (shared -by the builder and the sidecar-pinning coverage check so they cannot drift): -enums and orderable scalars β†’ BTREE, free-text Strings β†’ FTS, `Vector` β†’ vector, -list/`Blob` columns β†’ none. - -> **Free-text Strings are not equality-indexed.** A non-enum `String` column -> (including a `String @key` slug) gets an FTS inverted index, which Lance does -> **not** consult for `=`/range β€” only for `search`/`match_text`/`bm25`. So an -> equality filter on a free-text String falls back to a full scan. If you filter -> a String identifier by equality on a large table, model it so the value is the -> node id, or track it as a follow-up to also build a BTREE on such columns. - -> **Coverage and cost.** Each indexed column adds index files and build time, and -> an index only covers the fragments it was built over. Rows appended after the -> index was built (e.g. by `load --mode merge`) are scanned unindexed until a -> reindex extends coverage; see [maintenance](../operations/maintenance.md) β†’ `optimize`. - -## L2 β€” OmniGraph orchestration - -- **`@index`/`@key` declares intent; the physical index is derived state.** A migration records the declaration in the catalog/IR and never fails on it β€” `schema apply` builds **no** indexes (adding an `@index` to an existing column is a pure metadata change that touches no table data). `load`/`mutate` build declared indexes inline as part of the write, but a column that can't be built yet (a `Vector` column with no trainable vectors β€” IVF k-means needs β‰₯1 vector, e.g. rows loaded before `embed` runs) is left **pending**, not fatal. Reads stay correct meanwhile: a missing/partial index degrades to a scan (vector search to brute-force). A later `ensure_indices`/`optimize` materializes the pending index once it is buildable. This mirrors how LanceDB builds indexes asynchronously and serves unindexed rows by brute-force. -- `ensure_indices()` / `ensure_indices_on(branch)` β€” idempotent build of BTREE + inverted + vector indexes for the current head; safe to re-run; returns the columns it had to defer as pending. `optimize` runs it after compaction, so the maintenance cron is the convergence path for deferred indexes. -- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state. -- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index β€” the main lineage's index is reused until the first write triggers a copy-on-write fork. -- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column. - -## L2 β€” Graph topology index - -This is OmniGraph-specific (not Lance): - -- A Compressed Sparse Row (CSR) adjacency representation of edges, with both out- (CSR) and in- (CSC) directions, plus a dense per-node-type id mapping. -- Built on demand from a snapshot's edge tables, **lazily**: only when an `Expand` the planner routes to the CSR path (dense / large frontier) or an `AntiJoin` actually needs it. -- Cached per snapshot (LRU, keyed by snapshot id + edge table versions), so repeat traversals over the same snapshot reuse it. -- Selective `Expand`s resolve neighbors from the persisted `src`/`dst` BTREE instead (one indexed scan per hop) and never trigger the CSR build; see [query-language](../queries/index.md) β†’ Traversal execution. Pure scans, and queries served entirely by the indexed traversal path, skip it. diff --git a/docs/user/server.md b/docs/user/server.md new file mode 100644 index 0000000..6f55e16 --- /dev/null +++ b/docs/user/server.md @@ -0,0 +1,192 @@ +# HTTP Server (`omnigraph-server`) + +Axum 0.8 + tokio + utoipa-generated OpenAPI. **Two modes** (v0.6.0+): single-graph (legacy) and multi-graph (MR-668). Mode is inferred from CLI args + config shape. + +## Modes + +### Single-graph mode (legacy) + +`omnigraph-server <URI>` or `omnigraph-server --target <name> --config omnigraph.yaml`. Routes are flat β€” `/snapshot`, `/read`, `/branches`, etc. Behavior unchanged from v0.6.0. + +### Multi-graph mode (v0.6.0+) + +`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and **no** single-mode selector (no `server.graph`, no `<URI>`, no `--target`). The server opens every configured graph in parallel at startup (bounded concurrency = 4, fail-fast on the first open error). Routes are nested under `/graphs/{graph_id}/...`. Bare flat paths return 404 in multi mode. + +Mode inference (four-rule matrix): + +1. CLI positional `<URI>` β†’ single +2. CLI `--target <name>` β†’ single +3. `server.graph` in config β†’ single +4. `--config` + non-empty `graphs:` + no single-mode selector β†’ **multi** +5. otherwise β†’ error with migration hint + +## Endpoint inventory + +Per-graph endpoints β€” same body shape across modes; URLs differ: + +| Method | Single-mode path | Multi-mode path | Auth | Action | Handler | +|---|---|---|---|---|---| +| GET | `/healthz` | `/healthz` | none | β€” | `server_health` | +| GET | `/openapi.json` | `/openapi.json` | none | β€” | `server_openapi` (strips security if auth disabled; in multi mode emits cluster paths with `cluster_` operation-id prefix) | +| GET | `/snapshot?branch=` | `/graphs/{id}/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` | +| POST | `/query` | `/graphs/{id}/query` | bearer + `read` | inline read query (canonical; clean field names `query`/`name`; mutations β†’ 400) | `server_query` | +| POST | `/read` | `/graphs/{id}/read` | bearer + `read` | **deprecated** alias of `/query` (legacy field names `query_source`/`query_name`, byte-stable response; carries `Deprecation: true` + `Link: </query>; rel="successor-version"`) | `server_read` | +| POST | `/export` | `/graphs/{id}/export` | bearer + `export` | NDJSON stream | `server_export` | +| POST | `/mutate` | `/graphs/{id}/mutate` | bearer + `change` | mutation (canonical; `query`/`name`; accepts legacy `query_source`/`query_name` as serde aliases) | `server_mutate` | +| POST | `/change` | `/graphs/{id}/change` | bearer + `change` | **deprecated** alias of `/mutate` (carries `Deprecation: true` + `Link: </mutate>; rel="successor-version"`) | `server_change` | +| GET | `/schema` | `/graphs/{id}/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` | +| POST | `/schema/apply` | `/graphs/{id}/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` | +| POST | `/ingest` | `/graphs/{id}/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) | +| GET | `/branches` | `/graphs/{id}/branches` | bearer + `read` | list branches | `server_branch_list` | +| POST | `/branches` | `/graphs/{id}/branches` | bearer + `branch_create` | create | `server_branch_create` | +| DELETE | `/branches/{branch}` | `/graphs/{id}/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` | +| POST | `/branches/merge` | `/graphs/{id}/branches/merge` | bearer + `branch_merge` | merge `source β†’ target` | `server_branch_merge` | +| GET | `/commits?branch=` | `/graphs/{id}/commits?branch=` | bearer + `read` | list | `server_commit_list` | +| GET | `/commits/{commit_id}` | `/graphs/{id}/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` | + +Server-level management endpoints (v0.6.0+): + +| Method | Path | Auth | Action | Handler | +|---|---|---|---|---| +| GET | `/graphs` | bearer + `graph_list` on `Server::"root"` | list registered graphs | `server_graphs_list` (405 in single mode) | + +## Adding and removing graphs (multi mode) + +Runtime add/remove via API is **not** exposed in v0.6.0 β€” neither +`POST /graphs` nor `DELETE /graphs/{id}` is implemented. Operators add +or remove graphs by stopping the server, editing the `graphs:` map in +`omnigraph.yaml`, then restarting. The server treats `omnigraph.yaml` +as operator-owned configuration and never writes it. + +A future release may introduce a managed registry (Lance-backed, +catalog-style: reserve β†’ init β†’ publish with recovery sidecars) and +re-expose runtime mutation on top of it. + +## Inline read queries (`POST /query`) + +`POST /query` is the read-only, agent-friendly twin of `POST /read`. The +request body uses clean field names that match the CLI `-e` flag and the GQ +`query` keyword: + +```json +{ + "query": "query find($n: String) { match { $p: Person { name: $n } } return { $p.name } }", + "name": "find", + "params": { "n": "Alice" }, + "branch": "main", + "snapshot": null +} +``` + +Response shape is identical to `/read` (`ReadOutput`). If the inline source +contains mutations (`insert` / `update` / `delete`), the request is rejected +with HTTP 400 and an error pointing the caller at `POST /mutate` β€” the +read-only contract is enforced at the URL. + +`POST /mutate` is the canonical mutation endpoint. It accepts the same clean +field names (`query`, `name`); the legacy field names `query_source` and +`query_name` continue to deserialize as serde aliases so existing clients keep +working without changes. + +## Deprecated names (`/read`, `/change`) + +`POST /read` and `POST /change` are kept for back-compat indefinitely β€” they +are byte-stable on the request side and otherwise behave identically to +`/query` / `/mutate`. They are flagged as deprecated through three independent +channels: + +- **OpenAPI**: the operations carry `deprecated: true` in `openapi.json`, so + every OpenAPI codegen (typescript-fetch, openapi-generator, oapi-codegen, + …) emits a `@deprecated` marker on the generated SDK method. +- **Response headers (RFC 9745)**: every response carries `Deprecation: true`. +- **Response headers (RFC 8288)**: every response carries a `Link` header + pointing at the canonical successor: + `Link: </query>; rel="successor-version"` for `/read`, and + `Link: </mutate>; rel="successor-version"` for `/change`. SDKs and HTTP + proxies can pick the successor up automatically. + +Migration is purely cosmetic on the client side β€” swap the URL path, leave +the request body and response handling alone. + +## Streaming + +Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON. + +## Error model + +Uniform `ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | too_many_requests | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`. + +`manifest_conflict` is set on **publisher CAS rejections** (HTTP 409): the +caller's pre-write view of one table's manifest version was stale. +`ManifestConflictOutput { table_key, expected, actual }` tells the client +which table to refresh and retry. This is the conflict shape produced by +concurrent `/mutate` (or its `/change` alias) or `/ingest` calls landing +the same `(table, branch)` race. + +HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500. + +## Per-actor admission control + +Disjoint +`(table, branch)` writes from different actors now run concurrently, +guarded only by the engine's per-(table, branch) write queue. To keep +one heavy actor from exhausting shared capacity (Lance I/O, manifest +churn, network), the server gates mutating handlers through a +`WorkloadController` configured per-process from environment variables: + +| Env var | Default | Purpose | +|---|---|---| +| `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX` | 16 | Concurrent in-flight mutations per actor | +| `OMNIGRAPH_PER_ACTOR_BYTES_MAX` | 4 GiB | In-flight estimated bytes per actor | + +When an actor exceeds its in-flight count or byte budget, the server +returns **HTTP 429 Too Many Requests** with `code: too_many_requests` +and a `Retry-After` header (seconds). The actor should back off; other +actors are unaffected. + +Cedar policy authorization runs **before** admission accounting so +denied requests don't consume admission slots. + +Today admission gates every mutating handler: `/mutate` (and its +deprecated alias `/change`), `/ingest`, `/branches/{create,delete,merge}`, +and `/schema/apply`. Read-only endpoints (`/snapshot`, `/query`, `/read`, +`/export`, `/branches` GET, `/commits`, `/schema` GET) are not +admission-gated. + +## Body limits + +- Default: 1 MB +- `/ingest`: 32 MB + +## Auth model (`bearer + SHA-256`) + +- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory. +- Constant-time comparison via `subtle::ConstantTimeEq`. +- Three sources, in precedence: + 1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` β€” AWS Secrets Manager (build with `--features aws`) + 2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` β€” JSON `{actor_id: token, …}` + 3. `OMNIGRAPH_SERVER_BEARER_TOKEN` β€” single legacy token, actor `default` +- If no tokens are configured, startup refuses unless `--unauthenticated` or + `OMNIGRAPH_UNAUTHENTICATED=1` explicitly opts into open local-dev mode. A + policy file without tokens is also rejected at startup. In open mode + `/openapi.json` strips the security scheme. + +See [deployment.md](deployment.md) for token-source operational details. + +## Tracing & observability + +- `tower_http::TraceLayer::new_for_http()` +- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule +- Startup logs: token source name, graph URI, bind address +- Graceful SIGINT shutdown + +## Not implemented (by design or "TBD") + +- CORS β€” not configured; add `tower_http::cors` if needed. +- Rate limiting β€” per-actor admission control gates `/mutate` (alias + `/change`), `/ingest`, `/branches/{create,delete,merge}`, + `/schema/apply` (see "Per-actor + admission control" above). No global rate limiter is configured; + add `tower_http::limit` if a graph-wide cap is needed. +- Pagination β€” none (commits/branches return everything; export streams). +- Runtime graph add/remove β€” edit `omnigraph.yaml` and restart. diff --git a/docs/user/storage.md b/docs/user/storage.md new file mode 100644 index 0000000..c22d4d6 --- /dev/null +++ b/docs/user/storage.md @@ -0,0 +1,115 @@ +# Storage + +## L1 β€” Lance dataset (per node/edge type) + +Every node type and every edge type is its own Lance dataset: + +- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema. +- **Fragments**: data is partitioned into fragments; new writes create new fragments. +- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable. +- **Stable row IDs**: `enable_stable_row_ids: true` is set on every Lance dataset OmniGraph creates β€” node and edge data tables, `__manifest`, `_graph_commits.lance`, `_graph_commit_recoveries.lance`, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create per Lance's row-id-lineage spec, so a future change that introduces a Lance dataset must preserve it. Consequences: `_row_created_at_version` and `_row_last_updated_at_version` are available on every dataset (load-bearing for change-feed validators); `CreateIndex Γ— Rewrite` is not a retryable conflict, so indices survive `omnigraph optimize` without needing the Fragment Reuse Index; readers must use a Lance build that recognises the flag (our pinned 4.0.0 is fine). Pre-0.4.x graphs created before this code path settled may have datasets without the flag and cannot be retrofitted in place β€” the supported path is dump-and-reload. The `stage_overwrite` rewrite path (used by `schema_apply`) preserves the flag through `Operation::Overwrite`; pinned by `stage_overwrite_preserves_stable_row_ids` in `crates/omnigraph/tests/staged_writes.rs`. +- **Append / delete / `merge_insert`**: native Lance write modes. +- **Per-dataset branches** (Lance native): copy-on-write at the dataset level. +- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) β€” OmniGraph wires file:// and s3:// (`storage.rs`). + +## L2 β€” Multi-dataset coordination via `__manifest` + +OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table. + +- **Manifest table**: `__manifest/` Lance dataset. +- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`): + - `nodes/{fnv1a64-hex(type_name)}` β€” one Lance dataset per node type + - `edges/{fnv1a64-hex(edge_type_name)}` β€” one Lance dataset per edge type + - `__manifest/` β€” the catalog of all sub-tables and their published versions + - `_graph_commits.lance` / `_graph_commit_actors.lance` β€” the commit graph and its actor map + - (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed in MR-771 and these files are cleaned up via MR-770's production sweep) +- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`): + - `object_type` ∈ `table | table_version | table_tombstone` + - `table_key` ∈ `node:<TypeName> | edge:<EdgeName>` + - `table_branch` is `null` for the main lineage and the branch name otherwise +- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones β€” rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`. +- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once. +- **Row-level CAS on the merge-insert join key**: `object_id` carries `lance-schema:unenforced-primary-key=true` so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same `object_id` row. Without this annotation, Lance's transparent rebase would admit silent duplicates of `version:T@v=N` from racing publishers (see `.context/merge-insert-cas-granularity.md`). +- **Optimistic concurrency control on publish**: `ManifestBatchPublisher::publish` accepts a `expected_table_versions: HashMap<table_key, u64>` map. Each entry asserts the manifest's current latest non-tombstoned version for that table is exactly what the caller observed; mismatches surface as `OmniError::Manifest` with `ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, actual }`. Empty map preserves the legacy "best-effort publish" semantics. The publisher uses `conflict_retries(0)` against Lance and owns retry itself (`PUBLISHER_RETRY_BUDGET = 5`), re-running the pre-check on each iteration so concurrent advances surface as `ExpectedVersionMismatch` rather than being silently rebased through. + +### Internal schema versioning (`db/manifest/migrations.rs`) + +The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher. `INTERNAL_MANIFEST_SCHEMA_VERSION` declares the shape this binary writes; the on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). + +- **`init_manifest_graph`** stamps the current version at creation, so newly initialized graphs never need migration. +- **Publisher open-for-write path** (`load_publish_state`) calls `migrate_internal_schema(&mut dataset)` before reading state. When the on-disk stamp matches the binary, this is a single metadata read with no writes; otherwise the dispatcher walks `match`-arm steps forward (1β†’2, 2β†’3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free. +- **Forward-version protection**: a stamp *higher* than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp". +- **Idempotency**: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands. The dispatcher itself is a cheap stamp-read on the steady-state path. + +Adding a new on-disk shape change is one constant bump (`INTERNAL_MANIFEST_SCHEMA_VERSION`), one match arm in `migrate_internal_schema`, and one test. No code outside this module branches on the stamp. + +| Stamp | Shape change | +|---|---| +| v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; publisher had no row-level CAS protection. | +| v2 | `__manifest.object_id` carries `lance-schema:unenforced-primary-key=true`; row-level CAS engaged. Stamped as `omnigraph:internal_schema_version=2`. | + +## On-disk layout + +A graph on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets. + +```mermaid +flowchart TB + classDef l1 fill:#fef3e8,stroke:#c46900,color:#000 + classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000 + + graph["graph URI<br/>file:// or s3://bucket/prefix"]:::l2 + + manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2 + nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2 + edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2 + cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/<br/>_graph_commit_recoveries.lance/"]:::l2 + recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2 + refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2 + + graph --> manifest + graph --> nodes + graph --> edges + graph --> cgraph + graph --> recovery + graph --> refs + + subgraph dataset[Inside each Lance dataset β€” L1] + ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1 + ds_data["data/<br/>fragment files (Arrow IPC)"]:::l1 + ds_idx["_indices/{uuid}/<br/>BTREE Β· Inverted FTS Β· IVF/HNSW"]:::l1 + ds_refs["_refs/<br/>per-dataset Lance branches/tags"]:::l1 + ds_tx["_transactions/<br/>commit transaction logs"]:::l1 + end + + nodes -.-> dataset + edges -.-> dataset + manifest -.-> dataset +``` + +**What's where:** + +- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph. +- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here. +- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe. +- **`_graph_commits.lance`** is an L2 dataset that records the graph-level commit DAG, with a paired `_graph_commit_actors.lance` for the actor map. (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; MR-770 sweeps these in production.) +- **`_graph_commit_recoveries.lance`** β€” one row per recovery sweep action. Joined to `_graph_commits.lance` by `graph_commit_id`; the linked commit row carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See `crates/omnigraph/src/db/recovery_audit.rs`. +- **`__recovery/{ulid}.json`** β€” transient sidecar files written by the four migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B β†’ Phase C window; the next `Omnigraph::open` recovery sweep processes it. Steady-state directory is empty. See `crates/omnigraph/src/db/manifest/recovery.rs`. +- **`_refs/branches/{name}.json`** is graph-level branch metadata β€” pointers from a branch name to the manifest version it heads. +- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags. + +The split β€” L2 owns the cross-dataset catalog; L1 owns the per-dataset internals β€” means that schema work (which adds or removes datasets) updates `__manifest`, while data work (which adds fragments) updates `_versions/` inside the affected dataset and then bumps `__manifest`. + +## URI scheme support (`storage.rs`) + +| Scheme | Backend | Notes | +|---|---|---| +| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths | +| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` | +| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend | + +## Object-store env vars (S3-compatible) + +- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` +- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` β€” for MinIO / RustFS / GCS-via-XML +- `AWS_S3_FORCE_PATH_STYLE=true` β€” path-style URLs +- `AWS_ALLOW_HTTP=true` β€” allow plain HTTP (local dev) diff --git a/docs/user/branching/transactions.md b/docs/user/transactions.md similarity index 75% rename from docs/user/branching/transactions.md rename to docs/user/transactions.md index 6e6b1c4..e4ed485 100644 --- a/docs/user/branching/transactions.md +++ b/docs/user/transactions.md @@ -2,7 +2,7 @@ OmniGraph does not have `BEGIN` / `COMMIT` / `ROLLBACK`. Branches do that job. This page explains the model, when to use which primitive, and shows worked examples for the patterns that come up most. -The architectural rule lives in [`docs/dev/invariants.md`](../../dev/invariants.md): +The architectural rule lives in [`docs/dev/invariants.md`](../dev/invariants.md): > **Mutations publish at one boundary.** A `mutate_as` or `load` operation > accumulates constructive writes, commits each touched table at the end, then @@ -47,8 +47,8 @@ query register_employee_with_team($name: String, $age: I32, $team: String) { ``` ```bash -omnigraph change --query mutations.gq --name register_employee_with_team \ - --params '{"name":"Alice","age":30,"team":"Acme"}' graph.omni +omnigraph change --query ./mutations.gq --name register_employee_with_team \ + --params '{"name":"Alice","age":30,"team":"Acme"}' ./graph.omni ``` If the second statement fails (e.g. `Acme` doesn't exist), the publisher never publishes; `Alice` is not in the database. Atomic. @@ -57,10 +57,10 @@ If the second statement fails (e.g. `Acme` doesn't exist), the publisher never p ```bash # Query 1 -omnigraph change --query mutations.gq --name register_employee --params '{"name":"Alice","age":30}' graph.omni +omnigraph change --query ./mutations.gq --name register_employee --params '{"name":"Alice","age":30}' ./graph.omni # Query 2 β€” runs after Query 1 has already published -omnigraph change --query mutations.gq --name link_to_team --params '{"name":"Alice","team":"Acme"}' graph.omni +omnigraph change --query ./mutations.gq --name link_to_team --params '{"name":"Alice","team":"Acme"}' ./graph.omni ``` These are **two publishes** on `main`. If Query 2 fails, Query 1's effects are already visible. There is no `ROLLBACK` for Query 1. @@ -75,39 +75,39 @@ The pattern when you need to run multiple queries β€” possibly across multiple c ```bash # Fork a working branch from main. -omnigraph branch create --from main onboarding/2026-04-25 graph.omni +omnigraph branch create --from main onboarding/2026-04-25 ./graph.omni # Run any number of mutations on the branch β€” each one is its own publish on the branch. # Concurrent reads of `main` are unaffected. omnigraph change --branch onboarding/2026-04-25 \ - --query mutations.gq --name register_employee \ - --params '{"name":"Alice","age":30}' graph.omni + --query ./mutations.gq --name register_employee \ + --params '{"name":"Alice","age":30}' ./graph.omni omnigraph change --branch onboarding/2026-04-25 \ - --query mutations.gq --name register_employee \ - --params '{"name":"Bob","age":25}' graph.omni + --query ./mutations.gq --name register_employee \ + --params '{"name":"Bob","age":25}' ./graph.omni omnigraph change --branch onboarding/2026-04-25 \ - --query mutations.gq --name link_to_team \ - --params '{"name":"Alice","team":"Acme"}' graph.omni + --query ./mutations.gq --name link_to_team \ + --params '{"name":"Alice","team":"Acme"}' ./graph.omni # Inspect the branch β€” read queries work just like on main. omnigraph read --branch onboarding/2026-04-25 \ - --query queries.gq --name list_employees graph.omni + --query ./queries.gq --name list_employees ./graph.omni # Happy with what's on the branch? Merge it. This is one atomic publish: # `main` flips to include every commit on the branch. -omnigraph branch merge onboarding/2026-04-25 --into main graph.omni +omnigraph branch merge onboarding/2026-04-25 --into main ./graph.omni # OR: not happy? Throw it away. `main` is untouched. -# omnigraph branch delete onboarding/2026-04-25 graph.omni +# omnigraph branch delete onboarding/2026-04-25 ./graph.omni ``` Properties: - Each query on the branch is its own publisher commit β€” so they're individually atomic. Per-query CAS works on branches just like on main. - The branch lives on disk. Process crash mid-workflow? Re-open and resume. - Multiple agents can work on different branches in parallel without blocking each other. -- The merge is a three-way merge at the row level. Conflicts surface as structured merge-conflict kinds (`DivergentInsert`, `DivergentUpdate`, `DeleteVsUpdate`, …) so callers can handle them programmatically. +- The merge is a three-way merge at the row level. Conflicts surface as `OmniError::MergeConflicts(Vec<MergeConflict>)`, with structured kinds (`DivergentInsert`, `DivergentUpdate`, `DeleteVsUpdate`, …) so callers can handle them programmatically. ### 4. Coordinating multiple agents @@ -115,28 +115,28 @@ Two agents writing to the same graph independently: ```bash # Agent A -omnigraph branch create --from main agent-a/work graph.omni -omnigraph change --branch agent-a/work … graph.omni +omnigraph branch create --from main agent-a/work ./graph.omni +omnigraph change --branch agent-a/work … ./graph.omni # … many mutations … -omnigraph branch merge agent-a/work --into main graph.omni +omnigraph branch merge agent-a/work --into main ./graph.omni # Agent B (running concurrently) -omnigraph branch create --from main agent-b/work graph.omni -omnigraph change --branch agent-b/work … graph.omni +omnigraph branch create --from main agent-b/work ./graph.omni +omnigraph change --branch agent-b/work … ./graph.omni # … many mutations … -omnigraph branch merge agent-b/work --into main graph.omni +omnigraph branch merge agent-b/work --into main ./graph.omni ``` Each agent sees a consistent snapshot of `main` at the time it forked. The first merge to `main` lands as a fast-forward (or a no-op if no concurrent change). The second merge runs three-way: rows touched by both branches surface as `MergeConflict`s for the caller to resolve. -This is the workflow agentic loops are designed around: **branches are the unit of "an agent's working set."** +This is the workflow MR-797 / agentic loops are designed around: **branches are the unit of "an agent's working set."** ## Failure modes | Scenario | What happens | Caller action | |---|---|---| | Single query fails mid-flight | Publisher never publishes; target unchanged | Read the error, decide whether to retry | -| Concurrent writers race the same `(table, branch)` | Publisher CAS rejects the loser with a version-mismatch conflict | Refresh handle, retry the query | +| Concurrent writers race the same `(table, branch)` | Publisher CAS rejects the loser with `ManifestConflictDetails::ExpectedVersionMismatch` | Refresh handle, retry the query | | Branch with N successful mutations, then merge fails (three-way conflict) | Each individual mutation already committed on the branch; merge surfaces `MergeConflicts` | Inspect, decide whether to keep working on the branch, abandon it (`branch_delete`), or resolve and re-merge | | Process crashes mid-branch-workflow | Each completed mutation on the branch is durable | Re-open the graph, continue where you left off | @@ -161,8 +161,8 @@ This is the workflow agentic loops are designed around: **branches are the unit ## See also -- [`docs/user/branches-commits.md`](index.md) β€” branch and commit-graph mechanics. -- [`docs/dev/merge.md`](../../dev/merge.md) β€” three-way merge details and conflict kinds. -- [`docs/user/query-language.md`](../queries/index.md) β€” `.gq` syntax for the multi-statement queries used above. -- [`docs/dev/writes.md`](../../dev/writes.md) β€” the per-query commit pipeline that gives single-query atomicity. -- [`docs/dev/invariants.md`](../../dev/invariants.md) β€” the architectural rule. +- [`docs/user/branches-commits.md`](branches-commits.md) β€” branch and commit-graph mechanics. +- [`docs/dev/merge.md`](../dev/merge.md) β€” three-way merge details and conflict kinds. +- [`docs/user/query-language.md`](query-language.md) β€” `.gq` syntax for the multi-statement queries used above. +- [`docs/dev/runs.md`](../dev/runs.md) β€” the per-query commit pipeline that gives single-query atomicity. +- [`docs/dev/invariants.md`](../dev/invariants.md) β€” the architectural rule. diff --git a/openapi.json b/openapi.json index 7333248..d1fa337 100644 --- a/openapi.json +++ b/openapi.json @@ -7,85 +7,17 @@ "name": "MIT", "identifier": "MIT" }, - "version": "0.7.2" + "version": "0.6.0" }, "paths": { - "/graphs": { - "get": { - "tags": [ - "management" - ], - "summary": "List every graph currently registered with this server (MR-668).", - "description": "Multi-graph mode only. In single mode, the route returns 405 β€” there's\nno registry to enumerate. Cedar-gated by the server-level policy via\nthe `graph_list` action against `Omnigraph::Server::\"root\"`.\n\nOrder: alphabetical by `graph_id` (server-sorted so clients see\ndeterministic output across requests).", - "operationId": "listGraphs", - "responses": { - "200": { - "description": "List of registered graphs", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/GraphListResponse" - } - } - } - }, - "401": { - "description": "Unauthorized", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "403": { - "description": "Forbidden", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "405": { - "description": "Method not allowed (single-graph mode)", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - } - }, - "security": [ - { - "bearer_token": [] - } - ] - } - }, - "/graphs/{graph_id}/branches": { + "/branches": { "get": { "tags": [ "branches" ], "summary": "List all branches.", "description": "Returns branch names sorted alphabetically. Read-only.", - "operationId": "cluster_listBranches", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "listBranches", "responses": { "200": { "description": "List of branches", @@ -130,18 +62,7 @@ ], "summary": "Create a new branch.", "description": "Forks `name` off of `from` (defaults to `main`). The new branch shares\ntable data with its parent until it is mutated. Returns 409 if `name`\nalready exists.", - "operationId": "cluster_createBranch", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "createBranch", "requestBody": { "content": { "application/json": { @@ -221,25 +142,14 @@ ] } }, - "/graphs/{graph_id}/branches/merge": { + "/branches/merge": { "post": { "tags": [ "branches" ], "summary": "Merge one branch into another.", "description": "Merges `source` into `target` (defaults to `main`). Outcome is one of\n`already_up_to_date`, `fast_forward`, or `merged`. Returns 409 with the\nlist of conflicts if the merge cannot be completed; the target is left\nunchanged in that case. **Destructive** to `target` on success.", - "operationId": "cluster_mergeBranches", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "mergeBranches", "requestBody": { "content": { "application/json": { @@ -319,24 +229,15 @@ ] } }, - "/graphs/{graph_id}/branches/{branch}": { + "/branches/{branch}": { "delete": { "tags": [ "branches" ], "summary": "Delete a branch.", "description": "**Irreversible.** Removes the branch pointer; commits remain reachable\nonly if referenced by another branch. Returns 404 if the branch does not\nexist.", - "operationId": "cluster_deleteBranch", + "operationId": "deleteBranch", "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - }, { "name": "branch", "in": "path", @@ -406,25 +307,14 @@ ] } }, - "/graphs/{graph_id}/change": { + "/change": { "post": { "tags": [ "mutations" ], "summary": "**Deprecated** β€” use [`POST /mutate`](#tag/mutations/operation/mutate) instead.", - "description": "Apply a GQ mutation to a branch. Behavior is unchanged; the route is\nkept indefinitely for back-compat. New integrations should target\n`POST /mutate`, which has identical semantics and a name that pairs\ncleanly with `POST /query`. Responses from this route include\n`Deprecation: true` and `Link: <mutate>; rel=\"successor-version\"`\nheaders per RFC 9745 / RFC 8288 so SDKs and proxies can surface the\nsignal.", - "operationId": "cluster_change", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "description": "Apply a GQ mutation to a branch. Behavior is unchanged; the route is\nkept indefinitely for back-compat. New integrations should target\n`POST /mutate`, which has identical semantics and a name that pairs\ncleanly with `POST /query`. Responses from this route include\n`Deprecation: true` and `Link: </mutate>; rel=\"successor-version\"`\nheaders per RFC 9745 / RFC 8288 so SDKs and proxies can surface the\nsignal.", + "operationId": "change", "requestBody": { "content": { "application/json": { @@ -437,7 +327,7 @@ }, "responses": { "200": { - "description": "Mutation results (response includes `Deprecation: true` + `Link: <mutate>; rel=\"successor-version\"`)", + "description": "Mutation results (response includes `Deprecation: true` + `Link: </mutate>; rel=\"successor-version\"`)", "content": { "application/json": { "schema": { @@ -505,24 +395,15 @@ ] } }, - "/graphs/{graph_id}/commits": { + "/commits": { "get": { "tags": [ "commits" ], "summary": "List commits.", "description": "Filter by `branch` to get the commits on a single branch (most recent\nfirst); omit to list across all branches. Read-only.", - "operationId": "cluster_listCommits", + "operationId": "listCommits", "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - }, { "name": "branch", "in": "query", @@ -574,24 +455,15 @@ ] } }, - "/graphs/{graph_id}/commits/{commit_id}": { + "/commits/{commit_id}": { "get": { "tags": [ "commits" ], "summary": "Get a single commit.", "description": "Returns the commit's manifest version, parent commit(s), and creation\nmetadata. Read-only.", - "operationId": "cluster_getCommit", + "operationId": "getCommit", "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - }, { "name": "commit_id", "in": "path", @@ -651,25 +523,14 @@ ] } }, - "/graphs/{graph_id}/export": { + "/export": { "post": { "tags": [ "queries" ], "summary": "Stream the contents of a branch as NDJSON.", "description": "Emits one JSON object per line (`application/x-ndjson`). Filter with\n`type_names` (node/edge type names) and/or `table_keys`; both empty\nstreams the entire branch. Suitable for large exports β€” the response is\nstreamed, not buffered. Read-only.", - "operationId": "cluster_export", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "export", "requestBody": { "content": { "application/json": { @@ -725,52 +586,21 @@ ] } }, - "/graphs/{graph_id}/ingest": { - "post": { + "/graphs": { + "get": { "tags": [ - "mutations" + "management" ], - "summary": "**Deprecated** β€” use [`POST /load`](#tag/mutations/operation/load) instead.", - "description": "Bulk-load NDJSON data into a branch. Behavior is unchanged; the route is\nkept indefinitely for back-compat. New integrations should target\n`POST /load`, which has identical semantics. Responses from this route\ninclude `Deprecation: true` and `Link: <load>; rel=\"successor-version\"`\nheaders per RFC 9745 / RFC 8288 so SDKs and proxies can surface the signal.", - "operationId": "cluster_ingest", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], - "requestBody": { - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/IngestRequest" - } - } - }, - "required": true - }, + "summary": "List every graph currently registered with this server (MR-668).", + "description": "Multi-graph mode only. In single mode, the route returns 405 β€” there's\nno registry to enumerate. Cedar-gated by the server-level policy via\nthe `graph_list` action against `Omnigraph::Server::\"root\"`.\n\nOrder: alphabetical by `graph_id` (server-sorted so clients see\ndeterministic output across requests).", + "operationId": "listGraphs", "responses": { "200": { - "description": "Load results (response includes `Deprecation: true` + `Link: <load>; rel=\"successor-version\"`)", + "description": "List of registered graphs", "content": { "application/json": { "schema": { - "$ref": "#/components/schemas/IngestOutput" - } - } - } - }, - "400": { - "description": "Bad request", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" + "$ref": "#/components/schemas/GraphListResponse" } } } @@ -795,8 +625,8 @@ } } }, - "429": { - "description": "Per-actor admission cap exceeded; honor `Retry-After` header", + "405": { + "description": "Method not allowed (single-graph mode)", "content": { "application/json": { "schema": { @@ -806,7 +636,6 @@ } } }, - "deprecated": true, "security": [ { "bearer_token": [] @@ -814,25 +643,36 @@ ] } }, - "/graphs/{graph_id}/load": { + "/healthz": { + "get": { + "tags": [ + "health" + ], + "summary": "Liveness probe.", + "description": "Returns server status and version. Unauthenticated; safe to call from any\ncaller. Use this to confirm the server is reachable before invoking other\nendpoints.", + "operationId": "health", + "responses": { + "200": { + "description": "Server is healthy", + "content": { + "application/json": { + "schema": { + "$ref": "#/components/schemas/HealthOutput" + } + } + } + } + } + } + }, + "/ingest": { "post": { "tags": [ "mutations" ], - "summary": "Bulk-load NDJSON data into a branch (canonical load endpoint).", - "description": "`data` is NDJSON with one record per line. `mode` controls behavior on\nexisting rows: `merge` upserts by id (default), `append` blindly inserts,\n`overwrite` replaces table contents. Branch creation is opt-in by\npresence of `from`: with `from` set, a missing `branch` is created from\nit; without `from`, `branch` must already exist β€” a missing branch is a\n404, never an implicit fork. **Destructive** when `mode` is `overwrite`\nor when the load produces conflicting writes.\n\nThe legacy `POST /ingest` route has identical semantics and is kept as a\ndeprecated alias.", - "operationId": "cluster_load", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "summary": "Bulk-ingest NDJSON data into a branch.", + "description": "`data` is NDJSON with one record per line. `mode` controls behavior on\nexisting rows: `merge` upserts by id (default), `append` blindly inserts,\n`overwrite` replaces table contents. If `branch` does not exist it is\ncreated from `from` (defaults to `main`). **Destructive** when `mode` is\n`overwrite` or when ingest produces conflicting writes.", + "operationId": "ingest", "requestBody": { "content": { "application/json": { @@ -845,7 +685,7 @@ }, "responses": { "200": { - "description": "Load results", + "description": "Ingest results", "content": { "application/json": { "schema": { @@ -902,25 +742,14 @@ ] } }, - "/graphs/{graph_id}/mutate": { + "/mutate": { "post": { "tags": [ "mutations" ], "summary": "Apply a GQ mutation to a branch (canonical mutation endpoint).", "description": "Writes to the named `branch` (defaults to `main`). Mutations are atomic\nper call and produce a new commit. Returns counts of nodes and edges\naffected. **Destructive**: on success the branch is updated; rejected\nmutations may still acquire locks briefly. Returns 409 on merge conflict.\n\nPairs with `POST /query` (read-only). The legacy `POST /change` route\nhas identical semantics and is kept as a deprecated alias.", - "operationId": "cluster_mutate", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "mutate", "requestBody": { "content": { "application/json": { @@ -1000,216 +829,14 @@ ] } }, - "/graphs/{graph_id}/queries": { - "get": { - "tags": [ - "queries" - ], - "summary": "List the graph's exposed stored queries as a typed tool catalog.", - "description": "Returns every stored query in the `queries:` registry, each\nwith its MCP tool name, read/mutate flag, description/instruction, and\ntyped parameters β€” enough for a client to register them as tools without\nfetching `.gq` source. Cluster-served graphs have no per-query expose flag,\nso the catalog lists them all. Read-gated; the catalog is graph-wide (branch\nindependent β€” `read` is authorized against `main`). **Not** Cedar-filtered\nper query yet, so it can list a query whose `invoke_query` the caller\nlacks (a known gap until per-query authorization lands).", - "operationId": "cluster_list_queries", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], - "responses": { - "200": { - "description": "Stored-query catalog (every stored query, with typed params)", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/QueriesCatalogOutput" - } - } - } - }, - "401": { - "description": "Unauthorized", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "403": { - "description": "Forbidden", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - } - }, - "security": [ - { - "bearer_token": [] - } - ] - } - }, - "/graphs/{graph_id}/queries/{name}": { - "post": { - "tags": [ - "queries" - ], - "summary": "Invoke a curated, server-side stored query by name.", - "description": "The query source comes from the graph's `queries:` registry, not the\nrequest body β€” callers send only runtime inputs (`params`, `branch`,\n`snapshot`). Gated by the `invoke_query` Cedar action at the boundary;\na stored *mutation* additionally passes the engine's `change` gate\n(double-gated). An actor **without** `invoke_query` cannot tell a denied\nquery from a missing one β€” both return the same 404, so the catalog\ncan't be probed without the grant. Once `invoke_query` is held, the\ninner `read`/`change` gate may surface a 403 for an existing query the\nactor can't run (the intended double-gate signal).", - "operationId": "cluster_invoke_query", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - }, - { - "name": "name", - "in": "path", - "description": "Stored query name (the registry key)", - "required": true, - "schema": { - "type": "string" - } - } - ], - "requestBody": { - "content": { - "application/json": { - "schema": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/InvokeStoredQueryRequest" - } - ] - } - } - } - }, - "responses": { - "200": { - "description": "Read envelope (ReadOutput) or mutation envelope (ChangeOutput), serialized untagged", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/InvokeStoredQueryResponse" - } - } - } - }, - "400": { - "description": "Bad request (param type error; snapshot on a stored mutation)", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "401": { - "description": "Unauthorized", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "403": { - "description": "Forbidden (the inner `change` gate for a stored mutation)", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "404": { - "description": "Unknown stored query, or `invoke_query` denied β€” indistinguishable to a caller without the grant", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "409": { - "description": "Merge conflict", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "429": { - "description": "Per-actor admission cap exceeded; honor `Retry-After` header", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, - "500": { - "description": "Policy evaluation error (a denial is reported as 404, not 500)", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - } - }, - "security": [ - { - "bearer_token": [] - } - ] - } - }, - "/graphs/{graph_id}/query": { + "/query": { "post": { "tags": [ "queries" ], "summary": "Execute an inline read query (friendlier-named alternative to `POST /read`).", "description": "Designed for ad-hoc exploration and AI-agent tool-use: short field\nnames (`query`, `name`) match the CLI `-e` flag and the GQ `query`\nkeyword. Mutations (`insert`/`update`/`delete`) are rejected with 400\n-- use `POST /mutate` (or its deprecated alias `POST /change`) for\nwrite queries. Otherwise behaves identically to `POST /read`: same\ntarget semantics (branch xor snapshot), same Cedar action (Read),\nsame response shape.", - "operationId": "cluster_query", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "query", "requestBody": { "content": { "application/json": { @@ -1269,25 +896,14 @@ ] } }, - "/graphs/{graph_id}/read": { + "/read": { "post": { "tags": [ "queries" ], "summary": "**Deprecated** β€” use [`POST /query`](#tag/queries/operation/query) instead.", - "description": "Execute a GQ read query. Behavior is unchanged from prior releases; the\nroute is kept indefinitely for byte-stable back-compat. New integrations\nshould target `POST /query`, which has clean field names (`query` /\n`name`) and a 400-on-mutation guard. Responses from this route include\n`Deprecation: true` and `Link: <query>; rel=\"successor-version\"`\nheaders per RFC 9745 / RFC 8288 so SDKs and proxies can surface the\nsignal.", - "operationId": "cluster_read", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "description": "Execute a GQ read query. Behavior is unchanged from prior releases; the\nroute is kept indefinitely for byte-stable back-compat. New integrations\nshould target `POST /query`, which has clean field names (`query` /\n`name`) and a 400-on-mutation guard. Responses from this route include\n`Deprecation: true` and `Link: </query>; rel=\"successor-version\"`\nheaders per RFC 9745 / RFC 8288 so SDKs and proxies can surface the\nsignal.", + "operationId": "read", "requestBody": { "content": { "application/json": { @@ -1300,7 +916,7 @@ }, "responses": { "200": { - "description": "Query results (response includes `Deprecation: true` + `Link: <query>; rel=\"successor-version\"`)", + "description": "Query results (response includes `Deprecation: true` + `Link: </query>; rel=\"successor-version\"`)", "content": { "application/json": { "schema": { @@ -1348,25 +964,14 @@ ] } }, - "/graphs/{graph_id}/schema": { + "/schema": { "get": { "tags": [ "schema" ], "summary": "Read the current schema source.", "description": "Returns the project's schema as a single string in `.pg` source form.\nUseful for clients that want to introspect available types and tables\nbefore constructing GQ queries. Read-only.", - "operationId": "cluster_getSchema", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "operationId": "getSchema", "responses": { "200": { "description": "Current schema source", @@ -1406,25 +1011,14 @@ ] } }, - "/graphs/{graph_id}/schema/apply": { + "/schema/apply": { "post": { "tags": [ "mutations" ], "summary": "Apply a schema migration.", - "description": "Cluster-backed servers reject this route with `409 Conflict`; operators\nmust apply schema changes through `omnigraph cluster apply` and restart.\n\nDiffs `schema_source` against the current schema and applies the resulting\nmigration steps (add/drop type, add/drop column, etc.). **Destructive**:\nsome steps drop data. Returns the list of steps applied; if `applied` is\nfalse the diff was unsupported and no changes were made.", - "operationId": "cluster_applySchema", - "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - } - ], + "description": "Diffs `schema_source` against the current schema and applies the resulting\nmigration steps (add/drop type, add/drop column, etc.). **Destructive**:\nsome steps drop data. Returns the list of steps applied; if `applied` is\nfalse the diff was unsupported and no changes were made.", + "operationId": "applySchema", "requestBody": { "content": { "application/json": { @@ -1476,16 +1070,6 @@ } } }, - "409": { - "description": "Schema apply is disabled for cluster-backed serving; use `omnigraph cluster apply` and restart", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/ErrorOutput" - } - } - } - }, "429": { "description": "Per-actor admission cap exceeded; honor `Retry-After` header", "content": { @@ -1504,24 +1088,15 @@ ] } }, - "/graphs/{graph_id}/snapshot": { + "/snapshot": { "get": { "tags": [ "snapshots" ], "summary": "Read the current snapshot of a branch.", "description": "Returns the manifest version plus per-table metadata (path, version, row\ncount) for every table on the branch. Defaults to `main` when `branch` is\nomitted. Read-only.", - "operationId": "cluster_getSnapshot", + "operationId": "getSnapshot", "parameters": [ - { - "name": "graph_id", - "in": "path", - "description": "Graph id to route the request to.", - "required": true, - "schema": { - "type": "string" - } - }, { "name": "branch", "in": "query", @@ -1572,28 +1147,6 @@ } ] } - }, - "/healthz": { - "get": { - "tags": [ - "health" - ], - "summary": "Liveness probe.", - "description": "Returns server status and version. Unauthenticated; safe to call from any\ncaller. Use this to confirm the server is reachable before invoking other\nendpoints.", - "operationId": "health", - "responses": { - "200": { - "description": "Server is healthy", - "content": { - "application/json": { - "schema": { - "$ref": "#/components/schemas/HealthOutput" - } - } - } - } - } - } } }, "components": { @@ -1986,6 +1539,7 @@ "required": [ "uri", "branch", + "base_branch", "branch_created", "mode", "tables" @@ -1998,11 +1552,7 @@ ] }, "base_branch": { - "type": [ - "string", - "null" - ], - "description": "Base branch a fork was requested from (the request's `from`), echoed\neven when the branch already existed. `null` when `from` was absent." + "type": "string" }, "branch": { "type": "string" @@ -2035,7 +1585,7 @@ "string", "null" ], - "description": "Target branch. Defaults to `main`. Without `from`, the branch must\nalready exist β€” a missing branch is a 404, never an implicit fork." + "description": "Target branch. Created from `from` if it does not yet exist. Defaults to `main`." }, "data": { "type": "string", @@ -2047,7 +1597,7 @@ "string", "null" ], - "description": "Parent branch used to create `branch` if it does not exist. Branch\ncreation is opt-in by presence of this field; omit it to require an\nexisting branch." + "description": "Parent branch used to create `branch` if it does not exist. Defaults to `main`." }, "mode": { "oneOf": [ @@ -2078,47 +1628,6 @@ } } }, - "InvokeStoredQueryRequest": { - "type": "object", - "description": "Body for `POST /queries/{name}` β€” invokes the server-side stored query\nnamed in the path. The query source and name come from the registry,\nnever the body; only the runtime inputs are supplied here.", - "properties": { - "branch": { - "type": [ - "string", - "null" - ], - "description": "Branch to run against. Defaults to `main`; for a stored mutation the\nwrite targets this branch." - }, - "expect_mutation": { - "type": [ - "boolean", - "null" - ], - "description": "The kind the caller expects (RFC-011 Decision 3): `Some(false)` for\n`omnigraph query <name>`, `Some(true)` for `omnigraph mutate <name>`.\nWhen set and it disagrees with the stored query's actual kind, the\nserver rejects the call (400) so the verb asserts the kind. `None`\n(the default) skips the check β€” preserving older clients and aliases." - }, - "params": { - "description": "JSON object whose keys match the stored query's declared parameters." - }, - "snapshot": { - "type": [ - "string", - "null" - ], - "description": "Snapshot id to read from (read queries only β€” rejected for a stored\nmutation). Mutually exclusive with `branch`." - } - } - }, - "InvokeStoredQueryResponse": { - "oneOf": [ - { - "$ref": "#/components/schemas/ReadOutput" - }, - { - "$ref": "#/components/schemas/ChangeOutput" - } - ], - "description": "Response for `POST /queries/{name}`: the read envelope for a stored\nread, or the mutation envelope for a stored mutation. Serialized\n**untagged**, so the wire shape is exactly [`ReadOutput`] or\n[`ChangeOutput`] β€” classification follows the stored query, not a\nwrapper field." - }, "LoadMode": { "type": "string", "description": "Shadow enum for documenting [`LoadMode`] in the OpenAPI schema.", @@ -2189,120 +1698,6 @@ } } }, - "ParamDescriptor": { - "type": "object", - "description": "One declared parameter of a stored query, projected for the catalog.", - "required": [ - "name", - "kind", - "nullable" - ], - "properties": { - "item_kind": { - "oneOf": [ - { - "type": "null" - }, - { - "$ref": "#/components/schemas/ParamKind", - "description": "Element kind when `kind == list` (always a scalar β€” the grammar\nforbids lists of vectors or nested lists)." - } - ] - }, - "kind": { - "$ref": "#/components/schemas/ParamKind" - }, - "name": { - "type": "string" - }, - "nullable": { - "type": "boolean", - "description": "`false` β†’ the caller must supply it; `true` β†’ optional." - }, - "vector_dim": { - "type": [ - "integer", - "null" - ], - "format": "int32", - "description": "Dimension when `kind == vector`.", - "minimum": 0 - } - } - }, - "ParamKind": { - "type": "string", - "description": "The kind of a stored-query parameter, decomposed so a client (e.g. an\nMCP server) can build a typed input schema with a closed `match` and\nnever re-parse omnigraph's type spelling. `bigint`/`date`/`datetime`/\n`blob` are carried as JSON strings on the wire: a 64-bit integer past\n2^53 loses precision as a JSON number, and Date/DateTime are ISO\nstrings, Blob a blob-URI string.", - "enum": [ - "string", - "bool", - "int", - "bigint", - "float", - "date", - "datetime", - "blob", - "vector", - "list" - ] - }, - "QueriesCatalogOutput": { - "type": "object", - "description": "Response for `GET /queries`: every stored query in a graph's\nregistry, each with typed parameters.", - "required": [ - "queries" - ], - "properties": { - "queries": { - "type": "array", - "items": { - "$ref": "#/components/schemas/QueryCatalogEntry" - } - } - } - }, - "QueryCatalogEntry": { - "type": "object", - "description": "One entry in the stored-query catalog (`GET /queries`).", - "required": [ - "name", - "tool_name", - "mutation", - "params" - ], - "properties": { - "description": { - "type": [ - "string", - "null" - ] - }, - "instruction": { - "type": [ - "string", - "null" - ] - }, - "mutation": { - "type": "boolean", - "description": "`true` for a stored mutation β†’ an MCP read-only hint of `false`." - }, - "name": { - "type": "string", - "description": "Registry key / invoke path segment (`POST /queries/{name}`)." - }, - "params": { - "type": "array", - "items": { - "$ref": "#/components/schemas/ParamDescriptor" - } - }, - "tool_name": { - "type": "string", - "description": "MCP tool id (the `tool_name` override, else `name`)." - } - } - }, "QueryRequest": { "type": "object", "description": "Inline read-query request for `POST /query`.\n\nFriendlier-named alternative to [`ReadRequest`] for ad-hoc reads and\nAI-agent integration. Mutations are rejected with 400 β€” use `POST\n/mutate` (or its deprecated alias `POST /change`) for write queries.\nField names are deliberately short (`query`, `name`) to match the GQ\nkeyword and the CLI `-e` flag.", diff --git a/scripts/check-agents-md.sh b/scripts/check-agents-md.sh index 02a177a..abc6469 100755 --- a/scripts/check-agents-md.sh +++ b/scripts/check-agents-md.sh @@ -34,15 +34,10 @@ PY canonical=() while IFS= read -r line; do canonical+=("$line") -done < <(find docs -type f -name '*.md' ! -path 'docs/releases/*' ! -path 'docs/internal/*' ! -path 'docs/rfcs/*' | sort) +done < <(find docs -type f -name '*.md' ! -path 'docs/releases/*' ! -path 'docs/internal/*' | sort) if [[ -d docs/releases ]]; then canonical+=("docs/releases/") fi -# RFCs are a growing collection (like releases): represent the directory, not -# every per-RFC file. The dir must be linked from an audience index. -if [[ -d docs/rfcs ]]; then - canonical+=("docs/rfcs/") -fi linked=() for index_file in "${index_files[@]}"; do diff --git a/scripts/install.ps1 b/scripts/install.ps1 deleted file mode 100644 index 3bfd0f1..0000000 --- a/scripts/install.ps1 +++ /dev/null @@ -1,151 +0,0 @@ -param( - [string]$RepoSlug = "ModernRelay/omnigraph", - [string]$InstallDir = "$env:USERPROFILE\.local\bin", - [ValidateSet("stable", "edge")] - [string]$ReleaseChannel = "stable", - [string]$Version = "" -) - -$ErrorActionPreference = "Stop" - -$assetName = "omnigraph-windows-x86_64.zip" -$assetStem = "omnigraph-windows-x86_64" -$workDir = Join-Path ([System.IO.Path]::GetTempPath()) ("omnigraph-install-" + [System.Guid]::NewGuid().ToString("N")) -$selectedChannel = "" - -function Write-Log { - param([string]$Message) - Write-Host "==> $Message" -} - -function Get-ReleaseBaseUrl { - param([string]$Channel) - - if ($Version -ne "") { - return "https://github.com/$RepoSlug/releases/download/$Version" - } - - if ($Channel -eq "stable") { - return "https://github.com/$RepoSlug/releases/latest/download" - } - - if ($Channel -eq "edge") { - return "https://github.com/$RepoSlug/releases/download/edge" - } - - throw "unsupported ReleaseChannel '$Channel' (expected stable or edge)" -} - -function Download-ReleaseFiles { - param( - [string]$BaseUrl, - [string]$ArchivePath, - [string]$ChecksumPath - ) - - try { - Invoke-WebRequest -UseBasicParsing -Uri "$BaseUrl/$assetName" -OutFile $ArchivePath - Invoke-WebRequest -UseBasicParsing -Uri "$BaseUrl/$assetStem.sha256" -OutFile $ChecksumPath - return $true - } catch { - return $false - } -} - -function Verify-Checksum { - param( - [string]$ArchivePath, - [string]$ChecksumPath - ) - - $checksumText = (Get-Content -Path $ChecksumPath -Raw).Trim() - $expected = ($checksumText -split "\s+")[0].ToLowerInvariant() - if ($expected -eq "") { - throw "checksum file did not contain a SHA256 digest" - } - - $actual = (Get-FileHash -Path $ArchivePath -Algorithm SHA256).Hash.ToLowerInvariant() - if ($actual -ne $expected) { - throw "checksum verification failed for $assetName" - } -} - -function Install-FromDirectory { - param([string]$SourceDir) - - New-Item -ItemType Directory -Force -Path $InstallDir | Out-Null - Copy-Item -Path (Join-Path $SourceDir "omnigraph.exe") -Destination (Join-Path $InstallDir "omnigraph.exe") -Force - Copy-Item -Path (Join-Path $SourceDir "omnigraph-server.exe") -Destination (Join-Path $InstallDir "omnigraph-server.exe") -Force -} - -function Install-FromRelease { - New-Item -ItemType Directory -Force -Path $workDir | Out-Null - - $archivePath = Join-Path $workDir $assetName - $checksumPath = Join-Path $workDir "$assetStem.sha256" - - if ($Version -ne "") { - $script:selectedChannel = $Version - $baseUrl = Get-ReleaseBaseUrl -Channel $ReleaseChannel - Write-Log "Downloading $assetName from $Version" - if (!(Download-ReleaseFiles -BaseUrl $baseUrl -ArchivePath $archivePath -ChecksumPath $checksumPath)) { - throw "no published binary found for $assetName at release $Version" - } - } else { - $script:selectedChannel = $ReleaseChannel - $baseUrl = Get-ReleaseBaseUrl -Channel $selectedChannel - Write-Log "Downloading $assetName from $selectedChannel" - if (!(Download-ReleaseFiles -BaseUrl $baseUrl -ArchivePath $archivePath -ChecksumPath $checksumPath)) { - if ($ReleaseChannel -ne "stable") { - throw "no published binary found for $assetName on channel $ReleaseChannel" - } - - Write-Log "Stable release binaries are not published yet; falling back to edge" - $script:selectedChannel = "edge" - $baseUrl = Get-ReleaseBaseUrl -Channel $selectedChannel - if (!(Download-ReleaseFiles -BaseUrl $baseUrl -ArchivePath $archivePath -ChecksumPath $checksumPath)) { - throw "no published binary found for $assetName on stable or edge; build from source" - } - } - } - - Verify-Checksum -ArchivePath $archivePath -ChecksumPath $checksumPath - - $extractDir = Join-Path $workDir "extract" - New-Item -ItemType Directory -Force -Path $extractDir | Out-Null - Expand-Archive -Path $archivePath -DestinationPath $extractDir -Force - Install-FromDirectory -SourceDir $extractDir -} - -function Print-Summary { - $omnigraphPath = Join-Path $InstallDir "omnigraph.exe" - $serverPath = Join-Path $InstallDir "omnigraph-server.exe" - - Write-Host "" - Write-Host "Installed:" - Write-Host " $omnigraphPath" - Write-Host " $serverPath" - Write-Host "" - Write-Host "Verify:" - Write-Host " $omnigraphPath version" - Write-Host " $serverPath --help" - Write-Host "" - - if ($selectedChannel -ne "") { - Write-Host "Installed from release channel: $selectedChannel" - } - - $pathParts = $env:Path -split [System.IO.Path]::PathSeparator - if ($pathParts -notcontains $InstallDir) { - Write-Host "Add $InstallDir to PATH if needed." - } -} - -try { - Install-FromRelease - Print-Summary -} finally { - if (Test-Path $workDir) { - Remove-Item -Path $workDir -Recurse -Force - } -} diff --git a/scripts/local-rustfs-bootstrap.sh b/scripts/local-rustfs-bootstrap.sh new file mode 100755 index 0000000..6327f77 --- /dev/null +++ b/scripts/local-rustfs-bootstrap.sh @@ -0,0 +1,422 @@ +#!/usr/bin/env bash +set -euo pipefail + +REPO_SLUG="${REPO_SLUG:-ModernRelay/omnigraph}" +SOURCE_REF="${SOURCE_REF:-main}" +RELEASE_CHANNEL="${RELEASE_CHANNEL:-edge}" +WORKDIR="${WORKDIR:-$PWD/.omnigraph-rustfs-demo}" +RUSTFS_CONTAINER_NAME="${RUSTFS_CONTAINER_NAME:-omnigraph-rustfs-demo}" +RUSTFS_IMAGE="${RUSTFS_IMAGE:-rustfs/rustfs:latest}" +RUSTFS_DATA_DIR="${RUSTFS_DATA_DIR:-$WORKDIR/rustfs-data}" +BUCKET="${BUCKET:-omnigraph-local}" +PREFIX="${PREFIX:-repos/context}" +BIND="${BIND:-127.0.0.1:8080}" +AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID:-rustfsadmin}" +AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY:-rustfsadmin}" +AWS_REGION="${AWS_REGION:-us-east-1}" +AWS_ENDPOINT_URL="${AWS_ENDPOINT_URL:-http://127.0.0.1:9000}" +AWS_ENDPOINT_URL_S3="${AWS_ENDPOINT_URL_S3:-$AWS_ENDPOINT_URL}" +AWS_ALLOW_HTTP="${AWS_ALLOW_HTTP:-true}" +AWS_S3_FORCE_PATH_STYLE="${AWS_S3_FORCE_PATH_STYLE:-true}" +FORCE_BUILD="${FORCE_BUILD:-0}" +RESET_REPO="${RESET_REPO:-0}" + +REPO_URI="s3://$BUCKET/$PREFIX" +SERVER_LOG="$WORKDIR/omnigraph-server.log" +SERVER_PID_FILE="$WORKDIR/omnigraph-server.pid" +BIN_DIR="" +FIXTURE_DIR="" +AWS_BIN="" + +log() { + printf '==> %s\n' "$*" +} + +die() { + printf 'error: %s\n' "$*" >&2 + exit 1 +} + +need_cmd() { + command -v "$1" >/dev/null 2>&1 || die "missing required command: $1" +} + +repo_root_from_shell() { + if [ -f "$PWD/Cargo.toml" ] && [ -f "$PWD/crates/omnigraph/tests/fixtures/context.pg" ]; then + printf '%s\n' "$PWD" + return 0 + fi + + if [ -n "${BASH_SOURCE[0]:-}" ] && [ -f "${BASH_SOURCE[0]}" ]; then + local candidate + candidate="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" + if [ -f "$candidate/Cargo.toml" ] && [ -f "$candidate/crates/omnigraph/tests/fixtures/context.pg" ]; then + printf '%s\n' "$candidate" + return 0 + fi + fi + + return 1 +} + +latest_release_tag() { + local json + json="$(curl -fsSL "https://api.github.com/repos/$REPO_SLUG/releases/latest" 2>/dev/null || true)" + printf '%s' "$json" | sed -n 's/.*"tag_name":[[:space:]]*"\([^"]*\)".*/\1/p' | head -n 1 +} + +platform_asset_name() { + local os arch + os="$(uname -s)" + arch="$(uname -m)" + + case "$os/$arch" in + Linux/x86_64) + printf 'omnigraph-linux-x86_64.tar.gz\n' + ;; + Darwin/x86_64) + printf 'omnigraph-macos-x86_64.tar.gz\n' + ;; + Darwin/arm64) + printf 'omnigraph-macos-arm64.tar.gz\n' + ;; + *) + return 1 + ;; + esac +} + +checksum_command() { + if command -v shasum >/dev/null 2>&1; then + printf 'shasum -a 256' + return + fi + + if command -v sha256sum >/dev/null 2>&1; then + printf 'sha256sum' + return + fi + + die "missing checksum tool: expected shasum or sha256sum" +} + +release_base_url() { + case "$RELEASE_CHANNEL" in + stable) + printf 'https://github.com/%s/releases/latest/download\n' "$REPO_SLUG" + ;; + edge) + printf 'https://github.com/%s/releases/download/edge\n' "$REPO_SLUG" + ;; + *) + die "unsupported RELEASE_CHANNEL '$RELEASE_CHANNEL' (expected stable or edge)" + ;; + esac +} + +verify_checksum() { + local archive="$1" + local checksum_file="$2" + local expected actual tool + + expected="$(awk '{print $1}' "$checksum_file")" + [ -n "$expected" ] || die "checksum file did not contain a SHA256 digest" + + tool="$(checksum_command)" + actual="$($tool "$archive" | awk '{print $1}')" + + [ "$actual" = "$expected" ] || die "checksum verification failed for $(basename "$archive")" +} + +ensure_aws_cli() { + if command -v aws >/dev/null 2>&1; then + AWS_BIN="$(command -v aws)" + return + fi + + need_cmd python3 + + if ! python3 -m pip --version >/dev/null 2>&1; then + python3 -m ensurepip --upgrade --user >/dev/null 2>&1 || die "aws cli not found and python3 pip bootstrap failed" + fi + + log "Installing a user-local AWS CLI" + python3 -m pip install --user awscli >/dev/null + export PATH="$HOME/.local/bin:$PATH" + + command -v aws >/dev/null 2>&1 || die "aws cli installation succeeded but aws was not found on PATH" + AWS_BIN="$(command -v aws)" +} + +download_fixture_files() { + local ref="$1" + local fixture_target="$WORKDIR/fixtures" + mkdir -p "$fixture_target" + + for file in context.pg context.jsonl; do + curl -fsSL \ + "https://raw.githubusercontent.com/$REPO_SLUG/$ref/crates/omnigraph/tests/fixtures/$file" \ + -o "$fixture_target/$file" || return 1 + done + + FIXTURE_DIR="$fixture_target" +} + +download_release_binaries() { + local asset asset_stem archive_dir archive_path checksum_path base_url + + [ "$FORCE_BUILD" = "1" ] && return 1 + + asset="$(platform_asset_name)" || return 1 + asset_stem="${asset%.tar.gz}" + archive_dir="$WORKDIR/release" + archive_path="$archive_dir/$asset" + checksum_path="$archive_dir/$asset_stem.sha256" + mkdir -p "$archive_dir" "$WORKDIR/bin" + base_url="$(release_base_url)" + + log "Downloading release asset $asset" + curl -fsSL \ + "$base_url/$asset" \ + -o "$archive_path" || return 1 + curl -fsSL \ + "$base_url/$asset_stem.sha256" \ + -o "$checksum_path" || return 1 + verify_checksum "$archive_path" "$checksum_path" || return 1 + tar -C "$WORKDIR/bin" -xzf "$archive_path" || return 1 + + BIN_DIR="$WORKDIR/bin" + if [ "$RELEASE_CHANNEL" = "stable" ]; then + local tag + tag="$(latest_release_tag)" + [ -n "$tag" ] || return 1 + download_fixture_files "$tag" || return 1 + else + download_fixture_files "main" || return 1 + fi +} + +build_from_source() { + local repo_root + repo_root="${1:-}" + + if [ -z "$repo_root" ]; then + need_cmd git + need_cmd cargo + + repo_root="$WORKDIR/source" + if [ ! -d "$repo_root/.git" ]; then + log "Cloning $REPO_SLUG at $SOURCE_REF" + git clone --depth 1 --branch "$SOURCE_REF" "https://github.com/$REPO_SLUG.git" "$repo_root" + fi + fi + + need_cmd cargo + log "Building omnigraph binaries from source" + ( + cd "$repo_root" + cargo build --release --locked -p omnigraph-cli -p omnigraph-server + ) + + BIN_DIR="$repo_root/target/release" + FIXTURE_DIR="$repo_root/crates/omnigraph/tests/fixtures" +} + +setup_binaries() { + local repo_root + repo_root="$(repo_root_from_shell || true)" + + if [ -n "${OMNIGRAPH_BIN_DIR:-}" ]; then + BIN_DIR="$OMNIGRAPH_BIN_DIR" + if [ -n "${OMNIGRAPH_FIXTURE_DIR:-}" ]; then + FIXTURE_DIR="$OMNIGRAPH_FIXTURE_DIR" + elif [ -n "$repo_root" ]; then + FIXTURE_DIR="$repo_root/crates/omnigraph/tests/fixtures" + fi + elif ! download_release_binaries; then + if [ -n "$repo_root" ]; then + build_from_source "$repo_root" + else + build_from_source + fi + fi + + [ -x "$BIN_DIR/omnigraph" ] || die "omnigraph binary not found in $BIN_DIR" + [ -x "$BIN_DIR/omnigraph-server" ] || die "omnigraph-server binary not found in $BIN_DIR" + [ -f "$FIXTURE_DIR/context.pg" ] || die "context fixture schema not found in $FIXTURE_DIR" + [ -f "$FIXTURE_DIR/context.jsonl" ] || die "context fixture data not found in $FIXTURE_DIR" +} + +start_rustfs() { + mkdir -p "$RUSTFS_DATA_DIR" + + if docker ps --format '{{.Names}}' | grep -qx "$RUSTFS_CONTAINER_NAME"; then + log "Reusing existing RustFS container $RUSTFS_CONTAINER_NAME" + return + fi + + if docker ps -a --format '{{.Names}}' | grep -qx "$RUSTFS_CONTAINER_NAME"; then + log "Removing stopped RustFS container $RUSTFS_CONTAINER_NAME" + docker rm -f "$RUSTFS_CONTAINER_NAME" >/dev/null + fi + + log "Starting RustFS on $AWS_ENDPOINT_URL_S3" + docker run -d \ + --name "$RUSTFS_CONTAINER_NAME" \ + -p 9000:9000 \ + -p 9001:9001 \ + -v "$RUSTFS_DATA_DIR:/data" \ + -e RUSTFS_ACCESS_KEY="$AWS_ACCESS_KEY_ID" \ + -e RUSTFS_SECRET_KEY="$AWS_SECRET_ACCESS_KEY" \ + "$RUSTFS_IMAGE" \ + /data >/dev/null +} + +wait_for_rustfs() { + local attempt + for attempt in $(seq 1 30); do + if "$AWS_BIN" --endpoint-url "$AWS_ENDPOINT_URL_S3" s3api list-buckets >/dev/null 2>&1; then + return + fi + sleep 2 + done + + docker logs "$RUSTFS_CONTAINER_NAME" || true + die "RustFS did not become ready" +} + +ensure_bucket() { + log "Ensuring bucket $BUCKET exists" + "$AWS_BIN" --endpoint-url "$AWS_ENDPOINT_URL_S3" \ + s3api create-bucket --bucket "$BUCKET" >/dev/null 2>&1 || true +} + +graph_prefix_has_objects() { + local key_count + key_count="$("$AWS_BIN" --endpoint-url "$AWS_ENDPOINT_URL_S3" \ + s3api list-objects-v2 \ + --bucket "$BUCKET" \ + --prefix "$PREFIX/" \ + --max-keys 1 \ + --query 'KeyCount' \ + --output text 2>/dev/null || true)" + + [ -n "$key_count" ] && [ "$key_count" != "None" ] && [ "$key_count" != "0" ] +} + +reset_graph_prefix() { + log "Removing existing objects under $REPO_URI" + "$AWS_BIN" --endpoint-url "$AWS_ENDPOINT_URL_S3" \ + s3 rm "s3://$BUCKET/$PREFIX" --recursive >/dev/null +} + +initialize_graph() { + if "$BIN_DIR/omnigraph" snapshot "$REPO_URI" --json >/dev/null 2>&1; then + log "Reusing existing graph at $REPO_URI" + return + fi + + if graph_prefix_has_objects; then + if [ "$RESET_REPO" = "1" ]; then + reset_graph_prefix + else + die "found existing objects under $REPO_URI but could not open an Omnigraph graph there. This usually means a previous bootstrap left a partially initialized prefix. Rerun with RESET_REPO=1 to delete that prefix and recreate it, or set PREFIX to a new value." + fi + fi + + log "Initializing graph at $REPO_URI" + "$BIN_DIR/omnigraph" init --schema "$FIXTURE_DIR/context.pg" "$REPO_URI" + + log "Loading context fixture into $REPO_URI" + "$BIN_DIR/omnigraph" load --data "$FIXTURE_DIR/context.jsonl" "$REPO_URI" +} + +start_server() { + mkdir -p "$WORKDIR" + + if [ -f "$SERVER_PID_FILE" ] && kill -0 "$(cat "$SERVER_PID_FILE")" >/dev/null 2>&1; then + log "Stopping existing server process $(cat "$SERVER_PID_FILE")" + kill "$(cat "$SERVER_PID_FILE")" >/dev/null 2>&1 || true + sleep 1 + fi + + log "Starting omnigraph-server on $BIND" + nohup "$BIN_DIR/omnigraph-server" "$REPO_URI" --bind "$BIND" >"$SERVER_LOG" 2>&1 & + echo "$!" > "$SERVER_PID_FILE" +} + +wait_for_server() { + local bind_host bind_port health_host base_url + bind_host="${BIND%:*}" + bind_port="${BIND##*:}" + health_host="$bind_host" + if [ "$health_host" = "0.0.0.0" ]; then + health_host="127.0.0.1" + fi + base_url="http://$health_host:$bind_port" + + for _ in $(seq 1 30); do + if curl -fsSL "$base_url/healthz" >/dev/null 2>&1; then + printf '%s\n' "$base_url" + return + fi + sleep 1 + done + + cat "$SERVER_LOG" >&2 || true + die "omnigraph-server did not pass /healthz" +} + +print_summary() { + local base_url="$1" + + cat <<EOF + +Omnigraph local RustFS demo is up. + +Server: + $base_url + +Graph URI: + $REPO_URI + +RustFS console: + http://127.0.0.1:9001 + +Useful commands: + curl -fsSL "$base_url/healthz" + curl -fsSL "$base_url/snapshot?branch=main" + "$BIN_DIR/omnigraph" snapshot "$REPO_URI" --json + tail -f "$SERVER_LOG" + kill \$(cat "$SERVER_PID_FILE") + docker logs -f "$RUSTFS_CONTAINER_NAME" + +EOF +} + +main() { + need_cmd docker + need_cmd curl + docker info >/dev/null 2>&1 || die "docker is installed but the daemon is not reachable; start Docker Desktop or another daemon and rerun" + + export AWS_ACCESS_KEY_ID + export AWS_SECRET_ACCESS_KEY + export AWS_REGION + export AWS_ENDPOINT_URL + export AWS_ENDPOINT_URL_S3 + export AWS_ALLOW_HTTP + export AWS_S3_FORCE_PATH_STYLE + + mkdir -p "$WORKDIR" + + setup_binaries + ensure_aws_cli + start_rustfs + wait_for_rustfs + ensure_bucket + initialize_graph + start_server + print_summary "$(wait_for_server)" +} + +main "$@" diff --git a/scripts/update-homebrew-formula.sh b/scripts/update-homebrew-formula.sh index f2f0df9..90a5dea 100755 --- a/scripts/update-homebrew-formula.sh +++ b/scripts/update-homebrew-formula.sh @@ -64,8 +64,20 @@ cat >"$FORMULA_PATH" <<EOF class Omnigraph < Formula desc "Typed property graph database with Git-style workflows" homepage "https://github.com/${REPO_SLUG}" - version "${VERSION}" license "MIT" + version "${VERSION}" + + on_macos do + depends_on arch: :arm64 + url "${MACOS_ARM_URL}" + sha256 "${MACOS_ARM_SHA}" + end + + on_linux do + url "${LINUX_X86_URL}" + sha256 "${LINUX_X86_SHA}" + end + head "https://github.com/${REPO_SLUG}.git", branch: "main" livecheck do @@ -73,21 +85,6 @@ class Omnigraph < Formula regex(/^v?(\\d+(?:\\.\\d+)+)$/i) end - on_macos do - depends_on arch: :arm64 - on_arm do - url "${MACOS_ARM_URL}" - sha256 "${MACOS_ARM_SHA}" - end - end - - on_linux do - on_intel do - url "${LINUX_X86_URL}" - sha256 "${LINUX_X86_SHA}" - end - end - def install bin.install "omnigraph", "omnigraph-server" end diff --git a/skills/omnigraph/SKILL.md b/skills/omnigraph/SKILL.md deleted file mode 100644 index 7bf044a..0000000 --- a/skills/omnigraph/SKILL.md +++ /dev/null @@ -1,414 +0,0 @@ ---- -name: omnigraph -description: Store, retrieve, and query knowledge, memory, and relationships in an Omnigraph graph, and operate a local or remote Omnigraph deployment. Use when the user wants to capture or recall facts, notes, or entities, build or query a knowledge graph or agent memory, or run Omnigraph β€” and whenever you see Omnigraph CLI commands (omnigraph init/query/mutate/load/schema/lint/embed/branch/commit/login/profile/cluster), .pg schema or .gq query files, s3:// graph URIs, bearer-authed graph endpoints, 504 errors, or a cluster.yaml / omnigraph.yaml / ~/.omnigraph/config.yaml. Covers cluster-mode deployments (cluster.yaml plan/apply, omnigraph-server --cluster), the two config surfaces (cluster.yaml + ~/.omnigraph/config.yaml), schema evolution, query linting, data writes (mutate; load needs --mode/--from), branches, embeddings, Cedar policy, and remote ops. Especially important before schema apply (plan first), any load (--mode required), any .gq/.pg edit (lint after), or any remote write (verify via commit list). -license: MIT (see LICENSE at repo root) -compatibility: Requires omnigraph CLI >= 0.7.0 β€” the unified `load`, the two config surfaces (cluster.yaml + ~/.omnigraph/config.yaml), and cluster apply/serve all require 0.7.0. -metadata: - author: ModernRelay - version: "0.7.0" - repository: https://github.com/ModernRelay/omnigraph ---- - -# Operating Omnigraph Locally - -This skill captures the operational rules for working with a locally or remotely deployed Omnigraph. Follow them when authoring schema, writing queries, loading data, evolving schema, or automating graph operations. - -## The Seven Rules - -1. **Lint before commit** β€” `omnigraph lint --schema schema.pg --query queries/foo.gq` validates both sides against each other. No running repo required. -2. **Plan before apply** β€” never run `schema apply` without a successful `schema plan` first. Apply is destructive; plan is free. (Cluster mode has the same rule with different verbs: `cluster plan` before `cluster apply` β€” the plan embeds the engine's real migration steps.) -3. **Branches are for data; apply is for schema** β€” review bulk data loads on a feature branch then merge. Schema changes go straight to `main`: in cluster mode edit the `.pg` and run `cluster apply` (a direct `schema apply` **refuses** a cluster-managed graph); `schema plan`/`apply` is for a non-cluster store. -4. **Pick the right write command** β€” `mutate` for edits (typechecked, parameterized); `load` for bulk JSONL, local **or** remote, with a **required** `--mode` (`merge` upsert Β· `append` strict-insert Β· `overwrite` clean-slate). `load --from <base>` forks a review branch in one shot; bare `load` needs an existing target branch. -5. **Parameterize everything** β€” never string-interpolate values into `.gq` bodies or `--params`. Declare `$var: Type` and pass via `--params`. -6. **Expose agent operations as aliases** β€” not raw CLI invocations. Aliases decouple the operation name from the query implementation. -7. **Verify after every remote write** β€” compare `commit list --branch main` head before and after. The CLI's exit code is not authoritative on remote graphs; proxies can drop the response while the write commits server-side. See `references/remote-ops.md` for the verification ritual and how to recover from 504s. - -## Essentials: Queries, Mutations, Loads - -The patterns below cover the daily 80% β€” enough to write correct `.gq` and JSONL without leaving this file. The long tail (multi-hop, negation, aggregations, hybrid search, every decorator) is in [`references/queries.md`](references/queries.md) and [`references/schema.md`](references/schema.md). - -**Comments in `.pg` and `.gq` are `//`, never `#`** (the #1 parse error). - -### Read query (`.gq`) - -```gq -query get_signal($slug: String) { - match { - $s: Signal { slug: $slug } // inline property filter goes in the match block - $s formsPattern $p // edge FormsPattern declared PascalCase, traversed lowerCamelCase - } - return { $s.slug, $s.name, $p.slug } -} -``` - -- **Parameterize, never interpolate.** Declare `$var: Type` in the signature; pass via `--params '{"slug":"sig-foo"}'`. An empty signature still needs parens: `query foo() { ... }`. -- **Edge traversal is lowerCamelCase** even though the schema declares edges PascalCase (`FormsPattern` β†’ `formsPattern`). -- **List/sort** by appending `order { $s.stagingTimestamp desc } limit 50` after `return`. -- **Ranking ops (`nearest`/`bm25`/`rrf`) require a trailing `limit N`** β€” omitting it is a compile error. They live in `order { }`, not as filters. Scope with `match`/filters first, then rank (`order { nearest($d.embedding, $q) } limit 10`). - -### Mutation (`.gq`) - -There is **no top-level `mutation { }`** β€” every block is a named `query`; the verb (`insert`/`update`/`delete`) makes it a write. Dispatch with `omnigraph mutate` (not `query`). - -```gq -query add_signal($slug: String, $name: String, $brief: String, $createdAt: DateTime) { - insert Signal { slug: $slug, name: $name, brief: $brief, - stagingTimestamp: $createdAt, createdAt: $createdAt, updatedAt: $createdAt } -} -query link($from: String, $to: String) { insert FormsPattern { from: $from, to: $to } } -query retitle($slug: String, $t: String) { update Signal set { name: $t } where slug = $slug } -query remove($slug: String) { delete Signal where slug = $slug } -``` - -- **Every non-nullable property must be supplied** or lint fails (`T12: insert for 'Signal' must provide non-nullable property 'X'`). -- A single mutation is insert/update-only **or** delete-only β€” never both (parse-time Dβ‚‚ rule); split them. -- Edges have no `@key`: give `from`/`to` slugs; the property block is `{}` when the edge has none. - -### Bulk load (JSONL) - -```jsonl -{"type":"Signal","data":{"slug":"sig-foo","name":"Foo","brief":"…","stagingTimestamp":"2026-04-14T00:00:00Z","createdAt":"2026-04-14T00:00:00Z","updatedAt":"2026-04-14T00:00:00Z"}} -{"edge":"FormsPattern","from":"sig-foo","to":"pat-bar","data":{}} -``` - -```bash -omnigraph load --data seed.jsonl --mode merge $GRAPH # --mode is REQUIRED (no default) -omnigraph load --data delta.jsonl --from main --branch review --mode merge $GRAPH # fork a review branch in one shot -``` - -- `--mode`: `merge` (upsert by `@key`) Β· `append` (fails on collision) Β· `overwrite` (destructive, staged). `--from <base>` forks a missing `--branch`; bare `load` needs an existing branch. Works local **and** remote. -- **Date footgun**: `mutate --params` takes ISO strings (`Date` `"2026-04-29"`, `DateTime` `"…T00:00:00Z"`); `load` JSONL takes **integer days since epoch** for `Date` (`20572`) but ISO for `DateTime`. - -### Dispatching - -```bash -omnigraph alias signal sig-foo # operator alias β†’ its bound stored query (read or write) -omnigraph query get_signal --params '{"slug":"sig-foo"}' # served stored query by name (verb asserts read vs write) -omnigraph query -e 'query q() { match { $s: Signal } return { $s.slug } limit 5 }' # ad-hoc/inline (or: --query f.gq <name>) -omnigraph mutate add_signal --query mutations.gq --params '{"slug":"sig-foo", ...}' # name positional; ad-hoc file source -omnigraph lint --schema schema.pg --query queries/foo.gq # after EVERY .gq/.pg edit (no server needed) -``` - -### `.gq` grammar - -The non-obvious facts that bite, then the full grammar: - -- **Scalar param types**: `String Bool I32 I64 U32 U64 F32 F64 DateTime Date Blob`. Modifiers: `T?` (optional), `[T]` (list), `Vector(N)`. There is **no `Int`** β€” use `I64`. -- **A read query needs `match` *and* `return`** (`order`/`limit` optional); a mutation has neither β€” only `insert`/`update`/`delete`. -- **`limit` takes an integer literal, not a param** β€” `limit 50`, never `limit $n`. -- **Variable-hop traversal**: `$p knows{1,3} $f` (`{1,}` = unbounded). -- **Literals & calls**: `now()`, `date("2026-04-29")`, `datetime("…T00:00:00Z")`, list `[…]`. -- **Filters** `= != > < >= <= contains`; **aggregates** `count/sum/avg/min/max` (`count($f) as n`). -- **Stored-query metadata**: `@description("…")` / `@instruction("…")` may follow the param list. -- **Casing**: type names uppercase-initial (`Signal`); idents/edges lowercase-initial (`formsPattern`); variables `$`-prefixed. `//` and `/* */` comments only. - -Authoritative PEG grammar (pest) for `.gq` files ("NanoGraph" is the legacy engine name): - -```pest -// NanoGraph Query Grammar (.gq files) - -WHITESPACE = _{ " " | "\t" | "\r" | "\n" } -COMMENT = _{ LINE_COMMENT | BLOCK_COMMENT } -LINE_COMMENT = _{ "//" ~ (!"\n" ~ ANY)* } -BLOCK_COMMENT = _{ "/*" ~ (!"*/" ~ ANY)* ~ "*/" } - -query_file = { SOI ~ query_decl* ~ EOI } - -query_decl = { - "query" ~ ident ~ "(" ~ param_list? ~ ")" ~ query_annotation* ~ "{" - ~ query_body - ~ "}" -} -query_annotation = { description_annotation | instruction_annotation } -description_annotation = { "@description" ~ "(" ~ string_lit ~ ")" } -instruction_annotation = { "@instruction" ~ "(" ~ string_lit ~ ")" } - -query_body = { read_query_body | mutation_body } -mutation_body = { mutation_stmt+ } -read_query_body = { - match_clause - ~ return_clause - ~ order_clause? - ~ limit_clause? -} - -mutation_stmt = { insert_stmt | update_stmt | delete_stmt } -insert_stmt = { "insert" ~ type_name ~ "{" ~ mutation_assignment+ ~ "}" } -update_stmt = { "update" ~ type_name ~ "set" ~ "{" ~ mutation_assignment+ ~ "}" ~ "where" ~ mutation_predicate } -delete_stmt = { "delete" ~ type_name ~ "where" ~ mutation_predicate } -mutation_assignment = { ident ~ ":" ~ match_value ~ ","? } -mutation_predicate = { ident ~ comp_op ~ match_value } - -param_list = { param ~ ("," ~ param)* } -param = { variable ~ ":" ~ type_ref } - -type_ref = { (list_type | base_type | vector_type) ~ "?"? } -list_type = { "[" ~ base_type ~ "]" } -vector_type = { "Vector" ~ "(" ~ integer ~ ")" } -base_type = { "String" | "Blob" | "Bool" | "I32" | "I64" | "U32" | "U64" | "F32" | "F64" | "DateTime" | "Date" } - -match_clause = { "match" ~ "{" ~ clause+ ~ "}" } - -clause = { negation | binding | traversal | filter | text_search_clause } -text_search_clause = { search_call | fuzzy_call | match_text_call } - -// Binding: $p: Person { name: "Alice" } -binding = { variable ~ ":" ~ type_name ~ ("{" ~ prop_match_list ~ "}")? } - -prop_match_list = { prop_match ~ ("," ~ prop_match)* ~ ","? } -prop_match = { ident ~ ":" ~ match_value } -match_value = { literal | variable | now_call } - -// Traversal: $p knows $f -traversal = { variable ~ edge_ident ~ traversal_bounds? ~ variable } -traversal_bounds = { "{" ~ integer ~ "," ~ integer? ~ "}" } - -// Filter: $f.age > 25 -filter = { expr ~ filter_op ~ expr } - -// Negation: not { ... } -negation = { "not" ~ "{" ~ clause+ ~ "}" } - -// Return clause β€” projections separated by commas or newlines -return_clause = { "return" ~ "{" ~ projection+ ~ "}" } -projection = { expr ~ ("as" ~ ident)? ~ ","? } - -// Order clause -order_clause = { "order" ~ "{" ~ ordering ~ ("," ~ ordering)* ~ "}" } -ordering = { nearest_ordering | (expr ~ order_dir?) } -nearest_ordering = { "nearest" ~ "(" ~ prop_access ~ "," ~ expr ~ ")" } -order_dir = { "asc" | "desc" } - -// Limit clause -limit_clause = { "limit" ~ integer } - -// Expressions -expr = { now_call | nearest_ordering | search_call | fuzzy_call | match_text_call | bm25_call | rrf_call | agg_call | prop_access | variable | literal | ident } -now_call = { "now" ~ "(" ~ ")" } -search_call = { "search" ~ "(" ~ expr ~ "," ~ expr ~ ")" } -fuzzy_call = { "fuzzy" ~ "(" ~ expr ~ "," ~ expr ~ ("," ~ expr)? ~ ")" } -match_text_call = { "match_text" ~ "(" ~ expr ~ "," ~ expr ~ ")" } -bm25_call = { "bm25" ~ "(" ~ expr ~ "," ~ expr ~ ")" } -rank_expr = { nearest_ordering | bm25_call } -rrf_call = { "rrf" ~ "(" ~ rank_expr ~ "," ~ rank_expr ~ ("," ~ expr)? ~ ")" } - -prop_access = { variable ~ "." ~ ident } - -agg_call = { agg_func ~ "(" ~ expr ~ ")" } -agg_func = { "count" | "sum" | "avg" | "min" | "max" } - -comp_op = { ">=" | "<=" | "!=" | ">" | "<" | "=" } -filter_op = { "contains" | comp_op } - -// Terminals -variable = @{ "$" ~ (ident_chars | "_") } -ident_chars = @{ (ASCII_ALPHA_LOWER | "_") ~ (ASCII_ALPHANUMERIC | "_")* } - -// Edge identifier β€” lowercase start, same as ident but used in traversal context -// Must not match keywords -edge_ident = @{ !("not" ~ !ASCII_ALPHANUMERIC) ~ (ASCII_ALPHA_LOWER | "_") ~ (ASCII_ALPHANUMERIC | "_")* } - -type_name = @{ ASCII_ALPHA_UPPER ~ (ASCII_ALPHANUMERIC | "_")* } -ident = @{ (ASCII_ALPHA_LOWER | "_") ~ (ASCII_ALPHANUMERIC | "_")* } - -literal = { list_lit | datetime_lit | date_lit | string_lit | float_lit | integer | bool_lit } -date_lit = { "date" ~ "(" ~ string_lit ~ ")" } -datetime_lit = { "datetime" ~ "(" ~ string_lit ~ ")" } -list_lit = { "[" ~ (literal ~ ("," ~ literal)*)? ~ "]" } -string_lit = @{ "\"" ~ string_char* ~ "\"" } -string_char = @{ !("\"" | "\\") ~ ANY | "\\" ~ ANY } -float_lit = @{ ASCII_DIGIT+ ~ "." ~ ASCII_DIGIT+ } -integer = @{ ASCII_DIGIT+ } -bool_lit = { "true" | "false" } -``` - -## CLI Reference (condensed) - -Notation: `<x>` required Β· `[x]` optional Β· `<a|b>` choice Β· `…` repeatable. - -**Global addressing flags**: `--as <actor>` (direct/`--store` writes only β€” a server resolves the actor from its token), `--server <name|url>`, `--cluster <dir|uri>` (cluster-managed storage, for maintenance), `--graph <id>` (selects the graph within a `--server` or `--cluster` scope), `--profile <name>` (`$OMNIGRAPH_PROFILE`), `--store <uri>`. Data commands also take a positional `file://`/`s3://` URI (`--config <dir>` is for `cluster` commands only). Output: `--json`, or reads take `--format <json|jsonl|csv|kv|table>`. **Write guards:** `--yes` skips the confirm prompt for a destructive write (`cleanup`, overwrite `load`, `branch delete`) against a non-local scope (it *refuses* without it when non-TTY or `--json`); `--quiet` suppresses the resolved-target echo. - -**Data plane** β€” `any` (served via `--server`/`--profile`, or direct via `--store`/URI): -- `query` (alias `read`) `<name>` β€” a **served stored query** by name (via `--server`/`--profile`); or ad-hoc `[<name>] (--query <f.gq> | -e '<GQ>')` where `<name>` picks which query in the source. `[--params <json> | --params-file <p>] [--branch <b> | --snapshot <id>] [--format <fmt> | --json]`. No positional URI β€” address via `--server`/`--store`/`--profile`. -- `mutate` (alias `change`) β€” same shape (served stored mutation by `<name>`, or ad-hoc `--query`/`-e`); `[--params …] [--branch <b>] [--json]`. The verb asserts kind: `query`β†’read, `mutate`β†’write (400 on mismatch). -- `alias <name> [args…]` β€” invoke an operator alias's bound stored query (read or write); `[--params … | --params-file <p>] [--format <fmt> | --json]` (server/graph/query come from the binding) -- `load --data <f.jsonl> --mode <overwrite|append|merge> [--branch <b>] [--from <base>] [--json]` β€” `--mode` required; `--from` forks a missing `--branch` -- `snapshot [--branch <b>] [--json]` -- `export [--branch <b>] [--type <T>…] [--table <K>…]` (streams JSONL) -- `branch <create <name> [--from <base>] | list | delete <name> | merge <source> --into <target>> [--json]` -- `commit <list [--branch <b>] | show <commit_id>> [--json]` -- `schema <plan | apply> --schema <f.pg> [--allow-data-loss] [--json]` Β· `schema show` (alias `get`) β€” `apply` **refuses a cluster-managed graph** (evolve those via `cluster apply`) - -**Served only** (needs `--server`/`--profile`): `graphs list [--json]` - -**Direct / storage** β€” reject `--server`; address by positional URI or `--cluster <dir|s3> --graph <id>`: -- `init --schema <f.pg> <uri> [--force]` -- `lint --query <f.gq> [--schema <f.pg>] [<uri>] [--json]` β€” offline with `--schema`, graph-backed with a URI -- `optimize [--json]` Β· `repair [--confirm] [--force] [--json]` Β· `cleanup (--keep <N> | --older-than <7d>) --confirm [--json]` -- `queries <validate [<uri>] | list> [--json]` - -**Control plane** β€” cluster (`--config <dir>`, default `.`): -- `cluster <validate | plan | apply | status | refresh | import> [--config <dir>] [--json]` -- `cluster approve <resource> --as <actor> [--config <dir>] [--json]` Β· `cluster force-unlock <lock_id> [--config <dir>] [--json]` - -**Local** (no graph): -- `policy <validate | test --tests <f> | explain --actor <a> --action <act> [--branch <b> | --target-branch <b>]> --cluster <dir> [--graph <id>]` -- `embed --seed <embed.yaml> [--reembed_all | --clean | --select "<Type>:<field>=<value>"]` -- `login <server> [--token <t>]` (prefer piping the token on stdin) Β· `logout <server>` Β· `profile <list | show [<name>]>` Β· `version` - -Pre-0.7.0 spellings (`read`/`change`/`ingest`, `--target`, positional `http://`) β†’ [`references/migrations.md`](references/migrations.md). - -## Five Ontology Design Criteria (Gruber 1993) - -Omnigraph schemas are ontologies. The canonical design criteria from Gruber's *Toward Principles for the Design of Ontologies Used for Knowledge Sharing* (Int. J. Human-Computer Studies 43:907–928) apply directly when authoring `.pg` files. - -1. **Clarity** β€” definitions should communicate intended meaning unambiguously and be independent of social or computational context. In Omnigraph: precise type names, narrow enums over `String`, `@check`/`@range` for stated invariants. A reviewer should understand the domain from the schema alone. -2. **Coherence** β€” inferences sanctioned by the schema must be consistent with the domain modeled. Gruber's trap: defining quantity as a `(magnitude, unit)` pair makes `6 feet β‰  2 yards` even though they describe the same length. In Omnigraph: watch for `@card`, `@unique`, and edge directionality that let the schema distinguish things the domain treats as equal. -3. **Extendibility** β€” the schema should support specialization without revising existing definitions. In Omnigraph: prefer interfaces for shared shape, leave enums open where the domain genuinely admits more, model identifiers via mapping functions rather than baking units/formats into the entity. -4. **Minimal encoding bias** β€” representation choices made for notation or implementation convenience leak into the model. In Omnigraph: don't type dates as `String` because the source API returns strings; separate conceptual entities (a publication date, a person) from their surface encoding (a year integer, a name string) when both matter. -5. **Minimal ontological commitment** β€” make as few claims about the world as the use case requires. In Omnigraph: don't add required properties, closed enums, or `@card(1..1)` "in case"; tighten later via `schema plan`/`apply` when a real constraint emerges. Weaker schemas leave consumers room to specialize. - -The criteria trade off against each other β€” Clarity wants tight definitions while Minimal Commitment wants weak ones. Gruber's resolution: *having decided a distinction is worth making, give it the tightest possible definition*. Decide what to model conservatively; once modeled, constrain precisely. - -## Schema Authoring Principles - -Twelve practical rules for `.pg` authoring β€” full text and examples in [`docs/omni-schema.md`](../../docs/omni-schema.md). In short: schema-is-the-contract Β· explicit identity via `@key` Β· model meaning not tables Β· strong intentional types Β· deliberate optionality Β· shared shape in interfaces Β· schema-level constraints (`@unique`/`@index`/`@range`/`@check`/`@card`) Β· search as a schema decision Β· edge semantics matter Β· reviewable schemas Β· intentional migrations (`@rename_from`) Β· domain clarity over ORM habits. - -Design flow: entities β†’ stable keys β†’ relationships worth their own edge β†’ enum candidates β†’ uniqueness/bounds/cardinality β†’ search needs β†’ shared shape into interfaces β†’ evolution plan. - -## Provenance Is Structural (Multi-Agent Source of Truth) - -When Omnigraph serves as canonical truth across multiple agents, every assertion must answer *who said it, when, based on what evidence*. This is the runtime guarantee Gruber's criteria don't cover β€” his agents shared vocabulary; ours additionally must share attribution. Provenance belongs in the schema, not in logs. - -Without structural provenance, agents cannot reconcile contradictory assertions, retract facts when a source is discredited, replay graph state at a past timestamp, or distinguish high-evidence facts from speculation. - -**In Omnigraph:** model provenance as a `Claim`-style interface (or a separate `Claim` node linked to each sourced fact) with required fields β€” `asserted_by: Actor`, `asserted_at: DateTime`, `evidence_source: Source`, optionally `confidence: F64`. Don't stash provenance into a free-text `source: String` or a `metadata: JSON` dump β€” structured provenance is queryable, indexable, and migratable; free-form is none of these. - -## Storage & Credentials - -A graph's bytes live in one of two backends: - -- **Local filesystem** β€” a path or `file://` URI. In cluster mode `storage:` defaults to the config directory, so local dev needs no object store. -- **S3-compatible object storage** β€” AWS, Railway, Tigris, etc. (`s3://bucket/prefix`). Authenticate with the standard `AWS_*` environment contract; keep dev creds in a git-ignored `.env.omni` and source it before CLI calls: - -```bash -set -a && source .env.omni && set +a -``` - -`init` and `load` write storage directly (bypassing the server); the server reads from it. Validate with `curl http://127.0.0.1:8080/healthz`, then `omnigraph snapshot <graph-uri> --json`. - -## Project Layout - -### Deployment & access (omnigraph >= 0.7.0) - -- **Cluster deployment β€” the only way to serve.** A `cluster.yaml` declares the - whole deployment (graphs, schemas, stored queries, policies, optional S3 - `storage:` root); `omnigraph cluster apply` converges it and - `omnigraph-server --cluster .` (or `--cluster s3://bucket/prefix`, - config-free) serves it. See `references/cluster.md`. -- **Direct / embedded access β€” no server.** Address a graph's storage directly - with `--store <file://|s3:// uri>` or a positional URI for one-off CLI ops. - There is **no single-graph server mode** β€” the server is cluster-only. - -### The two config surfaces (omnigraph >= 0.7.0) - -Configuration has two single-owner homes (RFC-007/008), plus an -everything-explicit flag/env tier: - -| Surface | Owner | Location | Declares | -|---|---|---|---| -| **Cluster config** | the team, in the repo | `cluster.yaml` + the `.pg`/`.gq`/policy files it references | what the system **is**: graphs, schemas, queries, policies, storage | -| **Operator config** | one person | `~/.omnigraph/config.yaml` (`$OMNIGRAPH_HOME` relocates it) | who **I** am: identity, named servers, output defaults, personal aliases | -| Flags / env | per invocation | β€” | everything, explicitly | - -```yaml -# ~/.omnigraph/config.yaml β€” per operator, never committed -operator: - actor: act-andrew # default --as identity -servers: - intel-dev: - url: https://graph.example.com # no tokens here, ever -defaults: - output: table # read-format default - server: intel-dev # default served scope (or `store: file://…/g.omni` for a local default β€” mutually exclusive) - default_graph: spike # graph within a server/cluster scope -profiles: # optional named scope bundles β€” pick with --profile <name> - staging: { server: intel-staging, default_graph: spike } -aliases: # personal bindings to TEAM stored queries (see references/aliases.md) - triage: { server: intel-dev, graph: spike, query: weekly_triage, args: [since] } -``` - -The operator config and credentials are **auto-discovered β€” no flag points at them**: the CLI reads `$OMNIGRAPH_HOME/config.yaml` (default `~/.omnigraph/config.yaml`), and an absent file is just an empty layer (zero-config). `$OMNIGRAPH_HOME` relocates the *directory* only, not a specific file. (`--config`/`$OMNIGRAPH_CONFIG` is a separate flag for the cluster / server config β€” not this.) - -Credentials live outside config: `echo $TOKEN | omnigraph login intel-dev` -writes `~/.omnigraph/credentials` (`0600`); the matching token resolves via -`OMNIGRAPH_TOKEN_INTEL_DEV` or that file. - -**Addressing a graph**: `--store <file://|s3:// uri>` or a positional URI for -direct storage; `--server <name|url>` (+ `--graph <id>`) for a served remote; -`--profile <name>` for a named bundle; else the operator `defaults`. A remote is -addressed with `--server` (a bare `http(s)://` URL is not a graph address). Run -data-plane commands from a graph's project folder so relative `queries/`, -`schema.pg`, and `.env.omni` paths resolve. - -### What to commit - -**Commit:** `schema.pg`, `queries/*.gq`, `cluster.yaml`, `seed.md`, `seed.jsonl`, and the project's `README.md` and `CLAUDE.md`. - -**Ignore:** `.env.omni` (credentials), `.claude/` (local agent state), `*.omni/` (local graph artifacts), `__cluster/` and `graphs/` (cluster state + derived graph roots). - -### Give agents a `CLAUDE.md` - -A per-project `CLAUDE.md` tells coding agents where files live and what conventions matter. Without it, agents re-discover the same things every session. - -## Common Gotchas - -These are the traps most likely to bite. Scan this table before debugging any parse or runtime error. - -| Trap | Symptom | Fix | -|------|---------|-----| -| `#` comments in `.pg` | `parse error: expected schema_file` | Use `//` | -| Standalone `enum Foo { ... }` block | `parse error: expected EOI or schema_decl` | Inline: `kind: enum(a, b)` | -| `[Category]` (list of enum) | compile error | Use `[String]`; lists must contain scalars | -| `@embed(text)` without quotes | `unexpected constraint_name` | `@embed("text")` | -| `@unique(src)` on edge without body block | parse error | `@card(1..1) { @unique(src) }` | -| `load --mode merge` after `@embed` source change | stale embeddings | `omnigraph embed --reembed_all` or `load --mode overwrite` | -| `schema apply` with feature branches open | rejected | Merge or delete branches first | -| `nearest(...)` / `bm25(...)` / `rrf(...)` without `limit` | compile error | Add `limit N` | -| Adding non-nullable property without backfill | unsupported migration | Make optional β†’ backfill β†’ tighten in follow-up apply | -| `omnigraph init --json` | `unexpected argument --json` | `init` doesn't support `--json`; drop the flag | -| `omnigraph init` on an already-initialized URI | `AlreadyInitialized` error (v0.6.0+) | `--force` to re-init (skips the schema preflight; does **not** purge data) | -| `schema apply` dropping a property/type | soft-dropped or rejected (no data loss) | add `--allow-data-loss` to actually drop the column | -| Committing `.env.omni` | credential leak | Add `.env*` to `.gitignore` | -| Non-parameterized query values | typecheck surprise, injection risk | Declare `$param: Type` and pass via `--params` | -| Missing required field in `insert` | `T12: insert for 'X' must provide non-nullable property 'Y'` | Accept the param in the mutation signature | -| Long-lived feature branches | merge conflicts, schema apply blocked | Merge promptly; delete when done | -| `mutation { ... }` wrapper in `.gq` | `parse error: expected query_file` at line 1 | Use `query <name>(...) { insert T { ... } }`; there is no top-level `mutation` keyword | -| `--config` placed before subcommand | `unexpected argument --config` | Put `--config` **after** the subcommand (e.g. `omnigraph schema show --config X`) | -| Reading a large schema via stdout-capped tool | Truncated, garbled, or duplicated output | `omnigraph schema show > /tmp/schema.pg` first; then read the file with offset/limit | -| `omnigraph load` without `--mode` | error: `--mode` is required | Pass `--mode merge\|append\|overwrite` β€” there is no default (overwrite is destructive, so it is never implicit). `load` works against local and remote URIs | -| Blind retry after 504 | Duplicate Signal/Decision/Claim (append-only types lack `@key` dedup) | `commit list --branch main --json` first; head advanced means it landed; only retry if unchanged | -| `sync_branch()` mentioned in version-drift error | Searching for nonexistent CLI command | Server-internal directive in error text; just retry β€” the next call re-pins to the new head | -| Stale empty branches at `main`'s head | 504-orphaned forks from a timed-out `load --from`; eventually block writes | List branches, find ones at `main`'s `graph_commit_id`, `omnigraph branch delete --config X <name>` | -| `omnigraph schema apply` / `init` on a cluster-managed graph | refused β€” bypasses the cluster ledger | Evolve cluster graphs via `omnigraph cluster apply --config .`; `schema apply`/`init` are for a non-cluster store | -| `omnigraph optimize` against a table with a `Blob` property | table is **skipped**, not failed (Lance blob-v2 compaction bug) | Expected β€” `--json` reports it under `skipped`; non-blob tables still compact | -| `@unique` on a `[List]`/`Blob` column | `load` now errors loudly (was silently un-enforced before #160) | Use `@unique` only on scalar columns (and composite `@unique(a, b)`, now keyed as a true tuple) β€” uniqueness needs a type that reduces to a scalar key | - -## Deep Dives - -- `references/cluster.md` β€” cluster-mode declarative deployments: cluster.yaml, the validate/import/plan/apply loop, approval-gated deletes, `--cluster` serving, the two-file contract, recovery - -For anything beyond the basics, load the relevant reference file. Each is self-contained β€” load only what you need. - -| Reference | When to load | -|-----------|--------------| -| [`references/schema.md`](references/schema.md) | Editing `.pg` files, running `schema plan`/`apply`, renaming types, backfilling required fields | -| [`references/queries.md`](references/queries.md) | Writing or linting `.gq` files, search functions, aggregations, multi-hop patterns | -| [`references/data.md`](references/data.md) | Choosing between `mutate` and `load` (required `--mode`, `--from` to fork a review branch); branch review workflow; destructive ops | -| [`references/remote-ops.md`](references/remote-ops.md) | Operating against a remote/CloudFront-fronted graph: 504 verification ritual, version drift, fork-branch 504 fingerprints, append-only retry safety, operator `--server`/`login` targeting | -| [`references/search.md`](references/search.md) | Embeddings, `@embed`, vector/text ranking, scope-then-rank pattern | -| [`references/aliases.md`](references/aliases.md) | Defining aliases for agents, structured output, JSON args | -| [`references/stored-queries.md`](references/stored-queries.md) | Server-side stored-query registry: declared in `cluster.yaml`, `omnigraph queries validate/list`, `GET /graphs/{id}/queries` + `POST /graphs/{id}/queries/{name}`, `invoke_query` Cedar gating | -| [`references/server-policy.md`](references/server-policy.md) | Starting the HTTP server, routes, bearer auth, Cedar policy gating, multi-graph mode | -| [`references/commands.md`](references/commands.md) | `snapshot`, `export`, `commit list/show`, addressing & resolution | -| [`references/migrations.md`](references/migrations.md) | Migrating a pre-0.7.0 setup, or you hit an old config/command/flag/route/error and need its current form | diff --git a/skills/omnigraph/references/aliases.md b/skills/omnigraph/references/aliases.md deleted file mode 100644 index 85dba93..0000000 --- a/skills/omnigraph/references/aliases.md +++ /dev/null @@ -1,141 +0,0 @@ -# Aliases & Agent Automation - -## Contents -- What an alias is -- Operator alias schema -- Args binding & JSON-first parsing -- Default to structured output -- Alias naming convention -- Secrets don't belong in aliases -- Example alias set -- Invocation patterns - -How to wire Omnigraph operations for agents and scripts. - -## What an alias is - -An **operator alias** decouples a stable **operation name** from its implementation, so an agent calling `omnigraph alias signal …` keeps working as the query evolves. Aliases live in `~/.omnigraph/config.yaml` and are personal *bindings* to a **stored query on a named server** β€” they carry no query content; the stored query in the cluster catalog is the team's contract. - -```yaml -# ~/.omnigraph/config.yaml -aliases: - triage: - server: intel-dev # an entry under servers: - graph: spike # optional (multi-graph servers) - query: weekly_triage # the STORED query's name β€” never a file - args: [since] # positional args β†’ params, in order - params: { limit: 20 } # fixed defaults; positionals/--params win - format: table -``` - -```bash -omnigraph alias triage 2026-06-01 -# β†’ POST <intel-dev>/graphs/spike/queries/weekly_triage with the keyed credential -``` - -> **Alias vs stored query.** The alias is *yours* (a personal name + defaults); the **stored query** it points at is the *team's* β€” declared in `cluster.yaml`, type-checked and served by the cluster (`GET /graphs/<id>/queries`, `POST /graphs/<id>/queries/<name>`, gated by `invoke_query`). See [`stored-queries.md`](stored-queries.md). -## Operator Alias Schema - -```yaml -aliases: - <alias-name>: - server: <server-name> # an entry under servers: in ~/.omnigraph/config.yaml - graph: <graph-id> # optional: for multi-graph servers - query: <stored-query> # the stored query's NAME (never a file path) - args: [<name1>, <name2>] # positional CLI args β†’ named params, in order - params: { <k>: <v> } # fixed default params; positionals / --params win - format: table|kv|csv|jsonl|json # optional: output format -``` - -Dispatch with `omnigraph alias <name> [args]` β€” one subcommand for read **and** write stored queries (a mutation alias is double-gated by `invoke_query` + `change`). Aliases live in their own namespace, so one can never shadow or be shadowed by a built-in verb. - -### `args` bind to query parameters - -If `args: [slug, name, age]`, then: - -```bash -omnigraph alias foo sig-bar "Some Name" 29 -``` - -...maps to `{"slug":"sig-bar","name":"Some Name","age":29}`. - -### Args are JSON-first - -Each arg is parsed as JSON first, then falls back to string: -- `29` β†’ integer -- `"29"` β†’ string -- `true` β†’ boolean -- `Alice` β†’ string (JSON parse fails, falls back) -- `{"x":1}` β†’ object - -Explicit `--params '{...}'` wins on key conflict. - -## Default to Structured Output - -For scripts and agents, prefer `jsonl` or `json`; `table` is for humans. Set a default in `~/.omnigraph/config.yaml`: - -```yaml -defaults: - output: jsonl -``` - -Or per-alias (`format: jsonl`), or per-call (`--format jsonl`). - -### When to use which - -- **`jsonl`** β€” one JSON object per line, first line is metadata; streams; ideal for agents -- **`json`** β€” pretty-printed JSON array; smaller results; human-readable -- **`kv`** β€” `key: value` per line; good for single-row lookups -- **`csv`** β€” for spreadsheets or line-count-heavy analysis -- **`table`** β€” default human view; don't use in automation - -## Alias Naming Convention - -Short, hyphenated, matches the conceptual operation: - -- `signal`, `pattern`, `element` β€” single lookup (typical pair with `format: kv`) -- `signals`, `patterns`, `elements` β€” list -- `signal-patterns`, `pattern-signals` β€” traversals -- `add-signal`, `link-forms-pattern` β€” mutations - -## Secrets Don't Belong in Aliases - -Credentials never live in an alias or any config file. For remote servers, `omnigraph login <server>` stores the bearer token in `~/.omnigraph/credentials` (`0600`); for S3-backed storage, AWS creds go in `.env.omni`. Aliases should only contain query names and parameter bindings β€” never tokens, passwords, or API keys. - -## Example Alias Set - -```yaml -# ~/.omnigraph/config.yaml -servers: - intel-dev: { url: https://graph.example.com } -aliases: - # Lookups (kv format for single-row readability) - signal: { server: intel-dev, graph: spike, query: get_signal, args: [slug], format: kv } - pattern: { server: intel-dev, graph: spike, query: get_pattern, args: [slug], format: kv } - # Lists - signals: { server: intel-dev, graph: spike, query: recent_signals } - # Traversals - pattern-signals: { server: intel-dev, graph: spike, query: pattern_signals, args: [slug] } - # Mutations (stored mutation; invoke_query + change) - add-signal: { server: intel-dev, graph: spike, query: add_signal, args: [slug, name, brief, stagingTimestamp, createdAt, updatedAt] } - link-forms-pattern: { server: intel-dev, graph: spike, query: link_signal_forms_pattern, args: [signal, pattern] } -``` - -Each `query:` names a stored query the cluster serves β€” declare them in `cluster.yaml` and `cluster apply` first (see [`stored-queries.md`](stored-queries.md)). - -## Invocation Patterns - -```bash -# Invoke an alias (read or write β€” the bound stored query decides) -omnigraph alias signal sig-kimi-k25 -omnigraph alias add-signal sig-new "Name" "Brief" \ - 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z - -# Override output format -omnigraph alias signals --format jsonl - -# Explicit --params (wins over positional args on key conflict) -omnigraph alias signal --params '{"slug":"sig-override"}' -``` - -The `alias` subcommand carries `--params`/`--params-file`, `--format`/`--json`, and `--config`; the server, graph, and stored-query name come from the binding. For a different server/graph or a branch read, call `query`/`mutate` directly. diff --git a/skills/omnigraph/references/cluster.md b/skills/omnigraph/references/cluster.md deleted file mode 100644 index 3e9f6e9..0000000 --- a/skills/omnigraph/references/cluster.md +++ /dev/null @@ -1,128 +0,0 @@ -# Cluster Mode β€” Declarative Deployments - -## Contents -- The model -- The loop (validate β†’ import β†’ plan β†’ apply β†’ serve) -- The config contract (`cluster.yaml` vs `~/.omnigraph/config.yaml`) -- Serving (`--cluster`, config-free bucket boot) -- Recovery cheat-sheet - -The cluster control plane (omnigraph >= 0.7.0) manages a whole deployment β€” -graphs, schemas, stored queries, Cedar policies β€” as **declared files in one -directory**, converged Terraform-style. It is the **only way to serve** a -graph (the server is cluster-only); the data-plane operations in the other -references work against the cluster's graphs unchanged. - -## The model - -``` -company-brain/ -β”œβ”€β”€ cluster.yaml # the deployment: graphs, schemas, queries, policies -β”œβ”€β”€ schema.pg -β”œβ”€β”€ queries/*.gq -β”œβ”€β”€ *.policy.yaml -β”œβ”€β”€ graphs/<id>.omni # DERIVED β€” created by apply, never by hand (gitignore) -└── __cluster/ # ledger + catalog + approvals β€” local state (gitignore) -``` - -```yaml -# cluster.yaml -version: 1 -# storage: s3://my-bucket/clusters/company-brain # optional β€” put ledger, -# catalog, and graph roots on S3 object storage (default: this folder) -state: { backend: cluster, lock: true } -graphs: - knowledge: - schema: schema.pg - queries: queries/ # the .gq files ARE the declaration β€” every `query <name>` registers -policies: - base: { file: base.policy.yaml, applies_to: [knowledge] } # or [cluster] for server-level -``` - -`queries` also accepts a file list (`[a.gq, b.gq]`) or a fine-grained -`name: { file: ... }` map. Discovery is loud: unparseable files and duplicate -names across files fail validation. - -## The loop (memorize this) - -```bash -omnigraph cluster validate --config . # parse + typecheck everything -omnigraph cluster import --config . # one-time: create the state ledger -omnigraph cluster plan --config . # preview β€” REQUIRED reading before apply -omnigraph cluster apply --config . --as <you> # converge (idempotent) -omnigraph-server --cluster . --bind 127.0.0.1:8080 --unauthenticated # serve (local dev) -``` - -- **`apply` creates graphs** at `graphs/<id>.omni` β€” there is no separate - `omnigraph init` in cluster mode. -- **Schema changes**: edit the `.pg`, `plan` shows the engine's real migration - steps (`add_property`, `drop_property [soft]`, `unsupported: …`), `apply` - migrates the live graph. **Soft drops only** β€” data-loss migrations are not - reachable from cluster apply (prior versions retain dropped columns). -- **Applied = serving on the next server restart.** No hot reload. -- **`storage: s3://bucket/prefix`** (optional) puts the entire cluster β€” state - ledger, lock, content-addressed catalog, recovery sidecars, approval - artifacts, and the derived graph roots (`<storage>/graphs/<id>.omni`) β€” on - S3-compatible object storage. The ledger CAS uses S3 conditional writes and - the lock becomes genuinely cross-machine. Absent, everything defaults to the - config directory (byte-compatible with pre-existing clusters). Credentials - come from the standard `AWS_*` env contract, never `cluster.yaml`. -- **`--as <actor>` attributes every run** (sidecars, audit, engine commits). - Defaults from your operator config's `operator.actor`; required for `approve`. -- **Destructive changes are gated**: removing a graph from `cluster.yaml` - blocks with `approval_required` until - `omnigraph cluster approve graph.<id> --config . --as <you>` records a - digest-bound approval. Any config/state drift after approving invalidates it. -- **Drift**: `cluster refresh` re-observes live graphs and marks out-of-band - changes `drifted`; the next `apply` converges them back to the declaration. -- **Data is NOT cluster's job**: rows flow through `omnigraph load / mutate` - against the derived roots, with branches as usual. - -## The config contract (do not blur this) - -| File | Owns | Read by | -|---|---|---| -| `cluster.yaml` | the deployment: graph set, schemas, stored queries, policy bindings, storage | `cluster` commands; the `--cluster` server | -| `~/.omnigraph/config.yaml` | per-operator: identity (`operator.actor`), named `servers:`, output defaults, personal aliases | data-plane CLI commands (tokens live in `~/.omnigraph/credentials` via `omnigraph login`) | - -Cluster commands read the operator config for **exactly one thing**: the actor -default when `--as` is omitted (`--as` > `operator.actor`). A `--cluster` server -reads it for **nothing** β€” boot from cluster state XOR the operator file, never -a merge. -Address a cluster-managed graph's data directly with `--store <storage>/graphs/<id>.omni`, -or via `--server`/aliases against a serving instance β€” that is ergonomics, not -coupling. - -## Serving - -`omnigraph-server --cluster <dir>` is exclusive (cannot combine with a URI, -`--target`, or `--config`), always multi-graph (`/graphs/{id}/...`), and -fail-fast: missing/pending/tampered state refuses boot with a remedy. Every -declared query is exposed (`GET /graphs/<id>/queries`, `POST -/graphs/<id>/queries/<name>`); Cedar bundles attach via `applies_to` -(`cluster` β†’ server-level gate incl. `graph_list`; `graph.<id>` β†’ that -graph's gate incl. `invoke_query`). Bearer tokens and bind stay process-level -(env/flags). - -**Config-free serving.** `--cluster` also accepts the storage-root URI -directly β€” `omnigraph-server --cluster s3://bucket/prefix` boots from the -applied revision on the bucket with **no checkout of the config repo**. The -ledger and catalog on the bucket are the whole deployment artifact; policy -bundles serve as digest-verified content from the catalog. The preferred -container shape is **bucket, no volume** (AWS ECS / Railway recipes in the -omnigraph repo's `docs/user/deployment.md`). For a mounted config directory -instead, `OMNIGRAPH_CLUSTER=<dir>` works and the image ships the CLI for -in-container `cluster apply`. - -## Recovery cheat-sheet - -| Symptom | Fix | -|---|---| -| Apply crashed mid-run | run `cluster apply` again β€” sidecars + sweep reconcile | -| Held lock | `cluster status` (shows lock id) β†’ `cluster force-unlock <LOCK_ID> --config .` | -| Lost/corrupt `state.json` | `cluster import` rebuilds from config + live graphs, then `apply` | -| Server refuses to boot | the error names its remedy (usually `cluster refresh` + `apply`, restart) | -| `approval_stale` warning | re-run `cluster approve` β€” the plan changed since you approved | - -Full reference: the omnigraph repo's `docs/user/clusters/index.md` (operator guide) -and `docs/user/clusters/config.md` (every key, flag, and diagnostic). diff --git a/skills/omnigraph/references/commands.md b/skills/omnigraph/references/commands.md deleted file mode 100644 index a76844b..0000000 --- a/skills/omnigraph/references/commands.md +++ /dev/null @@ -1,237 +0,0 @@ -# Reference Commands - -## Contents -- Inspect state (snapshot, export) -- Branches Β· commits Β· graphs -- Schema Β· lint Β· embed Β· init -- Load (bulk JSONL) -- Query / mutate -- Maintenance (optimize, cleanup) -- Stored queries -- Operator config & credentials -- Config resolution order -- Output formats Β· health check -- Cluster control plane - -Commands you'll reach for but don't need best-practice rules around. Quick syntax reference. - -## Inspect State - -### `snapshot` β€” tables + row counts - -```bash -omnigraph snapshot $REPO --branch main --json -``` - -Returns the manifest: all node/edge tables with row counts and versions. Use this to verify a load succeeded or to see what types exist. - -### `export` β€” full JSONL dump - -```bash -omnigraph export $REPO --branch main > graph.jsonl -``` - -Streams all nodes and edges as JSONL. The right tool for large-snapshot inspection. Don't try to page through the whole graph with read queries. - -Filter by type: - -```bash -omnigraph export $REPO --branch main --type Signal > signals.jsonl -``` - -## Branches - -```bash -omnigraph branch create --from main <branch-name> --store $REPO -omnigraph branch list --store $REPO -omnigraph branch merge <branch-name> --into main --store $REPO -omnigraph branch delete <branch-name> --store $REPO -``` - -All support `--json`. - -## Commits (History) - -```bash -omnigraph commit list $REPO --branch main -omnigraph commit show $REPO <commit-id> -``` - -Inspect graph history. Useful for "what changed between these two points" investigation. - -## Graphs (multi-graph servers) - -```bash -omnigraph graphs list --config X --json -``` - -Lists the graphs a multi-graph server serves. Remote servers only (rejects local URIs); the server must expose `GET /graphs` via `server.policy.file`. See `references/server-policy.md`. - -## Schema - -```bash -omnigraph schema plan --schema next.pg $REPO --json -omnigraph schema apply --schema next.pg $REPO -``` - -See `references/schema.md` for the full workflow. - -## Lint - -```bash -omnigraph lint --schema schema.pg --query queries/foo.gq --json -# or against a live repo: -omnigraph lint --query queries/foo.gq $REPO --json -``` - -`lint` is the single query-validation command. See `references/queries.md`. - -## Embed - -```bash -omnigraph embed --seed embed-config.yaml # fill missing -omnigraph embed --seed embed-config.yaml --reembed_all # regenerate all -omnigraph embed --seed embed-config.yaml --clean # delete -omnigraph embed --seed embed-config.yaml --select "Type:field=value" -``` - -See `references/search.md`. - -## Init - -```bash -omnigraph init --schema schema.pg $REPO -``` - -Creates a new graph at `$REPO` with the given schema. Declare the deployment in a `cluster.yaml` (see `references/cluster.md`). - -**Strict by default (v0.6.0+):** `init` against a URI that already holds schema files errors with `AlreadyInitialized` instead of silently overwriting. Use `omnigraph init --force` to re-init deliberately. `--force` only skips the schema-file preflight β€” it does **not** purge existing Lance datasets. - -**Note:** `init` does not accept `--json`. Drop the flag if you see `unexpected argument --json`. - -## Load (bulk JSONL) - -```bash -# bare load: operates on an existing branch (default main); --mode is required -omnigraph load --data seed.jsonl --mode merge $REPO - -# --from forks a missing branch from <base>, then loads onto it (one-shot review branch) -omnigraph load --data delta.jsonl --branch feature-x --from main --mode merge $REPO -``` - -`--mode` is **required** (no default): `merge`, `append`, or `overwrite`. `load` works against local **and** remote URIs. See `references/data.md`. - -## Query / Mutate - -```bash -omnigraph query get_signal --query queries/signals.gq --params '{"slug":"sig-foo"}' # ad-hoc file; <name> is positional -omnigraph query get_signal --server intel-dev --params '{"slug":"sig-foo"}' # served stored query by name -omnigraph mutate add_signal --query queries/mutations.gq --params '{"slug":"sig-foo",...}' -``` - -With aliases: - -```bash -omnigraph alias signal sig-foo -omnigraph alias add-signal sig-foo "Name" "Brief" 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z -``` - -> `query` and `mutate` also accept inline source via `-e/--query-string '<gq>'` instead of `--query <file>`. - -## Maintenance: Optimize & Cleanup (v0.6.1) - -### `optimize` β€” non-destructive Lance compaction - -```bash -omnigraph optimize $REPO --json -``` - -Compacts fragments and reclaims deleted-row space. Non-destructive β€” safe to run any time. **Skips tables with a `Blob` property** (Lance blob-v2 compaction decode bug); skipped tables are reported in the `skipped` field of `--json` output and in logs. Non-blob tables compact normally. Blob-table fragment count won't shrink until the upstream Lance fix lands β€” reads/writes are unaffected. - -### `cleanup` β€” destructive version GC - -```bash -omnigraph cleanup $REPO --keep 5 --older-than 7d --confirm -``` - -Garbage-collects old table versions, dropping time-travel reachability for anything pruned. **Destructive** β€” requires `--confirm`. Duration units for `--older-than`: `s`, `m`, `h`, `d`, `w`. Also reconciles orphaned per-table forks left by an interrupted `branch delete`. - -## Stored Queries (v0.6.1) - -```bash -omnigraph queries validate # type-check the stored-query registry vs the live schema (offline; exits non-zero on drift) -omnigraph queries list # list registry query names, MCP exposure, and typed params -``` - -`validate` opens the addressed graph and type-checks every applied stored query against the live schema β€” catches drift without restarting the server. `list` prints that graph's registry. Address the graph with `--store <uri>` or a positional URI. Distinct from `lint` (which validates a single `.gq` file). See `references/stored-queries.md`. - -## Operator Config & Credentials - -```bash -echo "$TOKEN" | omnigraph login <server> # store a bearer token in ~/.omnigraph/credentials (0600) -omnigraph logout <server> # remove it (idempotent) -``` - -The operator config and `~/.omnigraph/credentials` are **auto-discovered β€” there is no flag to point at them.** `$OMNIGRAPH_HOME` relocates the `~/.omnigraph` *directory* (mainly for test isolation), and an absent file is just an empty layer (zero-config). Separately, `$OMNIGRAPH_CONFIG` stands in for the `--config` flag β€” which targets the **cluster directory / server config**, never the operator config. See SKILL.md β†’ *The two config surfaces*. - -## Addressing a Graph - -How the CLI resolves which graph a data command (`query`, `mutate`, `load`, `branch`, …) runs against. A remote is addressed with `--server` (a bare `http(s)://` URL is not a graph address). - -Precedence (highest first): - -1. **`--store <uri>`** or a **positional `file://`/`s3://` URI** β€” direct storage access (bypasses any server; no catalog, so stored-query *names* don't resolve). `--store` is exclusive with a positional URI and with `--server`. -2. **`--server <name|url>`** (+ `--graph <id>` for a multi-graph server) β€” served/remote. A name resolves from `servers:` in `~/.omnigraph/config.yaml`; a literal `http(s)://` URL also works. -3. **`--profile <name>`** (or `$OMNIGRAPH_PROFILE`) β€” a named scope bundle from `profiles:` in the operator config (binds one of server/cluster/store + a default graph). -4. **Operator defaults** β€” `defaults.server` + `defaults.default_graph`, or `defaults.store` for a zero-flag local scope (mutually exclusive with `defaults.server`). - -Control-plane commands use `--config <dir>` (cluster); maintenance against a cluster-managed graph uses `--cluster <dir|s3://> --graph <id>`. Each command declares a **capability** β€” `any` / `served` / `direct` / `control` / `local` β€” shown in `omnigraph --help`; mis-addressing (e.g. `--server` on a `direct` verb, or a remote URI to `optimize`) fails loudly. - -For query source (`query`/`mutate`): - -1. **`--query <file>`** or **`-e/--query-string '<gq>'`** β€” exactly one (operator aliases are invoked via the separate `alias` subcommand) -2. Relative `--query` paths resolve through **`query.roots`** in config - -For params: - -1. **Explicit `--params '{...}'`** wins on key conflict -2. **Positional alias args** map to alias `args` list - -## Output Formats - -`--format <fmt>` on query/mutate: - -- `table` (default) β€” human-readable -- `kv` β€” `key: value` per line; good for single rows -- `csv` β€” comma-separated -- `jsonl` β€” NDJSON, one per line, with metadata line first -- `json` β€” pretty JSON array - -For admin commands (branch, commit, schema, policy): use `--json` for structured output, otherwise human text. - -## Health Check - -```bash -curl http://127.0.0.1:8080/healthz -``` - -Returns `200 OK` if the server is up. - -## Cluster Control Plane (omnigraph >= 0.7.0) - -```bash -omnigraph cluster validate --config <dir> # parse + typecheck the declaration -omnigraph cluster import --config <dir> # one-time: create the state ledger -omnigraph cluster plan --config <dir> [--json] # preview (schema changes show migration steps) -omnigraph cluster apply --config <dir> --as <actor> # converge; idempotent -omnigraph cluster approve <resource> --config <dir> --as <actor> # gate destructive changes (graph deletes) -omnigraph cluster status --config <dir> [--json] # read the ledger (read-only) -omnigraph cluster refresh --config <dir> # re-observe live graphs; flags drift -omnigraph cluster force-unlock <LOCK_ID> --config <dir> # clear a crashed run's lock (exact id from status) -``` - -Topology rule: `omnigraph schema apply` and `omnigraph init` **refuse a -cluster-managed graph** β€” in a cluster their jobs belong to `cluster apply`. -Data commands (`load`, `mutate`, branches) work either way β€” point them at the -derived root (`<dir>/graphs/<id>.omni`, or `<storage>/graphs/<id>.omni` for an -S3-backed cluster). See `references/cluster.md`. diff --git a/skills/omnigraph/references/data.md b/skills/omnigraph/references/data.md deleted file mode 100644 index f553270..0000000 --- a/skills/omnigraph/references/data.md +++ /dev/null @@ -1,175 +0,0 @@ -# Data Changes & Branches - -## Contents -- Choose the right write command -- `mutate` β€” single edits -- `load` β€” bulk JSONL (`--mode`, `--from`) -- Branches: review before merge -- Destructive ops go through a branch -- Branch commands -- Inspecting state after changes - -How to modify data safely in Omnigraph. - -## Choose the Right Write Command - -`load` is the one bulk-JSONL command β€” local **or** remote, against any -existing branch, with a **required** `--mode`. `mutate` is for single typed -edits. - -| Task | Command | Why | -|------|---------|-----| -| Add/update a single entity | `mutate` with a named mutation | typechecked, parameterized, auditable | -| Bulk upsert by `@key` | `load --mode merge` | preserves rows not in the file | -| Additive-only bulk | `load --mode append` | fails on key collision | -| Clean-slate reseed | `load --mode overwrite` | **destructive** β€” wipes the branch | -| Bulk load onto a fresh review branch | `load --from main --mode merge --branch <name>` | forks `<name>` from `main`, loads onto it, leaves it for review | - -> **`--mode` is required** β€” there is no default. Overwrite is destructive, so -> the CLI never picks a mode for you. -> -> **Local and remote are one command.** `load` works against a local repo URI -> (writing storage directly) *and* a remote `omnigraph-server` endpoint (the -> server orchestrates the write and publishes one atomic commit). See -> [`references/remote-ops.md`](remote-ops.md) for remote-specific concerns -> (504 handling, write-verification ritual). - -## `mutate` β€” Single Edits - -Goes through the running server (the configured default graph, or an alias): - -```bash -omnigraph mutate add_signal \ - --query mutations.gq \ - --params '{"slug":"sig-foo","name":"Foo","brief":"...","stagingTimestamp":"2026-04-14T00:00:00Z","createdAt":"2026-04-14T00:00:00Z","updatedAt":"2026-04-14T00:00:00Z"}' -``` - -Or via an alias: - -```bash -omnigraph alias add-signal sig-foo "Foo" "..." 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z 2026-04-14T00:00:00Z -``` - -Prefer `mutate` for interactive edits, mutations called from agents, and anything you want typechecked at call time. - -## `load` β€” Bulk JSONL - -JSONL format: - -```jsonl -{"type":"Signal","data":{"id":"sig-foo","slug":"sig-foo","name":"Foo","brief":"...","stagingTimestamp":"2026-04-14T00:00:00Z","createdAt":"2026-04-14T00:00:00Z","updatedAt":"2026-04-14T00:00:00Z"}} -{"edge":"FormsPattern","from":"sig-foo","to":"pat-bar","data":{}} -``` - -- Nodes: `{"type":"<NodeType>","data":{...props...}}` β€” `id` equals `slug` -- Edges: `{"edge":"<EdgeType>","from":"<src_slug>","to":"<dst_slug>","data":{...edge_props...}}` - -Load command: - -```bash -omnigraph load --data seed.jsonl --mode merge s3://my-bucket/repos/spike-intel -``` - -`--from <base>` forks a missing `--branch` from `<base>` before loading (the -one-shot review-branch flow below). Without `--from`, the target `--branch` -(default `main`) must already exist. - -### `--mode` semantics - -- **`overwrite`** (destructive) β€” replaces every node/edge table on the branch with the file's contents. **Staged**: the loader validates node/edge constraints, referential integrity, and edge cardinality *before* any data moves, so a bad file fails before touching the branch. Safe on a **first** load; risky afterward. Don't run it against `main` in production without a branch backup path. -- **`merge`** (upsert) β€” for each row, insert if `@key` is new, update if it exists. Rows not in the file are preserved. The safe default for incremental bulk updates. -- **`append`** (strict insert) β€” fails on key collision. Use when you're certain every row is new. - -### `merge` does NOT recompute embeddings - -If you change seed rows that feed into `@embed("source")` via `load --mode merge`, the source field updates but the embedding stays stale. - -**Fix:** run `omnigraph embed --reembed_all` after, or use `load --mode overwrite` once (which re-triggers embedding on load). - -### `overwrite` is destructive - -Wipes the entire branch's data for every node and edge type. Use only for: -- First-time seed -- Intentional full reseed on a feature branch -- Recovery scenarios - -Never on `main` without a branch backup. - -## Branches: Review Before Merge - -Branches exist for **data review**, not schema changes. Schema goes straight to `main` via `plan` + `apply`. - -### The review loop - -```bash -REPO=s3://my-bucket/repos/spike-intel - -# 1. Create feature branch from main -omnigraph branch create --from main staging-2026-04-14 --store $REPO - -# 2. Load delta onto the branch (merge mode is typical for review) -omnigraph load --data delta.jsonl --branch staging-2026-04-14 --mode merge $REPO - -# 3. Verify on the branch (reads can target --branch or --snapshot) -omnigraph query recent_signals --query queries/signals.gq --branch staging-2026-04-14 --store $REPO - -# 4. Merge to main when happy -omnigraph branch merge staging-2026-04-14 --into main --store $REPO - -# 5. Optionally delete the branch -omnigraph branch delete staging-2026-04-14 --store $REPO -``` - -### Fork a branch in one shot with `--from` - -- Bare `load` operates on an existing branch (default `main`). -- `load --from main --branch <name>` forks `<name>` from `main`, loads onto it, and leaves it for review β€” the whole review-branch flow in one command. - -Use `--from` for anything you want reviewed before it touches `main`. - -### Keep branches short-lived - -Long-lived branches compound merge risk. The usual flow is: create β†’ load β†’ verify β†’ merge β†’ delete, all in the same session. A week-old feature branch is a yellow flag. - -### Schema apply blocks non-main branches - -`omnigraph schema apply` rejects the request if any non-main branches exist. Merge or delete them first. This is enforced β€” it's not just a guideline. - -## Destructive Ops Go Through a Branch - -For any bulk load that could disrupt downstream queries (overwriting a heavily-referenced node type, removing edges en masse, reseeding a core table), use a feature branch: - -```bash -omnigraph load --data risky.jsonl --branch recovery-2026-04-14 \ - --from main --mode overwrite $REPO -# inspect, diff, verify reads -omnigraph branch merge recovery-2026-04-14 --into main --store $REPO -``` - -## Branch Commands (quick reference) - -```bash -omnigraph branch create --from main <branch-name> --store $REPO -omnigraph branch list --store $REPO -omnigraph branch merge <branch-name> --into main --store $REPO -omnigraph branch delete <branch-name> --store $REPO -``` - -All support `--json` for automation-friendly output. Address the graph with a -positional `file://`/`s3://` URI (shown), `--store <uri>`, or `--server <name>`. - -## Inspecting State After Changes - -```bash -omnigraph snapshot $REPO --branch main --json # tables + row counts -omnigraph export $REPO --branch main > graph.jsonl # full JSONL dump -omnigraph commit list $REPO --branch main --json # history -``` - -`export` is the right tool for large-snapshot inspection β€” don't try to page through the whole graph with read queries. - -> **Cluster note:** everything in this file applies unchanged in cluster -> deployments β€” the control plane owns schema/queries/policies; rows, loads, -> and branches stay on the data plane against the derived graph roots -> (`<dir>/graphs/<id>.omni`, or `<storage>/graphs/<id>.omni` for an S3-backed -> cluster). diff --git a/skills/omnigraph/references/migrations.md b/skills/omnigraph/references/migrations.md deleted file mode 100644 index 9aca605..0000000 --- a/skills/omnigraph/references/migrations.md +++ /dev/null @@ -1,65 +0,0 @@ -# Migration & Deprecations (pre-0.7.0 β†’ 0.7.0) - -The rest of this skill teaches the **current 0.7.0 surface only**. Consult this page solely when you meet an old config file, command, flag, route, or error and need its current form. Pre-0.7.0 spellings keep working as deprecated aliases (they print a warning) unless marked **removed**. - -## Config files - -| Before (pre-0.7.0) | Now (0.7.0) | -|---|---| -| `omnigraph.yaml` (one combined file) | **`cluster.yaml`** (team deployment) + **`~/.omnigraph/config.yaml`** (operator) | -| `cli.actor` | `operator.actor` | -| `cli.graph` / `server.graph` | `defaults.default_graph` (+ `defaults.server`) | -| `targets:` / `target:` | `graphs:` / `graph:` | -| `omnigraph init` scaffolds `omnigraph.yaml` | `init` scaffolds nothing β€” start a `cluster.yaml` from [`cluster.md`](cluster.md) | - -- **`omnigraph.yaml` is fully removed in 0.7.0** β€” no CLI command or server reads it, and there is **no `config migrate`**. Move team settings to `cluster.yaml` and personal settings (identity, `servers:`, `defaults:`, `aliases:`) to `~/.omnigraph/config.yaml` by hand. - -## CLI addressing (RFC-011) - -| Before | Now | -|---|---| -| `--target <name>` | **removed** β€” use `--server <name\|url>`, `--store <uri>`, or `--profile <name>` (SKILL.md β†’ *Addressing a graph*) | -| positional `http(s)://` URL β†’ a server | **removed** β€” address a remote with `--server <url>` | -| `--as` on a served (remote) write | no-op β€” the server resolves the actor from the bearer token (`--as` applies to direct `--store` writes) | -| `--cluster-graph <id>` | **removed** β€” `--cluster <dir\|uri>` is a global scope; pick the graph with `--graph <id>`. `--graph` now selects within a `--server` *or* `--cluster` scope | -| `query`/`mutate` `--name <q>` + positional graph URI / `--uri` | **removed** β€” the query name is the **positional** (`omnigraph query <name>`): a bare `<name>` invokes a served stored query (kind-asserted), `--query`/`-e` is the ad-hoc lane. Address the graph via `--server`/`--store`/`--profile` (not a positional URI on query/mutate) | - -## Server boot & schema (RFC-011) - -| Before | Now | -|---|---| -| `omnigraph-server <URI>` / `--config omnigraph.yaml` / `--target` / single-graph flat routes | **removed** β€” the server is **cluster-only**: `omnigraph-server --cluster <dir\|s3://>`; all HTTP is nested under `/graphs/<id>/...` (flat routes β†’ 404) | -| `omnigraph schema apply` on a cluster-managed graph | **refused** β€” evolve cluster graphs via `cluster apply` (the ledger). `schema apply` still works on a non-cluster store or via `--server` | -| `policy …` / `queries validate` via `--config omnigraph.yaml` | `policy validate\|test\|explain` reads `--cluster <dir>` (+ `--graph`); `queries validate` takes the store URI | - -## CLI verbs - -| Before | Now | -|---|---| -| `omnigraph ingest …` | `omnigraph load --from main --mode merge …` | -| `omnigraph read` | `omnigraph query` | -| `omnigraph change` | `omnigraph mutate` | -| `omnigraph query lint` / `query check` | `omnigraph lint` | -| `omnigraph query --alias <n>` / `mutate --alias <n>` | `omnigraph alias <n>` (dedicated subcommand; the `--alias` flag was removed) | - -## HTTP routes - -| Before | Now | -|---|---| -| `POST /ingest` | `POST /load` | -| `POST /read` | `POST /query` | -| `POST /change` | `POST /mutate` | - -The old routes remain as **deprecated aliases** (retained indefinitely), carrying `Deprecation: true` + `Link: <successor>` response headers. - -## Server token resolution - -| Before | Now | -|---|---| -| `graphs.<name>.bearer_token_env` in `omnigraph.yaml` | `omnigraph login <server>` β†’ `~/.omnigraph/credentials`, or `OMNIGRAPH_TOKEN_<NAME>` | - -The client bearer token now comes only from `OMNIGRAPH_TOKEN_<NAME>` or the credentials file β€” the `omnigraph.yaml` `bearer_token_env` chain is gone with the file. - -## Older removals (still worth knowing) - -- The transactional **Run** state machine, its `/runs` routes, and the `run_publish` / `run_abort` Cedar actions were **removed in v0.4.0**. Writes publish directly β€” use `GET /commits` for history and the `change` action for write gating; `/runs` returns 404. diff --git a/skills/omnigraph/references/queries.md b/skills/omnigraph/references/queries.md deleted file mode 100644 index f9f84e0..0000000 --- a/skills/omnigraph/references/queries.md +++ /dev/null @@ -1,302 +0,0 @@ -# Query Authoring & Linting - -## Contents -- File organization -- Linting -- Parameterization -- Query structure -- Search functions -- Aggregations -- Filter operators -- Mutations -- Naming convention -- Aliases over raw queries - -Writing `.gq` query files in Omnigraph. - -## File Organization - -- One `.gq` file per primary node type (`signals.gq`, `patterns.gq`, `elements.gq`) -- One `mutations.gq` file for all insert/update/delete queries -- Put query files in `queries/` β€” cluster mode discovers `queries/*.gq` automatically - -## Linting - -```bash -omnigraph lint --schema schema.pg --query queries/signals.gq -``` - -Or (lint against a live repo): - -```bash -omnigraph lint --query queries/signals.gq s3://bucket/repo -``` - -Lint returns: -- `"status": "ok"` β€” all queries passed -- `"errors": N` β€” count of type errors (exit 1 when nonzero) -- `"warnings": N` β€” count of drift warnings - -Run lint after every `.gq` or `.pg` edit. Wire into precommit. - -## Parameterization - -### Always declare typed parameters - -```gq -query get_signal($slug: String) { - match { $s: Signal { slug: $slug } } - return { $s.slug, $s.name } -} -``` - -Never string-interpolate values into query bodies. Pass them via `--params`: - -```bash -omnigraph query get_signal --query signals.gq --params '{"slug":"sig-foo"}' -``` - -The compiler typechecks parameter values against declared types. - -> For one-off/ad-hoc execution, pass the query inline instead of a file with `-e/--query-string` (v0.6.0+): `omnigraph query -e 'query q($slug: String){ match { $s: Signal { slug: $slug } } return { $s.name } }' --params '{"slug":"sig-foo"}'` (and `omnigraph mutate -e '...'`). `-e` is mutually exclusive with `--query <file>` β€” exactly one of the two is required. (Operator aliases are invoked via the separate `omnigraph alias <name>` subcommand.) - -## Query Structure - -### Match β†’ Return β†’ Order β†’ Limit - -```gq -query recent_signals() { - match { - $s: Signal - } - return { $s.slug, $s.name, $s.stagingTimestamp } - order { $s.stagingTimestamp desc } - limit 50 -} -``` - -### Edge traversal (lowerCamelCase) - -Schema edges are PascalCase; traversal uses lowerCamelCase: - -```gq -match { - $s: Signal { slug: $slug } - $s formsPattern $p // edge FormsPattern: Signal -> Pattern -} -``` - -### Multi-hop - -Chain traversal clauses: - -```gq -query friends_of_friends($name: String) { - match { - $p: Person { name: $name } - $p knows $mid - $mid knows $fof - } - return { $fof.name } -} -``` - -### Reverse traversal - -Flip the subject/object: - -```gq -query employees_of($company: String) { - match { - $c: Company { name: $company } - $p worksAt $c - } - return { $p.name } -} -``` - -### Negation - -```gq -query orphan_signals() { - match { - $s: Signal - not { $s formsPattern $_ } - } - return { $s.slug } -} -``` - -## Search Functions - -### Text search - -```gq -match { - $d: Doc - search($d.title, $q) // full-text on @index'd String -} -``` - -```gq -match { - $d: Doc - fuzzy($d.title, $q, 2) // fuzzy match, max 2 edits -} -``` - -```gq -match { - $d: Doc - match_text($d.body, $q) // phrase match -} -``` - -### Vector/ranking (require `limit`) - -```gq -query vector_search($q: Vector(3072)) { - match { $d: Doc } - return { $d.slug, $d.title } - order { nearest($d.embedding, $q) } - limit 10 -} -``` - -`nearest`, `bm25`, and `rrf` are ranking operators, not filters. Every query using them **must** end with `limit N` β€” omitting it is a compile error. - -### Hybrid (reciprocal rank fusion) - -```gq -query hybrid_search($vq: Vector(3072), $tq: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { rrf(nearest($d.embedding, $vq), bm25($d.title, $tq)) } - limit 10 -} -``` - -## Aggregations - -```gq -query friend_counts() { - match { - $p: Person - $p knows $f - } - return { - $p.name - count($f) as friends - } - order { friends desc } - limit 20 -} -``` - -Supported: `count`, `sum`, `avg`, `min`, `max`. Grouping is implicit on non-aggregated return fields. - -## Filter Operators - -`=`, `!=`, `>`, `<`, `>=`, `<=`, `contains` - -```gq -match { - $p: Person - $p.age > 30 - $p.name contains "Al" -} -``` - -## Mutations - -> **No top-level `mutation { ... }` wrapper.** Agents trained on GraphQL reflexively write `mutation { insert T { ... } }` β€” that fails the parser at character 1 with `parse error: expected query_file`. Every executable block in a `.gq` file is a named `query`; the body's verb (`insert` / `update` / `delete`) determines whether it's a write. Dispatch via `omnigraph mutate` (not `query`). - -### Insert - -```gq -query add_signal($slug: String, $name: String, $brief: String, - $stagingTimestamp: DateTime, $createdAt: DateTime, $updatedAt: DateTime) { - insert Signal { - slug: $slug, - name: $name, - brief: $brief, - stagingTimestamp: $stagingTimestamp, - createdAt: $createdAt, - updatedAt: $updatedAt - } -} -``` - -**Every non-nullable property must be provided.** Lint catches missing ones as: - -``` -error: T12: insert for 'Signal' must provide non-nullable property 'brief' -``` - -### Insert edge - -```gq -query link_signal_forms_pattern($signal: String, $pattern: String) { - insert FormsPattern { from: $signal, to: $pattern } -} -``` - -Edge `data` block is `{}` if the edge has no properties β€” just specify `from` and `to` slugs. - -### Update - -```gq -query retitle_signal($slug: String, $new_title: String) { - update Signal set { name: $new_title } where slug = $slug -} -``` - -### Delete - -```gq -query remove_signal($slug: String) { - delete Signal where slug = $slug -} -``` - -### Multi-statement - -```gq -query add_and_link($slug: String, $pattern: String, $createdAt: DateTime, $updatedAt: DateTime) { - insert Signal { slug: $slug, name: $slug, brief: $slug, - stagingTimestamp: $createdAt, createdAt: $createdAt, updatedAt: $updatedAt } - insert FormsPattern { from: $slug, to: $pattern } -} -``` - -There's no `upsert` keyword at the query level β€” use `load --mode merge` for bulk upsert. - -> **Insert/update-only OR delete-only (the Dβ‚‚ rule).** A single mutation query may contain inserts and updates, **or** deletes β€” never both. Mixing a `delete` with an `insert`/`update` in the same query is rejected at parse time. (Inserts/updates go through a staged two-phase publish; deletes inline-commit β€” omnigraph doesn't yet use Lance's two-phase delete API (it shipped in Lance 7.0.0 but isn't wired in) β€” so they can't share one atomic statement.) Split a delete-then-insert into two separate mutations. - -### Date and DateTime values - -Date format is asymmetric between `mutate` (parameter values) and `load` (JSONL): - -| Path | Date | DateTime | -|---|---|---| -| `mutate --params` | ISO string `"2026-04-29"` | ISO string `"2026-04-29T10:00:00Z"` | -| `load` JSONL | Integer days since epoch `20572` | ISO string `"2026-04-29T10:00:00Z"` | - -Compute integer days form for a given date `d`: - -```python -(d - datetime.date(1970, 1, 1)).days # d is the date you're loading, not today() -``` - -This asymmetry is one of the most common silent type errors when bulk-loading data prepared for one path through the other. - -## Naming Convention - -`verb_object`: -- `get_signal`, `recent_signals`, `search_signals` -- `signal_patterns`, `signal_elements` (traversal queries) -- `add_signal`, `link_signal_forms_pattern` (mutations) - -## Aliases Over Raw Queries - -For anything an agent or script will call repeatedly, define an operator alias. See `references/aliases.md`. diff --git a/skills/omnigraph/references/remote-ops.md b/skills/omnigraph/references/remote-ops.md deleted file mode 100644 index e956dd7..0000000 --- a/skills/omnigraph/references/remote-ops.md +++ /dev/null @@ -1,142 +0,0 @@ -# Remote Graph Operations - -## Contents -- What's different about remote -- Verify after every write -- 504 Gateway Timeout -- Fork-branch 504 fingerprint -- Targeting a remote graph (`--server`, `login`) -- Version drift / `sync_branch()` -- `manifest_conflict` 409 -- 429 Too Many Requests -- Duplicate risk on blind retry -- Reading large schemas safely -- Prevention checklist - -When the graph URI is a remote endpoint (`omnigraph-server` behind ALB / CloudFront, bearer-authenticated) instead of a local S3 path, several CLI behaviors change in ways the local-storage workflow never exposes. This reference covers the failures and operational rituals specific to remote graphs. - -## What's different about remote - -A remote graph runs server-side. Every write executes on the server β€” staged per touched table, then published atomically as a **single manifest commit** guarded by a compare-and-swap on expected table versions β€” and is gated by a connection-level idle timeout (CloudFront defaults to ~30s). There is no separate "run" object to poll β€” write status is implied by the HTTP response (and verifiable via `commit list`). The local CLI is a thin client; it never sees the commit happen, only the HTTP response. That asymmetry is the root of every gotcha below. - -| Local repo | Remote repo | -|---|---| -| CLI writes S3 directly | Server executes the write, publishes one atomic manifest commit | -| No connection timeout | ~30s idle timeout (CloudFront) | -| No admission control | Per-actor `429` + `Retry-After` on writes | -| `load` writes S3-backed storage directly | `load` is server-orchestrated β€” same command, one atomic commit | -| CLI exit code is authoritative | CLI exit code can lie β€” verify via `commit list` | - -## Verify after every write - -The CLI's exit code is **not authoritative on remote graphs**. The proxy can drop a response after the server has already committed. Always verify by comparing `main`'s head: - -```bash -HEAD_BEFORE=$(omnigraph commit list --config X --branch main --json | jq -r '.commits[0].graph_commit_id') - -# … run your load / mutate … - -HEAD_AFTER=$(omnigraph commit list --config X --branch main --json | jq -r '.commits[0].graph_commit_id') - -if [[ "$HEAD_BEFORE" != "$HEAD_AFTER" ]]; then - echo "landed" -else - echo "did NOT land β€” safe to retry" -fi -``` - -For a `load --from` that forks a review branch, also compare the new branch head's `graph_commit_id` against `main`'s. **Identical means the load didn't land β€” empty fork left behind.** - -For pointed verification of a single record: - -```bash -omnigraph export --config X --type <NodeType> | grep <slug> -omnigraph export --config X --type <EdgeType> | grep <slug> -``` - -## 504 Gateway Timeout: response lost, write status unknown - -A 504 from the proxy means the server didn't respond within the idle timeout. Two server-side outcomes are possible β€” **the 504 alone cannot distinguish them**: - -1. **Write completed and published** β€” landed, `main`'s head advanced. Common for small mutations finishing just past the 30s edge. -2. **Write still in progress** β€” will publish or fail soon. Re-check after a minute. - -Always verify via `commit list` before retrying. Blind retry on append-only types creates duplicates. - -## Fork-branch 504 fingerprint - -`load --from <base>` creates the branch **before** loading data. A timed-out fork-load where the data didn't land leaves an empty branch at `<base>`'s head. Stale numbered branches (`feature-v2`, `-v3`, `-v4` …) all sitting at the same `graph_commit_id` as `main` are the fingerprint of prior 504-blocked attempts. - -Find them by comparing each branch's head against `main`'s in `omnigraph branch list --config X --json`, then delete the empty ones. - -## Targeting a remote graph: `--server` and `login` - -`load`, `query`, and `mutate` all run against a remote `omnigraph-server` endpoint β€” there is no local-only restriction as of 0.7.0. Address an operator-defined server by name instead of pasting URLs and juggling tokens: - -```bash -echo "$TOKEN" | omnigraph login intel-dev # stores it in ~/.omnigraph/credentials (0600) -omnigraph load --server intel-dev --graph spike \ - --data delta.jsonl --from main --mode merge --branch staging -``` - -`--server <name>` resolves the URL from `~/.omnigraph/config.yaml` and the token via `OMNIGRAPH_TOKEN_<NAME>` or the credentials file. A token is only ever sent to the server it is keyed to. `--graph <id>` selects the graph on a multi-graph server. - -## Version drift / `sync_branch()` - -``` -version drift on node:<Type>: snapshot pinned vN but dataset is at vM β€” call sync_branch() and retry -``` - -- `sync_branch()` is **not a CLI command** β€” it's a server-internal directive that leaked into the error text. Don't go looking for it. -- Cause: another actor committed to `main` between your CLI's snapshot pin and your `mutate` attempt. -- Usually self-resolves on retry β€” the next call re-pins. -- Calling `omnigraph snapshot` does **not** reliably re-pin for subsequent `mutate`s in the same session. -- If persistent, fall back to `load --from main` onto a fresh branch β€” a forked branch doesn't suffer from concurrent-commit drift on `main`. -- The cleaner, modern form of this conflict is a structured `manifest_conflict` **409** β€” see below. - -## `manifest_conflict` 409 β€” stale snapshot, retry - -When another actor commits to the same branch between your query's snapshot pin and your write, the server returns a structured **`manifest_conflict` 409** carrying `table_key` / `expected` / `actual`, rather than silently overwriting. Since v0.4.2 this is the form most concurrent update/delete/merge races take. - -- **Retry it.** A 409 means your write was computed against a stale view and was rejected *before* committing β€” there is no partial state and no duplicate risk. Re-issue the same call; it re-pins to the new head. -- Concurrent `mutate` Γ— branch-merge on the same target branch resolves to either success or a clean 409 depending on who wins the server's per-table queue β€” both outcomes are safe. - -## 429 Too Many Requests β€” back off, then retry - -The server applies **per-actor admission control** to every mutating endpoint (`mutate` / `load` / `schema apply` / branch createΒ·deleteΒ·merge). An actor that exceeds its in-flight-request or estimated-byte budget gets a structured **HTTP 429** (`code: too_many_requests`) with a `Retry-After` header β€” instead of blocking unrelated actors behind a global lock. - -- This is **not** a failed write β€” the write never started. Honor `Retry-After` and retry; it is always safe (no partial write, no duplicate risk). -- It's per-actor, so one noisy automation can't starve others. If you hit it constantly, batch less aggressively or space your calls out. -- Read-only endpoints are not admission-gated. - -## Duplicate risk on blind retry - -After a 504, never retry without verifying first. Different node kinds have different retry semantics: - -| Kind | Retry safety | -|---|---| -| Pointer nodes (`Org`, `Person`, `Opportunity`, `Channel`, `Actor`, `ActionItem`, `Artifact`, `Meeting`, `Technology`, `Campaign`, `UseCase`) | βœ“ Idempotent β€” `@key` upserts dedupe | -| Append-only nodes (`Signal`, `Claim`, `Decision`, `Event`, `Interaction`, `MarketingElement`, `Policy`, `Outcome`) | βœ— Duplicates on retry β€” verify before retrying | -| Edges | ⚠ No `@key`. Verify via `export --type <EdgeName>` + grep. Some simple edges dedupe server-side; don't rely on it. | - -## Reading large schemas safely - -Remote schemas can be large (tens of KB). Tools that cap stdout (~50KB is common) will truncate or duplicate the output silently β€” leading to memory-based answers from agents that look correct but reference nonexistent fields. - -Always redirect to a file before reading: - -```bash -omnigraph schema show --config X > /tmp/schema.pg -wc -l /tmp/schema.pg -``` - -Then read the file with offset/limit, not via piped stdout. - -## Prevention checklist - -- Keep mutations small. Single-node inserts finish well under the timeout. -- Prefer `mutate` over `load` for ≀ a handful of records. -- Always run `commit list` after a 504 before deciding to retry. -- For destructive or large-batch work, use `load --from main` onto a feature branch and verify the branch head before merging. -- Read large schemas via file redirect, not piped stdout. -- A `429` (throttle) or a `manifest_conflict` `409` (stale snapshot) is always safe to retry β€” the write never committed. Honor `Retry-After` on a 429. diff --git a/skills/omnigraph/references/schema.md b/skills/omnigraph/references/schema.md deleted file mode 100644 index b30745b..0000000 --- a/skills/omnigraph/references/schema.md +++ /dev/null @@ -1,192 +0,0 @@ -# Schema Authoring & Evolution - -## Contents -- Authoring (.pg files) -- Evolution (schema plan/apply) -- Supported types -- Decorators (quick reference) -- Interfaces -- Design principles -- Schema evolution in cluster mode - -How to write and evolve `.pg` schemas in Omnigraph. - -## Authoring (.pg files) - -### Use `//` for comments - -Not `#`. The compiler rejects `#` with a parse error that looks like: - -``` -parse error: expected schema_file -``` - -### Enums are inline, not standalone - -The compiler does **not** accept top-level `enum Foo { ... }` blocks. Put the values inline on the property: - -```pg -kind: enum(product, technology, framework, concept, ops) @index -``` - -If the same enum appears on multiple nodes, duplicate it inline β€” there's no shared enum type. - -### Lists contain scalars only - -`[String]` and `[I32]` are fine. `[Category]` (a list of enum values) is **not** supported. Use `[String]` with query-side filtering, or use a single-valued enum property if one value is enough. - -### `@embed` takes a quoted string - -```pg -embedding: Vector(3072) @embed("text") @index -``` - -Not `@embed(text)`. The source property name is a string literal. - -### Edge constraints go inside a body block - -`@unique(src, dst)` on an edge goes inside `{ }`, after `@card(...)`: - -```pg -edge PartOfArtifact: Chunk -> InformationArtifact @card(1..1) { - @unique(src) -} -``` - -### Lint after every edit - -```bash -omnigraph lint --schema schema.pg --query queries/signals.gq -``` - -This validates the schema **and** the queries against it. No running repo required. Wire it into a precommit hook. - -## Evolution (schema plan/apply) - -### Plan before apply β€” always - -```bash -omnigraph schema plan --schema next.pg s3://bucket/repo --json -# inspect "supported": true|false and the step list -omnigraph schema apply --schema next.pg s3://bucket/repo -``` - -If `supported: false`, fix the source before applying. Plan is free; run it as often as needed. - -Plan/apply diagnostics carry stable codes of the form **`OG-XXX-NNN`** (since v0.5.0) β€” match on the code, not the free-form message text. - -**Destructive drops are gated (since v0.5.0).** Dropping a property or type is a soft drop by default (or rejected); to actually lose data you must opt in: - -```bash -omnigraph schema apply --schema next.pg s3://bucket/repo --allow-data-loss -``` - -Over HTTP the equivalent is `{"allow_data_loss": true}` in the schema-apply body. Without the flag, a destructive drop returns a structured diagnostic instead of silently deleting columns. - -### Apply is main-only - -`omnigraph schema apply` rejects any non-`main` branches. Delete or merge feature branches first. This is deliberate: schema changes don't go through review branches. They go straight to main via `plan` + `apply`. - -### Rename, don't replace - -Use `@rename_from(...)` on renames so the planner emits a rename step (preserves data), not a drop+add pair (loses data): - -```pg -node Account @rename_from("User") { - full_name: String @rename_from("name") -} -``` - -Works on node types, edge types, and properties. - -### Required properties need a backfill plan - -Adding a non-nullable property to an existing node is rejected as unsupported. Pattern: - -1. Add as optional: `new_prop: String?` -2. Apply -3. Backfill via a `mutate` or `load --mode merge` -4. Tighten to required in a follow-up apply: `new_prop: String` - -### Keep `@key` stable - -Changing the key field is effectively a replace β€” it invalidates every external reference to the node. Treat identity changes as deliberate, multi-step migrations, not casual field renames. - -### `schema apply` blocks writes while running - -No concurrent mutations during an apply. Plan for a short read-only window. - -## Supported Types - -- **Scalars:** `String`, `Bool`, `I32`, `I64`, `U32`, `U64`, `F32`, `F64`, `Date`, `DateTime`, `Blob` -- **Collections:** `Vector(N)` (fixed-size float vector), `[ScalarType]` (list of scalar) -- **Enums:** `enum(value1, value2, ...)` β€” inline only, values can contain alphanumerics, underscores, hyphens -- **Optional:** any type + `?` suffix (`String?`, `[I32]?`, `Vector(4)?`) - -## Decorators (quick reference) - -**Property-level:** -- `@key` β€” primary key (implies index; usually one per node) -- `@unique` β€” uniqueness constraint -- `@index` β€” query optimization -- `@range(min, max)` β€” numeric bounds (open ranges allowed) -- `@check(prop, "regex")` β€” regex pattern validation on a String property -- `@embed("source_prop")` β€” embed from a String source into a Vector property -- `@description("...")` β€” metadata (no migration impact) -- `@instruction("...")` β€” semantic hint for LLMs/operators - -**Edge-level:** -- `@card(min..max)` β€” edge cardinality (default: `0..*`) - -**Type-level (nodes/edges/properties):** -- `@rename_from("OldName")` β€” migration-aware rename - -**Group-level (inside body block):** -- `@unique(prop1, prop2)` β€” composite uniqueness, enforced as a true tuple key at both intake and merge (works on edges too: `@unique(src, dst)`). Columns must reduce to a scalar key: `@unique` on a `[List]`/`Blob` column is rejected loudly at `load` (it used to be silently un-enforced β€” fixed in #160). -- `@index(prop1, prop2)` β€” composite index - -## Interfaces - -Supported but rarely used. Declare shared property contracts and node types implement them: - -```pg -interface Searchable { - title: String @index - embedding: Vector(3072) @embed("title") -} - -node Doc implements Searchable { - slug: String @key - body: String -} -``` - -Most schemas are fine without interfaces. Reach for them only when 3+ node types need to share a property contract. - -## Design Principles (brief) - -- **Identity is explicit** β€” use `@key` on a semantic slug, not internal row IDs -- **Narrow types** β€” `Date` over `String` for dates, `enum` over `String` for lifecycle states -- **Edge semantics matter** β€” prefer `AuthoredBy` over `RelatedTo` -- **Constraints live in the schema** β€” `@unique`, `@range`, `@card` keep invariants out of application code -- **Schemas are reviewable** β€” clear names, explicit enums, obvious keys - -## Schema Evolution in Cluster Mode - -In a cluster deployment there is **no direct `omnigraph schema apply`** β€” the -schema is declared (`graphs.<id>.schema:` in `cluster.yaml`) and converged: - -```bash -$EDITOR schema.pg -omnigraph cluster plan --config . # shows the engine's migration steps -omnigraph cluster apply --config . --as <you> -# restart the --cluster server to serve the new shape -``` - -Differences from direct `schema apply` (on a non-cluster store): **soft drops -only** (`--allow-data-loss` is not reachable from cluster apply β€” prior versions -retain dropped columns), -and out-of-band schema changes on the live graph are *drift* β€” `cluster -refresh` flags them and the next `apply` converges the graph back to the -declared schema. Everything else in this file (`@rename_from`, backfills, -linting, enum discipline) applies unchanged to the `.pg` you edit. diff --git a/skills/omnigraph/references/search.md b/skills/omnigraph/references/search.md deleted file mode 100644 index 53397ab..0000000 --- a/skills/omnigraph/references/search.md +++ /dev/null @@ -1,150 +0,0 @@ -# Search & Embeddings - -## Contents -- Embeddings are schema-declared -- Generating embeddings -- Embeddings + `load --mode merge` interaction -- Search functions in queries -- The key pattern: scope first, rank second -- Model / config - -Vector embeddings and text search in Omnigraph. - -## Embeddings are Schema-Declared - -```pg -node Chunk { - text: String - chunk_index: I32 - embedding: Vector(3072) @embed("text") @index - createdAt: DateTime -} -``` - -- `Vector(N)` β€” fixed-size float vector -- `@embed("source_prop")` β€” what text field to embed from (quoted string) -- `@index` β€” enables vector search on this field - -The schema says **where** embeddings live and **what** they come from. Queries don't recompute; they read. - -## Generating Embeddings - -### First time / refresh missing - -```bash -omnigraph embed --seed embed-config.yaml -``` - -Default mode is `fill_missing` β€” only generates embeddings for rows without one. - -### Re-embed everything - -```bash -omnigraph embed --seed embed-config.yaml --reembed_all -``` - -Use when: -- You changed the source field: `@embed("body")` β†’ `@embed("title")` -- You mutated text at scale and need fresh embeddings -- You switched embedding models (rare) - -### Selective refresh - -```bash -omnigraph embed --seed embed-config.yaml --select "Chunk:chunk_index=42" -``` - -Regenerate only rows matching the selector. - -### Clean (delete) embeddings - -```bash -omnigraph embed --seed embed-config.yaml --clean -``` - -## Embeddings + `load --mode merge` Interaction - -**`load --mode merge` does NOT recompute embeddings.** - -If you update rows whose source fields feed into `@embed(...)`, the source updates but the embedding stays stale. - -Two fixes: -1. Run `omnigraph embed --reembed_all` after the merge -2. Use `load --mode overwrite` instead, which re-triggers embedding on load - -## Search Functions in Queries - -All ranking functions require `limit N` β€” they're order operators, not filters. - -### Vector similarity - -```gq -query nearest_chunks($q: Vector(3072)) { - match { $c: Chunk } - return { $c.text } - order { nearest($c.embedding, $q) } - limit 10 -} -``` - -### BM25 text ranking - -```gq -query top_titles($q: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { bm25($d.title, $q) } - limit 10 -} -``` - -### Hybrid (Reciprocal Rank Fusion) - -```gq -query hybrid($vq: Vector(3072), $tq: String) { - match { $d: Doc } - return { $d.slug, $d.title } - order { rrf(nearest($d.embedding, $vq), bm25($d.title, $tq)) } - limit 10 -} -``` - -### Text filter (not ranking β€” no `limit` required) - -```gq -match { - $d: Doc - search($d.title, $q) // full-text filter - fuzzy($d.title, $q, 2) // fuzzy filter, max 2 edits - match_text($d.body, $q) // phrase filter -} -``` - -## The Key Pattern: Scope First, Rank Second - -Filter with graph traversal before invoking vector or text ranking. Ranking over a narrow set is both cheaper and more relevant. - -```gq -query related_chunks($artifact_slug: String, $q: Vector(3072)) { - match { - $a: InformationArtifact { slug: $artifact_slug } - $c partOfArtifact $a // scope: only this artifact's chunks - } - return { $c.text } - order { nearest($c.embedding, $q) } // rank: vector similarity within scope - limit 10 -} -``` - -Don't rank over the entire chunk set if you know a traversal can narrow it first. - -## Model / Config - -Omnigraph uses **two distinct embedding clients** β€” don't conflate them: - -| Client | When it runs | Default model | Configured via | -|--------|--------------|---------------|----------------| -| **Engine / load-time** | At load, when an `@embed("source")` field is populated (and `omnigraph embed`) | `gemini-embedding-2-preview` (3072-dim) | `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL`, `OMNIGRAPH_EMBED_*`, `OMNIGRAPH_EMBEDDINGS_MOCK` | -| **Compiler / query-time** | When a query passes a *string* to a ranking op (e.g. `nearest($c.embedding, "some text")`) and the server auto-embeds it | `text-embedding-3-small` (OpenAI-style) | `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL`, `NANOGRAPH_EMBEDDINGS_MOCK` | - -The vector stored in the schema is produced by the **load-time (engine)** client, so `Vector(N)` must match that model's output dimension β€” `Vector(3072)` for `gemini-embedding-2-preview`. If you point the query-time client at a model with a different dimension than your stored vectors, similarity search returns garbage or errors β€” keep both sides on the same dimension. Vectors are stored L2-normalized. diff --git a/skills/omnigraph/references/server-policy.md b/skills/omnigraph/references/server-policy.md deleted file mode 100644 index 225c708..0000000 --- a/skills/omnigraph/references/server-policy.md +++ /dev/null @@ -1,224 +0,0 @@ -# HTTP Server & Cedar Policy - -## Contents -- Starting the server (boot sources) -- HTTP routes -- Auth -- Setup operations bypass the server -- Cedar policy -- Multi-graph mode -- Server + policy together -- Cluster-booted servers - -How to run `omnigraph-server` and gate operations with Cedar policies. - -## Starting the Server - -The server is the canonical runtime entry point β€” all CLI queries, mutations, and admin ops go through it. **Boot is cluster-only** (RFC-011): the server boots from a cluster and serves N graphs (N β‰₯ 1) under nested routes. There is **no** single-graph / bare-URI / `omnigraph.yaml` boot. - -```bash -omnigraph-server --cluster ./company-brain --bind 127.0.0.1:8080 # a config directory … -omnigraph-server --cluster s3://bucket/prefix --bind 0.0.0.0:8080 # … or a storage-root URI (config-free) -``` - -`--cluster` boots from the cluster's applied revision (see *Cluster-Booted Servers* below). Run it in a separate terminal or background process. - -## HTTP Routes - -All per-graph routes are nested under `/graphs/{id}/...` (`{id}` = a graph id from the applied cluster); bare flat paths (`/query`, `/snapshot`, …) return **404**. `/healthz` and `/graphs` stay flat. - -| Route | Purpose | -|-------|---------| -| `GET /healthz` | liveness probe (flat) | -| `GET /graphs` | enumerate served graphs (flat; `graph_list`-gated) | -| `GET /graphs/{id}/snapshot?branch=` | table state + row counts | -| `POST /graphs/{id}/query` | read query (canonical; `/read` = deprecated alias) | -| `POST /graphs/{id}/mutate` | mutation (`/change` = deprecated alias) | -| `POST /graphs/{id}/load` | bulk JSONL load, 32 MB; branch creation opt-in via `from` (`/ingest` = deprecated alias) | -| `POST /graphs/{id}/export` | NDJSON stream of a branch | -| `GET /graphs/{id}/queries` Β· `POST /graphs/{id}/queries/{name}` | stored-query catalog (`read`) + invocation (`invoke_query`, +`change` for a stored mutation; deny == 404) | -| `GET /graphs/{id}/schema` Β· `POST /graphs/{id}/schema/apply` | read `.pg` Β· migrate (`schema_apply`) | -| `GET/POST /graphs/{id}/branches` Β· `DELETE …/branches/{b}` Β· `POST …/branches/merge` | branch ops | -| `GET /graphs/{id}/commits?branch=` Β· `…/commits/{commit_id}` | history | - -Read routes take `?branch=main` or `?snapshot=<id>`. Writes publish directly and commit atomically via `__manifest`; use the commits route for write/audit history. - -## Auth - -Set bearer tokens on the server process. Three sources, in precedence: `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` (AWS Secrets Manager) β†’ `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON`/`_FILE` (JSON `{actor_id: token}`) β†’ `OMNIGRAPH_SERVER_BEARER_TOKEN` (single token, actor `default`): - -```bash -OMNIGRAPH_SERVER_BEARER_TOKENS_JSON='{"act-reader":"s3cret"}' \ - omnigraph-server --cluster ./company-brain --bind 0.0.0.0:8080 -``` - -On the client side (0.7.0), register the server once and store its token out of band: - -```bash -echo "s3cret" | omnigraph login remote # β†’ ~/.omnigraph/credentials (0600) -omnigraph query get_signal --server remote --graph spike --params '{"slug":"sig-foo"}' -``` - -`--server remote` resolves the URL from `~/.omnigraph/config.yaml`'s `servers:` and the token via `OMNIGRAPH_TOKEN_REMOTE` or the credentials file. A token is only ever sent to the server it is keyed to. - -### Running without auth requires an explicit opt-in - -You can no longer just "leave auth off." Since v0.6.0 the server **refuses to start** when it has neither bearer tokens nor a policy file, unless you explicitly opt in: - -```bash -omnigraph-server --cluster . --unauthenticated -# or: OMNIGRAPH_UNAUTHENTICATED=1 omnigraph-server --cluster . -``` - -This is a guardrail against accidentally shipping an open server. For pure local dev, pass `--unauthenticated` deliberately. - -## Setup Operations Bypass the Server - -`init` and **local** `load` write storage directly β€” they don't go through the server (a **remote** `load` is server-orchestrated, POSTing `/load`). Pass the repo URI: - -```bash -omnigraph init --schema schema.pg s3://my-bucket/repos/<name> -omnigraph load --data seed.jsonl --mode overwrite s3://my-bucket/repos/<name> -``` - -Everything else β€” `query`, `mutate`, `snapshot`, `schema plan/apply`, `branch`, `commit` β€” goes through the running server. - -## Cedar Policy - -Omnigraph can gate sensitive actions with [Cedar](https://www.cedarpolicy.com/) policies. - -### Default-deny posture - -Policy is enforced engine-wide (every authoring path calls the same gate), and the default is **closed**, not open: - -| Server state | Bearer tokens | Policy file | Behavior | -|---|---|---|---| -| **Open** | no | no | Every request permitted β€” but the server refuses to start without `--unauthenticated` / `OMNIGRAPH_UNAUTHENTICATED=1`. | -| **DefaultDeny** | yes | no | Every authenticated request for an action other than `read` is rejected (HTTP 403). "Tokens but forgot the policy file" no longer ships the illusion of protection. | -| **PolicyEnabled** | yes | yes | Requests are evaluated against your Cedar rules. | - -So configuring a policy file is what *enables* writes β€” there is no "permit everything by default" mode once tokens are set. - -### Gated actions - -Per-graph actions (evaluated against the graph being addressed): - -| Action | Protects | -|--------|----------| -| `read` | query execution | -| `export` | data export | -| `change` | mutations | -| `invoke_query` | stored-query invocation via `POST /graphs/{id}/queries/{name}` (graph-scoped, not branch-scoped). A stored **mutation** is double-gated β€” it also passes `change`. For a caller without the grant, a denial and an unknown query name both return the same **404** so the catalog can't be probed. | -| `schema_apply` | schema migrations | -| `branch_create` | branch creation | -| `branch_delete` | branch deletion | -| `branch_merge` | merges (especially into protected branches) | - -`admin` exists but is reserved (no call site yet β€” don't write rules for it). A server-scoped `graph_list` action gates `GET /graphs`; declare it in a `[cluster]`-scoped bundle. - -For any shared repo, gate at least `schema_apply` and `branch_merge`. - -### Where policy is declared - -Cedar bundles are declared in `cluster.yaml` and attach via `applies_to`: `[cluster]` is the server-level engine (gates `graph_list` / `GET /graphs`); `[<graph-id>]` is that graph's engine (gates `invoke_query`, `read`, `change`, `branch_*`, `schema_apply`). `cluster apply` publishes them and the `--cluster` server enforces the applied revision. The `policy.yaml` rule format (below) is the bundle content. - -### `policy.yaml` shape - -The policy model is **allow-only**: every rule is a `permit`. You grant capabilities to groups; anything ungranted is denied by default. There is **no `deny` / `effect` key** β€” to forbid something, simply don't grant it. - -```yaml -version: 1 # required; must be 1 - -groups: - admins: [act-alice, act-bob] - team: [act-carol, act-dan] - -protected_branches: - - main - -rules: - - id: admins-can-apply-schema # rules use `id`, not `name` - allow: # required `allow:` block - actors: { group: admins } # references a group by name - actions: [schema_apply] - target_branch_scope: protected - - - id: team-can-merge-to-protected - allow: - actors: { group: team } - actions: [branch_merge] - target_branch_scope: protected - - - id: team-can-read-write-unprotected - allow: - actors: { group: team } - actions: [read, change] - branch_scope: unprotected -``` - -To "block unreviewed schema applies," you don't write a deny rule β€” you just don't grant `schema_apply` to that group. Default-deny does the rest. - -Scope rules (a rule's `allow` block may use **at most one**): - -- `branch_scope: any | protected | unprotected` β€” for `read`, `export`, `change` (matches the source branch). -- `target_branch_scope: any | protected | unprotected` β€” for `schema_apply`, `branch_create`, `branch_delete`, `branch_merge` (matches the destination branch). - -### Validate, test, explain - -```bash -# Compile Cedar + check the cluster's applied policies -omnigraph policy validate --cluster . - -# Run declarative test cases -omnigraph policy test --cluster . --tests policy.tests.yaml - -# Debug a single decision -omnigraph policy explain \ - --actor act-alice \ - --action schema_apply \ - --target-branch main \ - --cluster . -``` - -### Test cases (`policy.tests.yaml`) - -```yaml -version: 1 # required; must be 1 -cases: - - id: alice-can-apply-schema # cases use `id`, not `name` - actor: act-alice - action: schema_apply - target_branch: main # schema_apply is target-branch scoped - expect: allow # `allow` / `deny` (not `permit`) - - - id: random-user-cannot-merge-to-main - actor: act-random - action: branch_merge - target_branch: main - expect: deny -``` - -Run `policy test` after every policy edit. Tests are cheap. - -## Multi-graph serving - -A `--cluster` server serves every graph in the applied cluster, each under `/graphs/{id}/...`. `GET /graphs` enumerates them (sorted by id), gated by the cluster-level `graph_list` action β€” even under `--unauthenticated`, topology stays closed until a `[cluster]` policy grants it. `omnigraph graphs list` mirrors it (remote servers only). - -Policy attaches at two levels via `cluster.yaml` `applies_to`: -- `[<graph-id>]` β€” per-graph rules (`read`, `change`, `branch_*`, `schema_apply`, `invoke_query`). -- `[cluster]` β€” server-level rules (`graph_list`). - -There is no runtime add/remove of graphs β€” edit `cluster.yaml`, `cluster apply`, restart. - -## Server + Policy Together - -When the server is running with a policy file: -1. Every request resolves the actor from the bearer token (the client cannot set actor identity) and checks it against Cedar rules. -2. Unauthorized requests return `403 Forbidden`. -3. The CLI doesn't bypass policy when it connects over HTTP β€” it's enforced at the server. Enforcement is also engine-wide, so CLI direct-engine writes and embedded SDK consumers hit the same gate. - -Setup ops (`init`, `load`) write storage directly. With a policy configured they still flow through the engine-layer enforce gate for the actor you pass via `--as` (or `operator.actor` in `~/.omnigraph/config.yaml`); gate the raw storage layer too (S3 bucket ACLs, object locks) if the bucket is shared. - -## Cluster-Booted Servers - -`omnigraph-server --cluster <dir|s3://>` is the only boot source (covered above). It serves the cluster's **applied revision**: `cluster apply` changes take effect on the next restart (no hot reload), and boot is fail-fast with named remedies for missing/pending/tampered state. Bearer tokens and bind stay process-level (env/flags). See `references/cluster.md`. diff --git a/skills/omnigraph/references/stored-queries.md b/skills/omnigraph/references/stored-queries.md deleted file mode 100644 index 02aaf75..0000000 --- a/skills/omnigraph/references/stored-queries.md +++ /dev/null @@ -1,54 +0,0 @@ -# Stored-Query Registries - -A **stored query** is a `.gq` query that the *server* loads, type-checks at startup, and exposes by name β€” without ever accepting ad-hoc query source from the client. It's how you publish a vetted, typed query surface to remote callers and MCP tools. - -This is a server-side feature introduced in **v0.6.1**. It is distinct from CLI `aliases:` (see [`aliases.md`](aliases.md)): an alias is local client ergonomics; a stored query is a server-published, policy-gated endpoint. - -## Declaring stored queries (`cluster.yaml`) - -Stored queries are declared in the cluster's `cluster.yaml` β€” every `query <name>` in the listed `.gq` files registers: - -```yaml -graphs: - <id>: - schema: schema.pg - queries: queries/ # discover every `query <name>` in queries/*.gq -``` - -`queries` also accepts an explicit file list (`[a.gq, b.gq]`) or a fine-grained `name: { file: … }` map; an unparseable `.gq` or a duplicate query name across files fails `cluster validate`. `cluster apply` publishes them to the content-addressed catalog, and the `--cluster` server type-checks and serves every applied query. Every applied query is listed (per-query `mcp:`/expose flags are a planned phase). - -## CLI - -```bash -omnigraph queries validate # type-check every stored query against the live schema (offline; opens the graph; exits non-zero on drift) -omnigraph queries list # print the addressed graph's registry: query names and typed params -``` - -- `validate` catches schema drift **without restarting the server** β€” run it after a `schema apply` or before deploying a config change. The server also runs this check at startup and **refuses to boot** on drift or on a duplicate MCP tool name. -- `validate` opens the graph (address with `--store <uri>` or a positional URI); `list` reads the addressed graph's catalog. -- `queries` is distinct from `lint` β€” `lint` validates a single `.gq` file you point it at; `queries validate` validates the registry the server will actually serve. - -## HTTP surface - -| Route | Gate | Purpose | -|-------|------|---------| -| `GET /graphs/{id}/queries` | `read` | Typed tool catalog of the served queries. Graph-wide (branch-independent; `read` authorized against `main`). | -| `POST /graphs/{id}/queries/{name}` | `invoke_query` (+ `change` for a stored mutation) | Invoke a named query. Body carries params only β€” **never** `.gq` source. A stored mutation cannot target a `snapshot` (`400`); a param type error is a structured `400` naming the param. | - -`?branch=` / `?snapshot=` query params apply to `POST /graphs/{id}/queries/{name}` reads; branch/snapshot access stays enforced by the inner `read`/`change` gate (`invoke_query` itself is graph-scoped, not branch-scoped). - -## Policy gating (`invoke_query`) - -- **`invoke_query`** is a per-graph Cedar action gating the whole stored-query invocation surface. Grant it like any other action (see [`server-policy.md`](server-policy.md)). -- **Stored mutations are double-gated:** the caller needs `invoke_query` to reach the query **and** `change` for the write. An actor with `invoke_query` but not `change` gets `403` on a stored mutation. -- **Deny == unknown:** for a caller *lacking* `invoke_query`, a denial and an unknown query name return the **same 404** (identical body) β€” the catalog can't be probed. A caller who *holds* `invoke_query` may still get a `403` from the inner gate for a query it can't `read`/`change`, so existence is visible to grant-holders by design. -- **Default-deny mode** (bearer tokens, no `policy.file`) permits only `read`, so *every* `/graphs/{id}/queries/{name}` call returns `404` until an `invoke_query` rule is configured. - -## MCP exposure - -Every applied query is listed in `GET /graphs/{id}/queries` as a typed MCP tool. Per-query exposure controls (`mcp.expose`, `tool_name`) are a planned phase β€” there is no per-query `mcp:` flag in cluster mode today. - -## Note on per-query authorization - -The catalog is **not** Cedar-filtered per query yet: a caller with `read` but not `invoke_query` can *list* a query it cannot *invoke* (invocation would 404). Per-query authorization is future work; for now the catalog is a discovery surface and `invoke_query` is the invocation gate. -