Commit graph

23 commits

Author SHA1 Message Date
Ragnor Comerford
5dd9a9f463
Merge remote-tracking branch 'origin/main' into ragnorc/optimize-reconciles-drift
# Conflicts:
#	docs/dev/testing.md
2026-06-08 23:04:09 +02:00
Andrew Altshuler
ce150fb0ca
docs(testing): fix stale optimize test name in maintenance.rs row (#148)
The maintenance.rs row referenced `optimize_reconciles_preexisting_manifest_head_drift`,
which never existed (leftover from the reconcile-drift heuristic removed in #141).
The actual second test is `optimize_defers_when_recovery_sidecar_is_pending`.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 22:19:21 +03:00
Ragnor Comerford
20ef0e90a4
docs(invariants): note the non-atomic manifest->commit-graph publish gap
Every graph publish commits __manifest then appends _graph_commits as two
separate writes; a crash between them leaves the manifest ahead of the commit
DAG. Live reads + durability are unaffected (reads resolve via the manifest) and
recovery does not repair it; impact is bounded to commit history / time-travel
by commit id / merge-base completeness. Pre-existing across all publishes, not
the optimize reconcile specifically. Documented as a Known Gap; the fix is a
commit-graph reconcilable from the manifest, not a recovery sidecar.
2026-06-08 16:07:24 +02:00
Ragnor Comerford
2193a24641
docs(optimize): document the pre-existing-drift reconcile
- maintenance.md: add the reconcile bullet (metadata-only manifest catch-up for
  a table with HEAD>manifest and an empty plan), and correct the 'requires a
  recovered graph' note — that guard is what makes the reconcile safe (no
  sidecar-covered drift in flight), not a claim that no drift exists.
- AGENTS.md: restore the reconcile mention in the Compaction capability row.
- testing.md: the maintenance.rs row lists all three optimize tests (publishes /
  reconciles / defers); fix the stale failpoints ensure_indices test name
  (recovery_rolls_forward_ensure_indices_on_feature_branch).
2026-06-08 14:17:23 +02:00
Ragnor Comerford
e62d9166fb
fix: optimize publishes compaction; recovery roll-back converges manifest (#141)
* test(optimize): cover manifest publish + HEAD-drift reconcile

Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:

- maintenance: optimize must publish so the manifest table_version
  tracks the compacted Lance HEAD and a later schema apply succeeds;
  and must reconcile a pre-existing manifest-behind-HEAD drift (forged
  via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
  reopen in the full lifecycle (the canonical flow previously omitted
  post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
  forward on next open.

* fix(optimize): publish compaction to manifest and reconcile HEAD drift

optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).

optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.

Docs updated (maintenance, storage, branches-commits, writes, testing).

* test(recovery): rollback convergence + optimize-defer regressions

Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
  must track Lance HEAD (no residual drift) so a follow-up schema apply
  succeeds — the original "+1 per retry" loop. Today roll-back restores
  without publishing, so the manifest lags HEAD and the apply fails its
  HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
  operating on an unrecovered graph could publish a partial write the
  sweep would roll back.

Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.

* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery

Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:

- Recovery roll-back now PUBLISHES the restored version (via a
  push_table_update_at_head helper shared with roll-forward), so the manifest
  tracks the Lance HEAD after recovery — symmetric with roll-forward. This
  fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
  precondition passes) and removes the only remaining source of orphaned
  drift. The audit still records the logical rolled-back-to version; the
  manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
  __recovery sidecar is pending, so it only ever operates on a recovered
  graph (manifest == HEAD); its compaction publish can no longer commit a
  partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
  moot.

Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.

* test(recovery): fix rollback assertion for manifest convergence

The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
2026-06-08 02:50:12 +03:00
Ragnor Comerford
54842808db
feat(engine): sweep & remove legacy __run__ branch guard (MR-770) (#132)
* feat(engine): sweep legacy __run__ branches via v2→v3 manifest migration

Pre-v0.4.0 graphs can carry stale `__run__<id>` staging branches on the
`__manifest` dataset, left by the Run state machine removed in MR-771. Lance's
`list_branches` still enumerates them, so they leak into `branch_list()` and
count as blocking branches at schema-apply time.

Add a one-time `migrate_v2_to_v3` arm to the internal-schema dispatcher: on the
first read-write open it enumerates `__manifest` branches, deletes every
`__run__*` ref, and bumps the stamp to 3. Idempotent under retry (re-enumerates
fresh each run). The `"__run__"` prefix is inlined so the migration does not
depend on the run_registry guard that MR-770 removes next.

This is the prerequisite sweep; the guard removal follows in the next commit.

* refactor(engine): remove the legacy __run__ branch guard (MR-770)

With the v2→v3 migration sweeping stale `__run__*` branches off `__manifest`
on first read-write open, the defense-in-depth `is_internal_run_branch` guard
is no longer needed.

- delete `db/run_registry.rs`; drop the module + re-export from `db/mod.rs`
- collapse `is_internal_system_branch` to the schema-apply-lock check only
- `ensure_public_branch_ref`: drop the run-ref rejection; `__run__*` is now an
  ordinary branch name
- `branch_merge`: reject `is_internal_system_branch` (was run-only) so the
  schema-apply lock is rejected consistently with create/delete — a small,
  deliberate tightening
- update the inline schema-apply test + the writes integration tests
  (`public_branch_apis_reject_internal_run_refs` →
  `public_branch_apis_reject_internal_system_refs`, which also asserts
  `__run__*` now creates successfully)
- docs: flip the "pending production sweep / defense-in-depth" notes to
  "auto-swept by the v2→v3 migration"; document the read-only-open limitation

Known residual: the inert `_graph_runs.lance` / `_graph_run_actors.lance` bytes
remain until a `StorageAdapter::delete_prefix` primitive lands.

* fix(engine): run __run__ sweep at Omnigraph::open, not only on publish

Review (PR #132) caught a regression: removing __run__ from
`is_internal_system_branch` exposed legacy `__run__*` branches to the
schema-apply blocking-branch checks (schema_apply.rs:104 and :778) and to
`branch_list()`, but the v2→v3 sweep ran only inside the publisher's
`load_publish_state`. On a pre-v0.4.0 graph whose first write is a schema
apply, the blocking-branch check fires before any publish, so apply failed
with "found non-main branches: __run__…". The same lazy timing also created a
reverse hazard: a user-created `__run__*` branch on a still-v2 graph could be
deleted by the first publish's sweep.

Fix: run the internal-schema migration in `Omnigraph::open(ReadWrite)` (new
`manifest::migrate_on_open`), before the coordinator reads branch state. The
sweep now lands before any branch-observing code, and a graph is stamped v3 at
open — so the one-time sweep can never catch a legitimately-created branch.
Both checks and `branch_list` see the swept graph; correct by construction for
every write path.

Accepted residual: a read-only open of an unmigrated legacy graph still lists
`__run__*` (read-only opens must not write, so they can't sweep). Documented.

Regression test `legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply`
confirmed RED before the fix (panicked on the branch_list leak assertion) and
GREEN after. Also updates the stale schema_apply.rs comment, the writes.md
"Migration code" section, and adds the v3 row to storage.md's migration table.

* test(engine): sweep multiple legacy __run__ branches; doc nit

Strengthen the v2→v3 migration test to synthesize three `__run__*` branches
(a real legacy graph accumulates one per run) so the migration's delete loop
is exercised on a single reused dataset handle, not just a single branch.
Confirms multi-branch deletion is safe.

Also drop a stale "active runs" reference from the branch_delete doc line.

* fix(engine): force-delete in __run__ sweep for concurrency safety

`migrate_v2_to_v3` ran `Dataset::delete_branch` (= `branches().delete(.., false)`),
which errors "BranchContents not found" if the branch is already gone. Since the
sweep now runs in `Omnigraph::open(ReadWrite)`, two processes opening the same
legacy v2 graph concurrently would race: one wins each delete, the other's open
fails. The migration only claimed idempotency under *sequential* retry.

Switch to `Dataset::force_delete_branch` (= `delete(.., true)`), Lance's
documented path for cleaning up zombie branches, which tolerates an
already-absent branch. The sweep is now idempotent under concurrent runners and
robust to partial/zombie state. Found in self-review; no behavior change for the
common single-open path.

* docs(release): note MR-770 __run__ cleanup in v0.6.1

* docs(branches): reconcile branch cleanup semantics
2026-06-07 18:33:14 +03:00
Andrew Altshuler
fd8e078a77
ci(codeowners): add aaltshuler to engineering role (#147)
Restores aaltshuler as an `engineering` code-owner (removed in #142), so
`crates/**` and repo-infra PRs have a second reviewer besides the sole
owner ragnorc — unblocking review of author-ragnorc PRs (e.g. #132) that
ragnorc cannot self-approve.

Edited the source of truth (.github/codeowners-roles.yml) and re-rendered
.github/CODEOWNERS + the docs/dev/codeowners.md tables via
.github/scripts/render-codeowners.py, per the documented flow.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:05:01 +03:00
Andrew Altshuler
343f1f17ed
governance: external contribution model (issues/discussions/RFCs/PRs) (#143)
Formalize the public contribution surface. Maintainers keep a separate internal
process and are exempt from the intake gates; everyone stays bound by review,
CODEOWNERS, and branch protection.

Model:
- Issues = problem reports only (bug form + config.yml redirects ideas to
  Discussions and disables blank issues).
- Discussions = ideas + RFC incubation.
- RFCs = anyone (incl. external) authors docs/rfcs/NNNN-*.md; a maintainer
  merging it is acceptance. Distinct from the maintainer-internal
  docs/dev/rfc-00N-* track.
- PRs = link an `accepted` issue or accepted RFC, or use the trivial fast-lane
  (typos/docs/deps). Enforced softly to start (template + review).

Adds GOVERNANCE.md, rewrites CONTRIBUTING.md, adds docs/rfcs/ (README +
template), .github issue/PR/discussion templates. Wires docs/rfcs/ into the
doc-link checker (excluded like releases; linked from docs/dev/index.md).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 23:58:08 +03:00
Andrew Altshuler
c7365bf8ef
ci(codeowners): un-trap required checks, auto-render, generate owner tables (#142)
The CODEOWNERS required checks blocked every PR — the real root cause was a
name mismatch, compounded by a path filter:

- branch-protection.json required the contexts `CODEOWNERS / drift` and
  `CODEOWNERS / noedit` (the GitHub UI "workflow / job-id" display form), but
  the jobs report check-run names from their `name:` fields — "CODEOWNERS
  matches source" / "CODEOWNERS not hand-edited". The required contexts
  therefore never matched any reported check and sat permanently pending.
- The workflow was also path-filtered to CODEOWNERS files, so it didn't even
  run for most PRs.

Net effect: with both required checks unsatisfiable, every PR could only land
via admin override (e.g. #140).

Fixes:
- A: drop the `paths:` filter so the workflow runs on every PR and both
  required contexts always report.
- name fix: point branch-protection.json at the actual job names verbatim, and
  add a doc note that the contexts must equal the job `name:` values.
- B: the `drift` job now re-renders and, on same-repo PRs, auto-commits the
  regenerated artifacts back to the branch (mirrors the openapi.json job in
  ci.yml); forks / manual runs strict-check instead. Contributors no longer
  run the script by hand.
- D: render-codeowners.py also generates a "who owns what" path->owners +
  roles table spliced into docs/dev/codeowners.md between markers, so the
  human-readable view never drifts. Idempotent; CODEOWNERS output unchanged.
- docs: correct the stale `enforce_admins: true` line (JSON and live are
  false).

NOTE: the branch-protection.json change only takes effect after an admin runs
`./scripts/apply-branch-protection.sh` (deliberate manual step, per
docs/dev/branch-protection.md). Until then `main` still requires the old
mismatched contexts, so this PR itself needs an admin-override merge — the last
one that should be necessary.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 18:09:47 +03:00
Ragnor Comerford
d54bccb940
fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
* test(optimize): pin Lance blob-column compaction failure as a surface guard

Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected.

Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip.

* fix(optimize): skip blob-bearing tables instead of crashing compaction

omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact.

Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants).

* refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive

Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change.

Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note.

* chore(release): bump workspace to v0.6.1

Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions.
2026-06-02 17:12:00 +02:00
Ragnor Comerford
3c2b1b8051
Stored-query registry foundation + config/CLI RFC-002 (#128)
* MR-969: add stored-query registry config surface

Introduce the `queries:` block in omnigraph.yaml — an inline
`name -> entry` map of stored queries, per-graph
(`graphs.<id>.queries`) and top-level for single-graph mode, mirroring
how `policy` is wired in both modes. Each entry points at a `.gq` file
and carries optional MCP exposure settings (`expose`, `tool_name`),
defaulting to not-exposed.

Additive: absent `queries:` leaves current behavior unchanged.

- QueryEntry { file, mcp: McpSettings { expose, tool_name } }
- `queries` field on TargetConfig + OmnigraphConfig (serde default)
- query_entries() / target_query_entries() accessors
- resolve_query_file() — base_dir-relative `.gq` path resolution
- round-trip + absent-block tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add stored-query registry loader and GraphHandle wiring

Add a `queries` module: QueryRegistry loads each declared `.gq` entry,
parses it, and selects the query whose symbol matches the manifest key,
asserting the two agree (key == `query <name>` symbol). Identity is the
query name; a key/symbol mismatch is a load-time error. Errors are
collected, not fail-fast, so a bad registry surfaces every broken entry
at once. Schema type-checking is deliberately left to a separate pass so
the loader stays callable without an open engine.

Thread an `Option<Arc<QueryRegistry>>` through GraphHandle alongside the
per-graph policy; the URI-canonicalizing clone propagates it. Production
openers default to None for now — the boot path loads and attaches the
registry in a later change.

- QueryRegistry::{from_specs, load, lookup, iter}; StoredQuery::is_mutation
- GraphHandle.queries field, propagated on canonical clone
- registry unit tests: identity match/mismatch, multi-query selection,
  per-entry parse errors, error collection, mutation classification

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: add RFC-002 config & CLI architecture

Layered config (user-global ~/.config/omnigraph/ + per-project), a
unifying `target` abstraction resolving to (locus, graph, sub-state,
credential) with embedded-URI XOR remote-server loci, multi-server ×
multi-graph client targeting, credentials by-reference, and the
file-naming decision: project and server config are one artifact
(`omnigraph.yaml`); the only differently-named file is the user-global
`config.yaml`, split by scope not role. Includes the 12-factor bind
portability rule (prefer --bind/OMNIGRAPH_BIND over a committed
server.bind) and the defined-locally / invoked-remotely model for
stored queries. Derived from first principles working backwards from
what the engine enables; validated against kube/Helix/git/compose.

Linked from docs/dev/index.md. Proposed; phased rollout for the
MR-973/974/981 family.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add check() to validate stored queries against the live schema

A pure check(registry, catalog) that type-checks every stored query via
the same typecheck_query_decl the engine runs for inline queries — no
parallel implementation. Failures are collected, not fail-fast, so an
operator sees every broken query (e.g. a type/property a migration
renamed or removed) in one pass. Breakages are fatal (the boot path will
refuse to start); warnings are advisory.

Pure over (registry, catalog) so it is callable both at boot (engine
catalog) and offline from the CLI without an open engine.

Advisory lint: an mcp.expose:true query that declares a Vector(N)
parameter warns — an LLM cannot supply a raw embedding vector; such a
query should take a String parameter and embed server-side. Warns
rather than rejects, since service-to-service callers may pass vectors.

- CheckReport { breakages, warnings }; has_breakages / is_clean
- tests: valid query, unknown type, unknown property, collect-not-fail-fast,
  vector-param-exposed warns, unexposed silent

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Drop internal plan-label refs from stored-query config comments

Doc comments referenced sequencing labels ("C2") that mean nothing to a
reader; reword to describe the behavior directly. Comment-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: reconcile aliases with the role model in RFC-002

Place the existing client-only `aliases:` block in the client/server
role split: aliases are client-role (CLI, embedded, ungated) and may
live in both user-global and project config; `queries:` is server-role
(deployment manifest only). They overlap as "name -> .gq"; `queries:` is
the superset, and the end-state subsumes aliases (definition -> queries,
target/branch/format -> client invocation context, positional args ->
CLI sugar). v1 keeps aliases unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: make RFC-002 config global-first, project-optional

The global user config is the primary, self-sufficient default; the
CLI works from any directory with no project file (the kubectl/aws/gh
posture), a deliberate flip from today's project-anchored behavior.
The project omnigraph.yaml becomes an optional repo-scoped override and
the deployment manifest. Uniform schema, both layers optional; global
can hold any section including a personal server's graphs/queries.
Additive: project still overrides global; the flip adds a fallback
layer below the project file rather than removing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: justify XDG ~/.config/omnigraph over legacy ~/.omnigraph in RFC-002

Make the rationale explicit: XDG-first because OmniGraph is a client
that will cache remote catalogs and keep session state alongside
secrets, and XDG separates config / cache / state into distinct dirs
(clear cache without touching creds; backups skip cache) whereas a
single ~/.omnigraph/ mixes them. Honor ~/.omnigraph/ as a fallback for
the peer-group (aws/kube/docker/helix) expectation. Add XDG_CACHE_HOME
/ XDG_STATE_HOME to the override precedence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: build RFC-002 credentials on the existing env-file mechanism

OmniGraph already has credentials-by-reference: bearer_token_env names
the env var, and auth.env_file is a git-ignored dotenv the CLI
auto-loads (real env vars win), resolved via resolve_remote_bearer_token.
The RFC's proposed credentials.yaml + token_env were redundant parallel
inventions. Reconcile: reuse bearer_token_env (extend to
servers.<name>) and auth.env_file (add a global ~/.config/omnigraph/.env
layered under the project .env.omni); OS keychain is an additive future
resolver. No new credentials.yaml. Updated summary, non-goals,
background, file-naming, credentials, example, login, migration, rollout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: use single ~/.omnigraph dir (Helix-style), not XDG, in RFC-002

Reverse the earlier XDG-first call. The prior argument rested on a false
dichotomy (single-dir => mixed config/cache/state); in fact the peer
tools (aws, kube, helix) achieve separation via SUBDIRECTORIES inside
one ~/.tool/ dir (~/.aws/sso/cache/, ~/.kube/cache/), getting cache
hygiene AND one discoverable place. So everything goes under
~/.omnigraph/: config.yaml, credentials (dotenv, 0600), cache/, state/.
Lower cognitive load, matches what DB/cloud-CLI users expect, matches
Helix. OMNIGRAPH_HOME overrides; $XDG_CONFIG_HOME optionally honored but
~/.omnigraph/ is canonical. Updated all paths, the rationale paragraph,
the file-naming table (added a cache/state row), and env precedence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: reconcile RFC-002 with shipped/planned CLI tickets

Align with reality found in existing tickets:
- Noun is graph/graphs, not target/targets (MR-603 done renamed the
  config key targets->graphs, flag --graph). Use graphs:/--graph; an
  entry is embedded (uri) XOR remote (server + remote graph name).
- ~/.omnigraph/ confirmed by MR-581 (og template pull, done) which
  already quick-starts templates there.
- Templates already exist (MR-581/MR-531) — not invented here.
- The init family is already specced (init, quickstart MR-973, serve
  MR-970, prune MR-972, mcp install MR-974, agent-mode MR-981); this
  RFC only adds the user route (~/.omnigraph/config.yaml + login).
- aliases: -> operations: planned (MR-839).
- bearer_token_env gap tracked in MR-971.
- query lint/check already exist (MR-639) — registry validator must not
  collide with the singular `query check`.
Add a Reconciliation section; fix the canonical example to graphs:/--graph.
Also: merge semantics refined (deep-merge settings, replace named
entries, replace lists, config view --resolved --show-origin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: correct stale-ticket claims and fold init/bootstrap design into RFC-002

Verify against code, not ticket statuses (MR-581 is marked done but is
stale/unbuilt): no ~/.omnigraph usage, no template/serve/quickstart/
prune/login commands exist; config still uses aliases: (no operations:).
So ~/.omnigraph/ stands on peer-convention merits alone, and templates
are a design question, not a foothold. Add §7.5: the three-tier init
model (user route = login + ~/.omnigraph/config.yaml; thin project init;
fat quickstart + templates) with first-principles positions (split
init/login, in-place refuse-if-exists, interactive vs --auto/agent-mode,
--template flag, secrets-on-scaffold gitignore rule). This RFC owns only
the user route; the rest are sibling tickets (MR-973/970/972/974/981).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: breadboard + slice Shape A in RFC-002

Add the implementation breadboard (places P1-P5, affordances N1-N14 with
NEW markers, mermaid) and five vertical slices for the selected config/
CLI/init shape: V1 global layer + merge engine + config view; V2 remote
graphs + HTTP-client path + credential resolution; V3 omnigraph login;
V4 init-hardening + quickstart + templates (rides MR-970); V5 agent-mode
(MR-981). Rollout reordered to the slice sequence; spikes X1-X4 gate
their owning slice. V1-V2 close the substantive client->server gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add InvokeQuery Cedar action (coarse, graph-scoped)

A per-graph, branch-scoped action that gates invoking a server-side
stored query by name. Coarse for now: an `invoke_query` allow rule
permits any stored query on the graph; a future, additive refinement
adds an optional per-query-name scope without changing rules written
against the coarse action. Enforcement is at the HTTP boundary; the
engine `_as` writers still enforce read/change per the query body, so a
stored mutation is double-gated (invoke_query to reach the tool, change
for the write). No call site yet — the invocation handler wires it in a
later change (same pattern as Admin/GraphList added ahead of consumers).

- variant + as_str/resource_kind(Graph)/FromStr/uses_branch_scope
- Cedar schema: invoke_query appliesTo Graph
- tests: per-graph allow/deny, branch-scope accepted

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Load and type-check stored queries at server boot, refusing breakage

At startup the server now loads each graph's stored-query registry,
type-checks every query against that graph's live schema, and refuses to
boot if any query references a type/property the schema doesn't have
(same posture as bad policy YAML) — so schema drift surfaces at the
deploy boundary, not silently at invocation. Non-blocking warnings are
logged. The validated registry is attached to the GraphHandle (the two
production sites previously held `queries: None`).

Loading (parse + key==symbol identity) happens at settings-build time
where the config is in scope; the schema type-check happens after each
engine opens (single mode in `open_single_with_queries`, multi mode in
`open_single_graph`). `open_with_bearer_tokens_and_policy` delegates
with an empty registry so its 18 test callers are unchanged; the public
`new_*` constructors are unchanged (only the private build path threads
the registry).

- ServerConfigMode::Single / GraphStartupConfig carry the loaded registry
- boot tests: valid registry boots; type-broken query refuses boot + names it

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add `omnigraph queries validate` and `queries list` CLI

`queries validate` type-checks the stored-query registry against the
live schema offline — it opens the selected graph, runs the same
check() the server runs at boot, prints breakages/warnings (human or
--json), and exits non-zero on any breakage — so an operator can catch
a query broken by a schema change without restarting the server.
`queries list` prints each registered query's name, MCP exposure, and
typed params.

Named `validate` (not `check`) to avoid overlap with the existing
`omnigraph lint` — `query check`/`query lint` are already deprecated
argv-shims to `lint`. Registry entries resolve like the server: a named
graph uses its per-graph `queries:`; otherwise the top-level one.

- Queries subcommand group; reuses QueryRegistry::load + check from
  omnigraph-server; local-only (needs the schema), mirrors lint
- tests: clean registry exits 0, broken query exits non-zero + names it,
  list shows the query and its typed params

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Route registry selection through one shared query_entries_for

The "which queries: block applies for graph X" rule existed twice — the
server boot path and the CLI's registry_entries — and had already drifted:
the CLI carried an unreachable unwrap_or_else fallback the server lacked.

Add OmnigraphConfig::query_entries_for(graph: Option<&str>) as the single
definition (named graph -> its per-graph block; otherwise top-level) and
route all three sites through it: server single mode, server multi-graph
loop, and the CLI. The CLI's dead fallback arm is deleted; CLI and server
now resolve identically by construction.

No behavior change. Extends the config round-trip test to pin the selector,
including the unknown-name -> top-level fallback the deleted CLI arm covered.

* Funnel registry validation through one validate_and_attach gate

The check -> refuse-on-breakage -> log-warnings -> empty->None block was
copy-pasted across both open paths (single mode and the multi-graph
per-graph open), differing only by the graph label. A third opener could
attach a registry that was never schema-checked.

Extract validate_and_attach(queries, catalog, label) -> Option<Arc<..>> as
the single gate both paths call, so attaching an unchecked registry is no
longer expressible. The catalog handle is an owned Arc, so calling it
before the multi-mode policy match (which rebinds db) is borrow-clean.

No behavior change. Adds a direct unit test of the helper (empty / clean /
breakage incl. the graph label in the message) — covering the multi-graph
path's logic, which previously had no boot-refusal coverage.

* Resolve param types structurally in the MCP vector lint

The exposed-query advisory detected vector params with
type_name.starts_with("Vector(") — a second copy of the compiler's own
ScalarType::from_str_name vector parsing that could drift from it.

Key the lint off PropType::from_param_type_name + ScalarType::Vector(_)
instead, the one canonical resolver the type system already uses. Any
future param-suppliability lint now reads the structured type rather than
scanning the surface string.

Behavior-preserving: the grammar forbids list-of-vector params
(list_type = "[" base_type "]", and base_type excludes Vector), so the only
input where the structured and string checks could differ is unparseable.
Adds a guard test that an exposed String param does not false-trigger the
warning.

* Refuse duplicate MCP tool names across exposed stored queries

The effective MCP tool name (explicit tool_name, else the query name) is a
second identity namespace beside the registry key, but nothing enforced it
unique — two exposed queries could claim one catalog key, and each consumer
re-derived the name ad hoc.

Add StoredQuery::effective_tool_name() as the one definition, and a
load-time uniqueness pass in from_specs over exposed queries: a collision is
a collected LoadError naming the loser and the winner. Scoped to exposed
queries (unexposed have no MCP tool); deterministic over the BTreeMap so the
first-declared wins and the error order is stable.

New (rare) refusal: a config with colliding exposed tool names now fails
`omnigraph queries validate` offline and refuses server boot, the same
posture as a malformed registry. Release-note-worthy.

Test-first: duplicate_exposed_tool_name_is_a_load_error (red before the
pass, green after) + a CLI offline test; the unexposed sibling pins the
exposed-only scope; effective_tool_name asserts folded into the load test.

* docs: document the queries registry, CLI, and invoke_query action

The stored-query surface shipped without user docs. Add it, per the same-PR
maintenance contract:

- policy.md: invoke_query as per-graph action #10 (branch-scoped), with the
  double-gating note; renumber graph_list; add it to the branch_scope list.
- cli-reference.md: the `queries validate | list` command, and the
  `queries:` config block (per-graph + top-level) with mcp.expose/tool_name
  and the tool-name uniqueness rule.
- server.md: boot-time stored-query type-check (refuse on breakage), noting
  invocation over HTTP/MCP is not yet exposed.

* Add POST /queries/{name} stored-query invocation handler

Invoke a curated server-side stored query by name: source + name come from
the per-graph queries: registry, the client sends only runtime inputs
(params, branch, snapshot). Gated by the invoke_query Cedar action at the
boundary; the handler delegates to the existing run_query/run_mutate, whose
inner Read/Change enforce still runs — so a stored mutation is double-gated
(invoke_query to reach the tool, change for the write).

- InvokeStoredQueryRequest + an untagged InvokeStoredQueryResponse
  { Read(ReadOutput), Change(ChangeOutput) } → one Json<_> return type and a
  oneOf 200 schema (a correct contract, not a wrong-but-simple one).
- Route lives in per_graph_protected → single-mode /queries/{name} and
  multi-mode /graphs/{id}/queries/{name} for free.
- Deny == unknown: an invoke_query denial and a missing query both return the
  same 404, so the catalog can't be probed by an unauthorized caller.
- OpenAPI regenerated; tests cover read, mutation double-gate (403 vs 200),
  bad-param 400, and the identical-404 deny path.

Completes the MR-969 V1 invocation slice (registry + /queries/{name} + invoke_query).

* docs: stored-query invocation endpoint; flip the not-yet-exposed caveat

Now that POST /queries/{name} ships (C7), document it: add the endpoint to
server.md's inventory + an invocation section (body, untagged read/mutate
envelope, invoke_query gate, double-gated mutations, deny == 404), and flip
the startup note that said invocation was not yet exposed. In policy.md,
replace "no invocation call site yet" on the invoke_query action with a
pointer to the endpoint.

* Scope the stored-query 404-hiding claim to non-invoke_query callers

Review found the deny==404 catalog-hiding was overstated as a contract: it
holds only at the outer invoke_query gate. A caller that HOLDS invoke_query
but lacks read/change gets the inner gate's 403 for an existing query vs 404
for an unknown one — so existence is visible to grant-holders by design (the
intended double-gate). The handler docstring, OpenAPI 404 description, and
server.md all claimed the 404 was airtight against any denied actor.

Correct the wording in all three (no behavior change) and add the missing
symmetric test (invoke_query but no read -> 403 for an existing query, 404
for unknown) so the actual contract is pinned. Also document that in
default-deny mode (tokens, no policy) every invocation 404s until an
invoke_query rule is configured.

Nits: the from_specs collision comment said "first declared wins" but it is
lexicographically-first by name (BTreeMap); the effective_tool_name docstring
overclaimed the CLI display routes through it (it resolves the rule on its
own output DTO).

* Default mcp.expose to true (the manifest entry is the opt-in)

expose controls MCP-catalog membership only — it is not an authorization
gate (invocation is gated by invoke_query regardless). So requiring a
per-query mcp.expose: true was friction with no safety benefit: a
non-exposed query is still HTTP-invocable by name. Flip the default so
declaring a query in the manifest exposes it to the agent tool catalog by
default; expose: false is the escape hatch for service-only queries.

Both the absent-mcp path (Default impl) and the present-but-no-expose path
(serde default fn) now yield true. Doc comments + cli-reference updated; the
config round-trip test asserts the new default.

* Add GET /queries stored-query catalog endpoint

List a graph's mcp.expose stored queries as a typed tool catalog so a client
(the MCP server) can register them as tools without fetching .gq source.
Each entry carries name, MCP tool_name, description/instruction, a
read/mutate flag, and decomposed typed params (kind enum: string|bool|int|
bigint|float|date|datetime|blob|vector|list, plus item_kind for lists and
vector_dim) — so the consumer builds an input schema with a closed match and
never re-parses omnigraph type spelling. I64/U64 are bigint (string on the
wire): a JSON number loses precision past 2^53 and the engine already accepts
decimal strings.

Read-gated (works in default-deny; the catalog is graph-wide, authorized
against main). NOT Cedar-filtered per query yet — a reader can list a query
whose invoke_query they lack (documented gap until per-query authz lands);
invocation stays invoke_query-gated + deny==404.

- api: QueriesCatalogOutput / QueryCatalogEntry / ParamDescriptor / ParamKind
  + query_catalog_entry (reuses PropType::from_param_type_name; scalar_kind is
  exhaustive, so a new ScalarType is a compile error here until catalogued).
- GET /queries route in per_graph_protected (→ /graphs/{id}/queries in multi
  mode); OpenAPI regenerated; path allowlists updated.
- Tests: projection unit (every kind, list, vector, nullable, mutation,
  empty) + handler (exposed-only filter, read-gate probe-oracle, empty
  registry).

* docs: GET /queries stored-query catalog endpoint

Document the catalog: the endpoint table row (GET /queries, read-gated), a
catalog section (typed-param kind enum, bigint/date/datetime/blob-as-string,
graph-wide/branch-independent, mcp.expose default true, the read-gated
probe-oracle gap), and flip the startup note now that the catalog ships.

* Collect file-I/O and parse errors in QueryRegistry::load in one pass

load() early-returned on any unreadable .gq file, masking parse / identity /
tool-name-collision errors in the OTHER (readable) files — so an operator
fixed the missing file, restarted, and only then saw the next broken query.
Now it collects I/O errors but still runs from_specs on the readable specs
and returns the union, so every broken entry surfaces at once (matching the
collected-errors contract the rest of the registry already follows).

Safe: from_specs' tool-name collision check runs over loaded queries only, so
dropping an I/O-failed entry can only under-report a collision, never invent
one. I/O errors are ordered first (BTreeMap key order), then spec errors.

Adds a load-level test (tempdir: a valid, a missing, and a parse-broken .gq)
asserting all three surface in one Err — confirmed red before the fix.

* Make invoke_query graph-scoped (one branch authority)

invoke_query gates reaching the curated stored-query surface — a graph-level
capability. Per-branch/snapshot access is already enforced by the inner
read/change gate in run_query/run_mutate (authorized against the resolved
branch), so branch-scoping the outer gate was redundant AND wrong for snapshot
reads (it defaulted to main). Drop the branch dimension: remove InvokeQuery
from uses_branch_scope (it joins admin as graph-scoped) and authorize the
boundary gate with branch: None.

Lossless: an actor confined to branch X by their read/change rules can still
only invoke a stored query that touches X. A rule that sets branch_scope on
invoke_query is now rejected by validate() — write invoke_query in its own
rule.

Ripple (atomic): restructure the server invoke fixture so invoke_query sits in
its own branch_scope-free rule; invert invoke_query_is_branch_scoped ->
invoke_query_rejects_branch_scope; the per-graph authorize test uses
branch: None; docs (policy.md, server.md, the InvokeQuery doc). No wire/OpenAPI
change.

* Resolve graph config by identity, not server mode

Which policy/queries block applies for a graph was decided three different,
mode-dependent ways: single-mode boot used top-level even for a named graph;
multi-mode used per-graph (and silently ignored a top-level queries block); the
CLI used per-graph for a named target. So `queries validate --target prod`
could check a different registry than the single-mode server loaded, and a
named graph's per-graph policy/queries were silently shadowed.

Make config a function of graph IDENTITY: a graph served by NAME
(--target/server.graph, a graphs: entry) uses its own graphs.<name>.{policy,
queries}; a bare URI is anonymous and uses top-level. One rule, applied by
single-mode boot, multi-mode boot, and the CLI — so they can't diverge and the
CLI predicts the server exactly.

No silent ignore: serving a named graph while a top-level policy/queries block
is populated now refuses boot, naming the block (the multi-mode top-level-policy
bail, extended to queries and to single-mode-named). The CLI's `queries
validate` derives the schema URI and the registry from ONE selection, and a
positional URI forces anonymous (ignoring cli.graph) so the two can't come from
different graphs.

BREAKING (released behavior): single mode by name (--target/server.graph) with
top-level policy/queries previously used top-level; it now uses the per-graph
block and refuses boot if top-level is also populated. Bare-URI single mode is
unchanged. Loud, with migration text pointing at graphs.<name>.

- config: resolve_policy_file_for (policy sibling of query_entries_for, no
  top-level fallback) + populated_top_level_blocks for the coherence check.
- characterization tests (single-mode named -> per-graph; named + top-level ->
  bail; multi-mode top-level queries -> bail; CLI positional-URI -> top-level).
- docs: policy.md, server.md, cli-reference.md.

* docs: RFC-002 credentials keyed by server name (keychain/profile/env)

Reworks the RFC's credentials model: secrets are keyed by server name — OS
keychain `omnigraph:<server>` (preferred) -> a `[<server>]` profile in
`~/.omnigraph/credentials` -> `OMNIGRAPH_TOKEN[_<SERVER>]` env (CI), the
AWS/gh/kube model. `servers.<name>` is endpoint-only by default but may carry
an explicit, secret-free `auth: { token: { env|file|command|keychain } }`
source. The shipped `bearer_token_env` + `.env.omni` dotenv remain a legacy
compat path; no `credentials.yaml`.

* docs: RFC-002 — typed graph locator (storage/server/graph_id), not a uri string

Add §1.1: the resolved graph address is a typed GraphLocator
(Embedded{storage} | Remote{server, graph_id}), not a flat uri: String.
Diagnoses the string model's cost in the code today (~16 is_remote_uri forks,
TargetConfig can't express multi-server x multi-graph, the CLI bails on remote,
the ts SDK models baseUrl+graphId separately) and settles the YAML naming so
the key names the locus:

- storage: (embedded) — shipped uri: is a deprecated alias
- server: + graph_id: (remote) — graph_id defaults to the entry key
- storage xor server, reject both/neither (no silent ambiguity)

Kills the graphs:/graph: collision and the uri:-might-be-a-server ambiguity.
Updates the §1/§8 examples and the entry-shape notes to the new naming.

* Test: queries list must reject an unknown --target

queries list opens no graph URI, so unknown-graph validation does not ride
along on resolve_target_uri the way it does for every other command. The new
test reproduces the gap: with an unknown --target the command currently exits 0
and prints the (empty) top-level registry instead of erroring like the
URI-resolving commands do. Fails against current code; the fix follows.

* Validate the graph selection in queries list

Graph-existence validation was a side effect of URI resolution: every
URI-resolving command rejects an unknown --target via resolve_target_uri, but
queries list opens no URI, so query_entries_for(Some(unknown)) silently fell
back to the top-level registry and showed the wrong (or empty) catalog.

Make membership a property of the selection: add the fallible
resolve_graph_selection alongside the infallible query_entries_for (a known
name passes through, an unknown name errors with the same message as
resolve_target_uri, None stays anonymous), and validate the selection in
execute_queries_list. query_entries_for is unchanged — server boot's bare-URI
path still needs its None -> top-level arm.

* Surface policy-engine errors from stored-query invoke

The invoke handler mapped every authorize_request failure to 404 ('stored
query not found'), which collapsed the authorization decision (deny -> 403)
together with operational failures (no actor -> 401, Cedar evaluation error ->
500). A real policy-engine 500 was hidden as a missing query.

Separate the two concerns instead of sniffing the masked status. Extract
authorize() returning an Authz { Allowed, Denied(msg) } decision and reserve
Err for operational failures only; authorize_request becomes a thin wrapper
that maps Denied -> 403, so the 16 deny-as-403 callers are unchanged. The
invoke handler now matches the decision directly: a denial stays 404 (deny ==
missing, so the catalog can't be probed without the grant), while a 401/500
propagates with its true status.

500 is now a reachable outcome on POST /queries/{name}; document it in the
endpoint responses and regenerate openapi.json.

* Extract the named-graph/top-level coherence rule into one helper

The rule 'a named graph uses its own graphs.<name> block, so a populated
top-level block is a config error' lived inline in single-mode server boot.
Extract it to OmnigraphConfig::ensure_top_level_blocks_honored so the same
definition can be shared by the CLI selection gate (next commit) and the two
can't drift. Boot calls the helper; the message is reworded context-neutral
(drops 'serving') so it reads correctly from both boot and the CLI.

Behavior-preserving: multi-graph mode keeps its own unconditional check, and
single_mode_named_graph_rejects_top_level_blocks still passes.

* Test: queries validate/list must reject a named graph with a top-level block

Server boot refuses a config where a graph is selected by name yet a top-level
queries:/policy.file block is populated (the block would be silently ignored).
The CLI's queries validate/list resolve the same named selection but skip that
coherence check, so they give a false green / list the per-graph block. The new
test reproduces it: validate prints OK and list succeeds where boot would
refuse. Fails against current code; the fix follows.

* Enforce top-level coherence in the single CLI selection gate

queries validate validated graph membership only as a side effect of URI
resolution and queries list only via resolve_graph_selection's membership
check; neither applied the named-graph/top-level coherence rule server boot
enforces, so both gave a false green on a config boot refuses.

Fold ensure_top_level_blocks_honored into resolve_graph_selection so it is the
single gate that returns only valid + server-coherent selections, and route
resolve_selected_graph (queries validate) through it; queries list already
calls the gate. A named graph with a populated top-level block now errors in
both commands, matching boot. A positional URI stays anonymous (top-level
honored), so queries_validate_positional_uri_ignores_default_graph is
unaffected.

* docs: RFC-003 — MCP server surface for omnigraph-server

Detailed MCP-transport design for the stored-query/MCP work, building on the
shipped #128 registry. Corrects the draft against the branch head: the coarse
invoke_query gate + 404 denial-masking are already wired (server_invoke_query),
so per-query invoke_query scope (PolicyRequest has no query-name dimension yet)
is the real prerequisite; positions the doc as superseding rfc-001's MCP
transport (/mcp/tools+/mcp/invoke) and reconciles the shipped mcp.expose YAML
form and the schema-introspection non-goal; grounds the parity surface in the
actual omnigraph-ts package (13 tools with read/change ids, 2 resources).

* docs(config): clarify graph config boundaries

* fix(config): enforce graph-scoped policies and query validation

* fix(cli): require graph selection for scoped query registries

* fix(server): preserve named graph id in single mode policy

* fix(cli): share graph identity for policy resolution

* test(cli): cover policy tooling server graph selection

* fix(cli): honor server graph for policy tooling

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 22:50:31 +02:00
Ragnor Comerford
353c0c876a
fix(branch): make branch delete correct under partial failure (#137)
* test(lance): pin force_delete_branch surface guard

Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete
single-authority redesign relies on: plain delete_branch errors on a
missing ref, force_delete_branch removes an existing forked branch, and
the local-store quirk where force_delete on a fully-absent branch still
errors (worked around by the upcoming TableStore::force_delete_branch).

Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime).

* feat(storage): add force branch-delete to TableStore + CommitGraph

Add TableStore::force_delete_branch and CommitGraph::force_delete_branch
(idempotent: tolerate an already-absent branch via Lance RefNotFound /
NotFound), plus CommitGraph::list_branches for the cleanup reconciler to
diff against the manifest authority. RefConflict (referencing
descendants) is still surfaced. Unused until the branch-delete rewire.

* test(maintenance): red — cleanup reconciles orphaned branch forks

Forge a Lance branch on the Person table that the manifest never
references (a zombie fork from an incomplete prior delete) and assert
cleanup reclaims it while leaving main intact. Fails today: cleanup does
not yet reconcile orphaned forks. Goes green with the next commit.

* fix(maintenance): reconcile orphaned branch forks in cleanup

Add reconcile_orphaned_branches: force_delete_branch every per-table and
commit-graph Lance branch absent from the manifest branch set (the
authority), children-before-parents. Folded into cleanup_all_tables,
runs before version GC. Idempotent and authority-derived; no-ops once
nothing is orphaned, and would harmlessly find nothing if a future Lance
atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches
and exposes graph_commits_uri(pub crate). Turns the maintenance red test green.

* test(failpoints): red — branch_delete partial failure converges

Add the branch_delete.before_table_cleanup failpoint hook (inert without
the feature) and a regression test: a cleanup-step failure after the
manifest authority flip must leave branch_delete returning Ok, the branch
gone, the orphan stranded, then reclaimed by cleanup, and the name
reusable. Fails today: cleanup_deleted_branch_tables propagates the error
as a hard failure. Goes green with the next commit.

* fix(branch): best-effort fork reclaim after the manifest flip

Make branch_delete treat per-table forks and the commit-graph branch as
derived state reclaimed best-effort with force_delete_branch after the
manifest authority flip. A reclaim failure (transient error, or the
branch_delete.before_table_cleanup failpoint) is logged via tracing::warn
and swallowed: the branch is already gone and the cleanup reconciler
converges the orphan. cleanup_deleted_branch_tables no longer returns an
error or blocks the call. Turns the partial-failure recovery test green.

* test(failpoints): red — recreate over orphaned fork is actionable

After a partial-failure delete leaves a fork orphaned, recreating the
branch name and writing to the previously-forked table before cleanup
runs currently surfaces the opaque ExpectedVersionMismatch ("stale view
... expected manifest table version N"). Assert instead a clear error
pointing the user at cleanup. Goes green with the next commit.

* fix(branch): actionable orphan-collision error in fork_branch_from_state

When a fork's create_branch collides with an existing target ref, reuse
it only if its head matches source_version (a legitimate concurrent
first-write). A version mismatch means a zombie fork from an incomplete
prior delete: return a manifest_conflict pointing the user at
`omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch.
Turns the recreate-over-orphan red test green.

* docs(invariants): single-authority branch-lifecycle + Lance forward-compat

Record branch delete in the Current Truth Matrix: manifest is the single
authority flipped atomically first, per-table forks + commit-graph branch
are derived state reclaimed best-effort with the cleanup reconciler as
backstop, and reusing a name whose reclaim failed surfaces an actionable
error. Note the reconciler is authority-derived and degrades to a no-op
under a future Lance atomic multi-dataset branch op, the same shape as
invariant 7.

* test(failpoints): red — cleanup isolates a single-table failure

Add the cleanup.table_gc failpoint hook (inert without the feature) and
an error: Option<String> field on TableCleanupStats (mechanical, always
None for now). Regression test: a one-shot version-GC failure for one
table must not abort the whole cleanup — assert cleanup still succeeds,
surfaces the failure per-table in stats, and the independent reconcile
pass still reclaimed an orphan. Fails today: the version-GC collect
aborts on the first table error. Goes green with the next commit.

* fix(maintenance): fault-isolate cleanup per table

Make the cleanup sweep do as much as it can and converge on re-run
instead of aborting wholesale on one table's transient error
(invariant 13). The version-GC loop now records a per-table failure on
its stats row (error: Some) and logs it rather than collecting into a
Result that aborts; reconcile_orphaned_branches isolates per-table and
commit-graph failures into BranchReconcileStats.failures. The CLI reports
any failed tables and tells the user to rerun cleanup. Addresses the
Devin review finding. Turns the single-table-failure test green.

* test(failpoints): red — branch_create heals commit-graph zombie + is atomic

Add the branch_delete.before_commit_graph_reclaim failpoint hook and two
regression tests: (a) recreating a name whose delete left a commit-graph
zombie must succeed (today it dies on Lance's internal Clone error), and
(b) branch_create must roll back the manifest branch when the derived
commit-graph branch fails (today it leaves the manifest branch created
while returning Err). Both fail now; green with the next commit. The
existing branch_create_failpoint_triggers test still passes.

* fix(branch): make branch_create atomic + heal commit-graph zombie

branch_create now flips the manifest authority first, then creates the
derived commit-graph branch in create_commit_graph_branch, force-dropping
any orphaned commit-graph ref left by an incomplete prior delete (the
manifest branch is fresh, so a same-named commit-graph branch is provably
a zombie). If commit-graph creation fails, the manifest branch is rolled
back so the name never half-exists. Addresses the Codex review finding.
Turns the two branch_create red tests green; existing tests unaffected.

* test(failpoints): red — fork collision misclassifies live concurrent fork

Add the fork.before_classify failpoint hook and a concurrency test: when
a concurrent first-write legitimately wins the fork race, the loser must
get a retryable refresh-and-retry, not the misleading run-cleanup orphan
error. Today the version-comparison misclassifies the live fork as an
orphan (the Cursor finding). Goes green with the next commit.

* fix(branch): manifest-arbitrated fork-collision classification

Classify a fork collision by the manifest authority instead of comparing
Lance branch versions. Before forking, open_owned_dataset_for_branch_write
re-reads the live manifest: if the table is already forked on the active
branch, a concurrent first-write won and the loser gets a retryable
refresh-and-retry (not a misleading orphan error). fork_branch_from_state
no longer guesses from versions — a create collision past that check is
an orphan, so it returns the actionable cleanup error. Addresses the
Cursor finding; turns the live-concurrent-fork test green, zombie path
unchanged.

* test(failpoints): close branch-lifecycle test gaps

Three coverage additions for the branch-delete work (behavior already
correct; these lock it in and catch regressions):

- cleanup_isolates_reconcile_failure: inject a force-delete failure into
  the reconcile loop (new cleanup.reconcile_fork hook) and assert the
  sweep continues + converges on re-run. Directly covers the reconcile
  loop the Devin finding was about (previously only version-GC was).
- cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph
  orphan via the delete reclaim failpoint and assert cleanup's
  reconcile_commit_graph_orphans drops it (previously untested).
- fork_collision_with_live_concurrent_fork_is_retryable: replace the
  fixed 300ms sleep with a deterministic readiness signal (cfg_callback +
  compare_exchange atomics) so the two-writer ordering can't flake.

Full failpoints suite 31/0.
2026-06-01 13:28:38 +02:00
Ragnor Comerford
2d5c4b1202
docs: rename runs.md/runs.rs → writes and repoint all references (#131)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md`
and `crates/omnigraph/tests/runs.rs` have since documented and tested the
direct-publish write path, so the "runs" name was misleading.

- git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro;
  keep MR-771 history note)
- git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header)
- repoint every runs.md / runs.rs reference across docs, AGENTS.md, and
  source comments
- fix four pre-existing broken `docs/runs.md` links (the file never lived
  at that path) to `docs/dev/writes.md`
- fix the stale v0.4.0 anchor to the live section

No behavior change: every source edit is a comment. Engine builds and the
renamed test passes 25/25; scripts/check-agents-md.sh passes.

The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is
deferred to MR-770.
2026-05-30 23:20:56 +02:00
Ragnor Comerford
24413844ae
Add Windows release binaries (#127)
* Add Windows release binaries

* Fix Windows installer downloads
2026-05-30 14:23:40 +02:00
Ragnor Comerford
50910b3753
docs: align release artifact docs 2026-05-29 14:04:16 +02:00
devin-ai-integration[bot]
1a4d2cee97
feat: inline query strings in CLI and HTTP server (#110)
* feat(MR-656): inline query strings in CLI and HTTP server

CLI:
- Add -e / --query-string <STRING> to omnigraph read and omnigraph change
- Exactly one of --query, --query-string, --alias is required (3-way XOR)
- Empty --query-string is rejected with a clear error

HTTP:
- New POST /query (read-only, clean field names: query/name/params/branch/snapshot)
- Mutations on /query are rejected with 400 -- use POST /change instead
- ChangeRequest fields polished: query (alias query_source), name (alias query_name)
- POST /read and POST /change remain byte-compatible for existing clients

Tests:
- cli.rs: -e happy-path on read/change, mutex error vs --query, empty -e rejected
- system_local.rs: inline -e read and -e change exercise the local flow
- system_remote.rs: inline -e read/change over HTTP plus direct /query 200/400
- server.rs: /query 200, /query 400 on mutation, /change legacy field alias
- openapi.rs: new /query path, QueryRequest schema, ChangeRequest field-name polish

Docs: cli.md (-e examples), cli-reference.md (read/change rows), server.md (/query)
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* feat(MR-656): rename read/change to query/mutate with deprecation signals

HTTP server:
- Add POST /mutate as canonical write endpoint (pairs with POST /query).
- Mark POST /read and POST /change as deprecated. Three-channel signal:
  * OpenAPI: `deprecated: true` on the operation (every codegen flags
    the generated SDK method).
  * RFC 9745: response `Deprecation: true` header on every response.
  * RFC 8288: response `Link: </successor>; rel="successor-version"`
    pointing at /query and /mutate respectively.
- Share business logic across /mutate and /change via run_mutate(); the
  /change wrapper is the only place that adds the deprecation headers.
- ChangeRequest field aliases (query_source/query_name) preserved.
- AliasCommand serde now accepts `query`/`mutate` alongside `read`/`change`.

CLI:
- Promote `omnigraph query` / `omnigraph mutate` to top-level canonical
  subcommands (clap visible_alias keeps `omnigraph read` / `omnigraph
  change` working forever).
- Promote `omnigraph lint` / `omnigraph check` to top-level (was nested
  under `omnigraph query lint`, which is now a deprecated argv shim that
  rewrites to the canonical form).
- Argv-level preprocessing prints a one-line deprecation warning to
  stderr when any legacy spelling is used. Canonical names are silent.

Tests:
- Server: /mutate works, /change emits Deprecation+Link headers, /read
  emits Deprecation+Link headers, /query carries no deprecation signal.
- OpenAPI: /read and /change flagged deprecated; /query and /mutate not.
- CLI: canonical `lint` matches deprecated `query lint` / `query check`
  output; `read` / `change` print deprecation warnings.

Docs:
- cli.md: new canonical examples; "Deprecated names" migration table.
- cli-reference.md: top-level table updated; aliases.<name>.command
  accepts both legacy and canonical spellings.
- server.md: endpoint inventory shows /query and /mutate as canonical
  and /read and /change as deprecated; dedicated section explains the
  three-channel deprecation signal.
- og-cheet-sheet.md: use new `omnigraph lint` / `omnigraph check`.
- openapi.json regenerated.

Migration is purely cosmetic — every deprecated form continues to work
indefinitely; only the spelling changes.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* fix(MR-656): address Devin Review findings on /query and /change

Two issues raised by Devin Review on PR #110:

1. `POST /query` mutation-rejection error pointed at the deprecated
   `/change` endpoint instead of the canonical `/mutate`. Fixed in
   three places: the runtime error message in `server_query`, the
   utoipa 400-response description, and the handler doc comment. The
   `QueryRequest` schema docstrings in `api.rs` got the same update so
   the openapi.json bodies match. Server and openapi tests updated.

2. `execute_change_remote` serialized `ChangeRequest` directly, which
   emits the new canonical field names `query` / `name` on the wire.
   `#[serde(alias = "query_source")]` only affects deserialization, so
   a newer CLI talking to an older server would have its `/change`
   POST body fail with "missing field: query_source". Fixed by
   extracting a `legacy_change_request_body` helper that hand-rolls
   the JSON with the legacy keys (`query_source` / `query_name`), the
   same byte-stable contract `execute_read_remote` already uses
   against `/read`. Added two unit tests on the helper to lock the
   wire shape in.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* docs(dev): RFC 001 — inline + stored queries, envelope, MCP

Tracked artifact consolidating the design across MR-656 (this branch),
MR-976 (Phase 1 envelope hardening parent, with MR-977/978/979/980
sub-issues), and MR-969 (stored queries + MCP).

Sections:

* Two paths, one engine — inline `/query` + `/mutate` (this PR) coexist
  with stored `/queries/{name}` (MR-969). Same `run_query` / `run_mutate`
  backend (the fold-in landed in the previous commit).
* Request envelope ("before") — Idempotency-Key, If-Match, X-Deadline,
  X-Trace-Id, expect, dry_run, fields. Phase 1 ships the load-bearing
  subset on `/mutate`.
* Response envelope ("after") — audit_id, snapshot_id, commit_id, stats,
  warnings. Closes the provenance loop today's `ChangeOutput` leaves
  open.
* `.gq` pragmas — `@description`, `@returns`, `@mcp`. Source-of-truth
  for the stored-query agent contract; no separate YAML registry.
* Multi-graph MCP — per-graph `/graphs/{id}/mcp/tools` + `/mcp/invoke`.
  Token binds to one graph by default; cross-graph agents loop.
* Cedar split — `read`/`change` for inline, `invoke_query` for stored.
  Operators deny ad-hoc for agent groups while keeping curated tool
  list open.
* Rejected alternatives — per-env override files, compiled bundles,
  tool-name prefixing across graphs, body-field graph dispatch.

Index entry added under "Active Implementation Plans" so future agents
land on the RFC before touching queries / mutations / envelope code.
`scripts/check-agents-md.sh` clean (35 links, 34 docs).

* docs(server): clarify why run_query lacks AppState parameter

run_mutate takes state for workload admission; run_query doesn't because
reads aren't admission-gated today. Mark the asymmetry as intentional and
flag the two future events that would grow the signature: Phase 1's
`expect: { max_rows_scanned: N }` budget (MR-976) or per-actor admission
extending to stored-read invocations (MR-969). Prevents the natural
"make these symmetrical" follow-up.

* refactor(server): run_query / run_mutate take &ResolvedActor

Replace `Option<Extension<ResolvedActor>>` in the helpers with
`Option<&ResolvedActor>`. Saves MR-969's stored-query handler from
wrapping a bare actor in axum's `Extension(...)` before calling.
Handler signatures (`server_query`, `server_read`, `server_mutate`,
`server_change`) keep `Option<Extension<ResolvedActor>>` because that
is what axum injects, and unwrap at the call site with
`actor.as_ref().map(|Extension(actor)| actor)`.

Net: -13/+10 LOC, 89/0 server tests pass.

* docs(releases): v0.6.0 — describe inline + canonical-named queries (MR-656)

Extend the v0.6.0 release notes to cover the third piece of work landing
alongside the graph terminology rename and multi-graph server mode:
canonical-named `POST /query` and `POST /mutate` endpoints, the CLI's
new `-e/--query-string` flag, the top-level promotion of `lint` /
`check`, and the three-channel deprecation signal on `/read` and
`/change` (OpenAPI `deprecated: true` + RFC 9745 + RFC 8288).

Additions:

* Top blurb: "Two pieces" -> "Three pieces" with a bullet describing
  the rename + inline flow.
* Breaking Changes: new "Query / mutation rename" subsection covering
  the `ChangeRequest` field rename (with the back-compat serde aliases
  and the CLI's `legacy_change_request_body` byte-stable wire helper)
  and the `omnigraph query lint` -> `omnigraph lint` move.
* New: 5 bullets — the two endpoints, the CLI subcommands, the `-e`
  flag, the deprecation signal channels, the widened `aliases.<name>.command`
  vocabulary.
* User Impact: one bullet making explicit that the rename is cosmetic
  on the client side and migration is voluntary.
* Documentation: pointers to the updated `server.md` / `cli.md` /
  `cli-reference.md` and the new `docs/dev/rfc-001-queries-envelope-mcp.md`.

+15/-1 lines. `./scripts/check-agents-md.sh` clean.

* refactor(cli): demote `check` from visible_alias to deprecation shim

`omnigraph check` was a clap `visible_alias` on `lint`, advertised in
`--help` as an equivalent canonical name. Per MR-981 §6 (long-form
flags as canonical, short forms as visible aliases), visible aliases
on subcommand names hurt agent CX: agents emit either spelling
depending on training-data drift, and there's no length signal
pointing at the canonical name.

Changes:

* Remove `#[command(visible_alias = "check")]` from the `Lint` variant.
  `omnigraph --help` now shows only `lint`.
* Add bare `check` to `rewrite_deprecated_argv` so `omnigraph check
  <args>` still works — it rewrites to `omnigraph lint <args>` and
  emits a one-line stderr deprecation warning, matching the existing
  pattern for `read` / `change` / `query lint` / `query check`.
* Fix the nested `query check` shim to substitute `check` -> `lint` in
  the rewritten argv (previously it relied on `check` being a
  visible_alias to reach the `Lint` variant).
* New test `deprecated_check_top_level_rewrites_to_lint` covers: bare
  `check` produces identical stdout to `lint`, emits the deprecation
  warning, and `check` does NOT appear as an alias in `omnigraph
  --help`.
* Release notes updated to reflect the deprecation-shim treatment and
  cross-reference MR-981 §6 reasoning.

Cargo / Go users typing `check` still work indefinitely; one stderr
nudge per invocation teaches the canonical name. Agents see only
`lint` in `--help --json` so they emit one canonical form.

67/0 omnigraph-cli tests pass; 39 workspace test suites green.

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Ragnor Comerford <hello@ragnor.co>
2026-05-29 13:41:54 +02:00
Ragnor Comerford
fd41f798b7
chore(codeowners): remove aaltshuler as owner 2026-05-28 11:41:38 +02:00
Ragnor Comerford
972a6e047b
chore(codeowners): add ragnorc as engineering owner 2026-05-28 11:20:51 +02:00
Ragnor Comerford
cc2412dc65
Rename repo terminology to graph (#118)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
2026-05-24 16:46:00 +01:00
Andrew Altshuler
3551e0d40e
chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111)
* tests: add lance_surface_guards pre-flight pins for the v6 bump

Land 8 named guards in a new test file that pin Lance API surfaces
OmniGraph relies on. Each guard turns a silent-break risk (variant
rename, struct restructure, async-flip) into a red CI bar instead of
runtime drift.

Guards (mapped to the silent-break inventory from the v6 migration plan):

  Runtime (#[tokio::test]):
  1. lance_error_too_much_write_contention_variant_exists — pins the
     variant referenced by db/manifest/publisher.rs::map_lance_publish_error.
  2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme
     types and ManifestLocation accessor returning &Self (the access
     pattern at db/manifest/metadata.rs:84-88).
  6. write_params_default_does_not_set_storage_version — confirms our
     explicit V2_2 pin remains load-bearing (blob v2 requirement).

  Compile-only async fns (#[allow(...)] + unimplemented!() placeholders;
  never run, but cargo build --tests enforces the API shape):
  3. checkout_version + restore chain — pins the recovery rollback hammer
     at db/manifest/recovery.rs:505-522.
  4. DatasetBuilder::from_namespace().with_branch().with_version().load()
     — pins the namespace builder chain at db/manifest/namespace.rs:162-174.
  5. MergeInsertBuilder fluent chain — pins the manifest CAS at
     db/manifest/publisher.rs:370-391, including the return shape
     (Arc<Dataset>, MergeStats).
  7. compact_files(&mut ds, CompactionOptions, None) — pins
     db/omnigraph/optimize.rs:107.
  8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline
     delete result shape (MR-A will repurpose this guard to the staged
     two-phase variant once Lance #6658 migration lands).

This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump
follows in commit 2 (will trigger the guards under v6 if any surface
drifted).

Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md
(written this session). Two guards from the plan deferred to follow-up:
  - manifest_cas_returns_row_level_contention_variant (full publisher
    race integration test — needs harness scaffolding)
  - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata
    is pub(crate); requires test reach extension).

Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards
passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests
compiles all 5 compile-only guards clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58

The Cargo bump itself. Source is intentionally untouched — this commit
will not compile. The compile errors are the work-list for subsequent
commits on this branch.

Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn:
  + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512)
  + object_store 0.13.x (Lance 6 brings it transitively; our explicit
    pin stays at 0.12.5 for now — revisit in stages if diamond bites)
  - tantivy* crates (replaced by lance-tokenizer)

Compile error landscape on this commit (11 errors):
  • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280
    moved it to lance::index). Sites: table_store.rs:20,
    db/manifest.rs:37 (the second site was missed by the pre-flight
    inventory).
  • 8× E0599: `create_index_builder` / `load_indices` missing on
    `lance::Dataset` — all downstream of the DatasetIndexExt move.
    Once the import is corrected on table_store.rs and db/manifest.rs,
    these resolve automatically.
  • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse`
    initializer at db/manifest/namespace.rs:221, 364. New Lance
    namespace field per the v5 namespace restructure (PR #6186).

Surface guards (lance_surface_guards.rs, commit d571fa8) all still
compile + the 3 runtime ones pass on v6 — none of the silent-break
surfaces drifted. That's the load-bearing observation: the publisher
CAS chain, ManifestLocation field shape, checkout_version/restore,
DatasetBuilder fluent chain, MergeInsertBuilder return shape,
WriteParams::default, compact_files signature, and DeleteResult
fields are all v6-stable.

Next commits address the 11 errors per the migration plan stages
3-8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* imports: move DatasetIndexExt to lance::index (Lance PR #6280)

Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into
`lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`.

Mechanical update of 6 import sites:
  crates/omnigraph/src/table_store.rs:20 — split into two `use` lines
  crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt
  crates/omnigraph/tests/search.rs:6
  crates/omnigraph/tests/branching.rs:7
  crates/omnigraph/tests/failpoints.rs:467
  crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt

All 9 E0599 cascading errors on .create_index_builder / .load_indices
resolve once the trait is back in scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* namespace: add is_only_declared field to DescribeTableResponse

Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to
`DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the
v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)`
because every table BranchManifestNamespace returns from describe_table
is materialized — the manifest snapshot only includes entries for
tables we've already opened via Dataset::open.

Two sites in db/manifest/namespace.rs (BranchManifestNamespace +
StagedTableNamespace impls of LanceNamespace::describe_table).

Closes the last two compile errors from the v6 bump in the engine lib.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cargo: add lance to omnigraph-cli + omnigraph-server dev-deps

Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index`
in the cli and server test crates. Both crates only had `lance-index`
in their dev-dependencies; add `lance` alongside so the new path
resolves.

This is the last compile-error fix from the v6 bump — `cargo build
--workspace --tests` is now green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: refresh Lance alignment audit for v6.0.1; bump surveyed version

Per CLAUDE.md maintenance rule 2 (same-PR docs):

- docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with
  the v6.0.1 audit. Captures every v5/v6 finding from this PR (the
  DatasetIndexExt move, DescribeTableResponse.is_only_declared,
  MergeInsertBuilder return shape, ManifestLocation field shape,
  LanceFileVersion::default flip, file-reader async, tokenizer
  vendor, Lance #6658/#6666/#6877 status). Cross-references each
  guard in tests/lance_surface_guards.rs.

- AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x".
  Note: surveyed crate version stays at 0.4.2 — substrate version
  bumps are independent of OmniGraph's release version.

- crates/omnigraph/src/storage_layer.rs: update the trait module-level
  doc-comment to reflect that Lance #6658 closed 2026-05-14 and
  delete_where two-phase migration is MR-A (the next follow-up).
  #6666 stays open; create_vector_index inline residual stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tests: silence clippy::diverging_sub_expression on compile-only guards

The five `_compile_*` async fns in lance_surface_guards.rs use
`let ds: Dataset = unimplemented!()` as a placeholder so type inference
can chase the method chain we want to pin, without ever running the
function. Clippy's `diverging_sub_expression` lint flags this pattern
because the RHS diverges; that's the entire point. Added to the
per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code /
unused_variables / unused_mut already there.

No behavior change. cargo test -p omnigraph-engine --test
lance_surface_guards still 3/3 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1

The audit stanza in docs/dev/lance.md and the storage_layer.rs trait
doc-comment both implied the public DeleteBuilder::execute_uncommitted
API shipped with Lance 6.0.1. It did not. Issue #6658 closed
2026-05-14, but binary search across the release stream confirms:

  v6.0.1             no pub async fn execute_uncommitted on DeleteBuilder
  v6.1.0-rc.1       
  v7.0.0-beta.5     
  v7.0.0-beta.10     first appearance
  v7.0.0-rc.1       

So MR-A (delete two-phase migration) is gated on the Lance v7.x bump,
not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a
week.

No behavior change. Doc-only correction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux

Lance 6's heavier trait surface around futures/streams in storage_layer.rs's
staged-write API pushes the rustc trait-resolution recursion limit past
the default 128 on Linux builds. CI on PR #111 surfaced this in both
`Test Workspace` and `Test omnigraph-server --features aws`:

  error: queries overflow the depth limit!
    = help: consider increasing the recursion limit by adding a
      `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`)
    = note: query depth increased by 130 when computing layout of
      `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}`

(The async block is `stage_create_btree_index`'s body — its return type
is several layers of `impl Future<Output=Result<StagedHandle>>` deep on
top of Lance's own builder return types.)

Local macOS builds happened to short-circuit before tripping the limit,
which is why this didn't surface during the v6 bump sequence. The fix
rustc itself suggests is one line at the crate root.

No behavior change. Revisit if a future Lance bump stops needing it.

Verified: `cargo build --locked -p omnigraph-server --features aws`
compiles clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:42:29 +01:00
Andrew Altshuler
e98347eb7b
schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90)
* schema-lint chassis v1 (WIP): tier surfacing + plan doc

First commit of the chassis v1 branch. Lands a small, foundational
slice without behavior change, plus a planning doc that lays out the
remaining 7 commits in sequence so the PR can be reviewed
incrementally.

This commit:

- Adds SchemaMigrationStep::diagnostic() returning the full
  &'static DiagnosticCode (family + tier + severity) for
  UnsupportedChange steps with codes. Renderers can now reach the
  tier without re-implementing the code → tier lookup.

- CLI `omnigraph schema plan` output now displays tier alongside
  code:

    unsupported change on node:Person.age [OG-DS-104, destructive]:
        removing property 'Person.age' is not supported in schema
        migration v1

  Operators see at-a-glance the kind of risk each rejection
  represents — not just the rule identifier.

- No behavior change. All 11 existing schema_apply tests still pass.

Planning doc at docs/schema-lint-v1-plan.md tracks the 7 remaining
commits to bring v1 to feature-complete:

  1. (this commit) Tier surfacing in plan output.
  2. Soft / Hard mode enum on drop steps.
  3. Tombstone fields on catalog IR.
  4. Planner emits DropProperty { Soft } by default.
  5. Apply path implements Soft mode.
  6. Convert PR #62 destructive-rejection tests.
  7. --allow-data-loss flag + Hard mode.
  8. (optional) Tombstone unhide / restore command.

Delete the planning doc when v1 lands. Intentionally checked in to
the WIP branch so the scope is reviewable; not intended as a
permanent doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* schema-lint v1 commit 2: DropMode + dormant Drop* variants

Second commit of the chassis v1 branch. Lands the type-level shape
of soft/hard drops without wiring them up. Variants are reachable
from emitters but the planner doesn't produce them yet; the apply
path returns an explicit not-yet-implemented error if one shows up
via deserialization.

Added:

- `DropMode { Soft, Hard }` — orthogonal to `SafetyTier`. Tier
  classifies the rule's risk class; mode is the operator's intent
  for data treatment.
    - `Soft` → catalog tombstone, data retained. Tier: safe.
    - `Hard` → Lance-level removal. Tier: destructive; will require
      --allow-data-loss to apply (commit 7).

- `SchemaMigrationStep::DropType { type_kind, name, mode }` and
  `SchemaMigrationStep::DropProperty { type_kind, type_name,
  property_name, mode }` variants.

- Re-export `DropMode` from `omnigraph_compiler::DropMode` so
  downstream crates don't reach into the catalog submodule.

- CLI `render_schema_plan_step` arms for both variants, surfacing
  the mode in plan output: `drop property 'Person.age' of node
  'Person' (soft mode)`.

- `apply_schema_with_lock` exhaustive match arm for the two new
  variants that returns `manifest_internal` with a clear
  not-yet-implemented message. If a SchemaIR JSON containing
  Drop{Type,Property} arrives (e.g. from a future tool or hand-
  written), the apply path fails explicitly rather than silently
  misclassifying.

- Two new in-source tests:
    - `drop_steps_round_trip_through_serde` — pins the wire shape
      for all four (variant × mode) combinations.
    - `drop_mode_serde_uses_snake_case` — pins external-tool-
      friendly serialization (`"soft"` / `"hard"`).

Build: clean, only pre-existing warnings.
Tests:
- omnigraph-compiler schema_plan: 6/6 (4 existing + 2 new).
- omnigraph-engine schema_apply: 11/11 (unchanged — planner still
  emits UnsupportedChange for removal paths).

Next commit (commit 3 per docs/schema-lint-v1-plan.md): add the
`tombstoned: bool` fields to NodeIR / EdgeIR / PropertyIR for the
catalog representation of soft-mode tombstones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* plan doc: reframe v1 around Lance native drop_columns

After a substrate audit of the Lance data-evolution guide on
2026-05-13, the v1 plan was simplified. Two key findings:

1. Lance's `drop_columns()` is already metadata-only and reversible
   via time travel until cleanup. No need for a parallel
   `tombstoned: bool` field in our catalog IR — Lance's version
   graph IS the tombstone.

2. The full schema_apply substrate migration (add_columns,
   drop_columns, alter_columns vs. stage_overwrite across all step
   types) is consolidated in MR-948 as a sibling issue. v1 only
   uses the relevant slice (drop_columns for OG-DS-1XX).

Net plan changes:

- Commit 3 (original): tombstone fields on catalog IR → dropped.
  No catalog IR change needed. The Lance drop_columns commit IS the
  tombstone.

- Commit 5 (original): apply path writes tombstoned: true → replaced
  with: apply path calls Dataset::drop_columns([name]).

- Commit 7 Hard mode: stage_overwrite removing the column → replaced
  with: drop_columns + compact_files + cleanup_old_versions. Same
  APIs omnigraph cleanup already uses.

- Commit 8 (original): omnigraph schema unhide → dropped. Time
  travel is the undo (omnigraph snapshot --at <commit>).

Net result: 8 commits → 5 commits. ~250 LoC less surface. More
substrate-aligned.

The chassis types from commit 2 (DropMode enum, DropType /
DropProperty variants) are kept exactly as designed; only the
implementation strategy changed.

The Lance docs quote is included in the doc so future readers see
the substrate behavior cited verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* schema-lint v1 commit 3: emit + apply DropProperty { Soft }

Wire the dormant DropProperty variant end-to-end for the Soft case.
Per docs/schema-lint-v1-plan.md, commit #3 of the schema-lint chassis
v1 series (MR-694).

Planner (schema_plan.rs):
- plan_properties: emit DropProperty { type_kind, type_name,
  property_name, mode: Soft } instead of UnsupportedChange when a
  property exists in accepted but not in desired. Plan is now
  supported = true for drop-only changes.

Apply (schema_apply.rs):
- Route DropProperty { Soft } through rewritten_tables. The existing
  batch_for_schema_apply_rewrite path already iterates the *target*
  schema fields, so a property absent from desired_catalog is
  naturally projected away. The prior Lance version retains the
  dropped column for time-travel reversibility (until cleanup runs).
- DropType still errors (lands in commit #4 with different mechanics:
  __manifest entry removal instead of column projection).
- DropProperty { Hard } still errors (lands in commit #5 with
  --allow-data-loss CLI flag + immediate compact_files +
  cleanup_old_versions).

Tests:
- Planner unit test plan_emits_soft_drop_for_removed_nullable_property
  asserts the variant emission + supported = true + no UnsupportedChange.
- Integration test apply_schema_drops_a_nullable_property_softly_
  preserves_prior_version (replaces the former
  apply_schema_rejects_dropping_a_property_with_data) asserts:
  (a) plan contains DropProperty { Soft }
  (b) apply succeeds + manifest advances + row count unchanged
  (c) current dataset schema lacks the dropped column
  (d) snapshot_at_version(pre_drop) still has the dropped column
  (e) reopen consistency — drop preserved across engine restart

Recovery: rides on SidecarKind::SchemaApply per MR-847. No new
sidecar kind needed; the entire apply path is already sidecar-wrapped.

Substrate alignment: this commit uses the stage_overwrite full-rewrite
path (full_rewrite cost class) rather than Lance native drop_columns
(catalog_only cost class). MR-948 is the follow-up substrate-alignment
refactor that introduces a LanceColumnOp surface and switches the
metadata-only case onto drop_columns. Functional outcome is identical;
cost-class improvement deferred.

Test results:
- cargo test -p omnigraph-compiler --lib: 238 passed
- cargo test -p omnigraph-engine --test schema_apply: 11 passed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: move schema-lint-v1-plan into docs/dev/ + add to index

Post-rebase fixup for the docs split (#93). The plan doc was added
to docs/ at the top level before main reorganized to docs/{user,dev}/.
This moves it into docs/dev/ and adds an entry to docs/dev/index.md
under a new "Active Implementation Plans" section so the
check-agents-md.sh link check passes.

Per the original commit message (617a77d), the plan doc is intentionally
temporary — it will be deleted when v1 lands.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:30:03 +03:00
Andrew Altshuler
0de5f69d86
docs: drop npx mdrip; use curl | pandoc for full-page fetches (#97)
The previous "fetch the full page" recommendation in AGENTS.md and
docs/dev/lance.md pointed at an unknown-author npm CLI that, on consent,
wrote agent-targeted content into AGENTS.md and modified .gitignore /
tsconfig.json. Source audit was clean of malicious code but the
self-perpetuating prompt-injection pattern combined with a single
maintainer and ~21 downloads/day made it not worth the risk. Switched
to the curl + pandoc command already documented as the no-tool option.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:06:24 +03:00
Andrew Altshuler
60eee78465
docs: split user and developer docs (#93) 2026-05-15 03:45:22 +03:00