Commit graph

145 commits

Author SHA1 Message Date
aaltshuler
08c9b03d40 test(cli): the embedded/remote parity matrix (RFC-009 Phase 1)
The referee before any unification moves: every forked verb runs once
against the local graph and once against a spawned server on a twin copy
of the same fixture, with the SAME actor (--as locally; bearer-resolved
remotely) and the SAME Cedar bundle on both arms — like-for-like
enforcement is part of the harness (a tokens-only server is default-deny
by design; comparing that against a bare local arm measures
configuration, not the fork). Declared-volatile fields (ids, wall-clock,
transport locations) scrub to placeholders; everything else must match
exactly, and exit codes must match for shared failures.

Headline result: 11 rows green with an EMPTY divergence ledger — the
arms agree on every verb today. The ledger (KNOWN_DIVERGENCES) exists so
any future divergence is pinned or filed, never silently repaired;
repairs are Phase 3's job, gated by this referee staying green.

One engine observation surfaced and filed (#207): inline execution with
a declared-but-unbound param matches ALL rows on both arms, while the
stored-query invoke path hard-errors — a cross-path asymmetry the matrix
pins as agreeing behavior pending a deliberate fix. Documented
exclusions (graphs list, ingest/load-over-/ingest, storage-plane verbs)
map to RFC-009 Phases 4-5.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:50:46 +03:00
Andrew Altshuler
e0d80c0062
Merge pull request #206 from ModernRelay/rfc/unify-access-paths
docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus
2026-06-12 17:38:22 +03:00
aaltshuler
9002cfd5b9 docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus
Adopts the unify-embedded/remote draft as RFC-009 with three alignment
amendments: (1) the promised 'companion config-authority RFC' is RFC-008,
already landed through stage 4 — referenced, not re-proposed; (2) open
question 3 is answered by the two-surface architecture (embedded graphs
list enumerates the cluster catalog via read_serving_snapshot, never
omnigraph.yaml); (3) Phase 2 salvages PR #139's reviewed-clean
omnigraph-api-types extraction instead of rebuilding. Adds the
cycle's two no-referee bugs (alias positional, write-if-absent flush) as
concrete parity-matrix motivation, and RFC-007's addressing/credential
chains as RemoteClient constructor inputs.

Corpus alignment: RFC-002's header now maps each of its pieces to the
successor that landed or superseded it (007/008/009) with a do-not-
implement-from-here-unchecked warning; RFC-007 gains the RFC-009
relationship; RFC-008 stage 5 notes the Phases-4/5 easing; dev index row.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:33:11 +03:00
aaltshuler
98c6147c38 docs(testing): bring the test map up to release truth
Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:12:33 +03:00
aaltshuler
dedd647cde release: bump workspace to 0.7.0
All six crate manifests + their path-dependency constraints, Cargo.lock,
the regenerated openapi.json version metadata, AGENTS.md's surveyed
version, and the v0.7.0 release notes (object-storage clusters,
config-free --cluster serving, the operator config surface, keyed
credentials, operator targeting/aliases, and the omnigraph.yaml
deprecation stages).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 14:12:33 +03:00
aaltshuler
b24bb16d0c ci(codeowners): restore ragnorc to engineering and docs roles
Re-adds ragnorc to both roles in the source of truth and regenerates
CODEOWNERS + the ownership tables. This also resolves the standing
inconsistency from #169: branch-protection.json's
bypass_pull_request_allowances still listed ragnorc after his codeowners
removal — the two lists are in sync again (no protection change needed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 13:45:33 +03:00
aaltshuler
4c50170c77 feat(config): OMNIGRAPH_NO_LEGACY_CONFIG strict mode (RFC-008 stage 4)
Opt-in: with the env set, loading a legacy omnigraph.yaml is a hard
error pointing at config migrate — the regression guard for migrated
teams (a stray legacy file would otherwise silently outrank operator
config during the window) and the rehearsal for stage 5's removal.
Strict refuses the FILE, never its absence: flag-less invocations on
migrated setups are untouched. Inert unless set.

The RFC's stages-1-3-then-4 release gap collapsed honestly: no version
boundary was crossed between them, so all four ship in the same release
(noted in the RFC).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 00:03:10 +03:00
aaltshuler
5328c91341 refactor(cli): drop cluster init — no replacement scaffold
Andrew's call, and the right one by the repo's own lens: a minimal
cluster.yaml is five lines; a generator is a second copy of the schema to
keep in sync forever, emitting a file that is unusable until hand-edited
anyway (graphs: {} cannot apply or serve). Terraform has no config
scaffolder either. New users copy from the cluster quick-start; migrants
get a ready-to-review cluster.yaml from config migrate. RFC-008 stage 3
becomes purely subtractive.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 23:45:18 +03:00
aaltshuler
3adbc65af2 docs(cli): config migrate, cluster init, the legacy-file deprecation notice
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 23:37:12 +03:00
aaltshuler
dc91c55970 feat(cli): operator aliases — pure bindings invoking stored queries (RFC-007 PR 3, part 2)
aliases: in the operator config bind a personal name to (server, graph,
stored-query NAME, positional arg mapping, fixed param defaults, format)
— zero content, per the ratified bindings-not-content model. Invocation
goes through the server's stored-query endpoint (POST
{base}/graphs/{g}/queries/{name}) with the keyed credential resolving via
the ordinary URL match; param precedence --params > positionals > fixed
defaults; the result renders through the existing format cascade with the
alias's format as its hop. A legacy omnigraph.yaml alias with the same
name wins during the RFC-008 window, with a warning naming both.

E2e (spawned policy-gated server, invoke_query granted via a per-graph
bundle): the alias invokes with name + one positional and nothing else —
server, graph, query, and token all from the operator layer; --server/
--graph explicit targeting; unknown --server lists defined names;
--server exclusive with a positional URI.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:25:42 +03:00
aaltshuler
65160cc060 docs(rfc): aliases are bindings, not content — the ratified alias model
RFC-007 §D2 gains the model the alias design reasoned through: stored
queries are content + its canonical team-owned name; legacy
omnigraph.yaml aliases conflate a personal name with a local-file content
pointer (the muddle RFC-008 retires); operator aliases are pure bindings
(server, graph, stored-query NAME, arg mapping, defaults) — an alias that
carries content competes with the catalog, one that references a name
composes with it. The three senses of 'global' are resolved explicitly:
cross-graph globality is strengthened (one $HOME file vs per-directory),
team-shared shorthand is deliberately NOT an alias mechanism (the shared
name IS the catalog name), cross-machine follows the dotfile. Collision
rule: legacy wins during the RFC-008 window, with a warning.

RFC-008's migration row for aliases sharpens accordingly: a legacy alias
splits — content to the catalog (via cluster apply), binding to the
operator layer; config migrate proposes both halves.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 22:15:19 +03:00
aaltshuler
a819ab500e feat(cli): keyed credentials — servers:, the token chain, login/logout (RFC-007 PR 2)
The operator config gains servers: (name -> url; never a token). A remote
command whose URL prefix-matches an operator server resolves its bearer
token through the keyed chain first — OMNIGRAPH_TOKEN_<NAME> env, then the
[<name>] section of ~/.omnigraph/credentials (created 0600 via temp+rename,
#139 finding 7; group/world-readable files refused loudly) — falling
through to the legacy chain unchanged. URL keying makes §D5 rule 3
structural: a token is only ever sent to the server it is keyed to.
Longest-prefix matching with a path-boundary check (http://h:8080 never
matches http://h:8080-evil). Inserting the keyed hop above the legacy chain
is safe by construction — no existing setup can have servers: defined.

omnigraph login <name> stores/rotates one section (token from --token or
one stdin line — the pipe flow keeps secrets out of shell history);
omnigraph logout removes it, idempotently; logging in before declaring the
server warns instead of failing (the gh model).

Coverage: URL-match/no-substring-trap, credentials round-trip preserving
sibling sections, 0600 write + over-permissive refusal, env-name mapping;
the legacy resolve test is now hermetic against a real ~/.omnigraph and
asserts byte-identical legacy behavior with no servers defined; one
spawned-binary e2e walks the whole lifecycle against an authed server:
refusal -> wrong-token login (stdin) -> rotate (--token) -> authorized read
-> env-beats-file -> non-matching-URL negative -> logout revokes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 21:24:51 +03:00
aaltshuler
9427fb510e docs(cli): the two config surfaces + the operator file reference
cli-reference.md gains the config-surfaces table (cluster / operator /
flags-env, with omnigraph.yaml marked as the legacy combined file per
RFC-008) and the operator config.yaml reference; audit.md documents the
unified actor chain.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 20:32:04 +03:00
aaltshuler
08ce8dc34d docs(rfc): align RFC-007 with RFC-008's two-surface architecture
RFC-007 now speaks the end-state language throughout: the operator surface
is one half of the two-surface split (cluster config / operator config),
not a layer over a living omnigraph.yaml. The precedence cascade drops the
project layer (cluster config carries no operator-resolvable keys — a
checkout can never supply identity); legacy omnigraph.yaml appears only as
the RFC-008 deprecation-window slot. The trust boundary is restated as
closed-by-construction in the end state, with the rules governing the
window. PR 3 becomes operator targeting (--server + operator aliases — the
replacement RFC-008 needs before legacy aliases migrate), and the schema
example gains the aliases block.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:54:34 +03:00
aaltshuler
320311e759 docs(rfc): RFC-008 — deprecate omnigraph.yaml, one concern per config surface
The file is three unrelated concerns wearing one filename — server
deployment config, project/CLI conveniences, operator identity — and the
mixture is the root cause of a recurring problem class (per-operator
copies of project files, checkout-supplied credential redirection, init
scaffold pollution). End state: two single-owner surfaces — cluster
config (team, repo) and operator config (person, $HOME) — plus the
zero-config flags/env tier.

Complete key-by-key migration map over the verified OmnigraphConfig
surface; staged retirement per the repo's Hyrum rules (warn with per-key
guidance -> `config migrate` tool -> stop scaffolding -> opt-in strict ->
removal at the next major). RFC-007's project-layer framing is amended to
transitional accordingly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 19:33:19 +03:00
aaltshuler
d531f60999 docs(rfc): RFC-007 — per-operator config, the operator slice of RFC-002
Terraform-style operator/project split: ~/.omnigraph/config.yaml for
identity (operator.actor in the --as cascade), credentials keyed by
server name (env -> 0600 credentials file; no inline secrets), and
operator-owned named servers that project configs reference but cannot
redefine. Explicitly a staged subset of RFC-002: adopts its settled
decisions (one dir, keyed credentials, env precedence), defers
GraphLocator/use/state-layer, and encodes the ten confirmed PR #139
findings as design rules (compat shims, key-level merges, atomic writes,
the project-layer trust boundary).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 18:29:55 +03:00
aaltshuler
8d7aed065f test(cluster,server): gated object-storage cluster e2e + CI wiring + docs
s3_cluster.rs runs the full control-plane lifecycle against a real
bucket (CI: containerized RustFS; locally the RustFS binary): import →
lock released (pins the drop-time release regression caught on the first
live smoke) → apply (graph roots + catalog on the bucket, nothing local)
→ serving snapshots from both the config dir and the bare URI → schema
evolution → approved delete (prefix removal) → empty-cluster refusal.
The server suite gains the config-free boot test: --cluster s3://… with
zero local files serves a stored query over HTTP.

CI: the rustfs job runs both suites; the classify filter covers the
cluster store/serve modules and the new test files. The server smoke
drops its name filter — every test in the s3 target is bucket-gated, and
a filter matching nothing passes vacuously (which silently ran zero
tests for a while).

Docs: deployment.md gains the Bucket-no-volume shape as the preferred
cloud deployment; cluster.md/server.md document --cluster <uri>;
testing.md maps the new suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 15:56:40 +03:00
aaltshuler
8dc2f15255 feat(cluster): the storage: root — state, catalog, and graph roots relocatable
cluster.yaml gains an optional storage: URI deciding where everything the
cluster STORES lives: the state ledger, lock, content-addressed catalog,
recovery sidecars, approval artifacts, and the derived graph roots
(<storage>/graphs/<id>.omni). Absent, it defaults to the config directory
itself — the original layout, byte-compatible, so pre-existing clusters and
the whole test suite are untouched. Declared configuration always stays in
the working tree (Terraform's config-local/state-remote split); credentials
are env-only, never in cluster.yaml.

Every command resolves its store from the declared root (a bad root is a
loud invalid_storage_root). Graph-root derivation, the delete executor
(prefix delete via the adapter), the sweep's existence probes, the catalog
payload write/verify/read paths, and the serving snapshot all flow through
ClusterStore — the last raw-fs holdouts for stored state are gone, and the
deny-list gains the rule that keeps it that way.

Tests: default-layout byte-compat, a file:// root relocating the entire
cluster (ledger+catalog+graphs under the new root, nothing under the config
dir, serving snapshot follows), invalid-root validation. 98 in-crate + 9
failpoints + full workspace gate green. The s3:// flavor lands with PR 3's
gated RustFS e2e.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 14:28:04 +03:00
aaltshuler
fa6af775c1 feat(cli)!: unified load command; deprecate ingest as an alias
omnigraph load is now the single data-write command:
- works against remote graphs (POSTs the server's /ingest endpoint with the
  same bearer/actor resolution as other remote commands) — previously load
  was the only data command forced to open Lance storage directly
- --from <base> opts into fork-if-missing for --branch (the former ingest
  semantics); without --from a missing branch is an error, never a fork
- --mode is now required: overwrite is destructive, so there is no implicit
  default (the old silent default was overwrite)
- output gains base_branch/branch_created (and table sums on remote loads)

omnigraph ingest stays as a deprecated alias (defaults preserved: --from
main --mode merge) that prints a one-line warning to stderr, matching the
read/change deprecation convention; removal in a later release.

Docs updated in the same change: cli.md, cli-reference.md, policy.md,
audit.md, execution.md (unified load section), AGENTS.md quick-flow,
README.md.

BREAKING CHANGE: scripts running omnigraph load without --mode must now
pass it explicitly (previously defaulted to the destructive overwrite).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 04:18:00 +03:00
aaltshuler
90676ef52f feat(server)!: POST /ingest forks only when 'from' is present
Branch creation becomes opt-in by presence of the request's 'from' field.
Previously the handler defaulted from to 'main' and always auto-created a
missing branch — a typo'd branch name silently forked main and landed the
data there, with the client none the wiser. Now a request without 'from'
against a missing branch returns 404 branch-not-found and creates nothing;
with 'from' set, fork-if-missing behaves as before. The BranchCreate
authority is only consulted when a fork will actually happen.

The handler calls the unified load_as directly (the deprecated ingest_as
shim is no longer used in the server). IngestOutput.base_branch becomes
nullable: it echoes the request's 'from' and is null when absent. OpenAPI
regenerated; the CLI's local ingest arm moves to load_file_as + the new
converter shape.

BREAKING CHANGE: clients that relied on implicit fork-from-main with 'from'
omitted must now pass from='main' explicitly. IngestOutput.base_branch is
now nullable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 04:05:29 +03:00
aaltshuler
43d4e89fde docs(execution): Overwrite loads are staged since MR-793, not inline-commit
The LoadMode table still described Overwrite as an inline-commit-per-type
residual with a partial-truncation failure window. Since MR-793 Phase 2,
Overwrite goes through the same MutationStaging accumulator as Append/Merge,
staged as a Lance Operation::Overwrite transaction via stage_overwrite
(table_store.rs) and committed with commit_staged + publisher CAS — a
mid-load failure leaves Lance HEAD untouched in all three modes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 03:44:02 +03:00
aaltshuler
44b5866516 docs: drop ./ path prefixes; document query discovery
Paths in cluster.yaml and command examples are relative to one explicit
config folder (Terraform-shaped) — the ./ prefixes were noise and are gone
across the user docs (109 instances; ../ links and ./scripts executables
untouched). The cluster docs now present directory discovery as the primary
queries form with the list and map forms documented alongside.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 01:33:30 +03:00
Andrew Altshuler
c3ff076e89
Merge pull request #181 from ModernRelay/feat/container-cluster-mode
feat(docker): cluster-mode container + AWS/Railway recipes
2026-06-10 23:57:34 +03:00
aaltshuler
f165145b63 docs(deploy): address review — consistent placeholders, complete ECS command
The ECS day-2 apply gains its required --config flag (the image ships no
omnigraph.yaml, so the CLI cannot locate the cluster dir without it), and
the docker-exec example uses the <you> placeholder convention instead of a
real-looking actor name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:54:26 +03:00
aaltshuler
6b3ae7ac79 docs(deploy): AWS and Railway cluster-mode recipes
The container contract (OMNIGRAPH_CLUSTER + mounted volume + token env),
ECS/Fargate+EFS and Railway-volume walkthroughs, the in-container day-2
loop, and the honest constraints list (volume mandatory, no hot reload,
single-writer apply, shared-volume replicas unvalidated). Operator guide
links the recipes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:45:30 +03:00
aaltshuler
99f7f36864 docs(cluster): the precise omnigraph.yaml contract
The 'Relationship to omnigraph.yaml' section becomes the exact rule set:
cluster commands read the per-operator config for exactly one thing (the
cli.actor default when --as is omitted), a --cluster server reads it for
nothing, and pointing data-plane targets at derived roots is ergonomics,
not coupling. Operator guide and CLI reference updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:30:40 +03:00
aaltshuler
97eb65e921 docs(cluster): operator how-to guide for deploying and managing clusters
New docs/user/cluster.md — the practical companion to cluster-config.md's
reference: zero-to-served walkthrough (validate/import/plan/apply, derived
roots, data loading, --cluster serving), the day-2 edit->plan->apply->restart
loop with a per-change-kind table, drift observation and convergence, the
approval gate for destructive changes, crash/lock/lost-ledger recovery, the
boot-refusal table with remedies, deployment patterns (replicas, backup
unit, CI gating), and the explicit not-yet list (hot reload, S3-hosted
cluster dirs, per-query exposure, pipelines). Linked from the user index,
the agent guide's topic map, and cross-linked from the reference.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:10:19 +03:00
aaltshuler
d8354ac213 test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out
- The drift-heal verification now asserts `schema show` succeeded and
  produced a schema before checking the rogue field's absence (a failed
  command previously made the negative assertion vacuously pass).
- cluster_cli documents why it deliberately does not assert exit codes
  (blocked applies exit non-zero by contract while emitting the structured
  output callers assert on).
- The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1
  (graceful skip-with-message, the S3-gate pattern) for constrained
  sandboxes; requirements + suppression documented in testing.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:05:12 +03:00
aaltshuler
7d70811df1 test(cli): comprehensive full-cycle cluster e2e with a live server
Two system tests composing the whole Phase 1-5 surface with real binaries:

- local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two
  graphs -> one apply creates and converges them -> the --cluster server
  serves both stored queries -> schema+query evolve in one apply (migration
  previewed in plan) -> restart serves the new shape -> out-of-band schema
  drift observed by refresh and converged back by apply (rogue field
  soft-dropped) -> approved graph delete -> restart serves the survivor and
  404s the tombstoned graph -> final plan empty. Catches composition
  regressions where each stage passes its own tests but the lifecycle
  breaks (the composite_flow.rs principle at the control-plane level).

- local_cluster_serving_enforces_applied_policy_bindings: applied policy
  bundles gate serving per their bindings over HTTP with bearer-resolved
  actors — the cluster-bound bundle owns graph_list (admin 200, reader 403,
  anonymous 401), the graph-bound bundle owns invoke_query (reader gets
  rows; denied invocation is the documented anti-probing 404).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:07:29 +03:00
aaltshuler
711865e6f1 docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats
The standing caveat ('applied means recorded in the cluster catalog —
nothing more; the server still boots from omnigraph.yaml') retires: cluster
docs gain the 'Serving from the cluster' section (exclusivity, applied-
revision serving, fail-fast readiness, restart-to-pick-up, expose-all
bridge), server.md gains mode-inference rule 0 and the cluster-booted multi
mode, deployment.md the boot-source choice, and the CLI's apply note plus
the cli-reference cluster row (stale back to Stage 3A) now describe the full
convergence surface. RFC-005 flips to Landed with four implementation
deviations recorded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:56:54 +03:00
aaltshuler
6c98560dde docs(cluster): document policy binding metadata (5A)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:30:57 +03:00
Andrew Altshuler
3e8f103804
docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174)
The axiom-15 mode switch: omnigraph-server --cluster <dir> (mutually
exclusive with uri/--target/--config, zero omnigraph.yaml reads) serves the
APPLIED revision — graph set from state, query/policy content from the
content-addressed catalog at applied digests, cluster-scoped policy bundles
as the server-level Cedar engine. The load-bearing finding: state is not yet
serving-sufficient (policy applies_to bindings live only in cluster.yaml), so
slice 5A records binding metadata into the applied revision at apply time —
without it, boot-from-state silently becomes the merged read axiom 15
forbids. Fail-fast readiness table (missing state, pending sidecars, missing
blobs, unbound policies all refuse boot with remedies), the expose-all
mcp.expose bridge with its Phase 6 sunset, the operator migration path (exit
criterion 7), and 5A/5B/5C sequencing. The existing boot pipeline
(GraphStartupConfig -> registry -> routing/auth) is reused as-is — a new
source, not a new pipeline.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:22:12 +03:00
Andrew Altshuler
61da7bf406
docs(cluster): descope ETL pipelines to a separate project; keep the socket (#172)
Pipelines (scheduler, connectors, mapping, idempotency, run ledger) leave the
cluster control-plane rollout and become their own project with their own
RFC. This rollout guarantees only the socket, all of which already exists and
is enforced: the pipelines: config field is reserved (typed
future_phase_field rejection, test-covered), the pipeline.<name> typed
address and Pipeline resource kind are reserved in the resource model, and
axiom 13 fixes the contract any future implementation must satisfy
(definition reconciled, execution data-plane, fan-out statusful). The ETL
section in the high-level spec stands as the requirements record for that
project; exit criterion 9 defers to its RFC.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:16 +03:00
aaltshuler
c949a2b717 docs(cluster): document Stage 4C — Phase 4 complete
Approvals + gated graph deletion in the user docs, the approve command in the
CLI reference, RFC-004 flipped to Landed with its three implementation
deviations recorded (row-8 retire-and-repropose, --as instead of --actor/--by,
consumed artifacts rewritten in place rather than moved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:44:12 +03:00
aaltshuler
f217352c93 docs(cluster): document Stage 4B schema apply
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:14:20 +03:00
aaltshuler
cb6c67f196 docs(cluster): document Stage 4A graph create
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 05:00:42 +03:00
Andrew Altshuler
26b26999fd
ci(codeowners): aaltshuler owns all paths; remove ragnorc (#169)
Engineering and docs roles both resolve to @aaltshuler; every path
(catch-all, crates/**, docs/**, repo-level docs) now requires their review.
CODEOWNERS and the doc tables regenerated from codeowners-roles.yml via
render-codeowners.py.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:34:17 +03:00
Andrew Altshuler
58c66a54a2
docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168)
* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)

The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:

- Cluster recovery is roll-forward-only: the engine's own sidecars make every
  graph-level operation atomic within the graph, so the cluster never rolls a
  graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
  record, converging the ledger to observable reality (axiom 5) or surfacing
  a loud pending-repair condition. Eight-row decision matrix, every row
  testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
  consume digest-bound approval artifacts written by a new cluster approve
  command and retired into state.approval_records (axiom 11). A stale
  approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
  enforcement and commit attribution work unchanged; the cluster adds no
  policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
  per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
  per-matrix-row failpoint tests.

Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract

Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:34:14 +03:00
aaltshuler
50543a8ce0 docs(cluster): record Stage 3B failpoint + verification coverage
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:15:13 +03:00
aaltshuler
15868972ff feat(cluster): verify catalog payload blobs in status and refresh
Closes the Stage 3A product gap where a deleted or corrupted blob under
__cluster/resources/ went unnoticed forever (status reported converged and
apply could not repair it because the digests matched).

verify_catalog_payloads checks every query/policy digest in state against its
content-addressed blob (existence + full sha256 re-hash; graph/schema/unknown
addresses have no payloads and are skipped). status reports findings read-only
(warnings catalog_payload_missing/_mismatch; error catalog_payload_read_error
— an unverifiable catalog must not report healthy). refresh closes the
self-heal loop: missing/mismatched blobs mark the resource drifted and remove
its digest from state so the next plan proposes a create and the next apply
republishes; unreadable blobs keep the digest (no spurious republish), mark
error, and exit non-zero. Verification runs before graph observation so the
recomputed graph composite already excludes removed query digests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:07:08 +03:00
Andrew Altshuler
b6d228ff54
test(cli): cluster e2e hardening — lost-state recovery, out-of-band drift, root destruction, multi-graph convergence (#166)
Four lifecycle compositions over the spawned binary that pin spec claims no
single-command test proves:

- Lost ledger: delete state.json -> re-import from the live graph -> re-apply
  converges onto the same content-addressed blobs (axiom 5's reconstructable-
  state resilience edge, end to end).
- Out-of-band schema apply (the Sarah/Bob violation): refresh marks
  graph/schema Drifted with schema_mismatch, status and plan surface it, and
  cluster apply refuses to silently correct it — state keeps the LIVE schema
  digest (drift correction is gated, axiom 8).
- Destroyed graph root: refresh records graph_missing drift and drops
  graph/schema digests while preserving query/policy; plan proposes deferred
  creates only; apply moves nothing and the catalog stays intact.
- Two graphs (one live, one not yet created) + a graph-spanning policy + a
  cluster-scoped policy: a single apply yields all four dispositions at once
  (applied/derived/deferred/blocked, deterministically ordered), then the
  second graph appears, refresh observes it, and apply converges.

Helpers: init_named_cluster_graph generalizes init_cluster_derived_graph;
write_multi_graph_cluster_fixture builds the two-graph config.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:59:20 +03:00
aaltshuler
7f3ecf282a Merge origin/main (#164 axiom-15 docs, #86 TableStorage migration) into feat/cluster-apply-stage3a
Clean auto-merge; also fix the stale 'Stage 2C accepts' line in
cluster-config.md to Stage 3A.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:45:44 +03:00
aaltshuler
69b63c33ac Merge remote-tracking branch 'origin/main' into feat/cluster-apply-stage3a 2026-06-10 00:45:03 +03:00
Andrew Altshuler
cec65b8ef8
docs(cluster): axiom 15 — single ownership, mode-switch migration, per-operator layer (#164)
Encode the omnigraph.yaml ↔ cluster.yaml coexistence rules that were implicit
across the specs:

- cluster-axioms.md: new axiom 15 — every fact has exactly one owner at a time;
  coexistence is a mode switch, never a merge; omnigraph.yaml's job description
  shrinks to the permanent per-operator layer. Added review-tension bullet.
- cluster-config-specs.md: "Migration model" subsection (three coexistence
  windows: no-conflict, Phase-5 mode switch, bridges-with-sunsets) and a
  "per-operator layer" completeness table (connection, credential reference,
  active context, ergonomics, personal aliases) with its global-config-dir
  destination per the RFC-002 direction.
- cluster-config-implementation-spec.md: Compatibility Stance #7–#9 (single
  ownership, shrinking role, bridges carry sunsets); Phase 5 boot is an
  exclusive XOR mode switch; fixed the duplicated recoveries/recovery dirs in
  the Phase-1 storage layout.
- docs/user/cluster-config.md: "Relationship to omnigraph.yaml" section in
  current-reality terms (cluster catalog is inspectable, not live).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:44:51 +03:00
devin-ai-integration[bot]
2c578a60b2
(feat) convert engine call sites to &dyn TableStorage; demote legacy TableStore methods to pub(crate) (#86)
* MR-854: convert engine call sites to &dyn TableStorage; demote legacy methods

Phase 1b: every db.table_store.X(...) call site converts to
db.storage().X(...), reaching the storage layer through the sealed
TableStorage trait (returns &dyn TableStorage). Opaque SnapshotHandle
and StagedHandle replace bare lance::Dataset and Transaction in the
threaded values.

Phase 9: the inherent inline-commit methods on TableStore
(append_batch, merge_insert_batch{,es}, overwrite_batch,
create_btree_index, create_inverted_index) demote from pub to
pub(crate). Their only remaining direct users are table_store.rs
itself and the bulk loader's LoadMode::{Append, Overwrite, Merge}
concurrent fast-paths in loader::write_batch_to_dataset (no
two-phase shape in Lance 4.0.0 — closes after lance#6658 and #6666).

Docs:
- invariants.md \u00a7VI.23: drop "at the writer-trait surface"
  qualifier; staged primitives are now the only engine surface.
- runs.md: residual matrix shrinks to delete_where and
  create_vector_index (the two upstream-blocked residuals).
- forbidden_apis.rs: replace transitional language with the
  current allow-list shape (table_store.rs + loader concurrent
  fast-path only).

Files touched:
- changes/mod.rs, db/omnigraph.rs (+export/optimize/schema_apply/
  table_ops.rs), exec/{merge,mod,mutation,staging}.rs,
  loader/mod.rs, storage_layer.rs, table_store.rs,
  tests/forbidden_apis.rs, docs/{invariants,runs}.md.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: replace test-only inline-commit append callers with local Lance helpers

After demoting TableStore::append_batch from pub to pub(crate), the
integration tests in tests/recovery.rs and tests/staged_writes.rs
that previously called store.append_batch(...) directly to simulate
HEAD-ahead-of-manifest drift can no longer access the inherent
method. Replace those calls with small in-test helpers that do a raw
Dataset::append (the same body the inherent method runs).

- tests/helpers/mod.rs gains lance_append_inline (shared helper).
- tests/staged_writes.rs gets a file-local lance_append_inline_local
  (staged_writes.rs does not import helpers::).
- tests/recovery.rs drops the unused TableStore import in the one
  function whose store binding became unused after the conversion.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: retrigger CI for flaky Test Workspace job

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: convert remaining table_store call sites in export.rs / read_blob

Two leftover `self.table_store.X` / `db.table_store.X` call sites were
missed in the initial sweep — flagged by Devin Review on PR #86. Both
now go through the trait surface:

- `entity_from_snapshot` (db/omnigraph/export.rs): switch from
  `db.table_store.open_snapshot_table` + `db.table_store.scan` to
  `db.storage().open_snapshot_at_table` + `db.storage().scan`.

- `read_blob` (db/omnigraph.rs): replace
  `snapshot.open(table_key)` + `self.table_store.first_row_id_for_filter`
  with `self.storage().open_snapshot_at_table` +
  `self.storage().first_row_id_for_filter`. The follow-up
  `take_blobs` call still needs an `Arc<Dataset>` (it's a Lance blob
  accessor not surfaced through the trait), so we hand off via
  `SnapshotHandle::into_arc()` with a comment.

After this commit, no engine code outside `table_store.rs` reaches the
inherent `TableStore` API — the docs/runs.md and docs/invariants.md
claim is now uniformly true.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: post-rebase doc fixes (Lance 6.0.1, MR-A framing, into_dataset note)

Reviewer feedback on the rebased PR:

* docs/dev/writes.md residuals matrix: drop demoted methods from the trait-surface table (now `pub(crate)`); keep only the two genuine trait-surface residuals (`delete_where`, `create_vector_index`); reframe under MR-A (Lance v7.x bump) per docs/dev/lance.md.

* tests/forbidden_apis.rs: update transitional allow-list header to (a) drop the truncate_table mislabel (truncate_table is a Lance Dataset method, not a TableStore method — overwrite_batch's internal call), (b) reframe trait-surface residuals under MR-A / Lance #6666.

* crates/omnigraph/src/storage_layer.rs::SnapshotHandle::{into_arc, into_dataset}: add single-ref invariant doc — both consume Arc via try_unwrap-or-clone; sibling SnapshotHandle clones across an await point force a deep Dataset clone.

* Replace lance-4.0.0 version refs with lance-6.0.1 in active source/test/dev-doc comments (storage_layer.rs, table_store.rs, table_ops.rs, schema_apply.rs, merge.rs, recovery.rs, staged_writes.rs, consistency.rs, docs/dev/execution.md, docs/user/query-language.md). Historical refs in docs/releases/v0.4.1.md and the canonical "Lance 4.0.0 → 6.0.1 migration" line in docs/dev/lance.md left intact.

No engine code changes.

* MR-854: update docs/dev/invariants.md Storage trait row + gap entry

Reviewer feedback: the docs reorg landed; the invariant row now lives in
docs/dev/invariants.md with stable headings (no more numbered §VI.23).

Update two pieces to reflect MR-854 completion:

* Status table 'Storage trait' row: was 'full call-site migration ... incomplete';
  now 'engine call sites all route through db.storage() (MR-854); inline-commit
  inherent methods are pub(crate)-demoted; capability/stat surfaces are roadmap'.

* 'Known Gaps' 'Storage abstraction' entry: was 'older inherent TableStore call
  sites and inline residuals remain'; now names the closed scope (MR-854 — call
  sites migrated, methods demoted, loader fast-paths) and the remaining
  trait-surface residuals under MR-A (Lance v7.x bump) and Lance #6666.

Cross-links to docs/dev/lance.md and docs/dev/writes.md so the framing stays
co-located with the canonical Lance surface tracking.

* MR-854: remove dead inline-commit methods from the storage surface

The loader concurrent fast-path (write_batch_to_dataset) is only reached
for LoadMode::Overwrite — Append/Merge route through MutationStaging — so
its Append/Merge arms were unreachable. Collapse it to overwrite-only and
drop the now-unused mode params, which removes the only callers of:

- TableStorage::append_batch + TableStorage::merge_insert_batches (trait)
- TableStore::merge_insert_batch + merge_insert_batches (inherent)

create_btree_index / create_inverted_index had zero callers anywhere
(scalar index builds use the stage_* primitives). Remove both from the
trait and the inherent impl.

Inherent append_batch stays pub(crate): overwrite_batch and recovery
tests use it. Migrate the one trait-append_batch test caller
(seed_person_row) to stage_append + commit_staged. The merge_insert
FirstSeen-workaround rationale moves from the deleted merge_insert_batch
into stage_merge_insert (now the sole merge path). No behavior change.

Also corrects the inaccurate loader residual comment (the prior text
blamed Lance #6658/#6666, which are the delete and vector-index issues,
for keeping overwrite inline; a stage_overwrite primitive already exists
and schema_apply uses it).

* MR-854: seal db.storage() to staged-only; move residuals to InlineCommitResidual

Split the three remaining inline-commit writes (overwrite_batch,
delete_where, create_vector_index) off the TableStorage trait onto a new
sealed InlineCommitResidual trait, reachable only via the explicit
Omnigraph::storage_inline_residual() accessor. db.storage() now exposes
only staged primitives + reads, so engine code cannot couple a write
with a Lance HEAD advance through the default surface — MR-793 acceptance
§1 ("no public method commits as a side effect of writing") now holds by
construction, not by review + naming.

Call sites moved to storage_inline_residual(): loader overwrite
fast-path, the three mutation delete_where paths, the branch-merge
delete, and the vector-index build. Impl bodies are unchanged (same
delegation to the pub(crate) inherent methods); this is a pure surface
reshape with no behavior change.

The residual trait holds two genuinely upstream-blocked methods
(delete_where -> Lance #6658/v7.x, create_vector_index -> Lance #6666)
plus overwrite_batch, kept for the loader's cross-table bulk-overwrite
concurrency until its staged migration lands (tracked follow-up).

* MR-854 docs: describe the staged-only seal; fix stale Lance index URLs

- writes.md / invariants.md / AGENTS.md: the inline-commit residuals now
  live on InlineCommitResidual behind db.storage_inline_residual(), so
  acceptance §1 holds by construction rather than 'option (b)' per-method
  enumeration. Drop the inaccurate 'until Lance exposes
  Operation::Overwrite { fragments }' claim (that op exists; stage_overwrite
  already builds it) and reframe overwrite_batch as a removable legacy
  residual gated on the loader's bulk-overwrite concurrency.
- forbidden_apis.rs: rewrite the allow-list doc for the split surface.
- lance.md: the index spec pages moved from /format/table/index/ to
  /format/index/ in Lance 6.x (the old paths 404). Fix all 13 URLs.

* MR-854: fix stale lance-4.0.0 comment refs flagged in review

Addresses greptile (exec/merge.rs) and aaltshuler's stale-version blocker:
update lance-4.0.0 -> 6.0.1 in the comment/doc refs within this PR's
footprint (exec/merge.rs, exec/mutation.rs, docs/dev/writes.md). Also
corrects exec/merge.rs to cite lance#6666 (not #6658) for
build_index_metadata_from_segments — that is the vector-index segment-commit
API; #6658 is the two-phase delete. (Pre-existing 4.0.0 refs in untouched
files like architecture.md/storage.md are main's incomplete migration
cleanup, left out of scope.)

* fix(storage): stage loader overwrites

* fix(storage): stage empty schema rewrites

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Ragnor Comerford <hello@ragnor.co>
2026-06-09 23:03:08 +02:00
aaltshuler
40a21e4e77 docs(cluster): document Stage 3A config-only cluster apply
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 23:36:33 +03:00
Andrew Altshuler
171a8c5d13
docs(releases): attribute the __run__ sweep (MR-770) to v0.6.2, not v0.6.1 (#161)
The v0.6.2 notes omitted the MR-770 `__run__` cleanup entirely, and the v0.6.1
notes wrongly claimed it shipped in v0.6.1. The code (the `migrate_v2_to_v3`
`__manifest` sweep + `is_internal_run_branch`/`run_registry.rs` removal) first
appears at the v0.6.2 tag via #132 and is absent at v0.6.1.

- v0.6.2: add the MR-770 highlight, correct the manifest-stamp note to describe
  the v2→v3 auto-migration on first read-write open (with the read-only caveat),
  and mention the cleanup in the intro.
- v0.6.1: replace the two over-claiming `__run__` lines with corrections that
  point to v0.6.2.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 22:31:50 +03:00
aaltshuler
89b876c797 Add cluster state lock recovery 2026-06-09 22:31:46 +03:00
aaltshuler
d00d42274e Implement cluster refresh and import 2026-06-09 21:17:23 +03:00
Ragnor Comerford
e0d88d1295
fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types (#160)
* fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types

Hardening on top of #133. That PR introduced a shared
`loader::composite_unique_key(parts)` joining per-column scalars with U+001F
and routed both intake and branch-merge through it, closing the original
'|' vs U+001F separator drift. This takes the shared keying the rest of the
way to correct-by-design:

- Collision-free by construction: the key is now the tuple of per-column
  scalar strings (Vec<String>) keyed directly, no separator, so no data value
  (not even a literal U+001F) can forge a collision.
- One scalar converter across both paths: intake used an explicit type-match,
  merge used Arrow's array_value_to_string. Both now derive the key through
  composite_unique_key(group_columns, row), so they can't drift on conversion.
- Loud on un-keyable types: the scalar converter returned None for any Arrow
  type it didn't recognize, and the caller treated None as null-exempt, so a
  @unique on a column type it couldn't reduce (list, blob) was silently
  un-enforced. It now returns Err, surfacing the constraint it can't enforce
  instead of weakening it in silence.

Tests:
- consistency::composite_unique_key_is_consistent_across_intake_and_merge pins
  that intake and merge key the tuple identically (load-on-branch then merge
  of values containing '|').
- loader unit tests pin tuple keying + null exemption and the loud error on an
  un-keyable (binary) column.

Docs: invariants truth-matrix updated; stale loader/mod.rs line pointers fixed.
Scope unchanged: intra-batch / merge-candidate-set only; cross-version
uniqueness against committed rows stays a documented gap.

* fix(unique): cover all string encodings; make format_tuple private (PR #160 review)

Addresses two Greptile P2 comments on PR #160:

- unique_key_scalar handled only StringArray (Utf8). The loud-on-unknown-type
  behavior turned any legal string column that read back as LargeUtf8 or
  Utf8View into a hard write failure (the old code silently returned None). Add
  LargeStringArray and StringViewArray arms so a legal string column is keyable
  in every physical Arrow encoding; the Err path now fires only for a genuinely
  un-keyable logical type (list/blob/vector), never a legal value in an
  unenumerated encoding.
- format_tuple was pub(crate) but only used within loader/mod.rs; make it a
  private fn (matches the old format_unique_columns it replaced, minimal
  exposed surface).

New unit test unique_key_scalar_handles_all_string_encodings pins that Utf8 /
LargeUtf8 / Utf8View all render rather than error.
2026-06-09 19:28:21 +02:00