omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-15 01:55:13 +02:00

Author	SHA1	Message	Date
Andrew Altshuler	8726ca92ec	feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222 ) * feat(server): canonical POST /load, deprecate /ingest (RFC-009 Phase 5) The CLI's non-deprecated `load` verb rode the deprecated `/ingest` route, so `/ingest`'s eventual removal would silently break it. Add a canonical `/load`, mirroring the shipped `/mutate`↔`/change` and `/query`↔`/read` pattern. - Extract `server_ingest`'s body into a shared `run_ingest` (branch-exists / fork-if-`from`, Cedar auth, admission, `load_as`, `IngestOutput` mapping). - `server_load` (canonical) → `run_ingest`, `Json<IngestOutput>`. - `server_ingest` (deprecated) → `run_ingest` + `#[deprecated]` + RFC 9745/8288 `Deprecation: true` / `Link: </load>; rel="successor-version"` headers. - Router mounts `/load` (same 32 MB body limit) beside `/ingest`; OpenAPI `paths(...)` gains `server_load` and flags `server_ingest` deprecated. `/load` reuses `IngestRequest`/`IngestOutput`, exactly as canonical `/mutate` reuses `Change` — a DTO rename is a separate, larger change (out of scope). openapi.json regenerated. Tests: openapi `/load` present + not deprecated, `/ingest` deprecated, `/load` bearer-secured; data_routes `/load` happy path + `/ingest` deprecation headers. Existing `/ingest` route tests stay green (the shim is unchanged). Docs: server.md endpoint table; RFC-009 Phase 5 marked landed (incl. the hand-mount-vs-utoipa-axum registration finding). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> feat(cli): point remote load at /load (RFC-009 Phase 5) `GraphClient::load`'s remote arm now POSTs to the canonical `/load` route instead of the deprecated `/ingest`; the deprecated `ingest` verb keeps riding `/ingest`. `parity_load` exercises `/load` on the remote arm (its documented flip); the matrix exclusions comment is updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 03:32:16 +03:00
Andrew Altshuler	6144bb18d6	feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) (#221 ) * feat(cluster): cluster_root_for_graph_uri detection helper (RFC-010 Slice 3) Public helper the CLI uses to refuse `init` into a cluster-managed location: given a graph storage URI of the cluster layout (`<root>/graphs/<id>.omni`), return the cluster root if `<root>` holds `__cluster/state.json`, else None. Cheap by construction — a URI that doesn't match the `<root>/graphs/<id>.omni` shape returns None with zero I/O, so ordinary `init` targets never probe storage. Works for file:// and s3:// via the storage adapter. Adds two ClusterStore accessors (`display_root`, `has_state`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) Two cluster-graph-aware CLI behaviors, sharing the cluster-resolution path. Maintenance addressing. `optimize`/`repair`/`cleanup` gain `--cluster <dir\|s3://…> --cluster-graph <id>`, which resolves the graph's storage URI from the served cluster snapshot (the same truth a `--cluster` server boots from — `read_serving_snapshot`) and opens it embedded. The operator no longer hand-types `<storage>/graphs/<id>.omni`. A distinct flag is required because the global `--graph` is `requires = server` and means a remote multi-graph id. clap enforces both-or-neither and exclusion with the positional URI / `--target`; an unserved graph errors loudly, pointing at `cluster apply`. init signpost. `init` refuses a cluster-managed positional path (the `<root>/graphs/<id>.omni` layout where `<root>` holds `__cluster/state.json`, detected by `cluster_root_for_graph_uri`) and points at `cluster apply` — graphs in an established cluster are created with ledger/recovery/approvals, not by hand. The check is gated on the path shape, so ordinary `init` does no extra I/O and existing pre-apply cluster-graph inits are unaffected. planes guard remediation now also mentions `--cluster … --cluster-graph …` (the two Slice-1 guard-string tests track it). Docs updated (cli-reference Command planes, maintenance.md, cluster.md §7); the stale "no S3-hosted cluster directories" limitation is dropped (RFC-006 landed it). Tests (cli_cluster.rs, reusing the apply-a-cluster fixture): resolve by id, unknown-id error, `--cluster` requires `--cluster-graph`, init refusal + signpost, and ordinary init still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> fix(cli): resolve cluster graphs from the state ledger, not the serving snapshot Addresses the Greptile review on #221. `read_serving_snapshot` does all-or-nothing serving validation — recovery-sidecar checks plus a digest verify of every catalog payload (query .gq, policy blobs). Using it to resolve a maintenance target coupled `optimize`/`repair`/`cleanup` to the readiness of unrelated resources: a single corrupt policy blob, or a pending recovery sweep, would block the command before it could touch the graph — worst for `repair`, the tool you reach for when the cluster is degraded*. Add `omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)`: read the state ledger, confirm the graph is in the applied revision, return `graph_root(id)` — the URI is deterministically derivable, no catalog validation. The CLI's cluster resolver now calls it. Test: `optimize --cluster … --cluster-graph …` still resolves after the catalog payloads (`__cluster/resources/`) are removed — the ledger-only path is not blocked by degraded/unrelated catalog state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 02:52:21 +03:00
Andrew Altshuler	d6cf5b298c	feat(cli): plane-grouped --help + clap 4.6.1 (RFC-010 Slice 2) (#220 ) * chore(deps): bump clap to 4.6.1 Workspace constraint "4" → "4.6" so the resolver picks up the 4.6 line (a plain `cargo update` stayed on 4.5.x). clap 4.5.58 → 4.6.1 (clap_builder 4.6.0, clap_derive 4.6.1). Minor bump, no API breakage; the workspace builds and all CLI suites pass unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): group --help by plane (RFC-010 Slice 2) Slice 1 declared the planes (the command_plane table + the wrong-plane guard); this makes them visible in `--help`. clap can't print labeled heading rows between subcommand groups (verified against the source — help_heading is args-only, {subcommands} is one flat block), so per the chosen approach: cluster + legend. - Reorder the `Command` enum into plane bands (clap lists subcommands in declaration order): data (query, mutate, load, branch, snapshot, export, commit, schema, graphs) → storage/local-graph ops (init, optimize, repair, cleanup, lint, queries) → control (cluster) → session (policy, embed, login, logout, config, version). No magic display_order numbers — the source order IS the help order, with band comments for readers. The band placement matches `command_plane` (lint/queries are storage-plane: they reject --server), so the help grouping and the guard agree. - Add an `after_help` legend on `Cli` naming the planes. Written to describe the planes (not enumerate every command) so it doesn't drift. Help-polish (post-review): hide the deprecated `ingest` from the list (still a valid command); trim the long `login` and `--as` descriptions to one line each so the columns don't blow up. The behavioral source of truth for planes stays `planes::command_plane`; this ordering is its cosmetic counterpart. Test: `help_groups_commands_by_plane` pins the legend phrase + the cluster ordering (query < optimize < cluster). Doc: a line under cli-reference's Command planes section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): qualify mixed-plane commands in the --help legend Addresses the Greptile P2 on #220: the legend placed `schema` entirely in Data and `queries` entirely in Storage, but per `command_plane` the subcommands differ — `schema plan` is storage-plane (rejects --server) and `queries list` is session (no graph). A user reading the legend then running `schema plan --server` would hit a rejection contradicting it. The Commands list is one entry per top-level command (necessarily coarse), so the legend carries the nuance: `schema [plan: storage]` and `queries [list: session]`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 01:49:40 +03:00
Andrew Altshuler	4187d56f8a	fix(cli): align lint plane label + document the plane model (RFC-010 follow-up) (#218 ) Addresses the Greptile review on #217: P1 — `lint` reported two different names. `command_label` returns `lint`, but `execute_query_lint` passed `"query lint"` as the resolver operation string, so `lint --server` said `lint` while `lint <https>` said `query lint`. Both were pinned by tests. `query lint` is the deprecated alias (argv-rewritten to `lint`), so the canonical name is `lint`: switch both user-facing strings in `execute_query_lint` (the storage-plane bail label and the requires-schema-or-target usage message) to `lint`, and update the two pinned assertions in `cli_data.rs`. P2 — user-doc debt (AGENTS.md rule 1: error text is observable behavior). Document the plane model in `cli-reference.md` (new Command planes section: data vs storage/maintenance vs control, which addressing flags apply, and the declared wrong-plane / remote-target errors), and add an addressing note to `maintenance.md` cross-referencing it. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 22:58:51 +03:00
Andrew Altshuler	106356ab25	feat(cli): RFC-010 Slice 1 — declared plane capability surface + honest addressing (#217 ) * feat(cli): declared plane capability surface + wrong-plane guard (RFC-010 Slice 1) New `planes.rs` is the single source of truth for which plane each subcommand belongs to (Data / Storage / Control / Session). `command_plane` is an exhaustive match — adding a `Command` variant is a compile error until its plane is declared, so the surface cannot silently drift from the command set. It descends into the nested enums where the plane differs per subcommand (`schema plan` is storage while `schema show/apply` are data; `queries validate` opens the graph while `queries list` reads only config). `guard_addressing` runs once in `main` before dispatch: the data-plane addressing flags `--server`/`--graph` on any non-data verb now fail with one declared, pinned error instead of being silently ignored (`optimize --server prod` previously dropped `--server`). `init`'s message drops the `--target` half since it takes only a positional URI today. Test: `cli_schema_config::schema_plan_with_server_flag_errors_wrong_plane` pins the per-subcommand label, proving the guard descends into the nested enum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): storage-plane verbs fail loudly on a remote target (RFC-010 Slice 1) `optimize`/`repair`/`cleanup` switch from `resolve_uri` to `resolve_local_uri`, so a `--target` (or positional URI) that resolves to a remote server now fails with a declared storage-plane message instead of whatever `Omnigraph::open` said about an `http(s)://` URI. The `resolve_local_graph` bail is reworded to that storage-plane message, so every storage verb already on the local resolver (`schema plan`, `queries validate`, `lint`) speaks with one voice. Net: `optimize --target knowledge` resolves to the graph's storage URI and runs embedded; `optimize --target prod` (remote) fails loudly; `optimize --server` is caught earlier by the guard. Positional-URI invocations are unchanged. Tests (pinned strings, per RFC-010's test plan): optimize happy path on a local graph, `optimize --server` wrong-plane error, `optimize <https>` storage-plane error; the existing `query_lint_rejects_http_targets_without_schema` assertion is updated to the new shared message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 22:45:58 +03:00
Andrew Altshuler	2ddb88fad9	docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215 ) Folds in the Codex verification review (kept verbatim with per-point Resolution notes): - `graphs list` is marked remote-only today in the current-state table (the embedded arm bails; it rides GraphClient only to share the resolver). - `init` is noted as positional-URI-only today (no `--target`); adding `--target` to init is part of the proposal, entangled with the init→cluster apply signpost, not current state. - Validated-fact #1 now describes the post-collapse reality (`GraphClient::resolve*`; only the two factories call `apply_server_flag`), dropping the stale "16 call sites" count. - The Authority rule carries a flag-shape caveat: `--graph` is already a global flag requiring `--server`, so the cluster-managed resolver and its flag shape are deferred to a later slice; the illustrative `--cluster <dir> --graph <id>` spelling is marked not-final. Docs-only; no code change. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 22:24:09 +03:00
Andrew Altshuler	be4f29c0d0	docs(rfc): RFC-010 — restructure the CLI around explicit planes (#214 ) The CLI silently spans three planes (data / storage-maintenance / control) and forces the operator to name a graph differently per plane: the graph you query as `--server prod --graph knowledge` you must maintain as `s3://bucket/knowledge.omni`. Plane restrictions (graphs list is server-only, optimize is storage-only) are accidental — discovered by hitting a cryptic error, not declared. RFC-010 proposes: one graph-addressing model across every verb, a declared per-subcommand capability surface (expanding RFC-009 Phase 4), and plane-grouped --help. Storage maintenance stays off the wire deliberately (no HTTP routes for optimize/cleanup/repair). CLI-internal only — no engine, server, or wire change. Incorporates the Codex review thread (kept verbatim with per-point Resolution notes): sharpened resolver authority rule (operator/legacy target must be direct storage; cluster-managed graphs via explicit --cluster --graph), per-subcommand capability table (schema plan vs show/apply, queries validate vs list, session/tooling classified), graphs list aligned to RFC-009's both-later target, init promoted to an explicit cluster-apply signpost, and a Test plan that extends the existing CLI suites and pins the new wrong-plane error strings. Linked from docs/dev/index.md. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 21:03:56 +03:00
Andrew Altshuler	45500a690a	refactor(cli): collapse export + graphs-list onto GraphClient (RFC-009 Phase 3c) (#213 ) The last two embedded-vs-remote forks move onto the enum, so every such `if` in the CLI now lives in client.rs — the point of the refactor. - `export<W: Write>`: the streaming verb 3b deferred (writes to a writer, chunks the HTTP response body, rather than returning a DTO). Embedded calls db.export_jsonl_to_writer; Remote streams the chunked body through. Opens WITHOUT policy (like reads), so it routes via resolve(). - `list_graphs`: remote-only by design (no local enumeration endpoint), so the Embedded arm keeps the loud "requires a remote multi-graph server" bail verbatim. Routing it through the enum still buys the shared resolve() addressing/token preamble the arm hand-rolled. Retire the now-orphaned execute_export_to_writer / execute_export_remote_to_writer pair, and sweep two pre-existing dead fns while in the files: inferred_config_path (helpers.rs) and yaml_string (output.rs, shadowed by test-local copies). parity_matrix gains one row, parity_export — the single intended matrix change in this phase. Export is a JSONL stream, not a single --json doc, so it compares the two arms' output line-wise (sorted; twin graphs are byte-copies so rows need no scrubbing). graphs-list gets no row: its remote-only behavior is a documented exclusion, not an equality case. Full workspace tests pass; all 12 parity rows green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 21:03:45 +03:00
Andrew Altshuler	d32c1ac191	refactor(cli): collapse write/query forks onto GraphClient (RFC-009 Phase 3b) (#211 ) Phase 3a put the GraphClient enum in place and collapsed the five uniform read forks. 3b folds the remaining data-plane forks onto the same enum: load, ingest, mutate, query, branch create/delete/merge, and schema apply. The wrinkle 3a deferred was the local policy attachment. Reads and query open the local engine without a policy; writes open through open_local_db_with_policy and attribute a resolved actor. So the Embedded variant grows an optional policy context (graph/actor) filled by a second factory, resolve_with_policy; resolve() leaves it empty. open_embedded picks the open path from whether the context is present, preserving both of today's behaviors exactly. query still uses resolve() (no policy), as the read path did. apply_schema takes the catalog-validator closure as impl FnOnce(&Catalog) — the embedded arm runs it inside apply_schema_as_with_catalog_check, the remote arm ignores it (the server runs its own check). That non-object-safe closure is why GraphClient is an enum, not a trait. The stored-query registry is still built caller-side and only for the local path. load and ingest stay separate methods: same operation, but load surfaces the CLI LoadOutput (two distinct per-arm mappings preserved) while ingest surfaces the wire IngestOutput. The now-fully-dead execute_read/ execute_read_remote and execute_change/execute_change_remote pairs are retired (legacy_change_request_body stays — client.rs uses it); the export pair remains for 3c. The Phase-1 parity matrix is unchanged and green; full workspace tests pass. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 19:25:57 +03:00
Andrew Altshuler	81b66f9427	ci: run Test Workspace only on main, not on pull requests (#212 ) The full workspace + failpoints suite was the slowest PR gate (~15min warm, up to the 75min cold ceiling) and dominated PR turnaround. Gate the `test` job with `if: github.event_name != 'pull_request'` so it runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. `RustFS S3 Integration` needs `test`, so it becomes push-/dispatch-only by the same cascade. Drop `Test Workspace` from the required-check list in branch-protection.json: a required context that never reports on PRs (the job no longer runs there) would leave every PR permanently pending — the job-never-reports trap the policy already documents. Trade-off accepted deliberately (chosen by the maintainer): a regression the suite would catch now lands on `main` and reddens the post-merge run instead of being blocked pre-merge, so `main` can briefly break. Mitigations documented in ci.md: run `cargo test --workspace --locked` locally before merging non-trivial changes (or trigger the workflow on your branch via workflow_dispatch), and regenerate openapi.json locally for server/API changes (the auto-regen step lived in the now-PR-skipped test job). The fast PR gates remain: Classify Changes, Check AGENTS.md Links, the AWS-feature build/test, and the two CODEOWNERS checks. NOTE: an admin must run ./scripts/apply-branch-protection.sh after this merges, or GitHub keeps requiring the now-unreported Test Workspace context. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-13 19:23:41 +03:00
Andrew Altshuler	7bfe9c6d69	Merge pull request #210 from ModernRelay/refactor/graph-client-reads refactor(cli): GraphClient enum + read verbs (RFC-009 Phase 3a)	2026-06-13 18:02:13 +03:00
aaltshuler	25d74d689d	refactor(cli): GraphClient enum + read verbs (RFC-009 Phase 3a) The embedded-vs-remote split gets one home: a GraphClient enum (Embedded { uri } \| Remote { http, base_url, token }) with a resolve() factory that absorbs the shared preamble (apply_server_flag -> token -> URI/remoteness) and a verb method per command. The five uniform read forks — branch list, commit list, commit show, schema show, snapshot — collapse from per-command if-graph-is-remote else to one line each (main.rs: -113/+47). Behavior identical per verb (local reads still open WITHOUT policy, as today); the Phase-1 parity matrix is the referee and passes textually unchanged. Enum, not the RFC trait: only two variants ever, and inherent async methods avoid async_trait boxing and the apply_schema closure that is not object-safe (3b) — same one-body-two-impls collapse, less ceremony. Scope: the uniform reads only. The query verb (policy-open + operator- alias early-return + param merge) joins the write verbs in 3b; export/streaming and graphs-list in 3c, where the now-shared execute__remote/execute_ pairs get retired. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 17:44:49 +03:00
Andrew Altshuler	e4334deb14	Merge pull request #209 from ModernRelay/refactor/api-types-crate refactor(api): extract omnigraph-api-types crate (RFC-009 Phase 2)	2026-06-13 17:32:34 +03:00
aaltshuler	3e2502c35e	docs: omnigraph-api-types in the crate list; RFC-009 Phase 2 outcome Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 17:10:00 +03:00
aaltshuler	adbb2a181c	refactor(cli): consume omnigraph-api-types directly; unify the load mapping The CLI's wire-DTO imports repoint from omnigraph_server::api to omnigraph-api-types (the server's other exports — queries registry, config types — still come from omnigraph-server). The local Load arm's inline LoadOutput hand-construction in main.rs is extracted into load_output_from_result next to load_output_from_tables in output.rs, so both '-> LoadOutput' mappings (engine LoadResult for local, wire IngestOutput for remote) live in one place. Deviation from the plan, with reason: LoadOutput stays CLI-side rather than moving into the wire-DTO crate — it is a rendered CLI output type, not an HTTP wire DTO, and its mapping consumes a CLI clap type (CliLoadMode). The shared crate stays strictly wire DTOs. Shapes unchanged: the parity matrix passes textually unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 17:05:32 +03:00
aaltshuler	4821e7208f	refactor(api): extract omnigraph-api-types crate (RFC-009 Phase 2) The HTTP wire DTOs and their engine-result -> DTO mappings move from omnigraph-server's api module into a new omnigraph-api-types crate that both server and CLI can depend on (engine must not — DAG: api-types -> engine, never the reverse). The crate holds plain serde/utoipa types only; the transport-coupled error->status mapping stays in the server (lib.rs/ handlers). The one server-runtime coupling (query_catalog_entry, which maps a StoredQuery — not a wire type) stays behind in api.rs, now calling the crate's pub param_descriptor. api.rs becomes a thin `pub use omnigraph_api_types::*` re-export, so every omnigraph_server::api::Foo path (handlers, the OpenApi schema list, CLI imports) resolves unchanged. openapi.json regenerates BYTE-IDENTICAL (the Phase-2 referee: 77 openapi tests green, zero diff). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 17:03:20 +03:00
Andrew Altshuler	82dda296d1	Merge pull request #208 from ModernRelay/test/parity-matrix test(cli): embedded/remote parity matrix (RFC-009 Phase 1)	2026-06-13 16:53:54 +03:00
Ragnor Comerford	446b46d548	Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203 ) * test(engine): pin the long-lived-handle heal contract for sidecar-covered drift A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest publish did not land, recovery sidecar persists) currently wedges every subsequent staged write on the same engine handle: the commit-time drift guard rejects with 'run omnigraph repair', but repair itself refuses while a recovery sidecar is pending, so a long-lived server can only recover by restart. The documented contract (writes.md 'Long-running servers', invariants.md invariant 5) says refresh-time roll-forward closes this residual without restart -- but no write path runs it. Two red tests pin the intended contract at the write entry points: a follow-up load (the POST /ingest shape: shared handle, no reopen) and a follow-up mutation must heal roll-forward-eligible sidecars in-process and then succeed. Currently failing with: table 'node:Company' has Lance HEAD version 2 ahead of manifest version 1; run `omnigraph repair` before writing The fix lands in the next commit. * fix(engine): heal pending recovery sidecars at the staged-write entry points Close the long-lived-process gap in the recovery protocol: a Phase B -> Phase C residual (per-table commit_staged landed, manifest publish did not, sidecar persists) previously recovered only at the next ReadWrite open or via an explicit refresh() that no production write path called, so a long-lived server wedged every subsequent write on the commit-time drift guard until restart. New recovery::heal_pending_sidecars_roll_forward: - one list_dir of __recovery/ at write entry (empty -> immediate return, the steady state), so the per-write cost is one storage list; - per sidecar, acquires the same per-(table_key, table_branch) write queues every sidecar writer holds from before write_sidecar until after delete_sidecar, then re-checks sidecar existence -- this serializes the heal against live writers instead of rolling an in-flight sidecar forward from under its writer (which would fail that writer's publish CAS spuriously). Lock order queues -> coordinator matches every writer's commit->publish path. This is the queue-acquisition design recovery.rs and write_queue.rs already documented for in-process recovery; - processes in RollForwardOnly mode: the common residual rolls forward in-process; rollback-eligible sidecars still defer to the next ReadWrite open (Dataset::restore is unsafe under concurrency). Wire it into load_as and mutate_as (before the inline delete path can advance any HEAD), and rebase Omnigraph::refresh onto the same helper so refresh stops racing live writers' sidecars. The maintenance entry points (apply_schema_as, branch_merge_as, ensure_indices) intentionally keep their strict fail-loud preconditions for now; wiring the same heal there is a follow-up with its own tests. Turns the previous commit's two red tests green. * fix(engine): name the right recovery path in the commit-time drift guard The drift guard's 'run omnigraph repair before writing' advice is a dead end when the drift is covered by a pending recovery sidecar: repair refuses while a sidecar is pending. With the write-entry heal in place, reaching this guard with sidecar-covered drift means the heal deferred it (rollback-eligible), and the actual recovery path is a read-write reopen. Distinguish the two classes on the error path only (one sidecar list, after the conflict is already certain); a listing failure falls back to the uncovered-drift wording rather than masking the conflict. Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open with a write attempt against the deferred sidecar. * docs: write-entry in-process sidecar heal — contract and coverage Update the recovery contract docs to match the previous two commits: invariant 5 now states that the staged-write entry points and refresh run in-process roll-forward recovery (long-lived processes converge on the next write, not at restart); writes.md 'Long-running servers' describes the heal's queue-acquisition concurrency contract, the improved drift-guard error, and the entry points that intentionally do not heal yet; testing.md indexes the new failpoint tests; AGENTS.md capability matrix drops the claim that in-process recovery is entirely future work (only the rollback path remains with the background reconciler). * test(engine): pin the entry heal contract for schema apply and branch merge Without the write-entry heal, the two maintenance writers do worse than wedge on sidecar-covered drift -- they proceed and decide its fate implicitly: - schema apply re-plans table rewrites from the manifest pin, orphaning the drifted Phase-B commit (its rows silently vanish from the rewritten table) while the stale sidecar lingers to misclassify against the post-apply pins; - branch merge publishes over the drift, making the failed writer's commit visible as an unattributed side effect (no recovery audit row), and leaves the stale sidecar behind. Two red tests pin the intended contract: both entry points heal the sidecar first (attributed roll-forward), then run on the converged state. Currently failing on the stale-sidecar / dropped-rows assertions; the fix lands in the next commit. * fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries Extend the write-entry heal to the remaining two write entry points. Unlike load/mutate (which wedge on the drift guard), these proceeded over sidecar-covered drift and decided its fate implicitly: - schema apply re-planned table rewrites from the manifest pin, orphaning the drifted Phase-B commit -- its rows silently vanished from the rewritten table -- while the stale sidecar lingered to misclassify against the post-apply pins; - branch merge published over the drift, making the failed writer's commit visible without a recovery audit row, and left the stale sidecar behind. Both now run the same queue-serialized roll-forward heal at entry, before their own sidecar exists, so recovery is attributed (audit row) and deterministic. ensure_indices stays heal-free: it runs inside the load / schema-apply flows after their entry heal. Turns the previous commit's two red tests green. Docs updated in the same change (invariant 5, writes.md, testing.md, AGENTS.md). * test(engine): pin Phase A sidecar-write failure semantics Storage fault-injection matrix, row 1: a sidecar PUT failure (S3 PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write at the top of write_sidecar -- the single choke point all five sidecar writers go through -- models the storage error backend-generically. Also adds the other three storage-fault failpoints used by the following commits (recovery.sidecar_delete, recovery.sidecar_list, recovery.record_audit); each is a no-op without the failpoints feature. Pinned contract: every writer writes its sidecar BEFORE its first HEAD-advancing commit, so a put failure aborts with zero drift (no sidecar, Lance HEAD == manifest pin, no rows) and a transient fault never wedges the graph -- the same handle writes/merges normally once it clears. Covered for load (the staging writer) and branch_merge (the multi-table writer, forced onto the RewriteMerged path by diverging both sides). * test(engine): pin Phase D delete, list, and audit-append storage-fault semantics Storage fault-injection matrix, rows 2/3/5, plus the real-backend run: - recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject) must NOT fail the user's write -- the manifest publish already landed, so the caller's data is durable. The swallowed failure leaves a stale sidecar; the next write's entry heal consumes it via the stale-sidecar audit-recovery path (RolledForward, attributed). - recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2) is loud at every consumer -- the write-entry heal fails the write and the open-time sweep fails the open. Silently skipping recovery over a pending sidecar would be consumer tolerance of drift. Once the fault clears, open recovers the pending sidecar normally. - recovery.record_audit: an audit write failure after the roll-forward's manifest publish aborts that recovery attempt and keeps the sidecar; re-entry detects the already-published manifest, records exactly ONE RolledForward audit row, and converges -- the retry tolerance documented on record_audit, exercised end-to-end. - s3_load_recovers_after_publisher_failure_without_reopen: the same-handle heal scenario on a real bucket (gated on OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar put/list/delete through S3StorageAdapter instead of the local-FS adapter. CI wiring lands in a follow-up commit. * test(engine): refuse corrupt recovery sidecars loudly Storage fault-injection matrix, row 4 (no failpoint needed -- the corrupt file is written by hand, sibling to the unknown-schema-version refusal test): a truncated/garbage __recovery/{ulid}.json must be refused loudly by both the write-entry heal (the write fails naming the parse error) and the open-time sweep (ReadWrite open fails naming the file), with the file left on disk for operator inspection. Read-only opens still work -- the sweep is skipped there. * test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix - ci.yml rustfs_integration: new step running the bucket-gated failpoints tests (name filter s3_) against the RustFS container, so sidecar put/list/delete are exercised through S3StorageAdapter on every storage-affecting PR. - writes.md: sidecar I/O failure semantics -- Phase A put failure aborts with zero drift; Phase D delete failure is swallowed (write already durable) and healed by the next write; list failures are loud at heal and open; corrupt sidecars are refused with the file kept for inspection; audit-append failures are retried to exactly one audit row. - testing.md: index the storage-fault matrix in the failpoints.rs row and the new RustFS CI line. * test(engine): pin read-visibility of acknowledged local if-absent writes The cluster lib test import_missing_state_creates_state_with_graph_- observation flakes at ~50% under full-workspace load ('EOF while parsing a value' reading back the state.json its own import just acknowledged). Root cause is in the engine's local storage adapter: write_text_if_absent writes through a buffered tokio::fs::File and returns when write_all resolves -- which, per tokio's documented File semantics, means the bytes reached tokio's internal buffer, not the file. The actual write completes in a background blocking task after drop, so a caller that acknowledges success and reads the object back can see an empty or partial file. Under load the window widens; the red run fails at iteration 0 with 0 of 8192 bytes on disk. The regression test pins the contract at the adapter boundary: when write_text_if_absent resolves, the full contents are visible to any reader; a losing second claim leaves the winner's object untouched. The fix lands in the next commit. * fix(engine): publish local storage writes with atomic visibility Close the class, not the instance. The local adapter admitted three ways for a reader to observe a write that was acknowledged or visible before its bytes were complete: 1. write_text_if_absent acknowledged success when the buffered tokio::fs::File write_all resolved -- i.e. when the bytes reached tokio's internal buffer, not the file. A caller reading back its own acknowledged write could see an empty object (the ~50% cluster import flake under full-workspace load; the regression test failed at iteration 0 with 0 of 8192 bytes visible). 2. The same call published its CLAIM (create_new) before its CONTENT, so concurrent readers saw an empty claimed file in the window. 3. write_text (plain tokio::fs::write) exposed truncated content mid-replace -- silently falsifying write_sidecar's 'readers either see the complete sidecar or none' contract on local FS (true on S3, where PutObject is atomic). A flush in write_text_if_absent would have fixed only (1). Instead, both local write paths now publish complete temp files atomically: rename for replace (write_text -- the idiom write_text_if_match already used) and hard_link for no-replace (write_text_if_absent -- link fails AlreadyExists, so exactly one of N concurrent claimants wins and the winner's object is fully readable at the instant it becomes visible). The local adapter now honors the same object-level atomic-visibility contract as the S3 adapter, which is what every caller (recovery sidecar protocol, cluster state CAS) was written against. Crash-orphaned .tmp. files are inert: the sidecar sweep filters to .json, and cluster state reads address state.json by name. fsync/durability policy is unchanged (no fsync before, none now); this fix is about visibility ordering, not power-loss durability. Pre-existing on main (landed with the multi-graph server mode change, PR #119); surfaced by this branch's heal work only because one extra list_dir per write shifted test timing. Cluster lib suite: 12/25 failures before, 0/25 after. Turns the previous commit's red test green. * refactor(engine): one storage implementation over object_store for every backend Collapse LocalStorageAdapter (hand-rolled tokio::fs) and S3StorageAdapter into a single ObjectStorageAdapter backed by Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs, the existing AmazonS3 build for s3://, plus a pub in_memory() constructor (full contract including TRUE conditional updates; the in-memory test backend testing.md asked for at the adapter level). Why: the acknowledged-before-visible bug showed the two-impl shape has no referee -- one prose contract, two independent answers. Upstream LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/ hard_link idiom that fix converged on, and Lance's own commit protocol is built on the same primitives (put-if-not-exists / rename-if-not- exists), so the substrate-aligned move is to stop hand-rolling it. The per-backend residue shrinks to a UriCodec (URI <-> object path) and one capability flag. Semantics preserved by construction, with three deliberate deltas: - exists() is now object-store-semantics everywhere (head + non-empty prefix fallback): an EMPTY local directory no longer 'exists'. The only dir-shaped caller (_graph_commits.lance probes) self-heals via ensure_commit_graph_initialized where it previously wedged loudly. - A directory at an object path reads as NotFound, not as an IO error ('only objects exist'). The cluster unreadable-payload test used a same-named directory as a portable non-NotFound trigger; it now uses chmod 000, which still models genuine transient IO. - write_text_if_match keeps content-token semantics on local (PutMode::Update is NotImplemented upstream for LocalFileSystem in 0.12.5 and 0.13.2); the capability flag gates the token SOURCE in read_text_versioned too -- an ETag token with content-compare writes would lose every CAS. delete_prefix keeps a local remove_dir_all branch: directories are a local-FS concept, and list+delete would leave empty skeletons that cluster graph_root_exists (raw Path::exists) reports as still present. LocalStorageAdapter remains as a delegating shim so the pinned contract tests gate this swap textually unchanged; the shim and the test parameterization over local + in-memory land next. Cargo gains the explicit 'fs' feature (already transitively enabled by lance). * test(engine): one executable storage contract, run against every backend Remove the LocalStorageAdapter delegation shim and migrate its construction sites to ObjectStorageAdapter::local(). Replace the per-backend duplicated tests with a single contract_suite asserting the trait's promises (atomic replace, exists incl. the dataset-root prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost, rename, list round-trip with no sibling-prefix bleed, idempotent delete/delete_prefix), run against the local backend and the new in-memory backend -- which implements true conditional updates, so the strong-CAS path is exercised without a bucket. The bucket-gated S3 variant already exists (s3_adapter_conditional_writes_contract). New local-specific pins for the deliberate semantic edges of the collapse: empty directories are not objects (exists=false; the Lance dataset-root probe shape is the non-empty case), file://-anchored and spaces-in-path list output round-trips byte-identically into read_text, dot-segment paths are lexically absolutized (the CLI's ./graph.omni shape), and upstream rename creating missing destination parents. The acknowledged-write visibility regression test stays, now documenting that the cross-API std::fs read-back is the point. * refactor(cluster): drop put_json's per-backend atomicity branch The local temp+rename dance predates the storage adapter guaranteeing atomic visibility; now that write_text publishes via a staged temp + rename on the filesystem (and a single atomic PUT on object stores) by contract, the branch duplicated upstream behavior. One call, both backends. * docs: storage adapter collapse — contract, in-memory backend, local CAS gap - testing.md: the 'no MemStorage backend' note is half-closed — ObjectStorageAdapter::in_memory() covers the text-object layer with the full contract (true conditional updates); Lance datasets bypass the adapter, so the engine substrate ask stays open. - invariants.md: truth-matrix Tests row updated; new Known Gap for local write_text_if_match (upstream PutMode::Update is unimplemented for LocalFileSystem; content-token emulation is safe only under the cluster lock protocol — close before admitting a lock-free caller). - writes.md: backend notes for the unified adapter (name#N staging residue invisible to the sweep, backend-wrapped error text with exists()-probing for missing-vs-error, loud permission failures). * docs: finish renaming the storage adapters in user docs and test comments storage.md's URI-scheme table and the S3 failpoint test's doc comment still named the deleted LocalStorageAdapter/S3StorageAdapter; both now describe the unified ObjectStorageAdapter over object_store, including the relative-path absolutization note for local URIs. * test(engine): pin branch-awareness of the drift guard's recovery advice A pending sidecar on ANOTHER branch does not cover this branch's drift: with a deferred feature-branch sidecar on disk and genuinely uncovered drift on main, the main write's error must still point at omnigraph repair -- a read-write reopen recovers the sidecar but cannot repair main's uncovered drift. Currently red: the guard matches sidecar pins by table_key only, so the feature sidecar flips main's advice to the reopen path. Fix in the next commit. Surfaced by external review of the drift-guard change. * fix(engine): branch-aware sidecar matching in the drift guard's advice The commit-time drift guard's sidecar-covered check matched pins by table_key alone, so a pending sidecar on another branch flipped this branch's uncovered-drift advice from 'run omnigraph repair' to the reopen path -- and a reopen recovers that sidecar but cannot repair this branch's drift. Compare the pin's table_branch too. Turns the previous commit's red test green. Surfaced by external review of the drift-guard change. * test(engine): pin heal non-interference with a live schema apply The write-entry heal's schema-staging reconcile runs before any queue acquisition, so a load on the same handle, overlapping a schema apply parked between its staging write and manifest commit, promotes the apply's staging files (new catalog live against the old manifest), classifies the LIVE apply's sidecar, and publishes its registrations out from under it. The resumed apply then collides with its own stolen commit. Currently red with: Lance("Concurrent modification: table version 3 already exists for node:Tag") The fix (per-sidecar reconcile under the sidecar's write-queue guards, plus a serialization key the schema-apply writer and the heal both acquire) lands in the next commit. Surfaced by external review of the write-entry heal. * fix(engine): serialize the heal's schema-staging reconcile with live schema applies The write-entry heal ran recover_schema_state_files up front, before acquiring any queue guards. Overlapping a live schema apply parked between its staging write and manifest commit, the heal promoted the apply's staging files (new catalog live against the old manifest), classified the LIVE apply's sidecar, and published its registrations — the resumed apply then collided with its own stolen commit. Correct by construction: - New schema-apply serialization queue key, acquired by the schema- apply writer (alongside its per-table keys) from before write_sidecar until after delete_sidecar. Per-table keys alone don't cover a registration-only migration, which pins no existing tables but has a sidecar and staging files on disk. - The heal reconciles schema staging lazily, PER SchemaApply sidecar, after acquiring that sidecar's guards (including the serialization key) and re-confirming the sidecar exists — a sidecar that survives the queue wait belongs to a dead writer, so the reconcile can no longer race a live apply. Recomputing per sidecar also removes the staleness of one up-front result across a multi-sidecar pass. - Omnigraph::refresh drops its up-front reconcile-and-pass-through (same race, and a pre-promoted result would make the heal's guarded reconcile see clean staging and wrongly defer the sidecar): it now reconciles standalone only when NO sidecar exists — which cannot race a live apply, whose sidecar always precedes its staging files — and otherwise defers entirely to the heal. The open-time sweep keeps its precomputed reconcile: open has no concurrent writers. Turns the previous commit's red test green. Surfaced by external review of the write-entry heal. Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU (a live apply could write its sidecar + staging between the empty check and the reconcile) — the standalone reconcile now holds the serialization key across the list-then-reconcile pair. The remaining residual is cross-process only (in-process queues cannot serialize against a writer in another process; the open-time sweep has the same pre-existing exposure) and is now an explicit Known Gap in invariants.md rather than an implicit one. * test(engine): pin catalog reload after the heal recovers a schema apply When the write-entry heal rolls a crashed apply's SchemaApply sidecar forward on the same handle, disk and manifest move to the new schema (staging promoted, registrations published) but the handle's in-memory schema_source/catalog do not. Subsequent writes then validate against the stale catalog and reject rows of types the graph already has. Currently red with: record 1: unknown node type 'Tag' refresh() reloads after its heal; the write entry points must too. Fix in the next commit. Surfaced by external review of the write-entry heal. * fix(engine): reload the in-memory catalog after the heal recovers a schema apply heal_pending_recovery_sidecars refreshed the coordinator and invalidated the runtime cache after processing sidecars, but never reloaded schema_source/catalog — so a write whose entry heal rolled a crashed SchemaApply sidecar forward proceeded to validate against the OLD schema while disk and manifest were already on the new one. reload_schema_if_source_changed is the same post-heal step refresh() already runs; it no-ops on the (overwhelmingly common) non-schema heal because the on-disk source is unchanged. Turns the previous commit's red test green. Surfaced by external review of the write-entry heal. * test(engine): pin that a deleted-branch sidecar cannot wedge the graph A rollback-eligible sidecar pinned to a branch is deferred by every roll-forward-only pass; if the branch is then deleted, the sidecar survives, referencing a branch with no manifest tree. The heal (every write entry) and the open-time sweep (every ReadWrite open) both fail opening the dead branch, and repair refuses while a sidecar is pending -- a terminal read-only state with manual sidecar surgery as the only exit. Currently red with: Lance("Not found: .../__manifest/tree/feature/_versions") The branch's tree and forks are already reclaimed, so the pinned drift is unreachable and the sidecar is provably moot; the fix classifies it as an orphaned-branch terminal state (audit + discard) in both passes. Surfaced by review (P1, verified by repro). * fix(engine): classify deleted-branch sidecars as orphaned instead of wedging A deferred (rollback-eligible) sidecar pinned to a branch survives branch_delete; both the write-entry heal and the open-time sweep then failed unconditionally opening the dead branch -- every write and every ReadWrite open errored, and repair refuses while a sidecar pends. Terminal state, manual sidecar surgery the only exit. The branch's tree and per-table forks are already reclaimed at delete, so the drift the sidecar pins is unreachable and the sidecar is provably moot. Both passes now check the sidecar's branch against the manifest's branch list (the authority -- deliberately NOT inferred from a Not-found on open, which could be a transient storage error masking real recovery intent) and discard orphans with an OrphanedBranchDiscarded audit row, commit appended on main since the sidecar's own branch no longer has a commit graph. The open-time half is pre-existing; the write-entry heal made it hot. Turns the previous commit's red test green. Surfaced by review (P1, verified by repro). * chore: harden review nits — vacuous CI filter, root-runner skip, liveness note - ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the 's3_' name filter matches zero tests (cargo passes vacuously on an empty filter; the step exists specifically to prove S3 sidecar I/O coverage). The pre-existing CLI smoke step has the same shape and is left for a follow-up. - cluster unreadable-payload test: cfg(unix) + a skip-with-log when running as root (mode 000 is still readable to root, common in container dev runners), so the test degrades instead of failing. - refresh: document the one-pass-late convergence for legacy staging residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by re-running the reconcile unserialized — the exact race the serialization key closes. * test(engine): pin orphan-discard idempotency across a delete fault discard_orphaned_branch_sidecar writes its audit row and main commit before deleting the sidecar; a Phase D delete fault leaves the sidecar on disk with the audit already durable, and the retry repeated the whole path -- a second OrphanedBranchDiscarded audit row (and commit) for the same operation. Currently red: 2 rows after one fault + retry. The retry must only finish the delete. Fix next. Also promotes the recovery-audit kinds reader into the shared test helpers (it was recovery.rs-local). Surfaced by external review of the orphan-discard fix. * fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred Two review findings on the recovery surface: - discard_orphaned_branch_sidecar now checks the audit table for an existing (operation_id, OrphanedBranchDiscarded) row before appending the commit + audit pair, so a Phase D delete fault retries ONLY the delete instead of duplicating audit rows and commit-graph entries. Cold path: the list scan runs only when an orphaned sidecar exists. Turns the previous commit's red test green (exactly one audit row across fault + retry). - process_sidecar returns whether durable state changed; the heal sets processed_any only for sidecars that were actually rolled forward / rolled back / audit-recovered (orphan discards count). Deferred sidecars (rollback-eligible, invariant-violating, unpromoted SchemaApply) no longer trigger a per-write schema reload + full runtime-cache invalidation while they pend -- the cache is snapshot-keyed so this was waste, not corruption, but it was paid on every write until reopen. Acted-paths' processed=true remains pinned by load_after_schema_apply_phase_b_failure_uses_recovered_catalog (the reload depends on it). Surfaced by external review. * test(engine): pin the orphan-discard audit-append fault leg as documented tolerance The orphan discard's commit append and audit append are two writes; a failure between them leaves a recovery commit with no audit row, and the retry (keyed on the audit row, the operator-facing record) appends a second commit before the audit lands. This is the same not-atomic-pair-write tolerance record_audit documents and the manifest->commit-graph Known Gap covers for every publish: bounded commit-graph noise, audit row exactly-once under clean failures. Keying idempotency on commit rows instead would need an operation_id column on _graph_commits, and audit-before-commit would dangle the graph_commit_id join -- both worse than the documented residual. Make the tolerance explicit instead of implicit: docstring names the window, a failpoint sits inside it, and the new test pins convergence across the fault (sidecar consumed, exactly one audit row), completing the orphan-discard fault matrix alongside the delete-fault leg. Surfaced by external review of the orphan-discard idempotency. * test(engine): pin honest drift-guard advice when sidecar listing fails The guard's unwrap_or(false) conflated 'classified as uncovered' with 'could not classify': a transient list fault on the guard's second list (the entry heal's first list having succeeded) confidently routed the operator to omnigraph repair even when the heal had just deferred a rollback-eligible sidecar -- and repair refuses while a sidecar is pending. Currently red: the error says 'run omnigraph repair' with no mention of the reopen path. The fix names both paths plus the failure cause when classification is impossible. Surfaced by external review of the drift-guard fallback. * fix(engine): admit ambiguity in the drift guard when sidecar listing fails Replace the unwrap_or(false) fallback with a tri-state: covered -> reopen advice; uncovered -> repair advice; listing FAILED -> say the drift could not be classified, name the cause, and give both paths in order ('run repair, or reopen read-write if repair reports a pending sidecar'). The old fallback confidently routed a transient list fault to repair, which refuses while a sidecar is pending -- a self- correcting but pointless detour. The conflict itself is still always raised; only the advice degrades honestly. Turns the previous commit's red test green. Surfaced by external review of the drift-guard fallback.	2026-06-13 11:20:08 +02:00
aaltshuler	08c9b03d40	test(cli): the embedded/remote parity matrix (RFC-009 Phase 1) The referee before any unification moves: every forked verb runs once against the local graph and once against a spawned server on a twin copy of the same fixture, with the SAME actor (--as locally; bearer-resolved remotely) and the SAME Cedar bundle on both arms — like-for-like enforcement is part of the harness (a tokens-only server is default-deny by design; comparing that against a bare local arm measures configuration, not the fork). Declared-volatile fields (ids, wall-clock, transport locations) scrub to placeholders; everything else must match exactly, and exit codes must match for shared failures. Headline result: 11 rows green with an EMPTY divergence ledger — the arms agree on every verb today. The ledger (KNOWN_DIVERGENCES) exists so any future divergence is pinned or filed, never silently repaired; repairs are Phase 3's job, gated by this referee staying green. One engine observation surfaced and filed (#207): inline execution with a declared-but-unbound param matches ALL rows on both arms, while the stored-query invoke path hard-errors — a cross-path asymmetry the matrix pins as agreeing behavior pending a deliberate fix. Documented exclusions (graphs list, ingest/load-over-/ingest, storage-plane verbs) map to RFC-009 Phases 4-5. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:50:46 +03:00
Andrew Altshuler	e0d80c0062	Merge pull request #206 from ModernRelay/rfc/unify-access-paths docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus	2026-06-12 17:38:22 +03:00
Andrew Altshuler	0c43b4efa0	Merge pull request #202 from ModernRelay/release/v0.7.0-prep release: bump workspace to 0.7.0	2026-06-12 17:38:17 +03:00
aaltshuler	9002cfd5b9	docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus Adopts the unify-embedded/remote draft as RFC-009 with three alignment amendments: (1) the promised 'companion config-authority RFC' is RFC-008, already landed through stage 4 — referenced, not re-proposed; (2) open question 3 is answered by the two-surface architecture (embedded graphs list enumerates the cluster catalog via read_serving_snapshot, never omnigraph.yaml); (3) Phase 2 salvages PR #139's reviewed-clean omnigraph-api-types extraction instead of rebuilding. Adds the cycle's two no-referee bugs (alias positional, write-if-absent flush) as concrete parity-matrix motivation, and RFC-007's addressing/credential chains as RemoteClient constructor inputs. Corpus alignment: RFC-002's header now maps each of its pieces to the successor that landed or superseded it (007/008/009) with a do-not- implement-from-here-unchecked warning; RFC-007 gains the RFC-009 relationship; RFC-008 stage 5 notes the Phases-4/5 easing; dev index row. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 17:33:11 +03:00
aaltshuler	98c6147c38	docs(testing): bring the test map up to release truth Lands an orphaned-but-accurate working-tree edit (the engine table rows for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed, proptest_equivalence, ordering, literal_filters, policy_engine_chassis — all real files; 21 -> 28 count) and replaces the stale pre-modularization crate rows: the CLI and server entries now describe the per-area suites (#192/#193 splits) plus this cycle's additions (RFC-008 deprecation coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the bucket-gated s3 suites). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:12:33 +03:00
aaltshuler	dedd647cde	release: bump workspace to 0.7.0 All six crate manifests + their path-dependency constraints, Cargo.lock, the regenerated openapi.json version metadata, AGENTS.md's surveyed version, and the v0.7.0 release notes (object-storage clusters, config-free --cluster serving, the operator config surface, keyed credentials, operator targeting/aliases, and the omnigraph.yaml deprecation stages). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:12:33 +03:00
Andrew Altshuler	c94ee2572f	Merge pull request #205 from ModernRelay/ci/restore-ragnorc-codeowners ci(codeowners): restore ragnorc to engineering and docs roles	2026-06-12 14:12:18 +03:00
Andrew Altshuler	13ceab3336	Merge pull request #204 from ModernRelay/fix/local-write-if-absent-flush fix(storage): flush before acking in local write_text_if_absent	2026-06-12 14:12:14 +03:00
aaltshuler	b24bb16d0c	ci(codeowners): restore ragnorc to engineering and docs roles Re-adds ragnorc to both roles in the source of truth and regenerates CODEOWNERS + the ownership tables. This also resolves the standing inconsistency from #169: branch-protection.json's bypass_pull_request_allowances still listed ragnorc after his codeowners removal — the two lists are in sync again (no protection change needed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:45:33 +03:00
aaltshuler	aabb3dca2e	fix(storage): flush before acking in local write_text_if_absent tokio's async File buffers writes internally: write_all only fills the buffer, and the actual OS write happens in a background task after drop — so write_text_if_absent could return Ok(true) with the file created but still EMPTY, and an immediate reader saw EOF. Caught twice in CI as 'EOF while parsing a value' reading state.json right after cluster import (the cluster's first state-write routes here since the storage port); also an invariant-6 violation (acknowledged before the write reached the OS). The other local write paths use tokio::fs::write, which flushes internally — this was the one miss. Fix: flush().await before Ok, with the same remove-on-failure cleanup as the write itself. Regression test is a best-effort tight loop (the window is timing-dependent; the two CI failures are the recorded red) asserting read-after-ack never sees a short file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 13:44:51 +03:00
Andrew Altshuler	867138499e	Merge pull request #200 from ModernRelay/feat/no-legacy-config-strict Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details feat(config): OMNIGRAPH_NO_LEGACY_CONFIG strict mode (RFC-008 stage 4)	2026-06-12 00:15:20 +03:00
aaltshuler	4c50170c77	feat(config): OMNIGRAPH_NO_LEGACY_CONFIG strict mode (RFC-008 stage 4) Opt-in: with the env set, loading a legacy omnigraph.yaml is a hard error pointing at config migrate — the regression guard for migrated teams (a stray legacy file would otherwise silently outrank operator config during the window) and the rehearsal for stage 5's removal. Strict refuses the FILE, never its absence: flag-less invocations on migrated setups are untouched. Inert unless set. The RFC's stages-1-3-then-4 release gap collapsed honestly: no version boundary was crossed between them, so all four ship in the same release (noted in the RFC). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 00:03:10 +03:00
Andrew Altshuler	108d2defa6	Merge pull request #199 from ModernRelay/feat/yaml-deprecation-stages feat(cli,config): RFC-008 stages 1–3 — deprecate omnigraph.yaml (warnings, config migrate, scaffold flip)	2026-06-11 23:55:16 +03:00
aaltshuler	5328c91341	refactor(cli): drop cluster init — no replacement scaffold Andrew's call, and the right one by the repo's own lens: a minimal cluster.yaml is five lines; a generator is a second copy of the schema to keep in sync forever, emitting a file that is unusable until hand-edited anyway (graphs: {} cannot apply or serve). Terraform has no config scaffolder either. New users copy from the cluster quick-start; migrants get a ready-to-review cluster.yaml from config migrate. RFC-008 stage 3 becomes purely subtractive. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:45:18 +03:00
aaltshuler	3adbc65af2	docs(cli): config migrate, cluster init, the legacy-file deprecation notice Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:37:12 +03:00
aaltshuler	5ba9656666	feat(cli): init stops scaffolding omnigraph.yaml; cluster init replaces it (RFC-008 stage 3) omnigraph init no longer writes a legacy config into cwd (the source of the earlier test-pollution bug, and a scaffold for a deprecated file); the scaffolder is deleted. omnigraph cluster init scaffolds the replacement: a minimal valid cluster.yaml (version: 1, optional metadata.name / storage:, a commented graphs example), refusing to overwrite. The scaffold validates clean via cluster validate in the e2e. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:34:04 +03:00
aaltshuler	cd1f175396	feat(cli): omnigraph config migrate — the RFC-008 split (stage 2) Reads a legacy omnigraph.yaml and produces the three-section split: team half as a ready-to-review cluster.yaml proposal (graphs with TODO schema pointers — the legacy file never knew schemas — per-graph queries directories, policies with applies_to bindings), personal half as an operator-config merge (actor, output/table defaults — OperatorDefaults gains the two table keys with their cascade hops — remote graphs with bearer_token_env become servers entries plus a printed login step, and legacy aliases split per the RFC: content to the catalog as a manual step, binding to an operator alias), plus a dropped-keys section with reasons. Touches nothing without --write; with it, the operator merge is key-level (existing entries always win; prior file backed up), and cluster.yaml is emitted only when absent (else cluster.yaml.proposed). --json emits the report structurally. The completeness contract is a unit test: every top-level key of the legacy schema must classify somewhere, or the RFC-008 map has a bug. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:32:05 +03:00
aaltshuler	c89d268b23	feat(config): per-key deprecation warnings on legacy omnigraph.yaml load (RFC-008 stage 1) Loading a legacy file (flag, env, or cwd-found — never on defaults) emits one stderr block listing each key actually present with its destination from RFC-008's migration map — the map applied to YOUR file, not a generic banner. Once per process; both binaries warn (cluster-mode boots never reach load_config, silent by construction); suppressible via OMNIGRAPH_SUPPRESS_YAML_DEPRECATION=1 for CI logs during the window. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 23:28:33 +03:00
Andrew Altshuler	588b0c1b6c	Merge pull request #198 from ModernRelay/feat/operator-targeting feat(cli): operator targeting — --server + aliases as pure bindings (RFC-007 PR 3)	2026-06-11 22:54:50 +03:00
aaltshuler	20ddfc61c1	fix(cli): reclaim the hidden legacy-uri positional for operator aliases Caught on the live smoke: with --alias, the first bare CLI arg lands in the hidden legacy_uri positional, so an operator alias's positional param never bound ('parameter not provided' from the server). An operator alias always knows its target, so the existing normalize_legacy_alias_uri reclaims the swallowed positional as the first alias arg — same rule the legacy path already applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:29:57 +03:00
aaltshuler	dc91c55970	feat(cli): operator aliases — pure bindings invoking stored queries (RFC-007 PR 3, part 2) aliases: in the operator config bind a personal name to (server, graph, stored-query NAME, positional arg mapping, fixed param defaults, format) — zero content, per the ratified bindings-not-content model. Invocation goes through the server's stored-query endpoint (POST {base}/graphs/{g}/queries/{name}) with the keyed credential resolving via the ordinary URL match; param precedence --params > positionals > fixed defaults; the result renders through the existing format cascade with the alias's format as its hop. A legacy omnigraph.yaml alias with the same name wins during the RFC-008 window, with a warning naming both. E2e (spawned policy-gated server, invoke_query granted via a per-graph bundle): the alias invokes with name + one positional and nothing else — server, graph, query, and token all from the operator layer; --server/ --graph explicit targeting; unknown --server lists defined names; --server exclusive with a positional URI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:25:42 +03:00
aaltshuler	2b33ab64f2	feat(cli): --server <name> targeting (RFC-007 PR 3, part 1) Global flags --server (operator-defined server name) and --graph (graph id on a multi-graph server, requires --server) resolve to the effective remote URI through one helper and feed the ordinary uri slot — graph resolution and the PR-2 keyed-token URL match work unchanged; the flag is sugar for a URI the operator already owns. Exclusive with a positional URI and --target (loud error, never silent precedence). Unknown names fail listing the servers that ARE defined. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:19:25 +03:00
aaltshuler	65160cc060	docs(rfc): aliases are bindings, not content — the ratified alias model RFC-007 §D2 gains the model the alias design reasoned through: stored queries are content + its canonical team-owned name; legacy omnigraph.yaml aliases conflate a personal name with a local-file content pointer (the muddle RFC-008 retires); operator aliases are pure bindings (server, graph, stored-query NAME, arg mapping, defaults) — an alias that carries content competes with the catalog, one that references a name composes with it. The three senses of 'global' are resolved explicitly: cross-graph globality is strengthened (one $HOME file vs per-directory), team-shared shorthand is deliberately NOT an alias mechanism (the shared name IS the catalog name), cross-machine follows the dotfile. Collision rule: legacy wins during the RFC-008 window, with a warning. RFC-008's migration row for aliases sharpens accordingly: a legacy alias splits — content to the catalog (via cluster apply), binding to the operator layer; config migrate proposes both halves. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 22:15:19 +03:00
Andrew Altshuler	b6ebe6cbe5	Merge pull request #197 from ModernRelay/feat/operator-keyed-credentials feat(cli): keyed credentials — servers:, the token chain, omnigraph login (RFC-007 PR 2)	2026-06-11 21:44:58 +03:00
aaltshuler	a819ab500e	feat(cli): keyed credentials — servers:, the token chain, login/logout (RFC-007 PR 2) The operator config gains servers: (name -> url; never a token). A remote command whose URL prefix-matches an operator server resolves its bearer token through the keyed chain first — OMNIGRAPH_TOKEN_<NAME> env, then the [<name>] section of ~/.omnigraph/credentials (created 0600 via temp+rename, #139 finding 7; group/world-readable files refused loudly) — falling through to the legacy chain unchanged. URL keying makes §D5 rule 3 structural: a token is only ever sent to the server it is keyed to. Longest-prefix matching with a path-boundary check (http://h:8080 never matches http://h:8080-evil). Inserting the keyed hop above the legacy chain is safe by construction — no existing setup can have servers: defined. omnigraph login <name> stores/rotates one section (token from --token or one stdin line — the pipe flow keeps secrets out of shell history); omnigraph logout removes it, idempotently; logging in before declaring the server warns instead of failing (the gh model). Coverage: URL-match/no-substring-trap, credentials round-trip preserving sibling sections, 0600 write + over-permissive refusal, env-name mapping; the legacy resolve test is now hermetic against a real ~/.omnigraph and asserts byte-identical legacy behavior with no servers defined; one spawned-binary e2e walks the whole lifecycle against an authed server: refusal -> wrong-token login (stdin) -> rotate (--token) -> authorized read -> env-beats-file -> non-matching-URL negative -> logout revokes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 21:24:51 +03:00
Andrew Altshuler	5db42fb660	Merge pull request #196 from ModernRelay/feat/operator-config-identity feat(cli): operator config surface — identity + output defaults (RFC-007 PR 1)	2026-06-11 21:01:58 +03:00
Andrew Altshuler	d5d703fccc	Merge pull request #195 from ModernRelay/rfc/operator-config docs(rfc): RFC-007 + RFC-008 — the config architecture pair (operator layer; deprecate omnigraph.yaml)	2026-06-11 21:01:54 +03:00
aaltshuler	9427fb510e	docs(cli): the two config surfaces + the operator file reference cli-reference.md gains the config-surfaces table (cluster / operator / flags-env, with omnigraph.yaml marked as the legacy combined file per RFC-008) and the operator config.yaml reference; audit.md documents the unified actor chain. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 20:32:04 +03:00
aaltshuler	be4bd46212	feat(cli): the operator config surface — identity and output defaults (RFC-007 PR 1) ~/.omnigraph/config.yaml joins the resolution chains as the operator surface: operator.actor becomes the last hop of THE actor chain (--as > legacy cli.actor during the RFC-008 window > operator.actor > none, one implementation for direct-engine and cluster commands alike) and defaults.output joins the read-format cascade below every more-specific source. Discovery honors $OMNIGRAPH_HOME (tilde-expanded, #139 finding 9); an absent file is an empty layer; unknown keys WARN and load (a file written for later slices must not break this CLI); malformed YAML is a loud error. The module is CLI-only — the server never reads operator config (invariant 11 by construction). $OMNIGRAPH_CONFIG becomes a first-class stand-in for --config in load_config (flag > env > ./omnigraph.yaml), one meaning in both binaries. The test harness pins hermeticity: spawned binaries get a nonexistent OMNIGRAPH_HOME by default so no test ever reads the developer's real operator config. New coverage: loader unit tests, the env-precedence matrix on load_config_in, and spawned-binary e2es for the actor chain (operator wins with no flag/legacy key; legacy outranks it; --as wins) and the format cascade. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 20:29:02 +03:00
aaltshuler	08ce8dc34d	docs(rfc): align RFC-007 with RFC-008's two-surface architecture RFC-007 now speaks the end-state language throughout: the operator surface is one half of the two-surface split (cluster config / operator config), not a layer over a living omnigraph.yaml. The precedence cascade drops the project layer (cluster config carries no operator-resolvable keys — a checkout can never supply identity); legacy omnigraph.yaml appears only as the RFC-008 deprecation-window slot. The trust boundary is restated as closed-by-construction in the end state, with the rules governing the window. PR 3 becomes operator targeting (--server + operator aliases — the replacement RFC-008 needs before legacy aliases migrate), and the schema example gains the aliases block. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:54:34 +03:00
aaltshuler	320311e759	docs(rfc): RFC-008 — deprecate omnigraph.yaml, one concern per config surface The file is three unrelated concerns wearing one filename — server deployment config, project/CLI conveniences, operator identity — and the mixture is the root cause of a recurring problem class (per-operator copies of project files, checkout-supplied credential redirection, init scaffold pollution). End state: two single-owner surfaces — cluster config (team, repo) and operator config (person, $HOME) — plus the zero-config flags/env tier. Complete key-by-key migration map over the verified OmnigraphConfig surface; staged retirement per the repo's Hyrum rules (warn with per-key guidance -> `config migrate` tool -> stop scaffolding -> opt-in strict -> removal at the next major). RFC-007's project-layer framing is amended to transitional accordingly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 19:33:19 +03:00
aaltshuler	d531f60999	docs(rfc): RFC-007 — per-operator config, the operator slice of RFC-002 Terraform-style operator/project split: ~/.omnigraph/config.yaml for identity (operator.actor in the --as cascade), credentials keyed by server name (env -> 0600 credentials file; no inline secrets), and operator-owned named servers that project configs reference but cannot redefine. Explicitly a staged subset of RFC-002: adopts its settled decisions (one dir, keyed credentials, env precedence), defers GraphLocator/use/state-layer, and encodes the ten confirmed PR #139 findings as design rules (compat shims, key-level merges, atomic writes, the project-layer trust boundary). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 18:29:55 +03:00

1 2 3 4 5 ...

516 commits