mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-18 02:24:27 +02:00
305 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
b183db078f
|
Index materialization is derived state: defer off the write path, reconcile via optimize (iss-848) (#246)
* test(engine): reproduce empty-table Vector @index aborting schema apply
A Vector (IVF) index trains k-means centroids over the column, so Lance
cannot build it on 0 vectors ("Creating empty vector indices with
train=False is not yet implemented"). schema apply reconciles a table's
whole index set whenever any @index on it changes, so adding an unrelated
scalar @index materializes the dormant empty vector index and aborts the
entire migration (all-or-nothing).
This regression test inits a 0-row Doc with a Vector @index, adds a scalar
@index, and asserts the apply succeeds (then loads one embedded row and
asserts the deferred index materializes). It fails today at the apply step
with the vector-index abort; the fix lands in the next commit.
Refs dev-graph iss-empty-vector-index-schema-apply, iss-848.
* fix(engine): defer Vector @index on an empty table instead of aborting schema apply
build_indices_on_dataset_for_catalog materialized a declared Vector @index
unconditionally. On a 0-row table Lance cannot train the IVF index
("Creating empty vector indices with train=False is not yet implemented"),
so any later migration that touches the table (e.g. adding an unrelated
scalar @index, which reconciles the table's whole index set) aborted the
entire migration on the dormant vector index — all-or-nothing.
Guard the vector arm with a row-count check, matching the guard
ensure_indices_for_branch and the branch-merge rebuild already use: an
untrainable column becomes a pending index that a later ensure_indices /
optimize materializes once the table has rows. Reads stay correct meanwhile
(vector search degrades to a brute-force scan).
Stop-gap: the residual rows-present-but-vectors-null window and the full
decoupling (intent recorded at apply, an idempotent coverage reconciler)
are dev-graph iss-848. Turns the green half of the regression test added in
the previous commit.
Refs dev-graph iss-empty-vector-index-schema-apply, iss-848, iss-687.
* docs(invariants): record the logical-contract-over-physical-state principle
The bug class behind the empty-table vector-index abort (and the schema-apply
vs optimize version drift) is one shape: a physical operation allowed to fail
a logical one. Several hard invariants (2, 5, 7, 13) and deny-list items are
already instances of this, but the unifying rule was never written down.
Add it to docs/dev/invariants.md as a "Governing principle" section above the
hard invariants, naming which invariants and deny-list items instantiate it
and the smell to watch for (a logical operation gated on a physical fact).
Add a one-line always-on rule (7) in AGENTS.md so it stays in working memory,
with the qualifier that genuine logical conflicts still fail loudly — the
licence to lag covers physical convergence, not correctness.
Audience-neutral: no private ticket refs. check-agents-md.sh passes.
* test(engine): index build must tolerate rows with null vectors (load-before-embed)
Loading rows whose vector column is null into a `Vector @index` table fails
today: build_indices (reached via the loader's prepare_updates_for_commit)
calls create_vector_index, and Lance's IVF KMeans errors "cannot train 1
centroids with 0 vectors". The same abort hits ensure_indices/optimize/schema
apply/merge, since they all funnel through build_indices_on_dataset_for_catalog.
This test loads two null-embedding rows and calls ensure_indices; it must not
abort (the untrainable vector column is deferred, sibling indexes still build).
Fails today at the load step; fixed in the next commit.
Refs dev-graph iss-848, iss-empty-vector-index-schema-apply.
* fix(engine): defer unbuildable index columns instead of aborting the write path
build_indices_on_dataset_for_catalog is the chokepoint every write path funnels
through (load/mutate via prepare_updates_for_commit, schema apply, ensure_indices,
optimize, branch merge). Its vector arm called create_vector_index
unconditionally, so a column with no trainable vectors yet — an empty table, or
rows loaded before `embed` populates them — aborted the whole operation with
Lance's IVF KMeans error.
Fault-isolate the vector build: on failure, record the column as a PendingIndex
(table, column, reason), log it, and continue building the sibling indexes; a
later ensure_indices/optimize materializes it once the column is trainable, and
reads use brute-force meanwhile. Manifest/CAS/IO errors at the publish boundary
still propagate. Isolating at the single chokepoint realizes the governing
principle (physical index state never fails a logical operation) for every write
path, and supersedes the earlier symptomatic count_rows==0 stop-gap (removed) —
closing the residual rows-present-but-vectors-null window it left open.
Surfacing pending index status rather than failing is the database norm
(Postgres indisvalid, LanceDB list_indices). ensure_indices and the build_indices
wrappers now return Vec<PendingIndex>; optimize surfaces it in a later commit.
Refs dev-graph iss-848, iss-951 (vector index stays inline-commit until lance#6666).
* test(engine): index-only schema apply must not touch table data
Adding an @index to an existing column should be a pure metadata change once
index materialization moves to the reconciler (iss-848): the apply records the
intent in the catalog/IR but builds nothing inline, so the table's manifest
version is unchanged. Today the indexed_tables block builds the index inline
and bumps the version (4 -> 5). Fixed in the next commit.
Refs dev-graph iss-848.
* fix(engine): schema apply records index intent only; index-only apply is metadata
Schema apply no longer builds indexes inline. The four build_indices calls
(added/renamed/rewritten/index-only tables) are removed; the @index/@key intent
is already persisted in the catalog/IR the apply writes, and the physical index
is materialized off the critical path by ensure_indices/optimize (iss-848).
Concretely:
- AddConstraint (an @index addition — every other added constraint plans as
UnsupportedChange) becomes a pure metadata step alongside the metadata-only
steps: it touches no table data, so the table version is unchanged.
- added/renamed/rewritten tables still write their data; only the trailing
index build is gone. The rewritten table's coverage is restored later by
optimize_indices.
- recovery_pins drops index-only tables (they no longer advance Lance HEAD) and
keeps rewritten tables; their post_commit_pin = expected+1 is now exact (one
rewrite commit), strengthening recovery classification.
- the now-orphaned Omnigraph::build_indices_on_dataset_for_catalog wrapper is
removed.
A migration can no longer abort on an index build, for any index type at any
cardinality. Turns the green half of index_only_constraint_apply_touches_no_table_data.
Refs dev-graph iss-848.
* test(engine): optimize must converge a declared-but-unbuilt index
After iss-848, adding an @index post-data is a metadata-only apply that defers
the physical build, so the column is declared-indexed but unbuilt (reads scan).
`optimize` — the operator's cron reconciler — must materialize it. Today optimize
only maintains coverage of EXISTING indexes (optimize_indices) and never creates
missing ones, so the rank BTREE stays Degraded after optimize. Fixed next commit.
Refs dev-graph iss-848.
* fix(engine): optimize materializes declared-but-unbuilt indexes (the reconciler)
`omnigraph optimize` is the operator's cron reconciler. It already compacts and
folds new fragments into EXISTING indexes (optimize_indices); now it also builds
declared-but-missing indexes, so the indexes schema apply / load defer (iss-848)
converge on the next optimize.
Done inside optimize_one_table (not by composing the all-tables ensure_indices,
which is drift-blind and would re-publish the uncovered HEAD>manifest drift that
optimize deliberately skips): after the per-table drift/blob skips and under the
queue + Optimize sidecar already held, a needs_index_create gate (reusing
needs_index_work_node/edge — "declared index missing AND row_count > 0", so empty
tables stay no-ops) admits index-only work, and Phase B builds the missing index
over the just-compacted layout via the build chokepoint. An untrainable vector
column fault-isolates into the new TableOptimizeStats.pending_indexes (the
list_indices/indisvalid analog operators read), not a failure. committed now
reflects index commits, so the existing post-publish cache invalidation covers
them. LanceDB's optimize only maintains existing indexes; creating
declared-but-missing ones is the L2 behavior omnigraph's declarative @index needs.
Turns the green half of optimize_materializes_index_declared_but_unbuilt.
Refs dev-graph iss-848.
* docs: index materialization is deferred to the reconciler (iss-848)
Update the index-lifecycle docs to reflect the new contract: @index/@key
declares intent and the physical index is derived state that never fails a
logical operation. Schema apply builds nothing (records intent only);
load/mutate build inline through one chokepoint that defers an untrainable
Vector column as pending; optimize/ensure_indices is the reconciler that
creates declared-but-missing indexes and maintains coverage, reporting
still-pending columns.
Touches: dev/invariants.md (truth-matrix Index-lifecycle row), AGENTS.md
(capability matrix), user/search/indexes.md (L2 orchestration), user/operations/
maintenance.md (optimize reconciler bullet), dev/testing.md (new tests).
* test(server): schema_apply_route_can_add_index reflects deferred index build
iss-848 made schema apply record @index intent without building the physical
index inline. The route test asserted the index count increased after apply;
on an empty graph it now stays unchanged (the build is deferred to
ensure_indices/optimize). Assert the new contract: apply succeeds and the
physical index count is unchanged.
* fix(engine): precheck vector trainability — don't pin or swallow (PR review)
Two issues Cursor Bugbot caught in the chokepoint fault-isolation:
1. (HIGH) Pending vector pins roll back siblings. needs_index_work_node counted
a missing vector index as work whenever the table had rows, so a column with
no trainable vectors got pinned in the EnsureIndices recovery sidecar — but
the build deferred it (zero commit). On a crash before manifest publish the
classifier sees NoMovement and the all-or-nothing decision (recovery.rs
decide()) rolls back the WHOLE sidecar, undoing a sibling table's committed
index work.
2. (MED) Vector build swallowed fatal errors. The match arm converted every
create_vector_index error into a deferred PendingIndex, hiding genuine
I/O/manifest/Lance failures as "pending".
Fix both with one trainability precheck (vector_column_trainable: >=1 non-null
vector, the ivf_flat(1) minimum) used identically by needs_index_work_node and
the build arm: an untrainable column is never counted as work (so never pinned —
no zero-commit pin) and never attempted (so it can't fail); only a trainable
column is built, and then any error PROPAGATES (stays fatal). The deferred
column is still recorded as a PendingIndex with a clear reason.
Refs dev-graph iss-848.
* feat(cli): surface pending index column + reason in optimize output (PR review)
Codex (P2): pending_indexes was documented as visible in `optimize --json` but
the CLI projection never emitted it — operators would lose the only signal that
optimize has deferred index work. Greptile (P2): the stat dropped the reason, so
operators saw which column was stuck, not why.
Carry the reason: TableOptimizeStats.pending_indexes is now Vec<PendingIndex>
(column + reason), and `omnigraph optimize --json` emits {column, reason} per
pending index; human output prints a "↳ index pending on '<col>': <reason>" line.
Refs dev-graph iss-848.
* test: align CLI index-add test with deferred build; cover post-rename reconcile
- schema_apply_json_adds_index_for_existing_property (cli_schema_config.rs): the
CLI analog of the server test — asserted the index count grew after apply;
under iss-848 the apply defers the build, so the count is unchanged on an
empty graph. Assert the deferred contract. (The only full-suite failure.)
- optimize_materializes_index_after_type_rename (maintenance.rs, new): covers
the gap Greptile flagged — a RenameType writes the renamed table with rows but
no indexes (inline build removed in Commit B); assert the rank index is
Degraded post-rename and Indexed after optimize reconciles it.
Refs dev-graph iss-848.
* test(engine): in-source apply tests reflect deferred index materialization
The two db::omnigraph in-source unit tests asserted the old "schema apply builds
/ preserves indexes inline" behavior (the only remaining full-suite failures):
- test_apply_schema_defers_index_then_reconciler_builds_it (was
test_apply_schema_adds_index_for_existing_property): apply records the @index
intent but builds nothing; assert the BTREE on `age` is absent after apply and
present after ensure_indices. (Uses `age`, unindexed in TEST_SCHEMA — `name
@key` is already FTS-indexed at seed.)
- test_apply_schema_rewrite_defers_index_then_reconciler_restores (was
test_apply_schema_rewrite_preserves_existing_indices): an AddProperty rewrite
no longer rebuilds indexes inline; assert ensure_indices restores id BTREE +
name FTS after the rewrite.
Verified by grep that these + the server/CLI tests are the complete set of
"apply builds an index" assertions; all other index-presence tests run after
load/ensure_indices/primitives, which still build.
Refs dev-graph iss-848.
* fix(engine): optimize always reports pending indexes, not only on create-work (PR review)
Cursor Bugbot (MED): pending_indexes was filled only when needs_index_create was
true, but the vector trainability precheck makes needs_index_work_node exclude an
untrainable Vector column. So a table whose sole missing index is untrainable, but
which optimize still compacts or reindexes, returned an empty pending_indexes —
contradicting the documented operator contract for deferred columns.
Run the (idempotent) build chokepoint unconditionally once past the no-op gate,
rather than gating it on needs_index_create. It skips existing indexes, builds
any buildable missing one, and reports an untrainable column as pending whether
the table entered for compaction, reindex, or index creation. needs_index_create
still gates the no-op decision (so an index-only table still enters the path).
Refs dev-graph iss-848.
* test(engine): reframe staged-BTREE-failure failpoint onto the reconciler path
ensure_indices_stage_btree_failure_leaves_existing_tables_writable fired
`ensure_indices.post_stage_pre_commit_btree` and expected `apply_schema` (adding
a type) to fail mid-BTREE-build. iss-848 removed apply's inline index build, so
that apply now succeeds and the test's unwrap_err panicked — it exercised a
removed code path.
Reframe onto where BTREE builds happen now: seed Person, add an `@index` on
`age` (apply records intent, defers the build), then `ensure_indices` builds the
deferred BTREE and the failpoint fires between stage and commit. Person's HEAD
is unchanged (no drift) and its EnsureIndices sidecar pins NoMovement; a write to
a different, unpinned table (Company) is unaffected (mutations/loads heal
roll-forward and proceed, unlike optimize/repair which refuse on a pending
sidecar). Preserves the original coverage (staged-index stage failure leaves
other tables writable, no drift) in the new architecture.
Refs dev-graph iss-848.
* feat(server): converge deferred indexes promptly after schema apply (iss-848)
Schema apply records @index intent but defers the physical build. On a
long-lived server, spawn a detached best-effort ensure_indices after a
successful apply so the indexes converge promptly instead of waiting for the
operator's next optimize. Fire-and-forget: it never blocks or fails the apply
response, and a failure is logged (the index still converges on the next
optimize). Guarded on result.applied. The CLI is one-shot, so it has no
equivalent; its convergence path is the optimize cadence.
handle.engine is already an Arc, so the spawn takes an owned clone. Convergence
itself is covered by the engine ensure_indices/optimize tests; the existing
empty-graph schema-apply route tests confirm the response is unaffected (the
spawn is a read-only no-op on an empty table).
Refs dev-graph iss-848.
* docs(maintenance): list pending_indexes in optimize per-table stats (consistency)
|
||
|
|
625ae7c208
|
feat(cli): defaults.store — a zero-flag local default scope (RFC-011) (#249)
Operator config gains defaults.store (a file:///s3:// graph storage URI), the local-dev counterpart of defaults.server + default_graph. Mutually exclusive with defaults.server, and a store cannot carry default_graph (both refused at load). The zero-flag local default that survives the upcoming removal of omnigraph.yaml's cli.graph. Additive, non-breaking. |
||
|
|
9ef5f90991
|
feat(cli)!: query/mutate invoke stored queries by name; server kind-assert (RFC-011 D3) (#247)
omnigraph query <name> / mutate <name> invoke a stored query by name from the served catalog (served-only). The verb asserts kind via a new expect_mutation on POST /queries/{name} (400 on mismatch). -e/--query + --store is the ad-hoc lane; the positional selects within the source (replacing --name). The bare positional graph URI, --uri, and --name are removed from query/mutate.
|
||
|
|
1bc0ea6b51
|
feat(cli): no-default-graph errors list candidate graphs (RFC-011 D7) (#245)
When a server/cluster scope resolves with no --graph and no default_graph, the CLI auto-uses a sole graph (cluster) or errors listing the candidate graph ids (cluster catalog; multi-graph server via best-effort GET /graphs), never a silent pick. GraphClient::resolve becomes async; flat/single-graph servers and happy paths are unaffected. |
||
|
|
b395757e21
|
feat(cli): alias subcommand; remove --alias flag (RFC-011 D4) (#244)
Operator aliases move from the --alias flag on query/mutate to a dedicated 'omnigraph alias <name> [args]' subcommand, so an alias can never shadow or be shadowed by a built-in verb. Unknown name errors listing defined aliases. Removes the legacy alias machinery from query/mutate (net -156 lines); legacy omnigraph.yaml aliases lose their CLI entry point. |
||
|
|
2ed05d2cb1
|
feat(cli): --quiet/--yes globals; echo resolved write target; gate non-local destructive writes (#243)
RFC-011 Decision 9. Writes echo their resolved target + access path to stderr (suppress with --quiet). Destructive writes (cleanup, overwrite load, branch delete) against a non-local scope require consent: --yes, a TTY prompt, or a hard refusal for non-TTY/--json runs. Local file:// writes unaffected. |
||
|
|
a09045028f
|
feat(cli)!: unify graph selection under --graph; --cluster is a global scope; remove --cluster-graph (#241)
RFC-011: --graph is the single graph selector across server and cluster scopes; --cluster becomes a global scope primitive; --cluster-graph removed. Maintenance dispatch unified through resolve_scope. Wrong-address guard validates each scope flag against the verb it can consume. |
||
|
|
c3d7639377
|
test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes (#240)
* test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes Follow-up polish to the Lance 7.0.0 alignment (the immutable-PK migration fix and the realigned native-namespace surface test). Two precision nits, no behavior change: 1. Pin the upstream behavior we now depend on. Lance 7 makes the unenforced PK immutable once set (`lance::dataset::transaction`): re-applying the reserved `lance-schema:unenforced-primary-key` key — even with the same value — errors "cannot be changed once set". That is exactly what broke `migrate_v1_to_v2`'s crash-idempotency and forced its field-guard. Add `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` so a future Lance bump that relaxes immutability turns red, prompting re-evaluation of the migration guard. (Matches the "first smoke check on a Lance bump" discipline in docs/dev/lance.md.) 2. Clarify that the native `DirectoryNamespace` decoupling is contingent on omnigraph's legacy boolean PK key, not an unconditional v7 property: with the position key the native namespace would still read the manifest. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0) and one uniform on-disk format beats a new-vs-old split, since existing graphs can't be re-keyed under the same immutability rule. Updated the test comment and the lance.md stanza; also corrected the stale `is_empty()` description of the migration guard (it now matches on the specific PK field). * test(engine): make the immutable-PK guard's red-bar diagnostic fire in every change-shape Review follow-up: the guard's re-set assertion chained `.unwrap().await.unwrap_err()`, which only surfaces the actionable "Lance no longer rejects re-setting the unenforced PK" message when immutability is enforced on the async commit path and still returns an error. Two other change-shapes would panic generically instead, defeating the guard's purpose: - if Lance moves the check to the sync validation stage, the first `.unwrap()` panics with a bare "unwrap() on Err"; - if Lance relaxes immutability so the re-set succeeds, `.unwrap_err()` panics with a bare "unwrap_err() on Ok". Normalize the sync `.update()` result and the async `.await` into one `Result` and assert on it, so the diagnostic fires whichever stage enforces (or relaxes) the rule. |
||
|
|
9a607abdec
|
fix(engine): v1→v2 migration verifies the unenforced PK is object_id, not just non-empty (#239)
Greptile follow-up (#236): `migrate_v1_to_v2` guarded the field-set with `unenforced_primary_key().is_empty()`, which skips the set whenever *any* field is the PK — including the (corrupt/unexpected) case where a field other than `object_id` carries it. That would silently leave merge-insert row-level CAS keyed on the wrong column, and Lance 7 forbids changing the PK afterward. Match on the specific PK field instead: `["object_id"]` is the idempotent crash-recovery no-op, `[]` sets it (the genuine pre-v0.4.0 first migration), and any other PK refuses loudly. Defensive — Lance won't let a fresh graph reach the error branch — but correct by construction. The idempotent re-entry path stays covered by test_publish_migrates_pre_stamp_manifest_to_current_version (28 manifest tests green). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
bc2a989a7b
|
feat(cli)!: remove legacy data-plane addressing (--target, positional http→remote, --as-on-served) (#238)
* feat(cli): --server accepts a literal URL (RFC-011 Decision 2) `resolve_server_flag` now treats a `--server` value containing `://` as a literal base URL (trailing slash trimmed; `--graph` appends `/graphs/<id>`), bypassing the operator-config `servers:` registry; a bare name still resolves through the registry. This is the replacement the upcoming `--uri http(s)://` deprecation points at, and a small ergonomic win on its own (`--server https://host` with no config entry). Token resolution for a literal-URL server falls to the legacy OMNIGRAPH_BEARER_TOKEN chain, same as a positional URL today. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(cli): address the parity-matrix arms with global --store/--server flags Prep for removing the positional-http→remote dispatch. The parity harness addressed both arms with a positional graph right after the verb (`omnigraph <verb> <addr> <args…>`), which only parses for top-level verbs — for nested subcommands (`schema show`, `branch list`, …) the address landed in the subcommand slot and BOTH arms failed identically, so the test passed vacuously (matching exit codes, never comparing output). Address both arms with the global flags instead — local `--store <graph>` (embedded), remote `--server <url>` (served) — appended after the verb + args, valid regardless of nesting. The previously-vacuous nested-verb parity checks now actually compare embedded vs remote (and pass — parity holds), and the remote arm no longer relies on the positional-URL dispatch that's about to be removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: --as on a served write is a hard error (was a silent no-op) A served write resolves the actor server-side from the bearer token, so `--as` could never set identity there — it was silently ignored. It now errors (in the remote write factory, before any HTTP call), pointing the user at removing `--as` or writing directly with `--store`. Reads don't carry `--as`, so this is write-path only. BREAKING for any script that passed `--as` to a remote write (it was a no-op, so behavior is unchanged except the now-explicit error). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: a positional/--uri http(s):// URL no longer dispatches to a server Remote graphs must be addressed with `--server <url>` (or a named server / a profile binding one). A positional or `--uri` `http(s)://` URL on a data verb now errors instead of silently routing to the remote HTTP client — the scheme no longer carries transport semantics. The discriminator is `via_server`: a remote URL produced by a server scope is fine; a remote URL from a positional/`--uri` source is rejected (`reject_positional_remote` in both GraphClient factories). Storage verbs are unaffected — they already reject remote URIs through `resolve_local_graph` with the existing "direct (storage-native)" error. Migrated the gh-host keyed-credential system test to `--server <url>` (the literal URL still prefix-matches the operator server for token resolution). BREAKING: scripts addressing a server by a bare URL must switch to `--server <url>`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: remove the --target flag (use --store / --profile / --server) Removes the legacy named-graph flag and threads its parameter out of the whole resolver chain. `--target` resolved a graph name through `omnigraph.yaml`'s `graphs:` map; its replacements (`--store <uri>`, `--profile <name>`, `--server <name>`) all ship. - Drops the 22 `target` clap fields + the `--cluster` exclusion that named it. - Threads `target`/`cli_target` out of `resolve_uri`/`resolve_cli_graph`/ `resolve_local_graph`/`resolve_local_uri`/`resolve_storage_uri`/ `resolve_remote_bearer_token`/`apply_server_flag`/`execute_query_lint`/ `resolve_selected_graph`/`resolve_registry_selection_for_list`/ `execute_queries_{validate,list}`, the two `GraphClient` factories, and `ScopeFlags`/`ResolvedScope`. - Keeps the shared `OmnigraphConfig::resolve_target_uri` 3-arg (server boot uses it); the CLI passes None for the explicit-target arm. The `cli.graph` default (omnigraph.yaml bare-command fallback) is unchanged — its removal belongs to the omnigraph.yaml excision. - Operator/file aliases that bind a `graph` name still work: the name is now resolved to a URI inline (a positional URI wins). - Error messages and `--graph`/`--server`/`--store` help text no longer name `--target`; the queries-list selection hint points at `cli.graph`. BREAKING. Tests updated (named-target resolution rewritten onto `cli.graph`; positional-URI tests unchanged). Full omnigraph-cli suite green (228). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(cli): drop --target and positional-http addressing; --as-on-served is an error Update the user docs for the legacy data-plane addressing removals: - the CLI `--target` flag is gone — address graphs with a positional URI, `--store`, `--profile`, or `--server <name|url>`; - a positional `http(s)://` URI no longer dispatches to a server (use `--server`); - `--as` on a served write is now rejected (was a silent no-op). Touches cli/reference.md (addressing intro, capability table, error examples, scopes), cli/index.md (the remote-read example → --server), operations/maintenance + policy, and the cluster docs' data-plane load guidance. The server's own `--target` boot flag is unchanged (server.md untouched). Also fixes a pre-existing broken maintenance link in search/indexes.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(cli): --store is loudly exclusive with a positional URI / --server; test graphs→Served Address two Greptile findings on the RFC-011 slices: - Slice A (P1): `--store` combined with a positional URI silently dropped the URI (`scope.rs` did `store.or(uri)`); `--store` + `--server` errored with a misleading "positional URI" message. Now both combinations fail loudly with a declared `--store is exclusive with a positional URI and --server` error. - Slice B (P2): the `command_capability` unit test never exercised the one Data→Served refinement (`graphs`); added the assertion so deleting that guard can't pass silently. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
7eeced3e88
|
feat(cli): RFC-011 Slice B — capability vocabulary (any/served/direct/control/local) (#237)
* feat(cli): RFC-011 Slice B — capability vocabulary (any/served/direct/control/local)
User-facing CLI errors and --help now speak a single "capability" vocabulary —
what a command needs — instead of the internal four-plane jargon. Behavior is
unchanged: the --server/--graph allow set is identical (the served-graph
capabilities `any` ∪ `served` = the old `Data` plane, since `graphs` was already
allowed). Only error text and the --help legend change.
- planes.rs: add `Capability { Any, Served, Direct, Control, Local }` derived from
the existing exhaustive `command_plane` classifier (which stays as the drift
guard) plus the one Data→Served refinement (`graphs`). `guard_addressing` now
allows `--server`/`--graph` on `{Any, Served}` and rejects elsewhere with a
capability-worded message. The mapping reflects *current* behavior (`queries
list` → Local, `queries validate` → Direct); it converges to the RFC end-state
table when later slices re-route those verbs.
- scope.rs: `resolve_scope` takes `Capability` instead of `Plane`, so the whole
addressing path speaks one vocabulary; call sites in client.rs (Any) and the 3
maintenance verbs in main.rs (Direct) updated.
- helpers.rs: the storage-direct remote rejection reworded to "direct
(storage-native) command".
- cli.rs: the --help legend is now "COMMANDS BY CAPABILITY".
- Tests: the 5 assertions pinning the old plane text updated; added planes.rs unit
tests proving the allow set is exactly {Any, Served} (behavior-preservation),
the per-verb mapping, and distinct capability phrases.
Full omnigraph-cli suite: 225 green (222 + 3 new), zero behavior-test changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* docs(cli): capability vocabulary in the CLI reference + maintenance addressing
Rename the reference's "Command planes" section to "Command capabilities"
(any/served/direct/control/local), reword the error examples, and update the
maintenance doc's addressing note + its section cross-link to match Slice B.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|
|
a4d08a4184
|
feat(cli): RFC-011 Slice A — additive scope/profile addressing (#235)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
* feat(cli): RFC-011 Slice A — operator-config scope structs (profiles/clusters/defaults)
Additive operator-config surface for the RFC-011 scope model. No behavior
change yet — these structs are parsed but not consumed until the scope
resolver lands.
- OperatorConfig gains `profiles:` (name → OperatorProfile) and `clusters:`
(name → OperatorCluster { root }) — the latter the only place a storage
root appears in operator config (RFC-011 storage-root rule).
- OperatorDefaults gains `server` and `default_graph` (the flat-default scope).
- OperatorProfile binds one of {server, cluster, store} + default_graph;
`binding()` validates exactly-one on use and returns a ScopeBinding.
- Accessors profile()/cluster_root()/default_server()/default_graph();
unknown-key warnings extended to the new blocks (forward-compat preserved —
old configs still load, new keys are no longer "unknown").
Tests: parse profiles/clusters/scope-defaults, binding rejects zero/multiple
entities, unknown keys in a profile warn.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(cli): RFC-011 Slice A — scope resolver + --profile/--store, wired (additive)
Translate the new scope inputs into the existing addressing tuple, in front of
the unchanged resolvers. Purely additive: an explicit address
(--uri/--target/--server/--store) passes straight through, so every existing
invocation is byte-for-byte unchanged.
- scope.rs: resolve_scope() with the RFC-011 precedence (explicit > --profile /
OMNIGRAPH_PROFILE > flat defaults.server), producing the effective
(server, graph, uri, target) for data verbs and (cluster, cluster_graph) for
maintenance. Plane×scope capability check (server scope rejected on a
maintenance verb; cluster scope rejected on a data verb; store rejects --graph)
fires only on the new paths. 9 unit tests.
- cli.rs: global --profile <NAME> and --store <URI>. (--graph keeps
requires=server for now; profile/default graph comes from default_graph —
profile+--graph override is deferred to the --cluster-graph rework.)
- client.rs: the two GraphClient factories call resolve_scope (Plane::Data) up
front; the explicit branch reproduces today's behavior exactly.
- main.rs: the 15 data call sites forward --profile/--store; the 3 maintenance
verbs consult the scope (Plane::Storage) only when no explicit per-command
address is given, so cluster-binding profiles and --store reach
optimize/repair/cleanup.
Verified: the full omnigraph-cli suite (221 tests) stays green untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test+docs(cli): RFC-011 Slice A — end-to-end scope test + reference docs
- cli_data.rs: prove --store and a --profile store binding drive a read
identically to the legacy positional URI (the additive-coexistence contract),
end to end against a local graph (no server needed).
- cli/reference.md: document profiles/clusters/defaults.server/default_graph,
the --profile/--store flags, and a "Scopes & profiles" section; note the model
coexists with legacy addressing (nothing removed yet).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|
|
ceb37dd4cb
|
fix(engine): close the 2 Lance 7.0.0 alignment failures (immutable PK + native namespace) (#236)
* fix(engine): make the v1→v2 manifest migration idempotent under Lance 7's immutable unenforced primary key Lance 7 (dataset/transaction.rs) makes the unenforced primary key immutable once set: any write touching the reserved `lance-schema:unenforced-primary-key` field metadata after the PK is set errors "cannot be changed once set" — even re-applying the same value. `migrate_v1_to_v2` previously relied on the old Lance 6 idempotency (re-applying the annotation was a no-op-ish bump), which it needs for crash-recovery: a v1 graph that crashes after the field-set but before the stamp bump re-enters the migration with the PK already present. Under Lance 7 that re-entry now errors, so a real pre-v0.4.0 graph crashing in that window could never complete its migration. Guard the field-set with `schema().unenforced_primary_key().is_empty()` so a genuine first-set still runs but a re-set is skipped — restoring crash-idempotency by construction. (Fresh graphs bake the PK into manifest_schema() at init and never run this migration.) The existing test_publish_migrates_pre_stamp_manifest_to_current_version is the regression guard: red under Lance 7 before this change, green after. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(engine): realign the native-namespace surface guard to Lance 7 (TableNotFound) `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` pokes Lance's NATIVE DirectoryNamespace (not omnigraph's production write path, which is the manifest merge_insert publisher) to document that it cannot replace omnigraph's authority. Lance 7's DirectoryNamespace routes list/describe/create_table_version through `check_table_status`, which now reports an omnigraph-manifest-tracked table as absent — so all three return TableNotFound for `node:Person` (observed). The native namespace is now fully decoupled from omnigraph's manifest: it cannot enumerate, inspect, or publish over omnigraph's tables. This strengthens the guard's thesis. Realigned the assertions to the v7 behavior and kept the authority check (omnigraph's refresh ignores the direct append; row_count stays 0). Test-only; no production impact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(lance): document the 2 runtime behavior changes in the 7.0.0 alignment stanza The #229 stanza verified a clean engine *build* but not the test suite, and claimed "no Lance API surface omnigraph uses changed." Two runtime behaviors did, caught only by the full test suite: - the unenforced primary key is immutable once set in v7 (transaction.rs) — broke the v1→v2 manifest migration's crash-idempotency; fixed by an is-set guard; - the native DirectoryNamespace returns TableNotFound for omnigraph manifest-tracked tables (dir.rs) — test-only; the surface guard was realigned. Corrects the over-broad "no surface changed" claim, adds both findings, and notes the lesson: a clean build is not a clean alignment — run cargo test --workspace before declaring a Lance bump done. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
67baf615d9
|
build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229)
* build(deps): bump Lance 6.0.1 → 7.0.0 (object_store 0.13.2, roaring 0.11.4) Arrow stays 58 and DataFusion stays 53 (no change). The only transitive bump is object_store 0.12.5 → 0.13.2. 141 upstream commits reviewed; no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). - object_store 0.13 moved get/put/head/rename/delete behind a new ObjectStoreExt trait (list/list_with_delimiter/put_opts stay on the core trait). Add `use object_store::ObjectStoreExt` in storage.rs and db/manifest/namespace.rs; no call-site changes. Mirrors Lance's own migration in PR #6672. - roaring pinned to 0.11.4 (cargo update -p roaring --precise 0.11.4). Lance 7.0.0's UpdatedFragmentOffsets newtype (lance#6650) derives Eq over HashMap<u64, RoaringBitmap>, which needs RoaringBitmap: Eq, added in roaring 0.11.4; the loose `roaring = "0.11"` constraint otherwise resolves 0.11.3 and lance itself fails to compile. - lance#6774: merge-insert INSERT rows now stamp _row_created_at_version with the commit version (was a fallback of 1). Flip the lance_version_columns assertion to `== v2` and correct the changes/mod.rs rationale comment. Production change-detection keys on _row_last_updated_at_version + ID membership, so its logic is unaffected. Refs lance#6650, lance#6774, lance#6672. * fix(storage): pin WriteParams::auto_cleanup = None (lance#6755 default flip) lance#6755 flipped the WriteParams::auto_cleanup default from on (a full cleanup pass every 20th commit) to None. On 6.0.1 the on-by-default hook could silently GC versions that __manifest pins for snapshots/time-travel. OmniGraph owns cleanup explicitly (optimize.rs::cleanup_all_tables) and never set auto_cleanup, so it was relying on a default that is both wrong for our snapshot model and now changed upstream. Pin auto_cleanup: None explicitly at all 11 production WriteParams sites (table_store ×6, commit_graph ×2, recovery_audit ×1, manifest/graph ×2 — the __manifest + sub-table Create paths). Removes the dependency on a default-flag value and locks in the snapshot-safe behavior regardless of future upstream re-flips. Refs lance#6755. * test(lance): pin BTREE range-boundary correctness (lance#6796) lance#6796 (issue #6792) fixed a BTREE scalar-index range-query bound inclusiveness bug: `x <= hi AND x > lo` returned the wrong boundary row. Add lance_surface_guards.rs::btree_range_query_boundary_is_correct, which reproduces the exact #6792 shape (5 rows + an explicit BTREE drives the index path even on tiny data) and pins the corrected inclusive-<= / exclusive-> semantics. It turns red if a future Lance regression reintroduces the bug. OmniGraph today builds BTREE only on string @key columns and queries them by equality/IN, so its current patterns do not hit this; the guard protects any future BTREE-range path (BTREE-on-properties, range-on-key). Refs lance#6796. * docs(dev): align Lance docs + invariants to 7.0.0 - docs/dev/lance.md: new 2026-06-14 alignment stanza for the 6.0.1 → 7.0.0 bump (object_store ObjectStoreExt move, roaring 0.11.4, #6774/#6796/#6755 behavior, #6658 shipped → MR-A unblocked but separate, #6666 + blob compaction still open); prior 6.0.1 stanza demoted to historical. - AGENTS.md: storage substrate 6.x → 7.x (line + architecture diagram). - docs/dev/invariants.md: deletes/vector known gap updated — the staged two-phase delete API (lance#6658) now exists and MR-A is unblocked, but delete_where stays inline and D2 stays in place until the migration lands; create_vector_index still gated on lance#6666. * fix(storage): skip Lance auto-cleanup on commit paths for legacy datasets Addresses PR #229 review (Codex P1). `WriteParams::auto_cleanup` is create-time config with no effect on existing datasets (Lance write.rs docs), so the previous `auto_cleanup: None` change alone did NOT protect graphs created before the v7 bump: 6.0.1 defaulted auto_cleanup ON, leaving `lance.auto_cleanup.*` config on those datasets, and Lance's per-commit hook (io/commit.rs: `if !commit_config.skip_auto_cleanup`) fires off that stored config — so omnigraph's own writes would GC versions the __manifest pins for snapshots/time-travel. Skip the hook on every commit path, covering new and legacy datasets alike: - commit_staged: CommitBuilder::with_skip_auto_cleanup(true) — the staged data path. - __manifest publisher: MergeInsertBuilder::skip_auto_cleanup(true). - all 11 WriteParams: skip_auto_cleanup: true (direct Dataset::write/append paths; auto_cleanup: None retained so new datasets store no cleanup config at all). Tests: - lance_surface_guards::skip_auto_cleanup_suppresses_version_gc — substrate: negative control (config GCs v1 without skip) + with-skip survival. - staged_writes::commit_staged_skips_auto_cleanup_so_pinned_versions_survive — omnigraph usage: commit_staged on a legacy-config dataset preserves the pinned create version. Refs lance#6755. * test(lance): assert created_at-preserved + updated_at-bumped on merge_insert UPDATE Addresses PR #229 review follow-up. `lance_merge_insert_update_preserves_created_at_version` documented (in a comment) that a merge_insert UPDATE preserves created_at and bumps updated_at, but only asserted the value change — leaving the change-feed invariant unguarded. Add the two missing assertions: - bob created_at == v1 (preserved across UPDATE; what the test name promises; lance#6774 only changed INSERT-row stamping). - bob updated_at == v2 (bumped to the commit version) — the invariant OmniGraph's insert/update classification relies on (changes/mod.rs keys on _row_last_updated_at_version). A regression here would silently drop updates from the diff/change feed. |
||
|
|
7963499995
|
fix(cli): unify remote URL builder, fix branch delete //branches 404 (#230)
* test(cli): reproduce branch-delete //branches 404 (failing) Regression test for the `branch delete` 404 over a multi-graph `--server`/`--graph` target: the composed URL must be `<base>/branches/<name>` with no empty `//` segment. Fails against the current `remote_branch_url`, which appends a trailing slash before extending path segments and so emits `…/graphs/p9-os//branches/tmpbranch`. The next commit fixes it. left: "http://host/graphs/p9-os//branches/tmpbranch" right: "http://host/graphs/p9-os/branches/tmpbranch" * fix(cli): unify remote URL builder, close the //branches 404 class Correct-by-design fix for the failing test in the previous commit. The bug was not specific to `branch delete`: URL assembly was scattered across a string-concat `remote_url`, a url-crate `remote_branch_url`, and several `format!` interpolations that left dynamic path/query components un-encoded (commit id in the path, branch in the query string). `branch delete` was the instance that surfaced because it is the only verb that puts a dynamic value in the path. Replace both helpers with one `remote_url(base, segments, query)` that every remote call routes through. Callers pass structured segments and query pairs, so trailing-slash normalization (pop_if_empty) and per-segment / per-value percent-encoding live in one place. A stray `//` or an un-encoded dynamic component is no longer representable, closing the whole class rather than the reported instance. Migrates the previous commit's failing test to the new builder and adds the single-graph, trailing-slash, slash-in-name, commit-id-path, and query-value cases (the last two cover the previously latent siblings). All 16 callsites migrated; `remote_branch_url` removed. |
||
|
|
1bed998052
|
fix(engine): scalar index coverage + filter literal coercion (query latency) (#216)
* fix(engine): lower date/datetime filter literals as typed Arrow scalars `literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings, relying on DataFusion implicit casts. Against a physical `Date32`/`Date64` column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a scalar BTREE and degrades the scan to a full filtered read. Lower to typed `Date32`/`Date64` scalars instead (reusing the loader's `parse_date32_literal`/`parse_date64_literal`, already used by the in-memory comparison arm), so the predicate stays a direct column comparison and the index is used. Malformed literals fall back to the Utf8 string so pushdown behavior never regresses. Tests: unit goldens asserting the lowered literal is typed (red before, green after) + inline-binding pushdown equality in literal_filters confirming the epoch conversion selects the right rows. * fix(engine): build scalar BTREE for enum and orderable-scalar @index columns `build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md. * feat(engine): reindex in optimize to keep index coverage current A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the build (e.g. `ingest --mode merge`, whose commit does not rebuild an existing index) are scanned unindexed, and `compact_files` rewrites fragments out of coverage. Nothing folded them back in, so coverage decayed as the graph grew — even the id/src/dst BTREEs that power traversal. `optimize_one_table` now runs Lance `optimize_indices` after `compact_files` (incremental merge, not retrain — the same compact->optimize_indices sequence LanceDB's `optimize()` uses) and enters the publish path on compaction work OR stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the fragment_bitmap logic). `optimize_indices` is a committing call with no uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops. Blob-bearing tables are still skipped (the Lance blob-compaction bug is compaction-specific; reindex-for-blob deferred as a noted follow-up). Tests: maintenance.rs asserts an appended fragment is uncovered before and covered after optimize, and idempotency holds (second pass is a no-op). lance_surface_guards pins the `optimize_indices` signature and its incremental- coverage behavior. The existing optimize Phase-B recovery failpoint now also exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md, lance.md, AGENTS.md. * fix(engine): coerce pushdown filter literals to the column type Filter literals were pushed to Lance in their natural Arrow type (every integer Int64, every float Float64). Against a narrower indexed column DataFusion widens to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats the scalar BTREE and degrades to a full filtered read. A physical-plan probe confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`; by an i64 literal it does not. Thread the scan's `arrow_schema` through `build_lance_filter_expr` -> `ir_filter_to_expr` and coerce each literal operand to the opposite column's exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same path the in-memory arm uses, so the two arms agree). Coercion never demotes a filter to None: on failure it falls back to the natural literal, because a node scan has no in-memory fallback for inline filters. Supersedes the date-specific change in |
||
|
|
d46e50dd6d
|
docs(user): restructure user docs into topic sections (Phase 1) (#223)
Move the 23 flat docs/user/*.md files into topic subdirectories so the user guide is organized by area (schema, queries, search, branching, cli, operations, clusters, concepts, reference) instead of a flat list. This is a pure structural move — whole files relocated, every cross-doc link recomputed, no prose rewrites or content splits (those follow in Phase 2). - 19 `git mv`s (install.md, deployment.md stay top-level); history preserved (renames detected at 92–100% similarity). - All intra-doc links, AGENTS.md's topic table (52 pointers), and the docs/dev + docs/releases back-links recomputed via relpath from each file's new location. - docs/user/index.md rewritten as a sectioned nav hub. - Fixed 5 doc-path references in Rust (comments + two user-facing server settings error strings) to point at the new locations. Verified: zero broken .md links across tracked docs; check-agents-md.sh green (with the untracked scratch docs set aside); touched crates build. Note: the public site (omnigraph-web) imports docs/ via a flat-only script; its import-docs.mjs needs a subdir-aware update before the next re-sync. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
8726ca92ec
|
feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222)
* feat(server): canonical POST /load, deprecate /ingest (RFC-009 Phase 5) The CLI's non-deprecated `load` verb rode the deprecated `/ingest` route, so `/ingest`'s eventual removal would silently break it. Add a canonical `/load`, mirroring the shipped `/mutate`↔`/change` and `/query`↔`/read` pattern. - Extract `server_ingest`'s body into a shared `run_ingest` (branch-exists / fork-if-`from`, Cedar auth, admission, `load_as`, `IngestOutput` mapping). - `server_load` (canonical) → `run_ingest`, `Json<IngestOutput>`. - `server_ingest` (deprecated) → `run_ingest` + `#[deprecated]` + RFC 9745/8288 `Deprecation: true` / `Link: </load>; rel="successor-version"` headers. - Router mounts `/load` (same 32 MB body limit) beside `/ingest`; OpenAPI `paths(...)` gains `server_load` and flags `server_ingest` deprecated. `/load` reuses `IngestRequest`/`IngestOutput`, exactly as canonical `/mutate` reuses `Change*` — a DTO rename is a separate, larger change (out of scope). openapi.json regenerated. Tests: openapi `/load` present + not deprecated, `/ingest` deprecated, `/load` bearer-secured; data_routes `/load` happy path + `/ingest` deprecation headers. Existing `/ingest` route tests stay green (the shim is unchanged). Docs: server.md endpoint table; RFC-009 Phase 5 marked landed (incl. the hand-mount-vs-utoipa-axum registration finding). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): point remote load at /load (RFC-009 Phase 5) `GraphClient::load`'s remote arm now POSTs to the canonical `/load` route instead of the deprecated `/ingest`; the deprecated `ingest` verb keeps riding `/ingest`. `parity_load` exercises `/load` on the remote arm (its documented flip); the matrix exclusions comment is updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
6144bb18d6
|
feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) (#221)
* feat(cluster): cluster_root_for_graph_uri detection helper (RFC-010 Slice 3) Public helper the CLI uses to refuse `init` into a cluster-managed location: given a graph storage URI of the cluster layout (`<root>/graphs/<id>.omni`), return the cluster root if `<root>` holds `__cluster/state.json`, else None. Cheap by construction — a URI that doesn't match the `<root>/graphs/<id>.omni` shape returns None with zero I/O, so ordinary `init` targets never probe storage. Works for file:// and s3:// via the storage adapter. Adds two ClusterStore accessors (`display_root`, `has_state`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) Two cluster-graph-aware CLI behaviors, sharing the cluster-resolution path. Maintenance addressing. `optimize`/`repair`/`cleanup` gain `--cluster <dir|s3://…> --cluster-graph <id>`, which resolves the graph's storage URI from the served cluster snapshot (the same truth a `--cluster` server boots from — `read_serving_snapshot*`) and opens it embedded. The operator no longer hand-types `<storage>/graphs/<id>.omni`. A distinct flag is required because the global `--graph` is `requires = server` and means a remote multi-graph id. clap enforces both-or-neither and exclusion with the positional URI / `--target`; an unserved graph errors loudly, pointing at `cluster apply`. init signpost. `init` refuses a cluster-managed positional path (the `<root>/graphs/<id>.omni` layout where `<root>` holds `__cluster/state.json`, detected by `cluster_root_for_graph_uri`) and points at `cluster apply` — graphs in an established cluster are created with ledger/recovery/approvals, not by hand. The check is gated on the path shape, so ordinary `init` does no extra I/O and existing pre-apply cluster-graph inits are unaffected. planes guard remediation now also mentions `--cluster … --cluster-graph …` (the two Slice-1 guard-string tests track it). Docs updated (cli-reference Command planes, maintenance.md, cluster.md §7); the stale "no S3-hosted cluster directories" limitation is dropped (RFC-006 landed it). Tests (cli_cluster.rs, reusing the apply-a-cluster fixture): resolve by id, unknown-id error, `--cluster` requires `--cluster-graph`, init refusal + signpost, and ordinary init still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(cli): resolve cluster graphs from the state ledger, not the serving snapshot Addresses the Greptile review on #221. `read_serving_snapshot*` does all-or-nothing serving validation — recovery-sidecar checks plus a digest verify of every catalog payload (query .gq, policy blobs). Using it to resolve a maintenance target coupled `optimize`/`repair`/`cleanup` to the readiness of unrelated resources: a single corrupt policy blob, or a pending recovery sweep, would block the command before it could touch the graph — worst for `repair`, the tool you reach for *when the cluster is degraded*. Add `omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)`: read the state ledger, confirm the graph is in the applied revision, return `graph_root(id)` — the URI is deterministically derivable, no catalog validation. The CLI's cluster resolver now calls it. Test: `optimize --cluster … --cluster-graph …` still resolves after the catalog payloads (`__cluster/resources/`) are removed — the ledger-only path is not blocked by degraded/unrelated catalog state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
d6cf5b298c
|
feat(cli): plane-grouped --help + clap 4.6.1 (RFC-010 Slice 2) (#220)
* chore(deps): bump clap to 4.6.1
Workspace constraint "4" → "4.6" so the resolver picks up the 4.6 line
(a plain `cargo update` stayed on 4.5.x). clap 4.5.58 → 4.6.1
(clap_builder 4.6.0, clap_derive 4.6.1). Minor bump, no API breakage; the
workspace builds and all CLI suites pass unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(cli): group --help by plane (RFC-010 Slice 2)
Slice 1 declared the planes (the command_plane table + the wrong-plane
guard); this makes them visible in `--help`. clap can't print labeled
heading rows between subcommand groups (verified against the source —
help_heading is args-only, {subcommands} is one flat block), so per the
chosen approach: cluster + legend.
- Reorder the `Command` enum into plane bands (clap lists subcommands in
declaration order): data (query, mutate, load, branch, snapshot, export,
commit, schema, graphs) → storage/local-graph ops (init, optimize,
repair, cleanup, lint, queries) → control (cluster) → session (policy,
embed, login, logout, config, version). No magic display_order numbers —
the source order IS the help order, with band comments for readers. The
band placement matches `command_plane` (lint/queries are storage-plane:
they reject --server), so the help grouping and the guard agree.
- Add an `after_help` legend on `Cli` naming the planes. Written to
describe the planes (not enumerate every command) so it doesn't drift.
Help-polish (post-review): hide the deprecated `ingest` from the list
(still a valid command); trim the long `login` and `--as` descriptions to
one line each so the columns don't blow up.
The behavioral source of truth for planes stays `planes::command_plane`;
this ordering is its cosmetic counterpart.
Test: `help_groups_commands_by_plane` pins the legend phrase + the cluster
ordering (query < optimize < cluster). Doc: a line under cli-reference's
*Command planes* section.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(cli): qualify mixed-plane commands in the --help legend
Addresses the Greptile P2 on #220: the legend placed `schema` entirely in
Data and `queries` entirely in Storage, but per `command_plane` the
subcommands differ — `schema plan` is storage-plane (rejects --server) and
`queries list` is session (no graph). A user reading the legend then running
`schema plan --server` would hit a rejection contradicting it. The Commands
list is one entry per top-level command (necessarily coarse), so the legend
carries the nuance: `schema [plan: storage]` and `queries [list: session]`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|
|
4187d56f8a
|
fix(cli): align lint plane label + document the plane model (RFC-010 follow-up) (#218)
Addresses the Greptile review on #217: P1 — `lint` reported two different names. `command_label` returns `lint`, but `execute_query_lint` passed `"query lint"` as the resolver operation string, so `lint --server` said `lint` while `lint <https>` said `query lint`. Both were pinned by tests. `query lint` is the *deprecated* alias (argv-rewritten to `lint`), so the canonical name is `lint`: switch both user-facing strings in `execute_query_lint` (the storage-plane bail label and the requires-schema-or-target usage message) to `lint`, and update the two pinned assertions in `cli_data.rs`. P2 — user-doc debt (AGENTS.md rule 1: error text is observable behavior). Document the plane model in `cli-reference.md` (new *Command planes* section: data vs storage/maintenance vs control, which addressing flags apply, and the declared wrong-plane / remote-target errors), and add an addressing note to `maintenance.md` cross-referencing it. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
106356ab25
|
feat(cli): RFC-010 Slice 1 — declared plane capability surface + honest addressing (#217)
* feat(cli): declared plane capability surface + wrong-plane guard (RFC-010 Slice 1) New `planes.rs` is the single source of truth for which plane each subcommand belongs to (Data / Storage / Control / Session). `command_plane` is an exhaustive match — adding a `Command` variant is a compile error until its plane is declared, so the surface cannot silently drift from the command set. It descends into the nested enums where the plane differs per subcommand (`schema plan` is storage while `schema show/apply` are data; `queries validate` opens the graph while `queries list` reads only config). `guard_addressing` runs once in `main` before dispatch: the data-plane addressing flags `--server`/`--graph` on any non-data verb now fail with one declared, pinned error instead of being silently ignored (`optimize --server prod` previously dropped `--server`). `init`'s message drops the `--target` half since it takes only a positional URI today. Test: `cli_schema_config::schema_plan_with_server_flag_errors_wrong_plane` pins the per-subcommand label, proving the guard descends into the nested enum. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): storage-plane verbs fail loudly on a remote target (RFC-010 Slice 1) `optimize`/`repair`/`cleanup` switch from `resolve_uri` to `resolve_local_uri`, so a `--target` (or positional URI) that resolves to a remote server now fails with a declared storage-plane message instead of whatever `Omnigraph::open` said about an `http(s)://` URI. The `resolve_local_graph` bail is reworded to that storage-plane message, so every storage verb already on the local resolver (`schema plan`, `queries validate`, `lint`) speaks with one voice. Net: `optimize --target knowledge` resolves to the graph's storage URI and runs embedded; `optimize --target prod` (remote) fails loudly; `optimize --server` is caught earlier by the guard. Positional-URI invocations are unchanged. Tests (pinned strings, per RFC-010's test plan): optimize happy path on a local graph, `optimize --server` wrong-plane error, `optimize <https>` storage-plane error; the existing `query_lint_rejects_http_targets_without_schema` assertion is updated to the new shared message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
45500a690a
|
refactor(cli): collapse export + graphs-list onto GraphClient (RFC-009 Phase 3c) (#213)
The last two embedded-vs-remote forks move onto the enum, so every such `if` in the CLI now lives in client.rs — the point of the refactor. - `export<W: Write>`: the streaming verb 3b deferred (writes to a writer, chunks the HTTP response body, rather than returning a DTO). Embedded calls db.export_jsonl_to_writer; Remote streams the chunked body through. Opens WITHOUT policy (like reads), so it routes via resolve(). - `list_graphs`: remote-only by design (no local enumeration endpoint), so the Embedded arm keeps the loud "requires a remote multi-graph server" bail verbatim. Routing it through the enum still buys the shared resolve() addressing/token preamble the arm hand-rolled. Retire the now-orphaned execute_export_to_writer / execute_export_remote_to_writer pair, and sweep two pre-existing dead fns while in the files: inferred_config_path (helpers.rs) and yaml_string (output.rs, shadowed by test-local copies). parity_matrix gains one row, parity_export — the single intended matrix change in this phase. Export is a JSONL stream, not a single --json doc, so it compares the two arms' output line-wise (sorted; twin graphs are byte-copies so rows need no scrubbing). graphs-list gets no row: its remote-only behavior is a documented exclusion, not an equality case. Full workspace tests pass; all 12 parity rows green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> |
||
|
|
d32c1ac191
|
refactor(cli): collapse write/query forks onto GraphClient (RFC-009 Phase 3b) (#211)
Phase 3a put the GraphClient enum in place and collapsed the five uniform read forks. 3b folds the remaining data-plane forks onto the same enum: load, ingest, mutate, query, branch create/delete/merge, and schema apply. The wrinkle 3a deferred was the local policy attachment. Reads and query open the local engine without a policy; writes open through open_local_db_with_policy and attribute a resolved actor. So the Embedded variant grows an optional policy context (graph/actor) filled by a second factory, resolve_with_policy; resolve() leaves it empty. open_embedded picks the open path from whether the context is present, preserving both of today's behaviors exactly. query still uses resolve() (no policy), as the read path did. apply_schema takes the catalog-validator closure as impl FnOnce(&Catalog) — the embedded arm runs it inside apply_schema_as_with_catalog_check, the remote arm ignores it (the server runs its own check). That non-object-safe closure is why GraphClient is an enum, not a trait. The stored-query registry is still built caller-side and only for the local path. load and ingest stay separate methods: same operation, but load surfaces the CLI LoadOutput (two distinct per-arm mappings preserved) while ingest surfaces the wire IngestOutput. The now-fully-dead execute_read/ execute_read_remote and execute_change/execute_change_remote pairs are retired (legacy_change_request_body stays — client.rs uses it); the export pair remains for 3c. The Phase-1 parity matrix is unchanged and green; full workspace tests pass. Co-authored-by: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
25d74d689d |
refactor(cli): GraphClient enum + read verbs (RFC-009 Phase 3a)
The embedded-vs-remote split gets one home: a GraphClient enum
(Embedded { uri } | Remote { http, base_url, token }) with a resolve()
factory that absorbs the shared preamble (apply_server_flag -> token ->
URI/remoteness) and a verb method per command. The five uniform read
forks — branch list, commit list, commit show, schema show, snapshot —
collapse from per-command if-graph-is-remote else to one line each
(main.rs: -113/+47). Behavior identical per verb (local reads still open
WITHOUT policy, as today); the Phase-1 parity matrix is the referee and
passes textually unchanged.
Enum, not the RFC trait: only two variants ever, and inherent async
methods avoid async_trait boxing and the apply_schema closure that is not
object-safe (3b) — same one-body-two-impls collapse, less ceremony.
Scope: the uniform reads only. The query verb (policy-open + operator-
alias early-return + param merge) joins the write verbs in 3b;
export/streaming and graphs-list in 3c, where the now-shared
execute_*_remote/execute_* pairs get retired.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
||
|
|
adbb2a181c |
refactor(cli): consume omnigraph-api-types directly; unify the load mapping
The CLI's wire-DTO imports repoint from omnigraph_server::api to omnigraph-api-types (the server's other exports — queries registry, config types — still come from omnigraph-server). The local Load arm's inline LoadOutput hand-construction in main.rs is extracted into load_output_from_result next to load_output_from_tables in output.rs, so both '-> LoadOutput' mappings (engine LoadResult for local, wire IngestOutput for remote) live in one place. Deviation from the plan, with reason: LoadOutput stays CLI-side rather than moving into the wire-DTO crate — it is a rendered CLI output type, not an HTTP wire DTO, and its mapping consumes a CLI clap type (CliLoadMode). The shared crate stays strictly wire DTOs. Shapes unchanged: the parity matrix passes textually unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
4821e7208f |
refactor(api): extract omnigraph-api-types crate (RFC-009 Phase 2)
The HTTP wire DTOs and their engine-result -> DTO mappings move from omnigraph-server's api module into a new omnigraph-api-types crate that both server and CLI can depend on (engine must not — DAG: api-types -> engine, never the reverse). The crate holds plain serde/utoipa types only; the transport-coupled error->status mapping stays in the server (lib.rs/ handlers). The one server-runtime coupling (query_catalog_entry, which maps a StoredQuery — not a wire type) stays behind in api.rs, now calling the crate's pub param_descriptor. api.rs becomes a thin `pub use omnigraph_api_types::*` re-export, so every omnigraph_server::api::Foo path (handlers, the OpenApi schema list, CLI imports) resolves unchanged. openapi.json regenerates BYTE-IDENTICAL (the Phase-2 referee: 77 openapi tests green, zero diff). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
82dda296d1
|
Merge pull request #208 from ModernRelay/test/parity-matrix
test(cli): embedded/remote parity matrix (RFC-009 Phase 1) |
||
|
|
446b46d548
|
Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203)
* test(engine): pin the long-lived-handle heal contract for sidecar-covered drift
A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest
publish did not land, recovery sidecar persists) currently wedges every
subsequent staged write on the same engine handle: the commit-time drift
guard rejects with 'run omnigraph repair', but repair itself refuses
while a recovery sidecar is pending, so a long-lived server can only
recover by restart. The documented contract (writes.md 'Long-running
servers', invariants.md invariant 5) says refresh-time roll-forward
closes this residual without restart -- but no write path runs it.
Two red tests pin the intended contract at the write entry points:
a follow-up load (the POST /ingest shape: shared handle, no reopen)
and a follow-up mutation must heal roll-forward-eligible sidecars
in-process and then succeed.
Currently failing with:
table 'node:Company' has Lance HEAD version 2 ahead of manifest
version 1; run `omnigraph repair` before writing
The fix lands in the next commit.
* fix(engine): heal pending recovery sidecars at the staged-write entry points
Close the long-lived-process gap in the recovery protocol: a Phase B ->
Phase C residual (per-table commit_staged landed, manifest publish did
not, sidecar persists) previously recovered only at the next ReadWrite
open or via an explicit refresh() that no production write path called,
so a long-lived server wedged every subsequent write on the commit-time
drift guard until restart.
New recovery::heal_pending_sidecars_roll_forward:
- one list_dir of __recovery/ at write entry (empty -> immediate
return, the steady state), so the per-write cost is one storage list;
- per sidecar, acquires the same per-(table_key, table_branch) write
queues every sidecar writer holds from before write_sidecar until
after delete_sidecar, then re-checks sidecar existence -- this
serializes the heal against live writers instead of rolling an
in-flight sidecar forward from under its writer (which would fail
that writer's publish CAS spuriously). Lock order queues ->
coordinator matches every writer's commit->publish path. This is the
queue-acquisition design recovery.rs and write_queue.rs already
documented for in-process recovery;
- processes in RollForwardOnly mode: the common residual rolls forward
in-process; rollback-eligible sidecars still defer to the next
ReadWrite open (Dataset::restore is unsafe under concurrency).
Wire it into load_as and mutate_as (before the inline delete path can
advance any HEAD), and rebase Omnigraph::refresh onto the same helper
so refresh stops racing live writers' sidecars.
The maintenance entry points (apply_schema_as, branch_merge_as,
ensure_indices) intentionally keep their strict fail-loud preconditions
for now; wiring the same heal there is a follow-up with its own tests.
Turns the previous commit's two red tests green.
* fix(engine): name the right recovery path in the commit-time drift guard
The drift guard's 'run omnigraph repair before writing' advice is a
dead end when the drift is covered by a pending recovery sidecar:
repair refuses while a sidecar is pending. With the write-entry heal in
place, reaching this guard with sidecar-covered drift means the heal
deferred it (rollback-eligible), and the actual recovery path is a
read-write reopen. Distinguish the two classes on the error path only
(one sidecar list, after the conflict is already certain); a listing
failure falls back to the uncovered-drift wording rather than masking
the conflict.
Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open
with a write attempt against the deferred sidecar.
* docs: write-entry in-process sidecar heal — contract and coverage
Update the recovery contract docs to match the previous two commits:
invariant 5 now states that the staged-write entry points and refresh
run in-process roll-forward recovery (long-lived processes converge on
the next write, not at restart); writes.md 'Long-running servers'
describes the heal's queue-acquisition concurrency contract, the
improved drift-guard error, and the entry points that intentionally do
not heal yet; testing.md indexes the new failpoint tests; AGENTS.md
capability matrix drops the claim that in-process recovery is entirely
future work (only the rollback path remains with the background
reconciler).
* test(engine): pin the entry heal contract for schema apply and branch merge
Without the write-entry heal, the two maintenance writers do worse than
wedge on sidecar-covered drift -- they proceed and decide its fate
implicitly:
- schema apply re-plans table rewrites from the manifest pin, orphaning
the drifted Phase-B commit (its rows silently vanish from the
rewritten table) while the stale sidecar lingers to misclassify
against the post-apply pins;
- branch merge publishes over the drift, making the failed writer's
commit visible as an unattributed side effect (no recovery audit
row), and leaves the stale sidecar behind.
Two red tests pin the intended contract: both entry points heal the
sidecar first (attributed roll-forward), then run on the converged
state. Currently failing on the stale-sidecar / dropped-rows
assertions; the fix lands in the next commit.
* fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries
Extend the write-entry heal to the remaining two write entry points.
Unlike load/mutate (which wedge on the drift guard), these proceeded
over sidecar-covered drift and decided its fate implicitly:
- schema apply re-planned table rewrites from the manifest pin,
orphaning the drifted Phase-B commit -- its rows silently vanished
from the rewritten table -- while the stale sidecar lingered to
misclassify against the post-apply pins;
- branch merge published over the drift, making the failed writer's
commit visible without a recovery audit row, and left the stale
sidecar behind.
Both now run the same queue-serialized roll-forward heal at entry,
before their own sidecar exists, so recovery is attributed (audit row)
and deterministic. ensure_indices stays heal-free: it runs inside the
load / schema-apply flows after their entry heal.
Turns the previous commit's two red tests green. Docs updated in the
same change (invariant 5, writes.md, testing.md, AGENTS.md).
* test(engine): pin Phase A sidecar-write failure semantics
Storage fault-injection matrix, row 1: a sidecar PUT failure (S3
PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write
at the top of write_sidecar -- the single choke point all five sidecar
writers go through -- models the storage error backend-generically.
Also adds the other three storage-fault failpoints used by the
following commits (recovery.sidecar_delete, recovery.sidecar_list,
recovery.record_audit); each is a no-op without the failpoints feature.
Pinned contract: every writer writes its sidecar BEFORE its first
HEAD-advancing commit, so a put failure aborts with zero drift (no
sidecar, Lance HEAD == manifest pin, no rows) and a transient fault
never wedges the graph -- the same handle writes/merges normally once
it clears. Covered for load (the staging writer) and branch_merge (the
multi-table writer, forced onto the RewriteMerged path by diverging
both sides).
* test(engine): pin Phase D delete, list, and audit-append storage-fault semantics
Storage fault-injection matrix, rows 2/3/5, plus the real-backend run:
- recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject)
must NOT fail the user's write -- the manifest publish already
landed, so the caller's data is durable. The swallowed failure
leaves a stale sidecar; the next write's entry heal consumes it via
the stale-sidecar audit-recovery path (RolledForward, attributed).
- recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2)
is loud at every consumer -- the write-entry heal fails the write
and the open-time sweep fails the open. Silently skipping recovery
over a pending sidecar would be consumer tolerance of drift. Once
the fault clears, open recovers the pending sidecar normally.
- recovery.record_audit: an audit write failure after the
roll-forward's manifest publish aborts that recovery attempt and
keeps the sidecar; re-entry detects the already-published manifest,
records exactly ONE RolledForward audit row, and converges -- the
retry tolerance documented on record_audit, exercised end-to-end.
- s3_load_recovers_after_publisher_failure_without_reopen: the
same-handle heal scenario on a real bucket (gated on
OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar
put/list/delete through S3StorageAdapter instead of the local-FS
adapter. CI wiring lands in a follow-up commit.
* test(engine): refuse corrupt recovery sidecars loudly
Storage fault-injection matrix, row 4 (no failpoint needed -- the
corrupt file is written by hand, sibling to the unknown-schema-version
refusal test): a truncated/garbage __recovery/{ulid}.json must be
refused loudly by both the write-entry heal (the write fails naming
the parse error) and the open-time sweep (ReadWrite open fails naming
the file), with the file left on disk for operator inspection.
Read-only opens still work -- the sweep is skipped there.
* test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix
- ci.yml rustfs_integration: new step running the bucket-gated
failpoints tests (name filter s3_) against the RustFS container, so
sidecar put/list/delete are exercised through S3StorageAdapter on
every storage-affecting PR.
- writes.md: sidecar I/O failure semantics -- Phase A put failure
aborts with zero drift; Phase D delete failure is swallowed (write
already durable) and healed by the next write; list failures are
loud at heal and open; corrupt sidecars are refused with the file
kept for inspection; audit-append failures are retried to exactly
one audit row.
- testing.md: index the storage-fault matrix in the failpoints.rs row
and the new RustFS CI line.
* test(engine): pin read-visibility of acknowledged local if-absent writes
The cluster lib test import_missing_state_creates_state_with_graph_-
observation flakes at ~50% under full-workspace load ('EOF while
parsing a value' reading back the state.json its own import just
acknowledged). Root cause is in the engine's local storage adapter:
write_text_if_absent writes through a buffered tokio::fs::File and
returns when write_all resolves -- which, per tokio's documented File
semantics, means the bytes reached tokio's internal buffer, not the
file. The actual write completes in a background blocking task after
drop, so a caller that acknowledges success and reads the object back
can see an empty or partial file. Under load the window widens; the
red run fails at iteration 0 with 0 of 8192 bytes on disk.
The regression test pins the contract at the adapter boundary: when
write_text_if_absent resolves, the full contents are visible to any
reader; a losing second claim leaves the winner's object untouched.
The fix lands in the next commit.
* fix(engine): publish local storage writes with atomic visibility
Close the class, not the instance. The local adapter admitted three
ways for a reader to observe a write that was acknowledged or visible
before its bytes were complete:
1. write_text_if_absent acknowledged success when the buffered
tokio::fs::File write_all resolved -- i.e. when the bytes reached
tokio's internal buffer, not the file. A caller reading back its own
acknowledged write could see an empty object (the ~50% cluster
import flake under full-workspace load; the regression test failed
at iteration 0 with 0 of 8192 bytes visible).
2. The same call published its CLAIM (create_new) before its CONTENT,
so concurrent readers saw an empty claimed file in the window.
3. write_text (plain tokio::fs::write) exposed truncated content
mid-replace -- silently falsifying write_sidecar's 'readers either
see the complete sidecar or none' contract on local FS (true on S3,
where PutObject is atomic).
A flush in write_text_if_absent would have fixed only (1). Instead,
both local write paths now publish complete temp files atomically:
rename for replace (write_text -- the idiom write_text_if_match
already used) and hard_link for no-replace (write_text_if_absent --
link fails AlreadyExists, so exactly one of N concurrent claimants
wins and the winner's object is fully readable at the instant it
becomes visible). The local adapter now honors the same object-level
atomic-visibility contract as the S3 adapter, which is what every
caller (recovery sidecar protocol, cluster state CAS) was written
against. Crash-orphaned *.tmp.* files are inert: the sidecar sweep
filters to .json, and cluster state reads address state.json by name.
fsync/durability policy is unchanged (no fsync before, none now);
this fix is about visibility ordering, not power-loss durability.
Pre-existing on main (landed with the multi-graph server mode change,
PR #119); surfaced by this branch's heal work only because one extra
list_dir per write shifted test timing. Cluster lib suite: 12/25
failures before, 0/25 after. Turns the previous commit's red test
green.
* refactor(engine): one storage implementation over object_store for every backend
Collapse LocalStorageAdapter (hand-rolled tokio::fs) and
S3StorageAdapter into a single ObjectStorageAdapter backed by
Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs,
the existing AmazonS3 build for s3://, plus a pub in_memory()
constructor (full contract including TRUE conditional updates; the
in-memory test backend testing.md asked for at the adapter level).
Why: the acknowledged-before-visible bug showed the two-impl shape has
no referee -- one prose contract, two independent answers. Upstream
LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/
hard_link idiom that fix converged on, and Lance's own commit protocol
is built on the same primitives (put-if-not-exists / rename-if-not-
exists), so the substrate-aligned move is to stop hand-rolling it.
The per-backend residue shrinks to a UriCodec (URI <-> object path)
and one capability flag.
Semantics preserved by construction, with three deliberate deltas:
- exists() is now object-store-semantics everywhere (head + non-empty
prefix fallback): an EMPTY local directory no longer 'exists'. The
only dir-shaped caller (_graph_commits.lance probes) self-heals via
ensure_commit_graph_initialized where it previously wedged loudly.
- A directory at an object path reads as NotFound, not as an IO error
('only objects exist'). The cluster unreadable-payload test used a
same-named directory as a portable non-NotFound trigger; it now uses
chmod 000, which still models genuine transient IO.
- write_text_if_match keeps content-token semantics on local
(PutMode::Update is NotImplemented upstream for LocalFileSystem in
0.12.5 and 0.13.2); the capability flag gates the token SOURCE in
read_text_versioned too -- an ETag token with content-compare writes
would lose every CAS.
delete_prefix keeps a local remove_dir_all branch: directories are a
local-FS concept, and list+delete would leave empty skeletons that
cluster graph_root_exists (raw Path::exists) reports as still present.
LocalStorageAdapter remains as a delegating shim so the pinned
contract tests gate this swap textually unchanged; the shim and the
test parameterization over local + in-memory land next. Cargo gains
the explicit 'fs' feature (already transitively enabled by lance).
* test(engine): one executable storage contract, run against every backend
Remove the LocalStorageAdapter delegation shim and migrate its
construction sites to ObjectStorageAdapter::local(). Replace the
per-backend duplicated tests with a single contract_suite asserting
the trait's promises (atomic replace, exists incl. the dataset-root
prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost,
rename, list round-trip with no sibling-prefix bleed, idempotent
delete/delete_prefix), run against the local backend and the new
in-memory backend -- which implements true conditional updates, so the
strong-CAS path is exercised without a bucket. The bucket-gated S3
variant already exists (s3_adapter_conditional_writes_contract).
New local-specific pins for the deliberate semantic edges of the
collapse: empty directories are not objects (exists=false; the Lance
dataset-root probe shape is the non-empty case), file://-anchored and
spaces-in-path list output round-trips byte-identically into
read_text, dot-segment paths are lexically absolutized (the CLI's
./graph.omni shape), and upstream rename creating missing destination
parents. The acknowledged-write visibility regression test stays, now
documenting that the cross-API std::fs read-back is the point.
* refactor(cluster): drop put_json's per-backend atomicity branch
The local temp+rename dance predates the storage adapter guaranteeing
atomic visibility; now that write_text publishes via a staged temp +
rename on the filesystem (and a single atomic PUT on object stores) by
contract, the branch duplicated upstream behavior. One call, both
backends.
* docs: storage adapter collapse — contract, in-memory backend, local CAS gap
- testing.md: the 'no MemStorage backend' note is half-closed —
ObjectStorageAdapter::in_memory() covers the text-object layer with
the full contract (true conditional updates); Lance datasets bypass
the adapter, so the engine substrate ask stays open.
- invariants.md: truth-matrix Tests row updated; new Known Gap for
local write_text_if_match (upstream PutMode::Update is unimplemented
for LocalFileSystem; content-token emulation is safe only under the
cluster lock protocol — close before admitting a lock-free caller).
- writes.md: backend notes for the unified adapter (name#N staging
residue invisible to the sweep, backend-wrapped error text with
exists()-probing for missing-vs-error, loud permission failures).
* docs: finish renaming the storage adapters in user docs and test comments
storage.md's URI-scheme table and the S3 failpoint test's doc comment
still named the deleted LocalStorageAdapter/S3StorageAdapter; both now
describe the unified ObjectStorageAdapter over object_store, including
the relative-path absolutization note for local URIs.
* test(engine): pin branch-awareness of the drift guard's recovery advice
A pending sidecar on ANOTHER branch does not cover this branch's
drift: with a deferred feature-branch sidecar on disk and genuinely
uncovered drift on main, the main write's error must still point at
omnigraph repair -- a read-write reopen recovers the sidecar but
cannot repair main's uncovered drift. Currently red: the guard
matches sidecar pins by table_key only, so the feature sidecar flips
main's advice to the reopen path. Fix in the next commit.
Surfaced by external review of the drift-guard change.
* fix(engine): branch-aware sidecar matching in the drift guard's advice
The commit-time drift guard's sidecar-covered check matched pins by
table_key alone, so a pending sidecar on another branch flipped this
branch's uncovered-drift advice from 'run omnigraph repair' to the
reopen path -- and a reopen recovers that sidecar but cannot repair
this branch's drift. Compare the pin's table_branch too. Turns the
previous commit's red test green.
Surfaced by external review of the drift-guard change.
* test(engine): pin heal non-interference with a live schema apply
The write-entry heal's schema-staging reconcile runs before any queue
acquisition, so a load on the same handle, overlapping a schema apply
parked between its staging write and manifest commit, promotes the
apply's staging files (new catalog live against the old manifest),
classifies the LIVE apply's sidecar, and publishes its registrations
out from under it. The resumed apply then collides with its own stolen
commit. Currently red with:
Lance("Concurrent modification: table version 3 already exists for
node:Tag")
The fix (per-sidecar reconcile under the sidecar's write-queue guards,
plus a serialization key the schema-apply writer and the heal both
acquire) lands in the next commit.
Surfaced by external review of the write-entry heal.
* fix(engine): serialize the heal's schema-staging reconcile with live schema applies
The write-entry heal ran recover_schema_state_files up front, before
acquiring any queue guards. Overlapping a live schema apply parked
between its staging write and manifest commit, the heal promoted the
apply's staging files (new catalog live against the old manifest),
classified the LIVE apply's sidecar, and published its registrations —
the resumed apply then collided with its own stolen commit.
Correct by construction:
- New schema-apply serialization queue key, acquired by the schema-
apply writer (alongside its per-table keys) from before write_sidecar
until after delete_sidecar. Per-table keys alone don't cover a
registration-only migration, which pins no existing tables but has a
sidecar and staging files on disk.
- The heal reconciles schema staging lazily, PER SchemaApply sidecar,
after acquiring that sidecar's guards (including the serialization
key) and re-confirming the sidecar exists — a sidecar that survives
the queue wait belongs to a dead writer, so the reconcile can no
longer race a live apply. Recomputing per sidecar also removes the
staleness of one up-front result across a multi-sidecar pass.
- Omnigraph::refresh drops its up-front reconcile-and-pass-through
(same race, and a pre-promoted result would make the heal's guarded
reconcile see clean staging and wrongly defer the sidecar): it now
reconciles standalone only when NO sidecar exists — which cannot
race a live apply, whose sidecar always precedes its staging files —
and otherwise defers entirely to the heal.
The open-time sweep keeps its precomputed reconcile: open has no
concurrent writers. Turns the previous commit's red test green.
Surfaced by external review of the write-entry heal.
Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU
(a live apply could write its sidecar + staging between the empty
check and the reconcile) — the standalone reconcile now holds the
serialization key across the list-then-reconcile pair. The remaining
residual is cross-process only (in-process queues cannot serialize
against a writer in another process; the open-time sweep has the same
pre-existing exposure) and is now an explicit Known Gap in
invariants.md rather than an implicit one.
* test(engine): pin catalog reload after the heal recovers a schema apply
When the write-entry heal rolls a crashed apply's SchemaApply sidecar
forward on the same handle, disk and manifest move to the new schema
(staging promoted, registrations published) but the handle's in-memory
schema_source/catalog do not. Subsequent writes then validate against
the stale catalog and reject rows of types the graph already has.
Currently red with:
record 1: unknown node type 'Tag'
refresh() reloads after its heal; the write entry points must too.
Fix in the next commit.
Surfaced by external review of the write-entry heal.
* fix(engine): reload the in-memory catalog after the heal recovers a schema apply
heal_pending_recovery_sidecars refreshed the coordinator and
invalidated the runtime cache after processing sidecars, but never
reloaded schema_source/catalog — so a write whose entry heal rolled a
crashed SchemaApply sidecar forward proceeded to validate against the
OLD schema while disk and manifest were already on the new one.
reload_schema_if_source_changed is the same post-heal step refresh()
already runs; it no-ops on the (overwhelmingly common) non-schema heal
because the on-disk source is unchanged. Turns the previous commit's
red test green.
Surfaced by external review of the write-entry heal.
* test(engine): pin that a deleted-branch sidecar cannot wedge the graph
A rollback-eligible sidecar pinned to a branch is deferred by every
roll-forward-only pass; if the branch is then deleted, the sidecar
survives, referencing a branch with no manifest tree. The heal (every
write entry) and the open-time sweep (every ReadWrite open) both fail
opening the dead branch, and repair refuses while a sidecar is pending
-- a terminal read-only state with manual sidecar surgery as the only
exit. Currently red with:
Lance("Not found: .../__manifest/tree/feature/_versions")
The branch's tree and forks are already reclaimed, so the pinned drift
is unreachable and the sidecar is provably moot; the fix classifies it
as an orphaned-branch terminal state (audit + discard) in both passes.
Surfaced by review (P1, verified by repro).
* fix(engine): classify deleted-branch sidecars as orphaned instead of wedging
A deferred (rollback-eligible) sidecar pinned to a branch survives
branch_delete; both the write-entry heal and the open-time sweep then
failed unconditionally opening the dead branch -- every write and
every ReadWrite open errored, and repair refuses while a sidecar
pends. Terminal state, manual sidecar surgery the only exit.
The branch's tree and per-table forks are already reclaimed at delete,
so the drift the sidecar pins is unreachable and the sidecar is
provably moot. Both passes now check the sidecar's branch against the
manifest's branch list (the authority -- deliberately NOT inferred
from a Not-found on open, which could be a transient storage error
masking real recovery intent) and discard orphans with an
OrphanedBranchDiscarded audit row, commit appended on main since the
sidecar's own branch no longer has a commit graph.
The open-time half is pre-existing; the write-entry heal made it hot.
Turns the previous commit's red test green.
Surfaced by review (P1, verified by repro).
* chore: harden review nits — vacuous CI filter, root-runner skip, liveness note
- ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the
's3_' name filter matches zero tests (cargo passes vacuously on an
empty filter; the step exists specifically to prove S3 sidecar I/O
coverage). The pre-existing CLI smoke step has the same shape and is
left for a follow-up.
- cluster unreadable-payload test: cfg(unix) + a skip-with-log when
running as root (mode 000 is still readable to root, common in
container dev runners), so the test degrades instead of failing.
- refresh: document the one-pass-late convergence for legacy staging
residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by
re-running the reconcile unserialized — the exact race the
serialization key closes.
* test(engine): pin orphan-discard idempotency across a delete fault
discard_orphaned_branch_sidecar writes its audit row and main commit
before deleting the sidecar; a Phase D delete fault leaves the sidecar
on disk with the audit already durable, and the retry repeated the
whole path -- a second OrphanedBranchDiscarded audit row (and commit)
for the same operation. Currently red: 2 rows after one fault + retry.
The retry must only finish the delete. Fix next.
Also promotes the recovery-audit kinds reader into the shared test
helpers (it was recovery.rs-local).
Surfaced by external review of the orphan-discard fix.
* fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred
Two review findings on the recovery surface:
- discard_orphaned_branch_sidecar now checks the audit table for an
existing (operation_id, OrphanedBranchDiscarded) row before appending
the commit + audit pair, so a Phase D delete fault retries ONLY the
delete instead of duplicating audit rows and commit-graph entries.
Cold path: the list scan runs only when an orphaned sidecar exists.
Turns the previous commit's red test green (exactly one audit row
across fault + retry).
- process_sidecar returns whether durable state changed; the heal sets
processed_any only for sidecars that were actually rolled forward /
rolled back / audit-recovered (orphan discards count). Deferred
sidecars (rollback-eligible, invariant-violating, unpromoted
SchemaApply) no longer trigger a per-write schema reload + full
runtime-cache invalidation while they pend -- the cache is
snapshot-keyed so this was waste, not corruption, but it was paid on
every write until reopen. Acted-paths' processed=true remains pinned
by load_after_schema_apply_phase_b_failure_uses_recovered_catalog
(the reload depends on it).
Surfaced by external review.
* test(engine): pin the orphan-discard audit-append fault leg as documented tolerance
The orphan discard's commit append and audit append are two writes; a
failure between them leaves a recovery commit with no audit row, and
the retry (keyed on the audit row, the operator-facing record) appends
a second commit before the audit lands. This is the same
not-atomic-pair-write tolerance record_audit documents and the
manifest->commit-graph Known Gap covers for every publish: bounded
commit-graph noise, audit row exactly-once under clean failures.
Keying idempotency on commit rows instead would need an operation_id
column on _graph_commits, and audit-before-commit would dangle the
graph_commit_id join -- both worse than the documented residual.
Make the tolerance explicit instead of implicit: docstring names the
window, a failpoint sits inside it, and the new test pins convergence
across the fault (sidecar consumed, exactly one audit row), completing
the orphan-discard fault matrix alongside the delete-fault leg.
Surfaced by external review of the orphan-discard idempotency.
* test(engine): pin honest drift-guard advice when sidecar listing fails
The guard's unwrap_or(false) conflated 'classified as uncovered' with
'could not classify': a transient list fault on the guard's second
list (the entry heal's first list having succeeded) confidently routed
the operator to omnigraph repair even when the heal had just deferred
a rollback-eligible sidecar -- and repair refuses while a sidecar is
pending. Currently red: the error says 'run omnigraph repair' with no
mention of the reopen path. The fix names both paths plus the failure
cause when classification is impossible.
Surfaced by external review of the drift-guard fallback.
* fix(engine): admit ambiguity in the drift guard when sidecar listing fails
Replace the unwrap_or(false) fallback with a tri-state: covered ->
reopen advice; uncovered -> repair advice; listing FAILED -> say the
drift could not be classified, name the cause, and give both paths in
order ('run repair, or reopen read-write if repair reports a pending
sidecar'). The old fallback confidently routed a transient list fault
to repair, which refuses while a sidecar is pending -- a self-
correcting but pointless detour. The conflict itself is still always
raised; only the advice degrades honestly. Turns the previous commit's
red test green.
Surfaced by external review of the drift-guard fallback.
|
||
|
|
08c9b03d40 |
test(cli): the embedded/remote parity matrix (RFC-009 Phase 1)
The referee before any unification moves: every forked verb runs once against the local graph and once against a spawned server on a twin copy of the same fixture, with the SAME actor (--as locally; bearer-resolved remotely) and the SAME Cedar bundle on both arms — like-for-like enforcement is part of the harness (a tokens-only server is default-deny by design; comparing that against a bare local arm measures configuration, not the fork). Declared-volatile fields (ids, wall-clock, transport locations) scrub to placeholders; everything else must match exactly, and exit codes must match for shared failures. Headline result: 11 rows green with an EMPTY divergence ledger — the arms agree on every verb today. The ledger (KNOWN_DIVERGENCES) exists so any future divergence is pinned or filed, never silently repaired; repairs are Phase 3's job, gated by this referee staying green. One engine observation surfaced and filed (#207): inline execution with a declared-but-unbound param matches ALL rows on both arms, while the stored-query invoke path hard-errors — a cross-path asymmetry the matrix pins as agreeing behavior pending a deliberate fix. Documented exclusions (graphs list, ingest/load-over-/ingest, storage-plane verbs) map to RFC-009 Phases 4-5. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
dedd647cde |
release: bump workspace to 0.7.0
All six crate manifests + their path-dependency constraints, Cargo.lock, the regenerated openapi.json version metadata, AGENTS.md's surveyed version, and the v0.7.0 release notes (object-storage clusters, config-free --cluster serving, the operator config surface, keyed credentials, operator targeting/aliases, and the omnigraph.yaml deprecation stages). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
aabb3dca2e |
fix(storage): flush before acking in local write_text_if_absent
tokio's async File buffers writes internally: write_all only fills the buffer, and the actual OS write happens in a background task after drop — so write_text_if_absent could return Ok(true) with the file created but still EMPTY, and an immediate reader saw EOF. Caught twice in CI as 'EOF while parsing a value' reading state.json right after cluster import (the cluster's first state-write routes here since the storage port); also an invariant-6 violation (acknowledged before the write reached the OS). The other local write paths use tokio::fs::write, which flushes internally — this was the one miss. Fix: flush().await before Ok, with the same remove-on-failure cleanup as the write itself. Regression test is a best-effort tight loop (the window is timing-dependent; the two CI failures are the recorded red) asserting read-after-ack never sees a short file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
4c50170c77 |
feat(config): OMNIGRAPH_NO_LEGACY_CONFIG strict mode (RFC-008 stage 4)
Opt-in: with the env set, loading a legacy omnigraph.yaml is a hard error pointing at config migrate — the regression guard for migrated teams (a stray legacy file would otherwise silently outrank operator config during the window) and the rehearsal for stage 5's removal. Strict refuses the FILE, never its absence: flag-less invocations on migrated setups are untouched. Inert unless set. The RFC's stages-1-3-then-4 release gap collapsed honestly: no version boundary was crossed between them, so all four ship in the same release (noted in the RFC). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
5328c91341 |
refactor(cli): drop cluster init — no replacement scaffold
Andrew's call, and the right one by the repo's own lens: a minimal
cluster.yaml is five lines; a generator is a second copy of the schema to
keep in sync forever, emitting a file that is unusable until hand-edited
anyway (graphs: {} cannot apply or serve). Terraform has no config
scaffolder either. New users copy from the cluster quick-start; migrants
get a ready-to-review cluster.yaml from config migrate. RFC-008 stage 3
becomes purely subtractive.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
||
|
|
5ba9656666 |
feat(cli): init stops scaffolding omnigraph.yaml; cluster init replaces it (RFC-008 stage 3)
omnigraph init no longer writes a legacy config into cwd (the source of the earlier test-pollution bug, and a scaffold for a deprecated file); the scaffolder is deleted. omnigraph cluster init scaffolds the replacement: a minimal valid cluster.yaml (version: 1, optional metadata.name / storage:, a commented graphs example), refusing to overwrite. The scaffold validates clean via cluster validate in the e2e. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
cd1f175396 |
feat(cli): omnigraph config migrate — the RFC-008 split (stage 2)
Reads a legacy omnigraph.yaml and produces the three-section split: team half as a ready-to-review cluster.yaml proposal (graphs with TODO schema pointers — the legacy file never knew schemas — per-graph queries directories, policies with applies_to bindings), personal half as an operator-config merge (actor, output/table defaults — OperatorDefaults gains the two table keys with their cascade hops — remote graphs with bearer_token_env become servers entries plus a printed login step, and legacy aliases split per the RFC: content to the catalog as a manual step, binding to an operator alias), plus a dropped-keys section with reasons. Touches nothing without --write; with it, the operator merge is key-level (existing entries always win; prior file backed up), and cluster.yaml is emitted only when absent (else cluster.yaml.proposed). --json emits the report structurally. The completeness contract is a unit test: every top-level key of the legacy schema must classify somewhere, or the RFC-008 map has a bug. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
c89d268b23 |
feat(config): per-key deprecation warnings on legacy omnigraph.yaml load (RFC-008 stage 1)
Loading a legacy file (flag, env, or cwd-found — never on defaults) emits one stderr block listing each key actually present with its destination from RFC-008's migration map — the map applied to YOUR file, not a generic banner. Once per process; both binaries warn (cluster-mode boots never reach load_config, silent by construction); suppressible via OMNIGRAPH_SUPPRESS_YAML_DEPRECATION=1 for CI logs during the window. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
20ddfc61c1 |
fix(cli): reclaim the hidden legacy-uri positional for operator aliases
Caught on the live smoke: with --alias, the first bare CLI arg lands in
the hidden legacy_uri positional, so an operator alias's positional param
never bound ('parameter not provided' from the server). An operator alias
always knows its target, so the existing normalize_legacy_alias_uri
reclaims the swallowed positional as the first alias arg — same rule the
legacy path already applies.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
||
|
|
dc91c55970 |
feat(cli): operator aliases — pure bindings invoking stored queries (RFC-007 PR 3, part 2)
aliases: in the operator config bind a personal name to (server, graph,
stored-query NAME, positional arg mapping, fixed param defaults, format)
— zero content, per the ratified bindings-not-content model. Invocation
goes through the server's stored-query endpoint (POST
{base}/graphs/{g}/queries/{name}) with the keyed credential resolving via
the ordinary URL match; param precedence --params > positionals > fixed
defaults; the result renders through the existing format cascade with the
alias's format as its hop. A legacy omnigraph.yaml alias with the same
name wins during the RFC-008 window, with a warning naming both.
E2e (spawned policy-gated server, invoke_query granted via a per-graph
bundle): the alias invokes with name + one positional and nothing else —
server, graph, query, and token all from the operator layer; --server/
--graph explicit targeting; unknown --server lists defined names;
--server exclusive with a positional URI.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
||
|
|
2b33ab64f2 |
feat(cli): --server <name> targeting (RFC-007 PR 3, part 1)
Global flags --server (operator-defined server name) and --graph (graph id on a multi-graph server, requires --server) resolve to the effective remote URI through one helper and feed the ordinary uri slot — graph resolution and the PR-2 keyed-token URL match work unchanged; the flag is sugar for a URI the operator already owns. Exclusive with a positional URI and --target (loud error, never silent precedence). Unknown names fail listing the servers that ARE defined. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
a819ab500e |
feat(cli): keyed credentials — servers:, the token chain, login/logout (RFC-007 PR 2)
The operator config gains servers: (name -> url; never a token). A remote command whose URL prefix-matches an operator server resolves its bearer token through the keyed chain first — OMNIGRAPH_TOKEN_<NAME> env, then the [<name>] section of ~/.omnigraph/credentials (created 0600 via temp+rename, #139 finding 7; group/world-readable files refused loudly) — falling through to the legacy chain unchanged. URL keying makes §D5 rule 3 structural: a token is only ever sent to the server it is keyed to. Longest-prefix matching with a path-boundary check (http://h:8080 never matches http://h:8080-evil). Inserting the keyed hop above the legacy chain is safe by construction — no existing setup can have servers: defined. omnigraph login <name> stores/rotates one section (token from --token or one stdin line — the pipe flow keeps secrets out of shell history); omnigraph logout removes it, idempotently; logging in before declaring the server warns instead of failing (the gh model). Coverage: URL-match/no-substring-trap, credentials round-trip preserving sibling sections, 0600 write + over-permissive refusal, env-name mapping; the legacy resolve test is now hermetic against a real ~/.omnigraph and asserts byte-identical legacy behavior with no servers defined; one spawned-binary e2e walks the whole lifecycle against an authed server: refusal -> wrong-token login (stdin) -> rotate (--token) -> authorized read -> env-beats-file -> non-matching-URL negative -> logout revokes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
be4bd46212 |
feat(cli): the operator config surface — identity and output defaults (RFC-007 PR 1)
~/.omnigraph/config.yaml joins the resolution chains as the operator surface: operator.actor becomes the last hop of THE actor chain (--as > legacy cli.actor during the RFC-008 window > operator.actor > none, one implementation for direct-engine and cluster commands alike) and defaults.output joins the read-format cascade below every more-specific source. Discovery honors $OMNIGRAPH_HOME (tilde-expanded, #139 finding 9); an absent file is an empty layer; unknown keys WARN and load (a file written for later slices must not break this CLI); malformed YAML is a loud error. The module is CLI-only — the server never reads operator config (invariant 11 by construction). $OMNIGRAPH_CONFIG becomes a first-class stand-in for --config in load_config (flag > env > ./omnigraph.yaml), one meaning in both binaries. The test harness pins hermeticity: spawned binaries get a nonexistent OMNIGRAPH_HOME by default so no test ever reads the developer's real operator config. New coverage: loader unit tests, the env-precedence matrix on load_config_in, and spawned-binary e2es for the actor chain (operator wins with no flag/legacy key; legacy outranks it; --as wins) and the format cascade. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
8d7aed065f |
test(cluster,server): gated object-storage cluster e2e + CI wiring + docs
s3_cluster.rs runs the full control-plane lifecycle against a real bucket (CI: containerized RustFS; locally the RustFS binary): import → lock released (pins the drop-time release regression caught on the first live smoke) → apply (graph roots + catalog on the bucket, nothing local) → serving snapshots from both the config dir and the bare URI → schema evolution → approved delete (prefix removal) → empty-cluster refusal. The server suite gains the config-free boot test: --cluster s3://… with zero local files serves a stored query over HTTP. CI: the rustfs job runs both suites; the classify filter covers the cluster store/serve modules and the new test files. The server smoke drops its name filter — every test in the s3 target is bucket-gated, and a filter matching nothing passes vacuously (which silently ran zero tests for a while). Docs: deployment.md gains the Bucket-no-volume shape as the preferred cloud deployment; cluster.md/server.md document --cluster <uri>; testing.md maps the new suite. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
58855c0a7c |
feat(cluster,server): inline policy content + config-free --cluster URI boot
Two serving changes that complete RFC-006's read side:
ServingPolicy carries the policy bundle CONTENT (digest-verified at
snapshot read) instead of a blob path — the catalog may live on object
storage, and the server must not re-read mutable state after the
snapshot. The server grows a PolicySource enum: File for omnigraph.yaml
deployments (unchanged), Inline for cluster boots, wired through
PolicyEngine::load_{graph,server}_from_source.
read_serving_snapshot_from_storage(uri) reads the applied revision
straight from a storage root, and --cluster accepts a scheme-qualified
URI (s3://bucket/prefix): config-free serving — a serving box needs only
the URI and credentials; the ledger and catalog on the bucket ARE the
deployment artifact. Bare paths keep the config-directory behavior.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
||
|
|
7af3697397
|
Merge pull request #193 from ModernRelay/refactor/cli-modularize
refactor(cli): modularize main.rs and the test monolith — pure code movement |
||
|
|
d5e75df272 |
refactor(cli): split the test monolith into command-area suites
tests/cli.rs (4,548 lines, 112 tests) becomes five area files — cli_cluster (24), cli_cluster_e2e (10, the spawned-binary lifecycle compositions), cli_data (49), cli_schema_config (16), cli_queries (13) — with the file-local helpers joining the existing tests/support harness. Verbatim moves + visibility bumps; 161 crate tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
916015c416 |
refactor(cli): split main.rs into cli/helpers/output modules
Verbatim moves: the clap surface (every command/subcommand/arg struct) to cli.rs, resolution helpers (config/actor/graph/branch/query, remote HTTP, env/token, scaffolding) to helpers.rs, human/JSON formatting to output.rs, the in-source test mod to main_tests.rs via #[path]. main.rs (1,184 lines) keeps main() and the dispatch match. Visibility bumps only; 22 binary tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
127440d873 |
refactor(server): split lib.rs into handlers and settings modules
Verbatim moves: route handlers + bearer-auth middleware + per-request authorization + the cluster-prefix OpenAPI rewrite go to handlers.rs; settings resolution (omnigraph.yaml/CLI/env, mode inference, bearer-token sources, runtime-state classification) and its in-source test mod go to settings.rs. lib.rs (1,158 lines) keeps the public types, app/router assembly, and serve(). The ApiDoc derive references handlers::-qualified paths; the one multi-line utoipa attribute the cut orphaned was relocated with its handler. 289 crate tests green, OpenAPI drift check included. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
b036073ec6 |
refactor(server): split the test monolith into area suites
tests/server.rs (6,517 lines, 110 tests) becomes seven area files — auth_policy, data_routes, schema_routes, stored_queries, multi_graph, boot_settings, s3 — with shared helpers in tests/support/mod.rs. Verbatim moves + visibility bumps (pub on helpers, pub(super)->pub inside the matrix harness); cargo fix stripped the per-file unused imports. All 110 tests pass in their new homes (289 across the crate including lib and openapi). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
f6ae3e4fa3 |
fix(cluster): lock release must complete before a CLI process exits
Caught by the first live s3 smoke: StateLockGuard's spawned async delete dies with the runtime when a short-lived CLI process exits right after the command — import's lock survived into the next command as state_lock_held. On the multi-thread runtime (the CLI, and the gated s3 tests) block_in_place waits for the delete to complete; current-thread runtimes keep the spawn fallback with force-unlock as the documented recovery, same as a crash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |