Conflict resolutions:
- cli/main.rs: keep the omnigraph_api_types import (branch extraction) and
add main's omnigraph_cluster import; test-import list takes the branch's
ResolvedCliGraph/is_remote_uri additions
- Cargo manifests: union of deps — branch's config/queries/api-types crates
plus main's cluster crate; all versions unified at 0.6.2 (new crates bumped
from 0.6.1)
- cli/Cargo.toml: omnigraph-server dep stays dropped (branch decision; CLI
references it only in comments and the test harness binary spawn)
- AGENTS.md: 0.6.2 + union crate list
- cli-reference.md: branch's --graph/layered-config sections + main's
cluster/repair rows
- Cargo.lock regenerated
cargo test --workspace --locked passes; scripts/check-agents-md.sh passes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types
Hardening on top of #133. That PR introduced a shared
`loader::composite_unique_key(parts)` joining per-column scalars with U+001F
and routed both intake and branch-merge through it, closing the original
'|' vs U+001F separator drift. This takes the shared keying the rest of the
way to correct-by-design:
- Collision-free by construction: the key is now the tuple of per-column
scalar strings (Vec<String>) keyed directly, no separator, so no data value
(not even a literal U+001F) can forge a collision.
- One scalar converter across both paths: intake used an explicit type-match,
merge used Arrow's array_value_to_string. Both now derive the key through
composite_unique_key(group_columns, row), so they can't drift on conversion.
- Loud on un-keyable types: the scalar converter returned None for any Arrow
type it didn't recognize, and the caller treated None as null-exempt, so a
@unique on a column type it couldn't reduce (list, blob) was silently
un-enforced. It now returns Err, surfacing the constraint it can't enforce
instead of weakening it in silence.
Tests:
- consistency::composite_unique_key_is_consistent_across_intake_and_merge pins
that intake and merge key the tuple identically (load-on-branch then merge
of values containing '|').
- loader unit tests pin tuple keying + null exemption and the loud error on an
un-keyable (binary) column.
Docs: invariants truth-matrix updated; stale loader/mod.rs line pointers fixed.
Scope unchanged: intra-batch / merge-candidate-set only; cross-version
uniqueness against committed rows stays a documented gap.
* fix(unique): cover all string encodings; make format_tuple private (PR #160 review)
Addresses two Greptile P2 comments on PR #160:
- unique_key_scalar handled only StringArray (Utf8). The loud-on-unknown-type
behavior turned any legal string column that read back as LargeUtf8 or
Utf8View into a hard write failure (the old code silently returned None). Add
LargeStringArray and StringViewArray arms so a legal string column is keyable
in every physical Arrow encoding; the Err path now fires only for a genuinely
un-keyable logical type (list/blob/vector), never a legal value in an
unenumerated encoding.
- format_tuple was pub(crate) but only used within loader/mod.rs; make it a
private fn (matches the old format_unique_columns it replaced, minimal
exposed surface).
New unit test unique_key_scalar_handles_all_string_encodings pins that Utf8 /
LargeUtf8 / Utf8View all render rather than error.
* perf(engine): route Expand node hydration through the id BTREE via structured filter
hydrate_nodes built an `id IN (...)` SQL string applied via Scanner::filter,
which DataFusion evaluates with InListEval (O(N×M)) rather than using the id
BTREE scalar index — measured at 72× the indexed cost on a 100k-node hop
(MR-376). Build the id IN-list as a structured DataFusion Expr, AND it with
the pushable destination filters, and apply via Scanner::filter_expr (the same
path execute_node_scan already uses); Lance then compiles it to
scalar-index-search -> take.
Destination-filter pushability is now decided by ir_filter_to_expr (structured)
instead of ir_filter_to_sql, so list-contains (array_has) pushes down too.
Removes the now-dead string-filter helpers build_lance_filter, ir_filter_to_sql,
and ir_expr_to_sql; literal_to_sql stays (still used by the mutation delete path).
* feat(engine): add TableStore::scan_edges_by_endpoint for indexed neighbor lookup
Static helper returning edge rows that match a set of endpoint keys on src/dst,
projected to [key_col, opposite_col], via a structured `key_col IN (keys)`
filter_expr. Lance routes it through the persisted BTREE on the endpoint column
(index-search -> take), so cost scales with the frontier size rather than |E|.
Unused until execute_expand's indexed mode lands; isolated in its own commit so
the storage-layer primitive is reviewable on its own.
* feat(engine): add BTREE-indexed Expand traversal path
Split execute_expand into a dispatcher over execute_expand_csr (the existing
in-memory CSR BFS, unchanged) and a new execute_expand_indexed that serves each
hop by batching the frontier into one scan_edges_by_endpoint call against the
persisted src/dst BTREE (index-search -> take), then fans out per source row.
Both share expand_hydrate_and_align — the destination hydration + alignment +
hconcat + in-memory non-pushable filters — which now aligns by string id (a
HashMap) instead of a dense row-id vec, so one tail serves both modes.
Mode selection is OMNIGRAPH_TRAVERSAL_MODE for now (default csr); the
frontier-size auto policy and lazy CSR build follow. AntiJoin stays on CSR.
tests/traversal_indexed.rs (its own #[serial] binary, so env writes never race a
reader) asserts the indexed path matches CSR for one-hop, multi-hop, cross-type,
and no-match cases, and that a freshly-appended unindexed edge is still found
(partial index coverage — fast_search=false unindexed-fragment scan).
* feat(engine): frontier-size Expand dispatcher + lazy CSR build
Replace the env-only mode switch with an auto policy: Expand uses the
BTREE-indexed path when the source frontier is small and the hop count bounded
(OMNIGRAPH_EXPAND_INDEXED_MAX_FRONTIER=1024, OMNIGRAPH_EXPAND_INDEXED_MAX_HOPS=6),
else the in-memory CSR. OMNIGRAPH_TRAVERSAL_MODE=indexed|csr still forces a mode.
Make the CSR index lazy: thread a GraphIndexHandle (memoizing OnceCell over a
Cached/Direct/None builder) through execute_query/execute_pipeline/
execute_rrf_query/execute_anti_join instead of a pre-built Option<&GraphIndex>.
A query served entirely by the indexed path with no AntiJoin never pays the
O(|E|) CSR build — the perf win of Tier 3. AntiJoin still realizes the index
(its negation uses CSR has_neighbors).
Net effect: selective traversals (the common case) skip the whole-graph CSR
build and resolve neighbors from the persisted, incrementally-maintained
src/dst BTREE. Existing traversal/aggregation/end_to_end/search suites now run
the indexed path by default and stay green.
Docs: constants.md (new env knobs), query-language.md (Expand dual path),
indexes.md (graph index is lazy + the indexed alternative).
* test(engine): bench indexed vs CSR selective traversal
Add a selective single-source knows{1,2} comparison to bench_expand: per growing
|E|, time the cold query in csr vs indexed mode (fresh db each, so CSR pays its
O(|E|) build) and assert both modes return identical rows — a guard against the
scalar-index physical_rows silent fallback dropping unindexed-fragment rows. The
existing dense hop1/2/3 latency bench is unchanged.
* feat(engine): surface silent scalar-index fallback in indexed traversal (C6)
Add TableStore::key_column_index_coverage — a metadata-only check (no IO) of
whether a `key_col IN (...)` scan will be served by the persisted BTREE or
silently fall back to a full filtered scan, mirroring Lance's own decision:
no BTREE on the column, or any fragment missing physical_rows (which disables
scalar indices for the whole scan, lance dataset/scanner.rs create_filter_plan).
execute_expand_indexed calls it once per traversal and tracing::warn!s on
Degraded, so the perf cliff is observable instead of hidden behind a bench oracle.
Detection-only: results are correct either way (the scan returns all rows). Closes
the "no silent failures" gap the traversal best-practice audit flagged as the top
deviation, and adds an IndexCoverage value a future cost-based planner can consume.
* perf(engine): dense-id BFS on the indexed traversal path (C3)
execute_expand_indexed ran its per-source BFS in string space
(Vec<HashSet<String>>, HashMap<String,Vec<String>>, ~4 String clones per neighbor
occurrence). Intern node ids to u32 once via a per-traversal TypeIndex (no
GraphIndex/CSR build — laziness preserved) and run visited/seen/frontier/
neighbor-map in dense u32 space, mirroring the CSR path; de-intern only for the
per-hop IN-list and the emitted dst ids handed to the hydrate+align tail.
Behavior-preserving — the traversal_indexed CSR-vs-indexed equivalence tests are
the guard (results are identical, the key type just changes String -> u32).
* refactor(engine): thread the opened edge dataset into indexed Expand
Hoist the edge-dataset open and the C6 index-coverage warning out of
execute_expand_indexed into execute_expand, threading the opened dataset in
as a parameter so it is opened exactly once. Extract the endpoint-column
mapping (endpoint_columns) and the coverage warning (warn_on_degraded_coverage)
as helpers.
Behavior-preserving: same dataset, same warning, same dispatch decision. This
only relocates the open so the upcoming cost-based chooser can consult index
coverage before dispatch without opening the dataset twice.
* feat(engine): cost-based Expand dispatch chooser (C5)
Replace the fixed frontier<=1024 && hops<=6 dispatch threshold with a pure,
IO-free cost model. choose_expand_mode compares the indexed path's
frontier-relative work (hops * frontier * fanout, or hops * |E| when BTREE
coverage is degraded) against the cost of building the whole-graph CSR
(BUILD_FACTOR * |E|), from cheap manifest row counts. Under good coverage this
reduces to a selectivity ratio independent of |E|, preserving the flat-in-|E|
indexed win for selective traversals while routing dense / deep / high-fanout
or degraded-and-expensive traversals to CSR.
execute_expand decides cardinality-first and only opens the edge dataset to
confirm coverage when it leans indexed (no open on a clearly-CSR traversal).
The two env knobs become hard ceilings layered on the model; the
OMNIGRAPH_TRAVERSAL_MODE override still forces a path; the chosen mode is
traced. Results are unchanged across modes — only the path differs.
Adds inline crossover unit tests and extends the traversal_indexed both_modes
harness with an auto pass asserting the chooser is result-preserving across
every traversal shape. Documents the new flag semantics in
docs/user/{constants,query-language}.md.
* test(engine): pin Lance scalar-index coverage + system-column/deletion-metadata surface
Add three Lance surface guards de-risking a future persisted-adjacency cache:
- a compile-only guard pinning the fragment physical_rows + index-detail
surface that key_column_index_coverage mirrors (the C6 fallback);
- a runtime probe confirming a scalar BTREE on the system column
_row_last_updated_at_version is not buildable via the normal create-index
path (the column is not in the user schema), so a version-column range delta
is not viable as drafted;
- a runtime probe confirming per-fragment deletion metadata
(deletion_file.num_deleted_rows) is available as cheap O(fragments) metadata,
the primitive a fragment-coverage delete model would rely on.
The probes turn the two largest substrate assumptions into green/red CI facts
before any cache work begins.
* test(engine): regression for cross-type id-collision in indexed traversal
A node id is unique only within a type, so a Person and a Company can share an
id string. A variable-length traversal over a cross-type edge (WorksAt) must
structurally stop after one hop. This test builds a graph where 'shared' is both
a Person and a Company id and asserts worksAt{1,2} returns only the one-hop
company. It fails today: the indexed path's single string interner de-interns
the hop-1 Company id back to the colliding Person id and runs a hop-2 scan that
matches that Person's edges, emitting a spurious second-hop company (indexed
["other","shared"] vs csr ["shared"]).
* fix(engine): structurally cap cross-type Expand at one hop
A cross-type edge cannot chain (e.g. a Company is not a WorksAt source), so a
variable-length traversal over one is structurally single-hop. Both traversal
paths now enforce this by capping max hops at 1 when from_type != to_type,
instead of relying on the hop-2 scan returning empty.
That reliance was a correctness hole on the indexed path: it interns every
endpoint string into one dense id space, so a cross-type id-string collision (a
Person and a Company sharing an id) let hop 2 de-intern a destination id back to
the colliding source-type id and match its edges, emitting rows the CSR path
never produces. With the cap the cross-type second-hop scan never runs, so the
shared interner can no longer alias across types. Turns the regression test
green (indexed == csr == ["shared"]).
* perf(engine): set-oriented filtered anti-join, remove per-row dispatch
execute_anti_join's filtered slow path sliced the outer batch to one row at a
time and re-ran the inner pipeline per row, so each 1-row inner Expand dispatched
to the indexed path — one Lance scan per outer row, while the CSR realized up
front sat unused.
Replace it with a set-oriented anti-semi-join: tag each outer row with a
synthetic index column, run the inner pipeline once over the whole frontier (the
tag survives Expand's hconcat and Filter's row-drop), then exclude outer rows
whose tag survived. The inner Expand now runs as a single set-at-a-time traversal
over the full frontier; config is read once per operator, not per row (the env
nit is mooted). A produced-but-untagged inner batch fails loudly rather than
silently keeping every row. Results are unchanged (the predicated-negation tests
exercise the path over a multi-row outer with dst-filters).
* test(engine): drop flaky wall-clock budget from the merge truth table
The 30s wall-clock assertion in merge_pair_truth_table flakes under parallel
test load: it tripped at ~31s in the full --test-threads=4 gate while passing at
~20s in isolation. A fixed time budget in a correctness test depends on machine
and parallelism, not correctness; elapsed is still logged for visibility, and a
real merge-perf regression belongs in a bench. The cell-count correctness
assertions (81 / 36 / 45) are unchanged.
* fix(engine): total deterministic ORDER via entity-key tie-break + NULL contract
apply_ordering used an unstable lexsort with no tie-break, so rows with equal
user-sort keys came out in a run-dependent order (the input order depends on
scan parallelism / upstream hashing) — making ORDER ... LIMIT non-deterministic,
a latent deny-list violation (no nondeterministic result ordering).
Append the bound entities' key columns (<var>.id, unique per row) in canonical
name-sorted order as ascending tie-breaks, giving a total, reproducible order
(and a deterministic top-N when ties straddle the LIMIT cutoff). NULL placement
(nulls_first = !descending) is unchanged and now documented as the contract.
New tests/ordering.rs locks descending, multi-key precedence, the deterministic
key tie-break (data loaded in a different order than the expected output, so it
proves the tie sorts by key not by load order), and NULL placement under ASC/DESC.
docs/user/query-language.md documents the total-order + NULL contract.
* test(engine): property-based query-correctness invariants over generated graphs
Adds a proptest harness (new dev-dep) that generates small graphs whose Person
and Company keys are drawn from a shared 5-key alphabet, so cross-type id
collisions, cycles, and self-loops arise by search rather than from one
hand-built fixture. Three invariants:
- prop_expand_indexed_eq_csr: csr == indexed == auto over knows{1,3} (same-type,
cycles) and worksAt{1,2} (cross-type, collision-prone) from every start.
- prop_results_subset_of_existing_nodes: no phantom rows (catches over-emission
even if both modes are wrong identically).
- prop_antijoin_partitions_persons: not{worksAt} and its complement are disjoint
and cover all persons.
Verified the guard bites: neutering the cross-type hop cap makes
prop_expand_indexed_eq_csr fail and proptest shrinks it to persons["c","e"] /
companies["b","c"] — the cross-type collision class the hand-built fixture
only sampled once. Tests are sync + #[serial] (per-case runtime; the mode test
writes OMNIGRAPH_TRAVERSAL_MODE).
* test(engine): cover cycle/self-loop termination + nested anti-join (C5 edge cases)
- variable_hops_terminate_and_dedup_on_cycle: a 3-cycle a->b->c->a traversed with
knows{1,5} (ceiling above the cycle length) terminates and emits each node once
(the c->a back-edge hits the seeded source); both_modes confirms indexed == csr.
Uses a bounded range deliberately — unbounded {1,} is a typecheck error, not a
runtime path.
- variable_hops_handle_self_loop: a->a self-loop does not loop forever and does
not re-emit the seeded source.
- nested_anti_join_double_negation: not { worksAt; not { name = Acme } } recurses
through execute_pipeline, yielding [Alice,Charlie,Diana] (people with no non-Acme
employer) — distinct from plain unemployed [Charlie,Diana].
* test(engine): execution goldens for typed-literal filters (C4 gap #4)
New literal_filters.rs covers filtering by F64/F32/Bool/Date/DateTime LITERALS
across both arms: standalone comparisons ($m.score > 1.5, $m.ratio <= 0.25,
$m.active = true, $m.born >= date(...), $m.seen < datetime(...)) exercise the
in-memory comparison path, and inline bindings (Metric { active: true },
Metric { score: 3.0 }) exercise Lance filter_expr pushdown. Seeds partition each
predicate so a dropped/miscast filter returns all rows. (Param-bound scalars and
list-column contains are covered elsewhere.)
* test(engine): full rank-order goldens for nearest + bm25 (gap #2)
Existing search tests stopped at top-1 (nearest) or non-empty (bm25), so a
regression corrupting ranks 2..k or reversing the sort direction passed CI
silently. Pin the FULL ordered slug list: nearest([0.1,0.2,0.3,0.4]) ->
[ml-intro, nlp-guide, rl-intro] (ml-intro exact at dist 0, rest by ascending
L2); bm25(Learning) -> [rl-intro, ml-intro, dl-basics] (descending score).
nearest/bm25 skip apply_ordering (is_search_ordered) and return Lance native
order, so result_slugs row order == rank order; values resolved by running and
confirmed stable across runs.
* test(engine): search fuzzy/match_text characterization + RRF non-default pairings
- match_text_matches_exact_set_excludes_unrelated: match_text(body,'neural') ==
[dl-basics] exactly (not just contains).
- fuzzy_does_not_match_under_default_tokenizer: characterizes that fuzzy() is
inert with the default tokenizer here (search/match_text work, fuzzy returns
nothing); turns red — to be promoted to a real golden — if fuzzy starts matching.
- rrf_fuses_two_fts_fields / rrf_fuses_two_vector_queries: RRF fuses arms other
than the default nearest+bm25 (bm25 title+body; two vector queries), proving
primary_var resolves and fusion runs. New fixtures/search.gq queries +
two_vector_params helper. Orders resolved by running, confirmed stable.
* test(engine): anti-join fast-vs-slow path equivalence harness
anti_join_fast_and_slow_paths_agree: the CSR has_neighbors fast path
(not { $p worksAt $_ }) and the set-oriented inner-pipeline replay (same
negation forced slow by an always-true $c.name != "" dst filter) must produce
the same result ([Charlie, Diana]). Closes the second real engine fork explicitly.
* test(engine): regression for nested slow-path anti-join tag collision
A nested not { ... not { ... } } where both levels hit the set-oriented slow
path collides on the fixed __antijoin_outer_row correlation column: the inner
call appends a duplicate, and column_by_name reads the OUTER tag. Fan-out (p1
works at two companies) makes inner row indices diverge from outer tags, so the
bug returns the wrong person set. Fails on current code (left ["p2","p4"] vs
right ["p3","p4"]).
* fix(engine): collision-free anti-join correlation tag for nested negation
The set-oriented anti-join tagged the outer batch with a fixed column name and
read it back by name. Under a nested slow-path anti-join the enclosing tag rides
through the inner pipeline, so the inner call produced a duplicate field; Arrow
permits duplicate names and column_by_name returns the first, so the inner
negation mis-correlated against the outer row indices.
Choose a tag name not already present in the batch (suffix-incremented), so each
nesting level reads its own correlation column. Turns the fan-out regression
green; the existing nested/fast-vs-slow/proptest anti-join invariants still pass.
* fix(engine): cap cross-type hops in the Expand cost model
gather_cost_inputs fed the requested max_hops into choose_expand_mode even though
execute_expand_indexed runs at most one hop for a cross-type edge. So a cross-type
variable-length expand (e.g. worksAt{1,5}) had its indexed cost scaled by 5 while
only one hop runs, skewing the chooser toward CSR (an unnecessary whole-graph
build) near the crossover. Results were unaffected (modes are equivalent); this
is a plan-accuracy fix.
Add cost_effective_hops(requested, same_type) — caps to 1 for cross-type — and
apply it in gather_cost_inputs so the estimate matches what executes. Unit test
covers the cap and the crossover consequence (capped 1 hop stays indexed where
the requested 5 would have flipped to CSR).
* perf(engine): realize anti-join CSR lazily + reuse a warm CSR in the chooser
Two CSR build/reuse fixes flagged on the set-oriented anti-join work (results
unchanged — plan/perf accuracy):
- execute_anti_join called graph_index.get() (the O(|E|) whole-graph CSR build)
unconditionally, but only the bulk fast path consumes it; a filtered/nested
slow-path anti-join's inner Expand picks its own access path. Gate the build on
a pure shape predicate (bulk_anti_join_applies) so a selective anti-join over a
large graph no longer pays a build it won't use.
- gather_cost_inputs hardcoded csr_cached=false, so once an earlier op realized
the CSR, later Expands still cost it as a cold build and could pick per-hop
indexed scans over reusing the warm in-memory CSR. Add GraphIndexHandle::
is_built() and thread it through so the chooser reuses a materialized CSR.
Anti-join, cross-type, proptest-equivalence, and chooser unit tests stay green.
* test(engine): RAII traversal-mode guard in proptest equivalence
prop_expand_indexed_eq_csr set/cleared OMNIGRAPH_TRAVERSAL_MODE manually; a panic
between set and clear (e.g. a query unwrap on a generated case) would leak the
forced mode into proptest's shrink/subsequent cases and mask the divergence under
test. Replace with a ModeGuard that clears on drop (including on unwind), scoping
the forced mode to a single query.
* test(engine): regression for multi-hop anti-join hop bounds
The bulk anti-join fast path answers via has_neighbors (one-hop existence), so
not { $p knows{2,2} $x } wrongly drops a node with a 1-hop neighbor but no
2-hop path. On a->b (sink) and c->d->e, only c has a 2-hop path; the query should
keep [a,b,d,e]. Fails on current code (left ["b","e"] — only the sinks).
* fix(engine): restrict anti-join bulk fast path to one-hop expands
bulk_anti_join_applies accepted any single Expand, but try_bulk_anti_join_mask
decides via the CSR has_neighbors one-hop existence check — wrong for multi-hop
negations. Require min_hops==1 && max_hops==1 in the predicate; anything else
falls to the slow path, whose inner Expand runs the real bounded traversal.
Turns the multi-hop regression green; one-hop anti-joins unchanged.
* fix(engine): IndexCoverage reports Degraded for uncovered fragments
key_column_index_coverage checked BTREE-exists + physical_rows but not that the
index actually covers the current fragments. Since edge-index creation is skipped
once a BTREE exists, fragments appended later stay unindexed while coverage still
reported Indexed — so the cost chooser priced a partly-full scan as fully indexed.
Compare the BTREE's fragment_bitmap (public on lance_table IndexMetadata) against
the dataset's current fragment ids; report Degraded when any are uncovered. A None
bitmap means Lance can't report coverage — don't over-degrade. Results are
unaffected (the scan returns unindexed-fragment rows either way); this corrects
the cost signal.
Test: a freshly-loaded edge BTREE is Indexed; after appending an edge the new
fragment is uncovered → Degraded. Surface guard pins IndexMetadata.fragment_bitmap.
* docs: clarify the Expand frontier ceiling bounds the initial dispatch frontier
The cap is applied at dispatch on the initial frontier; per-hop fan-out
(union_dense) is not hard-capped. Correct the constants.md and query-language.md
claims: the ceilings bound the initial-dispatch frontier/hops, the cost model
estimates total indexed work as ~hops*frontier*fanout (pricing dense fan-out
toward CSR), and per-hop work is not a hard bound. Drops the overstated 'hard
caps bound indexed work' / 'cost ∝ frontier' wording.
* fix(loader): enforce composite @unique(a, b) as a true composite key
Node/edge composite uniqueness constraints were flattened into a single
list of property names, so @unique(a, b) was enforced as independent
single-field checks @unique(a) AND @unique(b) at intake. Preserve the
constraint grouping and check each group as a composite key, mirroring
the merge-path enforcement. Error messages now name the full composite.
MR-983
* docs: clarify unit-separator comment in composite unique check
* docs: fix separator reference in composite unique comment (merge.rs also uses U+001F)
* fix(merge): align composite @unique key separator with intake (U+001F)
The branch-merge path (update_unique_constraints) joined composite key
columns with '|', while intake joins with U+001F. The same @unique(a, b)
was keyed two different ways, and '|'-join can raise phantom merge
conflicts for values containing '|' (e.g. ('x|y','z') vs ('x','y|z')).
Factor the tuple-join into one shared helper (loader::composite_unique_key)
so the intake and merge paths cannot drift again. Add branching regression
tests for edge @unique(src, dst) on the merge path.
Refs MR-983.
---------
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Andrew Altshuler <andrew@collectivelab.io>
* docs(invariants): note the non-atomic manifest->commit-graph publish gap
Every graph publish commits __manifest then appends _graph_commits as two
separate writes; a crash between them leaves the manifest ahead of the commit
DAG. Live reads + durability are unaffected (reads resolve via the manifest) and
recovery does not repair it; impact is bounded to commit history / time-travel
by commit id / merge-base completeness. Pre-existing across all publishes, not
the optimize reconcile specifically. Documented as a Known Gap; the fix is a
commit-graph reconcilable from the manifest, not a recovery sidecar.
* fix(maintenance): route uncovered drift through repair
* fix(maintenance): harden repair review feedback
* test(optimize): cover manifest publish + HEAD-drift reconcile
Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:
- maintenance: optimize must publish so the manifest table_version
tracks the compacted Lance HEAD and a later schema apply succeeds;
and must reconcile a pre-existing manifest-behind-HEAD drift (forged
via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
reopen in the full lifecycle (the canonical flow previously omitted
post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
forward on next open.
* fix(optimize): publish compaction to manifest and reconcile HEAD drift
optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).
optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.
Docs updated (maintenance, storage, branches-commits, writes, testing).
* test(recovery): rollback convergence + optimize-defer regressions
Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
must track Lance HEAD (no residual drift) so a follow-up schema apply
succeeds — the original "+1 per retry" loop. Today roll-back restores
without publishing, so the manifest lags HEAD and the apply fails its
HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
operating on an unrecovered graph could publish a partial write the
sweep would roll back.
Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.
* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery
Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:
- Recovery roll-back now PUBLISHES the restored version (via a
push_table_update_at_head helper shared with roll-forward), so the manifest
tracks the Lance HEAD after recovery — symmetric with roll-forward. This
fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
precondition passes) and removes the only remaining source of orphaned
drift. The audit still records the logical rolled-back-to version; the
manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
__recovery sidecar is pending, so it only ever operates on a recovered
graph (manifest == HEAD); its compaction publish can no longer commit a
partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
moot.
Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.
* test(recovery): fix rollback assertion for manifest convergence
The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
The loader read input line-by-line (reader.lines() + serde_json::from_str per line), so any delta where a JSON object spanned multiple lines failed with 'invalid JSON on line 1: EOF while parsing an object'. Compact JSONL worked; pretty-printed JSON never did.
Switch to a streaming value deserializer (Deserializer::from_reader().into_iter::<Value>()), which treats any whitespace (including newlines inside objects) as a separator — so both compact JSONL and pretty-printed JSON load. Error labels switch from line numbers to record numbers (line numbers are meaningless once objects span lines).
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
* feat(engine): sweep legacy __run__ branches via v2→v3 manifest migration
Pre-v0.4.0 graphs can carry stale `__run__<id>` staging branches on the
`__manifest` dataset, left by the Run state machine removed in MR-771. Lance's
`list_branches` still enumerates them, so they leak into `branch_list()` and
count as blocking branches at schema-apply time.
Add a one-time `migrate_v2_to_v3` arm to the internal-schema dispatcher: on the
first read-write open it enumerates `__manifest` branches, deletes every
`__run__*` ref, and bumps the stamp to 3. Idempotent under retry (re-enumerates
fresh each run). The `"__run__"` prefix is inlined so the migration does not
depend on the run_registry guard that MR-770 removes next.
This is the prerequisite sweep; the guard removal follows in the next commit.
* refactor(engine): remove the legacy __run__ branch guard (MR-770)
With the v2→v3 migration sweeping stale `__run__*` branches off `__manifest`
on first read-write open, the defense-in-depth `is_internal_run_branch` guard
is no longer needed.
- delete `db/run_registry.rs`; drop the module + re-export from `db/mod.rs`
- collapse `is_internal_system_branch` to the schema-apply-lock check only
- `ensure_public_branch_ref`: drop the run-ref rejection; `__run__*` is now an
ordinary branch name
- `branch_merge`: reject `is_internal_system_branch` (was run-only) so the
schema-apply lock is rejected consistently with create/delete — a small,
deliberate tightening
- update the inline schema-apply test + the writes integration tests
(`public_branch_apis_reject_internal_run_refs` →
`public_branch_apis_reject_internal_system_refs`, which also asserts
`__run__*` now creates successfully)
- docs: flip the "pending production sweep / defense-in-depth" notes to
"auto-swept by the v2→v3 migration"; document the read-only-open limitation
Known residual: the inert `_graph_runs.lance` / `_graph_run_actors.lance` bytes
remain until a `StorageAdapter::delete_prefix` primitive lands.
* fix(engine): run __run__ sweep at Omnigraph::open, not only on publish
Review (PR #132) caught a regression: removing __run__ from
`is_internal_system_branch` exposed legacy `__run__*` branches to the
schema-apply blocking-branch checks (schema_apply.rs:104 and :778) and to
`branch_list()`, but the v2→v3 sweep ran only inside the publisher's
`load_publish_state`. On a pre-v0.4.0 graph whose first write is a schema
apply, the blocking-branch check fires before any publish, so apply failed
with "found non-main branches: __run__…". The same lazy timing also created a
reverse hazard: a user-created `__run__*` branch on a still-v2 graph could be
deleted by the first publish's sweep.
Fix: run the internal-schema migration in `Omnigraph::open(ReadWrite)` (new
`manifest::migrate_on_open`), before the coordinator reads branch state. The
sweep now lands before any branch-observing code, and a graph is stamped v3 at
open — so the one-time sweep can never catch a legitimately-created branch.
Both checks and `branch_list` see the swept graph; correct by construction for
every write path.
Accepted residual: a read-only open of an unmigrated legacy graph still lists
`__run__*` (read-only opens must not write, so they can't sweep). Documented.
Regression test `legacy_run_branch_is_swept_on_open_and_does_not_block_schema_apply`
confirmed RED before the fix (panicked on the branch_list leak assertion) and
GREEN after. Also updates the stale schema_apply.rs comment, the writes.md
"Migration code" section, and adds the v3 row to storage.md's migration table.
* test(engine): sweep multiple legacy __run__ branches; doc nit
Strengthen the v2→v3 migration test to synthesize three `__run__*` branches
(a real legacy graph accumulates one per run) so the migration's delete loop
is exercised on a single reused dataset handle, not just a single branch.
Confirms multi-branch deletion is safe.
Also drop a stale "active runs" reference from the branch_delete doc line.
* fix(engine): force-delete in __run__ sweep for concurrency safety
`migrate_v2_to_v3` ran `Dataset::delete_branch` (= `branches().delete(.., false)`),
which errors "BranchContents not found" if the branch is already gone. Since the
sweep now runs in `Omnigraph::open(ReadWrite)`, two processes opening the same
legacy v2 graph concurrently would race: one wins each delete, the other's open
fails. The migration only claimed idempotency under *sequential* retry.
Switch to `Dataset::force_delete_branch` (= `delete(.., true)`), Lance's
documented path for cleaning up zombie branches, which tolerates an
already-absent branch. The sweep is now idempotent under concurrent runners and
robust to partial/zombie state. Found in self-review; no behavior change for the
common single-open path.
* docs(release): note MR-770 __run__ cleanup in v0.6.1
* docs(branches): reconcile branch cleanup semantics
Merge `version` now follows the highest loaded-from-file layer (project over global),
like every other scalar, instead of 'any layer that set it' — so a legacy project under
a v1 global reports the project's (legacy) version. Document the config view --resolved
output, the --resolved/--show-origin exclusivity, and the OMNIGRAPH_CONFIG-missing error
in cli-reference.md.
cli()/cli_process() set OMNIGRAPH_HOME but inherited OMNIGRAPH_CONFIG (which wins over
HOME) and XDG_CONFIG_HOME, so a developer/CI with those exported got non-hermetic tests.
env_remove both so the harness fully owns the config-resolution env; tests that exercise
global-first still set OMNIGRAPH_CONFIG explicitly per-invocation.
config view --resolved --show-origin silently dropped --show-origin; add clap
conflicts_with so the parser rejects the combo loudly. Remove the unused `server`
field from ActiveContext (parsed, always written None, never read) — server-qualified
active context is a V2/V3 concern; don't ship an unfulfilled field.
An explicitly-set OMNIGRAPH_CONFIG pointing at a missing path was silently dropped
(no global layer, confusing downstream 'graph not found'). Bail naming the file, so an
explicitly-named global config is required like --config; the default ~/.omnigraph path
stays optional.
resolve_all_paths_in_place absolutized the derived entry.uri but left the raw
Storage::Bare/Block uri relative, so config view showed `uri: /abs` beside
`storage: ./rel` and a future reader of storage.uri (V2 region/endpoint threading)
would mis-resolve. Resolve the storage uri too. The eager test now exercises the
storage block form — the coverage gap that let this slip.
config view --resolved hand-listed locator fields and dropped an embedded graph's
region/endpoint/policy_file/selected — exactly the fields one debugs an S3 endpoint or
per-graph policy with. Derive Serialize on GraphLocator (internally tagged `kind:`) and
serialize+prune in print_resolved_locator, so every field is reported and a new field
can't be forgotten by hand.
merge_layers destructured OmnigraphConfig with a `..` rest-pattern, so a config
field added later would be silently dropped from the cross-layer fold (a drift trap).
Bind every field explicitly (intentionally-ignored ones as `_`) so a new field fails
to compile here until it is dispositioned — the compiler is the guard.
Wrap the global/project layer loads with file context (in global config <path> /
in project config <path>) so a strict v1 error names which file is at fault — a
loud, attributable failure across layers (invariant 13). The single-layer
load_config_in path keeps its raw message (one obvious file). Pairs with the
layer-prefixed deprecation warnings the layered loader already surfaces.
Add omnigraph use <graph>: validate the graph resolves in the merged config (loud
fail otherwise), then write ~/.omnigraph/state/active.yaml ({graph}). The layered
loader reads it as the State layer between Global and Project, so a bare command
targets the active graph — Project still overrides State, State overrides Global.
The State layer is synthetic (sets defaults.graph only) and raises no version/legacy
warnings.
Add config view [--resolved] [--show-origin] [--json] [<graph>]: prints the merged
layered config (null/empty values pruned for readability), or with --show-origin the
layer each field came from (sorted, deterministic), or with --resolved <graph> the
typed GraphLocator (embedded vs remote). Suppress the always-empty legacy
project:/server: blocks on serialize.
Add load_layered_config: load the global ~/.omnigraph/config.yaml layer under the
project ./omnigraph.yaml (or --config) layer and merge them, returning the merged
config, its provenance, and per-layer deprecation warnings. The CLI's load_cli_config
now uses it — so a graph/server/defaults defined only in the global config is usable
from any directory with no project file. The server stays single-layer (a deployment
manifest must not pick up ambient $HOME state).
Global and cwd-default files are optional (absent = no layer); an explicit project
--config still errors if missing. base_dir stays the highest loaded layer's config
dir so relative ad-hoc --query paths resolve as before. The CLI test harness pins
OMNIGRAPH_HOME to an empty temp dir so tests never read the developer's real global
config.
Add merge_layers(Vec<LoadedLayer>) -> (OmnigraphConfig, Provenance): folds parsed
config layers low->high into one merged config plus a dotted-path origin map.
Settings-objects deep-merge per leaf (a higher layer's Some/non-empty wins, None
inherits); named-resource maps (servers/graphs/aliases/queries) union by key with
a higher layer's entry replacing the lower wholesale (no intra-entry bleed); lists
and scalars replace. Provenance is a sorted BTreeMap side-table, so the merged
OmnigraphConfig shape and all its accessors stay unchanged — only config view reads
it. Pure structure: every layer is already version-gated and path-resolved before
merge. Not yet wired into loading.
Add resolve_all_paths_in_place, run at the end of load_single_layer (after
normalize_graphs fills entry.uri), rewriting every relative path/URI field —
graph uris, per-graph + top-level policy/queries files, serve.policy, auth.env_file,
query.roots — to absolute against that layer's base_dir. A scheme URI passes
through. This makes a layer self-contained so the upcoming merge composes
absolute-only configs and never mis-resolves a path against the wrong layer's dir.
Factor the join logic into resolve_uri_against/resolve_path_against, reused by the
existing resolve_config_* methods. Behavior-preserving: the per-field resolvers
pass already-absolute values through to the same result.
Add global_config_dir()/global_config_file() — RFC-002 §5 global-first resolution:
OMNIGRAPH_CONFIG (explicit file) > OMNIGRAPH_HOME (dir) > $XDG_CONFIG_HOME/omnigraph
> ~/.omnigraph, with config.yaml as the file. The resolution is factored through
an injected-env form so it is hermetically testable without mutating process env.
Not yet wired into loading — that is the layered loader.
Split the body of load_config_in into a load_single_layer(cwd, path) helper that
loads and fully processes one config file (version-gating, legacy scan, normalize)
— the seam the upcoming layered loader needs — while load_config_in keeps its exact
single-layer semantics (an explicit --config still errors on a missing file).
Add the precedence-ordered Layer enum (Default < Global < State < Project) and a
Provenance side-table newtype for the merge engine, plus the dirs dependency
(already in the lock tree via lance, so zero new crates). No behavior change.
Top-level `policy:`/`queries:` (the single-graph-mode blocks) are removed under
`version: 1` — policy and stored queries belong on the owning `graphs.<name>`
entry. Both are rejected at load (pointing at the per-graph block) and honored-
but-warned under the legacy schema; the per-graph `graphs.<name>.policy`/`queries`
blocks are nested, so the key scan leaves them untouched. Drops the now-illegal
empty top-level `policy: {}` from the shared CLI test helpers.
Introduces the `serve:` host-role block in its RFC-002 shape — `graphs:` is a
served-set list, replacing the single `server.graph` scalar. The legacy
`server:` block is folded into `serve:` under the legacy schema and rejected
under `version: 1` (pointing at the new spelling, noting the scalar->list
change). Serving a true subset (`serve.graphs` with more than one entry) is
rejected with a route-unification hint; one entry (single-graph) and none
(serve all) preserve today's behavior.
Renames the accessors (server_* -> serve_*) and repoints the server boot call
sites; migrates the init scaffold to `serve:`. `servers:` (the remote endpoint
map) is deliberately unaffected — the key scan is exact-match.
`defaults:` is the canonical CLI/client-defaults block. The legacy spelling
`cli:` is accepted as a serde alias and honored under the legacy schema, but
rejected under `version: 1` (pointing at the new spelling) and flagged by a
deprecation warning. Generalizes the version-gated key scan into
`legacy_top_level_keys`, which now drives both the v1 rejection and the legacy
warnings via a shared migration-hint table. Renames the config accessors
(cli_* -> default_*) and repoints the CLI call sites; migrates the init
scaffold, the example config, and the shared test helpers to `defaults:`.
Add a raw top-level key-presence scan (reject_legacy_top_level_keys_under_v1)
so the v1 schema can reject known-but-removed legacy keys that serde_ignored
cannot surface (they stay struct fields for legacy parsing). The first such
key is `project:` — it has no consumer; it is rejected under `version: 1`
(naming the key) and stays honored-but-warned under the legacy schema.
Drop `project:` from the `omnigraph init` scaffold so generated configs load
clean under v1.
L6 renamed the flag but left stragglers. Two were functional: the
`resolve_target_uri` "URI must be provided …" error still named `--target`, and
`docker/entrypoint.sh` passed `--target` to omnigraph-server (which now only
accepts `--graph`) — so the container failed to boot. Both fixed (plus the
entrypoint smoke test's expected args). The rest are code comments across the
config/server/cli crates and tests, and the cheat sheet, swept `--target` →
`--graph`. `--target-branch` (policy explain) is a distinct flag and untouched;
past release notes keep `--target` (accurate for those versions).
Surface the config deprecation notices at load: the CLI prints them via
`eprintln!` in `load_cli_config`; the server logs them via `tracing::warn!` in
`load_server_settings` (a no-op under tests with no subscriber). `omnigraph init`
now scaffolds a `version: 1` config using `storage:` instead of the legacy `uri:`.
Migrate the shared test fixtures (`support::local_yaml_config` →
`version: 1` + `storage:`; `remote_yaml_config` → `version: 1` + `servers:`/`server:`)
to the current schema so they don't trip the new warnings; the resolved `.uri` is
unchanged, so the tests behave identically. The no-`version:` notice is gated on a
loaded config file, so commands with a bare URI and no `omnigraph.yaml` stay quiet
(verified by the existing command-deprecation tests, which still pass). Extends the
`init` test for the scaffold shape and adds a legacy-config-warns test.
Add a pure `OmnigraphConfig::deprecation_warnings() -> Vec<String>` so the CLI and
server can surface load-time migration notices while the config crate stays
stdio-agnostic. It flags a config with no `version:` (the legacy schema — gated on
a new `loaded_from_file` so there's nothing to warn about when no `omnigraph.yaml`
exists) and any graph still using the legacy `uri:` key (vs `storage:`/`server:`).
Unused until the next commit wires the emit sites. 4 config tests added.
The graph-selection flag is now `--graph` (the canonical noun, matching
`graphs:`/`cli.graph`) with NO `--target` compat alias — a clean break, to avoid
two-names-for-one-thing confusion. Applied across the 21 `omnigraph` subcommands
and the `omnigraph-server` binary (the field stays `target` internally;
`value_name = "GRAPH"` keeps help text clean). Updates the user-facing "no graph to
serve" and stored-query selection-hint error messages, plus the two cli.rs tests
that drove `--target`.
Breaking: `--target` is removed (now an unknown-flag error); use `--graph`. `uri`
is unchanged (positional or `--uri` per command — deliberately not folded into a
shared arg, since the two uri patterns can't share one flattened definition without
a breaking uri-form change). `--target-branch` (policy explain) is unrelated and
untouched. Code comments and docs still referencing `--target` are swept in L8 with
the docs.
omnigraph-server serves embedded graphs only (RFC-002 §3). Classify each served
graph with the typed `config.resolve_graph(...).is_remote()` — the same locator
classifier the CLI dispatches on — and `bail!` early with a clear "embedded
graphs only" error if a graph sets `server:` or a remote `http(s)://` `uri:`,
instead of accepting it and failing confusingly later at `Omnigraph::open`.
Both modes covered: Single (hoist `selected` above the URI resolution and check
before `cli_uri` is moved) and Multi (per entry, after the `GraphId` check). A new
`ensure_embedded` helper keeps the message in one place. No config-crate change
(reuses existing public `resolve_graph`/`is_remote`); embedded `storage:`/`uri:`
entries pass through unchanged (positive guard test added).
Updates the `server_settings_can_resolve_named_target` lib test, which asserted
the old behavior (server resolves a remote named target); it now uses an embedded
target — the remote case is rejected and covered by
`single_mode_rejects_named_remote_graph`. Turns the previous commit's four red
tests green.
omnigraph-server serves embedded graphs only (RFC-002 §3); a graph with
`server:` or a remote `http(s)://` `uri:` must be refused at config load. These
four boot tests (multi `server:`, multi remote `uri:`, single positional URI,
single named remote via server.graph) are RED today: load_server_settings
returns Ok for a remote entry — it builds a `Single`/`Multi { uri: "https://…" }`
and only fails later at `Omnigraph::open`. The next commit gates this.
Convert the 6 commands that still scheme-sniffed a raw URI — commit list, commit
show, schema show, snapshot, export, graphs list — from `resolve_uri` +
`is_remote_uri(&uri)` to `resolve_cli_graph` + `graph.is_remote()`, matching the
pattern the other 8 commands already use (`let uri = graph.uri.clone()`). The
scheme sniff is no longer a dispatch decision; `is_remote_uri` now survives only
inside `normalize_policy_graph_uri` (Cedar id), and `resolve_uri` only inside
`resolve_cli_graph` and the local-only optimize/cleanup paths.
Behavior-preserving on the open path: local branches keep plain `Omnigraph::open`
(these are reads — unlike writes, they don't attach policy). One intentional
consistency delta: routing through `resolve_cli_graph` means these commands now
apply the same top-level-block coherence check every other command applies — but
only for a named-graph invocation with a populated top-level `policy`/`queries`
block; positional invocations are byte-identical. No HTTP/API change.
Add an equivalence test asserting `ResolvedCliGraph::is_remote()` (the typed
locator discriminant) matches the old `is_remote_uri(graph.uri)` for every
address shape that flows through `resolve_cli_graph`: a `server:` graph, a legacy
remote `uri:` (with the `/graphs/{gid}` suffix), an embedded `uri:`, and positional
http(s)/s3 URIs. Extend the two existing graph_identity tests with `is_remote()`
assertions. `srv/gid` is omitted — it bails at `resolve_graph_selection` and is
unwired until V2.
Make embedded-vs-remote a typed discriminant: `ResolvedCliGraph` now carries the
`GraphLocator` from `config.resolve_graph` (replacing the `is_remote: bool`
scheme-sniff field), exposed via an `is_remote()` accessor. The 9 dispatch reads
switch from `graph.is_remote` to `graph.is_remote()`.
Behavior-identical: the engine `uri`, Cedar `graph_id`, `selected`, and
`policy_file` keep their exact current computation (`resolve_target_uri` /
engine-normalized `normalize_policy_graph_uri` / `resolve_graph_selection`
coherence check) — the locator's `uri`/`graph_id`/`endpoint` DIVERGE from those
(base_dir-join, `file://` retention) and are V2 fields, so only `is_remote()` is
read here. A doc comment records that trap.
`is_remote_uri`/`resolve_uri` and the 11 raw-uri dispatch sites are untouched
(their conversion + deletion is L4). No HTTP/API change; openapi.json unchanged.
`#[serde(deny_unknown_fields)]` was applied unconditionally as a struct
attribute, so the `version` discriminator did not actually gate strictness: a
legacy (no-`version:`) config was wrongly rejected for any unrecognized key,
contradicting the documented contract (RFC-002 §3: no version = legacy-lenient,
`version: 1` = strict) and the code's own comments.
Make strictness a function of `version`, decided at one point in
`load_config_in`: parse once via `serde_ignored` (collecting ignored fields at
all depths), then reject unknowns only under `version: 1`; legacy tolerates them
(restoring the pre-existing behavior). Drop `deny_unknown_fields` from the two
legacy-spanning structs (`OmnigraphConfig`, `TargetConfig`) and keep it on the
v1-only typed blocks (`StorageBlock`, `ServerEntry`), which have no legacy form.
This removes the double-parse (`check_config_version` is gone — the version is
read from the parsed struct) and makes v1 strictness comprehensive: a misspelled
nested key (e.g. under `cli:`) now errors instead of being silently dropped.
Turns the previous commit's regression test green and pins v1 comprehensive
strictness.
A config without a `version:` is the legacy/lenient schema (RFC-002): an
unrecognized top-level key must be tolerated, not rejected. This test pins that
contract and fails today because `deny_unknown_fields` is applied unconditionally
(struct-wide), so the `version` discriminator does not actually gate strictness.
Intentionally red — goes green in the next commit, which gates unknown-field
strictness on `version`.
Addresses a review finding: cargo fmt --check was failing on the V0/L1/L2 crates (and some pre-existing engine drift). cargo fmt --all --check is now clean. No behavior change.
Add the typed graph schema behind the version gate: per-graph storage: (string or block) XOR server:+graph_id:, a servers: map, and resolve_graph -> typed GraphLocator (local alias -> srv/gid qualified -> cli.graph default). uri stays a defaulted String filled by normalize_graphs from storage/server, so all existing .uri readers are unchanged and this commit is config-only (server reject-Remote is L5). deny_unknown_fields rejects typos; Storage has a hand-rolled string-or-block Deserialize for precise unknown-field errors. region/endpoint parsed but not yet engine-threaded (V2); GraphLocator.graph_id carried but unused on the wire until V2. 26 config tests pass; cargo test --workspace --locked green (per review).
load_config reads the optional top-level 'version:' discriminator and rejects unsupported versions (>1) before parsing; no-version (legacy) and 'version: 1' both still parse via the existing lenient struct. Forward-compat gate (RFC-002 §3) added as a no-op so the typed GraphLocator schema can tighten in following commits without breaking legacy files. 14 config tests pass (incl. version:1 parses, version:2 rejected).
Repoint CLI imports to the extracted crates: api DTOs -> omnigraph-api-types, QueryRegistry/check -> omnigraph-queries, config types -> omnigraph-config, and Policy* -> omnigraph-policy directly (no longer via the server re-export shim). Remove omnigraph-server from the CLI manifest. The CLI no longer pulls Axum/tower/utoipa-axum: 'cargo tree -p omnigraph-cli -i omnigraph-server' and '-i axum' both report not-in-graph. No behavior change (CLI compiles; no test churn — CLI tests import none of the moved symbols).
Move api.rs (HTTP request/response DTOs + OpenAPI schemas) into omnigraph-api-types (deps: engine, compiler, queries, serde, serde_json, utoipa). Repoint crate::queries::StoredQuery -> omnigraph_queries. Relocate the two catalog-projection tests here, where query_catalog_entry lives, eliminating the queries<->api dev-dep cycle (no cycle, tests at their boundary). Server aliases via 'pub use omnigraph_api_types as api;'. openapi.json byte-identical (drift test passes; no regen). 2 api-types tests pass; server+cli compile. No behavior change.
Move queries.rs (stored-query registry + validation) into omnigraph-queries (deps: omnigraph-config, omnigraph-compiler; no serde). Repoint crate::config -> omnigraph_config. The two catalog-projection tests (which exercise api::query_catalog_entry) are removed here and relocated to omnigraph-api-types in the next commit, where that function lives. Server aliases via 'pub use omnigraph_queries as queries;'. 16 queries tests pass; server+cli compile. No behavior change.