* Parallel per-type load writes + omnigraph optimize/cleanup CLI
## MR-677.3 — parallel per-type load writes
The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).
## MR-676 — `omnigraph optimize` and `omnigraph cleanup`
New CLI subcommands that walk every node + edge table in the repo:
- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
runs Lance `cleanup_old_versions` to prune historical manifests +
unique fragments. Requires `--confirm` because it's destructive.
Supports both count-based and time-based retention (or both AND'd
together). Time uses chrono `DateTime<Utc>` (added as a workspace
dep, default-features off).
Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.
Public API on `Omnigraph`:
pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
-> Result<Vec<TableCleanupStats>>
All 10 existing loader tests still pass.
Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate openapi.json
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Files where inline tests crowded out production code (test/prod ratio
≥ 0.8) move to sibling files via `#[path]`. Files where production
dominates (query_input.rs, schema_plan.rs) stay inline — extracting
would add noise, not reduce it.
- ir/lower.rs: 1239 → 577 lines (ratio 1.15)
- catalog/mod.rs: 594 → 326 lines (ratio 0.83)
- query/lint.rs: 562 → 314 lines (ratio 0.80)
catalog/tests.rs uses the shorter name since it's inside a module
directory (no ambiguity with filename).
All 229 compiler tests green, identical count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
typecheck.rs, schema/parser.rs, and query/parser.rs each had
~1000-line inline `mod tests` blocks that overshadowed the production
code in the file. Move each to a sibling `*_tests.rs` using
`#[path = "..."] mod tests;`.
- typecheck.rs: 2865 → 1708 lines; typecheck_tests.rs: 1156 lines
- schema/parser.rs: 1950 → 994 lines; parser_tests.rs: 955 lines
- query/parser.rs: 1737 → 803 lines; parser_tests.rs: 933 lines
No visibility change — the sibling module still has `use super::*`
access to crate-privates. No semantic edits beyond de-indenting by
4 spaces (mechanical). All 229 compiler tests green, identical
count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unit tests covering gaps identified by systematic matrix of:
topology (fan-out, fan-in, cycle) × deferral × filter type × direction.
New unit tests:
- fan-out: one root fans to two deferred destinations via different edges
- fan-in: two sources converge on one destination via reverse expand
- cycle: deferred binding + genuine cycle-closing on return edge
- multiple filters on single deferred binding (name + age)
- param filter on deferred binding (IRExpr::Param in dst_filters)
- negation with inner binding (documents current NodeScan+cycle-close behavior)
New integration tests:
- fan-out projection (friend × company cross-product per source)
- deferred filter matching nothing (empty result propagation)
- negation with inner destination binding filter
Also: guard anti-join fast path against non-empty dst_filters. The bulk
CSR existence check only tests neighbor existence, not destination
properties — it must fall back to the slow path when dst_filters are
present to avoid false negatives.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The anonymous wildcard variable _ was included as a regular node in the
undirected adjacency graph used for component analysis. When multiple
traversals referenced $_, it falsely bridged otherwise-independent
components, causing bindings in separate components to be deferred.
The deferred binding would never be introduced (since _ is never added
to bound_vars), leading to silently dropped traversals.
Fix: skip edges involving _ when building the adjacency graph.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The retain-based loop swallowed catalog.lookup_edge_by_name errors by
keeping the traversal for the next pass, where it could never succeed.
This caused the no-progress break to fire, silently dropping the
traversal and producing incorrect query results with missing joins.
Replaced retain with a manual for-loop that propagates errors via ?.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The iterative lowering now handles traversals declared in non-topological
order (e.g. `$b worksAt $c` before `$a knows $b`). Each pass processes
traversals that have at least one bound endpoint, repeating until all are
consumed. Caught during self-review.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The IR lowering previously emitted independent NodeScans for every binding
in a match clause, even when bindings were connected by traversals. This
created O(N×M) cross-joins followed by cycle-closing filters — correct but
extremely slow for large datasets.
Two changes fix this by design:
1. **Deferred bindings** — When multiple bindings are connected by
traversals, only the first-declared binding gets a NodeScan. The rest
are introduced by Expand operations, eliminating cross-joins entirely.
2. **Filter fusion into Expand** — Deferred binding filters are attached
directly to IROp::Expand (new `dst_filters` field) and pushed into
Lance SQL during hydrate_nodes(), so the storage layer skips
non-matching rows. Non-pushable filters (list-contains, FTS) fall back
to in-memory application after hconcat.
For a query like:
match { $p: Person $p worksAt $c $c: Company { name: "Acme" } }
Old plan: NodeScan($p) → NodeScan($c) → cross-join → Expand(__temp) → cycle-close
New plan: NodeScan($p) → Expand($p→$c, Lance SQL: id IN (...) AND name='Acme')
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The early return at line 273 for None/Value::Null params was skipping
the null-fill loop, leaving declared nullable params absent from the
map. Downstream code would then error with "parameter not provided".
https://claude.ai/code/session_014oGFKL7EVg1b2cyPgt9Gne
Parameters declared with `?` (e.g. `$changelogUrl: String?`) now correctly
accept omission or explicit null in JSON input instead of requiring empty
strings as a workaround. Adds `Literal::Null` variant and threads it through
parameter parsing, type-checking, and Arrow array conversion.
https://claude.ai/code/session_014oGFKL7EVg1b2cyPgt9Gne
Add runtime support for aggregate functions (count, sum, avg, min, max)
with GROUP BY semantics, built on a single wide RecordBatch that
eliminates correlation tracking by construction.
Execution engine (exec/query.rs):
- Replace HashMap<String, RecordBatch> with Option<RecordBatch> where
columns are prefixed as <variable>.<property>
- NodeScan prefixes columns and cross-joins with existing batch
- Expand collects (src_row, dst_id) pairs, takes wide batch rows,
appends prefixed destination columns via hconcat
- Filter applies single mask to entire wide batch
- AntiJoin: fast-path returns BooleanArray mask; slow-path slices
one row for inner pipeline execution
Projection engine (exec/projection.rs):
- aggregate_return groups rows by non-aggregate key columns using
length-prefixed string encoding, computes per-group aggregates
- SUM accumulates into f64 to avoid integer overflow
- MIN/MAX support both numeric and string types
- Empty input returns count=0, others=null
Compiler (typecheck.rs):
- T8: split MIN/MAX from SUM/AVG — allow string arguments
- T9: non-aggregate expressions in aggregate queries must be
property accesses or variables
- SUM type inference returns Float64 (matching runtime)
Tests: 8 new integration tests covering grouped count, global count,
sum/avg/min/max per company, aggregate+order+limit, string min/max,
multi-hop aggregates, and edge cases.
https://claude.ai/code/session_019o5NRyYomgETFyd7hpiLey
Allow mutation queries to contain multiple sequential statements that
execute atomically within a single transactional run. This enables
patterns like inserting a node and its edges in one query:
query add_and_link($name: String, $age: I32, $friend: String) {
insert Person { name: $name, age: $age }
insert Knows { from: $name, to: $friend }
}
Changes span the full compiler-to-execution pipeline:
- Grammar: mutation_body = { mutation_stmt+ }
- AST: QueryDecl.mutations: Vec<Mutation>
- IR: MutationIR.ops: Vec<MutationOpIR>
- Execution: loop over ops, accumulate affected counts
Cross-statement visibility works because each statement's commit_updates
advances the manifest state, so subsequent statements see prior writes.
Atomicity comes from the existing run mechanism (begin_run/publish_run).
https://claude.ai/code/session_01E4VG2WXrZW8aeXFiqr8NwF