load_jsonl_reader dispatches on mode: Append/Merge use the
MutationStaging accumulator (per-type batch staging, single stage_* +
commit_staged per touched table at end-of-load, publisher CAS).
Overwrite keeps the legacy concurrent inline-commit path
(truncate-then-append doesn't fit the staged shape; overwrite has no
in-flight read-your-writes requirement).
* New helpers collect_node_ids_with_pending and
validate_edge_cardinality_with_pending_loader — loader analogs
of the engine's pending-aware validators.
* Phase 2c (RI) and Phase 3 (cardinality) consult pending batches
for Append/Merge so a mid-load failure aborts the load before any
Lance write reaches HEAD.
A failed Append/Merge load no longer advances Lance HEAD on staged
tables — the next load on the same tables proceeds normally with no
ExpectedVersionMismatch. Overwrite mode's drift residual is unchanged
from today's behavior; documented in docs/runs.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Replace mutation.rs's MutationStaging.latest with the new
pending + inline_committed shape from exec::staging. Inserts
and updates push batches into pending; deletes still inline-commit
via record_inline.
* Rewrite execute_insert, execute_update, execute_delete*,
validate_edge_insert_endpoints, ensure_node_id_exists for the
new shape. Edge cardinality validates against committed scan +
in-memory pending walk (validate_edge_cardinality_with_pending).
* D₂ parse-time check: a query is either insert/update-only or
delete-only. Mixed → reject before any I/O.
* Drop CoordinatorRestoreGuard and the swap_coordinator_for_branch /
restore_coordinator dance from mutate_with_current_actor. Branch
is threaded explicitly through execute_named_mutation and the
per-op functions. (merge.rs keeps its own swap pattern.)
* apply_assignments updated to copy unassigned blob columns from
the scan when present, enabling full-schema update batches; for
blob-bearing tables we still project away the blob columns at scan
time (Lance's filter pushdown panics otherwise) and accept the
narrow-schema output for the v1 path.
A failed mid-query op no longer advances Lance HEAD on staged tables —
the next mutation proceeds normally with no ExpectedVersionMismatch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the scaffolding for the in-memory staged-write rewire — no behavior
change yet:
* New crates/omnigraph/src/exec/staging.rs with MutationStaging,
PendingTable, PendingMode, StagedTablePath, plus the end-of-query
finalize() that issues one stage_* + commit_staged per pending
table (Merge mode dedupes by id, last-write-wins).
* TableStore::scan_with_pending and count_rows_with_pending helpers —
Lance scan committed + DataFusion MemTable scan pending, concat.
Sidesteps the Scanner::with_fragments filter-pushdown limitation
documented on scan_with_staged.
* Add datafusion = "52" to workspace + omnigraph-engine deps for
MemTable (transitively pulled by Lance already).
Engine code still uses the legacy MutationStaging shape; the rewire
lands in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failures, both addressed:
(1) u32/u64 type mismatch in stage_append (compile error):
ds.manifest.max_fragment_id is Option<u32>, but Lance's Fragment::id
and the commit-time renumbering counter in
Transaction::fragments_with_ids operate on u64. Cast max_fragment_id
to u64 before the arithmetic.
(2) scan_with_staged_pushes_filter_through_committed_and_staged failed
because Lance's stats-based fragment pruning drops uncommitted staged
fragments from filtered scans — they lack the per-column statistics
that committed fragments carry. With filter `age >= 30` and a staged
dave (age=35), dave is silently absent from the result.
scanner.use_stats(false) does not bypass this in lance 4.0.0
(verified locally).
Rather than chase Lance internals further, document the limitation:
- stage_merge_insert / scan_with_staged docstring updated to flag the
filter contract as incomplete on staged fragments.
- Test renamed to scan_with_staged_with_filter_silently_drops_staged_rows
and flipped to assert the actual behavior, with a clear note pointing
at the design pivot (.context/mr-794-step2-design.md §1.1) and
instructions for whoever sees the assertion fail in the future.
- Test also asserts that unfiltered scan_with_staged returns all rows —
confirms the issue is specifically filter pushdown, not fragment
scanning per se.
The engine's MR-794 step 2+ design (in-memory pending-batch
accumulation + DataFusion MemTable for read-your-writes) sidesteps
this entirely; production code is unaffected. scan_with_staged stays
on the public surface for primitive-level testing and for callers
that don't need filter pushdown.
All 8 staged_writes tests + 10 runs + 63 end_to_end + consistency
green locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI exposed the actual root cause behind the three staged_writes
test failures: Lance's InsertBuilder::execute_uncommitted produces
fragments with id=0 as a "Temporary ID" (lance-4.0.0
dataset/write.rs:1044, with the assertion at line 1712). Real IDs
get assigned at commit time by Transaction::fragments_with_ids
(transaction.rs:1456). Because we expose pre-commit fragments to
scan_with_staged via Scanner::with_fragments, two fragments collide
on id=0 in the combined list — the staged fragment with the seed
fragment, or two staged fragments with each other.
Lance's scanner mishandles the collision. Symptoms observed in
the three failing tests:
- chained_stage_appends: only 1 distinct _rowid (other fragments
silently dropped)
- count_rows_with_staged_matches_scan: range overflow ("Invalid read
of range 0..2 for fragment 0 with 1 addressable rows")
- scan_with_staged_pushes_filter: duplicate carol + missing dave
(one fragment read twice, the other not at all)
Fix: assign real fragment IDs in stage_append, mirroring Lance's
commit-time logic. Use ds.manifest.max_fragment_id + 1 as the base,
incremented by the prior_stages fragment count so chained
stage_appends produce distinct IDs. The row_id_meta assignment
stays — both are needed for the scanner to correctly map row IDs
through the combined fragment list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups to the staged-writes primitives, all caught by the
"are we missing tests?" review:
(1) Path A row-ID threading (Gap 1, real bug):
stage_append now takes prior_stages: &[StagedWrite] and offsets the
assigned row IDs by the sum of prior stages' physical_rows. Without
this, two stage_appends against the same dataset both started at
ds.manifest.next_row_id, producing fragments with overlapping _rowid
ranges. This would have fired in Step 2+ on any multi-statement
mutation like `insert Knows ...; insert Knows ...` (multiple appends
to the same edge table — allowed under D₂′). The slice mirrors
scan_with_staged's API shape; the same slice is passed to both stage
and scan. Documented contract: only stage_append results in
prior_stages (D₂′ guarantees this upstream).
(2) commit_staged round-trip tests (Gap 2):
Two tests covering stage_append + commit_staged and stage_merge_insert
+ commit_staged. Validate that Lance's commit-time row-ID assignment
works correctly even after our pre-commit row_id_meta assignment in
the append path — the two assignments diverge but neither is observed
across the boundary.
(3) Filter pushdown test (Gap 3):
scan_with_staged with a SQL filter applies it across both committed
and staged fragments. Validates the MR-794 ticket's claim that Lance's
with_fragments preserves filter/vector/FTS pushdown (Lance tests
test_scalar_index_respects_fragment_list etc.).
Also adds chained_stage_appends_have_distinct_row_ids which directly
demonstrates the Gap 1 fix by projecting _rowid and asserting no
duplicates across 1 committed + 2 staged rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI exposed a real Step 1 bug surfaced by the new staged_writes tests:
stage_append → scan_with_staged fails on stable_row_id datasets with
"Missing row id meta" (lance-4.0.0/src/dataset/rowids.rs:22).
Root cause: InsertBuilder::execute_uncommitted produces fragments with
row_id_meta = None. Lance's commit phase normally populates it via
Transaction::assign_row_ids, but scan_with_staged reads the staged
fragments BEFORE commit. MergeInsertBuilder::execute_uncommitted dodges
this by populating row_id_meta inline (transaction.rs:1618) — that's
why the two merge-side tests in tests/staged_writes.rs passed and the
two append-side tests failed.
The bug was always present in the primitive — PR #66 shipped it the
same way. PR #66 had no tests calling stage_append, so neither CI nor
the bot reviewers caught it. Step 2+ would have hit it on the first
mutation that did "insert + insert with FK validation," but the failure
would have looked like a MutationStaging wiring bug; localizing it
here saves the next session the chase.
Fix: assign row_id_meta on the cloned fragments returned in
StagedWrite.new_fragments. Mirrors the relevant arm of Lance's
Transaction::assign_row_ids (transaction.rs:2682) for the
row_id_meta = None case. The transaction's internal fragment copy stays
untouched — Lance assigns its own IDs at commit time, and the two ID
assignments don't have to agree because no caller threads _rowid from
the staged scan into the commit path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed compiling tests/staged_writes.rs — `.len()` is on the Array
trait, not on the concrete StringArray/Int32Array types. Add the
trait import.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex flagged that combine_committed_with_staged can return duplicates
on chained stage_merge_inserts: each call's MergeInsertBuilder runs
against the committed view (it does not see prior staged fragments), so
two staged merges whose source rows share keys both produce
Operation::Update transactions whose new_fragments contain the shared
row. The combined scan returns it twice.
The bug is intrinsic to Lance's API: there is no public way to make
MergeInsertBuilder see uncommitted fragments. Fixing the primitive
itself requires either a Lance API extension or in-memory pre-merge
logic, neither in scope for v1.
The v1 fix is a parse-time companion (D₂′) added with the engine rewire
in MR-794 step 2+: per touched table, ops must be all stage_append OR
exactly one stage_merge_insert. Multi-table queries and append-chains
remain safe; only chained merges on a single table are rejected.
This commit:
- Documents the contract on stage_merge_insert and
combine_committed_with_staged so callers know the invariant the
primitive relies on.
- Adds tests/staged_writes.rs with four primitive-level tests:
- stage_append + scan_with_staged shows committed + staged
- stage_merge_insert dedupes superseded committed fragments
(regression for the removed_fragment_ids fix that PR #66's
730631c added)
- count_rows_with_staged matches scan
- chained stage_merge_insert with shared key documents the
duplicate-row behavior; assertion pins it so a future change either
preserves the contract or consciously fixes it (and updates the test)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent automated reviews (Cubic P1, Cursor High, Codex P1)
flagged a real correctness bug in stage_merge_insert: Operation::Update
returns three fields — removed_fragment_ids, updated_fragments,
new_fragments — and we were collecting only the latter two into
StagedWrite.new_fragments while discarding removed_fragment_ids.
That breaks read-your-writes for any merge_insert that rewrites an
existing fragment: scan_with_staged combines the dataset's full committed
manifest with the staged new_fragments, so the *original* committed
fragment (which the rewrite supersedes) and its rewritten version both
end up in the Scanner's fragment list. Result: duplicate rows.
Fix:
- StagedWrite gains `removed_fragment_ids: Vec<u64>` populated from
Operation::Update; empty for Operation::Append (which never supersedes
existing fragments).
- scan_with_staged / count_rows_with_staged take `&[StagedWrite]` instead
of `&[Fragment]` so they have access to both fields.
- A new `combine_committed_with_staged` helper composes the visible
fragment list as `committed - removed + new`, deduping by fragment ID.
Also addresses cubic's P3 doc-fab note: the StagedWrite doc comment
claimed the type was "used by MutationStaging and the loader" but those
callers don't exist in this PR (they're MR-794 step 2+). Reword to
"defined here for later integration" so the doc doesn't lie about the
current state.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Lance's distributed-write API splits "write fragment files" from "advance
HEAD": write_fragments returns a Transaction with FragmentMetadata; a
later CommitBuilder::execute(transaction) commits via the manifest CAS.
The same shape exists for merge_insert via
MergeInsertBuilder::execute_uncommitted. Scanner::with_fragments(staged)
lets in-flight reads see uncommitted staged data.
Adds wrappers for these primitives:
- StagedWrite carries the uncommitted Transaction plus the new Fragments
(extracted for read-your-writes via Scanner::with_fragments).
- TableStore::stage_append wraps InsertBuilder::execute_uncommitted.
- TableStore::stage_merge_insert wraps MergeInsertBuilder::execute_uncommitted.
- TableStore::commit_staged wraps CommitBuilder::execute.
- TableStore::scan_with_staged / count_rows_with_staged thread the staged
fragments into a Scanner alongside the dataset's committed fragments.
The MutationStaging integration that uses these primitives is the next
step in MR-794 — it requires a coordinated rewrite of execute_insert /
execute_update / execute_delete plus the load_jsonl_reader path, plus
end-of-query commit logic. Doc comment on MutationStaging is updated to
reference MR-794 and these primitives so the followup is well-anchored.
The current MR-771 limitation in docs/runs.md ("mid-query partial failure
leaves Lance HEAD ahead of __manifest") still applies until the
follow-up lands; the primitives are the building blocks but not yet the
fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
cubic correctly flagged that the assertion `!branches_after.iter().any(|b| b.starts_with("__run__"))` is vacuous because `branch_list()` already filters `__run__*` via `is_internal_system_branch`. The real structural property (no `__run__` branches can ever be created) is enforced by MR-771's deletion of `begin_run` etc. — that's a build-time invariant, not a runtime one.
Drop the vacuous assertion; document why. The remaining checks (public branch list unchanged + `_graph_runs.lance` never reappears) cover the actual cancel-safety properties.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three fixes from automated PR review on #65:
1. Internal-branch guard in mutation/load (Cursor Bugbot, Medium).
Pre-MR-771 the begin_run path called ensure_public_branch_ref;
the direct-publish replacements only normalized the name. A caller
passing __run__* or __schema_apply_lock__ verbatim could write
directly to a system branch. Re-add the explicit guard at the
public write boundary in mutate_with_current_actor and load.
2. Panic-safe coordinator restoration (Cursor Bugbot, High).
The previous swap-and-restore pattern would skip restore_coordinator
if execute_named_mutation panicked, leaving the handle pinned to
the wrong branch indefinitely. Replace with a CoordinatorRestoreGuard
RAII type that captures the previous coordinator on swap and
restores it in Drop.
3. Flaky cancel-safety test (cubic, P2).
tests/runs.rs::cancelled_mutation_future_leaves_no_state asserted
manifest version equality after handle.abort(), but abort races
the spawned task. Re-frame around what actually defines cancel
safety: no __run__* branches, no _graph_runs.lance, no synthesized
public branches.
The fourth comment (Codex P1: branch_delete losing its in-flight
write barrier) is bigger in scope — fits in the MR-794 storage-trait
staging story rather than a hotfix here. Tracked there.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The on-disk shape of `__manifest` is reconciled with the binary via a single
stamp + dispatcher in `db/manifest/migrations.rs`:
- `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes.
- The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest
dataset's schema-level metadata (Lance `update_schema_metadata`).
- `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from
the on-disk stamp until it matches the binary, then returns. Idempotent.
- `init_manifest_repo` stamps the current version at creation; the publisher's
open-for-write path runs pending migrations before reading state. Reads
stay side-effect-free.
- Forward-version protection: a stamp higher than the binary's known version
triggers a clear "upgrade omnigraph first" error so an old binary cannot
clobber a newer schema.
Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step:
the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id`
that engages Lance's row-level CAS at commit time. New repos created via
`init` are stamped at v2 immediately and don't need migration.
Adding a future on-disk shape change is one constant bump, one match arm in
`migrate_internal_schema`, and one test — no new branches in unrelated code
paths. Code outside the migration module never inspects the stamp.
New tests in `manifest/tests.rs`:
- `test_init_stamps_internal_schema_version`
- `test_publish_migrates_pre_stamp_manifest_to_current_version`
- `test_publish_rejects_manifest_stamped_at_future_version`
Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated
per the AGENTS.md maintenance contract.
Layered approach selected by the CAS-granularity investigation
(.context/merge-insert-cas-granularity.md):
- Annotate __manifest.object_id with `lance-schema:unenforced-primary-key`,
enabling Lance row-level CAS via the bloom-filter conflict resolver.
Closes a latent silent-duplicate bug where two concurrent publishes of
the same `version:T@v=N+1` row could both land in disjoint fragments.
- Extend `ManifestBatchPublisher::publish` with `expected_table_versions:
&HashMap<String, u64>`. Empty map preserves today's behavior; populated
map asserts the manifest's latest non-tombstoned version per table
matches the caller's view. Mismatches surface as a typed
`ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected,
actual }` so callers can match without parsing strings.
- Set `merge_builder.conflict_retries(0)` so Lance's transparent rebase
cannot silently break the OCC contract; retries are owned by the
publisher loop, where each attempt re-runs `load_publish_state` and
the expected-version pre-check.
- Surface `ManifestCoordinator::commit_with_expected` for the callers
that need strict OCC (the run-demotion ticket); existing `commit` and
`commit_changes` paths are unaffected.
New tests in `manifest/tests.rs` cover: matching expected versions,
stale expected with typed details, drift on an untouched expected
table, unknown expected table (actual=0), and the headline case of two
concurrent publishes with overlapping expected versions where exactly
one succeeds.
Several validators were defined but only called from a subset of write
paths, so writes that violated @unique, @range, @check, enum, or
@cardinality constraints could silently succeed and corrupt data.
Adds two new helpers in loader/mod.rs:
- validate_enum_constraints — batch-level enum check, scans Arrow
string columns (and list-of-string columns) for values outside the
declared set
- enforce_unique_constraints_intra_batch — single-batch duplicate
detection over named columns; partial enforcement (does not check
against committed rows yet — cross-batch enforcement is a separate
effort)
Wires the validators into:
- load_jsonl_reader nodes (alongside the existing
validate_value_constraints call) and edges (which had no enum or
unique check at all)
- exec/mutation.rs node insert, edge insert, and update paths
- mutation edge insert now also calls validate_edge_cardinality after
the row lands but before the manifest commit, matching the loader's
Phase 3 behavior
A new tests/validators.rs suite asserts rejection on every entry path
for invalid enum values, @range violations, intra-batch @unique
duplicates, and edge @card excesses.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema apply previously committed the manifest before writing the schema
source and IR contract files. A crash in that window left the manifest
pointing at the new schema while _schema.pg, _schema.ir.json, and
__schema_state.json still reflected the old one — a silent inconsistency
that subsequent reads hit as type errors.
Reorders the apply: write to staging filenames first, commit the
manifest, then atomically rename staging → final. On open, a recovery
sweep reconciles any leftover staging files against the manifest's table
set: pre-commit crashes get the staging files deleted, post-commit
crashes get the renames completed (idempotent — handles partial
renames). Property-only migrations where both schemas imply the same
table set return an operator-actionable error rather than guessing.
Adds rename_text + delete to StorageAdapter (atomic on local FS via
tokio::fs::rename; copy + delete on S3 — recovery is tolerant of the
non-atomic case). Failpoints test coverage at both crash boundaries plus
a partial-rename scenario.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both Cursor Bugbot and Cubic flagged that the inbound `headers().get(...)`
call constructed `HeaderName::from_static("x-request-id")` inline instead
of reusing the `X_REQUEST_ID` constant defined at the top of the file.
The two were already kept in sync by both being `from_static("x-request-id")`,
but a future rename would have to touch both sites or risk silent drift
between read and write.
Also drops the now-unused `header` module import.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-request ULID minted at the edge, exposed in request extensions and
on the response header. Caller-supplied X-Request-Id is echoed when
well-formed (1..=128 ASCII printable characters); otherwise rejected
and replaced with a fresh ULID so the value is always safe to log.
Companion to the TypeScript SDK redesign — clients now correlate logs
across the wire by reading X-Request-Id from response headers (and the
SDK already surfaces it on every OmnigraphError as `requestId`).
No spec change required; the header is a transport-layer concern.
Tests:
- mint a ULID when no header is provided
- echo a valid caller-supplied id
- reject overlong header (200 chars), mint a fresh ULID
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add operation descriptions and examples to utoipa annotations so the
generated TypeScript SDK has rich JSDoc, and so future Python/Go SDKs
and any /openapi.json docs UI benefit from the same effort.
- Doc comments on all 18 handlers (utoipa picks up summary/description)
- #[schema(example = ...)] on free-text fields (query_source,
schema_source, NDJSON data) and i64 timestamps
- Destructive/irreversible warnings on change, applySchema, ingest,
mergeBranches, deleteBranch, publishRun, abortRun
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Parallel per-type load writes + omnigraph optimize/cleanup CLI
## MR-677.3 — parallel per-type load writes
The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).
## MR-676 — `omnigraph optimize` and `omnigraph cleanup`
New CLI subcommands that walk every node + edge table in the repo:
- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
runs Lance `cleanup_old_versions` to prune historical manifests +
unique fragments. Requires `--confirm` because it's destructive.
Supports both count-based and time-based retention (or both AND'd
together). Time uses chrono `DateTime<Utc>` (added as a workspace
dep, default-features off).
Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.
Public API on `Omnigraph`:
pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
-> Result<Vec<TableCleanupStats>>
All 10 existing loader tests still pass.
Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate openapi.json
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
BFS now emits Vec<u32> dense ids directly with HashSet<u32> per-source
dedup. Only the deduped set is stringified for Lance's IN-list. The
post-hydrate alignment uses a dense-indexed Vec<Option<u32>> instead of
HashMap<&str, usize>, giving O(1) lookup without repeated string hashing.
End-to-end on the bench_expand harness (release, M-series):
query baseline after speedup
1k hop3 460.2 ms 23.7 ms 19x
10k hop2 4.21 s 139.9 ms 30x
10k hop3 40.59 s 898.5 ms 45x
30k hop2 11.71 s 490.2 ms 24x
30k hop3 197.38 s 3.22 s 61x
The cost lived in stringifying every (src,dst) pair and re-hashing the
strings during alignment; once dense ids stay dense, the BFS inner loop
and the final fan-out both collapse to integer ops.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove vestigial code left from removed hasher variants: unused
BuildHasherDefault import, PhantomData suppression line, orphan planning
comments for Variant C/E. Also drop an unused `mut` on the PRNG closure
binding. No behavior change; compiles warning-free.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The BFS in execute_expand emits one (src_idx, dst_id) pair per edge, so
dst_id_list contains heavy duplication when multi-hop traversals revisit
the same destination nodes. hydrate_nodes then built an
"id IN ('a', 'b', ...)" filter from the full list, passing it verbatim
to Lance. On a 30k-node Person graph, a 3-hop query produced a 15.4M-
entry IN-list against a 30k-row target — 512x more entries than unique
ids.
Deduplicate before the Lance scan; the post-hydrate alignment HashMap
already fans results back out to the original (src, dst) pairs, so
output is bit-identical.
Bench numbers (crates/omnigraph/examples/bench_expand.rs, min of 2-3
runs, release build):
query before after speedup
1k hop3 460 ms 28 ms 16x
10k hop2 4.21 s 188 ms 22x
10k hop3 40.59 s 1.30 s 31x
30k hop2 11.71 s 678 ms 17x
30k hop3 197.38 s 4.86 s 41x
All existing omnigraph-engine tests pass (72/72, 0 failures).
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Run branches are transactional scaffolding — the durable audit lives
on RunRecord. Invariant: every terminal state (Published, Aborted,
Failed) deletes the __run__ branch.
- Add `terminate_run` helper: appends terminal RunRecord, then
deletes the run branch. Delete errors are swallowed — the record
is authoritative; `cleanup_terminal_run_branches_for_target`
retries on later `branch_delete` of the target.
- Wire into `publish_run_as`, `abort_run`, `fail_run`.
- Include `Failed` in the cleanup filter (was `Published | Aborted`
only) for legacy-repo GC during branch_delete.
- Cleanup now checks `coordinator.all_branches()` first to skip
branches already deleted by a concurrent handle — avoids Lance
NotFound when two handles publish/clean up independently.
- Drop `Failed` from `ensure_branch_delete_safe` — post-fix, Failed
means the branch is already gone, so there's no reason to block
target deletion (MR-674 "Downstream effects").
Tests:
- New regression: `run_branches_do_not_accumulate_across_repeated_loads`
— 10 loads + 1 abort → `branch_list() == ["main"]`.
- New `failed_load_deletes_run_branch` asserts Failed path cleans up.
- Rename `abort_run_keeps_target_unchanged_and_preserves_hidden_branch_for_inspection`
→ `abort_run_leaves_target_unchanged_and_deletes_run_branch`, invert
the hidden-branch assertion.
- Rewrite `public_{load,mutation}_preserves_staged_edge_ids_on_publish`
to capture staged IDs before publish instead of inspecting the run
branch after (branch is gone now).
- Update MR-670 regression test to assert the run branch is *absent*
after publish.
Deferred to follow-up: `--keep-run-branch` debug flag, `omnigraph run gc`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Files where inline tests crowded out production code (test/prod ratio
≥ 0.8) move to sibling files via `#[path]`. Files where production
dominates (query_input.rs, schema_plan.rs) stay inline — extracting
would add noise, not reduce it.
- ir/lower.rs: 1239 → 577 lines (ratio 1.15)
- catalog/mod.rs: 594 → 326 lines (ratio 0.83)
- query/lint.rs: 562 → 314 lines (ratio 0.80)
catalog/tests.rs uses the shorter name since it's inside a module
directory (no ambiguity with filename).
All 229 compiler tests green, identical count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
typecheck.rs, schema/parser.rs, and query/parser.rs each had
~1000-line inline `mod tests` blocks that overshadowed the production
code in the file. Move each to a sibling `*_tests.rs` using
`#[path = "..."] mod tests;`.
- typecheck.rs: 2865 → 1708 lines; typecheck_tests.rs: 1156 lines
- schema/parser.rs: 1950 → 994 lines; parser_tests.rs: 955 lines
- query/parser.rs: 1737 → 803 lines; parser_tests.rs: 933 lines
No visibility change — the sibling module still has `use super::*`
access to crate-privates. No semantic edits beyond de-indenting by
4 spaces (mechanical). All 229 compiler tests green, identical
count to before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline `mod tests` in crates/omnigraph/src/db/omnigraph.rs had grown
to ~620 lines, mixing tests that need crate-private access with tests
that only exercise the public API. Splits the latter out.
- tests/lifecycle.rs: 10 init/open/snapshot/drift tests
- tests/schema_apply.rs: 5 plan/apply tests
- omnigraph.rs: 10 tests remain inline because they use
db.coordinator, db.table_store(), ManifestCoordinator,
SCHEMA_APPLY_LOCK_BRANCH, or is_internal_run_branch — all
crate-private and intentionally kept so.
No behavior change. Zero semantic edits to the tests themselves beyond
replacing db.snapshot() (pub(crate)) with snapshot_main helper at
integration-test boundaries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Published `__run__` branches are intentionally retained after publish
for post-publish inspection (runs.rs tests verify edge IDs match
between run branch and main). `apply_schema` was counting them as
"non-main" branches and refusing to run — permanently blocking schema
evolution after any load or change, with no CLI recovery path
(`branch_delete` rejects internal refs, `run abort` rejects Published
runs).
Fix: `apply_schema` filters `is_internal_system_branch` (covers both
`__run__*` and the schema-apply lock) rather than just the lock.
Run branches remain available for inspection.
Regression: test_apply_schema_succeeds_after_load_creates_published_run_branch
pins that schema apply succeeds after a load even while the run
branch is still present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The separate openapi-sync workflow was duplicating the workspace build
(~15 min cold-cache compile), paying the cost twice per PR. Fold the
regen + auto-commit into the existing test job: one compile, shared
rust-cache, same drift-check semantics.
- Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then
commit the regenerated spec back to the PR branch
- Fork PRs / pushes: env var empty, test stays in strict drift-check mode
- openapi_spec_is_up_to_date treats empty env value as unset, so the
conditional workflow env expression works
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces an opt-in AWS Secrets Manager backend for bearer tokens,
behind the `aws` Cargo feature. Default builds (on-prem, local dev)
don't pull in the AWS SDK and don't pay its compile cost.
- New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager`
optional deps. Default features remain empty.
- New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by
fetching a JSON `{"actor_id": "token", ...}` payload from a named
Secrets Manager secret. Credentials resolve via the AWS default chain
(env, shared config, IMDSv2 instance role, ECS task role) so no
explicit plumbing is needed under an IAM role.
- New `resolve_token_source()` dispatches based on the
`OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set
but the binary was built without `--features aws`, returns a clear
rebuild instruction rather than silently falling back.
- `serve()` now uses `resolve_token_source()` and logs which source was
selected at startup.
- `parse_json_secret_payload()` is factored out as a free function so
the payload validation (trim whitespace, reject blank actor/token,
reject non-object) is unit-testable without the AWS SDK.
- New CI job `test_aws_feature` builds + tests with `--features aws`.
Not in this PR (follow-ups):
- Background refresh loop for rotation. `SecretsManagerTokenSource`
advertises `supports_refresh: true` but the AppState-level refresh
task isn't wired yet.
- Config-YAML dispatch (today the AWS source is selected via env var
only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`).
Tests:
- Default-feature build: 33 lib + 41 integration + 64 openapi.
- `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure refactor. No behavior change. Introduces a TokenSource trait so
additional backends (AWS Secrets Manager, Vault, etc.) can plug in
behind feature flags without touching the server wiring.
- New module crates/omnigraph-server/src/auth.rs with the TokenSource
trait and a single EnvOrFileTokenSource implementation that delegates
to the existing server_bearer_tokens_from_env() function.
- serve() now constructs EnvOrFileTokenSource and calls load() instead
of calling the free function directly.
- The trait has a supports_refresh() hook (false for env/file) for
future implementations that can rotate without restart.
- async-trait added to omnigraph-server deps; it's already in the
workspace.
Tests:
- Unit tests in auth.rs covering load paths and the default supports_refresh
/ name values.
- Existing 128 tests (lib + integration + openapi) pass unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes two live authz bugs in omnigraph-server:
- Bearer-token lookup previously used HashMap::get, which compares keys with
Eq and short-circuits on the first differing byte — a network-observable
timing oracle for brute-forcing tokens. Tokens are now stored as SHA-256
digests and compared with subtle::ConstantTimeEq, iterating every entry
unconditionally so total work is independent of which slot matches. Raw
token bytes no longer live in server memory after startup.
- authorize_request now overwrites PolicyRequest.actor_id from the
authenticated session instead of trusting the handler-supplied field,
which previously defaulted to "" via unwrap_or_default(). The empty
string can no longer reach Cedar as a policy subject even if a future
refactor drops the None check.
External API of AppState constructors is unchanged — tokens still enter as
Vec<(String, String)> and are hashed on the way in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback on #23, applied on top of the original commit:
- Rename the CLI subcommand from `schema get` to `schema show` to match
the existing `run show` / `commit show` convention. A `#[command(alias
= "get")]` preserves muscle memory for anyone who already typed `get`.
- Rename `SchemaGetOutput` → `SchemaOutput` and its field `source` →
`schema_source`, so the get response and the apply request use the
same field name for the same concept.
- Use `println!` instead of `print!` in the CLI so the shell prompt
doesn't land on the last line of schema output.
- Add three integration tests on `/schema`: happy path (no auth),
401 when bearer is required but missing, 403 when the policy grants
the actor branch_create but not read.
Follow-ups left for a separate PR: include `schema_ir_hash` and
`schema_identity_version` in the response payload so clients can do
drift detection and the server can set an ETag; and a fast-path local
read that skips `Omnigraph::open()` when only the schema source is
needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce SDK generation scaffolding: commit a static openapi.json
extracted from the Utoipa annotations via a golden-file test, add
Stainless workspace/config for TypeScript and Python SDKs, and clean
up operation IDs for ergonomic generated method names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exposes the existing schema_source() method via a new `omnigraph schema get`
CLI subcommand and a `GET /schema` API endpoint, allowing users to retrieve
the current accepted schema from any graph repository.
https://claude.ai/code/session_01UYybeBQks3fz3RJrTHtwQw
Unit tests covering gaps identified by systematic matrix of:
topology (fan-out, fan-in, cycle) × deferral × filter type × direction.
New unit tests:
- fan-out: one root fans to two deferred destinations via different edges
- fan-in: two sources converge on one destination via reverse expand
- cycle: deferred binding + genuine cycle-closing on return edge
- multiple filters on single deferred binding (name + age)
- param filter on deferred binding (IRExpr::Param in dst_filters)
- negation with inner binding (documents current NodeScan+cycle-close behavior)
New integration tests:
- fan-out projection (friend × company cross-product per source)
- deferred filter matching nothing (empty result propagation)
- negation with inner destination binding filter
Also: guard anti-join fast path against non-empty dst_filters. The bulk
CSR existence check only tests neighbor existence, not destination
properties — it must fall back to the slow path when dst_filters are
present to avoid false negatives.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The anonymous wildcard variable _ was included as a regular node in the
undirected adjacency graph used for component analysis. When multiple
traversals referenced $_, it falsely bridged otherwise-independent
components, causing bindings in separate components to be deferred.
The deferred binding would never be introduced (since _ is never added
to bound_vars), leading to silently dropped traversals.
Fix: skip edges involving _ when building the adjacency graph.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>