Lakehouse-native graph engine with git-style workflows https://omnigraph.dev
Find a file
Ragnor Comerford 1c5cb8741e
Some checks are pending
CI / Classify Changes (push) Waiting to run
CI / Check AGENTS.md Links (push) Waiting to run
CI / Container Entrypoint (push) Waiting to run
CI / Test Workspace (push) Blocked by required conditions
CI / Test omnigraph-server --features aws (push) Blocked by required conditions
CI / RustFS S3 Integration (push) Blocked by required conditions
Release Edge / Prepare edge release (push) Waiting to run
Release Edge / Build edge omnigraph-linux-x86_64 (push) Blocked by required conditions
Release Edge / Build edge omnigraph-macos-arm64 (push) Blocked by required conditions
Release Edge / Build edge omnigraph-windows-x86_64 (push) Blocked by required conditions
Release Edge / Smoke Windows installer (push) Blocked by required conditions
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299)
* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1)

3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener
unification (pinned-base opener, shared Session, write-local handle cache,
strict-op conflict-timing move) to step 5. The greptile comments on the #295
spec were moot for #298 (none of those constructs were built) but are
load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not
RefCell); (2) the strict-op timing move needs an explicit retry contract — txn
discarded after any commit, retry re-opens a fresh base — which is the SAME
contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test
must advance HEAD externally then assert pinned-base, not the trivial HEAD==base.

* feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7)

Graph lineage no longer lives in a second write to _graph_commits.lance. Each
commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest
merge-insert as the table-version rows (one atomic version), and CommitGraph reads
its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is
no longer written commit rows (it remains only as a Lance branch-ref carrier).

Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor,
merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside
the publisher retry loop the parent is resolved per attempt from the just-loaded
branch-scoped manifest (the should_replace_head winner over the visible graph_commit
rows — branch-correct by Lance branch isolation; the graph_head row is written for
forward-compat + the §7.1 contention point but is not the parent source, so a
freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry
re-reads the advanced head → correct new parent; the commit_id is stable across
retries.

Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race):
- manifest→commit-graph atomicity (no crash window between manifest + lineage),
- commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write
  commit_graph.refresh() is gone).

Recovery, branch-merge, and genesis route their lineage through the same CAS
(merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the
recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write).
The dead _graph_commits write helpers (append_commit/_merge/_actor) are
#[allow(dead_code)] (the actor sidecar table is still enumerated by optimize).

Verified (sequential): build clean; the new lineage_projection gate (manifest-only —
_graph_commits/_actors have 0 rows; full lineage reconstructs via the projection);
branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/
changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the
now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold
values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra
write). invariants.md marks both gaps CLOSED.

PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage
5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 +
migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full
doc-sync of storage.md/architecture.md/writes.md.

* feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4)

The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a
pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in
__manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade
seamlessly and not break reads.

- Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the
  3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors,
  emits one graph_commit row per commit + exactly one graph_head:<branch> for the head
  (should_replace_head winner, deterministic id-sort — no hash-map-order in migration
  output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard
  first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters
  at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per
  branch: migrate_on_open backfills main; load_publish_state backfills each branch on
  its first write (root_uri/branch threaded through migrate_internal_schema).
- v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the
  (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY
  open of an un-migrated graph reads correct history with no write. Correctness catch:
  the legacy _graph_commit_actors.lance was never branched, so the fallback reads it
  FLAT (no branch checkout) while checking out the branch only on the commits dataset.
- Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the
  same upgrade error (future-proofing the next format bump; the write path already
  refused via migrate_internal_schema).
- Docs: storage/architecture/writes/invariants/constants updated to manifest-stored
  lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break,
  data preserved, upgrade writers first).

6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both
modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite +
failpoints (59) + cargo test --workspace --locked green; check-agents-md passes.

* test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7)

Two (or N) writers committing disjoint tables on one branch still share the
mutable `graph_head:<branch>` manifest row, so the only row-level CAS
contention is that row. The contract — exactly one writer wins each CAS round;
the loser retries inside the publisher, re-resolves its parent off the
freshly-advanced head, and re-commits, so every writer lands and the
graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance
test. This adds it.

- concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint
  writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG
  is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits
  share a parent, the head is the unique non-parent).
- n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint
  writers each with an app-level retry loop (the publisher's internal budget
  can be exhausted under contention); all converge to one linear chain of 8.
- concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real
  object store (true conditional-put CAS), bucket-gated.

Cites both tests from the §7.1 contention note in invariants.md.
Test-only; no production change.

* perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2)

Each lineage publish scanned `__manifest` twice: `load_publish_state` read
table state via one scan, then `resolve_lineage_rows` did a second full
`read_graph_lineage` scan only to find the parent commit. Fold the
`graph_commit` extraction into the existing scan.

- `read_manifest_scan` gains a `collect_lineage` flag. The publish path
  (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the
  table-state hot path leaves them in the forward-compat skip arm, so it never
  pays the O(commits) lineage JSON decode (it also skips reading the
  `object_id` column entirely). One shared `decode_graph_commit_row` serves
  both the folded path and the standalone `read_graph_lineage`, so the two
  cannot drift.
- `resolve_lineage_rows` is now sync and takes the already-parsed rows; the
  per-attempt re-read is preserved because `load_publish_state` runs once per
  CAS attempt, so a retry still re-parents off the advanced head.
- `load_publish_state` returns a named `LoadedPublishState` instead of a
  four-tuple; the thin `read_registered_table_locations` /
  `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes
  `#[cfg(test)]`: the fold removes its last production caller, leaving only the
  test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`),
  so gating it keeps it from becoming dead code in non-test builds.

Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads
54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips
if a publish-path scan is re-added, and its calibration comment is corrected.

* test(engine): red — transient legacy-open failure silently completes the v3→v4 migration

A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in
`_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into
`__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY
`Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or
corrupt open during the one-time migration backfills nothing and still stamps
v4 — orphaning the real lineage permanently (the migration runs once; the v3
fallback is then disabled).

Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found
Lance error at the legacy open, and a fault-injection regression test in the
`failpoints` binary. Against the current swallow the migration completes anyway,
so the test fails on its "migration must abort" assertion — the predicted
symptom. The fix follows in the next commit.

Test support reachable from the `failpoints` integration binary (it compiles the
crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader
are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release
builds. Failpoint tests stay in the integration binary because the fail registry
is process-global.

* fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration

`read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache
(`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The
v3→v4 internal-schema migration reads this once before stamping internal-schema
v4; a transient or corrupt open therefore backfilled nothing and stamped v4
anyway, orphaning the graph's real lineage permanently (the migration runs once,
and the stamp-gated v3 fallback is disabled at v4). This is the "no silent
failures" deny-list violation, and realistic on object storage.

Both opens now match the not-found variants — Lance maps an object-store NotFound
to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and
propagate anything else as a loud error. The two arms share the variant contract
but carry different rationale (commits-absent is the legitimate empty signal;
actor-sidecar-absent is benign, but a corrupt actor open silently wiping
authorship before stamping v4 is the same loss hole), commented at each site.

Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant`
guard (turns red if a Lance bump changes the absence variant) and greens the
fault-injection regression test from the previous commit.

* test(engine): cover the per-branch v3→v4 migration against a real Lance branch

`seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one)
to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the
production per-branch migration path — `read_legacy_commit_cache` checking out a
real Lance branch, and a branch-scoped `__manifest` — was never exercised.

Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance
branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits
main's stripped v3 state), and a test that migrates the BRANCH and asserts the
branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit,
`graph_head:feature` → branch commit, parents + actors intact) with main's
`__manifest` untouched.

This empirically resolves the open question behind the merge robustness work: the
fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but
`__manifest` is Lance-branched per graph-branch, so a branch reads only its own
lineage — the test confirms migrating one branch does not leak into another. No
branch filter is needed.

* refactor(engine): type the lineage-backfill merge conflict via the publisher classifier

`state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest`
merge-insert) stringified its `execute_reader` error, discarding the Lance
variant. Route it through the publisher's `map_lance_publish_error` (now
`pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the
SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's
own retry consumes — one vocabulary, no raw-Lance matching in the migration.

Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that
classifier also matches `CommitConflict`/`RetryableCommitConflict` from the
compaction commit path, which a row-level merge-insert never emits. Cross-linked
with a comment at both sites.

Behavior-preserving: the only path that changes is the error TYPE on a CAS loss
(previously an opaque `Lance` string, now a typed conflict); no success/failure
outcome changes. The bounded re-open retry that consumes the new type lands next.

* test(engine): red — concurrent v3→v4 migrations error instead of converging

`migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it.
`merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no
app-level retry, so when two processes open the same legacy graph at once the
backfill's row-level CAS loser errors the whole open instead of converging.

The test opens two `__manifest` handles at the same pre-migration (v3,
empty-lineage) HEAD and runs both `migrate_internal_schema` calls under
`tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the
current code the loser fails with `RowLevelCasContention` ("Attempted 0
retries.") — the predicted symptom — so the "both must converge" assertion
panics. The bounded re-open retry that makes both converge lands next.

* fix(engine): make the v3→v4 lineage backfill converge under concurrent runners

`migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes
(or open-for-write handles) opening the same legacy graph at once both reach the
backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the
row-level CAS loser error the whole open instead of converging.

Two contention points, both now handled all-or-nothing:

1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read
   legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss
   re-opens the manifest past the winner's (atomic) commit and re-loops; the
   fast-path re-read then sees the winner's lineage and stamps. On budget
   exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's
   OUTER retry loop completes it. The retry decision reuses the publisher's
   `is_retryable_publish_conflict` so the two stay in lockstep.

2. The terminal stamp bump. Making the merge loser converge newly lets BOTH
   runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the
   loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the
   merge loop doesn't catch it). This surfaced only under the concurrent
   full-suite run, not the isolated test. Both write the SAME value, so the
   conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp
   already reached the target, succeeds; else re-applies (bounded).

Greens the race test from the previous commit (3x isolated, 5x full-suite, no
flake). The new `IncompatibleTransaction` match is pinned by
`lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`.

* fix(engine): refuse a future internal-schema stamp on the branch read path

`load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp —
`< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection —
but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be
misread by the projection rather than rejected.

Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`)
right after the branch stamp is read, mirroring the main read path's
`refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole:
migrations run main-first (main migrates on open; each branch on its first write),
so main's stamp is always >= every branch's and the main path refuses first. The
guard closes the gap if that ordering invariant is ever weakened.

Tested by force-stamping a real branch past CURRENT and asserting the branch read
refuses with the upgrade error (the test misreads via the projection — returns Ok
— without the guard, confirmed by removing it).

* docs(rfc-013): record the v3→v4 migration robustness fixes

invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is
loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded
re-open retry on the merge CAS + idempotent stamp bump), and that the branch read
path refuses a `> CURRENT` stamp.

lance.md: note the two new surface guards the migration depends on
(`dataset_open_missing_returns_not_found_variant`,
`lance_error_incompatible_transaction_variant_exists`).

testing.md: note the migration fault-injection test in the failpoints row.

* refactor: remove dead code and silence warnings across engine + cluster

Dead-code sweep follow-up to the RFC-013 stack. No behavior change.

- engine: delete the orphaned `validate_edge_cardinality` — the load path uses
  `validate_edge_cardinality_with_pending_loader` for every mode (including
  Overwrite, which it treats as the replacement table image), so the old
  standalone validator had no caller — and correct its sibling's now-stale doc
  reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline-
  commit residual kept only for recovery test setup, with no non-test caller.
- cluster: drop unused imports in `lib.rs`, delete the unused
  `ClusterStore::payload_display`, and raise `LiveGraphObservation` /
  `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions
  that return them.

Both lib crates now build warning-free.

* fix(engine): match Lance's typed DatasetAlreadyExists, not the message string

The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and
`db/recovery_audit.rs` classified the "already exists" race by
`err.to_string().contains("Dataset already exists")` — a Lance display string,
not an API contract. A wording change upstream would silently break the fallback
(a re-create would error instead of opening the existing table). Match the typed
`lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline
as the v3→v4 migration's not-found classifier — pinned by the new
`lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists`
guard so a Lance rename turns red instead of silently regressing.

* refactor(engine): consolidate now_micros into one crate::db helper

Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit,
graph_coordinator, manifest/graph) had already drifted: three mapped the
clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit
used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with
one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest`
variant), and repoint the eight call sites at `crate::db::now_micros()`. No
test asserts on the failure message, so unifying the variant is behavior-safe;
the timestamp-mapping contract can no longer fork across the rows it stamps.

* refactor(engine): drop the dead snapshot param from roll_back_sidecar

`roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with
`let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a
recovery-commit lineage row), so the snapshot is never read to decide whether
to skip a publish. Remove the parameter, the two call-site arguments, and the
suppressor. A signature must not advertise inputs it does not consume. The
`Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and
`record_audit_recovery_rollforward` still take it.

* test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref

A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the
`_graph_commits.lance` branch ref is a derived artifact. An interrupted
fork-reclaim or a `cleanup` race can drop that derived ref while the manifest
lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail
a logical read of the lineage.

This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in
`__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then
asserts the branch reads (`open_at_branch` / list-commits / `merge_base`)
succeed from `__manifest` while a write that needs the derived ref
(`create_branch`) fails loudly with the typed actionable error.

Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on
the missing ref errors `OmniError::Lance` (Lance "Not found:
_graph_commits.lance/tree/feature/_versions"), wedging the logical read.

* fix(engine): read manifest lineage independent of the derived _graph_commits ref

`CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the
`_graph_commits.lance` branch ref before reading lineage — so a missing derived
ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's
commit-list / merge-base / snapshot resolution even though the lineage is
readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a
derived/physical artifact failing a logical read — invariants 7 and 15.

Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`).
`open_at_branch` and `refresh` check out the derived ref best-effort: a typed
not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read
re-syncs from `__manifest`; any other open error still propagates. The manifest
existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`,
so a truly absent branch still fails loudly at the manifest. `create_branch`
(the only writer that forks a ref) and the folded-in version lookup return a
loud, actionable error on `None`, deferring repair to `cleanup`'s existing
orphan reconciler rather than inlining a write on a read-side refresh. Reads
(`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle.

Greens the wedge regression from the preceding commit.

* fix(engine): v3→v4 retry loops return retryable contention on exhaustion

`commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET`
(6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST
iteration's `IncompatibleTransaction` fell through to
`Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the
intended `RowLevelCasContention`, and the post-loop contention return was dead
code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`,
so under sustained concurrent v3→v4 migration the one-time stamp bump could fail
instead of converging, defeating the idempotency the migration is supposed to add.

Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the
retryable variant is always handled inside the loop (re-open + same-value check +
retry), so it can never reach the stringifying catch-all, and the post-loop is the
SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)`
arm now catches only genuine non-contention errors. Apply the same range alignment
to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its
`Err(err)` returns the already-typed contention — but it carried the identical
off-by-one structure the stamp loop was copied from; aligning both stops the next
copy from re-introducing it).

Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent
winner stamps the same value, so the re-read returns Ok on the first retry — so a
new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to
lose, driving exhaustion deterministically. Against the pre-fix loop the new
`v4_stamp_exhaustion_returns_retryable_contention` test goes red with
`Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it
asserts the typed `RowLevelCasContention`. Found by automated review on #299.

* feat(engine): minimum-supported internal-schema floor + retirement tripwire

The internal-schema migration chain (`migrate_internal_schema`) had a too-new
ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers
it needs stay forever — the pile grows by one migration + readers + tests every
schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure
no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as
the oldest stamp this binary opens; raising it is how the chain sheds old code.

Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported`
checking both bounds, so the floor lands at all three stamp-enforcement sites —
the write-path migrate dispatcher, the read-only open guard, and the branch
lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A
hand-wired floor twin would have had to touch each site, and the branch-read path
is easy to miss; one combined guard cannot half-enforce. Rename the read-only
wrapper `refuse_if_internal_schema_unsupported` to match.

A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE
== MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the
now-dead migration arm (or vice versa) — stronger than a runtime test, impossible
to skip, and it doubles as the use that keeps the mirror const live.

Tests: a sub-floor graph is refused in both open modes (twin of
`future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly
[MIN, CURRENT]. No behavior change for any real graph. The retirement runbook
lives on the `MIN_SUPPORTED` doc-comment + invariants.md.

* fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit

Three review-surfaced fixes on the RFC-013 Phase 7 path.

Publisher retry vs migration contention: `publish()` propagated a
`load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced
by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish
instead of being retried — only `merge_rows` conflicts hit the retry. This
contradicted the migration's own design, which returns that typed error
EXPECTING the publisher to re-run the load (by which point a concurrent winner
has usually finished the migration, so the next scan is a no-op). Route a
retryable load error through the same retry path as a retryable `merge_rows`
conflict. Regression test (failpoints): a one-shot retryable contention injected
into `load_publish_state` now commits via the retry; red without the fix (the
write fails with the injected contention).

Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the
branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can
advance `graph_head` past the recovery commit between the winner's publish and
this read — attributing the audit to a later, wrong commit. Use the latest
`RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is
the recovery commit by construction. The audit's actor was already correct (it
comes from `sidecar.actor_id`, not the commit).

Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward`
(removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to
delete the sidecar.
2026-06-25 13:55:34 +02:00
.cargo Raise LANCE_MEM_POOL_SIZE to 1 GB in .cargo/config.toml 2026-04-19 22:27:49 +03:00
.context Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq) 2026-04-28 23:30:17 +00:00
.github write-path cost gate + opener bypass (#288) 2026-06-20 13:31:15 +02:00
assets docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
crates feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) 2026-06-25 13:55:34 +02:00
docker fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) 2026-06-19 03:34:15 +03:00
docs feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) 2026-06-25 13:55:34 +02:00
scripts docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
skills/omnigraph docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
.dockerignore feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
.gitignore release: v0.5.0 (#115) 2026-05-23 13:59:42 +01:00
AGENTS.md feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) 2026-06-25 13:55:34 +02:00
Cargo.lock release: v0.7.2 (#301) 2026-06-25 09:08:12 +02:00
Cargo.toml build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) 2026-06-14 20:42:24 +02:00
CLAUDE.md Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it 2026-04-28 23:10:09 +02:00
CODE_OF_CONDUCT.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
CONTRIBUTING.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
Dockerfile feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
GOVERNANCE.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
LICENSE Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
og-cheet-sheet.md feat: inline query strings in CLI and HTTP server (#110) 2026-05-29 13:41:54 +02:00
omnigraph.example.yaml example config: use graphs / cli.graph, matching the MR-603 rename 2026-04-18 23:40:35 +03:00
openapi.json release: v0.7.2 (#301) 2026-06-25 09:08:12 +02:00
README.md docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
rust-toolchain.toml Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
SECURITY.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00

OMNIGRAPH

Lakehouse graph database for context assembly & multi-agent coordination
Multimodal retrieval · Git-style branching · object-storage native

Quickstart  ·  Docs  ·  Cookbooks  ·  CLI

License: MIT crates.io Rust


Omnigraph is the operational state and coordination layer for fleets of agents.
Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely.

Key capabilities

Capability What it gives you
Declared as code A cluster.yaml declares graphs, schemas, stored queries, embedding providers, and policies; cluster apply converges it and omnigraph-server brings every graph online at /graphs/{id}/….
Built for fleets of agents Hundreds of agents enrich the graph on parallel isolated branches; changes are reviewed and merged safely, Git-style, across the whole graph.
Multimodal retrieval Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in one query runtime, for context assembly.
Security as code Cedar policy enforced server-side on every mutation, per-graph and server-wide; bearer auth; actor/audit tracking.
Runs on your infrastructure Any S3-compatible object store: on-prem via RustFS / MinIO, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store.
Open, versioned storage Lance columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video).

What you can build

Use case What it's for
Company brain Org knowledge unified into one graph every agent can query
Agentic memory Durable, versioned memory: a branch per agent or per task, merged on review
Context graph Decision traces and codified tribal knowledge for retrieval
Dev graph Issues & dependency model that coding agents read and write
R&D / ML data layer Experiments and trials written into branches, versioned for training & eval

Install

curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash

This installs omnigraph (CLI) and omnigraph-server into ~/.local/bin from published release binaries. Or with Homebrew:

brew tap ModernRelay/tap
brew install ModernRelay/tap/omnigraph

Set it up with an AI agent

Omnigraph is built to be run by coding agents. Two ways in:

Teach your agent the playbook. This repo ships the omnigraph agent skill: the operational playbook covering cluster mode, the two config surfaces, schema evolution, query linting, data writes, branches, Cedar policy, and the common gotchas.

npx skills add ModernRelay/omnigraph@omnigraph

Or have an agent set it up from scratch. Paste this into Claude Code, Codex, or any agent that can read a URL and run a shell command:

Help me set up Omnigraph

1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with
   docs/user/clusters/index.md, then docs/user/deployment.md.
2. Skim the starter graphs and seed data in the cookbooks:
   https://github.com/ModernRelay/omnigraph-cookbooks
3. Ask me what I want to build (company brain, agent memory, dev graph,
   research / R&D layer, …). Then stand up a cluster for it, load a little
   data, and run a query so I can see it working.

For ready-to-run graphs with real seed data (company brain, VC operating system, pharma & industry intel), ModernRelay/omnigraph-cookbooks is the fastest way to see Omnigraph shaped to a real domain.

Deploy

A deployment is a cluster: a multigraph config directory that declares its graphs, schemas, stored queries, and policies as code. You manage it Terraform-style: cluster plan previews the diff, cluster apply converges it. omnigraph-server then boots from the cluster and brings every graph online at /graphs/{id}/…, each behind its own policy.

1. Declare the cluster.

company-brain/
├── cluster.yaml
├── people.pg          # schema for the "knowledge" graph
├── queries/           # stored queries: the .gq files ARE the declaration
│   └── people.gq
└── base.policy.yaml   # a Cedar policy bundle
# cluster.yaml
version: 1
metadata:
  name: company-brain
storage: s3://company/clusters/company-brain   # ledger, catalog, and graph data live here
graphs:
  knowledge:
    schema: people.pg
    queries: queries/                          # every `query <name>` in queries/*.gq registers
policies:
  base:
    file: base.policy.yaml
    applies_to: [knowledge]                    # graph-bound; use [cluster] for server-level

2. Stand up your object store. On-prem, run RustFS (or MinIO); Omnigraph writes Lance to it over the standard S3 API. In the cloud, point the same AWS_* env at S3 / R2 / GCS instead.

3. Converge and run. apply creates each graph, applies its schema, and publishes queries and policies into the content-addressed catalog. It is idempotent; re-running is always safe.

omnigraph cluster validate   # parse + typecheck everything
omnigraph cluster plan       # preview what apply would do
omnigraph cluster apply      # converge

# Boot the server from the cluster dir; storage resolves through cluster.yaml
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080

See the cluster guide for the day-2 loop (edit → plan → apply → restart), approval gates for destructive changes, drift inspection, and recovery; the deployment guide for containers, AWS/Railway, auth, and the full AWS_* contract.

Query and mutate

Set a default server and graph once in ~/.omnigraph/config.yaml, and the everyday commands stay short. Stored queries and mutations run by name:

omnigraph query  search_docs --params '{"q":"AI safety"}'
omnigraph mutate add_person  --params '{"name":"Mina"}'

# Branch, review, merge across the whole graph; agents write in isolation
omnigraph branch create --from main agent/ingest-42
omnigraph branch merge  agent/ingest-42 --into main

An alias is shorter still: bind a server, graph, and stored query to one name, then omnigraph alias triage runs it. For an ad-hoc target, any command still takes --server <name|url> --graph <id> (or --store <uri> for a local graph). See the CLI reference.

Security & governance

  • Engine-wide enforcement: every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules.
  • Declared in the cluster: a policy bundle is bound to graphs (or the whole server) via policies:applies_to.
  • Scoped: rules apply per graph, per branch, or server-wide.
  • No plaintext tokens: bearer tokens are hashed at startup and compared in constant time.
  • Forge-proof identity: the actor is resolved server-side from the token; clients can't set it.

See the policy guide.

Clients & SDKs

Client Use it for Where
TypeScript SDK typed access from Node / TS @modernrelay/omnigraph · source
MCP server bridge Omnigraph to LLM hosts (Claude, Codex, …) @modernrelay/omnigraph-mcp
HTTP / OpenAPI any language, the wire contract the server's OpenAPI spec
Python SDK typed access from Python coming soon

Both npm packages are versioned in lockstep with omnigraph-server.

Local quick test (no server)

1-min setup to try it: an embedded, local file-backed graph (no server, no object store). For dev and experiments; production is the deployed cluster above.

cat > schema.pg <<'PG'
node Signal  { slug: String @key, title: String }
node Pattern { slug: String @key, name: String }
edge Indicates: Signal -> Pattern
PG
printf '%s\n' \
  '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \
  '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \
  '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl

omnigraph init  --schema schema.pg ./graph.omni
omnigraph load  --data data.jsonl --mode overwrite --store ./graph.omni

# "What pattern does signal s1 indicate?"
omnigraph query --store ./graph.omni \
  -e 'query indicates() { match { $s: Signal { slug: "s1" }  $s indicates $p } return { $p.name } }'
# → adoption

Docs

Build And Test

cargo build --workspace
cargo test  --workspace

Notes:

  • Rust stable toolchain, edition 2024
  • CI runs cargo test --workspace --locked
  • Full CI and some local test flows require protobuf-compiler
  • S3 integration tests expect an S3-compatible endpoint such as RustFS

Workspace Crates

  • crates/omnigraph-compiler: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency)
  • crates/omnigraph (package omnigraph-engine): storage/runtime, branching, merge, change detection, query execution, and embeddings
  • crates/omnigraph-policy: Cedar policy compilation and enforcement
  • crates/omnigraph-api-types: shared HTTP wire DTOs used by both the server and the CLI
  • crates/omnigraph-cluster: cluster config validation, planning, and apply (the control plane)
  • crates/omnigraph-server: Axum HTTP server, cluster-first, runs N graphs under /graphs/{id}/…
  • crates/omnigraph-cli: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance

Contributing

Please open an issue, spec, or design discussion before sending large code changes. Design feedback and concrete problem statements are the fastest way to collaborate on the roadmap.

Community

Join the Omnigraph Slack community to ask questions, share feedback, and follow development.