Lakehouse-native graph engine with git-style workflows https://omnigraph.dev
Find a file
Ragnor Comerford 7168ee0ed0
fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277)
* fix(engine): stop branch-merge fast-forward OOM on embedding tables

A branch→main fast-forward merge of a forked, embedding-bearing table
re-derived the whole branch row-by-row: it lumped new + changed rows into
one Lance `merge_insert`, i.e. a full-outer hash join over the entire
delta that exhausts the DataFusion memory pool (8k rows × 3072-dim →
`Resources exhausted: 188MB HashJoinInput, 100MB pool`), so the merge
hung/failed instead of completing.

Fix the data path on existing, substrate-supported primitives:

- Adopt-with-delta split: new rows → `stage_append` (a streaming
  `Operation::Append`, no hash join), only genuinely-changed rows →
  a bounded `stage_merge_insert`, deletes inline. New `AdoptDelta` /
  `compute_adopt_delta` / `publish_adopted_delta` replace the combined
  `compute_source_delta` path; the three-way merge path is untouched.
- Stream the append via `stage_append_stream` →
  `execute_uncommitted_stream` (the substrate-blessed bulk-append path),
  removing the `Vec`+`concat` full-delta materialization. Blob-aware via
  `scan_stream_for_rewrite`. Exposed on the sealed `TableStorage` trait.
- Lazy row-signature: stop stringifying every row's embedding eagerly;
  compute the signature only for the `(Some,Some)` changed-candidate arm.
- Index coverage is reconciler-owned: the adopt path no longer rebuilds
  vector/FTS indexes inline; `optimize`/`ensure_indices` folds the new
  rows in (reads stay correct via brute-force tail). Post-merge
  index-coverage contract documented in docs/user/branching/merge.md.
- Recovery pin: new `CandidateTableState::AdoptWithDelta` is classified
  and pinned so the append's HEAD advance is sidecar-covered
  (invariant 5); the `BranchMerge` sidecar's loose classification covers
  the two-commit shape.

The regression gate is structural, not a brittle size threshold: task-local
write probes assert an append-only fast-forward merge does 0
`stage_merge_insert` (the OOM hash join), appends via `stage_append`, and
streams (0 whole-delta materialization). Plus functional correctness,
blob round-trip, index-defer, and a Phase-B failpoint recovery test.

Residual: the classify-time staging round-trip is still O(N) in memory
(architecturally required for the all-or-nothing multi-table publish);
bounding it fully is the fragment-adopt follow-up.

* test(engine): partial branch-merge Phase B must roll back (RED regression)

A branch-merge per-table publish is a multi-commit sequence — adopt:
append → upsert → delete; three-way: merge_insert → delete → index — each
step advancing Lance HEAD before the single manifest publish. Add four
failpoint sites at those windows and four regression tests (mixed delta:
a fresh id, a modified base id, a removed base id) asserting that a crash
mid-sequence rolls the whole merge BACK on the next open and a re-run
re-applies the full delta.

RED against current code: the loose `BranchMerge` classification rolls any
`lance_head > manifest_pinned` forward, so the partial is published and the
merge recorded — the rolled-back-to-base assertion fails with the partial
state visible (e.g. bob appended, dave not deleted). The fix lands next.

The failpoint sites are no-ops unless the `failpoints` feature activates them.

* fix(engine): roll back partial branch-merge Phase B (recovery WAL confirmation)

A branch-merge publishes each table with several Lance commits (adopt:
append → upsert → delete; three-way: merge_insert → delete → index), then
one manifest publish makes them atomic. Recovery classified `BranchMerge`
loosely: any `lance_head > manifest_pinned` with a matching CAS pin rolled
*forward* to the observed HEAD. So a crash mid-sequence published a partial
delta (e.g. the append without its sibling upsert/delete) and recorded the
merge as complete — silent data loss; a re-merge sees "already up to date"
and never repairs it.

Fix: make the recovery sidecar a two-phase WAL for `BranchMerge`. After the
whole per-table publish loop completes, stamp each pin's `confirmed_version`
with its exact achieved Lance version (a second sidecar write), then publish
the manifest. Recovery now:

- rolls FORWARD only to a pin's `confirmed_version` (set ⇒ Phase B finished);
- rolls BACK (`TableClassification::IncompletePhaseB`) when the HEAD moved but
  no confirmation was recorded ⇒ a partial publish ⇒ all-or-nothing restore to
  the manifest pin, so a re-run re-applies the full delta.

Scope: `BranchMerge` only. Other loose writers (`SchemaApply`,
`EnsureIndices`, `Optimize`) keep the loose roll-forward — their drift is
derived state (index coverage, compaction) a partial roll-forward never
corrupts, so confirmation would be cost without benefit.

This is the write-ahead intent record + idempotent roll-forward that the
fast-forward-main commit model requires to be crash-atomic across N tables;
version-recorded (not phase-count-derived), so it survives later changes to
the per-table commit sequence.

Regression tests (failpoints): four partial-window crashes — adopt
after-append / after-upsert, three-way after-merge / after-delete — each with
a mixed delta (new id, modified id, removed id) now roll the whole merge back;
the existing complete-Phase-B tests still roll forward.

* fix(engine): scope merge index docs to fast-forward; record append probe after write

Two PR-review fixes:

- docs(merge): the "a merge does not build indexes inline" note only holds for
  the fast-forward / adopt path (deferred to the reconciler). The three-way
  `Merged` path still rebuilds indexes inline in its publish, so a
  Merged-outcome merge of an embedding table pays the build up front. Scope the
  doc so a Merged-outcome user isn't surprised or led to skip a post-merge
  optimize.

- `stage_append` recorded its instrumentation probe before the fallible
  `execute_uncommitted`, so a failed staging write left the call/row counters
  inflated — and diverged from `stage_append_stream`, which records after the
  transaction is built. Record after the write succeeds.

* fix(engine): record stage_merge_insert / vector-index probes after write too

The prior commit moved `stage_append`'s instrumentation probe to after the
write, but left the two sibling write primitives with the identical ordering
bug: `stage_merge_insert` recorded before `execute_uncommitted`, and
`create_vector_index` before the index build. A failed write on either would
inflate the probe counter. Move both to record only after the write succeeds,
so all write-primitive probes share one rule (record-after-success) — closing
the class rather than the single instance the review flagged.

* docs(engine): mark the fragment-adopt excision boundary in the merge code

Comment the transitional row-level merge code so a future fragment-adopt
implementation (Lance branch-merge/rebase #7263 + UUID branch paths #7185)
knows exactly what it deletes and what it keeps:

- `AdoptDelta` / `compute_adopt_delta` / `publish_adopted_delta` — the row-level
  re-derivation; removed wholesale when a fast-forward merge becomes a fragment
  graft (adopt the source table version's fragments + indexes by reference).
- `stage_append_stream` — its only caller is that merge append; dead with it
  unless re-adopted as a general bulk-append path.
- `confirm_sidecar_phase_b` — the boundary marker: this SURVIVES. The recovery
  sidecar is the cross-table WAL a fast-forward-main commit still needs; only the
  within-table multi-commit reason for `IncompletePhaseB` narrows once each table
  is a single graft commit. Keep the sidecar; only simplify the classifier.

Comments only; no behavior change.

* test(engine): pre-upgrade v1 branch-merge sidecar must roll forward (RED)

Phase-B confirmation made the recovery classifier strict for every BranchMerge
sidecar — including ones written by a binary that predates confirmation. A
pre-upgrade crash in the Phase-B→C gap can leave such a sidecar over a COMPLETED
merge; the new classifier reads its absent confirmed_version as a partial and
rolls it back, silently discarding the finished merge (greptile P1 / Cursor High).

This regression test synthesizes that sidecar realistically: crash after Phase B
(real sidecar + advanced Lance HEAD), downgrade the on-disk JSON to the
pre-confirmation v1 shape (schema_version=1, strip confirmed_version), reopen.
RED: the merge rolls back, `bob` is discarded (left ["alice"], want
["alice","bob"]). The versioning fix lands next.

* fix(engine): version the recovery sidecar; read pre-confirmation merges as loose

Phase-B confirmation changed how a BranchMerge sidecar's absent confirmed_version
is interpreted (roll forward → roll back) without versioning the artifact, so the
new classifier silently discarded completed pre-upgrade merges (greptile P1 /
Cursor High). A capability flag would not fix the symmetric direction — keeping
schema_version=1, an OLD binary reading a NEW sidecar sails through its
already-shipped strict gate, ignores the unknown flag, and applies loose
semantics to a new partial → the same data loss on downgrade. Use the versioning
system instead.

- Bump SIDECAR_SCHEMA_VERSION 1 → 2; add a fixed CONFIRMATION_SCHEMA_VERSION = 2
  (the generation at which confirmation shipped — pinned, so a later v3 keeps v2
  confirmation-aware).
- Make the read gate version-aware (`parse_sidecar`): refuse only versions NEWER
  than this binary; accept and interpret older ones with their original
  semantics — no operator toil draining pre-upgrade sidecars. Rename
  `SidecarSchemaError.supported_version` → `max_supported_version` and reword.
- Dispatch classification by version: the strict BranchMerge confirmation path is
  gated on `schema_version >= CONFIRMATION_SCHEMA_VERSION`; a v1 BranchMerge
  sidecar falls through to the existing loose roll-forward. Thread
  `sidecar.schema_version` from `process_sidecar`.

This is bidirectionally safe: a new binary interprets v1 (loose) and v2 (strict)
and refuses the future; an old binary's `!= 1` gate already refuses v2, so it
never misreads a new sidecar. The flag was an additive-field pattern misapplied
to a semantics change; versioning is the correct mechanism.

Honest residual (any approach): an old *partial* sidecar still rolls forward —
v1 carries no confirmation, so partialness is undetectable in it. The fix stops
us from interpreting old sidecars under new rules; it can't retrofit information
they never had.

* fix(engine): harden recovery — mode resolver, loud divergence check, publish classified version

Three correct-by-design fixes from the holistic review of the recovery path, all
in recovery.rs (each closes a class, not an instance):

A. Resolve the classification mode from `(kind, schema_version)` once, instead of
   a kind×version match accreting fall-through guards in `classify_table`. New
   `ClassificationMode { Strict, Loose, Confirmed }` + an exhaustive
   `SidecarKind::classification_mode` — adding a writer kind or version floor is
   now one arm in the resolver (the compiler forces it), not a guard threaded
   through the classifier. No behavior change; existing classify/decide tests are
   the guard.

B. `confirm_sidecar_phase_b` now errors loudly when a pinned table has no achieved
   version in the publish `updates`, instead of silently skipping it (which left
   the pin unconfirmed → `IncompletePhaseB` → a silent rollback of a COMPLETE
   merge). Guards the implicit `pins ⊆ updates` invariant against a future
   divergence between the two filters (invariants 9/13). + a unit test.

C. Recovery roll-forward publishes the version classification OBSERVED
   (`state.lance_head`), not a fresh HEAD re-read at publish time. For a Confirmed
   pin classify already validated `lance_head == confirmed_version`, so this
   publishes the recorded WAL intent by construction and closes the
   classify→publish re-derivation/TOCTOU for every writer (invariant 15).
   `push_table_update_at_head` → `push_table_update(target_version: Option<u64>)`:
   roll-forward pins the classified version; roll-back keeps `None` (publishes the
   restore commit it just made). In-scope behavior is preserved, so the existing
   roll-forward integration tests are the guard; the drift-hardening is
   correct-by-construction (deterministic mid-sweep drift injection isn't feasible
   — a sync failpoint can't do an async Lance write).
2026-06-19 00:15:06 +02:00
.cargo Raise LANCE_MEM_POOL_SIZE to 1 GB in .cargo/config.toml 2026-04-19 22:27:49 +03:00
.context Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq) 2026-04-28 23:30:17 +00:00
.github chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
assets docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
crates fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277) 2026-06-19 00:15:06 +02:00
docker feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
docs fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277) 2026-06-19 00:15:06 +02:00
scripts docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
skills/omnigraph docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
.dockerignore feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
.gitignore release: v0.5.0 (#115) 2026-05-23 13:59:42 +01:00
AGENTS.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
Cargo.lock perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) 2026-06-17 13:25:20 +02:00
Cargo.toml build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) 2026-06-14 20:42:24 +02:00
CLAUDE.md Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it 2026-04-28 23:10:09 +02:00
CODE_OF_CONDUCT.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
CONTRIBUTING.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
Dockerfile feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
GOVERNANCE.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
LICENSE Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
og-cheet-sheet.md feat: inline query strings in CLI and HTTP server (#110) 2026-05-29 13:41:54 +02:00
omnigraph.example.yaml example config: use graphs / cli.graph, matching the MR-603 rename 2026-04-18 23:40:35 +03:00
openapi.json [codex] fix RFC-011 follow-up regressions (#258) 2026-06-16 03:11:43 +03:00
README.md docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
rust-toolchain.toml Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
SECURITY.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00

OMNIGRAPH

Lakehouse graph database for context assembly & multi-agent coordination
Multimodal retrieval · Git-style branching · object-storage native

Quickstart  ·  Docs  ·  Cookbooks  ·  CLI

License: MIT crates.io Rust


Omnigraph is the operational state and coordination layer for fleets of agents.
Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely.

Key capabilities

Capability What it gives you
Declared as code A cluster.yaml declares graphs, schemas, stored queries, embedding providers, and policies; cluster apply converges it and omnigraph-server brings every graph online at /graphs/{id}/….
Built for fleets of agents Hundreds of agents enrich the graph on parallel isolated branches; changes are reviewed and merged safely, Git-style, across the whole graph.
Multimodal retrieval Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in one query runtime, for context assembly.
Security as code Cedar policy enforced server-side on every mutation, per-graph and server-wide; bearer auth; actor/audit tracking.
Runs on your infrastructure Any S3-compatible object store: on-prem via RustFS / MinIO, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store.
Open, versioned storage Lance columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video).

What you can build

Use case What it's for
Company brain Org knowledge unified into one graph every agent can query
Agentic memory Durable, versioned memory: a branch per agent or per task, merged on review
Context graph Decision traces and codified tribal knowledge for retrieval
Dev graph Issues & dependency model that coding agents read and write
R&D / ML data layer Experiments and trials written into branches, versioned for training & eval

Install

curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash

This installs omnigraph (CLI) and omnigraph-server into ~/.local/bin from published release binaries. Or with Homebrew:

brew tap ModernRelay/tap
brew install ModernRelay/tap/omnigraph

Set it up with an AI agent

Omnigraph is built to be run by coding agents. Two ways in:

Teach your agent the playbook. This repo ships the omnigraph agent skill: the operational playbook covering cluster mode, the two config surfaces, schema evolution, query linting, data writes, branches, Cedar policy, and the common gotchas.

npx skills add ModernRelay/omnigraph@omnigraph

Or have an agent set it up from scratch. Paste this into Claude Code, Codex, or any agent that can read a URL and run a shell command:

Help me set up Omnigraph

1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with
   docs/user/clusters/index.md, then docs/user/deployment.md.
2. Skim the starter graphs and seed data in the cookbooks:
   https://github.com/ModernRelay/omnigraph-cookbooks
3. Ask me what I want to build (company brain, agent memory, dev graph,
   research / R&D layer, …). Then stand up a cluster for it, load a little
   data, and run a query so I can see it working.

For ready-to-run graphs with real seed data (company brain, VC operating system, pharma & industry intel), ModernRelay/omnigraph-cookbooks is the fastest way to see Omnigraph shaped to a real domain.

Deploy

A deployment is a cluster: a multigraph config directory that declares its graphs, schemas, stored queries, and policies as code. You manage it Terraform-style: cluster plan previews the diff, cluster apply converges it. omnigraph-server then boots from the cluster and brings every graph online at /graphs/{id}/…, each behind its own policy.

1. Declare the cluster.

company-brain/
├── cluster.yaml
├── people.pg          # schema for the "knowledge" graph
├── queries/           # stored queries: the .gq files ARE the declaration
│   └── people.gq
└── base.policy.yaml   # a Cedar policy bundle
# cluster.yaml
version: 1
metadata:
  name: company-brain
storage: s3://company/clusters/company-brain   # ledger, catalog, and graph data live here
graphs:
  knowledge:
    schema: people.pg
    queries: queries/                          # every `query <name>` in queries/*.gq registers
policies:
  base:
    file: base.policy.yaml
    applies_to: [knowledge]                    # graph-bound; use [cluster] for server-level

2. Stand up your object store. On-prem, run RustFS (or MinIO); Omnigraph writes Lance to it over the standard S3 API. In the cloud, point the same AWS_* env at S3 / R2 / GCS instead.

3. Converge and run. apply creates each graph, applies its schema, and publishes queries and policies into the content-addressed catalog. It is idempotent; re-running is always safe.

omnigraph cluster validate   # parse + typecheck everything
omnigraph cluster plan       # preview what apply would do
omnigraph cluster apply      # converge

# Boot the server from the cluster dir; storage resolves through cluster.yaml
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080

See the cluster guide for the day-2 loop (edit → plan → apply → restart), approval gates for destructive changes, drift inspection, and recovery; the deployment guide for containers, AWS/Railway, auth, and the full AWS_* contract.

Query and mutate

Set a default server and graph once in ~/.omnigraph/config.yaml, and the everyday commands stay short. Stored queries and mutations run by name:

omnigraph query  search_docs --params '{"q":"AI safety"}'
omnigraph mutate add_person  --params '{"name":"Mina"}'

# Branch, review, merge across the whole graph; agents write in isolation
omnigraph branch create --from main agent/ingest-42
omnigraph branch merge  agent/ingest-42 --into main

An alias is shorter still: bind a server, graph, and stored query to one name, then omnigraph alias triage runs it. For an ad-hoc target, any command still takes --server <name|url> --graph <id> (or --store <uri> for a local graph). See the CLI reference.

Security & governance

  • Engine-wide enforcement: every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules.
  • Declared in the cluster: a policy bundle is bound to graphs (or the whole server) via policies:applies_to.
  • Scoped: rules apply per graph, per branch, or server-wide.
  • No plaintext tokens: bearer tokens are hashed at startup and compared in constant time.
  • Forge-proof identity: the actor is resolved server-side from the token; clients can't set it.

See the policy guide.

Clients & SDKs

Client Use it for Where
TypeScript SDK typed access from Node / TS @modernrelay/omnigraph · source
MCP server bridge Omnigraph to LLM hosts (Claude, Codex, …) @modernrelay/omnigraph-mcp
HTTP / OpenAPI any language, the wire contract the server's OpenAPI spec
Python SDK typed access from Python coming soon

Both npm packages are versioned in lockstep with omnigraph-server.

Local quick test (no server)

1-min setup to try it: an embedded, local file-backed graph (no server, no object store). For dev and experiments; production is the deployed cluster above.

cat > schema.pg <<'PG'
node Signal  { slug: String @key, title: String }
node Pattern { slug: String @key, name: String }
edge Indicates: Signal -> Pattern
PG
printf '%s\n' \
  '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \
  '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \
  '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl

omnigraph init  --schema schema.pg ./graph.omni
omnigraph load  --data data.jsonl --mode overwrite --store ./graph.omni

# "What pattern does signal s1 indicate?"
omnigraph query --store ./graph.omni \
  -e 'query indicates() { match { $s: Signal { slug: "s1" }  $s indicates $p } return { $p.name } }'
# → adoption

Docs

Build And Test

cargo build --workspace
cargo test  --workspace

Notes:

  • Rust stable toolchain, edition 2024
  • CI runs cargo test --workspace --locked
  • Full CI and some local test flows require protobuf-compiler
  • S3 integration tests expect an S3-compatible endpoint such as RustFS

Workspace Crates

  • crates/omnigraph-compiler: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency)
  • crates/omnigraph (package omnigraph-engine): storage/runtime, branching, merge, change detection, query execution, and embeddings
  • crates/omnigraph-policy: Cedar policy compilation and enforcement
  • crates/omnigraph-api-types: shared HTTP wire DTOs used by both the server and the CLI
  • crates/omnigraph-cluster: cluster config validation, planning, and apply (the control plane)
  • crates/omnigraph-server: Axum HTTP server, cluster-first, runs N graphs under /graphs/{id}/…
  • crates/omnigraph-cli: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance

Contributing

Please open an issue, spec, or design discussion before sending large code changes. Design feedback and concrete problem statements are the fastest way to collaborate on the roadmap.

Community

Join the Omnigraph Slack community to ask questions, share feedback, and follow development.