Lakehouse-native graph engine with git-style workflows https://omnigraph.dev
Find a file
Ragnor Comerford 6d4606a830
fix(engine): optimize survives a cross-process write race on the same table (#297)
* test(engine): cross-process optimize-vs-write race — RED

Two regression tests for the prod bug: a direct `optimize` process racing a
served write on the same table fails, because the in-process write queue does
not serialize across processes and the data-table optimize path has no retry.

- optimize_survives_concurrent_insert_advancing_manifest: a concurrent insert
  advances the manifest while optimize is paused between compact and publish;
  optimize's equality-CAS publish then fails "expected X but current Y".
- optimize_survives_concurrent_delete_before_compaction: a concurrent delete
  commits before optimize compacts; Lance rebases the compaction past it
  cleanly, so optimize again fails the publish CAS (the genuine Lance
  Rewrite-vs-Rewrite overlap is rarer and shares the internal path's retry).

Both fail today with ExpectedVersionMismatch. Adds the `optimize.before_compact`
failpoint seam + a wait_for_sidecar helper; serializes the optimize failpoint
tests (shared failpoint name). The fix lands next.

* fix(engine): optimize survives a cross-process write race on the same table

The data-table optimize path trusted the in-process write queue and skipped a
retry, so a CLI `optimize` racing a served write (separate processes = separate
queues) failed: either the Lance Rewrite lost ("preempted by concurrent Update")
or the manifest publish lost the strict equality CAS ("expected X but current Y").

Unify both compaction paths on the internal path's reopen+replan shape, with a
two-level retry that matches the two failure points:

- Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same-
  fragment conflict means our compaction did not commit — reopen at the new HEAD
  and re-plan. Lance rebases the common disjoint case (a concurrent insert/delete
  on other fragments) for free, so this fires only on real overlap.
- Inner loop (Phase C, monotonic publish): the manifest advanced between our
  compaction and our publish. The compaction is already committed at Lance HEAD N,
  so we must NOT reopen (that trips the HEAD>manifest drift guard on our own work).
  Re-read the current manifest version C: if C >= N the manifest already includes
  our compaction (versions are linear) — no-op; else fast-forward to N. Monotonic,
  not the strict equality CAS that manufactured the conflict.

The Phase-A sidecar is written once and reused across reopen attempts (every
Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward
or safely rolls the compaction back). The in-process queue is kept — it is now an
in-process contention reducer, not the cross-process correctness guard. Shares the
COMPACTION_RETRY_BUDGET constant + is_retryable_lance_conflict with the internal
path; adds is_retryable_manifest_conflict for the publish loop. No writer_epoch.

Turns the prior commit's two race tests green.

* docs(rfc-013): two-op-class principle + the found+fixed optimize-vs-write race

§6.6 records the maintenance vs logical op-class distinction (maintenance commutes
→ Lance rebase + reopen/replan + monotonic manifest fast-forward, no writer_epoch;
logical → strict cross-process OCC + epoch) and the prod optimize-vs-served-write
race that motivated it, now landed. Adds the matching mechanic row to §4.2.

* fix(engine): retry must not misclassify optimize's own HEAD drift

Review catch on the cross-process optimize fix: the outer retry loop re-ran the
`lance_head > manifest` drift guard every iteration. After a partial Phase-B commit
(the auto_cleanup strip or compaction commits, then a later op hits a retryable
conflict), the reopened attempt saw HEAD ahead of the manifest — from OUR own
sidecar-covered work, not an external writer — and deleted the sidecar + returned
`skipped_for_drift`, stranding uncovered drift that then needs `repair`.

Track `head_advanced` (did one of our Phase-B ops already commit). The drift guard
now fires only when `!head_advanced` (genuine pre-existing external drift); once we
have advanced HEAD, a reopened HEAD>manifest is our work that the monotonic publish
fast-forwards. The no-op early-return likewise publishes prior committed work instead
of dropping it when `head_advanced`.

Regression test `optimize_retry_does_not_misclassify_own_head_drift` injects one
retryable reindex conflict after the compaction commits (new `optimize.inject_
reindex_conflict` seam); red→green verified by negative control (reverting the gate
reproduces `skipped_for_drift: Some(DriftNeedsRepair)`).

Also de-flake `optimize_survives_concurrent_insert_advancing_manifest`: pause at
`before_compact` (not post-compact) so the concurrent insert lands while HEAD==
manifest — otherwise it could race optimize's committed-but-unpublished compaction
and hit the write-path "HEAD ahead of manifest" guard.

* fix(engine): optimize publish converges on retry-budget exhaustion

Review catch (greptile): the monotonic Phase-C publish loop returned an error on its
final iteration's retryable manifest conflict, even though that conflict can itself
mean a concurrent writer published a version that already includes our (content-
preserving) compaction — i.e. the postcondition ("the manifest reflects our
compaction") is already met. Recovery covered it (no data loss), but the operator
saw a spurious error and had to re-run.

Restructure the loop to re-read `current` on every retryable conflict and, on budget
exhaustion, do a final `current >= state.version` convergence check before surfacing
the error — the §6.6 "postcondition is the state, not winning the CAS" principle.
Factor the repeated current-version read into `current_manifest_version`.
2026-06-22 13:05:28 +02:00
.cargo Raise LANCE_MEM_POOL_SIZE to 1 GB in .cargo/config.toml 2026-04-19 22:27:49 +03:00
.context Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq) 2026-04-28 23:30:17 +00:00
.github write-path cost gate + opener bypass (#288) 2026-06-20 13:31:15 +02:00
assets docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
crates fix(engine): optimize survives a cross-process write race on the same table (#297) 2026-06-22 13:05:28 +02:00
docker fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) 2026-06-19 03:34:15 +03:00
docs fix(engine): optimize survives a cross-process write race on the same table (#297) 2026-06-22 13:05:28 +02:00
scripts docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
skills/omnigraph docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
.dockerignore feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
.gitignore release: v0.5.0 (#115) 2026-05-23 13:59:42 +01:00
AGENTS.md release: v0.7.1 (#290) 2026-06-19 23:12:44 +03:00
Cargo.lock release: v0.7.1 (#290) 2026-06-19 23:12:44 +03:00
Cargo.toml build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) 2026-06-14 20:42:24 +02:00
CLAUDE.md Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it 2026-04-28 23:10:09 +02:00
CODE_OF_CONDUCT.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
CONTRIBUTING.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
Dockerfile feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
GOVERNANCE.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
LICENSE Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
og-cheet-sheet.md feat: inline query strings in CLI and HTTP server (#110) 2026-05-29 13:41:54 +02:00
omnigraph.example.yaml example config: use graphs / cli.graph, matching the MR-603 rename 2026-04-18 23:40:35 +03:00
openapi.json docs(user): coherence cleanup aligned with 0.7.1 (#293) 2026-06-21 00:02:34 +03:00
README.md docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
rust-toolchain.toml Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
SECURITY.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00

OMNIGRAPH

Lakehouse graph database for context assembly & multi-agent coordination
Multimodal retrieval · Git-style branching · object-storage native

Quickstart  ·  Docs  ·  Cookbooks  ·  CLI

License: MIT crates.io Rust


Omnigraph is the operational state and coordination layer for fleets of agents.
Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely.

Key capabilities

Capability What it gives you
Declared as code A cluster.yaml declares graphs, schemas, stored queries, embedding providers, and policies; cluster apply converges it and omnigraph-server brings every graph online at /graphs/{id}/….
Built for fleets of agents Hundreds of agents enrich the graph on parallel isolated branches; changes are reviewed and merged safely, Git-style, across the whole graph.
Multimodal retrieval Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in one query runtime, for context assembly.
Security as code Cedar policy enforced server-side on every mutation, per-graph and server-wide; bearer auth; actor/audit tracking.
Runs on your infrastructure Any S3-compatible object store: on-prem via RustFS / MinIO, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store.
Open, versioned storage Lance columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video).

What you can build

Use case What it's for
Company brain Org knowledge unified into one graph every agent can query
Agentic memory Durable, versioned memory: a branch per agent or per task, merged on review
Context graph Decision traces and codified tribal knowledge for retrieval
Dev graph Issues & dependency model that coding agents read and write
R&D / ML data layer Experiments and trials written into branches, versioned for training & eval

Install

curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash

This installs omnigraph (CLI) and omnigraph-server into ~/.local/bin from published release binaries. Or with Homebrew:

brew tap ModernRelay/tap
brew install ModernRelay/tap/omnigraph

Set it up with an AI agent

Omnigraph is built to be run by coding agents. Two ways in:

Teach your agent the playbook. This repo ships the omnigraph agent skill: the operational playbook covering cluster mode, the two config surfaces, schema evolution, query linting, data writes, branches, Cedar policy, and the common gotchas.

npx skills add ModernRelay/omnigraph@omnigraph

Or have an agent set it up from scratch. Paste this into Claude Code, Codex, or any agent that can read a URL and run a shell command:

Help me set up Omnigraph

1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with
   docs/user/clusters/index.md, then docs/user/deployment.md.
2. Skim the starter graphs and seed data in the cookbooks:
   https://github.com/ModernRelay/omnigraph-cookbooks
3. Ask me what I want to build (company brain, agent memory, dev graph,
   research / R&D layer, …). Then stand up a cluster for it, load a little
   data, and run a query so I can see it working.

For ready-to-run graphs with real seed data (company brain, VC operating system, pharma & industry intel), ModernRelay/omnigraph-cookbooks is the fastest way to see Omnigraph shaped to a real domain.

Deploy

A deployment is a cluster: a multigraph config directory that declares its graphs, schemas, stored queries, and policies as code. You manage it Terraform-style: cluster plan previews the diff, cluster apply converges it. omnigraph-server then boots from the cluster and brings every graph online at /graphs/{id}/…, each behind its own policy.

1. Declare the cluster.

company-brain/
├── cluster.yaml
├── people.pg          # schema for the "knowledge" graph
├── queries/           # stored queries: the .gq files ARE the declaration
│   └── people.gq
└── base.policy.yaml   # a Cedar policy bundle
# cluster.yaml
version: 1
metadata:
  name: company-brain
storage: s3://company/clusters/company-brain   # ledger, catalog, and graph data live here
graphs:
  knowledge:
    schema: people.pg
    queries: queries/                          # every `query <name>` in queries/*.gq registers
policies:
  base:
    file: base.policy.yaml
    applies_to: [knowledge]                    # graph-bound; use [cluster] for server-level

2. Stand up your object store. On-prem, run RustFS (or MinIO); Omnigraph writes Lance to it over the standard S3 API. In the cloud, point the same AWS_* env at S3 / R2 / GCS instead.

3. Converge and run. apply creates each graph, applies its schema, and publishes queries and policies into the content-addressed catalog. It is idempotent; re-running is always safe.

omnigraph cluster validate   # parse + typecheck everything
omnigraph cluster plan       # preview what apply would do
omnigraph cluster apply      # converge

# Boot the server from the cluster dir; storage resolves through cluster.yaml
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080

See the cluster guide for the day-2 loop (edit → plan → apply → restart), approval gates for destructive changes, drift inspection, and recovery; the deployment guide for containers, AWS/Railway, auth, and the full AWS_* contract.

Query and mutate

Set a default server and graph once in ~/.omnigraph/config.yaml, and the everyday commands stay short. Stored queries and mutations run by name:

omnigraph query  search_docs --params '{"q":"AI safety"}'
omnigraph mutate add_person  --params '{"name":"Mina"}'

# Branch, review, merge across the whole graph; agents write in isolation
omnigraph branch create --from main agent/ingest-42
omnigraph branch merge  agent/ingest-42 --into main

An alias is shorter still: bind a server, graph, and stored query to one name, then omnigraph alias triage runs it. For an ad-hoc target, any command still takes --server <name|url> --graph <id> (or --store <uri> for a local graph). See the CLI reference.

Security & governance

  • Engine-wide enforcement: every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules.
  • Declared in the cluster: a policy bundle is bound to graphs (or the whole server) via policies:applies_to.
  • Scoped: rules apply per graph, per branch, or server-wide.
  • No plaintext tokens: bearer tokens are hashed at startup and compared in constant time.
  • Forge-proof identity: the actor is resolved server-side from the token; clients can't set it.

See the policy guide.

Clients & SDKs

Client Use it for Where
TypeScript SDK typed access from Node / TS @modernrelay/omnigraph · source
MCP server bridge Omnigraph to LLM hosts (Claude, Codex, …) @modernrelay/omnigraph-mcp
HTTP / OpenAPI any language, the wire contract the server's OpenAPI spec
Python SDK typed access from Python coming soon

Both npm packages are versioned in lockstep with omnigraph-server.

Local quick test (no server)

1-min setup to try it: an embedded, local file-backed graph (no server, no object store). For dev and experiments; production is the deployed cluster above.

cat > schema.pg <<'PG'
node Signal  { slug: String @key, title: String }
node Pattern { slug: String @key, name: String }
edge Indicates: Signal -> Pattern
PG
printf '%s\n' \
  '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \
  '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \
  '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl

omnigraph init  --schema schema.pg ./graph.omni
omnigraph load  --data data.jsonl --mode overwrite --store ./graph.omni

# "What pattern does signal s1 indicate?"
omnigraph query --store ./graph.omni \
  -e 'query indicates() { match { $s: Signal { slug: "s1" }  $s indicates $p } return { $p.name } }'
# → adoption

Docs

Build And Test

cargo build --workspace
cargo test  --workspace

Notes:

  • Rust stable toolchain, edition 2024
  • CI runs cargo test --workspace --locked
  • Full CI and some local test flows require protobuf-compiler
  • S3 integration tests expect an S3-compatible endpoint such as RustFS

Workspace Crates

  • crates/omnigraph-compiler: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency)
  • crates/omnigraph (package omnigraph-engine): storage/runtime, branching, merge, change detection, query execution, and embeddings
  • crates/omnigraph-policy: Cedar policy compilation and enforcement
  • crates/omnigraph-api-types: shared HTTP wire DTOs used by both the server and the CLI
  • crates/omnigraph-cluster: cluster config validation, planning, and apply (the control plane)
  • crates/omnigraph-server: Axum HTTP server, cluster-first, runs N graphs under /graphs/{id}/…
  • crates/omnigraph-cli: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance

Contributing

Please open an issue, spec, or design discussion before sending large code changes. Design feedback and concrete problem statements are the fastest way to collaborate on the roadmap.

Community

Join the Omnigraph Slack community to ask questions, share feedback, and follow development.