Lakehouse-native graph engine with git-style workflows https://omnigraph.dev
Find a file
Ragnor Comerford f2b792e0ae
(feat): compact the internal manifest/commit-graph tables in optimize (#291)
* feat(engine): compact the internal __manifest/_graph_commits tables in optimize

`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).

`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.

Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).

maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).

* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)

`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.

* docs: RFC-013 step 2 internal-table compaction landed

- invariants.md: close the compaction half of the read-path-rederivation known
  gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
  sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
  watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.

* fix(engine): guard absent _graph_commits + always compact internal tables

Addresses PR #291 review findings:

- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
  but a graph can validly have none (the coordinator opens it as `Option`, gated on
  `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
  absent table errored and failed the whole optimize. Guard the `_graph_commits`
  compaction with the same `storage_adapter().exists()` check the coordinator uses;
  `__manifest` always exists so it stays unguarded. Regression test
  `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
  recreates the table before the guard).

- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
  compaction for a schema with no node/edge types. Removed it so the internal
  tables are compacted regardless of the data-table set.

- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
  `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
  so on a graph storing an *on* auto_cleanup config the commit would fire version
  GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
  safe; the only exposure is pre-fix upgraded graphs, identical to the existing
  data-table optimize path, with step 2b's watermark as the comprehensive guard.
  Added a comment in `compact_internal_table` recording this.

* fix(engine): retry publish on RetryableCommitConflict (compaction vs publish)

Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates,
validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files'
`Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a
live publish usually wins the race and the compaction rebases. But the publish
runs its merge-insert with conflict_retries(0) = one rebase attempt; if the
compaction commits first AND the merge touched a fragment the Rewrite rewrote,
Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT
variant from the row-level `TooMuchWriteContention` the publisher already retries.
Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance
compaction (physical op) failing a live write (logical op) — invariant 7.

Map `LanceError::RetryableCommitConflict` to a new
`ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the
publisher's outer loop (reload fresh state + re-merge), alongside
RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine
expectation break must not be blindly retried). This also hardens multi-process
concurrent writers generally, not just compaction.

Normal publishes are insert-only (new object_ids -> new fragments, disjoint from
rewritten old ones), so the conflict is rare; the guard covers the
same-fragment-update edge and multi-process writers. Unit tests in publisher.rs
pin the mapping + the retry-predicate contract.

* revert: publisher RetryableCommitConflict handling (it was the wrong side)

Reverts d138902e. Validated against lance-7.0.0: the publisher's merge-insert runs
with conflict_retries(0), and execute_with_retry converts an exhausted retryable
commit conflict to TooMuchWriteContention before the caller sees it
(write/retry.rs ~95-130). So map_lance_publish_error NEVER receives
RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which
the publisher already maps to RowLevelCasContention and retries. The reverted
mapping was therefore dead on the real path and its unit test was synthetic.

The actual exposure is the *compaction* side: compact_files -> commit_compaction ->
apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn
conflict propagates raw and optimize can fail on a live graph. That is fixed
app-side in compact_internal_table in the following commit.

* fix(engine): make internal-table compaction correct by construction

Address three findings from review of the step-2 internal-table compaction:

- Non-destructive by construction: before compacting an internal table,
  strip any stored `lance.auto_cleanup.*` config off it. `compact_files`
  commits with a default `CommitConfig` (skip_auto_cleanup=false) and
  `CompactionOptions` exposes no override, so on a graph created by an older
  binary (on-by-default GC hook) the compaction commit would fire Lance's
  auto-cleanup and silently prune `__manifest`-pinned versions. Current
  binaries store no such config; the strip is the upgrade-path safety net so
  `optimize` can never GC versions.

- App-level compaction retry: `compact_files` does NOT auto-retry a semantic
  conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete
  propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the
  internal-table compaction in a bounded retry loop that reopens fresh and
  replans on a retryable Lance conflict, so a maintenance compaction (a
  physical op) never fails a live write (a logical op) — invariant 7.

- Compact all three internal tables, not two: `_graph_commit_actors` grows
  one fragment per commit on the authenticated write path, the same O(depth)
  scan as `__manifest`/`_graph_commits`. Drive the sweep from one
  source-of-truth list with per-table existence guards (the two commit-graph
  tables are optional). Make `graph_commit_actors_uri` pub(crate).

Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the
authenticated (actorful) write path so it covers `_graph_commit_actors` via
the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as`
helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins
the non-destructive guarantee (config cleared + no version GC); a unit test
pins the retryable-conflict classifier; the empty-graph stats count is 7
(the actor table is created at init).

* docs: internal-table compaction covers all 3 tables, non-destructive, retried

Sync the RFC-013 step-2a section and the maintenance guide with the
correctness-by-design refinements:

- optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors`
  (the actor table grows on the authenticated write path).
- optimize is non-destructive by construction — it never GCs versions, and
  strips stale `lance.auto_cleanup.*` config so an upgraded graph's commit-time
  GC hook cannot fire during compaction.
- internal-table compaction rebases and retries against concurrent live
  writers rather than failing the operator's optimize or the live write.
- the cost LOCK is the authenticated-path acceptance test.

* fix(engine): refresh coordinator after a config-strip with no compaction work

`compact_internal_table` returns early when `plan_compaction` finds no work,
but `clear_stale_auto_cleanup_config` may have already committed a config-strip
that advanced Lance HEAD. The early return skipped the coordinator refresh that
the successful-compaction path performs, leaving warm `__manifest`/commit-graph
handles pinned to the pre-strip version until the next read's version probe
healed them. No correctness bug (the probe self-heals, and a stale-handle write
would retry via publisher CAS), but the refresh makes coherence deterministic
rather than probe-dependent. Refresh iff the config-strip actually committed.

* docs(engine): correct compact_internal_table doc — compact_files does not auto-retry

The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite
against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict
doc just below it and the explicit retry loop that exists precisely because compact_files
does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw
through apply_commit). Also move the orphaned description from above the retry-budget
const onto the function, and include the third internal table.

* test(engine): optimize must clear stale auto_cleanup on DATA tables too (red)

Regression test for a destructive bug on the data-table optimize path: on an
upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.*
config, `optimize`'s compact_files/optimize_indices commits fire Lance's version
GC and prune __manifest-pinned data-table versions. Mirrors the internal-table
auto_cleanup test on a Person table (force-repair realigns the config-induced
drift so optimize doesn't skip the table). Red against the current code: the
data-table path does not strip the config. The fix lands in the next commit.

* fix(engine): clear stale auto_cleanup on the data-table optimize path too

The auto_cleanup scrub previously only protected the internal tables; the
data-table path (optimize_one_table) ran compact_files/optimize_indices with a
default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded
graph those commits could fire Lance's version-GC hook and prune __manifest-pinned
node/edge versions — making the "non-destructive" contract false for data tables.
Strip the config before the HEAD-advancing commits, capturing version_before first
so the strip's own commit still triggers the Phase-C manifest publish (no uncovered
drift). No retry loop needed: the data-table path holds the per-table write queue.
Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green.

Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys
to update_config(None values), and correct two now-inaccurate doc comments —
compaction is "one or more content-preserving commits" (compact_files can emit a
ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free
property rests on content-preservation + read-at-HEAD, not single-commit atomicity.

* docs: optimize is non-destructive on all tables; correct atomicity/retry claims

- non-destructive guarantee now spans data + internal tables (the auto_cleanup
  strip runs on both paths), not just the internal ones.
- "single atomic Lance commit" was inaccurate: compaction can emit a
  ReserveFragments commit before the Rewrite; the no-sidecar property rests on
  content-preservation + read-at-HEAD, not single-commit atomicity.
- "retries rather than failing" softened to the truth: a *bounded* retry on the
  internal path; sustained contention surfaces a loud conflict error (bounded +
  observable, not an infinite loop). The data path holds the per-table queue and
  never contends.
2026-06-21 16:38:20 +02:00
.cargo Raise LANCE_MEM_POOL_SIZE to 1 GB in .cargo/config.toml 2026-04-19 22:27:49 +03:00
.context Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq) 2026-04-28 23:30:17 +00:00
.github write-path cost gate + opener bypass (#288) 2026-06-20 13:31:15 +02:00
assets docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
crates (feat): compact the internal manifest/commit-graph tables in optimize (#291) 2026-06-21 16:38:20 +02:00
docker fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284) 2026-06-19 03:34:15 +03:00
docs (feat): compact the internal manifest/commit-graph tables in optimize (#291) 2026-06-21 16:38:20 +02:00
scripts docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
skills/omnigraph docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) 2026-06-16 11:48:13 +02:00
.dockerignore feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
.gitignore release: v0.5.0 (#115) 2026-05-23 13:59:42 +01:00
AGENTS.md release: v0.7.1 (#290) 2026-06-19 23:12:44 +03:00
Cargo.lock release: v0.7.1 (#290) 2026-06-19 23:12:44 +03:00
Cargo.toml build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) 2026-06-14 20:42:24 +02:00
CLAUDE.md Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it 2026-04-28 23:10:09 +02:00
CODE_OF_CONDUCT.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
CONTRIBUTING.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
Dockerfile feat(docker): cluster-mode entrypoint and the CLI in the image 2026-06-10 22:44:54 +03:00
GOVERNANCE.md chore: remove CODEOWNERS chassis and the code-owner review gate 2026-06-18 02:55:27 +03:00
LICENSE Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
og-cheet-sheet.md feat: inline query strings in CLI and HTTP server (#110) 2026-05-29 13:41:54 +02:00
omnigraph.example.yaml example config: use graphs / cli.graph, matching the MR-603 rename 2026-04-18 23:40:35 +03:00
openapi.json docs(user): coherence cleanup aligned with 0.7.1 (#293) 2026-06-21 00:02:34 +03:00
README.md docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274) 2026-06-17 02:36:14 +03:00
rust-toolchain.toml Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00
SECURITY.md Initial public Omnigraph repository 2026-04-10 20:49:41 +03:00

OMNIGRAPH

Lakehouse graph database for context assembly & multi-agent coordination
Multimodal retrieval · Git-style branching · object-storage native

Quickstart  ·  Docs  ·  Cookbooks  ·  CLI

License: MIT crates.io Rust


Omnigraph is the operational state and coordination layer for fleets of agents.
Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely.

Key capabilities

Capability What it gives you
Declared as code A cluster.yaml declares graphs, schemas, stored queries, embedding providers, and policies; cluster apply converges it and omnigraph-server brings every graph online at /graphs/{id}/….
Built for fleets of agents Hundreds of agents enrich the graph on parallel isolated branches; changes are reviewed and merged safely, Git-style, across the whole graph.
Multimodal retrieval Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in one query runtime, for context assembly.
Security as code Cedar policy enforced server-side on every mutation, per-graph and server-wide; bearer auth; actor/audit tracking.
Runs on your infrastructure Any S3-compatible object store: on-prem via RustFS / MinIO, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store.
Open, versioned storage Lance columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video).

What you can build

Use case What it's for
Company brain Org knowledge unified into one graph every agent can query
Agentic memory Durable, versioned memory: a branch per agent or per task, merged on review
Context graph Decision traces and codified tribal knowledge for retrieval
Dev graph Issues & dependency model that coding agents read and write
R&D / ML data layer Experiments and trials written into branches, versioned for training & eval

Install

curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash

This installs omnigraph (CLI) and omnigraph-server into ~/.local/bin from published release binaries. Or with Homebrew:

brew tap ModernRelay/tap
brew install ModernRelay/tap/omnigraph

Set it up with an AI agent

Omnigraph is built to be run by coding agents. Two ways in:

Teach your agent the playbook. This repo ships the omnigraph agent skill: the operational playbook covering cluster mode, the two config surfaces, schema evolution, query linting, data writes, branches, Cedar policy, and the common gotchas.

npx skills add ModernRelay/omnigraph@omnigraph

Or have an agent set it up from scratch. Paste this into Claude Code, Codex, or any agent that can read a URL and run a shell command:

Help me set up Omnigraph

1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with
   docs/user/clusters/index.md, then docs/user/deployment.md.
2. Skim the starter graphs and seed data in the cookbooks:
   https://github.com/ModernRelay/omnigraph-cookbooks
3. Ask me what I want to build (company brain, agent memory, dev graph,
   research / R&D layer, …). Then stand up a cluster for it, load a little
   data, and run a query so I can see it working.

For ready-to-run graphs with real seed data (company brain, VC operating system, pharma & industry intel), ModernRelay/omnigraph-cookbooks is the fastest way to see Omnigraph shaped to a real domain.

Deploy

A deployment is a cluster: a multigraph config directory that declares its graphs, schemas, stored queries, and policies as code. You manage it Terraform-style: cluster plan previews the diff, cluster apply converges it. omnigraph-server then boots from the cluster and brings every graph online at /graphs/{id}/…, each behind its own policy.

1. Declare the cluster.

company-brain/
├── cluster.yaml
├── people.pg          # schema for the "knowledge" graph
├── queries/           # stored queries: the .gq files ARE the declaration
│   └── people.gq
└── base.policy.yaml   # a Cedar policy bundle
# cluster.yaml
version: 1
metadata:
  name: company-brain
storage: s3://company/clusters/company-brain   # ledger, catalog, and graph data live here
graphs:
  knowledge:
    schema: people.pg
    queries: queries/                          # every `query <name>` in queries/*.gq registers
policies:
  base:
    file: base.policy.yaml
    applies_to: [knowledge]                    # graph-bound; use [cluster] for server-level

2. Stand up your object store. On-prem, run RustFS (or MinIO); Omnigraph writes Lance to it over the standard S3 API. In the cloud, point the same AWS_* env at S3 / R2 / GCS instead.

3. Converge and run. apply creates each graph, applies its schema, and publishes queries and policies into the content-addressed catalog. It is idempotent; re-running is always safe.

omnigraph cluster validate   # parse + typecheck everything
omnigraph cluster plan       # preview what apply would do
omnigraph cluster apply      # converge

# Boot the server from the cluster dir; storage resolves through cluster.yaml
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080

See the cluster guide for the day-2 loop (edit → plan → apply → restart), approval gates for destructive changes, drift inspection, and recovery; the deployment guide for containers, AWS/Railway, auth, and the full AWS_* contract.

Query and mutate

Set a default server and graph once in ~/.omnigraph/config.yaml, and the everyday commands stay short. Stored queries and mutations run by name:

omnigraph query  search_docs --params '{"q":"AI safety"}'
omnigraph mutate add_person  --params '{"name":"Mina"}'

# Branch, review, merge across the whole graph; agents write in isolation
omnigraph branch create --from main agent/ingest-42
omnigraph branch merge  agent/ingest-42 --into main

An alias is shorter still: bind a server, graph, and stored query to one name, then omnigraph alias triage runs it. For an ad-hoc target, any command still takes --server <name|url> --graph <id> (or --store <uri> for a local graph). See the CLI reference.

Security & governance

  • Engine-wide enforcement: every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules.
  • Declared in the cluster: a policy bundle is bound to graphs (or the whole server) via policies:applies_to.
  • Scoped: rules apply per graph, per branch, or server-wide.
  • No plaintext tokens: bearer tokens are hashed at startup and compared in constant time.
  • Forge-proof identity: the actor is resolved server-side from the token; clients can't set it.

See the policy guide.

Clients & SDKs

Client Use it for Where
TypeScript SDK typed access from Node / TS @modernrelay/omnigraph · source
MCP server bridge Omnigraph to LLM hosts (Claude, Codex, …) @modernrelay/omnigraph-mcp
HTTP / OpenAPI any language, the wire contract the server's OpenAPI spec
Python SDK typed access from Python coming soon

Both npm packages are versioned in lockstep with omnigraph-server.

Local quick test (no server)

1-min setup to try it: an embedded, local file-backed graph (no server, no object store). For dev and experiments; production is the deployed cluster above.

cat > schema.pg <<'PG'
node Signal  { slug: String @key, title: String }
node Pattern { slug: String @key, name: String }
edge Indicates: Signal -> Pattern
PG
printf '%s\n' \
  '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \
  '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \
  '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl

omnigraph init  --schema schema.pg ./graph.omni
omnigraph load  --data data.jsonl --mode overwrite --store ./graph.omni

# "What pattern does signal s1 indicate?"
omnigraph query --store ./graph.omni \
  -e 'query indicates() { match { $s: Signal { slug: "s1" }  $s indicates $p } return { $p.name } }'
# → adoption

Docs

Build And Test

cargo build --workspace
cargo test  --workspace

Notes:

  • Rust stable toolchain, edition 2024
  • CI runs cargo test --workspace --locked
  • Full CI and some local test flows require protobuf-compiler
  • S3 integration tests expect an S3-compatible endpoint such as RustFS

Workspace Crates

  • crates/omnigraph-compiler: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency)
  • crates/omnigraph (package omnigraph-engine): storage/runtime, branching, merge, change detection, query execution, and embeddings
  • crates/omnigraph-policy: Cedar policy compilation and enforcement
  • crates/omnigraph-api-types: shared HTTP wire DTOs used by both the server and the CLI
  • crates/omnigraph-cluster: cluster config validation, planning, and apply (the control plane)
  • crates/omnigraph-server: Axum HTTP server, cluster-first, runs N graphs under /graphs/{id}/…
  • crates/omnigraph-cli: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance

Contributing

Please open an issue, spec, or design discussion before sending large code changes. Design feedback and concrete problem statements are the fastest way to collaborate on the roadmap.

Community

Join the Omnigraph Slack community to ask questions, share feedback, and follow development.