omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Lakehouse-native graph engine with git-style workflows https://omnigraph.dev

Find a file

Ragnor Comerford f2b792e0ae (feat): compact the internal manifest/commit-graph tables in optimize (#291 ) * feat(engine): compact the internal __manifest/_graph_commits tables in optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after). * test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance) `internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR. * docs: RFC-013 step 2 internal-table compaction landed - invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR. * fix(engine): guard absent _graph_commits + always compact internal tables Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an on auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this. * fix(engine): retry publish on RetryableCommitConflict (compaction vs publish) Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates, validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files' `Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a live publish usually wins the race and the compaction rebases. But the publish runs its merge-insert with conflict_retries(0) = one rebase attempt; if the compaction commits first AND the merge touched a fragment the Rewrite rewrote, Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT variant from the row-level `TooMuchWriteContention` the publisher already retries. Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance compaction (physical op) failing a live write (logical op) — invariant 7. Map `LanceError::RetryableCommitConflict` to a new `ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the publisher's outer loop (reload fresh state + re-merge), alongside RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine expectation break must not be blindly retried). This also hardens multi-process concurrent writers generally, not just compaction. Normal publishes are insert-only (new object_ids -> new fragments, disjoint from rewritten old ones), so the conflict is rare; the guard covers the same-fragment-update edge and multi-process writers. Unit tests in publisher.rs pin the mapping + the retry-predicate contract. * revert: publisher RetryableCommitConflict handling (it was the wrong side) Reverts `d138902e`. Validated against lance-7.0.0: the publisher's merge-insert runs with conflict_retries(0), and execute_with_retry converts an exhausted retryable commit conflict to TooMuchWriteContention before the caller sees it (write/retry.rs ~95-130). So map_lance_publish_error NEVER receives RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which the publisher already maps to RowLevelCasContention and retries. The reverted mapping was therefore dead on the real path and its unit test was synthetic. The actual exposure is the compaction side: compact_files -> commit_compaction -> apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn conflict propagates raw and optimize can fail on a live graph. That is fixed app-side in compact_internal_table in the following commit. * fix(engine): make internal-table compaction correct by construction Address three findings from review of the step-2 internal-table compaction: - Non-destructive by construction: before compacting an internal table, strip any stored `lance.auto_cleanup.` config off it. `compact_files` commits with a default `CommitConfig` (skip_auto_cleanup=false) and `CompactionOptions` exposes no override, so on a graph created by an older binary (on-by-default GC hook) the compaction commit would fire Lance's auto-cleanup and silently prune `__manifest`-pinned versions. Current binaries store no such config; the strip is the upgrade-path safety net so `optimize` can never GC versions. - App-level compaction retry: `compact_files` does NOT auto-retry a semantic conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the internal-table compaction in a bounded retry loop that reopens fresh and replans on a retryable Lance conflict, so a maintenance compaction (a physical op) never fails a live write (a logical op) — invariant 7. - Compact all three internal tables, not two: `_graph_commit_actors` grows one fragment per commit on the authenticated write path, the same O(depth) scan as `__manifest`/`_graph_commits`. Drive the sweep from one source-of-truth list with per-table existence guards (the two commit-graph tables are optional). Make `graph_commit_actors_uri` pub(crate). Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the authenticated (actorful) write path so it covers `_graph_commit_actors` via the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as` helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins the non-destructive guarantee (config cleared + no version GC); a unit test pins the retryable-conflict classifier; the empty-graph stats count is 7 (the actor table is created at init). docs: internal-table compaction covers all 3 tables, non-destructive, retried Sync the RFC-013 step-2a section and the maintenance guide with the correctness-by-design refinements: - optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors` (the actor table grows on the authenticated write path). - optimize is non-destructive by construction — it never GCs versions, and strips stale `lance.auto_cleanup.` config so an upgraded graph's commit-time GC hook cannot fire during compaction. - internal-table compaction rebases and retries against concurrent live writers rather than failing the operator's optimize or the live write. - the cost LOCK is the authenticated-path acceptance test. fix(engine): refresh coordinator after a config-strip with no compaction work `compact_internal_table` returns early when `plan_compaction` finds no work, but `clear_stale_auto_cleanup_config` may have already committed a config-strip that advanced Lance HEAD. The early return skipped the coordinator refresh that the successful-compaction path performs, leaving warm `__manifest`/commit-graph handles pinned to the pre-strip version until the next read's version probe healed them. No correctness bug (the probe self-heals, and a stale-handle write would retry via publisher CAS), but the refresh makes coherence deterministic rather than probe-dependent. Refresh iff the config-strip actually committed. * docs(engine): correct compact_internal_table doc — compact_files does not auto-retry The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict doc just below it and the explicit retry loop that exists precisely because compact_files does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw through apply_commit). Also move the orphaned description from above the retry-budget const onto the function, and include the third internal table. * test(engine): optimize must clear stale auto_cleanup on DATA tables too (red) Regression test for a destructive bug on the data-table optimize path: on an upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.* config, `optimize`'s compact_files/optimize_indices commits fire Lance's version GC and prune __manifest-pinned data-table versions. Mirrors the internal-table auto_cleanup test on a Person table (force-repair realigns the config-induced drift so optimize doesn't skip the table). Red against the current code: the data-table path does not strip the config. The fix lands in the next commit. * fix(engine): clear stale auto_cleanup on the data-table optimize path too The auto_cleanup scrub previously only protected the internal tables; the data-table path (optimize_one_table) ran compact_files/optimize_indices with a default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded graph those commits could fire Lance's version-GC hook and prune __manifest-pinned node/edge versions — making the "non-destructive" contract false for data tables. Strip the config before the HEAD-advancing commits, capturing version_before first so the strip's own commit still triggers the Phase-C manifest publish (no uncovered drift). No retry loop needed: the data-table path holds the per-table write queue. Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green. Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys to update_config(None values), and correct two now-inaccurate doc comments — compaction is "one or more content-preserving commits" (compact_files can emit a ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free property rests on content-preservation + read-at-HEAD, not single-commit atomicity. * docs: optimize is non-destructive on all tables; correct atomicity/retry claims - non-destructive guarantee now spans data + internal tables (the auto_cleanup strip runs on both paths), not just the internal ones. - "single atomic Lance commit" was inaccurate: compaction can emit a ReserveFragments commit before the Rewrite; the no-sidecar property rests on content-preservation + read-at-HEAD, not single-commit atomicity. - "retries rather than failing" softened to the truth: a bounded retry on the internal path; sustained contention surfaces a loud conflict error (bounded + observable, not an infinite loop). The data path holds the per-table queue and never contends.		2026-06-21 16:38:20 +02:00
.cargo	Raise LANCE_MEM_POOL_SIZE to 1 GB in .cargo/config.toml	2026-04-19 22:27:49 +03:00
.context	Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq)	2026-04-28 23:30:17 +00:00
.github	write-path cost gate + opener bypass (#288 )	2026-06-20 13:31:15 +02:00
assets	docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274 )	2026-06-17 02:36:14 +03:00
crates	(feat): compact the internal manifest/commit-graph tables in optimize (#291 )	2026-06-21 16:38:20 +02:00
docker	fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284 )	2026-06-19 03:34:15 +03:00
docs	(feat): compact the internal manifest/commit-graph tables in optimize (#291 )	2026-06-21 16:38:20 +02:00
scripts	docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257 )	2026-06-16 11:48:13 +02:00
skills/omnigraph	docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257 )	2026-06-16 11:48:13 +02:00
.dockerignore	feat(docker): cluster-mode entrypoint and the CLI in the image	2026-06-10 22:44:54 +03:00
.gitignore	release: v0.5.0 (#115 )	2026-05-23 13:59:42 +01:00
AGENTS.md	release: v0.7.1 (#290 )	2026-06-19 23:12:44 +03:00
Cargo.lock	release: v0.7.1 (#290 )	2026-06-19 23:12:44 +03:00
Cargo.toml	build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229 )	2026-06-14 20:42:24 +02:00
CLAUDE.md	Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it	2026-04-28 23:10:09 +02:00
CODE_OF_CONDUCT.md	Initial public Omnigraph repository	2026-04-10 20:49:41 +03:00
CONTRIBUTING.md	chore: remove CODEOWNERS chassis and the code-owner review gate	2026-06-18 02:55:27 +03:00
Dockerfile	feat(docker): cluster-mode entrypoint and the CLI in the image	2026-06-10 22:44:54 +03:00
GOVERNANCE.md	chore: remove CODEOWNERS chassis and the code-owner review gate	2026-06-18 02:55:27 +03:00
LICENSE	Initial public Omnigraph repository	2026-04-10 20:49:41 +03:00
og-cheet-sheet.md	feat: inline query strings in CLI and HTTP server (#110 )	2026-05-29 13:41:54 +02:00
omnigraph.example.yaml	example config: use graphs / cli.graph, matching the MR-603 rename	2026-04-18 23:40:35 +03:00
openapi.json	docs(user): coherence cleanup aligned with 0.7.1 (#293 )	2026-06-21 00:02:34 +03:00
README.md	docs(readme): drop em-dashes, Cursor→Codex, rename agent section (#274 )	2026-06-17 02:36:14 +03:00
rust-toolchain.toml	Initial public Omnigraph repository	2026-04-10 20:49:41 +03:00
SECURITY.md	Initial public Omnigraph repository	2026-04-10 20:49:41 +03:00

README.md

OMNIGRAPH

Lakehouse graph database for context assembly & multi-agent coordination
_{Multimodal retrieval · Git-style branching · object-storage native}

Quickstart · Docs · Cookbooks · CLI

Omnigraph is the operational state and coordination layer for fleets of agents.
Run it as a server, declared as code; hundreds of agents operate and enrich the graph on parallel isolated branches, and every change is reviewed and merged safely.

Key capabilities

Capability	What it gives you
Declared as code	A `cluster.yaml` declares graphs, schemas, stored queries, embedding providers, and policies; `cluster apply` converges it and `omnigraph-server` brings every graph online at `/graphs/{id}/…`.
Built for fleets of agents	Hundreds of agents enrich the graph on parallel isolated branches; changes are reviewed and merged safely, Git-style, across the whole graph.
Multimodal retrieval	Graph traversal + vector ANN + full-text + Reciprocal Rank Fusion in one query runtime, for context assembly.
Security as code	Cedar policy enforced server-side on every mutation, per-graph and server-wide; bearer auth; actor/audit tracking.
Runs on your infrastructure	Any S3-compatible object store: on-prem via RustFS / MinIO, or AWS S3 / R2 / GCS. VPC, on-prem, hybrid; your data never leaves your store.
Open, versioned storage	`Lance` columnar format: branchable, time-travelable, with native blob-as-data (docs, images, video).

What you can build

Use case	What it's for
Company brain	Org knowledge unified into one graph every agent can query
Agentic memory	Durable, versioned memory: a branch per agent or per task, merged on review
Context graph	Decision traces and codified tribal knowledge for retrieval
Dev graph	Issues & dependency model that coding agents read and write
R&D / ML data layer	Experiments and trials written into branches, versioned for training & eval

Install

curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash

This installs omnigraph (CLI) and omnigraph-server into ~/.local/bin from published release binaries. Or with Homebrew:

brew tap ModernRelay/tap
brew install ModernRelay/tap/omnigraph

Set it up with an AI agent

Omnigraph is built to be run by coding agents. Two ways in:

Teach your agent the playbook. This repo ships the omnigraph agent skill: the operational playbook covering cluster mode, the two config surfaces, schema evolution, query linting, data writes, branches, Cedar policy, and the common gotchas.

npx skills add ModernRelay/omnigraph@omnigraph

Or have an agent set it up from scratch. Paste this into Claude Code, Codex, or any agent that can read a URL and run a shell command:

Help me set up Omnigraph

1. Read the docs at https://github.com/ModernRelay/omnigraph, starting with
   docs/user/clusters/index.md, then docs/user/deployment.md.
2. Skim the starter graphs and seed data in the cookbooks:
   https://github.com/ModernRelay/omnigraph-cookbooks
3. Ask me what I want to build (company brain, agent memory, dev graph,
   research / R&D layer, …). Then stand up a cluster for it, load a little
   data, and run a query so I can see it working.

For ready-to-run graphs with real seed data (company brain, VC operating system, pharma & industry intel), ModernRelay/omnigraph-cookbooks is the fastest way to see Omnigraph shaped to a real domain.

Deploy

A deployment is a cluster: a multigraph config directory that declares its graphs, schemas, stored queries, and policies as code. You manage it Terraform-style: cluster plan previews the diff, cluster apply converges it. omnigraph-server then boots from the cluster and brings every graph online at /graphs/{id}/…, each behind its own policy.

1. Declare the cluster.

company-brain/
├── cluster.yaml
├── people.pg          # schema for the "knowledge" graph
├── queries/           # stored queries: the .gq files ARE the declaration
│   └── people.gq
└── base.policy.yaml   # a Cedar policy bundle

# cluster.yaml
version: 1
metadata:
  name: company-brain
storage: s3://company/clusters/company-brain   # ledger, catalog, and graph data live here
graphs:
  knowledge:
    schema: people.pg
    queries: queries/                          # every `query <name>` in queries/*.gq registers
policies:
  base:
    file: base.policy.yaml
    applies_to: [knowledge]                    # graph-bound; use [cluster] for server-level

2. Stand up your object store. On-prem, run RustFS (or MinIO); Omnigraph writes Lance to it over the standard S3 API. In the cloud, point the same AWS_* env at S3 / R2 / GCS instead.

3. Converge and run. apply creates each graph, applies its schema, and publishes queries and policies into the content-addressed catalog. It is idempotent; re-running is always safe.

omnigraph cluster validate   # parse + typecheck everything
omnigraph cluster plan       # preview what apply would do
omnigraph cluster apply      # converge

# Boot the server from the cluster dir; storage resolves through cluster.yaml
omnigraph-server --cluster company-brain --bind 0.0.0.0:8080

See the cluster guide for the day-2 loop (edit → plan → apply → restart), approval gates for destructive changes, drift inspection, and recovery; the deployment guide for containers, AWS/Railway, auth, and the full AWS_* contract.

Query and mutate

Set a default server and graph once in ~/.omnigraph/config.yaml, and the everyday commands stay short. Stored queries and mutations run by name:

omnigraph query  search_docs --params '{"q":"AI safety"}'
omnigraph mutate add_person  --params '{"name":"Mina"}'

# Branch, review, merge across the whole graph; agents write in isolation
omnigraph branch create --from main agent/ingest-42
omnigraph branch merge  agent/ingest-42 --into main

An alias is shorter still: bind a server, graph, and stored query to one name, then omnigraph alias triage runs it. For an ad-hoc target, any command still takes --server <name|url> --graph <id> (or --store <uri> for a local graph). See the CLI reference.

Security & governance

Engine-wide enforcement: every write path goes through the same Cedar gate, so the HTTP server, the CLI, and the embedded SDK obey identical rules.
Declared in the cluster: a policy bundle is bound to graphs (or the whole server) via policies: → applies_to.
Scoped: rules apply per graph, per branch, or server-wide.
No plaintext tokens: bearer tokens are hashed at startup and compared in constant time.
Forge-proof identity: the actor is resolved server-side from the token; clients can't set it.

See the policy guide.

Clients & SDKs

Client	Use it for	Where
TypeScript SDK	typed access from Node / TS	`@modernrelay/omnigraph` · source
MCP server	bridge Omnigraph to LLM hosts (Claude, Codex, …)	`@modernrelay/omnigraph-mcp`
HTTP / OpenAPI	any language, the wire contract	the server's OpenAPI spec
Python SDK	typed access from Python	coming soon

Both npm packages are versioned in lockstep with omnigraph-server.

Local quick test (no server)

1-min setup to try it: an embedded, local file-backed graph (no server, no object store). For dev and experiments; production is the deployed cluster above.

cat > schema.pg <<'PG'
node Signal  { slug: String @key, title: String }
node Pattern { slug: String @key, name: String }
edge Indicates: Signal -> Pattern
PG
printf '%s\n' \
  '{"type":"Signal","data":{"slug":"s1","title":"OSS model adoption surging"}}' \
  '{"type":"Pattern","data":{"slug":"p1","name":"adoption"}}' \
  '{"edge":"Indicates","from":"s1","to":"p1"}' > data.jsonl

omnigraph init  --schema schema.pg ./graph.omni
omnigraph load  --data data.jsonl --mode overwrite --store ./graph.omni

# "What pattern does signal s1 indicate?"
omnigraph query --store ./graph.omni \
  -e 'query indicates() { match { $s: Signal { slug: "s1" }  $s indicates $p } return { $p.name } }'
# → adoption

Docs

Cluster guide · Deployment guide · CLI reference
Schema · Queries · Search · Policy

Build And Test

cargo build --workspace
cargo test  --workspace

Notes:

Rust stable toolchain, edition 2024
CI runs cargo test --workspace --locked
Full CI and some local test flows require protobuf-compiler
S3 integration tests expect an S3-compatible endpoint such as RustFS

Workspace Crates

crates/omnigraph-compiler: shared schema/query parser, typechecker, catalog, and IR lowering (zero Lance dependency)
crates/omnigraph (package omnigraph-engine): storage/runtime, branching, merge, change detection, query execution, and embeddings
crates/omnigraph-policy: Cedar policy compilation and enforcement
crates/omnigraph-api-types: shared HTTP wire DTOs used by both the server and the CLI
crates/omnigraph-cluster: cluster config validation, planning, and apply (the control plane)
crates/omnigraph-server: Axum HTTP server, cluster-first, runs N graphs under /graphs/{id}/…
crates/omnigraph-cli: CLI for graph lifecycle, query/mutate, branch/commit/merge, schema/lint, snapshot/export, cluster control, policy/queries, profiles, and maintenance

Contributing

Please open an issue, spec, or design discussion before sending large code changes. Design feedback and concrete problem statements are the fastest way to collaborate on the roadmap.

Community

Join the Omnigraph Slack community to ask questions, share feedback, and follow development.