Merge pull request #60 from ModernRelay/ragnorc/omnigraph-spec

Add AGENTS.md (map) + docs/ knowledge base + CI link check
This commit is contained in:
Ragnor Comerford 2026-04-29 00:15:19 +02:00 committed by GitHub
commit a9430978fb
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
27 changed files with 1610 additions and 0 deletions

View file

@ -99,6 +99,18 @@ jobs:
echo "run_full_ci=$run_full_ci" >> "$GITHUB_OUTPUT"
echo "run_rustfs_ci=$run_rustfs_ci" >> "$GITHUB_OUTPUT"
check_agents_md:
name: Check AGENTS.md Links
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout source
uses: actions/checkout@v5.0.1
- name: Verify AGENTS.md ↔ docs/ cross-links
run: bash scripts/check-agents-md.sh
test:
name: Test Workspace
needs: classify_changes

220
AGENTS.md Normal file
View file

@ -0,0 +1,220 @@
# OmniGraph — Agent Guide
This file is the always-on map for AI coding agents (Claude Code, Codex, Cursor, Cline) working in this repo. It is loaded into context on every turn, so it stays as a **map plus the rules and invariants that need to be in scope at all times** — the encyclopedia content lives under [`docs/`](docs/). When you need depth, follow a pointer.
**Required reading every session, every change:**
1. **[docs/invariants.md](docs/invariants.md)** — the architectural invariants and §IX deny-list. Apply to every PR, not only architecture work.
2. **[docs/lance.md](docs/lance.md)** — the curated index of upstream Lance docs. **Consult it before every task** to identify which Lance pages are relevant to what you're about to do, then fetch those upstream URLs before grepping our code or guessing. Lance is the substrate; behavior is documented there, not here.
3. **[docs/testing.md](docs/testing.md)** — the test-coverage map. **Always check what already covers your change before writing a new test.** Extending an existing test (an assertion, a fixture row, a parameterization) is preferred over a duplicated `init_and_load()` block. Walk the before-every-task checklist to identify existing coverage, run those tests as a clean baseline, and only add a new test fn or file when no existing one owns the area.
Tools that support `@`-imports (Claude Code) auto-include all three files via the imports below — note these must sit at column 0 (not inside a blockquote) for the parser to recognize them. Other agents (Codex, Cursor, Cline, …) must open them explicitly at the start of each session.
@docs/invariants.md
@docs/lance.md
@docs/testing.md
`CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`.
**Version surveyed:** 0.3.1
**Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-cli`, `omnigraph-server`
**Storage substrate:** Lance 4.x (columnar, versioned, branchable)
**License:** MIT
**Toolchain:** Rust stable, edition 2024
---
## Start here — what is this?
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets. Highlights:
- **Storage**: per node/edge type a separate Lance dataset; multi-dataset commits coordinated atomically through one `__manifest` table.
- **Languages**: a `.pg` schema language and a `.gq` query language, both Pest-based, with a typed IR.
- **Multi-modal querying**: vector ANN (`nearest`), full-text (`search`/`fuzzy`/`match_text`/`bm25`), Reciprocal Rank Fusion (`rrf`), and graph traversal (`Expand`, anti-join `not { … }`) in one runtime.
- **Branches and commits across the whole graph**: Git-style — every successful publish appends to a commit DAG; merges are three-way at the row level.
- **Transactional runs**: ephemeral `__run__<id>` branches for isolated mutation, fast-path or merge-path publish.
- **HTTP server**: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager), Cedar policy gating.
- **CLI** driven by a single `omnigraph.yaml`; multi-format output (json/jsonl/csv/kv/table).
Throughout the docs, capabilities are split into **L1 — Inherited from Lance** vs **L2 — Added by OmniGraph**.
---
## Architecture at a glance
```
CLI (omnigraph) HTTP Server (omnigraph-server, Axum)
│ │
└─────────────┬──────────────┘
omnigraph-compiler ── Pest grammars, catalog, IR, lowering, lint, migration plan
omnigraph (engine) ── ManifestRepo, CommitGraph, RunRegistry, GraphIndex (CSR/CSC), exec
Lance 4.x ── columnar Arrow, fragments, per-dataset versions/branches, indexes
Object store (file / s3 / RustFS / MinIO / S3-compat)
```
Full diagram and concurrency model: [docs/architecture.md](docs/architecture.md).
---
## Where to find each topic
| Area | Read |
|---|---|
| **Architectural invariants & deny-list (read before any non-trivial proposal or review)** | **[docs/invariants.md](docs/invariants.md)** |
| **Lance docs index — fetch upstream Lance docs by problem domain** | **[docs/lance.md](docs/lance.md)** |
| **Test coverage map — what's covered, what helpers to reuse, before-every-task checklist** | **[docs/testing.md](docs/testing.md)** |
| Architecture, L1/L2 framing, concurrency model | [docs/architecture.md](docs/architecture.md) |
| Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/storage.md](docs/storage.md) |
| `.pg` schema language, types, constraints, annotations, migration planning | [docs/schema-language.md](docs/schema-language.md) |
| `.gq` query language, MATCH/RETURN/ORDER, search funcs, mutations, IR ops, lint codes | [docs/query-language.md](docs/query-language.md) |
| Indexes (BTREE / inverted / vector / graph topology) | [docs/indexes.md](docs/indexes.md) |
| Embeddings (compiler + engine clients, env vars, `@embed`) | [docs/embeddings.md](docs/embeddings.md) |
| Branches, commit graph, snapshots, system branches | [docs/branches-commits.md](docs/branches-commits.md) |
| Runs (transactional graph mutations, `__run__<id>`, publish paths) | [docs/runs.md](docs/runs.md) |
| Three-way merge and conflict kinds | [docs/merge.md](docs/merge.md) |
| Diff / change feed (`diff_between`, `diff_commits`) | [docs/changes.md](docs/changes.md) |
| Query execution, mutation execution, bulk loader, `load` vs `ingest` | [docs/execution.md](docs/execution.md) |
| `optimize` (compaction) and `cleanup` (version GC) | [docs/maintenance.md](docs/maintenance.md) |
| Cedar policy actions, scopes, CLI | [docs/policy.md](docs/policy.md) |
| HTTP server endpoints, auth, error model, body limits | [docs/server.md](docs/server.md) |
| CLI quick-start | [docs/cli.md](docs/cli.md) |
| CLI command surface and `omnigraph.yaml` schema | [docs/cli-reference.md](docs/cli-reference.md) |
| Audit / actor tracking | [docs/audit.md](docs/audit.md) |
| Error taxonomy and result serialization | [docs/errors.md](docs/errors.md) |
| Install (binary / Homebrew / source / channels) | [docs/install.md](docs/install.md) |
| Deployment (binary / container / RustFS bootstrap / auth / build variants) | [docs/deployment.md](docs/deployment.md) |
| CI / release workflows | [docs/ci.md](docs/ci.md) |
| Constants & tunables cheat sheet | [docs/constants.md](docs/constants.md) |
| Per-version release notes | [docs/releases/](docs/releases/) |
---
## Always-on rules (load these into your working memory)
These are architectural rules that need to be in scope on every change. They're framed at the level that survives renames and refactors — the deeper implementation specifics (function names, lock names, branch-prefix conventions, enforcement points) live in the per-area docs and may evolve. The full architectural invariants and deny-list are in [docs/invariants.md](docs/invariants.md); §IX (deny-list) is the fastest first-pass when reviewing any change.
1. **Multi-dataset publish is atomic across the whole graph.** A graph commit flips every relevant sub-table version visible together, in one manifest write. Don't introduce code paths that publish per sub-table outside the unified publish path — that loses cross-table snapshot isolation.
2. **Snapshot isolation per query.** A query holds one snapshot for its lifetime. Don't re-read the current head mid-query.
3. **Mutations are atomic at the commit boundary.** Multi-statement change queries publish one commit. Don't commit per-statement.
4. **Bearer-token plaintext never persists in process memory.** Tokens are hashed at startup; auth uses constant-time comparison; the actor id is server-resolved from the hash match and must not be settable by the client.
5. **Reads always see the current index state for the branch they're reading.** Indexes track the branch head, not historical snapshots. If you change index lifecycle, preserve this guarantee.
6. **Stable type IDs survive renames.** Schema migration relies on identity that's stable across rename — don't mint new IDs on rename.
### Deny-list (fast-pass review filter — full reasoning in [docs/invariants.md §IX](docs/invariants.md))
If a proposal fits one of these, the burden is on the proposer to justify why this case is the exception:
- Synchronous-inline index updates for indexes expensive to build (vector ANN, FTS) — use the reconciler pattern.
- Custom WAL / transaction manager / buffer pool — Lance owns these.
- Job queue for state derivable from manifest — reconciler pattern instead.
- Per-feature lowering for shapes that share a structure (interfaces, wildcards, alternation) — use one mechanism.
- Eager materialization of cross-products in multi-hop — factorize; flatten only when needed.
- Ad-hoc IN-list filtering when SIP fits.
- String-flattened SQL filter generation when structured pushdown is available.
- In-process-only `Dataset` impls — `Send + Sync`, remote descriptors.
- Cost-blind plan choice — lowering-order execution is not a planner.
- Hidden statistics — if a metric matters for plan choice, it must be exposed through the trait surface.
- Side-channels for query semantics — search modes, mutations, polymorphism are first-class IR concepts.
- Discarding rank in retrieval — score and rank propagate as columns.
- State that drifts from the manifest — derive from observable state.
- Cloud-only correctness fixes — correctness is always OSS.
- Forking the codebase for Cloud — trait-extension only.
- Hand-rolling something Lance already does — check the spec first.
- Mutating in place state that should be immutable (Lance fragments, index segments) — new segments instead.
- Silent failures — OOM, timeout, partial result must all be surfaced and bounded.
---
## Quick-reference flows
```bash
# Initialize an S3-backed repo
omnigraph init --schema ./schema.pg s3://my-bucket/repo.omni
# Bulk load
omnigraph load --data ./seed.jsonl --mode overwrite s3://my-bucket/repo.omni
# Branch + ingest a review batch
omnigraph branch create --from main review/2026-04-25 s3://my-bucket/repo.omni
omnigraph ingest --branch review/2026-04-25 --data ./batch.jsonl s3://my-bucket/repo.omni
# Run a hybrid (vector + BM25) query
omnigraph read --query ./queries.gq --name find_similar \
--params '{"q":"trends in AI safety"}' --format table s3://my-bucket/repo.omni
# Plan + apply schema migration
omnigraph schema plan --schema ./next.pg s3://my-bucket/repo.omni
omnigraph schema apply --schema ./next.pg s3://my-bucket/repo.omni --json
# Merge review branch back
omnigraph branch merge review/2026-04-25 --into main s3://my-bucket/repo.omni
# Compact + GC (preview, then confirm)
omnigraph optimize s3://my-bucket/repo.omni
omnigraph cleanup --keep 10 --older-than 7d s3://my-bucket/repo.omni
omnigraph cleanup --keep 10 --older-than 7d --confirm s3://my-bucket/repo.omni
# Stand up the HTTP server (token from env)
OMNIGRAPH_SERVER_BEARER_TOKEN=xxxx \
omnigraph-server s3://my-bucket/repo.omni --bind 0.0.0.0:8080
# Cedar policy explain
omnigraph policy explain --actor act-alice --action change --branch main
```
---
## Capability matrix — "Lens by default vs. added by OmniGraph"
| Capability | L1 (Lance default) | L2 (OmniGraph adds) |
|---|---|---|
| Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing |
| Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables |
| Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering |
| Atomic single-dataset commits | ✅ | **Atomic multi-dataset publish** via `__manifest` + `ManifestBatchPublisher` |
| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency |
| Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy |
| BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches |
| `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering |
| Vector search | ✅ | `nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema |
| Full-text search | ✅ | `search/fuzzy/match_text/bm25` query ops |
| Hybrid ranking | — | `rrf(...)` Reciprocal Rank Fusion in one runtime |
| Graph traversal | — | CSR/CSC topology index, `Expand` IR op, variable-length hops, `not { }` anti-join |
| Schema language | — | `.pg` + Pest grammar + catalog + interfaces + constraints + annotations |
| Query language | — | `.gq` + Pest grammar + IR + lowering + linter |
| Schema migration planning | — | `plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__` |
| Commit graph (DAG) across whole repo | — | `_graph_commits.lance` with linear + merge parents, ULID ids, actor map |
| Transactional runs | — | `_graph_runs.lance`, `__run__<id>` ephemeral branches, fast-path & merge-path publish |
| Three-way row-level merge | — | `OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind` |
| Change feeds | — | `diff_between` / `diff_commits` with manifest fast path + ID streaming |
| Cedar policy | — | 10 actions, branch / target_branch / protected scopes, validate/test/explain CLI |
| HTTP server | — | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), policy gating, NDJSON streaming export |
| CLI with config | — | `omnigraph.yaml`, aliases, multi-format output (json/jsonl/csv/kv/table) |
| Audit / actor tracking | — | `_as` write APIs + actor maps in commit & run datasets |
| Local RustFS bootstrap | — | `scripts/local-rustfs-bootstrap.sh` one-shot S3-backed dev environment |
---
## Maintenance contract for agents
When you change something user-visible, **update the relevant `docs/<area>.md` in the same change**. Pointers from this file to that doc must keep working — CI enforces cross-link integrity via `scripts/check-agents-md.sh`.
When proposing or reviewing a non-trivial change, walk [docs/invariants.md](docs/invariants.md) — at minimum the §IX deny-list and §X review checklist. Add to the deny-list when a new anti-pattern surfaces; relaxing an invariant requires the same review process as code.
Rules:
1. **Update in the same PR.** New endpoint, query function, CLI flag, env var, constant, schema construct, or invariant: update both the source code and the doc in the same change. Never split documentation drift into a follow-up.
2. **Bump version on release.** When a release boundary crosses (e.g. v0.3.1 → v0.3.2), update the version line at the top of this file and add a `docs/releases/<version>.md` describing the user-visible delta. Update [docs/architecture.md](docs/architecture.md) only if the architecture itself changed.
3. **Don't lie.** If a section becomes wrong but you can't rewrite it fully right now, replace the wrong line with `*(stale — needs update after <change>)*` rather than leaving silently incorrect text. Then fix it ASAP.
4. **Re-verify before recommending.** If you cite a flag, env var, endpoint, or constant to the user or in code, grep for it in source first. Memory and docs go stale; the code is authoritative.
5. **Keep AGENTS.md a map, not an encyclopedia.** New deep content goes into `docs/`. Add an entry to "Where to find each topic" instead of pasting prose into this file. The "Always-on rules" section is the exception — it's for invariants that should always be in scope.
6. **Re-read on schema/query/IR changes.** Edits to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of [docs/schema-language.md](docs/schema-language.md), [docs/query-language.md](docs/query-language.md), and [docs/execution.md](docs/execution.md) to confirm they still describe reality.
CI check: `scripts/check-agents-md.sh` verifies that every `docs/*.md` link in this file resolves and that every doc in the canonical set is linked. Run it locally before opening a PR if you've moved or renamed docs.

1
CLAUDE.md Symbolic link
View file

@ -0,0 +1 @@
AGENTS.md

69
docs/architecture.md Normal file
View file

@ -0,0 +1,69 @@
# Architecture
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`.
## Stack
```
┌──────────────────────────────────────────────────────────────────┐
│ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) │
│ - 13 cmd families - REST + OpenAPI │
│ - Aliases, configs - Bearer auth + Cedar policy │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ omnigraph-compiler │
│ - Pest grammars: schema.pest, query.pest │
│ - Catalog (Node/Edge/Interface types) │
│ - IR + lowering (NodeScan / Expand / Filter / AntiJoin) │
│ - Schema migration planner │
│ - Embedding client (OpenAI-style for query-time normalization) │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ omnigraph (engine) │
│ - GraphCoordinator + ManifestRepo (__manifest) │
│ - CommitGraph (_graph_commits.lance) │
│ - RunRegistry (_graph_runs.lance, __run__ branches) │
│ - GraphIndex (CSR/CSC) + RuntimeCache (LRU 8) │
│ - exec::query / mutation / merge │
│ - Embedding client (Gemini for runtime ingest) │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ Lance 4.x (per-table dataset) │
│ - Columnar (Arrow) storage, fragments │
│ - Manifest versions per dataset │
│ - Per-dataset branches (copy-on-write) │
│ - Indexes: BTREE, Inverted (FTS/BM25), IVF/HNSW vector │
│ - merge_insert (upsert), append, delete │
│ - compact_files, cleanup_old_versions │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ Object store: local FS, S3, RustFS, MinIO, S3-compatible │
└──────────────────────────────────────────────────────────────────┘
```
## L1 / L2 framing
Throughout the docs, capabilities are split into:
- **L1 — Inherited from Lance**: what OmniGraph gets "for free" by sitting on top of the Lance dataset format (columnar Arrow storage, per-dataset versions and branches, index types, `merge_insert`, `compact_files` / `cleanup_old_versions`).
- **L2 — Added by OmniGraph**: typing (schema language), graph semantics, multi-dataset coordination via `__manifest`, graph-level branches and commits, the `.gq` query language and IR, the topology index, the HTTP server, Cedar policy, the CLI.
## Concurrency model
- **MVCC**: every Lance write bumps a per-dataset version; the OmniGraph manifest version coordinates which sub-table versions are visible together.
- **Snapshot isolation**: a query holds one `Snapshot` for its lifetime; concurrent writes don't leak in.
- **Cross-branch isolation**: copy-on-write means readers and writers on different branches don't block each other.
- **Run isolation**: each transactional run lives on its own `__run__<id>` branch.
- **Schema-apply lock**: `__schema_apply_lock__` system branch serializes schema migrations.
- **Fail-points** (`failpoints` cargo feature): `failpoints::maybe_fail("operation.step")?` in `branch_create`, publish, etc., for deterministic failure injection in tests.
## Workspace crates
- `omnigraph-compiler` — schema and query grammars, catalog, IR, lowering, type checker, lint, migration planner, OpenAI-style embedding client.
- `omnigraph` (engine, published as `omnigraph-engine` on crates.io since v0.2.2) — the Lance-backed runtime: manifest, commit graph, run registry, snapshot, exec, merge, loader, Gemini embedding client.
- `omnigraph-cli` — the `omnigraph` binary.
- `omnigraph-server` — the `omnigraph-server` binary (Axum HTTP server).

6
docs/audit.md Normal file
View file

@ -0,0 +1,6 @@
# Audit / Actor tracking
- `Omnigraph::audit_actor_id: Option<String>` is the actor in effect.
- `_as` variants of every write API let callers override the actor: `begin_run_as`, `publish_run_as`, `ingest_as`, `mutate_as`, `branch_merge_as`, etc.
- Actor IDs are persisted both on `RunRecord.actor_id` and on `GraphCommit.actor_id`, with optional split storage in `_graph_commit_actors.lance` and `_graph_run_actors.lance`.
- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor).

57
docs/branches-commits.md Normal file
View file

@ -0,0 +1,57 @@
# Branches, Commits, Snapshots
## L1 — Lance per-dataset branches
Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork.
## L2 — Graph-level branches
OmniGraph builds *graph branches* on top by branching every sub-table coherently:
- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle.
- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes.
- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments.
- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source.
- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch.
## L2 — Commit graph (`db/commit_graph.rs`)
In-memory shape of a graph commit:
```
GraphCommit {
graph_commit_id: ULID,
manifest_branch: Option<String>,
manifest_version: u64,
parent_commit_id: Option<String>,
merged_parent_commit_id: Option<String>, // populated for merge commits
actor_id: Option<String>, // joined in-memory from _graph_commit_actors.lance, NOT a column on _graph_commits.lance
created_at: i64 (microseconds since epoch),
}
```
Storage is split across two Lance datasets (both with stable row IDs):
- `_graph_commits.lance` — every column above *except* `actor_id`.
- `_graph_commit_actors.lance` — optional separate `(graph_commit_id, actor_id)` map, created on demand. The `actor_id` field above is populated by joining this dataset in-memory at load time.
Notes:
- Every successful publish (load / change / merge / schema_apply / publish_run) appends one commit.
- Merge commits have two parents; linear commits have one.
- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`.
## L2 — Snapshots & time travel
- `snapshot()` — current snapshot for the bound branch; cached.
- `snapshot_of(target)` — snapshot at a `ReadTarget` (branch | snapshot id).
- `snapshot_at_version(v: u64)` — historical snapshot from any manifest version.
- `entity_at(table_key, id, version)` — single-entity time travel without building a full snapshot.
- A `Snapshot` is a `(version, HashMap<table_key, SubTableEntry>)` — cheap to build, snapshot-isolated cross-table reads.
## L2 — Internal system branches
Filtered from `branch_list()` but visible to internals:
- `__run__<run-id>` — ephemeral isolation branch for a transactional run.
- `__schema_apply_lock__` — serializes schema migrations.

24
docs/changes.md Normal file
View file

@ -0,0 +1,24 @@
# Change Detection / Diff
`changes/mod.rs`. Three-level algorithm:
1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged.
2. **Lineage check**:
- Same branch lineage → fast path: use the per-row `_row_last_updated_at_version` column to classify Insert/Update/Delete.
- Different lineages → ID-based streaming comparison.
3. **Row-level diff**: streaming, no full materialization.
## Public API
- `diff_between(from: ReadTarget, to: ReadTarget, filter: Option<ChangeFilter>) -> ChangeSet`
- `diff_commits(from_commit_id, to_commit_id, filter)` — cross-branch safe.
## Types
```
ChangeOp: Insert | Update | Delete
EntityKind: Node | Edge
EntityChange { table_key, kind, type_name, id, op, manifest_version, endpoints?: {src, dst} }
ChangeFilter { kinds?, type_names?, ops? }
ChangeSet { from_version, to_version, branch?, changes[], stats }
```

10
docs/ci.md Normal file
View file

@ -0,0 +1,10 @@
# CI / Release Workflows
`.github/workflows/`:
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease.
- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by pushing the regenerated formula to `ModernRelay/homebrew-tap`.
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.

83
docs/cli-reference.md Normal file
View file

@ -0,0 +1,83 @@
# CLI Reference (`omnigraph`)
A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md).
17 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`.
## Top-level commands
| Command | Purpose |
|---|---|
| `init` | `--schema <pg>` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) |
| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) |
| `ingest` | branch-creating transactional load (`--from <base>`) |
| `read` | run named query (params via `--params`, `--params-file`, or alias args) |
| `change` | run mutation query |
| `snapshot` | print current snapshot (per-table version + row count) |
| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
| `branch create \| list \| delete \| merge` | branching ops |
| `commit list \| show` | inspect commit graph |
| `run list \| show \| publish \| abort` | transactional run ops |
| `schema plan \| apply \| show (alias: get)` | migrations |
| `query lint \| check` | offline / repo-backed validation |
| `optimize` | non-destructive Lance compaction |
| `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
| `embed` | offline JSONL embedding pipeline |
| `policy validate \| test \| explain` | Cedar tooling |
| `version` / `-v` | print `omnigraph 0.3.x` |
## `omnigraph.yaml` schema
```yaml
project: { name }
graphs:
<name>:
uri: <local|s3://|http(s)://>
bearer_token_env: <ENV_NAME>
server:
graph: <name>
bind: <ip:port>
cli:
graph: <name>
branch: <name>
output_format: json|jsonl|csv|kv|table
table_max_column_width: 80
table_cell_layout: truncate|wrap
query:
roots: [<dir>, …] # search path for .gq files
auth:
env_file: ./.env.omni
aliases:
<alias>:
command: read|change
query: <path-to-.gq>
name: <query-name>
args: [<positional-name>, …]
graph: <name>
branch: <name>
format: <output-format>
policy:
file: ./policy.yaml
```
## Output formats (read command)
- `json` — pretty-printed object with metadata + rows
- `jsonl` — one metadata line then one JSON object per row
- `csv` — RFC 4180-ish quoting
- `table` — fitted text table, honors `table_max_column_width` + `table_cell_layout`
- `kv` — grouped per-row key/value blocks
## Param resolution
Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, `omnigraph.yaml` defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients.
## Bearer token resolution (CLI)
1. `graphs.<name>.bearer_token_env`
2. `OMNIGRAPH_BEARER_TOKEN` global env
3. `auth.env_file` referenced `.env`
## Duration parsing (cleanup)
`s | m | h | d | w` units, e.g. `--older-than 7d`.

20
docs/constants.md Normal file
View file

@ -0,0 +1,20 @@
# Constants & Tunables (cheat sheet)
| Name | Value | Where |
|---|---|---|
| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` |
| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` |
| Run registry dir | `_graph_runs.lance` | `db/run_registry.rs` |
| Run branch prefix | `__run__` | `db/run_registry.rs` |
| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` |
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` |
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` |
| Graph index cache size | `8` (LRU) | `runtime_cache.rs` |
| Default body limit | `1 MB` | `omnigraph-server/lib.rs` |
| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` |
| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` |
| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` |
| Embed timeout | `30 000 ms` | both clients |
| Embed retries | `4` | both clients |
| Embed retry backoff | `200 ms` | both clients |
| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime |

31
docs/embeddings.md Normal file
View file

@ -0,0 +1,31 @@
# Embeddings
OmniGraph has **two** embedding clients with different defaults and purposes.
## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization
- Default model: `text-embedding-3-small` (OpenAI-style schema)
- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors
## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest
- Model: `gemini-embedding-2-preview`
- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
- Exponential backoff with retryable detection (timeouts, 429, 5xx)
## Schema integration
Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).
## CLI `omnigraph embed` (offline file pipeline)
Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive):
- (default) `fill_missing` — only embed rows whose target field is empty
- `--reembed-all` — overwrite all
- `--clean` — strip embeddings
Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.

21
docs/errors.md Normal file
View file

@ -0,0 +1,21 @@
# Errors and Result Serialization
## Error taxonomy (`omnigraph::error::OmniError`)
- `Compiler(...)` — schema/query parse/typecheck errors
- `Lance(String)` — storage layer
- `DataFusion(String)` — execution layer
- `Io(io::Error)`
- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, … })`
- `MergeConflicts(Vec<MergeConflict>)`
Compiler-side `NanoError` covers parse / catalog / type / storage / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics.
## Result serialization (`omnigraph_compiler::result::QueryResult`)
- `to_arrow_ipc()` — efficient binary
- `to_sdk_json()` — JS-safe JSON (large i64 wrapped in metadata)
- `to_rust_json()` — Rust-friendly JSON
- `batches()` — direct Arrow `RecordBatch` access
Mutation results: `{ affectedNodes: usize, affectedEdges: usize }` (also exposed as a tiny Arrow batch).

76
docs/execution.md Normal file
View file

@ -0,0 +1,76 @@
# Query Execution, Mutations, and Loading
## Query execution (`exec/query.rs`)
Pipeline:
1. Parse + typecheck via `omnigraph-compiler`.
2. Lower to IR.
3. If `Expand` or `AntiJoin` is present, build (or fetch from `RuntimeCache`) a `GraphIndex`.
4. Run `execute_query` against the snapshot.
### Multi-modal search modes (`SearchMode`)
The executor recognizes three modes that may be combined in a single query:
- **`nearest`** — vector ANN (uses Lance vector index; `LIMIT` required).
- **`bm25`** — BM25 over an inverted index.
- **`rrf`** — Reciprocal Rank Fusion of two rankings, with k (default 60).
Hybrid example: `order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20`.
### Joins / set operations
- Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups.
- `not { … }` lowers to an `AntiJoin` over the inner pipeline.
### Scoped reads
- `query(target, source, name, params)` — at any branch or snapshot.
- `run_query_at(version, …)` — direct historical query at a manifest version.
### Concurrency
- Snapshot isolation per query: all reads inside a query use the same `Snapshot`.
- Readers and writers on different branches don't block each other.
## Mutation execution (`exec/mutation.rs`)
Resolves expression values to literals, converts to typed Arrow arrays (`literal_to_typed_array(lit, DataType, num_rows)`), then writes:
- `insert` → Lance `WriteMode::Append`
- `update` → Lance `merge_insert(WhenMatched::Update)`
- `delete` → Lance `merge_insert(WhenMatched::Delete)` (logical) or filtered overwrite.
Multi-statement mutations are atomic at the manifest commit boundary.
## Bulk loader (`loader/mod.rs`)
- **JSONL only** in v1, with two record shapes:
- Node: `{"type":"NodeType", "data":{…}}`
- Edge: `{"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}`
- Lines starting with `//` are treated as comments.
- Schema validation on every row (typecheck, required props, blob base64 decoding).
- Edge endpoint resolution by node `@key`.
## Load modes (`LoadMode`)
| Mode | Semantics |
|---|---|
| `Overwrite` | Replace all data in the target tables on the branch |
| `Append` | Strict insert; duplicates error |
| `Merge` | Upsert by id (`merge_insert`) |
## `load` vs `ingest`
- `load(branch, data, mode)` — direct load to a branch.
- `ingest(branch, from, data, mode)` — branch-creating, transactional load:
1. If target advanced since the run started, fork a fresh run branch from `from`.
2. Load into the run branch (Append).
3. If target hasn't moved, fast-publish; otherwise abort.
- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`.
- `ingest_as(actor_id)` records the actor on the resulting commit.
## Embeddings during load
If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](embeddings.md).

26
docs/indexes.md Normal file
View file

@ -0,0 +1,26 @@
# Indexes
## L1 — Lance index types OmniGraph exposes
| Index | Use | Notes |
|---|---|---|
| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` |
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries |
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
## L2 — OmniGraph orchestration
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
## L2 — Graph topology index (`graph_index/mod.rs`)
This is OmniGraph-specific (not Lance):
- `TypeIndex`: dense `u32 ↔ String id` mapping per node type.
- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`.
- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables.
- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions).
- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it.

162
docs/invariants.md Normal file
View file

@ -0,0 +1,162 @@
# Architectural Invariants & Policies
**Type:** Reference / standing document
**Status:** Living — updated as decisions accrue
**Audience:** anyone proposing, reviewing, or implementing a change to any part of OmniGraph
This document captures the invariants that hold across all of OmniGraph — storage, engine, server, schema, indexing, observability, and the OSS / Cloud product split. New RFCs, designs, and implementations are checked against this list.
These are not query-engine-specific. They apply to every layer.
> **Note on numbering.** §IV (Additivity / migration) and §VIII (OSS / Cloud kernel-product split) are referenced from §X but not yet drafted in this revision. Pending upstream fill-in.
## How to use
- **Writing an RFC or design proposal:** walk through the relevant sections and state how the proposal upholds each invariant — or why a documented exception is justified.
- **Reviewing a PR or design:** scan for invariants the change might violate. The deny-list (§IX) is the fastest first pass.
- **Debating a tradeoff:** invoke the relevant invariant and check whether the tradeoff respects it. Invariants here are load-bearing; breaking one is rare and requires explicit justification in the proposal.
- **Updating this document:** add to the deny-list freely. Removing or relaxing an invariant requires the same review process as any other architectural decision.
## I. Substrate boundaries — what we don't build
The most important question for any new component is "does the substrate already do this?" If yes, we don't.
1. **Lance is the storage substrate.** We do not build a competing storage format. We build above Lance via a trait. Where we want format-level behavior Lance doesn't have, we propose it upstream or work around it.
*Check:* Does this proposal introduce a parallel storage format, custom on-disk pages, custom serialization?
2. **No own-format WAL or transaction manager.** Lance manifest MVCC plus (eventually) MemWAL is the durability story. We do not write durability code.
*Check:* Does this proposal track its own write log, recovery state, or transaction journal?
3. **No own buffer pool.** Lance + `object_store` + Lance's scan-aware cache cover our needs.
*Check:* Does this proposal cache Arrow batches or pages outside Lance's cache?
4. **The runtime substrate provides relational machinery.** Whether we choose DataFusion (current prior) or a custom executor, we do not rebuild joins, aggregations, parallelism, spill. We build only graph- or multi-modal-shaped operators on top.
*Check:* Are we extending the substrate via traits, or reimplementing parts of it?
5. **Don't replace Lance's index lifecycle.** New fragments enter without index coverage; reads union indexed and scan paths via `fragment_bitmap`. Our reconciler observes the same primitive; we don't replicate it.
*Check:* Does this proposal maintain its own version of "what's indexed and what isn't" parallel to Lance's?
## II. Layering — the seams that hold
6. **The IR is the contract between frontend and backend.** Frontends emit IR; planner / executor consume it. No frontend logic leaks downward; no executor concerns leak upward.
*Check:* Does the proposal add to the IR, or to a layer? If to a layer, does it cross another layer's concern?
7. **Capabilities and statistics flow upward; data flows downward.** Lower layers expose what they can do (capabilities) and what they know (statistics). Upper layers consume both. Methods alone are insufficient — methods without capability advertisement force one-size-fits-all plans.
*Check:* When adding a method to a layer trait, did we also expose the capability so the planner can reason about it?
8. **One trait boundary per layer.** Crossing a layer means going through its trait. Direct calls to lower-layer concrete types from upper layers are forbidden.
*Check:* Does this code call `lance::Dataset` directly outside engine-storage? Call planner internals from the executor?
9. **No god modules.** Single-module concerns: storage, IR, planner, executor, frontend, reconciler, schema, policy. Each crate has a reference test suite that runs without the others.
*Check:* Does this PR add a concern to a crate that already owns a different one?
## III. Distributability — kernel stays remote-friendly
These are technical constraints, independent of whether we ship a distributed product. They preserve the architectural seam.
10. Storage trait is `Send + Sync`. No in-process-only assumptions in `Dataset` impls.
11. `Dataset` impls accept remote descriptors (URI, snapshot ref, fragment ID) without requiring an open in-process Lance handle.
12. **IR is location-neutral.** No IR operator embeds an assumption about where data lives.
13. Cost model accepts a network-cost term as a future additive component. No place hard-codes "all cost is local I/O."
14. Reconciler trait admits alternate implementations. In-process tokio is the OSS default; a separable worker fleet is the distributed shape.
## IV. *(Additivity / migration — placeholder, not yet drafted)*
## V. Honesty — cost, observability, calibration
20. **Estimate-vs-actual logging on every estimator.** Cost models drift; calibration is a continuous process, not a one-off.
21. **Coverage and lag are first-class metrics.** Index coverage, reconciler lag, cost-model accuracy — surfaced through the storage trait's `capabilities()` and a unified observability API.
22. **Honest failure modes.** Cost-model misses degrade gracefully (spill, partial-result, bounded abort). No silent OOM.
23. **Per-query budgets propagate through operators.** Memory cap, wall-clock timeout, max-rows-scanned, max-fragments-scanned. Operators respect them; budgets exposed via explain.
24. **Plans are explainable.** Every executed query can be inspected as IR + physical plan + cost annotations. No "you'd have to read the source to know what this does."
## VI. Patterns — use the unified mechanism
When two features look similar, they probably share a mechanism. Use it.
25. **Reconciler pattern for derivable state.** Index coverage, statistics, anything derivable from manifest state — reconciled, not job-queued.
26. **Polymorphism via Union, not per-feature lowering.** Interfaces / wildcards / alternation on nodes and edges share one IR (`Polymorphism<T>`) and one lowering (Union of per-type concrete plans).
27. **Mutations wrap read subplans.** Insert / Update / Delete / Merge are operators that consume read-shaped subplans. Same planner, same cost model, same storage trait.
28. **SIP for cross-operator selectivity propagation.** Producers publish ID bitmaps; downstream scans consume them through structured pushdown. Don't ad-hoc IN-lists.
29. **Factorize multi-hop, flatten only at projection.** Lists carry multiplicity through intermediate operators. Flatten is inserted by the planner where required, not eagerly.
30. **Stable row IDs as dense graph IDs.** Don't maintain parallel string→u32 maps. Lance's stable row IDs are the substrate's identity layer; we use them directly.
31. **Rank and score are columns.** Retrieval operators emit `_score`, `_rank`. Fusion operators consume rank-bearing batches. Don't discard rank in pipeline-twice merges.
32. **Policy as predicates.** Authorization decisions are filter expressions injected into the planner, not enforcement at the API boundary.
## VII. Quality gates — every change passes
33. **Tests at every boundary.** `MemStorage` for engine tests; planner-only tests; executor-only tests with a stub storage. No layer tested only via end-to-end.
34. **Reference implementation per trait.** Every trait has a primary impl (Lance for storage) and at least a test impl.
35. **Documented capability surface.** New capabilities are documented with what they advertise, who consumes them, how the planner uses them.
36. **Benchmark before optimization.** New optimizations land with a benchmark that motivates them; if the motivating workload doesn't exist, the feature waits.
## VIII. *(OSS / Cloud kernel-product split — placeholder, not yet drafted)*
## IX. Anti-patterns — deny list
If a proposal fits one of these, the burden is on the proposer to justify why this case is the exception.
- Synchronous-inline index updates for indexes expensive to build (vector ANN, FTS). Reconciler pattern instead.
- Custom WAL / transaction manager / buffer pool. Lance owns these.
- Job queue for state derivable from manifest. Reconciler pattern instead.
- Per-feature lowering for shapes that share a structure (interfaces, wildcards, alternation). Use one mechanism.
- Eager materialization of cross-products in multi-hop. Factorize; flatten only when needed.
- Ad-hoc IN-list filtering when SIP fits.
- String-flattened SQL filter generation when structured pushdown is available.
- In-process-only `Dataset` impls. `Send + Sync`, remote descriptors.
- Cost-blind plan choice. Lowering-order execution is not a planner.
- Hidden statistics. If a metric matters for plan choice, it must be exposed through the trait surface.
- Side-channels for query semantics. Search modes, mutations, polymorphism — all first-class IR concepts.
- Discarding rank in retrieval. Score and rank propagate as columns.
- State that drifts from the manifest. Derive from observable state.
- Cloud-only correctness fixes. Correctness is always OSS.
- Forking the codebase for Cloud. Trait-extension only.
- Hand-rolling something Lance already does. Check the spec first.
- Mutating in place state that should be immutable (Lance fragments, index segments). New segments instead.
- Silent failures. OOM, timeout, partial result — all surfaced and bounded.
## X. Review checklist (use against any non-trivial change)
Print this when reviewing an RFC or PR. Each line is **yes / no / N/A**.
- Does it respect the substrate? (§I)
- Does it cross only one trait boundary per layer? (§II)
- Are capabilities and stats exposed for any new behavior? (§II.7)
- Storage trait stays `Send + Sync` and remote-friendly? (§III)
- Additive, not rewrite? Feature-flagged where behavior changes? (§IV)
- Any new estimator has estimate-vs-actual logging? (§V.20)
- Coverage / lag / budget metrics surfaced? (§V.2123)
- Failure modes graceful, bounded, observable? (§V.22)
- Reuses an existing pattern (reconciler, Union, mutation-wrap-read, SIP, factorize) where applicable? (§VI)
- Tests at every boundary, not just end-to-end? (§VII.33)
- Reference impl + test impl for any new trait? (§VII.34)
- If commercial-relevant: kernel/product split preserved? (§VIII)
- None of the deny-list patterns apply? (§IX)
## XI. Living document policy
This document is updated when:
- A new architectural decision establishes a new invariant — add it.
- An existing invariant is challenged and either reaffirmed (with the case sharpened) or revised (with explicit migration of any affected code).
- A new anti-pattern surfaces in review and deserves a place on the deny-list — add it.
Updates require the same review process as code. Adding to the deny-list (§IX) is cheap; removing or relaxing an invariant (§IVIII) requires explicit justification in the proposal.
When an invariant is contested in the moment, the resolution path is: (a) state the case in the relevant RFC or PR; (b) link it from this document; (c) update this document if the resolution changes the rule.
## XII. Source / origin
These invariants were extracted from the architectural decisions in:
- **MR-737** — Query Engine v2 RFC (the kernel scope and seams)
- **MR-738** — OSS / Cloud strategy (the commercial overlay)
- The schema migration program (**MR-694** family — additive evolution, safety tiers)
- The policy program (**MR-722 / MR-725** — predicate pushdown)
- The reconciler / index-lifecycle work (**MR-737 §5.16, MR-688, MR-679, MR-680**)
- The factorization and SIP work (**MR-737 §5.2, §5.3** — Kuzu / Ladybug inspiration)
- The polymorphic-bindings framing (**MR-737 §5.13** — one mechanism for eight cells)
General precedent: Lance + LanceDB Enterprise architecture; ClickHouse merge subsystem; Kubernetes controllers; Postgres autovacuum; the FDAL stack (Flight + DataFusion + Arrow + Lance).
Adding a new invariant here means we've learned something — either from a hard call we made and want to preserve, or from a mistake we don't want to repeat. Both are worth recording.

157
docs/lance.md Normal file
View file

@ -0,0 +1,157 @@
# Lance Docs Index (for OmniGraph agents)
OmniGraph sits on top of Lance. Many problems — index lifecycle, branching, transactions, fragments, compaction, vector/FTS internals — are answered upstream in Lance's docs, not in this repo.
This file is the curated entry point. **When you hit a Lance-shaped problem, find the matching topic below and fetch the listed URL(s) before guessing.** Don't grep our codebase for behavior that is documented authoritatively in Lance.
Base URL: `https://lance.org`. Use `WebFetch` (or your tool's equivalent) on the full URLs. Keep this index curated to relevant material — the upstream sitemap has hundreds of URLs (notably the Namespace REST API model surface, Spark/Trino/Databricks integrations) that we don't use.
> **Substrate boundary check.** Before fetching, recall [docs/invariants.md §I](invariants.md): if Lance already does the thing, we don't reimplement it. The most common reason to read these docs is to confirm a substrate behavior, not to learn what to clone.
## Quick-start (read these once per project)
| Read when | URL |
|---|---|
| Onboarding to Lance — concepts in 10 min | https://lance.org/quickstart/ |
| Onboarding to vector search | https://lance.org/quickstart/vector-search/ |
| Onboarding to full-text search | https://lance.org/quickstart/full-text-search/ |
| Onboarding to versioning / time travel | https://lance.org/quickstart/versioning/ |
| Lance's own AGENTS.md (its agent guide) | https://lance.org/format/AGENTS/ |
## By problem domain
### Storage format & file layout
Touching `db/manifest`, fragment lifecycle, dataset reconstruction, or anything that reads/writes raw Lance state.
| Topic | URL |
|---|---|
| Lance file format overview | https://lance.org/format/ |
| File-level format spec | https://lance.org/format/file/ |
| File encoding | https://lance.org/format/file/encoding/ |
| File-level versioning | https://lance.org/format/file/versioning/ |
| Table layout (fragments, manifest) | https://lance.org/format/table/layout/ |
| Table schema metadata | https://lance.org/format/table/schema/ |
| Table-level versioning | https://lance.org/format/table/versioning/ |
| Transactions (commit semantics, conflict types) | https://lance.org/format/table/transaction/ |
| MemWAL (durability story) | https://lance.org/format/table/mem_wal/ |
| Row-ID lineage (stable row IDs) | https://lance.org/format/table/row_id_lineage/ |
| Branches & tags (Lance native) | https://lance.org/format/table/branch_tag/ |
### Branching / tags / time travel
Touching graph-level branches, snapshots, run isolation, the commit graph.
| Topic | URL |
|---|---|
| Branch & tag format | https://lance.org/format/table/branch_tag/ |
| Tags & branches operational guide | https://lance.org/guide/tags_and_branches/ |
| Versioning quick-start | https://lance.org/quickstart/versioning/ |
| Table-level versioning spec | https://lance.org/format/table/versioning/ |
### Indexes
Adding/changing index types, fixing coverage, debugging FTS or vector recall, designing the reconciler.
| Topic | URL |
|---|---|
| Index spec overview | https://lance.org/format/table/index/ |
| BTREE scalar index | https://lance.org/format/table/index/scalar/btree/ |
| Bitmap scalar index | https://lance.org/format/table/index/scalar/bitmap/ |
| Bloom-filter scalar index | https://lance.org/format/table/index/scalar/bloom_filter/ |
| Label-list scalar index | https://lance.org/format/table/index/scalar/label_list/ |
| Zone-map scalar index | https://lance.org/format/table/index/scalar/zonemap/ |
| R-Tree scalar index (spatial) | https://lance.org/format/table/index/scalar/rtree/ |
| Full-text search (FTS) index | https://lance.org/format/table/index/scalar/fts/ |
| N-gram scalar index | https://lance.org/format/table/index/scalar/ngram/ |
| Vector index | https://lance.org/format/table/index/vector/ |
| Fragment-reuse system index | https://lance.org/format/table/index/system/frag_reuse/ |
| MemWAL system index | https://lance.org/format/table/index/system/mem_wal/ |
| HNSW Rust example | https://lance.org/examples/rust/hnsw/ |
| Distributed indexing | https://lance.org/guide/distributed_indexing/ |
| Tokenizer (FTS, n-gram) | https://lance.org/guide/tokenizer/ |
### Reads & writes
Touching the bulk loader, mutation execution, `merge_insert`, `WriteMode` selection.
| Topic | URL |
|---|---|
| Read-and-write guide | https://lance.org/guide/read_and_write/ |
| Distributed write | https://lance.org/guide/distributed_write/ |
| Rust example: write & read a dataset | https://lance.org/examples/rust/write_read_dataset/ |
### Schema evolution
Touching `apply_schema`, the migration planner, additive evolution.
| Topic | URL |
|---|---|
| Data-evolution guide | https://lance.org/guide/data_evolution/ |
| Migration guide | https://lance.org/guide/migration/ |
### Object store / S3
Touching `storage.rs`, S3-compatible backends (RustFS, MinIO), env vars.
| Topic | URL |
|---|---|
| Object-store guide | https://lance.org/guide/object_store/ |
### Data types
Touching schema-language scalar mappings, blob columns, JSON, list columns.
| Topic | URL |
|---|---|
| Data types overview | https://lance.org/guide/data_types/ |
| Arrays / list types | https://lance.org/guide/arrays/ |
| Blobs (LargeBinary) | https://lance.org/guide/blob/ |
| JSON | https://lance.org/guide/json/ |
### Performance & tuning
Optimizing scans, fragment counts, cache behavior, memory pool sizing.
| Topic | URL |
|---|---|
| Performance guide | https://lance.org/guide/performance/ |
### Compaction & cleanup
Touching `omnigraph optimize` / `cleanup`, the underlying `compact_files` / `cleanup_old_versions`.
| Topic | URL |
|---|---|
| Read-and-write guide (covers `compact_files`, `cleanup_old_versions`) | https://lance.org/guide/read_and_write/ |
| Performance (compaction tradeoffs) | https://lance.org/guide/performance/ |
| Fragment-reuse index | https://lance.org/format/table/index/system/frag_reuse/ |
### DataFusion integration
The runtime substrate that may carry our query execution. See [docs/invariants.md §I.4](invariants.md): we don't rebuild relational machinery.
| Topic | URL |
|---|---|
| DataFusion integration | https://lance.org/integrations/datafusion/ |
### SDK reference
Looking up a specific Rust API (signature, return type, error variant).
| Topic | URL |
|---|---|
| SDK docs landing | https://lance.org/sdk_docs/ |
## What's not in this index (and why)
- **Namespace REST API model surface** (`/format/namespace/client/operations/models/...`) — hundreds of REST schema docs for the Lance Namespace catalog API. Omnigraph does not run a Lance Namespace server, so these are not reachable from our problem space.
- **Spark / Trino / Databricks / Dataproc / Hive / Glue / Polaris / Iceberg / Unity / OneLake / Gravitino integrations** — not part of OmniGraph's deployment surface.
- **Python / TF / PyTorch / Hugging Face / Ray integrations** — OmniGraph is Rust-only; Python notebooks aren't relevant.
- **Community / governance / release / voting / PMC pages** — meta, not technical.
If a future need pulls one of these into scope, add a row to the matching domain section above and link it from `AGENTS.md`'s topic index.
## Maintenance
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.

22
docs/maintenance.md Normal file
View file

@ -0,0 +1,22 @@
# Maintenance: Optimize & Cleanup
`db/omnigraph/optimize.rs`.
## `optimize_all_tables(db)` — non-destructive
- Lance `compact_files()` on every node + edge table on `main`.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`.
## `cleanup_all_tables(db, options)` — destructive
- Lance `cleanup_old_versions()` per table.
- Removes manifests (and their unique fragments) older than the retention policy.
- `CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> }` — at least one is required.
- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`.
- CLI guards with `--confirm`; without it, prints a preview line.
## Tombstones
Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction.

30
docs/merge.md Normal file
View file

@ -0,0 +1,30 @@
# Merging (three-way) and Conflicts
`exec/merge.rs`.
## Strategy
Ordered, row-by-row cursor merge:
- `OrderedTableCursor` scans each table sorted by `id` and supports peek/pop matching.
- `StagedTableWriter` buffers `MERGE_STAGE_BATCH_ROWS = 8192` rows into a temp Lance dataset (`OMNIGRAPH_MERGE_STAGING_DIR`).
- The merge runs per sub-table; results are published as one atomic manifest update.
## Outcome enum
`MergeOutcome { AlreadyUpToDate | FastForward | Merged }`
## Conflict types (`error.rs`)
```
MergeConflictKind:
DivergentInsert // same id inserted on both branches
DivergentUpdate // updated differently on both branches
DeleteVsUpdate // one side deletes, other updates
OrphanEdge // edge references a node deleted by the other side
UniqueViolation
CardinalityViolation
ValueConstraintViolation
```
Returned as `OmniError::MergeConflicts(Vec<MergeConflict { table_key, row_id?, kind, message }>)`. The HTTP server surfaces this as a 409 with structured `merge_conflicts[]` (top 3 + "+N more").

44
docs/policy.md Normal file
View file

@ -0,0 +1,44 @@
# Authorization (Cedar policy)
OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
## Policy actions
1. `read` — query / snapshot / list branches & commits
2. `export` — NDJSON export
3. `change` — mutations
4. `schema_apply` — apply schema migrations
5. `branch_create`
6. `branch_delete`
7. `branch_merge`
8. `run_publish`
9. `run_abort`
10. `admin` — reserved
## Scope kinds
- `branch_scope` — applied to source branch (`read`, `export`, `change`)
- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops)
- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected`
## Configuration
`omnigraph.yaml`:
```yaml
policy:
file: ./policy.yaml # Cedar rules + groups
tests: ./policy.tests.yaml # declarative test cases
```
Each rule must use exactly one of `branch_scope` or `target_branch_scope`.
## CLI
- `omnigraph policy validate` — parse + count actors, exit 1 on parse error.
- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch.
- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule.
## Server enforcement
Every mutating endpoint calls `authorize_request()` *before* the handler runs; decisions are logged with actor / action / branch / outcome / matched rule.

103
docs/query-language.md Normal file
View file

@ -0,0 +1,103 @@
# Query Language (`.gq`)
Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`.
## Query declarations
```
query <name>($p1: T1, $p2: T2?, …)
@description("…") @instruction("…") {
}
```
Two body shapes:
- **Read**: `match { … } return { … } [order { … }] [limit N]`
- **Mutation**: one or more of `insert | update | delete` statements
Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`.
## MATCH clauses
- **Binding**: `$x: NodeType { prop: <literal | $param | now()>, … }`
- **Traversal**: `$src EDGE_NAME { min, max? } $dst` — variable-length paths via hop bounds; default 1..1 if bounds omitted.
- **Filter**: `<expr> <op> <expr>` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`.
- **Negation**: `not { clause+ }` — desugars to anti-join over the inner pipeline.
## Search clauses (multi-modal)
Used inside MATCH or as expressions inside RETURN/ORDER:
| Function | Purpose | Underlying Lance facility |
|---|---|---|
| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) |
| `search(field, q)` | Generic FTS | Inverted index |
| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index |
| `match_text(field, q)` | Pattern match | Inverted index |
| `bm25(field, q)` | BM25 scoring | Inverted index |
| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings |
`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input).
## RETURN clause
`return { <expr> [as <alias>], … }` with expressions:
- Variable / property access: `$x`, `$x.prop`
- Literals: string, int, float, bool, list
- `now()`
- Aggregates: `count`, `sum`, `avg`, `min`, `max`
- All search functions above (so you can return a score column)
- `AliasRef` — re-use a previous projection alias
## ORDER & LIMIT
- `order { <expr> [asc|desc], … }` — supports plain expressions and `nearest(...)`.
- `limit <integer>` — required when there is a `nearest(...)` ordering.
## Mutation statements
- `insert <Type> { prop: <value>, … }`
- `update <Type> set { prop: <value>, … } where <prop> <op> <value>`
- `delete <Type> where <prop> <op> <value>`
`<value>` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0).
## IR (Intermediate Representation)
`QueryIR { name, params, pipeline: Vec<IROp>, return_exprs, order_by, limit }`
Pipeline operations:
- `NodeScan { variable, type_name, filters }`
- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` — destination filters are pushed *into* the expand so Lance scalar pushdown can prune.
- `Filter { left, op, right }`
- `AntiJoin { outer_var, inner: Vec<IROp> }` — for `not { … }`
Lowering:
1. Partition MATCH clauses (bindings, traversals, filters, negations).
2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown.
3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations.
4. Translate RETURN / ORDER expressions; preserve LIMIT.
## Linting & validation (`query/lint.rs`)
Codes seen so far:
- **Q000** (Error): parse error
- **L201** (Warning): nullable property never set by any UPDATE — "{type}.{prop} exists in schema but no update query sets it"
- (Warning): mutation declares no params — hardcoded mutations are easy to miss
- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.)
Output:
```
QueryLintOutput { status, schema_source, query_path,
queries_processed, errors, warnings, infos,
results: [{ name, kind, status, error?, warnings[] }],
findings: [{ severity, code, message, type_name?, property?, query_names[] }] }
```
CLI exits non-zero only on `status = Error`.

19
docs/releases/v0.3.1.md Normal file
View file

@ -0,0 +1,19 @@
# Omnigraph v0.3.1
Omnigraph v0.3.1 is a performance and operability point release.
## Highlights
- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads.
- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form.
- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45).
## Included Changes
- Parallel per-type load writes (#46)
- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46)
- Dedupe dst ids before hydrating nodes in `execute_expand` (#45)
## Upgrade Notes
No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1.

38
docs/runs.md Normal file
View file

@ -0,0 +1,38 @@
# Runs (transactional graph mutations)
`db/run_registry.rs` + run lifecycle in `db/omnigraph.rs`. Stored in `_graph_runs.lance` and `_graph_run_actors.lance`.
## RunRecord
```
RunRecord {
run_id: RunId (ULID),
target_branch: String, // where the run will publish
run_branch: "__run__<id>", // ephemeral isolation branch
base_snapshot_id: String,
base_manifest_version: u64,
operation_hash: Option<String>, // idempotency key
actor_id: Option<String>,
status: Running | Published | Failed | Aborted,
published_snapshot_id: Option<String>,
created_at, updated_at: i64 (microseconds),
}
```
## Lifecycle
1. `begin_run(target_branch, op_hash)` / `begin_run_as(target_branch, op_hash, actor_id)` — forks `__run__<id>` from the target's current head, appends a `RunRecord`.
2. Mutations on `run_branch` (via the normal write APIs) — isolated from concurrent activity on the target.
3. `publish_run(id)` / `publish_run_as(id, actor)`:
- **Fast path**: if the target hasn't moved since `base_snapshot_id`, promote the run snapshot directly.
- **Merge path**: if it has moved, perform a three-way merge (see [merge.md](merge.md)) into the target.
- On success: `status = Published`, `published_snapshot_id` set, run branch cleaned up asynchronously.
4. `abort_run(id)` / `fail_run(id)` — terminal; cleans up run branch best-effort.
## Idempotency
`operation_hash` is an optional field clients can use to detect a duplicate `begin_run` retry.
## Cleanup
`cleanup_terminal_run_branches_for_target(branch)` is called as branches change; failures are swallowed (lazy cleanup on next branch op).

79
docs/schema-language.md Normal file
View file

@ -0,0 +1,79 @@
# Schema Language (`.pg`)
Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`.
## Top-level declarations
- `interface <Name> { property* }` — reusable property contracts.
- `node <Name> [implements <Iface>, ...] { property* | constraint* }`
- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }`
- Comments: line `//` and block `/* … */`.
## Property declarations
`<ident>: <TypeRef> [annotation*]`
## Built-in scalar types
| Scalar | Arrow type |
|---|---|
| `String` | Utf8 |
| `Blob` | LargeBinary |
| `Bool` | Boolean |
| `I32` / `I64` | Int32 / Int64 |
| `U32` / `U64` | UInt32 / UInt64 |
| `F32` / `F64` | Float32 / Float64 |
| `Date` | Date32 |
| `DateTime` | Date64 |
| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` |
| `[<scalar>]` | List(scalar) |
| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values |
| `<scalar>?` | Same as scalar but `nullable: true` |
## Constraints (body level)
| Constraint | On | Effect |
|---|---|---|
| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key |
| `@unique(p, …)` | node, edge | Uniqueness across listed columns |
| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns |
| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) |
| `@check(p, "regex")` | node | Regex pattern validation |
| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. |
Edge bodies only allow `@unique` and `@index`.
## Annotations
- `@<ident>` or `@<ident>(<literal>)` on any declaration or property.
- Known annotations:
- `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType).
- `@description("…")`, `@instruction("…")` on query declarations (carried through to clients).
- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation.
## Catalog construction
- Pass 0: collect interfaces.
- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`).
- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties.
## Schema IR & stable type IDs
- `SCHEMA_IR_VERSION = 1` (`catalog/schema_ir.rs`).
- Each interface/node/edge gets a `stable_type_id` (kind+name hashed) so renames can be tracked.
- Serialized as JSON for diff/migration plans.
## Schema migration planning
`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types:
- `AddType { type_kind, name }`
- `RenameType { type_kind, from, to }`
- `AddProperty { type_kind, type_name, property_name, property_type }`
- `RenameProperty { type_kind, type_name, from, to }`
- `AddConstraint { type_kind, type_name, constraint }`
- `UpdateTypeMetadata { … annotations }`
- `UpdatePropertyMetadata { … annotations }`
- `UnsupportedChange { entity, reason }` (forces `supported=false`)
`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize.

68
docs/server.md Normal file
View file

@ -0,0 +1,68 @@
# HTTP Server (`omnigraph-server`)
Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant.
## Endpoint inventory
| Method | Path | Auth | Action | Handler |
|---|---|---|---|---|
| GET | `/healthz` | none | — | `server_health` |
| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) |
| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
| POST | `/read` | bearer + `read` | run named query | `server_read` |
| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` |
| POST | `/change` | bearer + `change` | mutation | `server_change` |
| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` |
| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` |
| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
| GET | `/runs` | bearer + `read` | list | `server_run_list` |
| GET | `/runs/{run_id}` | bearer + `read` | show | `server_run_show` |
| POST | `/runs/{run_id}/publish` | bearer + `run_publish` | publish | `server_run_publish` |
| POST | `/runs/{run_id}/abort` | bearer + `run_abort` | abort | `server_run_abort` |
| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` |
| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
## Streaming
Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON.
## Error model
Uniform `ErrorOutput { error, code?, merge_conflicts[] }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`.
HTTP status codes used: 200, 400, 401, 403, 404, 409, 500.
## Body limits
- Default: 1 MB
- `/ingest`: 32 MB
## Auth model (`bearer + SHA-256`)
- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
- Constant-time comparison via `subtle::ConstantTimeEq`.
- Three sources, in precedence:
1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`)
2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}`
3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default`
- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme.
See [deployment.md](deployment.md) for token-source operational details.
## Tracing & observability
- `tower_http::TraceLayer::new_for_http()`
- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
- Startup logs: token source name, repo URI, bind address
- Graceful SIGINT shutdown
## Not implemented (by design or "TBD")
- CORS — not configured; add `tower_http::cors` if needed.
- Rate limiting — none.
- Pagination — none (commits/branches/runs return everything; export streams).
- Multi-tenant routing — one repo per process.

46
docs/storage.md Normal file
View file

@ -0,0 +1,46 @@
# Storage
## L1 — Lance dataset (per node/edge type)
Every node type and every edge type is its own Lance dataset:
- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema.
- **Fragments**: data is partitioned into fragments; new writes create new fragments.
- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable.
- **Stable row IDs**: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction.
- **Append / delete / `merge_insert`**: native Lance write modes.
- **Per-dataset branches** (Lance native): copy-on-write at the dataset level.
- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`).
## L2 — Multi-dataset coordination via `__manifest`
OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table.
- **Manifest table**: `__manifest/` Lance dataset.
- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`):
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
- `__manifest/` — the catalog of all sub-tables and their published versions
- `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map
- `_graph_runs.lance` / `_graph_run_actors.lance` — the run registry and its actor map
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
- `object_type``table | table_version | table_tombstone`
- `table_key``node:<TypeName> | edge:<EdgeName>`
- `table_branch` is `null` for the main lineage and the branch name otherwise
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones — rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`.
- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once.
## URI scheme support (`storage.rs`)
| Scheme | Backend | Notes |
|---|---|---|
| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths |
| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` |
| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend |
## Object-store env vars (S3-compatible)
- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`
- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML
- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs
- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev)

109
docs/testing.md Normal file
View file

@ -0,0 +1,109 @@
# Testing
This file is the always-on map of the test surface. **Consult it before every task** so you know what tests already cover the area you're about to change, what helpers to reuse, and where a new test belongs. The architectural invariant *"tests at every boundary, not just end-to-end"* lives in [docs/invariants.md §VII.33](invariants.md).
## Where tests live, per crate
| Crate | Path | Style |
|---|---|---|
| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (15 files), fixture-driven, share `tests/helpers/mod.rs` |
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | `cli.rs` (unit-ish), `system_local.rs`, `system_remote.rs`, share `tests/support/mod.rs` |
| `omnigraph-server` | `crates/omnigraph-server/tests/` | `server.rs` (HTTP-level), `openapi.rs` (OpenAPI drift / regeneration) |
| `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint |
The engine's `tests/` is the principal coverage surface; most graph-shaped behavior is exercised there.
## Engine integration tests (`crates/omnigraph/tests/`)
| File | Covers |
|---|---|
| `end_to_end.rs` | Full init → load → query/mutate flow |
| `branching.rs` | Branch create / list / delete, lazy fork |
| `runs.rs` | Transactional runs (begin/publish/abort), idempotency |
| `lifecycle.rs` | Repo lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
| `schema_apply.rs` | Migration plan + apply, schema-apply lock |
| `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) |
| `traversal.rs` | `Expand`, variable-length hops, anti-join |
| `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` |
| `export.rs` | NDJSON streaming export filters |
| `s3_storage.rs` | S3-backed repo (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature) |
## Fixtures
`crates/omnigraph/tests/fixtures/` holds the canonical schema (`.pg`), seed data (`.jsonl`), and queries (`.gq`) shared across tests. Reuse these before inventing new ones — the helpers harness already knows how to load them.
## Test helpers
- **Engine**`crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp repo + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
- **CLI**`crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers.
- **Server** — no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire.
> Note: there is **no `MemStorage` or in-memory backend** today. Tests use `tempfile::tempdir()` for local FS. If you find yourself needing one for layer isolation, that's an architectural ask — see [docs/invariants.md §VII.34](invariants.md) (reference impl + test impl per trait).
## Failpoints (fault injection)
- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml`).
- Wrapper: `crates/omnigraph/src/failpoints.rs` exposes `maybe_fail("name")` and `ScopedFailPoint` for tests.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, etc.).
- Activated tests: `crates/omnigraph/tests/failpoints.rs`. Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints`.
## RustFS / S3 integration
CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml``rustfs_integration` job):
- `cargo test -p omnigraph-engine --test s3_storage`
- `cargo test -p omnigraph-server --test server server_opens_s3_repo_directly_and_serves_snapshot_and_read`
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully.
## OpenAPI drift
`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repo PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
## Examples & benches
- `crates/omnigraph/examples/bench_expand.rs` — runnable example (not part of CI).
- No `benches/` directories. The architectural rule [docs/invariants.md §VII.36](invariants.md) requires benchmark motivation before optimization, so add `benches/` per crate when you ship a perf-driven change.
## Coverage tooling — what's missing
There is **no** coverage tooling in the repo today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
If introducing coverage tooling is in scope for your task, the natural first step is `cargo-llvm-cov` wired into a separate CI job, and a per-crate threshold rather than a global one.
## First principle: check what already covers it
**Before writing any new test, check whether an existing test already covers the case.** The cost of duplicating coverage is high: more code to read, more places to keep in sync when behavior changes, and more drift when one copy lags. The cost of *extending* an existing test is usually one extra assertion or one extra fixture row.
How to check:
1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `runs.rs`, `search.rs`, etc.). The filename usually names the area.
2. **Open the file and skim every test fn name.** Test fn names are the index — read them all, not just the first few.
3. **Grep for the symbol or path you're changing.** `rg <FunctionName>` or `rg <enum_variant>` across all `tests/` directories surfaces existing coverage you might miss.
4. **Decide one of three outcomes**, in this order of preference:
- *Existing test already asserts the new behavior* → no new test needed; this PR is a refactor or no-op behaviorally. Confirm by running the existing test against the change.
- *Existing test covers the area but not your case***add an assertion or a fixture row to the existing test**, don't write a new function with `init_and_load()` again.
- *No existing coverage in any test file* → only then write a new test; put it in the file that owns the area, or open a new file only if the area itself is new.
Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repo. Don't add to it.
## Before-every-task checklist
When you pick up any change, walk through this:
1. **Find existing coverage** (per the principle above). Don't just look at the first test file by name — grep for the symbol you're touching across every crate's `tests/`.
2. **Run those tests locally before editing.** `cargo test --workspace --locked` for the broad pass; `-p <crate> --test <file>` for a focused loop. Confirm a clean baseline.
3. **Decide extend-vs-new** explicitly. If you can extend an existing test (assertion, fixture row, parameterization), do that. Only add a new test fn or new file if no existing one owns the area.
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh repo by hand if a helper exists.
5. **Mind the boundary.** Per [docs/invariants.md §VII.33](invariants.md), test at the layer the change lives at — planner-level changes deserve planner-level tests, not just end-to-end.
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
When in doubt, re-read [docs/invariants.md §VII](invariants.md) — quality gates apply to every change.

77
scripts/check-agents-md.sh Executable file
View file

@ -0,0 +1,77 @@
#!/usr/bin/env bash
# Verify that AGENTS.md and docs/ stay in sync.
#
# Two checks:
# 1. Every docs/*.md path linked from AGENTS.md exists on disk.
# 2. Every doc in the canonical set is linked from AGENTS.md.
#
# Exit non-zero on any drift.
set -euo pipefail
repo_root="$(cd "$(dirname "$0")/.." && pwd)"
cd "$repo_root"
agents_file="AGENTS.md"
if [[ ! -f "$agents_file" ]]; then
echo "error: $agents_file not found" >&2
exit 1
fi
# Canonical set: every docs/*.md (top-level), plus the releases/ index dir if present.
canonical=()
while IFS= read -r line; do
canonical+=("$line")
done < <(find docs -mindepth 1 -maxdepth 1 -type f -name '*.md' | sort)
if [[ -d docs/releases ]]; then
canonical+=("docs/releases/")
fi
# Extract docs/ links from AGENTS.md (markdown link form: (docs/...))
linked=()
while IFS= read -r line; do
linked+=("$line")
done < <(grep -oE '\(docs/[^)]+\)' "$agents_file" | sed -E 's/^\(|\)$//g' | sort -u)
fail=0
# Check 1: every linked path exists.
for link in "${linked[@]}"; do
# Strip in-page anchors like #foo
path="${link%%#*}"
if [[ "$path" == */ ]]; then
if [[ ! -d "$path" ]]; then
echo "error: AGENTS.md links to missing directory: $path" >&2
fail=1
fi
else
if [[ ! -f "$path" ]]; then
echo "error: AGENTS.md links to missing file: $path" >&2
fail=1
fi
fi
done
# Check 2: every canonical doc is linked at least once.
for doc in "${canonical[@]}"; do
found=0
for link in "${linked[@]}"; do
path="${link%%#*}"
if [[ "$path" == "$doc" ]]; then
found=1
break
fi
done
if [[ "$found" -eq 0 ]]; then
echo "error: doc not linked from AGENTS.md: $doc" >&2
fail=1
fi
done
if [[ "$fail" -ne 0 ]]; then
echo >&2
echo "AGENTS.md / docs/ are out of sync. Either update AGENTS.md links or rename/remove the doc." >&2
exit 1
fi
echo "AGENTS.md ↔ docs/ links OK (${#linked[@]} links, ${#canonical[@]} docs)."