Refactor AGENTS.md from encyclopedia to map; move spec into docs/

Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).

Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Ragnor Comerford 2026-04-28 23:31:08 +02:00
parent cfea41e942
commit a335d98854
No known key found for this signature in database
23 changed files with 1069 additions and 924 deletions

View file

@ -99,6 +99,18 @@ jobs:
echo "run_full_ci=$run_full_ci" >> "$GITHUB_OUTPUT"
echo "run_rustfs_ci=$run_rustfs_ci" >> "$GITHUB_OUTPUT"
check_agents_md:
name: Check AGENTS.md Links
runs-on: ubuntu-latest
permissions:
contents: read
steps:
- name: Checkout source
uses: actions/checkout@v5.0.1
- name: Verify AGENTS.md ↔ docs/ cross-links
run: bash scripts/check-agents-md.sh
test:
name: Test Workspace
needs: classify_changes

1038
AGENTS.md

File diff suppressed because it is too large Load diff

69
docs/architecture.md Normal file
View file

@ -0,0 +1,69 @@
# Architecture
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`.
## Stack
```
┌──────────────────────────────────────────────────────────────────┐
│ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) │
│ - 13 cmd families - REST + OpenAPI │
│ - Aliases, configs - Bearer auth + Cedar policy │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ omnigraph-compiler │
│ - Pest grammars: schema.pest, query.pest │
│ - Catalog (Node/Edge/Interface types) │
│ - IR + lowering (NodeScan / Expand / Filter / AntiJoin) │
│ - Schema migration planner │
│ - Embedding client (OpenAI-style for query-time normalization) │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ omnigraph (engine) │
│ - GraphCoordinator + ManifestRepo (__manifest) │
│ - CommitGraph (_graph_commits.lance) │
│ - RunRegistry (_graph_runs.lance, __run__ branches) │
│ - GraphIndex (CSR/CSC) + RuntimeCache (LRU 8) │
│ - exec::query / mutation / merge │
│ - Embedding client (Gemini for runtime ingest) │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ Lance 4.x (per-table dataset) │
│ - Columnar (Arrow) storage, fragments │
│ - Manifest versions per dataset │
│ - Per-dataset branches (copy-on-write) │
│ - Indexes: BTREE, Inverted (FTS/BM25), IVF/HNSW vector │
│ - merge_insert (upsert), append, delete │
│ - compact_files, cleanup_old_versions │
└──────────────────────────────┬───────────────────────────────────┘
┌──────────────────────────────▼───────────────────────────────────┐
│ Object store: local FS, S3, RustFS, MinIO, S3-compatible │
└──────────────────────────────────────────────────────────────────┘
```
## L1 / L2 framing
Throughout the docs, capabilities are split into:
- **L1 — Inherited from Lance**: what OmniGraph gets "for free" by sitting on top of the Lance dataset format (columnar Arrow storage, per-dataset versions and branches, index types, `merge_insert`, `compact_files` / `cleanup_old_versions`).
- **L2 — Added by OmniGraph**: typing (schema language), graph semantics, multi-dataset coordination via `__manifest`, graph-level branches and commits, the `.gq` query language and IR, the topology index, the HTTP server, Cedar policy, the CLI.
## Concurrency model
- **MVCC**: every Lance write bumps a per-dataset version; the OmniGraph manifest version coordinates which sub-table versions are visible together.
- **Snapshot isolation**: a query holds one `Snapshot` for its lifetime; concurrent writes don't leak in.
- **Cross-branch isolation**: copy-on-write means readers and writers on different branches don't block each other.
- **Run isolation**: each transactional run lives on its own `__run__<id>` branch.
- **Schema-apply lock**: `__schema_apply_lock__` system branch serializes schema migrations.
- **Fail-points** (`failpoints` cargo feature): `failpoints::maybe_fail("operation.step")?` in `branch_create`, publish, etc., for deterministic failure injection in tests.
## Workspace crates
- `omnigraph-compiler` — schema and query grammars, catalog, IR, lowering, type checker, lint, migration planner, OpenAI-style embedding client.
- `omnigraph` (engine, published as `omnigraph-engine` on crates.io since v0.2.2) — the Lance-backed runtime: manifest, commit graph, run registry, snapshot, exec, merge, loader, Gemini embedding client.
- `omnigraph-cli` — the `omnigraph` binary.
- `omnigraph-server` — the `omnigraph-server` binary (Axum HTTP server).

6
docs/audit.md Normal file
View file

@ -0,0 +1,6 @@
# Audit / Actor tracking
- `Omnigraph::audit_actor_id: Option<String>` is the actor in effect.
- `_as` variants of every write API let callers override the actor: `begin_run_as`, `publish_run_as`, `ingest_as`, `mutate_as`, `branch_merge_as`, etc.
- Actor IDs are persisted both on `RunRecord.actor_id` and on `GraphCommit.actor_id`, with optional split storage in `_graph_commit_actors.lance` and `_graph_run_actors.lance`.
- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor).

51
docs/branches-commits.md Normal file
View file

@ -0,0 +1,51 @@
# Branches, Commits, Snapshots
## L1 — Lance per-dataset branches
Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork.
## L2 — Graph-level branches
OmniGraph builds *graph branches* on top by branching every sub-table coherently:
- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle.
- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes.
- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments.
- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source.
- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch.
## L2 — Commit graph (`db/commit_graph.rs`)
Stored as a Lance dataset `_graph_commits.lance` (with stable row IDs):
```
GraphCommit {
graph_commit_id: ULID,
manifest_branch: Option<String>,
manifest_version: u64,
parent_commit_id: Option<String>,
merged_parent_commit_id: Option<String>, // populated for merge commits
actor_id: Option<String>,
created_at: i64 (microseconds since epoch),
}
```
- Every successful publish (load / change / merge / schema_apply / publish_run) appends one commit.
- Merge commits have two parents; linear commits have one.
- `_graph_commit_actors.lance` — optional separate actor map (created on demand).
- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`.
## L2 — Snapshots & time travel
- `snapshot()` — current snapshot for the bound branch; cached.
- `snapshot_of(target)` — snapshot at a `ReadTarget` (branch | snapshot id).
- `snapshot_at_version(v: u64)` — historical snapshot from any manifest version.
- `entity_at(table_key, id, version)` — single-entity time travel without building a full snapshot.
- A `Snapshot` is a `(version, HashMap<table_key, SubTableEntry>)` — cheap to build, snapshot-isolated cross-table reads.
## L2 — Internal system branches
Filtered from `branch_list()` but visible to internals:
- `__run__<run-id>` — ephemeral isolation branch for a transactional run.
- `__schema_apply_lock__` — serializes schema migrations.

24
docs/changes.md Normal file
View file

@ -0,0 +1,24 @@
# Change Detection / Diff
`changes/mod.rs`. Three-level algorithm:
1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged.
2. **Lineage check**:
- Same branch lineage → fast path: use the per-row `_row_last_updated_at_version` column to classify Insert/Update/Delete.
- Different lineages → ID-based streaming comparison.
3. **Row-level diff**: streaming, no full materialization.
## Public API
- `diff_between(from: ReadTarget, to: ReadTarget, filter: Option<ChangeFilter>) -> ChangeSet`
- `diff_commits(from_commit_id, to_commit_id, filter)` — cross-branch safe.
## Types
```
ChangeOp: Insert | Update | Delete
EntityKind: Node | Edge
EntityChange { table_key, kind, type_name, id, op, manifest_version, endpoints?: {src, dst} }
ChangeFilter { kinds?, type_names?, ops? }
ChangeSet { from_version, to_version, branch?, changes[], stats }
```

10
docs/ci.md Normal file
View file

@ -0,0 +1,10 @@
# CI / Release Workflows
`.github/workflows/`:
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease.
- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by force-pushing the regenerated formula to `ModernRelay/homebrew-tap`.
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.

83
docs/cli-reference.md Normal file
View file

@ -0,0 +1,83 @@
# CLI Reference (`omnigraph`)
A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md).
13 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`.
## Top-level commands
| Command | Purpose |
|---|---|
| `init` | `--schema <pg>` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) |
| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) |
| `ingest` | branch-creating transactional load (`--from <base>`) |
| `read` | run named query (params via `--params`, `--params-file`, or alias args) |
| `change` | run mutation query |
| `snapshot` | print current snapshot (per-table version + row count) |
| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
| `branch create \| list \| delete \| merge` | branching ops |
| `commit list \| show` | inspect commit graph |
| `run list \| show \| publish \| abort` | transactional run ops |
| `schema plan \| apply \| show (alias: get)` | migrations |
| `query lint \| check` | offline / repo-backed validation |
| `optimize` | non-destructive Lance compaction |
| `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
| `embed` | offline JSONL embedding pipeline |
| `policy validate \| test \| explain` | Cedar tooling |
| `version` / `-v` | print `omnigraph 0.3.x` |
## `omnigraph.yaml` schema
```yaml
project: { name }
graphs:
<name>:
uri: <local|s3://|http(s)://>
bearer_token_env: <ENV_NAME>
server:
graph: <name>
bind: <ip:port>
cli:
graph: <name>
branch: <name>
output_format: json|jsonl|csv|kv|table
table_max_column_width: 80
table_cell_layout: truncate|wrap
query:
roots: [<dir>, …] # search path for .gq files
auth:
env_file: ./.env.omni
aliases:
<alias>:
command: read|change
query: <path-to-.gq>
name: <query-name>
args: [<positional-name>, …]
graph: <name>
branch: <name>
format: <output-format>
policy:
file: ./policy.yaml
```
## Output formats (read command)
- `json` — pretty-printed object with metadata + rows
- `jsonl` — one metadata line then one JSON object per row
- `csv` — RFC 4180-ish quoting
- `table` — fitted text table, honors `table_max_column_width` + `table_cell_layout`
- `kv` — grouped per-row key/value blocks
## Param resolution
Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, `omnigraph.yaml` defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients.
## Bearer token resolution (CLI)
1. `graphs.<name>.bearer_token_env`
2. `OMNIGRAPH_BEARER_TOKEN` global env
3. `auth.env_file` referenced `.env`
## Duration parsing (cleanup)
`s | m | h | d | w` units, e.g. `--older-than 7d`.

20
docs/constants.md Normal file
View file

@ -0,0 +1,20 @@
# Constants & Tunables (cheat sheet)
| Name | Value | Where |
|---|---|---|
| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` |
| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` |
| Run registry dir | `_graph_runs.lance` | `db/run_registry.rs` |
| Run branch prefix | `__run__` | `db/run_registry.rs` |
| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` |
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` |
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` |
| Graph index cache size | `8` (LRU) | `runtime_cache.rs` |
| Default body limit | `1 MB` | `omnigraph-server/lib.rs` |
| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` |
| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` |
| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` |
| Embed timeout | `30 000 ms` | both clients |
| Embed retries | `4` | both clients |
| Embed retry backoff | `200 ms` | both clients |
| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime |

31
docs/embeddings.md Normal file
View file

@ -0,0 +1,31 @@
# Embeddings
OmniGraph has **two** embedding clients with different defaults and purposes.
## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization
- Default model: `text-embedding-3-small` (OpenAI-style schema)
- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors
## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest
- Model: `gemini-embedding-2-preview`
- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
- Exponential backoff with retryable detection (timeouts, 429, 5xx)
## Schema integration
Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).
## CLI `omnigraph embed` (offline file pipeline)
Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive):
- (default) `fill_missing` — only embed rows whose target field is empty
- `--reembed-all` — overwrite all
- `--clean` — strip embeddings
Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.

21
docs/errors.md Normal file
View file

@ -0,0 +1,21 @@
# Errors and Result Serialization
## Error taxonomy (`omnigraph::error::OmniError`)
- `Compiler(...)` — schema/query parse/typecheck errors
- `Lance(String)` — storage layer
- `DataFusion(String)` — execution layer
- `Io(io::Error)`
- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, … })`
- `MergeConflicts(Vec<MergeConflict>)`
Compiler-side `NanoError` covers parse / catalog / type / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics.
## Result serialization (`omnigraph_compiler::result::QueryResult`)
- `to_arrow_ipc()` — efficient binary
- `to_sdk_json()` — JS-safe JSON (large i64 wrapped in metadata)
- `to_rust_json()` — Rust-friendly JSON
- `batches()` — direct Arrow `RecordBatch` access
Mutation results: `{ affectedNodes: usize, affectedEdges: usize }` (also exposed as a tiny Arrow batch).

76
docs/execution.md Normal file
View file

@ -0,0 +1,76 @@
# Query Execution, Mutations, and Loading
## Query execution (`exec/query.rs`)
Pipeline:
1. Parse + typecheck via `omnigraph-compiler`.
2. Lower to IR.
3. If `Expand` or `AntiJoin` is present, build (or fetch from `RuntimeCache`) a `GraphIndex`.
4. Run `execute_query` against the snapshot.
### Multi-modal search modes (`SearchMode`)
The executor recognizes three modes that may be combined in a single query:
- **`nearest`** — vector ANN (uses Lance vector index; `LIMIT` required).
- **`bm25`** — BM25 over an inverted index.
- **`rrf`** — Reciprocal Rank Fusion of two rankings, with k (default 60).
Hybrid example: `order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20`.
### Joins / set operations
- Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups.
- `not { … }` lowers to an `AntiJoin` over the inner pipeline.
### Scoped reads
- `query(target, source, name, params)` — at any branch or snapshot.
- `run_query_at(version, …)` — direct historical query at a manifest version.
### Concurrency
- Snapshot isolation per query: all reads inside a query use the same `Snapshot`.
- Readers and writers on different branches don't block each other.
## Mutation execution (`exec/mutation.rs`)
Resolves expression values to literals, converts to typed Arrow arrays (`literal_to_typed_array(lit, DataType, num_rows)`), then writes:
- `insert` → Lance `WriteMode::Append`
- `update` → Lance `merge_insert(WhenMatched::Update)`
- `delete` → Lance `merge_insert(WhenMatched::Delete)` (logical) or filtered overwrite.
Multi-statement mutations are atomic at the manifest commit boundary.
## Bulk loader (`loader/mod.rs`)
- **JSONL only** in v1, with two record shapes:
- Node: `{"type":"NodeType", "data":{…}}`
- Edge: `{"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}`
- Lines starting with `//` are treated as comments.
- Schema validation on every row (typecheck, required props, blob base64 decoding).
- Edge endpoint resolution by node `@key`.
## Load modes (`LoadMode`)
| Mode | Semantics |
|---|---|
| `Overwrite` | Replace all data in the target tables on the branch |
| `Append` | Strict insert; duplicates error |
| `Merge` | Upsert by id (`merge_insert`) |
## `load` vs `ingest`
- `load(branch, data, mode)` — direct load to a branch.
- `ingest(branch, from, data, mode)` — branch-creating, transactional load:
1. If target advanced since the run started, fork a fresh run branch from `from`.
2. Load into the run branch (Append).
3. If target hasn't moved, fast-publish; otherwise abort.
- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`.
- `ingest_as(actor_id)` records the actor on the resulting commit.
## Embeddings during load
If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](embeddings.md).

26
docs/indexes.md Normal file
View file

@ -0,0 +1,26 @@
# Indexes
## L1 — Lance index types OmniGraph exposes
| Index | Use | Notes |
|---|---|---|
| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` |
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries |
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
## L2 — OmniGraph orchestration
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
## L2 — Graph topology index (`graph_index/mod.rs`)
This is OmniGraph-specific (not Lance):
- `TypeIndex`: dense `u32 ↔ String id` mapping per node type.
- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`.
- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables.
- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions).
- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it.

22
docs/maintenance.md Normal file
View file

@ -0,0 +1,22 @@
# Maintenance: Optimize & Cleanup
`db/omnigraph/optimize.rs`.
## `optimize_all_tables(db)` — non-destructive
- Lance `compact_files()` on every node + edge table on `main`.
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`.
## `cleanup_all_tables(db, options)` — destructive
- Lance `cleanup_old_versions()` per table.
- Removes manifests (and their unique fragments) older than the retention policy.
- `CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> }` — at least one is required.
- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`.
- CLI guards with `--confirm`; without it, prints a preview line.
## Tombstones
Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction.

30
docs/merge.md Normal file
View file

@ -0,0 +1,30 @@
# Merging (three-way) and Conflicts
`exec/merge.rs`.
## Strategy
Ordered, row-by-row cursor merge:
- `OrderedTableCursor` scans each table sorted by `id` and supports peek/pop matching.
- `StagedTableWriter` buffers `MERGE_STAGE_BATCH_ROWS = 8192` rows into a temp Lance dataset (`OMNIGRAPH_MERGE_STAGING_DIR`).
- The merge runs per sub-table; results are published as one atomic manifest update.
## Outcome enum
`MergeOutcome { AlreadyUpToDate | FastForward | Merged }`
## Conflict types (`error.rs`)
```
MergeConflictKind:
DivergentInsert // same id inserted on both branches
DivergentUpdate // updated differently on both branches
DeleteVsUpdate // one side deletes, other updates
OrphanEdge // edge references a node deleted by the other side
UniqueViolation
CardinalityViolation
ValueConstraintViolation
```
Returned as `OmniError::MergeConflicts(Vec<MergeConflict { table_key, row_id?, kind, message }>)`. The HTTP server surfaces this as a 409 with structured `merge_conflicts[]` (top 3 + "+N more").

44
docs/policy.md Normal file
View file

@ -0,0 +1,44 @@
# Authorization (Cedar policy)
OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
## Policy actions
1. `read` — query / snapshot / list branches & commits
2. `export` — NDJSON export
3. `change` — mutations
4. `schema_apply` — apply schema migrations
5. `branch_create`
6. `branch_delete`
7. `branch_merge`
8. `run_publish`
9. `run_abort`
10. `admin` — reserved
## Scope kinds
- `branch_scope` — applied to source branch (`read`, `export`, `change`)
- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops)
- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected`
## Configuration
`omnigraph.yaml`:
```yaml
policy:
file: ./policy.yaml # Cedar rules + groups
tests: ./policy.tests.yaml # declarative test cases
```
Each rule must use exactly one of `branch_scope` or `target_branch_scope`.
## CLI
- `omnigraph policy validate` — parse + count actors, exit 1 on parse error.
- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch.
- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule.
## Server enforcement
Every mutating endpoint calls `authorize_request()` *before* the handler runs; decisions are logged with actor / action / branch / outcome / matched rule.

103
docs/query-language.md Normal file
View file

@ -0,0 +1,103 @@
# Query Language (`.gq`)
Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`.
## Query declarations
```
query <name>($p1: T1, $p2: T2?, …)
@description("…") @instruction("…") {
}
```
Two body shapes:
- **Read**: `match { … } return { … } [order { … }] [limit N]`
- **Mutation**: one or more of `insert | update | delete` statements
Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`.
## MATCH clauses
- **Binding**: `$x: NodeType { prop: <literal | $param | now()>, … }`
- **Traversal**: `$src EDGE_NAME { min, max? } $dst` — variable-length paths via hop bounds; default 1..1 if bounds omitted.
- **Filter**: `<expr> <op> <expr>` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`.
- **Negation**: `not { clause+ }` — desugars to anti-join over the inner pipeline.
## Search clauses (multi-modal)
Used inside MATCH or as expressions inside RETURN/ORDER:
| Function | Purpose | Underlying Lance facility |
|---|---|---|
| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) |
| `search(field, q)` | Generic FTS | Inverted index |
| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index |
| `match_text(field, q)` | Pattern match | Inverted index |
| `bm25(field, q)` | BM25 scoring | Inverted index |
| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings |
`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input).
## RETURN clause
`return { <expr> [as <alias>], … }` with expressions:
- Variable / property access: `$x`, `$x.prop`
- Literals: string, int, float, bool, list
- `now()`
- Aggregates: `count`, `sum`, `avg`, `min`, `max`
- All search functions above (so you can return a score column)
- `AliasRef` — re-use a previous projection alias
## ORDER & LIMIT
- `order { <expr> [asc|desc], … }` — supports plain expressions and `nearest(...)`.
- `limit <integer>` — required when there is a `nearest(...)` ordering.
## Mutation statements
- `insert <Type> { prop: <value>, … }`
- `update <Type> set { prop: <value>, … } where <prop> <op> <value>`
- `delete <Type> where <prop> <op> <value>`
`<value>` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0).
## IR (Intermediate Representation)
`QueryIR { name, params, pipeline: Vec<IROp>, return_exprs, order_by, limit }`
Pipeline operations:
- `NodeScan { variable, type_name, filters }`
- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` — destination filters are pushed *into* the expand so Lance scalar pushdown can prune.
- `Filter { left, op, right }`
- `AntiJoin { outer_var, inner: Vec<IROp> }` — for `not { … }`
Lowering:
1. Partition MATCH clauses (bindings, traversals, filters, negations).
2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown.
3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations.
4. Translate RETURN / ORDER expressions; preserve LIMIT.
## Linting & validation (`query/lint.rs`)
Codes seen so far:
- **Q000** (Error): parse error
- **L201** (Warning): nullable property never set by any UPDATE — "{type}.{prop} exists in schema but no update query sets it"
- (Warning): mutation declares no params — hardcoded mutations are easy to miss
- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.)
Output:
```
QueryLintOutput { status, schema_source, query_path,
queries_processed, errors, warnings, infos,
results: [{ name, kind, status, error?, warnings[] }],
findings: [{ severity, code, message, type_name?, property?, query_names[] }] }
```
CLI exits non-zero only on `status = Error`.

19
docs/releases/v0.3.1.md Normal file
View file

@ -0,0 +1,19 @@
# Omnigraph v0.3.1
Omnigraph v0.3.1 is a performance and operability point release.
## Highlights
- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads.
- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form.
- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45).
## Included Changes
- Parallel per-type load writes (#46)
- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46)
- Dedupe dst ids before hydrating nodes in `execute_expand` (#45)
## Upgrade Notes
No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1.

38
docs/runs.md Normal file
View file

@ -0,0 +1,38 @@
# Runs (transactional graph mutations)
`db/run_registry.rs` + run lifecycle in `db/omnigraph.rs`. Stored in `_graph_runs.lance` and `_graph_run_actors.lance`.
## RunRecord
```
RunRecord {
run_id: RunId (ULID),
target_branch: String, // where the run will publish
run_branch: "__run__<id>", // ephemeral isolation branch
base_snapshot_id: String,
base_manifest_version: u64,
operation_hash: Option<String>, // idempotency key
actor_id: Option<String>,
status: Running | Published | Failed | Aborted,
published_snapshot_id: Option<String>,
created_at, updated_at: i64 (microseconds),
}
```
## Lifecycle
1. `begin_run(target_branch, op_hash)` / `begin_run_as(target_branch, op_hash, actor_id)` — forks `__run__<id>` from the target's current head, appends a `RunRecord`.
2. Mutations on `run_branch` (via the normal write APIs) — isolated from concurrent activity on the target.
3. `publish_run(id)` / `publish_run_as(id, actor)`:
- **Fast path**: if the target hasn't moved since `base_snapshot_id`, promote the run snapshot directly.
- **Merge path**: if it has moved, perform a three-way merge (see [merge.md](merge.md)) into the target.
- On success: `status = Published`, `published_snapshot_id` set, run branch cleaned up asynchronously.
4. `abort_run(id)` / `fail_run(id)` — terminal; cleans up run branch best-effort.
## Idempotency
`operation_hash` is an optional field clients can use to detect a duplicate `begin_run` retry.
## Cleanup
`cleanup_terminal_run_branches_for_target(branch)` is called as branches change; failures are swallowed (lazy cleanup on next branch op).

79
docs/schema-language.md Normal file
View file

@ -0,0 +1,79 @@
# Schema Language (`.pg`)
Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`.
## Top-level declarations
- `interface <Name> { property* }` — reusable property contracts.
- `node <Name> [implements <Iface>, ...] { property* | constraint* }`
- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }`
- Comments: line `//` and block `/* … */`.
## Property declarations
`<ident>: <TypeRef> [annotation*]`
## Built-in scalar types
| Scalar | Arrow type |
|---|---|
| `String` | Utf8 |
| `Blob` | LargeBinary |
| `Bool` | Boolean |
| `I32` / `I64` | Int32 / Int64 |
| `U32` / `U64` | UInt32 / UInt64 |
| `F32` / `F64` | Float32 / Float64 |
| `Date` | Date32 |
| `DateTime` | Date64 |
| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` |
| `[<scalar>]` | List(scalar) |
| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values |
| `<scalar>?` | Same as scalar but `nullable: true` |
## Constraints (body level)
| Constraint | On | Effect |
|---|---|---|
| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key |
| `@unique(p, …)` | node, edge | Uniqueness across listed columns |
| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns |
| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) |
| `@check(p, "regex")` | node | Regex pattern validation |
| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. |
Edge bodies only allow `@unique` and `@index`.
## Annotations
- `@<ident>` or `@<ident>(<literal>)` on any declaration or property.
- Known annotations:
- `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType).
- `@description("…")`, `@instruction("…")` on query declarations (carried through to clients).
- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation.
## Catalog construction
- Pass 0: collect interfaces.
- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`).
- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties.
## Schema IR & stable type IDs
- `IR_VERSION = 1` (`catalog/schema_ir.rs`).
- Each interface/node/edge gets a `stable_type_id` (kind+name hashed) so renames can be tracked.
- Serialized as JSON for diff/migration plans.
## Schema migration planning
`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types:
- `AddType { type_kind, name }`
- `RenameType { type_kind, from, to }`
- `AddProperty { type_kind, type_name, property_name, property_type }`
- `RenameProperty { type_kind, type_name, from, to }`
- `AddConstraint { type_kind, type_name, constraint }`
- `UpdateTypeMetadata { … annotations }`
- `UpdatePropertyMetadata { … annotations }`
- `UnsupportedChange { entity, reason }` (forces `supported=false`)
`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize.

68
docs/server.md Normal file
View file

@ -0,0 +1,68 @@
# HTTP Server (`omnigraph-server`)
Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant.
## Endpoint inventory
| Method | Path | Auth | Action | Handler |
|---|---|---|---|---|
| GET | `/healthz` | none | — | `server_health` |
| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) |
| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
| POST | `/read` | bearer + `read` | run named query | `server_read` |
| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` |
| POST | `/change` | bearer + `change` | mutation | `server_change` |
| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` |
| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` |
| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
| GET | `/runs` | bearer + `read` | list | `server_run_list` |
| GET | `/runs/{run_id}` | bearer + `read` | show | `server_run_show` |
| POST | `/runs/{run_id}/publish` | bearer + `run_publish` | publish | `server_run_publish` |
| POST | `/runs/{run_id}/abort` | bearer + `run_abort` | abort | `server_run_abort` |
| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` |
| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
## Streaming
Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON.
## Error model
Uniform `ErrorOutput { error, code?, merge_conflicts[] }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`.
HTTP status codes used: 200, 400, 401, 403, 404, 409, 500.
## Body limits
- Default: 1 MB
- `/ingest`: 32 MB
## Auth model (`bearer + SHA-256`)
- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
- Constant-time comparison via `subtle::ConstantTimeEq`.
- Three sources, in precedence:
1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`)
2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}`
3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default`
- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme.
See [deployment.md](deployment.md) for token-source operational details.
## Tracing & observability
- `tower_http::TraceLayer::new_for_http()`
- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
- Startup logs: token source name, repo URI, bind address
- Graceful SIGINT shutdown
## Not implemented (by design or "TBD")
- CORS — not configured; add `tower_http::cors` if needed.
- Rate limiting — none.
- Pagination — none (commits/branches/runs return everything; export streams).
- Multi-tenant routing — one repo per process.

46
docs/storage.md Normal file
View file

@ -0,0 +1,46 @@
# Storage
## L1 — Lance dataset (per node/edge type)
Every node type and every edge type is its own Lance dataset:
- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema.
- **Fragments**: data is partitioned into fragments; new writes create new fragments.
- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable.
- **Stable row IDs**: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction.
- **Append / delete / `merge_insert`**: native Lance write modes.
- **Per-dataset branches** (Lance native): copy-on-write at the dataset level.
- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`).
## L2 — Multi-dataset coordination via `__manifest`
OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table.
- **Manifest table**: `__manifest/` Lance dataset.
- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`):
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
- `__manifest/` — the catalog of all sub-tables and their published versions
- `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map
- `_graph_runs.lance` / `_graph_run_actors.lance` — the run registry and its actor map
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
- `object_type``table | table_version | table_tombstone`
- `table_key``node:<TypeName> | edge:<EdgeName>`
- `table_branch` is `null` for the main lineage and the branch name otherwise
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones whose `tombstone_version >= table_version`.
- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once.
## URI scheme support (`storage.rs`)
| Scheme | Backend | Notes |
|---|---|---|
| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths |
| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` |
| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend |
## Object-store env vars (S3-compatible)
- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`
- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML
- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs
- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev)

77
scripts/check-agents-md.sh Executable file
View file

@ -0,0 +1,77 @@
#!/usr/bin/env bash
# Verify that AGENTS.md and docs/ stay in sync.
#
# Two checks:
# 1. Every docs/*.md path linked from AGENTS.md exists on disk.
# 2. Every doc in the canonical set is linked from AGENTS.md.
#
# Exit non-zero on any drift.
set -euo pipefail
repo_root="$(cd "$(dirname "$0")/.." && pwd)"
cd "$repo_root"
agents_file="AGENTS.md"
if [[ ! -f "$agents_file" ]]; then
echo "error: $agents_file not found" >&2
exit 1
fi
# Canonical set: every docs/*.md (top-level), plus the releases/ index dir if present.
canonical=()
while IFS= read -r line; do
canonical+=("$line")
done < <(find docs -mindepth 1 -maxdepth 1 -type f -name '*.md' | sort)
if [[ -d docs/releases ]]; then
canonical+=("docs/releases/")
fi
# Extract docs/ links from AGENTS.md (markdown link form: (docs/...))
linked=()
while IFS= read -r line; do
linked+=("$line")
done < <(grep -oE '\(docs/[^)]+\)' "$agents_file" | sed -E 's/^\(|\)$//g' | sort -u)
fail=0
# Check 1: every linked path exists.
for link in "${linked[@]}"; do
# Strip in-page anchors like #foo
path="${link%%#*}"
if [[ "$path" == */ ]]; then
if [[ ! -d "$path" ]]; then
echo "error: AGENTS.md links to missing directory: $path" >&2
fail=1
fi
else
if [[ ! -f "$path" ]]; then
echo "error: AGENTS.md links to missing file: $path" >&2
fail=1
fi
fi
done
# Check 2: every canonical doc is linked at least once.
for doc in "${canonical[@]}"; do
found=0
for link in "${linked[@]}"; do
path="${link%%#*}"
if [[ "$path" == "$doc" ]]; then
found=1
break
fi
done
if [[ "$found" -eq 0 ]]; then
echo "error: doc not linked from AGENTS.md: $doc" >&2
fail=1
fi
done
if [[ "$fail" -ne 0 ]]; then
echo >&2
echo "AGENTS.md / docs/ are out of sync. Either update AGENTS.md links or rename/remove the doc." >&2
exit 1
fi
echo "AGENTS.md ↔ docs/ links OK (${#linked[@]} links, ${#canonical[@]} docs)."