diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 3893efe..4cae31c 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -99,6 +99,18 @@ jobs: echo "run_full_ci=$run_full_ci" >> "$GITHUB_OUTPUT" echo "run_rustfs_ci=$run_rustfs_ci" >> "$GITHUB_OUTPUT" + check_agents_md: + name: Check AGENTS.md Links + runs-on: ubuntu-latest + permissions: + contents: read + steps: + - name: Checkout source + uses: actions/checkout@v5.0.1 + + - name: Verify AGENTS.md ↔ docs/ cross-links + run: bash scripts/check-agents-md.sh + test: name: Test Workspace needs: classify_changes diff --git a/AGENTS.md b/AGENTS.md index d1423fe..3760d09 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,923 +1,104 @@ # OmniGraph — Agent Guide -> **Maintenance contract for agents:** This file is the canonical, in-repo orientation for any AI coding agent working in this repository. **You MUST keep it up to date.** Whenever you change schema, query language, IR, storage layout, server endpoints, CLI surface, policy actions, defaults, constants, deployment, or release history, update the relevant section of this file in the same change. If a section becomes stale, fix it — do not leave drift. When unsure whether a change is user-visible enough to document here, err on the side of updating. +This file is the always-on map for AI coding agents (Claude Code, Codex, Cursor, Cline) working in this repo. It is loaded into context on every turn, so it stays as a **map plus the rules and invariants that need to be in scope at all times** — the encyclopedia content lives under [`docs/`](docs/). When you need depth, follow a pointer. -**Version surveyed:** 0.3.1 (branch `ragnorc/omnigraph-spec`) +`CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`. + +**Version surveyed:** 0.3.1 **Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-cli`, `omnigraph-server` **Storage substrate:** Lance 4.x (columnar, versioned, branchable) **License:** MIT **Toolchain:** Rust stable, edition 2024 -> Convention: each feature is split into **L1 — Inherited from Lance** (capabilities OmniGraph gets "for free" by sitting on top of the Lance dataset format) and **L2 — Added by OmniGraph** (typing, graph semantics, coordination across many datasets, server, CLI, policy, etc.). When a section only exists at one layer, only that layer is shown. +--- + +## Start here — what is this? + +OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets. Highlights: + +- **Storage**: per node/edge type a separate Lance dataset; multi-dataset commits coordinated atomically through one `__manifest` table. +- **Languages**: a `.pg` schema language and a `.gq` query language, both Pest-based, with a typed IR. +- **Multi-modal querying**: vector ANN (`nearest`), full-text (`search`/`fuzzy`/`match_text`/`bm25`), Reciprocal Rank Fusion (`rrf`), and graph traversal (`Expand`, anti-join `not { … }`) in one runtime. +- **Branches and commits across the whole graph**: Git-style — every successful publish appends to a commit DAG; merges are three-way at the row level. +- **Transactional runs**: ephemeral `__run__` branches for isolated mutation, fast-path or merge-path publish. +- **HTTP server**: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager), Cedar policy gating. +- **CLI** driven by a single `omnigraph.yaml`; multi-format output (json/jsonl/csv/kv/table). + +Throughout the docs, capabilities are split into **L1 — Inherited from Lance** vs **L2 — Added by OmniGraph**. --- -## 0. Architecture at a Glance +## Architecture at a glance ``` -┌──────────────────────────────────────────────────────────────────┐ -│ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) │ -│ - 13 cmd families - REST + OpenAPI │ -│ - Aliases, configs - Bearer auth + Cedar policy │ -└──────────────────────────────┬───────────────────────────────────┘ - │ -┌──────────────────────────────▼───────────────────────────────────┐ -│ omnigraph-compiler │ -│ - Pest grammars: schema.pest, query.pest │ -│ - Catalog (Node/Edge/Interface types) │ -│ - IR + lowering (NodeScan / Expand / Filter / AntiJoin) │ -│ - Schema migration planner │ -│ - Embedding client (OpenAI-style for query-time normalization) │ -└──────────────────────────────┬───────────────────────────────────┘ - │ -┌──────────────────────────────▼───────────────────────────────────┐ -│ omnigraph (engine) │ -│ - GraphCoordinator + ManifestRepo (__manifest) │ -│ - CommitGraph (_graph_commits.lance) │ -│ - RunRegistry (_graph_runs.lance, __run__ branches) │ -│ - GraphIndex (CSR/CSC) + RuntimeCache (LRU 8) │ -│ - exec::query / mutation / merge │ -│ - Embedding client (Gemini for runtime ingest) │ -└──────────────────────────────┬───────────────────────────────────┘ - │ -┌──────────────────────────────▼───────────────────────────────────┐ -│ Lance 4.x (per-table dataset) │ -│ - Columnar (Arrow) storage, fragments │ -│ - Manifest versions per dataset │ -│ - Per-dataset branches (copy-on-write) │ -│ - Indexes: BTREE, Inverted (FTS/BM25), IVF/HNSW vector │ -│ - merge_insert (upsert), append, delete │ -│ - compact_files, cleanup_old_versions │ -└──────────────────────────────┬───────────────────────────────────┘ - │ -┌──────────────────────────────▼───────────────────────────────────┐ -│ Object store: local FS, S3, RustFS, MinIO, S3-compatible │ -└──────────────────────────────────────────────────────────────────┘ +CLI (omnigraph) HTTP Server (omnigraph-server, Axum) + │ │ + └─────────────┬──────────────┘ + ▼ + omnigraph-compiler ── Pest grammars, catalog, IR, lowering, lint, migration plan + │ + ▼ + omnigraph (engine) ── ManifestRepo, CommitGraph, RunRegistry, GraphIndex (CSR/CSC), exec + │ + ▼ + Lance 4.x ── columnar Arrow, fragments, per-dataset versions/branches, indexes + │ + ▼ + Object store (file / s3 / RustFS / MinIO / S3-compat) ``` -**One-liner:** OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`. +Full diagram and concurrency model: [docs/architecture.md](docs/architecture.md). --- -## 1. Storage Architecture +## Where to find each topic -### 1.1 L1 — Lance dataset (per node/edge type) -Every node type and every edge type is its own Lance dataset: -- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema. -- **Fragments**: data is partitioned into fragments; new writes create new fragments. -- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable. -- **Stable row IDs**: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction. -- **Append / delete / `merge_insert`**: native Lance write modes. -- **Per-dataset branches** (Lance native): copy-on-write at the dataset level. -- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`). - -### 1.2 L2 — Multi-dataset coordination via `__manifest` -OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table. - -- **Manifest table**: `__manifest/` Lance dataset. -- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`): - - `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type - - `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type - - `__manifest/` — the catalog of all sub-tables and their published versions - - `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map - - `_graph_runs.lance` / `_graph_run_actors.lance` — the run registry and its actor map -- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`): - - `object_type` ∈ `table | table_version | table_tombstone` - - `table_key` ∈ `node: | edge:` - - `table_branch` is `null` for the main lineage and the branch name otherwise -- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones whose `tombstone_version >= table_version`. -- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once. - -### 1.3 URI scheme support (`storage.rs`) -| Scheme | Backend | Notes | -|---|---|---| -| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths | -| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` | -| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend | - -### 1.4 Object-store env vars (S3-compatible) -- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` -- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML -- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs -- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev) - ---- - -## 2. Schema Language (`.pg`) - -Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`. - -### 2.1 Top-level declarations -- `interface { property* }` — reusable property contracts. -- `node [implements , ...] { property* | constraint* }` -- `edge : -> [@card(min..max)] { property* | constraint* }` -- Comments: line `//` and block `/* … */`. - -### 2.2 Property declarations -`: [annotation*]` - -### 2.3 Built-in scalar types -| Scalar | Arrow type | +| Area | Read | |---|---| -| `String` | Utf8 | -| `Blob` | LargeBinary | -| `Bool` | Boolean | -| `I32` / `I64` | Int32 / Int64 | -| `U32` / `U64` | UInt32 / UInt64 | -| `F32` / `F64` | Float32 / Float64 | -| `Date` | Date32 | -| `DateTime` | Date64 | -| `Vector()` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` | -| `[]` | List(scalar) | -| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values | -| `?` | Same as scalar but `nullable: true` | - -### 2.4 Constraints (body level) -| Constraint | On | Effect | -|---|---|---| -| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key | -| `@unique(p, …)` | node, edge | Uniqueness across listed columns | -| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns | -| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) | -| `@check(p, "regex")` | node | Regex pattern validation | -| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. | - -Edge bodies only allow `@unique` and `@index`. - -### 2.5 Annotations -- `@` or `@()` on any declaration or property. -- Known annotations: - - `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType). - - `@description("…")`, `@instruction("…")` on query declarations (carried through to clients). -- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation. - -### 2.6 Catalog construction -- Pass 0: collect interfaces. -- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`). -- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties. - -### 2.7 Schema IR & stable type IDs -- `IR_VERSION = 1` (`catalog/schema_ir.rs`). -- Each interface/node/edge gets a `stable_type_id` (kind+name hashed) so renames can be tracked. -- Serialized as JSON for diff/migration plans. - -### 2.8 Schema migration planning -`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types: -- `AddType { type_kind, name }` -- `RenameType { type_kind, from, to }` -- `AddProperty { type_kind, type_name, property_name, property_type }` -- `RenameProperty { type_kind, type_name, from, to }` -- `AddConstraint { type_kind, type_name, constraint }` -- `UpdateTypeMetadata { … annotations }` -- `UpdatePropertyMetadata { … annotations }` -- `UnsupportedChange { entity, reason }` (forces `supported=false`) - -`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize. +| Architecture, L1/L2 framing, concurrency model | [docs/architecture.md](docs/architecture.md) | +| Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/storage.md](docs/storage.md) | +| `.pg` schema language, types, constraints, annotations, migration planning | [docs/schema-language.md](docs/schema-language.md) | +| `.gq` query language, MATCH/RETURN/ORDER, search funcs, mutations, IR ops, lint codes | [docs/query-language.md](docs/query-language.md) | +| Indexes (BTREE / inverted / vector / graph topology) | [docs/indexes.md](docs/indexes.md) | +| Embeddings (compiler + engine clients, env vars, `@embed`) | [docs/embeddings.md](docs/embeddings.md) | +| Branches, commit graph, snapshots, system branches | [docs/branches-commits.md](docs/branches-commits.md) | +| Runs (transactional graph mutations, `__run__`, publish paths) | [docs/runs.md](docs/runs.md) | +| Three-way merge and conflict kinds | [docs/merge.md](docs/merge.md) | +| Diff / change feed (`diff_between`, `diff_commits`) | [docs/changes.md](docs/changes.md) | +| Query execution, mutation execution, bulk loader, `load` vs `ingest` | [docs/execution.md](docs/execution.md) | +| `optimize` (compaction) and `cleanup` (version GC) | [docs/maintenance.md](docs/maintenance.md) | +| Cedar policy actions, scopes, CLI | [docs/policy.md](docs/policy.md) | +| HTTP server endpoints, auth, error model, body limits | [docs/server.md](docs/server.md) | +| CLI quick-start | [docs/cli.md](docs/cli.md) | +| CLI command surface and `omnigraph.yaml` schema | [docs/cli-reference.md](docs/cli-reference.md) | +| Audit / actor tracking | [docs/audit.md](docs/audit.md) | +| Error taxonomy and result serialization | [docs/errors.md](docs/errors.md) | +| Install (binary / Homebrew / source / channels) | [docs/install.md](docs/install.md) | +| Deployment (binary / container / RustFS bootstrap / auth / build variants) | [docs/deployment.md](docs/deployment.md) | +| CI / release workflows | [docs/ci.md](docs/ci.md) | +| Constants & tunables cheat sheet | [docs/constants.md](docs/constants.md) | +| Per-version release notes | [docs/releases/](docs/releases/) | --- -## 3. Query Language (`.gq`) +## Always-on rules (load these into your working memory) -Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`. +These invariants need to be in scope on every change — they're the ones that quietly break if forgotten. -### 3.1 Query declarations -``` -query ($p1: T1, $p2: T2?, …) - @description("…") @instruction("…") { - … -} -``` -Two body shapes: -- **Read**: `match { … } return { … } [order { … }] [limit N]` -- **Mutation**: one or more of `insert | update | delete` statements - -Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`. - -### 3.2 MATCH clauses -- **Binding**: `$x: NodeType { prop: , … }` -- **Traversal**: `$src EDGE_NAME { min, max? } $dst` — variable-length paths via hop bounds; default 1..1 if bounds omitted. -- **Filter**: ` ` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`. -- **Negation**: `not { clause+ }` — desugars to anti-join over the inner pipeline. - -### 3.3 Search clauses (multi-modal) -Used inside MATCH or as expressions inside RETURN/ORDER: -| Function | Purpose | Underlying Lance facility | -|---|---|---| -| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) | -| `search(field, q)` | Generic FTS | Inverted index | -| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index | -| `match_text(field, q)` | Pattern match | Inverted index | -| `bm25(field, q)` | BM25 scoring | Inverted index | -| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings | - -`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input). - -### 3.4 RETURN clause -`return { [as ], … }` with expressions: -- Variable / property access: `$x`, `$x.prop` -- Literals: string, int, float, bool, list -- `now()` -- Aggregates: `count`, `sum`, `avg`, `min`, `max` -- All search functions above (so you can return a score column) -- `AliasRef` — re-use a previous projection alias - -### 3.5 ORDER & LIMIT -- `order { [asc|desc], … }` — supports plain expressions and `nearest(...)`. -- `limit ` — required when there is a `nearest(...)` ordering. - -### 3.6 Mutation statements -- `insert { prop: , … }` -- `update set { prop: , … } where ` -- `delete where ` - -`` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0). - -### 3.7 IR (Intermediate Representation) -`QueryIR { name, params, pipeline: Vec, return_exprs, order_by, limit }` - -Pipeline operations: -- `NodeScan { variable, type_name, filters }` -- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` — destination filters are pushed *into* the expand so Lance scalar pushdown can prune. -- `Filter { left, op, right }` -- `AntiJoin { outer_var, inner: Vec }` — for `not { … }` - -Lowering: -1. Partition MATCH clauses (bindings, traversals, filters, negations). -2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown. -3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations. -4. Translate RETURN / ORDER expressions; preserve LIMIT. - -### 3.8 Linting & validation (`query/lint.rs`) -Codes seen so far: -- **Q000** (Error): parse error -- **L201** (Warning): nullable property never set by any UPDATE — "{type}.{prop} exists in schema but no update query sets it" -- (Warning): mutation declares no params — hardcoded mutations are easy to miss -- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.) - -Output: -``` -QueryLintOutput { status, schema_source, query_path, - queries_processed, errors, warnings, infos, - results: [{ name, kind, status, error?, warnings[] }], - findings: [{ severity, code, message, type_name?, property?, query_names[] }] } -``` -CLI exits non-zero only on `status = Error`. +1. **`__manifest` is the atomic-publish boundary.** Multi-dataset commits flip via a single `ManifestBatchPublisher` write. Don't introduce code paths that publish per sub-table outside the batch publisher — you'll lose snapshot isolation across tables. +2. **`nearest($x.vec, $q)` requires a `LIMIT`.** The compiler enforces it, but if you're touching the query lowering or executor, don't break this rule. ANN without a limit is unbounded. +3. **Snapshot isolation per query.** A query holds one `Snapshot` for its lifetime. Don't read against `db.head()` mid-query; use the snapshot bound at lowering time. +4. **Run isolation lives on `__run__` branches.** Mutations inside `begin_run` … `publish_run` must go through `run_branch`, not `target_branch`. Publish picks fast-path (target unmoved) or merge-path (three-way). +5. **Schema apply is serialized via `__schema_apply_lock__`.** Concurrent `apply_schema` is not safe. Don't bypass the lock. +6. **`branch_list()` filters internal branches.** `__run__…` and `__schema_apply_lock__` must not appear in user-visible listings, exports, or policy-scoped operations. If you add a new system branch, follow the `__name__` prefix convention and add it to the filter. +7. **Bearer-token plaintext never persists in process memory.** Tokens are SHA-256 hashed at startup; comparison uses `subtle::ConstantTimeEq`. The actor id is server-resolved from the hash match — it must not be settable by the client. +8. **Mutations are atomic at the manifest commit boundary.** Multi-statement `change` queries publish one commit. Don't commit per-statement. +9. **Indexes are built on the branch head, not on a snapshot.** Reads always see the current index state. Lazy fork: a branch that hasn't mutated a sub-table reuses the source's index until the first write. +10. **Stable type IDs survive renames.** Schema migration uses `stable_type_id` (kind+name hashed at first sight). Don't mint new IDs on rename. --- -## 4. Indexes (multi-modal lens) - -### 4.1 L1 — Lance index types OmniGraph exposes -| Index | Use | Notes | -|---|---|---| -| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` | -| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries | -| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) | - -### 4.2 L2 — OmniGraph orchestration -- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run. -- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state. -- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork. -- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column. - -### 4.3 L2 — Graph topology index (`graph_index/mod.rs`) -This is OmniGraph-specific (not Lance): -- `TypeIndex`: dense `u32 ↔ String id` mapping per node type. -- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`. -- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables. -- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions). -- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it. - ---- - -## 5. Embeddings - -OmniGraph has **two** embedding clients with different defaults and purposes: - -### 5.1 Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization -- Default model: `text-embedding-3-small` (OpenAI-style schema) -- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200` -- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)` -- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors - -### 5.2 Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest -- Model: `gemini-embedding-2-preview` -- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK` -- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT) -- Exponential backoff with retryable detection (timeouts, 429, 5xx) - -### 5.3 Schema integration -Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim). - -### 5.4 CLI `omnigraph embed` (offline file pipeline) -Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive): -- (default) `fill_missing` — only embed rows whose target field is empty -- `--reembed-all` — overwrite all -- `--clean` — strip embeddings - -Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL. - ---- - -## 6. Branching, Commits, Snapshots - -### 6.1 L1 — Lance per-dataset branches -Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork. - -### 6.2 L2 — Graph-level branches -OmniGraph builds *graph branches* on top by branching every sub-table coherently: -- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle. -- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes. -- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments. -- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source. -- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch. - -### 6.3 L2 — Commit graph (`db/commit_graph.rs`) -Stored as a Lance dataset `_graph_commits.lance` (with stable row IDs): -``` -GraphCommit { - graph_commit_id: ULID, - manifest_branch: Option, - manifest_version: u64, - parent_commit_id: Option, - merged_parent_commit_id: Option, // populated for merge commits - actor_id: Option, - created_at: i64 (microseconds since epoch), -} -``` -- Every successful publish (load / change / merge / schema_apply / publish_run) appends one commit. -- Merge commits have two parents; linear commits have one. -- `_graph_commit_actors.lance` — optional separate actor map (created on demand). -- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`. - -### 6.4 L2 — Snapshots & time travel -- `snapshot()` — current snapshot for the bound branch; cached. -- `snapshot_of(target)` — snapshot at a `ReadTarget` (branch | snapshot id). -- `snapshot_at_version(v: u64)` — historical snapshot from any manifest version. -- `entity_at(table_key, id, version)` — single-entity time travel without building a full snapshot. -- A `Snapshot` is a `(version, HashMap)` — cheap to build, snapshot-isolated cross-table reads. - -### 6.5 L2 — Internal system branches -Filtered from `branch_list()` but visible to internals: -- `__run__` — ephemeral isolation branch for a transactional run. -- `__schema_apply_lock__` — serializes schema migrations. - ---- - -## 7. Runs (transactional graph mutations) - -`db/run_registry.rs` + run lifecycle in `db/omnigraph.rs`. Stored in `_graph_runs.lance` and `_graph_run_actors.lance`. - -### 7.1 RunRecord -``` -RunRecord { - run_id: RunId (ULID), - target_branch: String, // where the run will publish - run_branch: "__run__", // ephemeral isolation branch - base_snapshot_id: String, - base_manifest_version: u64, - operation_hash: Option, // idempotency key - actor_id: Option, - status: Running | Published | Failed | Aborted, - published_snapshot_id: Option, - created_at, updated_at: i64 (microseconds), -} -``` - -### 7.2 Lifecycle -1. `begin_run(target_branch, op_hash)` / `begin_run_as(target_branch, op_hash, actor_id)` — forks `__run__` from the target's current head, appends a `RunRecord`. -2. Mutations on `run_branch` (via the normal write APIs) — isolated from concurrent activity on the target. -3. `publish_run(id)` / `publish_run_as(id, actor)`: - - **Fast path**: if the target hasn't moved since `base_snapshot_id`, promote the run snapshot directly. - - **Merge path**: if it has moved, perform a three-way merge (see §8) into the target. - - On success: `status = Published`, `published_snapshot_id` set, run branch cleaned up asynchronously. -4. `abort_run(id)` / `fail_run(id)` — terminal; cleans up run branch best-effort. - -### 7.3 Idempotency -`operation_hash` is an optional field clients can use to detect a duplicate `begin_run` retry. - -### 7.4 Cleanup -`cleanup_terminal_run_branches_for_target(branch)` is called as branches change; failures are swallowed (lazy cleanup on next branch op). - ---- - -## 8. Merging (three-way) and Conflicts - -`exec/merge.rs`. - -### 8.1 Strategy -Ordered, row-by-row cursor merge: -- `OrderedTableCursor` scans each table sorted by `id` and supports peek/pop matching. -- `StagedTableWriter` buffers `MERGE_STAGE_BATCH_ROWS = 8192` rows into a temp Lance dataset (`OMNIGRAPH_MERGE_STAGING_DIR`). -- The merge runs per sub-table; results are published as one atomic manifest update. - -### 8.2 Outcome enum -`MergeOutcome { AlreadyUpToDate | FastForward | Merged }` - -### 8.3 Conflict types (`error.rs`) -``` -MergeConflictKind: - DivergentInsert // same id inserted on both branches - DivergentUpdate // updated differently on both branches - DeleteVsUpdate // one side deletes, other updates - OrphanEdge // edge references a node deleted by the other side - UniqueViolation - CardinalityViolation - ValueConstraintViolation -``` -Returned as `OmniError::MergeConflicts(Vec)`. The HTTP server surfaces this as a 409 with structured `merge_conflicts[]` (top 3 + "+N more"). - ---- - -## 9. Change Detection / Diff - -`changes/mod.rs`. Three-level algorithm: -1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged. -2. **Lineage check**: - - Same branch lineage → fast path: use the per-row `_row_last_updated_at_version` column to classify Insert/Update/Delete. - - Different lineages → ID-based streaming comparison. -3. **Row-level diff**: streaming, no full materialization. - -Public API: -- `diff_between(from: ReadTarget, to: ReadTarget, filter: Option) -> ChangeSet` -- `diff_commits(from_commit_id, to_commit_id, filter)` — cross-branch safe. - -Types: -``` -ChangeOp: Insert | Update | Delete -EntityKind: Node | Edge -EntityChange { table_key, kind, type_name, id, op, manifest_version, endpoints?: {src, dst} } -ChangeFilter { kinds?, type_names?, ops? } -ChangeSet { from_version, to_version, branch?, changes[], stats } -``` - ---- - -## 10. Query Execution - -`exec/query.rs`. Pipeline: -1. Parse + typecheck via `omnigraph-compiler`. -2. Lower to IR. -3. If `Expand` or `AntiJoin` is present, build (or fetch from `RuntimeCache`) a `GraphIndex`. -4. Run `execute_query` against the snapshot. - -### 10.1 Multi-modal search modes (`SearchMode`) -The executor recognizes three modes that may be combined in a single query: -- **`nearest`** — vector ANN (uses Lance vector index; `LIMIT` required). -- **`bm25`** — BM25 over an inverted index. -- **`rrf`** — Reciprocal Rank Fusion of two rankings, with k (default 60). - -Hybrid example: `order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20`. - -### 10.2 Joins / set operations -- Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups. -- `not { … }` lowers to an `AntiJoin` over the inner pipeline. - -### 10.3 Scoped reads -- `query(target, source, name, params)` — at any branch or snapshot. -- `run_query_at(version, …)` — direct historical query at a manifest version. - -### 10.4 Concurrency -- Snapshot isolation per query: all reads inside a query use the same `Snapshot`. -- Readers and writers on different branches don't block each other. - ---- - -## 11. Mutations & Loading - -### 11.1 Mutation execution (`exec/mutation.rs`) -Resolves expression values to literals, converts to typed Arrow arrays (`literal_to_typed_array(lit, DataType, num_rows)`), then writes: -- `insert` → Lance `WriteMode::Append` -- `update` → Lance `merge_insert(WhenMatched::Update)` -- `delete` → Lance `merge_insert(WhenMatched::Delete)` (logical) or filtered overwrite. - -Multi-statement mutations are atomic at the manifest commit boundary. - -### 11.2 Bulk loader (`loader/mod.rs`) -- **JSONL only** in v1, with two record shapes: - - Node: `{"type":"NodeType", "data":{…}}` - - Edge: `{"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}` -- Lines starting with `//` are treated as comments. -- Schema validation on every row (typecheck, required props, blob base64 decoding). -- Edge endpoint resolution by node `@key`. - -### 11.3 Load modes (`LoadMode`) -| Mode | Semantics | -|---|---| -| `Overwrite` | Replace all data in the target tables on the branch | -| `Append` | Strict insert; duplicates error | -| `Merge` | Upsert by id (`merge_insert`) | - -### 11.4 `load` vs `ingest` -- `load(branch, data, mode)` — direct load to a branch. -- `ingest(branch, from, data, mode)` — branch-creating, transactional load: - 1. If target advanced since the run started, fork a fresh run branch from `from`. - 2. Load into the run branch (Append). - 3. If target hasn't moved, fast-publish; otherwise abort. -- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`. -- `ingest_as(actor_id)` records the actor on the resulting commit. - -### 11.5 Embeddings during load -If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. - ---- - -## 12. Maintenance: Optimize & Cleanup - -`db/omnigraph/optimize.rs`. - -### 12.1 `optimize_all_tables(db)` — non-destructive -- Lance `compact_files()` on every node + edge table on `main`. -- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests. -- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). -- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`. - -### 12.2 `cleanup_all_tables(db, options)` — destructive -- Lance `cleanup_old_versions()` per table. -- Removes manifests (and their unique fragments) older than the retention policy. -- `CleanupPolicyOptions { keep_versions: Option, older_than: Option }` — at least one is required. -- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`. -- CLI guards with `--confirm`; without it, prints a preview line. - -### 12.3 Tombstones -Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction. - ---- - -## 13. Authorization (Cedar policy) - -OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC. - -### 13.1 Policy actions -1. `read` — query / snapshot / list branches & commits -2. `export` — NDJSON export -3. `change` — mutations -4. `schema_apply` — apply schema migrations -5. `branch_create` -6. `branch_delete` -7. `branch_merge` -8. `run_publish` -9. `run_abort` -10. `admin` — reserved - -### 13.2 Scope kinds -- `branch_scope` — applied to source branch (`read`, `export`, `change`) -- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops) -- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected` - -### 13.3 Configuration -`omnigraph.yaml`: -```yaml -policy: - file: ./policy.yaml # Cedar rules + groups - tests: ./policy.tests.yaml # declarative test cases -``` -Each rule must use exactly one of `branch_scope` or `target_branch_scope`. - -### 13.4 CLI -- `omnigraph policy validate` — parse + count actors, exit 1 on parse error. -- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch. -- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule. - -### 13.5 Server enforcement -Every mutating endpoint calls `authorize_request()` *before* the handler runs; decisions are logged with actor / action / branch / outcome / matched rule. - ---- - -## 14. HTTP Server (`omnigraph-server`) - -Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant. - -### 14.1 Endpoint inventory - -| Method | Path | Auth | Action | Handler | -|---|---|---|---|---| -| GET | `/healthz` | none | — | `server_health` | -| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) | -| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` | -| POST | `/read` | bearer + `read` | run named query | `server_read` | -| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` | -| POST | `/change` | bearer + `change` | mutation | `server_change` | -| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` | -| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` | -| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) | -| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` | -| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` | -| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` | -| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` | -| GET | `/runs` | bearer + `read` | list | `server_run_list` | -| GET | `/runs/{run_id}` | bearer + `read` | show | `server_run_show` | -| POST | `/runs/{run_id}/publish` | bearer + `run_publish` | publish | `server_run_publish` | -| POST | `/runs/{run_id}/abort` | bearer + `run_abort` | abort | `server_run_abort` | -| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` | -| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` | - -### 14.2 Streaming -Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON. - -### 14.3 Error model -Uniform `ErrorOutput { error, code?, merge_conflicts[] }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`. - -HTTP status codes used: 200, 400, 401, 403, 404, 409, 500. - -### 14.4 Body limits -- Default: 1 MB -- `/ingest`: 32 MB - -### 14.5 Auth model (`bearer + SHA-256`) -- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory. -- Constant-time comparison via `subtle::ConstantTimeEq`. -- Three sources, in precedence: - 1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`) - 2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}` - 3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default` -- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme. - -### 14.6 Tracing & observability -- `tower_http::TraceLayer::new_for_http()` -- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule -- Startup logs: token source name, repo URI, bind address -- Graceful SIGINT shutdown - -### 14.7 Not implemented (by design or "TBD") -- CORS — not configured; add `tower_http::cors` if needed. -- Rate limiting — none. -- Pagination — none (commits/branches/runs return everything; export streams). -- Multi-tenant routing — one repo per process. - ---- - -## 15. CLI (`omnigraph`) - -13 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target ` resolved against `omnigraph.yaml`. - -### 15.1 Top-level commands -| Command | Purpose | -|---|---| -| `init` | `--schema ` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) | -| `load` | bulk load a branch (`--mode overwrite|append|merge`) | -| `ingest` | branch-creating transactional load (`--from `) | -| `read` | run named query (params via `--params`, `--params-file`, or alias args) | -| `change` | run mutation query | -| `snapshot` | print current snapshot (per-table version + row count) | -| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) | -| `branch create | list | delete | merge` | branching ops | -| `commit list | show` | inspect commit graph | -| `run list | show | publish | abort` | transactional run ops | -| `schema plan | apply | show (alias: get)` | migrations | -| `query lint | check` | offline / repo-backed validation | -| `optimize` | non-destructive Lance compaction | -| `cleanup --keep N --older-than 7d --confirm` | destructive version GC | -| `embed` | offline JSONL embedding pipeline | -| `policy validate | test | explain` | Cedar tooling | -| `version` / `-v` | print `omnigraph 0.3.x` | - -### 15.2 `omnigraph.yaml` schema -```yaml -project: { name } -graphs: - : - uri: - bearer_token_env: -server: - graph: - bind: -cli: - graph: - branch: - output_format: json|jsonl|csv|kv|table - table_max_column_width: 80 - table_cell_layout: truncate|wrap -query: - roots: [, …] # search path for .gq files -auth: - env_file: ./.env.omni -aliases: - : - command: read|change - query: - name: - args: [, …] - graph: - branch: - format: -policy: - file: ./policy.yaml -``` - -### 15.3 Output formats (read command) -- `json` — pretty-printed object with metadata + rows -- `jsonl` — one metadata line then one JSON object per row -- `csv` — RFC 4180-ish quoting -- `table` — fitted text table, honors `table_max_column_width` + `table_cell_layout` -- `kv` — grouped per-row key/value blocks - -### 15.4 Param resolution -Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, omnigraph.yaml defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients. - -### 15.5 Bearer token resolution (CLI) -1. `graphs..bearer_token_env` -2. `OMNIGRAPH_BEARER_TOKEN` global env -3. `auth.env_file` referenced `.env` - -### 15.6 Duration parsing (cleanup) -`s | m | h | d | w` units, e.g. `--older-than 7d`. - ---- - -## 16. Audit / Actor tracking - -- `Omnigraph::audit_actor_id: Option` is the actor in effect. -- `_as` variants of every write API let callers override the actor: `begin_run_as`, `publish_run_as`, `ingest_as`, `mutate_as`, `branch_merge_as`, etc. -- Actor IDs are persisted both on `RunRecord.actor_id` and on `GraphCommit.actor_id`, with optional split storage in `_graph_commit_actors.lance` and `_graph_run_actors.lance`. -- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor). - ---- - -## 17. Concurrency Model - -- **MVCC**: every Lance write bumps a per-dataset version; the OmniGraph manifest version coordinates which sub-table versions are visible together. -- **Snapshot isolation**: a query holds one `Snapshot` for its lifetime; concurrent writes don't leak in. -- **Cross-branch isolation**: copy-on-write means readers and writers on different branches don't block each other. -- **Run isolation**: each transactional run lives on its own `__run__` branch. -- **Schema-apply lock**: `__schema_apply_lock__` system branch serializes schema migrations. -- **Fail-points** (`failpoints` cargo feature): `failpoints::maybe_fail("operation.step")?` in `branch_create`, publish, etc., for deterministic failure injection in tests. - ---- - -## 18. Error Taxonomy - -`omnigraph::error::OmniError`: -- `Compiler(...)` — schema/query parse/typecheck errors -- `Lance(String)` — storage layer -- `DataFusion(String)` — execution layer -- `Io(io::Error)` -- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, … })` -- `MergeConflicts(Vec)` - -Compiler-side `NanoError` covers parse / catalog / type / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics. - ---- - -## 19. Result Serialization - -`omnigraph_compiler::result::QueryResult`: -- `to_arrow_ipc()` — efficient binary -- `to_sdk_json()` — JS-safe JSON (large i64 wrapped in metadata) -- `to_rust_json()` — Rust-friendly JSON -- `batches()` — direct Arrow `RecordBatch` access - -Mutation results: `{ affectedNodes: usize, affectedEdges: usize }` (also exposed as a tiny Arrow batch). - ---- - -## 20. Deployment - -### 20.1 Binary install -- `curl -fsSL https://raw.githubusercontent.com/ModernRelay/omnigraph/main/scripts/install.sh | bash` - - `RELEASE_CHANNEL=stable|edge`, `VERSION=v0.x.y`, `INSTALL_DIR=~/.local/bin` - - SHA-256 checksum verification - - Auto-fallback stable → edge if a tagged release isn't published yet -- Homebrew: `brew tap ModernRelay/tap && brew install ModernRelay/tap/omnigraph` -- Source: `scripts/install-source.sh` (clones + cargo build) -- Manual: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` - -### 20.2 Release artifacts (per release) -- `omnigraph-linux-x86_64.tar.gz` -- `omnigraph-macos-x86_64.tar.gz` -- `omnigraph-macos-arm64.tar.gz` -- Each archive contains both `omnigraph` and `omnigraph-server`, plus a `.sha256` sidecar. - -### 20.3 Container -- `docker build -t omnigraph-server:local .` -- Base: `public.ecr.aws/debian/debian:bookworm-slim` (avoids Docker Hub rate limits — switched in v0.3.0) -- Entrypoint env: `OMNIGRAPH_TARGET_URI` or `OMNIGRAPH_CONFIG` (+ optional `OMNIGRAPH_TARGET`), `OMNIGRAPH_BIND` (default `0.0.0.0:8080`) -- Healthcheck: `curl http://localhost:8080/healthz`, 30 s interval / 5 s timeout / 10 s start-period / 3 retries. - -### 20.4 Local RustFS one-shot bootstrap -`scripts/local-rustfs-bootstrap.sh` (~422 lines): -- Starts RustFS in Docker on `127.0.0.1:9000` -- Creates a bucket -- Initializes an S3-backed repo (`s3://…`) -- Loads the checked-in fixture -- Launches `omnigraph-server` on `127.0.0.1:8080` -- Tunables: `WORKDIR`, `RUSTFS_CONTAINER_NAME`, `BUCKET`, `PREFIX`, `RESET_REPO`, `BIND`, `FORCE_BUILD`, AWS creds. -- Prefers rolling `edge` binaries; falls back to source build. - -### 20.5 Build variants -| Variant | Build | Backend extras | -|---|---|---| -| default | `cargo build --release` | local FS, S3, JSON-file/env tokens | -| aws | `cargo build --release --features aws` | + AWS Secrets Manager bearer-token backend | - -The aws variant pulls ~150 transitive dependencies (~30–60 s compile time). - ---- - -## 21. CI Matrix - -`.github/workflows/`: -- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. -- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest. -- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`. -- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease. -- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by force-pushing the regenerated formula to `ModernRelay/homebrew-tap`. -- **package.yml**: manual ECR image build; emits two image tags per commit (``, `-aws`) via CodeBuild. - ---- - -## 22. Release History (highlights) - -### v0.2.0 — operability -- Schema planning + apply (CLI + `POST /schema/apply`) -- Concurrent-write blocking on schema apply, index preservation, source-head verification -- Multi-statement atomic mutations -- `/openapi.json` endpoint -- Streamed `/export`, load summaries - -### v0.2.1 — query fixes -- `omnigraph query lint` / `query check` -- Aggregate execution -- Nullable-param omission and explicit null -- Traversal planning + join alignment robustness -- Config terminology: `targets:` → `graphs:` - -### v0.2.2 — packaging -- Crate rename: `omnigraph` → `omnigraph-engine` on crates.io (CLI surface unchanged) - -### v0.3.0 — AWS + security hardening -- `--features aws` build + AWS Secrets Manager bearer-token backend -- SHA-256 token hashing, constant-time comparison, server-authoritative actor IDs -- `GET /schema` + `omnigraph schema get` -- Static OpenAPI spec auto-synced by CI -- Stricter run-branch hygiene (`__run__…` filtered + auto-deleted) -- Dockerfile base → ECR Public (`public.ecr.aws/debian/debian`) -- LANCE memory pool default raised to 1 GB -- Homebrew tap update automation -- **Breaking**: pre-0.3 repos must be reinitialized (schema state changes) - -### v0.3.1 — current -- Parallel per-type load writes -- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs -- Dst-id deduplication during edge expand hydration - ---- - -## 23. Capability Matrix — "Lens by default vs. added by OmniGraph" - -| Capability | L1 (Lance default) | L2 (OmniGraph adds) | -|---|---|---| -| Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing | -| Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | -| Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | -| Atomic single-dataset commits | ✅ | **Atomic multi-dataset publish** via `__manifest` + `ManifestBatchPublisher` | -| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency | -| Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | -| BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | -| `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | -| Vector search | ✅ | `nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema | -| Full-text search | ✅ | `search/fuzzy/match_text/bm25` query ops | -| Hybrid ranking | — | `rrf(...)` Reciprocal Rank Fusion in one runtime | -| Graph traversal | — | CSR/CSC topology index, `Expand` IR op, variable-length hops, `not { }` anti-join | -| Schema language | — | `.pg` + Pest grammar + catalog + interfaces + constraints + annotations | -| Query language | — | `.gq` + Pest grammar + IR + lowering + linter | -| Schema migration planning | — | `plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__` | -| Commit graph (DAG) across whole repo | — | `_graph_commits.lance` with linear + merge parents, ULID ids, actor map | -| Transactional runs | — | `_graph_runs.lance`, `__run__` ephemeral branches, fast-path & merge-path publish | -| Three-way row-level merge | — | `OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind` | -| Change feeds | — | `diff_between` / `diff_commits` with manifest fast path + ID streaming | -| Cedar policy | — | 10 actions, branch / target_branch / protected scopes, validate/test/explain CLI | -| HTTP server | — | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), policy gating, NDJSON streaming export | -| CLI with config | — | `omnigraph.yaml`, aliases, multi-format output (json/jsonl/csv/kv/table) | -| Audit / actor tracking | — | `_as` write APIs + actor maps in commit & run datasets | -| Local RustFS bootstrap | — | `scripts/local-rustfs-bootstrap.sh` one-shot S3-backed dev environment | - ---- - -## 24. Constants & Tunables (cheat sheet) - -| Name | Value | Where | -|---|---|---| -| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` | -| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` | -| Run registry dir | `_graph_runs.lance` | `db/run_registry.rs` | -| Run branch prefix | `__run__` | `db/run_registry.rs` | -| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` | -| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` | -| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` | -| Graph index cache size | `8` (LRU) | `runtime_cache.rs` | -| Default body limit | `1 MB` | `omnigraph-server/lib.rs` | -| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` | -| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` | -| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` | -| Embed timeout | `30 000 ms` | both clients | -| Embed retries | `4` | both clients | -| Embed retry backoff | `200 ms` | both clients | -| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime | - ---- - -## 25. Quick-reference: typical flows +## Quick-reference flows ```bash # Initialize an S3-backed repo @@ -956,39 +137,48 @@ omnigraph policy explain --actor act-alice --action change --branch main --- -## 26. Keeping this file current (agent responsibilities) +## Capability matrix — "Lens by default vs. added by OmniGraph" -When you make a change in this repo, also update AGENTS.md if the change touches any of: - -- **Schema language** (`.pg` grammar, scalar types, constraints, annotations) → §2 -- **Query language** (`.gq` grammar, MATCH/RETURN/ORDER, search funcs, mutations, IR ops, lint codes) → §3 -- **Indexes** (BTREE / inverted / vector / graph topology) → §4 -- **Embeddings** (clients, models, env vars, `@embed`) → §5 -- **Branches / commits / snapshots / runs / merge / diff** → §6–§9 -- **Storage layout, manifest schema, URI schemes, env vars** → §1 -- **Mutations / loader / load modes / ingest** → §11 -- **Optimize / cleanup / tombstones** → §12 -- **Cedar policy actions, scopes, CLI** → §13 -- **HTTP server** (endpoint inventory, body limits, auth, error model) → §14 -- **CLI** (commands, `omnigraph.yaml`, output formats, param/token resolution) → §15 -- **Audit / actor tracking** → §16 -- **Concurrency, error taxonomy, result serialization** → §17–§19 -- **Deployment / CI / release** → §20–§22 -- **Constants and tunables** → §24 -- **Capability matrix and example flows** → §23, §25 - -Rules of engagement: - -1. **Update in the same change.** If a PR adds a new endpoint, query function, CLI flag, env var, constant, or schema construct, the same PR must update the corresponding section here. Never split documentation drift into a follow-up. -2. **Bump the version surveyed line** at the top when a change crosses a release boundary (e.g. v0.3.1 → v0.3.2). Add a new bullet under §22 Release History describing the user-visible delta. -3. **Don't lie.** If a section is now wrong but you don't have time to rewrite it fully, replace the wrong line with `*(stale — needs update after )*` rather than leaving silently incorrect text. Then fix it ASAP. -4. **Match section numbering.** Don't renumber sections; new content goes into the closest existing section, or — if it doesn't fit — as a new sub-heading inside one. Renumbering breaks every cross-reference in this file and in linked CLAUDE.md. -5. **Tables before prose.** Endpoint lists, env vars, constants, and step types are easier to keep current as tables. Prefer tables when adding new entries. -6. **Cross-check on schema/query/IR changes.** Any edit to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of §2, §3, and §10 to confirm they still describe reality. -7. **Re-verify before recommending.** If you cite a flag, env var, or endpoint to the user, grep for it in the source first. This file is a living guide, not a frozen contract. - -> The CLAUDE.md file in this repo is a symlink to AGENTS.md — there is exactly one source of truth. Do not edit CLAUDE.md directly; edit AGENTS.md and the link will reflect the change. +| Capability | L1 (Lance default) | L2 (OmniGraph adds) | +|---|---|---| +| Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing | +| Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | +| Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | +| Atomic single-dataset commits | ✅ | **Atomic multi-dataset publish** via `__manifest` + `ManifestBatchPublisher` | +| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency | +| Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | +| BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | +| `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | +| Vector search | ✅ | `nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema | +| Full-text search | ✅ | `search/fuzzy/match_text/bm25` query ops | +| Hybrid ranking | — | `rrf(...)` Reciprocal Rank Fusion in one runtime | +| Graph traversal | — | CSR/CSC topology index, `Expand` IR op, variable-length hops, `not { }` anti-join | +| Schema language | — | `.pg` + Pest grammar + catalog + interfaces + constraints + annotations | +| Query language | — | `.gq` + Pest grammar + IR + lowering + linter | +| Schema migration planning | — | `plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__` | +| Commit graph (DAG) across whole repo | — | `_graph_commits.lance` with linear + merge parents, ULID ids, actor map | +| Transactional runs | — | `_graph_runs.lance`, `__run__` ephemeral branches, fast-path & merge-path publish | +| Three-way row-level merge | — | `OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind` | +| Change feeds | — | `diff_between` / `diff_commits` with manifest fast path + ID streaming | +| Cedar policy | — | 10 actions, branch / target_branch / protected scopes, validate/test/explain CLI | +| HTTP server | — | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), policy gating, NDJSON streaming export | +| CLI with config | — | `omnigraph.yaml`, aliases, multi-format output (json/jsonl/csv/kv/table) | +| Audit / actor tracking | — | `_as` write APIs + actor maps in commit & run datasets | +| Local RustFS bootstrap | — | `scripts/local-rustfs-bootstrap.sh` one-shot S3-backed dev environment | --- -*End of guide — sections 0–25 reflect code at branch `ragnorc/omnigraph-spec`, tag-equivalent v0.3.1. Section 26 is the maintenance contract for future agents.* +## Maintenance contract for agents + +When you change something user-visible, **update the relevant `docs/.md` in the same change**. Pointers from this file to that doc must keep working — CI enforces cross-link integrity via `scripts/check-agents-md.sh`. + +Rules: + +1. **Update in the same PR.** New endpoint, query function, CLI flag, env var, constant, schema construct, or invariant: update both the source code and the doc in the same change. Never split documentation drift into a follow-up. +2. **Bump version on release.** When a release boundary crosses (e.g. v0.3.1 → v0.3.2), update the version line at the top of this file and add a `docs/releases/.md` describing the user-visible delta. Update [docs/architecture.md](docs/architecture.md) only if the architecture itself changed. +3. **Don't lie.** If a section becomes wrong but you can't rewrite it fully right now, replace the wrong line with `*(stale — needs update after )*` rather than leaving silently incorrect text. Then fix it ASAP. +4. **Re-verify before recommending.** If you cite a flag, env var, endpoint, or constant to the user or in code, grep for it in source first. Memory and docs go stale; the code is authoritative. +5. **Keep AGENTS.md a map, not an encyclopedia.** New deep content goes into `docs/`. Add an entry to "Where to find each topic" instead of pasting prose into this file. The "Always-on rules" section is the exception — it's for invariants that should always be in scope. +6. **Re-read on schema/query/IR changes.** Edits to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of [docs/schema-language.md](docs/schema-language.md), [docs/query-language.md](docs/query-language.md), and [docs/execution.md](docs/execution.md) to confirm they still describe reality. + +CI check: `scripts/check-agents-md.sh` verifies that every `docs/*.md` link in this file resolves and that every doc in the canonical set is linked. Run it locally before opening a PR if you've moved or renamed docs. diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 0000000..4fc8867 --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,69 @@ +# Architecture + +OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`. + +## Stack + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) │ +│ - 13 cmd families - REST + OpenAPI │ +│ - Aliases, configs - Bearer auth + Cedar policy │ +└──────────────────────────────┬───────────────────────────────────┘ + │ +┌──────────────────────────────▼───────────────────────────────────┐ +│ omnigraph-compiler │ +│ - Pest grammars: schema.pest, query.pest │ +│ - Catalog (Node/Edge/Interface types) │ +│ - IR + lowering (NodeScan / Expand / Filter / AntiJoin) │ +│ - Schema migration planner │ +│ - Embedding client (OpenAI-style for query-time normalization) │ +└──────────────────────────────┬───────────────────────────────────┘ + │ +┌──────────────────────────────▼───────────────────────────────────┐ +│ omnigraph (engine) │ +│ - GraphCoordinator + ManifestRepo (__manifest) │ +│ - CommitGraph (_graph_commits.lance) │ +│ - RunRegistry (_graph_runs.lance, __run__ branches) │ +│ - GraphIndex (CSR/CSC) + RuntimeCache (LRU 8) │ +│ - exec::query / mutation / merge │ +│ - Embedding client (Gemini for runtime ingest) │ +└──────────────────────────────┬───────────────────────────────────┘ + │ +┌──────────────────────────────▼───────────────────────────────────┐ +│ Lance 4.x (per-table dataset) │ +│ - Columnar (Arrow) storage, fragments │ +│ - Manifest versions per dataset │ +│ - Per-dataset branches (copy-on-write) │ +│ - Indexes: BTREE, Inverted (FTS/BM25), IVF/HNSW vector │ +│ - merge_insert (upsert), append, delete │ +│ - compact_files, cleanup_old_versions │ +└──────────────────────────────┬───────────────────────────────────┘ + │ +┌──────────────────────────────▼───────────────────────────────────┐ +│ Object store: local FS, S3, RustFS, MinIO, S3-compatible │ +└──────────────────────────────────────────────────────────────────┘ +``` + +## L1 / L2 framing + +Throughout the docs, capabilities are split into: + +- **L1 — Inherited from Lance**: what OmniGraph gets "for free" by sitting on top of the Lance dataset format (columnar Arrow storage, per-dataset versions and branches, index types, `merge_insert`, `compact_files` / `cleanup_old_versions`). +- **L2 — Added by OmniGraph**: typing (schema language), graph semantics, multi-dataset coordination via `__manifest`, graph-level branches and commits, the `.gq` query language and IR, the topology index, the HTTP server, Cedar policy, the CLI. + +## Concurrency model + +- **MVCC**: every Lance write bumps a per-dataset version; the OmniGraph manifest version coordinates which sub-table versions are visible together. +- **Snapshot isolation**: a query holds one `Snapshot` for its lifetime; concurrent writes don't leak in. +- **Cross-branch isolation**: copy-on-write means readers and writers on different branches don't block each other. +- **Run isolation**: each transactional run lives on its own `__run__` branch. +- **Schema-apply lock**: `__schema_apply_lock__` system branch serializes schema migrations. +- **Fail-points** (`failpoints` cargo feature): `failpoints::maybe_fail("operation.step")?` in `branch_create`, publish, etc., for deterministic failure injection in tests. + +## Workspace crates + +- `omnigraph-compiler` — schema and query grammars, catalog, IR, lowering, type checker, lint, migration planner, OpenAI-style embedding client. +- `omnigraph` (engine, published as `omnigraph-engine` on crates.io since v0.2.2) — the Lance-backed runtime: manifest, commit graph, run registry, snapshot, exec, merge, loader, Gemini embedding client. +- `omnigraph-cli` — the `omnigraph` binary. +- `omnigraph-server` — the `omnigraph-server` binary (Axum HTTP server). diff --git a/docs/audit.md b/docs/audit.md new file mode 100644 index 0000000..591c016 --- /dev/null +++ b/docs/audit.md @@ -0,0 +1,6 @@ +# Audit / Actor tracking + +- `Omnigraph::audit_actor_id: Option` is the actor in effect. +- `_as` variants of every write API let callers override the actor: `begin_run_as`, `publish_run_as`, `ingest_as`, `mutate_as`, `branch_merge_as`, etc. +- Actor IDs are persisted both on `RunRecord.actor_id` and on `GraphCommit.actor_id`, with optional split storage in `_graph_commit_actors.lance` and `_graph_run_actors.lance`. +- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor). diff --git a/docs/branches-commits.md b/docs/branches-commits.md new file mode 100644 index 0000000..eaa6cab --- /dev/null +++ b/docs/branches-commits.md @@ -0,0 +1,51 @@ +# Branches, Commits, Snapshots + +## L1 — Lance per-dataset branches + +Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork. + +## L2 — Graph-level branches + +OmniGraph builds *graph branches* on top by branching every sub-table coherently: + +- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle. +- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes. +- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments. +- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source. +- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch. + +## L2 — Commit graph (`db/commit_graph.rs`) + +Stored as a Lance dataset `_graph_commits.lance` (with stable row IDs): + +``` +GraphCommit { + graph_commit_id: ULID, + manifest_branch: Option, + manifest_version: u64, + parent_commit_id: Option, + merged_parent_commit_id: Option, // populated for merge commits + actor_id: Option, + created_at: i64 (microseconds since epoch), +} +``` + +- Every successful publish (load / change / merge / schema_apply / publish_run) appends one commit. +- Merge commits have two parents; linear commits have one. +- `_graph_commit_actors.lance` — optional separate actor map (created on demand). +- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`. + +## L2 — Snapshots & time travel + +- `snapshot()` — current snapshot for the bound branch; cached. +- `snapshot_of(target)` — snapshot at a `ReadTarget` (branch | snapshot id). +- `snapshot_at_version(v: u64)` — historical snapshot from any manifest version. +- `entity_at(table_key, id, version)` — single-entity time travel without building a full snapshot. +- A `Snapshot` is a `(version, HashMap)` — cheap to build, snapshot-isolated cross-table reads. + +## L2 — Internal system branches + +Filtered from `branch_list()` but visible to internals: + +- `__run__` — ephemeral isolation branch for a transactional run. +- `__schema_apply_lock__` — serializes schema migrations. diff --git a/docs/changes.md b/docs/changes.md new file mode 100644 index 0000000..58739e2 --- /dev/null +++ b/docs/changes.md @@ -0,0 +1,24 @@ +# Change Detection / Diff + +`changes/mod.rs`. Three-level algorithm: + +1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged. +2. **Lineage check**: + - Same branch lineage → fast path: use the per-row `_row_last_updated_at_version` column to classify Insert/Update/Delete. + - Different lineages → ID-based streaming comparison. +3. **Row-level diff**: streaming, no full materialization. + +## Public API + +- `diff_between(from: ReadTarget, to: ReadTarget, filter: Option) -> ChangeSet` +- `diff_commits(from_commit_id, to_commit_id, filter)` — cross-branch safe. + +## Types + +``` +ChangeOp: Insert | Update | Delete +EntityKind: Node | Edge +EntityChange { table_key, kind, type_name, id, op, manifest_version, endpoints?: {src, dst} } +ChangeFilter { kinds?, type_names?, ops? } +ChangeSet { from_version, to_version, branch?, changes[], stats } +``` diff --git a/docs/ci.md b/docs/ci.md new file mode 100644 index 0000000..3424830 --- /dev/null +++ b/docs/ci.md @@ -0,0 +1,10 @@ +# CI / Release Workflows + +`.github/workflows/`: + +- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`). +- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest. +- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`. +- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease. +- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by force-pushing the regenerated formula to `ModernRelay/homebrew-tap`. +- **package.yml**: manual ECR image build; emits two image tags per commit (``, `-aws`) via CodeBuild. diff --git a/docs/cli-reference.md b/docs/cli-reference.md new file mode 100644 index 0000000..dee5cfa --- /dev/null +++ b/docs/cli-reference.md @@ -0,0 +1,83 @@ +# CLI Reference (`omnigraph`) + +A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md). + +13 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target ` resolved against `omnigraph.yaml`. + +## Top-level commands + +| Command | Purpose | +|---|---| +| `init` | `--schema ` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) | +| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) | +| `ingest` | branch-creating transactional load (`--from `) | +| `read` | run named query (params via `--params`, `--params-file`, or alias args) | +| `change` | run mutation query | +| `snapshot` | print current snapshot (per-table version + row count) | +| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) | +| `branch create \| list \| delete \| merge` | branching ops | +| `commit list \| show` | inspect commit graph | +| `run list \| show \| publish \| abort` | transactional run ops | +| `schema plan \| apply \| show (alias: get)` | migrations | +| `query lint \| check` | offline / repo-backed validation | +| `optimize` | non-destructive Lance compaction | +| `cleanup --keep N --older-than 7d --confirm` | destructive version GC | +| `embed` | offline JSONL embedding pipeline | +| `policy validate \| test \| explain` | Cedar tooling | +| `version` / `-v` | print `omnigraph 0.3.x` | + +## `omnigraph.yaml` schema + +```yaml +project: { name } +graphs: + : + uri: + bearer_token_env: +server: + graph: + bind: +cli: + graph: + branch: + output_format: json|jsonl|csv|kv|table + table_max_column_width: 80 + table_cell_layout: truncate|wrap +query: + roots: [, …] # search path for .gq files +auth: + env_file: ./.env.omni +aliases: + : + command: read|change + query: + name: + args: [, …] + graph: + branch: + format: +policy: + file: ./policy.yaml +``` + +## Output formats (read command) + +- `json` — pretty-printed object with metadata + rows +- `jsonl` — one metadata line then one JSON object per row +- `csv` — RFC 4180-ish quoting +- `table` — fitted text table, honors `table_max_column_width` + `table_cell_layout` +- `kv` — grouped per-row key/value blocks + +## Param resolution + +Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, `omnigraph.yaml` defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients. + +## Bearer token resolution (CLI) + +1. `graphs..bearer_token_env` +2. `OMNIGRAPH_BEARER_TOKEN` global env +3. `auth.env_file` referenced `.env` + +## Duration parsing (cleanup) + +`s | m | h | d | w` units, e.g. `--older-than 7d`. diff --git a/docs/constants.md b/docs/constants.md new file mode 100644 index 0000000..b10aba0 --- /dev/null +++ b/docs/constants.md @@ -0,0 +1,20 @@ +# Constants & Tunables (cheat sheet) + +| Name | Value | Where | +|---|---|---| +| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` | +| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` | +| Run registry dir | `_graph_runs.lance` | `db/run_registry.rs` | +| Run branch prefix | `__run__` | `db/run_registry.rs` | +| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` | +| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` | +| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` | +| Graph index cache size | `8` (LRU) | `runtime_cache.rs` | +| Default body limit | `1 MB` | `omnigraph-server/lib.rs` | +| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` | +| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` | +| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` | +| Embed timeout | `30 000 ms` | both clients | +| Embed retries | `4` | both clients | +| Embed retry backoff | `200 ms` | both clients | +| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime | diff --git a/docs/embeddings.md b/docs/embeddings.md new file mode 100644 index 0000000..596a6a0 --- /dev/null +++ b/docs/embeddings.md @@ -0,0 +1,31 @@ +# Embeddings + +OmniGraph has **two** embedding clients with different defaults and purposes. + +## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization + +- Default model: `text-embedding-3-small` (OpenAI-style schema) +- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200` +- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)` +- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors + +## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest + +- Model: `gemini-embedding-2-preview` +- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK` +- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT) +- Exponential backoff with retryable detection (timeouts, 429, 5xx) + +## Schema integration + +Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim). + +## CLI `omnigraph embed` (offline file pipeline) + +Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive): + +- (default) `fill_missing` — only embed rows whose target field is empty +- `--reembed-all` — overwrite all +- `--clean` — strip embeddings + +Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL. diff --git a/docs/errors.md b/docs/errors.md new file mode 100644 index 0000000..bd8a2df --- /dev/null +++ b/docs/errors.md @@ -0,0 +1,21 @@ +# Errors and Result Serialization + +## Error taxonomy (`omnigraph::error::OmniError`) + +- `Compiler(...)` — schema/query parse/typecheck errors +- `Lance(String)` — storage layer +- `DataFusion(String)` — execution layer +- `Io(io::Error)` +- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, … })` +- `MergeConflicts(Vec)` + +Compiler-side `NanoError` covers parse / catalog / type / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics. + +## Result serialization (`omnigraph_compiler::result::QueryResult`) + +- `to_arrow_ipc()` — efficient binary +- `to_sdk_json()` — JS-safe JSON (large i64 wrapped in metadata) +- `to_rust_json()` — Rust-friendly JSON +- `batches()` — direct Arrow `RecordBatch` access + +Mutation results: `{ affectedNodes: usize, affectedEdges: usize }` (also exposed as a tiny Arrow batch). diff --git a/docs/execution.md b/docs/execution.md new file mode 100644 index 0000000..3b65217 --- /dev/null +++ b/docs/execution.md @@ -0,0 +1,76 @@ +# Query Execution, Mutations, and Loading + +## Query execution (`exec/query.rs`) + +Pipeline: + +1. Parse + typecheck via `omnigraph-compiler`. +2. Lower to IR. +3. If `Expand` or `AntiJoin` is present, build (or fetch from `RuntimeCache`) a `GraphIndex`. +4. Run `execute_query` against the snapshot. + +### Multi-modal search modes (`SearchMode`) + +The executor recognizes three modes that may be combined in a single query: + +- **`nearest`** — vector ANN (uses Lance vector index; `LIMIT` required). +- **`bm25`** — BM25 over an inverted index. +- **`rrf`** — Reciprocal Rank Fusion of two rankings, with k (default 60). + +Hybrid example: `order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20`. + +### Joins / set operations + +- Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups. +- `not { … }` lowers to an `AntiJoin` over the inner pipeline. + +### Scoped reads + +- `query(target, source, name, params)` — at any branch or snapshot. +- `run_query_at(version, …)` — direct historical query at a manifest version. + +### Concurrency + +- Snapshot isolation per query: all reads inside a query use the same `Snapshot`. +- Readers and writers on different branches don't block each other. + +## Mutation execution (`exec/mutation.rs`) + +Resolves expression values to literals, converts to typed Arrow arrays (`literal_to_typed_array(lit, DataType, num_rows)`), then writes: + +- `insert` → Lance `WriteMode::Append` +- `update` → Lance `merge_insert(WhenMatched::Update)` +- `delete` → Lance `merge_insert(WhenMatched::Delete)` (logical) or filtered overwrite. + +Multi-statement mutations are atomic at the manifest commit boundary. + +## Bulk loader (`loader/mod.rs`) + +- **JSONL only** in v1, with two record shapes: + - Node: `{"type":"NodeType", "data":{…}}` + - Edge: `{"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}` +- Lines starting with `//` are treated as comments. +- Schema validation on every row (typecheck, required props, blob base64 decoding). +- Edge endpoint resolution by node `@key`. + +## Load modes (`LoadMode`) + +| Mode | Semantics | +|---|---| +| `Overwrite` | Replace all data in the target tables on the branch | +| `Append` | Strict insert; duplicates error | +| `Merge` | Upsert by id (`merge_insert`) | + +## `load` vs `ingest` + +- `load(branch, data, mode)` — direct load to a branch. +- `ingest(branch, from, data, mode)` — branch-creating, transactional load: + 1. If target advanced since the run started, fork a fresh run branch from `from`. + 2. Load into the run branch (Append). + 3. If target hasn't moved, fast-publish; otherwise abort. +- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`. +- `ingest_as(actor_id)` records the actor on the resulting commit. + +## Embeddings during load + +If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](embeddings.md). diff --git a/docs/indexes.md b/docs/indexes.md new file mode 100644 index 0000000..ce6c728 --- /dev/null +++ b/docs/indexes.md @@ -0,0 +1,26 @@ +# Indexes + +## L1 — Lance index types OmniGraph exposes + +| Index | Use | Notes | +|---|---|---| +| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` | +| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries | +| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) | + +## L2 — OmniGraph orchestration + +- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run. +- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state. +- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork. +- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column. + +## L2 — Graph topology index (`graph_index/mod.rs`) + +This is OmniGraph-specific (not Lance): + +- `TypeIndex`: dense `u32 ↔ String id` mapping per node type. +- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`. +- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables. +- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions). +- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it. diff --git a/docs/maintenance.md b/docs/maintenance.md new file mode 100644 index 0000000..a222014 --- /dev/null +++ b/docs/maintenance.md @@ -0,0 +1,22 @@ +# Maintenance: Optimize & Cleanup + +`db/omnigraph/optimize.rs`. + +## `optimize_all_tables(db)` — non-destructive + +- Lance `compact_files()` on every node + edge table on `main`. +- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests. +- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8). +- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`. + +## `cleanup_all_tables(db, options)` — destructive + +- Lance `cleanup_old_versions()` per table. +- Removes manifests (and their unique fragments) older than the retention policy. +- `CleanupPolicyOptions { keep_versions: Option, older_than: Option }` — at least one is required. +- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`. +- CLI guards with `--confirm`; without it, prints a preview line. + +## Tombstones + +Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction. diff --git a/docs/merge.md b/docs/merge.md new file mode 100644 index 0000000..9895092 --- /dev/null +++ b/docs/merge.md @@ -0,0 +1,30 @@ +# Merging (three-way) and Conflicts + +`exec/merge.rs`. + +## Strategy + +Ordered, row-by-row cursor merge: + +- `OrderedTableCursor` scans each table sorted by `id` and supports peek/pop matching. +- `StagedTableWriter` buffers `MERGE_STAGE_BATCH_ROWS = 8192` rows into a temp Lance dataset (`OMNIGRAPH_MERGE_STAGING_DIR`). +- The merge runs per sub-table; results are published as one atomic manifest update. + +## Outcome enum + +`MergeOutcome { AlreadyUpToDate | FastForward | Merged }` + +## Conflict types (`error.rs`) + +``` +MergeConflictKind: + DivergentInsert // same id inserted on both branches + DivergentUpdate // updated differently on both branches + DeleteVsUpdate // one side deletes, other updates + OrphanEdge // edge references a node deleted by the other side + UniqueViolation + CardinalityViolation + ValueConstraintViolation +``` + +Returned as `OmniError::MergeConflicts(Vec)`. The HTTP server surfaces this as a 409 with structured `merge_conflicts[]` (top 3 + "+N more"). diff --git a/docs/policy.md b/docs/policy.md new file mode 100644 index 0000000..746e35f --- /dev/null +++ b/docs/policy.md @@ -0,0 +1,44 @@ +# Authorization (Cedar policy) + +OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC. + +## Policy actions + +1. `read` — query / snapshot / list branches & commits +2. `export` — NDJSON export +3. `change` — mutations +4. `schema_apply` — apply schema migrations +5. `branch_create` +6. `branch_delete` +7. `branch_merge` +8. `run_publish` +9. `run_abort` +10. `admin` — reserved + +## Scope kinds + +- `branch_scope` — applied to source branch (`read`, `export`, `change`) +- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops) +- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected` + +## Configuration + +`omnigraph.yaml`: + +```yaml +policy: + file: ./policy.yaml # Cedar rules + groups + tests: ./policy.tests.yaml # declarative test cases +``` + +Each rule must use exactly one of `branch_scope` or `target_branch_scope`. + +## CLI + +- `omnigraph policy validate` — parse + count actors, exit 1 on parse error. +- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch. +- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule. + +## Server enforcement + +Every mutating endpoint calls `authorize_request()` *before* the handler runs; decisions are logged with actor / action / branch / outcome / matched rule. diff --git a/docs/query-language.md b/docs/query-language.md new file mode 100644 index 0000000..f90fe60 --- /dev/null +++ b/docs/query-language.md @@ -0,0 +1,103 @@ +# Query Language (`.gq`) + +Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`. + +## Query declarations + +``` +query ($p1: T1, $p2: T2?, …) + @description("…") @instruction("…") { + … +} +``` + +Two body shapes: + +- **Read**: `match { … } return { … } [order { … }] [limit N]` +- **Mutation**: one or more of `insert | update | delete` statements + +Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`. + +## MATCH clauses + +- **Binding**: `$x: NodeType { prop: , … }` +- **Traversal**: `$src EDGE_NAME { min, max? } $dst` — variable-length paths via hop bounds; default 1..1 if bounds omitted. +- **Filter**: ` ` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`. +- **Negation**: `not { clause+ }` — desugars to anti-join over the inner pipeline. + +## Search clauses (multi-modal) + +Used inside MATCH or as expressions inside RETURN/ORDER: + +| Function | Purpose | Underlying Lance facility | +|---|---|---| +| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) | +| `search(field, q)` | Generic FTS | Inverted index | +| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index | +| `match_text(field, q)` | Pattern match | Inverted index | +| `bm25(field, q)` | BM25 scoring | Inverted index | +| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings | + +`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input). + +## RETURN clause + +`return { [as ], … }` with expressions: + +- Variable / property access: `$x`, `$x.prop` +- Literals: string, int, float, bool, list +- `now()` +- Aggregates: `count`, `sum`, `avg`, `min`, `max` +- All search functions above (so you can return a score column) +- `AliasRef` — re-use a previous projection alias + +## ORDER & LIMIT + +- `order { [asc|desc], … }` — supports plain expressions and `nearest(...)`. +- `limit ` — required when there is a `nearest(...)` ordering. + +## Mutation statements + +- `insert { prop: , … }` +- `update set { prop: , … } where ` +- `delete where ` + +`` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0). + +## IR (Intermediate Representation) + +`QueryIR { name, params, pipeline: Vec, return_exprs, order_by, limit }` + +Pipeline operations: + +- `NodeScan { variable, type_name, filters }` +- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` — destination filters are pushed *into* the expand so Lance scalar pushdown can prune. +- `Filter { left, op, right }` +- `AntiJoin { outer_var, inner: Vec }` — for `not { … }` + +Lowering: + +1. Partition MATCH clauses (bindings, traversals, filters, negations). +2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown. +3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations. +4. Translate RETURN / ORDER expressions; preserve LIMIT. + +## Linting & validation (`query/lint.rs`) + +Codes seen so far: + +- **Q000** (Error): parse error +- **L201** (Warning): nullable property never set by any UPDATE — "{type}.{prop} exists in schema but no update query sets it" +- (Warning): mutation declares no params — hardcoded mutations are easy to miss +- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.) + +Output: + +``` +QueryLintOutput { status, schema_source, query_path, + queries_processed, errors, warnings, infos, + results: [{ name, kind, status, error?, warnings[] }], + findings: [{ severity, code, message, type_name?, property?, query_names[] }] } +``` + +CLI exits non-zero only on `status = Error`. diff --git a/docs/releases/v0.3.1.md b/docs/releases/v0.3.1.md new file mode 100644 index 0000000..1f5d7dc --- /dev/null +++ b/docs/releases/v0.3.1.md @@ -0,0 +1,19 @@ +# Omnigraph v0.3.1 + +Omnigraph v0.3.1 is a performance and operability point release. + +## Highlights + +- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads. +- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form. +- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45). + +## Included Changes + +- Parallel per-type load writes (#46) +- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46) +- Dedupe dst ids before hydrating nodes in `execute_expand` (#45) + +## Upgrade Notes + +No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1. diff --git a/docs/runs.md b/docs/runs.md new file mode 100644 index 0000000..5150ec5 --- /dev/null +++ b/docs/runs.md @@ -0,0 +1,38 @@ +# Runs (transactional graph mutations) + +`db/run_registry.rs` + run lifecycle in `db/omnigraph.rs`. Stored in `_graph_runs.lance` and `_graph_run_actors.lance`. + +## RunRecord + +``` +RunRecord { + run_id: RunId (ULID), + target_branch: String, // where the run will publish + run_branch: "__run__", // ephemeral isolation branch + base_snapshot_id: String, + base_manifest_version: u64, + operation_hash: Option, // idempotency key + actor_id: Option, + status: Running | Published | Failed | Aborted, + published_snapshot_id: Option, + created_at, updated_at: i64 (microseconds), +} +``` + +## Lifecycle + +1. `begin_run(target_branch, op_hash)` / `begin_run_as(target_branch, op_hash, actor_id)` — forks `__run__` from the target's current head, appends a `RunRecord`. +2. Mutations on `run_branch` (via the normal write APIs) — isolated from concurrent activity on the target. +3. `publish_run(id)` / `publish_run_as(id, actor)`: + - **Fast path**: if the target hasn't moved since `base_snapshot_id`, promote the run snapshot directly. + - **Merge path**: if it has moved, perform a three-way merge (see [merge.md](merge.md)) into the target. + - On success: `status = Published`, `published_snapshot_id` set, run branch cleaned up asynchronously. +4. `abort_run(id)` / `fail_run(id)` — terminal; cleans up run branch best-effort. + +## Idempotency + +`operation_hash` is an optional field clients can use to detect a duplicate `begin_run` retry. + +## Cleanup + +`cleanup_terminal_run_branches_for_target(branch)` is called as branches change; failures are swallowed (lazy cleanup on next branch op). diff --git a/docs/schema-language.md b/docs/schema-language.md new file mode 100644 index 0000000..64d7eaa --- /dev/null +++ b/docs/schema-language.md @@ -0,0 +1,79 @@ +# Schema Language (`.pg`) + +Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`. + +## Top-level declarations + +- `interface { property* }` — reusable property contracts. +- `node [implements , ...] { property* | constraint* }` +- `edge : -> [@card(min..max)] { property* | constraint* }` +- Comments: line `//` and block `/* … */`. + +## Property declarations + +`: [annotation*]` + +## Built-in scalar types + +| Scalar | Arrow type | +|---|---| +| `String` | Utf8 | +| `Blob` | LargeBinary | +| `Bool` | Boolean | +| `I32` / `I64` | Int32 / Int64 | +| `U32` / `U64` | UInt32 / UInt64 | +| `F32` / `F64` | Float32 / Float64 | +| `Date` | Date32 | +| `DateTime` | Date64 | +| `Vector()` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` | +| `[]` | List(scalar) | +| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values | +| `?` | Same as scalar but `nullable: true` | + +## Constraints (body level) + +| Constraint | On | Effect | +|---|---|---| +| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key | +| `@unique(p, …)` | node, edge | Uniqueness across listed columns | +| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns | +| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) | +| `@check(p, "regex")` | node | Regex pattern validation | +| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. | + +Edge bodies only allow `@unique` and `@index`. + +## Annotations + +- `@` or `@()` on any declaration or property. +- Known annotations: + - `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType). + - `@description("…")`, `@instruction("…")` on query declarations (carried through to clients). +- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation. + +## Catalog construction + +- Pass 0: collect interfaces. +- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`). +- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties. + +## Schema IR & stable type IDs + +- `IR_VERSION = 1` (`catalog/schema_ir.rs`). +- Each interface/node/edge gets a `stable_type_id` (kind+name hashed) so renames can be tracked. +- Serialized as JSON for diff/migration plans. + +## Schema migration planning + +`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types: + +- `AddType { type_kind, name }` +- `RenameType { type_kind, from, to }` +- `AddProperty { type_kind, type_name, property_name, property_type }` +- `RenameProperty { type_kind, type_name, from, to }` +- `AddConstraint { type_kind, type_name, constraint }` +- `UpdateTypeMetadata { … annotations }` +- `UpdatePropertyMetadata { … annotations }` +- `UnsupportedChange { entity, reason }` (forces `supported=false`) + +`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize. diff --git a/docs/server.md b/docs/server.md new file mode 100644 index 0000000..3c6242d --- /dev/null +++ b/docs/server.md @@ -0,0 +1,68 @@ +# HTTP Server (`omnigraph-server`) + +Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant. + +## Endpoint inventory + +| Method | Path | Auth | Action | Handler | +|---|---|---|---|---| +| GET | `/healthz` | none | — | `server_health` | +| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) | +| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` | +| POST | `/read` | bearer + `read` | run named query | `server_read` | +| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` | +| POST | `/change` | bearer + `change` | mutation | `server_change` | +| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` | +| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` | +| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) | +| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` | +| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` | +| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` | +| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` | +| GET | `/runs` | bearer + `read` | list | `server_run_list` | +| GET | `/runs/{run_id}` | bearer + `read` | show | `server_run_show` | +| POST | `/runs/{run_id}/publish` | bearer + `run_publish` | publish | `server_run_publish` | +| POST | `/runs/{run_id}/abort` | bearer + `run_abort` | abort | `server_run_abort` | +| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` | +| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` | + +## Streaming + +Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON. + +## Error model + +Uniform `ErrorOutput { error, code?, merge_conflicts[] }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`. + +HTTP status codes used: 200, 400, 401, 403, 404, 409, 500. + +## Body limits + +- Default: 1 MB +- `/ingest`: 32 MB + +## Auth model (`bearer + SHA-256`) + +- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory. +- Constant-time comparison via `subtle::ConstantTimeEq`. +- Three sources, in precedence: + 1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`) + 2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}` + 3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default` +- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme. + +See [deployment.md](deployment.md) for token-source operational details. + +## Tracing & observability + +- `tower_http::TraceLayer::new_for_http()` +- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule +- Startup logs: token source name, repo URI, bind address +- Graceful SIGINT shutdown + +## Not implemented (by design or "TBD") + +- CORS — not configured; add `tower_http::cors` if needed. +- Rate limiting — none. +- Pagination — none (commits/branches/runs return everything; export streams). +- Multi-tenant routing — one repo per process. diff --git a/docs/storage.md b/docs/storage.md new file mode 100644 index 0000000..1c8dd76 --- /dev/null +++ b/docs/storage.md @@ -0,0 +1,46 @@ +# Storage + +## L1 — Lance dataset (per node/edge type) + +Every node type and every edge type is its own Lance dataset: + +- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema. +- **Fragments**: data is partitioned into fragments; new writes create new fragments. +- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable. +- **Stable row IDs**: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction. +- **Append / delete / `merge_insert`**: native Lance write modes. +- **Per-dataset branches** (Lance native): copy-on-write at the dataset level. +- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`). + +## L2 — Multi-dataset coordination via `__manifest` + +OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table. + +- **Manifest table**: `__manifest/` Lance dataset. +- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`): + - `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type + - `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type + - `__manifest/` — the catalog of all sub-tables and their published versions + - `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map + - `_graph_runs.lance` / `_graph_run_actors.lance` — the run registry and its actor map +- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`): + - `object_type` ∈ `table | table_version | table_tombstone` + - `table_key` ∈ `node: | edge:` + - `table_branch` is `null` for the main lineage and the branch name otherwise +- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones whose `tombstone_version >= table_version`. +- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once. + +## URI scheme support (`storage.rs`) + +| Scheme | Backend | Notes | +|---|---|---| +| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths | +| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` | +| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend | + +## Object-store env vars (S3-compatible) + +- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN` +- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML +- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs +- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev) diff --git a/scripts/check-agents-md.sh b/scripts/check-agents-md.sh new file mode 100755 index 0000000..95ae05f --- /dev/null +++ b/scripts/check-agents-md.sh @@ -0,0 +1,77 @@ +#!/usr/bin/env bash +# Verify that AGENTS.md and docs/ stay in sync. +# +# Two checks: +# 1. Every docs/*.md path linked from AGENTS.md exists on disk. +# 2. Every doc in the canonical set is linked from AGENTS.md. +# +# Exit non-zero on any drift. + +set -euo pipefail + +repo_root="$(cd "$(dirname "$0")/.." && pwd)" +cd "$repo_root" + +agents_file="AGENTS.md" +if [[ ! -f "$agents_file" ]]; then + echo "error: $agents_file not found" >&2 + exit 1 +fi + +# Canonical set: every docs/*.md (top-level), plus the releases/ index dir if present. +canonical=() +while IFS= read -r line; do + canonical+=("$line") +done < <(find docs -mindepth 1 -maxdepth 1 -type f -name '*.md' | sort) +if [[ -d docs/releases ]]; then + canonical+=("docs/releases/") +fi + +# Extract docs/ links from AGENTS.md (markdown link form: (docs/...)) +linked=() +while IFS= read -r line; do + linked+=("$line") +done < <(grep -oE '\(docs/[^)]+\)' "$agents_file" | sed -E 's/^\(|\)$//g' | sort -u) + +fail=0 + +# Check 1: every linked path exists. +for link in "${linked[@]}"; do + # Strip in-page anchors like #foo + path="${link%%#*}" + if [[ "$path" == */ ]]; then + if [[ ! -d "$path" ]]; then + echo "error: AGENTS.md links to missing directory: $path" >&2 + fail=1 + fi + else + if [[ ! -f "$path" ]]; then + echo "error: AGENTS.md links to missing file: $path" >&2 + fail=1 + fi + fi +done + +# Check 2: every canonical doc is linked at least once. +for doc in "${canonical[@]}"; do + found=0 + for link in "${linked[@]}"; do + path="${link%%#*}" + if [[ "$path" == "$doc" ]]; then + found=1 + break + fi + done + if [[ "$found" -eq 0 ]]; then + echo "error: doc not linked from AGENTS.md: $doc" >&2 + fail=1 + fi +done + +if [[ "$fail" -ne 0 ]]; then + echo >&2 + echo "AGENTS.md / docs/ are out of sync. Either update AGENTS.md links or rename/remove the doc." >&2 + exit 1 +fi + +echo "AGENTS.md ↔ docs/ links OK (${#linked[@]} links, ${#canonical[@]} docs)."