mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-09 01:35:18 +02:00
Refactor AGENTS.md from encyclopedia to map; move spec into docs/
Splits the 990-line AGENTS.md into a 184-line map (architecture, where-to-find index, always-on invariants, capability matrix, maintenance contract) plus 18 new docs/*.md files holding the deep content per topic (storage, schema and query languages, indexes, embeddings, branches/commits, runs, merge, changes, execution, policy, server, CLI reference, audit, errors, CI, constants, v0.3.1 notes). Adds scripts/check-agents-md.sh and a check_agents_md CI job that verifies every docs/ link in AGENTS.md resolves and every doc in the canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
cfea41e942
commit
a335d98854
23 changed files with 1069 additions and 924 deletions
12
.github/workflows/ci.yml
vendored
12
.github/workflows/ci.yml
vendored
|
|
@ -99,6 +99,18 @@ jobs:
|
|||
echo "run_full_ci=$run_full_ci" >> "$GITHUB_OUTPUT"
|
||||
echo "run_rustfs_ci=$run_rustfs_ci" >> "$GITHUB_OUTPUT"
|
||||
|
||||
check_agents_md:
|
||||
name: Check AGENTS.md Links
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
contents: read
|
||||
steps:
|
||||
- name: Checkout source
|
||||
uses: actions/checkout@v5.0.1
|
||||
|
||||
- name: Verify AGENTS.md ↔ docs/ cross-links
|
||||
run: bash scripts/check-agents-md.sh
|
||||
|
||||
test:
|
||||
name: Test Workspace
|
||||
needs: classify_changes
|
||||
|
|
|
|||
69
docs/architecture.md
Normal file
69
docs/architecture.md
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
# Architecture
|
||||
|
||||
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`.
|
||||
|
||||
## Stack
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ CLI (omnigraph) HTTP Server (omnigraph-server, Axum) │
|
||||
│ - 13 cmd families - REST + OpenAPI │
|
||||
│ - Aliases, configs - Bearer auth + Cedar policy │
|
||||
└──────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────────────▼───────────────────────────────────┐
|
||||
│ omnigraph-compiler │
|
||||
│ - Pest grammars: schema.pest, query.pest │
|
||||
│ - Catalog (Node/Edge/Interface types) │
|
||||
│ - IR + lowering (NodeScan / Expand / Filter / AntiJoin) │
|
||||
│ - Schema migration planner │
|
||||
│ - Embedding client (OpenAI-style for query-time normalization) │
|
||||
└──────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────────────▼───────────────────────────────────┐
|
||||
│ omnigraph (engine) │
|
||||
│ - GraphCoordinator + ManifestRepo (__manifest) │
|
||||
│ - CommitGraph (_graph_commits.lance) │
|
||||
│ - RunRegistry (_graph_runs.lance, __run__ branches) │
|
||||
│ - GraphIndex (CSR/CSC) + RuntimeCache (LRU 8) │
|
||||
│ - exec::query / mutation / merge │
|
||||
│ - Embedding client (Gemini for runtime ingest) │
|
||||
└──────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────────────▼───────────────────────────────────┐
|
||||
│ Lance 4.x (per-table dataset) │
|
||||
│ - Columnar (Arrow) storage, fragments │
|
||||
│ - Manifest versions per dataset │
|
||||
│ - Per-dataset branches (copy-on-write) │
|
||||
│ - Indexes: BTREE, Inverted (FTS/BM25), IVF/HNSW vector │
|
||||
│ - merge_insert (upsert), append, delete │
|
||||
│ - compact_files, cleanup_old_versions │
|
||||
└──────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌──────────────────────────────▼───────────────────────────────────┐
|
||||
│ Object store: local FS, S3, RustFS, MinIO, S3-compatible │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## L1 / L2 framing
|
||||
|
||||
Throughout the docs, capabilities are split into:
|
||||
|
||||
- **L1 — Inherited from Lance**: what OmniGraph gets "for free" by sitting on top of the Lance dataset format (columnar Arrow storage, per-dataset versions and branches, index types, `merge_insert`, `compact_files` / `cleanup_old_versions`).
|
||||
- **L2 — Added by OmniGraph**: typing (schema language), graph semantics, multi-dataset coordination via `__manifest`, graph-level branches and commits, the `.gq` query language and IR, the topology index, the HTTP server, Cedar policy, the CLI.
|
||||
|
||||
## Concurrency model
|
||||
|
||||
- **MVCC**: every Lance write bumps a per-dataset version; the OmniGraph manifest version coordinates which sub-table versions are visible together.
|
||||
- **Snapshot isolation**: a query holds one `Snapshot` for its lifetime; concurrent writes don't leak in.
|
||||
- **Cross-branch isolation**: copy-on-write means readers and writers on different branches don't block each other.
|
||||
- **Run isolation**: each transactional run lives on its own `__run__<id>` branch.
|
||||
- **Schema-apply lock**: `__schema_apply_lock__` system branch serializes schema migrations.
|
||||
- **Fail-points** (`failpoints` cargo feature): `failpoints::maybe_fail("operation.step")?` in `branch_create`, publish, etc., for deterministic failure injection in tests.
|
||||
|
||||
## Workspace crates
|
||||
|
||||
- `omnigraph-compiler` — schema and query grammars, catalog, IR, lowering, type checker, lint, migration planner, OpenAI-style embedding client.
|
||||
- `omnigraph` (engine, published as `omnigraph-engine` on crates.io since v0.2.2) — the Lance-backed runtime: manifest, commit graph, run registry, snapshot, exec, merge, loader, Gemini embedding client.
|
||||
- `omnigraph-cli` — the `omnigraph` binary.
|
||||
- `omnigraph-server` — the `omnigraph-server` binary (Axum HTTP server).
|
||||
6
docs/audit.md
Normal file
6
docs/audit.md
Normal file
|
|
@ -0,0 +1,6 @@
|
|||
# Audit / Actor tracking
|
||||
|
||||
- `Omnigraph::audit_actor_id: Option<String>` is the actor in effect.
|
||||
- `_as` variants of every write API let callers override the actor: `begin_run_as`, `publish_run_as`, `ingest_as`, `mutate_as`, `branch_merge_as`, etc.
|
||||
- Actor IDs are persisted both on `RunRecord.actor_id` and on `GraphCommit.actor_id`, with optional split storage in `_graph_commit_actors.lance` and `_graph_run_actors.lance`.
|
||||
- HTTP server uses the bearer-token actor automatically; CLI uses the local user / explicit env (no implicit actor).
|
||||
51
docs/branches-commits.md
Normal file
51
docs/branches-commits.md
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
# Branches, Commits, Snapshots
|
||||
|
||||
## L1 — Lance per-dataset branches
|
||||
|
||||
Lance supports branching at the dataset level: a branch is a named lineage of versions, and `fork_branch_from_state(source_branch, target_branch, source_version)` creates a copy-on-write fork.
|
||||
|
||||
## L2 — Graph-level branches
|
||||
|
||||
OmniGraph builds *graph branches* on top by branching every sub-table coherently:
|
||||
|
||||
- `branch_create(name)` / `branch_create_from(target, name)` — disallowed name `main`; fails if branch exists; ensures the schema-apply lock is idle.
|
||||
- `branch_list()` — returns public branches, **filters internal** `__run__…` and `__schema_apply_lock__` prefixes.
|
||||
- `branch_delete(name)` — refuses if there are descendants or active runs on the branch; cleans up owned per-branch fragments.
|
||||
- **Lazy forking**: a branch only forks a sub-table when that sub-table is first mutated on it. Pure-read branches share fragments with their source.
|
||||
- `sync_branch(branch)` — re-binds the in-memory handle to the latest head of the branch.
|
||||
|
||||
## L2 — Commit graph (`db/commit_graph.rs`)
|
||||
|
||||
Stored as a Lance dataset `_graph_commits.lance` (with stable row IDs):
|
||||
|
||||
```
|
||||
GraphCommit {
|
||||
graph_commit_id: ULID,
|
||||
manifest_branch: Option<String>,
|
||||
manifest_version: u64,
|
||||
parent_commit_id: Option<String>,
|
||||
merged_parent_commit_id: Option<String>, // populated for merge commits
|
||||
actor_id: Option<String>,
|
||||
created_at: i64 (microseconds since epoch),
|
||||
}
|
||||
```
|
||||
|
||||
- Every successful publish (load / change / merge / schema_apply / publish_run) appends one commit.
|
||||
- Merge commits have two parents; linear commits have one.
|
||||
- `_graph_commit_actors.lance` — optional separate actor map (created on demand).
|
||||
- API: `list_commits(branch)`, `get_commit(id)`, `head_commit_id_for_branch(branch)`.
|
||||
|
||||
## L2 — Snapshots & time travel
|
||||
|
||||
- `snapshot()` — current snapshot for the bound branch; cached.
|
||||
- `snapshot_of(target)` — snapshot at a `ReadTarget` (branch | snapshot id).
|
||||
- `snapshot_at_version(v: u64)` — historical snapshot from any manifest version.
|
||||
- `entity_at(table_key, id, version)` — single-entity time travel without building a full snapshot.
|
||||
- A `Snapshot` is a `(version, HashMap<table_key, SubTableEntry>)` — cheap to build, snapshot-isolated cross-table reads.
|
||||
|
||||
## L2 — Internal system branches
|
||||
|
||||
Filtered from `branch_list()` but visible to internals:
|
||||
|
||||
- `__run__<run-id>` — ephemeral isolation branch for a transactional run.
|
||||
- `__schema_apply_lock__` — serializes schema migrations.
|
||||
24
docs/changes.md
Normal file
24
docs/changes.md
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
# Change Detection / Diff
|
||||
|
||||
`changes/mod.rs`. Three-level algorithm:
|
||||
|
||||
1. **Manifest diff**: skip sub-tables whose `(table_version, table_branch)` is unchanged.
|
||||
2. **Lineage check**:
|
||||
- Same branch lineage → fast path: use the per-row `_row_last_updated_at_version` column to classify Insert/Update/Delete.
|
||||
- Different lineages → ID-based streaming comparison.
|
||||
3. **Row-level diff**: streaming, no full materialization.
|
||||
|
||||
## Public API
|
||||
|
||||
- `diff_between(from: ReadTarget, to: ReadTarget, filter: Option<ChangeFilter>) -> ChangeSet`
|
||||
- `diff_commits(from_commit_id, to_commit_id, filter)` — cross-branch safe.
|
||||
|
||||
## Types
|
||||
|
||||
```
|
||||
ChangeOp: Insert | Update | Delete
|
||||
EntityKind: Node | Edge
|
||||
EntityChange { table_key, kind, type_name, id, op, manifest_version, endpoints?: {src, dst} }
|
||||
ChangeFilter { kinds?, type_names?, ops? }
|
||||
ChangeSet { from_version, to_version, branch?, changes[], stats }
|
||||
```
|
||||
10
docs/ci.md
Normal file
10
docs/ci.md
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
# CI / Release Workflows
|
||||
|
||||
`.github/workflows/`:
|
||||
|
||||
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repo PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
|
||||
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
|
||||
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_repo_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
|
||||
- **release-edge.yml**: on every push to main, retags `edge`, builds Linux/macOS-Intel/macOS-arm64 archives + sha256, publishes a rolling prerelease.
|
||||
- **release.yml**: on `v*` tags, builds the 3-platform matrix and updates the Homebrew tap (`scripts/update-homebrew-formula.sh`) by force-pushing the regenerated formula to `ModernRelay/homebrew-tap`.
|
||||
- **package.yml**: manual ECR image build; emits two image tags per commit (`<sha>`, `<sha>-aws`) via CodeBuild.
|
||||
83
docs/cli-reference.md
Normal file
83
docs/cli-reference.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
# CLI Reference (`omnigraph`)
|
||||
|
||||
A reference for the `omnigraph` binary's command surface and `omnigraph.yaml` schema. For a quick-start guide, see [cli.md](cli.md).
|
||||
|
||||
13 top-level command families, 40+ subcommands. All commands accept either a positional `URI`, `--uri`, or a `--target <name>` resolved against `omnigraph.yaml`.
|
||||
|
||||
## Top-level commands
|
||||
|
||||
| Command | Purpose |
|
||||
|---|---|
|
||||
| `init` | `--schema <pg>` → initialize a repo (also scaffolds `omnigraph.yaml` if missing) |
|
||||
| `load` | bulk load a branch (`--mode overwrite\|append\|merge`) |
|
||||
| `ingest` | branch-creating transactional load (`--from <base>`) |
|
||||
| `read` | run named query (params via `--params`, `--params-file`, or alias args) |
|
||||
| `change` | run mutation query |
|
||||
| `snapshot` | print current snapshot (per-table version + row count) |
|
||||
| `export` | dump to JSONL on stdout (`--type T`, `--table K` filters) |
|
||||
| `branch create \| list \| delete \| merge` | branching ops |
|
||||
| `commit list \| show` | inspect commit graph |
|
||||
| `run list \| show \| publish \| abort` | transactional run ops |
|
||||
| `schema plan \| apply \| show (alias: get)` | migrations |
|
||||
| `query lint \| check` | offline / repo-backed validation |
|
||||
| `optimize` | non-destructive Lance compaction |
|
||||
| `cleanup --keep N --older-than 7d --confirm` | destructive version GC |
|
||||
| `embed` | offline JSONL embedding pipeline |
|
||||
| `policy validate \| test \| explain` | Cedar tooling |
|
||||
| `version` / `-v` | print `omnigraph 0.3.x` |
|
||||
|
||||
## `omnigraph.yaml` schema
|
||||
|
||||
```yaml
|
||||
project: { name }
|
||||
graphs:
|
||||
<name>:
|
||||
uri: <local|s3://|http(s)://>
|
||||
bearer_token_env: <ENV_NAME>
|
||||
server:
|
||||
graph: <name>
|
||||
bind: <ip:port>
|
||||
cli:
|
||||
graph: <name>
|
||||
branch: <name>
|
||||
output_format: json|jsonl|csv|kv|table
|
||||
table_max_column_width: 80
|
||||
table_cell_layout: truncate|wrap
|
||||
query:
|
||||
roots: [<dir>, …] # search path for .gq files
|
||||
auth:
|
||||
env_file: ./.env.omni
|
||||
aliases:
|
||||
<alias>:
|
||||
command: read|change
|
||||
query: <path-to-.gq>
|
||||
name: <query-name>
|
||||
args: [<positional-name>, …]
|
||||
graph: <name>
|
||||
branch: <name>
|
||||
format: <output-format>
|
||||
policy:
|
||||
file: ./policy.yaml
|
||||
```
|
||||
|
||||
## Output formats (read command)
|
||||
|
||||
- `json` — pretty-printed object with metadata + rows
|
||||
- `jsonl` — one metadata line then one JSON object per row
|
||||
- `csv` — RFC 4180-ish quoting
|
||||
- `table` — fitted text table, honors `table_max_column_width` + `table_cell_layout`
|
||||
- `kv` — grouped per-row key/value blocks
|
||||
|
||||
## Param resolution
|
||||
|
||||
Precedence (high to low): explicit `--params` / `--params-file`, alias positional args, `omnigraph.yaml` defaults. JS-safe-integer handling is built in (`is_js_safe_integer_i64`, `JS_MAX_SAFE_INTEGER_U64`) so 64-bit ids round-trip safely through JSON clients.
|
||||
|
||||
## Bearer token resolution (CLI)
|
||||
|
||||
1. `graphs.<name>.bearer_token_env`
|
||||
2. `OMNIGRAPH_BEARER_TOKEN` global env
|
||||
3. `auth.env_file` referenced `.env`
|
||||
|
||||
## Duration parsing (cleanup)
|
||||
|
||||
`s | m | h | d | w` units, e.g. `--older-than 7d`.
|
||||
20
docs/constants.md
Normal file
20
docs/constants.md
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
# Constants & Tunables (cheat sheet)
|
||||
|
||||
| Name | Value | Where |
|
||||
|---|---|---|
|
||||
| `MANIFEST_DIR` | `__manifest` | `db/manifest/layout.rs` |
|
||||
| Commit graph dir | `_graph_commits.lance` | `db/commit_graph.rs` |
|
||||
| Run registry dir | `_graph_runs.lance` | `db/run_registry.rs` |
|
||||
| Run branch prefix | `__run__` | `db/run_registry.rs` |
|
||||
| Schema apply lock | `__schema_apply_lock__` | `db/mod.rs` |
|
||||
| Merge stage batch | `MERGE_STAGE_BATCH_ROWS = 8192` | `exec/merge.rs` |
|
||||
| Maintenance concurrency | `OMNIGRAPH_MAINTENANCE_CONCURRENCY=8` | `db/omnigraph/optimize.rs` |
|
||||
| Graph index cache size | `8` (LRU) | `runtime_cache.rs` |
|
||||
| Default body limit | `1 MB` | `omnigraph-server/lib.rs` |
|
||||
| Ingest body limit | `32 MB` | `omnigraph-server/lib.rs` |
|
||||
| Engine embed model | `gemini-embedding-2-preview` | `omnigraph/embedding.rs` |
|
||||
| Compiler embed model | `text-embedding-3-small` | `omnigraph-compiler/embedding.rs` |
|
||||
| Embed timeout | `30 000 ms` | both clients |
|
||||
| Embed retries | `4` | both clients |
|
||||
| Embed retry backoff | `200 ms` | both clients |
|
||||
| LANCE memory pool default | `1 GB` (raised in v0.3.0) | runtime |
|
||||
31
docs/embeddings.md
Normal file
31
docs/embeddings.md
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# Embeddings
|
||||
|
||||
OmniGraph has **two** embedding clients with different defaults and purposes.
|
||||
|
||||
## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization
|
||||
|
||||
- Default model: `text-embedding-3-small` (OpenAI-style schema)
|
||||
- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
|
||||
- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
|
||||
- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors
|
||||
|
||||
## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest
|
||||
|
||||
- Model: `gemini-embedding-2-preview`
|
||||
- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
|
||||
- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
|
||||
- Exponential backoff with retryable detection (timeouts, 429, 5xx)
|
||||
|
||||
## Schema integration
|
||||
|
||||
Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).
|
||||
|
||||
## CLI `omnigraph embed` (offline file pipeline)
|
||||
|
||||
Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive):
|
||||
|
||||
- (default) `fill_missing` — only embed rows whose target field is empty
|
||||
- `--reembed-all` — overwrite all
|
||||
- `--clean` — strip embeddings
|
||||
|
||||
Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.
|
||||
21
docs/errors.md
Normal file
21
docs/errors.md
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
# Errors and Result Serialization
|
||||
|
||||
## Error taxonomy (`omnigraph::error::OmniError`)
|
||||
|
||||
- `Compiler(...)` — schema/query parse/typecheck errors
|
||||
- `Lance(String)` — storage layer
|
||||
- `DataFusion(String)` — execution layer
|
||||
- `Io(io::Error)`
|
||||
- `Manifest(ManifestError { kind: BadRequest|NotFound|Conflict|Internal, … })`
|
||||
- `MergeConflicts(Vec<MergeConflict>)`
|
||||
|
||||
Compiler-side `NanoError` covers parse / catalog / type / plan / execution / arrow / lance / IO / manifest / unique-constraint, each with structured spans (`SourceSpan { start, end }`) for ariadne-style diagnostics.
|
||||
|
||||
## Result serialization (`omnigraph_compiler::result::QueryResult`)
|
||||
|
||||
- `to_arrow_ipc()` — efficient binary
|
||||
- `to_sdk_json()` — JS-safe JSON (large i64 wrapped in metadata)
|
||||
- `to_rust_json()` — Rust-friendly JSON
|
||||
- `batches()` — direct Arrow `RecordBatch` access
|
||||
|
||||
Mutation results: `{ affectedNodes: usize, affectedEdges: usize }` (also exposed as a tiny Arrow batch).
|
||||
76
docs/execution.md
Normal file
76
docs/execution.md
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
# Query Execution, Mutations, and Loading
|
||||
|
||||
## Query execution (`exec/query.rs`)
|
||||
|
||||
Pipeline:
|
||||
|
||||
1. Parse + typecheck via `omnigraph-compiler`.
|
||||
2. Lower to IR.
|
||||
3. If `Expand` or `AntiJoin` is present, build (or fetch from `RuntimeCache`) a `GraphIndex`.
|
||||
4. Run `execute_query` against the snapshot.
|
||||
|
||||
### Multi-modal search modes (`SearchMode`)
|
||||
|
||||
The executor recognizes three modes that may be combined in a single query:
|
||||
|
||||
- **`nearest`** — vector ANN (uses Lance vector index; `LIMIT` required).
|
||||
- **`bm25`** — BM25 over an inverted index.
|
||||
- **`rrf`** — Reciprocal Rank Fusion of two rankings, with k (default 60).
|
||||
|
||||
Hybrid example: `order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20`.
|
||||
|
||||
### Joins / set operations
|
||||
|
||||
- Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups.
|
||||
- `not { … }` lowers to an `AntiJoin` over the inner pipeline.
|
||||
|
||||
### Scoped reads
|
||||
|
||||
- `query(target, source, name, params)` — at any branch or snapshot.
|
||||
- `run_query_at(version, …)` — direct historical query at a manifest version.
|
||||
|
||||
### Concurrency
|
||||
|
||||
- Snapshot isolation per query: all reads inside a query use the same `Snapshot`.
|
||||
- Readers and writers on different branches don't block each other.
|
||||
|
||||
## Mutation execution (`exec/mutation.rs`)
|
||||
|
||||
Resolves expression values to literals, converts to typed Arrow arrays (`literal_to_typed_array(lit, DataType, num_rows)`), then writes:
|
||||
|
||||
- `insert` → Lance `WriteMode::Append`
|
||||
- `update` → Lance `merge_insert(WhenMatched::Update)`
|
||||
- `delete` → Lance `merge_insert(WhenMatched::Delete)` (logical) or filtered overwrite.
|
||||
|
||||
Multi-statement mutations are atomic at the manifest commit boundary.
|
||||
|
||||
## Bulk loader (`loader/mod.rs`)
|
||||
|
||||
- **JSONL only** in v1, with two record shapes:
|
||||
- Node: `{"type":"NodeType", "data":{…}}`
|
||||
- Edge: `{"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}`
|
||||
- Lines starting with `//` are treated as comments.
|
||||
- Schema validation on every row (typecheck, required props, blob base64 decoding).
|
||||
- Edge endpoint resolution by node `@key`.
|
||||
|
||||
## Load modes (`LoadMode`)
|
||||
|
||||
| Mode | Semantics |
|
||||
|---|---|
|
||||
| `Overwrite` | Replace all data in the target tables on the branch |
|
||||
| `Append` | Strict insert; duplicates error |
|
||||
| `Merge` | Upsert by id (`merge_insert`) |
|
||||
|
||||
## `load` vs `ingest`
|
||||
|
||||
- `load(branch, data, mode)` — direct load to a branch.
|
||||
- `ingest(branch, from, data, mode)` — branch-creating, transactional load:
|
||||
1. If target advanced since the run started, fork a fresh run branch from `from`.
|
||||
2. Load into the run branch (Append).
|
||||
3. If target hasn't moved, fast-publish; otherwise abort.
|
||||
- Returns `IngestResult { branch, base_branch, branch_created, mode, tables[] }`.
|
||||
- `ingest_as(actor_id)` records the actor on the resulting commit.
|
||||
|
||||
## Embeddings during load
|
||||
|
||||
If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](embeddings.md).
|
||||
26
docs/indexes.md
Normal file
26
docs/indexes.md
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# Indexes
|
||||
|
||||
## L1 — Lance index types OmniGraph exposes
|
||||
|
||||
| Index | Use | Notes |
|
||||
|---|---|---|
|
||||
| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` |
|
||||
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries |
|
||||
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
|
||||
|
||||
## L2 — OmniGraph orchestration
|
||||
|
||||
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
|
||||
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
|
||||
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
|
||||
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
|
||||
|
||||
## L2 — Graph topology index (`graph_index/mod.rs`)
|
||||
|
||||
This is OmniGraph-specific (not Lance):
|
||||
|
||||
- `TypeIndex`: dense `u32 ↔ String id` mapping per node type.
|
||||
- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`.
|
||||
- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables.
|
||||
- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions).
|
||||
- Built only when an `Expand` or `AntiJoin` IR op is present in the lowered query, so pure scans skip it.
|
||||
22
docs/maintenance.md
Normal file
22
docs/maintenance.md
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
# Maintenance: Optimize & Cleanup
|
||||
|
||||
`db/omnigraph/optimize.rs`.
|
||||
|
||||
## `optimize_all_tables(db)` — non-destructive
|
||||
|
||||
- Lance `compact_files()` on every node + edge table on `main`.
|
||||
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests.
|
||||
- Bounded by `OMNIGRAPH_MAINTENANCE_CONCURRENCY` (default 8).
|
||||
- Returns `[TableOptimizeStats { table_key, fragments_removed, fragments_added, committed }]`.
|
||||
|
||||
## `cleanup_all_tables(db, options)` — destructive
|
||||
|
||||
- Lance `cleanup_old_versions()` per table.
|
||||
- Removes manifests (and their unique fragments) older than the retention policy.
|
||||
- `CleanupPolicyOptions { keep_versions: Option<u32>, older_than: Option<Duration> }` — at least one is required.
|
||||
- Returns `[TableCleanupStats { table_key, bytes_removed, old_versions_removed }]`.
|
||||
- CLI guards with `--confirm`; without it, prints a preview line.
|
||||
|
||||
## Tombstones
|
||||
|
||||
Logical sub-table delete markers in `__manifest`; `tombstone_object_id(table_key, version)` excludes a sub-table version from snapshot reconstruction.
|
||||
30
docs/merge.md
Normal file
30
docs/merge.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Merging (three-way) and Conflicts
|
||||
|
||||
`exec/merge.rs`.
|
||||
|
||||
## Strategy
|
||||
|
||||
Ordered, row-by-row cursor merge:
|
||||
|
||||
- `OrderedTableCursor` scans each table sorted by `id` and supports peek/pop matching.
|
||||
- `StagedTableWriter` buffers `MERGE_STAGE_BATCH_ROWS = 8192` rows into a temp Lance dataset (`OMNIGRAPH_MERGE_STAGING_DIR`).
|
||||
- The merge runs per sub-table; results are published as one atomic manifest update.
|
||||
|
||||
## Outcome enum
|
||||
|
||||
`MergeOutcome { AlreadyUpToDate | FastForward | Merged }`
|
||||
|
||||
## Conflict types (`error.rs`)
|
||||
|
||||
```
|
||||
MergeConflictKind:
|
||||
DivergentInsert // same id inserted on both branches
|
||||
DivergentUpdate // updated differently on both branches
|
||||
DeleteVsUpdate // one side deletes, other updates
|
||||
OrphanEdge // edge references a node deleted by the other side
|
||||
UniqueViolation
|
||||
CardinalityViolation
|
||||
ValueConstraintViolation
|
||||
```
|
||||
|
||||
Returned as `OmniError::MergeConflicts(Vec<MergeConflict { table_key, row_id?, kind, message }>)`. The HTTP server surfaces this as a 409 with structured `merge_conflicts[]` (top 3 + "+N more").
|
||||
44
docs/policy.md
Normal file
44
docs/policy.md
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
# Authorization (Cedar policy)
|
||||
|
||||
OmniGraph integrates AWS Cedar (`cedar-policy = 4.9`) for ABAC.
|
||||
|
||||
## Policy actions
|
||||
|
||||
1. `read` — query / snapshot / list branches & commits
|
||||
2. `export` — NDJSON export
|
||||
3. `change` — mutations
|
||||
4. `schema_apply` — apply schema migrations
|
||||
5. `branch_create`
|
||||
6. `branch_delete`
|
||||
7. `branch_merge`
|
||||
8. `run_publish`
|
||||
9. `run_abort`
|
||||
10. `admin` — reserved
|
||||
|
||||
## Scope kinds
|
||||
|
||||
- `branch_scope` — applied to source branch (`read`, `export`, `change`)
|
||||
- `target_branch_scope` — applied to destination (`schema_apply`, branch ops, run ops)
|
||||
- `protected_branches` — named list with special rules; rule scopes are `any | protected | unprotected`
|
||||
|
||||
## Configuration
|
||||
|
||||
`omnigraph.yaml`:
|
||||
|
||||
```yaml
|
||||
policy:
|
||||
file: ./policy.yaml # Cedar rules + groups
|
||||
tests: ./policy.tests.yaml # declarative test cases
|
||||
```
|
||||
|
||||
Each rule must use exactly one of `branch_scope` or `target_branch_scope`.
|
||||
|
||||
## CLI
|
||||
|
||||
- `omnigraph policy validate` — parse + count actors, exit 1 on parse error.
|
||||
- `omnigraph policy test` — run cases in `policy.tests.yaml`, exit 1 on any expectation mismatch.
|
||||
- `omnigraph policy explain --actor … --action … [--branch …] [--target-branch …]` — show decision and matched rule.
|
||||
|
||||
## Server enforcement
|
||||
|
||||
Every mutating endpoint calls `authorize_request()` *before* the handler runs; decisions are logged with actor / action / branch / outcome / matched rule.
|
||||
103
docs/query-language.md
Normal file
103
docs/query-language.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# Query Language (`.gq`)
|
||||
|
||||
Pest grammar at `crates/omnigraph-compiler/src/query/query.pest`. AST in `query/ast.rs`. Type checker in `query/typecheck.rs`. Lowering in `ir/lower.rs`.
|
||||
|
||||
## Query declarations
|
||||
|
||||
```
|
||||
query <name>($p1: T1, $p2: T2?, …)
|
||||
@description("…") @instruction("…") {
|
||||
…
|
||||
}
|
||||
```
|
||||
|
||||
Two body shapes:
|
||||
|
||||
- **Read**: `match { … } return { … } [order { … }] [limit N]`
|
||||
- **Mutation**: one or more of `insert | update | delete` statements
|
||||
|
||||
Param types reuse all schema scalars; trailing `?` makes a param optional. The compiler reserves `$__nanograph_now` for `now()`.
|
||||
|
||||
## MATCH clauses
|
||||
|
||||
- **Binding**: `$x: NodeType { prop: <literal | $param | now()>, … }`
|
||||
- **Traversal**: `$src EDGE_NAME { min, max? } $dst` — variable-length paths via hop bounds; default 1..1 if bounds omitted.
|
||||
- **Filter**: `<expr> <op> <expr>` with operators `>=`, `<=`, `!=`, `>`, `<`, `=`, and string `contains`.
|
||||
- **Negation**: `not { clause+ }` — desugars to anti-join over the inner pipeline.
|
||||
|
||||
## Search clauses (multi-modal)
|
||||
|
||||
Used inside MATCH or as expressions inside RETURN/ORDER:
|
||||
|
||||
| Function | Purpose | Underlying Lance facility |
|
||||
|---|---|---|
|
||||
| `nearest($x.vec, $q)` | k-NN vector search (cosine) | Lance vector index (IVF / HNSW) |
|
||||
| `search(field, q)` | Generic FTS | Inverted index |
|
||||
| `fuzzy(field, q [, max_edits])` | Levenshtein-tolerant text search | Inverted index |
|
||||
| `match_text(field, q)` | Pattern match | Inverted index |
|
||||
| `bm25(field, q)` | BM25 scoring | Inverted index |
|
||||
| `rrf(rank_a, rank_b [, k])` | Reciprocal Rank Fusion of two rankings (default k=60) | OmniGraph fuses scored rankings |
|
||||
|
||||
`nearest()` requires a `LIMIT`; the compiler resolves the query vector via the param map (or via the runtime embedding client when bound to a text input).
|
||||
|
||||
## RETURN clause
|
||||
|
||||
`return { <expr> [as <alias>], … }` with expressions:
|
||||
|
||||
- Variable / property access: `$x`, `$x.prop`
|
||||
- Literals: string, int, float, bool, list
|
||||
- `now()`
|
||||
- Aggregates: `count`, `sum`, `avg`, `min`, `max`
|
||||
- All search functions above (so you can return a score column)
|
||||
- `AliasRef` — re-use a previous projection alias
|
||||
|
||||
## ORDER & LIMIT
|
||||
|
||||
- `order { <expr> [asc|desc], … }` — supports plain expressions and `nearest(...)`.
|
||||
- `limit <integer>` — required when there is a `nearest(...)` ordering.
|
||||
|
||||
## Mutation statements
|
||||
|
||||
- `insert <Type> { prop: <value>, … }`
|
||||
- `update <Type> set { prop: <value>, … } where <prop> <op> <value>`
|
||||
- `delete <Type> where <prop> <op> <value>`
|
||||
|
||||
`<value>` is a literal, `$param`, or `now()`. Multi-statement mutations execute atomically (added in v0.2.0).
|
||||
|
||||
## IR (Intermediate Representation)
|
||||
|
||||
`QueryIR { name, params, pipeline: Vec<IROp>, return_exprs, order_by, limit }`
|
||||
|
||||
Pipeline operations:
|
||||
|
||||
- `NodeScan { variable, type_name, filters }`
|
||||
- `Expand { src_var, dst_var, edge_type, direction (Out|In), dst_type, min_hops, max_hops, dst_filters }` — destination filters are pushed *into* the expand so Lance scalar pushdown can prune.
|
||||
- `Filter { left, op, right }`
|
||||
- `AntiJoin { outer_var, inner: Vec<IROp> }` — for `not { … }`
|
||||
|
||||
Lowering:
|
||||
|
||||
1. Partition MATCH clauses (bindings, traversals, filters, negations).
|
||||
2. Identify "deferred" bindings (a destination of a traversal that has filters) so the Expand can carry the filter as a pushdown.
|
||||
3. Emit NodeScan for the first binding, then Expand operations, then remaining Filter operations, then AntiJoins for negations.
|
||||
4. Translate RETURN / ORDER expressions; preserve LIMIT.
|
||||
|
||||
## Linting & validation (`query/lint.rs`)
|
||||
|
||||
Codes seen so far:
|
||||
|
||||
- **Q000** (Error): parse error
|
||||
- **L201** (Warning): nullable property never set by any UPDATE — "{type}.{prop} exists in schema but no update query sets it"
|
||||
- (Warning): mutation declares no params — hardcoded mutations are easy to miss
|
||||
- Plus all type errors from `typecheck_query_decl()` (undefined types, mismatched operators, undefined edges, etc.)
|
||||
|
||||
Output:
|
||||
|
||||
```
|
||||
QueryLintOutput { status, schema_source, query_path,
|
||||
queries_processed, errors, warnings, infos,
|
||||
results: [{ name, kind, status, error?, warnings[] }],
|
||||
findings: [{ severity, code, message, type_name?, property?, query_names[] }] }
|
||||
```
|
||||
|
||||
CLI exits non-zero only on `status = Error`.
|
||||
19
docs/releases/v0.3.1.md
Normal file
19
docs/releases/v0.3.1.md
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Omnigraph v0.3.1
|
||||
|
||||
Omnigraph v0.3.1 is a performance and operability point release.
|
||||
|
||||
## Highlights
|
||||
|
||||
- **Parallel per-type load writes**: the bulk loader writes to each node/edge table concurrently rather than serially, materially reducing wall-clock time on multi-table loads.
|
||||
- **`omnigraph optimize` and `omnigraph cleanup` CLI commands**: previously only available via the engine API. `optimize` runs Lance `compact_files()` across every node/edge table; `cleanup` runs Lance `cleanup_old_versions()` with a `--keep`/`--older-than` policy and requires `--confirm` for the destructive form.
|
||||
- **Dst-id deduplication during edge expand hydration**: avoids redundant lookups when the same destination id appears multiple times in an `Expand` step (#45).
|
||||
|
||||
## Included Changes
|
||||
|
||||
- Parallel per-type load writes (#46)
|
||||
- `omnigraph optimize` / `cleanup` CLI commands and runtime APIs (#46)
|
||||
- Dedupe dst ids before hydrating nodes in `execute_expand` (#45)
|
||||
|
||||
## Upgrade Notes
|
||||
|
||||
No breaking changes. Existing v0.3.0 repos can be opened directly with v0.3.1.
|
||||
38
docs/runs.md
Normal file
38
docs/runs.md
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
# Runs (transactional graph mutations)
|
||||
|
||||
`db/run_registry.rs` + run lifecycle in `db/omnigraph.rs`. Stored in `_graph_runs.lance` and `_graph_run_actors.lance`.
|
||||
|
||||
## RunRecord
|
||||
|
||||
```
|
||||
RunRecord {
|
||||
run_id: RunId (ULID),
|
||||
target_branch: String, // where the run will publish
|
||||
run_branch: "__run__<id>", // ephemeral isolation branch
|
||||
base_snapshot_id: String,
|
||||
base_manifest_version: u64,
|
||||
operation_hash: Option<String>, // idempotency key
|
||||
actor_id: Option<String>,
|
||||
status: Running | Published | Failed | Aborted,
|
||||
published_snapshot_id: Option<String>,
|
||||
created_at, updated_at: i64 (microseconds),
|
||||
}
|
||||
```
|
||||
|
||||
## Lifecycle
|
||||
|
||||
1. `begin_run(target_branch, op_hash)` / `begin_run_as(target_branch, op_hash, actor_id)` — forks `__run__<id>` from the target's current head, appends a `RunRecord`.
|
||||
2. Mutations on `run_branch` (via the normal write APIs) — isolated from concurrent activity on the target.
|
||||
3. `publish_run(id)` / `publish_run_as(id, actor)`:
|
||||
- **Fast path**: if the target hasn't moved since `base_snapshot_id`, promote the run snapshot directly.
|
||||
- **Merge path**: if it has moved, perform a three-way merge (see [merge.md](merge.md)) into the target.
|
||||
- On success: `status = Published`, `published_snapshot_id` set, run branch cleaned up asynchronously.
|
||||
4. `abort_run(id)` / `fail_run(id)` — terminal; cleans up run branch best-effort.
|
||||
|
||||
## Idempotency
|
||||
|
||||
`operation_hash` is an optional field clients can use to detect a duplicate `begin_run` retry.
|
||||
|
||||
## Cleanup
|
||||
|
||||
`cleanup_terminal_run_branches_for_target(branch)` is called as branches change; failures are swallowed (lazy cleanup on next branch op).
|
||||
79
docs/schema-language.md
Normal file
79
docs/schema-language.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
# Schema Language (`.pg`)
|
||||
|
||||
Pest grammar at `crates/omnigraph-compiler/src/schema/schema.pest`. AST at `schema/ast.rs`. Catalog at `catalog/mod.rs`.
|
||||
|
||||
## Top-level declarations
|
||||
|
||||
- `interface <Name> { property* }` — reusable property contracts.
|
||||
- `node <Name> [implements <Iface>, ...] { property* | constraint* }`
|
||||
- `edge <Name>: <FromType> -> <ToType> [@card(min..max)] { property* | constraint* }`
|
||||
- Comments: line `//` and block `/* … */`.
|
||||
|
||||
## Property declarations
|
||||
|
||||
`<ident>: <TypeRef> [annotation*]`
|
||||
|
||||
## Built-in scalar types
|
||||
|
||||
| Scalar | Arrow type |
|
||||
|---|---|
|
||||
| `String` | Utf8 |
|
||||
| `Blob` | LargeBinary |
|
||||
| `Bool` | Boolean |
|
||||
| `I32` / `I64` | Int32 / Int64 |
|
||||
| `U32` / `U64` | UInt32 / UInt64 |
|
||||
| `F32` / `F64` | Float32 / Float64 |
|
||||
| `Date` | Date32 |
|
||||
| `DateTime` | Date64 |
|
||||
| `Vector(<dim>)` | FixedSizeList(Float32, dim), `1 ≤ dim ≤ i32::MAX` |
|
||||
| `[<scalar>]` | List(scalar) |
|
||||
| `enum(v1, v2, …)` | Utf8 with sorted/dedup'd set of allowed string values |
|
||||
| `<scalar>?` | Same as scalar but `nullable: true` |
|
||||
|
||||
## Constraints (body level)
|
||||
|
||||
| Constraint | On | Effect |
|
||||
|---|---|---|
|
||||
| `@key(p, …)` | node | Primary key; implies index on key columns; `key_property()` returns the first key |
|
||||
| `@unique(p, …)` | node, edge | Uniqueness across listed columns |
|
||||
| `@index(p, …)` | node, edge | Build a scalar (BTREE) index on the columns |
|
||||
| `@range(p, min..max)` | node | Numeric range validation (open ranges allowed) |
|
||||
| `@check(p, "regex")` | node | Regex pattern validation |
|
||||
| `@card(min..max?)` | edge | Edge multiplicity — default `0..*`; `0..1`, `1..1`, `1..*`, etc. |
|
||||
|
||||
Edge bodies only allow `@unique` and `@index`.
|
||||
|
||||
## Annotations
|
||||
|
||||
- `@<ident>` or `@<ident>(<literal>)` on any declaration or property.
|
||||
- Known annotations:
|
||||
- `@embed` on a Vector property — names the *source* property whose text gets embedded into this vector at ingest (`embed_sources` map in NodeType).
|
||||
- `@description("…")`, `@instruction("…")` on query declarations (carried through to clients).
|
||||
- Custom annotations are accepted by the parser and surfaced in catalog metadata; unrecognized annotations don't fail compilation.
|
||||
|
||||
## Catalog construction
|
||||
|
||||
- Pass 0: collect interfaces.
|
||||
- Pass 1: collect nodes, expand `implements`, build constraint and `@embed` mappings, build the Arrow schema for each node table (`id: Utf8` plus all properties; blob columns get `LargeBinary`).
|
||||
- Pass 2: collect edges, validate that `from_type` / `to_type` exist, normalize edge names case-insensitively for lookup, validate constraints for edges. Edge Arrow schema: `id: Utf8, src: Utf8, dst: Utf8` plus edge properties.
|
||||
|
||||
## Schema IR & stable type IDs
|
||||
|
||||
- `IR_VERSION = 1` (`catalog/schema_ir.rs`).
|
||||
- Each interface/node/edge gets a `stable_type_id` (kind+name hashed) so renames can be tracked.
|
||||
- Serialized as JSON for diff/migration plans.
|
||||
|
||||
## Schema migration planning
|
||||
|
||||
`plan_schema_migration(accepted, desired) -> SchemaMigrationPlan { supported, steps[] }` with step types:
|
||||
|
||||
- `AddType { type_kind, name }`
|
||||
- `RenameType { type_kind, from, to }`
|
||||
- `AddProperty { type_kind, type_name, property_name, property_type }`
|
||||
- `RenameProperty { type_kind, type_name, from, to }`
|
||||
- `AddConstraint { type_kind, type_name, constraint }`
|
||||
- `UpdateTypeMetadata { … annotations }`
|
||||
- `UpdatePropertyMetadata { … annotations }`
|
||||
- `UnsupportedChange { entity, reason }` (forces `supported=false`)
|
||||
|
||||
`apply_schema()` returns `SchemaApplyResult { supported, applied, manifest_version, steps }` and is gated by an internal `__schema_apply_lock__` system branch so concurrent schema applies serialize.
|
||||
68
docs/server.md
Normal file
68
docs/server.md
Normal file
|
|
@ -0,0 +1,68 @@
|
|||
# HTTP Server (`omnigraph-server`)
|
||||
|
||||
Axum 0.8 + tokio + utoipa-generated OpenAPI. Single repo per process; deploy multiple processes for multi-tenant.
|
||||
|
||||
## Endpoint inventory
|
||||
|
||||
| Method | Path | Auth | Action | Handler |
|
||||
|---|---|---|---|---|
|
||||
| GET | `/healthz` | none | — | `server_health` |
|
||||
| GET | `/openapi.json` | none | — | `server_openapi` (strips security if auth disabled) |
|
||||
| GET | `/snapshot?branch=` | bearer + `read` | snapshot of branch | `server_snapshot` |
|
||||
| POST | `/read` | bearer + `read` | run named query | `server_read` |
|
||||
| POST | `/export` | bearer + `export` | NDJSON stream | `server_export` |
|
||||
| POST | `/change` | bearer + `change` | mutation | `server_change` |
|
||||
| GET | `/schema` | bearer + `read` | get current `.pg` source | `server_schema_get` |
|
||||
| POST | `/schema/apply` | bearer + `schema_apply` (target=`main`) | migrate | `server_schema_apply` |
|
||||
| POST | `/ingest` | bearer + `branch_create` (if new) + `change` | bulk load | `server_ingest` (32 MB body limit) |
|
||||
| GET | `/branches` | bearer + `read` | list branches | `server_branch_list` |
|
||||
| POST | `/branches` | bearer + `branch_create` | create | `server_branch_create` |
|
||||
| DELETE | `/branches/{branch}` | bearer + `branch_delete` | delete | `server_branch_delete` |
|
||||
| POST | `/branches/merge` | bearer + `branch_merge` | merge `source → target` | `server_branch_merge` |
|
||||
| GET | `/runs` | bearer + `read` | list | `server_run_list` |
|
||||
| GET | `/runs/{run_id}` | bearer + `read` | show | `server_run_show` |
|
||||
| POST | `/runs/{run_id}/publish` | bearer + `run_publish` | publish | `server_run_publish` |
|
||||
| POST | `/runs/{run_id}/abort` | bearer + `run_abort` | abort | `server_run_abort` |
|
||||
| GET | `/commits?branch=` | bearer + `read` | list | `server_commit_list` |
|
||||
| GET | `/commits/{commit_id}` | bearer + `read` | show | `server_commit_show` |
|
||||
|
||||
## Streaming
|
||||
|
||||
Only `/export` streams (`application/x-ndjson`, MPSC channel + `Body::from_stream`). Everything else is buffered JSON.
|
||||
|
||||
## Error model
|
||||
|
||||
Uniform `ErrorOutput { error, code?, merge_conflicts[] }` with `code ∈ unauthorized | forbidden | bad_request | not_found | conflict | internal`. Merge conflicts attach structured `MergeConflictOutput { table_key, row_id?, kind, message }`.
|
||||
|
||||
HTTP status codes used: 200, 400, 401, 403, 404, 409, 500.
|
||||
|
||||
## Body limits
|
||||
|
||||
- Default: 1 MB
|
||||
- `/ingest`: 32 MB
|
||||
|
||||
## Auth model (`bearer + SHA-256`)
|
||||
|
||||
- Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
|
||||
- Constant-time comparison via `subtle::ConstantTimeEq`.
|
||||
- Three sources, in precedence:
|
||||
1. `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` — AWS Secrets Manager (build with `--features aws`)
|
||||
2. `OMNIGRAPH_SERVER_BEARER_TOKENS_FILE` or `OMNIGRAPH_SERVER_BEARER_TOKENS_JSON` — JSON `{actor_id: token, …}`
|
||||
3. `OMNIGRAPH_SERVER_BEARER_TOKEN` — single legacy token, actor `default`
|
||||
- If no tokens configured, server runs unauthenticated (local dev) and `/openapi.json` strips the security scheme.
|
||||
|
||||
See [deployment.md](deployment.md) for token-source operational details.
|
||||
|
||||
## Tracing & observability
|
||||
|
||||
- `tower_http::TraceLayer::new_for_http()`
|
||||
- Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
|
||||
- Startup logs: token source name, repo URI, bind address
|
||||
- Graceful SIGINT shutdown
|
||||
|
||||
## Not implemented (by design or "TBD")
|
||||
|
||||
- CORS — not configured; add `tower_http::cors` if needed.
|
||||
- Rate limiting — none.
|
||||
- Pagination — none (commits/branches/runs return everything; export streams).
|
||||
- Multi-tenant routing — one repo per process.
|
||||
46
docs/storage.md
Normal file
46
docs/storage.md
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
# Storage
|
||||
|
||||
## L1 — Lance dataset (per node/edge type)
|
||||
|
||||
Every node type and every edge type is its own Lance dataset:
|
||||
|
||||
- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema.
|
||||
- **Fragments**: data is partitioned into fragments; new writes create new fragments.
|
||||
- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable.
|
||||
- **Stable row IDs**: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction.
|
||||
- **Append / delete / `merge_insert`**: native Lance write modes.
|
||||
- **Per-dataset branches** (Lance native): copy-on-write at the dataset level.
|
||||
- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (`storage.rs`).
|
||||
|
||||
## L2 — Multi-dataset coordination via `__manifest`
|
||||
|
||||
OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table.
|
||||
|
||||
- **Manifest table**: `__manifest/` Lance dataset.
|
||||
- **Layout** (`db/manifest/layout.rs`, `db/manifest/state.rs`):
|
||||
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
|
||||
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
|
||||
- `__manifest/` — the catalog of all sub-tables and their published versions
|
||||
- `_graph_commits.lance` / `_graph_commit_actors.lance` — the commit graph and its actor map
|
||||
- `_graph_runs.lance` / `_graph_run_actors.lance` — the run registry and its actor map
|
||||
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
|
||||
- `object_type` ∈ `table | table_version | table_tombstone`
|
||||
- `table_key` ∈ `node:<TypeName> | edge:<EdgeName>`
|
||||
- `table_branch` is `null` for the main lineage and the branch name otherwise
|
||||
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones whose `tombstone_version >= table_version`.
|
||||
- **Atomic publish**: multi-dataset commits publish via a `ManifestBatchPublisher` so a single write to `__manifest` flips all the new sub-table versions visible at once.
|
||||
|
||||
## URI scheme support (`storage.rs`)
|
||||
|
||||
| Scheme | Backend | Notes |
|
||||
|---|---|---|
|
||||
| local path / `file://` | `LocalStorageAdapter` (tokio) | Normalized to absolute paths |
|
||||
| `s3://bucket/prefix` | `S3StorageAdapter` (object_store) | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` |
|
||||
| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend |
|
||||
|
||||
## Object-store env vars (S3-compatible)
|
||||
|
||||
- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`
|
||||
- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML
|
||||
- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs
|
||||
- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev)
|
||||
77
scripts/check-agents-md.sh
Executable file
77
scripts/check-agents-md.sh
Executable file
|
|
@ -0,0 +1,77 @@
|
|||
#!/usr/bin/env bash
|
||||
# Verify that AGENTS.md and docs/ stay in sync.
|
||||
#
|
||||
# Two checks:
|
||||
# 1. Every docs/*.md path linked from AGENTS.md exists on disk.
|
||||
# 2. Every doc in the canonical set is linked from AGENTS.md.
|
||||
#
|
||||
# Exit non-zero on any drift.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
repo_root="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
cd "$repo_root"
|
||||
|
||||
agents_file="AGENTS.md"
|
||||
if [[ ! -f "$agents_file" ]]; then
|
||||
echo "error: $agents_file not found" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Canonical set: every docs/*.md (top-level), plus the releases/ index dir if present.
|
||||
canonical=()
|
||||
while IFS= read -r line; do
|
||||
canonical+=("$line")
|
||||
done < <(find docs -mindepth 1 -maxdepth 1 -type f -name '*.md' | sort)
|
||||
if [[ -d docs/releases ]]; then
|
||||
canonical+=("docs/releases/")
|
||||
fi
|
||||
|
||||
# Extract docs/ links from AGENTS.md (markdown link form: (docs/...))
|
||||
linked=()
|
||||
while IFS= read -r line; do
|
||||
linked+=("$line")
|
||||
done < <(grep -oE '\(docs/[^)]+\)' "$agents_file" | sed -E 's/^\(|\)$//g' | sort -u)
|
||||
|
||||
fail=0
|
||||
|
||||
# Check 1: every linked path exists.
|
||||
for link in "${linked[@]}"; do
|
||||
# Strip in-page anchors like #foo
|
||||
path="${link%%#*}"
|
||||
if [[ "$path" == */ ]]; then
|
||||
if [[ ! -d "$path" ]]; then
|
||||
echo "error: AGENTS.md links to missing directory: $path" >&2
|
||||
fail=1
|
||||
fi
|
||||
else
|
||||
if [[ ! -f "$path" ]]; then
|
||||
echo "error: AGENTS.md links to missing file: $path" >&2
|
||||
fail=1
|
||||
fi
|
||||
fi
|
||||
done
|
||||
|
||||
# Check 2: every canonical doc is linked at least once.
|
||||
for doc in "${canonical[@]}"; do
|
||||
found=0
|
||||
for link in "${linked[@]}"; do
|
||||
path="${link%%#*}"
|
||||
if [[ "$path" == "$doc" ]]; then
|
||||
found=1
|
||||
break
|
||||
fi
|
||||
done
|
||||
if [[ "$found" -eq 0 ]]; then
|
||||
echo "error: doc not linked from AGENTS.md: $doc" >&2
|
||||
fail=1
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$fail" -ne 0 ]]; then
|
||||
echo >&2
|
||||
echo "AGENTS.md / docs/ are out of sync. Either update AGENTS.md links or rename/remove the doc." >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "AGENTS.md ↔ docs/ links OK (${#linked[@]} links, ${#canonical[@]} docs)."
|
||||
Loading…
Add table
Add a link
Reference in a new issue