mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Ragnor Comerford b6440d6b17

Add docs/lance.md — task-organized index of Lance upstream docs

Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.

Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.

AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-04-28 23:48:28 +02:00

15 KiB

Raw Blame History

OmniGraph — Agent Guide

This file is the always-on map for AI coding agents (Claude Code, Codex, Cursor, Cline) working in this repo. It is loaded into context on every turn, so it stays as a map plus the rules and invariants that need to be in scope at all times — the encyclopedia content lives under docs/. When you need depth, follow a pointer.

Required reading every session: docs/invariants.md. Load this in full before proposing, reviewing, or implementing any change — the §IX deny-list and §X review checklist apply to every PR, not only architecture work. Tools that support @-imports (Claude Code) auto-include it via the import below; other agents must open it explicitly at the start of each session.

@docs/invariants.md

When you hit a Lance-shaped problem (file format, fragments, indexes, transactions, branches/tags, compaction, schema evolution, vector / FTS internals): consult docs/lance.md and fetch the listed upstream URL before grepping our code or guessing. Lance is the substrate; behavior is documented there, not here.

CLAUDE.md is a symlink to this file — there is exactly one source of truth. Edit AGENTS.md.

Version surveyed: 0.3.1 Workspace crates: omnigraph-compiler, omnigraph (engine), omnigraph-cli, omnigraph-server Storage substrate: Lance 4.x (columnar, versioned, branchable) License: MIT Toolchain: Rust stable, edition 2024

Start here — what is this?

OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets. Highlights:

Storage: per node/edge type a separate Lance dataset; multi-dataset commits coordinated atomically through one __manifest table.
Languages: a .pg schema language and a .gq query language, both Pest-based, with a typed IR.
Multi-modal querying: vector ANN (nearest), full-text (search/fuzzy/match_text/bm25), Reciprocal Rank Fusion (rrf), and graph traversal (Expand, anti-join not { … }) in one runtime.
Branches and commits across the whole graph: Git-style — every successful publish appends to a commit DAG; merges are three-way at the row level.
Transactional runs: ephemeral __run__<id> branches for isolated mutation, fast-path or merge-path publish.
HTTP server: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager), Cedar policy gating.
CLI driven by a single omnigraph.yaml; multi-format output (json/jsonl/csv/kv/table).

Throughout the docs, capabilities are split into L1 — Inherited from Lance vs L2 — Added by OmniGraph.

Architecture at a glance

CLI (omnigraph)        HTTP Server (omnigraph-server, Axum)
        │                            │
        └─────────────┬──────────────┘
                      ▼
           omnigraph-compiler  ── Pest grammars, catalog, IR, lowering, lint, migration plan
                      │
                      ▼
           omnigraph (engine)  ── ManifestRepo, CommitGraph, RunRegistry, GraphIndex (CSR/CSC), exec
                      │
                      ▼
              Lance 4.x         ── columnar Arrow, fragments, per-dataset versions/branches, indexes
                      │
                      ▼
        Object store (file / s3 / RustFS / MinIO / S3-compat)

Full diagram and concurrency model: docs/architecture.md.

Where to find each topic

Area	Read
Architectural invariants & deny-list (read before any non-trivial proposal or review)	docs/invariants.md
Lance docs index — fetch upstream Lance docs by problem domain	docs/lance.md
Architecture, L1/L2 framing, concurrency model	docs/architecture.md
Storage layout, `__manifest` schema, URI schemes, S3 env vars	docs/storage.md
`.pg` schema language, types, constraints, annotations, migration planning	docs/schema-language.md
`.gq` query language, MATCH/RETURN/ORDER, search funcs, mutations, IR ops, lint codes	docs/query-language.md
Indexes (BTREE / inverted / vector / graph topology)	docs/indexes.md
Embeddings (compiler + engine clients, env vars, `@embed`)	docs/embeddings.md
Branches, commit graph, snapshots, system branches	docs/branches-commits.md
Runs (transactional graph mutations, `__run__<id>`, publish paths)	docs/runs.md
Three-way merge and conflict kinds	docs/merge.md
Diff / change feed (`diff_between`, `diff_commits`)	docs/changes.md
Query execution, mutation execution, bulk loader, `load` vs `ingest`	docs/execution.md
`optimize` (compaction) and `cleanup` (version GC)	docs/maintenance.md
Cedar policy actions, scopes, CLI	docs/policy.md
HTTP server endpoints, auth, error model, body limits	docs/server.md
CLI quick-start	docs/cli.md
CLI command surface and `omnigraph.yaml` schema	docs/cli-reference.md
Audit / actor tracking	docs/audit.md
Error taxonomy and result serialization	docs/errors.md
Install (binary / Homebrew / source / channels)	docs/install.md
Deployment (binary / container / RustFS bootstrap / auth / build variants)	docs/deployment.md
CI / release workflows	docs/ci.md
Constants & tunables cheat sheet	docs/constants.md
Per-version release notes	docs/releases/

Always-on rules (load these into your working memory)

These are architectural rules that need to be in scope on every change. They're framed at the level that survives renames and refactors — the deeper implementation specifics (function names, lock names, branch-prefix conventions, enforcement points) live in the per-area docs and may evolve. The full architectural invariants and deny-list are in docs/invariants.md; §IX (deny-list) is the fastest first-pass when reviewing any change.

Multi-dataset publish is atomic across the whole graph. A graph commit flips every relevant sub-table version visible together, in one manifest write. Don't introduce code paths that publish per sub-table outside the unified publish path — that loses cross-table snapshot isolation.
Snapshot isolation per query. A query holds one snapshot for its lifetime. Don't re-read the current head mid-query.
Mutations are atomic at the commit boundary. Multi-statement change queries publish one commit. Don't commit per-statement.
Bearer-token plaintext never persists in process memory. Tokens are hashed at startup; auth uses constant-time comparison; the actor id is server-resolved from the hash match and must not be settable by the client.
Reads always see the current index state for the branch they're reading. Indexes track the branch head, not historical snapshots. If you change index lifecycle, preserve this guarantee.
Stable type IDs survive renames. Schema migration relies on identity that's stable across rename — don't mint new IDs on rename.

Deny-list (fast-pass review filter — full reasoning in docs/invariants.md §IX)

If a proposal fits one of these, the burden is on the proposer to justify why this case is the exception:

Synchronous-inline index updates for indexes expensive to build (vector ANN, FTS) — use the reconciler pattern.
Custom WAL / transaction manager / buffer pool — Lance owns these.
Job queue for state derivable from manifest — reconciler pattern instead.
Per-feature lowering for shapes that share a structure (interfaces, wildcards, alternation) — use one mechanism.
Eager materialization of cross-products in multi-hop — factorize; flatten only when needed.
Ad-hoc IN-list filtering when SIP fits.
String-flattened SQL filter generation when structured pushdown is available.
In-process-only Dataset impls — Send + Sync, remote descriptors.
Cost-blind plan choice — lowering-order execution is not a planner.
Hidden statistics — if a metric matters for plan choice, it must be exposed through the trait surface.
Side-channels for query semantics — search modes, mutations, polymorphism are first-class IR concepts.
Discarding rank in retrieval — score and rank propagate as columns.
State that drifts from the manifest — derive from observable state.
Cloud-only correctness fixes — correctness is always OSS.
Forking the codebase for Cloud — trait-extension only.
Hand-rolling something Lance already does — check the spec first.
Mutating in place state that should be immutable (Lance fragments, index segments) — new segments instead.
Silent failures — OOM, timeout, partial result must all be surfaced and bounded.

Quick-reference flows

# Initialize an S3-backed repo
omnigraph init --schema ./schema.pg s3://my-bucket/repo.omni

# Bulk load
omnigraph load --data ./seed.jsonl --mode overwrite s3://my-bucket/repo.omni

# Branch + ingest a review batch
omnigraph branch create --from main review/2026-04-25 s3://my-bucket/repo.omni
omnigraph ingest --branch review/2026-04-25 --data ./batch.jsonl s3://my-bucket/repo.omni

# Run a hybrid (vector + BM25) query
omnigraph read --query ./queries.gq --name find_similar \
  --params '{"q":"trends in AI safety"}' --format table s3://my-bucket/repo.omni

# Plan + apply schema migration
omnigraph schema plan  --schema ./next.pg s3://my-bucket/repo.omni
omnigraph schema apply --schema ./next.pg s3://my-bucket/repo.omni --json

# Merge review branch back
omnigraph branch merge review/2026-04-25 --into main s3://my-bucket/repo.omni

# Compact + GC (preview, then confirm)
omnigraph optimize s3://my-bucket/repo.omni
omnigraph cleanup  --keep 10 --older-than 7d s3://my-bucket/repo.omni
omnigraph cleanup  --keep 10 --older-than 7d --confirm s3://my-bucket/repo.omni

# Stand up the HTTP server (token from env)
OMNIGRAPH_SERVER_BEARER_TOKEN=xxxx \
  omnigraph-server s3://my-bucket/repo.omni --bind 0.0.0.0:8080

# Cedar policy explain
omnigraph policy explain --actor act-alice --action change --branch main

Capability matrix — "Lens by default vs. added by OmniGraph"

Capability	L1 (Lance default)	L2 (OmniGraph adds)
Columnar storage on object store	✅ Arrow/Lance	URI normalization, S3 env-var plumbing
Per-dataset versioning + time travel	✅	`snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables
Per-dataset branches	✅	Graph-level branches (atomic across all sub-tables), lazy fork, system branch filtering
Atomic single-dataset commits	✅	Atomic multi-dataset publish via `__manifest` + `ManifestBatchPublisher`
Compaction (`compact_files`)	✅	`omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency
Cleanup (`cleanup_old_versions`)	✅	`omnigraph cleanup` with `--keep` / `--older-than` policy
BTREE / inverted (FTS) / vector indexes	✅	`ensure_indices` builds them on every relevant column; idempotent; lazy across branches
`merge_insert` upsert	✅	`LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering
Vector search	✅	`nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema
Full-text search	✅	`search/fuzzy/match_text/bm25` query ops
Hybrid ranking	—	`rrf(...)` Reciprocal Rank Fusion in one runtime
Graph traversal	—	CSR/CSC topology index, `Expand` IR op, variable-length hops, `not { }` anti-join
Schema language	—	`.pg` + Pest grammar + catalog + interfaces + constraints + annotations
Query language	—	`.gq` + Pest grammar + IR + lowering + linter
Schema migration planning	—	`plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__`
Commit graph (DAG) across whole repo	—	`_graph_commits.lance` with linear + merge parents, ULID ids, actor map
Transactional runs	—	`_graph_runs.lance`, `__run__<id>` ephemeral branches, fast-path & merge-path publish
Three-way row-level merge	—	`OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind`
Change feeds	—	`diff_between` / `diff_commits` with manifest fast path + ID streaming
Cedar policy	—	10 actions, branch / target_branch / protected scopes, validate/test/explain CLI
HTTP server	—	Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), policy gating, NDJSON streaming export
CLI with config	—	`omnigraph.yaml`, aliases, multi-format output (json/jsonl/csv/kv/table)
Audit / actor tracking	—	`_as` write APIs + actor maps in commit & run datasets
Local RustFS bootstrap	—	`scripts/local-rustfs-bootstrap.sh` one-shot S3-backed dev environment

Maintenance contract for agents

When you change something user-visible, update the relevant docs/<area>.md in the same change. Pointers from this file to that doc must keep working — CI enforces cross-link integrity via scripts/check-agents-md.sh.

When proposing or reviewing a non-trivial change, walk docs/invariants.md — at minimum the §IX deny-list and §X review checklist. Add to the deny-list when a new anti-pattern surfaces; relaxing an invariant requires the same review process as code.

Rules:

Update in the same PR. New endpoint, query function, CLI flag, env var, constant, schema construct, or invariant: update both the source code and the doc in the same change. Never split documentation drift into a follow-up.
Bump version on release. When a release boundary crosses (e.g. v0.3.1 → v0.3.2), update the version line at the top of this file and add a docs/releases/<version>.md describing the user-visible delta. Update docs/architecture.md only if the architecture itself changed.
Don't lie. If a section becomes wrong but you can't rewrite it fully right now, replace the wrong line with *(stale — needs update after <change>)* rather than leaving silently incorrect text. Then fix it ASAP.
Re-verify before recommending. If you cite a flag, env var, endpoint, or constant to the user or in code, grep for it in source first. Memory and docs go stale; the code is authoritative.
Keep AGENTS.md a map, not an encyclopedia. New deep content goes into docs/. Add an entry to "Where to find each topic" instead of pasting prose into this file. The "Always-on rules" section is the exception — it's for invariants that should always be in scope.
Re-read on schema/query/IR changes. Edits to schema.pest, query.pest, ir/lower.rs, query/typecheck.rs, or query/lint.rs should trigger a re-read of docs/schema-language.md, docs/query-language.md, and docs/execution.md to confirm they still describe reality.

CI check: scripts/check-agents-md.sh verifies that every docs/*.md link in this file resolves and that every doc in the canonical set is linked. Run it locally before opening a PR if you've moved or renamed docs.

15 KiB Raw Blame History