omnigraph/docs/storage.md
Ragnor Comerford 6f25c4f9f8
Address reviewer feedback (Cursor + cubic) on PR #60
All eight comments verified against source and applied:

- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
  the markdown blockquote. Claude Code's @-import parser expects @ at
  column 0; the leading "> " of a blockquote silently broke
  recognition, so the claimed auto-include did nothing. (Cursor,
  Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
  enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
  variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
  force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
  P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
  exists at error.rs:88-89 but the doc enumerated only 10 of 11.
  (cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
  tombstone_version column; state.rs:180 reads the tombstone version
  from the table_version column on rows where object_type =
  table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
  the underlying storage. actor_id is joined in-memory from
  _graph_commit_actors.lance, not a column on _graph_commits.lance.
  (cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
  match the actual constant name in catalog/schema_ir.rs:11.
  (cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
  `ls crates/omnigraph/tests/*.rs`). (cubic P3.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00

3 KiB

Storage

L1 — Lance dataset (per node/edge type)

Every node type and every edge type is its own Lance dataset:

  • Columnar Arrow storage: each property is a column; nullable per Arrow schema.
  • Fragments: data is partitioned into fragments; new writes create new fragments.
  • Manifest versioning: every commit produces a new dataset version; old versions remain readable.
  • Stable row IDs: enabled by OmniGraph for the commit-graph and run-registry datasets so durable references survive compaction.
  • Append / delete / merge_insert: native Lance write modes.
  • Per-dataset branches (Lance native): copy-on-write at the dataset level.
  • Object-store agnostic: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (storage.rs).

L2 — Multi-dataset coordination via __manifest

OmniGraph is not a single Lance dataset; it is a graph of datasets coordinated through one append-only manifest table.

  • Manifest table: __manifest/ Lance dataset.
  • Layout (db/manifest/layout.rs, db/manifest/state.rs):
    • nodes/{fnv1a64-hex(type_name)} — one Lance dataset per node type
    • edges/{fnv1a64-hex(edge_type_name)} — one Lance dataset per edge type
    • __manifest/ — the catalog of all sub-tables and their published versions
    • _graph_commits.lance / _graph_commit_actors.lance — the commit graph and its actor map
    • _graph_runs.lance / _graph_run_actors.lance — the run registry and its actor map
  • Manifest row schema (object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count):
    • object_typetable | table_version | table_tombstone
    • table_keynode:<TypeName> | edge:<EdgeName>
    • table_branch is null for the main lineage and the branch name otherwise
  • Snapshot reconstruction: latest visible table_version per (table_key, table_branch) minus tombstones — rows where object_type = table_tombstone, whose own table_version (acting as the tombstone version) is >= the entry's table_version.
  • Atomic publish: multi-dataset commits publish via a ManifestBatchPublisher so a single write to __manifest flips all the new sub-table versions visible at once.

URI scheme support (storage.rs)

Scheme Backend Notes
local path / file:// LocalStorageAdapter (tokio) Normalized to absolute paths
s3://bucket/prefix S3StorageAdapter (object_store) Honors AWS_ENDPOINT_URL_S3, AWS_ALLOW_HTTP, AWS_S3_FORCE_PATH_STYLE
http(s)://host:port HTTP client to omnigraph-server Used by CLI as a target, not a storage backend

Object-store env vars (S3-compatible)

  • AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
  • AWS_ENDPOINT_URL, AWS_ENDPOINT_URL_S3 — for MinIO / RustFS / GCS-via-XML
  • AWS_S3_FORCE_PATH_STYLE=true — path-style URLs
  • AWS_ALLOW_HTTP=true — allow plain HTTP (local dev)