omnigraph/docs/user/storage.md
Ragnor Comerford e62d9166fb
fix: optimize publishes compaction; recovery roll-back converges manifest (#141)
* test(optimize): cover manifest publish + HEAD-drift reconcile

Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:

- maintenance: optimize must publish so the manifest table_version
  tracks the compacted Lance HEAD and a later schema apply succeeds;
  and must reconcile a pre-existing manifest-behind-HEAD drift (forged
  via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
  reopen in the full lifecycle (the canonical flow previously omitted
  post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
  forward on next open.

* fix(optimize): publish compaction to manifest and reconcile HEAD drift

optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).

optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.

Docs updated (maintenance, storage, branches-commits, writes, testing).

* test(recovery): rollback convergence + optimize-defer regressions

Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
  must track Lance HEAD (no residual drift) so a follow-up schema apply
  succeeds — the original "+1 per retry" loop. Today roll-back restores
  without publishing, so the manifest lags HEAD and the apply fails its
  HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
  operating on an unrecovered graph could publish a partial write the
  sweep would roll back.

Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.

* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery

Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:

- Recovery roll-back now PUBLISHES the restored version (via a
  push_table_update_at_head helper shared with roll-forward), so the manifest
  tracks the Lance HEAD after recovery — symmetric with roll-forward. This
  fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
  precondition passes) and removes the only remaining source of orphaned
  drift. The audit still records the logical rolled-back-to version; the
  manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
  __recovery sidecar is pending, so it only ever operates on a recovered
  graph (manifest == HEAD); its compaction publish can no longer commit a
  partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
  moot.

Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.

* test(recovery): fix rollback assertion for manifest convergence

The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
2026-06-08 02:50:12 +03:00

11 KiB
Raw Permalink Blame History

Storage

L1 — Lance dataset (per node/edge type)

Every node type and every edge type is its own Lance dataset:

  • Columnar Arrow storage: each property is a column; nullable per Arrow schema.
  • Fragments: data is partitioned into fragments; new writes create new fragments.
  • Manifest versioning: every commit produces a new dataset version; old versions remain readable.
  • Stable row IDs: enable_stable_row_ids: true is set on every Lance dataset OmniGraph creates — node and edge data tables, __manifest, _graph_commits.lance, _graph_commit_recoveries.lance, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create per Lance's row-id-lineage spec, so a future change that introduces a Lance dataset must preserve it. Consequences: _row_created_at_version and _row_last_updated_at_version are available on every dataset (load-bearing for change-feed validators); CreateIndex × Rewrite is not a retryable conflict, so indices survive omnigraph optimize without needing the Fragment Reuse Index; readers must use a Lance build that recognises the flag (our pinned 4.0.0 is fine). Pre-0.4.x graphs created before this code path settled may have datasets without the flag and cannot be retrofitted in place — the supported path is dump-and-reload. The stage_overwrite rewrite path (used by schema_apply) preserves the flag through Operation::Overwrite; pinned by stage_overwrite_preserves_stable_row_ids in crates/omnigraph/tests/staged_writes.rs.
  • Append / delete / merge_insert: native Lance write modes.
  • Per-dataset branches (Lance native): copy-on-write at the dataset level.
  • Object-store agnostic: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3:// (storage.rs).

L2 — Multi-dataset coordination via __manifest

OmniGraph is not a single Lance dataset; it is a graph of datasets coordinated through one append-only manifest table.

  • Manifest table: __manifest/ Lance dataset.
  • Layout (db/manifest/layout.rs, db/manifest/state.rs):
    • nodes/{fnv1a64-hex(type_name)} — one Lance dataset per node type
    • edges/{fnv1a64-hex(edge_type_name)} — one Lance dataset per edge type
    • __manifest/ — the catalog of all sub-tables and their published versions
    • _graph_commits.lance / _graph_commit_actors.lance — the commit graph and its actor map
    • (legacy _graph_runs.lance / _graph_run_actors.lance from pre-v0.4.0 graphs are inert; the run state machine was removed in MR-771. The v2→v3 manifest migration sweeps stale __run__* branches on first write-open; the inert dataset bytes themselves remain until a delete_prefix storage primitive lands)
  • Manifest row schema (object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count):
    • object_typetable | table_version | table_tombstone
    • table_keynode:<TypeName> | edge:<EdgeName>
    • table_branch is null for the main lineage and the branch name otherwise
  • Snapshot reconstruction: latest visible table_version per (table_key, table_branch) minus tombstones — rows where object_type = table_tombstone, whose own table_version (acting as the tombstone version) is >= the entry's table_version.
  • Atomic publish: multi-dataset commits publish via a ManifestBatchPublisher so a single write to __manifest flips all the new sub-table versions visible at once.
  • Row-level CAS on the merge-insert join key: object_id carries lance-schema:unenforced-primary-key=true so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same object_id row. Without this annotation, Lance's transparent rebase would admit silent duplicates of version:T@v=N from racing publishers (see .context/merge-insert-cas-granularity.md).
  • Optimistic concurrency control on publish: ManifestBatchPublisher::publish accepts a expected_table_versions: HashMap<table_key, u64> map. Each entry asserts the manifest's current latest non-tombstoned version for that table is exactly what the caller observed; mismatches surface as OmniError::Manifest with ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected, actual }. Empty map preserves the legacy "best-effort publish" semantics. The publisher uses conflict_retries(0) against Lance and owns retry itself (PUBLISHER_RETRY_BUDGET = 5), re-running the pre-check on each iteration so concurrent advances surface as ExpectedVersionMismatch rather than being silently rebased through.

Internal schema versioning (db/manifest/migrations.rs)

The on-disk shape of __manifest is reconciled with the binary via a single stamp + dispatcher. INTERNAL_MANIFEST_SCHEMA_VERSION declares the shape this binary writes; the on-disk stamp omnigraph:internal_schema_version lives in the manifest dataset's schema-level metadata (Lance update_schema_metadata).

  • init_manifest_graph stamps the current version at creation, so newly initialized graphs never need migration.
  • Publisher open-for-write path (load_publish_state) calls migrate_internal_schema(&mut dataset) before reading state. When the on-disk stamp matches the binary, this is a single metadata read with no writes; otherwise the dispatcher walks match-arm steps forward (1→2, 2→3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free.
  • Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp".
  • Idempotency: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands. The dispatcher itself is a cheap stamp-read on the steady-state path.

Adding a new on-disk shape change is one constant bump (INTERNAL_MANIFEST_SCHEMA_VERSION), one match arm in migrate_internal_schema, and one test. No code outside this module branches on the stamp.

Stamp Shape change
v1 (implicit, pre-stamp) __manifest.object_id had no PK annotation; publisher had no row-level CAS protection.
v2 __manifest.object_id carries lance-schema:unenforced-primary-key=true; row-level CAS engaged. Stamped as omnigraph:internal_schema_version=2.
v3 One-time sweep of legacy __run__* staging branches (pre-v0.4.0 Run state machine, removed MR-771) off __manifest. Runs at Omnigraph::open(ReadWrite) and on publish. Stamped as omnigraph:internal_schema_version=3.

On-disk layout

A graph on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (_versions/, data/, _indices/, _refs/); OmniGraph adds the multi-dataset coordination by keeping __manifest/ alongside the per-type datasets.

flowchart TB
    classDef l1 fill:#fef3e8,stroke:#c46900,color:#000
    classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000

    graph["graph URI<br/>file:// or s3://bucket/prefix"]:::l2

    manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2
    nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2
    edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2
    cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/<br/>_graph_commit_recoveries.lance/"]:::l2
    recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2
    refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2

    graph --> manifest
    graph --> nodes
    graph --> edges
    graph --> cgraph
    graph --> recovery
    graph --> refs

    subgraph dataset[Inside each Lance dataset — L1]
        ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1
        ds_data["data/<br/>fragment files (Arrow IPC)"]:::l1
        ds_idx["_indices/{uuid}/<br/>BTREE · Inverted FTS · IVF/HNSW"]:::l1
        ds_refs["_refs/<br/>per-dataset Lance branches/tags"]:::l1
        ds_tx["_transactions/<br/>commit transaction logs"]:::l1
    end

    nodes -.-> dataset
    edges -.-> dataset
    manifest -.-> dataset

What's where:

  • Graph root is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
  • __manifest/ is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
  • nodes/ and edges/ are sibling directories holding one Lance dataset per declared type. Names are fnv1a64-hex of the type name to keep paths fixed-length and case-safe.
  • _graph_commits.lance is an L2 dataset that records the graph-level commit DAG, with a paired _graph_commit_actors.lance for the actor map. (Pre-v0.4.0 graphs also have inert _graph_runs.lance / _graph_run_actors.lance from the removed Run state machine; the v2→v3 migration sweeps their stale __run__* branches, and the dataset bytes are reclaimed once delete_prefix lands.)
  • _graph_commit_recoveries.lance — one row per recovery sweep action. Joined to _graph_commits.lance by graph_commit_id; the linked commit row carries actor_id=omnigraph:recovery. Operators correlate recoveries with the original mutations they rolled forward / back via this join. See crates/omnigraph/src/db/recovery_audit.rs.
  • __recovery/{ulid}.json — transient sidecar files written by the five migrated writers (MutationStaging::finalize, schema_apply, branch_merge, ensure_indices, optimize_all_tables) before Phase B begins, deleted after Phase C succeeds. A sidecar persisting after process exit means the writer crashed in the Phase B → Phase C window; the next Omnigraph::open recovery sweep processes it. Steady-state directory is empty. See crates/omnigraph/src/db/manifest/recovery.rs.
  • _refs/branches/{name}.json is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
  • Inside each Lance dataset (orange): the standard Lance directory layout. _versions/{n}.manifest records every commit; data/ holds the actual Arrow fragments; _indices/{uuid}/ holds index segments with their own fragment_bitmap for partial coverage; _refs/ holds Lance-native per-dataset branches and tags.

The split — L2 owns the cross-dataset catalog; L1 owns the per-dataset internals — means that schema work (which adds or removes datasets) updates __manifest, while data work (which adds fragments) updates _versions/ inside the affected dataset and then bumps __manifest.

URI scheme support (storage.rs)

Scheme Backend Notes
local path / file:// LocalStorageAdapter (tokio) Normalized to absolute paths
s3://bucket/prefix S3StorageAdapter (object_store) Honors AWS_ENDPOINT_URL_S3, AWS_ALLOW_HTTP, AWS_S3_FORCE_PATH_STYLE
http(s)://host:port HTTP client to omnigraph-server Used by CLI as a target, not a storage backend

Object-store env vars (S3-compatible)

  • AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
  • AWS_ENDPOINT_URL, AWS_ENDPOINT_URL_S3 — for MinIO / RustFS / GCS-via-XML
  • AWS_S3_FORCE_PATH_STYLE=true — path-style URLs
  • AWS_ALLOW_HTTP=true — allow plain HTTP (local dev)