mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
feat(engine): retire commit-graph tables (#311)
* docs(dev): write-latency roadmap (validated cost model + layered fix)
Records the validated 6-LIST warm-write cost model, the two root causes
(un-GC'd _versions/; re-resolving latest by listing), and the layered fix
(GC + capture-once reuse), plus how commit-graph-table retirement feeds in.
Linked from docs/dev/index.md next to the RFC-013 docs.
* feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration
Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest`
internal-schema version and refuses any older graph on open with a
rebuild-via-export/import message, instead of migrating it in place. Storage
format changes become a deliberate cutover, not a permanently-carried in-place
migration — the pre-release "complexity must be earned" contract.
Delete the entire in-place migration apparatus and everything that existed only
to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the
schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse);
the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration
tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned
Lance variants only the migration matched on; and `state::merge_lineage_rows`.
Keep `read_stamp` / `stamp_current_version` / `set_stamp` /
`refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into.
`load_commit_cache_for_branch` now reads the `__manifest` projection
unconditionally (sub-v4 graphs are refused at open). Adds
`sub_current_graph_is_refused_on_open_with_rebuild_hint`.
The commit-graph TABLES are still created/used as branch-ref ledgers — their
retirement (CommitGraph -> pure `__manifest` projection) is the next commit.
BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is
refused on open. Rebuild it: `omnigraph export` with the old release, then
`omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are
preserved; commit history and branches are not.
* feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection
Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` /
`graph_head` rows) and branch authority is `__manifest` (branch create forks it
first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance`
was never written or read; `_graph_commits.lance` carried zero commit rows and
only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire
both.
- `CommitGraph` collapses to a pure projection: drops its Lance dataset handles
(`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/
`refresh`/`init` open NO dataset, building the cache from
`ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open
dataset opens.
- `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid
projection). `branch_create`/`branch_delete` go through `ManifestCoordinator`
only — a single atomic op, replacing the two-step manifest-fork +
commit-graph-fork + rollback. Deleted `create_commit_graph_branch`,
`reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every
`storage.exists(_graph_commits.lance)` gate.
- `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from
the internal-table compaction set (now `__manifest` only).
- `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables.
- Fresh graphs create neither table; `lineage_projection.rs` now asserts both
`.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race
failpoint tests + their failpoint names, and updated the `maintenance`
optimize tests (one internal table, not three).
Review-pass fixes folded in:
- Removed two stale `omnigraph.rs` in-source tests the prior run missed (a
disk-full link failure masked them): one asserting `open` probes
`_graph_commits.lance` (the exists-gate this commit removes) — it was masked
earlier by a disk-full link failure.
- Corrected src comments referencing deleted code (`migrate_v3_to_v4`,
`append_commit`/`append_merge_commit`, the three-internal-table list,
the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit.
- Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal
test) — removes a dead-code warning in the failpoints build.
Branch create/delete atomicity improves (single atomic `__manifest` op). No
behavior change for reads or branches.
Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test
counter + its `IOTracker`, threaded through ~11 cost-test files.
* feat: surface the internal-schema (storage-format) version to operators
After stranding storage versioning (a sub-v4 graph is refused on open), operators
could only discover the storage-format version by hitting a refusal. Surface it:
- `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT
storage-format version).
- `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch
on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`.
- `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's
CURRENT, alongside `version`/`source_version`).
Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`;
add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`;
`snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it),
threaded through the embedded CLI + server snapshot callers. `openapi.json`
regenerated (two added int32 properties). Extends the existing healthz / snapshot /
version tests.
* docs(engine): gate internal-schema version at the graph level; record the per-branch read gap
PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget.
Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it."
No behavior change: HEAD already gated at the graph level.
* test(cost): remove the dead commit_graph_reads IO counter
Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs.
Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact.
* docs: align with the commit-graph retirement + strand storage versioning
Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild).
Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis.
Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs).
* test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance
Two coverage additions from PR review (P1):
(a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs.
(b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
This commit is contained in:
parent
0dcdcf5a9d
commit
7779b72446
53 changed files with 903 additions and 3324 deletions
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
- Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed.
|
||||
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs.
|
||||
- **Also compacts the internal system tables** `__manifest`, `_graph_commits`, and `_graph_commit_actors` (RFC-013 step 2), which accumulate one fragment per commit (the actor table only on the authenticated write path, where every commit carries an actor) and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place — **no manifest publish and no recovery sidecar**. (The sidecar-free property is not because it is one commit — `compact_files` can emit a `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip below is a further commit — but because every one of those commits is content-preserving and the table is read at its latest version, so a crash at any point leaves it readable and content-identical and the next `optimize` re-plans.) They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"` / `"_graph_commit_actors"` (the latter two only when present). They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
|
||||
- **Also compacts the internal `__manifest` table** (RFC-013 step 2), which accumulates one fragment per commit — it now carries the graph lineage and actor rows inline (RFC-013 Phase 7: `graph_commit` / `graph_head` rows), so on the authenticated write path every commit's actor lands here too — and otherwise makes every write's metadata scan grow with history. (The `_graph_commits.lance` / `_graph_commit_actors.lance` tables are retired, so there is no separate lineage table to compact.) It takes a simpler path than data tables: `__manifest` is read at its latest version, so compaction just advances its version in place — **no manifest publish and no recovery sidecar**. (The sidecar-free property is not because it is one commit — `compact_files` can emit a `ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip below is a further commit — but because every one of those commits is content-preserving and the table is read at its latest version, so a crash at any point leaves it readable and content-identical and the next `optimize` re-plans.) It appears in the returned stats under `table_key` `"__manifest"`. It is **not yet covered by `cleanup`**, so its version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
|
||||
- **`optimize` is non-destructive by construction — it never garbage-collects versions, on any table (data or internal).** Compaction rewrites fragments and advances the version; old versions stay reachable until you run `cleanup`. This holds even for a graph created by an older binary that stored an on-by-default Lance `auto_cleanup` hook: `compact_files` / `optimize_indices` commit with the hook enabled and expose no skip override, so before compacting **any** table `optimize` strips its stale `lance.auto_cleanup.*` config first, so Lance's commit-time GC hook cannot fire and silently prune `__manifest`-pinned versions. (Graphs created by current binaries store no such config; the strip is the upgrade-path safety net.) The internal-table path additionally tolerates a concurrent live writer: it runs a **bounded** rebase-and-retry, so transient contention does not fail the operator's `optimize` or the live write — but sustained contention past the retry budget surfaces a loud conflict error rather than looping forever (bounded and observable, not a silent give-up). The data-table path holds the per-table write queue while it compacts, so it does not contend with mutations on that table in the first place.
|
||||
- **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail.
|
||||
- **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout — so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes.
|
||||
|
|
|
|||
84
docs/user/operations/upgrade.md
Normal file
84
docs/user/operations/upgrade.md
Normal file
|
|
@ -0,0 +1,84 @@
|
|||
# Upgrading across a storage-format change (export / import)
|
||||
|
||||
Omnigraph storage is **strict-single-version**: a binary reads exactly one
|
||||
internal-schema (storage-format) version. There is no in-place migration. When a
|
||||
release changes the internal schema, a graph created by an older release is
|
||||
**refused on open** with a message that points here, and you move it forward by
|
||||
rebuilding it: export with the old binary, then `init` + `load` with the new one.
|
||||
|
||||
This is a deliberate pre-release design choice. The rationale (lower long-term
|
||||
liability than carrying in-place migration code for a format that is still
|
||||
changing) is in [docs/dev/versioning.md](../../dev/versioning.md).
|
||||
|
||||
## How you know you need this
|
||||
|
||||
Opening a graph whose stamp is below the binary's version fails with:
|
||||
|
||||
```
|
||||
__manifest is stamped at internal schema vN, but this omnigraph reads only vM.
|
||||
This graph was created by an older omnigraph release; rebuild it: run `omnigraph
|
||||
export` with the older omnigraph binary that created it, then `omnigraph init` +
|
||||
`omnigraph load` with this one. (Data, vectors, and blobs are preserved; commit
|
||||
history and branches are not.)
|
||||
```
|
||||
|
||||
You can also check versions before you hit a refusal:
|
||||
|
||||
- `omnigraph version` — the binary's served version (the `internal-schema <N>` line).
|
||||
- `omnigraph snapshot <graph>` — the graph's on-disk `internal_schema_version`.
|
||||
|
||||
If the graph's stamp is **higher** than the binary's, the binary is too old —
|
||||
upgrade omnigraph rather than rebuilding the graph.
|
||||
|
||||
## What is preserved (and what is not)
|
||||
|
||||
| Preserved | Not preserved |
|
||||
|---|---|
|
||||
| All node and edge rows | Commit history (the graph DAG starts fresh) |
|
||||
| Vector columns (embeddings round-trip verbatim) | Branches (export is a single-branch snapshot) |
|
||||
| Blob columns | Snapshot/time-travel history of the old graph |
|
||||
| The schema (re-applied at `init`) | |
|
||||
|
||||
The rebuilt graph is a faithful copy of the exported branch's **current state**.
|
||||
If you need history or multiple branches carried forward, there is no supported
|
||||
path today — export each branch you care about separately.
|
||||
|
||||
## The recipe
|
||||
|
||||
Use the **old** binary for the export steps and the **new** binary for init/load.
|
||||
Keep them as separate executables (for example a downloaded release archive) so you
|
||||
can run both.
|
||||
|
||||
```bash
|
||||
# 1. With the OLD binary — capture the schema and the data.
|
||||
old-omnigraph schema show s3://bucket/graph.omni > schema.pg
|
||||
old-omnigraph export s3://bucket/graph.omni > graph.jsonl
|
||||
|
||||
# 2. With the NEW binary — create a fresh graph and load the data.
|
||||
omnigraph init --schema schema.pg s3://bucket/graph-v2.omni
|
||||
omnigraph load --mode overwrite --data graph.jsonl s3://bucket/graph-v2.omni
|
||||
|
||||
# 3. With the NEW binary — verify.
|
||||
omnigraph snapshot s3://bucket/graph-v2.omni # internal_schema_version is current
|
||||
omnigraph version # confirms the binary's served version
|
||||
```
|
||||
|
||||
`omnigraph export` writes a full JSONL snapshot (one row per node/edge, all
|
||||
columns including vectors and blobs) of the chosen branch (default `main`; pass
|
||||
`--branch` for another) to stdout. `omnigraph load --mode overwrite` replaces the
|
||||
target graph's contents with that snapshot.
|
||||
|
||||
Once you have verified the rebuilt graph, retire the old one. If you rebuilt
|
||||
in place (same URI), export to a side location first and only overwrite after the
|
||||
new graph verifies.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Upgrade the whole fleet together.** A mixed fleet where an old binary still
|
||||
writes a graph a newer binary has stamped is unsupported, as with any
|
||||
internal-schema bump.
|
||||
- **Embeddings are not recomputed.** Export carries the stored vectors verbatim, so
|
||||
a load does not re-run the embedding pipeline. If you changed the embedding model,
|
||||
re-embed after loading.
|
||||
- **Server deployments**: take the graph out of the serving set, rebuild it offline
|
||||
with the CLI, then point the cluster at the rebuilt graph (`cluster apply`).
|
||||
Loading…
Add table
Add a link
Reference in a new issue