omnigraph/docs/dev/index.md
Ragnor Comerford 7779b72446
feat(engine): retire commit-graph tables (#311)
* docs(dev): write-latency roadmap (validated cost model + layered fix)

Records the validated 6-LIST warm-write cost model, the two root causes
(un-GC'd _versions/; re-resolving latest by listing), and the layered fix
(GC + capture-once reuse), plus how commit-graph-table retirement feeds in.
Linked from docs/dev/index.md next to the RFC-013 docs.

* feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration

Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest`
internal-schema version and refuses any older graph on open with a
rebuild-via-export/import message, instead of migrating it in place. Storage
format changes become a deliberate cutover, not a permanently-carried in-place
migration — the pre-release "complexity must be earned" contract.

Delete the entire in-place migration apparatus and everything that existed only
to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the
schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse);
the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration
tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned
Lance variants only the migration matched on; and `state::merge_lineage_rows`.
Keep `read_stamp` / `stamp_current_version` / `set_stamp` /
`refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into.

`load_commit_cache_for_branch` now reads the `__manifest` projection
unconditionally (sub-v4 graphs are refused at open). Adds
`sub_current_graph_is_refused_on_open_with_rebuild_hint`.

The commit-graph TABLES are still created/used as branch-ref ledgers — their
retirement (CommitGraph -> pure `__manifest` projection) is the next commit.

BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is
refused on open. Rebuild it: `omnigraph export` with the old release, then
`omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are
preserved; commit history and branches are not.

* feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection

Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` /
`graph_head` rows) and branch authority is `__manifest` (branch create forks it
first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance`
was never written or read; `_graph_commits.lance` carried zero commit rows and
only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire
both.

- `CommitGraph` collapses to a pure projection: drops its Lance dataset handles
  (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/
  `refresh`/`init` open NO dataset, building the cache from
  `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open
  dataset opens.
- `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid
  projection). `branch_create`/`branch_delete` go through `ManifestCoordinator`
  only — a single atomic op, replacing the two-step manifest-fork +
  commit-graph-fork + rollback. Deleted `create_commit_graph_branch`,
  `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every
  `storage.exists(_graph_commits.lance)` gate.
- `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from
  the internal-table compaction set (now `__manifest` only).
- `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables.
- Fresh graphs create neither table; `lineage_projection.rs` now asserts both
  `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race
  failpoint tests + their failpoint names, and updated the `maintenance`
  optimize tests (one internal table, not three).

Review-pass fixes folded in:
- Removed two stale `omnigraph.rs` in-source tests the prior run missed (a
  disk-full link failure masked them): one asserting `open` probes
  `_graph_commits.lance` (the exists-gate this commit removes) — it was masked
  earlier by a disk-full link failure.
- Corrected src comments referencing deleted code (`migrate_v3_to_v4`,
  `append_commit`/`append_merge_commit`, the three-internal-table list,
  the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit.
- Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal
  test) — removes a dead-code warning in the failpoints build.

Branch create/delete atomicity improves (single atomic `__manifest` op). No
behavior change for reads or branches.

Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test
counter + its `IOTracker`, threaded through ~11 cost-test files.

* feat: surface the internal-schema (storage-format) version to operators

After stranding storage versioning (a sub-v4 graph is refused on open), operators
could only discover the storage-format version by hitting a refusal. Surface it:

- `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT
  storage-format version).
- `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch
  on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`.
- `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's
  CURRENT, alongside `version`/`source_version`).

Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`;
add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`;
`snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it),
threaded through the embedded CLI + server snapshot callers. `openapi.json`
regenerated (two added int32 properties). Extends the existing healthz / snapshot /
version tests.

* docs(engine): gate internal-schema version at the graph level; record the per-branch read gap

PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget.

Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it."

No behavior change: HEAD already gated at the graph level.

* test(cost): remove the dead commit_graph_reads IO counter

Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs.

Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact.

* docs: align with the commit-graph retirement + strand storage versioning

Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild).

Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis.

Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs).

* test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance

Two coverage additions from PR review (P1):

(a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs.

(b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
2026-06-28 16:49:49 +02:00

7.2 KiB

Developer Docs

Audience: contributors, maintainers, and coding agents

This is the contributor-facing entry point. These docs explain architecture, invariants, implementation contracts, test ownership, and upstream Lance constraints. User-facing behavior should still be documented through docs/user/index.md and the relevant public reference docs.

Required For Every Non-Trivial Change

Need Read
Architectural rules, known gaps, deny-list invariants.md
Versioning & compatibility policy (release / wire / storage / Lance) versioning.md
Upstream Lance source-of-truth index lance.md
Existing test coverage and test placement testing.md

Architecture And Storage

Area Read
System structure, L1/L2 framing, component diagrams architecture.md
On-disk layout, manifest schema, URI behavior storage.md
Direct-publish writes, D2, staged writes, recovery sidecars writes.md
Query execution, mutation execution, loader flow execution.md
Index lifecycle and graph topology indexes indexes.md
Branch and commit internals branches-commits.md
Three-way merge implementation and conflicts merge.md
Diff/change-feed implementation changes.md
Branch protection policy branch-protection.md

Language, Runtime, And Boundaries

Area Read
Schema grammar, catalog, migration planner schema-language.md
Query grammar, IR, lints, mutation restrictions query-language.md
Embedding client and @embed integration embeddings.md
Cedar policy surface and server gating policy.md
Server auth, OpenAPI, endpoint handlers server.md
Error taxonomy and serialization errors.md
Constants and tunables constants.md
Transaction model public contract transactions.md
User-doc coherence cleanup ledger docs-issues.md

Project Operations

Area Read
CI and release workflows ci.md
Install and deployment packaging install.md, deployment.md
Release history releases/

Contribution & Governance

Area Read
How to contribute (external) CONTRIBUTING.md
Governance model, roles, decision authority GOVERNANCE.md
Public contribution RFC track rfcs/

The docs/rfcs/ track is the public, externally-authorable RFC process. The maintainer/internal RFCs below (rfc-00N-*.md) are a separate, team-owned track; don't conflate the two.

Case Studies

Worked write-ups of specific bugs — root cause, fix, and the reasoning that ruled out the tempting-but-wrong alternatives. Read these for the debugging pattern, not just the outcome.

Area Read
camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes bug-case-fix.md

Active Implementation Plans

Working documents for in-flight feature work. Removed when the work lands.

Area Read
Schema-lint chassis v1 (MR-694) — --allow-data-loss, soft/hard drops schema-lint-v1-plan.md
Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) rfc-001-queries-envelope-mcp.md
Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) rfc-002-config-cli-architecture.md
MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) rfc-003-mcp-server-surface.md
Future cluster control plane — declarative as-code config, JSON state ledger, reconciler cluster-config-specs.md, cluster-axioms.md, cluster-config-implementation-spec.md
Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts rfc-004-cluster-graph-schema-apply.md
Server boots from cluster state — Phase 5 mode switch, applied-revision serving rfc-005-server-cluster-boot.md
Per-operator config — ~/.omnigraph/ identity, keyed credentials, named servers (the operator slice of RFC-002) rfc-007-operator-config.md
Deprecate omnigraph.yaml — one concern per config surface; key-by-key migration map and staged retirement rfc-008-deprecate-omnigraph-yaml.md
Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, GraphClient trait, declared plane capabilities rfc-009-unify-access-paths.md
Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) rfc-010-cli-planes-restructure.md
CLI refactoring — one addressing & config model post-omnigraph.yaml: scope + --graph + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) rfc-011-cli-refactoring.md
Provider-independent embedding configuration — one resolved EmbeddingConfig + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor rfc-012-embedding-provider-config.md
Write-path latency — capture-once WriteTxn, version-pinned opens, one GraphPublishAuthority fed declarative PublishPlans, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (iss-write-s3-roundtrip-amplification, iss-991) rfc-013-write-path-latency.md
RFC-013 handoff — current-state map, latest validation, and concrete next actions for finishing write-path latency and correctness work handoff-rfc-013-write-path.md
Write-latency roadmap — validated cost model (the 6-LIST warm-write trace), the two root causes (un-GC'd _versions/; re-resolving latest by listing), and the layered fix (GC + capture-once reuse); how commit-graph-table retirement feeds in write-latency-roadmap.md

Boundary

Developer docs may mention implementation details, stale gaps, upstream Lance blockers, and review rules. User docs should not require that context unless the detail changes the public contract.