docs: add Mermaid architecture diagrams across architecture / storage / execution

Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-06-24 02:38:06 +02:00 · 2026-04-29 16:58:56 +02:00 · 2026-04-29 16:58:56 +02:00 · 64b9d56476
commit 64b9d56476
parent 4e5374a85e
3 changed files with 376 additions and 39 deletions
--- a/docs/storage.md
+++ b/docs/storage.md
@ -48,6 +48,55 @@ Adding a new on-disk shape change is one constant bump (`INTERNAL_MANIFEST_SCHEM
 | v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; publisher had no row-level CAS protection. |
 | v2 | `__manifest.object_id` carries `lance-schema:unenforced-primary-key=true`; row-level CAS engaged. Stamped as `omnigraph:internal_schema_version=2`. |

+## On-disk layout
+
+A repo on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets.
+
+```mermaid
+flowchart TB
+    classDef l1 fill:#fef3e8,stroke:#c46900,color:#000
+    classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000
+
+    repo["repo URI<br/>file:// or s3://bucket/prefix"]:::l2
+
+    manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2
+    nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2
+    edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2
+    cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/"]:::l2
+    runs["_graph_runs.lance/<br/>_graph_run_actors.lance/"]:::l2
+    refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2
+
+    repo --> manifest
+    repo --> nodes
+    repo --> edges
+    repo --> cgraph
+    repo --> runs
+    repo --> refs
+
+    subgraph dataset[Inside each Lance dataset — L1]
+        ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1
+        ds_data["data/<br/>fragment files (Arrow IPC)"]:::l1
+        ds_idx["_indices/{uuid}/<br/>BTREE · Inverted FTS · IVF/HNSW"]:::l1
+        ds_refs["_refs/<br/>per-dataset Lance branches/tags"]:::l1
+        ds_tx["_transactions/<br/>commit transaction logs"]:::l1
+    end
+
+    nodes -.-> dataset
+    edges -.-> dataset
+    manifest -.-> dataset
+```
+
+**What's where:**
+
+- **Repo root** is one directory (or S3 prefix). Everything below is part of one OmniGraph repo.
+- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
+- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
+- **`_graph_commits.lance` / `_graph_runs.lance`** are L2 datasets that record the graph-level commit DAG and run registry respectively (each has a paired `*_actors.lance` for the actor map).
+- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
+- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.
+
+The split — L2 owns the cross-dataset catalog; L1 owns the per-dataset internals — means that schema work (which adds or removes datasets) updates `__manifest`, while data work (which adds fragments) updates `_versions/` inside the affected dataset and then bumps `__manifest`.
+
 ## URI scheme support (`storage.rs`)

 | Scheme | Backend | Notes |