mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Ragnor Comerford 64b9d56476

docs: add Mermaid architecture diagrams across architecture / storage / execution

Replace the single ASCII stack in docs/architecture.md with a hierarchy of
Mermaid diagrams that show the system from external context down to the
component level. Add an on-disk layout diagram in docs/storage.md and two
sequence diagrams (read query, mutation) in docs/execution.md so readers
can navigate from "what is OmniGraph" to "how does a query run" without
opening source.

Static structure (docs/architecture.md):

- System context — agents/clients, embedding providers, Cedar, object store.
- Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling
  via classDef, replacing the pre-existing ASCII art.
- Component zoom-ins — compiler, engine, storage trait, index lifecycle,
  server/CLI. Each zoom-in cites file:line entry points.

Aspirational shapes (storage trait, full reconciler) are visually marked
and pointed at the relevant invariants.md section so readers see the
intended seam without thinking it's already implemented.

On-disk layout (docs/storage.md):

- Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance,
  _graph_runs.lance, _refs/branches/ down into Lance's per-dataset
  internals (_versions/, data/, _indices/, _refs/, _transactions/).
- Annotated with the actual filenames so readers can `ls` the same paths.
- Slots in below the existing __manifest CAS / OCC / migration prose; does
  not move or rewrite that content.

Runtime flows (docs/execution.md):

- Read flow sequence: client → Omnigraph::query → typecheck → lower →
  execute_query → table_store → Lance scanner → RecordBatch stream.
- Mutation flow sequence: Omnigraph::mutate → resolve literals →
  Lance write op (Append / merge_insert) → ManifestRepo::commit →
  __manifest upsert.
- Both diagrams are followed by a "Code paths" block with verified
  file:line citations so readers can navigate from diagram element to
  source in one step.

Conventions established (this is the first Mermaid in the repo):

- L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed.
- Diagram size cap ~9 elements; more detail goes in a sub-diagram.
- Diagrams paired with prose; code-path citations follow each diagram.
- Consistent vocabulary across diagrams: frontend / compiler / engine /
  storage trait / Lance / object store. No accidental synonyms.

Subsequent PRs will add flow diagrams for schema apply, branch + merge,
run isolation, index reconcile, and the embedding pipeline in the same
conventions.

2026-04-29 16:58:56 +02:00

6.6 KiB

Raw Blame History

Query Execution, Mutations, and Loading

Query execution (`exec/query.rs`)

Pipeline:

Parse + typecheck via omnigraph-compiler.
Lower to IR.
If Expand or AntiJoin is present, build (or fetch from RuntimeCache) a GraphIndex.
Run execute_query against the snapshot.

Read flow — sequence

sequenceDiagram
    autonumber
    participant client as Client
    participant og as Omnigraph::query<br/>(query.rs:7)
    participant cmp as omnigraph-compiler
    participant exec as execute_query<br/>(query.rs:347)
    participant gi as GraphIndex<br/>(RuntimeCache)
    participant ts as table_store
    participant lance as Lance scanner

    client->>og: query(target, source, name, params)
    og->>og: ensure_schema_state_valid()<br/>resolve target → snapshot
    og->>cmp: parse + typecheck_query (typecheck.rs:83)
    cmp-->>og: CheckedQuery
    og->>cmp: lower_query (lower.rs:11)
    cmp-->>og: QueryIR (pipeline of IROp)
    og->>exec: extract_search_mode + dispatch (query.rs:110)
    exec->>gi: build / fetch GraphIndex<br/>(if Expand or AntiJoin)
    gi-->>exec: CSR / CSC topology
    loop for each IROp in pipeline
        exec->>ts: scan with predicate / SIP
        ts->>lance: filter · nearest · full_text_search
        lance-->>ts: Stream of RecordBatch
        ts-->>exec: RecordBatch stream
        exec->>exec: factorize · expand · fuse · project
    end
    exec-->>og: QueryResult (RecordBatches)
    og-->>client: serialized result

Code paths:

Entry: Omnigraph::query at crates/omnigraph/src/exec/query.rs:7
Search-mode extraction: extract_search_mode at crates/omnigraph/src/exec/query.rs:110
Pipeline runner: execute_query at crates/omnigraph/src/exec/query.rs:347
RRF fan-out: execute_rrf_query at crates/omnigraph/src/exec/query.rs:393
Per-source-row BFS: execute_expand at crates/omnigraph/src/exec/query.rs:675
Lance scan + pushdown: execute_node_scan at crates/omnigraph/src/exec/query.rs:1027
Filter → SQL pushdown: build_lance_filter at crates/omnigraph/src/exec/query.rs:1158

Multi-modal search modes (`SearchMode`)

The executor recognizes three modes that may be combined in a single query:

nearest — vector ANN (uses Lance vector index; LIMIT required).
bm25 — BM25 over an inverted index.
rrf — Reciprocal Rank Fusion of two rankings, with k (default 60).

Hybrid example: order { rrf(nearest($d.embedding, $q), bm25($d.body, $q_text)) desc } limit 20.

Joins / set operations

Joins are implicit: MATCH bindings + traversals are implemented as scans + CSR/CSC lookups.
not { … } lowers to an AntiJoin over the inner pipeline.

Scoped reads

query(target, source, name, params) — at any branch or snapshot.
run_query_at(version, …) — direct historical query at a manifest version.

Concurrency

Snapshot isolation per query: all reads inside a query use the same Snapshot.
Readers and writers on different branches don't block each other.

Mutation execution (`exec/mutation.rs`)

Resolves expression values to literals, converts to typed Arrow arrays (literal_to_typed_array(lit, DataType, num_rows)), then writes:

insert → Lance WriteMode::Append
update → Lance merge_insert(WhenMatched::Update)
delete → Lance merge_insert(WhenMatched::Delete) (logical) or filtered overwrite.

Multi-statement mutations are atomic at the manifest commit boundary.

Mutation flow — sequence

sequenceDiagram
    autonumber
    participant client as Client
    participant og as Omnigraph::mutate<br/>(mutation.rs:511)
    participant cmp as omnigraph-compiler
    participant ts as table_store
    participant lance as Lance dataset
    participant mr as ManifestRepo<br/>(manifest.rs:280)
    participant manifest as __manifest/

    client->>og: mutate(target, source, name, params)
    og->>cmp: parse + typecheck_query
    cmp-->>og: CheckedQuery (Mutation IR)
    og->>og: resolve expression literals<br/>literal_to_typed_array(lit, type, n)
    loop for each mutation statement
        alt insert
            og->>ts: append RecordBatches
            ts->>lance: WriteMode::Append → new fragment(s)
        else update
            og->>ts: merge_insert keyed by id
            ts->>lance: merge_insert(WhenMatched::Update)
        else delete
            og->>ts: merge_insert with delete predicate
            ts->>lance: merge_insert(WhenMatched::Delete)
        end
        lance-->>ts: new dataset version
        ts-->>og: SubTableUpdate (key, version, row_count)
    end
    og->>mr: commit(updates)
    mr->>manifest: append rows<br/>(table_version per sub-table)
    manifest-->>mr: new graph-manifest version
    mr-->>og: graph version
    og-->>client: MutationResult

Code paths:

Entry: Omnigraph::mutate at crates/omnigraph/src/exec/mutation.rs:511
Actor-attributed variant: Omnigraph::mutate_as at crates/omnigraph/src/exec/mutation.rs:522
Manifest commit: ManifestRepo::commit at crates/omnigraph/src/db/manifest.rs:280

The whole mutation — every statement, every affected sub-table — publishes through one call to ManifestRepo::commit. That single append to __manifest is what gives multi-statement mutations their atomicity guarantee (per docs/invariants.md §VI.26).

Bulk loader (`loader/mod.rs`)

JSONL only in v1, with two record shapes:
- Node: {"type":"NodeType", "data":{…}}
- Edge: {"edge":"EdgeType", "from":"src_id", "to":"dst_id", "data":{…}}
Lines starting with // are treated as comments.
Schema validation on every row (typecheck, required props, blob base64 decoding).
Edge endpoint resolution by node @key.

Load modes (`LoadMode`)

Mode	Semantics
`Overwrite`	Replace all data in the target tables on the branch
`Append`	Strict insert; duplicates error
`Merge`	Upsert by id (`merge_insert`)

`load` vs `ingest`

load(branch, data, mode) — direct load to a branch.
ingest(branch, from, data, mode) — branch-creating, transactional load:
1. If target advanced since the run started, fork a fresh run branch from from.
2. Load into the run branch (Append).
3. If target hasn't moved, fast-publish; otherwise abort.
Returns IngestResult { branch, base_branch, branch_created, mode, tables[] }.
ingest_as(actor_id) records the actor on the resulting commit.

Embeddings during load

If a node type has @embed properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See embeddings.md.

6.6 KiB Raw Blame History