mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
docs(user): restructure user docs into topic sections (Phase 1) (#223)
Move the 23 flat docs/user/*.md files into topic subdirectories so the user guide is organized by area (schema, queries, search, branching, cli, operations, clusters, concepts, reference) instead of a flat list. This is a pure structural move — whole files relocated, every cross-doc link recomputed, no prose rewrites or content splits (those follow in Phase 2). - 19 `git mv`s (install.md, deployment.md stay top-level); history preserved (renames detected at 92–100% similarity). - All intra-doc links, AGENTS.md's topic table (52 pointers), and the docs/dev + docs/releases back-links recomputed via relpath from each file's new location. - docs/user/index.md rewritten as a sectioned nav hub. - Fixed 5 doc-path references in Rust (comments + two user-facing server settings error strings) to point at the new locations. Verified: zero broken .md links across tracked docs; check-agents-md.sh green (with the untracked scratch docs set aside); touched crates build. Note: the public site (omnigraph-web) imports docs/ via a flat-only script; its import-docs.mjs needs a subdir-aware update before the next re-sync. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
8726ca92ec
commit
d46e50dd6d
33 changed files with 126 additions and 109 deletions
31
docs/user/search/embeddings.md
Normal file
31
docs/user/search/embeddings.md
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
# Embeddings
|
||||
|
||||
OmniGraph has **two** embedding clients with different defaults and purposes.
|
||||
|
||||
## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization
|
||||
|
||||
- Default model: `text-embedding-3-small` (OpenAI-style schema)
|
||||
- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
|
||||
- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
|
||||
- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors
|
||||
|
||||
## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest
|
||||
|
||||
- Model: `gemini-embedding-2-preview`
|
||||
- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
|
||||
- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
|
||||
- Exponential backoff with retryable detection (timeouts, 429, 5xx)
|
||||
|
||||
## Schema integration
|
||||
|
||||
Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).
|
||||
|
||||
## CLI `omnigraph embed` (offline file pipeline)
|
||||
|
||||
Operates on **JSONL files** (not on a graph). Three modes (mutually exclusive):
|
||||
|
||||
- (default) `fill_missing` — only embed rows whose target field is empty
|
||||
- `--reembed-all` — overwrite all
|
||||
- `--clean` — strip embeddings
|
||||
|
||||
Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.
|
||||
26
docs/user/search/indexes.md
Normal file
26
docs/user/search/indexes.md
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
# Indexes
|
||||
|
||||
## L1 — Lance index types OmniGraph exposes
|
||||
|
||||
| Index | Use | Notes |
|
||||
|---|---|---|
|
||||
| **BTREE scalar** | range / equality on any scalar | created on `@key`, `@index(...)`, and on key columns by `ensure_indices()` |
|
||||
| **Inverted (FTS)** | `search`, `fuzzy`, `match_text`, `bm25` | created on text columns referenced by FTS queries |
|
||||
| **Vector** | `nearest()` k-NN | Lance picks IVF_PQ vs HNSW family by configuration; OmniGraph stores as FixedSizeList(Float32, dim) |
|
||||
|
||||
## L2 — OmniGraph orchestration
|
||||
|
||||
- `ensure_indices()` / `ensure_indices_on(branch)` — idempotent build of BTREE + inverted indexes for the current head; safe to re-run.
|
||||
- Indexes are built on the *branch head* (not on a snapshot), so reads always see the current index state.
|
||||
- **Lazy branch forking for indexes**: a branch that hasn't mutated a sub-table doesn't need its own index — the main lineage's index is reused until the first write triggers a copy-on-write fork.
|
||||
- Vector index parameters (metric, nlist, nprobe, etc.) are not exposed in the schema; they default at the Lance layer and are picked up automatically when an index is asked for on a Vector column.
|
||||
|
||||
## L2 — Graph topology index (`graph_index/mod.rs`)
|
||||
|
||||
This is OmniGraph-specific (not Lance):
|
||||
|
||||
- `TypeIndex`: dense `u32 ↔ String id` mapping per node type.
|
||||
- `CsrIndex`: Compressed Sparse Row representation of edges per edge type — `offsets[i]..offsets[i+1]` slices into `targets`.
|
||||
- `GraphIndex { type_indices, csr (out), csc (in) }` — built on demand from a snapshot's edge tables, **lazily**: only when an `Expand` the planner routes to the CSR path (dense / large frontier) or an `AntiJoin` actually needs it.
|
||||
- Cached in `RuntimeCache::graph_indices` (LRU, max 8 entries, keyed by snapshot id + edge table versions).
|
||||
- Selective `Expand`s resolve neighbors from the persisted `src`/`dst` BTREE instead (one indexed scan per hop) and never trigger the CSR build; see [query-language](../queries/index.md) → Expand. Pure scans, and queries served entirely by the indexed traversal path, skip it.
|
||||
Loading…
Add table
Add a link
Reference in a new issue