omnigraph/docs/embeddings.md

# Embeddings

OmniGraph has **two** embedding clients with different defaults and purposes.

## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization

- Default model: `text-embedding-3-small` (OpenAI-style schema)
- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors

## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest

- Model: `gemini-embedding-2-preview`
- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
- Exponential backoff with retryable detection (timeouts, 429, 5xx)

## Schema integration

Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).

## CLI `omnigraph embed` (offline file pipeline)

Operates on **JSONL files** (not on a repo). Three modes (mutually exclusive):

- (default) `fill_missing` — only embed rows whose target field is empty
- `--reembed-all` — overwrite all
- `--clean` — strip embeddings

Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.
Refactor AGENTS.md from encyclopedia to map; move spec into docs/ Splits the 990-line AGENTS.md into a 184-line map (architecture, where-to-find index, always-on invariants, capability matrix, maintenance contract) plus 18 new docs/*.md files holding the deep content per topic (storage, schema and query languages, indexes, embeddings, branches/commits, runs, merge, changes, execution, policy, server, CLI reference, audit, errors, CI, constants, v0.3.1 notes). Adds scripts/check-agents-md.sh and a check_agents_md CI job that verifies every docs/ link in AGENTS.md resolves and every doc in the canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 2026-04-28 23:31:08 +02:00			`# Embeddings`

			`OmniGraph has two embedding clients with different defaults and purposes.`

			## Compiler-side client (`omnigraph-compiler/src/embedding.rs`) — query-time normalization

			- Default model: `text-embedding-3-small` (OpenAI-style schema)
			- Env: `NANOGRAPH_EMBED_MODEL`, `OPENAI_API_KEY`, `OPENAI_BASE_URL` (default `https://api.openai.com/v1`), `NANOGRAPH_EMBEDDINGS_MOCK`, `NANOGRAPH_EMBED_TIMEOUT_MS=30000`, `NANOGRAPH_EMBED_RETRY_ATTEMPTS=4`, `NANOGRAPH_EMBED_RETRY_BACKOFF_MS=200`
			- Methods: `embed_text(input, expected_dim)`, `embed_texts(inputs, expected_dim)`
			`- Mock mode: deterministic FNV-1a + xorshift64 → L2-normalized vectors`

			## Engine-side client (`omnigraph/src/embedding.rs`) — runtime ingest

			- Model: `gemini-embedding-2-preview`
			- Env: `GEMINI_API_KEY`, `OMNIGRAPH_GEMINI_BASE_URL` (default Google generativelanguage v1beta), `OMNIGRAPH_EMBED_TIMEOUT_MS=30000`, `OMNIGRAPH_EMBED_RETRY_ATTEMPTS=4`, `OMNIGRAPH_EMBED_RETRY_BACKOFF_MS=200`, `OMNIGRAPH_EMBEDDINGS_MOCK`
			- Two task types: `embed_query_text` (RETRIEVAL_QUERY) and `embed_document_text` (RETRIEVAL_DOCUMENT)
			`- Exponential backoff with retryable detection (timeouts, 429, 5xx)`

			`## Schema integration`

			Mark a Vector property with `@embed("source_text_property")`. At ingest, the engine pulls the source text and writes the embedding into the vector column. Stored as L2-normalized FixedSizeList(Float32, dim).

			## CLI `omnigraph embed` (offline file pipeline)

			`Operates on JSONL files (not on a repo). Three modes (mutually exclusive):`

			- (default) `fill_missing` — only embed rows whose target field is empty
			- `--reembed-all` — overwrite all
			- `--clean` — strip embeddings

			Inputs are either a single seed manifest YAML or `--input/--output/--spec`. Selectors `--type T`, `--select T:field=value` filter rows. Streams JSONL → JSONL.