diff --git a/docs-site/content/docs/concepts/meta.json b/docs-site/content/docs/concepts/meta.json index 72c0a407..e82721b7 100644 --- a/docs-site/content/docs/concepts/meta.json +++ b/docs-site/content/docs/concepts/meta.json @@ -1,5 +1,5 @@ { "title": "Concepts", "defaultOpen": true, - "pages": ["the-context-layer", "semantic-layer-internals", "context-as-code"] + "pages": ["the-context-layer", "semantic-layer-internals", "wiki-retrieval", "context-as-code"] } diff --git a/docs-site/content/docs/concepts/the-context-layer.mdx b/docs-site/content/docs/concepts/the-context-layer.mdx index af2a6bdb..b88546c2 100644 --- a/docs-site/content/docs/concepts/the-context-layer.mdx +++ b/docs-site/content/docs/concepts/the-context-layer.mdx @@ -195,6 +195,10 @@ wiki pages are written and prunes `sl_refs` during ingest when their target sources are deleted or their measures are renamed - so a stale page can never quietly route an agent to a definition that no longer exists. +For how the hybrid search pipeline ranks pages, how `[[wikilinks]]` extend +the graph, and how ingest authors pages from evidence, read +[Wiki retrieval](/docs/concepts/wiki-retrieval). + The split between the two pillars is sharp: | Put it in YAML | Put it in Markdown | diff --git a/docs-site/content/docs/concepts/wiki-retrieval.mdx b/docs-site/content/docs/concepts/wiki-retrieval.mdx new file mode 100644 index 00000000..32e1bf61 --- /dev/null +++ b/docs-site/content/docs/concepts/wiki-retrieval.mdx @@ -0,0 +1,280 @@ +--- +title: Wiki retrieval +description: How ktx ranks wiki pages with hybrid search, links them into a graph, and keeps both sides anchored to evidence. +--- + +The wiki is the prose half of the context layer. Agents reach it two ways: +they search for a page, then follow references inside the pages they +already opened. This page covers how both work. + +- The wiki page contract that retrieval and validation depend on. +- The hybrid search pipeline that turns a question into ranked pages. +- The reference graph agents traverse without rerunning search. +- How pages get authored from evidence, and how broken edges get pruned. + +## The wiki page contract + +A wiki page is a Markdown file with a YAML frontmatter block. Frontmatter +carries metadata; the prose below it is free-form. Keys are flat tokens +(`revenue`, `mart_account_segments`), not paths, so every page is +addressable as `[[key]]` from any other page. + +```markdown +# wiki/global/revenue.md +--- +summary: Paid order value after refunds +tags: [finance, orders] +sl_refs: [warehouse.orders] +refs: [segment-classification] +usage_mode: auto +--- + +Revenue is paid order amount after refund adjustments. + +Use `orders.total_revenue` for recognized order value and +`orders.order_count` for paid order volume. +``` + +| Field | Purpose | +|-------|---------| +| `summary` | One-line description shown in search results and the agent's knowledge index | +| `tags` | Topic labels mixed into the search text and used for filtering | +| `refs` | Outgoing edges to other wiki pages by key | +| `sl_refs` | Outgoing edges to semantic-layer sources by `connection.source` name | +| `usage_mode` | `always`, `auto`, or `never` - whether the agent must, may, or must not surface this page | +| `source` | Where the page came from when authored by ingest (e.g. `historic-sql`, `dbt`) | +| `usage` | Stats attached to historic-SQL pattern pages: executions, distinct users, runtime percentiles, error rate | + +Pages live under two scopes. `wiki/global/*.md` is the team's shared +context; `wiki/user//*.md` is per-agent scratch space that shadows +global pages with the same key. + +## What retrieval does + +A wiki search runs the same ordered steps every time. + +1. **Normalize the query.** Lowercase, tokenize, deduplicate terms. +2. **Score in three lanes.** Lexical (SQLite FTS5 bm25), semantic + (cosine similarity over embeddings), and token (term-overlap fallback) + each rank every page independently. +3. **Fuse with Reciprocal Rank Fusion.** Each lane contributes + `weight / (60 + rank)` to a candidate's score. Lanes that fail or + skip are dropped, not zeroed. +4. **Order and trim.** Sort by fused score, then by how many lanes + matched, then by id for stable tie-breaks. Return the top `limit` + results with their summaries. +5. **Hydrate on demand.** The agent calls `wiki_read` to load full + bodies for the few pages that look relevant. + +
+
+

+ {"Hybrid retrieval"} +

+

+ {"Three lanes, one ranking"} +

+
+ +
+
+
+

{"lexical"}

+

{"sqlite fts5 / bm25"}

+

+ {"Matches stems and phrases. Strong on the exact terms the team already uses."} +

+

+ {"weight "}{"1.5"} +

+
+ +
+

{"semantic"}

+

{"cosine over embeddings"}

+

+ {"Catches synonyms and paraphrases the lexical lane misses."} +

+

+ {"weight "}{"2"} +

+
+ +
+

{"token"}

+

{"term-overlap fallback"}

+

+ {"Always available, so short queries still produce candidates."} +

+

+ {"weight "}{"0.75"} +

+
+
+ +
+

+ {"Reciprocal Rank Fusion"} +

+

+ {"score = Σ weight / (60 + rank)"} +

+

+ {"Pages that rank well in multiple lanes outscore pages that rank well in only one."} +

+
+
+ +
+ {"Defaults are tunable. "} + {"Lane weights and the RRF constant K are configuration, not assumptions."} +
+
+ +The text each lane scores is built deterministically: page key, summary, +body, and tags concatenated in that order. A precise summary and the +right tags make a page reachable before its body matches anything. + +## The page graph + +Two frontmatter fields and one inline syntax turn the wiki into a graph +the agent traverses without re-running search. + +| Edge | Source | Target | +|------|--------|--------| +| `sl_refs: [warehouse.orders]` | Frontmatter | Semantic source by name | +| `refs: [segment-classification]` | Frontmatter | Another wiki page by key | +| `[[segment-classification]]` | Inline in body | Another wiki page by key | + +`refs` stays in the prose layer; `sl_refs` crosses into the executable +half of the context layer. Inline `[[wikilinks]]` are extracted from +page bodies at validation time and treated as declared `refs`. + +
+
+

+ {"Anatomy of a traversal"} +

+

+ {"Edges to prose, edges to SQL"} +

+
+ +
+
+
+

+ {"wiki/global/revenue.md"} +

+

{"revenue"}

+

+ {"declares"} +

+
    +
  • {"sl_refs"}: warehouse.orders
  • +
  • {"refs"}: segment-classification
  • +
+
+
+

+ {"wiki/global/segment-classification.md"} +

+

{"segment-classification"}

+

+ {"declares"} +

+
    +
  • {"sl_refs"}: warehouse.customers
  • +
+
+
+ +
+
{"revenue → warehouse.orders · sl_refs"}
+
{"revenue → segment-classification · refs"}
+
+ +
+
+

+ {"semantic-layer/warehouse/orders.yaml"} +

+

{"warehouse.orders"}

+

{"grain: order_id · measure: total_revenue"}

+
+
+

+ {"semantic-layer/warehouse/customers.yaml"} +

+

{"warehouse.customers"}

+

{"grain: customer_id · dim: segment"}

+
+
+
+ +
+ {"Green nodes are wiki pages; blue nodes are semantic sources."} +
+
+ +## Keeping the graph live + +A page that references a deleted source is worse than no reference at +all - it sends the agent confidently to a definition that no longer +exists. **ktx** prevents that with three layered checks: + +- **At write time.** Every `refs` entry and `[[wikilink]]` is validated + against the pages visible in the current scope. A write that targets + a missing page is rejected before any file changes. +- **At ingest time.** Adapters prune `sl_refs` when the target source + is deleted, mark stale pattern pages with `stale_since`, and set + `archived_since` on retired pages instead of removing them silently. +- **At session end.** Every page touched by an ingest run is re-scanned + for references that resolved at write time but no longer point at + a live target. Dangling pairs are reported so the next iteration can + fix them. + +## Where the pages come from + +**ktx** writes wiki pages from evidence, not free invention. Each input +contributes a different kind of page, and accepted edits feed the next +ingest as input. + +| Evidence | What it produces | +|----------|------------------| +| Schema scans | One page per material table, with grain, columns, and known constraints | +| Query history | Pattern pages with `usage` frontmatter for executions, distinct users, runtime percentiles, and error rate | +| dbt manifests | Pages per model, exposure, and test, with `sl_refs` to the matching semantic source | +| MetricFlow, Looker, Metabase | Pages per metric, explore, or saved question, linked back to the source artifact | +| Notion, docs, analyst notes | Pages preserving business definitions, policies, and incident write-ups | +| Agent and analyst edits | First-class input to the next ingest, not a fork | + +Provenance stays with the page. Ingested pages keep HTML comments like +`` inline, so a reviewer can +walk from the prose back to the artifact that produced it. + +## Agent usage notes + +Point an agent at this page when it needs to explain why a wiki search +returned the pages it did, why a write was rejected, or how the wiki +stays in step with the semantic layer. + +| Agent task | Relevant section | Next page | +|------------|------------------|-----------| +| Explain why two searches return different pages for the same query | What retrieval does | [ktx wiki](/docs/cli-reference/ktx-wiki) | +| Decide whether to add a `refs` or `sl_refs` entry | The page graph | [Writing Context](/docs/guides/writing-context) | +| Repair a wiki write rejected for missing references | Keeping the graph live | [Writing Context](/docs/guides/writing-context) | +| Describe how historic SQL becomes a wiki page | Where the pages come from | [Building Context](/docs/guides/building-context) | +| Explain raw-source provenance comments | Where the pages come from | [Context as Code](/docs/concepts/context-as-code) |