From 78a2a643bcaadb5d5171c151202f9167122af143 Mon Sep 17 00:00:00 2001 From: Luca Martial Date: Sat, 16 May 2026 09:18:10 -0700 Subject: [PATCH] docs: restore semantic internals diagrams --- .../concepts/semantic-layer-internals.mdx | 417 +++++++++++++----- 1 file changed, 307 insertions(+), 110 deletions(-) diff --git a/docs-site/content/docs/concepts/semantic-layer-internals.mdx b/docs-site/content/docs/concepts/semantic-layer-internals.mdx index 3e9d7cd7..a6d4b640 100644 --- a/docs-site/content/docs/concepts/semantic-layer-internals.mdx +++ b/docs-site/content/docs/concepts/semantic-layer-internals.mdx @@ -4,128 +4,309 @@ description: How KTX uses join graphs, grain, and relationship metadata to turn --- KTX is a context layer for agents. This page focuses on the semantic execution -subsystem: the part that turns reviewed YAML context into safe SQL. +layer: the subsystem that turns reviewed context into safe SQL. -Read it as a pipeline: +Read it as four mechanics: -```text -context files + warehouse evidence - | - v -join graph with grain and relationship metadata - | - v -fan-out checks + aggregate-locality planning - | - v -canonical SQL -> dialect SQL -``` +- context files feed the semantic engine; +- evidence becomes a join graph with grain and relationship metadata; +- review keeps the graph current; +- query planning avoids fan-out and ambiguous joins. -## Where it fits +## Where the semantic layer fits -The semantic layer is not the whole product. It is the engine that makes KTX -context actionable for SQL generation. +The semantic layer is the engine that makes KTX context actionable for SQL +generation. It uses source YAML, wiki context, scan evidence, and provenance. -| Input | Used for | -|-------|----------| -| `semantic-layer/` | Sources, columns, joins, grain, measures, filters, and segments | -| `wiki/` | Business definitions, caveats, and metric explanations | -| `raw-sources/` | Schema scans, imported metadata, keys, and relationship evidence | -| Provenance | Ingest decisions, review history, and replay context | +
+
+
+

+ {"Context inputs"} +

+
+
+

semantic-layer/

+

+ {"source YAML, measures, joins, grain"} +

+
+
+

wiki/

+

+ {"business rules, definitions, caveats"} +

+
+
+

raw-sources/

+

+ {"schema scans, keys, imported metadata"} +

+
+
+

provenance

+

+ {"ingest decisions and review history"} +

+
+
+
-Agents use the result to: + -- search semantic sources and wiki pages; -- compile trusted SQL instead of guessing joins; -- explain metric meaning and provenance; -- patch YAML or Markdown and validate the diff. +
+
+

+ {"Semantic layer engine"} +

+
+
+

Join graph

+

+ {"sources as nodes, joins as typed edges"} +

+
+
+

Grain

+

+ {"row identity before aggregation"} +

+
+
+

Measures

+

+ {"verified formulas and filters"} +

+
+
+

Relationships

+

+ {"many_to_one, one_to_many, one_to_one"} +

+
+
+
+ {"Safe query planning before SQL is generated."} +
+
+ + + +
+

+ {"Agent workflows"} +

+
+
+ {"Search sources and wiki pages"} +
+
+ {"Compile trusted SQL"} +
+
+ {"Explain metrics and provenance"} +
+
+ {"Patch files and validate review"} +
+
+
+
+
## Join graph -A semantic source is a node. A join is a typed edge with a condition and a -relationship. The graph lets KTX choose valid paths and detect row-multiplying -paths before SQL is generated. +A semantic source is a node. A join is a typed edge. KTX uses the graph to +choose valid paths and detect row-multiplying joins before SQL is generated. -```text -customers <- many_to_one <- orders -> one_to_many -> order_items -grain: customer_id grain: order_id grain: order_id, line_id -``` +| Relationship | Planning impact | +|--------------|-----------------| +| `many_to_one` | Usually safe for adding dimensions | +| `one_to_many` | Can multiply measures and trigger fan-out handling | +| `one_to_one` | Usually safe when keys are correct | +| Equal-cost paths | Ambiguous unless aliases or explicit joins disambiguate | -| Relationship | What it means | Planning impact | -|--------------|---------------|-----------------| -| `many_to_one` | Many fact rows point to one dimension row | Usually safe for adding dimensions | -| `one_to_many` | One row expands into many child rows | Can multiply measures and trigger fan-out handling | -| `one_to_one` | Both sides preserve row identity | Usually safe when keys are correct | -| Ambiguous path | Multiple equal-cost paths connect sources | Requires aliases or a safer explicit path | +
+
+
+

customers

+

grain: customer_id

+
+
+

orders

+

grain: order_id

+
+
+

order_items

+

grain: order_id, line_id

+
+
+
+
orders -> customers: many_to_one
+
orders -> order_items: one_to_many
+
+
+ {"Example: "} + {"refunds joins to orders. Used carefully, it explains net revenue. Joined naively, it can duplicate order-level measures."} +
+
The graph is bidirectional for planning. If `orders -> customers` is -`many_to_one`, the reverse path is `one_to_many`; KTX keeps that distinction -instead of treating every join as neutral. +`many_to_one`, the reverse path is `one_to_many`. -## How KTX builds the graph +## Building and maintaining the graph -KTX starts from evidence, then writes reviewable source YAML. The accepted graph -is the plain-file diff your team approves. +KTX starts from evidence, writes reviewable source YAML, and treats the merged +diff as the accepted graph. | Evidence | What it contributes | |----------|---------------------| -| Declared primary keys | Initial row grain for each source | -| Declared foreign keys | Formal join candidates and relationship direction | -| Inferred relationships | Useful edges when warehouses lack constraints | -| dbt, MetricFlow, and LookML imports | Existing metrics, dimensions, entities, explores, and joins | -| Query history | Join and filter patterns agents should respect | +| Declared primary keys | Initial row grain | +| Declared foreign keys | Formal join candidates | +| Inferred relationships | Edges when warehouses lack constraints | +| dbt, MetricFlow, and LookML imports | Existing metrics, dimensions, explores, and joins | +| Query history | Real join and filter patterns | | Analyst review | Final authority before context is merged | -## Maintenance loop +
+
+

+ {"Semantic maintenance loop"} +

+

+ {"Every accepted correction becomes input to the next graph build."} +

+
+
+
+
+ -Semantic correctness changes when schemas, metrics, and business definitions -change. KTX keeps that loop explicit. +
+

+ {"reviewed context"} +

+

+ {"The accepted graph becomes the starting point for the next build."} +

+
-```text -ingest evidence - | - v -draft YAML diff - | - v -validate relationships and query shapes - | - v -analyst review - | - v -agent use - | - v -corrections become new evidence -``` - -This matters when a source gains a key, a metric changes definition, or an -analyst corrects a relationship. The next agent starts from the reviewed -context, not a hidden runtime state. +
+

+ {"Step 1"} +

+

{"ingest evidence"}

+

+ {"scan schemas, imports, and accepted files"} +

+
+
+

+ {"Step 2"} +

+

{"YAML diff"}

+

+ {"draft source, join, grain, and measure changes"} +

+
+
+

+ {"Step 3"} +

+

{"validation"}

+

+ {"check relationships, syntax, and unsafe query shapes"} +

+
+
+

+ {"Step 4"} +

+

{"analyst review"}

+

+ {"accept, edit, or reject generated context"} +

+
+
+

+ {"Step 5"} +

+

{"agent use"}

+

+ {"serve context to search, explain, and query"} +

+
+
+

+ {"Step 6"} +

+

{"corrections"}

+

+ {"agent and analyst fixes become new evidence"} +

+
+
+
+
+
## Modeling problems Fan-out is the classic failure mode: an order-level measure joins to line-item -rows before aggregation, so one order becomes many rows and revenue is counted -more than once. +rows before aggregation, so one order becomes many rows. | Problem | What happens | How KTX handles it | |---------|--------------|--------------------| -| Order measure joins to `order_items` | `orders.revenue` repeats once per item | Detect the `one_to_many` path and pre-aggregate the order measure | -| Two fact sources share `customers` | Measures multiply across a shared dimension | Treat it as a chasm trap and plan each fact locally | -| Filter crosses a `one_to_many` path | Filtering after the join changes measure grain | Reject or localize the filter | -| Equal-cost paths connect the same sources | Join choice is ambiguous | Prefer safer paths or require aliases | - -Many-to-many questions usually appear as multiple `one_to_many` paths or -independent fact sources. KTX treats those shapes as fan-out or chasm risks -unless the query can be planned at a safe grain. +| Order measure joins to `order_items` | `orders.revenue` repeats once per item | Detect `one_to_many` and pre-aggregate | +| Two fact sources share `customers` | Measures multiply across the shared dimension | Treat as a chasm trap and plan each fact locally | +| Filter crosses `one_to_many` | Filtering changes measure grain | Reject or localize the filter | +| Equal-cost paths connect sources | Join choice is ambiguous | Prefer safer paths or require aliases | ## Execution planning The planner resolves sources, chooses a join tree, checks relationship paths, -and decides whether the query can use a simple shape or needs aggregate -locality. +and picks a simple or aggregate-locality SQL shape. | Naive SQL shape | Semantic-layer SQL shape | |-----------------|--------------------------| @@ -134,33 +315,49 @@ locality. | Trust the shortest textual join path | Prefer safe relationship paths and reject disconnected sources | | Let dimension grain differ across facts | Raise when asymmetric dimensions would fan out another measure | -Unsafe shape: - -```sql -select customers.segment, sum(orders.amount) -from orders -join order_items on order_items.order_id = orders.id -join customers on customers.id = orders.customer_id -group by customers.segment; -``` - -KTX shape: - -```sql -with orders_agg as ( - select customer_id, sum(amount) as revenue +
+
+
+

+ {"Unsafe shape"} +

+
+{`orders
+  join order_items
+  join customers
+group by customer_segment
+sum(orders.amount)`}
+      
+

+ {"The order measure is exposed to line-item fan-out before aggregation."} +

+
+
+

+ {"KTX shape"} +

+
+{`orders_agg as (
+  select customer_id, sum(amount) revenue
   from orders
   group by customer_id
 )
-select customers.segment, sum(orders_agg.revenue)
+select customers.segment, sum(revenue)
 from orders_agg
-join customers on customers.id = orders_agg.customer_id
-group by customers.segment;
-```
+join customers`}
+      
+

+ {"KTX pre-aggregates fact measures at their own grain before joining dimensions."} +

+
+
+
The result is structured planning: validated sources, typed relationships, -graph search, fan-out detection, aggregate locality, and final dialect -transpilation. +graph search, fan-out detection, aggregate locality, and dialect transpilation. ## Agent usage notes @@ -173,4 +370,4 @@ as unsafe. | Explain why KTX asks for `grain` and relationship types | Join graph | [Writing Context](/docs/guides/writing-context) | | Diagnose duplicated measures after a join | Modeling problems | [ktx sl](/docs/cli-reference/ktx-sl) | | Explain safe SQL generation | Execution planning | [ktx sl](/docs/cli-reference/ktx-sl) | -| Describe how semantic context stays current | Maintenance loop | [Context as Code](/docs/concepts/context-as-code) | +| Describe how semantic context stays current | Building and maintaining the graph | [Context as Code](/docs/concepts/context-as-code) |