diff --git a/docs-site/content/docs/concepts/semantic-layer-internals.mdx b/docs-site/content/docs/concepts/semantic-layer-internals.mdx index c48428e6..3e9d7cd7 100644 --- a/docs-site/content/docs/concepts/semantic-layer-internals.mdx +++ b/docs-site/content/docs/concepts/semantic-layer-internals.mdx @@ -3,396 +3,174 @@ title: Semantic Layer Internals description: How KTX uses join graphs, grain, and relationship metadata to turn context into safe SQL. --- -KTX is a context layer for agents. This page focuses on one internal subsystem: -the semantic execution layer that turns reviewed context into safe SQL. +KTX is a context layer for agents. This page focuses on the semantic execution +subsystem: the part that turns reviewed YAML context into safe SQL. -The semantic layer is important, but it is not the whole product. KTX also -handles schema evidence, wiki context, provenance, validation, and agent -workflows around those files. +Read it as a pipeline: -Read the page as a pipeline: +```text +context files + warehouse evidence + | + v +join graph with grain and relationship metadata + | + v +fan-out checks + aggregate-locality planning + | + v +canonical SQL -> dialect SQL +``` -- context inputs feed the semantic engine; -- evidence becomes a join graph with grain and relationship metadata; -- review and corrections keep that graph current; -- the execution engine uses the graph to avoid fan-out and ambiguous joins. +## Where it fits -## Where the semantic layer fits +The semantic layer is not the whole product. It is the engine that makes KTX +context actionable for SQL generation. -The semantic layer is not a separate product category inside KTX. It is the -engine that makes the rest of the context actionable for SQL generation. +| Input | Used for | +|-------|----------| +| `semantic-layer/` | Sources, columns, joins, grain, measures, filters, and segments | +| `wiki/` | Business definitions, caveats, and metric explanations | +| `raw-sources/` | Schema scans, imported metadata, keys, and relationship evidence | +| Provenance | Ingest decisions, review history, and replay context | -
-
-
-

- {"Context inputs"} -

-
-
-

semantic-layer/

-

- {"source YAML, measures, joins, grain"} -

-
-
-

wiki/

-

- {"business rules, definitions, caveats"} -

-
-
-

raw-sources/

-

- {"schema scans, keys, imported metadata"} -

-
-
-

provenance

-

- {"ingest decisions and review history"} -

-
-
-
+Agents use the result to: - +- search semantic sources and wiki pages; +- compile trusted SQL instead of guessing joins; +- explain metric meaning and provenance; +- patch YAML or Markdown and validate the diff. -
-
-

- {"Semantic layer engine"} -

-
-
-

Join graph

-

- {"sources as nodes, joins as typed edges"} -

-
-
-

Grain

-

- {"row identity before aggregation"} -

-
-
-

Measures

-

- {"verified formulas and filters"} -

-
-
-

Relationships

-

- {"many_to_one, one_to_many, one_to_one"} -

-
-
-
- {"Safe query planning before SQL is generated."} -
-
+## Join graph - +A semantic source is a node. A join is a typed edge with a condition and a +relationship. The graph lets KTX choose valid paths and detect row-multiplying +paths before SQL is generated. -
-

- {"Agent workflows"} -

-
-
- {"Search sources and wiki pages"} -
-
- {"Compile trusted SQL"} -
-
- {"Explain metrics and provenance"} -
-
- {"Patch files and validate review"} -
-
-
-
-
+```text +customers <- many_to_one <- orders -> one_to_many -> order_items +grain: customer_id grain: order_id grain: order_id, line_id +``` -## The join graph KTX builds - -A semantic source is a node. A join is an edge with a join condition and a -relationship type. The graph lets KTX choose valid paths, reject unsafe paths, -and reason about whether a join preserves or multiplies rows before SQL is -generated. - -- `many_to_one` paths are usually safe for adding dimensions. -- `one_to_many` paths can multiply fact rows and trigger fan-out handling. -- Equal-cost paths can be ambiguous, so aliases and explicit joins matter. - -
-
-
-

customers

-

grain: customer_id

-
-
-

orders

-

grain: order_id

-
-
-

order_items

-

grain: order_id, line_id

-
-
-
-
orders -> customers: many_to_one
-
orders -> order_items: one_to_many
-
-
- {"Example: "} - {"refunds joins to orders. Used carefully, it explains net revenue. Joined naively, it can duplicate order-level measures."} -
-
+| Relationship | What it means | Planning impact | +|--------------|---------------|-----------------| +| `many_to_one` | Many fact rows point to one dimension row | Usually safe for adding dimensions | +| `one_to_many` | One row expands into many child rows | Can multiply measures and trigger fan-out handling | +| `one_to_one` | Both sides preserve row identity | Usually safe when keys are correct | +| Ambiguous path | Multiple equal-cost paths connect sources | Requires aliases or a safer explicit path | The graph is bidirectional for planning. If `orders -> customers` is `many_to_one`, the reverse path is `one_to_many`; KTX keeps that distinction -instead of treating every join as a neutral edge. +instead of treating every join as neutral. ## How KTX builds the graph -KTX starts from evidence, not a blank modeling canvas. Database scans and -analytics-tool imports create source definitions that an analyst can review. +KTX starts from evidence, then writes reviewable source YAML. The accepted graph +is the plain-file diff your team approves. | Evidence | What it contributes | -|---|---| +|----------|---------------------| | Declared primary keys | Initial row grain for each source | | Declared foreign keys | Formal join candidates and relationship direction | | Inferred relationships | Useful edges when warehouses lack constraints | | dbt, MetricFlow, and LookML imports | Existing metrics, dimensions, entities, explores, and joins | -| Query history | Real join and filter patterns agents should respect | -| Analyst review | The final authority before context is merged | +| Query history | Join and filter patterns agents should respect | +| Analyst review | Final authority before context is merged | -Generated YAML is intentionally reviewable. KTX can draft joins and measures, -but the accepted semantic layer is still the plain-file diff your team approves. +## Maintenance loop -## How KTX keeps the graph current +Semantic correctness changes when schemas, metrics, and business definitions +change. KTX keeps that loop explicit. -The semantic layer changes as schemas, metrics, and business rules change. KTX -keeps that loop explicit instead of hiding it behind a remote runtime. +```text +ingest evidence + | + v +draft YAML diff + | + v +validate relationships and query shapes + | + v +analyst review + | + v +agent use + | + v +corrections become new evidence +``` -
-
-

- {"Semantic maintenance loop"} -

-

- {"Every accepted correction becomes input to the next graph build."} -

-
-
-
-
- +This matters when a source gains a key, a metric changes definition, or an +analyst corrects a relationship. The next agent starts from the reviewed +context, not a hidden runtime state. -
-

- {"reviewed context"} -

-

- {"The accepted graph becomes the starting point for the next build."} -

-
+## Modeling problems -
-

- {"Step 1"} -

-

{"ingest evidence"}

-

- {"scan schemas, imports, and accepted files"} -

-
-
-

- {"Step 2"} -

-

{"YAML diff"}

-

- {"draft source, join, grain, and measure changes"} -

-
-
-

- {"Step 3"} -

-

{"validation"}

-

- {"check relationships, syntax, and unsafe query shapes"} -

-
-
-

- {"Step 4"} -

-

{"analyst review"}

-

- {"accept, edit, or reject generated context"} -

-
-
-

- {"Step 5"} -

-

{"agent use"}

-

- {"serve context to search, explain, and query"} -

-
-
-

- {"Step 6"} -

-

{"corrections"}

-

- {"agent and analyst fixes become new evidence"} -

-
-
-
-
-
+Fan-out is the classic failure mode: an order-level measure joins to line-item +rows before aggregation, so one order becomes many rows and revenue is counted +more than once. -This matters because semantic correctness is not static. If a source gains a -new key, a metric changes definition, or an analyst corrects a relationship, -the next agent gets that reviewed context. - -## The modeling problem the graph solves - -Fan-out is the classic failure mode. If an order-level measure is joined to -line-item rows before aggregation, one order can become many rows and revenue -can be counted more than once. - -| Problem | What happens | How KTX avoids it | -|---|---|---| +| Problem | What happens | How KTX handles it | +|---------|--------------|--------------------| | Order measure joins to `order_items` | `orders.revenue` repeats once per item | Detect the `one_to_many` path and pre-aggregate the order measure | -| Two independent fact sources share `customers` | Measures from each fact table multiply across the shared dimension | Treat it as a chasm trap and use aggregate-locality planning | -| Filter lives only across a `one_to_many` path | Filtering after the join changes the measure grain | Reject or localize the filter instead of silently producing unsafe SQL | -| Multiple equal-cost paths connect the same sources | The join path is ambiguous | Prefer safer paths and use aliases to disambiguate repeated joins | +| Two fact sources share `customers` | Measures multiply across a shared dimension | Treat it as a chasm trap and plan each fact locally | +| Filter crosses a `one_to_many` path | Filtering after the join changes measure grain | Reject or localize the filter | +| Equal-cost paths connect the same sources | Join choice is ambiguous | Prefer safer paths or require aliases | -Many-to-many questions usually show up as multiple one-to-many paths or +Many-to-many questions usually appear as multiple `one_to_many` paths or independent fact sources. KTX treats those shapes as fan-out or chasm risks unless the query can be planned at a safe grain. -## How the execution engine uses the graph +## Execution planning -The planner resolves the sources in a semantic query, chooses a join tree, and -checks whether any requested dimension or filter crosses a row-multiplying -edge. The SQL generator then chooses the simple path or the aggregate-locality -path. +The planner resolves sources, chooses a join tree, checks relationship paths, +and decides whether the query can use a simple shape or needs aggregate +locality. | Naive SQL shape | Semantic-layer SQL shape | -|---|---| -| Join facts and dimensions first, then aggregate | Aggregate each fact source at its own grain, then join the results | +|-----------------|--------------------------| +| Join facts and dimensions first, then aggregate | Aggregate each fact source at its own grain, then join results | | Put every filter in one outer `WHERE` clause | Keep measure filters with the measure source when locality is needed | | Trust the shortest textual join path | Prefer safe relationship paths and reject disconnected sources | | Let dimension grain differ across facts | Raise when asymmetric dimensions would fan out another measure | -
-
-
-

- {"Unsafe shape"} -

-
-{`orders
-  join order_items
-  join customers
-group by customer_segment
-sum(orders.amount)`}
-      
-

- {"The order measure is exposed to line-item fan-out before aggregation."} -

-
-
-

- {"KTX shape"} -

-
-{`orders_agg as (
-  select customer_id, sum(amount) revenue
+Unsafe shape:
+
+```sql
+select customers.segment, sum(orders.amount)
+from orders
+join order_items on order_items.order_id = orders.id
+join customers on customers.id = orders.customer_id
+group by customers.segment;
+```
+
+KTX shape:
+
+```sql
+with orders_agg as (
+  select customer_id, sum(amount) as revenue
   from orders
   group by customer_id
 )
-select customers.segment, sum(revenue)
+select customers.segment, sum(orders_agg.revenue)
 from orders_agg
-join customers`}
-      
-

- {"KTX pre-aggregates fact measures at their own grain before joining dimensions."} -

-
-
-
+join customers on customers.id = orders_agg.customer_id +group by customers.segment; +``` -The result is not magic. It is structured planning: validated sources, typed -relationships, graph search, fan-out detection, aggregate locality, and final -dialect transpilation. +The result is structured planning: validated sources, typed relationships, +graph search, fan-out detection, aggregate locality, and final dialect +transpilation. -## What this means for agents +## Agent usage notes -KTX gives agents a semantic surface they can inspect and improve, not just a -folder of notes. +Use this page when an agent needs to explain how KTX turns reviewed semantic +context into SQL, why relationship metadata matters, or why a query was rejected +as unsafe. -- Search semantic sources and related wiki pages before writing SQL. -- Compile SQL through `ktx sl query` instead of guessing joins. -- Validate semantic-layer changes before review. -- Patch YAML and Markdown files in git. -- Explain metric meaning and provenance from the same accepted context. - -Next, read [Writing Context](/docs/guides/writing-context) for the YAML editing -workflow or [ktx sl](/docs/cli-reference/ktx-sl) for the command reference. +| Agent task | Relevant section | Next page | +|------------|------------------|-----------| +| Explain why KTX asks for `grain` and relationship types | Join graph | [Writing Context](/docs/guides/writing-context) | +| Diagnose duplicated measures after a join | Modeling problems | [ktx sl](/docs/cli-reference/ktx-sl) | +| Explain safe SQL generation | Execution planning | [ktx sl](/docs/cli-reference/ktx-sl) | +| Describe how semantic context stays current | Maintenance loop | [Context as Code](/docs/concepts/context-as-code) |