mirror of
https://github.com/Kaelio/ktx.git
synced 2026-06-07 07:55:13 +02:00
* docs: rewrite Semantic Querying concept with imperative-vs-declarative diagram
Reframe semantic-layer-internals.mdx around the contract the semantic
layer offers an agent: declare what you want (a Semantic Query), KTX
figures out how to compute it. Replaces the old "Context-Aware SQL"
framing with a clear imperative-vs-declarative narrative.
Adds a React Flow component (semantic-layer-flow.tsx) that contrasts a
buggy 4-table agent-authored SQL (chasm trap, LEFT-JOIN-in-WHERE,
hardcoded DATE_TRUNC) against the chasm-safe per-fact CTE SQL the
planner actually emits, including the outer GROUP BY over the requested
dimensions. Both lanes converge into a shared warehouse node and each
SQL card now has parallel bullet notes (failures on the left, KTX
behavior on the right).
Side fixes bundled in:
- include the /ktx basePath in the favicon metadata so the icon resolves
under the production prefix
- migrate docs-site/middleware.ts to docs-site/proxy.ts (Next 16 rename)
- redirect / to /ktx/docs/getting-started/introduction so the apex docs
URL works
- add tests covering the apex redirect, the favicon basePath, and the
middleware-to-proxy rename
- propagate the Semantic Query terminology across the ktx-sl CLI
reference, the context-layer concept page, and the agent-clients /
primary-sources integration pages
* Fix CI dead-code failures
* docs-site: polish semantic-layer-internals code blocks and flow diagram
- Make CodeBlock a server component so children traverse synchronously
under React 19 RSC streaming; previously extractText returned "" in
dev SSR, leaving code blocks empty.
- Add custom JSON/YAML/SQL/code-like tokenizers with theme-aware token
classes; drop the colored file-glyph dot and gradient tab-head.
- Tighten tab-head: subtle grey background, smaller monospace filename
in muted grey, smaller rectangular language pill placed to the left
of the filename.
- Polish the React Flow semantic-layer diagram (controls, fit-view
padding, edge types).
* docs-site: annotate imperative SQL, add section anchor, drop ClickHouse
- Wire numbered red badges to each problematic span in the "Without KTX"
SQL with hover sync between SQL gutter, lines, and the notes list.
- Add #imperative-vs-declarative anchor on the flow section header so
the eyebrow link is shareable; reveals a # glyph on hover/focus.
- Align the compiled-SQL note dots to the first-line midpoint
(mt-[6px] instead of mt-1) so 4px dots sit at y=8 in a 16px line.
- Remove all ClickHouse references from docs-site (primary-sources,
quickstart, ktx-setup, contributing, agents-setup, mechanics test,
warehouse drivers in the flow diagram).
* test: drop ClickHouse contributing-docs assertion
Align the workspace-package mirror test with the ClickHouse removal
from docs-site (75907eb). The connector-clickhouse package still
exists in packages/, but contributing.mdx no longer lists it, so the
test that mirrored docs against the workspace was failing.
340 lines
16 KiB
Text
340 lines
16 KiB
Text
---
|
|
title: Semantic Querying
|
|
description: How KTX compiles a short Semantic Query into safe, dialect-correct SQL using a reviewed join graph.
|
|
---
|
|
|
|
import { SemanticLayerFlow } from "@/components/semantic-layer-flow";
|
|
|
|
KTX's semantic layer is a compiler that turns intent into SQL. The agent
|
|
declares _what_ it wants — measures, dimensions, filters — in a small
|
|
Semantic Query. KTX figures out the _how_: which tables to join, what
|
|
grain to aggregate at, how to keep fan-out from inflating measures, and
|
|
what dialect the warehouse speaks.
|
|
|
|
This page covers four mechanics:
|
|
|
|
- The Semantic Query contract agents send to the compiler.
|
|
- The planner steps that turn a Semantic Query into SQL.
|
|
- The join graph that backs those steps, and how it's built.
|
|
- The fan-out failure mode the compiler is designed to prevent.
|
|
|
|
## Imperative SQL vs declarative Semantic Querying
|
|
|
|
Writing analytics SQL is imperative work. Every question forces the
|
|
agent to hold two things in mind at once: _what_ it wants — a measure, a
|
|
slice, a filter — and _how_ to compute it: which tables to join, which
|
|
key links them, what grain to aggregate at, how to keep one fact from
|
|
inflating another, and what dialect the warehouse speaks. Plumbing on
|
|
top of intent, every query.
|
|
|
|
KTX's semantic layer separates those concerns:
|
|
|
|
- **You and KTX maintain the how.** Sources, joins, grain, measures, and
|
|
segments live in reviewable YAML — the analytical contract the team
|
|
agrees on, version-controlled.
|
|
- **The agent declares the what.** It sends a Semantic Query and trusts
|
|
the compiler to produce safe SQL.
|
|
|
|
The agent stops reasoning about plumbing. It states intent. KTX turns
|
|
that into SQL the warehouse can run.
|
|
|
|
<SemanticLayerFlow />
|
|
|
|
## The Semantic Query contract
|
|
|
|
A Semantic Query is the JSON payload the agent sends. Every field is optional
|
|
except `measures`, and column references are fully qualified
|
|
(`source.column`) so the compiler never has to guess where a name came
|
|
from.
|
|
|
|
Notice what's _not_ in the payload: no `FROM`, no `JOIN`, no `GROUP BY`,
|
|
no `WITH`. The agent states what it wants. KTX picks the join path, the
|
|
grain, the SQL shape, and the dialect.
|
|
|
|
| Field | Purpose |
|
|
|-------|---------|
|
|
| `measures` | Names of pre-defined measures, or inline expressions like `sum(orders.amount)` |
|
|
| `dimensions` | Columns to group by, optionally with a `granularity` for time fields |
|
|
| `filters` | Row-level predicates, classified into `WHERE` or `HAVING` at planning time |
|
|
| `segments` | Named filter sets defined on a source, applied as additional predicates |
|
|
| `order_by` | Sort fields with optional direction |
|
|
| `limit` | Row cap on the result |
|
|
|
|
A typical agent call looks like this:
|
|
|
|
```json
|
|
{
|
|
"measures": ["orders.revenue", "tickets.ticket_count"],
|
|
"dimensions": ["customers.segment"],
|
|
"filters": ["orders.created_at >= '2025-01-01'"],
|
|
"limit": 1000
|
|
}
|
|
```
|
|
|
|
That payload is enough for KTX to plan and compile. The agent never
|
|
authors a join, a CTE, or a dialect-specific cast.
|
|
|
|
## What the planner does
|
|
|
|
The planner is a deterministic pipeline. Each Semantic Query runs through the
|
|
same ordered steps before any SQL is emitted.
|
|
|
|
1. **Resolve refs.** Qualify bare column names, look up pre-defined
|
|
measure expressions, and classify each measure as raw or derived.
|
|
2. **Pick an anchor and build the join tree.** Choose the largest measure
|
|
source as the root, then run a shortest-path search across the typed
|
|
join graph to reach every required source.
|
|
3. **Detect fan-out.** Group measures by their owning source. If more
|
|
than one group exists, the planner marks the query as a chasm trap
|
|
and switches to aggregate-locality compilation.
|
|
4. **Classify filters.** Split predicates into row-level (`WHERE`) and
|
|
aggregate-level (`HAVING`) based on whether they reference a measure.
|
|
5. **Generate SQL.** Emit Postgres-shaped SQL with the right shape:
|
|
single-source aggregation when the query is safe, per-source CTEs
|
|
when fan-out is present.
|
|
6. **Transpile to the target dialect.** Run the result through `sqlglot`
|
|
so the warehouse receives syntax it understands.
|
|
|
|
The output is the SQL string, the resolved plan, and any warnings
|
|
surfaced during planning.
|
|
|
|
## The join graph
|
|
|
|
A semantic source is a node. A declared join is a typed edge. The graph
|
|
is bidirectional: every forward edge has a reverse with the relationship
|
|
inverted, so the planner can traverse from any anchor.
|
|
|
|
| Relationship | Planning impact |
|
|
|--------------|-----------------|
|
|
| `many_to_one` | Safe direction for adding dimensions |
|
|
| `one_to_many` | Multiplies measures and triggers fan-out handling |
|
|
| `one_to_one` | Safe in either direction when keys match |
|
|
| Equal-cost paths | Treated as ambiguous; aliases or explicit joins resolve them |
|
|
|
|
<figure
|
|
className="not-prose my-8 overflow-hidden rounded-lg border border-fd-border bg-fd-card p-4 shadow-sm"
|
|
aria-label="Example semantic join graph"
|
|
>
|
|
<div className="grid gap-3 md:grid-cols-[1fr_1fr_1fr]">
|
|
<div className="rounded-md border border-fd-border bg-fd-background px-4 py-3">
|
|
<p className="text-sm font-semibold text-fd-foreground">{"customers"}</p>
|
|
<p className="mt-1 text-xs text-fd-muted-foreground">{"grain: customer_id"}</p>
|
|
</div>
|
|
<div className="rounded-md border-2 border-fd-primary bg-fd-background px-4 py-3">
|
|
<p className="text-sm font-semibold text-fd-foreground">{"orders"}</p>
|
|
<p className="mt-1 text-xs text-fd-muted-foreground">{"grain: order_id"}</p>
|
|
</div>
|
|
<div className="rounded-md border border-fd-border bg-fd-background px-4 py-3">
|
|
<p className="text-sm font-semibold text-fd-foreground">{"order_items"}</p>
|
|
<p className="mt-1 text-xs text-fd-muted-foreground">{"grain: order_id, line_id"}</p>
|
|
</div>
|
|
</div>
|
|
<div className="my-3 grid gap-2 text-center text-xs font-medium text-fd-muted-foreground md:grid-cols-[1fr_1fr]">
|
|
<div>{"orders -> customers: many_to_one"}</div>
|
|
<div>{"orders -> order_items: one_to_many"}</div>
|
|
</div>
|
|
<figcaption className="mt-4 border-t border-fd-border pt-3 text-left text-xs leading-5 text-fd-muted-foreground">
|
|
<span className="font-medium text-fd-foreground">{"Example: "}</span>
|
|
{"refunds joins to orders. Used carefully, it explains net revenue. Joined naively, it duplicates order-level measures."}
|
|
</figcaption>
|
|
</figure>
|
|
|
|
Edges and grain come from your YAML. The compiler treats them as fact,
|
|
not a guess.
|
|
|
|
```yaml
|
|
# semantic-layer/warehouse/orders.yaml
|
|
name: orders
|
|
table: public.orders
|
|
grain: [order_id]
|
|
joins:
|
|
- to: customers
|
|
on: customer_id = customers.id
|
|
relationship: many_to_one
|
|
- to: order_items
|
|
on: id = order_items.order_id
|
|
relationship: one_to_many
|
|
measures:
|
|
- name: revenue
|
|
expr: sum(case when status != 'refunded' then amount end)
|
|
```
|
|
|
|
## Building and maintaining the graph
|
|
|
|
KTX builds the graph from evidence and accepted edits, not from runtime
|
|
inference. Each input contributes a different kind of authority.
|
|
|
|
| Evidence | What it contributes |
|
|
|----------|---------------------|
|
|
| Declared primary keys | Initial row grain |
|
|
| Declared foreign keys | Formal join candidates |
|
|
| Inferred relationships | Edges when the warehouse lacks constraints |
|
|
| dbt, MetricFlow, and LookML imports | Existing metrics, dimensions, explores, and joins |
|
|
| Query history | Real join and filter patterns from analyst SQL |
|
|
| Analyst review | Final authority before context is merged |
|
|
|
|
<div
|
|
className="not-prose my-8 overflow-hidden rounded-lg border border-fd-border bg-fd-card shadow-sm"
|
|
aria-label="Semantic layer maintenance loop"
|
|
>
|
|
<div className="border-b border-fd-border bg-fd-muted/35 px-4 py-3">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Semantic maintenance loop"}
|
|
</p>
|
|
<p className="mt-1 text-sm leading-6 text-fd-muted-foreground">
|
|
{"Every accepted correction becomes input to the next graph build."}
|
|
</p>
|
|
</div>
|
|
<div className="p-4">
|
|
<div className="-mx-4 overflow-x-auto px-4">
|
|
<div className="relative mx-auto h-[460px] w-[720px] max-w-none md:w-full md:max-w-[760px]">
|
|
<svg
|
|
aria-hidden="true"
|
|
className="absolute inset-0 h-full w-full text-fd-primary"
|
|
fill="none"
|
|
viewBox="0 0 760 460"
|
|
>
|
|
<g
|
|
stroke="currentColor"
|
|
strokeLinecap="round"
|
|
strokeLinejoin="round"
|
|
strokeOpacity="0.68"
|
|
strokeWidth="2.5"
|
|
>
|
|
<path d="M 352 80 H 384" />
|
|
<path d="M 600 80 H 668 V 150" />
|
|
<path d="M 632 284 V 378 H 626" />
|
|
<path d="M 408 378 H 376" />
|
|
<path d="M 160 378 H 96 V 308" />
|
|
<path d="M 128 172 V 80 H 140" />
|
|
</g>
|
|
<g fill="currentColor" fillOpacity="0.96" stroke="none">
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(398 80)" />
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(668 164) rotate(90)" />
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(612 378) rotate(180)" />
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(362 378) rotate(180)" />
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(96 294) rotate(270)" />
|
|
<polygon points="0,0 -14,-7 -14,7" transform="translate(154 80)" />
|
|
</g>
|
|
</svg>
|
|
|
|
<div className="absolute left-1/2 top-1/2 flex h-32 w-56 -translate-x-1/2 -translate-y-1/2 flex-col items-center justify-center rounded-md border border-fd-primary/50 bg-fd-background px-4 py-4 text-center shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-primary">
|
|
{"reviewed context"}
|
|
</p>
|
|
<p className="mt-2 text-sm font-semibold leading-6 text-fd-foreground">
|
|
{"The accepted graph becomes the starting point for the next build."}
|
|
</p>
|
|
</div>
|
|
|
|
<div className="absolute left-[160px] top-6 h-28 w-48 rounded-md border-2 border-fd-primary bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 1"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"ingest evidence"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"scan schemas, imports, and accepted files"}
|
|
</p>
|
|
</div>
|
|
<div className="absolute left-[408px] top-6 h-28 w-48 rounded-md border border-fd-border bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 2"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"YAML diff"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"draft source, join, grain, and measure changes"}
|
|
</p>
|
|
</div>
|
|
<div className="absolute left-[536px] top-[172px] h-28 w-48 rounded-md border border-fd-border bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 3"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"validation"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"check relationships, syntax, and unsafe query shapes"}
|
|
</p>
|
|
</div>
|
|
<div className="absolute left-[408px] top-[322px] h-28 w-48 rounded-md border border-fd-border bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 4"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"analyst review"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"accept, edit, or reject generated context"}
|
|
</p>
|
|
</div>
|
|
<div className="absolute left-[160px] top-[322px] h-28 w-48 rounded-md border border-fd-border bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 5"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"agent use"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"serve context to search, explain, and query"}
|
|
</p>
|
|
</div>
|
|
<div className="absolute left-8 top-[172px] h-28 w-48 rounded-md border border-fd-primary/70 bg-fd-background px-4 py-3 text-sm shadow-sm">
|
|
<p className="text-[11px] font-semibold uppercase tracking-wide text-fd-muted-foreground">
|
|
{"Step 6"}
|
|
</p>
|
|
<p className="mt-1 font-semibold text-fd-foreground">{"corrections"}</p>
|
|
<p className="mt-2 text-xs leading-5 text-fd-muted-foreground">
|
|
{"agent and analyst fixes become new evidence"}
|
|
</p>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
## Fan-out and aggregate locality
|
|
|
|
Fan-out is the classic analytics failure mode. Two fact tables join to a
|
|
shared dimension. A naive query joins them all together first, so each
|
|
row from one fact is multiplied by the matching rows from the other.
|
|
Measures duplicate, numbers go wrong, and the agent doesn't notice.
|
|
|
|
KTX's planner detects the shape by grouping measures by their owning
|
|
source. If more than one source contributes raw measures, the generator
|
|
switches to aggregate locality: each fact is pre-aggregated at its own
|
|
grain inside a CTE, and the CTEs are joined back to the dimension at the
|
|
end.
|
|
|
|
| Naive SQL shape | Semantic-layer SQL shape |
|
|
|-----------------|--------------------------|
|
|
| Join facts and dimensions first, then aggregate | Aggregate each fact at its own grain, then join |
|
|
| Put every filter in one outer `WHERE` clause | Keep measure filters with the measure source |
|
|
| Trust the shortest textual join path | Prefer typed safe paths, reject disconnected sources |
|
|
| Let dimension grain differ across facts | Raise when an asymmetric dimension would fan out another measure |
|
|
|
|
The result is the same analyst answer, computed with the join shape an
|
|
analyst would have written by hand.
|
|
|
|
## Where the context comes from
|
|
|
|
The planner is only as good as the YAML it reads. KTX builds and
|
|
maintains that YAML for you.
|
|
|
|
- `raw-sources/<connection>/` holds scan evidence from your warehouse:
|
|
schemas, columns, keys, samples, and observed usage patterns.
|
|
- `wiki/` holds business language, definitions, and caveats. The
|
|
planner doesn't read wiki at compile time, but the agent does, so
|
|
measure names and dimensions stay anchored to terms the team uses.
|
|
- `semantic-layer/<connection>/` holds the structured sources, joins,
|
|
grain, measures, and segments the planner actually compiles against.
|
|
|
|
Every accepted edit flows back into the next ingest, so the graph stays
|
|
current as the warehouse changes.
|
|
|
|
## Agent usage notes
|
|
|
|
Point an agent at this page when it needs to explain why KTX asks for
|
|
grain, why a query was rejected as unsafe, or why the compiled SQL looks
|
|
different from what the agent first proposed.
|
|
|
|
| Agent task | Relevant section | Next page |
|
|
|------------|------------------|-----------|
|
|
| Explain the Semantic Query shape | The Semantic Query contract | [ktx sl](/docs/cli-reference/ktx-sl) |
|
|
| Describe what the planner does between query and SQL | What the planner does | [ktx sl](/docs/cli-reference/ktx-sl) |
|
|
| Explain why KTX asks for grain and relationship types | The join graph | [Writing context](/docs/guides/writing-context) |
|
|
| Diagnose duplicated measures after a join | Fan-out and aggregate locality | [ktx sl](/docs/cli-reference/ktx-sl) |
|
|
| Describe how semantic context stays current | Building and maintaining the graph | [Context as code](/docs/concepts/context-as-code) |
|