2026-05-10 23:12:26 +02:00
---
name: sl
2026-05-14 12:43:14 -04:00
description: KTX's semantic layer - a structured catalog of sources (tables/views), measures, joins, and segments expressed as YAML. Covers the schema and how to query it via `sl_query` . Use when the task involves querying pre-defined metrics (ARR, churn, retention, LTV, MAU) or reading SL source YAML to understand the catalog. Capture is handled by the `sl_capture` skill (memory-agent only).
2026-05-10 23:12:26 +02:00
---
# Semantic Layer
2026-05-10 23:51:24 +02:00
KTX's semantic layer (SL) is a structured catalog. Each **source** represents a table, a SQL view, or an overlay that enriches a manifest-backed table with measures, computed columns, joins, and named segments. The catalog is the single source of truth for reusable business metrics.
2026-05-10 23:12:26 +02:00
This skill covers two parts:
2026-05-14 12:43:14 -04:00
- **Part 1** - Schema reference (what an SL source looks like).
- **Part 2** - Querying via `sl_query` .
2026-05-10 23:12:26 +02:00
2026-05-14 12:43:14 -04:00
Capture (when and how to add new patterns to the SL) is a separate concern handled by the memory-agent - see the `sl_capture` skill if you are running in capture mode. The research agent **reads** and **queries** the SL via the tools described here; it does not write to it.
2026-05-10 23:12:26 +02:00
2026-05-13 13:43:23 +02:00
For capture-time identifier verification, load `sl_capture` . Synthesis writer
skills must verify warehouse identifiers with `discover_data` ,
`entity_details` , and `sql_execution` before emitting table or column names.
2026-05-10 23:12:26 +02:00
---
2026-05-14 12:43:14 -04:00
## Part 1 - Schema reference
2026-05-10 23:12:26 +02:00
An SL source is a YAML file at `semantic-layer/<connectionId>/<source_name>.yaml` . There are three flavors:
### Overlay sources
Enrich a manifest-backed table with measures, computed columns, joins, and segments. No `table` or `sql` field. The base table's columns and grain are inherited from the manifest.
```yaml
name: fct_orders # must match an existing manifest table
2026-05-13 15:55:00 +02:00
descriptions:
user: "Overlay adding business measures to the orders fact table."
2026-05-10 23:12:26 +02:00
measures:
- name: total_revenue
expr: sum(amount)
2026-05-14 12:43:14 -04:00
description: Total order revenue - filter by status or region at query time
2026-05-10 23:12:26 +02:00
columns: # computed dimensions only
- name: is_large_order
type: boolean
expr: "amount > 1000"
fix(context): merge overlay columns onto manifest columns by name (#94)
* fix(context): merge overlay columns onto manifest columns by name
composeOverlay was appending overlay columns to the manifest column list,
producing duplicate entries when dbt/metabase overlays declared a column
just to attach descriptions. The duplicates carried no `type`, so the
pydantic SourceDefinition rejected them at semantic-query time and broke
`ktx sl query` for every overlay-backed measure. Now overlay columns
match base columns by name (case-insensitive): same-name entries merge
onto the manifest (overlay fields win, type/role fall back to the base,
descriptions merge per source key) and only new names append.
* refactor(sl): split overlay columns from column_overrides and enforce TS/Python wire contract
Overlay sources now have two distinct collections: `columns:` for computed
columns (requiring `expr` + `type`) and `column_overrides:` for metadata
patches to inherited manifest columns. Composing or loading an overlay that
mixes the two — or references an unknown column — fails with a typed error.
Introduce `ResolvedSemanticLayerSource` / `resolvedSourceSchema` /
`toResolvedWire` as the strict shape sent to the Python engine, and add a
schema contract test that diffs Zod against the Pydantic JSON schema dumped
by `python -m semantic_layer dump-schema`. `SourceDefinition` is now
`extra="forbid"` on the Python side.
`loadAllSources` surfaces per-file load errors instead of swallowing them,
so validation/query paths can report manifest shard parse failures.
* fix(context): make scan description generation resilient and quiet
A transient sampleTable failure during ingest used to take out every
table in a connection: generateTableDescription returned a hardcoded
'Table not found' string into descriptions.ai, and KtxDescriptionGenerator
was constructed without a logger, so the failure left no trail anywhere.
- sampleTable / sampleColumn calls retry 3x with 200/400/800ms backoff,
honouring KtxScanContext.signal via a new KtxAbortedError.
- On retry exhaustion or missing capability, table generation falls back
to a metadata-only prompt built from column name / native type / comment
/ rawDescriptions. The column path follows the same rule -- call the
LLM when any of samples or rawDescriptions are available; skip only
when both are absent.
- Logger is now threaded from KtxScanContext into the generator. Failures
emit structured KtxScanWarning entries (new description_fallback_used
code, plus existing sampling_failed / enrichment_failed /
connector_capability_missing). ktx scan groups warnings by code so a
batch of identical failures collapses to one summary line plus sample.
- Returns null on failure instead of the 'Table not found' sentinel; the
manifest writer's existing guard already skips empty descriptions, so
schema YAML no longer carries misleading text. SCAN_MANAGED_DESCRIPTION_KEYS
already strips stale 'ai' on merge, so existing YAML clears on next run.
Also suppress AI SDK v6 'system in messages' warning: pull system messages
out of KtxMessageBuilder.wrapSimple's output via a new splitKtxSystemMessages
helper and pass them top-level to generateText (preserves cacheControl
providerOptions on the SystemModelMessage). Agent-runner's local
splitSystemPromptMessages dedupes onto the shared helper.
* test(docs): align examples-docs assertions with revamped docs
PR #103 (setup/guide doc revamp) reworded several CLI examples and
connection labels; the assertions in scripts/examples-docs.test.mjs
still referenced the pre-revamp wording and were failing in CI on main.
Update the regexes to match the post-revamp content:
- drop the `--json` flag from the sl-query example expectation
- move the `Driver:` / `Status: ok` probe to the connection reference,
which is where that output now lives (driver id is lowercase
`postgres`, not the display name `PostgreSQL`)
- drop the obsolete `Install \`uv\`...` troubleshooting line
- accept `<connectionId>` everywhere; the docs no longer use the
hyphenated `<connection-id>` form
- match the `warehouse` connection id used in the quickstart instead of
the `postgres-warehouse` id only used in the README and setup ref
* fix(sl): skip TS/Python schema contract test when uv is unavailable
The TypeScript checks CI job does not install uv or Python, so the
module-level `execFileSync('uv', ...)` in schemas.contract.test.ts threw
ENOENT and failed the suite. Wrap the schema dump in a try/catch and
guard the describe block with `describe.skipIf` so the test skips in
environments without uv. Local dev and any CI job that has uv on PATH
still runs the cross-language contract assertion.
2026-05-15 02:11:04 +02:00
column_overrides: # metadata patches for inherited columns
- name: status
descriptions:
user: "Order lifecycle status."
2026-05-10 23:12:26 +02:00
segments:
- name: paid_non_refunded
expr: "is_paid = true AND is_refunded = false"
joins:
- to: customers
on: "customer_id = customers.id"
relationship: many_to_one
```
Rules:
2026-05-14 12:43:14 -04:00
- Do **not** repeat base-table columns, grain, `table` , or `source_type` in an overlay - those are inherited.
2026-05-10 23:12:26 +02:00
- Overlay columns MUST be computed (`expr` + `type` ).
fix(context): merge overlay columns onto manifest columns by name (#94)
* fix(context): merge overlay columns onto manifest columns by name
composeOverlay was appending overlay columns to the manifest column list,
producing duplicate entries when dbt/metabase overlays declared a column
just to attach descriptions. The duplicates carried no `type`, so the
pydantic SourceDefinition rejected them at semantic-query time and broke
`ktx sl query` for every overlay-backed measure. Now overlay columns
match base columns by name (case-insensitive): same-name entries merge
onto the manifest (overlay fields win, type/role fall back to the base,
descriptions merge per source key) and only new names append.
* refactor(sl): split overlay columns from column_overrides and enforce TS/Python wire contract
Overlay sources now have two distinct collections: `columns:` for computed
columns (requiring `expr` + `type`) and `column_overrides:` for metadata
patches to inherited manifest columns. Composing or loading an overlay that
mixes the two — or references an unknown column — fails with a typed error.
Introduce `ResolvedSemanticLayerSource` / `resolvedSourceSchema` /
`toResolvedWire` as the strict shape sent to the Python engine, and add a
schema contract test that diffs Zod against the Pydantic JSON schema dumped
by `python -m semantic_layer dump-schema`. `SourceDefinition` is now
`extra="forbid"` on the Python side.
`loadAllSources` surfaces per-file load errors instead of swallowing them,
so validation/query paths can report manifest shard parse failures.
* fix(context): make scan description generation resilient and quiet
A transient sampleTable failure during ingest used to take out every
table in a connection: generateTableDescription returned a hardcoded
'Table not found' string into descriptions.ai, and KtxDescriptionGenerator
was constructed without a logger, so the failure left no trail anywhere.
- sampleTable / sampleColumn calls retry 3x with 200/400/800ms backoff,
honouring KtxScanContext.signal via a new KtxAbortedError.
- On retry exhaustion or missing capability, table generation falls back
to a metadata-only prompt built from column name / native type / comment
/ rawDescriptions. The column path follows the same rule -- call the
LLM when any of samples or rawDescriptions are available; skip only
when both are absent.
- Logger is now threaded from KtxScanContext into the generator. Failures
emit structured KtxScanWarning entries (new description_fallback_used
code, plus existing sampling_failed / enrichment_failed /
connector_capability_missing). ktx scan groups warnings by code so a
batch of identical failures collapses to one summary line plus sample.
- Returns null on failure instead of the 'Table not found' sentinel; the
manifest writer's existing guard already skips empty descriptions, so
schema YAML no longer carries misleading text. SCAN_MANAGED_DESCRIPTION_KEYS
already strips stale 'ai' on merge, so existing YAML clears on next run.
Also suppress AI SDK v6 'system in messages' warning: pull system messages
out of KtxMessageBuilder.wrapSimple's output via a new splitKtxSystemMessages
helper and pass them top-level to generateText (preserves cacheControl
providerOptions on the SystemModelMessage). Agent-runner's local
splitSystemPromptMessages dedupes onto the shared helper.
* test(docs): align examples-docs assertions with revamped docs
PR #103 (setup/guide doc revamp) reworded several CLI examples and
connection labels; the assertions in scripts/examples-docs.test.mjs
still referenced the pre-revamp wording and were failing in CI on main.
Update the regexes to match the post-revamp content:
- drop the `--json` flag from the sl-query example expectation
- move the `Driver:` / `Status: ok` probe to the connection reference,
which is where that output now lives (driver id is lowercase
`postgres`, not the display name `PostgreSQL`)
- drop the obsolete `Install \`uv\`...` troubleshooting line
- accept `<connectionId>` everywhere; the docs no longer use the
hyphenated `<connection-id>` form
- match the `warehouse` connection id used in the quickstart instead of
the `postgres-warehouse` id only used in the README and setup ref
* fix(sl): skip TS/Python schema contract test when uv is unavailable
The TypeScript checks CI job does not install uv or Python, so the
module-level `execFileSync('uv', ...)` in schemas.contract.test.ts threw
ENOENT and failed the suite. Wrap the schema dump in a try/catch and
guard the describe block with `describe.skipIf` so the test skips in
environments without uv. Local dev and any CI job that has uv on PATH
still runs the cross-language contract assertion.
2026-05-15 02:11:04 +02:00
- Use `column_overrides` to add descriptions or metadata to inherited manifest columns. Do not put `type` or `expr` in `column_overrides` .
2026-05-10 23:12:26 +02:00
- `exclude_columns` hides specific manifest columns; `disable_joins` suppresses specific auto-detected joins.
### Standalone table sources
Self-contained; own their schema. Has `source_type: table` and `table:` .
```yaml
name: account_health_scores
source_type: table
table: "analytics.account_health_scores"
grain: [account_id, snapshot_date]
columns:
- name: account_id
type: string
- name: snapshot_date
type: time
role: time
- name: health_score
type: number
measures:
- name: avg_health_score
expr: avg(health_score)
```
### Standalone SQL sources
Self-contained; schema derived from a SQL query. Has `source_type: sql` and `sql:` .
```yaml
name: monthly_cancellations
source_type: sql
sql: |
SELECT
date_trunc('month', cancelled_at) AS month,
customer_id,
plan_name,
mrr_amount
FROM subscriptions
WHERE status = 'cancelled'
grain: [customer_id, month]
columns:
- name: month
type: time
role: time
- name: customer_id
type: string
- name: plan_name
type: string
- name: mrr_amount
type: number
measures:
- name: cancellation_count
expr: count(*)
```
2026-05-14 12:43:14 -04:00
An SQL source is a one-shot answer: the aggregation is frozen, callers cannot re-group or re-filter by columns the SQL has collapsed, and the source is disconnected from the join graph. Prefer overlays + measures over SQL sources when possible - the `sl_capture` skill covers when SQL is justified.
2026-05-10 23:12:26 +02:00
### Columns
fix(context): merge overlay columns onto manifest columns by name (#94)
* fix(context): merge overlay columns onto manifest columns by name
composeOverlay was appending overlay columns to the manifest column list,
producing duplicate entries when dbt/metabase overlays declared a column
just to attach descriptions. The duplicates carried no `type`, so the
pydantic SourceDefinition rejected them at semantic-query time and broke
`ktx sl query` for every overlay-backed measure. Now overlay columns
match base columns by name (case-insensitive): same-name entries merge
onto the manifest (overlay fields win, type/role fall back to the base,
descriptions merge per source key) and only new names append.
* refactor(sl): split overlay columns from column_overrides and enforce TS/Python wire contract
Overlay sources now have two distinct collections: `columns:` for computed
columns (requiring `expr` + `type`) and `column_overrides:` for metadata
patches to inherited manifest columns. Composing or loading an overlay that
mixes the two — or references an unknown column — fails with a typed error.
Introduce `ResolvedSemanticLayerSource` / `resolvedSourceSchema` /
`toResolvedWire` as the strict shape sent to the Python engine, and add a
schema contract test that diffs Zod against the Pydantic JSON schema dumped
by `python -m semantic_layer dump-schema`. `SourceDefinition` is now
`extra="forbid"` on the Python side.
`loadAllSources` surfaces per-file load errors instead of swallowing them,
so validation/query paths can report manifest shard parse failures.
* fix(context): make scan description generation resilient and quiet
A transient sampleTable failure during ingest used to take out every
table in a connection: generateTableDescription returned a hardcoded
'Table not found' string into descriptions.ai, and KtxDescriptionGenerator
was constructed without a logger, so the failure left no trail anywhere.
- sampleTable / sampleColumn calls retry 3x with 200/400/800ms backoff,
honouring KtxScanContext.signal via a new KtxAbortedError.
- On retry exhaustion or missing capability, table generation falls back
to a metadata-only prompt built from column name / native type / comment
/ rawDescriptions. The column path follows the same rule -- call the
LLM when any of samples or rawDescriptions are available; skip only
when both are absent.
- Logger is now threaded from KtxScanContext into the generator. Failures
emit structured KtxScanWarning entries (new description_fallback_used
code, plus existing sampling_failed / enrichment_failed /
connector_capability_missing). ktx scan groups warnings by code so a
batch of identical failures collapses to one summary line plus sample.
- Returns null on failure instead of the 'Table not found' sentinel; the
manifest writer's existing guard already skips empty descriptions, so
schema YAML no longer carries misleading text. SCAN_MANAGED_DESCRIPTION_KEYS
already strips stale 'ai' on merge, so existing YAML clears on next run.
Also suppress AI SDK v6 'system in messages' warning: pull system messages
out of KtxMessageBuilder.wrapSimple's output via a new splitKtxSystemMessages
helper and pass them top-level to generateText (preserves cacheControl
providerOptions on the SystemModelMessage). Agent-runner's local
splitSystemPromptMessages dedupes onto the shared helper.
* test(docs): align examples-docs assertions with revamped docs
PR #103 (setup/guide doc revamp) reworded several CLI examples and
connection labels; the assertions in scripts/examples-docs.test.mjs
still referenced the pre-revamp wording and were failing in CI on main.
Update the regexes to match the post-revamp content:
- drop the `--json` flag from the sl-query example expectation
- move the `Driver:` / `Status: ok` probe to the connection reference,
which is where that output now lives (driver id is lowercase
`postgres`, not the display name `PostgreSQL`)
- drop the obsolete `Install \`uv\`...` troubleshooting line
- accept `<connectionId>` everywhere; the docs no longer use the
hyphenated `<connection-id>` form
- match the `warehouse` connection id used in the quickstart instead of
the `postgres-warehouse` id only used in the README and setup ref
* fix(sl): skip TS/Python schema contract test when uv is unavailable
The TypeScript checks CI job does not install uv or Python, so the
module-level `execFileSync('uv', ...)` in schemas.contract.test.ts threw
ENOENT and failed the suite. Wrap the schema dump in a try/catch and
guard the describe block with `describe.skipIf` so the test skips in
environments without uv. Local dev and any CI job that has uv on PATH
still runs the cross-language contract assertion.
2026-05-15 02:11:04 +02:00
Every standalone column requires `name` and `type` . Overlays have computed columns in `columns:` and manifest column metadata patches in `column_overrides:` .
2026-05-10 23:12:26 +02:00
- `type` : one of `string` , `number` , `boolean` , `time` . Map LookML `date` /`datetime` /`timestamp` → `time` . Map LookML `yesno` → `boolean` .
- `role` (optional): `time` enables time-granularity queries (month, week, day). `default` is the implicit fallback.
- `visibility` (optional): `public` , `internal` , or `hidden` .
- `expr` (optional for standalone, required for overlay columns): SQL expression that computes the value. Expanded by sqlglot before generating SQL, so you can reference other columns on the same source.
### Grain
2026-05-14 12:43:14 -04:00
`grain: [col_a, col_b]` - the set of columns that uniquely identify one row. The query engine uses grain to prevent fan-out in joins. Overlays inherit grain from the manifest unless they override.
2026-05-10 23:12:26 +02:00
### Joins
```yaml
joins:
- to: customers # target source name
on: "customer_id = customers.id" # local_col = TARGET.target_col
relationship: many_to_one # or one_to_many, one_to_one
2026-05-14 12:43:14 -04:00
alias: primary_customer # optional - lets you join the same target twice
2026-05-10 23:12:26 +02:00
```
- `on` format: `local_col = TARGET.target_col` . Always qualify the right side with the target source name.
- `relationship` is the cardinality **from this source to the target** . Most joins are `many_to_one` (FK → PK on the parent).
### Measures
```yaml
measures:
- name: total_arr
expr: sum(arr_amount)
2026-05-14 12:43:14 -04:00
description: Sum of ARR - filter by plan_name at query time
2026-05-10 23:12:26 +02:00
filter: "is_active = true"
segments: [paid_non_refunded]
```
- `name` (required, snake_case).
2026-05-14 12:43:14 -04:00
- `expr` (required): any valid SQL aggregate - `sum(x)` , `count(*)` , `count(distinct user_id)` , `avg(score)` .
2026-05-10 23:12:26 +02:00
- `description` (required on capture): what the measure computes and how to use it.
- `filter` (optional): SQL predicate applied as a WHERE clause specific to this measure.
- `segments` (optional): names of segments defined on the same source. The engine AND-composes each segment's `expr` into this measure's effective filter.
Use `safe_divide(num, den)` for ratio measures to avoid division by zero.
### Segments
```yaml
segments:
- name: paid_non_refunded
expr: "is_paid = true AND is_refunded = false"
description: Orders that were paid and not refunded
```
2026-05-14 12:43:14 -04:00
Named, reusable boolean predicates scoped to one source. Reference by bare name in a measure's `segments: []` , or by dotted form `source.segment_name` in an `sl_query` . Segments are predicates only - they are NOT selectable as dimensions. If you need to group by the predicate, add a `columns[]` entry instead.
2026-05-10 23:12:26 +02:00
### Cross-references with the wiki
2026-05-14 12:43:14 -04:00
The reverse edge (wiki pages that cite this source) is derived automatically from each wiki's `sl_refs:` - you don't emit anything on the SL side. Author the edge once on the wiki via `sl_refs:` ; the post-write reconciler populates the knowledge↔SL index.
2026-05-10 23:12:26 +02:00
---
2026-05-14 12:43:14 -04:00
## Part 2 - Querying via `sl_query`
2026-05-10 23:12:26 +02:00
2026-05-14 00:57:51 +02:00
The `sl_query` tool generates correct SQL from a structured query. It handles joins, fan-out prevention, aggregation correctness, and filter classification automatically. Prefer it over writing raw SQL whenever the SL has the relevant sources.
2026-05-10 23:12:26 +02:00
2026-05-14 00:57:51 +02:00
### When to prefer sl_query over raw SQL
2026-05-10 23:12:26 +02:00
- A pre-defined measure already exists (`source.measure_name` appears in the catalog).
2026-05-14 12:43:14 -04:00
- The question combines fields from multiple sources - the engine resolves the join path automatically.
- The question asks for a standard metric (revenue, ARR, churn, retention, LTV, conversion, MAU, etc.) - even if no pre-defined measure exists, a runtime aggregation over a catalog column is usually correct.
2026-05-10 23:12:26 +02:00
Use raw SQL (`sql_execution` ) only when:
- The computation requires multi-step CTEs whose intermediate grain is not a column in any source.
- The question explicitly asks for a one-off exploration that will never be asked again.
### Input shape
```json
{
"connectionId": "uuid-of-the-connection",
2026-05-14 00:57:51 +02:00
"measures": ["orders.total_revenue", "sum(orders.amount)"],
"dimensions": ["customers.segment", { "field": "orders.created_at", "granularity": "month" }],
"filters": ["orders.status != 'cancelled'", "orders.total_revenue > 10000"],
"segments": ["orders.paid_non_refunded"],
"order_by": [{ "field": "orders.created_at", "direction": "desc" }],
"limit": 1000
2026-05-10 23:12:26 +02:00
}
```
- **`measures` **: mix pre-defined refs (`source.measure` ) and runtime aggregations (`sum(source.column)` ).
- **`dimensions` **: column refs or `{ field, granularity }` objects for time grains (`day` , `week` , `month` , `quarter` , `year` ).
- **`filters` **: free-form SQL predicates. The engine auto-classifies each as WHERE or HAVING based on whether it references an aggregated measure.
2026-05-14 12:43:14 -04:00
- **`segments` **: dotted `source.segment_name` . Each segment is AND-ed into the effective filter of every measure whose base source matches. Segments never become a global WHERE - use `filters` for cross-source predicates.
2026-05-10 23:12:26 +02:00
- **`order_by` **: string or `{ field, direction }` . Direction defaults to `asc` .
- **`limit` **: integer row cap.
### Join resolution
2026-05-14 12:43:14 -04:00
You don't specify a base table. The engine infers the set of sources needed from the fields you reference and resolves the shortest join path through the catalog's declared joins. If no path exists between two sources, the query fails with a path-not-found error - check `discover_data` or `sl_discover` to see which sources are connected.
2026-05-10 23:12:26 +02:00
### Worked examples
2026-05-14 12:43:14 -04:00
Cross-source query - engine resolves `account_health_scores → accounts ← opportunities` automatically:
2026-05-10 23:12:26 +02:00
```json
{
"measures": ["account_health_scores.avg_health_score"],
"dimensions": ["opportunities.stage"],
"filters": ["opportunities.stage != 'Closed Won'"]
}
```
Monthly ARR trend with a segment:
```json
{
"measures": ["subscriptions.arr"],
"dimensions": [{ "field": "subscriptions.month", "granularity": "month" }],
"segments": ["subscriptions.paid_non_refunded"],
"order_by": [{ "field": "subscriptions.month", "direction": "asc" }]
}
```
Multi-source with runtime aggregation:
```json
{
"measures": ["sum(orders.amount)", "count(support_tickets.ticket_id)"],
"dimensions": ["customers.segment"]
}
```