16 KiB
| name | description | callers | |
|---|---|---|---|
| sl_capture | How to capture new reusable patterns into KTX's semantic layer — when a measure, segment, or join belongs in the catalog and how to write it generically so it stays small and useful over time. Loaded by the post-turn memory-agent only. The research agent does not write to the SL. |
|
Semantic Layer — Capture
This skill covers when and how to capture new patterns into the semantic layer. For schema reference and query grammar, load the sl skill first.
When the current turn produces a reusable pattern (business metric, derived view, join pattern, computed dimension), capture it so future queries can reach for it instead of rediscovering it.
SQL dialect
The user-facing prompt includes a Warehouse: line under the SL Sources index
(e.g. Warehouse: BIGQUERY). All expr strings — measure expressions, segment
predicates, computed-column SQL — execute on that warehouse and must use its
syntax. Date arithmetic in particular varies by dialect:
- BigQuery:
transaction_date >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)(when the column isTIMESTAMP);event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)(whenDATE). - Postgres / Redshift:
transaction_date >= current_date - interval '90 days'. - Snowflake:
transaction_date >= dateadd(day, -90, current_timestamp()).
Match the column's manifest type (type: time → TIMESTAMP/DATETIME on the
warehouse) — comparing TIMESTAMP to a DATE-arithmetic result fails on
BigQuery. After every sl_edit_source/sl_write_source, the inline validator runs a
LIMIT 1 warehouse probe per measure and surfaces dialect mismatches; if
you see an error trailer, fix the expression and retry rather than leaving
the source for the post-squash gate to revert.
What's worth capturing
- Business metric aggregations (ARR, MRR, revenue, churn, retention, conversion, LTV, CAC).
- Derived calculations combining multiple signals (risk scores, health scores, composite KPIs).
- Multi-table join patterns producing a reusable analytical view.
- Computed categories or flags useful as reusable dimensions (
case when num_protocols >= 3 then 'power' else 'regular' end). - Missing joins between two sources that both exist but aren't connected in the join graph.
Skip:
- Simple
SELECT * LIMIT 10previews. - Trivial
COUNT(*)on one table with no business filtering. - One-off ad-hoc explorations unlikely to repeat.
- Equivalent measures that already exist (cite the existing one as
source.measure_name).
When in doubt, capture. Measures are easy to remove but impossible to recover from a lost conversation.
Generalization rules
The SL must stay small and general over time. Before adding a measure, decide whether it belongs as a generic pattern or a specific constant.
Prefer one generic measure with query-time filters over N hardcoded variants.
Anti-pattern:
- name: revenue_us_region
expr: sum(case when region = 'US' then amount end)
- name: revenue_eu_region
expr: sum(case when region = 'EU' then amount end)
Preferred:
- name: total_revenue
expr: sum(amount)
Callers filter region = 'US' at semantic_query time.
Bake constants in only when the filter has named business meaning that won't change (enterprise_arr for a contractually defined tier), cannot be expressed via the source's dimensions, or comes from a regulated/fixed list.
Time anchors and value lists belong in callers' filters, not in measure expressions or source SQL.
- Anti-pattern (date anchor inlined):
expr: count(distinct case when transaction_date >= '2026-04-12' then customer_id end)— the date will need editing every time the question shifts, and every reader has to discover it. - Anti-pattern (value list inlined in source SQL):
WHERE product_category_1 IN ('Testosterone', 'Weight Loss', …)— locks the source to today's catalog and blocks callers from broadening or narrowing. - Preferred: a generic measure (
count(distinct customer_id)) plus either a named segment that captures the meaning of the anchor (gh_new_products_since_launch) or a query-time filter. Callers compose; the source stays small. - A date is durable to bake in only when it represents a regulatory cutover, a contractually fixed boundary, or a one-time event that reshapes how the source itself is read.
If you create a segment whose expr matches a measure's filter, the measure MUST reference the segment via segments: [segment_name] rather than re-inlining the predicate. This is the canonical pattern even with a single measure — duplicating the predicate inline defeats the purpose of naming it.
Anti-pattern:
segments:
- name: engaged_subscriber
expr: "is_paid = true AND <date-window-90-days-on-transaction_date>"
measures:
- name: engaged_subscriber_count
expr: "count(distinct case when is_paid = true and transaction_date >= current_date - interval '90 day' then admin_user_id end)"
Preferred:
segments:
- name: engaged_subscriber
expr: "is_paid = true AND <date-window-90-days-on-transaction_date>"
measures:
- name: engaged_subscriber_count
expr: "count(distinct admin_user_id)"
segments: [engaged_subscriber]
Use computed dimensions for derived categories. A flag like is_power_user belongs on columns[] with expr, not inlined into every measure.
Extract repeated filter bundles into named segments. If the same predicate appears on multiple measures of the same source, lift it to a segments[] entry and have each measure reference it. One edit updates every measure that depends on it.
Never write a standalone file on a manifest-backed name. If sl_discover({ tableName }) finds an existing schema for that name, you MUST write an overlay (name: + measures:/segments:/description: only — no sql:, table:, grain:, columns:, joins:). A standalone with sql: or table: on a manifest-backed name clobbers the inherited columns and joins; sl_write_source and sl_validate both reject this shape with a clear fix hint. Always run sl_discover before your first write on any existing name.
Prefer overlay decomposition over standalone SQL sources. Before reaching for source_type: sql, check whether the metric decomposes into measures on existing overlays (including cross-source derived measures). Use source_type: sql only when:
- The metric requires per-user/per-entity derivation that cannot be expressed as a single
expr(e.g.,EXISTSover a time-windowed subset), OR - The metric requires multi-step CTEs whose intermediate grain is not a column in any existing source.
When an sql source is unavoidable, note in its description which SL gap forced the choice so it can be retired once the primitive ships. It must target a name NOT in the manifest — pick a distinct one (e.g. mrr_waterfall_rollup, not fct_orders).
Slim standalone sources via inherits_columns_from
When a standalone SQL source filters or projects from a single manifest-backed base table (the common pattern for derived views like aav_consignments over MARTS.CONSIGNMENTS), set inherits_columns_from: to the base table's manifest key and list only column names in columns:. Compose-time enrichment fills type, descriptions, and role from the matching manifest column.
Discover the manifest key with sl_discover — pass the bare name (CONSIGNMENTS), the fully-qualified path (ANALYTICS.MARTS.CONSIGNMENTS), or any suffix; the tool resolves all forms and prints the canonical key in its output.
name: aav_consignments
description: AAV consignments — filtered view of MARTS.CONSIGNMENTS for the auto-auction-vaulting channel.
source_type: sql
sql: |
SELECT CONSIGNED_ITEM_ID, CASH_ADV_AMOUNT, ALT_VALUE_COMBINED, my_derived_flag
FROM MARTS.CONSIGNMENTS
WHERE IS_AUTO_AUCTION_VAULTING_SUBMISSION = TRUE
AND IS_CARD_SHOW_SUBMISSION = FALSE
AND CONSIGNMENT_CANCELED_FLAG = FALSE
inherits_columns_from: CONSIGNMENTS
grain: [CONSIGNED_ITEM_ID]
columns:
- { name: CONSIGNED_ITEM_ID } # type/description inherited from manifest
- { name: CASH_ADV_AMOUNT }
- { name: ALT_VALUE_COMBINED }
- { name: my_derived_flag, type: boolean, expr: "CASH_ADV_AMOUNT > 0", description: "Computed locally — has any cash advance." }
measures:
- name: total_cash_advance
expr: sum(CASH_ADV_AMOUNT)
Rules:
- Inheritance fills only blank fields. If you set a
descriptionlocally, it wins — useful when the base description is misleading in the filtered view. - A column not in the manifest (a derived/aliased column, or one from a different table in a
JOIN) needs its owntypeanddescriptiondeclared. - If
inherits_columns_fromdoesn't resolve, the source still loads, but every column without a type triggers a validator error on the warehouse probe —sl_discoverfirst to confirm the key. - Don't use
inherits_columns_fromfor sources backed bytable:(those should be overlays — see the rule against shadowing the manifest above).
Refinement — replace, don't append
When the user corrects a prior answer, the existing measure is wrong by the user's own standard. Replace it, don't add a parallel measure.
Signals that the current turn is a refinement:
- "no, I meant...", "actually use X", "exclude Y", "wait, by X I mean Z".
- Pushback on a prior result ("that's wrong because...", "this should be higher").
- Redefinition of a term used in an existing measure.
Distinguishing question: would the prior measure still be correct for someone else asking the prior question? If no → replace. If yes → add.
Edit SL vs document in wiki
If the user explicitly names an SL artifact and asks to change it, the primary action is always an SL tool call. Examples:
- "edit the source", "edit the YAML", "edit
fct_intakes.yaml" →sl_edit_sourceorsl_write_source. - "refine the measure", "change the filter on
active_users", "fix the expr", "addis_test = false" →sl_edit_sourceon the source that owns the measure. - "don't create a new one, update the existing" →
sl_edit_source(neversl_write_sourcewith a new source name; neverwiki_writeas the only action).
A wiki update may ALSO make sense in the same turn (owner note, lineage, caveat), but it is never a substitute for editing the YAML when the user's request is about changing the measure/source definition itself.
Wiki-only is correct when the user is documenting about the measure (definition in business terms, owner, policy, glossary, examples of when to use it) without changing its SQL expression or filters.
Tool sequence
sl_discover— see what source files exist.sl_discover({ tableName })— REQUIRED before the first write on any name. Shows columns/joins/grain from the manifest. If the call returns a schema, you MUST write an overlay, not a standalone. Skipping this is the #1 cause of accidentally shadowing the manifest.sl_read_source({ sourceName })— read the raw YAML before editing.- For modifications:
sl_edit_source({ sourceName, old_string, new_string })with exact-string replacements.old_stringmust match exactly and be unique in the file. - For new sources or full rewrites:
sl_write_source({ sourceName, content })with the full YAML content. - For join discovery:
sql_execution({ sql })to verify the join key exists in both tables and assess cardinality before declaring the join. - Cross-reference knowledge: author the edge once on the wiki side via
sl_refs: [source_name]in the page's front-matter. The reverse edge (wiki pages that cite an SL source) is derived automatically by the reconciler — do not add aknowledge_refs:field to SL YAMLs. sl_validate— run after writing or editing to surface schema issues, duplicate measure names, and cross-source validation errors. Read-only; the writes are already committed (the squash-at-end flow will collapse them into one commit).
Editing patterns
sl_edit_sourceis the workhorse for additive changes: add a measure, add a join, tweak a description, replace a filter. Cheap, targeted, preserves the rest of the file.sl_write_sourceis for brand-new sources or when the entire file needs restructuring. It overwrites the file completely.- Do NOT modify existing measures or their descriptions unless the current turn explicitly corrects them.
Worked example — additive overlay
Conversation:
- User: "What was the average order value last quarter?"
- Assistant fell back to SQL:
SELECT AVG(amount) FROM orders WHERE order_date >= ...
Existing index: orders [measures=0, joins=0] — candidate for enrichment.
sl_discover()
→ orders.yaml does not exist yet
sl_discover({ tableName: "orders" })
→ see grain, columns, no current overlay
sl_write_source({
sourceName: "orders",
content: "name: orders\nmeasures:\n - name: avg_order_value\n expr: avg(amount)\n description: Mean order transaction amount — filter by product_category at query time\n"
})
sl_validate()
→ clean
The overlay only contains name and measures — no columns, grain, or table. Those are inherited from the manifest.
Worked example — refinement (replace)
Prior turn:
- [user] "How many active users do we have per region?"
- [assistant] "… used
count(*) filter: last_login_at > now() - interval '30 days'"
Current user: "Wait, by 'active' I mean users who have placed an order in the last 30 days, not just logged in."
The existing users.active_count measure is wrong by the new definition.
sl_read_source({ sourceName: "users" })
→ see the wrong measure
sl_edit_source({
sourceName: "users",
yaml_edits: [{
oldText: " - name: active_count\n expr: \"count(*)\"\n filter: \"last_login_at > now() - interval '30 days'\"\n description: Users who logged in within the last 30 days",
newText: " - name: active_count\n expr: \"count(distinct case when last_order_at > now() - interval '30 days' then user_id end)\"\n description: Users with at least one order in the last 30 days"
}]
})
sl_validate()
If you only added a new measure, the old incorrect active_count would stay and future queries would keep answering the wrong question.
Worked example — new join
Prior turn: user asked to correlate LTV with protocol count; assistant joined fct_orders with fct_mau_multiprotocol on admin_user_id in raw SQL.
sl_read_source({ sourceName: "fct_orders" })
→ no joins section yet
sql_execution({
sql: "SELECT COUNT(*), COUNT(DISTINCT a.admin_user_id) FROM fct_orders a JOIN fct_mau_multiprotocol b ON a.admin_user_id = b.admin_user_id LIMIT 1"
})
→ confirms cardinality (many orders per MAU row = many_to_one)
sl_edit_source({
sourceName: "fct_orders",
yaml_edits: [{
oldText: "measures:",
newText: "joins:\n - to: fct_mau_multiprotocol\n on: admin_user_id = fct_mau_multiprotocol.admin_user_id\n relationship: many_to_one\nmeasures:"
}]
})
sl_validate()
Always verify joins with sql_execution before adding them.
Rules recap
- Read existing sources before editing (
sl_read_sourceorsl_discover). - Prefer overlays over standalone sources on manifest-backed tables.
- Prefer generic measures + query-time filters over per-value variants.
- Time anchors and value lists belong in callers' filters, not in measure expressions.
- A measure whose filter matches a segment MUST reference the segment via
segments: [name]. - Extract repeated predicates into named segments.
- Use computed dimensions for derived categories.
- When the user corrects a prior answer, replace — don't append.
- Always run
sl_validateafter writing to surface issues. - If nothing is worth capturing, respond without calling any SL write tool.