apunkt/ktx

Fork 0

mirror of https://github.com/Kaelio/ktx.git synced 2026-06-10 08:05:14 +02:00

Andrey Avtomonov 3ce510b55b rename klo to ktx

2026-05-10 23:51:24 +02:00

16 KiB

Raw Permalink Blame History

name

description

callers

sl_capture

How to capture new reusable patterns into KTX's semantic layer — when a measure, segment, or join belongs in the catalog and how to write it generically so it stays small and useful over time. Loaded by the post-turn memory-agent only. The research agent does not write to the SL.

memory_agent

Semantic Layer — Capture

This skill covers when and how to capture new patterns into the semantic layer. For schema reference and query grammar, load the sl skill first.

When the current turn produces a reusable pattern (business metric, derived view, join pattern, computed dimension), capture it so future queries can reach for it instead of rediscovering it.

SQL dialect

The user-facing prompt includes a Warehouse: line under the SL Sources index (e.g. Warehouse: BIGQUERY). All expr strings — measure expressions, segment predicates, computed-column SQL — execute on that warehouse and must use its syntax. Date arithmetic in particular varies by dialect:

BigQuery: transaction_date >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY) (when the column is TIMESTAMP); event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) (when DATE).
Postgres / Redshift: transaction_date >= current_date - interval '90 days'.
Snowflake: transaction_date >= dateadd(day, -90, current_timestamp()).

Match the column's manifest type (type: time → TIMESTAMP/DATETIME on the warehouse) — comparing TIMESTAMP to a DATE-arithmetic result fails on BigQuery. After every sl_edit_source/sl_write_source, the inline validator runs a LIMIT 1 warehouse probe per measure and surfaces dialect mismatches; if you see an error trailer, fix the expression and retry rather than leaving the source for the post-squash gate to revert.

What's worth capturing

Business metric aggregations (ARR, MRR, revenue, churn, retention, conversion, LTV, CAC).
Derived calculations combining multiple signals (risk scores, health scores, composite KPIs).
Multi-table join patterns producing a reusable analytical view.
Computed categories or flags useful as reusable dimensions (case when num_protocols >= 3 then 'power' else 'regular' end).
Missing joins between two sources that both exist but aren't connected in the join graph.

Skip:

Simple SELECT * LIMIT 10 previews.
Trivial COUNT(*) on one table with no business filtering.
One-off ad-hoc explorations unlikely to repeat.
Equivalent measures that already exist (cite the existing one as source.measure_name).

When in doubt, capture. Measures are easy to remove but impossible to recover from a lost conversation.

Generalization rules

The SL must stay small and general over time. Before adding a measure, decide whether it belongs as a generic pattern or a specific constant.

Prefer one generic measure with query-time filters over N hardcoded variants.

Anti-pattern:

- name: revenue_us_region
  expr: sum(case when region = 'US' then amount end)
- name: revenue_eu_region
  expr: sum(case when region = 'EU' then amount end)

Preferred:

- name: total_revenue
  expr: sum(amount)

Callers filter region = 'US' at semantic_query time.

Bake constants in only when the filter has named business meaning that won't change (enterprise_arr for a contractually defined tier), cannot be expressed via the source's dimensions, or comes from a regulated/fixed list.

Time anchors and value lists belong in callers' filters, not in measure expressions or source SQL.

Anti-pattern (date anchor inlined): expr: count(distinct case when transaction_date >= '2026-04-12' then customer_id end) — the date will need editing every time the question shifts, and every reader has to discover it.
Anti-pattern (value list inlined in source SQL): WHERE product_category_1 IN ('Testosterone', 'Weight Loss', …) — locks the source to today's catalog and blocks callers from broadening or narrowing.
Preferred: a generic measure (count(distinct customer_id)) plus either a named segment that captures the meaning of the anchor (gh_new_products_since_launch) or a query-time filter. Callers compose; the source stays small.
A date is durable to bake in only when it represents a regulatory cutover, a contractually fixed boundary, or a one-time event that reshapes how the source itself is read.

If you create a segment whose expr matches a measure's filter, the measure MUST reference the segment via segments: [segment_name] rather than re-inlining the predicate. This is the canonical pattern even with a single measure — duplicating the predicate inline defeats the purpose of naming it.

Anti-pattern:

segments:
  - name: engaged_subscriber
    expr: "is_paid = true AND <date-window-90-days-on-transaction_date>"
measures:
  - name: engaged_subscriber_count
    expr: "count(distinct case when is_paid = true and transaction_date >= current_date - interval '90 day' then admin_user_id end)"

Preferred:

segments:
  - name: engaged_subscriber
    expr: "is_paid = true AND <date-window-90-days-on-transaction_date>"
measures:
  - name: engaged_subscriber_count
    expr: "count(distinct admin_user_id)"
    segments: [engaged_subscriber]

Use computed dimensions for derived categories. A flag like is_power_user belongs on columns[] with expr, not inlined into every measure.

Extract repeated filter bundles into named segments. If the same predicate appears on multiple measures of the same source, lift it to a segments[] entry and have each measure reference it. One edit updates every measure that depends on it.

Never write a standalone file on a manifest-backed name. If sl_discover({ tableName }) finds an existing schema for that name, you MUST write an overlay (name: + measures:/segments:/description: only — no sql:, table:, grain:, columns:, joins:). A standalone with sql: or table: on a manifest-backed name clobbers the inherited columns and joins; sl_write_source and sl_validate both reject this shape with a clear fix hint. Always run sl_discover before your first write on any existing name.

Prefer overlay decomposition over standalone SQL sources. Before reaching for source_type: sql, check whether the metric decomposes into measures on existing overlays (including cross-source derived measures). Use source_type: sql only when:

The metric requires per-user/per-entity derivation that cannot be expressed as a single expr (e.g., EXISTS over a time-windowed subset), OR
The metric requires multi-step CTEs whose intermediate grain is not a column in any existing source.

When an sql source is unavoidable, note in its description which SL gap forced the choice so it can be retired once the primitive ships. It must target a name NOT in the manifest — pick a distinct one (e.g. mrr_waterfall_rollup, not fct_orders).

Slim standalone sources via `inherits_columns_from`

When a standalone SQL source filters or projects from a single manifest-backed base table (the common pattern for derived views like aav_consignments over MARTS.CONSIGNMENTS), set inherits_columns_from: to the base table's manifest key and list only column names in columns:. Compose-time enrichment fills type, descriptions, and role from the matching manifest column.

Discover the manifest key with sl_discover — pass the bare name (CONSIGNMENTS), the fully-qualified path (ANALYTICS.MARTS.CONSIGNMENTS), or any suffix; the tool resolves all forms and prints the canonical key in its output.

name: aav_consignments
description: AAV consignments — filtered view of MARTS.CONSIGNMENTS for the auto-auction-vaulting channel.
source_type: sql
sql: |
  SELECT CONSIGNED_ITEM_ID, CASH_ADV_AMOUNT, ALT_VALUE_COMBINED, my_derived_flag
  FROM MARTS.CONSIGNMENTS
  WHERE IS_AUTO_AUCTION_VAULTING_SUBMISSION = TRUE
    AND IS_CARD_SHOW_SUBMISSION = FALSE
    AND CONSIGNMENT_CANCELED_FLAG = FALSE
inherits_columns_from: CONSIGNMENTS
grain: [CONSIGNED_ITEM_ID]
columns:
  - { name: CONSIGNED_ITEM_ID }      # type/description inherited from manifest
  - { name: CASH_ADV_AMOUNT }
  - { name: ALT_VALUE_COMBINED }
  - { name: my_derived_flag, type: boolean, expr: "CASH_ADV_AMOUNT > 0", description: "Computed locally — has any cash advance." }
measures:
  - name: total_cash_advance
    expr: sum(CASH_ADV_AMOUNT)

Rules:

Inheritance fills only blank fields. If you set a description locally, it wins — useful when the base description is misleading in the filtered view.
A column not in the manifest (a derived/aliased column, or one from a different table in a JOIN) needs its own type and description declared.
If inherits_columns_from doesn't resolve, the source still loads, but every column without a type triggers a validator error on the warehouse probe — sl_discover first to confirm the key.
Don't use inherits_columns_from for sources backed by table: (those should be overlays — see the rule against shadowing the manifest above).

Refinement — replace, don't append

When the user corrects a prior answer, the existing measure is wrong by the user's own standard. Replace it, don't add a parallel measure.

Signals that the current turn is a refinement:

"no, I meant...", "actually use X", "exclude Y", "wait, by X I mean Z".
Pushback on a prior result ("that's wrong because...", "this should be higher").
Redefinition of a term used in an existing measure.

Distinguishing question: would the prior measure still be correct for someone else asking the prior question? If no → replace. If yes → add.

Edit SL vs document in wiki

If the user explicitly names an SL artifact and asks to change it, the primary action is always an SL tool call. Examples:

"edit the source", "edit the YAML", "edit fct_intakes.yaml" → sl_edit_source or sl_write_source.
"refine the measure", "change the filter on active_users", "fix the expr", "add is_test = false" → sl_edit_source on the source that owns the measure.
"don't create a new one, update the existing" → sl_edit_source (never sl_write_source with a new source name; never wiki_write as the only action).

A wiki update may ALSO make sense in the same turn (owner note, lineage, caveat), but it is never a substitute for editing the YAML when the user's request is about changing the measure/source definition itself.

Wiki-only is correct when the user is documenting about the measure (definition in business terms, owner, policy, glossary, examples of when to use it) without changing its SQL expression or filters.

Tool sequence

sl_discover — see what source files exist.
sl_discover({ tableName }) — REQUIRED before the first write on any name. Shows columns/joins/grain from the manifest. If the call returns a schema, you MUST write an overlay, not a standalone. Skipping this is the #1 cause of accidentally shadowing the manifest.
sl_read_source({ sourceName }) — read the raw YAML before editing.
For modifications: sl_edit_source({ sourceName, old_string, new_string }) with exact-string replacements. old_string must match exactly and be unique in the file.
For new sources or full rewrites: sl_write_source({ sourceName, content }) with the full YAML content.
For join discovery: sql_execution({ sql }) to verify the join key exists in both tables and assess cardinality before declaring the join.
Cross-reference knowledge: author the edge once on the wiki side via sl_refs: [source_name] in the page's front-matter. The reverse edge (wiki pages that cite an SL source) is derived automatically by the reconciler — do not add a knowledge_refs: field to SL YAMLs.
sl_validate — run after writing or editing to surface schema issues, duplicate measure names, and cross-source validation errors. Read-only; the writes are already committed (the squash-at-end flow will collapse them into one commit).

Editing patterns

sl_edit_source is the workhorse for additive changes: add a measure, add a join, tweak a description, replace a filter. Cheap, targeted, preserves the rest of the file.
sl_write_source is for brand-new sources or when the entire file needs restructuring. It overwrites the file completely.
Do NOT modify existing measures or their descriptions unless the current turn explicitly corrects them.

Worked example — additive overlay

Conversation:

User: "What was the average order value last quarter?"
Assistant fell back to SQL: SELECT AVG(amount) FROM orders WHERE order_date >= ...

Existing index: orders [measures=0, joins=0] — candidate for enrichment.

sl_discover()
  → orders.yaml does not exist yet
sl_discover({ tableName: "orders" })
  → see grain, columns, no current overlay
sl_write_source({
  sourceName: "orders",
  content: "name: orders\nmeasures:\n  - name: avg_order_value\n    expr: avg(amount)\n    description: Mean order transaction amount — filter by product_category at query time\n"
})
sl_validate()
  → clean

The overlay only contains name and measures — no columns, grain, or table. Those are inherited from the manifest.

Worked example — refinement (replace)

Prior turn:

[user] "How many active users do we have per region?"
[assistant] "… used count(*) filter: last_login_at > now() - interval '30 days'"

Current user: "Wait, by 'active' I mean users who have placed an order in the last 30 days, not just logged in."

The existing users.active_count measure is wrong by the new definition.

sl_read_source({ sourceName: "users" })
  → see the wrong measure
sl_edit_source({
  sourceName: "users",
  yaml_edits: [{
    oldText: "  - name: active_count\n    expr: \"count(*)\"\n    filter: \"last_login_at > now() - interval '30 days'\"\n    description: Users who logged in within the last 30 days",
    newText: "  - name: active_count\n    expr: \"count(distinct case when last_order_at > now() - interval '30 days' then user_id end)\"\n    description: Users with at least one order in the last 30 days"
  }]
})
sl_validate()

If you only added a new measure, the old incorrect active_count would stay and future queries would keep answering the wrong question.

Worked example — new join

Prior turn: user asked to correlate LTV with protocol count; assistant joined fct_orders with fct_mau_multiprotocol on admin_user_id in raw SQL.

sl_read_source({ sourceName: "fct_orders" })
  → no joins section yet
sql_execution({
  sql: "SELECT COUNT(*), COUNT(DISTINCT a.admin_user_id) FROM fct_orders a JOIN fct_mau_multiprotocol b ON a.admin_user_id = b.admin_user_id LIMIT 1"
})
  → confirms cardinality (many orders per MAU row = many_to_one)
sl_edit_source({
  sourceName: "fct_orders",
  yaml_edits: [{
    oldText: "measures:",
    newText: "joins:\n  - to: fct_mau_multiprotocol\n    on: admin_user_id = fct_mau_multiprotocol.admin_user_id\n    relationship: many_to_one\nmeasures:"
  }]
})
sl_validate()

Always verify joins with sql_execution before adding them.

Rules recap

Read existing sources before editing (sl_read_source or sl_discover).
Prefer overlays over standalone sources on manifest-backed tables.
Prefer generic measures + query-time filters over per-value variants.
Time anchors and value lists belong in callers' filters, not in measure expressions.
A measure whose filter matches a segment MUST reference the segment via segments: [name].
Extract repeated predicates into named segments.
Use computed dimensions for derived categories.
When the user corrects a prior answer, replace — don't append.
Always run sl_validate after writing to surface issues.
If nothing is worth capturing, respond without calling any SL write tool.

16 KiB Raw Permalink Blame History