ktx/packages/context/skills/historic_sql_ingest/SKILL.md
2026-05-10 23:12:26 +02:00

8.5 KiB

name description callers
historic_sql_ingest Convert one full-tier historic-SQL template WorkUnit into a canonical query knowledge page, linked SL refs, and optional semantic-layer proposals.
memory_agent

Historic SQL Ingest

Use this skill when the WorkUnit contains files under raw-sources/<connectionId>/historic-sql/<syncId>/templates/<templateId>/.

Read exactly one historic-SQL template WorkUnit. Each WorkUnit represents one staged template or categorical sub-cluster that already survived full-tier page triage. It is not an intent cluster.

Input Shape

The WorkUnit normally exposes:

  • metadata.json in rawFiles.
  • page.md in rawFiles.
  • usage.json in dependencyPaths.
  • manifest.json in dependencyPaths.
  • peerFileIndex containing sibling templates that you cannot read.

metadata.json has the stable identity:

{
  "id": "fp_1",
  "title": "snowflake - analytics.orders [fp_1]",
  "path": "templates/fp_1/page.md",
  "objectType": "historic_sql_template",
  "lastEditedAt": null,
  "properties": {
    "fingerprint": "fp_1",
    "sub_cluster_id": null,
    "dialect": "snowflake",
    "tables_touched": ["analytics.orders"],
    "literal_slots": [
      { "position": 1, "type": "string", "classification": "constant" },
      { "position": 2, "type": "date", "classification": "runtime" }
    ],
    "triage_signals": {
      "executions_bucket": "high",
      "distinct_users_bucket": "team",
      "error_rate_bucket": "ok",
      "recency_bucket": "active",
      "service_account_only": "false",
      "slot_summary": "1 constant, 1 runtime"
    }
  }
}

page.md contains mechanically generated normalized SQL and touched tables:

# fp_1

## Normalized SQL
SELECT date_trunc(?, created_at), count(*) FROM analytics.orders WHERE status = ? AND created_at >= ? GROUP BY 1

## Tables touched
- analytics.orders

usage.json contains volatile stats, literal top values, and redacted samples. Use it for intent inference and usage summaries. Do not treat usage-only drift as a reason to group this template with siblings.

Required Workflow

  1. Read the WorkUnit section in the prompt first.
  2. Call read_raw_file for metadata.json, page.md, usage.json, and manifest.json.
  3. Confirm metadata.objectType === "historic_sql_template". If it is not, call emit_unmapped_fallback with reason: "parse_error", fallback: "flagged", and the metadata.json raw path.
  4. Extract fingerprint, sub_cluster_id, dialect, tables_touched, literal_slots, normalized SQL, usage stats, top literal values, and sample timestamps.
  5. Infer one canonical query intent from this template only. Use table names, selected expressions, aggregations, joins, grouping, constant literal slots, and repeated successful samples. Runtime literal slots are parameters, not fixed business rules.
  6. Build a short intent slug in kebab-case. Use queries/<intent_slug> as the wiki key.
  7. Search existing knowledge with wiki_search using the intent phrase and the primary table. Prefer updating an existing queries/... page when it is the same intent.
  8. Discover touched tables with sl_discover. Add cleanly matched source names to sl_refs. If a table does not map cleanly, keep it in the page body and do not include it in sl_refs.
  9. Write or update the query page with wiki_write.
  10. Apply the SL proposal threshold below. If it passes and a useful generic measure, segment, join, or overlay is clear, update the semantic layer and run sl_validate.
  11. Exit without reading peer files or grouping sibling templates.

Wiki Page Shape

Use wiki_write for pages. Emit the spec frontmatter fields directly on the query page.

Use this shape:

{
  "key": "queries/<intent_slug>",
  "summary": "<one sentence canonical intent>",
  "tags": ["historic-sql", "query-pattern"],
  "sl_refs": ["<clean_source_name>"],
  "source": "historic-sql",
  "intent": "<human-readable canonical intent>",
  "tables": ["<tables_touched>"],
  "representative_sql": "<parameterized representative SQL>",
  "usage": {
    "executions": 47812,
    "distinct_users": 12,
    "first_seen": "2026-02-01",
    "last_seen": "2026-04-30",
    "p50_runtime_ms": 320,
    "p95_runtime_ms": 1180,
    "error_rate": 0.0007
  },
  "fingerprints": ["<fingerprint or sub-cluster id>"],
  "content": "## <Canonical Intent Title>\n\n### Parameters\n- <constant/runtime/categorical slot notes>\n\n### When To Use\n- <concise reusable guidance>\n\n### Caveats\n- <redaction, service-account, low-confidence, or mapping notes if present>"
}

For Snowflake templates include usage.rows_produced when present in usage.json; for BigQuery v1 omit usage.rows_produced.

The key: "queries/<intent_slug>" value writes to knowledge/global/queries/<intent_slug>.md during external ingest because bundle ingests write global wiki pages.

Representative SQL Rules

  • Start from normalized SQL in page.md.
  • For constant slots, use the dominant usage.literal_slots[].top_values[0][0] when it has definitional meaning. Quote string and date values in the representative SQL.
  • For runtime slots, render named parameters such as :start_date, :as_of, :status, or :threshold.
  • For categorical slots, document the known categories and write this WorkUnit's sub-cluster value when sub_cluster_id is present.
  • Preserve the warehouse dialect named by metadata.properties.dialect.
  • Do not copy sample bound_sql into the wiki unless it is visibly redacted and safer than the normalized SQL. Prefer normalized SQL plus parameter notes.

SL Proposal Threshold

Only propose semantic-layer changes when all are true:

  1. This WorkUnit reached Stage 3 full tier. The runner normally guarantees this, but treat executions_bucket=low plus distinct_users_bucket=solo or service_account_only=true as a reason to write wiki only.
  2. At least one literal_slots[] entry has classification: "constant" and the value has durable business meaning, such as a status, plan tier, channel, threshold, or fixed category.
  3. Every table in tables_touched maps cleanly through sl_discover to an existing SL source.

When the threshold passes:

  • Call sl_read_source before editing an existing source.
  • Prefer adding a measure, segment, computed dimension, join, or manifest-backed overlay over creating a standalone SQL source.
  • Use sl_write_source for a manifest-backed overlay only with name: plus additive fields such as measures:, segments:, description:, or joins:. Do not include sql:, table:, grain:, or columns: on manifest-backed overlays.
  • Use sl_edit_source for targeted edits when the source file already exists.
  • Run sl_validate after every SL write or edit.
  • Keep runtime parameters as caller filters. Do not bake dates, user ids, ids, search strings, or other runtime slots into SL measures.

When the threshold does not pass, write the wiki page and set sl_refs for any cleanly discovered touched tables. A wiki-only result is valid.

Intent Inference Guidance

Prefer canonical intent names that describe the business question, not the SQL shape:

  • Good: queries/monthly-paid-order-count
  • Good: queries/enterprise-contract-renewal-risk
  • Good: queries/support-ticket-first-response-time
  • Weak: queries/fp-1
  • Weak: queries/count-orders-group-by-date

Use the SQL shape to infer intent:

  • COUNT, SUM, AVG, MIN, MAX, GROUP BY, and date truncation usually indicate metrics or rollups.
  • Constant slots often name segments, statuses, tiers, regions, or thresholds.
  • Runtime slots usually represent time windows, selected entities, or caller filters.
  • Repeated successful samples from multiple human users make the page more durable.
  • High error rates, service-account-only use, or old last_seen values belong in caveats.

Boundaries

  • Do not group sibling templates. Stage 4 historic_sql_curator owns cross-template clustering and dedupe.
  • Do not read paths listed only in peerFileIndex.
  • Do not create or update historic_sql_curator.
  • Do not call context_candidate_write; historic-SQL Stage 3 writes final wiki and optional SL artifacts directly.
  • Do not invent joins, measures, or definitions that are not supported by the normalized SQL, touched tables, literal slots, or existing SL sources.
  • Do not copy unredacted sample bound_sql, user emails, account ids, tokens, or free-text literal values into wiki or SL output.
  • Do not write SL changes when any touched table lacks a clean SL mapping.
  • Do not finish after only an SL write. Always write or update the query knowledge page first so the canonical SQL pattern is searchable.