15 KiB
| name | description | callers | |
|---|---|---|---|
| metabase_ingest | Convert Metabase questions, models, and metrics into KTX Semantic Layer source definitions. Covers result-metadata to KSL column type mapping, FK/PK detection, near-duplicate deduplication, pre-aggregation decomposition, join-graph connectivity, and how to react to priorProvenance from earlier ingest syncs. Load when the WorkUnit contains `cards/<id>.json` files under a Metabase bundle. |
|
Metabase to KTX Semantic Layer
Each WorkUnit represents one Metabase collection's cards for one Metabase database (mapped to exactly one KTX connection). Every cards/<id>.json file carries the resolved SQL, result_metadata, card type, collection path, and referenced-card ids. The WU's sync-config.json tells you which sync mode is active and which selections apply. databases/<id>.json tells you the target KTX connection.
Context format
Each card JSON looks like:
{
"metabaseId": 7,
"name": "Daily orders",
"description": "Orders by day",
"type": "model",
"databaseId": 42,
"collectionId": 5,
"resolvedSql": "SELECT ...",
"templateTags": [{"name": "ref", "type": "card", "cardReference": 10}],
"resultMetadata": [
{"name": "day", "base_type": "type/DateTime", "semantic_type": "type/CreationTimestamp"},
{"name": "order_count", "base_type": "type/Integer"}
],
"collectionPath": ["Data", "Orders Team"],
"referencedCardIds": [10]
}
Use resultMetadata to:
- Map
base_typeto KSL column type:type/Integer,type/Float,type/Decimal,type/BigInteger→number;type/Text,type/TextLike→string;type/DateTime,type/Date,type/DateTimeWithTZ→time;type/Boolean→boolean. - Identify grain candidates: columns with
semantic_type: type/PK. - Identify join candidates: columns with
semantic_type: type/FKplusfk_target_field_id. - Identify time columns:
semantic_type: type/CreationTimestamportype/UpdatedTimestamp→ setrole: time. - Use
display_namefor measure descriptions when available.
Additional card metadata
parameters: list of card-level parameters with widget types and defaults. When SQL resolution fell back to unresolved SQL, use this to drive Step A of the SQL-translation workflow (drop optional clauses): knowing each{{ var }}istype: "date/range"vstype: "category"tells you what kind of clause it is.resultMetadata[i].field_ref: Metabase's canonical reference to the source warehouse field. Shape["field", <field_id>, <options>]. When this is set, the column maps directly to a warehouse field, which is useful for declaring joins from FK metadata without re-parsing SQL.lastRunAt: ISO timestamp of the card's last execution. If null or very old, the card may be dead; prefer skipping over creating a source.dashboardCount: number of dashboards referencing the card. Cards withdashboardCount: 0and a stalelastRunAtare strong skip signals.
Decision tree
For each card:
- Analyze
resolvedSql+resultMetadata: identify base tables, aggregations, joins, filters, column types. - Check
sl_discoverandsl_read_sourcefor existing sources that overlap. - Decide:
- Simple aggregation on a table that already has a source →
sl_edit_sourceto add a measure. - Join between tables that should be linked in the SL graph →
sl_edit_sourceto add a join. - Complex derived SQL (CTEs, multi-layer aggregation, scoring models) →
sl_write_sourcewithsource_type: sql. When the SQL projects/filters from a single manifest-backed base table, setinherits_columns_from: <manifest_key>so columns inherit type and description from the manifest — seesl_captureskill for the slim form. Usesl_discoverto discover the manifest key from the table reference in the SQL (it acceptsMARTS.CONSIGNMENTS,ANALYTICS.MARTS.CONSIGNMENTS, orCONSIGNMENTS). - New base table not yet in the semantic layer →
sl_write_sourcewithsource_type: table. - Trivial query (
SELECT *, simpleCOUNT(*)with no business logic) → do nothing; the runner will record this card asaction_type='skipped'. - Duplicate of an existing measure → same as trivial; do nothing for this card.
- Simple aggregation on a table that already has a source →
Manifest-only names need an overlay first. If sl_discover shows a source name with Type: table but sl_read_source returns "Source not found", the source lives only in the schema manifest (no standalone overlay yet). sl_edit_source cannot edit manifest-only names — you must bootstrap an overlay with sl_write_source using the overlay shape:
name: <SOURCE_NAME>
measures:
- name: <measure_name>
expr: "<expression>"
Overlay shape: name: plus any of measures:, segments:, description:, joins:, disable_joins:. Never include sql:, table:, grain:, or columns: on a manifest-backed name — those would shadow the manifest's schema and drop its joins. Overlay joins: are merged additively with the manifest's joins (deduped by to + on); use disable_joins: ["<on-clause>"] to suppress a specific manifest join. After the overlay exists, use sl_edit_source for further tweaks. See sl_capture skill for the canonical overlay rule.
Join discovery: When your card's SQL references warehouse tables (e.g. in FROM or JOIN clauses), call sl_discover({ query: '<table>' }) before writing. The matching manifest entry's name is the value you put in joins: [- to: <name>]. Use many_to_one for FK-to-dimension joins, one_to_many for the reverse.
priorProvenance
If the WU prompt includes a priorProvenance section for a card, it tells you what happened on prior ingest syncs. Treat it as advisory:
action_type: source_createdon source X → prefer editing X withsl_edit_sourcerather than writing a new source.action_type: measure_addedon source X → you already contributed to X; add only measures that aren't present.action_type: subsumedormerged→ this card was folded into another source last time; unless its SQL has changed structurally, keep it subsumed (no new write).action_type: skipped→ last time we decided not to ingest this card; re-read the SQL and confirm the decision still holds. If the card now has non-trivial business logic, ingest it.
Deduplication
Before writing, scan all cards in this WU for near-duplicate groups — cards whose resolvedSql shares the same CTEs, base tables, joins, and aggregation structure but differs only in:
- Trailing filters (e.g.
date_trunc(week, date)vsdate_trunc(month, date)). - Minor
WHEREclause variations. - Column aliases or output column subsets.
- Aggregation granularity (daily vs weekly vs monthly).
When you find a group of near-duplicates:
- Create ONE generalized source from the most comprehensive card in the group.
- Strip card-specific trailing filters from the SQL so the source covers all variants (e.g. keep daily grain instead of filtering to week/month).
- If each card had a distinct measure or filter, add them as separate measures on the single source.
- For all cards except the canonical one, do nothing — they'll be recorded as
action_type='skipped'automatically by the runner.
Do NOT merge cards with fundamentally different business logic, even if they share CTEs.
Pre-aggregation decomposition
When a card's resolvedSql contains GROUP BY with aggregation functions (SUM, COUNT, AVG, …):
- Detect: simple aggregation on base tables/joins —
SELECTwithGROUP BY, no complex CTEs or window functions. - Decompose: strip the
GROUP BYand aggregation functions. KeepFROM,JOIN, andWHEREintact. - Expose row-level columns: include the grouped-by columns AND the raw columns being aggregated (e.g.
money_outinstead ofSUM(money_out) AS total_money_out). - Define aggregations as measures: convert each aggregation into a KSL measure (e.g.
sum(money_out)). - Add joins: with FK columns now exposed, declare joins to dimension sources.
Exception: keep the pre-aggregated SQL when the query involves multi-CTE pipelines, window functions, or recursive logic where decomposition would lose business logic.
SQL translation from raw native to KSL
Every card carries a resolvedSql field. Check the staged card's resolutionStatus first:
resolutionStatus: "resolved"—{{#N}}references are inlined and[[ ... ]]optional clauses have been dropped locally. If the resolved SQL contains no other parameters the SQL is executable as-is. If the card had required (non-bracketed){{ var }}placeholders, the SQL is prefixed with a placeholder-warning comment block listing every dummy substitution Metabase made — see "Step A" below.resolutionStatus: "fallback"— Metabase failed to resolve. The SQL still contains{{#N}},{{#N-name}} alias,{{ var }}, and[[ ... ]]syntax. Do the translation steps below before writing a source.
Step A — Handle dummy-substituted placeholders (resolved cards only)
When a card has a required {{ var }} outside any [[ ]] block, the resolver substitutes a dummy value purely so Metabase's parser will accept the query. The resulting SQL is prefixed with a comment like:
-- PLACEHOLDER_WARNING: this SQL was extracted from a Metabase card with
-- unbound template parameters. The placeholders below were substituted with DUMMY
-- values to satisfy Metabase's parser — they DO NOT represent intended filters.
-- Drop the corresponding clauses (or expose them as runtime SL filters) before
-- persisting this SQL as a semantic-layer source.
-- {{ auction_end }} (type=dimension, widget=date/all-options) → '2020-01-01~2020-12-31'
-- {{ status }} (type=text) → 'placeholder'
SELECT ...
WHERE start_date >= '2020-01-01' AND start_date < '2021-01-01' AND status = 'placeholder'
For each listed placeholder: locate the WHERE clause(s) in the SQL that reference the dummy literal and drop them, then strip the warning comment. SL chat-time filters compose narrowing predicates dynamically, so the source should represent the unfiltered dataset.
For fallback cards, dropping is simpler — the SQL still has the [[ ... ]] brackets and {{ var }} placeholders intact:
-- before:
WHERE 1=1
[[AND {{ auction_end }} ]]
[[AND status = {{ status }} ]]
-- after:
WHERE 1=1
Step B — Inline {{#N}} references (fallback cards only)
Resolved cards already have {{#N}} inlined for you. For fallback cards, each {{#N}} (or {{#N-some-slug}}) in the SQL refers to another card's resolvedSql. The referenced card is in the WU's rawFiles or dependencyPaths. Read it with read_raw_file, then inline its SQL.
If the reference has an alias (from {{#5996-listing-interactions}} tb), the outer SQL probably uses that alias (select tb.* ..., tb.column_name, etc.). When you inline, you must EITHER:
- Pick a single base table inside the inlined SQL and rename its alias to the outer alias. Useful when the inlined card is
SELECT * FROM listings JOIN ...— set the LISTINGS alias totbandtb.*keeps working in the outer query. - Replace the outer alias references with explicit columns from the inlined SQL. Useful when the inlined card has multiple JOINs and
tb.*is ambiguous.
Never leave the outer alias dangling: after inlining, grep your SQL for the outer alias name and rewrite or remove every reference. A leftover tb.* with no tb table is the most common failure mode here.
Step C — Inlining cleanup checklist
After Steps A and B, your SQL must:
- Contain no placeholder-warning comment, no
{{,}},[[, or]]characters anywhere. - Reference no aliases that aren't defined inside the SQL itself.
- Be valid as a standalone subquery (the validator runs
SELECT * FROM (your_sql) LIMIT 1).
If resolutionStatus: "fallback" and the SQL is still complex enough that you can't confidently translate it, skip the card rather than writing broken SQL. Call emit_unmapped_fallback with the staged card path as rawPath, reason: "metabase_sql_untranslated", and fallback: "flagged".
Join-graph connectivity
For source_type: table:
- Use FK columns (
semantic_type: type/FK) to declaremany_to_onejoins to dimension sources. - Match column names ending in
_idagainst existing sources' grain columns.
For source_type: sql:
- The validator parses your SQL and rejects the write if any FROM/JOIN table has a manifest entry that you did not declare in
joins:. The error names every missing join target — declare amany_to_onejoin for each and reissue. - Tables outside the manifest (schemas not covered by this connection — e.g.
staging.*referenced from a MARTS source) are not flagged. For those, write a single-linewiki_writewith keyunmapped-table-<table_name>so the gap is documented, then callemit_unmapped_fallbackwith the staged card path asrawPath,reason: "table_outside_manifest", andfallback: "wiki_only".
Joins on manifest-backed names compose: the manifest's joins are inherited automatically, and any overlay joins: are merged on top (deduped by to + on). Use disable_joins: ["<on-clause>"] in the overlay to suppress a specific manifest join. If sl_discover shows a manifest-backed source with Joins: 0 and the warehouse FK metadata is genuinely absent, declaring application-level joins via the overlay is fair game — bootstrap with sl_write_source (overlay shape above), then refine via sl_edit_source.
Cross-card references ({{#N}})
Resolved cards (resolutionStatus: "resolved") have these inlined for you. Unresolved cards (resolutionStatus: "fallback") need manual handling — see "SQL translation from raw native to KSL" above.
Provenance markers
Every SL source and wiki page you write carries HTML-comment provenance tags pointing to the cards/<id>.json files they derive from:
# <!-- from: raw-sources/<connId>/metabase/<syncId>/cards/7.json -->
name: orders
...
If a source is derived from multiple cards (e.g. a generalized source for a near-duplicate group), emit one tag per contributing card.
Quality standards
Source definitions must follow ktx-sl YAML conventions:
source_type:"table"(physical table/view) or"sql"(arbitrary SQL / derived view).table: required whensource_type: "table"(e.g."public.orders").sql: required whensource_type: "sql".grain: what one row represents (e.g.[id],[customer_id, product_id]).columns: all columns with correct types (string,number,time,boolean).- Time columns: mark with
role: time. joins: use correctrelationshiptypes (many_to_onefor FK→PK,one_to_manyfor reverse).joins.on:local_column = TARGET_SOURCE.target_column— the right side MUST include the target source name.measures.expr: aggregation expression (e.g."sum(amount)"); optionalfilterfor business rules; requireddescription.
Measure naming: descriptive snake_case (e.g. total_revenue, avg_order_value).
Rules
- Prefer adding measures to existing sources over creating new ones.
- Before editing, always
sl_read_sourcethe source to check for existing measures. - Don't duplicate measures (same aggregation on the same column).
- If two measures differ only by a filter (e.g.
revenuevspaid_revenue), they are distinct. - Use the card's
name+descriptionto write meaningful measure descriptions. - When multiple cards in a WU are near-duplicates, create ONE generalized source; the runner will skip the rest automatically.
- Process every card in the WU — don't stop early.