Captures the design for replacing the current per-query, per-dialect historic-SQL pipeline with a unified, LLM-driven ingest that feeds both ktx wiki search (pattern pages) and ktx sl search (per-table usage in _schema shards). Spec covers motivation, principles, architecture, hot and cold paths, search-surface plumbing, schemas, configuration, cutover plan, and rejected alternatives. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
35 KiB
Historic SQL Ingestion — Redesign
Status: draft Date: 2026-05-11 Owner: Andrey Avtomonov
1. Motivation
The current historic-SQL ingestion adapter (packages/context/src/ingest/adapters/historic-sql/) is slow, complex, and structurally cannot answer the questions a research/BI agent actually asks.
Concrete pain points observed:
- A full run takes 30+ minutes against a tiny demo Postgres database. The hot loop calls
SqlAnalysisPort.analyzeForFingerprint()once per query via HTTP to the Python daemon, so thousands of RPC round-trips dominate runtime. - Two completely different code paths for Postgres (baseline-diff against
pg_stat_statements) versus BigQuery/Snowflake (timestamp cursor overINFORMATION_SCHEMA.JOBS/QUERY_HISTORY). Postgres further cannot produce the same outputs as the others (no per-execution samples, no literal-slot bindings, error rate stuck at zero). - The output is fingerprint-fragmented: the pipeline emits one document per fingerprint, expands categorical literal slots into per-value sub-clusters, and ranks templates with a recency-decayed score. The result is many near-duplicate documents per fingerprint and gratuitous churn across runs.
- The output is rigid and shallow: deterministic slot classification (constant / categorical / runtime) and triage-signal buckets do not produce narrative an agent can use. The current downstream skills (
historic_sql_ingest,historic_sql_curator) try to recover narrative from these templates but at high cost. - Lots of moving parts (baseline files, reset detection, atomic per-connection commit, slot heuristics, ranking formula) for what is fundamentally "find interesting queries and tell agents about them."
The end goal — per the user — is for ingested content to be searchable by ktx wiki search and ktx sl search to help consumer research agents do data analysis and agentic BI.
2. Design principles
- LLMs are the right tool for narrative and clustering. Deterministic heuristics (slot classification, ranking formulas, categorical expansion) get replaced by LLM judgement applied to aggregated, bucketed inputs.
- The adapter stays LLM-free. The existing convention — adapters are deterministic, skills do LLM work — is preserved.
- One pipeline across dialects. A single reader interface, a single staging shape, a single set of skills. Dialect-specific behavior lives only in the snapshot query.
- No work where no signal changed. Daily reruns should LLM only the things that actually changed.
- Lean context for caller agents. Each retrieval tier (search hit → source read → pattern read) carries only what the agent needs to make the next decision. The principle lives in prompt instructions, not in defensive schema constraints.
- Simplification over backward compatibility. Hard cutover, delete the old code path, no parallel implementations.
3. Architecture
┌────────────────────────── LLM-free, deterministic ─────────────────────────┐
Reader (unified) ─▶ Aggregated snapshot ─▶ Batch SQL parse ─▶ Bucket by table
│
▼
Staged dir:
manifest.json
tables/{schema}.{name}.json (one per touched table)
patterns-input.json
│
▼
chunk() → WorkUnits
│
┌───────────────────────────────────────────────┴────────────────────────────┐
▼ ▼
┌────── LLM via skill ──────┐ ┌────── LLM via skill ──────┐
│ historic_sql_table_digest │ (N WorkUnits, parallel) │ historic_sql_patterns │
│ produces TableUsage │ │ produces Pattern[] │
│ evidence per table │ │ evidence │
└───────────────────────────┘ └───────────────────────────┘
│ │
└──────────────────────────┬───────────────────────────────────────────────────┘
▼
onPullSucceeded() projection (no LLM):
Pass A — merge `usage` into _schema/{shard}.yaml (per-shard atomic, scan-managed keys)
Pass B — write/update pattern wiki pages (slug stability + stale handling)
Pass C — trigger SL search re-index for changed sources
4. Hot path (LLM-free)
4.1 Unified reader interface
interface HistoricSqlReader {
probe(client: HistoricSqlQueryClient): Promise<HistoricSqlProbeResult>;
fetchAggregated(
client: HistoricSqlQueryClient,
window: { start: Date; end: Date },
): AsyncIterable<AggregatedTemplate>;
}
AggregatedTemplate is one record per template, already aggregated by the warehouse. Schema in §9.
Trailing-window only. No cursor, no baseline file. Every run reads "what was hot in the last N days." Idempotency comes from per-WorkUnit content hashing via the framework's DiffSetComputerPort.
4.2 Snapshot queries (one per dialect)
Postgres — pg_stat_statements collapsed to queryid:
SELECT queryid::text AS template_id,
query AS canonical_sql,
SUM(calls)::bigint AS executions,
COUNT(DISTINCT userid) AS distinct_users,
SUM(total_exec_time) / NULLIF(SUM(calls), 0) AS mean_ms,
SUM(total_rows)::bigint AS rows_produced
FROM pg_stat_statements
WHERE toplevel = true
GROUP BY queryid, query
HAVING SUM(calls) >= @min_executions
firstSeen derives from pg_stat_statements_info.stats_reset; lastSeen is now(). p50RuntimeMs / p95RuntimeMs collapse to mean_ms. errorRate = 0 (PG doesn't track failures in PGSS).
BigQuery — warehouse-side aggregation over INFORMATION_SCHEMA.JOBS_BY_PROJECT:
SELECT query_hash AS template_id,
MIN(query) AS canonical_sql,
COUNT(*) AS executions,
COUNT(DISTINCT user_email) AS distinct_users,
MIN(creation_time) AS first_seen,
MAX(creation_time) AS last_seen,
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(50)] AS p50_ms,
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(95)] AS p95_ms,
SAFE_DIVIDE(COUNTIF(error_result IS NOT NULL), COUNT(*)) AS error_rate
FROM `{project}.region-{region}.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE job_type = 'QUERY'
AND statement_type IN ('SELECT', 'MERGE')
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL @window_days DAY)
GROUP BY query_hash
HAVING COUNT(*) >= @min_executions
Snowflake — analogous, over SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY:
SELECT query_hash AS template_id,
MIN(query_text) AS canonical_sql,
COUNT(*) AS executions,
COUNT(DISTINCT user_name) AS distinct_users,
MIN(start_time) AS first_seen,
MAX(start_time) AS last_seen,
APPROX_PERCENTILE(total_elapsed_time, 0.50) AS p50_ms,
APPROX_PERCENTILE(total_elapsed_time, 0.95) AS p95_ms,
DIV0(COUNT_IF(execution_status != 'SUCCESS'), COUNT(*)) AS error_rate,
SUM(rows_produced) AS rows_produced
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE query_text IS NOT NULL
AND query_type IN ('SELECT', 'MERGE')
AND start_time >= DATEADD(day, -@window_days, CURRENT_TIMESTAMP())
GROUP BY query_hash
HAVING COUNT(*) >= @min_executions
4.3 Batch SQL parse
After collecting all AggregatedTemplate rows, one call to a new daemon endpoint:
const parsed = await sqlAnalysis.analyzeBatch(
templates.map(t => ({ id: t.templateId, sql: t.canonicalSql })),
dialect,
);
// → Map<templateId, { tablesTouched: string[], columnsByClause: Record<Clause, string[]>, error?: string }>
The endpoint is implemented in python/ktx-daemon and uses sqlglot internally with ProcessPoolExecutor parallelism over the batch. Replaces the per-query HTTP roundtrip pattern that dominates today's runtime.
Per-row parse failures are non-fatal: the template loses table grounding (excluded from per-table bucketing and from patterns) but the failure is logged to manifest.warnings as parse_failed:<templateId>.
4.4 Filtering (three layers)
Layer A — Warehouse-side (in the SQL above):
- Noise prefixes (
SHOW,DESCRIBE,EXPLAIN,USE,SET). - System catalogs (
INFORMATION_SCHEMA,SNOWFLAKE.ACCOUNT_USAGE,pg_*,system.*). - DDL / non-analytical statement types via
statement_type/query_typecolumns (PG falls back to prefix regex). - Trivial probes (
SELECT 1,SELECT NOW(),SELECT VERSION()) — configurable. - Minimum executions threshold (
@min_executions, default 5). - Trailing window (
@window_days, default 90) — BQ/SF only.
Layer B — Post-fetch, in-memory:
- Service-account exclusion/inclusion via configurable regex patterns; three modes (
excludedefault,include,mark-only). - Orchestrator boilerplate (dbt/Looker/Metabase markers) — default
mark-only(do not drop; dbt-generated queries are often the actual business logic). - Failed-query filter (BQ/SF only): templates with
errorRate > 0.9 AND executions < 10.
Layer C — Post-parse:
- Zero-table templates (parsed cleanly but touch no real tables) are dropped from per-table bucketization and from patterns.
4.5 Bucketize by table
In-memory pass: a single template touching N tables ends up in N table buckets.
4.6 Staged artifacts
{stagedDir}/
manifest.json
tables/
{schema}.{name}.json # one per touched table
patterns-input.json
manifest.json is small (summary, window, counts, warnings — schema in §9).
tables/{schema}.{name}.json contains bucketed content so that DiffSet content hashes are stable when nothing material changed:
{
"table": "public.orders",
"stats": {
"executionsBucket": "1k-5k",
"distinctUsersBucket": "5-10",
"errorRateBucket": "low",
"p95RuntimeBucket": "100ms-1s",
"recencyBucket": "current"
},
"columnsByClause": {
"select": [["amount","high"], ["status","high"]],
"where": [["status","high"], ["created_at","mid"]],
"join": [["customer_id","high"]],
"groupBy": [["status","low"]]
},
"observedJoins": [
{ "withTable": "public.customers", "on": ["customer_id"], "freq": "high" },
{ "withTable": "public.line_items", "on": ["order_id"], "freq": "high" }
],
"topTemplates": [
{ "id": "...", "canonicalSql": "...", "topUsers": [...] }
]
}
patterns-input.json contains every template in compact form (id, canonicalSql, tablesTouched, executionsBucket, distinctUsersBucket, dialect). Pulls double duty as the patterns skill input and as the audit log; no separate templates.jsonl.
Bucket bands are defined deterministically in code (e.g. executionsBucket: <10, 10-100, 100-1k, 1k-5k, 5k-50k, >50k). Exact thresholds set during implementation; the principle is that small fluctuations don't change the bucket.
4.7 chunk() (trivial, convention-following)
One WorkUnit per tables/*.json file (handled by historic_sql_table_digest) + one WorkUnit referencing patterns-input.json (handled by historic_sql_patterns). No custom diff logic — the framework's DiffSetComputerPort already filters to changed files.
5. Cold path (LLM, via skills)
Both skills produce evidence; the adapter's onPullSucceeded() projects evidence to its final homes. This avoids write contention between parallel skill invocations on the same shard file.
5.1 historic_sql_table_digest
One invocation per changed table's WorkUnit. Input: the table's staged JSON plus dependency reference to the existing _schema entry (so the LLM sees the actual column list and doesn't hallucinate).
Prompt cache split (cacheControl: { type: 'ephemeral', ttl: '5m' }, auto-bump to '1h' when the run is expected to exceed ~4 minutes wall clock):
- Cached prefix: role, output JSON schema generated from
tableUsageOutputSchemavia Zod 4'sz.toJSONSchema(), extraction rules, 1–2 few-shot examples. - Variable suffix: table name, existing columns list, existing AI description, staged usage input.
Output schema (zod, in historic-sql/skill-schemas.ts):
export const tableUsageOutputSchema = z.object({
narrative: z.string(),
frequencyTier: z.enum(['high', 'mid', 'low', 'unused']),
commonFilters: z.array(z.string()),
commonGroupBys: z.array(z.string()).optional(),
commonJoins: z.array(z.object({
table: z.string(),
on: z.array(z.string()),
})),
staleSince: z.iso.datetime().nullable().optional(),
});
No hard length/cap constraints in the schema. Concision is a behavioral instruction in the prompt prefix.
Concurrency: runWithConcurrency() from packages/context/src/scan/description-generation.ts:147 (the same utility scan-description uses). Default 12, configurable in ktx.yaml.
Idempotency: when tables/{name}.json's content hash hasn't changed (bucketed stats stable), DiffSet marks the file unchanged, no WorkUnit is emitted, no LLM call happens. Steady-state daily runs LLM only the meaningfully changed tables.
5.2 historic_sql_patterns
One invocation per run (or a small handful if patterns-input.json exceeds a context budget — split deterministically by tablesTouched cardinality stratification).
Prompt: identifies recurring analytical intents that span ≥2 tables with ≥mid executionsBucket and ≥2-5 distinct users. Output is a list of PatternOutput.
Output schema:
export const patternOutputSchema = z.object({
slug: z.string(),
title: z.string(),
narrative: z.string(),
definitionSql: z.string(),
tablesInvolved: z.array(z.string()),
slRefs: z.array(z.string()),
constituentTemplateIds: z.array(z.string()),
});
Cache control: skip. Single call per run; cache write premium doesn't amortize.
Slug stability across runs: the projection step (§5.3) does a deterministic similarity check against existing pattern pages. For each new pattern, find an existing slug whose tablesInvolved ∪ constituentTemplateIds overlap ≥60% with the new one and reuse it; else mint a new slug. Pure post-process, no LLM call.
5.3 Projection inside onPullSucceeded()
After all skills complete and evidence is committed, run two passes. Both are pure data transformations, no LLM calls.
Pass A — _schema shard reconciliation:
- Collect all
historic_sql_table_usageevidence written this run. - Group by
shardKey(catalog.schema). - For each shard:
- Load existing
_schema/{shardKey}.yaml. - For each table entry: if new evidence exists, merge under
usageviamergeUsagePreservingExternal()(onlyhistoricSql-managed keys touched; user-added keys preserved — same pattern asmergeDescriptionsPreservingExternalatlocal-enrichment-artifacts.ts:237-242). - For tables previously present with
historicSql-managedusagebut absent from this run's snapshot: setusage.staleSince = lastSnapshotSeenAt, clear other historicSql-managed fields. - Atomic write to
_schema/{shardKey}.yaml.
- Load existing
- Trigger SL search re-index for changed sources via the existing flow (
sl-search.service.ts:91-99detects search-text drift).
Pass B — wiki pattern pages:
- Collect all
historic_sql_patternevidence written this run. - Load existing wiki pages with tags
['historic-sql', 'pattern']for this connection. - Run slug-stability matching.
- For each pattern (existing or new):
- Build
LocalKnowledgePagewithkey: historic-sql/{slug},scope: GLOBAL,tags: ['historic-sql', 'pattern'],slRefsto relevant SL sources,refsto other historic-sql pages. writeLocalKnowledgePage(...).
- Build
- For existing patterns not seen this run: append frontmatter
stale_since: {today}and addtag: stale. Don't delete; preserve for historical lookups. - After
staleArchiveAfterDaysthreshold (default 90 days, configurable): move the page key underhistoric-sql/_archived/and addtag: archived.
6. Search-surface plumbing
6.1 ktx wiki search — no plumbing required
Pattern pages are written to knowledge/global/historic-sql/{slug}.md and are discovered by the existing searchLocalKnowledgePages() walk. Tags ['historic-sql', 'pattern'] enable faceted search.
6.2 ktx sl search — small extension
6.2.1 — SemanticLayerSource.usage field
Add an optional usage field to SemanticLayerSource in packages/context/src/sl/schemas.ts, reusing the same tableUsageOutputSchema from skill-schemas.ts. Single source of truth end-to-end.
6.2.2 — _schema → SemanticLayerSource projection carries usage
The existing projection step in local-sl.ts (or wherever the manifest reader builds SemanticLayerSource objects) needs one new field copy: entry.usage → source.usage.
6.2.3 — buildSemanticLayerSourceSearchText() extension
Extend the function at sl-search.service.ts:8-74 to include usage content in the FTS5/embedding text:
if (source.usage) {
const u = source.usage;
parts.push(`usage: ${u.narrative}`);
parts.push(`frequency: ${u.frequencyTier}`);
if (u.commonFilters?.length) parts.push(`commonly filtered by: ${u.commonFilters.join(', ')}`);
if (u.commonGroupBys?.length) parts.push(`commonly grouped by: ${u.commonGroupBys.join(', ')}`);
for (const j of u.commonJoins ?? []) {
parts.push(`commonly joined to ${j.table} on ${j.on.join(',')}`);
}
if (u.staleSince) parts.push(`stale since ${u.staleSince}`);
}
6.2.4 — Re-index trigger
Already wired. Per-source content-hash detection at sl-search.service.ts:91-99 ensures only sources whose usage changed re-embed.
6.2.5 — Query-mode result enrichment
Extend the search result shape returned by agent sl list --query to include score and an FTS5 snippet() per hit. Implementation: small SQL change in sqlite-sl-sources-index.ts to select snippet(local_sl_sources_fts, ...) alongside the source row.
Result shape becomes:
{
"connectionId": "warehouse",
"name": "public.orders",
"table": "orders",
"columnCount": 12,
"measureCount": 3,
"joinCount": 2,
"description": "...",
"score": 0.81,
"frequencyTier": "high",
"snippet": "commonly filtered by <mark>status</mark>, joined to customers"
}
The full usage block lives in the SemanticLayerSource returned by agent sl read <name>.
7. Three-tier retrieval model
| Tier | Surface | What an agent gets |
|---|---|---|
| Search hit | agent sl list --query "..." |
name, table, counts, description, score, frequencyTier, snippet |
| Source read | agent sl read <name> |
full SemanticLayerSource YAML including columns, measures, joins, and usage block |
| Pattern read | agent wiki read historic-sql/{slug} |
title, narrative, canonical SQL, tables involved, slRefs |
Agents pull deeper only when they need to. The bytes per tier are governed by prompt-side concision instructions, not by schema constraints.
8. Configuration
Per-connection block in ktx.yaml:
connections:
warehouse:
driver: postgres
connectionUrl: postgres://...
historicSql:
enabled: true
# everything below is optional; defaults from the zod schema
windowDays: 90
minExecutions: 5
concurrency: 12
filters:
serviceAccounts:
patterns: ['^etl-', '@bot\.']
mode: exclude # exclude | include | mark-only
orchestrators:
mode: mark-only # include | exclude | mark-only
dropTrivialProbes: true
dropFailedBelow:
errorRate: 0.9
executions: 10
redactionPatterns: ['password', 'api_key']
staleArchiveAfterDays: 90
CLI setup wizard (ktx setup) flags map onto this block. --historic-sql-min-calls is renamed --historic-sql-min-executions (cross-dialect clarity); both names accepted for one release.
Doctor command (ktx dev doctor) retains PG-specific validation: version ≥ 14, extension installed, pg_read_all_stats grant, pg_stat_statements.track != 'none'. The pg_stat_statements.max ≥ 5000 check is downgraded from a warning to an informational note (deallocation churn no longer threatens delta-tracking integrity, because there is no delta tracking).
9. Schemas (zod)
Lives in packages/context/src/ingest/adapters/historic-sql/types.ts unless noted.
export const historicSqlPullConfigSchema = z.object({
dialect: z.enum(['postgres', 'bigquery', 'snowflake']),
windowDays: z.number().int().positive().default(90),
minExecutions: z.number().int().nonnegative().default(5),
concurrency: z.number().int().positive().default(12),
filters: z.object({
serviceAccounts: z.object({
patterns: z.array(z.string()).default([]),
mode: z.enum(['exclude', 'include', 'mark-only']).default('exclude'),
}).optional(),
orchestrators: z.object({
mode: z.enum(['exclude', 'include', 'mark-only']).default('mark-only'),
}).optional(),
dropTrivialProbes: z.boolean().default(true),
dropFailedBelow: z.object({
errorRate: z.number(),
executions: z.number().int(),
}).optional(),
}).optional(),
redactionPatterns: z.array(z.string()).default([]),
staleArchiveAfterDays: z.number().int().positive().default(90),
});
export const aggregatedTemplateSchema = z.object({
templateId: z.string(),
canonicalSql: z.string(),
dialect: z.enum(['postgres', 'bigquery', 'snowflake']),
stats: z.object({
executions: z.number().int(),
distinctUsers: z.number().int(),
firstSeen: z.iso.datetime(),
lastSeen: z.iso.datetime(),
p50RuntimeMs: z.number().nullable(),
p95RuntimeMs: z.number().nullable(),
errorRate: z.number(),
rowsProduced: z.number().int().nullable(),
}),
topUsers: z.array(z.object({
user: z.string().nullable(),
executions: z.number().int(),
})),
});
export const stagedTableInputSchema = z.object({
table: z.string(),
stats: z.object({
executionsBucket: z.string(),
distinctUsersBucket: z.string(),
errorRateBucket: z.string(),
p95RuntimeBucket: z.string(),
recencyBucket: z.string(),
}),
columnsByClause: z.record(z.string(), z.array(z.tuple([z.string(), z.string()]))),
observedJoins: z.array(z.object({
withTable: z.string(),
on: z.array(z.string()),
freq: z.string(),
})),
topTemplates: z.array(z.object({
id: z.string(),
canonicalSql: z.string(),
topUsers: z.array(z.object({ user: z.string().nullable() })),
})),
});
export const stagedPatternsInputSchema = z.object({
templates: z.array(z.object({
id: z.string(),
canonicalSql: z.string(),
tablesTouched: z.array(z.string()),
executionsBucket: z.string(),
distinctUsersBucket: z.string(),
dialect: z.enum(['postgres', 'bigquery', 'snowflake']),
})),
});
export const stagedManifestSchema = z.object({
source: z.literal('historic-sql'),
connectionId: z.string(),
dialect: z.enum(['postgres', 'bigquery', 'snowflake']),
fetchedAt: z.iso.datetime(),
windowStart: z.iso.datetime(),
windowEnd: z.iso.datetime(),
snapshotRowCount: z.number().int(),
touchedTableCount: z.number().int(),
parseFailures: z.number().int(),
warnings: z.array(z.string()),
probeWarnings: z.array(z.string()),
});
In packages/context/src/ingest/adapters/historic-sql/skill-schemas.ts — the single source of truth for LLM I/O shapes, imported by the prompt builder, the evidence parser, the projection step, the SemanticLayerSource type, and the _schema manifest entry type:
export const tableUsageOutputSchema = z.object({
narrative: z.string(),
frequencyTier: z.enum(['high', 'mid', 'low', 'unused']),
commonFilters: z.array(z.string()),
commonGroupBys: z.array(z.string()).optional(),
commonJoins: z.array(z.object({
table: z.string(),
on: z.array(z.string()),
})),
staleSince: z.iso.datetime().nullable().optional(),
});
export type TableUsageOutput = z.infer<typeof tableUsageOutputSchema>;
export const patternOutputSchema = z.object({
slug: z.string(),
title: z.string(),
narrative: z.string(),
definitionSql: z.string(),
tablesInvolved: z.array(z.string()),
slRefs: z.array(z.string()),
constituentTemplateIds: z.array(z.string()),
});
export const patternsArraySchema = z.array(patternOutputSchema);
export type PatternOutput = z.infer<typeof patternOutputSchema>;
Extensions to existing types:
packages/context/src/sl/schemas.ts—SemanticLayerSource.usage: tableUsageOutputSchema.optional().packages/context/src/ingest/adapters/live-database/manifest.ts—LiveDatabaseManifestTableEntry.usage?: TableUsageOutput.
The _schema/{shard}.yaml manifest version need not bump — usage is an additive, optional field. Validators must allow unknown future keys (audit during step 1 of §10).
10. Cutover plan
Hard cutover. No parallel codepaths. Single coordinated PR (or PR train).
10.1 Code that gets deleted
Within packages/context/src/ingest/adapters/historic-sql/:
stage.ts— rewrittenstage-pgss.ts— deleted (no baseline tracking)stage-pgss.test.ts,stage-pgss-golden.test.ts— deletedhistoric-sql.adapter.ts— rewrittenhistoric-sql.adapter.test.ts— rewrittenchunk.ts/chunk.test.ts— rewritten (becomes trivial)detect.ts/detect.test.ts— trivial updatepostgres-pgss-query-history-reader.ts— rewritten aspostgres-pgss-reader.ts; baseline-tracking code removedbigquery-query-history-reader.ts/snowflake-query-history-reader.ts— rewritten; cursor logic removed; warehouse-side GROUP BYtypes.ts— rewritten- new
skill-schemas.ts errors.ts— keep (probe errors); prune unused
Old skills historic_sql_ingest and historic_sql_curator — audit; if only consumed by historic-sql, delete.
expandCategoricalTemplates, classifySlot, rankTemplate, slot-related types — gone.
10.2 Existing artifacts
| Artifact | Where | Decision |
|---|---|---|
| Old per-template wiki pages | knowledge/global/... (legacy historic-sql-template tag or matching key prefix) |
One-time cleanup in onPullSucceeded() on first run after upgrade. Idempotent: subsequent runs no-op. |
| PG baseline files | .ktx/cache/historic-sql/{connectionId}/pgss-baseline.json |
Delete on first run. Cache; no signal lost. |
Old raw-sources/{connectionId}/historic-sql/{syncId}/ snapshots |
raw-sources/... |
Leave alone. Per-sync audit; framework handles retention. |
10.3 Ordering
- Foundations (independent, no behavioral change):
- Daemon
analyze-batchendpoint +SqlAnalysisPort.analyzeBatch()(old method still in place, unused). SemanticLayerSource.usagefield (no producer yet).LiveDatabaseManifestTableEntry.usagefield (no producer yet).mergeUsagePreservingExternal()utility + tests.
- Daemon
- Search enrichment (independent, ships an unrelated win):
buildSemanticLayerSourceSearchText()extension.- FTS5
snippet()+ score in query-mode results.
- New adapter (replaces old in a single commit per dialect):
- PG path first (smallest surface, has the doctor command for validation).
- BQ + SF together (share aggregation pattern).
- Skills + projection:
historic_sql_table_digest+historic_sql_patterns.onPullSucceededprojection passes.- One-time legacy cleanup.
- Delete the old codepath — same PR as step 3, ideally.
- Docs + setup wizard updates.
10.4 Verification before merging
- Demo DB end-to-end:
examples/postgres-historic/ingest completes in under 60 seconds (current 30-minute baseline becomes the regression bar). - Cross-dialect smoke: at least one run against each of PG / BQ / SF ends with non-empty
_schema/{shard}.yamlusageblocks and ≥0 pattern pages. - Idempotency: a second run immediately after the first produces zero
historic_sql_table_digestLLM calls. - Drift: a run where one table disappears from the snapshot sets
usage.staleSinceon that table's_schemaentry; reappearance clears it. - Search retrieval:
agent sl list --queryreturns hits with non-empty snippets;agent wiki search "<pattern slug>"returns the pattern page directly. - No old code paths:
git grep -E "stagePgStatStatementsTemplates|expandCategoricalTemplates|classifySlot|pgss-baseline"returns zero results. - Doctor still passes on a properly configured PG with the new adapter.
10.5 Out of scope
- Embedding-based pattern clustering (rejected in favor of LLM-driven intent detection).
- Wiki shard pages (rejected — patterns are sparse; per-page is correct).
- Incremental dialect-by-dialect rollout behind a flag.
- A
ktx historic-sql migratecommand — cleanup runs automatically once. - Framework-level
raw-sources/retention policy (separate concern; not introduced here). - Per-table wiki pages (the very problem
_schemashards exist to avoid — see §11).
10.6 Risks
| Risk | Mitigation |
|---|---|
Daemon analyze-batch slower than hoped on huge templates |
ProcessPoolExecutor parallelism; configurable batch size cap |
_schema shard concurrent writes (scan + historic-sql) |
Atomic per-shard write + scan-managed-keys merge (mergeUsagePreservingExternal); new test covers concurrent invocation |
| Pattern slug churn between runs | Slug-stability matcher in projection; ≥60% overlap reuses existing slug; falls back to new mint if no match |
Existing manifest validators reject usage field |
Audit validators in step 1 of §10.3; extend allowed-fields list |
User-edited usage fields clobbered |
mergeUsagePreservingExternal follows the same scan-managed-keys discipline as descriptions; covered by tests |
11. Rejected alternatives
Documented so future readers don't relitigate.
Per-table wiki pages — one .md per table under knowledge/global/historic-sql/. Rejected: reintroduces the per-table-file proliferation problem (writeLocalKnowledgePage writes one file per page) that _schema shards exist to avoid. ~800 markdown files for a 1000-table warehouse, ~100 churning daily.
Single-file all-usage page — one giant page containing every table. Rejected: ~700 KB blob; FTS5 snippets all come from the same source; wiki read returns an unusable mass.
One file per table in a new _usage/ directory — same file-count problem as per-table wiki, plus needs new search plumbing.
New parallel _usage/{shard}.yaml shards — same sharding benefit as merging into _schema but without riding SL search. Plumbing required without offsetting win.
One wiki page per catalog.schema — workable, but pages get large (200 tables per page) and only rides wiki search, not SL search. The chosen design rides both.
Single staged snapshot.json — to reduce raw-sources/ accumulation. Rejected: required custom diff logic in chunk(), broke framework convention, saved bounded disk for a framework-level concern (sync retention). Per-table staged files with bucketed content is cleaner.
Embedding-based pattern clustering — using sentence-transformer embeddings to cluster templates into themes before naming via LLM. Rejected: reintroduces clustering hyperparameters and determinism the redesign aims to avoid. The LLM does the grouping in one call from the full template list, no embedding step.
Skip pattern pages entirely — ship only _schema enrichment for a leaner v1. Rejected: leaves ktx wiki search empty of historic-sql content (loses one of two stated consumption surfaces) and forces agents to synthesize cross-cutting intents from fragmented per-table mentions.
TypeScript-native SQL parser instead of sqlglot via daemon — node-sql-parser, pgsql-parser (WASM), etc. Rejected: materially worse dialect coverage on Snowflake/BigQuery edge cases; duplicates parser logic when KTX already uses sqlglot elsewhere (python/ktx-daemon/src/ktx_daemon/lookml.py); AGENTS.md explicitly mandates sqlglot. Batch endpoint on the existing daemon achieves the perf win.
Hard length/count caps in zod output schemas (e.g. narrative.max(250), commonFilters.max(5)). Rejected: arbitrary thresholds, brittle retry-on-violation paths, defensive coding for a soft concern. Concision belongs in prompt instructions; the schema validates shape.
12. Cost / scale envelope
For a representative mid-size warehouse (~200 touched tables):
| Phase | Calls | Cost @ Sonnet |
|---|---|---|
| Hot path (deterministic) | 0 | $0 |
| First-run table digest (uncached + cached mix) | ~200 | ~$5–7 |
| First-run patterns | 1 | ~$0.05 |
| Embeddings (changed tables) | ~200 | ~$0.02 |
| First run total | ~$5–7 | |
| Daily steady-state (hash-skipped) | ~10–30 changed | ~$0.10–$0.25 |
Wall-clock: first run ~1–3 min on mid; demo DB <60s end-to-end.
For a large warehouse (~800 touched tables): first-run ~$20–30, daily ~$0.20–$1.00.
13. Open questions
- Exact bucket thresholds for
executionsBucket,distinctUsersBucket, etc. — to be chosen during implementation based on what produces stable hashes in practice. - Final naming of the daemon endpoint (
/sql/analyze-batchvs alternatives). - Whether
historic_sql_ingest/historic_sql_curatorskills are consumed elsewhere — audit during step 1. - Whether to delete legacy wiki pages automatically or behind a confirmation flag — design assumes automatic.