The current historic-SQL ingestion adapter (`packages/context/src/ingest/adapters/historic-sql/`) is slow, complex, and structurally cannot answer the questions a research/BI agent actually asks.
Concrete pain points observed:
- A full run takes **30+ minutes against a tiny demo Postgres database**. The hot loop calls `SqlAnalysisPort.analyzeForFingerprint()` once per query via HTTP to the Python daemon, so thousands of RPC round-trips dominate runtime.
- **Two completely different code paths** for Postgres (baseline-diff against `pg_stat_statements`) versus BigQuery/Snowflake (timestamp cursor over `INFORMATION_SCHEMA.JOBS` / `QUERY_HISTORY`). Postgres further cannot produce the same outputs as the others (no per-execution samples, no literal-slot bindings, error rate stuck at zero).
- The output is **fingerprint-fragmented**: the pipeline emits one document per fingerprint, expands categorical literal slots into per-value sub-clusters, and ranks templates with a recency-decayed score. The result is many near-duplicate documents per fingerprint and gratuitous churn across runs.
- The output is **rigid and shallow**: deterministic slot classification (constant / categorical / runtime) and triage-signal buckets do not produce narrative an agent can use. The current downstream skills (`historic_sql_ingest`, `historic_sql_curator`) try to recover narrative from these templates but at high cost.
- Lots of moving parts (baseline files, reset detection, atomic per-connection commit, slot heuristics, ranking formula) for what is fundamentally "find interesting queries and tell agents about them."
The end goal - per the user - is for ingested content to be **searchable by `ktx wiki search` and `ktx sl search` to help consumer research agents do data analysis and agentic BI**.
1.**LLMs are the right tool for narrative and clustering.** Deterministic heuristics (slot classification, ranking formulas, categorical expansion) get replaced by LLM judgement applied to aggregated, bucketed inputs.
3.**One pipeline across dialects.** A single reader interface, a single staging shape, a single set of skills. Dialect-specific behavior lives only in the snapshot query.
4.**No work where no signal changed.** Daily reruns should LLM only the things that actually changed.
5.**Lean context for caller agents.** Each retrieval tier (search hit → source read → pattern read) carries only what the agent needs to make the next decision. The principle lives in prompt instructions, not in defensive schema constraints.
6.**Simplification over backward compatibility.** Hard cutover, delete the old code path, no parallel implementations.
`AggregatedTemplate` is one record per template, already aggregated by the warehouse. Schema in §9.
**Trailing-window only.** No cursor, no baseline file. Every run reads "what was hot in the last N days." Idempotency comes from per-WorkUnit content hashing via the framework's `DiffSetComputerPort`.
The endpoint is implemented in `python/ktx-daemon` and uses `sqlglot` internally with `ProcessPoolExecutor` parallelism over the batch. Replaces the per-query HTTP roundtrip pattern that dominates today's runtime.
Per-row parse failures are non-fatal: the template loses table grounding (excluded from per-table bucketing and from patterns) but the failure is logged to `manifest.warnings` as `parse_failed:<templateId>`.
- Orchestrator boilerplate (dbt/Looker/Metabase markers) - default `mark-only` (do not drop; dbt-generated queries are often the actual business logic).
`patterns-input.json` contains every template in compact form (`id`, `canonicalSql`, `tablesTouched`, `executionsBucket`, `distinctUsersBucket`, `dialect`). Pulls double duty as the patterns skill input and as the audit log; no separate `templates.jsonl`.
Bucket bands are defined deterministically in code (e.g. `executionsBucket`: `<10`, `10-100`, `100-1k`, `1k-5k`, `5k-50k`, `>50k`). Exact thresholds set during implementation; the principle is that small fluctuations don't change the bucket.
One `WorkUnit` per `tables/*.json` file (handled by `historic_sql_table_digest`) + one `WorkUnit` referencing `patterns-input.json` (handled by `historic_sql_patterns`). No custom diff logic - the framework's `DiffSetComputerPort` already filters to changed files.
Both skills produce **evidence**; the adapter's `onPullSucceeded()` projects evidence to its final homes. This avoids write contention between parallel skill invocations on the same shard file.
### 5.1 `historic_sql_table_digest`
One invocation per changed table's `WorkUnit`. Input: the table's staged JSON plus dependency reference to the existing `_schema` entry (so the LLM sees the actual column list and doesn't hallucinate).
**Prompt cache split** (`cacheControl: { type: 'ephemeral', ttl: '5m' }`, auto-bump to `'1h'` when the run is expected to exceed ~4 minutes wall clock):
- **Cached prefix:** role, output JSON schema generated from `tableUsageOutputSchema` via Zod 4's `z.toJSONSchema()`, extraction rules, 1–2 few-shot examples.
No hard length/cap constraints in the schema. Concision is a behavioral instruction in the prompt prefix.
**Concurrency:** `runWithConcurrency()` from `packages/context/src/scan/description-generation.ts:147` (the same utility scan-description uses). Default 12, configurable in `ktx.yaml`.
**Idempotency:** when `tables/{name}.json`'s content hash hasn't changed (bucketed stats stable), DiffSet marks the file `unchanged`, no WorkUnit is emitted, no LLM call happens. Steady-state daily runs LLM only the meaningfully changed tables.
One invocation per run (or a small handful if `patterns-input.json` exceeds a context budget - split deterministically by `tablesTouched` cardinality stratification).
**Prompt:** identifies recurring analytical intents that span ≥2 tables with ≥mid executionsBucket and ≥2-5 distinct users. Output is a list of `PatternOutput`.
**Output schema:**
```typescript
export const patternOutputSchema = z.object({
slug: z.string(),
title: z.string(),
narrative: z.string(),
definitionSql: z.string(),
tablesInvolved: z.array(z.string()),
slRefs: z.array(z.string()),
constituentTemplateIds: z.array(z.string()),
});
```
**Cache control:** skip. Single call per run; cache write premium doesn't amortize.
**Slug stability across runs:** the projection step (§5.3) does a deterministic similarity check against existing pattern pages. For each new pattern, find an existing slug whose `tablesInvolved`∪`constituentTemplateIds` overlap ≥60% with the new one and reuse it; else mint a new slug. Pure post-process, no LLM call.
### 5.3 Projection inside `onPullSucceeded()`
After all skills complete and evidence is committed, run two passes. Both are pure data transformations, no LLM calls.
- For each table entry: if new evidence exists, merge under `usage` via `mergeUsagePreservingExternal()` (only `historicSql`-managed keys touched; user-added keys preserved - same pattern as `mergeDescriptionsPreservingExternal` at `local-enrichment-artifacts.ts:237-242`).
- For tables previously present with `historicSql`-managed `usage` but absent from this run's snapshot: set `usage.staleSince = lastSnapshotSeenAt`, clear other historicSql-managed fields.
- Atomic write to `_schema/{shardKey}.yaml`.
4. Trigger SL search re-index for changed sources via the existing flow (`sl-search.service.ts:91-99` detects search-text drift).
1. Collect all `historic_sql_pattern` evidence written this run.
2. Load existing wiki pages with tags `['historic-sql', 'pattern']` for this connection.
3. Run slug-stability matching.
4. For each pattern (existing or new):
- Build `LocalKnowledgePage` with `key: historic-sql/{slug}`, `scope: GLOBAL`, `tags: ['historic-sql', 'pattern']`, `slRefs` to relevant SL sources, `refs` to other historic-sql pages.
-`writeLocalKnowledgePage(...)`.
5. For existing patterns not seen this run: append frontmatter `stale_since: {today}` and add `tag: stale`. Don't delete; preserve for historical lookups.
6. After `staleArchiveAfterDays` threshold (default 90 days, configurable): move the page key under `historic-sql/_archived/` and add `tag: archived`.
Pattern pages are written to `knowledge/global/historic-sql/{slug}.md` and are discovered by the existing `searchLocalKnowledgePages()` walk. Tags `['historic-sql', 'pattern']` enable faceted search.
Add an optional `usage` field to `SemanticLayerSource` in `packages/context/src/sl/schemas.ts`, reusing the same `tableUsageOutputSchema` from `skill-schemas.ts`. Single source of truth end-to-end.
The existing projection step in `local-sl.ts` (or wherever the manifest reader builds `SemanticLayerSource` objects) needs one new field copy: `entry.usage → source.usage`.
Extend the search result shape returned by `agent sl list --query` to include `score` and an FTS5 `snippet()` per hit. Implementation: small SQL change in `sqlite-sl-sources-index.ts` to select `snippet(local_sl_sources_fts, ...)` alongside the source row.
Result shape becomes:
```jsonc
{
"connectionId": "warehouse",
"name": "public.orders",
"table": "orders",
"columnCount": 12,
"measureCount": 3,
"joinCount": 2,
"description": "...",
"score": 0.81,
"frequencyTier": "high",
"snippet": "commonly filtered by <mark>status</mark>, joined to customers"
}
```
The full `usage` block lives in the `SemanticLayerSource` returned by `agent sl read <name>`.
## 7. Three-tier retrieval model
| Tier | Surface | What an agent gets |
|---|---|---|
| Search hit | `agent sl list --query "..."` | name, table, counts, description, score, frequencyTier, snippet |
| Source read | `agent sl read <name>` | full SemanticLayerSource YAML including columns, measures, joins, and `usage` block |
Agents pull deeper only when they need to. The bytes per tier are governed by prompt-side concision instructions, not by schema constraints.
## 8. Configuration
Per-connection block in `ktx.yaml`:
```yaml
connections:
warehouse:
driver: postgres
connectionUrl: postgres://...
historicSql:
enabled: true
# everything below is optional; defaults from the zod schema
windowDays: 90
minExecutions: 5
concurrency: 12
filters:
serviceAccounts:
patterns: ['^etl-', '@bot\.']
mode: exclude # exclude | include | mark-only
orchestrators:
mode: mark-only # include | exclude | mark-only
dropTrivialProbes: true
dropFailedBelow:
errorRate: 0.9
executions: 10
redactionPatterns: ['password', 'api_key']
staleArchiveAfterDays: 90
```
CLI setup wizard (`ktx setup`) flags map onto this block. `--historic-sql-min-calls` is renamed `--historic-sql-min-executions` (cross-dialect clarity); both names accepted for one release.
Doctor command (`ktx dev doctor`) retains PG-specific validation: version ≥ 14, extension installed, `pg_read_all_stats` grant, `pg_stat_statements.track != 'none'`. The `pg_stat_statements.max ≥ 5000` check is downgraded from a warning to an informational note (deallocation churn no longer threatens delta-tracking integrity, because there is no delta tracking).
## 9. Schemas (zod)
Lives in `packages/context/src/ingest/adapters/historic-sql/types.ts` unless noted.
In `packages/context/src/ingest/adapters/historic-sql/skill-schemas.ts` - the **single source of truth for LLM I/O shapes**, imported by the prompt builder, the evidence parser, the projection step, the `SemanticLayerSource` type, and the `_schema` manifest entry type:
The `_schema/{shard}.yaml` manifest version need not bump - `usage` is an additive, optional field. Validators must allow unknown future keys (audit during step 1 of §10).
| Old per-template wiki pages | `knowledge/global/...` (legacy `historic-sql-template` tag or matching key prefix) | **One-time cleanup** in `onPullSucceeded()` on first run after upgrade. Idempotent: subsequent runs no-op. |
| PG baseline files | `.ktx/cache/historic-sql/{connectionId}/pgss-baseline.json` | **Delete on first run.** Cache; no signal lost. |
- **Demo DB end-to-end:** `examples/postgres-historic/` ingest completes in **under 60 seconds** (current 30-minute baseline becomes the regression bar).
- **Cross-dialect smoke:** at least one run against each of PG / BQ / SF ends with non-empty `_schema/{shard}.yaml``usage` blocks and ≥0 pattern pages.
- **Idempotency:** a second run immediately after the first produces zero `historic_sql_table_digest` LLM calls.
- **Drift:** a run where one table disappears from the snapshot sets `usage.staleSince` on that table's `_schema` entry; reappearance clears it.
- **Search retrieval:** `agent sl list --query` returns hits with non-empty snippets; `agent wiki search "<pattern slug>"` returns the pattern page directly.
- **No old code paths:** `git grep -E "stagePgStatStatementsTemplates|expandCategoricalTemplates|classifySlot|pgss-baseline"` returns zero results.
- **Doctor still passes** on a properly configured PG with the new adapter.
### 10.5 Out of scope
- Embedding-based pattern clustering (rejected in favor of LLM-driven intent detection).
| Pattern slug churn between runs | Slug-stability matcher in projection; ≥60% overlap reuses existing slug; falls back to new mint if no match |
| Existing manifest validators reject `usage` field | Audit validators in step 1 of §10.3; extend allowed-fields list |
| User-edited `usage` fields clobbered | `mergeUsagePreservingExternal` follows the same scan-managed-keys discipline as descriptions; covered by tests |
**Per-table wiki pages** - one `.md` per table under `knowledge/global/historic-sql/`. Rejected: reintroduces the per-table-file proliferation problem (`writeLocalKnowledgePage` writes one file per page) that `_schema` shards exist to avoid. ~800 markdown files for a 1000-table warehouse, ~100 churning daily.
**Single-file all-usage page** - one giant page containing every table. Rejected: ~700 KB blob; FTS5 snippets all come from the same source; `wiki read` returns an unusable mass.
**New parallel `_usage/{shard}.yaml` shards** - same sharding benefit as merging into `_schema` but without riding SL search. Plumbing required without offsetting win.
**One wiki page per `catalog.schema`** - workable, but pages get large (200 tables per page) and only rides wiki search, not SL search. The chosen design rides both.
**Single staged `snapshot.json`** - to reduce `raw-sources/` accumulation. Rejected: required custom diff logic in `chunk()`, broke framework convention, saved bounded disk for a framework-level concern (sync retention). Per-table staged files with bucketed content is cleaner.
**Embedding-based pattern clustering** - using sentence-transformer embeddings to cluster templates into themes before naming via LLM. Rejected: reintroduces clustering hyperparameters and determinism the redesign aims to avoid. The LLM does the grouping in one call from the full template list, no embedding step.
**Skip pattern pages entirely** - ship only `_schema` enrichment for a leaner v1. Rejected: leaves `ktx wiki search` empty of historic-sql content (loses one of two stated consumption surfaces) and forces agents to synthesize cross-cutting intents from fragmented per-table mentions.
- Exact bucket thresholds for `executionsBucket`, `distinctUsersBucket`, etc. - to be chosen during implementation based on what produces stable hashes in practice.