chore: remove private benchmark specs

This commit is contained in:
Andrey Avtomonov 2026-06-30 11:13:44 +02:00
parent 67a69dba8b
commit 1c5d16abc3
40 changed files with 0 additions and 8716 deletions

View file

@ -1,62 +0,0 @@
# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark
This directory is the handoff point between two agents working on different
sides of the same goal: making Claude Code + ktx score well on the Spider
2.0-Lite benchmark **without benchmark-specific instructions** — the agent
should succeed using only what ktx provides (skills, semantic layer, wiki).
## Mechanics
Three directories form a pipeline. A feature flows `todo/``specs/`
(implemented), and only its intake draft moves to `done/`:
- **`todo/`** — intake drafts. A **playground agent** (works in
`/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the
benchmark, identifies ktx capability gaps) writes a draft spec here when it
finds a gap.
- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a
`todo/` draft and produces a proper, implementation-ready spec at
`specs/<same-filename>.md`: sharpened requirements, resolved ambiguities,
acceptance criteria, and orientation hints. The refined spec is the **durable
artifact** the implementer builds from — it stays in `specs/` permanently and
never moves.
- **`done/`** — intake drafts whose feature has shipped (see below).
The **ktx worktree agent** (started from a ktx repo worktree, e.g.
`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the
refined spec in `specs/` (falling back to the `todo/` draft only if no refined
spec exists yet). When the feature is implemented it:
1. appends a short **"Implementation notes"** section to the refined spec in
`specs/` (what was built, where, any deviations); and
2. **moves the original intake draft from `todo/` to `done/`.**
Location is status: `todo/` = draft awaiting implementation, `done/` = draft
whose feature shipped, `specs/` = refined specs (permanent home, do not move).
A draft and its refined spec share the same filename so they correspond
(`todo/01-foo.md``specs/01-foo.md``done/01-foo.md`). No other tracking.
## Rules for specs
1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the
benchmark only surfaces the need. Every spec must state a real-world use
case independent of Spider 2.0-Lite. If a requirement only makes sense for
the benchmark, it doesn't belong in ktx.
2. Specs are **requirement-level**, not implementation plans. Code pointers in
specs are orientation hints from exploration (line numbers may have
drifted); the implementer owns the design.
3. One spec per file, kebab-case, numeric prefix = suggested priority order.
A refined spec in `specs/` keeps the same filename as its `todo/` draft.
## For the implementer
- After implementing, rebuild and re-link the dev binary so the playground
picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
- Add/extend tests in the ktx test suites; specs list acceptance criteria to
cover.
- Build from the refined spec in `specs/`. On completion, append
"Implementation notes" to that spec (it stays in `specs/`) and move the
intake draft from `todo/` to `done/`.
- If a spec turns out to be wrong or already satisfied, don't silently drop
it — record why in the refined spec's notes and move the draft to `done/`
explaining why no change was needed.

View file

@ -1,74 +0,0 @@
# Connection-scoped wiki pages
## Problem
Wiki pages have only two scopes today: `GLOBAL` and `USER`
(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29).
There is no way to associate a page with a connection. In a project with many
connections, all pages share one search index, so `wiki_search` for a generic
term ("orders", "revenue", "average order value") surfaces pages about the
wrong database. Concept names collide across databases constantly in
real-world multi-connection projects (several databases each with `orders`,
`customers`, etc.).
Today, when `memory_ingest` is called with a `connectionId`, that id is only
used to scope which semantic-layer sources the triage agent can see
(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the
resulting wiki page in any form.
## Generic use case
Any org with multiple databases/warehouses in one ktx project: org-wide
definitions ("fiscal year starts in February") should be visible everywhere,
while database-specific conventions ("in the events DB, `user_id` is the
anonymous device id, not the account id") should not pollute searches about
other databases.
## Requirements
1. **Frontmatter field.** Add an optional `connections:` field to wiki page
frontmatter — a list of connection ids (accept a single string too,
normalize to list).
- **Absent or empty ⇒ unscoped: the page applies to all connections.**
This is exactly today's behavior, so every existing page is unaffected
(backward compatible by construction).
2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64)
and `ktx wiki search` / `ktx wiki list` (CLI,
`knowledge-commands.ts`) accept an optional `connectionId`:
- With `connectionId: X` ⇒ return pages scoped to X **** unscoped pages.
- Without ⇒ current behavior, all pages.
- The filter must apply to **all three search lanes** (lexical FTS5,
semantic/embedding, token fallback) in
`local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter
that eats into the result limit unevenly.
3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index
(the index is already re-synced from files on every search,
`local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient).
4. **Write path.** The memory agent's wiki-write tool accepts the connections
field; when `memory_ingest` is invoked with a `connectionId`, the agent
should default new database-specific pages to that connection, while still
being allowed to write unscoped pages for clearly org-wide content (prompt
guidance, not a hard rule).
5. **`wiki_read` and refs are unchanged** — pages remain addressable by key
regardless of scoping; `connections` is a search/relevance concern only.
6. **Validation.** Warn (don't fail) when a page references a connection id
not present in `ktx.yaml` — config and content can evolve independently.
## Acceptance criteria
- A page with `connections: [db_a]` is returned by
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
**not** by `wiki_search(query, connectionId: "db_b")`.
- A page with no `connections` field is returned in all three cases above.
- Existing projects with no scoped pages behave identically before/after.
- Filtering works in each lane independently (test with embeddings disabled
to exercise lexical/token lanes alone).
- `memory_ingest(content, connectionId)` produces a page scoped to that
connection for database-specific content.
## Benchmark context (motivation only)
Spider 2.0-Lite local subset = one project with 30 SQLite connections whose
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
to exactly one database and must not surface for the other 29.

View file

@ -1,71 +0,0 @@
# Verbatim ingest mode for authoritative documents
## Problem
`ktx ingest --text/--file` routes content through the memory agent
(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop
(30-step budget for `external_ingest`, content clipped at ~48k chars,
`memory-agent.service.ts` ~165) that may rewrite, condense, or split the
content before writing wiki pages.
For *authoritative* documents — formula definitions, specs, runbooks,
compliance text — paraphrasing is a bug, not a feature:
- exact thresholds, constants, and rule wording must survive byte-for-byte;
- lexical (BM25) search works best when the stored text matches the phrasing
users/agents will query with;
- ingestion should be deterministic and reproducible — same input file, same
resulting page.
## Generic use case
Any team ingesting documents that are already the source of truth: metric
definition sheets, SLA documents, calculation methodology docs, regulatory
text. The user wants ktx to *index and surface* the document, not to
re-author it.
## Requirements
1. **Flag.** `ktx ingest --file <path> --verbatim` (apply to `--text` too).
Composes with the existing optional `--connection <id>` so the resulting
page can be connection-scoped (see spec 01).
2. **Body preservation is enforced by code, not by prompt.** The stored page
body must be the input content byte-for-byte. The LLM is used **only** to
generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug
(and `connections` default from the flag). Implementation freedom: a
single constrained LLM call is fine — the full memory-agent loop is not
required for this mode.
3. **No clipping of the stored body.** The ~48k clip may apply to what is
*sent to the LLM* for metadata generation, never to what is *written* to
the wiki page.
4. **Existing frontmatter.** If the input file already has YAML frontmatter,
preserve user-provided fields and only fill gaps (don't overwrite an
explicit `summary` with a generated one).
5. **Key collisions.** Deterministic, non-destructive behavior: error or
suffix — never silently overwrite an existing page.
6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should
still work, deriving `summary` from the first heading/sentence and leaving
optional metadata empty. (Regular agent ingest can't do this; verbatim
mode can and should.)
## Acceptance criteria
- Ingesting a file with `--verbatim` produces a wiki page whose body is
byte-identical to the input (assert with a hash in tests).
- Running the same ingest twice is idempotent or fails loudly on the second
run (per requirement 5) — no duplicated/divergent pages.
- A >48k-char file is stored in full.
- `--verbatim --connection X` yields a page scoped to X (depends on spec 01;
if 01 isn't implemented yet, the flag composition can land later).
- Generated metadata makes the page findable: `wiki_search` for a phrase
from the document body returns it (lexical lane), and for a paraphrase of
its topic returns it when embeddings are enabled (semantic lane).
## Benchmark context (motivation only)
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
definitions, haversine formula, F1 overtake rules…). Gold SQL was authored
against their exact text; an LLM paraphrase that drops a bucket boundary
loses a question. We currently work around this by hand-writing frontmatter
and copying files into `wiki/global/` — verbatim mode makes that a supported
ktx workflow instead of a manual step.

View file

@ -1,63 +0,0 @@
# Schema scan must tolerate individual objects that fail introspection
> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest
> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely**
> because a single broken VIEW errored during introspection, leaving that
> connection with no semantic layer at all.
## Problem
`ktx ingest <connection>` aborts the whole database's schema scan when one
table/view errors during introspection/profiling. In `oracle_sql` the view
`emp_hire_periods_with_name` is defined as
`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the
base table has no `start_date`/`end_date` columns — so any attempt to read it
raises `no such column: ehp.start_date`. That single broken object failed the
ingest of all ~48 healthy tables/views in the database.
A second, related symptom: setting `enabled_tables: [main.customers]` to work
around it produced a different hard failure (`Adapter "database schema" did not
recognize fetched source output`), so the documented allowlist escape hatch did
not provide a clean fallback either.
## Generic use case
Real databases routinely contain broken or inaccessible objects: views over
dropped/renamed columns, views referencing tables the connection role can't
read, permission-denied tables, or vendor system views that error. ktx should
ingest everything it *can* and skip what it can't — never let one bad object
zero out an entire connection's context. This is basic robustness for
production warehouses, not benchmark-specific.
## Requirements
1. **Per-object isolation.** If introspecting/profiling one table or view
throws, skip that object, record a warning (object name + error), and
continue scanning the rest. The connection's semantic layer is built from
the objects that succeeded.
2. **Surface, don't hide.** Report skipped objects in the ingest summary and in
`ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name:
no such column ehp.start_date"). Honor `failureMode` for whole-connection
aborts, but a single bad object should not count as a connection failure.
3. **Views vs tables.** A broken view should never block base-table ingest.
Consider profiling views defensively (they are read-only projections).
4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict
the scan to the listed objects (and the qualification format for sqlite must
be documented and accepted). Fix the `did not recognize fetched source
output` failure when the allowlist yields a small/edge-case set.
## Acceptance criteria
- Ingesting a sqlite DB containing one broken view plus N healthy tables yields
a semantic layer for the N healthy tables and a warning naming the broken view
— exit is success (not "failed"), subject to `failureMode`.
- The skipped object is listed in the ingest summary and `ktx status`.
- `enabled_tables` restricted to a subset ingests exactly that subset without the
adapter-output error.
## Benchmark context (motivation only)
`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer
because of its one broken view; those questions must be solved from raw
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
would restore enriched context for that database.

View file

@ -1,112 +0,0 @@
# Add universal SQL-authoring craft to the ktx-analytics skill
> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which
> ktx tools to call and in what order*, but gives almost no guidance on
> *writing correct SQL*. In benchmark runs the agent reliably produced
> runnable SQL (0 execution errors) yet failed on correctness — precision,
> determinism, type mismatches, and answer completeness. These are universal
> analytics-engineering truths that every ktx user benefits from, so they
> belong in the shipped skill, not in any caller's prompt.
## Scope guard (read first)
Only **universally-true** SQL/analytics craft goes here — guidance that helps a
real ktx user querying a **live** database. The test for inclusion: *"Would this
advice be correct and useful for an analyst on a current, production database?"*
**Dialect-specific syntax is out of scope here.** The v9 harnesses' only
per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted
lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX`
for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but
belongs in a **dialect-aware** location (per-driver notes), not this flat
skill. Track separately as a follow-up; the rules below must stay
dialect-agnostic.
Explicitly **do NOT** add (these are application/consumer concerns, not skill
concerns, and some are actively wrong for live data):
- Output-format contracts ("return a bare result set with exactly these
columns, no prose"). The skill is for interactive analysis and already
favors readable tables + summaries; a caller that needs a strict result
shape specifies that itself.
- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the
data. On a live database "recent" means relative to *now*; this is only true
for static snapshots and must not be baked into the product.
- Anything justified by a grader/scoring comparator.
## File
`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill;
`setup-agents.ts` installs it into agent environments — the copy under a
project's `.claude/skills/` is regenerated from this source). Extend the
existing `<rules>` block and step 5 ("Query") / step 6 ("Validate and
explain"); keep the existing interactive guidance intact.
## Requirements — add these as general rules (behavior only, no rationale that
references answers/graders)
**Schema discovery before writing SQL**
1. Inspect representative sample rows of each table before composing SQL —
confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null
prevalence in join/filter keys, and the actual set of categorical/enum
values. (`entity_details` + a small `sql_execution` sample.)
2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A
string column compared against a numeric literal (or vice versa) can
silently match nothing.
**Composition discipline**
3. Build complex queries incrementally — one CTE at a time, verifying each
layer's output on a small sample before stacking the next.
4. Avoid joins that fan out row counts. Add columns only from tables already
required by the grain, or pre-aggregate to the target grain before joining.
**Window-function correctness**
5. Give every ranking/ordering window function a complete, deterministic
tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG`
results are stable rather than flickering across runs.
6. Apply row filters **after** window functions for sequence / "first" /
"most recent" / "since" questions — compute over the full partition, then
filter.
**Numeric precision**
7. Compute at full precision; round only in the final projection, never inside
intermediate CTEs.
8. Be explicit about truncation (`CAST AS INT` truncates; use explicit
rounding when rounding is intended).
9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`)
from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the
question's wording.
**Answer completeness / interpretation**
10. "top / highest / most / lowest" → return only the winning row(s) (e.g.
`RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked
for.
11. "for each X / per X / by X" → exactly one row per X; don't collapse to a
single value unless the question says "overall" or "total across X".
12. When a question asks for inputs and a derived value ("X, Y, and their
ratio"), include the inputs as columns alongside the derived value.
13. When grouping by a human-readable label (a name), also expose the entity's
identifier — identity, not just the label, is part of the result.
14. When a result is unexpectedly empty, relax filters one at a time to find
which predicate removed the rows.
## Acceptance criteria
- The shipped `analytics/SKILL.md` contains the rules above, phrased as general
truths with **no reference to any benchmark, gold answer, or scoring
comparator**.
- Existing interactive guidance (compact result tables, summaries,
clarification prompts, the tool-order workflow) is preserved — the skill must
still read well for an interactive human-facing analysis session.
- None of the excluded items (output-shape contract, `MAX(date)` anchoring,
grader-driven advice) appear.
- Skill stays within a reasonable size; group the new rules under clear
sub-headings so they're scannable.
## Benchmark context (motivation only)
On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors
but ~50 result mismatches; a large share traced to exactly these gaps
(premature rounding, string-vs-number compares, non-deterministic window
ordering, returning full lists for "top" questions, dropping inputs to derived
values). These are generic SQL-authoring defects — fixing them in the skill
improves ktx for everyone and, as a side effect, the benchmark.

View file

@ -1,83 +0,0 @@
# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection)
> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept
> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect
> syntax here.
## Problem
Spec 07 deliberately keeps the analytics SQL-authoring craft
**dialect-agnostic** — every rule must read correctly on any engine. But a lot of
*real* correctness depends on dialect-specific syntax that spec 07 excludes and
defers to this follow-up:
- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers,
VARIANT colon-paths.
- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`.
- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`.
This guidance is genuinely useful to an agent writing SQL against a live
database, but it must **not** pollute the flat dialect-agnostic skill — an agent
querying sqlite should never see Snowflake VARIANT syntax. It belongs in a
**dialect-aware** location, surfaced only for the dialect the active connection
actually uses.
## Generic use case
Any ktx project whose connections span more than one warehouse engine (e.g. a
Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent
writes SQL for a given connection, it should get that engine's syntax
conventions — and nothing for the engines it isn't querying.
## Requirements
1. **Per-driver dialect notes.** Author concise, correct syntax notes per
supported driver: FQTN form, identifier quoting/case, date/time functions,
top-N / window-filtering idiom, semi-structured access. These are genuine
per-engine invariants, so enumerating them per driver is acceptable (unlike a
denylist of bad specifics).
2. **Scope to the active dialect, derived from state.** Which notes the agent
sees must be selected from the connection's configured driver/dialect
(`ktx.yaml` connections / the connector registry), not guessed and not shown
all at once. The flat analytics skill stays dialect-agnostic (spec 07
invariant preserved).
3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is
installed as a **single `SKILL.md`** per target (`setup-agents.ts` /
`readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one
of two approaches; the refinement pass should compare them before committing:
- **Multi-file skill delivery** — bundle `reference/<dialect>.md` files and
have the skill point to the one matching the connection. Requires extending
`setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal
`.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate
transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and
**per-file manifest entries** for clean uninstall. This is the
install-mechanism improvement spec 07's Model section flags as future work.
- **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a
given `connectionId` (the MCP layer already resolves the connection's
dialect), so no install change is needed and Cursor/OpenCode get identical
behavior. May be the lower-cost, more uniform path; weigh it first.
4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's
acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in
`analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware
channel; it does not amend the flat skill.
## Acceptance criteria
- An agent querying a sqlite connection gets sqlite date idioms and never sees
Snowflake/BigQuery-only syntax; an agent querying Snowflake gets
FQTN/identifier/VARIANT guidance.
- The dialect shown is **derived from the connection's configured driver**, not
hardcoded per project and not guessed.
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
unaffected.
- Whichever delivery mechanism is chosen installs/serves correctly across **all**
supported agent targets, including the single-file Cursor/OpenCode shape.
## Benchmark context (motivation only)
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths),
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
(`strftime`/`julianday`). That content is real and useful but engine-specific;
spec 07 kept it out of the flat skill and deferred it here so the
dialect-agnostic rules stay clean.

View file

@ -1,150 +0,0 @@
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
## Problem
The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4:
*"Avoid fan-out joins — add columns only from tables already at the target
grain, or pre-aggregate to that grain before joining; a join that multiplies
rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent
honors it on a single join but still **silently fan-outs on multi-hop join
chains**, where the inflation is one or two joins removed from the aggregate and
therefore much harder to notice.
The failure shape: a metric that lives at a *coarse* grain (e.g. one row per
parent record) is counted/summed *after* the parent has been joined down to a
*finer* grain (e.g. one row per child line). Every parent-level value is then
duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an
amount that depends on the data — runnable SQL, plausible-looking number,
quietly wrong.
The rule today is stated as a *prohibition* ("avoid"). It needs to be a
*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active
verification step the agent runs while composing, not just an instruction to be
careful.
## Generic use case (independent of any benchmark)
An analyst on any production warehouse asks: *"How many orders are there per
region?"* where the path from region to the order's detail runs through several
hops (region → store → order → order line). The honest answer counts each order
once. If the query descends to the line-level table along the way (e.g. for a
filter), each order is counted once **per line on the order**, inflating the
per-region total. Attribution here is unambiguous — each order belongs to exactly
one store and thus one region — so the *only* thing that can go wrong is the row
multiplication, which is exactly what makes it a clean teaching case. This is one
of the most common silently-wrong analytics mistakes on normalized schemas — it
is not
specific to any dataset, dialect, or benchmark.
## Requirements
This extends the existing `<sql_craft>` "Composition" guidance in the
`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic,
and stated as a heuristic-plus-why (consistent with spec 07's style).
1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the
danger is *cumulative*: any one-to-many hop on the path between the table that
owns a measure and the aggregate inflates that measure, even when the
offending join is several hops away from the `SUM`/`COUNT`. The fix is the
same as the single-hop case — **pre-aggregate the measure to its own grain in
a CTE, then join the already-aggregated result** — but the agent must apply it
per measure-owning table along the whole chain, not just at the final join.
2. **Add a verification habit, not just a prohibition.** While composing, the
agent should confirm a join did not change the grain it intends to aggregate
at — e.g. check that the row count (or the count of the aggregate's key) is
unchanged across a join that is supposed to be one-to-one / many-to-one, and
pre-aggregate the finer table to grain when it is one-to-many. This is the same
"build incrementally and check each layer" discipline spec 07 already endorses,
pointed specifically at grain preservation.
**Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only
shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and
then joining one-to-one is the remedy that works for every aggregate
(`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT <key>)` is a valid one-liner *for counts
only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two
rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse
them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is
silently wrong for sums.
3. **One concrete, generic multi-hop example.** Include a short worked example
that shows the inflation and the fix. It must use an **invented, generic
schema** — **no benchmark table names, no benchmark SQL, and no benchmark
result values** (see "Leak-safety" below — hard constraint). The example must:
(a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson
and does not entangle the skill's separate *macro-vs-micro average* rule; and
(b) use a chain with **unambiguous single-owner attribution** so the only thing
that can go wrong is row multiplication. The intended example is the chain
`regions → stores → orders → order_lines` answering *"how many orders per region
include at least one backordered line"* — each order belongs to exactly one
store and thus exactly one region, so attribution is clean; the line-level
filter gives `order_lines` a genuine reason to be joined (so the fix is the
pre-aggregate remedy, not "drop the join"), and that join sits **several hops
below** the region-level COUNT (the multi-hop point):
```sql
-- "How many orders per region include at least one backordered line?"
-- (order_lines is genuinely needed here — for the backordered filter — so the
-- fix is NOT "just drop the join".)
-- WRONG: the order_lines join is one row per matching line, joined several hops
-- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the
-- per-region total is inflated by backordered-lines-per-order — silently wrong.
SELECT r.region_id, COUNT(*) AS n_orders
FROM regions r
JOIN stores s ON s.region_id = r.region_id
JOIN orders o ON o.store_id = s.store_id
JOIN order_lines l ON l.order_id = o.order_id AND l.is_backordered -- one-to-many: fan-out
GROUP BY r.region_id;
-- RIGHT (general remedy): collapse the finer table to the measure's grain in a
-- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works
-- for SUM/AVG, not just COUNT.
WITH qualifying_orders AS ( -- back to ONE row per order
SELECT DISTINCT order_id FROM order_lines WHERE is_backordered
)
SELECT r.region_id, COUNT(*) AS n_orders
FROM regions r
JOIN stores s ON s.region_id = r.region_id
JOIN orders o ON o.store_id = s.store_id
JOIN qualifying_orders q ON q.order_id = o.order_id
GROUP BY r.region_id;
-- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works
-- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g.
-- summing each order's shipping_fee after joining lines) must pre-aggregate;
-- DISTINCT would wrongly merge two orders that happen to share the same fee.
```
## Leak-safety (hard constraint on this spec and its example)
The benchmark's gold answers must never appear in ktx. The worked example must
be a **synthetic, generic schema invented for teaching** — not the tables,
column names, query, or numeric results of any Spider 2.0-Lite question. The
example demonstrates the *pattern* (coarse-grain measure counted after a
one-to-many join), which is universal; it must be reconstructable from first
principles by anyone, with zero reference to benchmark data. A reviewer should
be able to read the example and find nothing that ties it to a specific
benchmark instance.
## Acceptance criteria
- The skill's `<sql_craft>` Composition section states the multi-hop
generalization of the fan-out rule and a grain-verification habit, inline and
dialect-agnostic.
- It includes exactly one short, **generic** worked example (wrong vs.
pre-aggregated-right) using an invented schema, with no benchmark-derived
identifiers or values.
- No new tool, flag, or config; this is skill-content only (additive to spec 07).
- Existing analytics-skill content tests are updated to cover the added rule's
presence (mirroring spec 07's `analytics-skill-content.test.ts`).
## Benchmark context (motivation only)
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
reached through several one-to-many joins) are a recurring source of
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
with the right tables but a fan-out-inflated number. These are correctness
failures, not knowledge or schema-discovery failures (zero execution errors in
the latest run), so the fix belongs in the product's authoring craft — where it
also helps any real analyst — not in a benchmark-specific prompt.
```

View file

@ -1,65 +0,0 @@
# Panel/period completeness — emit the full set of groups, not only the populated ones
## Problem
When a question asks for a result *per period* or *per category* ("orders for each
month of 2023", "revenue by region", "count per status"), the natural `GROUP BY`
only returns groups that actually have rows. Periods/categories with **zero**
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
ones that should read `0` are simply absent. The agent writes runnable SQL with
the right aggregate but an **incomplete panel**.
This is a universal reporting correctness issue: a monthly report with missing
months, or a category breakdown missing the empty categories, is wrong for any
analyst — and it is also a frequent result-mismatch shape on the benchmark.
## Generic use case (independent of any benchmark)
"How many orders were placed in each month of 2023?" must return **12 rows** even
if March had no orders (March = 0), not 11 rows. "Sales per region" should include
regions with no sales (as 0/NULL) when the question asks for *each* region.
## Requirements
Additive to the `ktx-analytics` skill's `<sql_craft>` "Answer completeness /
interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic
+ why style).
1. **Recognize "full-panel" phrasing.** Cues like *each / every / per <period> /
for all <category> / by month* signal that the answer's row set should be the
**complete** set of periods or categories in scope, not just those present in
the filtered fact rows.
2. **Build a spine, then LEFT JOIN.** Generate the full set of expected
groups — a date/number series via a recursive CTE for periods, or the distinct
dimension values from the authoritative dimension table for categories — and
LEFT JOIN the aggregated facts onto it, defaulting missing measures with
`COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner
`GROUP BY` can only emit groups that have at least one fact row.
3. **Don't over-apply.** When the question asks only about groups that exist
("which months had orders"), the spine is unnecessary; the cue is *each/all*
vs *which*.
## Leak-safety (hard constraint)
Any worked example must use a **synthetic generic schema** (e.g. an `orders`
table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN
+ COALESCE). No benchmark table names, SQL, or result values. The behavior is
reconstructable from first principles and tied to no specific instance.
## Acceptance criteria
- `<sql_craft>` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe,
and the over-application guard — inline and dialect-agnostic.
- At most one short generic example (recursive-CTE date spine or distinct-dimension
spine), no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated to cover the rule.
## Benchmark context (motivation only)
Per-period / per-category questions where some periods are empty produce
short-row result mismatches in the SQLite subset. The fix is a universal
reporting habit (complete panels), so it belongs in the product's craft, where it
also helps real analysts — not in a benchmark-specific prompt. Related to spec 11
(rolling/cumulative windows need a complete date spine to be correct).

View file

@ -1,73 +0,0 @@
# Time-series window craft — running totals, rolling-N (min-periods), period-over-period
## Problem
A large share of analytics questions are time-series shaped: a **running/cumulative
balance**, a **rolling N-day average**, or **period-over-period growth**. The agent
knows window functions exist (spec 07 covers determinism and window-then-filter) but
gets the *time-series specifics* wrong:
- cumulative balance computed without an unbounded preceding frame (or with the
frame defaulting incorrectly when there are ties on the order key);
- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily
data, so the window spans the wrong calendar span when days are missing;
- no **minimum-periods** handling — a rolling average is reported before the window
is actually full;
- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor.
These are runnable-but-wrong; the structure is close, the edge case diverges.
## Generic use case (independent of any benchmark)
- "Each account's month-end running balance over 2023" — cumulative sum of monthly
net over an ordered window.
- "30-day rolling average of daily revenue, only once 30 days of history exist."
- "Month-over-month revenue growth rate."
All three are bread-and-butter for any analyst on any time-series table.
## Requirements
Additive to the `ktx-analytics` skill's `<sql_craft>` "Window functions" group
(inline, dialect-agnostic, heuristic + why).
1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in
`ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY`
can include/exclude peers unexpectedly.
2. **Rolling window over time, not over rows.** When "rolling N days/months" is
asked, the window must span a calendar range. Over gappy data, either build a
complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals
the intended span, or use a range/self-join keyed on the date. *Why:* row-count
frames over missing dates silently measure the wrong span.
3. **Minimum periods.** When the question says "only after N periods of data" (or
it is implied by a rolling metric), emit NULL/skip until the window is full
(e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not
the requested metric.
4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at
full precision (round only at the end). Guard divide-by-zero/NULL prev.
## Leak-safety (hard constraint)
Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day,
amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*.
No benchmark table names, SQL, or result values.
## Acceptance criteria
- `<sql_craft>` "Window functions" gains the cumulative, rolling-over-time +
min-periods, and period-over-period recipes — inline, dialect-agnostic.
- At most one or two compact generic examples; no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated.
## Benchmark context (motivation only)
Running-balance / rolling / period-over-period questions are the single largest
result-mismatch cluster in the SQLite subset (financial-transactions style DBs).
The methodology is universal analyst craft, so it belongs in the product's skill
(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10
(date spine) for the gappy-rolling case.

View file

@ -1,61 +0,0 @@
# Parse text-encoded numeric columns before doing math on them
## Problem
Numeric measures are often stored as **text** with human formatting: unit suffixes
(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`),
percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`,
`""`). Aggregating or comparing such a column directly is silently wrong: string
comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on
the formatted values rather than the intended number.
The agent already samples schemas (spec 07 schema-discovery), but when it sees a
"numeric" column it tends to assume it is a real number type and skips the parse —
so the arithmetic runs on garbage. Runnable, plausible, wrong.
## Generic use case (independent of any benchmark)
A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000
/ 0` before you can sum it or compute a daily change. A `price` stored as
`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene
on real, messy production tables.
## Requirements
Extend the `ktx-analytics` skill's `<sql_craft>` "Schema discovery before writing
SQL" group (inline, dialect-agnostic, heuristic + why).
1. **Detect text-encoded numerics during sampling.** When a column that the
question treats as a number is stored as text, sample distinct values to learn
the encodings actually present (suffixes, symbols, separators, sentinels) before
composing — never assume the format from the column name.
2. **Parse and scale before arithmetic.** Strip currency/separator/percent
characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels
(`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a
numeric type. Do this in an early CTE so all downstream math sees clean numbers.
*Why:* string columns compared/aggregated as-is sort lexically and cast to 0,
producing silently wrong results instead of errors.
3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value
failed to parse (would surface as NULL), to catch an encoding the sample missed.
## Leak-safety (hard constraint)
Worked examples must use a **synthetic generic schema** and made-up values (e.g. a
`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names,
SQL, or result values; the parsing pattern is universal and tied to no instance.
## Acceptance criteria
- `<sql_craft>` schema-discovery gains the detect → parse/scale → verify guidance —
inline, dialect-agnostic, with at most one short generic example.
- No benchmark-derived content. Skill-content only; content tests updated.
## Benchmark context (motivation only)
At least one SQLite-subset question stores trading volume as suffix-encoded text
("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The
fix — parse messy encodings before math — is universal data hygiene that helps any
analyst, so it belongs in the product's craft rather than a benchmark-specific
prompt.

View file

@ -1,105 +0,0 @@
# Enforce answer-output completeness with a final pre-emit check in the analytics skill
## Problem
The single largest correctness failure mode is **incomplete output**: the query runs and the
methodology is roughly right, but the result is missing columns the question asked for. Three
recurring sub-patterns:
1. **Multi-part questions answered partially.** A question that asks for several things ("report
the highest *and* the lowest month, each with its count and average, *and* the difference")
comes back with only the first part — one column instead of the several requested.
2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's
identifier (e.g. a product name without its product id, a customer name without its
customer id).
3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not
the underlying counts the question also asked for.
Sub-patterns 2 and 3 are **already covered by `<sql_craft>` rules** in the analytics skill
(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*),
yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these
rules are passive heuristics buried in a list, and the agent doesn't reliably check them before
finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn
output-completeness into an **explicit final verification step** the agent performs before
emitting SQL.
This is reinforced by evidence that the failure is **model-independent**: a markedly stronger
model produced the same incomplete-output mistakes on these questions, which means it is a
craft/enforcement gap, not a capability gap.
## Generic use case (independent of any benchmark)
An analyst is asked: *"For each region, report the highest and the lowest monthly order count,
and the difference between them."* A complete, useful answer has a column for the region's id
and name, the highest count, the lowest count, and the difference — five columns. Returning just
the region and a single number answers only part of the request. This is a universal expectation
on any database: answer **every** part of a multi-part request, identify the entities, and show
the inputs behind any derived figure.
## Requirements
Additive to the analytics skill's `<sql_craft>` "Answer completeness / interpretation" group and
its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07).
1. **Multi-part / multi-output completeness (new rule).** When a question requests several
outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a
value plus its components ("X, Y, and their ratio") — the final projection must contain a
column for **each** requested output. *Why:* answering only the first clause is the most common
way a runnable query is still wrong; the grain and methodology can be perfect yet the answer
is short by columns.
2. **Fold the existing identity / inputs rules into the same completeness notion.** The
already-shipped rules — project the entity **identifier** alongside any human-readable label,
and **keep the inputs** to any derived value — are part of output completeness; reference them
from the check below so they are actually applied, not just listed.
3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the
final SQL, the skill should have the agent **re-read the question and confirm the projection
covers**: every named metric/attribute; the identifier of every grouped/named entity; every
input to a derived value; all at the grain the question specifies. This is a short, concrete
checkpoint at the validate step — the point is to convert the passive heuristics into an active
pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is
grader-gaming; the check is about matching the request exactly, not padding it.)
Generic teaching example (synthetic schema — see Leak-safety):
```sql
-- "For each region, report the highest and lowest monthly order count and their difference."
-- WRONG: answers only the first clause; no region id, no lowest, no difference.
SELECT region_name, MAX(monthly_orders) AS highest
FROM region_monthly GROUP BY region_name;
-- RIGHT: one column per requested output + the entity's identity, at the region grain.
SELECT r.region_id, r.region_name,
MAX(m.monthly_orders) AS highest_monthly_orders,
MIN(m.monthly_orders) AS lowest_monthly_orders,
MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference
FROM regions r
JOIN region_monthly m ON m.region_id = r.region_id
GROUP BY r.region_id, r.region_name;
```
## Leak-safety (hard constraint)
The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up
columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover
every requested output + identity + inputs), which is universal and tied to no specific instance.
## Acceptance criteria
- The skill states the multi-part-completeness rule and a concrete **final completeness check**
(re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic,
cross-referencing the existing identity/inputs rules so they're enforced.
- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming).
- One short generic example (wrong vs complete); no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated to cover the new rule + check.
## Benchmark context (motivation only)
In the latest SQLite-subset run, **incomplete output was the single largest failure bucket
(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value
inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A
probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this
is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested
part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product
skill (and transfers to real users), enforced as a final check rather than left as a passive hint.
```

View file

@ -1,116 +0,0 @@
# Structured, leveled logging for the ktx MCP server
> **Scope: observability only.** This spec is about *seeing* what the MCP server
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
> query from blocking the server (off-event-loop / interruptible query execution)
> is a separate concern — see "Non-goals" and the sibling spec note below.
## Problem
The ktx MCP server (`packages/cli/src/mcp-http-server.ts` +
`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk`
`StreamableHTTPServerTransport`) emits almost no operational logs. There is no
server-side record of **which MCP tool was called, with what parameters, when,
how long it took, or whether it succeeded** — nor of session open/close or
transport errors. When a tool call is slow, hangs, or a client connection drops
("Transport channel closed"), an operator has no trail to diagnose it and must
resort to process sampling / `lsof` / guesswork — and the offending input
(e.g. the exact SQL) is typically unrecoverable.
## Generic use case
Anyone running a long-lived ktx MCP server — a developer's local instance, a
shared team server, or a hosted deployment — needs observability into tool-call
activity to:
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
connection, with what SQL, for how long);
- explain client-visible connection failures from the server side (session
lifecycle, transport-closed events);
- audit what agents asked the server to do;
- spot patterns (hot tools, slow connections, error rates).
This is standard production-server hygiene; the server currently provides none.
## Requirements (sketch — refine when picked up)
1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation
only; implementer owns the choice). A single shared instance; write **JSON to
stdout** (12-factor — the launcher/aggregator routes it). No in-app file
rotation. Optional human-readable pretty output only when attached to a TTY
(dev).
2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug`
for diagnosis) — verbose logging on demand without code changes.
3. **Per-session / per-call context** via child loggers: every line carries a
`sessionId` (from the transport session) and, for tool calls, a `callId` +
`tool` name, so one session's or call's activity can be traced/grepped.
4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For
every MCP tool invocation:
- on entry: log `{ tool, params, sessionId, callId }` **before** running the
handler (so the record exists even if the handler never returns);
- on exit: log `durationMs` + outcome (ok with result size, or error with
stack).
This makes a **hung / never-returning call identifiable**: a start with no
matching completion is the culprit, with its exact parameters and timestamp.
This matters specifically because handlers like `sql_execution` run a
*synchronous* better-sqlite3 query — a runaway query blocks the process and no
completion is ever logged, so the start line (flushed before the blocking
call) is the only record. For `sql_execution`, `params` should include the SQL
text (the most useful field). Emit a **WARN** when a *completed* call exceeds a
configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`).
5. **Connection / session lifecycle:** log session open/close (with `sessionId`)
and transport errors (the SDK's closed-channel / "Transport channel closed"
events) so client-side connection failures have a server-side counterpart.
6. **Error logging** with structured stack traces (a standard error serializer),
not bare strings.
7. **Light redaction — credentials only** (bearer token, connection
passwords/secrets). SQL text and tool params are *not* secrets and must be
logged. Do not over-redact.
8. **Synchronous logging is fine.** The server uses a synchronous DB client, so
logging need not be async; prefer the simpler synchronous stdout path over
async/worker transports (which can lose buffered lines on a hard crash). Do
not introduce async-logging machinery.
## Acceptance criteria (sketch)
- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start`
(tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line
on the server's stdout, as JSON.
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
`tool.start` line carrying its **exact SQL and timestamp** and **no**
`tool.end` — so the offending query is recoverable from the log alone, with no
process sampling.
- A completed tool call slower than the configured threshold emits a WARN with
its duration.
- Session open/close and transport-closed events are logged with the `sessionId`.
- At default level (`info`), routine per-tool lines are suppressed but lifecycle,
slow-call warnings, and errors are present.
- Credentials (bearer token, connection secrets) never appear in logs; SQL and
tool params do.
- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no
async-transport machinery.
## Non-goals
- **Preventing/interrupting runaway queries** (off-event-loop execution, query
timeouts, worker-thread isolation). That is a *separate* spec; a single
synchronous query that fans out into a massive nested-loop join can peg the
single-threaded server for hours and break new connections — observability
surfaces *which* query, but the fix is execution-model work. (This logging is
also a prerequisite for a future watchdog that detects a `tool.start` with no
`tool.end` past a threshold and recycles the server.)
- Metrics/tracing/OpenTelemetry exporters.
- Forwarding logs to the MCP *client* via the protocol's logging capability
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
distinct from operational stdout logging.
## Benchmark context (motivation, not a requirement)
Running Spider 2.0-Lite against the MCP server at concurrency, an
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
synchronous better-sqlite3 executed it on the event loop, pegging a server at
~100% CPU for hours and breaking new MCP connections to it ("Transport channel
closed"). We could not determine *which* query, because the server logs nothing
about tool calls — diagnosis required `sample`/`lsof` on the live process and the
exact SQL was never recovered. Structured tool-call logging (especially
start-before-execute) would have turned this into a one-line `grep` of the server
log.

View file

@ -1,131 +0,0 @@
# Bounded query execution (deadline + non-blocking) for read SQL
> Priority: HIGH. Found empirically during a Spider2-lite sqlite run
> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU
> for 13+ minutes and never returned. The query
> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the
> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112
> rows, joined on a 4-column key with no composite index) whose plan degraded to
> an O(N×M) nested-loop scan. Because the sqlite connector runs
> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP
> worker's entire event loop: no `tool.end` was ever logged, the port went
> unresponsive, and the query could not be cancelled. One of four eval shards
> stalled until the worker was killed by hand.
## Problem
Two compounding gaps on the read-query path:
1. **No execution deadline.** A single expensive query runs unbounded. This is
handled divergently per connector, with no shared contract: BigQuery has a
real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP
`request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only
connection/pool *acquisition*, not statement *execution*; SQLite has nothing.
So whether a runaway query is bounded depends entirely on which driver the
caller happened to hit.
2. **In-process engines block the event loop and can't be cancelled.** The
sqlite connector executes on the main thread via synchronous
`better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't
serve other requests, send progress, or write `tool.end`), and there is no
way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its
documented mechanism for slow queries is to run them in a **worker thread**,
and the only way to stop a runaway synchronous query is to terminate the
thread executing it.
The net effect is a query that produces a `tool.start` with no matching
`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`)
does not help — it bounds returned rows, not scan work, and the failing query
returned a single aggregate row.
## Generic use case
Any data agent that lets an LLM author SQL will eventually issue an
accidentally-expensive query — an unindexed or cartesian join, an expensive
VIEW, a wide aggregate over a large fact table. A general-purpose context layer
must bound that and return a clean, fast "query exceeded Ns" error so the agent
can revise (add filters, query base tables, narrow the range) instead of hanging
the tool and the server. This matters for embedded/local warehouses (sqlite,
duckdb) and remote ones alike, and is wholly independent of any benchmark.
## Requirements
1. Every read-query execution path (`executeReadOnly`) enforces a single
canonical execution deadline. One opinionated default; **not** a per-call
user flag. Where a driver already supports a per-connection timeout
(BigQuery `job_timeout_ms`), reuse that as the per-connection override rather
than inventing a parallel knob.
2. On exceeding the deadline the path resolves with a `KtxQueryError`
("query exceeded {N}s") — a finite, decision-reaching outcome, never an
unbounded hang.
3. The deadline is a **shared contract at the connector boundary**, defined once
(on the `executeReadOnly` contract or a shared wrapper at the call site) so
all drivers participate. Bring the existing divergent timeouts (BigQuery job
timeout, ClickHouse request timeout) under this one contract instead of
leaving parallel mechanisms.
4. For in-process engines (sqlite today, any future embedded driver), execution
MUST NOT block the MCP server event loop. Run the query off the main thread
and enforce the deadline by terminating that thread on timeout (the
better-sqlite3-documented approach, since synchronous queries are
uncancellable in-thread). The event loop must stay responsive so `tool.end`
is always written and concurrent requests on the same port are served.
5. Prefer real cancellation over client-side give-up. Where the engine supports
a server-side statement timeout (Postgres `statement_timeout`, MySQL
`max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse
`max_execution_time`, BigQuery job timeout, SQL Server request timeout), set
it so the deadline actually stops work, not merely abandons the promise while
the query keeps running. For in-process engines, thread termination is the
cancellation.
6. The MCP `sql_execution` tool surfaces the timeout as an expected error
(classified as `KtxQueryError`, not a `$exception` fault, consistent with
existing expected-error classification) and logs a `tool.end` with the error
outcome.
7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain
unchanged. The deadline is additive; `maxRows` is not a substitute for it.
## Acceptance criteria
- A read query that exceeds the deadline returns a `KtxQueryError` within
roughly the deadline; the MCP worker stays responsive (a concurrent tool call
on the same server completes while the slow query is still pending) and writes
a matching `tool.end` with a non-ok outcome.
- sqlite specifically: executing a deliberately pathological query (e.g. an
expensive VIEW or an unindexed cross join) on a fixture does not block the
event loop, is terminated at the deadline, and CPU returns to idle afterward
(the off-main-thread executor is killed, not left spinning).
- No regression: normal fast queries return identical results; read-only
rejection still works; `maxRows` still bounds returned rows.
- Tests cover the deadline path for at least the in-process driver (sqlite,
terminate-on-deadline) and one server-side-timeout driver.
## Benchmark context (motivation only)
The Spider2-lite local set loads several warehouses into sqlite, some with
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
rows, no composite index, with `promo_id` (the index the optimizer picks) being
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
query instead.
## Orientation hints (code pointers; may have drifted)
- Shared contract: `packages/cli/src/context/scan/types.ts`
`KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285).
- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70`
(`connector.executeReadOnly`); tool registration in
`packages/cli/src/context/mcp/context-tools.ts`.
- In-process sync execution (the acute hang):
`packages/cli/src/connectors/sqlite/connector.ts:311-313`
(`better_sqlite3 .prepare().all()`).
- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts`
(`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602`
(`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only),
`connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`,
`connectors/sqlserver/connector.ts` (pool/connection only).
- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`).
- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no
interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern
for slow queries (master owns worker lifecycle and respawns on exit) — extend
it with terminate-on-deadline to enforce the timeout.

View file

@ -1,68 +0,0 @@
# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project)
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`.
## Problem (generic, real-world)
Analysts routinely query datasets that live in a **different** BigQuery project than the one
they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an
organization's central data project, etc. To make those connectable in ktx (so `discover_data`,
the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to
**introspect a dataset hosted in a foreign project while running/billing jobs in the
credentials' own project**.
Today it can't. ktx's BigQuery connector derives a single `projectId` from
`credentials.project_id` and uses it for **both** job billing **and** schema introspection:
- `connectors/bigquery/connector.ts:294``projectId` is read only from `credentials.project_id`;
there is no separate billing-vs-dataset project knob.
- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the
dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`.
- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the
**billing** project's INFORMATION_SCHEMA.
- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix.
So a `dataset_id` naming a dataset in another project can't be introspected, even though querying
it works fine (cross-project reads bill to the caller's project — that path already works).
### Empirical confirmation
With a service account in project `ktx-spider2-lite`:
- ktx's call pattern `client.dataset("austin_311")`**`404 NotFound`** (looks in
`projects/ktx-spider2-lite/datasets/austin_311`).
- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds**
(lists the public tables; public metadata is readable by any authenticated principal).
- There is **no config knob** to separate the introspection project from the billing project.
## Requirement
The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids`
(a single connection may span more than one source project), and for each:
- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` /
`DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and
label the table `catalog` with the dataset's project;
- **run jobs / bill** in `credentials.project_id` (unchanged).
A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so
existing single-project connections are unaffected.
## Acceptance
- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) →
`ktx ingest <conn>` introspects the tables, enriches, and samples values; `discover_data` /
`dictionary_search` return them.
- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both.
- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in
`credentials.project_id`.
- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression).
## Benchmark context (motivation only — do not encode benchmark specifics)
Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every
one of its ~74 logical databases groups datasets hosted in foreign public projects
(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query
execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the
faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74
BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset`
and introspect each in its own project" covers the benchmark and the general case alike. This is
the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector
change and is already baselined).

View file

@ -1,89 +0,0 @@
# 19 — Durable, resumable, bounded relationship detection during ingest enrichment
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`.
## Problem (generic, real-world)
Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment`
(`packages/cli/src/context/scan/local-enrichment.ts`):
1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per
table; on a large schema this is minutes of paid LLM work).
2. `embeddings` (`:559`) — column embeddings.
3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then
validates candidate joins.
The queryable semantic-layer artifacts are persisted **once, at the very end**, by
`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after**
`runLocalScanEnrichment` returns, i.e. after all three stages.
This creates three failure modes that compound on large schemas (hundreds of tables):
1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings
are computed and held in memory, but they only reach the durable, queryable artifacts when the
final write runs after the `relationships` stage. If the process is killed/crashes/times out
**during** relationship detection (the last, slowest, silent stage), the artifacts are never
written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the
paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced
full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
relationships stage ran silently past a supervising deadline and was killed — the persisted
`_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits
this, so the most expensive work is the most likely to be thrown away.
2. **Re-running does not resume — it re-spends.** There is a stage state store
(`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves
each completed stage's output. But the completed-stage lookup keys on **`runId`**
(`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest
invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a
new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch**
(re-paying for the LLM work that already succeeded).
3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between
"Detecting relationships" and the final "Relationship detection found N accepted" — minutes of
silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but-
working profile from a true hang, and there is no internal time/work budget, so on a very large
schema it can run far longer than any reasonable deadline.
## Requirements
1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions +
embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before
the `relationships` stage runs. Relationship detection then appends/merges its own artifact on
completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**,
even if relationship detection fails, is interrupted, or is skipped. (A failed/partial
relationship stage should degrade to "no/partial joins", never to "no descriptions".)
2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity
`(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted
ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what
actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM
credits on stages that already succeeded.
3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query
execution). Emit progress through the existing progress port — e.g. "Profiling table K/N",
"Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget
(configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops
gracefully and returns the relationships found so far (partial) rather than running unboundedly.
Partial completion is persisted (per requirement 1) and marked as such.
## Acceptance
- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer
with the table/column descriptions + embeddings that were generated (verified: re-open the
connection, descriptions are present).
- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage
already completed (verified: no LLM description calls for the cached tables; only the failed
stage re-runs).
- A connection with hundreds of tables emits relationship-stage progress and completes within the
configured budget, persisting partial relationships if the budget is hit — without discarding
enrichment.
- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship
output when nothing is interrupted).
## Benchmark context (motivation only — do not encode benchmark specifics)
The Spider 2.0-Lite BigQuery slice has datasets with hundredsthousands of tables (`ebi_chembl`
785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM
budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every
retry — makes large-schema ingest impractical. This is a general durability/cost property of the
ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale.

View file

@ -1,101 +0,0 @@
# 20 — Resilient enrichment under a slow/hung LLM backend
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`.
This is the **enrichment-stage** analog of two already-shipped specs:
- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline);
- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it.
Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two
weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung
table into an indefinite wedge plus total loss of an entire stage's LLM work.
## Problem / requirement
Two compounding gaps on the per-table description-enrichment path, observed end-to-end:
### 1. The per-table LLM timeout does not actually terminate the work
The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM
backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise
spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0%
CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock
the await** — so the call sits *past* its own timeout indefinitely.
Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min),
two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes**
well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED
connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung
child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot
interrupt the underlying work — applied to the enrichment LLM call instead of the query.)
**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires,
the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort
for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per
the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded
wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike.
### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones
Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before
relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's
description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1),
or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost,
even though their (expensive) LLM descriptions were finished.
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in
`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded
all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of
redone LLM calls.
**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to
the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated
and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the
descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just
"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**,
throwing away every successfully-generated description. The skip must be graceful — a skipped table costs
one missing description, not the entire stage's output. (This is the strongest argument for per-table
incremental persistence: the 283 good descriptions should have been durable the moment each was produced.)
**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the
descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and
(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write
design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the
checkpoint granularity from once-per-stage to incremental.
## Sketch (implementer to refine)
- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/
claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for
network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period.
- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit
minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value
itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator
override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental
checkpointing, a moderate default + skip is the better operating point.)
- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to
the same store/format used at stage completion; on resume, treat already-persisted tables as done and only
enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions).
- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer
than ~one timeout, so an external stall watchdog stops being the only backstop.
## Generic use case (independent of the benchmark)
Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend,
which is the common local/desktop setup) will eventually hit a table whose description call hangs — a
provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one
such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws
away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema
enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a
hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product,
wholly independent of any benchmark.
## Benchmark context (motivation only — not a benchmark-specific rule)
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching
the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the
30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions
checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had
to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark
just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic.

View file

@ -1,91 +0,0 @@
# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`.
Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment).
Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one
enrichment stage without re-paying for the others.
## Problem / requirement
Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`**
(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally
LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a
targeted re-run impossible without a full, expensive re-enrich:
1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a
single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`,
and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So
changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping
`scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces
ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change.
2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only
path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on
`mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`).
The capability is built; it's just not reachable.
3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage,
input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the
foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI
surface are missing.
**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested
connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply,
without re-running unchanged (especially the costly `descriptions`) stages.
## Design decisions (resolved during intake; implementer may refine)
- **CLI flag: `--stages <comma-list>`** (plural). Accepts a comma-separated subset of
`descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes
a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the
plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an
unknown stage is an error, never silently ignored.
- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs:
- `descriptions``{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model)
- `embeddings``{snapshot, embeddings model/provider, + the description text it embeds}`
- `relationships``{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}`
Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates
only `embeddings`; improving description prompts/LLM invalidates only `descriptions`.
- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write,
already the behaviour). A selective run never deletes another stage's artifacts.
- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`;
`relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing
`embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected
re-run leaves an unselected downstream stage stale, and the operator can opt to cascade
(`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream.
- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the
stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has
full context — not just raw column names.
- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to
the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent
`--no-query-history` negative flag, but that unification is out of scope here.
## Sketch (implementer to refine)
- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it
selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the
precedent).
- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each
stage's own inputs; gate each stage's resume/skip on its own hash.
- Ensure selective runs read + preserve the on-disk enriched schema and write additively.
- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one.
## Generic use case (independent of the benchmark)
Any team running ktx in production maintains its semantic layer over time: they improve description prompts
or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of
those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions
even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine
maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now
that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost —
alone. This is core operability for a long-lived ingestion product and is wholly independent of any
benchmark.
## Benchmark context (motivation only — not a benchmark-specific rule)
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a
tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …)
that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins**
across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both
were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate
the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised
large-scale multi-backend ingestion; the gap and the fix are generic.

View file

@ -1,300 +0,0 @@
# Connection-scoped wiki pages
> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`.
## Problem
Wiki pages have only two scopes today: `GLOBAL` and `USER`
(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by
directory (`wiki/global/<key>.md`, `wiki/user/<userId>/<key>.md`) and the
search path filters by loading only the in-scope pages before any lane runs.
There is no way to associate a page with a **connection** (a warehouse/database
defined under `connections:` in `ktx.yaml`).
In a project with many connections this causes two distinct failures:
1. **Cross-database relevance pollution.** All pages share one search index, so
`wiki_search` for a generic term (`orders`, `revenue`, `average order
value`) surfaces pages written about the wrong database. Concept names
collide across databases constantly in real multi-connection projects
(several databases each with `orders`, `customers`, …).
2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace.
The write path resolves a repeated key to the existing file and updates it
in place. So if the agent writes an `orders` page while ingesting database B
and an `orders` page already exists for database A, B's content **overwrites
A's** — same-concept pages for different databases cannot coexist today.
Today, when `memory_ingest` is called with a `connectionId`, that id only
scopes which semantic-layer sources the triage agent can see
(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page
and **not** validated against `ktx.yaml`.
## Generic use case
Any org with multiple databases/warehouses in one **ktx** project: org-wide
definitions ("fiscal year starts in February") should be visible everywhere,
while database-specific conventions ("in the events DB, `user_id` is the
anonymous device id, not the account id") should not pollute searches about
other databases — and two databases that both have an `orders` concept must be
able to keep separate, non-colliding pages.
## Model
`connections` is **additive frontmatter metadata**, orthogonal to the existing
`GLOBAL`/`USER` directory scope — not a third scope dimension:
- A page is still `GLOBAL` or `USER` and lives where it lives today. It may
**additionally** carry a `connections` list.
- **Page keys remain a flat, globally-unique namespace.** `connections` does
**not** namespace keys; a page is addressable by key alone, unchanged.
- A page may list **multiple** connections.
- **Absent or empty `connections` ⇒ unscoped: the page applies to all
connections.** This is exactly today's behavior, so every existing page is
unaffected.
This keeps `wiki_read` and refs untouched and adds no parallel scope axis;
filtering by connection is purely a search/relevance concern.
## Requirements
### 1. Frontmatter field
Add an optional `connections` field to wiki page frontmatter — a list of
connection ids.
- Accept a single string too; normalize to a list at parse time (reuse the
existing array-coercion helper used for `tags`/`refs`/`sl_refs`).
- Round-trips through parse/serialize without loss.
- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by
construction.
### 2. Page identity and key distinctness
`connections` does not change how pages are identified or addressed:
- Keys stay flat and globally unique; `wiki_read(key)` is unchanged.
- Because the write path updates a page in place when its key already exists,
same-concept pages for different connections **MUST** use distinct keys
(e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys
for database-specific pages are the primary mechanism (driven by write-path
prompt guidance, requirement 5).
- **Data-loss guard (code, not prompt):** a connection-scoped write whose key
matches an existing page whose `connections` scope is **disjoint** from the
incoming scope MUST surface a collision instead of silently overwriting the
existing page. (Updating a page within the same connection scope, or
broadening/narrowing its own `connections`, is a normal update — not a
collision.) The implementer owns whether the collision is a hard error or a
suffixed new key; it must not be a silent clobber.
### 3. Search filtering
Add an optional connection filter to the search surfaces:
- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`).
- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection <id>`
(with `-c` alias), matching the `ktx sql` connection flag.
Semantics:
- With `connectionId: X` ⇒ return pages whose `connections` is empty
(unscoped) **** pages whose `connections` contains X.
- Without ⇒ current behavior, all pages.
- The filter **MUST** apply uniformly to **all three search lanes** (lexical
FTS5, semantic/embedding, token fallback) at the **candidate-source level**,
so each lane draws its full candidate pool from the already-scoped set. It
**MUST NOT** be a post-filter on the merged/ranked results — that would let
off-scope candidates consume both the per-lane pool and the final result
limit unevenly.
*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the
disk-load step that feeds both the in-memory token lane and the synced SQLite
index (`local-knowledge.ts`); the connection filter fits the same seam.
### 4. Index persistence
The `.ktx/db.sqlite` knowledge index is re-synced from files on every search.
The implementer owns whether to persist `connections` as index columns / a side
table, or to filter the loaded page-set before the per-search sync. The binding
requirement is the uniform-across-lanes behavior in requirement 3 — not a
specific schema.
*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the
scoped subset and gives up a little embedding-cache reuse when searches
alternate between connections (recompute is one embedding per scoped page per
connection switch — negligible at the scale this targets). Persisting
`connections` in the index avoids that at the cost of a schema addition and a
per-lane predicate. Either is acceptable.
### 5. Write path
- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a
`connections` input field with the same REPLACE semantics as
`tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to
unscoped; `[ids]` ⇒ set.
- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt
guidance directs the agent to:
- set `connections: [connectionId]` on new **database-specific** pages, using
connection-distinctive keys; and
- leave `connections` empty for clearly **org-wide** content.
- This is **prompt guidance, not a code auto-default.** A connection-scoped
ingest must remain able to produce unscoped org-wide pages, so the tool must
not force the session's `connectionId` onto every page.
### 6. `wiki_read` and refs unchanged
Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and
`sl_refs` semantics are unchanged; `connections` is a search/relevance concern
only.
### 7. Validation
Validation behavior splits by surface, because an explicit argument is a
typo-prone input while persisted content drifts independently of config:
- **Explicit argument** — a connection id supplied as a command/tool argument
(`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`)
MUST be validated against `ktx.yaml` connections and **rejected with a clear
error listing the configured ids** when unknown. Reuse the canonical
`project.config.connections[id]` check. This also closes the current gap
where `memory_ingest`'s `connectionId` is accepted unvalidated.
- **Persisted frontmatter** — a connection id that appears only in a stored
page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during
validation/doctor, and MUST NOT break loading, searching, or reading that
page. Config and content can evolve independently.
### 8. Scope boundary
This spec delivers the **mechanism** (frontmatter storage + uniform filter +
write surface + validation). Driving the agent to actually pass `connectionId`
during analytics work is the concern of
`03-multi-connection-routing-in-analytics-skill`. It composes with the
`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`.
## Acceptance criteria
- A page with `connections: [db_a]` is returned by
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
**not** by `wiki_search(query, connectionId: "db_b")`.
- A page with no `connections` field is returned in all three cases above.
- Two pages — `orders_sales_db` (`connections: [sales_db]`) and
`orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to
`sales_db` returns the first and not the second, and neither overwrote the
other on write.
- A connection-scoped write whose key matches an existing page scoped to a
**different** connection surfaces a collision instead of silently
overwriting (data-loss guard, requirement 2).
- Filtering works in each lane independently (test with embeddings disabled to
exercise the lexical and token lanes alone).
- `memory_ingest(content, connectionId)` produces a page scoped to that
connection for database-specific content.
- `wiki_search`/`ktx wiki search --connection <unknown>` fails with an error
that lists the configured connection ids.
- A page whose `connections` references an id absent from `ktx.yaml` produces a
warning but stays searchable and readable; search and read do not throw.
- `connections` accepts a single string and a list, both normalized to a list.
- Existing projects with no scoped pages and no `connectionId`/`--connection`
behave identically before/after.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`),
`wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array
coercion `wiki/local-knowledge.ts` (`stringArray`).
- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts`
(`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already
scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts`
(FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`).
- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`,
`memory_ingest`; `connectionId` already present on `memory_ingest` but
unvalidated).
- **CLI surface:** `commands/knowledge-commands.ts`
(`ktx wiki search`/`list`/`read`); canonical `--connection` flag in
`commands/sql-commands.ts`; validation pattern
`project.config.connections[id]` in `mcp/local-project-ports.ts`.
- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE
semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId`
threaded through the capture session and tool session;
`external_ingest` forces `GLOBAL` scope).
- **Connection config:** `context/project/config.ts` (`connections` record in
`ktx.yaml`).
## Benchmark context (motivation only)
Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
to exactly one database and must not surface for the other 29.
## Implementation notes
Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
acceptance criteria covered; full package suite green (2924 passing),
type-check, knip/biome dead-code, and pre-commit clean.
**What was built / where**
1. **Frontmatter field (req 1).** `connections?: string[]` added to
`WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model
`LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new
`stringList()` coercion (single string → list); round-trips through both
serializers. Absent/empty ⇒ unscoped.
2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through
`searchLocalKnowledgePages` → both the sqlite-FTS and scan impls →
`loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is
applied at the **disk-load seam** (`pageMatchesConnection`: unscoped pages
listing the id), so the token lane and the per-search SQLite sync (lexical +
semantic) both draw their candidate pool from the already-scoped set —
candidate-source level, not a post-filter.
- Chose req 4 **option B (filter the loaded page-set)** over persisting a
column. Verified-safe here: standalone ktx's memory agent reads pages from
files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s
`knowledge_pages` is a per-search cache that `searchLocalKnowledgePages`
rebuilds every call — scoping the sync corrupts no shared state. Only cost
is one embedding recompute per scoped page on a connection switch (the
spec's acknowledged, negligible trade-off). No index-schema change.
3. **Page identity + data-loss guard (req 2).** Keys stay flat/global;
`wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`)
rejects (hard error, no silent clobber) a connection-scoped write whose
incoming `connections` is **disjoint** from a same-key existing page's
non-empty `connections`, suggesting a connection-distinctive key. Same-scope,
overlapping, broaden/narrow, and unscoped-existing updates are allowed.
Chose a hard error over auto-suffixing so the conflict reaches the agent
(the decision-maker) instead of silently forking the key namespace.
4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list)
with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no
code auto-default of the session connection. Prompt guidance added to the
shared `wiki_capture` skill (new "Connection scoping" section) and the
`memory_agent_external_ingest` prompt. The session `connectionId` is now
surfaced to the agent so the guidance is actionable: in the memory-agent
prompt header and in the ingest work-unit `<context>` block
(`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`).
5. **Validation (req 7).** New shared helper
`context/connections/configured-connections.ts → assertConfiguredConnectionId`
validates explicit connection-id arguments against `ktx.yaml` and throws an
error listing the configured ids. Routed from all three explicit-arg
surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest`
(validated at the boundary in `mcp-server-factory.ts` — this also closes the
prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated),
and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` +
`knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**:
`listReferencedConnectionIds` + a non-fatal `ktx status` warning
(`status-project.ts`); loading/searching/reading never throw on them.
**Deviations / notes**
- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`".
That helper (`stringArray`) is array-only and does **not** coerce a single
string; added a dedicated `stringList` for `connections` to meet the
single-string acceptance criterion rather than change `stringArray`'s
behavior for the other fields.
- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already
takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so
its wiki lane is intentionally left unscoped. Worth a follow-up if
`discover_data`'s wiki results should also be connection-scoped for
consistency.
- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the
new `wiki_search.connectionId` param and the `memory_ingest` validation
wrapper (the port is no longer the raw service object; it delegates).

View file

@ -1,327 +0,0 @@
# Verbatim ingest mode for authoritative documents
> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`.
## Problem
`ktx ingest --text/--file` routes captured content through the memory agent.
`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a
`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to
`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which
runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k
chars) inside a session worktree. The agent decides — via the `wiki_write`
tool — what to persist, so it may **rewrite, condense, split, or re-title** the
content before it lands as a wiki page. The body is produced by an LLM, not
copied by code.
For *authoritative* documents — formula definitions, metric specs, runbooks,
compliance text — paraphrasing is a defect, not a feature:
- exact thresholds, constants, and rule wording must survive unchanged;
- lexical (BM25/FTS5) search works best when the stored text matches the
phrasing users and agents query with;
- ingestion should be deterministic and reproducible — the same input file
yields the same page, and re-running is safe.
Two further gaps block authoritative ingest today:
- The memory agent hard-requires an LLM backend
(`context/memory/local-memory.ts` throws when `llm.provider.backend: none`
and no runner is injected), so there is **no** offline ingest path at all.
- The agent's write tool *merges* a repeated same-scope key in place (REPLACE
frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the
silent in-place rewrite an authoritative-document workflow must avoid.
## Generic use case
Any team ingesting documents that are already the source of truth: metric
definition sheets, SLA documents, calculation-methodology docs, regulatory
text. The user wants **ktx** to *index and surface* the document, not to
re-author it. Today they work around the memory agent by hand-writing
frontmatter and copying files into `wiki/global/`; verbatim mode makes that a
first-class, supported `ktx ingest` workflow.
## Model
`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a
constrained prompt over the existing agent loop. Its defining invariants:
- **The stored page body is the input document body, written by code.** The LLM
never produces, edits, or relays the body. It is confined to generating
*metadata* about the body.
- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is
LLM-generated or derived offline follows from the configured backend
(`llm.provider.backend`), not from a second user-facing switch.
- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project
authoritative docs (the content teams copy into `wiki/global/` today).
Connection association is expressed by the **additive `connections`
frontmatter** from spec 01, never by directory.
- **Deterministic and idempotent.** The page key, the merged frontmatter, and
the stored body are all functions of the input alone (given a fixed backend),
so the same input produces the same page and a re-run is a safe no-op.
### "Byte-for-byte" scope
The guarantee is on the document's **interior**: no paraphrase, no condense, no
split, no re-title, no reflow, **no clipping**. The shared wiki store
canonicalizes *surrounding* whitespace — `parsePage` trims the body and
`serializePage` emits a single trailing newline
(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are
normalized by the storage layer. Verbatim mode **MUST** write through that
shared `writePage`/`serializePage` path rather than fork a parallel serializer;
the interior bytes (thresholds, constants, wording) are what must be preserved
exactly, and they are. Acceptance hashes compare the stored body against the
**trimmed** input body.
## Requirements
### 1. Flag
`ktx ingest --file <path> --verbatim` and `ktx ingest --text <content>
--verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text`
item in the invocation; each item becomes its own page.
- It composes with the existing `--connection-id <id>` flag
(`commands/ingest-commands.ts`) so the resulting page can be
connection-scoped (see spec 01). **Note:** the intake draft wrote
`--connection`; the shipped flag is `--connection-id`. Use `--connection-id`.
- No new `--key` flag (see requirement 4). No second behavioral switch beyond
`--verbatim` itself.
### 2. Body preservation is enforced by code, not by prompt
The stored page body is the input content (interior preserved exactly, per
**Model → "Byte-for-byte" scope**).
- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop
or any `wiki_write` tool call where a model could alter it.
- The LLM, when used, generates **only** metadata: `summary`, `tags`, and
`sl_refs`. A single constrained structured-output call (AI SDK v6
`generateObject` with a `zod` schema) is the intended mechanism — the full
memory-agent loop, worktree, and squash-merge are **not** required and should
not be used.
- The page key is **not** LLM-generated (requirement 4).
### 3. No clipping of the stored body
The ~48k clip may apply only to the text **sent to the LLM** for metadata
generation. It **MUST NOT** apply to the text **written** to the page. A
document larger than the clip limit is stored in full; only its metadata is
derived from the clipped prefix.
### 4. Deterministic page key
The key is derived from the input, never chosen by the LLM (an LLM-chosen slug
would break determinism and the requirement-6 idempotency guarantee):
- **`--file <path>`** → `suggestFlatWikiKey(basename without extension)`
(`wiki/keys.ts`). This is the primary document case and is always
deterministic.
- **`--text <content>`** → if the content opens with a Markdown heading, the
key is `suggestFlatWikiKey(heading text)`. If there is no leading heading,
**hard error**: inline verbatim text needs a leading heading to derive a
stable key, or should be passed as `--file`.
- No hash-based keys (unfindable) and no `--key` override flag. A real need for
explicit key control can add `--key` later.
### 5. Frontmatter: passthrough + gap-fill
If the input has its own YAML frontmatter, split it from the body: the body is
everything after the closing `---`; the frontmatter is authoritative metadata.
- **Passthrough.** Every input frontmatter field is preserved in the stored
page, **including fields not in `WikiFrontmatter`** (`effective_date`,
`version`, `owner`, …). The serializer `YAML.stringify`s the object, so
unknown keys round-trip. Dropping them would be silent data loss on
authoritative docs.
- **Gap-fill only.** Generated/derived metadata fills **absent** fields only;
it **MUST NOT** overwrite an explicit value. An input `summary:` is never
replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept.
- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not
force-injected) when the input does not set it.
- **Connection scoping.** `--connection-id X` (validated via
`assertConfiguredConnectionId`, `context/connections/configured-connections.ts`)
sets `connections: [X]` when the input frontmatter does not already declare
`connections`. If the input frontmatter declares a **different**
`connections` than the flag, **hard error** (ambiguous intent) rather than
silently choosing one. If they match, or only one source is present, proceed.
### 6. Degraded mode (`llm.provider.backend: none`)
`--verbatim` **MUST** work with no LLM backend — this is its capability the
regular agent ingest lacks.
- `summary` is derived from the leading Markdown heading text, or, if none, the
first non-empty sentence of the body (trimmed to a reasonable length).
- `tags` and `sl_refs` are left empty.
- The body is still stored in full (requirement 3 applies unchanged).
### 7. Key collisions: idempotent-if-identical, else hard error
Verbatim mode does **not** reuse the agent write tool's in-place merge. Before
writing, read any existing `GLOBAL` page at the derived key:
- **No existing page** → write.
- **Existing page, stored body identical** to the new body (compared after the
storage-layer normalization in **Model**) → **idempotent no-op success**
(re-running the same file is safe).
- **Existing page, body differs****hard error** naming the conflicting key
and directing the user to a distinct key. Never a silent overwrite, never an
auto-suffixed second page (which would produce the duplicated/divergent pages
this mode must avoid).
### 8. LLM-failure handling
When a backend **is** configured but the metadata call fails (rate limit,
transport error, malformed output after retries), **fail the item** (honoring
`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`).
**MUST NOT** silently fall back to degraded derivation: a degraded page written
on a transient error would, under requirement 7, refuse to be replaced by a
healthy re-run — breaking reproducibility. Degraded derivation is reserved for
`backend: none`.
### 9. Findability
After write, the page is reindexed so search returns it:
- `wiki_search` for a phrase taken from the document body returns the page via
the lexical lane (the body is indexed in `buildKnowledgeSearchText`).
- `wiki_search` for a paraphrase of the document's topic returns it via the
semantic lane **when embeddings are enabled** (this is what the generated
`summary`/`tags` buy over a bare degraded page).
## Acceptance criteria
- Ingesting a file with `--verbatim` produces a page whose body is
byte-identical to the trimmed input body (assert with a hash in tests).
- A >48k-char file is stored in full (assert stored body length ≥ input length
minus trim).
- Running the same `--verbatim` ingest twice is idempotent: one page, identical
bytes both times, no error on the second run.
- A second ingest to the same derived key with **different** body content fails
loudly (requirement 7) and does not modify the existing page or create a
suffixed one.
- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved
in the stored page; an explicit input `summary` is **not** overwritten by a
generated one.
- With `llm.provider.backend: none`, `--verbatim` still produces a page: full
body stored, `summary` derived from the heading/first sentence, `tags` and
`sl_refs` empty.
- `--verbatim --connection-id X` yields a page with `connections: [X]`; an
unknown id is rejected with an error listing the configured ids. (Depends on
spec 01, now shipped.)
- `--verbatim --connection-id X` where the input frontmatter already declares a
different `connections` fails with an ambiguity error.
- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading
heading or `--file`.
- `wiki_search` for a body phrase returns the page (lexical lane); for a topic
paraphrase it returns the page when embeddings are enabled (semantic lane).
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
module layout and design, subject to the invariants above.
- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table;
`--text`/`--file`/`--connection-id`/`--fail-fast` already present — add
`--verbatim` and thread it into `KtxTextIngestArgs`).
- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`,
`validateItems`, per-item loop and exit-code aggregation). The verbatim flow
reuses item loading and replaces the `memoryIngest.ingest(...)` call with a
code-driven write for `--verbatim` items. Keep the new logic in a focused
module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`.
- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts`
(`parsePage` for the `---…---` split shape, `serializePage`, `writePage`,
`readPage` for the collision check). Write through this shared path — do not
re-implement YAML framing.
- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`).
- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and
`usage_mode` are the required fields; unknown passthrough fields live
alongside).
- **Connection validation:** `context/connections/configured-connections.ts`
(`assertConfiguredConnectionId`, shipped with spec 01).
- **Metadata LLM call:** the local LLM runtime/config resolution in
`context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a
single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill
covers v6 patterns.
- **Reindex / search lanes:** `wiki/local-knowledge.ts`
(`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/
semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`).
- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a
verbatim-focused test file covering the acceptance criteria above.
## Benchmark context (motivation only)
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
definitions, the haversine formula, F1 overtake rules, …). Gold SQL was
authored against their **exact** text; an LLM paraphrase that drops a bucket
boundary or rounds a constant loses the corresponding question. The current
workaround is hand-writing frontmatter and copying files into `wiki/global/`.
Verbatim mode turns that manual step into a supported **ktx** workflow, and
composes with the connection scoping from spec 01 so a doc relevant to exactly
one of the benchmark's ~30 SQLite databases does not surface for the other 29.
## Implementation notes
Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered
by tests and verified end-to-end through the linked `ktx-dev` binary.
**What was built**
- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor`
+ `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`,
`deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter`
(the last four are `@internal` exports for unit testing).
- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a
guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded
into `KtxTextIngestArgs.verbatim`.
- `text-ingest.ts` now tags each loaded item with an `origin`
(`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim
ingestor once and branches the per-item loop to a code-driven write instead of
`memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and
`--fail-fast` handling are reused.
**Deviations from the literal spec (design refinements, per "implementer owns the design")**
- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The
implementation routes through the existing `KtxLlmRuntimePort.generateObject`
instead — it is implemented by all three backends (ai-sdk, claude-code, codex),
and the ai-sdk one already wraps `generateText` + `Output.object({schema})`.
This realizes the spec's "single constrained structured-output call" intent via
the canonical cross-backend path rather than forking a second LLM entry point.
- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages`
rebuilds the SQLite index from disk on every call (recomputing embeddings for
changed pages), so a written page is findable without a dedicated reindex step.
The write still goes through the shared `KnowledgeWikiService.writePage` +
`syncSinglePage` path, so the page is also eagerly indexed.
- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter
already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills
absent fields, so there is nothing to generate). A fully specified document thus
ingests with a configured backend without any LLM call.
**Tests**
- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration
against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip,
idempotency, conflict hard-error, frontmatter passthrough, explicit-summary
preservation, degraded mode, connection scoping + unknown-id rejection +
ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item,
lexical + semantic findability).
- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging,
connection-id forwarding, fail-fast.
- `packages/cli/test/index.test.ts``--verbatim` flag threading and the
requires-`--text`/`--file` guard.
**Docs**
- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest"
section, examples, common errors) and
`docs-site/content/docs/guides/writing-context.mdx` (authoritative-document
workflow).
**Verification**
- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code`
(Biome + Knip default + production) clean; pre-commit clean on changed files.
A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is
untouched — it predates this work.

View file

@ -1,361 +0,0 @@
# Schema scan tolerates individual objects that fail introspection
> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`.
## Problem
A single broken or inaccessible object zeroes out an entire connection's
context. Schema introspection iterates objects with no per-object error
handling, so one throw aborts the whole scan, the live-database adapter's
`fetch()` rejects, and the connection ends with **no semantic layer at all**
even when every other object was healthy.
The failure surfaces in two phases, and the contract must hold in both:
- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does
`rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch.
`readTable` runs `PRAGMA table_info(<object>)`, which *executes* a view's
body to resolve its columns — so a view over a dropped/renamed column (the
`oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date`
from a base table that has no such column) raises `no such column:
ehp.start_date` and aborts introspection of all ~48 healthy objects.
- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/
bigquery/snowflake read metadata in bulk from catalog / `information_schema`
(a broken view rarely breaks that), then fail when a per-object profiling or
sampling `SELECT` runs against a broken object. Enrichment sampling is
*already* isolated (`description-generation.ts` wraps `sampleTable` in
try/catch → `sampling_failed`), but mandatory introspection-phase reads are
not uniformly isolated across drivers.
A second, related defect blocks the documented escape hatch. Setting
`enabled_tables: ["main.customers"]` on a sqlite connection produces a
different hard failure — `Adapter "database schema" did not recognize fetched
source output`. Root cause: the sqlite connector emits every object as
`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })`
(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but
`"main.customers"` parses to `{ db: "main", name: "customers" }`
(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`,
so the entry matches **nothing**, zero table files are written, and
`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping
the generic "did not recognize fetched source output" error at
`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form
`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form
silently matches nothing.
## Generic use case
Real warehouses routinely contain broken or inaccessible objects: views over
dropped/renamed columns, views referencing tables the connection role can't
read, permission-denied tables, and vendor system views that error on read.
**ktx** should ingest everything it *can* and skip what it can't, so one bad
object never zeroes out an entire connection's context. This is baseline
production robustness, independent of any benchmark — the same tolerance a
33-warehouse fleet needs the first time one of its databases has a stale view.
## Design
The unit of failure is **one object** (table or view). Introspecting or
profiling an object is an operation that can fail independently; a failure skips
that object, records a recoverable warning, and the scan continues from the
objects that succeeded.
Because seven Node connectors and the Python daemon each introspect differently
(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata
in bulk and fail per-object during profiling), the **semantics** of "skip /
warn / total-failure" are defined **once** and every connector routes through
them — rather than seven copies of the same try/catch that drift apart:
- A shared per-object helper in the `scan/` layer — the sibling of the existing
`tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single
object read and returns `{ ok: true, table } | { ok: false, warning }`, with a
standard warning code (e.g. `object_introspection_failed`).
- A shared post-check enforces the total-failure rule (R3) uniformly.
- Each connector keeps its **natural** shape: sqlite routes each `readTable`
through the helper; bulk-read drivers route their per-object profiling reads
through it. The contract is uniform; the loop is not forced to be.
- The Python daemon implements the **same contract** in its own helper, adds a
`warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps
those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`).
The warning channel already exists end to end on the Node side
(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/
`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json`
artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates
it. This spec makes that channel carry object-skip warnings and surfaces them in
the ingest summary, the persisted report body, and `ktx status`.
## Requirements
### R1 — Per-object isolation (the contract)
If introspecting or profiling one object throws, the scan **MUST** skip that
object, record a `KtxScanWarning` (object name, the error message, and any
schema/catalog qualifier; `recoverable: true`), and continue with the remaining
objects. No single object may abort the scan.
- The contract holds in **both** phases: the mandatory metadata read *and* any
profiling/row-count/sample read performed during introspection.
- It holds for **all seven Node connectors**
(`packages/cli/src/connectors/<driver>/`) and the **Python daemon** postgres
path (R6).
- The semantics are defined once (the shared helper + warning code from the
Design section) and every connector routes through them. Do not inline a
divergent per-driver copy.
- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object
identifier and the database's error text, redacted through the existing
`redactKtxSensitiveMetadata` path that `warnings.json` already uses.
### R2 — Surface, don't hide
Skipped objects **MUST** be reported both at ingest time and in the durable
status view:
- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports
a count plus the object name and a short reason for each skip — e.g.
`Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`.
- **Run report.** Object skips land in the run report's `warnings.json` artifact
(already written) and in the persisted report body (`IngestReportBody`), whose
natural home is the existing `fetch?: SourceFetchReport` field — the fetch
phase *is* introspection.
- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for
the connection's latest ingest — e.g. `oracle_sql: 1 object skipped —
emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived
from the latest persisted report, not new persisted state**: the report body
is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so
surfacing it requires **no `.ktx/db.sqlite` schema migration**`status`
reads and renders the skip info already present in the latest report body. A
connection whose latest ingest skipped nothing shows no such line.
### R3 — Failure semantics (partial vs total)
Per-object skipping is **unconditional** — there is **no new config knob**, and
the existing `ingest.workUnits.failureMode` (which governs the later LLM
work-unit stage, not introspection) is untouched and orthogonal. Outcomes are
derived from object counts, not from a mode:
| Scope | Objects discovered / matched | Introspection outcome | Result |
| --- | --- | --- | --- |
| none | 0 | n/a (legitimately empty DB) | **success**, empty layer |
| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest |
| none | N > 0 | all N fail | **connection failure** (clear error) |
| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) |
| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings |
| `enabled_tables` | matches M > 0 | all M fail | **connection failure** |
- "Connection failure" means the connector / `fetch()` raises a **clear,
actionable error** for that connection. It **MUST NOT** surface as the generic
`did not recognize fetched source output` (that message is reserved for a
genuinely unrecognized staged dir, not an empty/total-failure result).
- A total failure of one connection follows existing per-connection ingest
orchestration for whether sibling connections continue; this spec does not
change cross-connection behavior.
### R4 — A broken view never blocks base tables
A broken view **MUST NEVER** prevent base-table ingest.
- View introspection failures are isolated exactly like any other object (R1).
- Mandatory introspection **MUST** prefer reading an object's structure from the
catalog where possible over executing the object's body, and **MUST NOT** run
a data-reading query (row count, sample) against a view as a required step.
(sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the
metadata read that executes the view definition.)
### R5 — `enabled_tables` allowlist works
The documented allowlist escape hatch **MUST** reliably restrict the scan to the
listed objects, with no spurious adapter error:
- **sqlite qualification.** The schema-qualified form `"main.<name>"` **MUST**
resolve to the same object as the bare form `"<name>"` (sqlite's sole schema
is `main`; the connector emits `db: null`). Both forms select the object;
neither silently matches nothing.
- **Documented format.** The accepted qualification forms for each driver
(`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main`
equivalence **MUST** be documented where `enabled_tables` is described
(`context/project/driver-schemas.ts` and the user-facing config docs).
- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to
**zero** matched objects **MUST** fail with an actionable error naming the
connection, the unmatched entries, and the available object names — **not** the
generic `did not recognize fetched source output`. This is distinct from a
legitimately empty database (R3 row 1) and from a matched-but-all-broken scope
(R3 last row).
- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests
**exactly** those M objects (minus any that fail per R1), with no adapter
recognition error regardless of how small or edge-case the set is.
### R6 — Python daemon parity
The daemon's postgres introspection path **MUST** honor the same contract:
- Add a `warnings` field to `DatabaseIntrospectionResponse`
(`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the
same shape Node expects (code, message, object identifier, recoverable).
- Isolate per-object failures in the daemon's introspection so one broken object
does not abort the response; apply the R3 total-failure rule there too.
- Map daemon warnings into `KtxSchemaSnapshot.warnings` in
`mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`),
which currently drops them.
- The Node and Python warning shapes **MUST** stay in parity (the codebase
already mirrors Node↔Python schemas for telemetry; follow the same discipline
so the daemon cannot emit a code Node can't render).
## Acceptance criteria
- Ingesting a sqlite DB with one broken view + N healthy tables yields a
semantic layer for the N healthy tables and **exactly one** warning naming the
broken view and its error; exit is **success**.
- The skipped object appears in the `ktx ingest` summary output, in the run's
`warnings.json`, and in `ktx status` as a per-connection skipped-objects line
on the connection's latest ingest.
- A sqlite DB in which **every** discovered object fails introspection (and the
file opens) exits as a **connection failure** with a clear error — not an
empty "success" and not `did not recognize fetched source output`.
- A genuinely empty sqlite DB (zero objects) exits **success** with an empty
layer (not a failure).
- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both
ingest exactly the `customers` object on a sqlite connection.
- `enabled_tables` restricted to a valid subset of M objects ingests exactly
that subset, with **no** adapter-output error.
- `enabled_tables` that matches zero objects fails with an error naming the
connection, the unmatched entries, and available objects — distinguishable
from the empty-DB and all-broken cases.
- A broken view does not prevent ingest of base tables in the same connection
(regression test with a view that errors on read alongside a healthy table).
- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a
per-object failure in the daemon path produces a warning mapped into
`KtxSchemaSnapshot.warnings` (Node↔Python parity test).
- A warehouse-driver object whose profiling/sample read fails is skipped with a
warning and does not abort introspection of its siblings.
- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave
identically before/after — no warnings, same semantic layer.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **Shared semantics:** `context/scan/constraint-discovery.ts`
(`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror
for the per-object helper), `context/scan/types.ts`
(`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the
new object-skip code here).
- **Node connectors:** `packages/cli/src/connectors/<driver>/connector.ts` and
each `live-database-introspection.ts`. sqlite's loop is
`connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable`
(≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171.
Existing per-table sample isolation precedent: `description-generation.ts`
(≈ line 867, `sampling_failed`).
- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156)
routes every driver to its Node connector; the daemon is the `else` fallback.
- **`enabled_tables` matching:** `context/scan/enabled-tables.ts`
(`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts`
(`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47),
`context/project/driver-schemas.ts` (`enabled_tables` schema + description).
- **Staging / detect / error surface:**
`context/ingest/adapters/live-database/stage.ts`
(`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94,
`detectLiveDatabaseStagedDir` ≈ line 138),
`context/ingest/local-stage-ingest.ts` (the
`did not recognize fetched source output` throw ≈ line 291 — must stop being
the surface for empty-scope and total-failure).
- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus`
≈ line 202), `context/ingest/memory-flow/summary.ts`
(`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing
summary.
- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`;
`SourceFetchReport` as the home for scan warnings),
`context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted
whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts`
(`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body
per connection and render the skipped line via `renderLocalStatsAsLines`).
- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py`
(`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response`
≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables`
≈ line 267), and the Node mapping in
`context/ingest/adapters/live-database/daemon-introspection.ts`
(`mapDaemonSnapshot` ≈ line 209).
## Benchmark context (motivation only)
`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic
layer because of its one broken view, so those questions fall back to raw
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
restores enriched context for that database. The same robustness is required for
the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or
permission-restricted objects are common and a single one must not zero out a
warehouse's context.
## Implementation notes
Shipped on branch `write-feature-spec-wiki`. All requirements implemented;
verified with `pnpm --filter @kaelio/ktx run test` (2981 passing),
`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing),
`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`.
**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes
`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning
`{ ok, table } | { ok: false, warning }` and building an
`object_introspection_failed` warning (object name + redactable DB error). It
rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is
never masked as an object skip. The new warning code was added to
`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist
(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode`
validator), and `describeWarningGroup` (`scan.ts`).
**Per-object isolation, where it actually exists (R1/R4).** Only sqlite
(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do
per-object reads during *mandatory* introspection; both now route each object
through `tryIntrospectObject`. The other five Node connectors (postgres, mysql,
clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/
`information_schema` (already object-safe at this phase) and isolate per-object
profiling/sampling in the enrichment phase (`description-generation.ts`,
`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite
also tolerates a `COUNT(*)` (profiling) failure without dropping a
structurally-readable table, and a broken view's metadata read is isolated so it
never blocks base tables (R4).
**Single-source outcome decision (R3/R5).** New
`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once
in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the
daemon) routes through — and derives the outcome from the snapshot + scope:
≥1 object → success (skips ride along as warnings); all matched objects failed →
clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear
zero-match error naming the connection, the requested entries, and the available
objects (sqlite/bigquery attach the discovered inventory via
`metadata.discovered_object_names`); empty database (no scope) → success with an
empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a
valid empty staging is recognized; total-failure/zero-match now throw a clear
connection error before staging instead of surfacing the generic
`did not recognize fetched source output`.
**`enabled_tables` matching (R5).** Normalized at the scope boundary in
`resolveEnabledTables` using `connection.driver`: for sqlite, `main.<name>`
`{ db: null }`, so `"main.customers"` and `"customers"` select the same object.
`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and
`docs-site/.../configuration/ktx-yaml.mdx`.
**Surfacing (R2).** Deviation from the spec's orientation: live-database schema
ingest runs through the **stage-only** path (`runLocalStageOnlyIngest`
`local_ingest_reports`), not the bundle runner, so the home for scan warnings is
`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is
persisted whole, so **no migration**), not the bundle-only
`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport`
(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`).
The ingest summary is already rendered by `runKtxScan` from `report.warnings`
(the new `describeWarningGroup` case), and `ktx status`
(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the
latest report body per connection and prints a per-connection
`N object(s) skipped — name: reason` line.
**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to
`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model,
isolates per-object failures in `_map_rows_to_tables`, and shares the
`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with
Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`,
dropping any code Node cannot render (validated via `isKtxScanWarningCode`).
Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the
shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver
including the daemon, avoiding a divergent second implementation. Parity is
covered by a Node test (daemon-shaped warning round-trips) and a pytest
(per-object failure → warning with the shared code).

View file

@ -1,363 +0,0 @@
# Add universal SQL-authoring craft to the ktx-analytics skill
> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`.
## Problem
The shipped `ktx-analytics` skill
(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its
`<workflow>` and `<rules>` tell the agent **which ktx tools to call and in what
order** (`discover_data``entity_details`/`sl_read_source`
`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing
about **writing correct SQL**.
That gap shows up as a specific failure shape: the agent reliably produces
*runnable* SQL but *wrong* results. The recurring defects are universal
analytics-engineering mistakes, not ktx-specific ones:
- comparing a string column to a numeric literal (or vice versa), which can
silently match zero rows;
- rounding inside intermediate CTEs, so the final number is off;
- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so
results flicker run to run;
- filtering *before* a window function for sequence/“since”/“first” questions,
truncating the partition the window should see;
- returning a full ranked list for a “top/highest” question, or collapsing a
“per X” question to a single value;
- dropping the inputs (or the entity identifier) a derived value was built from.
These are correctness defects every ktx user hits on a live database. They
belong in the shipped skill — fixing them once improves ktx for everyone, rather
than living in any individual callers prompt.
## Generic use case
An analyst (human or agent) points ktx at a **live, production** database and
asks a real analytical question — “whats the most recent order per customer”,
“top region by margin”, “average order value by month”. The schema is unfamiliar
(unknown date encodings, nullable join keys, string-typed numeric columns), the
question carries grain and ranking intent in its wording, and the answer must be
*correct and deterministic*, not merely executable. The skill should encode the
analytics-engineering craft that makes the difference between a query that runs
and a query thats right — independent of any benchmark.
## Model
The change is **additive content in one Markdown file**, governed by these
invariants. They constrain the implementer; the exact prose is theirs.
### Inline-only delivery (this is a hard constraint, not a style preference)
All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled
`reference/*.md` file (the progressive-disclosure pattern Anthropics
skill-authoring guide recommends for large skills) **MUST NOT** be used here,
because the delivery mechanism ships only `SKILL.md`:
- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`,
which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file
per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex /
universal `.agents` equivalent, a **flattened** single rules file for Cursor
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
(`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that
contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`).
- Nothing copies sibling files or subdirectories. A reference file would dangle
on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot
represent a multi-file skill at all.
The skill is small enough that inline costs nothing meaningful: ~67 lines today
plus ~60 of craft is well under the 500-line budget. And this craft is **core
content** — consulted on every SQL-authoring turn — so even if multi-file delivery
existed it would still belong inline: progressive disclosure only pays off for
large, *conditionally-relevant* reference material loaded on demand, not for
always-needed craft.
Multi-file skill *delivery* is a legitimate future enhancement, but it must be
**pulled by a concrete need, not built ahead of one** — no shipped skill today
exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first
real trigger is the **per-dialect SQL syntax follow-up**
(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand
`reference/<dialect>.md` content is a genuine progressive-disclosure fit. When
that work is scoped, note that multi-file delivery is **not** a simple directory
copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor
(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform,
and uninstall needs per-file manifest entries. Recording the constraint here so a
future implementer does not “improve” this inline content into a bundled
reference that dangles on every target.
### Heuristics with a generic *why*, not a wall of MUSTs
The new rules are phrased as **heuristics with a one-line, universal rationale**,
because SQL authoring is a high-freedom task (many valid approaches, choice
depends on the question and the data). A bare imperative overfits; a rule plus
its *why* lets the model apply judgment and generalize. This follows Anthropics
own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all
caps or rigid structures, reframe and explain the reasoning”).
This **reconciles the drafts “behavior only, no rationale” instruction**: the
prohibition is specifically on rationale that references a **grader, gold answer,
or the benchmark**. *Generic analytics-engineering rationale is required* — e.g.
“…so `RANK`/`ROW_NUMBER` results dont flicker across runs”, “…a string-vs-number
compare can silently match nothing”. That is a universal truth, not a
grader reference.
### Dialect-agnostic
Every rule must read correctly on any SQL dialect a ktx connection might use.
**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only),
not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs.
Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware
(per-driver) location, explicitly out of scope here.
### Discovery craft attaches to discovery; authoring craft to query/validate
Two of the drafts rules (inspect sample rows; cast before comparing) are
*schema-discovery* concerns that happen **before** SQL is composed. They belong
with the discovery steps of the existing workflow, not only at the query step.
The rest (composition, window correctness, precision, completeness) belong with
the query/validate steps. The drafts “extend step 5/6” is the right home for
most rules but is slightly off for the discovery pair; this spec corrects that.
### Additive only
The existing `<workflow>`, `<rules>`, and `<examples>` — compact result tables,
summaries, clarification prompts, the tool-order workflow, the `connectionId`
scoping rules — are preserved unchanged. The skill must still read well for an
interactive, human-facing analysis session.
## Requirements
### 1. Placement and structure
Add a dedicated, scannable craft section to `SKILL.md`:
- A new top-level block — `<sql_craft>` (sibling to `<workflow>`/`<rules>`) — with
**five sub-headings**: *Schema discovery*, *Composition*, *Window functions*,
*Numeric precision*, *Answer completeness*. Sub-headings keep the block
scannable (the drafts “group under clear sub-headings” goal).
- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and
explain”) each gain a **one-line pointer** into `<sql_craft>` rather than
inlining the rules (state each rule once; Anthropics “consistent terminology /
dont repeat” guidance). The schema-discovery pair is additionally reflected as
a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing
to the same block.
- No new tool, flag, or config. This is content only.
### 2. The craft rules (all fourteen behaviors, grouped)
Every behavior from the intake draft must be represented. Tightly-related ones
**may** be merged into a single bullet where that reads better; none may be
dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout.
**Schema discovery** (cue in steps 2/4; lives in `<sql_craft>`)
1. Inspect representative **sample rows** of each table before composing SQL —
confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in
join/filter keys, and the real set of categorical/enum values
(`entity_details` + a small `sql_execution` sample). *Why:* assumptions about
encoding and nullability are the most common source of silently-wrong filters.
2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A
string column compared to a numeric literal (or vice versa) can silently match
nothing.
**Composition**
3. Build complex queries **incrementally** — one CTE at a time, verifying each
layers output on a small sample before stacking the next. *Why:* a wrong
intermediate layer is far cheaper to catch early than to debug in the final
result.
4. **Avoid fan-out joins.** Add columns only from tables already at the target
grain, or **pre-aggregate** to that grain before joining. *Why:* a join that
multiplies rows quietly inflates every downstream `SUM`/`COUNT`.
**Window functions**
5. Give every ranking/ordering window function a **complete, deterministic
tie-breaker** (append unique key columns to `ORDER BY`), so
`RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs.
6. For sequence / “first” / “most recent” / “since” questions, **filter after the
window**, not before: compute over the full partition, then keep the rows you
want. *Why:* a pre-filter shrinks the partition the window ranks over, so
“first”/“most recent” is computed against the wrong set. (See the worked
example, requirement 3.)
**Numeric precision**
7. Compute at **full precision; round only in the final projection**, never inside
intermediate CTEs.
8. Be **explicit about truncation**`CAST AS INT` truncates; use explicit
rounding when rounding is intended. (May merge with rule 7.)
9. Distinguish **macro vs micro averages** based on the questions wording:
“average of per-group averages” = `AVG(group_metric)`; “overall/weighted
average” = `SUM(numerator)/SUM(denominator)`.
**Answer completeness / interpretation**
10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the
top-ranked row via the window result), not the full ranked list, unless a list
is asked for. *(Phrase the mechanism dialect-agnostically — do not name
`QUALIFY`.)*
11. “for each X / per X / by X” → **exactly one row per X**; dont collapse to a
single value unless the question says “overall” or “total across X”.
12. When a question asks for inputs and a derived value (“X, Y, and their ratio”),
**include the inputs as columns** alongside the derived value.
13. When grouping by a human-readable label (a name), also **expose the entitys
identifier** — identity, not just the label, is part of the result (and
disambiguates duplicate names).
14. When a result is **unexpectedly empty, relax filters one at a time** to find
which predicate removed the rows. *Why:* this is the validation feedback loop
that turns a silent empty result into a diagnosable one.
### 3. One worked example (dialect-agnostic)
Add **exactly one** compact before/after example to the skill, demonstrating the
**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set.
It shows the wrong shape (filter inside, then rank) and the right shape (rank over
the full partition in a CTE, then filter to the top rank in the outer query),
using generic table/column names and standard SQL only (no `QUALIFY`, no
dialect functions). Keep it ~610 lines. Do not add a second example; the
existing three tool-orchestration examples stay as the primary example set.
*(Superseded by spec 09: the skill now carries a second `sql` worked example —
the multi-hop fan-out case — so the one-example constraint applies to spec 07's
window-then-filter example only.)*
### 4. Explicit exclusions
None of the following may appear in the skill (they are application/consumer
concerns, or actively wrong for live data):
- **Output-shape contracts** (“return a bare result set with exactly these
columns, no prose”). The skill is for interactive analysis and already favors
readable tables + summaries; a caller needing a strict shape specifies that
itself.
- **Anchoring relative time to `MAX(date)` of the data.** On a live database
“recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is
only valid for static snapshots and must not be baked into the product.
- **Any advice justified by a grader, gold answer, or scoring comparator.**
- **Dialect-specific syntax** (deferred to the per-driver follow-up).
### 5. Coordination with spec 03
`03-multi-connection-routing-in-analytics-skill` also edits this same file (it
adds a connection-routing “step 0” to `<workflow>` and threads `connectionId`
through the tool calls). Spec 07s additions are **orthogonal**: they live in a
new `<sql_craft>` block and in step 5/6 pointers, and must not rewrite the
`<workflow>` routing or the `<rules>` `connectionId` scoping that spec 03 owns.
If both land, the result is one coherent skill: routing in `<workflow>`/`<rules>`,
SQL craft in `<sql_craft>`.
## Acceptance criteria
- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped
under the five sub-headings, each phrased as a heuristic with a generic
rationale.
- **Zero references** to any benchmark, gold answer, grader, or scoring
comparator anywhere in the skill.
- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`,
no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect
construct — including in the worked example.
- The existing interactive guidance is intact: the `<workflow>` steps, the
`<rules>` (compact tables, summaries, clarification prompt, `connectionId`
scoping), and the three existing examples all still read correctly and were not
removed or contradicted.
- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of
“recent”, grader-driven advice, dialect syntax) appear.
- Exactly **one** new worked example is present, demonstrating window-then-filter,
in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second
`sql` worked example for the multi-hop fan-out case; the shipped skill then
contains two worked examples and the content test asserts two `sql` fences.)*
- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced,
and the skill still installs as a single file through `setup-agents.ts` for all
targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip).
- The skill stays **scannable and within a reasonable size** (comfortably under
the 500-line budget).
- The frontmatter (`name`, `description`) is unchanged and still parses through
`SkillsRegistryService.parseFrontmatter`.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the
`<sql_craft>` block; add one-line pointers in steps 5/6 and a discovery cue in
steps 2/4; add the single worked example. Keep `<workflow>`/`<rules>`/`<examples>`
otherwise intact.
- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts`
(`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`,
`plannedKtxAgentFiles`). Each target gets a single file derived from
`SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only
`ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this
spec — confirm the skill still installs unchanged.
- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the
same file; keep the changes non-overlapping (see requirement 5).
- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the
right level (this is prompt content, not executable logic). Assert the skill
text contains the craft sub-headings / representative rule phrases, contains the
worked example, and contains none of the banned constructs: the literal tokens
`QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`,
`gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since
`MAX()` is a legitimate aggregate — any instruction anchoring relative time
(“recent”, “past N months”) to the datas maximum date. The existing
`SkillsRegistryService` frontmatter-parse test must still pass. The standalone
`ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run
link:dev`) so the playground picks up the updated skill.
## Benchmark context (motivation only)
On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but
~50 result mismatches**, and a large share traced to exactly these gaps:
premature rounding, string-vs-number compares, non-deterministic window ordering,
returning full lists for “top” questions, and dropping the inputs to derived
values. These are generic SQL-authoring defects — fixing them in the skill
improves ktx for every user querying a live database, and improving the benchmark
score is a side effect, not the goal. The skill itself must contain no trace of
the benchmark.
## Implementation notes
Implemented on branch `write-feature-spec-wiki`.
**What was built**
- Added a new `<sql_craft>` block to `packages/cli/src/skills/analytics/SKILL.md`
(sibling to `<workflow>`/`<rules>`, placed just before `<examples>`), with the
five sub-headings — *Schema discovery before writing SQL*, *Composition*,
*Window functions*, *Numeric precision*, *Answer completeness / interpretation*
and a one-line opener framing the bullets as heuristics-with-a-why.
- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end /
truncation) are merged into one "Round only at the end" bullet, as the spec
permitted. Each bullet carries a generic analytics-engineering rationale; none
references a benchmark, grader, or gold answer.
- Exactly one worked example (a fenced `sql` block inside `<sql_craft>`)
demonstrates the window-then-filter rule, and incidentally the deterministic
tie-breaker: the *wrong* shape filters before the window; the *right* shape
ranks the full partition in a CTE, then filters in the outer query. Standard
SQL only — no `QUALIFY`, no dialect functions.
- Step pointers added without duplicating the rules: a schema-discovery cue in
steps 2 and 4, an authoring pointer in step 5, and a validation pointer in
step 6, each pointing into `<sql_craft>`.
- The existing `<workflow>` / `<rules>` / `<examples>` (compact tables,
summaries, clarification prompt, `connectionId` scoping, the three
orchestration examples) are unchanged. Delivery is unchanged: still a single
`SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/`
file was introduced.
**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a
content assertion over the source `SKILL.md`: the five sub-headings, a
representative phrase for each behavior, exactly one `sql` worked example, the
preserved interactive guidance, and the absence of banned constructs
(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` /
`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring
relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content
assertions and the `SkillsRegistryService` frontmatter test still pass (77/77
across the three relevant files). Rebuilt and re-linked `ktx-dev`
(`pnpm run build && pnpm run link:dev`); the craft block is present in the
shipped `dist` asset.
**Deviations / notes**
- The worked example runs ~18 lines including comments rather than the spec's
"~610"; a faithful before/after with a CTE needs the extra lines, and the
skill stays well within budget (~117 lines total).
- `pnpm run type-check` currently reports one **pre-existing, unrelated** error
in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on
this branch ahead of `origin/main`. The src type-check and `pnpm run build`
are green; this change does not touch any MCP file.
- Per-dialect SQL syntax stays out of scope here (deferred to
`todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains
dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that
belongs with spec 08's channel so the skill never references a tool that does
not exist.

View file

@ -1,395 +0,0 @@
# Per-dialect SQL syntax notes, served on demand and scoped to the connection
> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion
> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft
> dialect-agnostic and explicitly deferred per-dialect syntax to this spec.
## Problem
Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the
`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft
deliberately excludes anything that reads correctly on only one engine — no
`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs —
because the flat skill is installed verbatim and an agent querying sqlite must
never see Snowflake syntax.
But a large share of *real* correctness depends on exactly that excluded,
engine-specific syntax:
- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive
identifiers (unquoted folds to upper-case), VARIANT colon-paths
(`col:field.sub::type`), `QUALIFY`.
- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX`
for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`.
- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`,
`json_extract`.
- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`,
`sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and
JSON conventions.
This guidance is genuinely useful to an agent writing SQL against a live
database, but it must **not** pollute the flat dialect-agnostic skill. It belongs
in a **dialect-aware** channel, surfaced only for the dialect the active
connection actually uses, and selected from the project's own configured state —
not guessed, not shown all at once.
## Generic use case
Any **ktx** project whose connections span more than one warehouse engine — a
Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When
the agent (or a human analyst the agent assists) writes SQL for a given
connection, it should receive *that engine's* syntax conventions — FQTN form,
identifier quoting, date functions, top-N idiom, semi-structured access — and
nothing for the engines it is not querying. The need is independent of any
benchmark: it is what "write correct SQL against this specific warehouse" requires
on every multi-engine stack.
## Model
The change adds a **dialect-aware channel** alongside spec 07's flat skill. The
following decisions are committed by this refinement; the implementer owns the
exact prose and code.
### Delivery: a dynamic MCP tool (decision committed)
The draft posed two delivery mechanisms and asked the refinement to "weigh them
before committing." This spec commits to **dynamic MCP delivery**: a new
read-only MCP tool returns the syntax notes for a given `connectionId`, with the
dialect resolved server-side from the connection's configured `driver`. The flat
skill gains a one-line pointer to that tool. **No install-mechanism change is
required.**
The alternative — **multi-file skill delivery** (bundle `reference/<dialect>.md`
files and point the skill at the matching one) — is **rejected** for **ktx**, for
reasons that hold regardless of how the skill is otherwise authored:
1. **It cannot scope on two of the six install targets.** Cursor
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
(`.opencode/commands/ktx-analytics.md`) are physically **single-file**;
`setup-agents.ts` flattens the skill to one file there. A bundled `reference/`
directory degenerates to "concatenate every dialect into one file," so a
sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core
no-leak criterion on those targets**, and defeating progressive disclosure
(everything is in context at once). The MCP tool behaves **identically on all
six targets** because it is a tool call, not an installed file.
2. **Selecting the dialect is a deterministic operation, so it belongs in code,
not model judgment.** Anthropic's skill-authoring guidance explicitly says to
*"prefer scripts [tools] for deterministic operations."* With bundled files the
**model** must infer that connection X is Snowflake and open the right file —
and on a multi-connection project it can open the wrong one. With the tool, the
**server** resolves `driver → dialect` from `ktx.yaml` state and returns
exactly the right notes.
3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery
requires reworking `readAnalyticsSkillContent`, `installTarget`,
`plannedKtxAgentFiles`, the install manifest (a directory variant),
`removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a
concatenation transform for the single-file targets. The MCP tool requires one
read-only handler and one skill pointer.
4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on
the **ktx** MCP server — its entire workflow is calling `discover_data`,
`entity_details`, `sql_execution`, and so on. Wherever the server is down, the
skill is already non-functional; the tool adds **no new dependency**.
5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would
make multi-file delivery *possible*, but it would not make it better: reasons
24 stand, and the drop is a disproportionate cost (Cursor is a major target)
to neutralize a constraint the tool handles for free. Whether **ktx** supports
those targets is a separate product decision and is out of scope here.
This is consistent with Anthropic's progressive-disclosure goal — load the
relevant material on demand, at zero context cost until needed — which the tool
satisfies (its output costs context only when called) while resolving *which*
dialect from state rather than from a model guess. Reference:
[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
### Scope derived from state, through the one existing resolver
Which dialect's notes the agent sees is **derived** from the connection's
configured `driver`, via the resolver the rest of the system already uses —
`sqlAnalysisDialectForDriver(driver)` in
`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already
selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis
daemon. This spec **must not** introduce a second driver→dialect map. The notes
are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is
keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's
codomain so the two cannot drift.
### Authored per-engine notes are sanctioned static content
Enumerating syntax notes per engine is **not** a rotting denylist of bad
specifics; FQTN form and identifier quoting are genuine, stable invariants of each
engine — the kind of universal fact **ktx**'s design rules explicitly permit as
static content. What must stay derived-from-state is note *selection* (the active
dialect) and note *coverage* (every configured driver must resolve to notes that
exist), both of which this spec ties to the connector registry.
### The flat skill stays dialect-agnostic (spec 07 invariant preserved)
This work adds a *separate* channel. It does **not** amend spec 07's `<sql_craft>`
block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion
— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays
green. The only `SKILL.md` change is the pointer in requirement 3, which names the
tool and contains no dialect syntax.
## Requirements
### 1. A read-only `sql_dialect_notes` MCP tool
Register a new tool beside the existing context tools
(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the
implementer's to finalize but should follow the existing snake_case convention
(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name.
- **Input:** `{ connectionId }`, **required** — matching its siblings
`entity_details`/`sql_execution`, which always take an explicit connection.
- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved
`SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect.
- **Resolution:** `connectionId → connection.driver →
sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing
resolver. Do not duplicate the driver→dialect map.
- **Guards:**
- A **non-SQL context-source** connection (driver `metabase`, `looker`,
`lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL
warehouse connection" error**, not postgres notes. Gate on the existing
`isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`).
- For any **SQL warehouse** connection the resolver always yields a dialect with
notes (all seven warehouse drivers are covered — requirement 2); its built-in
`postgres` default is a safety floor, so the tool never errors for a SQL
connection and never emits a single-engine dialect (e.g. Snowflake) by
accident.
- **Annotations:** read-only and idempotent, consistent with the other read
tools.
- **Description (docs-grade, third person, states what and when):** e.g.
*"Returns the SQL syntax conventions for a connection's dialect — FQTN form,
identifier quoting and case-folding, date/time functions, top-N idiom, and
semi-structured access. Use before authoring raw SQL against a connection so the
SQL matches that engine."* The description drives the agent's decision to call
the tool, so it must be specific.
### 2. Per-dialect note content
Author concise notes for each supported dialect against a **fixed rubric**, so
every dialect answers the same questions. Each facet is a line or two of timeless,
engine-true convention (no version-dated "as of vX" content), phrased as
guidance with the engine reason where it helps — inheriting spec 07's
heuristics-with-a-why tone. The rubric facets:
1. **FQTN form** — how to fully-qualify a table on this engine.
2. **Identifier quoting & case-folding** — quote character and how unquoted
identifiers fold.
3. **Date/time** — the engine's date functions and common date-encoding idioms.
4. **Top-N / window-filtering idiom**`QUALIFY` where supported; a CTE +
outer-filter form where it is not; `TOP` for `tsql`.
5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/
`JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable.
6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery
`_TABLE_SUFFIX`).
Constraints on the content:
- **Coverage = the reachable dialect set.** Every driver in the connector registry
must resolve to a dialect that has non-empty notes. The reachable set is
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and
`tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`:
they appear in the resolver map but no connector can produce them, so they are
unreachable — matching the draft's "don't author for nonexistent drivers."
- **Keyed by `SqlAnalysisDialect`** (see Model).
- **Storage is the implementer's choice.** The notes MAY live as per-dialect
markdown files inside the package (e.g. under the skill's directory) served by
the tool, or as a typed map. If files are used they are **package-internal**
served by the tool, never installed onto an agent target — and already ship via
the recursive `src/skills → dist/skills` copy
(`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change.
- **No benchmark, gold-answer, grader, or scoring references** anywhere in the
notes.
The implementer must verify each engine's specifics against current official
documentation (the well-known anchors above are starting points, not a
substitute for checking the engine's docs).
### 3. The `SKILL.md` pointer (completes spec 07's deferral)
Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step
5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to
call the tool before writing raw SQL against a connection — e.g. *"Before writing
raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get
that engine's syntax conventions."* This is the pointer spec 07 deliberately did
not add because the tool did not yet exist.
- The pointer **names the tool only**; it contains **no dialect syntax**, so the
flat skill stays dialect-agnostic.
- Follow the skill's existing tool-reference convention. The skill currently names
MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's
guidance recommends **fully-qualified** `ServerName:tool` names to avoid
"tool not found" when multiple MCP servers are present. Whether to fully-qualify
the new pointer (and optionally retrofit the existing bare references) is a
small, separable decision flagged for the maintainer — **not** a rename sweep
this spec mandates.
### 4. Coverage is enforced from state, not by hand
A test must **derive** the required coverage from the connector registry rather
than hardcoding a dialect list: enumerate the configured warehouse drivers
(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in
`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and
assert each result has non-empty notes. Adding a connector later then **fails this
test** until its dialect gets notes — the allowlist-from-state discipline, not a
hand-maintained list.
### 5. No dialect syntax leaks into the flat skill
Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill
(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`,
backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds
a tool and a tool-pointer; it does not move dialect syntax into the skill.
### 6. Delivery is unchanged
`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`,
`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The
skill still installs as a single `SKILL.md` per target. Confirm the channel works
on all six targets — Claude Code, Claude Desktop (zip), Codex, universal
`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call,
including the single-file targets where multi-file delivery could not scope.
### 7. Coordination with specs 07 and 03
- **Spec 07** owns the dialect-agnostic `<sql_craft>` block. This spec must not
amend it; it adds the tool, the pointer, and the notes.
- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads
`connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer
is `connectionId`-scoped and fits that routing; keep the pointer consistent with
spec 03's `connectionId` rules and do not rewrite the routing it owns.
## Acceptance criteria
- An agent querying a **sqlite** connection gets sqlite date idioms and **never**
sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets
FQTN / identifier / VARIANT guidance.
- The dialect shown is **derived from the connection's configured `driver`** via
the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not
guessed. No second driver→dialect map is introduced.
- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`,
`bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with
non-empty notes, and the coverage test derives this from the registry.
- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a
clear "not a SQL warehouse" response, **not** postgres notes.
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
unaffected. The new pointer references the tool only and adds no dialect syntax.
- The channel installs/serves correctly across **all six** agent targets,
including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts`
change**.
- The notes contain **no** benchmark/gold/grader/scoring references and **no**
time-sensitive ("as of version X") content.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **Dialect resolver (reuse, do not duplicate):**
`packages/cli/src/context/sql-analysis/dialect.ts`
`sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect`
(`./ports.ts`), default `postgres`.
- **Connector registry (drives coverage):**
`packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`,
`isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts`
(`warehouseDrivers`, the per-driver `connectionConfigSchema`).
- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts`
(register beside `connection_list`, `entity_details`, `sql_execution`); the
`connectionId → driver → dialect` resolution already exists for `sql_execution`
in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool
through the same path.
- **The skill (one-line pointer only):**
`packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5;
leave `<workflow>`/`<rules>`/`<sql_craft>`/`<examples>` otherwise intact.
- **Note storage (if files):** under the skill directory, shipped by
`packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the
tool, never installed.
- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`.
- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown →
`postgres`, and non-warehouse rejection); a registry-derived coverage test
(requirement 4); a content test that each dialect's notes cover the rubric
facets and contain no banned tokens; and an extension of spec 07's
`analytics/SKILL.md` content test asserting the new pointer is present and the
flat skill is still dialect-clean. Rebuild and re-link the dev binary so the
playground picks up the change: `pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation only)
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths),
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
(`strftime`/`julianday`). That content is real and useful but engine-specific;
spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic
rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes
the same correctness benefit to every multi-engine **ktx** project — improving the
benchmark score is a side effect, not the goal, and the shipped skill contains no
trace of the benchmark.
## Implementation notes
Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed
decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as
specified — no `setup-agents.ts` change.
**What was built**
- Per-dialect notes are markdown files under
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md` (one each for
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`),
served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy
read + cache, `postgres` fallback floor; the authored set is the
`DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored
(unreachable from any connector). Each note answers the fixed rubric — FQTN,
identifier quoting/case-folding, date/time, top-N/window idiom,
JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics
were verified against current docs via Context7 (Snowflake VARIANT colon-paths
and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`,
`JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The
files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they
are never installed onto an agent target.
- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input
`{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only
+ idempotent annotations. It resolves through the **existing**
`connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second
driver→dialect map), implemented as the unconditional `dialectNotes` port in
`local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A
non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError`
("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays
out of Error Tracking.
- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`)
readonly tuple so the coverage test derives required coverage from the registry;
`isDatabaseDriver` behavior is unchanged.
- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call
`sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N,
and JSON conventions"). It names the tool only; spec 07's `<sql_craft>` block and
its dialect-clean content test are untouched.
**Tests**
- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future
connector fails the test until its dialect has notes), the full rubric per dialect,
leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`;
`QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no
benchmark/grader or version-dated content, the postgres fallback, and
`resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql`
and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a
guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync.
- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool
set + annotations assertion + a handler-routing test, and the regenerated
`__snapshots__/mcp-tools-list.json`.
- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present
and the flat skill stays dialect-clean.
**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files /
3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three
`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and
re-linked `ktx-dev`.
**Deviations / notes**
- Notes are stored as per-dialect markdown files (not a typed map, and not bundled
`reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the
most maintainable to edit. They are served by the tool and ship via a
`copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no
`setup-agents.ts` change.
- `pnpm run type-check` still reports one pre-existing, unrelated error in
`test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch);
this change adds zero new type errors and does not touch that file.

View file

@ -1,362 +0,0 @@
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`.
> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the
> `<sql_craft>` block. Additive, content-only.
## Problem
The shipped `ktx-analytics` skill
(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop
fan-out rule in `<sql_craft>`**Composition**:
> **Avoid fan-out joins.** Add columns only from tables already at the target
> grain, or pre-aggregate to that grain before joining. A join that multiplies
> rows quietly inflates every downstream `SUM`/`COUNT`.
In practice the agent honors that on a single join but still **silently
fans out on multi-hop join chains**, where the inflation is one or two joins
removed from the aggregate and therefore much harder to notice.
The failure shape: a measure that lives at a *coarse* grain (one row per parent
record) is counted/summed *after* the parent has been joined down to a *finer*
grain (one row per child line). Every parent-level value is then duplicated by
its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent
amount — runnable SQL, plausible-looking number, quietly wrong.
The rule today is stated only as a **prohibition** ("Avoid…"). It needs two
upgrades: (a) generalize it so the danger is understood as *cumulative across a
whole join chain*, not a single join; and (b) pair it with an **affirmative
verification habit** the agent runs while composing, so a grain change is
detected and fixed rather than merely warned against.
## Generic use case (independent of any benchmark)
An analyst on any production warehouse asks a counting/summing question whose
path runs through several one-to-many hops — e.g. *"how many orders per region
contain a returned item?"* where the path is `region → store → order →
order_line`. The honest answer counts each order once. The naïve join chain joins
`order_line` (to apply the line-level condition) and then counts orders, so an
order with three returned lines is counted three times. The inflation happens
**three joins below the `COUNT`**, where it is easy to miss. This is one of the
most common silently-wrong analytics mistakes on normalized schemas — not
specific to any dataset, dialect, or benchmark.
## Model (invariants — the implementer owns the prose)
These constrain the change; the exact wording is the implementer's. Each is
grounded in Anthropic's skill-authoring and prompt-engineering guidance so the
addition stays consistent with how spec 07 was written.
### Additive, inline-only, dialect-agnostic (inherited from spec 07)
The change is **additive content inside `skills/analytics/SKILL.md`** only — no
bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per
target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config.
Every addition must read correctly on any dialect: **no** `QUALIFY`,
`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect
construct — including in the worked example. The existing `<workflow>`, `<rules>`,
`<examples>`, and the other four `<sql_craft>` sub-headings are preserved
unchanged.
### Heuristic-plus-*why*, because SQL authoring is a high-freedom task
Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with
many valid approaches where decisions depend on context as **high freedom →
text-based heuristics**, the "open field, many paths" case (versus low-freedom,
fragile operations that need an exact script). SQL authoring is squarely
high-freedom. So the new content is phrased as **heuristics with a one-line,
universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the
existing `<sql_craft>` style and Anthropic's "add context / explain why so Claude
generalizes" principle.
### Affirmative framing for the verification step (do, not don't)
Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do
instead of what not to do."** The draft's requirement for "a detect-and-fix
*habit*, not just a prohibition" is the same principle. Therefore:
- The **generalized rule keeps the established `Avoid fan-out joins` lead and the
term `fan-out`** — it is spec 07's consistent terminology and the existing
content test references that phrase; reframing it would churn shared vocabulary
for no gain.
- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain
holds across each join"*) — an action the agent performs while composing, not a
warning. The two together satisfy both principles: a recognized anti-pattern
name *and* a positive habit.
### One default with an escape hatch, not two equal options
Anthropic: **"Avoid offering too many options… provide a default with an escape
hatch."** The fix for an inflated aggregate is presented as exactly that:
- **Default: pre-aggregate the measure to its own grain in a CTE, then join the
already-aggregated result.** This is the single-hop fix generalized, and it is
the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed
measure with `DISTINCT` (two legitimately-equal amounts would collapse).
- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an
inflated count in one line, but must be stated as count-only, not as a general
remedy.
This is the deepest correctness point in the spec and the easiest to get wrong; a
naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums.
### Consistent terminology
Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing
vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not
introduce synonyms (e.g. do not rename the concept "row blow-up" or
"multiplication factor"). Prose may vary, but the named concepts stay fixed.
### Concise — the addition must justify its token cost
Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and
"Claude is already very smart." The agent knows what a join and a `GROUP BY` are;
the addition explains only the non-obvious trap (cumulative grain inflation) and
shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and
one worked example — the skill stays comfortably under the 500-line budget
(~117 lines today).
### Examples over descriptions — exactly one
Anthropic's "examples pattern": **"Examples help Claude understand the desired
style and level of detail more clearly than descriptions alone"** and
"examples are concrete, not abstract." The multishot guidance favors 35 examples
in general, but here **conciseness and spec 07's one-example-per-rule economy
win**: the skill already carries the window-then-filter example, so this adds
**exactly one** compact wrong-vs-right example. The wrong/right contrast inside
that single example supplies the diversity multishot calls for, at one example's
token cost.
### Leak-safety (hard constraint)
The worked example must be a **synthetic, generic schema invented for teaching**
not the tables, column names, query, or numeric results of any Spider 2.0-Lite
question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a
one-to-many join), which is universal and reconstructable from first principles. A
reviewer must find nothing in it that ties it to a specific benchmark instance.
See "Leak-safety" below.
## Requirements
All four land in the **Composition** sub-heading of `<sql_craft>` in
`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite
the existing fan-out bullet, add one affirmative verification bullet, add one
worked example. Do not touch the other four sub-headings or `<workflow>`/`<rules>`/
`<examples>`.
### 1. Generalize the fan-out rule to multi-hop chains
Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that
the danger is **cumulative**: *any* one-to-many hop on the path between a measure's
owning table and the aggregate inflates that measure, **even when the offending
join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the
single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join
the already-aggregated result** — but the agent must apply it **per
measure-owning table along the whole chain**, not just at the final join. Keep the
`fan-out` term and the one-line *why*.
### 2. Add an affirmative grain-verification habit
Add a companion bullet, phrased as an action the agent performs **while
composing** (not a prohibition):
- Confirm that a join intended to be one-to-one / many-to-one **did not change the
grain** it aggregates at — e.g. check that the row count (or the count of the
aggregate's key) is unchanged across that join.
- When a join is genuinely one-to-many, **reach for the default fix
(pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an
acceptable escape hatch.
- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate**
`DISTINCT` cannot de-duplicate a sum.
This is spec 07's "build incrementally and check each layer" discipline pointed
specifically at grain preservation, in affirmative form.
### 3. One concrete, generic multi-hop worked example
Add **exactly one** compact wrong-vs-right `sql` example inside `<sql_craft>`
demonstrating the multi-hop inflation and the pre-aggregate fix. It is the
**second** `sql` fence in the skill (the first is spec 07's window-then-filter
example).
**Required properties** (these are the constraints; the SQL below is orientation):
- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed**
from the aggregate (not the single-hop case spec 07 already covers).
- **Unambiguous attribution**: each counted entity maps to **exactly one** group,
so the honest answer is well-defined. (This rules out "coarse measure attributed
to a fine dimension reached by descending," where one entity spans several
groups and the correct number is itself ambiguous — that would teach a murky
pattern.)
- **Motivated descent**: the finer-grain table is joined for a real reason (a
line-level filter or a needed line-level value), so the reader sees *why* the
fan-out join is there.
- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing
*Macro vs micro average* bullet and would muddy the fan-out lesson.
- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a
CTE) and is **actually correct**, not merely runnable — its number must equal the
honest answer, not just avoid an error.
- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect
functions), no benchmark identifiers or values.
**Recommended sketch** (implementer may adjust within the properties above):
```sql
-- "How many orders per region contain a returned item?"
-- WRONG: joining order_lines to apply the line-level filter multiplies orders —
-- an order with two returned lines is counted twice, three joins below the COUNT.
SELECT r.region_id, COUNT(*) AS n_orders
FROM regions r
JOIN stores s ON s.region_id = r.region_id
JOIN orders o ON o.store_id = s.store_id
JOIN order_lines l ON l.order_id = o.order_id
WHERE l.status = 'returned'
GROUP BY r.region_id;
-- RIGHT: collapse order_lines to one row per qualifying order first, then join up.
WITH returned_orders AS (
SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
)
SELECT r.region_id, COUNT(*) AS n_orders
FROM regions r
JOIN stores s ON s.region_id = r.region_id
JOIN orders o ON o.store_id = s.store_id
JOIN returned_orders ro ON ro.order_id = o.order_id
GROUP BY r.region_id;
-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
-- de-duplicate a sum.
```
### 4. Placement and structure
- Both bullets live under the existing **Composition** sub-heading; the example
follows them. The five-sub-heading structure spec 07 established is unchanged.
- **State each rule once** (Anthropic "consistent terminology / don't repeat"):
do not also restate the multi-hop rule in `<workflow>` steps 5/6 — those already
carry a one-line pointer into `<sql_craft>`, which is sufficient.
### 5. Coordination with spec 07 (supersession)
Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly
one** worked example and "Do not add a second example." **This spec supersedes
that constraint**: the skill now carries **two** `sql` worked examples
(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate
spec 07 at those two spots with a one-line "superseded by spec 09" note so the two
permanent specs do not contradict. No other spec 07 content changes.
## Leak-safety (hard constraint on this spec and its example)
The benchmark's gold answers must never appear in ktx. The worked example must be
a **synthetic, generic schema invented for teaching** — not the tables, column
names, query, or numeric results of any Spider 2.0-Lite question. The example
demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many
join), which is universal; it must be reconstructable from first principles by
anyone, with zero reference to benchmark data. A reviewer should be able to read
the example and find nothing that ties it to a specific benchmark instance.
## Acceptance criteria
- The `<sql_craft>` **Composition** section states the **multi-hop generalization**
of the fan-out rule (cumulative danger across the chain; pre-aggregate per
measure-owning table) and an **affirmative grain-verification habit**, inline and
dialect-agnostic.
- The fix is presented as **default (pre-aggregate to grain) + escape hatch
(`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG`
of a fanned-out measure must pre-aggregate.
- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right)
using an invented schema, with no benchmark-derived identifiers or values, whose
RIGHT side is actually correct (unambiguous attribution; honest number).
- The skill now contains **two** `sql` worked examples total; the existing content
test's fence-count assertion is updated `1 → 2` and new assertions cover the
multi-hop rule phrase and the grain-verification-habit phrase.
- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no
synonyms introduced.
- **No new tool, flag, or config.** Skill-content only; additive to spec 07.
- All spec 07 invariants still hold: the skill remains dialect-agnostic (no
`QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time
anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference,
including in the new example; `<workflow>`/`<rules>`/`<examples>` and the other
four sub-headings are intact; frontmatter still parses through
`SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines.
- Spec 07's "exactly one example" constraint is annotated as superseded (no
contradiction between the two permanent specs).
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`
`<sql_craft>`**Composition**. Rewrite the `Avoid fan-out joins` bullet, add
the affirmative grain-verification bullet, add the one worked example after them.
Leave the other four sub-headings, `<workflow>`, `<rules>`, and `<examples>`
unchanged.
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the
"ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`,
add an assertion for the new fan-out example's distinctive tokens (e.g.
`WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit
phrases to the behavior-presence list, and keep all banned-construct and
size-budget guards. This is a content assertion over the source `SKILL.md` — the
right level for prompt content.
- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's
requirement 3 and at its "Exactly one new worked example" acceptance bullet.
- **Rebuild/re-link** the dev binary so the playground picks up the change:
`pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
## Benchmark context (motivation only)
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
reached through several one-to-many joins) are a recurring source of
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
with the right tables but a fan-out-inflated number. These are correctness
failures, not knowledge or schema-discovery failures (zero execution errors in the
latest run), so the fix belongs in the product's authoring craft — where it also
helps any real analyst — not in a benchmark-specific prompt. The skill itself must
contain no trace of the benchmark.
## Implementation notes
Shipped as specified — additive, content-only, no new tool/flag/config.
- **`packages/cli/src/skills/analytics/SKILL.md`** → `<sql_craft>`**Composition**:
- Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the
danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many
hop between a measure's owning table and the aggregate inflates that measure
even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per
measure-owning table along the whole chain. Kept the `fan-out` term and the
one-line *why*.
- Added the affirmative `**Verify the grain holds across each join.**` bullet:
confirm a one-to-one / many-to-one join did not change the grain (row/key
count unchanged); default fix is pre-aggregate to grain, escape hatch is
`COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a
fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a
sum.
- Added one generic wrong-vs-right worked example (orders→regions via
stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in
the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side
pre-aggregates `order_lines` to one row per qualifying order so each order is
counted once (honest answer), and the trailing comment names the count-only
`COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented
schema, dialect-agnostic SQL, no benchmark identifiers/values.
- The other four sub-headings and `<workflow>`/`<rules>`/`<examples>` are
untouched. Skill is 147 lines (well under the 500-line budget).
- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count
`1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the
grain-verification phrase (`Verify the grain holds across each join`) to the
behavior-presence list; added new-example token assertions
(`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct,
relative-time, and size-budget guards retained. Test file passes (9/9).
- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one
worked example" acceptance bullet — no contradiction between the two permanent
specs.
**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9
passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built
`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev`
re-linked `ktx-dev`. A pre-existing, unrelated type error in
`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last
touched in commit `2677b3ef`) surfaces under the full `type-check`'s
`tsconfig.test.json` pass; it is outside this change's surface and not introduced
here.

View file

@ -1,289 +0,0 @@
# Panel/period completeness — emit the full set of groups, not only the populated ones
> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`.
## Problem
When a question asks for a result *per period* or *per category* ("orders for
each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY`
only returns groups that actually have rows. Periods or categories with **zero**
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
three that should read `0` are simply absent. The SQL is runnable and the
aggregate is right, but the **panel is incomplete** — and a monthly report with
missing months or a category breakdown missing its empty categories is wrong for
any analyst, on any database.
The existing `<sql_craft>` "Answer completeness / interpretation" group already
carries a *"For each X / per X / by X returns exactly one row per X"* rule, but
that rule only governs **grain** (don't collapse to a single value). It says
nothing about the **domain**: "one row per X" today means one row per *observed*
X, so empty groups still drop. This spec sharpens that rule from grain-only to
grain-and-completeness.
## Generic use case (independent of any benchmark)
"How many orders were placed in each month of 2023?" must return **12 rows** even
if March had no orders (March = 0), not 11. "Sales per region" should include
regions with no sales when the question asks for *each* region. Both are
bread-and-butter reporting for any analyst on any warehouse, with no benchmark in
sight.
## Model
The feature splits across **two surfaces**, each holding the half it is suited
for. This split is the central design decision and exists to satisfy spec 07's
hard dialect-agnostic invariant without weakening it.
### Why two surfaces (the dialect-agnostic reconciliation)
The draft asked for a *"recursive-CTE date spine"* worked example. But a real
date/number series is **inherently dialect-specific** — Postgres `generate_series`,
SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake
`GENERATOR`+`DATEADD` — and spec 07 made `<sql_craft>` strictly dialect-agnostic
(the analytics-skill content test bans single-dialect constructs). Inlining a date
spine would violate that invariant; carving out a test exception would erode it.
ktx already has the canonical home for engine-specific syntax: the per-dialect
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric
(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is
not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the
other per-dialect idioms, and the dialect-agnostic skill points to it. This
routes the dialect-specific half through the existing channel rather than
standing up a parallel dialect-specific recipe inside the skill.
Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the
**concrete series syntax**.
### Additive, inline, heuristic-with-a-why
Consistent with spec 07: the skill change is **additive content in one Markdown
file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the
delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
and phrased as a **heuristic with a one-line generic rationale**, not a wall of
MUSTs. The dialect-notes change is additive content in the seven existing
`dialects/*.md` files. No new tool, flag, or config on either surface.
## Requirements
### 1. Skill surface — `<sql_craft>` "Answer completeness / interpretation"
Add the panel-completeness rule to the existing group (it extends, and should sit
adjacent to, the *"For each X / per X / by X"* bullet). It must cover:
1. **Recognize the full-panel cue.** *each / every / all / per <period> / for all
<category> / by month* signals that the answer's row set should be the
**complete expected domain** of periods or categories in scope, not just those
present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit
groups that have at least one fact row.
2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the
**spine**), then LEFT JOIN the aggregated facts onto it:
- **Category/dimension spine:** the distinct values from the **domain-defining
dimension/entity table** (e.g. all regions from a `regions` table), *not*
`SELECT DISTINCT region FROM facts` — the latter yields only categories that
already occur, so a zero-activity category still drops. When no dimension
table exists, the distinct values from the **unfiltered** fact table are the
best available domain (with the residual caveat that a category which never
occurs at all cannot surface).
- **Period/number spine:** generate the series for the question's stated range
(e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the
question's explicit range; when the range is "all periods present," derive
bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete
series-generation syntax is per-dialect — the rule points the author to
`sql_dialect_notes` (see requirement 2) and shows no inline series SQL.
3. **COALESCE by measure additivity.** Default missing measures with
`COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events
or amounts — "no activity" genuinely reads as 0). Leave **non-additive**
measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL**
absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value
only for additive measures.
4. **Don't over-apply (the each-vs-which guard).** When the question asks only
about groups that exist ("*which* months had orders", "regions that made a
sale"), the spine is unnecessary and wrong — emit only observed groups. The cue
is *each / all / every* (complete domain) vs *which / that have* (observed
subset).
5. **One worked example — the category spine, fully portable.** Add **exactly
one** compact before/after example demonstrating the pattern with a
**distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty
groups missing) and the right shape (`SELECT DISTINCT` domain from the
dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic
table/column names, standard SQL only — no series generation, no dialect
functions, so the example stays dialect-clean. The period-spine variant is
described in prose (requirement 2) and delegated to `sql_dialect_notes`; it
gets **no** inline example. This is the **third** worked `sql` example in the
skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out).
6. **Step pointer, no duplication.** The validate/explain step (and/or the query
step) already points into `<sql_craft>` for answer-completeness; extend that
existing pointer's wording if needed, but state the rule **once** inside
`<sql_craft>`. The step-5 pointer that lists what `sql_dialect_notes` provides
("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also
name the **series/calendar** convention now that it exists.
### 2. Dialect-notes surface — `dialects/*.md`
Add a **"Series"** (date/number range) line to **each** of the seven authored
dialect files, giving that engine's idiomatic way to generate a contiguous
date or integer series for use as a spine. Each note is engine-exclusive — a
SQLite analyst gets the SQLite idiom and never another engine's construct, per the
existing dialect-notes leak guards. Orientation (exact syntax is the
implementer's):
- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`.
- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`.
- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers).
- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE.
- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`.
- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`).
- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table.
This line is what makes the period spine usable from the dialect-agnostic skill,
and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the
same date spine) — so it is foundational, not scope creep.
### 3. Coordination with spec 11
Spec 11 (time-series window recipes) explicitly depends on this date spine for the
gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10
establishes the spine concept in the Answer-completeness group and the
series syntax in the dialect notes; spec 11 reuses both from the Window-functions
group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it.
## Leak-safety (hard constraint)
Any worked example or note must use a **synthetic generic schema** (e.g. an
`orders` table with an `order_date`, a `regions` dimension) and demonstrate only
the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL,
or result values on either surface. The dialect-notes additions, like the existing
notes, carry no benchmark/grader/version-dated content. The behavior is
reconstructable from first principles and tied to no specific instance.
## Acceptance criteria
- `<sql_craft>` "Answer completeness / interpretation" states: the full-panel cue,
the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE
discriminator (0 vs NULL), and the each-vs-which over-application guard —
inline, dialect-agnostic, each with a generic *why*.
- Exactly **one** new worked `sql` example is present, a portable
distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`),
with no series generation and no dialect-specific syntax. The skill then carries
**three** `sql` worked examples total.
- Each of the seven `dialects/*.md` files gains a **Series** (date/number range)
line in its engine's own idiom; no engine leaks another engine's construct, and
the additions contain no benchmark/grader/version-dated content.
- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`,
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other
single-dialect construct anywhere in `SKILL.md`, including the new example.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples)
and the existing dialect-note rubric lines are intact and uncontradicted.
- No grader/benchmark reference, no output-shape contract, and no anchoring of
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data
appears (period-spine bounds derive from the question's explicit range or, for
"all periods present," from `MIN`/`MAX` over the facts — which is range
derivation, not relative-time anchoring).
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
still parses as `ktx-analytics`.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the
panel-completeness bullets to the Answer-completeness group, the single category
spine example, and extend the existing step pointer / dialect-notes provision
list to name the series convention. Leave `<workflow>`/`<rules>`/other examples
intact. Delivery is unchanged (single `SKILL.md` per target via
`readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required.
- **Dialect notes:** the seven files under
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
`copy-runtime-assets.mjs` — no plumbing change, content only.
- **Tests:**
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
representative phrase for the completeness rule; bump the `sql`-fence count
assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the
existing dialect-clean guards already cover the no-inline-series requirement
(the example is `SELECT DISTINCT`, so they pass unchanged).
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop
(the "answers the full rubric for every dialect" test) so every dialect must
also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`.
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
all seven without a hand-maintained list.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation only)
Per-period / per-category questions where some periods are empty produce
short-row result mismatches in the SQLite subset, and the related rolling/cumulative
cluster (spec 11) needs a complete date spine to be correct at all. The fix is a
universal reporting habit (complete panels) plus the per-dialect series syntax
that makes it executable — both belong in the product, where they help real
analysts. Improving the benchmark score is a side effect; the skill and the
dialect notes contain no trace of the benchmark.
## Implementation notes
Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no
new tool/flag/config, no plumbing change.
**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):**
- Added a **"Complete the panel for 'each / every / all / per <period or
category>'"** bullet to the `<sql_craft>` "Answer completeness / interpretation"
group, directly after the *"For each X / per X / by X"* bullet, with three
sub-bullets carrying the rest of the rule each with its generic *why*: **Spine
source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT`
over the facts; period/number series across the question's stated range, bounds
from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series
syntax delegated to `sql_dialect_notes`), **Default by additivity**
(`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and
**Don't over-apply** (the each-vs-which guard).
- Added **one** worked `sql` example at the end of the Answer-completeness group: a
portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions`
`LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right,
standard SQL only, no series generation, no dialect functions. The skill now
carries **three** `sql` worked examples.
- Extended the step-5 dialect-notes pointer to name the **series/calendar**
convention alongside FQTN / identifier-quoting / date / top-N / JSON.
- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the
single `SKILL.md` per target — confirmed, no change.
**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):**
- Added a `- **Series:**` line to all seven authored files (postgres, sqlite,
bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom
(`generate_series`; recursive CTE with `date(d,'+1 month')`;
`UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE
with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` +
`MAXRECURSION`), placed right after each file's Date/time line. No cross-engine
leak, no version-dated/benchmark content. Shipped to `dist` unchanged by
`copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`.
**Tests:**
- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel`
and `Default by additivity` phrases; renamed the worked-examples test and bumped
the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE`
shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the
dialect-clean banned list — a deliberate **strengthening** beyond the spec's
test orientation so the "no inline series" acceptance criterion is *enforced*,
not merely incidentally true of a `SELECT DISTINCT` example.
- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric
for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven
dialects are required to answer a Series line (coverage derived from
`DIALECTS_WITH_NOTES`, no hand-maintained list).
**Verification:** both affected test files pass (19 tests). `src` type-check and
`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in
all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an
unrelated, pre-existing `tsconfig.test.json` type error in
`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work
and outside its scope.
**Coordination with spec 11:** the per-dialect Series line is the foundational
date spine that spec 11 (rolling/cumulative windows over gappy dates) references.
Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11
will reference it from the Window-functions group. No overlap introduced.

View file

@ -1,391 +0,0 @@
# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period
> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`.
## Problem
A large share of analytics questions are time-series shaped: a **running /
cumulative balance**, a **rolling N-day average**, or **period-over-period
growth**. The agent already knows window functions exist — spec 07 gave the
`<sql_craft>` "Window functions" group its determinism and window-then-filter
rules, and spec 10 added panel/period completeness — but it still gets the
*time-series specifics* wrong:
- a cumulative balance computed **without an explicit unbounded-preceding
frame**, or with the implicit frame misbehaving when there are **ties on the
order key**;
- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy**
daily data, so the window spans the wrong calendar span when days are missing;
- no **minimum-periods** handling — a rolling average reported before the window
is actually full;
- "growth vs the previous period" written **without `LAG`** (or against the wrong
neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or
absent prior.
These are runnable-but-wrong: the structure is close, the edge case diverges.
It is the same failure shape spec 07 addressed at the general level; this spec
adds the time-series specifics to the **same Window-functions group**, building
on the rules already there rather than restating them.
## Generic use case (independent of any benchmark)
- "Each account's month-end running balance over 2023" — a cumulative sum of
monthly net over an ordered window.
- "30-day rolling average of daily revenue, only once 30 days of history exist."
- "Month-over-month revenue growth rate."
All three are bread-and-butter for any analyst on any time-series table, with no
benchmark in sight. The methodology is universal analyst craft, so it belongs in
the shipped skill — it transfers to every ktx user querying a live database.
## Model
The change is **additive content across two surfaces** — the same split spec 10
made, and for the same reason. The split is the central design decision; it
satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
weakening it.
### Why two surfaces (the dialect-agnostic reconciliation)
Two of the three recipes are **pure standard SQL** and stay entirely in the
dialect-agnostic skill:
- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW)` is standard on every engine.
- **Period-over-period**`LAG(metric) OVER (...)`, the growth ratio, and a
`NULLIF`-style divide-by-zero guard are standard on every engine.
The third recipe — a **rolling window over calendar time** — has one piece that
is genuinely dialect-divergent: the **calendar-range window frame**. A native
range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`
exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has
no date-interval range frame, and SQL Server has **no offset `RANGE` frames at
all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot
inline a range frame any more than it could inline a date-series generator.
ktx already routes that kind of engine-specific syntax through the per-dialect
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent
exactly: series/spine generation was not in the dialect rubric, so it was added
there (the **Series** line) and the dialect-agnostic skill points to it.
Rolling-window framing is the next construct in that same position — not in the
rubric yet, dialect-specific — so the **rolling-window idiom belongs in the
dialect notes**, and the skill points to it.
Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the
min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries
the **concrete rolling-window frame syntax** per engine.
### Additive, inline, heuristic-with-a-why
Consistent with specs 07 and 10: the skill change is **additive content in one
Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/`
file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as
**heuristics with a one-line generic rationale**, not a wall of MUSTs. The
dialect-notes change is additive content in the seven existing `dialects/*.md`
files. No new tool, flag, or config on either surface.
### Build on the rules already present; do not restate them
The Window-functions group already carries **"Make the ordering deterministic"**
(complete tie-breaker) from spec 07, and the Numeric-precision group carries
**"Round only at the end."** The cumulative and period-over-period recipes
**reference** these rather than repeat them (state each rule once — Anthropic's
"consistent terminology / don't repeat" guidance, already followed in spec 07).
Spec 10's **Series** dialect line is likewise **referenced** by the rolling
recipe's spine fallback, not duplicated.
## Requirements
### 1. Skill surface — `<sql_craft>` "Window functions" group (three recipes)
Add three recipes to the **existing** "Window functions" group, after its two
current bullets (deterministic ordering; filter-after-the-window). Each is a
heuristic with a generic *why*, dialect-agnostic.
1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER
(PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` —
with a **complete tie-breaker** on the `ORDER BY` (per the group's existing
deterministic-ordering rule; reference it, do not restate). *Why:* a bare
`ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the
order key** folds every tied peer into the same cumulative value — it runs and
looks plausible, but the running total jumps at each tie boundary.
2. **Rolling window over calendar time, plus minimum periods.** "Rolling N
days/months" must span a **calendar range**, not a fixed row count: a `ROWS
BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are
missing. Two sanctioned techniques:
- **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's
**Series**, via `sql_dialect_notes`) so the data has one row per calendar
unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the
intended calendar span. This path is fully dialect-agnostic.
- **Native range frame or date-keyed self-join (engine-specific).** Where the
engine supports it, a calendar **range frame** expresses the window directly;
otherwise a self-join keyed on the date does. Both use engine-specific
syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see
requirement 3); show no inline range frame in the skill.
**Minimum periods.** When the question says "only after N periods of data" (or
a rolling metric implies it), emit `NULL` / skip until the window is actually
full — guard on a window count, e.g. `COUNT(*) OVER (<same frame>) = N`. On a
gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null
observations** instead when "N periods" means N data points rather than N
calendar units. *Why:* a row-count frame over missing dates measures the wrong
span, and a partial early window is not the requested metric.
3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
for the prior-period comparison; compute growth as `(cur - prev) / prev` at
**full precision**, rounding only in the final projection (per the existing
"Round only at the end" rule), and **guard divide-by-zero / NULL prev**
(e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against
the wrong neighbor — the comparison lands on the wrong period, and an unguarded
ratio errors or returns garbage when the prior period is zero or absent.
**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list
(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON
conventions") should also name the **rolling-window** convention now that it
exists. State each rule once inside `<sql_craft>`; the workflow steps only point
to it.
### 2. One worked example — cumulative running total (dialect-agnostic)
Add **exactly one** new compact before/after `sql` example, demonstrating the
**cumulative running total** — the subtlest of the three (the implicit-frame trap
runs fine and is wrong only at tie boundaries) and the highest-value to show.
Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`):
- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the
implicit `RANGE` frame makes two txns on the same date share one inflated
running balance.
- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`).
Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no
`RANGE … INTERVAL`. Keep it ~1014 lines. The **rolling-over-time** recipe gets
**no** inline example (its correct form needs the engine-specific frame/spine,
delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was
prose-only); the **period-over-period** recipe is self-evident from its bullet
and also gets no example. This is the **fourth** worked `sql` example in the
skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and
spec 10 (panel-completeness spine).
### 3. Dialect-notes surface — `dialects/*.md` (rolling window)
Add a **rolling-window-over-time** idiom line to **each** of the seven authored
dialect files, parallel to spec 10's **Series** line. Each note is
engine-exclusive — a SQLite analyst gets the SQLite idiom and never another
engine's construct, per the existing dialect-notes leak guards. Each note either
gives the engine's native calendar-range frame **or** references its own
**Series** line for the spine + `ROWS` fallback (a cross-reference within the
file, not a duplicate of the Series line).
Orientation only — **`RANGE`-frame support genuinely varies by engine and
version, so the implementer must verify each engine's current support against
authoritative docs (context7 / the engine's manual) rather than assert it from
memory.** Starting points:
- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days'
PRECEDING AND CURRENT ROW)`.
- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT
ROW` over a temporal order key.
- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key
(e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or
build a spine (see **Series**) and use a `ROWS` frame.
- **sqlite:** **no** date-interval range frame — build a date spine (see
**Series**) and use a `ROWS` frame.
- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see
**Series**) and use a `ROWS` frame, or a date-keyed self-join.
- **snowflake / clickhouse:** range-frame support over dates is limited — verify;
default to a spine (see **Series**) + `ROWS` frame where a native calendar range
frame is unavailable.
This line is what makes the rolling-over-time recipe executable from the
dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series =
how to *generate* a spine; Rolling window = how to compute a *moving
calendar-range aggregate*, natively or via that spine), and it cross-references
the Series line rather than overlapping it.
### 4. Explicit constraints / exclusions
None of the following may appear (consistent with specs 07 and 10):
- **No inline dialect-specific range-frame syntax in the skill** — no
`RANGE … INTERVAL` frame, no series generator, no dialect function. The skill
stays dialect-clean; the range frame lives only in the dialect notes.
- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months"
means relative to *now* on a live database. A range *bound* may be derived from
the question's explicit range or, for "all periods present," from `MIN`/`MAX`
over the **unfiltered** facts (range derivation, per spec 10) — but the metric
must never silently redefine "recent" as the data's maximum date.
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
(the skill is for interactive analysis).
### 5. Coordination with specs 07 and 10
All three recipes live in the **existing** `<sql_craft>` "Window functions"
group; the two current bullets and the spec-07 window-then-filter example must
stay intact and uncontradicted.
- **Spec 07** owns the deterministic-ordering rule (Window functions) and the
round-at-the-end rule (Numeric precision). Spec 11 **builds on** both —
references them, never restates them.
- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11
**references** the spine for the gappy-rolling fallback and adds the **distinct**
rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a
spine; spec 11 = how to compute a moving calendar-range aggregate (native frame
or spine + `ROWS`).
## Leak-safety (hard constraint)
Every worked example or note uses a **synthetic generic schema** (e.g.
`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and
shows only the *pattern*. **No** benchmark table names, SQL, or result values on
either surface. The dialect-notes additions, like the existing notes, carry no
benchmark / grader / version-dated content. The behavior is reconstructable from
first principles and tied to no specific instance.
## Acceptance criteria
- The `<sql_craft>` "Window functions" group states the three recipes — inline,
dialect-agnostic, each with a generic *why*, and each **building on** (not
restating) the deterministic-ordering and round-at-the-end rules:
- **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED
PRECEDING AND CURRENT ROW` frame and a complete tie-breaker;
- **rolling window over calendar time + minimum periods** — calendar range not
row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)`
guard — delegating the engine's range-frame syntax to `sql_dialect_notes`;
- **period-over-period** via `LAG`, with full-precision growth and a
divide-by-zero / NULL-prev guard.
- Exactly **one** new worked `sql` example: the cumulative running total,
wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The
skill then carries **four** `sql` worked examples total.
- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time**
idiom line in its engine's own idiom (native calendar-range frame where
supported, otherwise a spine + `ROWS` fallback that references its **Series**
line); no engine leaks another engine's construct, and the additions contain no
benchmark / grader / version-dated content.
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no
inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new
example.
- The step-5 `sql_dialect_notes` provision list names the **rolling-window**
convention alongside FQTN / identifier-quoting / date / top-N / series/calendar /
JSON.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other
examples), the two existing Window-functions bullets, the window-then-filter
example, and the existing dialect-note rubric lines (including **Series**) are
intact and uncontradicted.
- No grader / benchmark reference, no output-shape contract, and no anchoring of
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data.
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
still parses as `ktx-analytics`.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes
to the "Window functions" group (after its two existing bullets), the single
cumulative worked example, and extend the step-5 dialect-notes provision list to
name the rolling-window convention. Leave `<workflow>` / `<rules>` / the other
examples and the two existing window bullets intact. Delivery is unchanged
(single `SKILL.md` per target via `readAnalyticsSkillContent` in
`setup-agents.ts`) — confirm, no change required.
- **Dialect notes:** the seven files under
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
engine's actual `RANGE`-frame support against authoritative docs before writing
the idiom; do not assert from memory.**
- **Tests:**
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
representative phrase for each of the three recipes; bump the `sql`-fence count
assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean
guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding
`generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the
"range frame lives only in the dialect notes" criterion is *enforced*, not
incidentally true).
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the
full rubric for every dialect" loop with the rolling-window assertion, e.g.
`expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it.
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
all seven without a hand-maintained list.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation only)
Running-balance / rolling / period-over-period questions are the single largest
result-mismatch cluster in the SQLite subset (financial-transactions-style DBs):
cumulative balances with the wrong frame on ties, rolling windows that mis-span
gappy dates, partial early windows, and unguarded period-over-period ratios. The
methodology is universal analyst craft, so it belongs in the product's skill
(where it helps every real user) plus the per-dialect rolling-window syntax that
makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the
date spine) for the gappy-rolling fallback. Improving the benchmark score is a
side effect; the skill and the dialect notes contain no trace of the benchmark.
## Implementation notes
Shipped as additive content across the two surfaces the spec specified — no new
tool, flag, or config.
**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes
to the existing `<sql_craft>` "Window functions" group, after its two bullets and
the spec-07 window-then-filter example: **Cumulative / running total** (explicit
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing
the deterministic-ordering rule), **Rolling window over calendar time, plus
minimum periods** (calendar range not row count; spine-or-native-range choice
delegated to `sql_dialect_notes`; the `COUNT(*) OVER (<same frame>) = N`
min-periods guard), and **Period-over-period** (`LAG` + full-precision growth +
`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked
`sql` example — the cumulative running total, wrong-vs-right, using
`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four
worked examples. Extended the step-5 `sql_dialect_notes` provision list to name
the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the
skill; it stays dialect-clean.
**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a
**Rolling window over time** line to all seven files, parallel to the spec-10
**Series** line and cross-referencing it for the spine fallback.
**Deviation — `RANGE`-frame support verified against authoritative docs (the
spec's hard requirement), which corrected two of its starting points:**
- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days'
PRECEDING AND CURRENT ROW` (as the spec guessed).
- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL
29 DAY PRECEDING AND CURRENT ROW` (as guessed).
- **bigquery**`RANGE` is numeric-only: range over `UNIX_DATE(day)` with
`RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed).
- **snowflake****corrected:** the spec said "limited; default to a spine," but
Snowflake *does* support a native interval `RANGE` frame over a date/timestamp
key and it is gap-tolerant, so the note gives the native frame
(`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed.
- **clickhouse****corrected:** the spec said "limited; default to a spine," but
ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in
days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for
`DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as
the fallback.
- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS`
(as guessed).
- **tsql**`RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame):
spine + `ROWS`, or a date-keyed self-join (as guessed).
**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative
phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4,
asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND
CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened
the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex.
`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop
with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from
`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric.
**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped);
`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed
`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one
error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is
present in committed branch code and untouched by this work.

View file

@ -1,405 +0,0 @@
# Parse text-encoded numeric columns before doing math on them
> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`.
## Problem
Numeric measures are often stored as **text** with human formatting: unit
suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators
(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero
(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is
**silently wrong**: a string comparison orders `"100" < "9"`, and a naive
`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the
intended number. The query runs, the shape looks right, the number is garbage.
The agent already samples schemas before composing — spec 07 gave the
`<sql_craft>` "Schema discovery before writing SQL" group its *"Sample before you
compose"* and *"Cast to the real type before comparing"* rules. But those rules
guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**;
they say nothing about a column whose declared/affinity type is text yet whose
*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to
assume a real number type and skips the parse, so the arithmetic runs on the raw
strings. This spec adds the detect → parse/scale → verify habit to that same
group, building on the two rules already there rather than restating them.
## Generic use case (independent of any benchmark)
- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become
`1200 / 3000000 / 0` before you can sum it or compute a daily change.
- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging.
- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it.
This is routine data hygiene on real, messy production tables — every analyst
hits text-encoded measures on some warehouse, with no benchmark in sight. The
methodology is universal craft, so it belongs in the shipped skill; it transfers
to every ktx user querying a live database.
## Model
The change is **additive content across two surfaces** — the same split specs 10
and 11 made, and for the same reason. The split is the central design decision;
it satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
weakening it.
### Why two surfaces (the dialect-agnostic reconciliation)
The **detect → parse → scale** half is **pure portable SQL** and stays entirely
in the dialect-agnostic skill:
- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known
set of literal characters — no regex needed.
- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression.
- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`.
- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable.
The **verify** half has one piece that is genuinely dialect-divergent: a
**failure-detecting numeric cast** — a cast that signals (rather than silently
swallows) a value that did not parse. This is exactly what requirement 3
("confirm coverage") needs, and it cannot be written portably:
- **bigquery:** `SAFE_CAST(x AS FLOAT64)``NULL` on failure.
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST``NULL` on failure.
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT``NULL`.
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)``NULL`.
- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before
casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`).
- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and
`CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an
`IS NULL` coverage check is **silently broken**. Detecting a failed parse needs
a `GLOB`/`typeof` pattern guard.
So a portable skill cannot inline a safe cast any more than spec 10 could inline a
date-series generator or spec 11 a calendar range frame. ktx already routes that
kind of engine-specific syntax through the per-dialect notes in
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by the
`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent:
a construct not yet in the dialect rubric, genuinely engine-specific, was added
there (the **Series** line; the **Rolling window** line) and the dialect-agnostic
skill points to it. The failure-detecting cast is the next construct in that same
position, so the **safe-cast idiom belongs in the dialect notes**, and the skill
points to it.
Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale
in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes)
carries the **concrete safe-cast syntax** per engine, including the sqlite
`CAST`-returns-0 gotcha.
The regex character-*strip* is deliberately **not** promoted to the dialect
notes: a portable chained `REPLACE` over a known character set is the opinionated
default, so there is no need for a per-dialect strip line (derive from need; one
default). The dialect surface gains exactly one thing — the safe cast — because
that is the only piece the portable path genuinely cannot express.
### Additive, inline, heuristic-with-a-why
Consistent with specs 07, 10, and 11: the skill change is **additive content in
one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled
`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
and phrased as **heuristics with a one-line generic rationale**, not a wall of
MUSTs. The dialect-notes change is additive content in the seven existing
`dialects/*.md` files. No new tool, flag, or config on either surface.
### Build on the rules already present; do not restate them
- The Schema-discovery group already carries **"Sample before you compose"** and
**"Cast to the real type before comparing"** (spec 07). The detect rule
**extends** the first (distinct-value sampling to learn the encoding) and the
parse rule **complements** the second (text-meaning-numeric, not just
text-vs-numeric literal mismatch) — reference them, do not repeat them.
- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive
judgment** spec 10 established in its *"Default by additivity"* rule (0 only
when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule
rather than restating the discriminator (state each rule once).
## Requirements
### 1. Skill surface — `<sql_craft>` "Schema discovery before writing SQL"
Add the text-encoded-numeric guidance to the **existing** group, after its two
current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic.
It must cover:
1. **Detect text-encoded numerics during sampling.** When a column the question
treats as a number is stored as text, sample its **distinct** values to learn
the encodings actually present — unit suffixes (`K`/`M`/`B`), currency
symbols, thousands separators, percent signs, and non-numeric sentinels
(`-`, `N/A`, empty) — **before** composing. Never infer the format from the
column name. *Why:* compared/aggregated as-is, the text sorts lexically
(`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL —
producing a silently wrong result instead of an error.
2. **Parse and scale in an early CTE.** Strip currency/separator/percent
characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels
to `0` **or** `NULL` per the question's intent, then cast to a numeric type —
all in **one early CTE**, so every downstream layer sees clean numbers. The
`0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive**
rule (reference it; do not restate). *Why:* a string column aggregated as-is
sorts lexically and casts to 0, so the math is silently wrong.
3. **Confirm coverage (verify).** After parsing, sanity-check that **no
intended-numeric value silently failed to parse** — a failed parse should
surface as `NULL`, which is only visible with a **failure-detecting cast**.
Note the divergence: a plain `CAST` errors on some engines and, on sqlite,
returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from
`sql_dialect_notes` (requirement 3), then count residual NULLs among
non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish
as `0`/NULL instead of being caught.
### 2. One worked example — parse/scale, fully portable
Add **exactly one** new compact before/after `sql` example demonstrating the
parse-and-scale pattern on a synthetic generic schema
(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`):
- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the
formatted values collapse to `0`/partial, so the total is silently wrong.
- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a
`CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to
`DECIMAL`, then `SUM`s the parsed column.
**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`,
`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the
example stays dialect-clean. Keep it ~1216 lines. The **verify** step gets **no**
inline example (its correct form needs the engine-specific safe cast, delegated to
`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's
rolling-window variants were prose-only).
This adds **one** worked `sql` example to the skill. Spec 11 independently adds
one as well; **do not hardcode the resulting total** — increment from the current
state. As of this writing the skill carries **three** examples (spec 07
window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is
the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test
assertion is incremented by one from its current value (see Acceptance criteria).
### 3. Dialect-notes surface — `dialects/*.md` (safe cast)
Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files,
parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each
line gives that engine's **failure-detecting numeric cast** — a cast that returns
`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the
verify step correct on that engine. Each note is engine-exclusive (a SQLite
analyst gets the SQLite idiom and never another engine's construct, per the
existing dialect-notes leak guards). Orientation only — exact syntax is the
implementer's; verify against authoritative docs (context7 / the engine manual)
rather than asserting from memory:
- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting,
e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is
available for the strip, but chained `REPLACE` is the portable default.)
- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before
`CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip.
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) →
`NULL` on failure.
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST`
`NULL` on failure.
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)``NULL`.
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT``NULL`.
- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an
error, so a coverage check must use a pattern guard such as
`CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof`
check) to detect a value that did not parse.
This line is what makes the verify step executable from the dialect-agnostic
skill. It is **distinct** from the Series and Rolling-window lines (those generate
or window over a calendar; this detects a failed numeric parse). Phrase any
version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test
bans version-dated wording).
### 4. Explicit constraints / exclusions
None of the following may appear (consistent with specs 07, 10, and 11):
- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`,
`TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
`replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is
chained `REPLACE`; the failure-detecting cast lives only in the dialect notes.
- **No regex-strip dialect line.** The character strip stays the portable
chained-`REPLACE` default; the dialect notes gain only the **safe cast**.
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
(the skill is for interactive analysis).
### 5. Coordination with specs 07, 08, 10, and 11
- **Spec 07** owns the Schema-discovery group and its two existing bullets
(*"Sample before you compose"*, *"Cast to the real type before comparing"*).
Spec 12 **extends** that group and **builds on** both bullets — references them,
never restates them; they must stay intact and uncontradicted.
- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one
rubric line through that channel; the engine-exclusivity guards apply unchanged.
- **Spec 10** owns the additive-vs-non-additive discriminator (Answer
completeness) and the dialect **Series** line. Spec 12 **references** the
additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it.
- **Spec 11** independently adds the dialect **Rolling window** line, one `sql`
example, and the **rolling-window** entry to the step-5 provision list. Spec 12
touches the **same** three places (the dialect-notes rubric loop, the example
count, and the step-5 list). Both are independent and additive — **add to the
current state, do not assume an order**: name **safe-cast** in the step-5 list
without removing rolling-window/series; increment the example count by one from
whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any
`/\*\*Rolling/` assertion.
### 6. Step pointer (no duplication)
The step-5 `sql_dialect_notes` provision list (currently "FQTN,
identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11
also names rolling-window) should additionally name the **safe-cast** convention
now that it exists. State each rule once inside `<sql_craft>`; the workflow steps
only point to it.
## Leak-safety (hard constraint)
Every worked example or note uses a **synthetic generic schema** (e.g.
`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`),
showing only the *pattern*. **No** benchmark table names, SQL, or result values on
either surface. The dialect-notes additions, like the existing notes, carry no
benchmark / grader / version-dated content. The behavior is reconstructable from
first principles and tied to no specific instance.
## Acceptance criteria
- The `<sql_craft>` "Schema discovery before writing SQL" group states the three
heuristics — inline, dialect-agnostic, each with a generic *why*, and each
**building on** (not restating) the existing *"Sample before you compose"* and
*"Cast to the real type before comparing"* bullets and spec 10's additivity rule:
- **detect** text-encoded numerics by sampling distinct values (suffixes,
symbols, separators, sentinels) — never from the column name;
- **parse and scale** in an early CTE (strip → suffix-scale → sentinel map →
cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule;
- **confirm coverage** with a failure-detecting cast, delegating the engine's
safe-cast syntax to `sql_dialect_notes`.
- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using
chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS
DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is
incremented by **one** from its current value (3 today → 4; or 5 if spec 11
shipped first).
- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its
engine's own failure-detecting numeric-cast idiom (including the sqlite
`CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the
additions contain no benchmark / grader / version-dated content.
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline
`RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` /
`REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md`
including the new example.
- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention
alongside FQTN / identifier-quoting / date / top-N / series-calendar /
rolling-window / JSON.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples),
the two existing Schema-discovery bullets, and the existing dialect-note rubric
lines (including **Series** and, if present, **Rolling window**) are intact and
uncontradicted.
- No grader / benchmark reference, and no output-shape contract.
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
still parses as `ktx-analytics`.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three
heuristics to the "Schema discovery before writing SQL" group (after its two
existing bullets), the single parse-and-scale worked example, and extend the
step-5 dialect-notes provision list to name the safe-cast convention. Leave
`<workflow>` / `<rules>` / the other examples and the two existing
schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per
target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no
change required.
- **Dialect notes:** the seven files under
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
engine's actual safe-cast / try-cast support against authoritative docs before
writing the idiom; do not assert from memory** (in particular the sqlite
`CAST`-returns-0 behavior, which is the motivating gotcha).
- **Tests:**
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
representative phrase for each of the three heuristics (e.g. a *detect*, a
*parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft
behavior` list; bump the `sql`-fence count assertion **by one** from its
current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a
suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding
`SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` /
`GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL`
guard, so the "safe cast lives only in the dialect notes" criterion is
*enforced*, not incidentally true).
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers
the full rubric for every dialect" loop with the safe-cast assertion,
`expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it.
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
all seven without a hand-maintained list. Do **not** add a false-exclusivity
assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the
line per dialect is sufficient.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation only)
At least one SQLite-subset question stores trading volume as suffix-encoded text
(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw
strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes
the failure especially insidious: there is no error to alert the agent, and a
naive `IS NULL` coverage check would not catch it either, which is precisely why
the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings
before math, then verify coverage with a failure-detecting cast — is universal
data hygiene that helps any analyst on any warehouse, so it belongs in the
product's craft (skill) plus the per-dialect safe-cast syntax that makes the
verify step executable, not in a benchmark-specific prompt. Improving the
benchmark score is a side effect; the skill and the dialect notes contain no trace
of the benchmark.
## Implementation notes
Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already
applied in the working tree). Built from the current state per the "do not assume an
order" guidance — there were **four** worked examples (specs 07 window-then-filter,
09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the
**fifth**, and step 5 already named `series/calendar, rolling-window`.
**Skill — `packages/cli/src/skills/analytics/SKILL.md`:**
- Added the three heuristics to the **"Schema discovery before writing SQL"** group,
after the two existing bullets: *Parse text-encoded numerics before doing math on
them* (detect by sampling distinct values, extending *Sample before you compose*,
never inferring from the column name), *Strip, scale, and cast in one early CTE*
(the *meaning-is-numeric* complement to *Cast to the real type before comparing*,
with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by
additivity* rule), and *Confirm the parse covered every value* (failure-detecting
cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing
bullets and the additivity rule are referenced, not restated.
- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`,
`'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an
early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with
a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard
portable SQL only — no dialect functions, no inline safe cast.
- Step 5 dialect-notes provision list now names **safe-cast** alongside the others.
**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a
**Safe cast** line to all seven files (after the *Rolling window* line), each giving
that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern
guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning);
bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql
`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the
`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial
gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against
the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real
name — the `to<Type>OrNull` family requires a bit width, hence `toDecimal64OrNull`).
No version-dated wording.
**Tests:** `analytics-skill-content.test.ts` — added the three representative
phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the
example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`),
and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`,
`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's
`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts`
— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so
all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity
assertion for the shared `TRY_CAST`.
**Verification:** both affected test files pass (19 tests); broader `test/skills` +
`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`)
is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry
*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev`
relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the
test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD,
untouched here) — out of scope for this spec.

View file

@ -1,336 +0,0 @@
# Output completeness — answer every requested part, enforced by a final pre-emit check
> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`.
## Problem
The single largest correctness failure mode for the analytics skill is
**incomplete output**: the query runs and the methodology is roughly right, but
the projection is missing columns the question asked for. The SQL is runnable and
the aggregate is correct — the answer is simply *short by columns*. Three
recurring shapes:
1. **Multi-part questions answered partially.** A question that asks for several
things ("report the highest *and* the lowest month, each with its count and
average, *and* the difference") comes back with only the first clause — one
column where several were requested.
2. **Identity dropped.** Grouping by a human-readable name but not projecting the
entity's identifier (a product name without its product id, a customer name
without its customer id).
3. **Inputs to a derived value dropped.** Returning a ratio / percentage /
difference but not the underlying counts the question also asked for.
Shapes 2 and 3 are **already covered** by shipped `<sql_craft>` rules — spec 07's
*"Expose identity, not just the label"* and *"Keep the inputs to a derived
value"* — yet they are frequently **not applied**. So the gap is not missing
knowledge: these rules sit as passive heuristics in a list, and nothing makes the
agent reliably check them before finalizing. The fix is twofold: (a) add the
missing **multi-part-completeness** rule that generalizes shapes 13, and (b)
turn output-completeness into an **explicit final verification step** the agent
performs before emitting SQL, so the existing identity/inputs rules are actually
enforced rather than merely listed.
The failure is **model-independent**: a markedly stronger model produced the same
incomplete-output mistakes on these questions, which means it is a
craft/enforcement gap, not a capability gap — exactly the kind of universal
analyst craft that belongs in the shipped skill.
## Generic use case (independent of any benchmark)
An analyst is asked: *"For each region, report the highest and the lowest monthly
order count, and the difference between them."* A complete answer has a column for
the region's id and name, the highest count, the lowest count, and the difference
— five columns. Returning just the region and a single number answers only part
of the request. This is a universal expectation on any database: answer **every**
part of a multi-part request, identify the entities, and show the inputs behind
any derived figure — and answer *exactly* that, without padding the result with
columns the question never asked for.
## Model
The change is **additive content in one Markdown file**
(`skills/analytics/SKILL.md`), governed by the same invariants spec 07
established. They constrain the implementer; the exact prose is theirs.
### Additive, inline, heuristic-with-a-why
Consistent with specs 07 and 10: the change is additive content in
`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the
`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic,
and phrased as **heuristics with a one-line generic rationale**, not a wall of
MUSTs. The new rule extends the existing `<sql_craft>` "Answer completeness /
interpretation" group; the shipped bullets in that group (including the *identity*
and *inputs* rules this spec builds on) are preserved unchanged. No new tool,
flag, or config.
### The over-projection guard carries a *universal* why, not a grader reference
The intake draft frames "don't pad the result with extra columns" as
*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or
benchmark (spec 07's hard invariant; the content test bans the words). So the
guard must ship with a **universal analytics rationale** instead: columns the
question did not ask for add noise, mislead the reader into thinking they matter,
and make the result harder to consume — match the request exactly, neither short
nor padded. This is the same reconciliation spec 07 applied to the draft's
"behavior only, no rationale" instruction: generic *why* is required; only
grader/gold/benchmark rationale is banned.
### Completeness is a closed set — identity and inputs are *inside* it
"Expose identity" and "keep the inputs" tell the agent to add columns; the
over-projection guard tells it not to. These only contradict if the target is
left fuzzy, so this spec pins it down. A **complete projection** is exactly:
> {every requested metric/attribute} {the identifier of each grouped/named
> entity} {the inputs to each derived value}, at the grain the question
> specifies.
Identity and inputs are **members of that set** — part of completeness, never
"padding." **Under-projection** is any member missing (the failure this spec
attacks); **over-projection** is any column *outside* the set (what the guard
forbids). The implementer must phrase the rule and guard against this single
definition so they read as one coherent notion, not two competing instructions.
### Dialect-agnostic, additive-only, exclusions intact
Every addition reads correctly on any dialect — no dialect-specific syntax in the
rule text or the worked example. The existing `<workflow>`, `<rules>`, and the
other `<sql_craft>` bullets and examples (specs 07/09/10/11/12) are preserved and
uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no
`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect
syntax.
## Requirements
### 1. Multi-part / multi-output completeness — a new umbrella rule
Add a bullet to the `<sql_craft>` "Answer completeness / interpretation" group:
when a question requests several outputs — a **list** ("A, B, and C"), **paired
extremes** ("the highest *and* the lowest"), or a **value plus its components**
("X, Y, and their ratio") — the final projection must contain a column for
**each** requested output. *Why:* answering only the first clause is the most
common way a runnable query is still wrong; the grain and methodology can be
perfect yet the answer is short by columns.
This rule is the **umbrella** over the two shipped completeness rules: the
*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components"
instance, and the *identity* rule (*"Expose identity, not just the label"*) is its
"entity identity" instance. The new bullet should **name that relationship**
(so the three read as one notion) rather than restating either rule.
Keep this distinct from the row-selection rules in the same group: *"Top /
highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows**
appear; multi-part completeness governs **which columns** appear. They compose
(e.g. "highest and lowest per region" needs one row per region *and* a column per
clause).
### 2. Final completeness check — the enforcement mechanism
The rule content lives **once** in `<sql_craft>`; the trigger is promoted to a
first-class line in `<workflow>` step 6.
- **Capstone bullet in `<sql_craft>`** (closing the "Answer completeness /
interpretation" group): *before emitting the final SQL, re-read the question and
confirm the projection covers* —
1. every named **metric / attribute** the question asks for (→ the multi-part
rule);
2. the **identifier** of every grouped or named entity (→ the *identity* rule);
3. every **input** to each derived value (→ the *inputs* rule);
4. all at the **grain** the question specifies (→ the *for each X* / panel
rules).
Each facet cross-references the rule it enforces, so the check is what makes
those passive rules active. Phrase it as a short, concrete "confirm the
projection covers…" checklist, not a wall of MUSTs.
- **Over-projection guard** (attached to the check): do **not** add columns the
question did not ask for "to be safe" — extra columns add noise, mislead, and
make the result harder to consume; match the request exactly. Carries the
**universal** why from the Model, **never** a grader/gold/benchmark reference.
- **`<workflow>` step 6 line** (the explicit ritual): step 6 ("Validate and
explain") gains a mandatory line directing the agent to **always** run the final
completeness check before emitting — re-read the question and verify every
requested output, each entity's identity, each derived value's inputs, and the
grain are all projected — pointing into the `<sql_craft>` capstone for the
detail. This **replaces the current conditional pointer's role** ("If a result
is unexpectedly empty or its grain looks wrong, work through the … rules"): the
empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty
results"* and grain rules), but the completeness check fires **unconditionally**,
on every SQL-authoring turn, not only when a result looks off. The workflow line
names the ritual and the four facets; the rationale, guard, and example are
stated once in `<sql_craft>`, not duplicated into the workflow.
### 3. One worked example (dialect-agnostic)
Add **exactly one** compact before/after example to the "Answer completeness /
interpretation" group, demonstrating multi-part completeness on a **synthetic**
schema (`regions`, `region_monthly`):
- **WRONG:** answers only the first clause — `SELECT region_name,
MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no
lowest, no difference.
- **RIGHT:** one column per requested output plus the entity's identity, at the
region grain — `region_id, region_name`, the highest, the lowest, and the
difference, with `regions` joined to `region_monthly` and grouped by the region
id and name.
Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN`
are portable aggregates). Keep it tight. It teaches multi-clause coverage +
identity + derived-value inputs in one capstone, and is **distinct** from the
spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN
spine + `COALESCE`); this one is about missing **columns**. This is the **sixth**
worked `sql` example in the skill (after specs 07/09/10/11/12).
### 4. Coordination with specs 03 and 07/09/10/11/12
- **Spec 03** (multi-connection routing) owns `<workflow>` step 0 and the
`connectionId` threading/scoping. Spec 14 touches `<workflow>` only to add the
completeness-check line to **step 6** — it must not rewrite the routing or the
`<rules>` `connectionId` scoping. If both land, step 6 reads coherently: validate
+ the completeness ritual.
- **Specs 07/09/10/11/12** own their own bullets and worked examples in
`<sql_craft>`. Spec 14 is **additive** to the same "Answer completeness /
interpretation" group and adds one example; it must not remove or contradict
theirs.
## Leak-safety (hard constraint)
The example uses an **invented, generic schema** (`regions`, `region_monthly`) and
made-up columns — **no benchmark table names, SQL, or result values.** It teaches
the *pattern* (cover every requested output + identity + inputs, at grain, without
padding), which is universal and tied to no specific instance. The over-projection
guard's rationale is **universal** (noise/clarity/consumability), never
"grader-gaming" or any other scoring reference. No part of the addition mentions a
benchmark, gold answer, grader, or scoring comparator.
## Acceptance criteria
- `<sql_craft>` "Answer completeness / interpretation" states the **multi-part /
multi-output completeness** rule (a column per requested output; list / paired
extremes / value-plus-components), named as the umbrella over the shipped
*identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*.
- `<sql_craft>` states a concrete **final completeness check** (re-read the
question → confirm metrics + entity identity + derived-value inputs + grain are
projected), cross-referencing the existing identity/inputs/grain rules so they
are enforced, not merely listed.
- The check carries the **over-projection guard** with a **universal** rationale
(don't pad with unrequested columns — noise / misleading / harder to consume),
and the skill contains **zero** grader/gold/benchmark references anywhere.
- `<workflow>` **step 6** carries a mandatory line that runs the completeness
check **unconditionally** before emitting and points into the `<sql_craft>`
capstone; the rule content is **stated once** in `<sql_craft>` (no duplicated
rationale/guard in the workflow). The empty/grain diagnostic remains available.
- Exactly **one** new worked `sql` example is present (synthetic
`regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL;
the skill then carries **six** `sql` worked examples total.
- The existing interactive guidance (`<workflow>` steps, `<rules>`, the other
`<sql_craft>` bullets and the five prior examples) is intact and uncontradicted;
the additive-only and dialect-clean invariants from specs 07/10 still hold.
- None of spec 07's excluded items appear (output-shape contract, `MAX(date)`
anchoring of "recent"/"past N", grader-driven advice, dialect syntax).
- The skill stays scannable and comfortably under the 500-line budget; the
frontmatter still parses as `ktx-analytics`.
- The analytics-skill **content test is updated** to cover the new rule and check
(see Implementation orientation).
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the prose.
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`.
- Add the multi-part-completeness bullet and the final-completeness-check
capstone (with the over-projection guard) to the `<sql_craft>` "Answer
completeness / interpretation" group; add the single
`regions`/`region_monthly` worked example.
- In `<workflow>` step 6, replace the current conditional answer-completeness
pointer with the mandatory completeness-check line (unconditional, names the
four facets, points into `<sql_craft>`); keep the empty/grain diagnostic.
- Leave `<workflow>` steps 05, `<rules>`, and the other `<sql_craft>`
bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target
via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change
required.
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`.
- Add representative phrases to the "represents every craft behavior" list for
the multi-part rule, the final completeness check, and the over-projection
guard.
- Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the
test name/comment), and assert the new example's shape (e.g. `region_monthly`,
`MAX(`, `MIN(`, the difference expression, `region_id`).
- The existing dialect-clean, grader/benchmark-clean, and relative-time
(`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN`
lines carry no "recent"/"past N" wording, so the phrase-level guard is
unaffected. The `SkillsRegistryService` frontmatter test must still pass.
- Rebuild and re-link the dev binary so the playground picks up the updated skill:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation only)
On the latest SQLite-subset run, **incomplete output was the single largest
failure bucket (~13 of 51 voted failures)**: multi-part questions answered
partially, plus dropped identity / derived-value inputs — the latter two being
spec-07 rules that already exist but weren't applied. A probe with a much stronger
model reproduced the *same* incomplete-output failures, confirming this is a
craft-enforcement gap rather than a model-capability one. The fix — answer every
requested part, identify the entities, keep the inputs, and don't pad — is
universal analyst craft, so it belongs in the product skill (and transfers to real
users), enforced as a final pre-emit check rather than left as a passive hint.
Improving the benchmark score is a side effect; the skill contains no trace of the
benchmark.
## Implementation notes
Implemented as additive content in one Markdown file plus a test update.
- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`<sql_craft>` "Answer
completeness / interpretation" group):
- Added the **"Answer every requested output"** umbrella bullet (list / paired
extremes / value-plus-components → a column per requested output, with a generic
*why*). It names *keep the inputs* and *expose identity* as its "value +
components" and "entity identity" instances, pins the closed-set definition of a
complete projection, and marks itself as governing *which columns* appear —
distinct from the *Top …* / *For each X* row-selection rules, with which it
composes. The two shipped instance rules are preserved verbatim.
- Added the **"Final completeness check"** capstone bullet: a four-facet
"before emitting, re-read the question and confirm the projection covers…"
checklist (metric/attribute → multi-part rule; identifier → *expose identity*;
inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on
every query. It carries the **over-projection guard** with a universal rationale
(unrequested columns add noise, mislead, and are harder to consume — match the
request exactly), with **no** grader/gold/benchmark reference.
- Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG
answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`),
dropping the region id, the lowest, and the difference; RIGHT projects
`r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the
`MAX MIN` difference, joining `regions` to `region_monthly` and grouping by id
+ name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`).
- `<workflow>` **step 6**: replaced the conditional answer-completeness pointer
with an unconditional *"Always run the final completeness check before emitting"*
line that names the four facets and points into the `<sql_craft>` capstone; the
empty/grain diagnostic is retained for diagnosis. Steps 05, `<rules>`, and the
other `<sql_craft>` bullets/examples are untouched.
- Delivery is unchanged: `readAnalyticsSkillContent` in
`packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target
(confirmed, no change required).
- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the
three representative phrases (`Answer every requested output`, `Final completeness
check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and
renamed that test; asserted the new example's shape (`region_monthly`,
`MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX MIN` difference, and
`r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean,
relative-time, and frontmatter guards still pass.
- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass;
production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the
updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content
present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it
up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as
`ktx-analytics`.
- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference
column named `order_count_range`; the intake draft sketched alias `m` and
`AS difference`. The spec leaves prose to the implementer, so the change is purely
naming.
- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in
`packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools`
mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is
untouched by this work and out of scope here.

View file

@ -1,405 +0,0 @@
# Structured, leveled logging for the ktx MCP server
> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`.
>
> **Scope: observability only.** This spec is about *seeing* what the MCP server
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
> query from blocking the server (off-event-loop / interruptible execution) is a
> separate concern — see "Non-goals".
## Problem
The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built
through `mcp-server-factory.ts` on raw `node:http` + the
`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There
is no server-side record of **which MCP tool was called, with what parameters,
when, how long it took, or whether it succeeded** — nor of session open/close or
transport errors. When a tool call is slow, hangs, or a client connection drops
("Transport channel closed"), an operator has no trail to diagnose it and must
resort to process sampling / `lsof` / guesswork — and the offending input
(e.g. the exact SQL) is typically unrecoverable.
The hook to fix this already exists but is half-built: `instrumentMcpServer`
(`context/mcp/context-tools.ts`) wraps every tool handler and already times it,
but it emits **only on completion** (a sampled `mcp_request_completed` telemetry
event) and **never writes a start line and never writes to the server log**. A
call that never returns therefore leaves no trace at all.
## Generic use case (independent of any benchmark)
Anyone running a long-lived ktx MCP server — a developer's local instance
(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a
shared/hosted HTTP daemon — needs observability into tool-call activity to:
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
connection, with what SQL, for how long);
- explain client-visible connection failures from the server side (session
lifecycle, transport-closed events);
- audit what agents asked the server to do;
- spot patterns (hot tools, slow connections, error rates).
This is standard production-server hygiene; the server currently provides none.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs.
### One `pino` logger, synchronous, written to **stderr**
Use `pino` — the de-facto standard structured-JSON logger for Node servers — as
a single shared instance. Two corrections to the draft's sketch:
- **stderr, not stdout.** The stdio transport reserves **stdout** for the
JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`);
writing logs there would corrupt the protocol stream. The HTTP daemon already
redirects **both** child fds to `.ktx/logs/mcp.log`
(`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands
in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the
one universally-correct sink** for both transports.
- **Synchronous, no worker-thread transport.** `pino` writes through a
`DestinationStream` (`{ write(msg) }`) — the server's existing
`KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a
**synchronous** destination (`pino.destination({ sync: true })`, or the
pino-pretty stream below with `sync: true`). This is load-bearing: the
`tool.start` line **must** be flushed to the fd *before* the (possibly
blocking) handler runs, so a runaway synchronous `better-sqlite3` query that
pegs the event loop still leaves the start line on disk. A worker-thread
transport (`transport: { target: ... }`) buffers and can lose that exact line
on a hard crash — **do not use transport mode.**
### Format is derived from `stderr.isTTY`, not a config flag
One logger, two serializations chosen by the environment (the "behavior follows
from inputs" rule — not a user-visible knob):
- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) →
**`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true,
destination: <stderr sink> })`, colorized). A readable live dev view.
- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log`
file fd) → **plain JSON line** via the synchronous pino destination. The log
*file* stays structured JSON so the incident workflow ("recover the hung query
with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat
it.
`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal
from the underlying stream (`process.stderr.isTTY`) at logger construction, while
still writing *through* the `io.stderr` sink so tests can capture emitted lines.
### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper
Tool-call logging is added to the existing `instrumentMcpServer`
(`context-tools.ts`), which already wraps `registerTool` and measures duration.
It receives the **raw** tool input (it wraps the schema-parsing handler from
`registerParsedTool`), so the params it logs include `sql` for `sql_execution`.
The existing telemetry emission stays unchanged; logging is **additive** beside
it. Because both transports build their server through `mcp-server-factory.ts`
`registerKtxContextTools`, this single change gives **both HTTP and stdio**
tool-call logging for free.
### `sessionId` / `callId` provenance
- **`sessionId`** comes from the SDK's per-call handler context
(`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk`
`1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for
stdio (single session) — log it when present, omit otherwise. Add
`sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`).
- **`callId`** is generated per invocation with `randomUUID()` (already imported
in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`.
### No redaction in v1 (explicit)
v1 ships **no log redaction**. Rationale recorded here so it is a deliberate
choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`),
**never transmitted off-box**, and sit at the **same trust boundary** as the
`ktx.yaml` / environment that already hold the connection credentials. Concretely:
- Request **headers are never logged** at all, so the bearer token
(`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted."
- Errors are logged with their **full message and stack** via pino's standard
`err` serializer.
- SQL text and tool params are logged **verbatim** (they are not secrets).
Credential redaction (e.g. a DB URL embedded in a driver error string) is an
explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box.
This drops the draft's "light redaction" requirement and the
`collectTelemetryRedactionSecrets` / scrubber reuse it implied.
## Requirements
### 1. One shared pino logger
- A single `pino` instance per server process, constructed once and threaded to
both the transport layer (for lifecycle events) and the tool layer (for
tool-call events). Level set from env (Requirement 7), default `info`.
- Synchronous destination bound to the server's stderr sink (see Design
decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`,
otherwise plain JSON. Each line carries pino's standard `time` and `level`.
- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics
stack, no async/worker transport, no in-app file rotation.
### 2. Per-session / per-call context via child loggers
Use pino child loggers so every line carries the relevant correlation fields:
a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one
session's or one call's activity can be grepped from the log.
### 3. Tool-call logging — START before execute, END after
In `instrumentMcpServer`, for **every** MCP tool invocation:
- **On entry, before invoking the handler**, write `tool.start` with
`{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool
input; for `sql_execution` this includes the full **SQL text** (the single most
useful field). The write is synchronous so the line exists even if the handler
never returns.
- **On normal completion**, write `tool.end` with
`{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at
**`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a
tool-agnostic size measure (byte length of the serialized result text content).
- **On error**, write `tool.end` with
`{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**,
where `err` is the serialized error (message + stack) per Requirement 6.
`tool.start` and `tool.end` share the **same correlation fields and the same
`info` level** (for the non-slow, non-error case) so that an **unmatched
`tool.start`** — a start with no `tool.end` for the same `callId` — is an
unambiguous "this call hung" signal. This is the property that makes a runaway
`sql_execution` identifiable from the log alone, with its exact SQL and
timestamp, no process sampling.
> **Deliberate change from the intake draft.** The draft put `tool.start` /
> `tool.end` at `debug` (suppressed at the default `info`). That defeats the
> motivating incident: a hang is unpredictable, so debug would have to be enabled
> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an
> always-on access log — so the offending query is recoverable at the default
> level. `debug` is reserved for heavier detail (Requirement 7).
### 4. Slow-call warning
When a call **completes** with `durationMs` greater than the configured slow
threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same
fields plus the duration) instead of `info`. This makes a completed-but-slow call
stand out and keeps it visible even when the level is raised to `warn`.
### 5. Connection / session lifecycle and transport errors
- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from
`onsessioninitialized` and `session.close` from `onsessionclosed` /
`transport.onclose`, each with `sessionId`, at `info`. **Wire the currently
unused `transport.onerror`** to log `transport.error` (the SDK's
closed-channel / "Transport channel closed" events) at `error`, so a
client-visible connection failure has a server-side counterpart.
- **stdio** (`mcp-stdio-server.ts`): route the existing raw
`transport.onerror` stderr string (it currently writes a plain string) through
the logger as a `transport.error` line at `error`. A single `session.open` /
`session.close` pair for the one stdio connection MAY be logged at `info`.
### 6. Structured error logging
Errors are logged as structured objects via pino's standard `err` serializer
(`pino.stdSerializers.err` or equivalent), carrying error class, message, and
stack — never a bare interpolated string. The existing telemetry exception
reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged.
### 7. Configuration surface
- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` |
…), default **`info`**. MCP-scoped name because the MCP server is the only
emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system
that does not exist.
- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement
4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply
between a local SQLite file and a remote warehouse.
- Level ladder that results from Requirements 35:
- `debug`: everything below **plus** heavier detail (e.g. result bodies,
progress notifications) — implementer's discretion on what extra to attach.
- `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s,
errors.
- `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but
not routine tool traffic.
- `error`: errored `tool.end`s and `transport.error` only.
## Acceptance criteria
- At default level (`info`), invoking any MCP tool produces a `tool.start`
(`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end`
(`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr
is not a TTY.
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
`tool.start` line carrying its **exact SQL and timestamp** and **no** matching
`tool.end` for that `callId` — so the offending query is recoverable from the
log alone, with no process sampling.
- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at
`warn` with its `durationMs`.
- Session open/close and transport-closed (`transport.error`) events are logged
with the `sessionId` (HTTP); the stdio transport error path goes through the
logger, not a raw `stderr.write`.
- At level `warn`, routine `tool.start` / `tool.end` are suppressed but
slow-call warnings, transport errors, and errored calls are present.
- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a
terminal), output is human-readable colorized `pino-pretty`; the daemon log
file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous.
- The bearer token never appears in any log line (headers are not logged); SQL
and tool params do appear.
- No worker-thread / async log transport is introduced; no OpenTelemetry /
metrics stack; the only new dependencies are `pino` and `pino-pretty`.
- The existing `mcp_request_completed` telemetry and exception reporting still
work unchanged.
## Non-goals
- **Preventing / interrupting runaway queries** (off-event-loop execution, query
timeouts, worker-thread isolation). A single synchronous query that fans out
into a massive nested-loop join can peg the single-threaded server for hours
and break new connections — observability surfaces *which* query, but the fix
is execution-model work in a separate spec. (This logging is also the
prerequisite for a future watchdog that detects a `tool.start` with no
`tool.end` past a threshold and recycles the server.)
- **Log redaction** (see Design decisions) — explicit v1 non-goal.
- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty
as a synchronous in-process stream only.
- Metrics / tracing / OpenTelemetry exporters.
- Forwarding logs to the MCP *client* via the protocol logging capability
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
distinct from operational stderr logging.
- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other
surfaces emit structured logs.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **New module** — a small logger factory, e.g.
`packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from
the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream
when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and
exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`.
- **Tool-call logging**`packages/cli/src/context/mcp/context-tools.ts`:
extend `instrumentMcpServer` (~line 585) to write `tool.start` before
`handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate
`callId` via the already-imported `randomUUID`; read `sessionId` from the
handler `context`. Thread the logger via `RegisterKtxContextToolsDeps`
(~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool`
and the existing telemetry emission intact.
- **Context type**`packages/cli/src/context/mcp/types.ts`: add
`sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to
`KtxMcpServerDeps` / the register deps.
- **Server wiring**`packages/cli/src/context/mcp/server.ts`
(`createDefaultKtxMcpServer` / `createKtxMcpServer`) and
`packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept
and pass the logger down to `registerKtxContextTools`.
- **HTTP lifecycle**`packages/cli/src/mcp-http-server.ts`: construct (or
receive) the logger; in `newTransport` (~line 186) log `session.open` /
`session.close` and add `transport.onerror``transport.error`.
- **stdio lifecycle**`packages/cli/src/mcp-stdio-server.ts`: construct (or
receive) the logger; route the existing `transport.onerror` (~line 54) through
it.
- **Log destination is already captured**`packages/cli/src/managed-mcp-daemon.ts`
redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs`
(`commands/mcp-commands.ts`) tails it. No change needed there.
- **Dependencies** — add `pino` and `pino-pretty` to
`packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks
still pass.
- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`,
`mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and
`commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written
before a (mock) handler runs and carries `params`/`sql`; (b) a matching
`tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a
`tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits
`warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token
never appears. Inject a capturing `io.stderr` and parse the JSON lines.
*Note:* `mcp-server-factory.test.ts` carries a pre-existing
`KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`,
unrelated to this work) — do not let it mask new failures.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation, not a requirement)
Running Spider 2.0-Lite against the MCP server at concurrency, an
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
synchronous `better-sqlite3` executed it on the event loop, pegging a server at
~100% CPU for hours and breaking new MCP connections ("Transport channel
closed"). We could not determine *which* query, because the server logs nothing
about tool calls — diagnosis required `sample` / `lsof` on the live process and
the exact SQL was never recovered. Structured tool-call logging — especially
`tool.start` written synchronously *before* execution, at the default level —
would have turned this into a one-line `grep` of the server log. Improving the
benchmark is a side effect; the logging is generic production-server hygiene.
## Implementation notes
Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance
criteria are satisfied.
**What was built / where**
- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io,
{ isTTY? })` builds one synchronous `pino` (v10) instance written through the
`io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13)
synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the
sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel`
(`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`),
`mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and
`serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`.
- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)**
per invocation: `callId = randomUUID()`, a child logger bound to
`{ tool, callId, sessionId? }`, `tool.start { params }` written at `info`
**before** awaiting the handler (synchronous, so a runaway query still leaves it
on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`,
`warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error",
err }`. `resultSize` is the UTF-8 byte length of the serialized text content.
The existing `mcp_request_completed` telemetry + `reportException` are unchanged
(`durationMs` is now computed once and shared); `registerParsedTool` is intact.
- **`sessionId` / logger plumbing** — `sessionId?: string` added to
`KtxMcpToolHandlerContext`; a single per-process logger threads from each
transport entrypoint through `createKtxMcpServerFactory`
`createDefaultKtxMcpServer``createKtxMcpServer``registerKtxContextTools`
(`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`).
- **HTTP lifecycle (`mcp-http-server.ts`)**`session.open` from
`onsessioninitialized`, `session.close` from `transport.onclose`, and the
previously-unused `transport.onerror` wired to `transport.error` at `error`.
- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror`
string write is replaced by a `transport.error` log line; `session.open` /
`session.close` are logged for the single stdio session.
- **Deps**`pino ^10.3.1`, `pino-pretty ^13.1.3` added to
`packages/cli/package.json`.
- **Tests**`test/context/mcp/logger.test.ts` (factory, level/threshold env
parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in
`test/context/mcp/server.test.ts` (start-before-handler, matching end with
`resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level
suppression with errored end still present, no-logger no-op), session lifecycle
+ bearer-token-never-logged in `test/mcp-http-server.test.ts`, and
`test/mcp-stdio-server.test.ts` for `transport.error`.
**Deviations / decisions**
- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a
thrown handler error into an `{ isError: true }` result (and reports the full
error via telemetry) before it reaches `instrumentMcpServer`, so the original
stack is already gone. `tool.end` for such a result logs `outcome:"error"` with
`err.message` only; a genuine throw that escapes gets the full pino `err`
serialization (type + message + stack). The field is always `err` for
consistency. This honours "leave `registerParsedTool` intact."
- **`session.close` is logged from `transport.onclose`** (the universal close
signal for both clean DELETE and dropped connections) rather than
`onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its
session-map cleanup role.
- **The logger is optional throughout.** Production always wires one per process;
when absent (programmatic/test callers that inject `createMcpServer`), tool-call
logging is simply off — which keeps existing tests unchanged.
- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production
derives format from `process.stderr.isTTY`.
**Verification**
`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test
files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only
2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and
unrelated to this change (in-progress analytics-skill work on this branch).
`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run
build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the
one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit
`2677b3ef` (documented above); all source and the new tests type-check clean.

View file

@ -1,493 +0,0 @@
# Bounded query execution (deadline + non-blocking) for read SQL
> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`.
>
> **Scope: bound and cancel a read query that runs too long.** This is the
> execution-model companion to spec 15 (MCP structured logging). Spec 15
> *surfaces* a runaway query in the log; it explicitly defers *preventing* one —
> "off-event-loop execution, query timeouts, worker-thread isolation … is
> execution-model work in a separate spec." This is that spec.
## Problem
Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the
current code:
1. **No execution deadline, handled divergently per connector.** A single
expensive query runs unbounded, and whether it is bounded at all depends
entirely on which driver the caller hit:
- **BigQuery** is the only connector with a real statement timeout — it sets
`jobTimeoutMs` on the query job from a per-connection config field
`job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491512).
- **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client
creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up,
not a server-side `max_execution_time`; the server keeps working.
- **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection
*acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres
`connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`;
MySQL pool size only) — nothing bounds statement *execution*.
- **SQLite** has nothing.
2. **In-process SQLite blocks the event loop and cannot be cancelled.** The
SQLite connector executes on the main thread via synchronous
`better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`,
`query(...)` 311318, used by `executeReadOnly` 247251). A slow query freezes
the whole MCP server — it cannot serve other requests, send progress, or write
`tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12)
exposes no interrupt/cancel API. Its documented mechanism for slow queries is a
**worker thread**, and the only way to stop a runaway synchronous query is to
**terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`,
`docs/threads.md`).
The observed failure (Spider2-lite sqlite run, 2026-06-18): a single
`sql_execution` MCP call —
`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`,
where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a
4-column key with no composite index) — degraded to an O(N×M) nested-loop scan,
pegged a worker at 100% CPU for 13+ minutes, never returned, produced a
`tool.start` with no matching `tool.end`, and stalled an eval shard until the
worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned
rows, not scan work, and the failing query returned a single aggregate row.
## Generic use case (independent of any benchmark)
Any data agent that lets an LLM author SQL will eventually issue an
accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW,
a wide aggregate over a large fact table. A general-purpose context layer must
bound that and return a clean, fast "query exceeded Ns" error so the agent can
revise (add filters, query base tables, narrow the range) instead of hanging the
tool and the server. This matters for embedded/local warehouses (SQLite, and any
future DuckDB-style in-process driver) and remote ones alike, and is wholly
independent of any benchmark.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs.
### One canonical deadline, applied uniformly at the contract
The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP
`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query
executor, relationship profiling and composite-candidate probes, relationship
validation, historic-SQL probes, `ktx sql`); the contract is the single place to
bound all of them. A heavy ingest profiling probe over a giant unindexed join is
exactly as worth abandoning as an interactive one — those call sites are
best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip
this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the
call sites that must treat the timeout as recoverable.)
> Rejected alternative: a caller-resolved deadline (short on the interactive path,
> longer/none for ingest). That introduces a second value source and the open
> question "what is the ingest budget," for no real gain — the 30s default already
> clears any normal profiling probe, and a probe that exceeds it is one to drop.
### Default 30s, configurable per-connection via one shared field
- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean
"exceeded 30s" and revises within the same turn; generous headroom over any
indexed aggregate or normal profiling probe; a genuine pathological nested-loop
scan blows past it immediately.
- **One shared per-connection override**, honored by every connector:
`query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer
in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it
replaces; the user-facing error still reads in seconds.
- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the
new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved
value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so
there is exactly one way to set a query timeout — no parallel knob (intake
requirement 1).
- **Granularity is per-connection only.** No global all-connections override —
different warehouses have different performance envelopes, and a second
(global) knob would double the configuration surface for no stated need.
### The shared contract is a value + an error, not a base class
There is **no shared connector base class or factory** — each connector is
constructed independently; the only shared registry is the *dialect* factory
(`context/connections/dialects.ts:4755`). So "defined once" (intake requirement
3) means a single shared module that owns:
- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`;
- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms`
override, else the default — so the default and the override precedence live in
exactly one place;
- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical
message `query exceeded ${Math.round(deadlineMs / 1000)}s`.
Each connector calls the resolver once (at construction; connectors already
receive their connection config) and stores `this.deadlineMs`. **Enforcement is
necessarily per-connector** — different engines cancel differently — but the
*value* and the *error message* are shared, so the agent sees one consistent,
actionable error regardless of driver.
### Real cancellation, not client-side give-up
Per intake requirement 5, the deadline must *stop the work*, not merely abandon
the promise while the query keeps running (which on a pooled driver also risks
returning a still-busy connection to the pool). So:
- **In-process (SQLite, and any future embedded driver):** run the query off the
main thread and enforce the deadline by **terminating the worker thread**. There
is no generic `Promise.race` outer wrapper — a `Promise.race` against a
synchronous in-thread `.all()` can never fire (the loop is blocked), and against
a pooled remote query it would poison the pool. Thread termination *is* the
cancellation.
- **Remote engines:** set the engine's **server-side statement timeout** so the
server itself aborts the query and frees the connection cleanly.
### Logging routes through spec 15's pino path — no second logger
The deadline cases are logged through the **existing** MCP tool-call logger
(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644730`), not a
new logging path threaded into the connector. Verified flow for a timeout:
`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) →
`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it
(`reportException` skips `$exception` for `KtxExpectedError`) and returns an
in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`**
with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same
`callId`** as the `tool.start`.
This is the central observability win and it requires **no new MCP logging code**:
spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this
spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose
`tool.end` names the deadline. The worker-termination (SQLite) and server-side
abort (remote) are internal enforcement mechanisms; their single observable signal
is that `tool.end`, so the connector does **not** get its own logger threaded
through `KtxScanContext` — that would fork a second path for one capability. The
"worker was actually reaped, not left spinning" guarantee is asserted by the
worker's `exit` event in tests (Requirement 3), not by a log line.
## Requirements
### 1. Shared deadline contract, defined once
A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`)
exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`,
and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its
deadline through this resolver; no connector hardcodes its own default or
duplicates the override-precedence logic.
### 2. Shared per-connection config field; BigQuery's removed
`query_timeout_ms` is added to the **shared** connection config schema (validated
as an optional positive integer, milliseconds) so every driver accepts it. The
BigQuery-specific `job_timeout_ms` config field and its dedicated reader
(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout
from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms`
(zero, negative, non-integer) is a clear config validation error, consistent with
how ktx validates `ktx.yaml`.
### 3. SQLite executes off the main thread, terminated on deadline
`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event
loop:
- Read-only validation and the row-limit wrapper (`assertReadOnlySql` +
`limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL
fails instantly without spawning a worker, and read-only enforcement stays at
the boundary (Requirement 7).
- The validated, row-limited SQL (and any params) is dispatched to a **worker
thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs
the query, and posts back `{ headers, rows, totalRows }` (all values are
structured-cloneable — primitives, `Buffer`, `BigInt`).
- The main thread arms a timer for `this.deadlineMs`; on expiry it calls
`worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal
message it clears the timer and resolves. On a worker error (SQLite rejected the
SQL) it rejects with that error, message preserved. A provided
`ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates
the worker, for external cancellation.
- **One short-lived worker per call**, terminated on completion or deadline — not
a persistent worker or pool. Terminate-on-deadline destroys the worker, so a
pool would need respawn/job-tracking for no benefit: `executeReadOnly` is
low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is
negligible against query latency. The other SQLite paths (introspect, sample,
stats, distinct-values, row-count) stay on the main thread — they are
ktx-authored, bounded, and not on the `executeReadOnly` contract.
- The event loop stays responsive throughout, so `tool.end` is always written and
concurrent requests on the same port are served.
### 4. Remote engines set a real server-side statement timeout
Each remote connector applies `this.deadlineMs` as its engine's server-side
statement timeout, so the deadline stops server work rather than abandoning the
promise:
| Connector | Mechanism | Unit |
|------------|--------------------------------------------------------|---------------|
| BigQuery | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms |
| Postgres | `statement_timeout` | ms |
| MySQL | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms |
| Snowflake | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION) | s (ceil) |
| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) |
| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms |
ClickHouse's existing hardcoded 30s `request_timeout` is brought under this
contract (derived from the resolved deadline), not left as a parallel mechanism.
### 5. Timeout resolves as a `KtxQueryError` with the canonical message
On exceeding the deadline, the path resolves with a `KtxQueryError`
(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded
hang. For SQLite the worker-termination path throws `queryDeadlineExceededError`
directly. For remote engines, each connector recognizes **its own** engine's
timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`;
SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as
`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector
owns its driver's signal — there is no central denylist of error codes to
maintain.
### 6. MCP surfacing and logging via the existing pino path
The MCP `sql_execution` path already (a) maps any non-native driver error to
`KtxQueryError` (`context/mcp/local-project-ports.ts:7888`, guarded by
`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips
`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start`
synchronously before the handler and `tool.end` in `instrumentMcpServer`
(`context/mcp/context-tools.ts:644730`). The deadline cases MUST surface through
this path — the implementer verifies and tests them, but adds **no parallel
classification or logging path**:
- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with
`outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same
`callId` as the `tool.start`. Classified as an expected error, so it is absent
from `$exception` Error Tracking. The reason `tool.end` was previously missing
is solely the blocked event loop (Requirement 3); once the loop stays free and
the deadline throws, the existing instrumentation logs the matched pair — closing
spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):**
unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline
(default 30s) and the slow threshold (default 10s) are independent knobs; a query
between 10s and 30s completes with a slow `warn`, one past 30s is killed with the
`error` above.
### 7. Read-only enforcement and `maxRows` unchanged
`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave
exactly as today. The deadline is additive. `maxRows` is not a substitute for it
(it bounds returned rows, not scan work).
### 8. Best-effort callers treat a deadline timeout as recoverable
The non-interactive `executeReadOnly` call sites that are best-effort —
relationship profiling, composite-candidate probes, relationship validation,
historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this
probe / mark unprofiled" and continue, never as a source-fatal error. The
implementer confirms each such site already swallows query errors into a
graceful-skip and adds that handling where it does not, so the uniform deadline
(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped
probe is logged at the skip site through that path's existing scan/ingest logger
(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers
are off the MCP tool-call path, so their visibility comes from the logger they
already use.
## Acceptance criteria
- A read query that exceeds the deadline returns a `KtxQueryError`
(`query exceeded {N}s`) within roughly the deadline; the MCP worker stays
responsive (a concurrent tool call on the same server completes while the slow
query is still pending) and writes a matching `tool.end` with a non-ok outcome.
- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching
`tool.end` (same `callId`) at `error` with `outcome:"error"` and
`err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The
timeout does not raise a `$exception` Error Tracking event (it is a
`KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but
under the deadline still emits its `tool.end` at `warn`. No new logger is
introduced — the lines come from the existing `instrumentMcpServer`.
- **SQLite specifically:** executing a deliberately pathological query (an
expensive VIEW or an unindexed cross join) on a fixture does not block the event
loop, is terminated at the deadline, and the worker exits (the off-main-thread
executor is killed, not left spinning) so CPU returns to idle.
- **One server-side-timeout driver (Postgres):** the connector applies
`statement_timeout` equal to the resolved deadline, and a `57014` cancellation
is mapped to the canonical `KtxQueryError`.
- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms`
override, and rejects an invalid value (zero / negative / non-integer).
- **No regression:** normal fast queries return identical results; read-only
rejection still works; `maxRows` still bounds returned rows.
- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's
former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the
shared field.
## Non-goals
- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result
size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a
separate, retained concern.
- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated
default plus a per-connection override; no per-call knob, no global knob.
- **A server watchdog that recycles the process on an unmatched `tool.start`.**
Spec 15 names this as a possible future mitigation; this spec prevents the hang
at the source, so the watchdog is out of scope here.
- **Moving SQLite introspection / sampling / stats off the main thread.** Only the
`executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded
ktx-authored queries.
- **Per-connection retry / backoff on timeout.** A timeout returns a clean error
for the agent to revise; ktx does not auto-retry.
- **A second logger threaded into the connector.** The deadline cases are logged
through spec 15's existing MCP tool-call logger; the connector gets no separate
pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes
through spec 15's pino path").
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
`DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`.
Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`).
- **Contract anchor**`KtxScanConnector.executeReadOnly`
(`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`),
`KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the
MCP path).
- **Config schema** — add `query_timeout_ms` to the shared connection config
(`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema);
remove BigQuery's `job_timeout_ms` reader.
- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts`
(constructed by path via `new URL('./read-query-worker.js', import.meta.url)`);
rework `connectors/sqlite/connector.ts` `executeReadOnly` (247251) to validate
on the main thread then dispatch to the worker with a terminate-on-deadline
timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in
the worker. Register the worker as a dynamic entry in `knip.json` (it is
referenced by path, not import) and confirm the build copies it into `dist`.
- **Remote connectors** — apply the resolved deadline and recognize the engine's
timeout signal in each `executeReadOnly` / `query(...)`:
`connectors/bigquery/connector.ts` (~491512, `jobTimeoutMs`),
`connectors/clickhouse/connector.ts` (~602/629644, `max_execution_time` +
`request_timeout`), `connectors/snowflake/connector.ts` (~354371/510534,
`STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822838,
`statement_timeout`), `connectors/mysql/connector.ts` (~774793,
`max_execution_time`), `connectors/sqlserver/connector.ts` (~812832,
`requestTimeout`).
- **MCP path + logging (verify only)**`context/mcp/local-project-ports.ts:6988`
(error mapping), the `sql_execution` registration (~915943), and the logging in
`instrumentMcpServer` (`context/mcp/context-tools.ts:644730`, which writes
`tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No
new classification or logging code; confirm the timeout flows through as an
expected error producing a matching `tool.end(error)` with the canonical message.
- **Best-effort callers**`context/scan/relationship-profiling.ts` (~227, 275),
`context/scan/relationship-composite-candidates.ts` (~365, 440),
`context/scan/relationship-validation.ts` (~259),
`context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the
historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a
graceful skip.
- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms`
as the test seam) asserting terminate-on-deadline, event-loop responsiveness
(a concurrent promise resolves while the query is pending), and worker exit; a
Postgres test asserting `statement_timeout` is set to the resolved deadline and
a `57014` error maps to `KtxQueryError`; resolver unit tests (default /
override / invalid); regression tests for normal results, read-only rejection,
and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g.
`test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a
matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation, not a requirement)
The Spider2-lite local set loads several warehouses into SQLite, some with
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
rows, no composite index, with `promo_id` (the index the optimizer picks) being
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
query instead. Improving the benchmark is a side effect; the deadline is generic
production hygiene for any agent that lets an LLM author SQL.
## Implementation notes
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
acceptance criteria are met; tests, type-check, dead-code, and build are green
for the changed surface.
### What was built, and where
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
`DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns
the validated `query_timeout_ms` override else the default; throws on
zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)`
(a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the
driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`.
- **Config field**`query_timeout_ms` (optional positive integer, ms) added to
the **shared warehouse** schema. NOTE (spec drift): that schema lives in
`context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not
`config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be
declared explicitly to be *validated* (otherwise it would pass through
unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection`
reader were removed; BigQuery now resolves the shared field. Every connector
resolves its deadline once at construction via `resolveQueryDeadlineMs`.
### Deviation from the spec's SQLite mechanism (worker thread → child process)
The spec mandated running SQLite read queries on a **worker thread** and enforcing
the deadline by `worker.terminate()`. This was **empirically disproven**:
`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3`
scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise
never even resolves (an 8s probe of the exact failing query shape confirmed the
thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler
API, and `.iterate()` does not help because the failing query is a single
aggregate row produced only *after* the full scan.
The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`**
(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from
`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed
the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both*
SQLite requirements better than a thread (event loop stays free **and** the query
is genuinely cancellable). The child is self-contained (imports only
`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`)
and `normalizeQueryRows` stay on the main thread. One short-lived child per call,
killed on completion, deadline, or `ctx.signal` abort. Node v24's native
TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts`
URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in
`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke).
### Remote connectors (server-side timeouts + own-signal mapping)
Each applies the resolved deadline server-side and re-wraps its own timeout signal
as `queryDeadlineExceededError(deadlineMs, { cause })`:
- **BigQuery**`jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error.
- **Postgres**`statement_timeout` via pool `options` (`-c statement_timeout=<ms>`); maps `57014`.
- **MySQL**`SET SESSION max_execution_time = <ms>` before the read; maps errno `3024`.
- **Snowflake**`ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = <ceil(s)>` in the pooled connection; maps code `604` / "reached its … timeout".
- **ClickHouse**`max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`.
- **SQL Server**`requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`.
Each connector has a focused test asserting the timeout is applied and its signal
maps to `KtxQueryError` (Postgres is the spec's required acceptance test).
### Best-effort callers (Requirement 8)
Confirmed already graceful: relationship **profiling** (outer try/catch →
`profile_failed` warning) and **composite-candidate** detection
(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL
**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error
into `{ ok: false }`. **Added** handling to relationship **validation**: a
`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to
`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of
aborting the whole validation pass. `ingest-query-executor.ts` is a generic
executor port whose callers own recoverability — left unchanged.
### MCP surfacing/logging
No new MCP classification or logging code. The deadline `KtxQueryError` flows
through the existing `local-project-ports` mapping → `reportException` (skips
`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts`
covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched
`tool.start``tool.end(error, level 50)` pair carrying `err.message = "query
exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched
pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
### Pre-existing branch issues encountered (not part of this feature)
- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with
a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke
`tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the
type-check gate green; behavior unchanged.
- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing
`**Window functions**` heading and `Expose identity, not just the label` prose
in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec
13/14) content drift committed earlier on the branch; **left untouched** — no
skill files were modified by this feature.

View file

@ -1,418 +0,0 @@
# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project)
> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`.
>
> **Scope: let the BigQuery connector introspect a dataset hosted in a *different*
> project than the one it bills jobs to.** A `dataset_ids` entry may be written
> fully-qualified as `project.dataset`; the connector introspects each entry in
> *its own* project while every job still runs in `credentials.project_id`. A
> bare `dataset` keeps today's single-project behavior unchanged.
>
> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard
> is **not** expected to *discover* foreign datasets — you cannot enumerate
> datasets in a project you don't own, and the wizard doesn't know which foreign
> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or
> at the dataset prompt); the connector must accept and introspect them. See
> *Non-goals*.
## Problem
**ktx**'s BigQuery connector derives a single `projectId` from
`credentials.project_id` and uses it for **both** job billing **and** schema
introspection. There is no way to introspect a dataset that lives in another
project, even though *querying* such a dataset already works (a cross-project
read in a `FROM` clause bills to the caller's project — that path is proven).
Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`):
- **`:294`** — `projectId` is read only from `credentials.project_id`. There is
no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig`
(`:278``:301`) returns `datasetIds: string[]` — raw, unparsed.
- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim;
it never parses a `project.` prefix.
- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`,
which resolves the dataset in the **client's (billing) project**, and labels
every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the
introspection-failure warning metadata (`:566`).
- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as
`` `<projectId>.<datasetId>.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the
**billing** project.
- **`listTables` (`:453`)** — queries
`` `<projectId>`.`region-<region>`.INFORMATION_SCHEMA.TABLES `` against the
**billing** project and labels each row `catalog: this.resolved.projectId`.
- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the
billing project.
### Empirical confirmation (from the intake draft)
With a service account in project `ktx-spider2-lite`:
- ktx's call pattern `client.dataset("austin_311")`**`404 NotFound`** (it looks
in `projects/ktx-spider2-lite/datasets/austin_311`).
- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })`
**succeeds** (public metadata is readable by any authenticated principal).
- There is **no config knob** to separate the introspection project from billing.
### Why the table `catalog` label is load-bearing, not cosmetic
The BigQuery dialect generates **three-part `catalog.db.name`** SQL
(`connectors/bigquery/dialect.ts:38``formatDialectTableName(..., 'three-part')`;
`context/connections/dialect-helpers.ts:27``32` emits `catalog.db.name`). The
`catalog` stored on each scanned table is therefore the project that *every*
later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`,
and ref-based `executeReadOnly` all format the ref through the dialect. If a
foreign dataset's tables are labeled with the billing project, every one of those
queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling
the table `catalog` with the dataset's own project is a **correctness
requirement**, and it is the single lever that makes sampling, dictionary value
extraction, and `discover_data` all resolve once the snapshot is right.
### One introspection path, no divergence
`connectors/bigquery/live-database-introspection.ts` wraps
`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database
paths share **one** introspection implementation. The SDK already supports the
fix: `client.dataset(id, { projectId })``@google-cloud/bigquery@8.3.1`'s
`DatasetOptions` exposes `projectId?: string`.
## Generic use case (independent of any benchmark)
Analysts routinely introspect datasets they can **read but do not own and do not
bill to**: Google's `bigquery-public-data`, a partner's shared project, an
organization's central data project that a smaller team queries from its own
billing project. To make those connectable in **ktx** — so `discover_data`, the
semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the
connector must introspect a foreign-hosted dataset while billing jobs in the
credentials' own project. This is a standard BigQuery deployment shape and is
wholly independent of any benchmark.
The class to design for is "the dataset's project ≠ the billing project," and it
must generalize beyond one example: a single connection may reference datasets in
**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data`
and `isb-cgc-bq`), and two different projects may host datasets with the **same
name**. The design must keep those distinct.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs.
### Carry the project inline on each dataset entry — no separate knob
The introspection project is expressed **per dataset**, inline, as the optional
`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config
field.
> Rejected alternative: a separate connection-level `dataset_project` (or
> `introspection_project`) field. It is a speculative runtime knob (against the
> repo's opinionated-defaults rule) and, more decisively, it **cannot express the
> requirement**: one connection must span *multiple* foreign projects, which a
> single global field cannot represent. The inline form also derives scope from
> the user's own declared input rather than adding a parallel setting.
### Parse to canonical `{ project, dataset }` pairs at the config boundary
Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` /
`datasetIds()`, into a canonical pair: the project (when no prefix is present,
default it to `credentials.project_id`) and the bare dataset id. Every
introspection-side call site reads the resolved pair; nothing downstream re-parses
a `project.dataset` string.
> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix
> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`,
> `testConnection`). That re-implements one rule in four places and is exactly the
> drift trap the repo's single-source-of-truth rule warns about — a later fix
> lands on one path and not another. Normalize at the boundary; carry the
> canonical form downstream.
The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`)
changes shape from `string[]` to a structured pair list. That is an internal type;
the connector internals and the connector test fixtures are the only consumers.
### Parsing rule (at the boundary)
- An entry contains **at most one `.`**.
- With a dot: the segment **before** the dot is the project, validated by the
existing `normalizeBigQueryProjectId` charset
(`context/connections/bigquery-identifiers.ts`); the segment **after** is the
dataset id (validated as a normal identifier).
- Without a dot: a bare dataset; the project defaults to `credentials.project_id`
(today's behavior).
- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error
raised at resolution time, naming the connection — not a silent
mis-introspection.
- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay
**out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset
(which already rejects `.` and `:` in a project id).
### Billing is never the dataset's project
The BigQuery client is still constructed with `projectId = credentials.project_id`
(`getClient()`, `:487``:495`), and `createQueryJob` always bills there. Only the
*introspection* surfaces switch to the per-dataset project. Cross-project reads in
a `FROM` clause already bill to the caller — unchanged and already proven.
### Dataset identity downstream is `(catalog, db)`
Scanned tables are keyed by `(catalog, db, name)` throughout
(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because
the table `catalog` now holds the dataset's own project, two foreign projects that
each host a `austin_311` dataset remain distinct with no extra work — provided the
snapshot's `scope` / `metadata` also preserve the project (Requirement 6).
### Setup-wizard scope: accept, don't discover
The connector's region-scoped `listTables` (`:453`) is consumed **only** by the
`ktx setup` wizard's table-selection step (`setup-databases.ts`); the
ingest / `discover_data` path reads persisted snapshot JSON via
`WarehouseCatalogService.listTables`, not the connector method. The wizard is not
expected to enumerate foreign datasets (you can't list a project you don't own).
A `project.dataset` value hand-entered at the dataset prompt, or written into
`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the
region caveat that follows from this.
## Requirements
### R1 — Accept and parse `project.dataset` at the config boundary
`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each
`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair
per the parsing rule above, defaulting `project` to `credentials.project_id` when
unprefixed. A malformed entry (more than one `.`, an empty project or dataset
segment, or a project/dataset that fails identifier validation) raises a clear
error at resolution time that names the connection id.
### R2 — Introspect each dataset in its own project
`introspectDataset` resolves the dataset via the **dataset's** project —
`client.dataset(datasetId, { projectId })` — for `getTables()` and each
`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to
accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`)
and forwarding it from `DefaultBigQueryClientFactory`.
### R3 — Label table `catalog` with the dataset's project
Every table produced by `introspectDataset` is labeled `catalog: <dataset's
project>` (not the billing project), and the introspection-failure warning
metadata (`object` / `catalog`) likewise reflects the dataset's project. This is
what makes downstream sample/distinct-value/read queries resolve.
### R4 — Primary-key discovery targets the dataset's project
The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` /
`KEY_COLUMN_USAGE` SQL is built against
`` `<dataset's project>.<datasetId>.INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA
view is dataset-qualified and therefore region-independent.) Its existing
soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved.
### R5 — `listTables` lists each dataset in its own project
`listTables` returns rows labeled `catalog: <that dataset's project>` and queries
each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection
can now span projects, it queries per distinct project rather than assuming one.
(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.)
### R6 — Snapshot scope and metadata reflect multiple projects
`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing**
project, but `scope.catalogs` becomes the **distinct set of dataset projects**
actually introspected. `scope.datasets` / `metadata.datasets` must stay
unambiguous when two projects share a dataset name (e.g. carry the qualified
`project.dataset`, or otherwise preserve the project). The scoped table-name
lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass
each dataset's own project so `tableScope` / `enabled_tables` filtering still
matches.
### R7 — `testConnection` resolves foreign datasets
`testConnection` validates each configured dataset via its own project
(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only
at foreign datasets reports success rather than a spurious `404`.
### R8 — Billing unchanged; bare dataset is a strict no-op
`createQueryJob` continues to bill in `credentials.project_id`. A connection whose
`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before:
same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no
behavioral change.
### R9 — `getTableRowCount` honors the parsed entry
`getTableRowCount`'s default-dataset handling (`:431`, today
`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign
default dataset is introspected in its own project.
### R10 — Docs reflect the qualified form
Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written
`project.dataset` to introspect a dataset hosted in another project (billing stays
in `credentials.project_id`). Update the BigQuery rows/examples in
`docs-site/content/docs/configuration/ktx-yaml.mdx` and
`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope
note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples
copy-pasteable and follow the `fumadocs-mdx-structure` skill.
## Acceptance criteria
1. **Foreign single-project introspection.** With credentials in project
`ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`,
`ktx ingest <conn>` introspects the tables, enriches, and samples values;
`discover_data` / `dictionary_search` return them. Tables are labeled
`catalog: 'bigquery-public-data'`.
2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x',
'other-project.y']` introspects **both**, each under its own project; the
snapshot's `scope.catalogs` contains both projects.
3. **Cross-project query still bills locally.** `sql_execution` of a
fully-qualified `project.dataset.table` query runs and bills in
`credentials.project_id`.
4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']`
yields two distinct dataset groups; tables do not collide.
5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`)
behaves exactly as before — resolved under `credentials.project_id`, same
`catalog` labels and INFORMATION_SCHEMA targets.
6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an
empty segment) raises a config error naming the connection, not a `404` at
scan time.
7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`,
using the existing fake `clientFactory` harness):
- the fake `dataset()` is called with the dataset's project for a prefixed
entry, and with the billing project for a bare entry;
- a prefixed entry yields tables with `catalog: '<dataset project>'`;
- a mixed two-project `dataset_ids` introspects both;
- `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment
entry;
- the existing single-project tests still pass unchanged.
## Non-goals
- **Foreign-dataset discovery in the setup wizard.** The wizard does not
enumerate datasets in projects the credentials don't own; users supply
`project.dataset` explicitly (scope decision A).
- **Cross-region `listTables`.** `listTables`' region-scoped
`region-<location>.INFORMATION_SCHEMA.TABLES` query uses the connection-level
`location`; a foreign dataset in a *different* region than the connection's
`location` will not be listed by that wizard-facing query. This does **not**
affect ingest/`discover_data`, whose introspection path
(`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is
region-independent. A per-dataset region knob is a separate spec if ever needed.
- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`),
already unsupported by `normalizeBigQueryProjectId`.
- **A separate billing/introspection config field** — explicitly rejected above.
## Implementation orientation
Pointers from exploration; line numbers may have drifted, and the implementer owns
the design.
- `packages/cli/src/connectors/bigquery/connector.ts`
- `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) —
parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds`
shape.
- `KtxBigQueryClient.dataset` port (`:100``:110`) and
`DefaultBigQueryClientFactory.dataset` (`:130``:135`) — thread `projectId`
(R2). `getClient()` (`:487`) keeps the billing project (R8).
- `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog`
+ warning metadata (R2, R3).
- `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4).
- `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog
(R5).
- `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup
(`:359`) (R6).
- `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9).
- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps
`introspect`; no separate change needed (it inherits the fix).
- `packages/cli/src/context/connections/bigquery-identifiers.ts`
`normalizeBigQueryProjectId` is the project-segment validator.
- `packages/cli/src/context/connections/dialect-helpers.ts` /
`connectors/bigquery/dialect.ts` — three-part naming; no change, but this is
*why* R3 matters.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`. Run
`pnpm --filter @kaelio/ktx run type-check` and the connector test suite.
## Benchmark context (motivation, not a requirement — do not encode benchmark specifics)
Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable
faithfully: every one of its ~74 logical databases groups datasets hosted in
foreign public projects (`bigquery-public-data`, `isb-cgc-bq`,
`data-to-insights`, …), never in a project we own. Query execution already works
cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly
because the connector can't introspect a foreign-hosted dataset. Of 74 BQ
databases only **one** spans more than one source project, so "let `dataset_ids`
carry `project.dataset` and introspect each in its own project" covers the
benchmark and the general case alike. None of these project names belong in the
code — they are derived from the user's own `dataset_ids` input.
## Implementation notes
Implemented on branch `write-feature-spec-wiki`. The whole change is contained in
the BigQuery connector, its identifier helpers, the connector test suite, and three
docs pages.
**Config boundary (R1).** Added `normalizeBigQueryDatasetId`
(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset
`[A-Za-z0-9_]`) next to the existing project/region validators. In
`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry,
defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots →
bare dataset in `defaultProject`; one dot → `project.dataset` (each segment
validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs`
resolves `env:`/`file:` references first, trims/filters empties, then parses each.
`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the
default, so the canonical pair list is produced once at the boundary.
`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new
`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name
`connections.<id>.dataset_ids entry "<entry>"`.
**Client port (R2).** `KtxBigQueryClient.dataset` now takes
`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards
`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`).
`getClient()` still constructs the client with the **billing** `project_id`, so
`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3).
**Per-dataset introspection (R3R7, R9).** Every introspection site reads the
resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)`
and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s
`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified
`` `<project>.<dataset>.INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each
dataset under its own project; `getTableRowCount`'s default resolves through the first
pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and
keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a
`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project
snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with
the same dataset name stay distinct, R6/acceptance 4).
**`listTables` (R5).** Split into `listTables` (parse override entries, group by
project) and `listTablesInProject(project, region, datasets?)`. With no override it
lists the billing project's region (unchanged); with an override it runs one
region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that
project's bare datasets, and labels rows with that project. The existing single-region
test is unchanged (bare entries collapse to one billing-project query).
**Docs (R10).** Added a "Cross-project datasets" subsection to
`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats),
plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`.
**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and
malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection
calls `dataset('austin_311', 'bigquery-public-data')`, labels tables
`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps
`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']`
connection introspects both under their own projects; and `['proj_a.shared',
'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated
to the pair list; all pre-existing behavioral tests pass unchanged.
**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite
(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`,
`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production),
`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all
pass. Acceptance criteria 14 are exercised by unit tests with the fake client factory;
criteria 56 by unit tests; criterion 3 (cross-project query bills locally) is
structurally guaranteed (single billing client) and asserted via the `createClient`
project. End-to-end ingest against live `bigquery-public-data` was not run here (no live
credentials in this worktree); the `link:dev` binary is ready for the playground agent to
validate.
**No deviations from the spec design.** The only judgment call: `scope.datasets`
renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to
satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation
requirement (R6/acceptance 4) with one unambiguous, dot-delimited form.

View file

@ -1,471 +0,0 @@
# Durable, resumable, bounded relationship detection during ingest enrichment
> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`.
>
> **Scope: make the expensive part of ingest enrichment survive an interrupted
> relationship stage.** Today the paid LLM descriptions + embeddings only become
> durable and queryable after the slowest, most-killable, least-valuable stage
> (relationship detection) also finishes. This spec moves the persistence boundary
> to the cost boundary, makes stage resume work across runs, and bounds + observes
> the one open-ended stage — the durability companion to spec 16 (bounded query
> execution), which this spec composes with rather than replaces.
## Problem
Three compounding failure modes, all confirmed in the current code, share one root
cause: **the three enrichment stages are treated as a single atomic unit for
persistence, identity, and bounding, even though they differ radically in cost,
durability value, runtime, and likelihood of being killed.**
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages
in a fixed order through `runEnrichmentStage` (`:413`):
| stage | order | cost | durability value | runtime on a large schema | likely to be killed |
|-------|-------|------|------------------|---------------------------|---------------------|
| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low |
| `embeddings` (`:553`) | 2nd | medium | high | secondsminutes | low |
| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** |
The slowest, most-killable, least-valuable stage runs **last**, and it gates the
durability of the two expensive stages held in memory before it.
### 1. Enrichment is lost if relationship detection is interrupted
The queryable artifact agents search and execute against is the `_schema` manifest
YAML (`semantic-layer/<connectionId>/_schema/*.yaml`). It is written **twice**:
- bare (native column comments only) early, at `local-scan.ts:473`
(`writeLocalScanManifestShards`), before enrichment runs; and
- rewritten **with AI descriptions + accepted joins** by
`writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called
from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after
all three stages.
So the descriptions and embeddings reach the queryable layer only via that single
terminal write. If the process is killed/crashes/times out **during** the
`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write
never runs, and the in-memory descriptions + embeddings are discarded — the
`_schema` retains only the bare native comments from the `:473` write.
Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full
descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
relationship stage ran silently past a supervising deadline and was killed; the
persisted `_schema` had **0** AI descriptions. The most expensive work is the most
likely to be thrown away.
> A stage-state store (below) does save each completed stage's output to an
> internal SQLite cache as the stage finishes — so the descriptions are not lost to
> the *resume cache*. They are simply never **promoted** to the queryable `_schema`
> until the terminal write. The data survives somewhere the agent cannot query, and
> (per failure mode 2) cannot be reused on the next run either.
### 2. Re-running does not resume — it re-spends
`runEnrichmentStage` resolves a completed stage with
`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and
the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares
`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id`
(`sqlite-local-enrichment-state-store.ts:83,91115`). `runId` is minted fresh per
ingest invocation (`record.runId`). The cache therefore only resolves *within* one
run; re-running an interrupted ingest gets a new `runId`, misses every cached
stage, and **recomputes descriptions + embeddings from scratch** — re-paying for
LLM work that already succeeded.
The store already computes and persists `inputHash` next to `runId`
a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity,
relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is
already on the row; the lookup just uses the volatile column. This is a keying
defect, not a missing capability.
### 3. Relationship detection is unobservable and unbounded
`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a
row sample of **every enabled table** (`profileKtxRelationshipSchema`,
`relationship-profiling.ts:320` — one sampled query per table at
`profileConcurrency`, default 4), validates candidate joins
(`relationship-validation.ts:237` — one coverage query per candidate), and detects
composite keys (`relationship-composite-candidates.ts:515` — per-table plus
cross-table queries). None of the controls the rest of the scan pipeline relies on
were ever wired into this stack:
- **No progress.** `discoverKtxRelationships` does not accept a progress port; the
caller can only emit start/end around it (`local-enrichment.ts:600,611`
`update(0, 'Detecting relationships')``update(1, 'found N')`). Minutes of
silence between.
- **No honored cancellation.** `KtxScanContext.signal` exists on the contract
(`types.ts`) but **no sub-stage reads it**.
- **No time budget.** Validation has a *count* budget (`validationBudget`, default
`min(2 × tableCount, 1000)`); profiling and composite detection have none. On a
schema with hundredsthousands of tables, profiling is O(tables) silent queries
with no internal stop condition.
A supervisor watching for liveness cannot tell a slow-but-working profile from a
true hang, and nothing inside the stage will voluntarily stop — so on a very large
schema it runs far past any reasonable deadline and is killed (which, via failure
mode 1, takes the descriptions with it).
## Generic use case (independent of any benchmark)
Any context layer that enriches a real warehouse with paid LLM work must make that
work durable the instant it is produced, resume it across process restarts without
re-paying, and bound the open-ended profiling stage so a large catalog cannot hang
ingest indefinitely. A data team ingesting a 500-table production warehouse over a
flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit
hits all three failure modes regardless of any benchmark. This is general
durability and cost hygiene for the ingest pipeline; the benchmark only made it
acute at scale.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs (requirement-level, per the specs README).
### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships
As soon as the last non-relationship stage completes — `embeddings` when an
embedding provider is configured, otherwise `descriptions` — persist the
descriptions + embeddings into the **queryable** `_schema` manifest (and the raw
`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the
`relationships` stage runs. The relationship stage then writes its joins on top: the
manifest builder already re-reads and preserves existing descriptions and
manual/inferred joins on rewrite (`loadExistingManifestState`,
`local-enrichment-artifacts.ts:196`), so the second write is additive, not
destructive.
Net invariant: **the descriptions + embeddings are always durable and queryable the
moment they are computed**, even if relationship detection then fails, is
interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
relationship stage degrades to "no joins" or "partial joins" — **never** to "no
descriptions." This is the inverse guarantee the current terminal-write ordering
violates.
The bare `:473` manifest write stays — it is the queryable schema for the
no-providers / enrichment-disabled path. The checkpoint is an additional write that
runs only when enrichment produced descriptions.
> Orientation (the implementer owns the seam): the lowest-coupling shape is a
> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once
> the last non-relationship stage completes, and `local-scan.ts` supplies a callback
> that calls the existing `writeLocalScanEnrichmentArtifacts` for the
> descriptions + embeddings + manifest only (no generated joins yet). The final
> write after the relationship stage proceeds as today. Relationship-specific
> artifacts (`relationships.json`, `relationship-profile.json`,
> `relationship-diagnostics.json`) are written by the final/relationship write, not
> the checkpoint, so the checkpoint never emits misleading empty relationship
> diagnostics.
>
> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment`
> (inject the file store / project). That couples the enrichment module to
> persistence for no gain — the writer already lives in `local-scan.ts` and the
> checkpoint needs only a one-line hook, not a relocation.
### D2 — Resume by content identity, not by `runId`
Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**,
independent of `runId`, so a re-run with an unchanged schema and config resumes the
finished `descriptions` / `embeddings` stages from cache and re-runs only what
actually failed. `inputHash` is already the content fingerprint; `connectionId`
scopes it to the right source. When several rows share a content identity (one per
prior run), the most recent `updatedAt` wins.
`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves
the uniqueness/lookup key.
The state store is a **disposable local resume cache** (`.ktx` local state,
regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate
the table if its on-disk shape differs from the new `(connection_id, stage,
input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the
old cache only means one ingest cannot resume; it never corrupts a queryable
artifact.
> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest
> are already folded into `inputHash`; adding them again would only narrow the key
> and re-break cross-run resume when an incidental field differs.
### D3 — Make the relationship stage observable and bounded
Thread three things the rest of the pipeline already supports through
`discoverKtxRelationships` into profiling, validation, and composite detection:
- **Progress** through the existing progress port (the relationship phase is
already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit
liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent
for composite probing — so a supervisor can distinguish slow-but-working from
hung.
- **A flat wall-clock budget** for the whole relationship stage: a new
`scan.relationships.detectionBudgetMs`, a positive integer of milliseconds,
project-level, validated like the other `scan.relationships` fields, **default
600_000 (10 min), enforced by default.** Checked at unit boundaries (before each
table profile, each candidate validation, each composite probe). It sits **above**
spec 16's per-query deadline (default 30s): each individual query is already
bounded; this bounds the *sum* of them.
- **Honored cancellation:** where `KtxScanContext.signal` is available, the same
unit-boundary check honors it, so external cancellation stops the stage too.
On budget exhaustion or abort: stop scheduling new work, let in-flight queries
finish (each already bounded by spec 16), finalize with the relationships found so
far, and return a **partial** result — never an unbounded hang and never an
exception that would lose the checkpointed descriptions.
> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a
> second formula to reason about and "more tables → more budget" partly re-opens the
> unbounded door this requirement closes. One flat, generous, project-level number
> matches how the other `scan.relationships` knobs are shaped and is enough for a
> best-effort stage whose partial output is durable and improvable (D4).
>
> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a
> per-call override. One opinionated project-level default with a config override is
> the canonical ktx shape; no second runtime path.
### D4 — A budget-truncated partial is a successful, cached, completed stage
A graceful budget stop is **not** a failure. The relationship stage saves its
partial result like any completed stage (so a plain re-run resumes it for free, no
re-querying) and marks it `partial` with a reason in the relationship diagnostics
plus a recoverable scan warning. Because `detectionBudgetMs` lives in
`relationshipSettings ⊂ inputHash`, **raising the budget changes the content
identity and triggers a fresh, fuller run** — that is the only "try harder"
mechanism, with no extra flag or runtime path.
Distinguish the two stop kinds:
- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as
completed, so the next run recomputes the relationship stage (after resuming
descriptions/embeddings from cache via D2). This is the primary durability path.
- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and
resumed cheaply on re-run, unless the budget is raised.
## Requirements
### 1. Checkpoint descriptions + embeddings before relationship detection
The descriptions and embeddings MUST be persisted into the durable, queryable
`_schema` manifest (and the raw enrichment artifacts) as soon as the last
non-relationship stage completes, before the `relationships` stage runs.
Relationship detection appends/merges its joins on completion. The expensive LLM +
embedding enrichment MUST be queryable even if the relationship stage subsequently
fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
relationship stage MUST degrade to "no/partial joins," never to "no descriptions."
### 2. Stage resume resolves by content identity across runs
Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`,
independent of `runId`, so re-running an interrupted ingest resumes the finished
`descriptions` / `embeddings` stages from cache and re-runs only what failed.
Re-running after an interruption MUST NOT re-issue LLM description or embedding
calls for stages that already completed. The resume cache MAY be recreated without a
migration bridge if its schema changes (it is disposable local state).
### 3. Relationship detection emits progress and honors a wall-clock budget
The relationship stage MUST emit per-unit progress through the existing progress
port (at minimum per-table during profiling and per-candidate during validation) so
liveness is observable. It MUST enforce a flat wall-clock budget
(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level,
overridable, validated as a positive integer) checked at unit boundaries and layered
above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where
available. On budget exhaustion or abort it MUST stop scheduling new work, finalize
with the relationships found so far, and return a partial result rather than running
unboundedly or throwing.
### 4. A budget-truncated relationship result is durable and marked partial
A graceful budget/abort stop MUST persist the partial relationship result as a
completed stage (so a plain re-run resumes it without re-querying) and MUST mark it
`partial` — in the relationship diagnostics artifact and as a recoverable scan
warning — so downstream consumers can see the joins are incomplete. Raising
`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller
relationship run; no separate flag is introduced for "redo." A process killed
mid-stage MUST NOT leave a completed record (so it recomputes on re-run).
### 5. No regression for small or uninterrupted ingests
A small or single-run ingest that is never interrupted MUST produce the same
artifacts and the same relationship output as today. The checkpoint write MUST be
idempotent with the final write (descriptions survive the join rewrite); the budget
default MUST be generous enough that normal and large-but-tractable schemas complete
relationship detection fully, hitting the budget only on pathological scale.
## Acceptance criteria
- **Durability across interruption:** interrupting an ingest **during** relationship
detection still leaves a queryable semantic layer carrying the table/column
descriptions + embeddings that were generated (verified: re-open the connection;
AI descriptions are present in `_schema`, not just native comments).
- **Resume does not re-spend:** re-running an interrupted ingest does **not**
regenerate descriptions/embeddings whose stage already completed (verified: no LLM
description calls and no embedding calls for the cached tables; only the failed
stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume
survives a fresh `runId`.
- **Observable + bounded relationships:** a connection with hundreds of tables emits
relationship-stage progress (per-table profiling, per-candidate validation) and
completes within `detectionBudgetMs`; when the budget is hit, the stage stops
gracefully and persists the partial relationships found so far — without
discarding enrichment — marked `partial` in diagnostics and via a recoverable
warning.
- **Partial is cached and improvable:** re-running with an unchanged budget resumes
the partial relationship result from cache (no re-querying); raising
`detectionBudgetMs` triggers a fresh, fuller relationship run.
- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project
override, and rejects an invalid value (zero / negative / non-integer) as a clear
`ktx.yaml` config error.
- **No regression:** small/single-run ingests behave exactly as before — identical
artifacts and relationship output when nothing is interrupted; the checkpoint +
final writes leave descriptions intact alongside the generated joins.
## Non-goals
- **Bounding the descriptions stage's per-table LLM call.** Whether an individual
enrichment LLM call can wedge is a separate concern (already being addressed in the
working tree via a per-table enrichment timeout). This spec ensures whatever
descriptions *did* complete are durable; it does not own the per-call timeout.
- **Changing relationship-detection quality, thresholds, or the candidate/validation
algorithm.** The accept/review thresholds, scoring, and the existing
`validationBudget` count cap are unchanged; this spec adds durability,
cross-run resume, progress, and a time budget around them.
- **A per-connection or per-call relationship budget, or a global env override.**
One flat project-level `detectionBudgetMs`; no second runtime path (D3).
- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec
composes above it and does not re-implement query-level deadlines.
- **Replacing the per-query deadline with the stage budget, or vice versa.** They
are independent and layered: a single query is bounded by spec 16; the stage's sum
is bounded by `detectionBudgetMs`.
- **A general checkpoint framework for every ingest stage.** The checkpoint is
specifically the descriptions+embeddings → queryable-manifest promotion before
relationships; it is not a generic per-stage artifact-flush abstraction.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
design.
- **Enrichment orchestration**`context/scan/local-enrichment.ts`:
`runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls
(`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`),
`runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the
checkpoint hook after the last non-relationship stage; thread the progress port,
signal, and budget into the relationship stage.
- **Scan driver / write ordering**`context/scan/local-scan.ts`: bare manifest
write (`:473`), enrichment call (`:492`, currently passing only
`{ runId, progress }` as `context` — wire `signal` through here too), terminal
`writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch
(`:530`, which after D1 no longer loses descriptions). Supply the checkpoint
callback here.
- **Artifact writer**`context/scan/local-enrichment-artifacts.ts`:
`writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards`
(`:270`), and the description-preserving merge in `loadExistingManifestState`
(`:196`) — the basis for the additive checkpoint/final write.
- **Resume cache**`context/scan/sqlite-local-enrichment-state-store.ts`:
`PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`),
`saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`,
pick latest `updated_at`, recreate the table if shape differs (disposable cache).
Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage`
in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash`
(`:78`).
- **Relationship stack (progress + budget + signal)**
`context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept
a progress port and budget/deadline + signal),
`context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320`
per-table progress + budget check),
`context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates`
`:237` — per-candidate progress + budget check, alongside the existing
`validationBudget`),
`context/scan/relationship-composite-candidates.ts`
(`discoverKtxCompositeRelationships` `:515` — budget check).
- **Config**`context/project/config.ts` `scan.relationships`
(`KtxScanRelationshipConfig`, `:171213`): add `detectionBudgetMs` (positive
integer ms, default 600_000) to the zod schema and the default config builder.
- **Partial marker**`context/scan/relationship-diagnostics.ts`
(`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries
a `partial` flag + reason; add a recoverable warning code to the
`KtxScanWarningCode` union in `context/scan/types.ts` (e.g.
`relationship_detection_partial`).
- **Tests** — durability: a fixture ingest interrupted during the relationship stage
leaves AI descriptions in the queryable `_schema`. Resume: a second run with a
fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings
(assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema
large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget,
emits per-unit progress, returns partial, persists it marked `partial`, and a
re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests
for `detectionBudgetMs` (default / override / invalid). Regression: small
uninterrupted ingest yields identical artifacts and relationship output.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation, not a requirement)
The Spider 2.0-Lite BigQuery slice has datasets with hundredsthousands of tables
(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code
costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage
interruption — and re-spending it on every retry — makes large-schema ingest
impractical, and an unbounded profiling stage runs past any supervising deadline and
is killed. This is a general durability/cost property of the ingest pipeline,
independent of the benchmark; the benchmark only made it acute at scale. Do not
encode any benchmark specifics in the implementation.
## Implementation notes
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
four design decisions shipped; no deviations from the resolved design.
**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`,
`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to
`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by
`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent
content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`;
`runId` stays on the row for diagnostics/`listRunStages`. The store drops and
recreates the table when the on-disk primary key differs (disposable cache, no
migration bridge), detected via `PRAGMA table_info`.
**D3 — observable + bounded relationship stage** (new
`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget`
(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an
injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that
generalizes and replaces the old `mapWithConcurrency`). Threaded through
`discoverKtxRelationships` → profiling (per-table progress + budget stop),
validation (per-candidate progress + budget stop; budget-skipped candidates
degrade to the existing `validation_unattempted` review), and composite
detection (budget stops at PK-detection and coverage-probe boundaries).
`discoverKtxRelationships` now accepts `progress` and `now` and returns
`partial: { reason } | null`. The clock check fires only when work remains, so a
deadline elapsing after the last unit never marks a fully-processed stage partial.
**D1 — checkpoint before relationships** (`local-enrichment.ts`,
`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a
caller-supplied `onCheckpoint` once descriptions/embeddings complete and before
the relationship stage runs, gated on `shouldDetectRelationships` so the
no-relationship path keeps a single write. `local-scan.ts` supplies a callback
calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json +
embeddings.json + manifest with descriptions and no generated joins — no
relationship artifacts, so no misleading empty diagnostics). The shared
description/embedding JSON writer was factored out so checkpoint and final writes
stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions`
into the enrichment context (completing the existing `KtxScanContext.signal`
contract already read by the budget and the in-flight description timeout).
**D4 — partial is durable + marked** (`relationship-diagnostics.ts`,
`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact
carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable
`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated.
A graceful budget/abort stop returns normally, so the relationship stage saves as a
completed-partial record and resumes cheaply; a process killed mid-stage saves
nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash`
(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only
"try harder" mechanism, no extra flag.
**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer
ms, default `600_000`, validated like the other relationship fields. Documented in
`docs-site/content/docs/configuration/ktx-yaml.mdx`.
**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`);
cross-run resume + table-recreate (`enrichment-state.test.ts`,
`local-enrichment.test.ts`); progress/budget/abort partial
(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise +
checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`);
end-to-end durability — a relationship-stage failure still leaves AI descriptions
in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag
(`relationship-diagnostics.test.ts`); config default/override/invalid
(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`,
and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated:
three `analytics-skill-content.test.ts` markdown-structure assertions fail on this
branch from earlier analytics-skill commits — untouched here.)

View file

@ -1,533 +0,0 @@
# Resilient enrichment under a slow/hung LLM backend
> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`.
>
> **Scope: make the descriptions enrichment stage survive a hung LLM backend and
> an interrupted run.** Two compounding gaps live *inside* the per-table
> description-enrichment path: (1) the per-table LLM timeout fires in JS but does
> not terminate a wedged subprocess backend, so a hung table wedges the whole
> stage indefinitely; (2) descriptions are persisted only at full-stage
> completion, so any interruption discards every already-enriched table. This is
> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline
> that *stops the work*, not just abandons the promise) and spec 19 (move the
> durability boundary to the cost boundary so expensive LLM work is not lost). It
> composes with both rather than replacing them.
## Problem
Two compounding failure modes on the per-table description-enrichment path, both
confirmed in the current code and observed end-to-end together. Their union turned
a single hung table into an indefinite wedge *plus* total loss of an entire
stage's LLM work.
### 1. The per-table LLM timeout does not terminate the work
`KtxDescriptionGenerator.generateBatchedTableDescriptions`
(`context/scan/description-generation.ts`, the bounded call ~760866) wraps the
per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`).
A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one
wedge stays one timeout, not 3×). That is the correct policy — but the abort never
actually stops a subprocess backend, so the timeout is cosmetic.
The runtime is selected by the `backend` config field
(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS =
['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn
a **child process the SDK owns** and to which ktx hands only an `AbortSignal`:
- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts`
`codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's
`spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the
SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing
the abort error **only after that loop ends**. A child wedged on a hung provider
socket survives SIGTERM → its stdout never closes → the readline loop never ends
→ the SDK never throws → ktx's `await generateObject` **never settles**, past the
per-attempt timeout, indefinitely. The child leaks (open provider connections,
~0% CPU).
- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via
`context/llm/claude-code-runtime.ts`, `collectResult` ~275322): on abort it calls
best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks
`throwIfAborted` **between** streamed messages. A wedged child emits no message, so
the `for await (const message of queryResult)` loop blocks and the graceful
`interrupt()` may never land — the same hang class.
By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via
`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's
`generateObject`, which cancels the underlying `fetch` natively — the await settles
promptly and there is no child to leak.
So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too
gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks
`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork —
which it does not here.
Observed (BigQuery ingest, codex backend, 2026-06-23): with
`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of
`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+
minutes** — well past the 30-min per-attempt timeout — with exactly two codex
children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand.
### 2. Descriptions are persisted only at full-stage completion
`generateDescriptions` (`context/scan/local-enrichment.ts` ~279352) fans out
per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and
**accumulates every table's result in an in-memory `updates` array**, returned only
when the whole stage finishes. `runEnrichmentStage` (~413, ~421474) then calls
`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`)
**after** `compute()` returns, and the spec-19 checkpoint write
(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351379,
fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the
descriptions stage completes**. There is no within-stage persistence: while the
stage runs, every enriched table's description lives only in memory.
So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is
killed, or a supervising watchdog fires — **all** already-enriched tables are lost,
even though their (expensive, paid) LLM descriptions were finished. On the next run,
`findCompletedStage` finds no row, so the descriptions stage **recomputes from
scratch**.
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but
**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk;
killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The
cost of 2 pathological tables was 283 tables' worth of redone LLM calls.
Sharper still (re-ingest with a short, *enforced* timeout): even when the stage
**runs to the end** — the 2 hung tables hit their timeout and were skipped, so
**283/285** descriptions were generated and the ingest reported success (`Scan
completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were
**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not**
only "discarded on kill": a stage that completes with *any* skipped/aborted table
threw away **every** successfully-generated description. The skip must be
**graceful** — a skipped table costs one missing description, not the entire stage's
output — which is the strongest argument for per-table incremental persistence: the
283 good descriptions should have been durable the moment each was produced.
The on-disk artifacts already carry everything needed to fix this *additively*: the
`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is
AI-enriched), and rewrites preserve existing descriptions
(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96115;
`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196253 — the basis
spec 19 relies on). The durable record and the resume-skip set can be **derived from
the system's own on-disk state**, with no new cache schema.
## Generic use case (independent of any benchmark)
Anyone ingesting a large or wide schema with an LLM enrichment backend —
especially a **subprocess** backend, the common local/desktop setup — will
eventually hit a table whose description call hangs: a provider stall, a rate-limit
black-hole, a pathologically large prompt. Without an *enforced* timeout, one such
table wedges the entire ingest indefinitely and leaks the spawned child; without
*incremental* persistence, any interruption throws away all the per-table LLM work
already done — the dominant ingest cost. Both fixes make large-schema enrichment
**resilient and resumable**: a few bad tables degrade to a few skipped
descriptions, not a hung process and a from-scratch redo. This is core robustness
for a general-purpose ingestion product, wholly independent of any benchmark.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs (requirement-level, per the specs README).
### D1 — One bounded-call guarantee; enforcement follows the backend's nature
The canonical contract is a single guarantee for the per-table enrichment call:
**the in-flight work terminates and ktx's await settles within the per-table
deadline plus a small grace, on every backend.** How that guarantee is met follows
from a structural property of the configured backend — *does it own a subprocess?*
— not from a hand-maintained list of provider names:
- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is
insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call
behind a **boundary it can hard-kill** — a short-lived ktx-owned child process,
made a **process-group leader** (`detached`). The SDK's grandchild (the
`codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx
**tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the
grandchild — and rejects promptly. This mirrors spec 16's child-process +
SIGKILL mechanism, extended by the critical step that **killing the immediate
child is not enough**: the grandchild would otherwise orphan to init and keep its
provider connections. Killing the group is the real fix.
- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing
in-process `abortSignal``fetch` cancellation already satisfies the contract —
the await settles promptly and there is no subprocess to leak. Routing these
through a subprocess would pay fork + IPC + credential-passing cost for no benefit.
> The branch on "subprocess-backed?" is behavior following from an input the backend
> declares about itself, not vendor enumeration — the same guarantee is reached two
> ways because the backends differ structurally. This matches the intake's own split
> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed").
>
> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline
> regardless of the SDK, but leave the SDK's child running). It unwedges the stage
> but leaves the orphaned child holding provider connections — the exact leak the
> incident showed — so it fails the intake's "actually cancelled" requirement and
> compounds over a long ingest that hits several hung tables.
>
> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime,
> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a
> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is
> low-frequency relative to its own latency and already concurrency-bounded (4), so
> one short-lived child per call (spec 16's resolved choice) is simpler and as fast.
**Portability.** ktx supports Windows, where POSIX process groups and
`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached
process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating
equivalent on Windows (e.g. `taskkill /pid <pid> /T /F` or a job object) so the
grandchild is reaped on every platform the subprocess backends run on.
### D2 — Default stays moderate and the retry/skip policy is unchanged
The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the
existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the
no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the
table is skipped with the existing `enrichment_timeout` warning and the stage
proceeds. The 30-min value in the incident was an operator stopgap chosen *because*
the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a
long timeout is strictly worse for a hang (a hang costs the full timeout), so the
moderate default is the correct operating point. The retry loop stays in
`description-generation.ts`: each attempt runs through the bounded boundary (D1), so
a transient backend error retries while a timeout surfaces as `KtxAbortedError` and
does not.
> Not introducing a new `ktx.yaml` config field for the timeout. The existing env
> override is the tuning seam; adding a per-connection/per-call/global knob would
> multiply the runtime surface for no stated need (one opinionated default + the
> existing env override is the canonical ktx shape).
### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state
During the descriptions fan-out, flush completed tables **per batch** (every N
tables / on a timer, at a cadence that bounds the at-risk window) to the durable
on-disk artifacts, reusing spec 19's additive write:
- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**;
- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal`
preserves prior `ai:`/`db:`/external keys) so finished descriptions are also
**queryable** the moment they are computed — the spec-19 invariant, one level
deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by
rewriting only changed shards.
On resume, `generateDescriptions` reads the existing record, **skips any table
already enriched**, computes only the remainder, and returns the merged full set so
the embeddings stage, the checkpoint write, and the stage-store row all see a
complete result exactly as today.
**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The
durable record is tagged with the descriptions stage's `inputHash`
(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when
the current `inputHash` matches** — a genuine resume-after-interruption of the same
content identity. A changed `inputHash` (schema or enrichment settings changed)
ignores the prior record for skipping and recomputes the stage as today; the
manifest write stays additive regardless. The artifact's on-disk shape may gain the
`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped
record simply forces one non-incremental run), consistent with ktx's
no-backward-compatibility policy.
> The skip set is **derived from the artifacts ktx already writes**, not from a new
> per-table cache table. The manifest's `ai:` field already encodes "this table is
> enriched"; a parallel per-table SQLite record would be a second source of truth for
> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is
> still written at stage completion (it remains the stage-level resume gate — a clean
> re-run skips the descriptions stage as today); the incremental record only matters
> when the stage did **not** complete — exactly the case where no row exists and
> `compute()` re-runs.
### D4 — A killed-mid-stage run is durable; resume is cheap
A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the
per-batch-flushed tables durable on disk. The next run resumes the descriptions
stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again),
but `generateDescriptions` now **re-issues LLM calls only for the unfinished
tables**. A failed/skipped table (timeout or exhausted retries) is left for the
remainder set and is retried on the next resume — never silently treated as done.
## Requirements
### 1. The per-table enrichment timeout is enforced for subprocess backends
When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed
backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and
its descendants** — MUST be terminated (SIGKILL of the process group / tree), and
ktx's `generateObject` await MUST settle within the deadline plus a small bounded
grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded.
The termination MUST be portable across the platforms the subprocess backends run on
(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends
keep their existing native `abortSignal``fetch` cancellation; the guarantee is one
contract met two ways, branching on the backend's structural "owns a subprocess"
property, not on a list of provider names.
### 2. The timeout default and retry/skip policy are unchanged
The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`),
with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the
no-retry-on-timeout policy. On timeout, the table is skipped with the existing
`enrichment_timeout` recoverable warning and the stage proceeds. No new
per-connection / per-call / global timeout knob is added.
### 3. Descriptions are persisted incrementally during the stage
Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch**
(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence
that bounds the at-risk window to a small number of tables. The flush MUST be
idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and
external keys via the existing merge). Finished tables MUST remain durable even if the
stage never completes — is wedged, killed, or interrupted. A failed/skipped
relationship/embedding stage or a killed descriptions stage MUST NOT lose the
descriptions already flushed.
### 4. Resume re-enriches only the unfinished tables
On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST
re-issue LLM description calls **only for tables not already enriched**, deriving the
already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable
record / the manifest's `ai:` descriptions), and MUST return the merged full result
so downstream stages behave as on a fresh run. A changed `inputHash` (schema or
enrichment settings changed) MUST recompute the stage as today (spec 19's
inputHash-gated semantics preserved). The durable record MAY be recreated without a
migration bridge if its on-disk shape changes (it is regenerable local/artifact
state).
### 5. No regression for small or uninterrupted ingests
A small or single-run ingest that is never interrupted MUST produce the same
artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST
be idempotent with the spec-19 checkpoint and the terminal write (descriptions
survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT
change a normal successful enrichment's output, only how a wedged call is terminated.
### 6. A skipped table costs one description, never the stage's output
A descriptions stage that **completes** with one or more skipped/aborted tables MUST
persist every successfully-generated description (the durable record and the `ai:`
manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages`
row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's
output because some tables were skipped. No single table's failure may reject the
per-table fan-out: a per-table failure degrades to one missing description (left for
the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the
only thing that fails the stage (so it resumes), and even then the already-flushed
descriptions remain durable.
## Acceptance criteria
- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call
that hangs past the deadline is terminated within the deadline plus a small grace;
ktx's await settles, the spawned child **and a grandchild it spawned** both exit
(verified via the child's `exit`, not left spinning), and the table is skipped with
an `enrichment_timeout` warning. The stage advances rather than wedging. A
`ctx.signal` abort terminates the same way.
- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly
on abort via the existing native path, with no subprocess involved.
- **Default + policy:** the default timeout is 120s and a timeout is not retried (one
wedge = one timeout); a transient error is still retried up to the attempt limit.
- **Graceful skip persists the rest:** a stage that completes with one table failing
(timeout, exhausted retries, or an unexpected throw) still writes the other N1
descriptions to the durable record + `ai:` `_schema` and marks the stage completed
(a `local_scan_enrichment_stages` row exists); the failed table is a single `null`
description left for the resume remainder, not a discarded stage.
- **Incremental durability:** interrupting the descriptions stage after K of N tables
leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`),
with no completed `local_scan_enrichment_stages` row.
- **Resume does not re-spend:** re-running the interrupted ingest (unchanged
`inputHash`, fresh `runId`) issues **no** LLM description calls for the K already-
enriched tables and enriches only the remaining NK; the returned result is the
full merged set. A changed `inputHash` recomputes the stage.
- **No regression:** a small uninterrupted ingest yields identical artifacts and the
same descriptions/embeddings output as today; the incremental flush is idempotent
with the checkpoint and terminal writes.
## Non-goals
- **Incremental persistence of embeddings.** Embeddings are fast and already covered
by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This
spec scopes incremental persistence to the `descriptions` stage.
- **Changing the timeout default, retry counts, or adding a timeout config knob.**
D2 keeps the moderate default and the single env tuning seam.
- **Routing HTTP backends through the subprocess boundary.** Their native abort
already meets the contract; a subprocess would add cost and a credential-passing
surface for no benefit.
- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed
call; no pool, no respawn/job-tracking (D1).
- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage
budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes
above them: spec 16 bounds individual queries, spec 19 makes whole stages durable
and resumable, and this spec hardens the per-table enrichment call's termination
and adds within-stage description durability.
- **A general per-stage incremental-flush framework.** The incremental flush is
specifically the descriptions stage; it is not a generic abstraction over every
enrichment stage.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
design.
- **Bounded per-table call (gap #1)**`context/scan/description-generation.ts`,
`KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block
~760866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on
timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry
loop stays here; each attempt runs through the kill boundary for subprocess
backends.
- **LLM runtime + backend selection**`context/llm/runtime-port.ts`
(`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input),
`context/llm/local-config.ts` (~127163, selects `CodexKtxLlmRuntime` /
`ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts`
(`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the
backend/runtime (e.g. on the runtime interface), not inferred from a name list.
- **Subprocess backends**`context/llm/codex-runtime.ts` +
`context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's
`spawn(executable, args, { signal })` is in `@openai/codex-sdk`),
`context/llm/claude-code-runtime.ts` (`collectResult` ~275322, the `interrupt()`
abort path). These are what the kill boundary must wrap and tree-kill.
- **Reuse spec 16's mechanism (extended to group/tree kill)**
`connectors/sqlite/read-query-child.ts` (the forked child shape) and
`connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292350: `fork`,
deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts`
child-URL resolver ~2527, knip dynamic entry). Gap #1 differs by making the child a
process-group leader and killing the **group/tree** (the SDK grandchild), portably.
Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`,
`linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns
passing the backend config/credentials to it (env/IPC) and serializing the
structured result back.
- **Incremental persistence (gap #2)**
`context/scan/local-enrichment.ts` (`generateDescriptions` ~279352: the per-table
`pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage`
~413/~421474 with `findCompletedStage` ~427 and `saveCompletedStage`; the
`onCheckpoint` hook ~598612). Make `generateDescriptions` resume-aware: read the
existing record, skip already-enriched tables, flush per batch, return the merged
full set.
- **Artifact writer + additive merge**`context/scan/local-enrichment-artifacts.ts`
(`writeLocalScanEnrichmentCheckpoint` ~351379, `writeEnrichmentDescriptionArtifacts`
with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270308,
`loadExistingManifestState` ~196253, `tableDescription`/`columnDescription`
~75105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96115,
`SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive
description/manifest write; tag the durable record with `inputHash`.
- **Stage store + input hash**
`context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE =
'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`,
`findCompletedStage`, `saveCompletedStage`),
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The
whole-stage row stays; the `inputHash` is the gate for the resume-skip set.
- **Scan driver**`context/scan/local-scan.ts` (the `onCheckpoint` wiring and the
terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal`
(`context/scan/types.ts`) which the kill boundary must honor.
- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores
SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within
deadline+grace, the child and a spawned grandchild both exit, and the table is
skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the
native path. gap #2: interrupt the descriptions stage after K/N tables (a flush
seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed
stage row; a resume with matching `inputHash` issues no LLM calls for the K and
enriches only NK; a changed `inputHash` recomputes; regression: a small
uninterrupted ingest yields identical artifacts.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
## Benchmark context (motivation, not a requirement)
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment
backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for
41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout
never killed the hung codex children, and because descriptions checkpoint only at
stage completion, the 283 already-enriched tables were unrecoverable — the operator
had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout
as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at
scale; the gaps and the fixes are generic production hygiene for any agent that
enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark
specifics in the implementation.
## Implementation notes
Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance
criteria are covered by tests. The full ktx test surface for the touched code is
green (the only failures in the whole suite are 3 pre-existing assertions in
`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown
— an unrelated subsystem this change does not touch).
### Gap #1 — enforced timeout for subprocess backends
- **Structural property on the runtime, not a name list.** Added
`subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort`
(`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime`
return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime`
(and the deterministic stub) return `null`. The per-table call branches on this,
never on a vendor list (D1).
- **Shared structured core.** Both subprocess runtimes gained
`generateStructuredJson(jsonSchema)` (returns the raw object; the caller
Zod-validates). Their existing `generateObject` was refactored to delegate to the
same streaming core, so structured generation has one implementation.
- **Kill boundary.** New `context/llm/subprocess-generate-object.ts`
(`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned
child (`subprocess-generate-object-child.ts`) **detached** (process-group leader);
the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx
tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX,
`taskkill /pid <pid> /T /F` on Windows) and rejects promptly; on success the raw
output is Zod-validated. Credentials reach the child via inherited `process.env`
(the runtimes re-derive their allowlisted env), never over IPC.
- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions`
(`context/scan/description-generation.ts`) routes each retry attempt through the
boundary for subprocess backends and keeps the native `AbortSignal``fetch`
path for HTTP backends. A fired deadline maps to the existing
`KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout);
default stays 120s (D2).
- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real
fixture child that spawns a grandchild and ignores SIGTERM, and asserts the
deadline/abort tree-kills both (the grandchild PID is reaped) and the await
settles within deadline+grace; plus success / schema-failure / child-error paths.
`test/context/scan/description-generation.test.ts` adds the generator-level
timeout-skip and the "HTTP backend spawns no child" cases.
### Gap #2 — incremental descriptions persistence + resume
- **Durable record + resume store.** `createKtxScanDescriptionResumeStore`
(`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to
a durable record (inputHash-tagged) and **only the manifest shards that gained a
table this batch** (new `onlyChangedTableNames` filter on
`writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)`
returns the prior enriched set only on a matching inputHash (D3).
- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`)
loads the prior record, skips already-enriched tables, enriches only the
remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single
in-flight flush; the final force-flush drains the tail), and returns the full
merged set (recovered + fresh + `null` for still-failed, so failures are retried,
D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`).
- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in
a try/catch: any non-cancellation failure degrades to one `null` description + an
`enrichment_failed` warning and the fan-out continues, so no single table can
reject `Promise.all` / abort the stage. This makes the "one skipped table costs one
description, not the stage's output" guarantee live at the stage boundary
(`generateBatchedTableDescriptions` already degrades its own failures; this is the
explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails
and resumes), and the already-flushed descriptions stay durable. This closes the
field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows.
- **Deviation from the spec's literal path (necessary correction).** The durable
record lives at a **stable, non-`syncId`** path
(`raw-sources/<connectionId>/live-database/enrichment-progress/descriptions.json`),
not the `syncId`-scoped `…/<syncId>/enrichment/descriptions.json` the spec named.
Reason: a from-scratch interruption (the incident's exact case — no prior
*completed* run) gets a **fresh `syncId`** on the next run
(`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped
record would be unreachable on resume. The manifest is already at the stable
per-connection scope (`semantic-layer/<connectionId>/_schema/`), so this keeps the
resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json`
debug artifact written by the terminal/checkpoint writers is unchanged.
- **Tests.** `test/context/scan/description-resume.test.ts` drives
`runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a
durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues
zero LLM calls and returns the full merged set; a partial record re-enriches only
the missing tables; a changed `inputHash` recomputes; the changed-shard filter
rewrites only the affected shard; and (requirement 6) a run where one table fails
still persists the other tables (durable record + `ai:`) and **completes the stage**
(a completed `local_scan_enrichment_stages` row), with the failed table left `null`
for resume.
### Incidental
- Fixed a stale assertion in `description-generation.test.ts` ("does not run
per-column fallback…" expected 1 call) to `3`, matching the retry policy added in
commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt
limit). The HTTP path is unchanged; the assertion simply predated the retry.
- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit
governor is not wired into the scan-enrichment path, so the kill-boundary child
loses no pacing.
- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles
to `dist/context/llm/subprocess-generate-object-child.js`.

View file

@ -1,567 +0,0 @@
# Selective enrichment stages (`--stages`) + per-stage cache keys
> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`.
>
> **Scope: make the three enrichment stages independently invalidatable and
> independently re-runnable.** Today one coarse cache key gates all three stages,
> so changing any one stage's inputs re-pays for every stage — most painfully the
> expensive per-table `descriptions`. And there is no CLI surface to re-run a
> chosen subset. This spec splits the key per stage (so a change invalidates only
> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen
> subset while preserving the others. It is the operability follow-on to spec 19
> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable
> descriptions); it composes with both rather than replacing them.
## Problem
Enrichment has three stages — **`descriptions`** (one paid LLM call per table),
**`embeddings`** (sentence-transformer vectors over the schema + descriptions),
**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19
and 20 these stages are durable and resumable, but they are still **coupled for
cache invalidation and unreachable for selective re-run**. Three facts make a
targeted re-run impossible without a full, expensive re-enrich.
### 1. One coarse cache key gates all three stages
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single
`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity,
relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`),
`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself
(`localScanProviderIdentity`, `local-scan.ts:241255`) is one blob conflating the
description LLM identity, the embedding model/dimensions/batch size, **and** the
whole relationship config — and it redundantly re-encodes `mode` and
`relationships`, which the coarse hash already mixes in.
The consequence: flipping `scan.relationships.llmProposals`, switching the LLM
backend, or upgrading the embeddings model changes the **one** hash and so
invalidates **all three** stages. ktx then re-runs the expensive per-table
`descriptions` even though they did not conceptually change. The headline cost of
the system — paid LLM description calls — is thrown away on any unrelated
enrichment-config edit.
### 2. No CLI surface to select stages
The enrichment internals already support a relationships-only path
(`KtxScanMode` `'relationships'`, `types.ts:12``descriptions`/`embeddings` are
gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while
`shouldDetectRelationships` admits `mode === 'relationships'` at `:624626`). But
`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no
flag to select a subset (`ingest-commands.ts:2649` — only `--no-query-history`
and friends). The relationships-only capability is built but unreachable, and there
is no way at all to ask for "descriptions only" or "embeddings only."
### 3. The foundation for "touch one stage, keep the rest" already exists
The per-stage store `local_scan_enrichment_stages` is keyed
`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is
additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and
`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`,
`db:`, and external description keys on rewrite; spec 20's per-table resume record
(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already
re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave
the others byte-for-byte" needs only two missing pieces: **per-stage key
granularity** and a **CLI surface** to select stages.
**Requirement:** let an operator re-run a chosen subset of enrichment stages on an
already-ingested connection, recomputing only those stages, preserving the others'
artifacts untouched, and **re-paying only for what genuinely changed** — never
re-running the costly `descriptions` because an unrelated stage's inputs moved.
## Generic use case (independent of any benchmark)
Any team running ktx in production maintains its semantic layer over time: they
improve the description prompt or switch the description LLM, upgrade the embeddings
model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich
of every connection** — re-running the expensive per-table descriptions even when
only embeddings or relationships changed. Two routine operations should be cheap and
targeted:
- **"Re-embed everything on the new model."** Swapping the embeddings model should
recompute only embeddings, leaving descriptions and joins on disk.
- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed
relationships should recompute only relationships.
And one operation needs an explicit trigger because no input changed:
- **"These descriptions came out thin — re-run them with a longer timeout."** A
connection whose description coverage is poor because tables timed out (same
snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand,
cheaply retrying only the tables that failed.
This is core operability for a long-lived ingestion product and is wholly
independent of any benchmark.
## Design decisions (resolved during refinement)
These resolve ambiguities the intake draft left open. They constrain the
implementer; the exact code is theirs (requirement-level, per the specs README).
### D1 — Split the coarse hash into three per-stage input hashes
Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash
computation, each keyed on only that stage's own inputs. Decompose the
`localScanProviderIdentity` blob into the slices each stage actually depends on:
- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the
description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the
embedding model/dimensions/batch size, **not** relationship settings.
- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where
`embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest`
is a stable digest of the resolved description text the embeddings consume (the
same text `buildEmbeddings``buildKtxColumnEmbeddingText` feeds the model,
`local-enrichment.ts:466486`, `embedding-text.ts:1744`). This content-addresses
embeddings on their real upstream (D4).
- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and
`detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X,
D5), **not** the embedding identity.
`mode` and `detectRelationships` drop out of the per-stage inputs: each stage
produces output under exactly one mode, so the stage name already scopes that, and
re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals`
invalidates only `relationships`; swapping the embeddings model invalidates only
`embeddings`; switching the description LLM invalidates only `descriptions`.
The per-stage hash becomes the key everywhere a single hash is used today: the
`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20
descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now
keyed on the **descriptions** stage's hash — so changing the embedding model no
longer busts the descriptions resume record, a strict improvement.
> **No migration bridge.** The stage store and the descriptions resume record are
> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage
> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next
> run after upgrade. Recreate/ignore stale-shaped records with no compatibility
> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy.
### D2 — `--stages <comma-list>` selects a subset; one gate, no new mode
Add `ktx ingest [connectionId] --stages <comma-list>`, a non-empty subset of
`descriptions,embeddings,relationships`. Plural because it takes a **set**:
`--stages relationships` and `--stages descriptions,embeddings` both read naturally,
and the plural signals "list expected." Flag absent = all three (today's behavior).
A Commander custom parser validates each name against the canonical stage registry
and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a
hard `InvalidArgumentError`** — never silently ignored. The set threads CLI →
`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan``runLocalScanEnrichment`.
Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected
stages)`** — a single gate. Each existing stage block additionally checks
membership in the selected set (`descriptions`/`embeddings` already gate on
`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`).
This adds **no** new `KtxScanMode` variant and **no** second parallel selection
path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means
"which of those stages to (re)compute this run." A named stage that cannot run
because a prerequisite is absent (e.g. `--stages embeddings` with no embedding
provider configured) MUST fail or warn clearly, never silently no-op.
> Rejected alternative — repurpose `mode` (`--stages relationships`
> `mode: 'relationships'`). It only expresses single-stage cases, leaves
> `descriptions,embeddings` with no mode, and creates two ways to say "relationships
> only." The explicit stage set is the one canonical selector.
### D3 — A named stage force-re-runs; per-table resume still avoids re-paying
Naming a stage in `--stages` carries the intent "recompute this," so a named stage
**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in
`runEnrichmentStage` (`local-enrichment.ts:538547`). The spec-20 machinery still
applies **inside** `compute()`:
- `--stages descriptions` re-enters `generateDescriptions`, which loads the
per-table resume record and re-issues LLM calls **only for the still-null/failed
tables** (when the descriptions hash is unchanged) — the "fill thin coverage with
a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps.
- A genuine input change (e.g. switching the LLM → a new descriptions hash)
invalidates the resume record and rebuilds the stage fully, as today.
Stages **not** named are skipped entirely — not run, not resumed — and their
on-disk artifacts are left exactly as they are (additive write; preserve-others is
already the behavior). The **no-flag default is unchanged**: all eligible stages
run, the completed-row short-circuit is respected (spec-19 cross-run resume).
Behavior follows from the input (did you explicitly name the stage?), not the call
path. A consequence to state plainly: `--stages descriptions,embeddings,relationships`
is **not** identical to passing no flag — naming all three is the explicit "force a
full enrichment recompute," whereas no flag is "ingest, resuming whatever is done."
### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent
The only hard dependency between stages is **`descriptions → embeddings`**
(embeddings embed the description text; `relationships` is decoupled, D5). Two
mechanisms keep it correct without a hardcoded dependency table:
- **Self-healing via content-addressing.** Because the embeddings hash includes
`descriptionDigest` (D1), re-running `descriptions` changes that digest, so a
later embeddings run (or a full ingest) sees a hash miss and recomputes — stale
embeddings can never silently persist across a future embeddings run. (Without
this, the embeddings hash would be unchanged after a description edit and a later
run would wrongly short-circuit on stale vectors.)
- **Surfaced immediately.** After a selective run, for each **unselected** stage that
has artifacts on disk, recompute its *current* per-stage hash from on-disk state
and compare it to the stored completed-row hash; if they differ, emit a
**recoverable `enrichment_stage_stale` warning** naming the stale stage and the
cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the
system's own state — it also catches "you changed the embedding model in `ktx.yaml`
but only ran `--stages descriptions`."
The run **never silently leaves a stale-but-unflagged downstream**, and **never
silently auto-cascades** extra work — the operator is told and decides. Re-running
`descriptions` does **not** flag `relationships` stale (D5).
### D5 — Relationships are decoupled from description content, but still get it as context
`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is
**not** invalidated or stale-flagged by a description change (decision X). Rationale:
relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's
own framing); coupling them to description content would make every routine
description re-run also invalidate joins — re-opening the exact over-invalidation
this spec exists to close.
Independently, a `relationships`-only run (descriptions stage not running this
invocation) MUST **hydrate its working schema from the persisted on-disk enriched
`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full
description context, not raw column names. Today the relationship stage builds its
schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740`
never merge the AI descriptions), so this also closes a latent gap: both the
full-run and the relationships-only paths MUST feed `llmProposals` the
best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) —
behavior from inputs, not path.
### D6 — Scope: enrichment stages only, composable with existing flags
`--stages` controls only the three enrichment stages. It is **orthogonal to and
composable with** the existing `--no-query-history` flag — a pure joins backfill
across everything is `ktx ingest --all --stages relationships --no-query-history`.
Schema introspection still runs (it is the hash substrate and the enrichment base,
and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it
can later extend to the broader scan phases (schema / query-history / source /
memory) and subsume the inconsistent negative `--no-query-history` flag — but that
unification is **out of scope** here.
## Requirements
### 1. Per-stage input hashes
Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its
resume record on a hash of only that stage's own inputs, per D1
(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding
identity + a digest of the embedded description text; `relationships` ← snapshot +
relationship settings + LLM identity). Changing one stage's inputs MUST invalidate
**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over
`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }`
MUST be removed in favor of per-stage computation. The stage store and the
descriptions resume record MAY be recreated without a migration bridge (disposable
local state).
### 2. `--stages` flag with strict validation
`ktx ingest` MUST accept `--stages <comma-list>`, a non-empty subset of
`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An
unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`),
never silently ignored. The selected set MUST thread through to enrichment and gate
which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new
`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite
is missing MUST fail or warn clearly, not silently no-op.
### 3. Selecting a stage force-re-runs it; unselected stages are preserved
A stage named in `--stages` MUST re-enter its `compute()`, bypassing the
completed-stage short-circuit, while still using the spec-20 per-table resume record
so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash)
and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST
leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19
cross-run resume (all eligible stages, completed-row short-circuit respected).
### 4. Downstream staleness is surfaced, never silent
After a selective run, the run MUST emit a recoverable `enrichment_stage_stale`
warning for every **unselected** stage whose current per-stage hash no longer
matches its stored completed-row hash (derived from on-disk state, naming the stage
and the cascade command). The embeddings hash MUST include a digest of the embedded
description text so a later embeddings run self-heals after a description change. The
run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently
auto-cascade. A description change MUST NOT stale-flag `relationships`.
### 5. Relationships run with description context
When the `relationships` stage runs without `descriptions` having run in the same
invocation, it MUST hydrate its working schema from the persisted on-disk enriched
`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description
context as a full enriched run, not bare column names. The full-run and
relationships-only paths MUST feed `llmProposals` descriptions consistently.
### 6. No regression for normal ingests
A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as
today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19
cross-run resume and spec-20 per-table description resume. The per-stage hash split
MUST NOT change a normal run's output, only which stages a *changed* input
invalidates.
## Acceptance criteria
- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals`
re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM
description calls, no re-embedding); swapping the embeddings model re-runs only
`embeddings`; switching the description LLM re-runs only `descriptions`. Verified by
asserting no LLM description calls / no embed calls for the unaffected stages.
- **Flag parse + validation:** `--stages relationships` and
`--stages descriptions,embeddings` parse to the right set; `--stages foo`,
`--stages` (empty), and `--stages descriptions,foo` each fail with a clear
`InvalidArgumentError`.
- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed
with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM
calls for exactly those K tables and leaves the already-good descriptions
untouched; the run completes and the K are now enriched. A changed descriptions
hash instead rebuilds all tables.
- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and
`relationships` artifacts are byte-stable (unselected stages did not run).
- **Derived staleness warning:** after `--stages descriptions` changes the
descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its
recomputed hash diverged) and does **not** emit it for `relationships` (decision
X); a subsequent `--stages embeddings` clears it.
- **Relationships context:** a `--stages relationships` run on an already-described
connection feeds the on-disk AI descriptions into `llmProposals` (verified: the
proposal prompt carries descriptions, not just column names).
- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical
artifacts and the same descriptions/embeddings/relationship output as today, with
spec-19/20 resume intact.
## Non-goals
- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The
namespace is built to extend later; this spec ships only the three enrichment
stages, composable with the existing query-history flag (D6).
- **A new `KtxScanMode` variant or a second stage-selection path.** One gate,
`(eligible) ∩ (selected)` (D2).
- **Coupling `relationships` to description content** (decision X, D5). Improving
descriptions does not invalidate or stale-flag joins.
- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the
operator chooses to cascade (D4).
- **Capturing prompt/code-level description-prompt changes in the hash.** The
descriptions hash keys on snapshot + LLM identity (config/model), not the prompt
text; a pure prompt improvement that does not change a hash input will not
force-rebuild already-good descriptions. Forcing that is out of scope — the
operator changes a real input or selects the stage with a changed config.
- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20
(per-table description resume, enforced timeout).** This spec composes above them:
it splits the key those stages resume on and adds the CLI surface to select and
force-re-run stages.
- **A general per-phase incremental-flush framework.** The selection mechanism is the
three enrichment stages; it is not a generic abstraction over every ingest phase.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
design.
- **Coarse hash → per-stage hashes**`context/scan/enrichment-state.ts`
(`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput`
`:57`): replace with per-stage hash functions (or one function taking a per-stage
input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three
`runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`,
`relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit
`:538547`). The `descriptions` hash also feeds `generateDescriptions`'
`resumeStore.load(inputHash)` (`:345`).
- **Provider-identity decomposition**`context/scan/local-scan.ts`
(`localScanProviderIdentity` `:241255`, the enrichment call site `:498537`):
split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` /
`relationships` re-encoding, and pass each stage only its slice.
- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings`
`:457486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText`
`:1744`): digest the resolved per-column/table description text that the embeddings
consume, and fold that digest into the embeddings hash.
- **CLI flag**`commands/ingest-commands.ts` (`:2649` option declarations,
`:51104` action handler): add `--stages` with a custom parser that validates
against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in
`enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`.
Thread through `public-ingest.ts` (`KtxScanArgs` build `:969978`, `mode: 'enriched'`
`:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) →
`runLocalScanEnrichment`.
- **Stage gating + force-rerun**`context/scan/local-enrichment.ts`: gate each stage
block on membership in the selected set (`descriptions` `:632`, `embeddings`
`:663665`, `relationships` `:720`); make a named stage bypass the completed-row
short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20
per-table resume. `KtxLocalScanEnrichmentInput` (`:6085`) gains the selected-stage
set.
- **Staleness detection + warning**`context/scan/local-enrichment.ts` (after the
stage blocks): recompute each unselected stage's current hash from on-disk state,
compare to the stored completed-row hash, push a recoverable warning on mismatch.
Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in
`context/scan/types.ts` (alongside `relationship_detection_partial`).
- **Relationships description context**`context/scan/local-enrichment.ts`
(`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736746`):
hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from
the on-disk `_schema` via `loadExistingManifestState`,
`local-enrichment-artifacts.ts`) before relationship detection.
- **Stage store + resume record**
`context/scan/sqlite-local-enrichment-state-store.ts`
(`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`,
`findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore`
(`local-enrichment-artifacts.ts:286332`, path `:265267`, inputHash gate
`:305307`) — both now keyed on the relevant per-stage hash. No migration bridge.
- **Config inputs**`context/project/config.ts` (`scanRelationshipsSchema`
`:171218` incl. `llmProposals` `:174` and `detectionBudgetMs`;
`scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`,
`llm.provider.gateway.base_url`): the sources of each per-stage identity slice.
- **Tests** — per-stage invalidation isolation (flip one input, assert only the
matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty
rejected); resume-aware force-rerun (`--stages descriptions` retries only the null
tables, leaves good ones, completes); preserve-others (unselected artifacts
byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after
a descriptions change, not for relationships; cleared by a later `--stages
embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`;
regression — a normal no-flag ingest yields identical artifacts with spec-19/20
resume intact.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
- **Docs:** add `--stages` to the `ktx ingest` CLI reference
(`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior
where enrichment/ingest is described.
## Benchmark context (motivation, not a requirement)
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A
level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description
coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only**
re-run with a longer timeout, and (b) a desire to **backfill joins** across all
already-ingested datasets after enabling `llmProposals` — without re-paying for
descriptions. Both were blocked by the coarse single `inputHash` (flipping
`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a
stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend
ingestion at scale; the gap and the fix are generic production operability. Do not
encode any benchmark specifics in the implementation.
## Implementation notes
Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented;
all acceptance criteria covered by tests.
**What was built / where:**
- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the
coarse `computeKtxScanEnrichmentInputHash` and added
`computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`),
`computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`),
`computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`),
plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` /
`KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the
canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into
`localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant
`mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the
keys. No migration bridge — the stage store + descriptions resume record just miss the
old coarse-keyed rows.
- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted
`buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage
and the digest, so the embeddings hash content-addresses the exact text the model sees.
- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`:
`parseEnrichmentStagesOption` (Commander parser) validates against the registry,
rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated
set; threaded through `KtxPublicIngestArgs``context-build-view``KtxScanArgs`
`RunLocalScanOptions``KtxLocalScanEnrichmentInput`. One gate
(`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage
emits a new `enrichment_stage_skipped` warning (never a silent no-op).
- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named
stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions`
still consults the spec-20 per-table resume record (retries only failed tables on an
unchanged hash).
- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment`
resolves best-available descriptions (fresh-this-run, else on-disk via a lazy
`loadPriorDescriptions` thunk wired from `local-scan.ts`
`loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema`
now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket`
now carries the resolved description text — closing the latent gap on **both** the
full-run and relationships-only paths.
- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code +
`findLatestCompletedStage` on the state store (interface + sqlite + test store). After a
selective run, each unselected stage with a completed row is compared against its
freshly recomputed hash; a mismatch warns and names the cascade command. Relationships
are never flagged by a description change (decoupled per D5).
- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a
"Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and
examples.
**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a
relationships-only run should hydrate "AI descriptions **and** embeddings" from the
on-disk `_schema`. Investigation found the `_schema` manifest shards store only
descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json`
that no code reads back, and each run mints a fresh syncId — so there is no durable
per-connection embeddings artifact to hydrate from. A relationships-only run therefore
hydrates **descriptions** (required for, and verified against, the `llmProposals`
acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships`
backfill gets deterministic + name-based + LLM-proposed candidates (the point of
`llmProposals`), but not the embedding-similarity candidates a full run would add.
Durable embeddings hydration (persist vectors at a stable per-connection path, or read
them from the vector index) is a clean follow-on and was left out of scope.
**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation),
`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture
guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves
others, naming all three forces a full recompute, per-stage invalidation isolation,
prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced
descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not
relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`,
`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in
`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a
`**Window functions**` heading the test expects — was present before this work and left
untouched.)
---
## ⚠️ Defect found in post-implementation validation (2026-06-24)
**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req
"preserve-others / a selective run never deletes another stage's artifacts."
**Reproduction (deterministic):**
- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges.
- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges**
but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌
- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins.
**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the
freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions
and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the
**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach
the `llmProposals` evidence packet only). So the on-disk `_schema` loses them.
**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the
`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every
db. Caught on a 1-db validation before any rollout.
**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:`
descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages'
artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`,
and asserts description count is unchanged while joins increase.
### ✅ Fixed (2026-06-24)
**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first
fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural**
manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard,
but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as
**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a
subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the
already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test
passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test
was rewritten to go through the full `runLocalScan` path.)
**What changed:**
- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions
(`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as
`descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them.
- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before
the structural manifest write** and feeds them to both the structural write and enrichment — so the
structural pre-write preserves them too (robust even if relationship detection later fails).
- Joins were already preserved for `--stages descriptions` via the existing manual/inferred
`preservedJoins` path; verified by a symmetric test.
**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai`
descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the
enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions`
preserves joins).
**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER
`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the
descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`.
Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass.

View file

@ -1,463 +0,0 @@
# Resumable and fault-tolerant source ingest
> Refined spec. No intake draft — surfaced by a real user report, not the
> playground agent (see Motivation). Lives beside the analogous scan-durability
> specs 19/20.
>
> **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt /
> Metabase / Notion) survive interruption and partial failure on large
> projects.** Two compounding gaps live on the source-ingest path: (1) an
> interrupted run restarts every work unit from scratch — there is no cross-run
> reuse of already-generated work-unit output, so a multi-day dbt ingest loses
> *all* progress to a single VPN/network blip; (2) the final integration gate is
> all-or-nothing — one artifact that cannot pass it (after LLM repair) discards
> the **entire** run with nothing committed. This is the source-ingest analog of
> spec 19 (move the durability boundary to the cost boundary so expensive LLM
> work is not lost) and spec 20 (a stage survives an interruption with per-item
> durability). It **reuses** the same content-keyed durability primitive those
> specs established rather than copying it.
## Problem
Two independent failure modes on the source-ingest work-unit (WU) pipeline,
both confirmed in the current code, both observed by a user on a ~2-day dbt
ingest. Their union makes large-project ingest brittle: any interruption is
total loss, and any single unfixable artifact at the end is total loss.
### 1. An interrupted run resumes nothing — every work unit re-runs
`IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as
a sequence of stages: fetch → parse/extract into **work units** → run each WU as
an isolated agent loop in a child worktree (`runIsolatedWorkUnit`
`executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU
patches → reconcile → finalize → final gates → one atomic squash commit
(`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU
is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki
artifacts, producing a git patch (`WorkUnitOutcome.patchPath` /
`patchTouchedPaths`, `stage-3-work-units.ts:31-46`).
The only persisted cross-run state is `SqliteBundleIngestStore`
(`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report,
and provenance — all written at or near **run completion**. There is **no
checkpoint of completed WU output**. A run that dies mid-flight (the user's
VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches,
re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM
cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no
relationship to the interrupted one.
> Observed (user report, large dbt project): a run reached deep into its
> work-unit progress and was lost to a network blip; the follow-up run started
> over from zero. On a ~2-day ingest this is the difference between a 5-minute
> resume and a 2-day redo.
### 2. The final integration gate is all-or-nothing
After all surviving WUs are integrated, `validateFinalIngestArtifacts`
(`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across
the *integrated* tree:
- **intrinsic source validity**`validateTouchedSources`
`validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) →
`validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`),
which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`);
- **cross-artifact references** — dangling join targets
(`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki`
refs (`validateWikiRefs``findMissingWikiRefs`), broken `wiki→sl_ref`s
(`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs
(`findInvalidWikiBodyRefs`).
On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`).
The runner catches it, runs the LLM repair `repairFinalGateFailure`
(`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws**
(`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null`
(`runner.ts:2729`) → **the whole run is discarded, nothing committed.**
The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate
`executeWorkUnit` already validates each WU in isolation (`validateWikiRefs`
~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`,
~155: the WU resets, is excluded from integration, and the run continues). So by
the time the final gate runs, intrinsic single-source failures are rare. The
gate fails predominantly on **cross-artifact dangling references**: WU-A's source
joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so
A's join now points at nothing. Each WU passed *alone*; the break only appears
once the survivors are integrated — and that break currently nukes the run.
> Observed (user report): a run completed all task generation and then failed at
> the final integration gate on a **single model**; because the gate is
> all-or-nothing, that one failure discarded an ~18h run with nothing committed.
## Generic use case (independent of any benchmark)
Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit
both failures. Large ingests run long enough that an interruption is a *when*,
not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator
ctrl-C on an apparently-stuck run), and a large artifact set makes it
near-certain that *some* model lands a cross-reference its sibling didn't
produce. Without cross-run reuse, every interruption is a from-scratch redo of
the dominant (LLM) cost; without partial commit, one unfixable artifact throws
away every good one. Both fixes make large-project ingest **resilient and
resumable**: an interruption costs only the unfinished work, and a single bad
model costs only that model — not the run. This is core robustness for a
general-purpose ingestion product.
## Design decisions (resolved during refinement)
These resolve the design space explored during refinement. They constrain the
implementer; the exact code is theirs (requirement-level, per the specs README).
### D1 — Resume is automatic and content-keyed at the work-unit level
A successful WU's output is cached across runs, keyed by a **content hash of its
inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest`
transparently replays any WU whose inputs are byte-identical to a cached success
and re-runs only the changed, failed, or missing WUs. The key is computed over:
the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads,
`types.ts:19-28`), the adapter/source identity, and a **version/prompt
fingerprint** (ktx version + the WU system/user prompt + model role). A changed
dbt model busts only that model's entry; everything unchanged replays for free.
> No flag, no config knob. Content-keying makes resume automatic; a flag would
> double the state space for no benefit. This is the same shape scan uses
> (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU
> pipeline.
### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes
The cache stores a successful WU's **output artifacts**: its git patch
(`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs
(`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner
**replays the patch** into the session worktree — no agent loop, no LLM — exactly
where it would have integrated a freshly-run WU. If a cached patch **fails to
apply** (the surrounding tree drifted), the entry is discarded and the WU
**recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree:
the cache can only make a run faster, never wrong.
### D3 — One durability primitive, shared by scan and ingest
Per the "one capability, one implementation" rule, the content-keyed store is
**extracted** into a shared primitive and **both** scan and ingest route through
it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK
`(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`)
and its `inputHash` computation (`enrichment-state.ts`) are generalized to a
content-keyed result cache; scan is migrated onto the shared primitive **in the
same change** so no second copy exists even transiently. The ingest cache is a
new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`)
on that one store.
> Extract-and-share in one PR, not "build a copy for ingest now, unify later."
> A temporary fork is exactly the divergence the rule forbids; the one-time
> extraction cost is paid once and both paths benefit from every later fix.
### D4 — Only successes are cached; failures retry on the next run
A failed WU is **not** recorded as terminal — the next run retries it. WU
failures on this path are dominantly transient (network, provider stall, an LLM
slip), and the user's explicit ask is "resume and finish the rest," so a failure
must not be sticky. This deliberately differs from scan's stage store (which
caches failed stages and re-throws): there the failure is the stage's
deterministic verdict; here a WU failure is usually a blip to retry. Caching only
successes also keeps the invariant simple — a cache entry always means "this
exact input already produced this exact good output."
### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune
Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation
that always yields a committable, internally-consistent tree:
1. `validateFinalIngestArtifacts` is refactored to **return structured findings**
(the danglers it already computes internally — join targets, `wiki→wiki`,
`wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of
flattening them into a thrown string.
2. **Drop the rare self-invalid source first.** A source that fails its *own*
validation at the final gate (intrinsic — rare, since stage 3 already filters
these) is removed, establishing the surviving artifact set.
3. **Prune the dead edges in a single pass** over that surviving set. For each
dangling reference — whether it pointed at an absent sibling or at a
just-dropped source — **remove that reference from its owner** (drop the join
entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping
the owning artifact. Because nodes are dropped first (step 2) and pruning only
removes edges, pruning **cannot create a new dangling edge, so one pass
suffices; no fixpoint.**
4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are
cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit
the remainder. If the confirm pass *still* fails, that is a real bug — fail the
run loudly rather than commit a dirty tree.
`repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`)
is **removed**. The deterministic prune supersedes it for the referential class,
and the rare intrinsic case is handled by drop.
> **Prune the edge, do not cascade the node.** The rejected alternative drops the
> *referencing artifact* and, transitively, everything that referenced *it* — a
> node-quarantine fixpoint that cascades healthy artifacts and needs a closure
> search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the
> dependent intact (minus one pointer that never resolved anyway), needs no
> fixpoint, and acts on findings the gate already produces.
>
> **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair
> can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune
> merely deletes it, preserving marginally more content. We drop it anyway:
> determinism beats an LLM round-trip with variance on the commit path, prune
> guarantees a commit where repair could only `throw`, and deleting it is a net
> maintenance reduction. The decision is reversible — repair could later run as a
> best-effort pass *before* prune — but the default is prune-only.
### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose)
Pruning is applied to the **integrated session worktree** at gate time and is
**re-derived from the current survivor set on every run**. It MUST NOT mutate the
cached WU patches (D2). This makes resume and prune compose correctly and
**self-heal**:
- Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B
fails; the gate prunes A's join-to-B from the integrated tree and commits A
without it.
- Run 2 (after the root cause is fixed): A's input is unchanged → A **replays
from cache with its join restored**; B now succeeds and exists; the gate finds
no dangler and commits both, fully linked.
So a ref pruned because of a sibling's failure costs nothing permanent: fixing
the sibling and re-running restores the link for free. The cache stores
intent (the WU's real output); prune is a per-run consistency projection over
whatever survived.
### D7 — Pruning is faithful and never silent
A pruned reference was, by definition, non-functional (its target was absent), so
removing it loses nothing executable — and removing dangling SL joins is already
the established fix for the SL engine's eager orphan-join rejection. Every prune
and every drop MUST be **recorded in the run report and a trace event** naming
the artifact, the removed reference, and the absent target. The report status
MUST reflect partial completion (extend the existing `failedWorkUnits`
mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs /
dropped-sources detail) so a partial run is visibly partial, never a silent
"success."
### D8 — Cache state is regenerable; no migration bridge
The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape
may change with **no migration bridge** — a stale-shaped or absent cache simply
forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's
no-backward-compatibility policy; the cache is an optimization, never a source of
truth.
## Requirements
1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output
MUST be cached keyed by a content hash over its input bytes
(`rawFiles` + `dependencyPaths`), the adapter/source identity, and a
version/prompt fingerprint (ktx version + WU prompt + model role). Re-running
`ktx ingest` MUST replay cached successes without an agent loop / LLM call and
re-run only changed, failed, or missing WUs. No `--resume` flag and no config
knob is added.
2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the
stored patch into the session worktree; if the patch does not apply cleanly the
entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to
produce a tree different from what a fresh run of that WU would have integrated.
3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it
MUST be retried on the next run.
4. **Conservative invalidation.** The input hash MUST change when the ktx version,
the WU prompt, or the model role changes (bias toward recompute). Under-keying
(stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is
acceptable.
5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run.
`validateFinalIngestArtifacts` MUST return structured findings; the runner MUST
deterministically **prune** every dangling reference from its owning artifact
and **drop** any source that fails its own validation, then commit the
remaining internally-consistent tree.
6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not
cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint)
because edge removal cannot create new dangling edges. A dependent that loses
one dangling ref MUST otherwise be committed intact.
7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and
MUST NOT mutate cached WU patches. A reference pruned in one run because its
target was absent MUST be restored automatically on a later run once the target
exists (resume replays the owner's intact patch).
8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on
the remainder and MUST pass before the squash; if it still fails the run MUST
fail loudly rather than commit a dirty tree.
9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its
obsolete tests/branches MUST be deleted (no dormant compatibility path).
10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST
be recorded in the run report and a trace event (artifact, removed ref, absent
target). A run that pruned or dropped anything MUST report as partial, never as
an unqualified success.
11. **One shared durability primitive.** The content-keyed store MUST be a single
implementation used by both scan and ingest; scan MUST be migrated onto it in
the same change. No second copy may exist, even transiently.
12. **No regression for clean runs.** A small, uninterrupted run whose every WU
passes and whose final gate is clean MUST produce byte-identical artifacts and
the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as
today.
## Acceptance criteria
- **Resume skips completed work:** interrupt an ingest after K of N WUs have
succeeded; re-run the same command (unchanged inputs); the run issues **zero**
agent loops / LLM calls for the K cached WUs, runs only the remaining NK, and
produces the same final artifacts as an uninterrupted run.
- **Changed model busts only its entry:** edit one dbt model between runs; the
re-run re-executes **only** the WU(s) whose input bytes changed and replays the
rest from cache.
- **Stale patch self-corrects:** a cached patch that no longer applies (forced
drift in a test) causes that WU to recompute, not a corrupt tree or a crash.
- **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached;
run 2 retries it and, on success, integrates it.
- **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join
to B dangles **commits** — A is committed with the dangling join **pruned**, the
report lists the pruned ref, and `commitSha` is non-null (contrast: today this
throws and commits nothing).
- **No cascade:** in that scenario A (and any other artifact that only referenced
B) is committed intact except for the single pruned reference; nothing healthy
is dropped.
- **Self-heal:** fix B's root cause and re-run; A replays from cache with its join
intact, B succeeds, and the final tree commits both fully linked with no prune.
- **Intrinsic drop:** a source that fails its own warehouse dry-run at the final
gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop
is reported.
- **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the
gate path has no LLM call.
- **One store:** scan and ingest both resume through the same content-keyed
primitive (one implementation; scan's behavior is unchanged by the migration —
spec 19/20 acceptance still passes).
- **Clean-run regression:** a small uninterrupted all-passing ingest yields
identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to
today.
## Non-goals
- **Resuming the cross-WU stages.** Reconciliation, finalization, and the final
gate re-run every time; their inputs depend on the full survivor set and their
cost is small relative to WU generation. Only WU generation is cached.
- **A `--resume` flag or any timeout/cache config knob.** Content-keying makes
resume automatic (D1); one opinionated default is the canonical ktx shape.
- **Caching failed WUs as terminal.** Failures retry (D4).
- **Node-cascade quarantine of the final gate.** Prune edges, do not drop
dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step.
- **Tolerating dangling references (warn instead of remove).** Unsafe — the SL
engine eagerly rejects orphan joins — so dead edges must be removed, not kept.
- **Keeping the LLM final-gate repair.** Removed (D5/req 9).
- **A general per-stage resume framework beyond the shared content-keyed store.**
The store is the one shared primitive (D3); this spec does not abstract every
ingest stage into a resumable framework.
- **Re-implementing spec 19/20 (scan durability).** This spec composes the same
primitive onto the source-ingest WU pipeline.
## Implementation orientation
Line numbers drift; treat these as anchors, not addresses. The implementer owns
the design.
- **Run flow + the all-or-nothing seam**`context/ingest/ingest-bundle.runner.ts`:
WU run + integration of successful patches (~16001900), the final-gate block
(~25492587, `runFinalArtifactGates`), the repair-then-rethrow that must be
replaced by prune (~25882644; the fatal `throw` ~2623), and the atomic squash
(~27012729; `commitSha: null` when nothing is touched ~2729). The prune step
slots between the gate findings and the squash, operating on `sessionWorktree`.
- **Work units & cacheable output**`context/ingest/types.ts` (`WorkUnit`
~1928: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult`
~204213: extend with pruned/dropped detail);
`context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU
validation + `failWithReset` ~134157 that already soft-fails a WU;
`WorkUnitOutcome` ~3146 with `patchPath`/`patchTouchedPaths`/`actions`/
`touchedSlSources` — the cache payload). The cache lookup/replay wraps the
per-WU execution; only the agent-loop branch is skipped on a hit.
- **The gate (make it return findings)**`context/ingest/artifact-gates.ts`
(`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from
`validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`;
the concatenated `throw` ~129 to replace with a structured return);
`context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124;
`findJoinTargetErrors` ~89 already returns missing join targets per source —
the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts`
(`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are
the drop set, not the prune set).
- **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the
offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove
the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts`
`findMissingWikiRefs`); wiki body refs: remove the broken link token
(`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is
deterministic and edits the integrated worktree only.
- **Remove the LLM repair**`context/ingest/final-gate-repair.ts`
(`repairFinalGateFailure`) and the `constrained-repair.ts` usage for
`final_artifact_gate`; delete the call site (~2595) and its tests.
- **Durability primitive to extract & share**
`context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`,
PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`),
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and
the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`).
Generalize to a content-keyed result cache; migrate scan onto it; add the ingest
namespace. The existing ingest store
`context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is
where ingest-side persistence lives — the WU cache sits alongside it under
`.ktx/`.
- **Tests** — resume: run an ingest against a real git-backed project with a fake
agent runner, interrupt after K WUs, assert the re-run issues no agent loops for
the K and the same artifacts result; changed-input bust; stale-patch recompute;
failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki
ref dangles → assert the run commits the sibling with the ref pruned, reports the
prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a
follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still
green on the shared primitive. Regression: a small uninterrupted all-passing
ingest is byte-identical to today.
- After implementing, rebuild and re-link so the playground picks it up:
`pnpm run build && pnpm run link:dev`.
## Motivation (the real report, not a benchmark)
A user ingesting a fairly large dbt project (~2-day run) hit both gaps together.
First, an interruption — a VPN drop / network blip — lost all progress because
ingest cannot resume; they had to restart from scratch. Second, on a later run
that completed all task generation, a **single model** failed the final
integration gate, and because the gate is all-or-nothing the one failure
discarded an ~18h run with nothing committed. Their ask: "some form of resume or
checkpoint (or at least reusing the patches that were already generated), and a
way to skip or quarantine a single failing model instead of failing the entire
run." This spec delivers both — resume via the content-keyed WU cache, and
partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this
gap was surfaced by a real user on a real warehouse, not by the benchmark; the
fix is generic production hygiene for any large ingest.
## Implementation notes
Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12
requirements and every acceptance criterion are covered by committed code and
tests; the full `@kaelio/ktx` package suite is green.
What was built and where:
- **Shared content-keyed durability primitive**`context/cache/content-result-cache.ts`
+ `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`).
Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts`
is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped),
so no second copy exists (D3 / req 11).
- **Content-keyed WU cache + replay**`context/ingest/work-unit-cache.ts`
(`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity +
CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`).
Replay/recompute and stale-recompute state refresh wrap the WU loop in
`ingest-bundle.runner.ts` (D1/D2/D4 / reqs 14).
- **Non-fatal final gate**`artifact-gates.ts` `validateFinalIngestArtifacts`
returns structured findings; `context/ingest/final-gate-prune.ts` deterministically
drops self-invalid sources and prunes dangling edges in a single pass, then a
confirm gate runs before squash (D5/D6 / reqs 58). `finalGatePrunedReferences`
/ `finalGateDroppedSources` are recorded in the report + trace and surface as a
`partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are
deleted (req 9).
Deviations / decisions worth noting (all preserve spec intent):
- **Cache stores artifact content snapshots (payload schema v2), not just a raw
git patch.** Replay materializes the owner's artifacts against the *current*
base, so a ref pruned in one run because a sibling failed is restored for free
on a later run once the sibling exists — without re-running the owner's agent
loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute.
- **Final-gate prune/drop resolves sources through the canonical
`resolveSlSourceFile` resolver**, not a derived `semantic-layer/<conn>/<name>.yaml`
path, so it works for uppercase / hash-derived source filenames (not only
lowercase demo names).
- **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join
target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU;
only intrinsic `source_validation` failures remain fatal at the WU level. This
is what lets a sibling-failed WU's owner survive to be pruned rather than be
excluded upstream (reqs 57, "no cascade").
- The raw report record keeps `status: 'completed'`; partial completion is derived
by `ingestReportOutcome` from the populated prune/drop fields.

View file

@ -1,66 +0,0 @@
# Multi-connection routing guidance in the ktx-analytics skill
## Problem
The agent-facing `ktx-analytics` skill (installed into agent environments via
the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in
projects) describes the query workflow — wiki_search → sl_read_source →
sl_query / sql_execution — but assumes the connection is obvious. In a
multi-connection project nothing tells the agent to *first decide which
connection the question is about*, and several tools silently require it:
- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId`
**required**;
- `sl_query`, `discover_data`, `dictionary_search`: optional, but
auto-inference only works with exactly one connection
(`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or
multiple connections).
An agent that skips routing either errors out or, worse, queries the wrong
database when names overlap.
## Generic use case
Any ktx project with more than one connection — the common shape for a data
org (warehouse + product DB + events DB). Routing is the first step of every
question, and the skill should encode it so individual agents don't have to
rediscover it.
## Requirements
1. **Add an explicit routing step (step 0) to the skill's workflow:**
- Call `connection_list` to see what exists.
- Match the question's domain to a connection using connection ids/names,
`discover_data` hits, and wiki context — not guesswork.
- If genuinely ambiguous after discovery, ask the user rather than pick.
2. **Thread the resolved `connectionId` everywhere:** all subsequent
`sl_query`, `sql_execution`, `sl_read_source`, `entity_details`,
`dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01
lands (search scoped to the resolved connection plus unscoped pages).
3. **Single-connection projects stay frictionless:** the skill should say
routing is trivial when `connection_list` returns one entry — don't add a
mandatory ceremony step for the common simple case.
4. **Capture routing knowledge:** when the agent learns a non-obvious
question-domain → connection mapping, the skill should encourage
`memory_ingest` so the mapping becomes wiki knowledge for next time.
This is a docs/prompt change in the skill content (plus any skill-install
plumbing if the skill is versioned); no engine changes required.
## Acceptance criteria
- In a fixture project with ≥2 connections, an agent following the skill
resolves the correct connection before its first data query, and no tool
call fails with "connectionId is required".
- In a single-connection project the skill-driven flow is unchanged (no
extra mandatory steps).
- Skill text nowhere assumes a default/implicit connection.
## Benchmark context (motivation only)
Spider 2.0-Lite local subset = 30 SQLite connections in one project; every
one of the 135 questions targets exactly one of them. Connection ids are set
to the benchmark's database names, so with this skill guidance routing is
mechanical (`connection_list` + name match) and needs no benchmark-specific
instructions — which is the point: the harness gives the agent only the
question text.

View file

@ -1,51 +0,0 @@
# Offline schema-documentation ingest adapter
> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider
> 2.0-Lite benchmark — we verified the benchmark's offline schema files
> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite
> scan already captures (DDL, types, PKs, sample values, cardinality
> profiling). Implement specs 01-03 first; pick this up only if a real
> use case shows up.
## Problem
The ingest pipeline's schema knowledge comes from live database scans
(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…).
There is no adapter for **offline schema documentation**: files describing
tables/columns that exist outside the database — column-description
spreadsheets, data dictionaries, DDL exports with comments, hand-maintained
schema docs.
## Generic use case
Teams whose richest schema documentation lives outside `information_schema`:
a wiki export of column meanings, a governance tool's CSV data dictionary,
DDL files with COMMENT clauses the production scan can't see, or
environments where ktx has no live access at all and must build the semantic
layer from documentation alone.
## Requirements (sketch — refine when picked up)
1. A new ingest adapter (peer of `metabase`/`dbt` in
`context/ingest/adapters/`) consuming a configured local path of schema
docs per connection.
2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements)
and tabular column dictionaries (CSV/JSON: table, column, description,
…). Extensible to other formats.
3. Output: **enrichment, not duplication** — merge descriptions/metadata
into the manifest-backed semantic-layer sources and dictionary for the
matching connection. Where a live scan exists, offline docs fill gaps
(descriptions, enum meanings, deprecation notes) and flag drift
(documented column missing from live schema and vice versa) rather than
creating parallel wiki pages that duplicate schema info.
4. Works without live database access (documentation-only bootstrap of a
connection's semantic layer), clearly marked as unverified-against-live.
## Acceptance criteria (sketch)
- Given a connection with a live scan plus an offline column dictionary,
semantic-layer sources carry the documented descriptions, and drift
between doc and live schema is reported.
- Given a connection with docs only (no live access), `sl list`/`sl read`
expose manifest sources built from the docs.
- No wiki pages are created that merely restate table/column lists.

View file

@ -1,59 +0,0 @@
# Composite-key (multi-column) join detection
> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite
> smoke test (2026-06-13): relationship detection emitted **zero joins** for a
> database whose fact tables are linked only by composite keys. Agents still
> answered correctly by inferring the join from shared `grain`, so this didn't
> cost benchmark points — but it forces inference that explicit joins would
> remove, and the gap is generic.
## Problem
Relationship detection appears to emit only single-column joins. For the IPL
sqlite database, every table came back with `joins=0`, even though its fact
tables are connected by a 4-column composite key
(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`,
`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did
correctly record that shared key as each table's `grain`, which is why agents
could recover the relationship — but no `joins:` entries were produced for the
fact-to-fact links.
## Generic use case
Event/fact tables keyed by composite business keys are common: ledger lines
(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports
ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a
multi-column key recurs across tables, ktx should detect and document the join
so agents (and `sl_query`) don't have to infer it.
## Requirements
1. Relationship detection considers **multi-column** join candidates, not just
single-column ones. A strong signal already exists in ktx: when two tables
share an identical (or subset/superset) declared `grain`, that grain is a
prime composite-join candidate.
2. Emitted joins carry the full composite condition, e.g.
`on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`,
with a sensible `relationship` cardinality.
3. The existing validation/threshold machinery
(`scan.relationships.acceptThreshold` etc.) applies to composite candidates
too; profile-based validation should check join selectivity on the full key.
4. No regression for single-column joins; don't explode combinatorially —
bound candidate generation (e.g. only consider shared-grain keys and
declared/!inferred PK overlaps, cap column count).
5. `sl_query` can compile a join across a composite-key relationship.
## Acceptance criteria
- For a fixture with two tables sharing a 3- or 4-column grain and no
single-column FK, ingest emits a composite join between them with the full
multi-column `on` condition.
- `sl read <source>` shows the composite join; `sl_query` can traverse it.
- Single-column join detection is unchanged on existing fixtures.
## Benchmark context (motivation only)
IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set)
have no single-column FKs; their joins are entirely composite. Explicit
composite joins would let the agent rely on documented relationships instead
of inferring them from grain.

View file

@ -1,89 +0,0 @@
# Canonical / authoritative-source measures in the semantic layer
## Problem
Many schemas contain an **authoritative table** that already encodes a metric's
business rules — an official standings/leaderboard table, a general-ledger or
period-end balance table, a materialized summary/snapshot — alongside the **raw
transactional** rows the metric *could* be re-derived from. Re-deriving the metric
from the raw rows frequently diverges from the canonical definition, because the
authoritative table bakes in rules the raw data doesn't expose (drop-scores,
penalties, adjustments, reconciliations, as-of snapshots).
Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from
raw fact tables, so the analytics skill has no signal that one source is canonical
for a metric — and the agent often re-derives from raw rows and gets a defensible-
but-different number.
## Generic use case (independent of any benchmark)
- "Championship points per competitor this season" — a sports schema may hold both
raw per-event results AND an official standings table that applies drop-scores
and penalties. The standings table is the canonical source; summing raw results
is wrong.
- "Account balance as of month end" — prefer a ledger/balance-snapshot table over
re-summing every transaction (which may miss adjustments).
- "Monthly recognized revenue" — prefer a finance summary table over re-deriving
from line items.
In each case a real analyst should be steered to the authoritative source.
## Requirements
1. **Detect candidate authoritative tables during ingest.** Heuristics only —
e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`,
`*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained
aggregation of another table, or tables documented as authoritative in provided
docs/wiki. Surface them as such in the semantic layer.
2. **Represent the metric as an SL measure backed by the authoritative table.**
Where a canonical source exists, define the measure over it so a query for that
metric resolves to the authoritative source by default. (The analytics skill
already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs
into existing behavior.)
3. **Keep raw re-derivation available** as a non-default alternative; the measure
documents which source it uses and why, so the choice is transparent and
overridable.
## Fairness boundary (HARD — this spec is fairness-sensitive)
The choice of authoritative source MUST be driven by **schema/structure or provided
documentation** — the table exists, is structured as a summary, or is documented as
authoritative. It must **NEVER** be driven by observing which interpretation matches
a benchmark gold answer. Concretely:
- ✅ Fair: "a table named/structured as official standings exists and aggregates the
raw results → treat it as the canonical points source."
- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold
result." That is per-instance gold-tuning (cheating) and must not appear in ktx,
the ingest heuristics, or any mapping.
If a metric is genuinely underspecified and only the gold answer disambiguates the
intended source, it is **not fairly fixable** — leave it. Whether this feature helps
any specific benchmark instance is therefore *conditional* on a real schema/doc basis
existing; do not manufacture one.
## Leak-safety (hard constraint)
No benchmark table names, queries, gold values, or instance-specific mappings
anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic.
## Acceptance criteria
- Ingest can flag candidate authoritative/summary tables via generic heuristics
(name/role/aggregation/doc signals), with no benchmark-specific rules.
- The semantic layer can express a measure as backed by a designated authoritative
source; the skill resolves the metric to it by default; raw re-derivation remains
available and the choice is documented.
- Tests use synthetic schemas only; no gold-derived mappings exist anywhere.
## Benchmark context (motivation only)
Some SQLite-subset metric questions are underspecified between a raw-derivation and
an authoritative-table interpretation (e.g. season points from raw results vs an
official standings table). This is the roadmap's "canonical semantic-layer measures
from schema + provided docs" item. It is fair ONLY where schema/docs support one
source; the gold-only cases are explicitly out of scope (fixing them would require
tuning to gold). Larger than the spec 0912 skill-content tweaks: this touches
ingest + the semantic-layer model.

View file

@ -1,57 +0,0 @@
# 17 — Lifecycle-event metrics in the semantic layer
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`.
## Problem / requirement
Many entities carry **several lifecycle timestamps** for the same record — an order has
`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery`
times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`,
`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named
completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled
payments by day"), the correct time anchor is the timestamp of *that named event*, not the
record-creation timestamp.
Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it
does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a
human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is
left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule
now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the
**model**, so any consumer of the semantic layer gets it for free.)
**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more
lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event
metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with
its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`),
distinct from the creation-anchored `orders` metric. Keep the inference conservative and
source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing
that the schema/descriptions don't independently support.
## Sketch (implementer to refine)
- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions
(e.g. status value `delivered``*_delivered_*_date`; `resolved``resolved_at`).
- Emit a metric per detected completed state: filter = the state predicate, grain = record,
`defaultTimeDimension` = the matching event timestamp.
- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the
delivery-anchored metric rather than a bare row count over the creation date.
- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar
(precision over recall — a wrong pairing is worse than none).
## Generic use case (independent of the benchmark)
Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments)
has this multi-timestamp lifecycle shape. An analyst asking "how many X were <completed-state> last
month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the
model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without
re-deriving it, and prevents the silent "grouped by when they started" error.
## Benchmark context (motivation only — not a benchmark-specific rule)
Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028
("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed
because the solver bucketed delivered orders by `order_purchase_timestamp` instead of
`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both
date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this
spec is the **model-layer** form of the same fix, which would make the right anchor the default for any
solver and any lifecycle schema.