chore: remove private benchmark specs

2026-07-01 08:59:39 +02:00 · 2026-06-30 11:13:44 +02:00 · 2026-06-30 11:13:44 +02:00 · 1c5d16abc3
commit 1c5d16abc3
parent 67a69dba8b
40 changed files with 0 additions and 8716 deletions
--- a/spider2-specs/README.md
+++ b/spider2-specs/README.md
@ -1,62 +0,0 @@
-# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark
-
-This directory is the handoff point between two agents working on different
-sides of the same goal: making Claude Code + ktx score well on the Spider
-2.0-Lite benchmark **without benchmark-specific instructions** — the agent
-should succeed using only what ktx provides (skills, semantic layer, wiki).
-
-## Mechanics
-
-Three directories form a pipeline. A feature flows `todo/` → `specs/` →
-(implemented), and only its intake draft moves to `done/`:
-
- **`todo/`** — intake drafts. A **playground agent** (works in
-  `/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the
-  benchmark, identifies ktx capability gaps) writes a draft spec here when it
-  finds a gap.
- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a
-  `todo/` draft and produces a proper, implementation-ready spec at
-  `specs/<same-filename>.md`: sharpened requirements, resolved ambiguities,
-  acceptance criteria, and orientation hints. The refined spec is the **durable
-  artifact** the implementer builds from — it stays in `specs/` permanently and
-  never moves.
- **`done/`** — intake drafts whose feature has shipped (see below).
-
-The **ktx worktree agent** (started from a ktx repo worktree, e.g.
-`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the
-refined spec in `specs/` (falling back to the `todo/` draft only if no refined
-spec exists yet). When the feature is implemented it:
-
-1. appends a short **"Implementation notes"** section to the refined spec in
-   `specs/` (what was built, where, any deviations); and
-2. **moves the original intake draft from `todo/` to `done/`.**
-
-Location is status: `todo/` = draft awaiting implementation, `done/` = draft
-whose feature shipped, `specs/` = refined specs (permanent home, do not move).
-A draft and its refined spec share the same filename so they correspond
-(`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking.
-
-## Rules for specs
-
-1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the
-   benchmark only surfaces the need. Every spec must state a real-world use
-   case independent of Spider 2.0-Lite. If a requirement only makes sense for
-   the benchmark, it doesn't belong in ktx.
-2. Specs are **requirement-level**, not implementation plans. Code pointers in
-   specs are orientation hints from exploration (line numbers may have
-   drifted); the implementer owns the design.
-3. One spec per file, kebab-case, numeric prefix = suggested priority order.
-   A refined spec in `specs/` keeps the same filename as its `todo/` draft.
-
-## For the implementer
-
- After implementing, rebuild and re-link the dev binary so the playground
-  picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
- Add/extend tests in the ktx test suites; specs list acceptance criteria to
-  cover.
- Build from the refined spec in `specs/`. On completion, append
-  "Implementation notes" to that spec (it stays in `specs/`) and move the
-  intake draft from `todo/` to `done/`.
- If a spec turns out to be wrong or already satisfied, don't silently drop
-  it — record why in the refined spec's notes and move the draft to `done/`
-  explaining why no change was needed.
--- a/spider2-specs/done/.gitkeep
+++ b/spider2-specs/done/.gitkeep
--- a/spider2-specs/done/01-connection-scoped-wiki.md
+++ b/spider2-specs/done/01-connection-scoped-wiki.md
@ -1,74 +0,0 @@
-# Connection-scoped wiki pages
-
-## Problem
-
-Wiki pages have only two scopes today: `GLOBAL` and `USER`
-(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29).
-There is no way to associate a page with a connection. In a project with many
-connections, all pages share one search index, so `wiki_search` for a generic
-term ("orders", "revenue", "average order value") surfaces pages about the
-wrong database. Concept names collide across databases constantly in
-real-world multi-connection projects (several databases each with `orders`,
-`customers`, etc.).
-
-Today, when `memory_ingest` is called with a `connectionId`, that id is only
-used to scope which semantic-layer sources the triage agent can see
-(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the
-resulting wiki page in any form.
-
-## Generic use case
-
-Any org with multiple databases/warehouses in one ktx project: org-wide
-definitions ("fiscal year starts in February") should be visible everywhere,
-while database-specific conventions ("in the events DB, `user_id` is the
-anonymous device id, not the account id") should not pollute searches about
-other databases.
-
-## Requirements
-
-1. **Frontmatter field.** Add an optional `connections:` field to wiki page
-   frontmatter — a list of connection ids (accept a single string too,
-   normalize to list).
-   - **Absent or empty ⇒ unscoped: the page applies to all connections.**
-     This is exactly today's behavior, so every existing page is unaffected
-     (backward compatible by construction).
-2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64)
-   and `ktx wiki search` / `ktx wiki list` (CLI,
-   `knowledge-commands.ts`) accept an optional `connectionId`:
-   - With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages.
-   - Without ⇒ current behavior, all pages.
-   - The filter must apply to **all three search lanes** (lexical FTS5,
-     semantic/embedding, token fallback) in
-     `local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter
-     that eats into the result limit unevenly.
-3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index
-   (the index is already re-synced from files on every search,
-   `local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient).
-4. **Write path.** The memory agent's wiki-write tool accepts the connections
-   field; when `memory_ingest` is invoked with a `connectionId`, the agent
-   should default new database-specific pages to that connection, while still
-   being allowed to write unscoped pages for clearly org-wide content (prompt
-   guidance, not a hard rule).
-5. **`wiki_read` and refs are unchanged** — pages remain addressable by key
-   regardless of scoping; `connections` is a search/relevance concern only.
-6. **Validation.** Warn (don't fail) when a page references a connection id
-   not present in `ktx.yaml` — config and content can evolve independently.
-
-## Acceptance criteria
-
- A page with `connections: [db_a]` is returned by
-  `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
-  **not** by `wiki_search(query, connectionId: "db_b")`.
- A page with no `connections` field is returned in all three cases above.
- Existing projects with no scoped pages behave identically before/after.
- Filtering works in each lane independently (test with embeddings disabled
-  to exercise lexical/token lanes alone).
- `memory_ingest(content, connectionId)` produces a page scoped to that
-  connection for database-specific content.
-
-## Benchmark context (motivation only)
-
-Spider 2.0-Lite local subset = one project with 30 SQLite connections whose
-schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
-External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
-to exactly one database and must not surface for the other 29.
--- a/spider2-specs/done/02-verbatim-ingest-mode.md
+++ b/spider2-specs/done/02-verbatim-ingest-mode.md
@ -1,71 +0,0 @@
-# Verbatim ingest mode for authoritative documents
-
-## Problem
-
-`ktx ingest --text/--file` routes content through the memory agent
-(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop
-(30-step budget for `external_ingest`, content clipped at ~48k chars,
-`memory-agent.service.ts` ~165) that may rewrite, condense, or split the
-content before writing wiki pages.
-
-For *authoritative* documents — formula definitions, specs, runbooks,
-compliance text — paraphrasing is a bug, not a feature:
-
- exact thresholds, constants, and rule wording must survive byte-for-byte;
- lexical (BM25) search works best when the stored text matches the phrasing
-  users/agents will query with;
- ingestion should be deterministic and reproducible — same input file, same
-  resulting page.
-
-## Generic use case
-
-Any team ingesting documents that are already the source of truth: metric
-definition sheets, SLA documents, calculation methodology docs, regulatory
-text. The user wants ktx to *index and surface* the document, not to
-re-author it.
-
-## Requirements
-
-1. **Flag.** `ktx ingest --file <path> --verbatim` (apply to `--text` too).
-   Composes with the existing optional `--connection <id>` so the resulting
-   page can be connection-scoped (see spec 01).
-2. **Body preservation is enforced by code, not by prompt.** The stored page
-   body must be the input content byte-for-byte. The LLM is used **only** to
-   generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug
-   (and `connections` default from the flag). Implementation freedom: a
-   single constrained LLM call is fine — the full memory-agent loop is not
-   required for this mode.
-3. **No clipping of the stored body.** The ~48k clip may apply to what is
-   *sent to the LLM* for metadata generation, never to what is *written* to
-   the wiki page.
-4. **Existing frontmatter.** If the input file already has YAML frontmatter,
-   preserve user-provided fields and only fill gaps (don't overwrite an
-   explicit `summary` with a generated one).
-5. **Key collisions.** Deterministic, non-destructive behavior: error or
-   suffix — never silently overwrite an existing page.
-6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should
-   still work, deriving `summary` from the first heading/sentence and leaving
-   optional metadata empty. (Regular agent ingest can't do this; verbatim
-   mode can and should.)
-
-## Acceptance criteria
-
- Ingesting a file with `--verbatim` produces a wiki page whose body is
-  byte-identical to the input (assert with a hash in tests).
- Running the same ingest twice is idempotent or fails loudly on the second
-  run (per requirement 5) — no duplicated/divergent pages.
- A >48k-char file is stored in full.
- `--verbatim --connection X` yields a page scoped to X (depends on spec 01;
-  if 01 isn't implemented yet, the flag composition can land later).
- Generated metadata makes the page findable: `wiki_search` for a phrase
-  from the document body returns it (lexical lane), and for a paraphrase of
-  its topic returns it when embeddings are enabled (semantic lane).
-
-## Benchmark context (motivation only)
-
-Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
-definitions, haversine formula, F1 overtake rules…). Gold SQL was authored
-against their exact text; an LLM paraphrase that drops a bucket boundary
-loses a question. We currently work around this by hand-writing frontmatter
-and copying files into `wiki/global/` — verbatim mode makes that a supported
-ktx workflow instead of a manual step.
--- a/spider2-specs/done/06-scan-tolerate-broken-objects.md
+++ b/spider2-specs/done/06-scan-tolerate-broken-objects.md
@ -1,63 +0,0 @@
-# Schema scan must tolerate individual objects that fail introspection
-
-> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest
-> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely**
-> because a single broken VIEW errored during introspection, leaving that
-> connection with no semantic layer at all.
-
-## Problem
-
-`ktx ingest <connection>` aborts the whole database's schema scan when one
-table/view errors during introspection/profiling. In `oracle_sql` the view
-`emp_hire_periods_with_name` is defined as
-`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the
-base table has no `start_date`/`end_date` columns — so any attempt to read it
-raises `no such column: ehp.start_date`. That single broken object failed the
-ingest of all ~48 healthy tables/views in the database.
-
-A second, related symptom: setting `enabled_tables: [main.customers]` to work
-around it produced a different hard failure (`Adapter "database schema" did not
-recognize fetched source output`), so the documented allowlist escape hatch did
-not provide a clean fallback either.
-
-## Generic use case
-
-Real databases routinely contain broken or inaccessible objects: views over
-dropped/renamed columns, views referencing tables the connection role can't
-read, permission-denied tables, or vendor system views that error. ktx should
-ingest everything it *can* and skip what it can't — never let one bad object
-zero out an entire connection's context. This is basic robustness for
-production warehouses, not benchmark-specific.
-
-## Requirements
-
-1. **Per-object isolation.** If introspecting/profiling one table or view
-   throws, skip that object, record a warning (object name + error), and
-   continue scanning the rest. The connection's semantic layer is built from
-   the objects that succeeded.
-2. **Surface, don't hide.** Report skipped objects in the ingest summary and in
-   `ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name:
-   no such column ehp.start_date"). Honor `failureMode` for whole-connection
-   aborts, but a single bad object should not count as a connection failure.
-3. **Views vs tables.** A broken view should never block base-table ingest.
-   Consider profiling views defensively (they are read-only projections).
-4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict
-   the scan to the listed objects (and the qualification format for sqlite must
-   be documented and accepted). Fix the `did not recognize fetched source
-   output` failure when the allowlist yields a small/edge-case set.
-
-## Acceptance criteria
-
- Ingesting a sqlite DB containing one broken view plus N healthy tables yields
-  a semantic layer for the N healthy tables and a warning naming the broken view
-  — exit is success (not "failed"), subject to `failureMode`.
- The skipped object is listed in the ingest summary and `ktx status`.
- `enabled_tables` restricted to a subset ingests exactly that subset without the
-  adapter-output error.
-
-## Benchmark context (motivation only)
-
-`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer
-because of its one broken view; those questions must be solved from raw
-`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
-would restore enriched context for that database.
--- a/spider2-specs/done/07-analytics-skill-sql-craft.md
+++ b/spider2-specs/done/07-analytics-skill-sql-craft.md
@ -1,112 +0,0 @@
-# Add universal SQL-authoring craft to the ktx-analytics skill
-
-> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which
-> ktx tools to call and in what order*, but gives almost no guidance on
-> *writing correct SQL*. In benchmark runs the agent reliably produced
-> runnable SQL (0 execution errors) yet failed on correctness — precision,
-> determinism, type mismatches, and answer completeness. These are universal
-> analytics-engineering truths that every ktx user benefits from, so they
-> belong in the shipped skill, not in any caller's prompt.
-
-## Scope guard (read first)
-
-Only **universally-true** SQL/analytics craft goes here — guidance that helps a
-real ktx user querying a **live** database. The test for inclusion: *"Would this
-advice be correct and useful for an analyst on a current, production database?"*
-
-**Dialect-specific syntax is out of scope here.** The v9 harnesses' only
-per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted
-lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX`
-for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but
-belongs in a **dialect-aware** location (per-driver notes), not this flat
-skill. Track separately as a follow-up; the rules below must stay
-dialect-agnostic.
-
-Explicitly **do NOT** add (these are application/consumer concerns, not skill
-concerns, and some are actively wrong for live data):
- Output-format contracts ("return a bare result set with exactly these
-  columns, no prose"). The skill is for interactive analysis and already
-  favors readable tables + summaries; a caller that needs a strict result
-  shape specifies that itself.
- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the
-  data. On a live database "recent" means relative to *now*; this is only true
-  for static snapshots and must not be baked into the product.
- Anything justified by a grader/scoring comparator.
-
-## File
-
-`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill;
-`setup-agents.ts` installs it into agent environments — the copy under a
-project's `.claude/skills/` is regenerated from this source). Extend the
-existing `<rules>` block and step 5 ("Query") / step 6 ("Validate and
-explain"); keep the existing interactive guidance intact.
-
-## Requirements — add these as general rules (behavior only, no rationale that
-references answers/graders)
-
-**Schema discovery before writing SQL**
-1. Inspect representative sample rows of each table before composing SQL —
-   confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null
-   prevalence in join/filter keys, and the actual set of categorical/enum
-   values. (`entity_details` + a small `sql_execution` sample.)
-2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A
-   string column compared against a numeric literal (or vice versa) can
-   silently match nothing.
-
-**Composition discipline**
-3. Build complex queries incrementally — one CTE at a time, verifying each
-   layer's output on a small sample before stacking the next.
-4. Avoid joins that fan out row counts. Add columns only from tables already
-   required by the grain, or pre-aggregate to the target grain before joining.
-
-**Window-function correctness**
-5. Give every ranking/ordering window function a complete, deterministic
-   tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG`
-   results are stable rather than flickering across runs.
-6. Apply row filters **after** window functions for sequence / "first" /
-   "most recent" / "since" questions — compute over the full partition, then
-   filter.
-
-**Numeric precision**
-7. Compute at full precision; round only in the final projection, never inside
-   intermediate CTEs.
-8. Be explicit about truncation (`CAST AS INT` truncates; use explicit
-   rounding when rounding is intended).
-9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`)
-   from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the
-   question's wording.
-
-**Answer completeness / interpretation**
-10. "top / highest / most / lowest" → return only the winning row(s) (e.g.
-    `RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked
-    for.
-11. "for each X / per X / by X" → exactly one row per X; don't collapse to a
-    single value unless the question says "overall" or "total across X".
-12. When a question asks for inputs and a derived value ("X, Y, and their
-    ratio"), include the inputs as columns alongside the derived value.
-13. When grouping by a human-readable label (a name), also expose the entity's
-    identifier — identity, not just the label, is part of the result.
-14. When a result is unexpectedly empty, relax filters one at a time to find
-    which predicate removed the rows.
-
-## Acceptance criteria
-
- The shipped `analytics/SKILL.md` contains the rules above, phrased as general
-  truths with **no reference to any benchmark, gold answer, or scoring
-  comparator**.
- Existing interactive guidance (compact result tables, summaries,
-  clarification prompts, the tool-order workflow) is preserved — the skill must
-  still read well for an interactive human-facing analysis session.
- None of the excluded items (output-shape contract, `MAX(date)` anchoring,
-  grader-driven advice) appear.
- Skill stays within a reasonable size; group the new rules under clear
-  sub-headings so they're scannable.
-
-## Benchmark context (motivation only)
-
-On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors
-but ~50 result mismatches; a large share traced to exactly these gaps
-(premature rounding, string-vs-number compares, non-deterministic window
-ordering, returning full lists for "top" questions, dropping inputs to derived
-values). These are generic SQL-authoring defects — fixing them in the skill
-improves ktx for everyone and, as a side effect, the benchmark.
--- a/spider2-specs/done/08-per-dialect-sql-syntax-notes.md
+++ b/spider2-specs/done/08-per-dialect-sql-syntax-notes.md
@ -1,83 +0,0 @@
-# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection)
-
-> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept
-> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect
-> syntax here.
-
-## Problem
-
-Spec 07 deliberately keeps the analytics SQL-authoring craft
-**dialect-agnostic** — every rule must read correctly on any engine. But a lot of
-*real* correctness depends on dialect-specific syntax that spec 07 excludes and
-defers to this follow-up:
-
- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers,
-  VARIANT colon-paths.
- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`.
- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`.
-
-This guidance is genuinely useful to an agent writing SQL against a live
-database, but it must **not** pollute the flat dialect-agnostic skill — an agent
-querying sqlite should never see Snowflake VARIANT syntax. It belongs in a
-**dialect-aware** location, surfaced only for the dialect the active connection
-actually uses.
-
-## Generic use case
-
-Any ktx project whose connections span more than one warehouse engine (e.g. a
-Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent
-writes SQL for a given connection, it should get that engine's syntax
-conventions — and nothing for the engines it isn't querying.
-
-## Requirements
-
-1. **Per-driver dialect notes.** Author concise, correct syntax notes per
-   supported driver: FQTN form, identifier quoting/case, date/time functions,
-   top-N / window-filtering idiom, semi-structured access. These are genuine
-   per-engine invariants, so enumerating them per driver is acceptable (unlike a
-   denylist of bad specifics).
-2. **Scope to the active dialect, derived from state.** Which notes the agent
-   sees must be selected from the connection's configured driver/dialect
-   (`ktx.yaml` connections / the connector registry), not guessed and not shown
-   all at once. The flat analytics skill stays dialect-agnostic (spec 07
-   invariant preserved).
-3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is
-   installed as a **single `SKILL.md`** per target (`setup-agents.ts` /
-   `readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one
-   of two approaches; the refinement pass should compare them before committing:
-   - **Multi-file skill delivery** — bundle `reference/<dialect>.md` files and
-     have the skill point to the one matching the connection. Requires extending
-     `setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal
-     `.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate
-     transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and
-     **per-file manifest entries** for clean uninstall. This is the
-     install-mechanism improvement spec 07's Model section flags as future work.
-   - **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a
-     given `connectionId` (the MCP layer already resolves the connection's
-     dialect), so no install change is needed and Cursor/OpenCode get identical
-     behavior. May be the lower-cost, more uniform path; weigh it first.
-4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's
-   acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in
-   `analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware
-   channel; it does not amend the flat skill.
-
-## Acceptance criteria
-
- An agent querying a sqlite connection gets sqlite date idioms and never sees
-  Snowflake/BigQuery-only syntax; an agent querying Snowflake gets
-  FQTN/identifier/VARIANT guidance.
- The dialect shown is **derived from the connection's configured driver**, not
-  hardcoded per project and not guessed.
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
-  unaffected.
- Whichever delivery mechanism is chosen installs/serves correctly across **all**
-  supported agent targets, including the single-file Cursor/OpenCode shape.
-
-## Benchmark context (motivation only)
-
-The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
-(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths),
-BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
-(`strftime`/`julianday`). That content is real and useful but engine-specific;
-spec 07 kept it out of the flat skill and deferred it here so the
-dialect-agnostic rules stay clean.
--- a/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md
+++ b/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md
@ -1,150 +0,0 @@
-# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
-
-## Problem
-
-The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4:
-*"Avoid fan-out joins — add columns only from tables already at the target
-grain, or pre-aggregate to that grain before joining; a join that multiplies
-rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent
-honors it on a single join but still **silently fan-outs on multi-hop join
-chains**, where the inflation is one or two joins removed from the aggregate and
-therefore much harder to notice.
-
-The failure shape: a metric that lives at a *coarse* grain (e.g. one row per
-parent record) is counted/summed *after* the parent has been joined down to a
-*finer* grain (e.g. one row per child line). Every parent-level value is then
-duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an
-amount that depends on the data — runnable SQL, plausible-looking number,
-quietly wrong.
-
-The rule today is stated as a *prohibition* ("avoid"). It needs to be a
-*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active
-verification step the agent runs while composing, not just an instruction to be
-careful.
-
-## Generic use case (independent of any benchmark)
-
-An analyst on any production warehouse asks: *"How many orders are there per
-region?"* where the path from region to the order's detail runs through several
-hops (region → store → order → order line). The honest answer counts each order
-once. If the query descends to the line-level table along the way (e.g. for a
-filter), each order is counted once **per line on the order**, inflating the
-per-region total. Attribution here is unambiguous — each order belongs to exactly
-one store and thus one region — so the *only* thing that can go wrong is the row
-multiplication, which is exactly what makes it a clean teaching case. This is one
-of the most common silently-wrong analytics mistakes on normalized schemas — it
-is not
-specific to any dataset, dialect, or benchmark.
-
-## Requirements
-
-This extends the existing `<sql_craft>` "Composition" guidance in the
-`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic,
-and stated as a heuristic-plus-why (consistent with spec 07's style).
-
-1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the
-   danger is *cumulative*: any one-to-many hop on the path between the table that
-   owns a measure and the aggregate inflates that measure, even when the
-   offending join is several hops away from the `SUM`/`COUNT`. The fix is the
-   same as the single-hop case — **pre-aggregate the measure to its own grain in
-   a CTE, then join the already-aggregated result** — but the agent must apply it
-   per measure-owning table along the whole chain, not just at the final join.
-
-2. **Add a verification habit, not just a prohibition.** While composing, the
-   agent should confirm a join did not change the grain it intends to aggregate
-   at — e.g. check that the row count (or the count of the aggregate's key) is
-   unchanged across a join that is supposed to be one-to-one / many-to-one, and
-   pre-aggregate the finer table to grain when it is one-to-many. This is the same
-   "build incrementally and check each layer" discipline spec 07 already endorses,
-   pointed specifically at grain preservation.
-
-   **Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only
-   shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and
-   then joining one-to-one is the remedy that works for every aggregate
-   (`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT <key>)` is a valid one-liner *for counts
-   only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two
-   rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse
-   them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is
-   silently wrong for sums.
-
-3. **One concrete, generic multi-hop example.** Include a short worked example
-   that shows the inflation and the fix. It must use an **invented, generic
-   schema** — **no benchmark table names, no benchmark SQL, and no benchmark
-   result values** (see "Leak-safety" below — hard constraint). The example must:
-   (a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson
-   and does not entangle the skill's separate *macro-vs-micro average* rule; and
-   (b) use a chain with **unambiguous single-owner attribution** so the only thing
-   that can go wrong is row multiplication. The intended example is the chain
-   `regions → stores → orders → order_lines` answering *"how many orders per region
-   include at least one backordered line"* — each order belongs to exactly one
-   store and thus exactly one region, so attribution is clean; the line-level
-   filter gives `order_lines` a genuine reason to be joined (so the fix is the
-   pre-aggregate remedy, not "drop the join"), and that join sits **several hops
-   below** the region-level COUNT (the multi-hop point):
-
-   ```sql
-   -- "How many orders per region include at least one backordered line?"
-   -- (order_lines is genuinely needed here — for the backordered filter — so the
-   --  fix is NOT "just drop the join".)
-   -- WRONG: the order_lines join is one row per matching line, joined several hops
-   -- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the
-   -- per-region total is inflated by backordered-lines-per-order — silently wrong.
-   SELECT r.region_id, COUNT(*) AS n_orders
-   FROM regions r
-   JOIN stores s      ON s.region_id = r.region_id
-   JOIN orders o      ON o.store_id  = s.store_id
-   JOIN order_lines l ON l.order_id  = o.order_id AND l.is_backordered  -- one-to-many: fan-out
-   GROUP BY r.region_id;
-
-   -- RIGHT (general remedy): collapse the finer table to the measure's grain in a
-   -- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works
-   -- for SUM/AVG, not just COUNT.
-   WITH qualifying_orders AS (                 -- back to ONE row per order
-     SELECT DISTINCT order_id FROM order_lines WHERE is_backordered
-   )
-   SELECT r.region_id, COUNT(*) AS n_orders
-   FROM regions r
-   JOIN stores s            ON s.region_id = r.region_id
-   JOIN orders o            ON o.store_id  = s.store_id
-   JOIN qualifying_orders q ON q.order_id  = o.order_id
-   GROUP BY r.region_id;
-
-   -- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works
-   -- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g.
-   -- summing each order's shipping_fee after joining lines) must pre-aggregate;
-   -- DISTINCT would wrongly merge two orders that happen to share the same fee.
-   ```
-
-## Leak-safety (hard constraint on this spec and its example)
-
-The benchmark's gold answers must never appear in ktx. The worked example must
-be a **synthetic, generic schema invented for teaching** — not the tables,
-column names, query, or numeric results of any Spider 2.0-Lite question. The
-example demonstrates the *pattern* (coarse-grain measure counted after a
-one-to-many join), which is universal; it must be reconstructable from first
-principles by anyone, with zero reference to benchmark data. A reviewer should
-be able to read the example and find nothing that ties it to a specific
-benchmark instance.
-
-## Acceptance criteria
-
- The skill's `<sql_craft>` Composition section states the multi-hop
-  generalization of the fan-out rule and a grain-verification habit, inline and
-  dialect-agnostic.
- It includes exactly one short, **generic** worked example (wrong vs.
-  pre-aggregated-right) using an invented schema, with no benchmark-derived
-  identifiers or values.
- No new tool, flag, or config; this is skill-content only (additive to spec 07).
- Existing analytics-skill content tests are updated to cover the added rule's
-  presence (mirroring spec 07's `analytics-skill-content.test.ts`).
-
-## Benchmark context (motivation only)
-
-Multi-hop aggregation questions (counting/averaging a coarse-grained measure
-reached through several one-to-many joins) are a recurring source of
-result-mismatch failures in the SQLite subset: the agent produces runnable SQL
-with the right tables but a fan-out-inflated number. These are correctness
-failures, not knowledge or schema-discovery failures (zero execution errors in
-the latest run), so the fix belongs in the product's authoring craft — where it
-also helps any real analyst — not in a benchmark-specific prompt.
-```
--- a/spider2-specs/done/10-panel-completeness-spine.md
+++ b/spider2-specs/done/10-panel-completeness-spine.md
@ -1,65 +0,0 @@
-# Panel/period completeness — emit the full set of groups, not only the populated ones
-
-## Problem
-
-When a question asks for a result *per period* or *per category* ("orders for each
-month of 2023", "revenue by region", "count per status"), the natural `GROUP BY`
-only returns groups that actually have rows. Periods/categories with **zero**
-activity silently vanish, so a "12 months" answer comes back with 9 rows and the
-ones that should read `0` are simply absent. The agent writes runnable SQL with
-the right aggregate but an **incomplete panel**.
-
-This is a universal reporting correctness issue: a monthly report with missing
-months, or a category breakdown missing the empty categories, is wrong for any
-analyst — and it is also a frequent result-mismatch shape on the benchmark.
-
-## Generic use case (independent of any benchmark)
-
-"How many orders were placed in each month of 2023?" must return **12 rows** even
-if March had no orders (March = 0), not 11 rows. "Sales per region" should include
-regions with no sales (as 0/NULL) when the question asks for *each* region.
-
-## Requirements
-
-Additive to the `ktx-analytics` skill's `<sql_craft>` "Answer completeness /
-interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic
-+ why style).
-
-1. **Recognize "full-panel" phrasing.** Cues like *each / every / per <period> /
-   for all <category> / by month* signal that the answer's row set should be the
-   **complete** set of periods or categories in scope, not just those present in
-   the filtered fact rows.
-
-2. **Build a spine, then LEFT JOIN.** Generate the full set of expected
-   groups — a date/number series via a recursive CTE for periods, or the distinct
-   dimension values from the authoritative dimension table for categories — and
-   LEFT JOIN the aggregated facts onto it, defaulting missing measures with
-   `COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner
-   `GROUP BY` can only emit groups that have at least one fact row.
-
-3. **Don't over-apply.** When the question asks only about groups that exist
-   ("which months had orders"), the spine is unnecessary; the cue is *each/all*
-   vs *which*.
-
-## Leak-safety (hard constraint)
-
-Any worked example must use a **synthetic generic schema** (e.g. an `orders`
-table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN
-+ COALESCE). No benchmark table names, SQL, or result values. The behavior is
-reconstructable from first principles and tied to no specific instance.
-
-## Acceptance criteria
-
- `<sql_craft>` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe,
-  and the over-application guard — inline and dialect-agnostic.
- At most one short generic example (recursive-CTE date spine or distinct-dimension
-  spine), no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated to cover the rule.
-
-## Benchmark context (motivation only)
-
-Per-period / per-category questions where some periods are empty produce
-short-row result mismatches in the SQLite subset. The fix is a universal
-reporting habit (complete panels), so it belongs in the product's craft, where it
-also helps real analysts — not in a benchmark-specific prompt. Related to spec 11
-(rolling/cumulative windows need a complete date spine to be correct).
--- a/spider2-specs/done/11-time-series-window-recipes.md
+++ b/spider2-specs/done/11-time-series-window-recipes.md
@ -1,73 +0,0 @@
-# Time-series window craft — running totals, rolling-N (min-periods), period-over-period
-
-## Problem
-
-A large share of analytics questions are time-series shaped: a **running/cumulative
-balance**, a **rolling N-day average**, or **period-over-period growth**. The agent
-knows window functions exist (spec 07 covers determinism and window-then-filter) but
-gets the *time-series specifics* wrong:
-
- cumulative balance computed without an unbounded preceding frame (or with the
-  frame defaulting incorrectly when there are ties on the order key);
- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily
-  data, so the window spans the wrong calendar span when days are missing;
- no **minimum-periods** handling — a rolling average is reported before the window
-  is actually full;
- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor.
-
-These are runnable-but-wrong; the structure is close, the edge case diverges.
-
-## Generic use case (independent of any benchmark)
-
- "Each account's month-end running balance over 2023" — cumulative sum of monthly
-  net over an ordered window.
- "30-day rolling average of daily revenue, only once 30 days of history exist."
- "Month-over-month revenue growth rate."
-
-All three are bread-and-butter for any analyst on any time-series table.
-
-## Requirements
-
-Additive to the `ktx-analytics` skill's `<sql_craft>` "Window functions" group
-(inline, dialect-agnostic, heuristic + why).
-
-1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS
-   BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in
-   `ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY`
-   can include/exclude peers unexpectedly.
-
-2. **Rolling window over time, not over rows.** When "rolling N days/months" is
-   asked, the window must span a calendar range. Over gappy data, either build a
-   complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals
-   the intended span, or use a range/self-join keyed on the date. *Why:* row-count
-   frames over missing dates silently measure the wrong span.
-
-3. **Minimum periods.** When the question says "only after N periods of data" (or
-   it is implied by a rolling metric), emit NULL/skip until the window is full
-   (e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not
-   the requested metric.
-
-4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
-   for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at
-   full precision (round only at the end). Guard divide-by-zero/NULL prev.
-
-## Leak-safety (hard constraint)
-
-Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day,
-amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*.
-No benchmark table names, SQL, or result values.
-
-## Acceptance criteria
-
- `<sql_craft>` "Window functions" gains the cumulative, rolling-over-time +
-  min-periods, and period-over-period recipes — inline, dialect-agnostic.
- At most one or two compact generic examples; no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated.
-
-## Benchmark context (motivation only)
-
-Running-balance / rolling / period-over-period questions are the single largest
-result-mismatch cluster in the SQLite subset (financial-transactions style DBs).
-The methodology is universal analyst craft, so it belongs in the product's skill
-(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10
-(date spine) for the gappy-rolling case.
--- a/spider2-specs/done/12-parse-text-encoded-numbers.md
+++ b/spider2-specs/done/12-parse-text-encoded-numbers.md
@ -1,61 +0,0 @@
-# Parse text-encoded numeric columns before doing math on them
-
-## Problem
-
-Numeric measures are often stored as **text** with human formatting: unit suffixes
-(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`),
-percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`,
-`""`). Aggregating or comparing such a column directly is silently wrong: string
-comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on
-the formatted values rather than the intended number.
-
-The agent already samples schemas (spec 07 schema-discovery), but when it sees a
-"numeric" column it tends to assume it is a real number type and skips the parse —
-so the arithmetic runs on garbage. Runnable, plausible, wrong.
-
-## Generic use case (independent of any benchmark)
-
-A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000
-/ 0` before you can sum it or compute a daily change. A `price` stored as
-`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene
-on real, messy production tables.
-
-## Requirements
-
-Extend the `ktx-analytics` skill's `<sql_craft>` "Schema discovery before writing
-SQL" group (inline, dialect-agnostic, heuristic + why).
-
-1. **Detect text-encoded numerics during sampling.** When a column that the
-   question treats as a number is stored as text, sample distinct values to learn
-   the encodings actually present (suffixes, symbols, separators, sentinels) before
-   composing — never assume the format from the column name.
-
-2. **Parse and scale before arithmetic.** Strip currency/separator/percent
-   characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels
-   (`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a
-   numeric type. Do this in an early CTE so all downstream math sees clean numbers.
-   *Why:* string columns compared/aggregated as-is sort lexically and cast to 0,
-   producing silently wrong results instead of errors.
-
-3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value
-   failed to parse (would surface as NULL), to catch an encoding the sample missed.
-
-## Leak-safety (hard constraint)
-
-Worked examples must use a **synthetic generic schema** and made-up values (e.g. a
-`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names,
-SQL, or result values; the parsing pattern is universal and tied to no instance.
-
-## Acceptance criteria
-
- `<sql_craft>` schema-discovery gains the detect → parse/scale → verify guidance —
-  inline, dialect-agnostic, with at most one short generic example.
- No benchmark-derived content. Skill-content only; content tests updated.
-
-## Benchmark context (motivation only)
-
-At least one SQLite-subset question stores trading volume as suffix-encoded text
-("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The
-fix — parse messy encodings before math — is universal data hygiene that helps any
-analyst, so it belongs in the product's craft rather than a benchmark-specific
-prompt.
--- a/spider2-specs/done/14-output-completeness-final-check.md
+++ b/spider2-specs/done/14-output-completeness-final-check.md
@ -1,105 +0,0 @@
-# Enforce answer-output completeness with a final pre-emit check in the analytics skill
-
-## Problem
-
-The single largest correctness failure mode is **incomplete output**: the query runs and the
-methodology is roughly right, but the result is missing columns the question asked for. Three
-recurring sub-patterns:
-
-1. **Multi-part questions answered partially.** A question that asks for several things ("report
-   the highest *and* the lowest month, each with its count and average, *and* the difference")
-   comes back with only the first part — one column instead of the several requested.
-2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's
-   identifier (e.g. a product name without its product id, a customer name without its
-   customer id).
-3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not
-   the underlying counts the question also asked for.
-
-Sub-patterns 2 and 3 are **already covered by `<sql_craft>` rules** in the analytics skill
-(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*),
-yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these
-rules are passive heuristics buried in a list, and the agent doesn't reliably check them before
-finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn
-output-completeness into an **explicit final verification step** the agent performs before
-emitting SQL.
-
-This is reinforced by evidence that the failure is **model-independent**: a markedly stronger
-model produced the same incomplete-output mistakes on these questions, which means it is a
-craft/enforcement gap, not a capability gap.
-
-## Generic use case (independent of any benchmark)
-
-An analyst is asked: *"For each region, report the highest and the lowest monthly order count,
-and the difference between them."* A complete, useful answer has a column for the region's id
-and name, the highest count, the lowest count, and the difference — five columns. Returning just
-the region and a single number answers only part of the request. This is a universal expectation
-on any database: answer **every** part of a multi-part request, identify the entities, and show
-the inputs behind any derived figure.
-
-## Requirements
-
-Additive to the analytics skill's `<sql_craft>` "Answer completeness / interpretation" group and
-its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07).
-
-1. **Multi-part / multi-output completeness (new rule).** When a question requests several
-   outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a
-   value plus its components ("X, Y, and their ratio") — the final projection must contain a
-   column for **each** requested output. *Why:* answering only the first clause is the most common
-   way a runnable query is still wrong; the grain and methodology can be perfect yet the answer
-   is short by columns.
-
-2. **Fold the existing identity / inputs rules into the same completeness notion.** The
-   already-shipped rules — project the entity **identifier** alongside any human-readable label,
-   and **keep the inputs** to any derived value — are part of output completeness; reference them
-   from the check below so they are actually applied, not just listed.
-
-3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the
-   final SQL, the skill should have the agent **re-read the question and confirm the projection
-   covers**: every named metric/attribute; the identifier of every grouped/named entity; every
-   input to a derived value; all at the grain the question specifies. This is a short, concrete
-   checkpoint at the validate step — the point is to convert the passive heuristics into an active
-   pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is
-   grader-gaming; the check is about matching the request exactly, not padding it.)
-
-   Generic teaching example (synthetic schema — see Leak-safety):
-   ```sql
-   -- "For each region, report the highest and lowest monthly order count and their difference."
-   -- WRONG: answers only the first clause; no region id, no lowest, no difference.
-   SELECT region_name, MAX(monthly_orders) AS highest
-   FROM region_monthly GROUP BY region_name;
-
-   -- RIGHT: one column per requested output + the entity's identity, at the region grain.
-   SELECT r.region_id, r.region_name,
-          MAX(m.monthly_orders) AS highest_monthly_orders,
-          MIN(m.monthly_orders) AS lowest_monthly_orders,
-          MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference
-   FROM regions r
-   JOIN region_monthly m ON m.region_id = r.region_id
-   GROUP BY r.region_id, r.region_name;
-   ```
-
-## Leak-safety (hard constraint)
-
-The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up
-columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover
-every requested output + identity + inputs), which is universal and tied to no specific instance.
-
-## Acceptance criteria
-
- The skill states the multi-part-completeness rule and a concrete **final completeness check**
-  (re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic,
-  cross-referencing the existing identity/inputs rules so they're enforced.
- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming).
- One short generic example (wrong vs complete); no benchmark-derived content.
- Skill-content only; analytics-skill content tests updated to cover the new rule + check.
-
-## Benchmark context (motivation only)
-
-In the latest SQLite-subset run, **incomplete output was the single largest failure bucket
-(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value
-inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A
-probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this
-is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested
-part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product
-skill (and transfers to real users), enforced as a final check rather than left as a passive hint.
-```
--- a/spider2-specs/done/15-mcp-server-structured-logging.md
+++ b/spider2-specs/done/15-mcp-server-structured-logging.md
@ -1,116 +0,0 @@
-# Structured, leveled logging for the ktx MCP server
-
-> **Scope: observability only.** This spec is about *seeing* what the MCP server
-> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
-> query from blocking the server (off-event-loop / interruptible query execution)
-> is a separate concern — see "Non-goals" and the sibling spec note below.
-
-## Problem
-
-The ktx MCP server (`packages/cli/src/mcp-http-server.ts` +
-`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk`
-`StreamableHTTPServerTransport`) emits almost no operational logs. There is no
-server-side record of **which MCP tool was called, with what parameters, when,
-how long it took, or whether it succeeded** — nor of session open/close or
-transport errors. When a tool call is slow, hangs, or a client connection drops
-("Transport channel closed"), an operator has no trail to diagnose it and must
-resort to process sampling / `lsof` / guesswork — and the offending input
-(e.g. the exact SQL) is typically unrecoverable.
-
-## Generic use case
-
-Anyone running a long-lived ktx MCP server — a developer's local instance, a
-shared team server, or a hosted deployment — needs observability into tool-call
-activity to:
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
-  connection, with what SQL, for how long);
- explain client-visible connection failures from the server side (session
-  lifecycle, transport-closed events);
- audit what agents asked the server to do;
- spot patterns (hot tools, slow connections, error rates).
-
-This is standard production-server hygiene; the server currently provides none.
-
-## Requirements (sketch — refine when picked up)
-
-1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation
-   only; implementer owns the choice). A single shared instance; write **JSON to
-   stdout** (12-factor — the launcher/aggregator routes it). No in-app file
-   rotation. Optional human-readable pretty output only when attached to a TTY
-   (dev).
-2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug`
-   for diagnosis) — verbose logging on demand without code changes.
-3. **Per-session / per-call context** via child loggers: every line carries a
-   `sessionId` (from the transport session) and, for tool calls, a `callId` +
-   `tool` name, so one session's or call's activity can be traced/grepped.
-4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For
-   every MCP tool invocation:
-   - on entry: log `{ tool, params, sessionId, callId }` **before** running the
-     handler (so the record exists even if the handler never returns);
-   - on exit: log `durationMs` + outcome (ok with result size, or error with
-     stack).
-   This makes a **hung / never-returning call identifiable**: a start with no
-   matching completion is the culprit, with its exact parameters and timestamp.
-   This matters specifically because handlers like `sql_execution` run a
-   *synchronous* better-sqlite3 query — a runaway query blocks the process and no
-   completion is ever logged, so the start line (flushed before the blocking
-   call) is the only record. For `sql_execution`, `params` should include the SQL
-   text (the most useful field). Emit a **WARN** when a *completed* call exceeds a
-   configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`).
-5. **Connection / session lifecycle:** log session open/close (with `sessionId`)
-   and transport errors (the SDK's closed-channel / "Transport channel closed"
-   events) so client-side connection failures have a server-side counterpart.
-6. **Error logging** with structured stack traces (a standard error serializer),
-   not bare strings.
-7. **Light redaction — credentials only** (bearer token, connection
-   passwords/secrets). SQL text and tool params are *not* secrets and must be
-   logged. Do not over-redact.
-8. **Synchronous logging is fine.** The server uses a synchronous DB client, so
-   logging need not be async; prefer the simpler synchronous stdout path over
-   async/worker transports (which can lose buffered lines on a hard crash). Do
-   not introduce async-logging machinery.
-
-## Acceptance criteria (sketch)
-
- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start`
-  (tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line
-  on the server's stdout, as JSON.
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
-  `tool.start` line carrying its **exact SQL and timestamp** and **no**
-  `tool.end` — so the offending query is recoverable from the log alone, with no
-  process sampling.
- A completed tool call slower than the configured threshold emits a WARN with
-  its duration.
- Session open/close and transport-closed events are logged with the `sessionId`.
- At default level (`info`), routine per-tool lines are suppressed but lifecycle,
-  slow-call warnings, and errors are present.
- Credentials (bearer token, connection secrets) never appear in logs; SQL and
-  tool params do.
- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no
-  async-transport machinery.
-
-## Non-goals
-
- **Preventing/interrupting runaway queries** (off-event-loop execution, query
-  timeouts, worker-thread isolation). That is a *separate* spec; a single
-  synchronous query that fans out into a massive nested-loop join can peg the
-  single-threaded server for hours and break new connections — observability
-  surfaces *which* query, but the fix is execution-model work. (This logging is
-  also a prerequisite for a future watchdog that detects a `tool.start` with no
-  `tool.end` past a threshold and recycles the server.)
- Metrics/tracing/OpenTelemetry exporters.
- Forwarding logs to the MCP *client* via the protocol's logging capability
-  (`notifications/message`, `logging/setLevel`) — a possible later enhancement,
-  distinct from operational stdout logging.
-
-## Benchmark context (motivation, not a requirement)
-
-Running Spider 2.0-Lite against the MCP server at concurrency, an
-adversarial-reviewer-generated query degenerated into a massive nested-loop join;
-synchronous better-sqlite3 executed it on the event loop, pegging a server at
-~100% CPU for hours and breaking new MCP connections to it ("Transport channel
-closed"). We could not determine *which* query, because the server logs nothing
-about tool calls — diagnosis required `sample`/`lsof` on the live process and the
-exact SQL was never recovered. Structured tool-call logging (especially
-start-before-execute) would have turned this into a one-line `grep` of the server
-log.
--- a/spider2-specs/done/16-bounded-query-execution-timeout.md
+++ b/spider2-specs/done/16-bounded-query-execution-timeout.md
@ -1,131 +0,0 @@
-# Bounded query execution (deadline + non-blocking) for read SQL
-
-> Priority: HIGH. Found empirically during a Spider2-lite sqlite run
-> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU
-> for 13+ minutes and never returned. The query
-> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the
-> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112
-> rows, joined on a 4-column key with no composite index) whose plan degraded to
-> an O(N×M) nested-loop scan. Because the sqlite connector runs
-> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP
-> worker's entire event loop: no `tool.end` was ever logged, the port went
-> unresponsive, and the query could not be cancelled. One of four eval shards
-> stalled until the worker was killed by hand.
-
-## Problem
-
-Two compounding gaps on the read-query path:
-
-1. **No execution deadline.** A single expensive query runs unbounded. This is
-   handled divergently per connector, with no shared contract: BigQuery has a
-   real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP
-   `request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only
-   connection/pool *acquisition*, not statement *execution*; SQLite has nothing.
-   So whether a runaway query is bounded depends entirely on which driver the
-   caller happened to hit.
-
-2. **In-process engines block the event loop and can't be cancelled.** The
-   sqlite connector executes on the main thread via synchronous
-   `better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't
-   serve other requests, send progress, or write `tool.end`), and there is no
-   way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its
-   documented mechanism for slow queries is to run them in a **worker thread**,
-   and the only way to stop a runaway synchronous query is to terminate the
-   thread executing it.
-
-The net effect is a query that produces a `tool.start` with no matching
-`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`)
-does not help — it bounds returned rows, not scan work, and the failing query
-returned a single aggregate row.
-
-## Generic use case
-
-Any data agent that lets an LLM author SQL will eventually issue an
-accidentally-expensive query — an unindexed or cartesian join, an expensive
-VIEW, a wide aggregate over a large fact table. A general-purpose context layer
-must bound that and return a clean, fast "query exceeded Ns" error so the agent
-can revise (add filters, query base tables, narrow the range) instead of hanging
-the tool and the server. This matters for embedded/local warehouses (sqlite,
-duckdb) and remote ones alike, and is wholly independent of any benchmark.
-
-## Requirements
-
-1. Every read-query execution path (`executeReadOnly`) enforces a single
-   canonical execution deadline. One opinionated default; **not** a per-call
-   user flag. Where a driver already supports a per-connection timeout
-   (BigQuery `job_timeout_ms`), reuse that as the per-connection override rather
-   than inventing a parallel knob.
-2. On exceeding the deadline the path resolves with a `KtxQueryError`
-   ("query exceeded {N}s") — a finite, decision-reaching outcome, never an
-   unbounded hang.
-3. The deadline is a **shared contract at the connector boundary**, defined once
-   (on the `executeReadOnly` contract or a shared wrapper at the call site) so
-   all drivers participate. Bring the existing divergent timeouts (BigQuery job
-   timeout, ClickHouse request timeout) under this one contract instead of
-   leaving parallel mechanisms.
-4. For in-process engines (sqlite today, any future embedded driver), execution
-   MUST NOT block the MCP server event loop. Run the query off the main thread
-   and enforce the deadline by terminating that thread on timeout (the
-   better-sqlite3-documented approach, since synchronous queries are
-   uncancellable in-thread). The event loop must stay responsive so `tool.end`
-   is always written and concurrent requests on the same port are served.
-5. Prefer real cancellation over client-side give-up. Where the engine supports
-   a server-side statement timeout (Postgres `statement_timeout`, MySQL
-   `max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse
-   `max_execution_time`, BigQuery job timeout, SQL Server request timeout), set
-   it so the deadline actually stops work, not merely abandons the promise while
-   the query keeps running. For in-process engines, thread termination is the
-   cancellation.
-6. The MCP `sql_execution` tool surfaces the timeout as an expected error
-   (classified as `KtxQueryError`, not a `$exception` fault, consistent with
-   existing expected-error classification) and logs a `tool.end` with the error
-   outcome.
-7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain
-   unchanged. The deadline is additive; `maxRows` is not a substitute for it.
-
-## Acceptance criteria
-
- A read query that exceeds the deadline returns a `KtxQueryError` within
-  roughly the deadline; the MCP worker stays responsive (a concurrent tool call
-  on the same server completes while the slow query is still pending) and writes
-  a matching `tool.end` with a non-ok outcome.
- sqlite specifically: executing a deliberately pathological query (e.g. an
-  expensive VIEW or an unindexed cross join) on a fixture does not block the
-  event loop, is terminated at the deadline, and CPU returns to idle afterward
-  (the off-main-thread executor is killed, not left spinning).
- No regression: normal fast queries return identical results; read-only
-  rejection still works; `maxRows` still bounds returned rows.
- Tests cover the deadline path for at least the in-process driver (sqlite,
-  terminate-on-deadline) and one server-side-timeout driver.
-
-## Benchmark context (motivation only)
-
-The Spider2-lite local set loads several warehouses into sqlite, some with
-expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
-`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
-rows, no composite index, with `promo_id` (the index the optimizer picks) being
-95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
-view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
-shard for 10+ minutes; with one, the agent gets a fast error and can scope the
-query instead.
-
-## Orientation hints (code pointers; may have drifted)
-
- Shared contract: `packages/cli/src/context/scan/types.ts` —
-  `KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285).
- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70`
-  (`connector.executeReadOnly`); tool registration in
-  `packages/cli/src/context/mcp/context-tools.ts`.
- In-process sync execution (the acute hang):
-  `packages/cli/src/connectors/sqlite/connector.ts:311-313`
-  (`better_sqlite3 .prepare().all()`).
- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts`
-  (`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602`
-  (`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only),
-  `connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`,
-  `connectors/sqlserver/connector.ts` (pool/connection only).
- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`).
- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no
-  interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern
-  for slow queries (master owns worker lifecycle and respawns on exit) — extend
-  it with terminate-on-deadline to enforce the timeout.
--- a/spider2-specs/done/18-bigquery-cross-project-datasets.md
+++ b/spider2-specs/done/18-bigquery-cross-project-datasets.md
@ -1,68 +0,0 @@
-# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project)
-
-**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`.
-
-## Problem (generic, real-world)
-
-Analysts routinely query datasets that live in a **different** BigQuery project than the one
-they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an
-organization's central data project, etc. To make those connectable in ktx (so `discover_data`,
-the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to
-**introspect a dataset hosted in a foreign project while running/billing jobs in the
-credentials' own project**.
-
-Today it can't. ktx's BigQuery connector derives a single `projectId` from
-`credentials.project_id` and uses it for **both** job billing **and** schema introspection:
-
- `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`;
-  there is no separate billing-vs-dataset project knob.
- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the
-  dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`.
- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the
-  **billing** project's INFORMATION_SCHEMA.
- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix.
-
-So a `dataset_id` naming a dataset in another project can't be introspected, even though querying
-it works fine (cross-project reads bill to the caller's project — that path already works).
-
-### Empirical confirmation
-With a service account in project `ktx-spider2-lite`:
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in
-  `projects/ktx-spider2-lite/datasets/austin_311`).
- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds**
-  (lists the public tables; public metadata is readable by any authenticated principal).
- There is **no config knob** to separate the introspection project from the billing project.
-
-## Requirement
-
-The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids`
-(a single connection may span more than one source project), and for each:
- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` /
-  `DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and
-  label the table `catalog` with the dataset's project;
- **run jobs / bill** in `credentials.project_id` (unchanged).
-
-A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so
-existing single-project connections are unaffected.
-
-## Acceptance
-
- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) →
-  `ktx ingest <conn>` introspects the tables, enriches, and samples values; `discover_data` /
-  `dictionary_search` return them.
- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both.
- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in
-  `credentials.project_id`.
- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression).
-
-## Benchmark context (motivation only — do not encode benchmark specifics)
-
-Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every
-one of its ~74 logical databases groups datasets hosted in foreign public projects
-(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query
-execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the
-faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74
-BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset`
-and introspect each in its own project" covers the benchmark and the general case alike. This is
-the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector
-change and is already baselined).
--- a/spider2-specs/done/19-durable-bounded-relationship-detection.md
+++ b/spider2-specs/done/19-durable-bounded-relationship-detection.md
@ -1,89 +0,0 @@
-# 19 — Durable, resumable, bounded relationship detection during ingest enrichment
-
-**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`.
-
-## Problem (generic, real-world)
-
-Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment`
-(`packages/cli/src/context/scan/local-enrichment.ts`):
-
-1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per
-   table; on a large schema this is minutes of paid LLM work).
-2. `embeddings` (`:559`) — column embeddings.
-3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then
-   validates candidate joins.
-
-The queryable semantic-layer artifacts are persisted **once, at the very end**, by
-`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after**
-`runLocalScanEnrichment` returns, i.e. after all three stages.
-
-This creates three failure modes that compound on large schemas (hundreds of tables):
-
-1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings
-   are computed and held in memory, but they only reach the durable, queryable artifacts when the
-   final write runs after the `relationships` stage. If the process is killed/crashes/times out
-   **during** relationship detection (the last, slowest, silent stage), the artifacts are never
-   written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the
-   paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced
-   full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
-   relationships stage ran silently past a supervising deadline and was killed — the persisted
-   `_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits
-   this, so the most expensive work is the most likely to be thrown away.
-
-2. **Re-running does not resume — it re-spends.** There is a stage state store
-   (`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves
-   each completed stage's output. But the completed-stage lookup keys on **`runId`**
-   (`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest
-   invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a
-   new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch**
-   (re-paying for the LLM work that already succeeded).
-
-3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between
-   "Detecting relationships" and the final "Relationship detection found N accepted" — minutes of
-   silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but-
-   working profile from a true hang, and there is no internal time/work budget, so on a very large
-   schema it can run far longer than any reasonable deadline.
-
-## Requirements
-
-1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions +
-   embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before
-   the `relationships` stage runs. Relationship detection then appends/merges its own artifact on
-   completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**,
-   even if relationship detection fails, is interrupted, or is skipped. (A failed/partial
-   relationship stage should degrade to "no/partial joins", never to "no descriptions".)
-
-2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity
-   — `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted
-   ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what
-   actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM
-   credits on stages that already succeeded.
-
-3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query
-   execution). Emit progress through the existing progress port — e.g. "Profiling table K/N",
-   "Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget
-   (configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops
-   gracefully and returns the relationships found so far (partial) rather than running unboundedly.
-   Partial completion is persisted (per requirement 1) and marked as such.
-
-## Acceptance
-
- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer
-  with the table/column descriptions + embeddings that were generated (verified: re-open the
-  connection, descriptions are present).
- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage
-  already completed (verified: no LLM description calls for the cached tables; only the failed
-  stage re-runs).
- A connection with hundreds of tables emits relationship-stage progress and completes within the
-  configured budget, persisting partial relationships if the budget is hit — without discarding
-  enrichment.
- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship
-  output when nothing is interrupted).
-
-## Benchmark context (motivation only — do not encode benchmark specifics)
-
-The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl`
-785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM
-budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every
-retry — makes large-schema ingest impractical. This is a general durability/cost property of the
-ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale.
--- a/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md
+++ b/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md
@ -1,101 +0,0 @@
-# 20 — Resilient enrichment under a slow/hung LLM backend
-
-**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`.
-
-This is the **enrichment-stage** analog of two already-shipped specs:
- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline);
- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it.
-
-Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two
-weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung
-table into an indefinite wedge plus total loss of an entire stage's LLM work.
-
-## Problem / requirement
-
-Two compounding gaps on the per-table description-enrichment path, observed end-to-end:
-
-### 1. The per-table LLM timeout does not actually terminate the work
-
-The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh
-`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM
-backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise
-spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0%
-CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock
-the await** — so the call sits *past* its own timeout indefinitely.
-
-Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min),
-two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** —
-well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED
-connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung
-child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot
-interrupt the underlying work — applied to the enrichment LLM call instead of the query.)
-
-**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires,
-the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort
-for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per
-the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded
-wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike.
-
-### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones
-
-Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before
-relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's
-description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1),
-or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost,
-even though their (expensive) LLM descriptions were finished.
-
-Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in
-`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded
-all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of
-redone LLM calls.
-
-**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to
-the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated
-and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the
-descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just
-"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**,
-throwing away every successfully-generated description. The skip must be graceful — a skipped table costs
-one missing description, not the entire stage's output. (This is the strongest argument for per-table
-incremental persistence: the 283 good descriptions should have been durable the moment each was produced.)
-
-**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the
-descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and
-(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write
-design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the
-checkpoint granularity from once-per-stage to incremental.
-
-## Sketch (implementer to refine)
-
- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/
-  claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for
-  network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period.
- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit
-  minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value
-  itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator
-  override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental
-  checkpointing, a moderate default + skip is the better operating point.)
- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to
-  the same store/format used at stage completion; on resume, treat already-persisted tables as done and only
-  enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions).
- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer
-  than ~one timeout, so an external stall watchdog stops being the only backstop.
-
-## Generic use case (independent of the benchmark)
-
-Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend,
-which is the common local/desktop setup) will eventually hit a table whose description call hangs — a
-provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one
-such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws
-away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema
-enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a
-hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product,
-wholly independent of any benchmark.
-
-## Benchmark context (motivation only — not a benchmark-specific rule)
-
-Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching
-the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the
-30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions
-checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had
-to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark
-just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic.
--- a/spider2-specs/done/21-selective-enrichment-stages.md
+++ b/spider2-specs/done/21-selective-enrichment-stages.md
@ -1,91 +0,0 @@
-# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys
-
-**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`.
-
-Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment).
-Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one
-enrichment stage without re-paying for the others.
-
-## Problem / requirement
-
-Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`**
-(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally
-LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a
-targeted re-run impossible without a full, expensive re-enrich:
-
-1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a
-   single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`,
-   and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So
-   changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping
-   `scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces
-   ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change.
-2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only
-   path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on
-   `mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`).
-   The capability is built; it's just not reachable.
-3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage,
-   input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the
-   foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI
-   surface are missing.
-
-**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested
-connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply,
-without re-running unchanged (especially the costly `descriptions`) stages.
-
-## Design decisions (resolved during intake; implementer may refine)
-
- **CLI flag: `--stages <comma-list>`** (plural). Accepts a comma-separated subset of
-  `descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes
-  a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the
-  plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an
-  unknown stage is an error, never silently ignored.
- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs:
-  - `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model)
-  - `embeddings`   → `{snapshot, embeddings model/provider, + the description text it embeds}`
-  - `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}`
-  Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates
-  only `embeddings`; improving description prompts/LLM invalidates only `descriptions`.
- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write,
-  already the behaviour). A selective run never deletes another stage's artifacts.
- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`;
-  `relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing
-  `embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected
-  re-run leaves an unselected downstream stage stale, and the operator can opt to cascade
-  (`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream.
- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the
-  stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has
-  full context — not just raw column names.
- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to
-  the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent
-  `--no-query-history` negative flag, but that unification is out of scope here.
-
-## Sketch (implementer to refine)
-
- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it
-  selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the
-  precedent).
- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each
-  stage's own inputs; gate each stage's resume/skip on its own hash.
- Ensure selective runs read + preserve the on-disk enriched schema and write additively.
- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one.
-
-## Generic use case (independent of the benchmark)
-
-Any team running ktx in production maintains its semantic layer over time: they improve description prompts
-or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of
-those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions
-even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine
-maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now
-that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost —
-alone. This is core operability for a long-lived ingestion product and is wholly independent of any
-benchmark.
-
-## Benchmark context (motivation only — not a benchmark-specific rule)
-
-Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a
-tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …)
-that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins**
-across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both
-were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate
-the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised
-large-scale multi-backend ingestion; the gap and the fix are generic.
--- a/spider2-specs/specs/01-connection-scoped-wiki.md
+++ b/spider2-specs/specs/01-connection-scoped-wiki.md
@ -1,300 +0,0 @@
-# Connection-scoped wiki pages
-
-> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`.
-
-## Problem
-
-Wiki pages have only two scopes today: `GLOBAL` and `USER`
-(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by
-directory (`wiki/global/<key>.md`, `wiki/user/<userId>/<key>.md`) and the
-search path filters by loading only the in-scope pages before any lane runs.
-There is no way to associate a page with a **connection** (a warehouse/database
-defined under `connections:` in `ktx.yaml`).
-
-In a project with many connections this causes two distinct failures:
-
-1. **Cross-database relevance pollution.** All pages share one search index, so
-   `wiki_search` for a generic term (`orders`, `revenue`, `average order
-   value`) surfaces pages written about the wrong database. Concept names
-   collide across databases constantly in real multi-connection projects
-   (several databases each with `orders`, `customers`, …).
-2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace.
-   The write path resolves a repeated key to the existing file and updates it
-   in place. So if the agent writes an `orders` page while ingesting database B
-   and an `orders` page already exists for database A, B's content **overwrites
-   A's** — same-concept pages for different databases cannot coexist today.
-
-Today, when `memory_ingest` is called with a `connectionId`, that id only
-scopes which semantic-layer sources the triage agent can see
-(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page
-and **not** validated against `ktx.yaml`.
-
-## Generic use case
-
-Any org with multiple databases/warehouses in one **ktx** project: org-wide
-definitions ("fiscal year starts in February") should be visible everywhere,
-while database-specific conventions ("in the events DB, `user_id` is the
-anonymous device id, not the account id") should not pollute searches about
-other databases — and two databases that both have an `orders` concept must be
-able to keep separate, non-colliding pages.
-
-## Model
-
-`connections` is **additive frontmatter metadata**, orthogonal to the existing
-`GLOBAL`/`USER` directory scope — not a third scope dimension:
-
- A page is still `GLOBAL` or `USER` and lives where it lives today. It may
-  **additionally** carry a `connections` list.
- **Page keys remain a flat, globally-unique namespace.** `connections` does
-  **not** namespace keys; a page is addressable by key alone, unchanged.
- A page may list **multiple** connections.
- **Absent or empty `connections` ⇒ unscoped: the page applies to all
-  connections.** This is exactly today's behavior, so every existing page is
-  unaffected.
-
-This keeps `wiki_read` and refs untouched and adds no parallel scope axis;
-filtering by connection is purely a search/relevance concern.
-
-## Requirements
-
-### 1. Frontmatter field
-
-Add an optional `connections` field to wiki page frontmatter — a list of
-connection ids.
-
- Accept a single string too; normalize to a list at parse time (reuse the
-  existing array-coercion helper used for `tags`/`refs`/`sl_refs`).
- Round-trips through parse/serialize without loss.
- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by
-  construction.
-
-### 2. Page identity and key distinctness
-
-`connections` does not change how pages are identified or addressed:
-
- Keys stay flat and globally unique; `wiki_read(key)` is unchanged.
- Because the write path updates a page in place when its key already exists,
-  same-concept pages for different connections **MUST** use distinct keys
-  (e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys
-  for database-specific pages are the primary mechanism (driven by write-path
-  prompt guidance, requirement 5).
- **Data-loss guard (code, not prompt):** a connection-scoped write whose key
-  matches an existing page whose `connections` scope is **disjoint** from the
-  incoming scope MUST surface a collision instead of silently overwriting the
-  existing page. (Updating a page within the same connection scope, or
-  broadening/narrowing its own `connections`, is a normal update — not a
-  collision.) The implementer owns whether the collision is a hard error or a
-  suffixed new key; it must not be a silent clobber.
-
-### 3. Search filtering
-
-Add an optional connection filter to the search surfaces:
-
- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`).
- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection <id>`
-  (with `-c` alias), matching the `ktx sql` connection flag.
-
-Semantics:
-
- With `connectionId: X` ⇒ return pages whose `connections` is empty
-  (unscoped) **∪** pages whose `connections` contains X.
- Without ⇒ current behavior, all pages.
- The filter **MUST** apply uniformly to **all three search lanes** (lexical
-  FTS5, semantic/embedding, token fallback) at the **candidate-source level**,
-  so each lane draws its full candidate pool from the already-scoped set. It
-  **MUST NOT** be a post-filter on the merged/ranked results — that would let
-  off-scope candidates consume both the per-lane pool and the final result
-  limit unevenly.
-
-*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the
-disk-load step that feeds both the in-memory token lane and the synced SQLite
-index (`local-knowledge.ts`); the connection filter fits the same seam.
-
-### 4. Index persistence
-
-The `.ktx/db.sqlite` knowledge index is re-synced from files on every search.
-The implementer owns whether to persist `connections` as index columns / a side
-table, or to filter the loaded page-set before the per-search sync. The binding
-requirement is the uniform-across-lanes behavior in requirement 3 — not a
-specific schema.
-
-*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the
-scoped subset and gives up a little embedding-cache reuse when searches
-alternate between connections (recompute is one embedding per scoped page per
-connection switch — negligible at the scale this targets). Persisting
-`connections` in the index avoids that at the cost of a schema addition and a
-per-lane predicate. Either is acceptable.
-
-### 5. Write path
-
- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a
-  `connections` input field with the same REPLACE semantics as
-  `tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to
-  unscoped; `[ids]` ⇒ set.
- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt
-  guidance directs the agent to:
-  - set `connections: [connectionId]` on new **database-specific** pages, using
-    connection-distinctive keys; and
-  - leave `connections` empty for clearly **org-wide** content.
- This is **prompt guidance, not a code auto-default.** A connection-scoped
-  ingest must remain able to produce unscoped org-wide pages, so the tool must
-  not force the session's `connectionId` onto every page.
-
-### 6. `wiki_read` and refs unchanged
-
-Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and
-`sl_refs` semantics are unchanged; `connections` is a search/relevance concern
-only.
-
-### 7. Validation
-
-Validation behavior splits by surface, because an explicit argument is a
-typo-prone input while persisted content drifts independently of config:
-
- **Explicit argument** — a connection id supplied as a command/tool argument
-  (`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`)
-  MUST be validated against `ktx.yaml` connections and **rejected with a clear
-  error listing the configured ids** when unknown. Reuse the canonical
-  `project.config.connections[id]` check. This also closes the current gap
-  where `memory_ingest`'s `connectionId` is accepted unvalidated.
- **Persisted frontmatter** — a connection id that appears only in a stored
-  page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during
-  validation/doctor, and MUST NOT break loading, searching, or reading that
-  page. Config and content can evolve independently.
-
-### 8. Scope boundary
-
-This spec delivers the **mechanism** (frontmatter storage + uniform filter +
-write surface + validation). Driving the agent to actually pass `connectionId`
-during analytics work is the concern of
-`03-multi-connection-routing-in-analytics-skill`. It composes with the
-`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`.
-
-## Acceptance criteria
-
- A page with `connections: [db_a]` is returned by
-  `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
-  **not** by `wiki_search(query, connectionId: "db_b")`.
- A page with no `connections` field is returned in all three cases above.
- Two pages — `orders_sales_db` (`connections: [sales_db]`) and
-  `orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to
-  `sales_db` returns the first and not the second, and neither overwrote the
-  other on write.
- A connection-scoped write whose key matches an existing page scoped to a
-  **different** connection surfaces a collision instead of silently
-  overwriting (data-loss guard, requirement 2).
- Filtering works in each lane independently (test with embeddings disabled to
-  exercise the lexical and token lanes alone).
- `memory_ingest(content, connectionId)` produces a page scoped to that
-  connection for database-specific content.
- `wiki_search`/`ktx wiki search --connection <unknown>` fails with an error
-  that lists the configured connection ids.
- A page whose `connections` references an id absent from `ktx.yaml` produces a
-  warning but stays searchable and readable; search and read do not throw.
- `connections` accepts a single string and a list, both normalized to a list.
- Existing projects with no scoped pages and no `connectionId`/`--connection`
-  behave identically before/after.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`),
-  `wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array
-  coercion `wiki/local-knowledge.ts` (`stringArray`).
- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts`
-  (`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already
-  scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts`
-  (FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`).
- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`,
-  `memory_ingest`; `connectionId` already present on `memory_ingest` but
-  unvalidated).
- **CLI surface:** `commands/knowledge-commands.ts`
-  (`ktx wiki search`/`list`/`read`); canonical `--connection` flag in
-  `commands/sql-commands.ts`; validation pattern
-  `project.config.connections[id]` in `mcp/local-project-ports.ts`.
- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE
-  semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId`
-  threaded through the capture session and tool session;
-  `external_ingest` forces `GLOBAL` scope).
- **Connection config:** `context/project/config.ts` (`connections` record in
-  `ktx.yaml`).
-
-## Benchmark context (motivation only)
-
-Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose
-schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
-External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
-to exactly one database and must not surface for the other 29.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
-acceptance criteria covered; full package suite green (2924 passing),
-type-check, knip/biome dead-code, and pre-commit clean.
-
-**What was built / where**
-
-1. **Frontmatter field (req 1).** `connections?: string[]` added to
-   `WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model
-   `LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new
-   `stringList()` coercion (single string → list); round-trips through both
-   serializers. Absent/empty ⇒ unscoped.
-2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through
-   `searchLocalKnowledgePages` → both the sqlite-FTS and scan impls →
-   `loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is
-   applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages
-   listing the id), so the token lane and the per-search SQLite sync (lexical +
-   semantic) both draw their candidate pool from the already-scoped set —
-   candidate-source level, not a post-filter.
-   - Chose req 4 **option B (filter the loaded page-set)** over persisting a
-     column. Verified-safe here: standalone ktx's memory agent reads pages from
-     files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s
-     `knowledge_pages` is a per-search cache that `searchLocalKnowledgePages`
-     rebuilds every call — scoping the sync corrupts no shared state. Only cost
-     is one embedding recompute per scoped page on a connection switch (the
-     spec's acknowledged, negligible trade-off). No index-schema change.
-3. **Page identity + data-loss guard (req 2).** Keys stay flat/global;
-   `wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`)
-   rejects (hard error, no silent clobber) a connection-scoped write whose
-   incoming `connections` is **disjoint** from a same-key existing page's
-   non-empty `connections`, suggesting a connection-distinctive key. Same-scope,
-   overlapping, broaden/narrow, and unscoped-existing updates are allowed.
-   Chose a hard error over auto-suffixing so the conflict reaches the agent
-   (the decision-maker) instead of silently forking the key namespace.
-4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list)
-   with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no
-   code auto-default of the session connection. Prompt guidance added to the
-   shared `wiki_capture` skill (new "Connection scoping" section) and the
-   `memory_agent_external_ingest` prompt. The session `connectionId` is now
-   surfaced to the agent so the guidance is actionable: in the memory-agent
-   prompt header and in the ingest work-unit `<context>` block
-   (`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`).
-5. **Validation (req 7).** New shared helper
-   `context/connections/configured-connections.ts → assertConfiguredConnectionId`
-   validates explicit connection-id arguments against `ktx.yaml` and throws an
-   error listing the configured ids. Routed from all three explicit-arg
-   surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest`
-   (validated at the boundary in `mcp-server-factory.ts` — this also closes the
-   prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated),
-   and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` +
-   `knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**:
-   `listReferencedConnectionIds` + a non-fatal `ktx status` warning
-   (`status-project.ts`); loading/searching/reading never throw on them.
-
-**Deviations / notes**
-
- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`".
-  That helper (`stringArray`) is array-only and does **not** coerce a single
-  string; added a dedicated `stringList` for `connections` to meet the
-  single-string acceptance criterion rather than change `stringArray`'s
-  behavior for the other fields.
- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already
-  takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so
-  its wiki lane is intentionally left unscoped. Worth a follow-up if
-  `discover_data`'s wiki results should also be connection-scoped for
-  consistency.
- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the
-  new `wiki_search.connectionId` param and the `memory_ingest` validation
-  wrapper (the port is no longer the raw service object; it delegates).
--- a/spider2-specs/specs/02-verbatim-ingest-mode.md
+++ b/spider2-specs/specs/02-verbatim-ingest-mode.md
@ -1,327 +0,0 @@
-# Verbatim ingest mode for authoritative documents
-
-> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`.
-
-## Problem
-
-`ktx ingest --text/--file` routes captured content through the memory agent.
-`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a
-`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to
-`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which
-runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k
-chars) inside a session worktree. The agent decides — via the `wiki_write`
-tool — what to persist, so it may **rewrite, condense, split, or re-title** the
-content before it lands as a wiki page. The body is produced by an LLM, not
-copied by code.
-
-For *authoritative* documents — formula definitions, metric specs, runbooks,
-compliance text — paraphrasing is a defect, not a feature:
-
- exact thresholds, constants, and rule wording must survive unchanged;
- lexical (BM25/FTS5) search works best when the stored text matches the
-  phrasing users and agents query with;
- ingestion should be deterministic and reproducible — the same input file
-  yields the same page, and re-running is safe.
-
-Two further gaps block authoritative ingest today:
-
- The memory agent hard-requires an LLM backend
-  (`context/memory/local-memory.ts` throws when `llm.provider.backend: none`
-  and no runner is injected), so there is **no** offline ingest path at all.
- The agent's write tool *merges* a repeated same-scope key in place (REPLACE
-  frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the
-  silent in-place rewrite an authoritative-document workflow must avoid.
-
-## Generic use case
-
-Any team ingesting documents that are already the source of truth: metric
-definition sheets, SLA documents, calculation-methodology docs, regulatory
-text. The user wants **ktx** to *index and surface* the document, not to
-re-author it. Today they work around the memory agent by hand-writing
-frontmatter and copying files into `wiki/global/`; verbatim mode makes that a
-first-class, supported `ktx ingest` workflow.
-
-## Model
-
-`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a
-constrained prompt over the existing agent loop. Its defining invariants:
-
- **The stored page body is the input document body, written by code.** The LLM
-  never produces, edits, or relays the body. It is confined to generating
-  *metadata* about the body.
- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is
-  LLM-generated or derived offline follows from the configured backend
-  (`llm.provider.backend`), not from a second user-facing switch.
- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project
-  authoritative docs (the content teams copy into `wiki/global/` today).
-  Connection association is expressed by the **additive `connections`
-  frontmatter** from spec 01, never by directory.
- **Deterministic and idempotent.** The page key, the merged frontmatter, and
-  the stored body are all functions of the input alone (given a fixed backend),
-  so the same input produces the same page and a re-run is a safe no-op.
-
-### "Byte-for-byte" scope
-
-The guarantee is on the document's **interior**: no paraphrase, no condense, no
-split, no re-title, no reflow, **no clipping**. The shared wiki store
-canonicalizes *surrounding* whitespace — `parsePage` trims the body and
-`serializePage` emits a single trailing newline
-(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are
-normalized by the storage layer. Verbatim mode **MUST** write through that
-shared `writePage`/`serializePage` path rather than fork a parallel serializer;
-the interior bytes (thresholds, constants, wording) are what must be preserved
-exactly, and they are. Acceptance hashes compare the stored body against the
-**trimmed** input body.
-
-## Requirements
-
-### 1. Flag
-
-`ktx ingest --file <path> --verbatim` and `ktx ingest --text <content>
--verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text`
-item in the invocation; each item becomes its own page.
-
- It composes with the existing `--connection-id <id>` flag
-  (`commands/ingest-commands.ts`) so the resulting page can be
-  connection-scoped (see spec 01). **Note:** the intake draft wrote
-  `--connection`; the shipped flag is `--connection-id`. Use `--connection-id`.
- No new `--key` flag (see requirement 4). No second behavioral switch beyond
-  `--verbatim` itself.
-
-### 2. Body preservation is enforced by code, not by prompt
-
-The stored page body is the input content (interior preserved exactly, per
-**Model → "Byte-for-byte" scope**).
-
- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop
-  or any `wiki_write` tool call where a model could alter it.
- The LLM, when used, generates **only** metadata: `summary`, `tags`, and
-  `sl_refs`. A single constrained structured-output call (AI SDK v6
-  `generateObject` with a `zod` schema) is the intended mechanism — the full
-  memory-agent loop, worktree, and squash-merge are **not** required and should
-  not be used.
- The page key is **not** LLM-generated (requirement 4).
-
-### 3. No clipping of the stored body
-
-The ~48k clip may apply only to the text **sent to the LLM** for metadata
-generation. It **MUST NOT** apply to the text **written** to the page. A
-document larger than the clip limit is stored in full; only its metadata is
-derived from the clipped prefix.
-
-### 4. Deterministic page key
-
-The key is derived from the input, never chosen by the LLM (an LLM-chosen slug
-would break determinism and the requirement-6 idempotency guarantee):
-
- **`--file <path>`** → `suggestFlatWikiKey(basename without extension)`
-  (`wiki/keys.ts`). This is the primary document case and is always
-  deterministic.
- **`--text <content>`** → if the content opens with a Markdown heading, the
-  key is `suggestFlatWikiKey(heading text)`. If there is no leading heading,
-  **hard error**: inline verbatim text needs a leading heading to derive a
-  stable key, or should be passed as `--file`.
- No hash-based keys (unfindable) and no `--key` override flag. A real need for
-  explicit key control can add `--key` later.
-
-### 5. Frontmatter: passthrough + gap-fill
-
-If the input has its own YAML frontmatter, split it from the body: the body is
-everything after the closing `---`; the frontmatter is authoritative metadata.
-
- **Passthrough.** Every input frontmatter field is preserved in the stored
-  page, **including fields not in `WikiFrontmatter`** (`effective_date`,
-  `version`, `owner`, …). The serializer `YAML.stringify`s the object, so
-  unknown keys round-trip. Dropping them would be silent data loss on
-  authoritative docs.
- **Gap-fill only.** Generated/derived metadata fills **absent** fields only;
-  it **MUST NOT** overwrite an explicit value. An input `summary:` is never
-  replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept.
- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not
-  force-injected) when the input does not set it.
- **Connection scoping.** `--connection-id X` (validated via
-  `assertConfiguredConnectionId`, `context/connections/configured-connections.ts`)
-  sets `connections: [X]` when the input frontmatter does not already declare
-  `connections`. If the input frontmatter declares a **different**
-  `connections` than the flag, **hard error** (ambiguous intent) rather than
-  silently choosing one. If they match, or only one source is present, proceed.
-
-### 6. Degraded mode (`llm.provider.backend: none`)
-
-`--verbatim` **MUST** work with no LLM backend — this is its capability the
-regular agent ingest lacks.
-
- `summary` is derived from the leading Markdown heading text, or, if none, the
-  first non-empty sentence of the body (trimmed to a reasonable length).
- `tags` and `sl_refs` are left empty.
- The body is still stored in full (requirement 3 applies unchanged).
-
-### 7. Key collisions: idempotent-if-identical, else hard error
-
-Verbatim mode does **not** reuse the agent write tool's in-place merge. Before
-writing, read any existing `GLOBAL` page at the derived key:
-
- **No existing page** → write.
- **Existing page, stored body identical** to the new body (compared after the
-  storage-layer normalization in **Model**) → **idempotent no-op success**
-  (re-running the same file is safe).
- **Existing page, body differs** → **hard error** naming the conflicting key
-  and directing the user to a distinct key. Never a silent overwrite, never an
-  auto-suffixed second page (which would produce the duplicated/divergent pages
-  this mode must avoid).
-
-### 8. LLM-failure handling
-
-When a backend **is** configured but the metadata call fails (rate limit,
-transport error, malformed output after retries), **fail the item** (honoring
-`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`).
-**MUST NOT** silently fall back to degraded derivation: a degraded page written
-on a transient error would, under requirement 7, refuse to be replaced by a
-healthy re-run — breaking reproducibility. Degraded derivation is reserved for
-`backend: none`.
-
-### 9. Findability
-
-After write, the page is reindexed so search returns it:
-
- `wiki_search` for a phrase taken from the document body returns the page via
-  the lexical lane (the body is indexed in `buildKnowledgeSearchText`).
- `wiki_search` for a paraphrase of the document's topic returns it via the
-  semantic lane **when embeddings are enabled** (this is what the generated
-  `summary`/`tags` buy over a bare degraded page).
-
-## Acceptance criteria
-
- Ingesting a file with `--verbatim` produces a page whose body is
-  byte-identical to the trimmed input body (assert with a hash in tests).
- A >48k-char file is stored in full (assert stored body length ≥ input length
-  minus trim).
- Running the same `--verbatim` ingest twice is idempotent: one page, identical
-  bytes both times, no error on the second run.
- A second ingest to the same derived key with **different** body content fails
-  loudly (requirement 7) and does not modify the existing page or create a
-  suffixed one.
- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved
-  in the stored page; an explicit input `summary` is **not** overwritten by a
-  generated one.
- With `llm.provider.backend: none`, `--verbatim` still produces a page: full
-  body stored, `summary` derived from the heading/first sentence, `tags` and
-  `sl_refs` empty.
- `--verbatim --connection-id X` yields a page with `connections: [X]`; an
-  unknown id is rejected with an error listing the configured ids. (Depends on
-  spec 01, now shipped.)
- `--verbatim --connection-id X` where the input frontmatter already declares a
-  different `connections` fails with an ambiguity error.
- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading
-  heading or `--file`.
- `wiki_search` for a body phrase returns the page (lexical lane); for a topic
-  paraphrase it returns the page when embeddings are enabled (semantic lane).
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-module layout and design, subject to the invariants above.
-
- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table;
-  `--text`/`--file`/`--connection-id`/`--fail-fast` already present — add
-  `--verbatim` and thread it into `KtxTextIngestArgs`).
- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`,
-  `validateItems`, per-item loop and exit-code aggregation). The verbatim flow
-  reuses item loading and replaces the `memoryIngest.ingest(...)` call with a
-  code-driven write for `--verbatim` items. Keep the new logic in a focused
-  module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`.
- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts`
-  (`parsePage` for the `---…---` split shape, `serializePage`, `writePage`,
-  `readPage` for the collision check). Write through this shared path — do not
-  re-implement YAML framing.
- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`).
- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and
-  `usage_mode` are the required fields; unknown passthrough fields live
-  alongside).
- **Connection validation:** `context/connections/configured-connections.ts`
-  (`assertConfiguredConnectionId`, shipped with spec 01).
- **Metadata LLM call:** the local LLM runtime/config resolution in
-  `context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a
-  single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill
-  covers v6 patterns.
- **Reindex / search lanes:** `wiki/local-knowledge.ts`
-  (`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/
-  semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`).
- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a
-  verbatim-focused test file covering the acceptance criteria above.
-
-## Benchmark context (motivation only)
-
-Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
-definitions, the haversine formula, F1 overtake rules, …). Gold SQL was
-authored against their **exact** text; an LLM paraphrase that drops a bucket
-boundary or rounds a constant loses the corresponding question. The current
-workaround is hand-writing frontmatter and copying files into `wiki/global/`.
-Verbatim mode turns that manual step into a supported **ktx** workflow, and
-composes with the connection scoping from spec 01 so a doc relevant to exactly
-one of the benchmark's ~30 SQLite databases does not surface for the other 29.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered
-by tests and verified end-to-end through the linked `ktx-dev` binary.
-
-**What was built**
-
- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor`
-  + `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`,
-  `deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter`
-  (the last four are `@internal` exports for unit testing).
- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a
-  guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded
-  into `KtxTextIngestArgs.verbatim`.
- `text-ingest.ts` now tags each loaded item with an `origin`
-  (`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim
-  ingestor once and branches the per-item loop to a code-driven write instead of
-  `memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and
-  `--fail-fast` handling are reused.
-
-**Deviations from the literal spec (design refinements, per "implementer owns the design")**
-
- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The
-  implementation routes through the existing `KtxLlmRuntimePort.generateObject`
-  instead — it is implemented by all three backends (ai-sdk, claude-code, codex),
-  and the ai-sdk one already wraps `generateText` + `Output.object({schema})`.
-  This realizes the spec's "single constrained structured-output call" intent via
-  the canonical cross-backend path rather than forking a second LLM entry point.
- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages`
-  rebuilds the SQLite index from disk on every call (recomputing embeddings for
-  changed pages), so a written page is findable without a dedicated reindex step.
-  The write still goes through the shared `KnowledgeWikiService.writePage` +
-  `syncSinglePage` path, so the page is also eagerly indexed.
- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter
-  already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills
-  absent fields, so there is nothing to generate). A fully specified document thus
-  ingests with a configured backend without any LLM call.
-
-**Tests**
-
- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration
-  against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip,
-  idempotency, conflict hard-error, frontmatter passthrough, explicit-summary
-  preservation, degraded mode, connection scoping + unknown-id rejection +
-  ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item,
-  lexical + semantic findability).
- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging,
-  connection-id forwarding, fail-fast.
- `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the
-  requires-`--text`/`--file` guard.
-
-**Docs**
-
- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest"
-  section, examples, common errors) and
-  `docs-site/content/docs/guides/writing-context.mdx` (authoritative-document
-  workflow).
-
-**Verification**
-
- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code`
-  (Biome + Knip default + production) clean; pre-commit clean on changed files.
-  A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is
-  untouched — it predates this work.
--- a/spider2-specs/specs/06-scan-tolerate-broken-objects.md
+++ b/spider2-specs/specs/06-scan-tolerate-broken-objects.md
@ -1,361 +0,0 @@
-# Schema scan tolerates individual objects that fail introspection
-
-> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`.
-
-## Problem
-
-A single broken or inaccessible object zeroes out an entire connection's
-context. Schema introspection iterates objects with no per-object error
-handling, so one throw aborts the whole scan, the live-database adapter's
-`fetch()` rejects, and the connection ends with **no semantic layer at all** —
-even when every other object was healthy.
-
-The failure surfaces in two phases, and the contract must hold in both:
-
- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does
-  `rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch.
-  `readTable` runs `PRAGMA table_info(<object>)`, which *executes* a view's
-  body to resolve its columns — so a view over a dropped/renamed column (the
-  `oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date`
-  from a base table that has no such column) raises `no such column:
-  ehp.start_date` and aborts introspection of all ~48 healthy objects.
- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/
-  bigquery/snowflake read metadata in bulk from catalog / `information_schema`
-  (a broken view rarely breaks that), then fail when a per-object profiling or
-  sampling `SELECT` runs against a broken object. Enrichment sampling is
-  *already* isolated (`description-generation.ts` wraps `sampleTable` in
-  try/catch → `sampling_failed`), but mandatory introspection-phase reads are
-  not uniformly isolated across drivers.
-
-A second, related defect blocks the documented escape hatch. Setting
-`enabled_tables: ["main.customers"]` on a sqlite connection produces a
-different hard failure — `Adapter "database schema" did not recognize fetched
-source output`. Root cause: the sqlite connector emits every object as
-`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })`
-(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but
-`"main.customers"` parses to `{ db: "main", name: "customers" }`
-(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`,
-so the entry matches **nothing**, zero table files are written, and
-`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping
-the generic "did not recognize fetched source output" error at
-`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form
-`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form
-silently matches nothing.
-
-## Generic use case
-
-Real warehouses routinely contain broken or inaccessible objects: views over
-dropped/renamed columns, views referencing tables the connection role can't
-read, permission-denied tables, and vendor system views that error on read.
-**ktx** should ingest everything it *can* and skip what it can't, so one bad
-object never zeroes out an entire connection's context. This is baseline
-production robustness, independent of any benchmark — the same tolerance a
-33-warehouse fleet needs the first time one of its databases has a stale view.
-
-## Design
-
-The unit of failure is **one object** (table or view). Introspecting or
-profiling an object is an operation that can fail independently; a failure skips
-that object, records a recoverable warning, and the scan continues from the
-objects that succeeded.
-
-Because seven Node connectors and the Python daemon each introspect differently
-(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata
-in bulk and fail per-object during profiling), the **semantics** of "skip /
-warn / total-failure" are defined **once** and every connector routes through
-them — rather than seven copies of the same try/catch that drift apart:
-
- A shared per-object helper in the `scan/` layer — the sibling of the existing
-  `tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single
-  object read and returns `{ ok: true, table } | { ok: false, warning }`, with a
-  standard warning code (e.g. `object_introspection_failed`).
- A shared post-check enforces the total-failure rule (R3) uniformly.
- Each connector keeps its **natural** shape: sqlite routes each `readTable`
-  through the helper; bulk-read drivers route their per-object profiling reads
-  through it. The contract is uniform; the loop is not forced to be.
- The Python daemon implements the **same contract** in its own helper, adds a
-  `warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps
-  those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`).
-
-The warning channel already exists end to end on the Node side
-(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/
-`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json`
-artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates
-it. This spec makes that channel carry object-skip warnings and surfaces them in
-the ingest summary, the persisted report body, and `ktx status`.
-
-## Requirements
-
-### R1 — Per-object isolation (the contract)
-
-If introspecting or profiling one object throws, the scan **MUST** skip that
-object, record a `KtxScanWarning` (object name, the error message, and any
-schema/catalog qualifier; `recoverable: true`), and continue with the remaining
-objects. No single object may abort the scan.
-
- The contract holds in **both** phases: the mandatory metadata read *and* any
-  profiling/row-count/sample read performed during introspection.
- It holds for **all seven Node connectors**
-  (`packages/cli/src/connectors/<driver>/`) and the **Python daemon** postgres
-  path (R6).
- The semantics are defined once (the shared helper + warning code from the
-  Design section) and every connector routes through them. Do not inline a
-  divergent per-driver copy.
- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object
-  identifier and the database's error text, redacted through the existing
-  `redactKtxSensitiveMetadata` path that `warnings.json` already uses.
-
-### R2 — Surface, don't hide
-
-Skipped objects **MUST** be reported both at ingest time and in the durable
-status view:
-
- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports
-  a count plus the object name and a short reason for each skip — e.g.
-  `Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`.
- **Run report.** Object skips land in the run report's `warnings.json` artifact
-  (already written) and in the persisted report body (`IngestReportBody`), whose
-  natural home is the existing `fetch?: SourceFetchReport` field — the fetch
-  phase *is* introspection.
- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for
-  the connection's latest ingest — e.g. `oracle_sql: 1 object skipped —
-  emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived
-  from the latest persisted report, not new persisted state**: the report body
-  is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so
-  surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status`
-  reads and renders the skip info already present in the latest report body. A
-  connection whose latest ingest skipped nothing shows no such line.
-
-### R3 — Failure semantics (partial vs total)
-
-Per-object skipping is **unconditional** — there is **no new config knob**, and
-the existing `ingest.workUnits.failureMode` (which governs the later LLM
-work-unit stage, not introspection) is untouched and orthogonal. Outcomes are
-derived from object counts, not from a mode:
-
-| Scope | Objects discovered / matched | Introspection outcome | Result |
-| --- | --- | --- | --- |
-| none | 0 | n/a (legitimately empty DB) | **success**, empty layer |
-| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest |
-| none | N > 0 | all N fail | **connection failure** (clear error) |
-| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) |
-| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings |
-| `enabled_tables` | matches M > 0 | all M fail | **connection failure** |
-
- "Connection failure" means the connector / `fetch()` raises a **clear,
-  actionable error** for that connection. It **MUST NOT** surface as the generic
-  `did not recognize fetched source output` (that message is reserved for a
-  genuinely unrecognized staged dir, not an empty/total-failure result).
- A total failure of one connection follows existing per-connection ingest
-  orchestration for whether sibling connections continue; this spec does not
-  change cross-connection behavior.
-
-### R4 — A broken view never blocks base tables
-
-A broken view **MUST NEVER** prevent base-table ingest.
-
- View introspection failures are isolated exactly like any other object (R1).
- Mandatory introspection **MUST** prefer reading an object's structure from the
-  catalog where possible over executing the object's body, and **MUST NOT** run
-  a data-reading query (row count, sample) against a view as a required step.
-  (sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the
-  metadata read that executes the view definition.)
-
-### R5 — `enabled_tables` allowlist works
-
-The documented allowlist escape hatch **MUST** reliably restrict the scan to the
-listed objects, with no spurious adapter error:
-
- **sqlite qualification.** The schema-qualified form `"main.<name>"` **MUST**
-  resolve to the same object as the bare form `"<name>"` (sqlite's sole schema
-  is `main`; the connector emits `db: null`). Both forms select the object;
-  neither silently matches nothing.
- **Documented format.** The accepted qualification forms for each driver
-  (`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main`
-  equivalence **MUST** be documented where `enabled_tables` is described
-  (`context/project/driver-schemas.ts` and the user-facing config docs).
- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to
-  **zero** matched objects **MUST** fail with an actionable error naming the
-  connection, the unmatched entries, and the available object names — **not** the
-  generic `did not recognize fetched source output`. This is distinct from a
-  legitimately empty database (R3 row 1) and from a matched-but-all-broken scope
-  (R3 last row).
- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests
-  **exactly** those M objects (minus any that fail per R1), with no adapter
-  recognition error regardless of how small or edge-case the set is.
-
-### R6 — Python daemon parity
-
-The daemon's postgres introspection path **MUST** honor the same contract:
-
- Add a `warnings` field to `DatabaseIntrospectionResponse`
-  (`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the
-  same shape Node expects (code, message, object identifier, recoverable).
- Isolate per-object failures in the daemon's introspection so one broken object
-  does not abort the response; apply the R3 total-failure rule there too.
- Map daemon warnings into `KtxSchemaSnapshot.warnings` in
-  `mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`),
-  which currently drops them.
- The Node and Python warning shapes **MUST** stay in parity (the codebase
-  already mirrors Node↔Python schemas for telemetry; follow the same discipline
-  so the daemon cannot emit a code Node can't render).
-
-## Acceptance criteria
-
- Ingesting a sqlite DB with one broken view + N healthy tables yields a
-  semantic layer for the N healthy tables and **exactly one** warning naming the
-  broken view and its error; exit is **success**.
- The skipped object appears in the `ktx ingest` summary output, in the run's
-  `warnings.json`, and in `ktx status` as a per-connection skipped-objects line
-  on the connection's latest ingest.
- A sqlite DB in which **every** discovered object fails introspection (and the
-  file opens) exits as a **connection failure** with a clear error — not an
-  empty "success" and not `did not recognize fetched source output`.
- A genuinely empty sqlite DB (zero objects) exits **success** with an empty
-  layer (not a failure).
- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both
-  ingest exactly the `customers` object on a sqlite connection.
- `enabled_tables` restricted to a valid subset of M objects ingests exactly
-  that subset, with **no** adapter-output error.
- `enabled_tables` that matches zero objects fails with an error naming the
-  connection, the unmatched entries, and available objects — distinguishable
-  from the empty-DB and all-broken cases.
- A broken view does not prevent ingest of base tables in the same connection
-  (regression test with a view that errors on read alongside a healthy table).
- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a
-  per-object failure in the daemon path produces a warning mapped into
-  `KtxSchemaSnapshot.warnings` (Node↔Python parity test).
- A warehouse-driver object whose profiling/sample read fails is skipped with a
-  warning and does not abort introspection of its siblings.
- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave
-  identically before/after — no warnings, same semantic layer.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **Shared semantics:** `context/scan/constraint-discovery.ts`
-  (`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror
-  for the per-object helper), `context/scan/types.ts`
-  (`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the
-  new object-skip code here).
- **Node connectors:** `packages/cli/src/connectors/<driver>/connector.ts` and
-  each `live-database-introspection.ts`. sqlite's loop is
-  `connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable`
-  (≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171.
-  Existing per-table sample isolation precedent: `description-generation.ts`
-  (≈ line 867, `sampling_failed`).
- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156)
-  routes every driver to its Node connector; the daemon is the `else` fallback.
- **`enabled_tables` matching:** `context/scan/enabled-tables.ts`
-  (`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts`
-  (`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47),
-  `context/project/driver-schemas.ts` (`enabled_tables` schema + description).
- **Staging / detect / error surface:**
-  `context/ingest/adapters/live-database/stage.ts`
-  (`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94,
-  `detectLiveDatabaseStagedDir` ≈ line 138),
-  `context/ingest/local-stage-ingest.ts` (the
-  `did not recognize fetched source output` throw ≈ line 291 — must stop being
-  the surface for empty-scope and total-failure).
- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus`
-  ≈ line 202), `context/ingest/memory-flow/summary.ts`
-  (`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing
-  summary.
- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`;
-  `SourceFetchReport` as the home for scan warnings),
-  `context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted
-  whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts`
-  (`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body
-  per connection and render the skipped line via `renderLocalStatsAsLines`).
- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py`
-  (`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response`
-  ≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables`
-  ≈ line 267), and the Node mapping in
-  `context/ingest/adapters/live-database/daemon-introspection.ts`
-  (`mapDaemonSnapshot` ≈ line 209).
-
-## Benchmark context (motivation only)
-
-`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic
-layer because of its one broken view, so those questions fall back to raw
-`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
-restores enriched context for that database. The same robustness is required for
-the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or
-permission-restricted objects are common and a single one must not zero out a
-warehouse's context.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki`. All requirements implemented;
-verified with `pnpm --filter @kaelio/ktx run test` (2981 passing),
-`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing),
-`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`.
-
-**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes
-`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning
-`{ ok, table } | { ok: false, warning }` and building an
-`object_introspection_failed` warning (object name + redactable DB error). It
-rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is
-never masked as an object skip. The new warning code was added to
-`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist
-(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode`
-validator), and `describeWarningGroup` (`scan.ts`).
-
-**Per-object isolation, where it actually exists (R1/R4).** Only sqlite
-(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do
-per-object reads during *mandatory* introspection; both now route each object
-through `tryIntrospectObject`. The other five Node connectors (postgres, mysql,
-clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/
-`information_schema` (already object-safe at this phase) and isolate per-object
-profiling/sampling in the enrichment phase (`description-generation.ts`,
-`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite
-also tolerates a `COUNT(*)` (profiling) failure without dropping a
-structurally-readable table, and a broken view's metadata read is isolated so it
-never blocks base tables (R4).
-
-**Single-source outcome decision (R3/R5).** New
-`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once
-in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the
-daemon) routes through — and derives the outcome from the snapshot + scope:
-≥1 object → success (skips ride along as warnings); all matched objects failed →
-clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear
-zero-match error naming the connection, the requested entries, and the available
-objects (sqlite/bigquery attach the discovered inventory via
-`metadata.discovered_object_names`); empty database (no scope) → success with an
-empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a
-valid empty staging is recognized; total-failure/zero-match now throw a clear
-connection error before staging instead of surfacing the generic
-`did not recognize fetched source output`.
-
-**`enabled_tables` matching (R5).** Normalized at the scope boundary in
-`resolveEnabledTables` using `connection.driver`: for sqlite, `main.<name>` →
-`{ db: null }`, so `"main.customers"` and `"customers"` select the same object.
-`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and
-`docs-site/.../configuration/ktx-yaml.mdx`.
-
-**Surfacing (R2).** Deviation from the spec's orientation: live-database schema
-ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` →
-`local_ingest_reports`), not the bundle runner, so the home for scan warnings is
-`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is
-persisted whole, so **no migration**), not the bundle-only
-`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport`
-(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`).
-The ingest summary is already rendered by `runKtxScan` from `report.warnings`
-(the new `describeWarningGroup` case), and `ktx status`
-(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the
-latest report body per connection and prints a per-connection
-`N object(s) skipped — name: reason` line.
-
-**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to
-`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model,
-isolates per-object failures in `_map_rows_to_tables`, and shares the
-`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with
-Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`,
-dropping any code Node cannot render (validated via `isKtxScanWarningCode`).
-Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the
-shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver
-including the daemon, avoiding a divergent second implementation. Parity is
-covered by a Node test (daemon-shaped warning round-trips) and a pytest
-(per-object failure → warning with the shared code).
--- a/spider2-specs/specs/07-analytics-skill-sql-craft.md
+++ b/spider2-specs/specs/07-analytics-skill-sql-craft.md
@ -1,363 +0,0 @@
-# Add universal SQL-authoring craft to the ktx-analytics skill
-
-> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`.
-
-## Problem
-
-The shipped `ktx-analytics` skill
-(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its
-`<workflow>` and `<rules>` tell the agent **which ktx tools to call and in what
-order** (`discover_data` → `entity_details`/`sl_read_source` →
-`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing
-about **writing correct SQL**.
-
-That gap shows up as a specific failure shape: the agent reliably produces
-*runnable* SQL but *wrong* results. The recurring defects are universal
-analytics-engineering mistakes, not ktx-specific ones:
-
- comparing a string column to a numeric literal (or vice versa), which can
-  silently match zero rows;
- rounding inside intermediate CTEs, so the final number is off;
- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so
-  results flicker run to run;
- filtering *before* a window function for sequence/“since”/“first” questions,
-  truncating the partition the window should see;
- returning a full ranked list for a “top/highest” question, or collapsing a
-  “per X” question to a single value;
- dropping the inputs (or the entity identifier) a derived value was built from.
-
-These are correctness defects every ktx user hits on a live database. They
-belong in the shipped skill — fixing them once improves ktx for everyone, rather
-than living in any individual caller’s prompt.
-
-## Generic use case
-
-An analyst (human or agent) points ktx at a **live, production** database and
-asks a real analytical question — “what’s the most recent order per customer”,
-“top region by margin”, “average order value by month”. The schema is unfamiliar
-(unknown date encodings, nullable join keys, string-typed numeric columns), the
-question carries grain and ranking intent in its wording, and the answer must be
-*correct and deterministic*, not merely executable. The skill should encode the
-analytics-engineering craft that makes the difference between a query that runs
-and a query that’s right — independent of any benchmark.
-
-## Model
-
-The change is **additive content in one Markdown file**, governed by these
-invariants. They constrain the implementer; the exact prose is theirs.
-
-### Inline-only delivery (this is a hard constraint, not a style preference)
-
-All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled
-`reference/*.md` file (the progressive-disclosure pattern Anthropic’s
-skill-authoring guide recommends for large skills) **MUST NOT** be used here,
-because the delivery mechanism ships only `SKILL.md`:
-
- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`,
-  which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file
-  per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex /
-  universal `.agents` equivalent, a **flattened** single rules file for Cursor
-  (`.cursor/rules/ktx-analytics.mdc`) and OpenCode
-  (`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that
-  contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`).
- Nothing copies sibling files or subdirectories. A reference file would dangle
-  on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot
-  represent a multi-file skill at all.
-
-The skill is small enough that inline costs nothing meaningful: ~67 lines today
-plus ~60 of craft is well under the 500-line budget. And this craft is **core
-content** — consulted on every SQL-authoring turn — so even if multi-file delivery
-existed it would still belong inline: progressive disclosure only pays off for
-large, *conditionally-relevant* reference material loaded on demand, not for
-always-needed craft.
-
-Multi-file skill *delivery* is a legitimate future enhancement, but it must be
-**pulled by a concrete need, not built ahead of one** — no shipped skill today
-exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first
-real trigger is the **per-dialect SQL syntax follow-up**
-(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand
-`reference/<dialect>.md` content is a genuine progressive-disclosure fit. When
-that work is scoped, note that multi-file delivery is **not** a simple directory
-copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor
-(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform,
-and uninstall needs per-file manifest entries. Recording the constraint here so a
-future implementer does not “improve” this inline content into a bundled
-reference that dangles on every target.
-
-### Heuristics with a generic *why*, not a wall of MUSTs
-
-The new rules are phrased as **heuristics with a one-line, universal rationale**,
-because SQL authoring is a high-freedom task (many valid approaches, choice
-depends on the question and the data). A bare imperative overfits; a rule plus
-its *why* lets the model apply judgment and generalize. This follows Anthropic’s
-own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all
-caps or rigid structures, reframe and explain the reasoning”).
-
-This **reconciles the draft’s “behavior only, no rationale” instruction**: the
-prohibition is specifically on rationale that references a **grader, gold answer,
-or the benchmark**. *Generic analytics-engineering rationale is required* — e.g.
-“…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number
-compare can silently match nothing”. That is a universal truth, not a
-grader reference.
-
-### Dialect-agnostic
-
-Every rule must read correctly on any SQL dialect a ktx connection might use.
-**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only),
-not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs.
-Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware
-(per-driver) location, explicitly out of scope here.
-
-### Discovery craft attaches to discovery; authoring craft to query/validate
-
-Two of the draft’s rules (inspect sample rows; cast before comparing) are
-*schema-discovery* concerns that happen **before** SQL is composed. They belong
-with the discovery steps of the existing workflow, not only at the query step.
-The rest (composition, window correctness, precision, completeness) belong with
-the query/validate steps. The draft’s “extend step 5/6” is the right home for
-most rules but is slightly off for the discovery pair; this spec corrects that.
-
-### Additive only
-
-The existing `<workflow>`, `<rules>`, and `<examples>` — compact result tables,
-summaries, clarification prompts, the tool-order workflow, the `connectionId`
-scoping rules — are preserved unchanged. The skill must still read well for an
-interactive, human-facing analysis session.
-
-## Requirements
-
-### 1. Placement and structure
-
-Add a dedicated, scannable craft section to `SKILL.md`:
-
- A new top-level block — `<sql_craft>` (sibling to `<workflow>`/`<rules>`) — with
-  **five sub-headings**: *Schema discovery*, *Composition*, *Window functions*,
-  *Numeric precision*, *Answer completeness*. Sub-headings keep the block
-  scannable (the draft’s “group under clear sub-headings” goal).
- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and
-  explain”) each gain a **one-line pointer** into `<sql_craft>` rather than
-  inlining the rules (state each rule once; Anthropic’s “consistent terminology /
-  don’t repeat” guidance). The schema-discovery pair is additionally reflected as
-  a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing
-  to the same block.
- No new tool, flag, or config. This is content only.
-
-### 2. The craft rules (all fourteen behaviors, grouped)
-
-Every behavior from the intake draft must be represented. Tightly-related ones
-**may** be merged into a single bullet where that reads better; none may be
-dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout.
-
-**Schema discovery** (cue in steps 2/4; lives in `<sql_craft>`)
-1. Inspect representative **sample rows** of each table before composing SQL —
-   confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in
-   join/filter keys, and the real set of categorical/enum values
-   (`entity_details` + a small `sql_execution` sample). *Why:* assumptions about
-   encoding and nullability are the most common source of silently-wrong filters.
-2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A
-   string column compared to a numeric literal (or vice versa) can silently match
-   nothing.
-
-**Composition**
-3. Build complex queries **incrementally** — one CTE at a time, verifying each
-   layer’s output on a small sample before stacking the next. *Why:* a wrong
-   intermediate layer is far cheaper to catch early than to debug in the final
-   result.
-4. **Avoid fan-out joins.** Add columns only from tables already at the target
-   grain, or **pre-aggregate** to that grain before joining. *Why:* a join that
-   multiplies rows quietly inflates every downstream `SUM`/`COUNT`.
-
-**Window functions**
-5. Give every ranking/ordering window function a **complete, deterministic
-   tie-breaker** (append unique key columns to `ORDER BY`), so
-   `RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs.
-6. For sequence / “first” / “most recent” / “since” questions, **filter after the
-   window**, not before: compute over the full partition, then keep the rows you
-   want. *Why:* a pre-filter shrinks the partition the window ranks over, so
-   “first”/“most recent” is computed against the wrong set. (See the worked
-   example, requirement 3.)
-
-**Numeric precision**
-7. Compute at **full precision; round only in the final projection**, never inside
-   intermediate CTEs.
-8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit
-   rounding when rounding is intended. (May merge with rule 7.)
-9. Distinguish **macro vs micro averages** based on the question’s wording:
-   “average of per-group averages” = `AVG(group_metric)`; “overall/weighted
-   average” = `SUM(numerator)/SUM(denominator)`.
-
-**Answer completeness / interpretation**
-10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the
-    top-ranked row via the window result), not the full ranked list, unless a list
-    is asked for. *(Phrase the mechanism dialect-agnostically — do not name
-    `QUALIFY`.)*
-11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a
-    single value unless the question says “overall” or “total across X”.
-12. When a question asks for inputs and a derived value (“X, Y, and their ratio”),
-    **include the inputs as columns** alongside the derived value.
-13. When grouping by a human-readable label (a name), also **expose the entity’s
-    identifier** — identity, not just the label, is part of the result (and
-    disambiguates duplicate names).
-14. When a result is **unexpectedly empty, relax filters one at a time** to find
-    which predicate removed the rows. *Why:* this is the validation feedback loop
-    that turns a silent empty result into a diagnosable one.
-
-### 3. One worked example (dialect-agnostic)
-
-Add **exactly one** compact before/after example to the skill, demonstrating the
-**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set.
-It shows the wrong shape (filter inside, then rank) and the right shape (rank over
-the full partition in a CTE, then filter to the top rank in the outer query),
-using generic table/column names and standard SQL only (no `QUALIFY`, no
-dialect functions). Keep it ~6–10 lines. Do not add a second example; the
-existing three tool-orchestration examples stay as the primary example set.
-*(Superseded by spec 09: the skill now carries a second `sql` worked example —
-the multi-hop fan-out case — so the one-example constraint applies to spec 07's
-window-then-filter example only.)*
-
-### 4. Explicit exclusions
-
-None of the following may appear in the skill (they are application/consumer
-concerns, or actively wrong for live data):
-
- **Output-shape contracts** (“return a bare result set with exactly these
-  columns, no prose”). The skill is for interactive analysis and already favors
-  readable tables + summaries; a caller needing a strict shape specifies that
-  itself.
- **Anchoring relative time to `MAX(date)` of the data.** On a live database
-  “recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is
-  only valid for static snapshots and must not be baked into the product.
- **Any advice justified by a grader, gold answer, or scoring comparator.**
- **Dialect-specific syntax** (deferred to the per-driver follow-up).
-
-### 5. Coordination with spec 03
-
-`03-multi-connection-routing-in-analytics-skill` also edits this same file (it
-adds a connection-routing “step 0” to `<workflow>` and threads `connectionId`
-through the tool calls). Spec 07’s additions are **orthogonal**: they live in a
-new `<sql_craft>` block and in step 5/6 pointers, and must not rewrite the
-`<workflow>` routing or the `<rules>` `connectionId` scoping that spec 03 owns.
-If both land, the result is one coherent skill: routing in `<workflow>`/`<rules>`,
-SQL craft in `<sql_craft>`.
-
-## Acceptance criteria
-
- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped
-  under the five sub-headings, each phrased as a heuristic with a generic
-  rationale.
- **Zero references** to any benchmark, gold answer, grader, or scoring
-  comparator anywhere in the skill.
- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`,
-  no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect
-  construct — including in the worked example.
- The existing interactive guidance is intact: the `<workflow>` steps, the
-  `<rules>` (compact tables, summaries, clarification prompt, `connectionId`
-  scoping), and the three existing examples all still read correctly and were not
-  removed or contradicted.
- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of
-  “recent”, grader-driven advice, dialect syntax) appear.
- Exactly **one** new worked example is present, demonstrating window-then-filter,
-  in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second
-  `sql` worked example for the multi-hop fan-out case; the shipped skill then
-  contains two worked examples and the content test asserts two `sql` fences.)*
- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced,
-  and the skill still installs as a single file through `setup-agents.ts` for all
-  targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip).
- The skill stays **scannable and within a reasonable size** (comfortably under
-  the 500-line budget).
- The frontmatter (`name`, `description`) is unchanged and still parses through
-  `SkillsRegistryService.parseFrontmatter`.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the
-  `<sql_craft>` block; add one-line pointers in steps 5/6 and a discovery cue in
-  steps 2/4; add the single worked example. Keep `<workflow>`/`<rules>`/`<examples>`
-  otherwise intact.
- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts`
-  (`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`,
-  `plannedKtxAgentFiles`). Each target gets a single file derived from
-  `SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only
-  `ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this
-  spec — confirm the skill still installs unchanged.
- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the
-  same file; keep the changes non-overlapping (see requirement 5).
- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the
-  right level (this is prompt content, not executable logic). Assert the skill
-  text contains the craft sub-headings / representative rule phrases, contains the
-  worked example, and contains none of the banned constructs: the literal tokens
-  `QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`,
-  `gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since
-  `MAX()` is a legitimate aggregate — any instruction anchoring relative time
-  (“recent”, “past N months”) to the data’s maximum date. The existing
-  `SkillsRegistryService` frontmatter-parse test must still pass. The standalone
-  `ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run
-  link:dev`) so the playground picks up the updated skill.
-
-## Benchmark context (motivation only)
-
-On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but
-~50 result mismatches**, and a large share traced to exactly these gaps:
-premature rounding, string-vs-number compares, non-deterministic window ordering,
-returning full lists for “top” questions, and dropping the inputs to derived
-values. These are generic SQL-authoring defects — fixing them in the skill
-improves ktx for every user querying a live database, and improving the benchmark
-score is a side effect, not the goal. The skill itself must contain no trace of
-the benchmark.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki`.
-
-**What was built**
- Added a new `<sql_craft>` block to `packages/cli/src/skills/analytics/SKILL.md`
-  (sibling to `<workflow>`/`<rules>`, placed just before `<examples>`), with the
-  five sub-headings — *Schema discovery before writing SQL*, *Composition*,
-  *Window functions*, *Numeric precision*, *Answer completeness / interpretation* —
-  and a one-line opener framing the bullets as heuristics-with-a-why.
- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end /
-  truncation) are merged into one "Round only at the end" bullet, as the spec
-  permitted. Each bullet carries a generic analytics-engineering rationale; none
-  references a benchmark, grader, or gold answer.
- Exactly one worked example (a fenced `sql` block inside `<sql_craft>`)
-  demonstrates the window-then-filter rule, and incidentally the deterministic
-  tie-breaker: the *wrong* shape filters before the window; the *right* shape
-  ranks the full partition in a CTE, then filters in the outer query. Standard
-  SQL only — no `QUALIFY`, no dialect functions.
- Step pointers added without duplicating the rules: a schema-discovery cue in
-  steps 2 and 4, an authoring pointer in step 5, and a validation pointer in
-  step 6, each pointing into `<sql_craft>`.
- The existing `<workflow>` / `<rules>` / `<examples>` (compact tables,
-  summaries, clarification prompt, `connectionId` scoping, the three
-  orchestration examples) are unchanged. Delivery is unchanged: still a single
-  `SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/`
-  file was introduced.
-
-**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a
-content assertion over the source `SKILL.md`: the five sub-headings, a
-representative phrase for each behavior, exactly one `sql` worked example, the
-preserved interactive guidance, and the absence of banned constructs
-(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` /
-`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring
-relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content
-assertions and the `SkillsRegistryService` frontmatter test still pass (77/77
-across the three relevant files). Rebuilt and re-linked `ktx-dev`
-(`pnpm run build && pnpm run link:dev`); the craft block is present in the
-shipped `dist` asset.
-
-**Deviations / notes**
- The worked example runs ~18 lines including comments rather than the spec's
-  "~6–10"; a faithful before/after with a CTE needs the extra lines, and the
-  skill stays well within budget (~117 lines total).
- `pnpm run type-check` currently reports one **pre-existing, unrelated** error
-  in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on
-  this branch ahead of `origin/main`. The src type-check and `pnpm run build`
-  are green; this change does not touch any MCP file.
- Per-dialect SQL syntax stays out of scope here (deferred to
-  `todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains
-  dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that
-  belongs with spec 08's channel so the skill never references a tool that does
-  not exist.
--- a/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md
+++ b/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md
@ -1,395 +0,0 @@
-# Per-dialect SQL syntax notes, served on demand and scoped to the connection
-
-> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion
-> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft
-> dialect-agnostic and explicitly deferred per-dialect syntax to this spec.
-
-## Problem
-
-Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the
-`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft
-deliberately excludes anything that reads correctly on only one engine — no
-`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs —
-because the flat skill is installed verbatim and an agent querying sqlite must
-never see Snowflake syntax.
-
-But a large share of *real* correctness depends on exactly that excluded,
-engine-specific syntax:
-
- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive
-  identifiers (unquoted folds to upper-case), VARIANT colon-paths
-  (`col:field.sub::type`), `QUALIFY`.
- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX`
-  for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`.
- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`,
-  `json_extract`.
- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`,
-  `sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and
-  JSON conventions.
-
-This guidance is genuinely useful to an agent writing SQL against a live
-database, but it must **not** pollute the flat dialect-agnostic skill. It belongs
-in a **dialect-aware** channel, surfaced only for the dialect the active
-connection actually uses, and selected from the project's own configured state —
-not guessed, not shown all at once.
-
-## Generic use case
-
-Any **ktx** project whose connections span more than one warehouse engine — a
-Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When
-the agent (or a human analyst the agent assists) writes SQL for a given
-connection, it should receive *that engine's* syntax conventions — FQTN form,
-identifier quoting, date functions, top-N idiom, semi-structured access — and
-nothing for the engines it is not querying. The need is independent of any
-benchmark: it is what "write correct SQL against this specific warehouse" requires
-on every multi-engine stack.
-
-## Model
-
-The change adds a **dialect-aware channel** alongside spec 07's flat skill. The
-following decisions are committed by this refinement; the implementer owns the
-exact prose and code.
-
-### Delivery: a dynamic MCP tool (decision committed)
-
-The draft posed two delivery mechanisms and asked the refinement to "weigh them
-before committing." This spec commits to **dynamic MCP delivery**: a new
-read-only MCP tool returns the syntax notes for a given `connectionId`, with the
-dialect resolved server-side from the connection's configured `driver`. The flat
-skill gains a one-line pointer to that tool. **No install-mechanism change is
-required.**
-
-The alternative — **multi-file skill delivery** (bundle `reference/<dialect>.md`
-files and point the skill at the matching one) — is **rejected** for **ktx**, for
-reasons that hold regardless of how the skill is otherwise authored:
-
-1. **It cannot scope on two of the six install targets.** Cursor
-   (`.cursor/rules/ktx-analytics.mdc`) and OpenCode
-   (`.opencode/commands/ktx-analytics.md`) are physically **single-file**;
-   `setup-agents.ts` flattens the skill to one file there. A bundled `reference/`
-   directory degenerates to "concatenate every dialect into one file," so a
-   sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core
-   no-leak criterion on those targets**, and defeating progressive disclosure
-   (everything is in context at once). The MCP tool behaves **identically on all
-   six targets** because it is a tool call, not an installed file.
-2. **Selecting the dialect is a deterministic operation, so it belongs in code,
-   not model judgment.** Anthropic's skill-authoring guidance explicitly says to
-   *"prefer scripts [tools] for deterministic operations."* With bundled files the
-   **model** must infer that connection X is Snowflake and open the right file —
-   and on a multi-connection project it can open the wrong one. With the tool, the
-   **server** resolves `driver → dialect` from `ktx.yaml` state and returns
-   exactly the right notes.
-3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery
-   requires reworking `readAnalyticsSkillContent`, `installTarget`,
-   `plannedKtxAgentFiles`, the install manifest (a directory variant),
-   `removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a
-   concatenation transform for the single-file targets. The MCP tool requires one
-   read-only handler and one skill pointer.
-4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on
-   the **ktx** MCP server — its entire workflow is calling `discover_data`,
-   `entity_details`, `sql_execution`, and so on. Wherever the server is down, the
-   skill is already non-functional; the tool adds **no new dependency**.
-5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would
-   make multi-file delivery *possible*, but it would not make it better: reasons
-   2–4 stand, and the drop is a disproportionate cost (Cursor is a major target)
-   to neutralize a constraint the tool handles for free. Whether **ktx** supports
-   those targets is a separate product decision and is out of scope here.
-
-This is consistent with Anthropic's progressive-disclosure goal — load the
-relevant material on demand, at zero context cost until needed — which the tool
-satisfies (its output costs context only when called) while resolving *which*
-dialect from state rather than from a model guess. Reference:
-[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
-
-### Scope derived from state, through the one existing resolver
-
-Which dialect's notes the agent sees is **derived** from the connection's
-configured `driver`, via the resolver the rest of the system already uses —
-`sqlAnalysisDialectForDriver(driver)` in
-`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already
-selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis
-daemon. This spec **must not** introduce a second driver→dialect map. The notes
-are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is
-keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's
-codomain so the two cannot drift.
-
-### Authored per-engine notes are sanctioned static content
-
-Enumerating syntax notes per engine is **not** a rotting denylist of bad
-specifics; FQTN form and identifier quoting are genuine, stable invariants of each
-engine — the kind of universal fact **ktx**'s design rules explicitly permit as
-static content. What must stay derived-from-state is note *selection* (the active
-dialect) and note *coverage* (every configured driver must resolve to notes that
-exist), both of which this spec ties to the connector registry.
-
-### The flat skill stays dialect-agnostic (spec 07 invariant preserved)
-
-This work adds a *separate* channel. It does **not** amend spec 07's `<sql_craft>`
-block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion
-— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays
-green. The only `SKILL.md` change is the pointer in requirement 3, which names the
-tool and contains no dialect syntax.
-
-## Requirements
-
-### 1. A read-only `sql_dialect_notes` MCP tool
-
-Register a new tool beside the existing context tools
-(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the
-implementer's to finalize but should follow the existing snake_case convention
-(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name.
-
- **Input:** `{ connectionId }`, **required** — matching its siblings
-  `entity_details`/`sql_execution`, which always take an explicit connection.
- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved
-  `SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect.
- **Resolution:** `connectionId → connection.driver →
-  sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing
-  resolver. Do not duplicate the driver→dialect map.
- **Guards:**
-  - A **non-SQL context-source** connection (driver `metabase`, `looker`,
-    `lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL
-    warehouse connection" error**, not postgres notes. Gate on the existing
-    `isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`).
-  - For any **SQL warehouse** connection the resolver always yields a dialect with
-    notes (all seven warehouse drivers are covered — requirement 2); its built-in
-    `postgres` default is a safety floor, so the tool never errors for a SQL
-    connection and never emits a single-engine dialect (e.g. Snowflake) by
-    accident.
- **Annotations:** read-only and idempotent, consistent with the other read
-  tools.
- **Description (docs-grade, third person, states what and when):** e.g.
-  *"Returns the SQL syntax conventions for a connection's dialect — FQTN form,
-  identifier quoting and case-folding, date/time functions, top-N idiom, and
-  semi-structured access. Use before authoring raw SQL against a connection so the
-  SQL matches that engine."* The description drives the agent's decision to call
-  the tool, so it must be specific.
-
-### 2. Per-dialect note content
-
-Author concise notes for each supported dialect against a **fixed rubric**, so
-every dialect answers the same questions. Each facet is a line or two of timeless,
-engine-true convention (no version-dated "as of vX" content), phrased as
-guidance with the engine reason where it helps — inheriting spec 07's
-heuristics-with-a-why tone. The rubric facets:
-
-1. **FQTN form** — how to fully-qualify a table on this engine.
-2. **Identifier quoting & case-folding** — quote character and how unquoted
-   identifiers fold.
-3. **Date/time** — the engine's date functions and common date-encoding idioms.
-4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE +
-   outer-filter form where it is not; `TOP` for `tsql`.
-5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/
-   `JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable.
-6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery
-   `_TABLE_SUFFIX`).
-
-Constraints on the content:
-
- **Coverage = the reachable dialect set.** Every driver in the connector registry
-  must resolve to a dialect that has non-empty notes. The reachable set is
-  `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and
-  `tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`:
-  they appear in the resolver map but no connector can produce them, so they are
-  unreachable — matching the draft's "don't author for nonexistent drivers."
- **Keyed by `SqlAnalysisDialect`** (see Model).
- **Storage is the implementer's choice.** The notes MAY live as per-dialect
-  markdown files inside the package (e.g. under the skill's directory) served by
-  the tool, or as a typed map. If files are used they are **package-internal** —
-  served by the tool, never installed onto an agent target — and already ship via
-  the recursive `src/skills → dist/skills` copy
-  (`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change.
- **No benchmark, gold-answer, grader, or scoring references** anywhere in the
-  notes.
-
-The implementer must verify each engine's specifics against current official
-documentation (the well-known anchors above are starting points, not a
-substitute for checking the engine's docs).
-
-### 3. The `SKILL.md` pointer (completes spec 07's deferral)
-
-Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step
-5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to
-call the tool before writing raw SQL against a connection — e.g. *"Before writing
-raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get
-that engine's syntax conventions."* This is the pointer spec 07 deliberately did
-not add because the tool did not yet exist.
-
- The pointer **names the tool only**; it contains **no dialect syntax**, so the
-  flat skill stays dialect-agnostic.
- Follow the skill's existing tool-reference convention. The skill currently names
-  MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's
-  guidance recommends **fully-qualified** `ServerName:tool` names to avoid
-  "tool not found" when multiple MCP servers are present. Whether to fully-qualify
-  the new pointer (and optionally retrofit the existing bare references) is a
-  small, separable decision flagged for the maintainer — **not** a rename sweep
-  this spec mandates.
-
-### 4. Coverage is enforced from state, not by hand
-
-A test must **derive** the required coverage from the connector registry rather
-than hardcoding a dialect list: enumerate the configured warehouse drivers
-(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in
-`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and
-assert each result has non-empty notes. Adding a connector later then **fails this
-test** until its dialect gets notes — the allowlist-from-state discipline, not a
-hand-maintained list.
-
-### 5. No dialect syntax leaks into the flat skill
-
-Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill
-(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`,
-backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds
-a tool and a tool-pointer; it does not move dialect syntax into the skill.
-
-### 6. Delivery is unchanged
-
-`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`,
-`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The
-skill still installs as a single `SKILL.md` per target. Confirm the channel works
-on all six targets — Claude Code, Claude Desktop (zip), Codex, universal
-`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call,
-including the single-file targets where multi-file delivery could not scope.
-
-### 7. Coordination with specs 07 and 03
-
- **Spec 07** owns the dialect-agnostic `<sql_craft>` block. This spec must not
-  amend it; it adds the tool, the pointer, and the notes.
- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads
-  `connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer
-  is `connectionId`-scoped and fits that routing; keep the pointer consistent with
-  spec 03's `connectionId` rules and do not rewrite the routing it owns.
-
-## Acceptance criteria
-
- An agent querying a **sqlite** connection gets sqlite date idioms and **never**
-  sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets
-  FQTN / identifier / VARIANT guidance.
- The dialect shown is **derived from the connection's configured `driver`** via
-  the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not
-  guessed. No second driver→dialect map is introduced.
- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`,
-  `bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with
-  non-empty notes, and the coverage test derives this from the registry.
- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a
-  clear "not a SQL warehouse" response, **not** postgres notes.
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
-  unaffected. The new pointer references the tool only and adds no dialect syntax.
- The channel installs/serves correctly across **all six** agent targets,
-  including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts`
-  change**.
- The notes contain **no** benchmark/gold/grader/scoring references and **no**
-  time-sensitive ("as of version X") content.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **Dialect resolver (reuse, do not duplicate):**
-  `packages/cli/src/context/sql-analysis/dialect.ts` —
-  `sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect`
-  (`./ports.ts`), default `postgres`.
- **Connector registry (drives coverage):**
-  `packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`,
-  `isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts`
-  (`warehouseDrivers`, the per-driver `connectionConfigSchema`).
- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts`
-  (register beside `connection_list`, `entity_details`, `sql_execution`); the
-  `connectionId → driver → dialect` resolution already exists for `sql_execution`
-  in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool
-  through the same path.
- **The skill (one-line pointer only):**
-  `packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5;
-  leave `<workflow>`/`<rules>`/`<sql_craft>`/`<examples>` otherwise intact.
- **Note storage (if files):** under the skill directory, shipped by
-  `packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the
-  tool, never installed.
- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`.
- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown →
-  `postgres`, and non-warehouse rejection); a registry-derived coverage test
-  (requirement 4); a content test that each dialect's notes cover the rubric
-  facets and contain no banned tokens; and an extension of spec 07's
-  `analytics/SKILL.md` content test asserting the new pointer is present and the
-  flat skill is still dialect-clean. Rebuild and re-link the dev binary so the
-  playground picks up the change: `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation only)
-
-The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
-(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths),
-BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
-(`strftime`/`julianday`). That content is real and useful but engine-specific;
-spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic
-rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes
-the same correctness benefit to every multi-engine **ktx** project — improving the
-benchmark score is a side effect, not the goal, and the shipped skill contains no
-trace of the benchmark.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed
-decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as
-specified — no `setup-agents.ts` change.
-
-**What was built**
- Per-dialect notes are markdown files under
-  `packages/cli/src/context/sql-analysis/dialects/<dialect>.md` (one each for
-  `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`),
-  served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy
-  read + cache, `postgres` fallback floor; the authored set is the
-  `DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored
-  (unreachable from any connector). Each note answers the fixed rubric — FQTN,
-  identifier quoting/case-folding, date/time, top-N/window idiom,
-  JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics
-  were verified against current docs via Context7 (Snowflake VARIANT colon-paths
-  and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`,
-  `JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The
-  files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they
-  are never installed onto an agent target.
- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input
-  `{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only
-  + idempotent annotations. It resolves through the **existing**
-  `connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second
-  driver→dialect map), implemented as the unconditional `dialectNotes` port in
-  `local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A
-  non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError`
-  ("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays
-  out of Error Tracking.
- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`)
-  readonly tuple so the coverage test derives required coverage from the registry;
-  `isDatabaseDriver` behavior is unchanged.
- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call
-  `sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N,
-  and JSON conventions"). It names the tool only; spec 07's `<sql_craft>` block and
-  its dialect-clean content test are untouched.
-
-**Tests**
- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future
-  connector fails the test until its dialect has notes), the full rubric per dialect,
-  leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`;
-  `QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no
-  benchmark/grader or version-dated content, the postgres fallback, and
-  `resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql`
-  and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a
-  guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync.
- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool
-  set + annotations assertion + a handler-routing test, and the regenerated
-  `__snapshots__/mcp-tools-list.json`.
- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present
-  and the flat skill stays dialect-clean.
-
-**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files /
-3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three
-`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and
-re-linked `ktx-dev`.
-
-**Deviations / notes**
- Notes are stored as per-dialect markdown files (not a typed map, and not bundled
-  `reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the
-  most maintainable to edit. They are served by the tool and ship via a
-  `copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no
-  `setup-agents.ts` change.
- `pnpm run type-check` still reports one pre-existing, unrelated error in
-  `test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch);
-  this change adds zero new type errors and does not touch that file.
--- a/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md
+++ b/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md
@ -1,362 +0,0 @@
-# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
-
-> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`.
-> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the
-> `<sql_craft>` block. Additive, content-only.
-
-## Problem
-
-The shipped `ktx-analytics` skill
-(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop
-fan-out rule in `<sql_craft>` → **Composition**:
-
-> **Avoid fan-out joins.** Add columns only from tables already at the target
-> grain, or pre-aggregate to that grain before joining. A join that multiplies
-> rows quietly inflates every downstream `SUM`/`COUNT`.
-
-In practice the agent honors that on a single join but still **silently
-fans out on multi-hop join chains**, where the inflation is one or two joins
-removed from the aggregate and therefore much harder to notice.
-
-The failure shape: a measure that lives at a *coarse* grain (one row per parent
-record) is counted/summed *after* the parent has been joined down to a *finer*
-grain (one row per child line). Every parent-level value is then duplicated by
-its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent
-amount — runnable SQL, plausible-looking number, quietly wrong.
-
-The rule today is stated only as a **prohibition** ("Avoid…"). It needs two
-upgrades: (a) generalize it so the danger is understood as *cumulative across a
-whole join chain*, not a single join; and (b) pair it with an **affirmative
-verification habit** the agent runs while composing, so a grain change is
-detected and fixed rather than merely warned against.
-
-## Generic use case (independent of any benchmark)
-
-An analyst on any production warehouse asks a counting/summing question whose
-path runs through several one-to-many hops — e.g. *"how many orders per region
-contain a returned item?"* where the path is `region → store → order →
-order_line`. The honest answer counts each order once. The naïve join chain joins
-`order_line` (to apply the line-level condition) and then counts orders, so an
-order with three returned lines is counted three times. The inflation happens
-**three joins below the `COUNT`**, where it is easy to miss. This is one of the
-most common silently-wrong analytics mistakes on normalized schemas — not
-specific to any dataset, dialect, or benchmark.
-
-## Model (invariants — the implementer owns the prose)
-
-These constrain the change; the exact wording is the implementer's. Each is
-grounded in Anthropic's skill-authoring and prompt-engineering guidance so the
-addition stays consistent with how spec 07 was written.
-
-### Additive, inline-only, dialect-agnostic (inherited from spec 07)
-
-The change is **additive content inside `skills/analytics/SKILL.md`** only — no
-bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per
-target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config.
-Every addition must read correctly on any dialect: **no** `QUALIFY`,
-`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect
-construct — including in the worked example. The existing `<workflow>`, `<rules>`,
-`<examples>`, and the other four `<sql_craft>` sub-headings are preserved
-unchanged.
-
-### Heuristic-plus-*why*, because SQL authoring is a high-freedom task
-
-Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with
-many valid approaches where decisions depend on context as **high freedom →
-text-based heuristics**, the "open field, many paths" case (versus low-freedom,
-fragile operations that need an exact script). SQL authoring is squarely
-high-freedom. So the new content is phrased as **heuristics with a one-line,
-universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the
-existing `<sql_craft>` style and Anthropic's "add context / explain why so Claude
-generalizes" principle.
-
-### Affirmative framing for the verification step (do, not don't)
-
-Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do
-instead of what not to do."** The draft's requirement for "a detect-and-fix
-*habit*, not just a prohibition" is the same principle. Therefore:
-
- The **generalized rule keeps the established `Avoid fan-out joins` lead and the
-  term `fan-out`** — it is spec 07's consistent terminology and the existing
-  content test references that phrase; reframing it would churn shared vocabulary
-  for no gain.
- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain
-  holds across each join"*) — an action the agent performs while composing, not a
-  warning. The two together satisfy both principles: a recognized anti-pattern
-  name *and* a positive habit.
-
-### One default with an escape hatch, not two equal options
-
-Anthropic: **"Avoid offering too many options… provide a default with an escape
-hatch."** The fix for an inflated aggregate is presented as exactly that:
-
- **Default: pre-aggregate the measure to its own grain in a CTE, then join the
-  already-aggregated result.** This is the single-hop fix generalized, and it is
-  the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed
-  measure with `DISTINCT` (two legitimately-equal amounts would collapse).
- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an
-  inflated count in one line, but must be stated as count-only, not as a general
-  remedy.
-
-This is the deepest correctness point in the spec and the easiest to get wrong; a
-naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums.
-
-### Consistent terminology
-
-Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing
-vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not
-introduce synonyms (e.g. do not rename the concept "row blow-up" or
-"multiplication factor"). Prose may vary, but the named concepts stay fixed.
-
-### Concise — the addition must justify its token cost
-
-Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and
-"Claude is already very smart." The agent knows what a join and a `GROUP BY` are;
-the addition explains only the non-obvious trap (cumulative grain inflation) and
-shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and
-one worked example — the skill stays comfortably under the 500-line budget
-(~117 lines today).
-
-### Examples over descriptions — exactly one
-
-Anthropic's "examples pattern": **"Examples help Claude understand the desired
-style and level of detail more clearly than descriptions alone"** and
-"examples are concrete, not abstract." The multishot guidance favors 3–5 examples
-in general, but here **conciseness and spec 07's one-example-per-rule economy
-win**: the skill already carries the window-then-filter example, so this adds
-**exactly one** compact wrong-vs-right example. The wrong/right contrast inside
-that single example supplies the diversity multishot calls for, at one example's
-token cost.
-
-### Leak-safety (hard constraint)
-
-The worked example must be a **synthetic, generic schema invented for teaching** —
-not the tables, column names, query, or numeric results of any Spider 2.0-Lite
-question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a
-one-to-many join), which is universal and reconstructable from first principles. A
-reviewer must find nothing in it that ties it to a specific benchmark instance.
-See "Leak-safety" below.
-
-## Requirements
-
-All four land in the **Composition** sub-heading of `<sql_craft>` in
-`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite
-the existing fan-out bullet, add one affirmative verification bullet, add one
-worked example. Do not touch the other four sub-headings or `<workflow>`/`<rules>`/
-`<examples>`.
-
-### 1. Generalize the fan-out rule to multi-hop chains
-
-Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that
-the danger is **cumulative**: *any* one-to-many hop on the path between a measure's
-owning table and the aggregate inflates that measure, **even when the offending
-join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the
-single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join
-the already-aggregated result** — but the agent must apply it **per
-measure-owning table along the whole chain**, not just at the final join. Keep the
-`fan-out` term and the one-line *why*.
-
-### 2. Add an affirmative grain-verification habit
-
-Add a companion bullet, phrased as an action the agent performs **while
-composing** (not a prohibition):
-
- Confirm that a join intended to be one-to-one / many-to-one **did not change the
-  grain** it aggregates at — e.g. check that the row count (or the count of the
-  aggregate's key) is unchanged across that join.
- When a join is genuinely one-to-many, **reach for the default fix
-  (pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an
-  acceptable escape hatch.
- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate**
-  — `DISTINCT` cannot de-duplicate a sum.
-
-This is spec 07's "build incrementally and check each layer" discipline pointed
-specifically at grain preservation, in affirmative form.
-
-### 3. One concrete, generic multi-hop worked example
-
-Add **exactly one** compact wrong-vs-right `sql` example inside `<sql_craft>`
-demonstrating the multi-hop inflation and the pre-aggregate fix. It is the
-**second** `sql` fence in the skill (the first is spec 07's window-then-filter
-example).
-
-**Required properties** (these are the constraints; the SQL below is orientation):
-
- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed**
-  from the aggregate (not the single-hop case spec 07 already covers).
- **Unambiguous attribution**: each counted entity maps to **exactly one** group,
-  so the honest answer is well-defined. (This rules out "coarse measure attributed
-  to a fine dimension reached by descending," where one entity spans several
-  groups and the correct number is itself ambiguous — that would teach a murky
-  pattern.)
- **Motivated descent**: the finer-grain table is joined for a real reason (a
-  line-level filter or a needed line-level value), so the reader sees *why* the
-  fan-out join is there.
- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing
-  *Macro vs micro average* bullet and would muddy the fan-out lesson.
- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a
-  CTE) and is **actually correct**, not merely runnable — its number must equal the
-  honest answer, not just avoid an error.
- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect
-  functions), no benchmark identifiers or values.
-
-**Recommended sketch** (implementer may adjust within the properties above):
-
-```sql
-- "How many orders per region contain a returned item?"
-- WRONG: joining order_lines to apply the line-level filter multiplies orders —
-- an order with two returned lines is counted twice, three joins below the COUNT.
-SELECT r.region_id, COUNT(*) AS n_orders
-FROM regions r
-JOIN stores s      ON s.region_id = r.region_id
-JOIN orders o      ON o.store_id  = s.store_id
-JOIN order_lines l ON l.order_id  = o.order_id
-WHERE l.status = 'returned'
-GROUP BY r.region_id;
-
-- RIGHT: collapse order_lines to one row per qualifying order first, then join up.
-WITH returned_orders AS (
-  SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
-)
-SELECT r.region_id, COUNT(*) AS n_orders
-FROM regions r
-JOIN stores s           ON s.region_id  = r.region_id
-JOIN orders o           ON o.store_id   = s.store_id
-JOIN returned_orders ro ON ro.order_id  = o.order_id
-GROUP BY r.region_id;
-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
-- de-duplicate a sum.
-```
-
-### 4. Placement and structure
-
- Both bullets live under the existing **Composition** sub-heading; the example
-  follows them. The five-sub-heading structure spec 07 established is unchanged.
- **State each rule once** (Anthropic "consistent terminology / don't repeat"):
-  do not also restate the multi-hop rule in `<workflow>` steps 5/6 — those already
-  carry a one-line pointer into `<sql_craft>`, which is sufficient.
-
-### 5. Coordination with spec 07 (supersession)
-
-Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly
-one** worked example and "Do not add a second example." **This spec supersedes
-that constraint**: the skill now carries **two** `sql` worked examples
-(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate
-spec 07 at those two spots with a one-line "superseded by spec 09" note so the two
-permanent specs do not contradict. No other spec 07 content changes.
-
-## Leak-safety (hard constraint on this spec and its example)
-
-The benchmark's gold answers must never appear in ktx. The worked example must be
-a **synthetic, generic schema invented for teaching** — not the tables, column
-names, query, or numeric results of any Spider 2.0-Lite question. The example
-demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many
-join), which is universal; it must be reconstructable from first principles by
-anyone, with zero reference to benchmark data. A reviewer should be able to read
-the example and find nothing that ties it to a specific benchmark instance.
-
-## Acceptance criteria
-
- The `<sql_craft>` **Composition** section states the **multi-hop generalization**
-  of the fan-out rule (cumulative danger across the chain; pre-aggregate per
-  measure-owning table) and an **affirmative grain-verification habit**, inline and
-  dialect-agnostic.
- The fix is presented as **default (pre-aggregate to grain) + escape hatch
-  (`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG`
-  of a fanned-out measure must pre-aggregate.
- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right)
-  using an invented schema, with no benchmark-derived identifiers or values, whose
-  RIGHT side is actually correct (unambiguous attribution; honest number).
- The skill now contains **two** `sql` worked examples total; the existing content
-  test's fence-count assertion is updated `1 → 2` and new assertions cover the
-  multi-hop rule phrase and the grain-verification-habit phrase.
- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no
-  synonyms introduced.
- **No new tool, flag, or config.** Skill-content only; additive to spec 07.
- All spec 07 invariants still hold: the skill remains dialect-agnostic (no
-  `QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time
-  anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference,
-  including in the new example; `<workflow>`/`<rules>`/`<examples>` and the other
-  four sub-headings are intact; frontmatter still parses through
-  `SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines.
- Spec 07's "exactly one example" constraint is annotated as superseded (no
-  contradiction between the two permanent specs).
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` →
-  `<sql_craft>` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add
-  the affirmative grain-verification bullet, add the one worked example after them.
-  Leave the other four sub-headings, `<workflow>`, `<rules>`, and `<examples>`
-  unchanged.
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the
-  "ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`,
-  add an assertion for the new fan-out example's distinctive tokens (e.g.
-  `WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit
-  phrases to the behavior-presence list, and keep all banned-construct and
-  size-budget guards. This is a content assertion over the source `SKILL.md` — the
-  right level for prompt content.
- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's
-  requirement 3 and at its "Exactly one new worked example" acceptance bullet.
- **Rebuild/re-link** the dev binary so the playground picks up the change:
-  `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
-
-## Benchmark context (motivation only)
-
-Multi-hop aggregation questions (counting/averaging a coarse-grained measure
-reached through several one-to-many joins) are a recurring source of
-result-mismatch failures in the SQLite subset: the agent produces runnable SQL
-with the right tables but a fan-out-inflated number. These are correctness
-failures, not knowledge or schema-discovery failures (zero execution errors in the
-latest run), so the fix belongs in the product's authoring craft — where it also
-helps any real analyst — not in a benchmark-specific prompt. The skill itself must
-contain no trace of the benchmark.
-
-## Implementation notes
-
-Shipped as specified — additive, content-only, no new tool/flag/config.
-
- **`packages/cli/src/skills/analytics/SKILL.md`** → `<sql_craft>` → **Composition**:
-  - Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the
-    danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many
-    hop between a measure's owning table and the aggregate inflates that measure
-    even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per
-    measure-owning table along the whole chain. Kept the `fan-out` term and the
-    one-line *why*.
-  - Added the affirmative `**Verify the grain holds across each join.**` bullet:
-    confirm a one-to-one / many-to-one join did not change the grain (row/key
-    count unchanged); default fix is pre-aggregate to grain, escape hatch is
-    `COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a
-    fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a
-    sum.
-  - Added one generic wrong-vs-right worked example (orders→regions via
-    stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in
-    the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side
-    pre-aggregates `order_lines` to one row per qualifying order so each order is
-    counted once (honest answer), and the trailing comment names the count-only
-    `COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented
-    schema, dialect-agnostic SQL, no benchmark identifiers/values.
-  - The other four sub-headings and `<workflow>`/`<rules>`/`<examples>` are
-    untouched. Skill is 147 lines (well under the 500-line budget).
- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count
-  `1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the
-  grain-verification phrase (`Verify the grain holds across each join`) to the
-  behavior-presence list; added new-example token assertions
-  (`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct,
-  relative-time, and size-budget guards retained. Test file passes (9/9).
- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one
-  worked example" acceptance bullet — no contradiction between the two permanent
-  specs.
-
-**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9
-passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built
-`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev`
-re-linked `ktx-dev`. A pre-existing, unrelated type error in
-`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last
-touched in commit `2677b3ef`) surfaces under the full `type-check`'s
-`tsconfig.test.json` pass; it is outside this change's surface and not introduced
-here.
--- a/spider2-specs/specs/10-panel-completeness-spine.md
+++ b/spider2-specs/specs/10-panel-completeness-spine.md
@ -1,289 +0,0 @@
-# Panel/period completeness — emit the full set of groups, not only the populated ones
-
-> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`.
-
-## Problem
-
-When a question asks for a result *per period* or *per category* ("orders for
-each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY`
-only returns groups that actually have rows. Periods or categories with **zero**
-activity silently vanish, so a "12 months" answer comes back with 9 rows and the
-three that should read `0` are simply absent. The SQL is runnable and the
-aggregate is right, but the **panel is incomplete** — and a monthly report with
-missing months or a category breakdown missing its empty categories is wrong for
-any analyst, on any database.
-
-The existing `<sql_craft>` "Answer completeness / interpretation" group already
-carries a *"For each X / per X / by X returns exactly one row per X"* rule, but
-that rule only governs **grain** (don't collapse to a single value). It says
-nothing about the **domain**: "one row per X" today means one row per *observed*
-X, so empty groups still drop. This spec sharpens that rule from grain-only to
-grain-and-completeness.
-
-## Generic use case (independent of any benchmark)
-
-"How many orders were placed in each month of 2023?" must return **12 rows** even
-if March had no orders (March = 0), not 11. "Sales per region" should include
-regions with no sales when the question asks for *each* region. Both are
-bread-and-butter reporting for any analyst on any warehouse, with no benchmark in
-sight.
-
-## Model
-
-The feature splits across **two surfaces**, each holding the half it is suited
-for. This split is the central design decision and exists to satisfy spec 07's
-hard dialect-agnostic invariant without weakening it.
-
-### Why two surfaces (the dialect-agnostic reconciliation)
-
-The draft asked for a *"recursive-CTE date spine"* worked example. But a real
-date/number series is **inherently dialect-specific** — Postgres `generate_series`,
-SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake
-`GENERATOR`+`DATEADD` — and spec 07 made `<sql_craft>` strictly dialect-agnostic
-(the analytics-skill content test bans single-dialect constructs). Inlining a date
-spine would violate that invariant; carving out a test exception would erode it.
-
-ktx already has the canonical home for engine-specific syntax: the per-dialect
-notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
-the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric
-(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is
-not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the
-other per-dialect idioms, and the dialect-agnostic skill points to it. This
-routes the dialect-specific half through the existing channel rather than
-standing up a parallel dialect-specific recipe inside the skill.
-
-Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the
-**concrete series syntax**.
-
-### Additive, inline, heuristic-with-a-why
-
-Consistent with spec 07: the skill change is **additive content in one Markdown
-file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the
-delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
-and phrased as a **heuristic with a one-line generic rationale**, not a wall of
-MUSTs. The dialect-notes change is additive content in the seven existing
-`dialects/*.md` files. No new tool, flag, or config on either surface.
-
-## Requirements
-
-### 1. Skill surface — `<sql_craft>` "Answer completeness / interpretation"
-
-Add the panel-completeness rule to the existing group (it extends, and should sit
-adjacent to, the *"For each X / per X / by X"* bullet). It must cover:
-
-1. **Recognize the full-panel cue.** *each / every / all / per <period> / for all
-   <category> / by month* signals that the answer's row set should be the
-   **complete expected domain** of periods or categories in scope, not just those
-   present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit
-   groups that have at least one fact row.
-
-2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the
-   **spine**), then LEFT JOIN the aggregated facts onto it:
-   - **Category/dimension spine:** the distinct values from the **domain-defining
-     dimension/entity table** (e.g. all regions from a `regions` table), *not*
-     `SELECT DISTINCT region FROM facts` — the latter yields only categories that
-     already occur, so a zero-activity category still drops. When no dimension
-     table exists, the distinct values from the **unfiltered** fact table are the
-     best available domain (with the residual caveat that a category which never
-     occurs at all cannot surface).
-   - **Period/number spine:** generate the series for the question's stated range
-     (e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the
-     question's explicit range; when the range is "all periods present," derive
-     bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete
-     series-generation syntax is per-dialect — the rule points the author to
-     `sql_dialect_notes` (see requirement 2) and shows no inline series SQL.
-
-3. **COALESCE by measure additivity.** Default missing measures with
-   `COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events
-   or amounts — "no activity" genuinely reads as 0). Leave **non-additive**
-   measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** —
-   absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value
-   only for additive measures.
-
-4. **Don't over-apply (the each-vs-which guard).** When the question asks only
-   about groups that exist ("*which* months had orders", "regions that made a
-   sale"), the spine is unnecessary and wrong — emit only observed groups. The cue
-   is *each / all / every* (complete domain) vs *which / that have* (observed
-   subset).
-
-5. **One worked example — the category spine, fully portable.** Add **exactly
-   one** compact before/after example demonstrating the pattern with a
-   **distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty
-   groups missing) and the right shape (`SELECT DISTINCT` domain from the
-   dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic
-   table/column names, standard SQL only — no series generation, no dialect
-   functions, so the example stays dialect-clean. The period-spine variant is
-   described in prose (requirement 2) and delegated to `sql_dialect_notes`; it
-   gets **no** inline example. This is the **third** worked `sql` example in the
-   skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out).
-
-6. **Step pointer, no duplication.** The validate/explain step (and/or the query
-   step) already points into `<sql_craft>` for answer-completeness; extend that
-   existing pointer's wording if needed, but state the rule **once** inside
-   `<sql_craft>`. The step-5 pointer that lists what `sql_dialect_notes` provides
-   ("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also
-   name the **series/calendar** convention now that it exists.
-
-### 2. Dialect-notes surface — `dialects/*.md`
-
-Add a **"Series"** (date/number range) line to **each** of the seven authored
-dialect files, giving that engine's idiomatic way to generate a contiguous
-date or integer series for use as a spine. Each note is engine-exclusive — a
-SQLite analyst gets the SQLite idiom and never another engine's construct, per the
-existing dialect-notes leak guards. Orientation (exact syntax is the
-implementer's):
-
- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`.
- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`.
- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers).
- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE.
- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`.
- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`).
- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table.
-
-This line is what makes the period spine usable from the dialect-agnostic skill,
-and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the
-same date spine) — so it is foundational, not scope creep.
-
-### 3. Coordination with spec 11
-
-Spec 11 (time-series window recipes) explicitly depends on this date spine for the
-gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10
-establishes the spine concept in the Answer-completeness group and the
-series syntax in the dialect notes; spec 11 reuses both from the Window-functions
-group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it.
-
-## Leak-safety (hard constraint)
-
-Any worked example or note must use a **synthetic generic schema** (e.g. an
-`orders` table with an `order_date`, a `regions` dimension) and demonstrate only
-the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL,
-or result values on either surface. The dialect-notes additions, like the existing
-notes, carry no benchmark/grader/version-dated content. The behavior is
-reconstructable from first principles and tied to no specific instance.
-
-## Acceptance criteria
-
- `<sql_craft>` "Answer completeness / interpretation" states: the full-panel cue,
-  the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE
-  discriminator (0 vs NULL), and the each-vs-which over-application guard —
-  inline, dialect-agnostic, each with a generic *why*.
- Exactly **one** new worked `sql` example is present, a portable
-  distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`),
-  with no series generation and no dialect-specific syntax. The skill then carries
-  **three** `sql` worked examples total.
- Each of the seven `dialects/*.md` files gains a **Series** (date/number range)
-  line in its engine's own idiom; no engine leaks another engine's construct, and
-  the additions contain no benchmark/grader/version-dated content.
- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`,
-  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other
-  single-dialect construct anywhere in `SKILL.md`, including the new example.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples)
-  and the existing dialect-note rubric lines are intact and uncontradicted.
- No grader/benchmark reference, no output-shape contract, and no anchoring of
-  *relative* time ("recent" / "past N months") to a `MAX(date)` over the data
-  appears (period-spine bounds derive from the question's explicit range or, for
-  "all periods present," from `MIN`/`MAX` over the facts — which is range
-  derivation, not relative-time anchoring).
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
-  still parses as `ktx-analytics`.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the
-  panel-completeness bullets to the Answer-completeness group, the single category
-  spine example, and extend the existing step pointer / dialect-notes provision
-  list to name the series convention. Leave `<workflow>`/`<rules>`/other examples
-  intact. Delivery is unchanged (single `SKILL.md` per target via
-  `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required.
- **Dialect notes:** the seven files under
-  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
-  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
-  `copy-runtime-assets.mjs` — no plumbing change, content only.
- **Tests:**
-  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
-    representative phrase for the completeness rule; bump the `sql`-fence count
-    assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the
-    existing dialect-clean guards already cover the no-inline-series requirement
-    (the example is `SELECT DISTINCT`, so they pass unchanged).
-  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop
-    (the "answers the full rubric for every dialect" test) so every dialect must
-    also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`.
-    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
-    all seven without a hand-maintained list.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation only)
-
-Per-period / per-category questions where some periods are empty produce
-short-row result mismatches in the SQLite subset, and the related rolling/cumulative
-cluster (spec 11) needs a complete date spine to be correct at all. The fix is a
-universal reporting habit (complete panels) plus the per-dialect series syntax
-that makes it executable — both belong in the product, where they help real
-analysts. Improving the benchmark score is a side effect; the skill and the
-dialect notes contain no trace of the benchmark.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no
-new tool/flag/config, no plumbing change.
-
-**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):**
- Added a **"Complete the panel for 'each / every / all / per <period or
-  category>'"** bullet to the `<sql_craft>` "Answer completeness / interpretation"
-  group, directly after the *"For each X / per X / by X"* bullet, with three
-  sub-bullets carrying the rest of the rule each with its generic *why*: **Spine
-  source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT`
-  over the facts; period/number series across the question's stated range, bounds
-  from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series
-  syntax delegated to `sql_dialect_notes`), **Default by additivity**
-  (`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and
-  **Don't over-apply** (the each-vs-which guard).
- Added **one** worked `sql` example at the end of the Answer-completeness group: a
-  portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` →
-  `LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right,
-  standard SQL only, no series generation, no dialect functions. The skill now
-  carries **three** `sql` worked examples.
- Extended the step-5 dialect-notes pointer to name the **series/calendar**
-  convention alongside FQTN / identifier-quoting / date / top-N / JSON.
- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the
-  single `SKILL.md` per target — confirmed, no change.
-
-**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):**
- Added a `- **Series:**` line to all seven authored files (postgres, sqlite,
-  bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom
-  (`generate_series`; recursive CTE with `date(d,'+1 month')`;
-  `UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE
-  with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` +
-  `MAXRECURSION`), placed right after each file's Date/time line. No cross-engine
-  leak, no version-dated/benchmark content. Shipped to `dist` unchanged by
-  `copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`.
-
-**Tests:**
- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel`
-  and `Default by additivity` phrases; renamed the worked-examples test and bumped
-  the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE`
-  shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the
-  dialect-clean banned list — a deliberate **strengthening** beyond the spec's
-  test orientation so the "no inline series" acceptance criterion is *enforced*,
-  not merely incidentally true of a `SELECT DISTINCT` example.
- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric
-  for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven
-  dialects are required to answer a Series line (coverage derived from
-  `DIALECTS_WITH_NOTES`, no hand-maintained list).
-
-**Verification:** both affected test files pass (19 tests). `src` type-check and
-`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in
-all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an
-unrelated, pre-existing `tsconfig.test.json` type error in
-`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work
-and outside its scope.
-
-**Coordination with spec 11:** the per-dialect Series line is the foundational
-date spine that spec 11 (rolling/cumulative windows over gappy dates) references.
-Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11
-will reference it from the Window-functions group. No overlap introduced.
--- a/spider2-specs/specs/11-time-series-window-recipes.md
+++ b/spider2-specs/specs/11-time-series-window-recipes.md
@ -1,391 +0,0 @@
-# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period
-
-> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`.
-
-## Problem
-
-A large share of analytics questions are time-series shaped: a **running /
-cumulative balance**, a **rolling N-day average**, or **period-over-period
-growth**. The agent already knows window functions exist — spec 07 gave the
-`<sql_craft>` "Window functions" group its determinism and window-then-filter
-rules, and spec 10 added panel/period completeness — but it still gets the
-*time-series specifics* wrong:
-
- a cumulative balance computed **without an explicit unbounded-preceding
-  frame**, or with the implicit frame misbehaving when there are **ties on the
-  order key**;
- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy**
-  daily data, so the window spans the wrong calendar span when days are missing;
- no **minimum-periods** handling — a rolling average reported before the window
-  is actually full;
- "growth vs the previous period" written **without `LAG`** (or against the wrong
-  neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or
-  absent prior.
-
-These are runnable-but-wrong: the structure is close, the edge case diverges.
-It is the same failure shape spec 07 addressed at the general level; this spec
-adds the time-series specifics to the **same Window-functions group**, building
-on the rules already there rather than restating them.
-
-## Generic use case (independent of any benchmark)
-
- "Each account's month-end running balance over 2023" — a cumulative sum of
-  monthly net over an ordered window.
- "30-day rolling average of daily revenue, only once 30 days of history exist."
- "Month-over-month revenue growth rate."
-
-All three are bread-and-butter for any analyst on any time-series table, with no
-benchmark in sight. The methodology is universal analyst craft, so it belongs in
-the shipped skill — it transfers to every ktx user querying a live database.
-
-## Model
-
-The change is **additive content across two surfaces** — the same split spec 10
-made, and for the same reason. The split is the central design decision; it
-satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
-weakening it.
-
-### Why two surfaces (the dialect-agnostic reconciliation)
-
-Two of the three recipes are **pure standard SQL** and stay entirely in the
-dialect-agnostic skill:
-
- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED
-  PRECEDING AND CURRENT ROW)` is standard on every engine.
- **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a
-  `NULLIF`-style divide-by-zero guard are standard on every engine.
-
-The third recipe — a **rolling window over calendar time** — has one piece that
-is genuinely dialect-divergent: the **calendar-range window frame**. A native
-range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`
-exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has
-no date-interval range frame, and SQL Server has **no offset `RANGE` frames at
-all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot
-inline a range frame any more than it could inline a date-series generator.
-
-ktx already routes that kind of engine-specific syntax through the per-dialect
-notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
-the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent
-exactly: series/spine generation was not in the dialect rubric, so it was added
-there (the **Series** line) and the dialect-agnostic skill points to it.
-Rolling-window framing is the next construct in that same position — not in the
-rubric yet, dialect-specific — so the **rolling-window idiom belongs in the
-dialect notes**, and the skill points to it.
-
-Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the
-min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries
-the **concrete rolling-window frame syntax** per engine.
-
-### Additive, inline, heuristic-with-a-why
-
-Consistent with specs 07 and 10: the skill change is **additive content in one
-Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/`
-file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as
-**heuristics with a one-line generic rationale**, not a wall of MUSTs. The
-dialect-notes change is additive content in the seven existing `dialects/*.md`
-files. No new tool, flag, or config on either surface.
-
-### Build on the rules already present; do not restate them
-
-The Window-functions group already carries **"Make the ordering deterministic"**
-(complete tie-breaker) from spec 07, and the Numeric-precision group carries
-**"Round only at the end."** The cumulative and period-over-period recipes
-**reference** these rather than repeat them (state each rule once — Anthropic's
-"consistent terminology / don't repeat" guidance, already followed in spec 07).
-Spec 10's **Series** dialect line is likewise **referenced** by the rolling
-recipe's spine fallback, not duplicated.
-
-## Requirements
-
-### 1. Skill surface — `<sql_craft>` "Window functions" group (three recipes)
-
-Add three recipes to the **existing** "Window functions" group, after its two
-current bullets (deterministic ordering; filter-after-the-window). Each is a
-heuristic with a generic *why*, dialect-agnostic.
-
-1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER
-   (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` —
-   with a **complete tie-breaker** on the `ORDER BY` (per the group's existing
-   deterministic-ordering rule; reference it, do not restate). *Why:* a bare
-   `ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the
-   order key** folds every tied peer into the same cumulative value — it runs and
-   looks plausible, but the running total jumps at each tie boundary.
-
-2. **Rolling window over calendar time, plus minimum periods.** "Rolling N
-   days/months" must span a **calendar range**, not a fixed row count: a `ROWS
-   BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are
-   missing. Two sanctioned techniques:
-   - **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's
-     **Series**, via `sql_dialect_notes`) so the data has one row per calendar
-     unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the
-     intended calendar span. This path is fully dialect-agnostic.
-   - **Native range frame or date-keyed self-join (engine-specific).** Where the
-     engine supports it, a calendar **range frame** expresses the window directly;
-     otherwise a self-join keyed on the date does. Both use engine-specific
-     syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see
-     requirement 3); show no inline range frame in the skill.
-
-   **Minimum periods.** When the question says "only after N periods of data" (or
-   a rolling metric implies it), emit `NULL` / skip until the window is actually
-   full — guard on a window count, e.g. `COUNT(*) OVER (<same frame>) = N`. On a
-   gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null
-   observations** instead when "N periods" means N data points rather than N
-   calendar units. *Why:* a row-count frame over missing dates measures the wrong
-   span, and a partial early window is not the requested metric.
-
-3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
-   for the prior-period comparison; compute growth as `(cur - prev) / prev` at
-   **full precision**, rounding only in the final projection (per the existing
-   "Round only at the end" rule), and **guard divide-by-zero / NULL prev**
-   (e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against
-   the wrong neighbor — the comparison lands on the wrong period, and an unguarded
-   ratio errors or returns garbage when the prior period is zero or absent.
-
-**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list
-(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON
-conventions") should also name the **rolling-window** convention now that it
-exists. State each rule once inside `<sql_craft>`; the workflow steps only point
-to it.
-
-### 2. One worked example — cumulative running total (dialect-agnostic)
-
-Add **exactly one** new compact before/after `sql` example, demonstrating the
-**cumulative running total** — the subtlest of the three (the implicit-frame trap
-runs fine and is wrong only at tie boundaries) and the highest-value to show.
-Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`):
-
- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the
-  implicit `RANGE` frame makes two txns on the same date share one inflated
-  running balance.
- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND
-  CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`).
-
-Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no
-`RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets
-**no** inline example (its correct form needs the engine-specific frame/spine,
-delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was
-prose-only); the **period-over-period** recipe is self-evident from its bullet
-and also gets no example. This is the **fourth** worked `sql` example in the
-skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and
-spec 10 (panel-completeness spine).
-
-### 3. Dialect-notes surface — `dialects/*.md` (rolling window)
-
-Add a **rolling-window-over-time** idiom line to **each** of the seven authored
-dialect files, parallel to spec 10's **Series** line. Each note is
-engine-exclusive — a SQLite analyst gets the SQLite idiom and never another
-engine's construct, per the existing dialect-notes leak guards. Each note either
-gives the engine's native calendar-range frame **or** references its own
-**Series** line for the spine + `ROWS` fallback (a cross-reference within the
-file, not a duplicate of the Series line).
-
-Orientation only — **`RANGE`-frame support genuinely varies by engine and
-version, so the implementer must verify each engine's current support against
-authoritative docs (context7 / the engine's manual) rather than assert it from
-memory.** Starting points:
-
- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days'
-  PRECEDING AND CURRENT ROW)`.
- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT
-  ROW` over a temporal order key.
- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key
-  (e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or
-  build a spine (see **Series**) and use a `ROWS` frame.
- **sqlite:** **no** date-interval range frame — build a date spine (see
-  **Series**) and use a `ROWS` frame.
- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see
-  **Series**) and use a `ROWS` frame, or a date-keyed self-join.
- **snowflake / clickhouse:** range-frame support over dates is limited — verify;
-  default to a spine (see **Series**) + `ROWS` frame where a native calendar range
-  frame is unavailable.
-
-This line is what makes the rolling-over-time recipe executable from the
-dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series =
-how to *generate* a spine; Rolling window = how to compute a *moving
-calendar-range aggregate*, natively or via that spine), and it cross-references
-the Series line rather than overlapping it.
-
-### 4. Explicit constraints / exclusions
-
-None of the following may appear (consistent with specs 07 and 10):
-
- **No inline dialect-specific range-frame syntax in the skill** — no
-  `RANGE … INTERVAL` frame, no series generator, no dialect function. The skill
-  stays dialect-clean; the range frame lives only in the dialect notes.
- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months"
-  means relative to *now* on a live database. A range *bound* may be derived from
-  the question's explicit range or, for "all periods present," from `MIN`/`MAX`
-  over the **unfiltered** facts (range derivation, per spec 10) — but the metric
-  must never silently redefine "recent" as the data's maximum date.
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
-  (the skill is for interactive analysis).
-
-### 5. Coordination with specs 07 and 10
-
-All three recipes live in the **existing** `<sql_craft>` "Window functions"
-group; the two current bullets and the spec-07 window-then-filter example must
-stay intact and uncontradicted.
-
- **Spec 07** owns the deterministic-ordering rule (Window functions) and the
-  round-at-the-end rule (Numeric precision). Spec 11 **builds on** both —
-  references them, never restates them.
- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11
-  **references** the spine for the gappy-rolling fallback and adds the **distinct**
-  rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a
-  spine; spec 11 = how to compute a moving calendar-range aggregate (native frame
-  or spine + `ROWS`).
-
-## Leak-safety (hard constraint)
-
-Every worked example or note uses a **synthetic generic schema** (e.g.
-`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and
-shows only the *pattern*. **No** benchmark table names, SQL, or result values on
-either surface. The dialect-notes additions, like the existing notes, carry no
-benchmark / grader / version-dated content. The behavior is reconstructable from
-first principles and tied to no specific instance.
-
-## Acceptance criteria
-
- The `<sql_craft>` "Window functions" group states the three recipes — inline,
-  dialect-agnostic, each with a generic *why*, and each **building on** (not
-  restating) the deterministic-ordering and round-at-the-end rules:
-  - **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED
-    PRECEDING AND CURRENT ROW` frame and a complete tie-breaker;
-  - **rolling window over calendar time + minimum periods** — calendar range not
-    row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)`
-    guard — delegating the engine's range-frame syntax to `sql_dialect_notes`;
-  - **period-over-period** via `LAG`, with full-precision growth and a
-    divide-by-zero / NULL-prev guard.
- Exactly **one** new worked `sql` example: the cumulative running total,
-  wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
-  ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The
-  skill then carries **four** `sql` worked examples total.
- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time**
-  idiom line in its engine's own idiom (native calendar-range frame where
-  supported, otherwise a spine + `ROWS` fallback that references its **Series**
-  line); no engine leaks another engine's construct, and the additions contain no
-  benchmark / grader / version-dated content.
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
-  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no
-  inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new
-  example.
- The step-5 `sql_dialect_notes` provision list names the **rolling-window**
-  convention alongside FQTN / identifier-quoting / date / top-N / series/calendar /
-  JSON.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other
-  examples), the two existing Window-functions bullets, the window-then-filter
-  example, and the existing dialect-note rubric lines (including **Series**) are
-  intact and uncontradicted.
- No grader / benchmark reference, no output-shape contract, and no anchoring of
-  *relative* time ("recent" / "past N months") to a `MAX(date)` over the data.
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
-  still parses as `ktx-analytics`.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes
-  to the "Window functions" group (after its two existing bullets), the single
-  cumulative worked example, and extend the step-5 dialect-notes provision list to
-  name the rolling-window convention. Leave `<workflow>` / `<rules>` / the other
-  examples and the two existing window bullets intact. Delivery is unchanged
-  (single `SKILL.md` per target via `readAnalyticsSkillContent` in
-  `setup-agents.ts`) — confirm, no change required.
- **Dialect notes:** the seven files under
-  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
-  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
-  `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
-  engine's actual `RANGE`-frame support against authoritative docs before writing
-  the idiom; do not assert from memory.**
- **Tests:**
-  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
-    representative phrase for each of the three recipes; bump the `sql`-fence count
-    assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN
-    UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean
-    guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding
-    `generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the
-    "range frame lives only in the dialect notes" criterion is *enforced*, not
-    incidentally true).
-  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the
-    full rubric for every dialect" loop with the rolling-window assertion, e.g.
-    `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it.
-    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
-    all seven without a hand-maintained list.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation only)
-
-Running-balance / rolling / period-over-period questions are the single largest
-result-mismatch cluster in the SQLite subset (financial-transactions-style DBs):
-cumulative balances with the wrong frame on ties, rolling windows that mis-span
-gappy dates, partial early windows, and unguarded period-over-period ratios. The
-methodology is universal analyst craft, so it belongs in the product's skill
-(where it helps every real user) plus the per-dialect rolling-window syntax that
-makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the
-date spine) for the gappy-rolling fallback. Improving the benchmark score is a
-side effect; the skill and the dialect notes contain no trace of the benchmark.
-
-## Implementation notes
-
-Shipped as additive content across the two surfaces the spec specified — no new
-tool, flag, or config.
-
-**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes
-to the existing `<sql_craft>` "Window functions" group, after its two bullets and
-the spec-07 window-then-filter example: **Cumulative / running total** (explicit
-`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing
-the deterministic-ordering rule), **Rolling window over calendar time, plus
-minimum periods** (calendar range not row count; spine-or-native-range choice
-delegated to `sql_dialect_notes`; the `COUNT(*) OVER (<same frame>) = N`
-min-periods guard), and **Period-over-period** (`LAG` + full-precision growth +
-`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked
-`sql` example — the cumulative running total, wrong-vs-right, using
-`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four
-worked examples. Extended the step-5 `sql_dialect_notes` provision list to name
-the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the
-skill; it stays dialect-clean.
-
-**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a
-**Rolling window over time** line to all seven files, parallel to the spec-10
-**Series** line and cross-referencing it for the spine fallback.
-
-**Deviation — `RANGE`-frame support verified against authoritative docs (the
-spec's hard requirement), which corrected two of its starting points:**
-
- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days'
-  PRECEDING AND CURRENT ROW` (as the spec guessed).
- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL
-  29 DAY PRECEDING AND CURRENT ROW` (as guessed).
- **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with
-  `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed).
- **snowflake** — **corrected:** the spec said "limited; default to a spine," but
-  Snowflake *does* support a native interval `RANGE` frame over a date/timestamp
-  key and it is gap-tolerant, so the note gives the native frame
-  (`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed.
- **clickhouse** — **corrected:** the spec said "limited; default to a spine," but
-  ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in
-  days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for
-  `DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as
-  the fallback.
- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS`
-  (as guessed).
- **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame):
-  spine + `ROWS`, or a date-keyed self-join (as guessed).
-
-**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative
-phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4,
-asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND
-CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened
-the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex.
-`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop
-with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from
-`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric.
-
-**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped);
-`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed
-`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one
-error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is
-present in committed branch code and untouched by this work.
--- a/spider2-specs/specs/12-parse-text-encoded-numbers.md
+++ b/spider2-specs/specs/12-parse-text-encoded-numbers.md
@ -1,405 +0,0 @@
-# Parse text-encoded numeric columns before doing math on them
-
-> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`.
-
-## Problem
-
-Numeric measures are often stored as **text** with human formatting: unit
-suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators
-(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero
-(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is
-**silently wrong**: a string comparison orders `"100" < "9"`, and a naive
-`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the
-intended number. The query runs, the shape looks right, the number is garbage.
-
-The agent already samples schemas before composing — spec 07 gave the
-`<sql_craft>` "Schema discovery before writing SQL" group its *"Sample before you
-compose"* and *"Cast to the real type before comparing"* rules. But those rules
-guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**;
-they say nothing about a column whose declared/affinity type is text yet whose
-*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to
-assume a real number type and skips the parse, so the arithmetic runs on the raw
-strings. This spec adds the detect → parse/scale → verify habit to that same
-group, building on the two rules already there rather than restating them.
-
-## Generic use case (independent of any benchmark)
-
- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become
-  `1200 / 3000000 / 0` before you can sum it or compute a daily change.
- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging.
- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it.
-
-This is routine data hygiene on real, messy production tables — every analyst
-hits text-encoded measures on some warehouse, with no benchmark in sight. The
-methodology is universal craft, so it belongs in the shipped skill; it transfers
-to every ktx user querying a live database.
-
-## Model
-
-The change is **additive content across two surfaces** — the same split specs 10
-and 11 made, and for the same reason. The split is the central design decision;
-it satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
-weakening it.
-
-### Why two surfaces (the dialect-agnostic reconciliation)
-
-The **detect → parse → scale** half is **pure portable SQL** and stays entirely
-in the dialect-agnostic skill:
-
- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known
-  set of literal characters — no regex needed.
- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression.
- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`.
- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable.
-
-The **verify** half has one piece that is genuinely dialect-divergent: a
-**failure-detecting numeric cast** — a cast that signals (rather than silently
-swallows) a value that did not parse. This is exactly what requirement 3
-("confirm coverage") needs, and it cannot be written portably:
-
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure.
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure.
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`.
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before
-  casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`).
- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and
-  `CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an
-  `IS NULL` coverage check is **silently broken**. Detecting a failed parse needs
-  a `GLOB`/`typeof` pattern guard.
-
-So a portable skill cannot inline a safe cast any more than spec 10 could inline a
-date-series generator or spec 11 a calendar range frame. ktx already routes that
-kind of engine-specific syntax through the per-dialect notes in
-`packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by the
-`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent:
-a construct not yet in the dialect rubric, genuinely engine-specific, was added
-there (the **Series** line; the **Rolling window** line) and the dialect-agnostic
-skill points to it. The failure-detecting cast is the next construct in that same
-position, so the **safe-cast idiom belongs in the dialect notes**, and the skill
-points to it.
-
-Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale
-in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes)
-carries the **concrete safe-cast syntax** per engine, including the sqlite
-`CAST`-returns-0 gotcha.
-
-The regex character-*strip* is deliberately **not** promoted to the dialect
-notes: a portable chained `REPLACE` over a known character set is the opinionated
-default, so there is no need for a per-dialect strip line (derive from need; one
-default). The dialect surface gains exactly one thing — the safe cast — because
-that is the only piece the portable path genuinely cannot express.
-
-### Additive, inline, heuristic-with-a-why
-
-Consistent with specs 07, 10, and 11: the skill change is **additive content in
-one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled
-`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
-and phrased as **heuristics with a one-line generic rationale**, not a wall of
-MUSTs. The dialect-notes change is additive content in the seven existing
-`dialects/*.md` files. No new tool, flag, or config on either surface.
-
-### Build on the rules already present; do not restate them
-
- The Schema-discovery group already carries **"Sample before you compose"** and
-  **"Cast to the real type before comparing"** (spec 07). The detect rule
-  **extends** the first (distinct-value sampling to learn the encoding) and the
-  parse rule **complements** the second (text-meaning-numeric, not just
-  text-vs-numeric literal mismatch) — reference them, do not repeat them.
- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive
-  judgment** spec 10 established in its *"Default by additivity"* rule (0 only
-  when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule
-  rather than restating the discriminator (state each rule once).
-
-## Requirements
-
-### 1. Skill surface — `<sql_craft>` "Schema discovery before writing SQL"
-
-Add the text-encoded-numeric guidance to the **existing** group, after its two
-current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic.
-It must cover:
-
-1. **Detect text-encoded numerics during sampling.** When a column the question
-   treats as a number is stored as text, sample its **distinct** values to learn
-   the encodings actually present — unit suffixes (`K`/`M`/`B`), currency
-   symbols, thousands separators, percent signs, and non-numeric sentinels
-   (`-`, `N/A`, empty) — **before** composing. Never infer the format from the
-   column name. *Why:* compared/aggregated as-is, the text sorts lexically
-   (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL —
-   producing a silently wrong result instead of an error.
-
-2. **Parse and scale in an early CTE.** Strip currency/separator/percent
-   characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels
-   to `0` **or** `NULL` per the question's intent, then cast to a numeric type —
-   all in **one early CTE**, so every downstream layer sees clean numbers. The
-   `0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive**
-   rule (reference it; do not restate). *Why:* a string column aggregated as-is
-   sorts lexically and casts to 0, so the math is silently wrong.
-
-3. **Confirm coverage (verify).** After parsing, sanity-check that **no
-   intended-numeric value silently failed to parse** — a failed parse should
-   surface as `NULL`, which is only visible with a **failure-detecting cast**.
-   Note the divergence: a plain `CAST` errors on some engines and, on sqlite,
-   returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from
-   `sql_dialect_notes` (requirement 3), then count residual NULLs among
-   non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish
-   as `0`/NULL instead of being caught.
-
-### 2. One worked example — parse/scale, fully portable
-
-Add **exactly one** new compact before/after `sql` example demonstrating the
-parse-and-scale pattern on a synthetic generic schema
-(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`):
-
- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the
-  formatted values collapse to `0`/partial, so the total is silently wrong.
- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a
-  `CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to
-  `DECIMAL`, then `SUM`s the parsed column.
-
-**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`,
-`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the
-example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no**
-inline example (its correct form needs the engine-specific safe cast, delegated to
-`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's
-rolling-window variants were prose-only).
-
-This adds **one** worked `sql` example to the skill. Spec 11 independently adds
-one as well; **do not hardcode the resulting total** — increment from the current
-state. As of this writing the skill carries **three** examples (spec 07
-window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is
-the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test
-assertion is incremented by one from its current value (see Acceptance criteria).
-
-### 3. Dialect-notes surface — `dialects/*.md` (safe cast)
-
-Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files,
-parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each
-line gives that engine's **failure-detecting numeric cast** — a cast that returns
-`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the
-verify step correct on that engine. Each note is engine-exclusive (a SQLite
-analyst gets the SQLite idiom and never another engine's construct, per the
-existing dialect-notes leak guards). Orientation only — exact syntax is the
-implementer's; verify against authoritative docs (context7 / the engine manual)
-rather than asserting from memory:
-
- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting,
-  e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is
-  available for the strip, but chained `REPLACE` is the portable default.)
- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before
-  `CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip.
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) →
-  `NULL` on failure.
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` →
-  `NULL` on failure.
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`.
- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an
-  error, so a coverage check must use a pattern guard such as
-  `CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof`
-  check) to detect a value that did not parse.
-
-This line is what makes the verify step executable from the dialect-agnostic
-skill. It is **distinct** from the Series and Rolling-window lines (those generate
-or window over a calendar; this detects a failed numeric parse). Phrase any
-version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test
-bans version-dated wording).
-
-### 4. Explicit constraints / exclusions
-
-None of the following may appear (consistent with specs 07, 10, and 11):
-
- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`,
-  `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
-  `replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is
-  chained `REPLACE`; the failure-detecting cast lives only in the dialect notes.
- **No regex-strip dialect line.** The character strip stays the portable
-  chained-`REPLACE` default; the dialect notes gain only the **safe cast**.
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
-  (the skill is for interactive analysis).
-
-### 5. Coordination with specs 07, 08, 10, and 11
-
- **Spec 07** owns the Schema-discovery group and its two existing bullets
-  (*"Sample before you compose"*, *"Cast to the real type before comparing"*).
-  Spec 12 **extends** that group and **builds on** both bullets — references them,
-  never restates them; they must stay intact and uncontradicted.
- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one
-  rubric line through that channel; the engine-exclusivity guards apply unchanged.
- **Spec 10** owns the additive-vs-non-additive discriminator (Answer
-  completeness) and the dialect **Series** line. Spec 12 **references** the
-  additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it.
- **Spec 11** independently adds the dialect **Rolling window** line, one `sql`
-  example, and the **rolling-window** entry to the step-5 provision list. Spec 12
-  touches the **same** three places (the dialect-notes rubric loop, the example
-  count, and the step-5 list). Both are independent and additive — **add to the
-  current state, do not assume an order**: name **safe-cast** in the step-5 list
-  without removing rolling-window/series; increment the example count by one from
-  whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any
-  `/\*\*Rolling/` assertion.
-
-### 6. Step pointer (no duplication)
-
-The step-5 `sql_dialect_notes` provision list (currently "FQTN,
-identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11
-also names rolling-window) should additionally name the **safe-cast** convention
-now that it exists. State each rule once inside `<sql_craft>`; the workflow steps
-only point to it.
-
-## Leak-safety (hard constraint)
-
-Every worked example or note uses a **synthetic generic schema** (e.g.
-`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`),
-showing only the *pattern*. **No** benchmark table names, SQL, or result values on
-either surface. The dialect-notes additions, like the existing notes, carry no
-benchmark / grader / version-dated content. The behavior is reconstructable from
-first principles and tied to no specific instance.
-
-## Acceptance criteria
-
- The `<sql_craft>` "Schema discovery before writing SQL" group states the three
-  heuristics — inline, dialect-agnostic, each with a generic *why*, and each
-  **building on** (not restating) the existing *"Sample before you compose"* and
-  *"Cast to the real type before comparing"* bullets and spec 10's additivity rule:
-  - **detect** text-encoded numerics by sampling distinct values (suffixes,
-    symbols, separators, sentinels) — never from the column name;
-  - **parse and scale** in an early CTE (strip → suffix-scale → sentinel map →
-    cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule;
-  - **confirm coverage** with a failure-detecting cast, delegating the engine's
-    safe-cast syntax to `sql_dialect_notes`.
- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using
-  chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS
-  DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is
-  incremented by **one** from its current value (3 today → 4; or 5 if spec 11
-  shipped first).
- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its
-  engine's own failure-detecting numeric-cast idiom (including the sqlite
-  `CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the
-  additions contain no benchmark / grader / version-dated content.
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
-  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline
-  `RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` /
-  `REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md`
-  including the new example.
- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention
-  alongside FQTN / identifier-quoting / date / top-N / series-calendar /
-  rolling-window / JSON.
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples),
-  the two existing Schema-discovery bullets, and the existing dialect-note rubric
-  lines (including **Series** and, if present, **Rolling window**) are intact and
-  uncontradicted.
- No grader / benchmark reference, and no output-shape contract.
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
-  still parses as `ktx-analytics`.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three
-  heuristics to the "Schema discovery before writing SQL" group (after its two
-  existing bullets), the single parse-and-scale worked example, and extend the
-  step-5 dialect-notes provision list to name the safe-cast convention. Leave
-  `<workflow>` / `<rules>` / the other examples and the two existing
-  schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per
-  target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no
-  change required.
- **Dialect notes:** the seven files under
-  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
-  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
-  `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
-  engine's actual safe-cast / try-cast support against authoritative docs before
-  writing the idiom; do not assert from memory** (in particular the sqlite
-  `CAST`-returns-0 behavior, which is the motivating gotcha).
- **Tests:**
-  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
-    representative phrase for each of the three heuristics (e.g. a *detect*, a
-    *parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft
-    behavior` list; bump the `sql`-fence count assertion **by one** from its
-    current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a
-    suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding
-    `SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
-    and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` /
-    `GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL`
-    guard, so the "safe cast lives only in the dialect notes" criterion is
-    *enforced*, not incidentally true).
-  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers
-    the full rubric for every dialect" loop with the safe-cast assertion,
-    `expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it.
-    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
-    all seven without a hand-maintained list. Do **not** add a false-exclusivity
-    assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the
-    line per dialect is sufficient.
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation only)
-
-At least one SQLite-subset question stores trading volume as suffix-encoded text
-(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw
-strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes
-the failure especially insidious: there is no error to alert the agent, and a
-naive `IS NULL` coverage check would not catch it either, which is precisely why
-the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings
-before math, then verify coverage with a failure-detecting cast — is universal
-data hygiene that helps any analyst on any warehouse, so it belongs in the
-product's craft (skill) plus the per-dialect safe-cast syntax that makes the
-verify step executable, not in a benchmark-specific prompt. Improving the
-benchmark score is a side effect; the skill and the dialect notes contain no trace
-of the benchmark.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already
-applied in the working tree). Built from the current state per the "do not assume an
-order" guidance — there were **four** worked examples (specs 07 window-then-filter,
-09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the
-**fifth**, and step 5 already named `series/calendar, rolling-window`.
-
-**Skill — `packages/cli/src/skills/analytics/SKILL.md`:**
- Added the three heuristics to the **"Schema discovery before writing SQL"** group,
-  after the two existing bullets: *Parse text-encoded numerics before doing math on
-  them* (detect by sampling distinct values, extending *Sample before you compose*,
-  never inferring from the column name), *Strip, scale, and cast in one early CTE*
-  (the *meaning-is-numeric* complement to *Cast to the real type before comparing*,
-  with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by
-  additivity* rule), and *Confirm the parse covered every value* (failure-detecting
-  cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing
-  bullets and the additivity rule are referenced, not restated.
- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`,
-  `'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an
-  early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with
-  a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard
-  portable SQL only — no dialect functions, no inline safe cast.
- Step 5 dialect-notes provision list now names **safe-cast** alongside the others.
-
-**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a
-**Safe cast** line to all seven files (after the *Rolling window* line), each giving
-that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern
-guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning);
-bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql
-`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the
-`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial
-gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against
-the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real
-name — the `to<Type>OrNull` family requires a bit width, hence `toDecimal64OrNull`).
-No version-dated wording.
-
-**Tests:** `analytics-skill-content.test.ts` — added the three representative
-phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the
-example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`),
-and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`,
-`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's
-`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts`
-— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so
-all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity
-assertion for the shared `TRY_CAST`.
-
-**Verification:** both affected test files pass (19 tests); broader `test/skills` +
-`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`)
-is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry
-*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev`
-relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the
-test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD,
-untouched here) — out of scope for this spec.
--- a/spider2-specs/specs/14-output-completeness-final-check.md
+++ b/spider2-specs/specs/14-output-completeness-final-check.md
@ -1,336 +0,0 @@
-# Output completeness — answer every requested part, enforced by a final pre-emit check
-
-> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`.
-
-## Problem
-
-The single largest correctness failure mode for the analytics skill is
-**incomplete output**: the query runs and the methodology is roughly right, but
-the projection is missing columns the question asked for. The SQL is runnable and
-the aggregate is correct — the answer is simply *short by columns*. Three
-recurring shapes:
-
-1. **Multi-part questions answered partially.** A question that asks for several
-   things ("report the highest *and* the lowest month, each with its count and
-   average, *and* the difference") comes back with only the first clause — one
-   column where several were requested.
-2. **Identity dropped.** Grouping by a human-readable name but not projecting the
-   entity's identifier (a product name without its product id, a customer name
-   without its customer id).
-3. **Inputs to a derived value dropped.** Returning a ratio / percentage /
-   difference but not the underlying counts the question also asked for.
-
-Shapes 2 and 3 are **already covered** by shipped `<sql_craft>` rules — spec 07's
-*"Expose identity, not just the label"* and *"Keep the inputs to a derived
-value"* — yet they are frequently **not applied**. So the gap is not missing
-knowledge: these rules sit as passive heuristics in a list, and nothing makes the
-agent reliably check them before finalizing. The fix is twofold: (a) add the
-missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b)
-turn output-completeness into an **explicit final verification step** the agent
-performs before emitting SQL, so the existing identity/inputs rules are actually
-enforced rather than merely listed.
-
-The failure is **model-independent**: a markedly stronger model produced the same
-incomplete-output mistakes on these questions, which means it is a
-craft/enforcement gap, not a capability gap — exactly the kind of universal
-analyst craft that belongs in the shipped skill.
-
-## Generic use case (independent of any benchmark)
-
-An analyst is asked: *"For each region, report the highest and the lowest monthly
-order count, and the difference between them."* A complete answer has a column for
-the region's id and name, the highest count, the lowest count, and the difference
-— five columns. Returning just the region and a single number answers only part
-of the request. This is a universal expectation on any database: answer **every**
-part of a multi-part request, identify the entities, and show the inputs behind
-any derived figure — and answer *exactly* that, without padding the result with
-columns the question never asked for.
-
-## Model
-
-The change is **additive content in one Markdown file**
-(`skills/analytics/SKILL.md`), governed by the same invariants spec 07
-established. They constrain the implementer; the exact prose is theirs.
-
-### Additive, inline, heuristic-with-a-why
-
-Consistent with specs 07 and 10: the change is additive content in
-`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the
-`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic,
-and phrased as **heuristics with a one-line generic rationale**, not a wall of
-MUSTs. The new rule extends the existing `<sql_craft>` "Answer completeness /
-interpretation" group; the shipped bullets in that group (including the *identity*
-and *inputs* rules this spec builds on) are preserved unchanged. No new tool,
-flag, or config.
-
-### The over-projection guard carries a *universal* why, not a grader reference
-
-The intake draft frames "don't pad the result with extra columns" as
-*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or
-benchmark (spec 07's hard invariant; the content test bans the words). So the
-guard must ship with a **universal analytics rationale** instead: columns the
-question did not ask for add noise, mislead the reader into thinking they matter,
-and make the result harder to consume — match the request exactly, neither short
-nor padded. This is the same reconciliation spec 07 applied to the draft's
-"behavior only, no rationale" instruction: generic *why* is required; only
-grader/gold/benchmark rationale is banned.
-
-### Completeness is a closed set — identity and inputs are *inside* it
-
-"Expose identity" and "keep the inputs" tell the agent to add columns; the
-over-projection guard tells it not to. These only contradict if the target is
-left fuzzy, so this spec pins it down. A **complete projection** is exactly:
-
-> {every requested metric/attribute} ∪ {the identifier of each grouped/named
-> entity} ∪ {the inputs to each derived value}, at the grain the question
-> specifies.
-
-Identity and inputs are **members of that set** — part of completeness, never
-"padding." **Under-projection** is any member missing (the failure this spec
-attacks); **over-projection** is any column *outside* the set (what the guard
-forbids). The implementer must phrase the rule and guard against this single
-definition so they read as one coherent notion, not two competing instructions.
-
-### Dialect-agnostic, additive-only, exclusions intact
-
-Every addition reads correctly on any dialect — no dialect-specific syntax in the
-rule text or the worked example. The existing `<workflow>`, `<rules>`, and the
-other `<sql_craft>` bullets and examples (specs 07/09/10/11/12) are preserved and
-uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no
-`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect
-syntax.
-
-## Requirements
-
-### 1. Multi-part / multi-output completeness — a new umbrella rule
-
-Add a bullet to the `<sql_craft>` "Answer completeness / interpretation" group:
-when a question requests several outputs — a **list** ("A, B, and C"), **paired
-extremes** ("the highest *and* the lowest"), or a **value plus its components**
-("X, Y, and their ratio") — the final projection must contain a column for
-**each** requested output. *Why:* answering only the first clause is the most
-common way a runnable query is still wrong; the grain and methodology can be
-perfect yet the answer is short by columns.
-
-This rule is the **umbrella** over the two shipped completeness rules: the
-*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components"
-instance, and the *identity* rule (*"Expose identity, not just the label"*) is its
-"entity identity" instance. The new bullet should **name that relationship**
-(so the three read as one notion) rather than restating either rule.
-
-Keep this distinct from the row-selection rules in the same group: *"Top /
-highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows**
-appear; multi-part completeness governs **which columns** appear. They compose
-(e.g. "highest and lowest per region" needs one row per region *and* a column per
-clause).
-
-### 2. Final completeness check — the enforcement mechanism
-
-The rule content lives **once** in `<sql_craft>`; the trigger is promoted to a
-first-class line in `<workflow>` step 6.
-
- **Capstone bullet in `<sql_craft>`** (closing the "Answer completeness /
-  interpretation" group): *before emitting the final SQL, re-read the question and
-  confirm the projection covers* —
-  1. every named **metric / attribute** the question asks for (→ the multi-part
-     rule);
-  2. the **identifier** of every grouped or named entity (→ the *identity* rule);
-  3. every **input** to each derived value (→ the *inputs* rule);
-  4. all at the **grain** the question specifies (→ the *for each X* / panel
-     rules).
-
-  Each facet cross-references the rule it enforces, so the check is what makes
-  those passive rules active. Phrase it as a short, concrete "confirm the
-  projection covers…" checklist, not a wall of MUSTs.
-
- **Over-projection guard** (attached to the check): do **not** add columns the
-  question did not ask for "to be safe" — extra columns add noise, mislead, and
-  make the result harder to consume; match the request exactly. Carries the
-  **universal** why from the Model, **never** a grader/gold/benchmark reference.
-
- **`<workflow>` step 6 line** (the explicit ritual): step 6 ("Validate and
-  explain") gains a mandatory line directing the agent to **always** run the final
-  completeness check before emitting — re-read the question and verify every
-  requested output, each entity's identity, each derived value's inputs, and the
-  grain are all projected — pointing into the `<sql_craft>` capstone for the
-  detail. This **replaces the current conditional pointer's role** ("If a result
-  is unexpectedly empty or its grain looks wrong, work through the … rules"): the
-  empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty
-  results"* and grain rules), but the completeness check fires **unconditionally**,
-  on every SQL-authoring turn, not only when a result looks off. The workflow line
-  names the ritual and the four facets; the rationale, guard, and example are
-  stated once in `<sql_craft>`, not duplicated into the workflow.
-
-### 3. One worked example (dialect-agnostic)
-
-Add **exactly one** compact before/after example to the "Answer completeness /
-interpretation" group, demonstrating multi-part completeness on a **synthetic**
-schema (`regions`, `region_monthly`):
-
- **WRONG:** answers only the first clause — `SELECT region_name,
-  MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no
-  lowest, no difference.
- **RIGHT:** one column per requested output plus the entity's identity, at the
-  region grain — `region_id, region_name`, the highest, the lowest, and the
-  difference, with `regions` joined to `region_monthly` and grouped by the region
-  id and name.
-
-Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN`
-are portable aggregates). Keep it tight. It teaches multi-clause coverage +
-identity + derived-value inputs in one capstone, and is **distinct** from the
-spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN
-spine + `COALESCE`); this one is about missing **columns**. This is the **sixth**
-worked `sql` example in the skill (after specs 07/09/10/11/12).
-
-### 4. Coordination with specs 03 and 07/09/10/11/12
-
- **Spec 03** (multi-connection routing) owns `<workflow>` step 0 and the
-  `connectionId` threading/scoping. Spec 14 touches `<workflow>` only to add the
-  completeness-check line to **step 6** — it must not rewrite the routing or the
-  `<rules>` `connectionId` scoping. If both land, step 6 reads coherently: validate
-  + the completeness ritual.
- **Specs 07/09/10/11/12** own their own bullets and worked examples in
-  `<sql_craft>`. Spec 14 is **additive** to the same "Answer completeness /
-  interpretation" group and adds one example; it must not remove or contradict
-  theirs.
-
-## Leak-safety (hard constraint)
-
-The example uses an **invented, generic schema** (`regions`, `region_monthly`) and
-made-up columns — **no benchmark table names, SQL, or result values.** It teaches
-the *pattern* (cover every requested output + identity + inputs, at grain, without
-padding), which is universal and tied to no specific instance. The over-projection
-guard's rationale is **universal** (noise/clarity/consumability), never
-"grader-gaming" or any other scoring reference. No part of the addition mentions a
-benchmark, gold answer, grader, or scoring comparator.
-
-## Acceptance criteria
-
- `<sql_craft>` "Answer completeness / interpretation" states the **multi-part /
-  multi-output completeness** rule (a column per requested output; list / paired
-  extremes / value-plus-components), named as the umbrella over the shipped
-  *identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*.
- `<sql_craft>` states a concrete **final completeness check** (re-read the
-  question → confirm metrics + entity identity + derived-value inputs + grain are
-  projected), cross-referencing the existing identity/inputs/grain rules so they
-  are enforced, not merely listed.
- The check carries the **over-projection guard** with a **universal** rationale
-  (don't pad with unrequested columns — noise / misleading / harder to consume),
-  and the skill contains **zero** grader/gold/benchmark references anywhere.
- `<workflow>` **step 6** carries a mandatory line that runs the completeness
-  check **unconditionally** before emitting and points into the `<sql_craft>`
-  capstone; the rule content is **stated once** in `<sql_craft>` (no duplicated
-  rationale/guard in the workflow). The empty/grain diagnostic remains available.
- Exactly **one** new worked `sql` example is present (synthetic
-  `regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL;
-  the skill then carries **six** `sql` worked examples total.
- The existing interactive guidance (`<workflow>` steps, `<rules>`, the other
-  `<sql_craft>` bullets and the five prior examples) is intact and uncontradicted;
-  the additive-only and dialect-clean invariants from specs 07/10 still hold.
- None of spec 07's excluded items appear (output-shape contract, `MAX(date)`
-  anchoring of "recent"/"past N", grader-driven advice, dialect syntax).
- The skill stays scannable and comfortably under the 500-line budget; the
-  frontmatter still parses as `ktx-analytics`.
- The analytics-skill **content test is updated** to cover the new rule and check
-  (see Implementation orientation).
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the prose.
-
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`.
-  - Add the multi-part-completeness bullet and the final-completeness-check
-    capstone (with the over-projection guard) to the `<sql_craft>` "Answer
-    completeness / interpretation" group; add the single
-    `regions`/`region_monthly` worked example.
-  - In `<workflow>` step 6, replace the current conditional answer-completeness
-    pointer with the mandatory completeness-check line (unconditional, names the
-    four facets, points into `<sql_craft>`); keep the empty/grain diagnostic.
-  - Leave `<workflow>` steps 0–5, `<rules>`, and the other `<sql_craft>`
-    bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target
-    via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change
-    required.
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`.
-  - Add representative phrases to the "represents every craft behavior" list for
-    the multi-part rule, the final completeness check, and the over-projection
-    guard.
-  - Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the
-    test name/comment), and assert the new example's shape (e.g. `region_monthly`,
-    `MAX(`, `MIN(`, the difference expression, `region_id`).
-  - The existing dialect-clean, grader/benchmark-clean, and relative-time
-    (`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN`
-    lines carry no "recent"/"past N" wording, so the phrase-level guard is
-    unaffected. The `SkillsRegistryService` frontmatter test must still pass.
- Rebuild and re-link the dev binary so the playground picks up the updated skill:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation only)
-
-On the latest SQLite-subset run, **incomplete output was the single largest
-failure bucket (~13 of 51 voted failures)**: multi-part questions answered
-partially, plus dropped identity / derived-value inputs — the latter two being
-spec-07 rules that already exist but weren't applied. A probe with a much stronger
-model reproduced the *same* incomplete-output failures, confirming this is a
-craft-enforcement gap rather than a model-capability one. The fix — answer every
-requested part, identify the entities, keep the inputs, and don't pad — is
-universal analyst craft, so it belongs in the product skill (and transfers to real
-users), enforced as a final pre-emit check rather than left as a passive hint.
-Improving the benchmark score is a side effect; the skill contains no trace of the
-benchmark.
-
-## Implementation notes
-
-Implemented as additive content in one Markdown file plus a test update.
-
- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`<sql_craft>` "Answer
-  completeness / interpretation" group):
-  - Added the **"Answer every requested output"** umbrella bullet (list / paired
-    extremes / value-plus-components → a column per requested output, with a generic
-    *why*). It names *keep the inputs* and *expose identity* as its "value +
-    components" and "entity identity" instances, pins the closed-set definition of a
-    complete projection, and marks itself as governing *which columns* appear —
-    distinct from the *Top …* / *For each X* row-selection rules, with which it
-    composes. The two shipped instance rules are preserved verbatim.
-  - Added the **"Final completeness check"** capstone bullet: a four-facet
-    "before emitting, re-read the question and confirm the projection covers…"
-    checklist (metric/attribute → multi-part rule; identifier → *expose identity*;
-    inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on
-    every query. It carries the **over-projection guard** with a universal rationale
-    (unrequested columns add noise, mislead, and are harder to consume — match the
-    request exactly), with **no** grader/gold/benchmark reference.
-  - Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG
-    answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`),
-    dropping the region id, the lowest, and the difference; RIGHT projects
-    `r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the
-    `MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id
-    + name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`).
-  - `<workflow>` **step 6**: replaced the conditional answer-completeness pointer
-    with an unconditional *"Always run the final completeness check before emitting"*
-    line that names the four facets and points into the `<sql_craft>` capstone; the
-    empty/grain diagnostic is retained for diagnosis. Steps 0–5, `<rules>`, and the
-    other `<sql_craft>` bullets/examples are untouched.
-  - Delivery is unchanged: `readAnalyticsSkillContent` in
-    `packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target
-    (confirmed, no change required).
- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the
-  three representative phrases (`Answer every requested output`, `Final completeness
-  check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and
-  renamed that test; asserted the new example's shape (`region_monthly`,
-  `MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and
-  `r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean,
-  relative-time, and frontmatter guards still pass.
- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass;
-  production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the
-  updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content
-  present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it
-  up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as
-  `ktx-analytics`.
- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference
-  column named `order_count_range`; the intake draft sketched alias `m` and
-  `AS difference`. The spec leaves prose to the implementer, so the change is purely
-  naming.
- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in
-  `packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools`
-  mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is
-  untouched by this work and out of scope here.
--- a/spider2-specs/specs/15-mcp-server-structured-logging.md
+++ b/spider2-specs/specs/15-mcp-server-structured-logging.md
@ -1,405 +0,0 @@
-# Structured, leveled logging for the ktx MCP server
-
-> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`.
->
-> **Scope: observability only.** This spec is about *seeing* what the MCP server
-> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
-> query from blocking the server (off-event-loop / interruptible execution) is a
-> separate concern — see "Non-goals".
-
-## Problem
-
-The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built
-through `mcp-server-factory.ts` on raw `node:http` + the
-`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There
-is no server-side record of **which MCP tool was called, with what parameters,
-when, how long it took, or whether it succeeded** — nor of session open/close or
-transport errors. When a tool call is slow, hangs, or a client connection drops
-("Transport channel closed"), an operator has no trail to diagnose it and must
-resort to process sampling / `lsof` / guesswork — and the offending input
-(e.g. the exact SQL) is typically unrecoverable.
-
-The hook to fix this already exists but is half-built: `instrumentMcpServer`
-(`context/mcp/context-tools.ts`) wraps every tool handler and already times it,
-but it emits **only on completion** (a sampled `mcp_request_completed` telemetry
-event) and **never writes a start line and never writes to the server log**. A
-call that never returns therefore leaves no trace at all.
-
-## Generic use case (independent of any benchmark)
-
-Anyone running a long-lived ktx MCP server — a developer's local instance
-(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a
-shared/hosted HTTP daemon — needs observability into tool-call activity to:
-
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
-  connection, with what SQL, for how long);
- explain client-visible connection failures from the server side (session
-  lifecycle, transport-closed events);
- audit what agents asked the server to do;
- spot patterns (hot tools, slow connections, error rates).
-
-This is standard production-server hygiene; the server currently provides none.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs.
-
-### One `pino` logger, synchronous, written to **stderr**
-
-Use `pino` — the de-facto standard structured-JSON logger for Node servers — as
-a single shared instance. Two corrections to the draft's sketch:
-
- **stderr, not stdout.** The stdio transport reserves **stdout** for the
-  JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`);
-  writing logs there would corrupt the protocol stream. The HTTP daemon already
-  redirects **both** child fds to `.ktx/logs/mcp.log`
-  (`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands
-  in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the
-  one universally-correct sink** for both transports.
- **Synchronous, no worker-thread transport.** `pino` writes through a
-  `DestinationStream` (`{ write(msg) }`) — the server's existing
-  `KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a
-  **synchronous** destination (`pino.destination({ sync: true })`, or the
-  pino-pretty stream below with `sync: true`). This is load-bearing: the
-  `tool.start` line **must** be flushed to the fd *before* the (possibly
-  blocking) handler runs, so a runaway synchronous `better-sqlite3` query that
-  pegs the event loop still leaves the start line on disk. A worker-thread
-  transport (`transport: { target: ... }`) buffers and can lose that exact line
-  on a hard crash — **do not use transport mode.**
-
-### Format is derived from `stderr.isTTY`, not a config flag
-
-One logger, two serializations chosen by the environment (the "behavior follows
-from inputs" rule — not a user-visible knob):
-
- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) →
-  **`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true,
-  destination: <stderr sink> })`, colorized). A readable live dev view.
- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log`
-  file fd) → **plain JSON line** via the synchronous pino destination. The log
-  *file* stays structured JSON so the incident workflow ("recover the hung query
-  with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat
-  it.
-
-`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal
-from the underlying stream (`process.stderr.isTTY`) at logger construction, while
-still writing *through* the `io.stderr` sink so tests can capture emitted lines.
-
-### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper
-
-Tool-call logging is added to the existing `instrumentMcpServer`
-(`context-tools.ts`), which already wraps `registerTool` and measures duration.
-It receives the **raw** tool input (it wraps the schema-parsing handler from
-`registerParsedTool`), so the params it logs include `sql` for `sql_execution`.
-The existing telemetry emission stays unchanged; logging is **additive** beside
-it. Because both transports build their server through `mcp-server-factory.ts` →
-`registerKtxContextTools`, this single change gives **both HTTP and stdio**
-tool-call logging for free.
-
-### `sessionId` / `callId` provenance
-
- **`sessionId`** comes from the SDK's per-call handler context
-  (`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk`
-  `1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for
-  stdio (single session) — log it when present, omit otherwise. Add
-  `sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`).
- **`callId`** is generated per invocation with `randomUUID()` (already imported
-  in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`.
-
-### No redaction in v1 (explicit)
-
-v1 ships **no log redaction**. Rationale recorded here so it is a deliberate
-choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`),
-**never transmitted off-box**, and sit at the **same trust boundary** as the
-`ktx.yaml` / environment that already hold the connection credentials. Concretely:
-
- Request **headers are never logged** at all, so the bearer token
-  (`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted."
- Errors are logged with their **full message and stack** via pino's standard
-  `err` serializer.
- SQL text and tool params are logged **verbatim** (they are not secrets).
-
-Credential redaction (e.g. a DB URL embedded in a driver error string) is an
-explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box.
-This drops the draft's "light redaction" requirement and the
-`collectTelemetryRedactionSecrets` / scrubber reuse it implied.
-
-## Requirements
-
-### 1. One shared pino logger
-
- A single `pino` instance per server process, constructed once and threaded to
-  both the transport layer (for lifecycle events) and the tool layer (for
-  tool-call events). Level set from env (Requirement 7), default `info`.
- Synchronous destination bound to the server's stderr sink (see Design
-  decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`,
-  otherwise plain JSON. Each line carries pino's standard `time` and `level`.
- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics
-  stack, no async/worker transport, no in-app file rotation.
-
-### 2. Per-session / per-call context via child loggers
-
-Use pino child loggers so every line carries the relevant correlation fields:
-a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one
-session's or one call's activity can be grepped from the log.
-
-### 3. Tool-call logging — START before execute, END after
-
-In `instrumentMcpServer`, for **every** MCP tool invocation:
-
- **On entry, before invoking the handler**, write `tool.start` with
-  `{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool
-  input; for `sql_execution` this includes the full **SQL text** (the single most
-  useful field). The write is synchronous so the line exists even if the handler
-  never returns.
- **On normal completion**, write `tool.end` with
-  `{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at
-  **`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a
-  tool-agnostic size measure (byte length of the serialized result text content).
- **On error**, write `tool.end` with
-  `{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**,
-  where `err` is the serialized error (message + stack) per Requirement 6.
-
-`tool.start` and `tool.end` share the **same correlation fields and the same
-`info` level** (for the non-slow, non-error case) so that an **unmatched
-`tool.start`** — a start with no `tool.end` for the same `callId` — is an
-unambiguous "this call hung" signal. This is the property that makes a runaway
-`sql_execution` identifiable from the log alone, with its exact SQL and
-timestamp, no process sampling.
-
-> **Deliberate change from the intake draft.** The draft put `tool.start` /
-> `tool.end` at `debug` (suppressed at the default `info`). That defeats the
-> motivating incident: a hang is unpredictable, so debug would have to be enabled
-> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an
-> always-on access log — so the offending query is recoverable at the default
-> level. `debug` is reserved for heavier detail (Requirement 7).
-
-### 4. Slow-call warning
-
-When a call **completes** with `durationMs` greater than the configured slow
-threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same
-fields plus the duration) instead of `info`. This makes a completed-but-slow call
-stand out and keeps it visible even when the level is raised to `warn`.
-
-### 5. Connection / session lifecycle and transport errors
-
- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from
-  `onsessioninitialized` and `session.close` from `onsessionclosed` /
-  `transport.onclose`, each with `sessionId`, at `info`. **Wire the currently
-  unused `transport.onerror`** to log `transport.error` (the SDK's
-  closed-channel / "Transport channel closed" events) at `error`, so a
-  client-visible connection failure has a server-side counterpart.
- **stdio** (`mcp-stdio-server.ts`): route the existing raw
-  `transport.onerror` stderr string (it currently writes a plain string) through
-  the logger as a `transport.error` line at `error`. A single `session.open` /
-  `session.close` pair for the one stdio connection MAY be logged at `info`.
-
-### 6. Structured error logging
-
-Errors are logged as structured objects via pino's standard `err` serializer
-(`pino.stdSerializers.err` or equivalent), carrying error class, message, and
-stack — never a bare interpolated string. The existing telemetry exception
-reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged.
-
-### 7. Configuration surface
-
- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` |
-  …), default **`info`**. MCP-scoped name because the MCP server is the only
-  emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system
-  that does not exist.
- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement
-  4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply
-  between a local SQLite file and a remote warehouse.
- Level ladder that results from Requirements 3–5:
-  - `debug`: everything below **plus** heavier detail (e.g. result bodies,
-    progress notifications) — implementer's discretion on what extra to attach.
-  - `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s,
-    errors.
-  - `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but
-    not routine tool traffic.
-  - `error`: errored `tool.end`s and `transport.error` only.
-
-## Acceptance criteria
-
- At default level (`info`), invoking any MCP tool produces a `tool.start`
-  (`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end`
-  (`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr
-  is not a TTY.
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
-  `tool.start` line carrying its **exact SQL and timestamp** and **no** matching
-  `tool.end` for that `callId` — so the offending query is recoverable from the
-  log alone, with no process sampling.
- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at
-  `warn` with its `durationMs`.
- Session open/close and transport-closed (`transport.error`) events are logged
-  with the `sessionId` (HTTP); the stdio transport error path goes through the
-  logger, not a raw `stderr.write`.
- At level `warn`, routine `tool.start` / `tool.end` are suppressed but
-  slow-call warnings, transport errors, and errored calls are present.
- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a
-  terminal), output is human-readable colorized `pino-pretty`; the daemon log
-  file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous.
- The bearer token never appears in any log line (headers are not logged); SQL
-  and tool params do appear.
- No worker-thread / async log transport is introduced; no OpenTelemetry /
-  metrics stack; the only new dependencies are `pino` and `pino-pretty`.
- The existing `mcp_request_completed` telemetry and exception reporting still
-  work unchanged.
-
-## Non-goals
-
- **Preventing / interrupting runaway queries** (off-event-loop execution, query
-  timeouts, worker-thread isolation). A single synchronous query that fans out
-  into a massive nested-loop join can peg the single-threaded server for hours
-  and break new connections — observability surfaces *which* query, but the fix
-  is execution-model work in a separate spec. (This logging is also the
-  prerequisite for a future watchdog that detects a `tool.start` with no
-  `tool.end` past a threshold and recycles the server.)
- **Log redaction** (see Design decisions) — explicit v1 non-goal.
- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty
-  as a synchronous in-process stream only.
- Metrics / tracing / OpenTelemetry exporters.
- Forwarding logs to the MCP *client* via the protocol logging capability
-  (`notifications/message`, `logging/setLevel`) — a possible later enhancement,
-  distinct from operational stderr logging.
- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other
-  surfaces emit structured logs.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **New module** — a small logger factory, e.g.
-  `packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from
-  the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream
-  when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and
-  exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`.
- **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`:
-  extend `instrumentMcpServer` (~line 585) to write `tool.start` before
-  `handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate
-  `callId` via the already-imported `randomUUID`; read `sessionId` from the
-  handler `context`. Thread the logger via `RegisterKtxContextToolsDeps`
-  (~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool`
-  and the existing telemetry emission intact.
- **Context type** — `packages/cli/src/context/mcp/types.ts`: add
-  `sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to
-  `KtxMcpServerDeps` / the register deps.
- **Server wiring** — `packages/cli/src/context/mcp/server.ts`
-  (`createDefaultKtxMcpServer` / `createKtxMcpServer`) and
-  `packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept
-  and pass the logger down to `registerKtxContextTools`.
- **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or
-  receive) the logger; in `newTransport` (~line 186) log `session.open` /
-  `session.close` and add `transport.onerror` → `transport.error`.
- **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or
-  receive) the logger; route the existing `transport.onerror` (~line 54) through
-  it.
- **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts`
-  redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs`
-  (`commands/mcp-commands.ts`) tails it. No change needed there.
- **Dependencies** — add `pino` and `pino-pretty` to
-  `packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks
-  still pass.
- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`,
-  `mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and
-  `commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written
-  before a (mock) handler runs and carries `params`/`sql`; (b) a matching
-  `tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a
-  `tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits
-  `warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token
-  never appears. Inject a capturing `io.stderr` and parse the JSON lines.
-  *Note:* `mcp-server-factory.test.ts` carries a pre-existing
-  `KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`,
-  unrelated to this work) — do not let it mask new failures.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation, not a requirement)
-
-Running Spider 2.0-Lite against the MCP server at concurrency, an
-adversarial-reviewer-generated query degenerated into a massive nested-loop join;
-synchronous `better-sqlite3` executed it on the event loop, pegging a server at
-~100% CPU for hours and breaking new MCP connections ("Transport channel
-closed"). We could not determine *which* query, because the server logs nothing
-about tool calls — diagnosis required `sample` / `lsof` on the live process and
-the exact SQL was never recovered. Structured tool-call logging — especially
-`tool.start` written synchronously *before* execution, at the default level —
-would have turned this into a one-line `grep` of the server log. Improving the
-benchmark is a side effect; the logging is generic production-server hygiene.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance
-criteria are satisfied.
-
-**What was built / where**
-
- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io,
-  { isTTY? })` builds one synchronous `pino` (v10) instance written through the
-  `io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13)
-  synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the
-  sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel`
-  (`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`),
-  `mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and
-  `serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`.
- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** —
-  per invocation: `callId = randomUUID()`, a child logger bound to
-  `{ tool, callId, sessionId? }`, `tool.start { params }` written at `info`
-  **before** awaiting the handler (synchronous, so a runaway query still leaves it
-  on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`,
-  `warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error",
-  err }`. `resultSize` is the UTF-8 byte length of the serialized text content.
-  The existing `mcp_request_completed` telemetry + `reportException` are unchanged
-  (`durationMs` is now computed once and shared); `registerParsedTool` is intact.
- **`sessionId` / logger plumbing** — `sessionId?: string` added to
-  `KtxMcpToolHandlerContext`; a single per-process logger threads from each
-  transport entrypoint through `createKtxMcpServerFactory` →
-  `createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools`
-  (`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`).
- **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from
-  `onsessioninitialized`, `session.close` from `transport.onclose`, and the
-  previously-unused `transport.onerror` wired to `transport.error` at `error`.
- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror`
-  string write is replaced by a `transport.error` log line; `session.open` /
-  `session.close` are logged for the single stdio session.
- **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to
-  `packages/cli/package.json`.
- **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env
-  parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in
-  `test/context/mcp/server.test.ts` (start-before-handler, matching end with
-  `resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level
-  suppression with errored end still present, no-logger no-op), session lifecycle
-  + bearer-token-never-logged in `test/mcp-http-server.test.ts`, and
-  `test/mcp-stdio-server.test.ts` for `transport.error`.
-
-**Deviations / decisions**
-
- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a
-  thrown handler error into an `{ isError: true }` result (and reports the full
-  error via telemetry) before it reaches `instrumentMcpServer`, so the original
-  stack is already gone. `tool.end` for such a result logs `outcome:"error"` with
-  `err.message` only; a genuine throw that escapes gets the full pino `err`
-  serialization (type + message + stack). The field is always `err` for
-  consistency. This honours "leave `registerParsedTool` intact."
- **`session.close` is logged from `transport.onclose`** (the universal close
-  signal for both clean DELETE and dropped connections) rather than
-  `onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its
-  session-map cleanup role.
- **The logger is optional throughout.** Production always wires one per process;
-  when absent (programmatic/test callers that inject `createMcpServer`), tool-call
-  logging is simply off — which keeps existing tests unchanged.
- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production
-  derives format from `process.stderr.isTTY`.
-
-**Verification**
-
-`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test
-files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only
-2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and
-unrelated to this change (in-progress analytics-skill work on this branch).
-`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run
-build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the
-one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit
-`2677b3ef` (documented above); all source and the new tests type-check clean.
--- a/spider2-specs/specs/16-bounded-query-execution-timeout.md
+++ b/spider2-specs/specs/16-bounded-query-execution-timeout.md
@ -1,493 +0,0 @@
-# Bounded query execution (deadline + non-blocking) for read SQL
-
-> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`.
->
-> **Scope: bound and cancel a read query that runs too long.** This is the
-> execution-model companion to spec 15 (MCP structured logging). Spec 15
-> *surfaces* a runaway query in the log; it explicitly defers *preventing* one —
-> "off-event-loop execution, query timeouts, worker-thread isolation … is
-> execution-model work in a separate spec." This is that spec.
-
-## Problem
-
-Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the
-current code:
-
-1. **No execution deadline, handled divergently per connector.** A single
-   expensive query runs unbounded, and whether it is bounded at all depends
-   entirely on which driver the caller hit:
-   - **BigQuery** is the only connector with a real statement timeout — it sets
-     `jobTimeoutMs` on the query job from a per-connection config field
-     `job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512).
-   - **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client
-     creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up,
-     not a server-side `max_execution_time`; the server keeps working.
-   - **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection
-     *acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres
-     `connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`;
-     MySQL pool size only) — nothing bounds statement *execution*.
-   - **SQLite** has nothing.
-
-2. **In-process SQLite blocks the event loop and cannot be cancelled.** The
-   SQLite connector executes on the main thread via synchronous
-   `better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`,
-   `query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes
-   the whole MCP server — it cannot serve other requests, send progress, or write
-   `tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12)
-   exposes no interrupt/cancel API. Its documented mechanism for slow queries is a
-   **worker thread**, and the only way to stop a runaway synchronous query is to
-   **terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`,
-   `docs/threads.md`).
-
-The observed failure (Spider2-lite sqlite run, 2026-06-18): a single
-`sql_execution` MCP call —
-`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`,
-where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a
-4-column key with no composite index) — degraded to an O(N×M) nested-loop scan,
-pegged a worker at 100% CPU for 13+ minutes, never returned, produced a
-`tool.start` with no matching `tool.end`, and stalled an eval shard until the
-worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned
-rows, not scan work, and the failing query returned a single aggregate row.
-
-## Generic use case (independent of any benchmark)
-
-Any data agent that lets an LLM author SQL will eventually issue an
-accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW,
-a wide aggregate over a large fact table. A general-purpose context layer must
-bound that and return a clean, fast "query exceeded Ns" error so the agent can
-revise (add filters, query base tables, narrow the range) instead of hanging the
-tool and the server. This matters for embedded/local warehouses (SQLite, and any
-future DuckDB-style in-process driver) and remote ones alike, and is wholly
-independent of any benchmark.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs.
-
-### One canonical deadline, applied uniformly at the contract
-
-The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP
-`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query
-executor, relationship profiling and composite-candidate probes, relationship
-validation, historic-SQL probes, `ktx sql`); the contract is the single place to
-bound all of them. A heavy ingest profiling probe over a giant unindexed join is
-exactly as worth abandoning as an interactive one — those call sites are
-best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip
-this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the
-call sites that must treat the timeout as recoverable.)
-
-> Rejected alternative: a caller-resolved deadline (short on the interactive path,
-> longer/none for ingest). That introduces a second value source and the open
-> question "what is the ingest budget," for no real gain — the 30s default already
-> clears any normal profiling probe, and a probe that exceeds it is one to drop.
-
-### Default 30s, configurable per-connection via one shared field
-
- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean
-  "exceeded 30s" and revises within the same turn; generous headroom over any
-  indexed aggregate or normal profiling probe; a genuine pathological nested-loop
-  scan blows past it immediately.
- **One shared per-connection override**, honored by every connector:
-  `query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer
-  in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it
-  replaces; the user-facing error still reads in seconds.
- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the
-  new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved
-  value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so
-  there is exactly one way to set a query timeout — no parallel knob (intake
-  requirement 1).
- **Granularity is per-connection only.** No global all-connections override —
-  different warehouses have different performance envelopes, and a second
-  (global) knob would double the configuration surface for no stated need.
-
-### The shared contract is a value + an error, not a base class
-
-There is **no shared connector base class or factory** — each connector is
-constructed independently; the only shared registry is the *dialect* factory
-(`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement
-3) means a single shared module that owns:
-
- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`;
- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms`
-  override, else the default — so the default and the override precedence live in
-  exactly one place;
- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical
-  message `query exceeded ${Math.round(deadlineMs / 1000)}s`.
-
-Each connector calls the resolver once (at construction; connectors already
-receive their connection config) and stores `this.deadlineMs`. **Enforcement is
-necessarily per-connector** — different engines cancel differently — but the
-*value* and the *error message* are shared, so the agent sees one consistent,
-actionable error regardless of driver.
-
-### Real cancellation, not client-side give-up
-
-Per intake requirement 5, the deadline must *stop the work*, not merely abandon
-the promise while the query keeps running (which on a pooled driver also risks
-returning a still-busy connection to the pool). So:
-
- **In-process (SQLite, and any future embedded driver):** run the query off the
-  main thread and enforce the deadline by **terminating the worker thread**. There
-  is no generic `Promise.race` outer wrapper — a `Promise.race` against a
-  synchronous in-thread `.all()` can never fire (the loop is blocked), and against
-  a pooled remote query it would poison the pool. Thread termination *is* the
-  cancellation.
- **Remote engines:** set the engine's **server-side statement timeout** so the
-  server itself aborts the query and frees the connection cleanly.
-
-### Logging routes through spec 15's pino path — no second logger
-
-The deadline cases are logged through the **existing** MCP tool-call logger
-(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a
-new logging path threaded into the connector. Verified flow for a timeout:
-`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) →
-`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it
-(`reportException` skips `$exception` for `KtxExpectedError`) and returns an
-in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`**
-with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same
-`callId`** as the `tool.start`.
-
-This is the central observability win and it requires **no new MCP logging code**:
-spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this
-spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose
-`tool.end` names the deadline. The worker-termination (SQLite) and server-side
-abort (remote) are internal enforcement mechanisms; their single observable signal
-is that `tool.end`, so the connector does **not** get its own logger threaded
-through `KtxScanContext` — that would fork a second path for one capability. The
-"worker was actually reaped, not left spinning" guarantee is asserted by the
-worker's `exit` event in tests (Requirement 3), not by a log line.
-
-## Requirements
-
-### 1. Shared deadline contract, defined once
-
-A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`)
-exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`,
-and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its
-deadline through this resolver; no connector hardcodes its own default or
-duplicates the override-precedence logic.
-
-### 2. Shared per-connection config field; BigQuery's removed
-
-`query_timeout_ms` is added to the **shared** connection config schema (validated
-as an optional positive integer, milliseconds) so every driver accepts it. The
-BigQuery-specific `job_timeout_ms` config field and its dedicated reader
-(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout
-from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms`
-(zero, negative, non-integer) is a clear config validation error, consistent with
-how ktx validates `ktx.yaml`.
-
-### 3. SQLite executes off the main thread, terminated on deadline
-
-`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event
-loop:
-
- Read-only validation and the row-limit wrapper (`assertReadOnlySql` +
-  `limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL
-  fails instantly without spawning a worker, and read-only enforcement stays at
-  the boundary (Requirement 7).
- The validated, row-limited SQL (and any params) is dispatched to a **worker
-  thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs
-  the query, and posts back `{ headers, rows, totalRows }` (all values are
-  structured-cloneable — primitives, `Buffer`, `BigInt`).
- The main thread arms a timer for `this.deadlineMs`; on expiry it calls
-  `worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal
-  message it clears the timer and resolves. On a worker error (SQLite rejected the
-  SQL) it rejects with that error, message preserved. A provided
-  `ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates
-  the worker, for external cancellation.
- **One short-lived worker per call**, terminated on completion or deadline — not
-  a persistent worker or pool. Terminate-on-deadline destroys the worker, so a
-  pool would need respawn/job-tracking for no benefit: `executeReadOnly` is
-  low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is
-  negligible against query latency. The other SQLite paths (introspect, sample,
-  stats, distinct-values, row-count) stay on the main thread — they are
-  ktx-authored, bounded, and not on the `executeReadOnly` contract.
- The event loop stays responsive throughout, so `tool.end` is always written and
-  concurrent requests on the same port are served.
-
-### 4. Remote engines set a real server-side statement timeout
-
-Each remote connector applies `this.deadlineMs` as its engine's server-side
-statement timeout, so the deadline stops server work rather than abandoning the
-promise:
-
-| Connector  | Mechanism                                              | Unit          |
-|------------|--------------------------------------------------------|---------------|
-| BigQuery   | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms       |
-| Postgres   | `statement_timeout`                                    | ms            |
-| MySQL      | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms |
-| Snowflake  | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION)         | s (ceil)      |
-| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) |
-| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms       |
-
-ClickHouse's existing hardcoded 30s `request_timeout` is brought under this
-contract (derived from the resolved deadline), not left as a parallel mechanism.
-
-### 5. Timeout resolves as a `KtxQueryError` with the canonical message
-
-On exceeding the deadline, the path resolves with a `KtxQueryError`
-(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded
-hang. For SQLite the worker-termination path throws `queryDeadlineExceededError`
-directly. For remote engines, each connector recognizes **its own** engine's
-timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`;
-SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as
-`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector
-owns its driver's signal — there is no central denylist of error codes to
-maintain.
-
-### 6. MCP surfacing and logging via the existing pino path
-
-The MCP `sql_execution` path already (a) maps any non-native driver error to
-`KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by
-`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips
-`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start`
-synchronously before the handler and `tool.end` in `instrumentMcpServer`
-(`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through
-this path — the implementer verifies and tests them, but adds **no parallel
-classification or logging path**:
-
- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with
-  `outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same
-  `callId` as the `tool.start`. Classified as an expected error, so it is absent
-  from `$exception` Error Tracking. The reason `tool.end` was previously missing
-  is solely the blocked event loop (Requirement 3); once the loop stays free and
-  the deadline throws, the existing instrumentation logs the matched pair — closing
-  spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):**
-  unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline
-  (default 30s) and the slow threshold (default 10s) are independent knobs; a query
-  between 10s and 30s completes with a slow `warn`, one past 30s is killed with the
-  `error` above.
-
-### 7. Read-only enforcement and `maxRows` unchanged
-
-`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave
-exactly as today. The deadline is additive. `maxRows` is not a substitute for it
-(it bounds returned rows, not scan work).
-
-### 8. Best-effort callers treat a deadline timeout as recoverable
-
-The non-interactive `executeReadOnly` call sites that are best-effort —
-relationship profiling, composite-candidate probes, relationship validation,
-historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this
-probe / mark unprofiled" and continue, never as a source-fatal error. The
-implementer confirms each such site already swallows query errors into a
-graceful-skip and adds that handling where it does not, so the uniform deadline
-(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped
-probe is logged at the skip site through that path's existing scan/ingest logger
-(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers
-are off the MCP tool-call path, so their visibility comes from the logger they
-already use.
-
-## Acceptance criteria
-
- A read query that exceeds the deadline returns a `KtxQueryError`
-  (`query exceeded {N}s`) within roughly the deadline; the MCP worker stays
-  responsive (a concurrent tool call on the same server completes while the slow
-  query is still pending) and writes a matching `tool.end` with a non-ok outcome.
- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching
-  `tool.end` (same `callId`) at `error` with `outcome:"error"` and
-  `err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The
-  timeout does not raise a `$exception` Error Tracking event (it is a
-  `KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but
-  under the deadline still emits its `tool.end` at `warn`. No new logger is
-  introduced — the lines come from the existing `instrumentMcpServer`.
- **SQLite specifically:** executing a deliberately pathological query (an
-  expensive VIEW or an unindexed cross join) on a fixture does not block the event
-  loop, is terminated at the deadline, and the worker exits (the off-main-thread
-  executor is killed, not left spinning) so CPU returns to idle.
- **One server-side-timeout driver (Postgres):** the connector applies
-  `statement_timeout` equal to the resolved deadline, and a `57014` cancellation
-  is mapped to the canonical `KtxQueryError`.
- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms`
-  override, and rejects an invalid value (zero / negative / non-integer).
- **No regression:** normal fast queries return identical results; read-only
-  rejection still works; `maxRows` still bounds returned rows.
- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's
-  former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the
-  shared field.
-
-## Non-goals
-
- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result
-  size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a
-  separate, retained concern.
- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated
-  default plus a per-connection override; no per-call knob, no global knob.
- **A server watchdog that recycles the process on an unmatched `tool.start`.**
-  Spec 15 names this as a possible future mitigation; this spec prevents the hang
-  at the source, so the watchdog is out of scope here.
- **Moving SQLite introspection / sampling / stats off the main thread.** Only the
-  `executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded
-  ktx-authored queries.
- **Per-connection retry / backoff on timeout.** A timeout returns a clean error
-  for the agent to revise; ktx does not auto-retry.
- **A second logger threaded into the connector.** The deadline cases are logged
-  through spec 15's existing MCP tool-call logger; the connector gets no separate
-  pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes
-  through spec 15's pino path").
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
-  `DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`.
-  Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`).
- **Contract anchor** — `KtxScanConnector.executeReadOnly`
-  (`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`),
-  `KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the
-  MCP path).
- **Config schema** — add `query_timeout_ms` to the shared connection config
-  (`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema);
-  remove BigQuery's `job_timeout_ms` reader.
- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts`
-  (constructed by path via `new URL('./read-query-worker.js', import.meta.url)`);
-  rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate
-  on the main thread then dispatch to the worker with a terminate-on-deadline
-  timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in
-  the worker. Register the worker as a dynamic entry in `knip.json` (it is
-  referenced by path, not import) and confirm the build copies it into `dist`.
- **Remote connectors** — apply the resolved deadline and recognize the engine's
-  timeout signal in each `executeReadOnly` / `query(...)`:
-  `connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`),
-  `connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` +
-  `request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534,
-  `STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838,
-  `statement_timeout`), `connectors/mysql/connector.ts` (~774–793,
-  `max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832,
-  `requestTimeout`).
- **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88`
-  (error mapping), the `sql_execution` registration (~915–943), and the logging in
-  `instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes
-  `tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No
-  new classification or logging code; confirm the timeout flows through as an
-  expected error producing a matching `tool.end(error)` with the canonical message.
- **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275),
-  `context/scan/relationship-composite-candidates.ts` (~365, 440),
-  `context/scan/relationship-validation.ts` (~259),
-  `context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the
-  historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a
-  graceful skip.
- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms`
-  as the test seam) asserting terminate-on-deadline, event-loop responsiveness
-  (a concurrent promise resolves while the query is pending), and worker exit; a
-  Postgres test asserting `statement_timeout` is set to the resolved deadline and
-  a `57014` error maps to `KtxQueryError`; resolver unit tests (default /
-  override / invalid); regression tests for normal results, read-only rejection,
-  and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g.
-  `test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a
-  matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation, not a requirement)
-
-The Spider2-lite local set loads several warehouses into SQLite, some with
-expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
-`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
-rows, no composite index, with `promo_id` (the index the optimizer picks) being
-95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
-view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
-shard for 10+ minutes; with one, the agent gets a fast error and can scope the
-query instead. Improving the benchmark is a side effect; the deadline is generic
-production hygiene for any agent that lets an LLM author SQL.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
-acceptance criteria are met; tests, type-check, dead-code, and build are green
-for the changed surface.
-
-### What was built, and where
-
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
-  `DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns
-  the validated `query_timeout_ms` override else the default; throws on
-  zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)`
-  (a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the
-  driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`.
- **Config field** — `query_timeout_ms` (optional positive integer, ms) added to
-  the **shared warehouse** schema. NOTE (spec drift): that schema lives in
-  `context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not
-  `config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be
-  declared explicitly to be *validated* (otherwise it would pass through
-  unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection`
-  reader were removed; BigQuery now resolves the shared field. Every connector
-  resolves its deadline once at construction via `resolveQueryDeadlineMs`.
-
-### Deviation from the spec's SQLite mechanism (worker thread → child process)
-
-The spec mandated running SQLite read queries on a **worker thread** and enforcing
-the deadline by `worker.terminate()`. This was **empirically disproven**:
-`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3`
-scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise
-never even resolves (an 8s probe of the exact failing query shape confirmed the
-thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler
-API, and `.iterate()` does not help because the failing query is a single
-aggregate row produced only *after* the full scan.
-
-The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`**
-(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from
-`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed
-the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both*
-SQLite requirements better than a thread (event loop stays free **and** the query
-is genuinely cancellable). The child is self-contained (imports only
-`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`)
-and `normalizeQueryRows` stay on the main thread. One short-lived child per call,
-killed on completion, deadline, or `ctx.signal` abort. Node v24's native
-TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts`
-URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in
-`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke).
-
-### Remote connectors (server-side timeouts + own-signal mapping)
-
-Each applies the resolved deadline server-side and re-wraps its own timeout signal
-as `queryDeadlineExceededError(deadlineMs, { cause })`:
-
- **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error.
- **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=<ms>`); maps `57014`.
- **MySQL** — `SET SESSION max_execution_time = <ms>` before the read; maps errno `3024`.
- **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = <ceil(s)>` in the pooled connection; maps code `604` / "reached its … timeout".
- **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`.
- **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`.
-
-Each connector has a focused test asserting the timeout is applied and its signal
-maps to `KtxQueryError` (Postgres is the spec's required acceptance test).
-
-### Best-effort callers (Requirement 8)
-
-Confirmed already graceful: relationship **profiling** (outer try/catch →
-`profile_failed` warning) and **composite-candidate** detection
-(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL
-**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error
-into `{ ok: false }`. **Added** handling to relationship **validation**: a
-`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to
-`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of
-aborting the whole validation pass. `ingest-query-executor.ts` is a generic
-executor port whose callers own recoverability — left unchanged.
-
-### MCP surfacing/logging
-
-No new MCP classification or logging code. The deadline `KtxQueryError` flows
-through the existing `local-project-ports` mapping → `reportException` (skips
-`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts`
-covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched
-`tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query
-exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched
-pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
-
-### Pre-existing branch issues encountered (not part of this feature)
-
- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with
-  a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke
-  `tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the
-  type-check gate green; behavior unchanged.
- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing
-  `**Window functions**` heading and `Expose identity, not just the label` prose
-  in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec
-  13/14) content drift committed earlier on the branch; **left untouched** — no
-  skill files were modified by this feature.
--- a/spider2-specs/specs/18-bigquery-cross-project-datasets.md
+++ b/spider2-specs/specs/18-bigquery-cross-project-datasets.md
@ -1,418 +0,0 @@
-# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project)
-
-> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`.
->
-> **Scope: let the BigQuery connector introspect a dataset hosted in a *different*
-> project than the one it bills jobs to.** A `dataset_ids` entry may be written
-> fully-qualified as `project.dataset`; the connector introspects each entry in
-> *its own* project while every job still runs in `credentials.project_id`. A
-> bare `dataset` keeps today's single-project behavior unchanged.
->
-> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard
-> is **not** expected to *discover* foreign datasets — you cannot enumerate
-> datasets in a project you don't own, and the wizard doesn't know which foreign
-> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or
-> at the dataset prompt); the connector must accept and introspect them. See
-> *Non-goals*.
-
-## Problem
-
-**ktx**'s BigQuery connector derives a single `projectId` from
-`credentials.project_id` and uses it for **both** job billing **and** schema
-introspection. There is no way to introspect a dataset that lives in another
-project, even though *querying* such a dataset already works (a cross-project
-read in a `FROM` clause bills to the caller's project — that path is proven).
-
-Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`):
-
- **`:294`** — `projectId` is read only from `credentials.project_id`. There is
-  no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig`
-  (`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed.
- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim;
-  it never parses a `project.` prefix.
- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`,
-  which resolves the dataset in the **client's (billing) project**, and labels
-  every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the
-  introspection-failure warning metadata (`:566`).
- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as
-  `` `<projectId>.<datasetId>.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the
-  **billing** project.
- **`listTables` (`:453`)** — queries
-  `` `<projectId>`.`region-<region>`.INFORMATION_SCHEMA.TABLES `` against the
-  **billing** project and labels each row `catalog: this.resolved.projectId`.
- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the
-  billing project.
-
-### Empirical confirmation (from the intake draft)
-
-With a service account in project `ktx-spider2-lite`:
-
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks
-  in `projects/ktx-spider2-lite/datasets/austin_311`).
- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })`
-  → **succeeds** (public metadata is readable by any authenticated principal).
- There is **no config knob** to separate the introspection project from billing.
-
-### Why the table `catalog` label is load-bearing, not cosmetic
-
-The BigQuery dialect generates **three-part `catalog.db.name`** SQL
-(`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`;
-`context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The
-`catalog` stored on each scanned table is therefore the project that *every*
-later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`,
-and ref-based `executeReadOnly` all format the ref through the dialect. If a
-foreign dataset's tables are labeled with the billing project, every one of those
-queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling
-the table `catalog` with the dataset's own project is a **correctness
-requirement**, and it is the single lever that makes sampling, dictionary value
-extraction, and `discover_data` all resolve once the snapshot is right.
-
-### One introspection path, no divergence
-
-`connectors/bigquery/live-database-introspection.ts` wraps
-`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database
-paths share **one** introspection implementation. The SDK already supports the
-fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s
-`DatasetOptions` exposes `projectId?: string`.
-
-## Generic use case (independent of any benchmark)
-
-Analysts routinely introspect datasets they can **read but do not own and do not
-bill to**: Google's `bigquery-public-data`, a partner's shared project, an
-organization's central data project that a smaller team queries from its own
-billing project. To make those connectable in **ktx** — so `discover_data`, the
-semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the
-connector must introspect a foreign-hosted dataset while billing jobs in the
-credentials' own project. This is a standard BigQuery deployment shape and is
-wholly independent of any benchmark.
-
-The class to design for is "the dataset's project ≠ the billing project," and it
-must generalize beyond one example: a single connection may reference datasets in
-**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data`
-and `isb-cgc-bq`), and two different projects may host datasets with the **same
-name**. The design must keep those distinct.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs.
-
-### Carry the project inline on each dataset entry — no separate knob
-
-The introspection project is expressed **per dataset**, inline, as the optional
-`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config
-field.
-
-> Rejected alternative: a separate connection-level `dataset_project` (or
-> `introspection_project`) field. It is a speculative runtime knob (against the
-> repo's opinionated-defaults rule) and, more decisively, it **cannot express the
-> requirement**: one connection must span *multiple* foreign projects, which a
-> single global field cannot represent. The inline form also derives scope from
-> the user's own declared input rather than adding a parallel setting.
-
-### Parse to canonical `{ project, dataset }` pairs at the config boundary
-
-Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` /
-`datasetIds()`, into a canonical pair: the project (when no prefix is present,
-default it to `credentials.project_id`) and the bare dataset id. Every
-introspection-side call site reads the resolved pair; nothing downstream re-parses
-a `project.dataset` string.
-
-> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix
-> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`,
-> `testConnection`). That re-implements one rule in four places and is exactly the
-> drift trap the repo's single-source-of-truth rule warns about — a later fix
-> lands on one path and not another. Normalize at the boundary; carry the
-> canonical form downstream.
-
-The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`)
-changes shape from `string[]` to a structured pair list. That is an internal type;
-the connector internals and the connector test fixtures are the only consumers.
-
-### Parsing rule (at the boundary)
-
- An entry contains **at most one `.`**.
- With a dot: the segment **before** the dot is the project, validated by the
-  existing `normalizeBigQueryProjectId` charset
-  (`context/connections/bigquery-identifiers.ts`); the segment **after** is the
-  dataset id (validated as a normal identifier).
- Without a dot: a bare dataset; the project defaults to `credentials.project_id`
-  (today's behavior).
- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error
-  raised at resolution time, naming the connection — not a silent
-  mis-introspection.
- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay
-  **out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset
-  (which already rejects `.` and `:` in a project id).
-
-### Billing is never the dataset's project
-
-The BigQuery client is still constructed with `projectId = credentials.project_id`
-(`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the
-*introspection* surfaces switch to the per-dataset project. Cross-project reads in
-a `FROM` clause already bill to the caller — unchanged and already proven.
-
-### Dataset identity downstream is `(catalog, db)`
-
-Scanned tables are keyed by `(catalog, db, name)` throughout
-(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because
-the table `catalog` now holds the dataset's own project, two foreign projects that
-each host a `austin_311` dataset remain distinct with no extra work — provided the
-snapshot's `scope` / `metadata` also preserve the project (Requirement 6).
-
-### Setup-wizard scope: accept, don't discover
-
-The connector's region-scoped `listTables` (`:453`) is consumed **only** by the
-`ktx setup` wizard's table-selection step (`setup-databases.ts`); the
-ingest / `discover_data` path reads persisted snapshot JSON via
-`WarehouseCatalogService.listTables`, not the connector method. The wizard is not
-expected to enumerate foreign datasets (you can't list a project you don't own).
-A `project.dataset` value hand-entered at the dataset prompt, or written into
-`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the
-region caveat that follows from this.
-
-## Requirements
-
-### R1 — Accept and parse `project.dataset` at the config boundary
-
-`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each
-`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair
-per the parsing rule above, defaulting `project` to `credentials.project_id` when
-unprefixed. A malformed entry (more than one `.`, an empty project or dataset
-segment, or a project/dataset that fails identifier validation) raises a clear
-error at resolution time that names the connection id.
-
-### R2 — Introspect each dataset in its own project
-
-`introspectDataset` resolves the dataset via the **dataset's** project —
-`client.dataset(datasetId, { projectId })` — for `getTables()` and each
-`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to
-accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`)
-and forwarding it from `DefaultBigQueryClientFactory`.
-
-### R3 — Label table `catalog` with the dataset's project
-
-Every table produced by `introspectDataset` is labeled `catalog: <dataset's
-project>` (not the billing project), and the introspection-failure warning
-metadata (`object` / `catalog`) likewise reflects the dataset's project. This is
-what makes downstream sample/distinct-value/read queries resolve.
-
-### R4 — Primary-key discovery targets the dataset's project
-
-The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` /
-`KEY_COLUMN_USAGE` SQL is built against
-`` `<dataset's project>.<datasetId>.INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA
-view is dataset-qualified and therefore region-independent.) Its existing
-soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved.
-
-### R5 — `listTables` lists each dataset in its own project
-
-`listTables` returns rows labeled `catalog: <that dataset's project>` and queries
-each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection
-can now span projects, it queries per distinct project rather than assuming one.
-(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.)
-
-### R6 — Snapshot scope and metadata reflect multiple projects
-
-`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing**
-project, but `scope.catalogs` becomes the **distinct set of dataset projects**
-actually introspected. `scope.datasets` / `metadata.datasets` must stay
-unambiguous when two projects share a dataset name (e.g. carry the qualified
-`project.dataset`, or otherwise preserve the project). The scoped table-name
-lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass
-each dataset's own project so `tableScope` / `enabled_tables` filtering still
-matches.
-
-### R7 — `testConnection` resolves foreign datasets
-
-`testConnection` validates each configured dataset via its own project
-(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only
-at foreign datasets reports success rather than a spurious `404`.
-
-### R8 — Billing unchanged; bare dataset is a strict no-op
-
-`createQueryJob` continues to bill in `credentials.project_id`. A connection whose
-`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before:
-same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no
-behavioral change.
-
-### R9 — `getTableRowCount` honors the parsed entry
-
-`getTableRowCount`'s default-dataset handling (`:431`, today
-`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign
-default dataset is introspected in its own project.
-
-### R10 — Docs reflect the qualified form
-
-Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written
-`project.dataset` to introspect a dataset hosted in another project (billing stays
-in `credentials.project_id`). Update the BigQuery rows/examples in
-`docs-site/content/docs/configuration/ktx-yaml.mdx` and
-`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope
-note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples
-copy-pasteable and follow the `fumadocs-mdx-structure` skill.
-
-## Acceptance criteria
-
-1. **Foreign single-project introspection.** With credentials in project
-   `ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`,
-   `ktx ingest <conn>` introspects the tables, enriches, and samples values;
-   `discover_data` / `dictionary_search` return them. Tables are labeled
-   `catalog: 'bigquery-public-data'`.
-2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x',
-   'other-project.y']` introspects **both**, each under its own project; the
-   snapshot's `scope.catalogs` contains both projects.
-3. **Cross-project query still bills locally.** `sql_execution` of a
-   fully-qualified `project.dataset.table` query runs and bills in
-   `credentials.project_id`.
-4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']`
-   yields two distinct dataset groups; tables do not collide.
-5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`)
-   behaves exactly as before — resolved under `credentials.project_id`, same
-   `catalog` labels and INFORMATION_SCHEMA targets.
-6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an
-   empty segment) raises a config error naming the connection, not a `404` at
-   scan time.
-7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`,
-   using the existing fake `clientFactory` harness):
-   - the fake `dataset()` is called with the dataset's project for a prefixed
-     entry, and with the billing project for a bare entry;
-   - a prefixed entry yields tables with `catalog: '<dataset project>'`;
-   - a mixed two-project `dataset_ids` introspects both;
-   - `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment
-     entry;
-   - the existing single-project tests still pass unchanged.
-
-## Non-goals
-
- **Foreign-dataset discovery in the setup wizard.** The wizard does not
-  enumerate datasets in projects the credentials don't own; users supply
-  `project.dataset` explicitly (scope decision A).
- **Cross-region `listTables`.** `listTables`' region-scoped
-  `region-<location>.INFORMATION_SCHEMA.TABLES` query uses the connection-level
-  `location`; a foreign dataset in a *different* region than the connection's
-  `location` will not be listed by that wizard-facing query. This does **not**
-  affect ingest/`discover_data`, whose introspection path
-  (`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is
-  region-independent. A per-dataset region knob is a separate spec if ever needed.
- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`),
-  already unsupported by `normalizeBigQueryProjectId`.
- **A separate billing/introspection config field** — explicitly rejected above.
-
-## Implementation orientation
-
-Pointers from exploration; line numbers may have drifted, and the implementer owns
-the design.
-
- `packages/cli/src/connectors/bigquery/connector.ts`
-  - `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) —
-    parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds`
-    shape.
-  - `KtxBigQueryClient.dataset` port (`:100`–`:110`) and
-    `DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId`
-    (R2). `getClient()` (`:487`) keeps the billing project (R8).
-  - `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog`
-    + warning metadata (R2, R3).
-  - `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4).
-  - `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog
-    (R5).
-  - `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup
-    (`:359`) (R6).
-  - `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9).
- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps
-  `introspect`; no separate change needed (it inherits the fix).
- `packages/cli/src/context/connections/bigquery-identifiers.ts` —
-  `normalizeBigQueryProjectId` is the project-segment validator.
- `packages/cli/src/context/connections/dialect-helpers.ts` /
-  `connectors/bigquery/dialect.ts` — three-part naming; no change, but this is
-  *why* R3 matters.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`. Run
-  `pnpm --filter @kaelio/ktx run type-check` and the connector test suite.
-
-## Benchmark context (motivation, not a requirement — do not encode benchmark specifics)
-
-Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable
-faithfully: every one of its ~74 logical databases groups datasets hosted in
-foreign public projects (`bigquery-public-data`, `isb-cgc-bq`,
-`data-to-insights`, …), never in a project we own. Query execution already works
-cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly
-because the connector can't introspect a foreign-hosted dataset. Of 74 BQ
-databases only **one** spans more than one source project, so "let `dataset_ids`
-carry `project.dataset` and introspect each in its own project" covers the
-benchmark and the general case alike. None of these project names belong in the
-code — they are derived from the user's own `dataset_ids` input.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki`. The whole change is contained in
-the BigQuery connector, its identifier helpers, the connector test suite, and three
-docs pages.
-
-**Config boundary (R1).** Added `normalizeBigQueryDatasetId`
-(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset
-`[A-Za-z0-9_]`) next to the existing project/region validators. In
-`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry,
-defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots →
-bare dataset in `defaultProject`; one dot → `project.dataset` (each segment
-validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs`
-resolves `env:`/`file:` references first, trims/filters empties, then parses each.
-`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the
-default, so the canonical pair list is produced once at the boundary.
-`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new
-`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name
-`connections.<id>.dataset_ids entry "<entry>"`.
-
-**Client port (R2).** `KtxBigQueryClient.dataset` now takes
-`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards
-`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`).
-`getClient()` still constructs the client with the **billing** `project_id`, so
-`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3).
-
-**Per-dataset introspection (R3–R7, R9).** Every introspection site reads the
-resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)`
-and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s
-`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified
-`` `<project>.<dataset>.INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each
-dataset under its own project; `getTableRowCount`'s default resolves through the first
-pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and
-keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a
-`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project
-snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with
-the same dataset name stay distinct, R6/acceptance 4).
-
-**`listTables` (R5).** Split into `listTables` (parse override entries, group by
-project) and `listTablesInProject(project, region, datasets?)`. With no override it
-lists the billing project's region (unchanged); with an override it runs one
-region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that
-project's bare datasets, and labels rows with that project. The existing single-region
-test is unchanged (bare entries collapse to one billing-project query).
-
-**Docs (R10).** Added a "Cross-project datasets" subsection to
-`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats),
-plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`.
-
-**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and
-malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection
-calls `dataset('austin_311', 'bigquery-public-data')`, labels tables
-`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps
-`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']`
-connection introspects both under their own projects; and `['proj_a.shared',
-'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated
-to the pair list; all pre-existing behavioral tests pass unchanged.
-
-**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite
-(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`,
-`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production),
-`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all
-pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory;
-criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is
-structurally guaranteed (single billing client) and asserted via the `createClient`
-project. End-to-end ingest against live `bigquery-public-data` was not run here (no live
-credentials in this worktree); the `link:dev` binary is ready for the playground agent to
-validate.
-
-**No deviations from the spec design.** The only judgment call: `scope.datasets`
-renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to
-satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation
-requirement (R6/acceptance 4) with one unambiguous, dot-delimited form.
--- a/spider2-specs/specs/19-durable-bounded-relationship-detection.md
+++ b/spider2-specs/specs/19-durable-bounded-relationship-detection.md
@ -1,471 +0,0 @@
-# Durable, resumable, bounded relationship detection during ingest enrichment
-
-> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`.
->
-> **Scope: make the expensive part of ingest enrichment survive an interrupted
-> relationship stage.** Today the paid LLM descriptions + embeddings only become
-> durable and queryable after the slowest, most-killable, least-valuable stage
-> (relationship detection) also finishes. This spec moves the persistence boundary
-> to the cost boundary, makes stage resume work across runs, and bounds + observes
-> the one open-ended stage — the durability companion to spec 16 (bounded query
-> execution), which this spec composes with rather than replaces.
-
-## Problem
-
-Three compounding failure modes, all confirmed in the current code, share one root
-cause: **the three enrichment stages are treated as a single atomic unit for
-persistence, identity, and bounding, even though they differ radically in cost,
-durability value, runtime, and likelihood of being killed.**
-
-`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages
-in a fixed order through `runEnrichmentStage` (`:413`):
-
-| stage | order | cost | durability value | runtime on a large schema | likely to be killed |
-|-------|-------|------|------------------|---------------------------|---------------------|
-| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low |
-| `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low |
-| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** |
-
-The slowest, most-killable, least-valuable stage runs **last**, and it gates the
-durability of the two expensive stages held in memory before it.
-
-### 1. Enrichment is lost if relationship detection is interrupted
-
-The queryable artifact agents search and execute against is the `_schema` manifest
-YAML (`semantic-layer/<connectionId>/_schema/*.yaml`). It is written **twice**:
-
- bare (native column comments only) early, at `local-scan.ts:473`
-  (`writeLocalScanManifestShards`), before enrichment runs; and
- rewritten **with AI descriptions + accepted joins** by
-  `writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called
-  from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after
-  all three stages.
-
-So the descriptions and embeddings reach the queryable layer only via that single
-terminal write. If the process is killed/crashes/times out **during** the
-`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write
-never runs, and the in-memory descriptions + embeddings are discarded — the
-`_schema` retains only the bare native comments from the `:473` write.
-
-Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full
-descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
-relationship stage ran silently past a supervising deadline and was killed; the
-persisted `_schema` had **0** AI descriptions. The most expensive work is the most
-likely to be thrown away.
-
-> A stage-state store (below) does save each completed stage's output to an
-> internal SQLite cache as the stage finishes — so the descriptions are not lost to
-> the *resume cache*. They are simply never **promoted** to the queryable `_schema`
-> until the terminal write. The data survives somewhere the agent cannot query, and
-> (per failure mode 2) cannot be reused on the next run either.
-
-### 2. Re-running does not resume — it re-spends
-
-`runEnrichmentStage` resolves a completed stage with
-`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and
-the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares
-`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id`
-(`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per
-ingest invocation (`record.runId`). The cache therefore only resolves *within* one
-run; re-running an interrupted ingest gets a new `runId`, misses every cached
-stage, and **recomputes descriptions + embeddings from scratch** — re-paying for
-LLM work that already succeeded.
-
-The store already computes and persists `inputHash` next to `runId` —
-a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity,
-relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is
-already on the row; the lookup just uses the volatile column. This is a keying
-defect, not a missing capability.
-
-### 3. Relationship detection is unobservable and unbounded
-
-`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a
-row sample of **every enabled table** (`profileKtxRelationshipSchema`,
-`relationship-profiling.ts:320` — one sampled query per table at
-`profileConcurrency`, default 4), validates candidate joins
-(`relationship-validation.ts:237` — one coverage query per candidate), and detects
-composite keys (`relationship-composite-candidates.ts:515` — per-table plus
-cross-table queries). None of the controls the rest of the scan pipeline relies on
-were ever wired into this stack:
-
- **No progress.** `discoverKtxRelationships` does not accept a progress port; the
-  caller can only emit start/end around it (`local-enrichment.ts:600,611` —
-  `update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of
-  silence between.
- **No honored cancellation.** `KtxScanContext.signal` exists on the contract
-  (`types.ts`) but **no sub-stage reads it**.
- **No time budget.** Validation has a *count* budget (`validationBudget`, default
-  `min(2 × tableCount, 1000)`); profiling and composite detection have none. On a
-  schema with hundreds–thousands of tables, profiling is O(tables) silent queries
-  with no internal stop condition.
-
-A supervisor watching for liveness cannot tell a slow-but-working profile from a
-true hang, and nothing inside the stage will voluntarily stop — so on a very large
-schema it runs far past any reasonable deadline and is killed (which, via failure
-mode 1, takes the descriptions with it).
-
-## Generic use case (independent of any benchmark)
-
-Any context layer that enriches a real warehouse with paid LLM work must make that
-work durable the instant it is produced, resume it across process restarts without
-re-paying, and bound the open-ended profiling stage so a large catalog cannot hang
-ingest indefinitely. A data team ingesting a 500-table production warehouse over a
-flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit
-hits all three failure modes regardless of any benchmark. This is general
-durability and cost hygiene for the ingest pipeline; the benchmark only made it
-acute at scale.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs (requirement-level, per the specs README).
-
-### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships
-
-As soon as the last non-relationship stage completes — `embeddings` when an
-embedding provider is configured, otherwise `descriptions` — persist the
-descriptions + embeddings into the **queryable** `_schema` manifest (and the raw
-`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the
-`relationships` stage runs. The relationship stage then writes its joins on top: the
-manifest builder already re-reads and preserves existing descriptions and
-manual/inferred joins on rewrite (`loadExistingManifestState`,
-`local-enrichment-artifacts.ts:196`), so the second write is additive, not
-destructive.
-
-Net invariant: **the descriptions + embeddings are always durable and queryable the
-moment they are computed**, even if relationship detection then fails, is
-interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
-relationship stage degrades to "no joins" or "partial joins" — **never** to "no
-descriptions." This is the inverse guarantee the current terminal-write ordering
-violates.
-
-The bare `:473` manifest write stays — it is the queryable schema for the
-no-providers / enrichment-disabled path. The checkpoint is an additional write that
-runs only when enrichment produced descriptions.
-
-> Orientation (the implementer owns the seam): the lowest-coupling shape is a
-> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once
-> the last non-relationship stage completes, and `local-scan.ts` supplies a callback
-> that calls the existing `writeLocalScanEnrichmentArtifacts` for the
-> descriptions + embeddings + manifest only (no generated joins yet). The final
-> write after the relationship stage proceeds as today. Relationship-specific
-> artifacts (`relationships.json`, `relationship-profile.json`,
-> `relationship-diagnostics.json`) are written by the final/relationship write, not
-> the checkpoint, so the checkpoint never emits misleading empty relationship
-> diagnostics.
->
-> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment`
-> (inject the file store / project). That couples the enrichment module to
-> persistence for no gain — the writer already lives in `local-scan.ts` and the
-> checkpoint needs only a one-line hook, not a relocation.
-
-### D2 — Resume by content identity, not by `runId`
-
-Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**,
-independent of `runId`, so a re-run with an unchanged schema and config resumes the
-finished `descriptions` / `embeddings` stages from cache and re-runs only what
-actually failed. `inputHash` is already the content fingerprint; `connectionId`
-scopes it to the right source. When several rows share a content identity (one per
-prior run), the most recent `updatedAt` wins.
-
-`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves
-the uniqueness/lookup key.
-
-The state store is a **disposable local resume cache** (`.ktx` local state,
-regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate
-the table if its on-disk shape differs from the new `(connection_id, stage,
-input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the
-old cache only means one ingest cannot resume; it never corrupts a queryable
-artifact.
-
-> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest
-> are already folded into `inputHash`; adding them again would only narrow the key
-> and re-break cross-run resume when an incidental field differs.
-
-### D3 — Make the relationship stage observable and bounded
-
-Thread three things the rest of the pipeline already supports through
-`discoverKtxRelationships` into profiling, validation, and composite detection:
-
- **Progress** through the existing progress port (the relationship phase is
-  already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit
-  liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent
-  for composite probing — so a supervisor can distinguish slow-but-working from
-  hung.
- **A flat wall-clock budget** for the whole relationship stage: a new
-  `scan.relationships.detectionBudgetMs`, a positive integer of milliseconds,
-  project-level, validated like the other `scan.relationships` fields, **default
-  600_000 (10 min), enforced by default.** Checked at unit boundaries (before each
-  table profile, each candidate validation, each composite probe). It sits **above**
-  spec 16's per-query deadline (default 30s): each individual query is already
-  bounded; this bounds the *sum* of them.
- **Honored cancellation:** where `KtxScanContext.signal` is available, the same
-  unit-boundary check honors it, so external cancellation stops the stage too.
-
-On budget exhaustion or abort: stop scheduling new work, let in-flight queries
-finish (each already bounded by spec 16), finalize with the relationships found so
-far, and return a **partial** result — never an unbounded hang and never an
-exception that would lose the checkpointed descriptions.
-
-> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a
-> second formula to reason about and "more tables → more budget" partly re-opens the
-> unbounded door this requirement closes. One flat, generous, project-level number
-> matches how the other `scan.relationships` knobs are shaped and is enough for a
-> best-effort stage whose partial output is durable and improvable (D4).
->
-> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a
-> per-call override. One opinionated project-level default with a config override is
-> the canonical ktx shape; no second runtime path.
-
-### D4 — A budget-truncated partial is a successful, cached, completed stage
-
-A graceful budget stop is **not** a failure. The relationship stage saves its
-partial result like any completed stage (so a plain re-run resumes it for free, no
-re-querying) and marks it `partial` with a reason in the relationship diagnostics
-plus a recoverable scan warning. Because `detectionBudgetMs` lives in
-`relationshipSettings ⊂ inputHash`, **raising the budget changes the content
-identity and triggers a fresh, fuller run** — that is the only "try harder"
-mechanism, with no extra flag or runtime path.
-
-Distinguish the two stop kinds:
-
- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as
-  completed, so the next run recomputes the relationship stage (after resuming
-  descriptions/embeddings from cache via D2). This is the primary durability path.
- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and
-  resumed cheaply on re-run, unless the budget is raised.
-
-## Requirements
-
-### 1. Checkpoint descriptions + embeddings before relationship detection
-
-The descriptions and embeddings MUST be persisted into the durable, queryable
-`_schema` manifest (and the raw enrichment artifacts) as soon as the last
-non-relationship stage completes, before the `relationships` stage runs.
-Relationship detection appends/merges its joins on completion. The expensive LLM +
-embedding enrichment MUST be queryable even if the relationship stage subsequently
-fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
-relationship stage MUST degrade to "no/partial joins," never to "no descriptions."
-
-### 2. Stage resume resolves by content identity across runs
-
-Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`,
-independent of `runId`, so re-running an interrupted ingest resumes the finished
-`descriptions` / `embeddings` stages from cache and re-runs only what failed.
-Re-running after an interruption MUST NOT re-issue LLM description or embedding
-calls for stages that already completed. The resume cache MAY be recreated without a
-migration bridge if its schema changes (it is disposable local state).
-
-### 3. Relationship detection emits progress and honors a wall-clock budget
-
-The relationship stage MUST emit per-unit progress through the existing progress
-port (at minimum per-table during profiling and per-candidate during validation) so
-liveness is observable. It MUST enforce a flat wall-clock budget
-(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level,
-overridable, validated as a positive integer) checked at unit boundaries and layered
-above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where
-available. On budget exhaustion or abort it MUST stop scheduling new work, finalize
-with the relationships found so far, and return a partial result rather than running
-unboundedly or throwing.
-
-### 4. A budget-truncated relationship result is durable and marked partial
-
-A graceful budget/abort stop MUST persist the partial relationship result as a
-completed stage (so a plain re-run resumes it without re-querying) and MUST mark it
-`partial` — in the relationship diagnostics artifact and as a recoverable scan
-warning — so downstream consumers can see the joins are incomplete. Raising
-`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller
-relationship run; no separate flag is introduced for "redo." A process killed
-mid-stage MUST NOT leave a completed record (so it recomputes on re-run).
-
-### 5. No regression for small or uninterrupted ingests
-
-A small or single-run ingest that is never interrupted MUST produce the same
-artifacts and the same relationship output as today. The checkpoint write MUST be
-idempotent with the final write (descriptions survive the join rewrite); the budget
-default MUST be generous enough that normal and large-but-tractable schemas complete
-relationship detection fully, hitting the budget only on pathological scale.
-
-## Acceptance criteria
-
- **Durability across interruption:** interrupting an ingest **during** relationship
-  detection still leaves a queryable semantic layer carrying the table/column
-  descriptions + embeddings that were generated (verified: re-open the connection;
-  AI descriptions are present in `_schema`, not just native comments).
- **Resume does not re-spend:** re-running an interrupted ingest does **not**
-  regenerate descriptions/embeddings whose stage already completed (verified: no LLM
-  description calls and no embedding calls for the cached tables; only the failed
-  stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume
-  survives a fresh `runId`.
- **Observable + bounded relationships:** a connection with hundreds of tables emits
-  relationship-stage progress (per-table profiling, per-candidate validation) and
-  completes within `detectionBudgetMs`; when the budget is hit, the stage stops
-  gracefully and persists the partial relationships found so far — without
-  discarding enrichment — marked `partial` in diagnostics and via a recoverable
-  warning.
- **Partial is cached and improvable:** re-running with an unchanged budget resumes
-  the partial relationship result from cache (no re-querying); raising
-  `detectionBudgetMs` triggers a fresh, fuller relationship run.
- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project
-  override, and rejects an invalid value (zero / negative / non-integer) as a clear
-  `ktx.yaml` config error.
- **No regression:** small/single-run ingests behave exactly as before — identical
-  artifacts and relationship output when nothing is interrupted; the checkpoint +
-  final writes leave descriptions intact alongside the generated joins.
-
-## Non-goals
-
- **Bounding the descriptions stage's per-table LLM call.** Whether an individual
-  enrichment LLM call can wedge is a separate concern (already being addressed in the
-  working tree via a per-table enrichment timeout). This spec ensures whatever
-  descriptions *did* complete are durable; it does not own the per-call timeout.
- **Changing relationship-detection quality, thresholds, or the candidate/validation
-  algorithm.** The accept/review thresholds, scoring, and the existing
-  `validationBudget` count cap are unchanged; this spec adds durability,
-  cross-run resume, progress, and a time budget around them.
- **A per-connection or per-call relationship budget, or a global env override.**
-  One flat project-level `detectionBudgetMs`; no second runtime path (D3).
- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec
-  composes above it and does not re-implement query-level deadlines.
- **Replacing the per-query deadline with the stage budget, or vice versa.** They
-  are independent and layered: a single query is bounded by spec 16; the stage's sum
-  is bounded by `detectionBudgetMs`.
- **A general checkpoint framework for every ingest stage.** The checkpoint is
-  specifically the descriptions+embeddings → queryable-manifest promotion before
-  relationships; it is not a generic per-stage artifact-flush abstraction.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns the
-design.
-
- **Enrichment orchestration** — `context/scan/local-enrichment.ts`:
-  `runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls
-  (`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`),
-  `runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the
-  checkpoint hook after the last non-relationship stage; thread the progress port,
-  signal, and budget into the relationship stage.
- **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest
-  write (`:473`), enrichment call (`:492`, currently passing only
-  `{ runId, progress }` as `context` — wire `signal` through here too), terminal
-  `writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch
-  (`:530`, which after D1 no longer loses descriptions). Supply the checkpoint
-  callback here.
- **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`:
-  `writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards`
-  (`:270`), and the description-preserving merge in `loadExistingManifestState`
-  (`:196`) — the basis for the additive checkpoint/final write.
- **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`:
-  `PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`),
-  `saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`,
-  pick latest `updated_at`, recreate the table if shape differs (disposable cache).
-  Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage`
-  in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash`
-  (`:78`).
- **Relationship stack (progress + budget + signal)** —
-  `context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept
-  a progress port and budget/deadline + signal),
-  `context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` —
-  per-table progress + budget check),
-  `context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates`
-  `:237` — per-candidate progress + budget check, alongside the existing
-  `validationBudget`),
-  `context/scan/relationship-composite-candidates.ts`
-  (`discoverKtxCompositeRelationships` `:515` — budget check).
- **Config** — `context/project/config.ts` `scan.relationships`
-  (`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive
-  integer ms, default 600_000) to the zod schema and the default config builder.
- **Partial marker** — `context/scan/relationship-diagnostics.ts`
-  (`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries
-  a `partial` flag + reason; add a recoverable warning code to the
-  `KtxScanWarningCode` union in `context/scan/types.ts` (e.g.
-  `relationship_detection_partial`).
- **Tests** — durability: a fixture ingest interrupted during the relationship stage
-  leaves AI descriptions in the queryable `_schema`. Resume: a second run with a
-  fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings
-  (assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema
-  large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget,
-  emits per-unit progress, returns partial, persists it marked `partial`, and a
-  re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests
-  for `detectionBudgetMs` (default / override / invalid). Regression: small
-  uninterrupted ingest yields identical artifacts and relationship output.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation, not a requirement)
-
-The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables
-(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code
-costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage
-interruption — and re-spending it on every retry — makes large-schema ingest
-impractical, and an unbounded profiling stage runs past any supervising deadline and
-is killed. This is a general durability/cost property of the ingest pipeline,
-independent of the benchmark; the benchmark only made it acute at scale. Do not
-encode any benchmark specifics in the implementation.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
-four design decisions shipped; no deviations from the resolved design.
-
-**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`,
-`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to
-`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by
-`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent
-content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`;
-`runId` stays on the row for diagnostics/`listRunStages`. The store drops and
-recreates the table when the on-disk primary key differs (disposable cache, no
-migration bridge), detected via `PRAGMA table_info`.
-
-**D3 — observable + bounded relationship stage** (new
-`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget`
-(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an
-injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that
-generalizes and replaces the old `mapWithConcurrency`). Threaded through
-`discoverKtxRelationships` → profiling (per-table progress + budget stop),
-validation (per-candidate progress + budget stop; budget-skipped candidates
-degrade to the existing `validation_unattempted` review), and composite
-detection (budget stops at PK-detection and coverage-probe boundaries).
-`discoverKtxRelationships` now accepts `progress` and `now` and returns
-`partial: { reason } | null`. The clock check fires only when work remains, so a
-deadline elapsing after the last unit never marks a fully-processed stage partial.
-
-**D1 — checkpoint before relationships** (`local-enrichment.ts`,
-`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a
-caller-supplied `onCheckpoint` once descriptions/embeddings complete and before
-the relationship stage runs, gated on `shouldDetectRelationships` so the
-no-relationship path keeps a single write. `local-scan.ts` supplies a callback
-calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json +
-embeddings.json + manifest with descriptions and no generated joins — no
-relationship artifacts, so no misleading empty diagnostics). The shared
-description/embedding JSON writer was factored out so checkpoint and final writes
-stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions`
-into the enrichment context (completing the existing `KtxScanContext.signal`
-contract already read by the budget and the in-flight description timeout).
-
-**D4 — partial is durable + marked** (`relationship-diagnostics.ts`,
-`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact
-carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable
-`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated.
-A graceful budget/abort stop returns normally, so the relationship stage saves as a
-completed-partial record and resumes cheaply; a process killed mid-stage saves
-nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash`
-(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only
-"try harder" mechanism, no extra flag.
-
-**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer
-ms, default `600_000`, validated like the other relationship fields. Documented in
-`docs-site/content/docs/configuration/ktx-yaml.mdx`.
-
-**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`);
-cross-run resume + table-recreate (`enrichment-state.test.ts`,
-`local-enrichment.test.ts`); progress/budget/abort partial
-(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise +
-checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`);
-end-to-end durability — a relationship-stage failure still leaves AI descriptions
-in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag
-(`relationship-diagnostics.test.ts`); config default/override/invalid
-(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`,
-and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated:
-three `analytics-skill-content.test.ts` markdown-structure assertions fail on this
-branch from earlier analytics-skill commits — untouched here.)
--- a/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md
+++ b/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md
@ -1,533 +0,0 @@
-# Resilient enrichment under a slow/hung LLM backend
-
-> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`.
->
-> **Scope: make the descriptions enrichment stage survive a hung LLM backend and
-> an interrupted run.** Two compounding gaps live *inside* the per-table
-> description-enrichment path: (1) the per-table LLM timeout fires in JS but does
-> not terminate a wedged subprocess backend, so a hung table wedges the whole
-> stage indefinitely; (2) descriptions are persisted only at full-stage
-> completion, so any interruption discards every already-enriched table. This is
-> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline
-> that *stops the work*, not just abandons the promise) and spec 19 (move the
-> durability boundary to the cost boundary so expensive LLM work is not lost). It
-> composes with both rather than replacing them.
-
-## Problem
-
-Two compounding failure modes on the per-table description-enrichment path, both
-confirmed in the current code and observed end-to-end together. Their union turned
-a single hung table into an indefinite wedge *plus* total loss of an entire
-stage's LLM work.
-
-### 1. The per-table LLM timeout does not terminate the work
-
-`KtxDescriptionGenerator.generateBatchedTableDescriptions`
-(`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the
-per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh
-`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`).
-A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one
-wedge stays one timeout, not 3×). That is the correct policy — but the abort never
-actually stops a subprocess backend, so the timeout is cosmetic.
-
-The runtime is selected by the `backend` config field
-(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS =
-['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn
-a **child process the SDK owns** and to which ktx hands only an `AbortSignal`:
-
- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` →
-  `codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's
-  `spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the
-  SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing
-  the abort error **only after that loop ends**. A child wedged on a hung provider
-  socket survives SIGTERM → its stdout never closes → the readline loop never ends
-  → the SDK never throws → ktx's `await generateObject` **never settles**, past the
-  per-attempt timeout, indefinitely. The child leaks (open provider connections,
-  ~0% CPU).
- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via
-  `context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls
-  best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks
-  `throwIfAborted` **between** streamed messages. A wedged child emits no message, so
-  the `for await (const message of queryResult)` loop blocks and the graceful
-  `interrupt()` may never land — the same hang class.
-
-By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via
-`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's
-`generateObject`, which cancels the underlying `fetch` natively — the await settles
-promptly and there is no child to leak.
-
-So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too
-gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks
-`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork —
-which it does not here.
-
-Observed (BigQuery ingest, codex backend, 2026-06-23): with
-`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of
-`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+
-minutes** — well past the 30-min per-attempt timeout — with exactly two codex
-children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand.
-
-### 2. Descriptions are persisted only at full-stage completion
-
-`generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out
-per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and
-**accumulates every table's result in an in-memory `updates` array**, returned only
-when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls
-`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`)
-**after** `compute()` returns, and the spec-19 checkpoint write
-(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379,
-fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the
-descriptions stage completes**. There is no within-stage persistence: while the
-stage runs, every enriched table's description lives only in memory.
-
-So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is
-killed, or a supervising watchdog fires — **all** already-enriched tables are lost,
-even though their (expensive, paid) LLM descriptions were finished. On the next run,
-`findCompletedStage` finds no row, so the descriptions stage **recomputes from
-scratch**.
-
-Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but
-**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk;
-killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The
-cost of 2 pathological tables was 283 tables' worth of redone LLM calls.
-
-Sharper still (re-ingest with a short, *enforced* timeout): even when the stage
-**runs to the end** — the 2 hung tables hit their timeout and were skipped, so
-**283/285** descriptions were generated and the ingest reported success (`Scan
-completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were
-**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not**
-only "discarded on kill": a stage that completes with *any* skipped/aborted table
-threw away **every** successfully-generated description. The skip must be
-**graceful** — a skipped table costs one missing description, not the entire stage's
-output — which is the strongest argument for per-table incremental persistence: the
-283 good descriptions should have been durable the moment each was produced.
-
-The on-disk artifacts already carry everything needed to fix this *additively*: the
-`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is
-AI-enriched), and rewrites preserve existing descriptions
-(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115;
-`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis
-spec 19 relies on). The durable record and the resume-skip set can be **derived from
-the system's own on-disk state**, with no new cache schema.
-
-## Generic use case (independent of any benchmark)
-
-Anyone ingesting a large or wide schema with an LLM enrichment backend —
-especially a **subprocess** backend, the common local/desktop setup — will
-eventually hit a table whose description call hangs: a provider stall, a rate-limit
-black-hole, a pathologically large prompt. Without an *enforced* timeout, one such
-table wedges the entire ingest indefinitely and leaks the spawned child; without
-*incremental* persistence, any interruption throws away all the per-table LLM work
-already done — the dominant ingest cost. Both fixes make large-schema enrichment
-**resilient and resumable**: a few bad tables degrade to a few skipped
-descriptions, not a hung process and a from-scratch redo. This is core robustness
-for a general-purpose ingestion product, wholly independent of any benchmark.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs (requirement-level, per the specs README).
-
-### D1 — One bounded-call guarantee; enforcement follows the backend's nature
-
-The canonical contract is a single guarantee for the per-table enrichment call:
-**the in-flight work terminates and ktx's await settles within the per-table
-deadline plus a small grace, on every backend.** How that guarantee is met follows
-from a structural property of the configured backend — *does it own a subprocess?*
-— not from a hand-maintained list of provider names:
-
- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is
-  insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call
-  behind a **boundary it can hard-kill** — a short-lived ktx-owned child process,
-  made a **process-group leader** (`detached`). The SDK's grandchild (the
-  `codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx
-  **tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the
-  grandchild — and rejects promptly. This mirrors spec 16's child-process +
-  SIGKILL mechanism, extended by the critical step that **killing the immediate
-  child is not enough**: the grandchild would otherwise orphan to init and keep its
-  provider connections. Killing the group is the real fix.
- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing
-  in-process `abortSignal` → `fetch` cancellation already satisfies the contract —
-  the await settles promptly and there is no subprocess to leak. Routing these
-  through a subprocess would pay fork + IPC + credential-passing cost for no benefit.
-
-> The branch on "subprocess-backed?" is behavior following from an input the backend
-> declares about itself, not vendor enumeration — the same guarantee is reached two
-> ways because the backends differ structurally. This matches the intake's own split
-> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed").
->
-> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline
-> regardless of the SDK, but leave the SDK's child running). It unwedges the stage
-> but leaves the orphaned child holding provider connections — the exact leak the
-> incident showed — so it fails the intake's "actually cancelled" requirement and
-> compounds over a long ingest that hits several hung tables.
->
-> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime,
-> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a
-> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is
-> low-frequency relative to its own latency and already concurrency-bounded (4), so
-> one short-lived child per call (spec 16's resolved choice) is simpler and as fast.
-
-**Portability.** ktx supports Windows, where POSIX process groups and
-`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached
-process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating
-equivalent on Windows (e.g. `taskkill /pid <pid> /T /F` or a job object) so the
-grandchild is reaped on every platform the subprocess backends run on.
-
-### D2 — Default stays moderate and the retry/skip policy is unchanged
-
-The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the
-existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the
-no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the
-table is skipped with the existing `enrichment_timeout` warning and the stage
-proceeds. The 30-min value in the incident was an operator stopgap chosen *because*
-the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a
-long timeout is strictly worse for a hang (a hang costs the full timeout), so the
-moderate default is the correct operating point. The retry loop stays in
-`description-generation.ts`: each attempt runs through the bounded boundary (D1), so
-a transient backend error retries while a timeout surfaces as `KtxAbortedError` and
-does not.
-
-> Not introducing a new `ktx.yaml` config field for the timeout. The existing env
-> override is the tuning seam; adding a per-connection/per-call/global knob would
-> multiply the runtime surface for no stated need (one opinionated default + the
-> existing env override is the canonical ktx shape).
-
-### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state
-
-During the descriptions fan-out, flush completed tables **per batch** (every N
-tables / on a timer, at a cadence that bounds the at-risk window) to the durable
-on-disk artifacts, reusing spec 19's additive write:
-
- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**;
- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal`
-  preserves prior `ai:`/`db:`/external keys) so finished descriptions are also
-  **queryable** the moment they are computed — the spec-19 invariant, one level
-  deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by
-  rewriting only changed shards.
-
-On resume, `generateDescriptions` reads the existing record, **skips any table
-already enriched**, computes only the remainder, and returns the merged full set so
-the embeddings stage, the checkpoint write, and the stage-store row all see a
-complete result exactly as today.
-
-**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The
-durable record is tagged with the descriptions stage's `inputHash`
-(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when
-the current `inputHash` matches** — a genuine resume-after-interruption of the same
-content identity. A changed `inputHash` (schema or enrichment settings changed)
-ignores the prior record for skipping and recomputes the stage as today; the
-manifest write stays additive regardless. The artifact's on-disk shape may gain the
-`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped
-record simply forces one non-incremental run), consistent with ktx's
-no-backward-compatibility policy.
-
-> The skip set is **derived from the artifacts ktx already writes**, not from a new
-> per-table cache table. The manifest's `ai:` field already encodes "this table is
-> enriched"; a parallel per-table SQLite record would be a second source of truth for
-> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is
-> still written at stage completion (it remains the stage-level resume gate — a clean
-> re-run skips the descriptions stage as today); the incremental record only matters
-> when the stage did **not** complete — exactly the case where no row exists and
-> `compute()` re-runs.
-
-### D4 — A killed-mid-stage run is durable; resume is cheap
-
-A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the
-per-batch-flushed tables durable on disk. The next run resumes the descriptions
-stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again),
-but `generateDescriptions` now **re-issues LLM calls only for the unfinished
-tables**. A failed/skipped table (timeout or exhausted retries) is left for the
-remainder set and is retried on the next resume — never silently treated as done.
-
-## Requirements
-
-### 1. The per-table enrichment timeout is enforced for subprocess backends
-
-When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed
-backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and
-its descendants** — MUST be terminated (SIGKILL of the process group / tree), and
-ktx's `generateObject` await MUST settle within the deadline plus a small bounded
-grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded.
-The termination MUST be portable across the platforms the subprocess backends run on
-(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends
-keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one
-contract met two ways, branching on the backend's structural "owns a subprocess"
-property, not on a list of provider names.
-
-### 2. The timeout default and retry/skip policy are unchanged
-
-The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`),
-with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the
-no-retry-on-timeout policy. On timeout, the table is skipped with the existing
-`enrichment_timeout` recoverable warning and the stage proceeds. No new
-per-connection / per-call / global timeout knob is added.
-
-### 3. Descriptions are persisted incrementally during the stage
-
-Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch**
-(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence
-that bounds the at-risk window to a small number of tables. The flush MUST be
-idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and
-external keys via the existing merge). Finished tables MUST remain durable even if the
-stage never completes — is wedged, killed, or interrupted. A failed/skipped
-relationship/embedding stage or a killed descriptions stage MUST NOT lose the
-descriptions already flushed.
-
-### 4. Resume re-enriches only the unfinished tables
-
-On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST
-re-issue LLM description calls **only for tables not already enriched**, deriving the
-already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable
-record / the manifest's `ai:` descriptions), and MUST return the merged full result
-so downstream stages behave as on a fresh run. A changed `inputHash` (schema or
-enrichment settings changed) MUST recompute the stage as today (spec 19's
-inputHash-gated semantics preserved). The durable record MAY be recreated without a
-migration bridge if its on-disk shape changes (it is regenerable local/artifact
-state).
-
-### 5. No regression for small or uninterrupted ingests
-
-A small or single-run ingest that is never interrupted MUST produce the same
-artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST
-be idempotent with the spec-19 checkpoint and the terminal write (descriptions
-survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT
-change a normal successful enrichment's output, only how a wedged call is terminated.
-
-### 6. A skipped table costs one description, never the stage's output
-
-A descriptions stage that **completes** with one or more skipped/aborted tables MUST
-persist every successfully-generated description (the durable record and the `ai:`
-manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages`
-row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's
-output because some tables were skipped. No single table's failure may reject the
-per-table fan-out: a per-table failure degrades to one missing description (left for
-the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the
-only thing that fails the stage (so it resumes), and even then the already-flushed
-descriptions remain durable.
-
-## Acceptance criteria
-
- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call
-  that hangs past the deadline is terminated within the deadline plus a small grace;
-  ktx's await settles, the spawned child **and a grandchild it spawned** both exit
-  (verified via the child's `exit`, not left spinning), and the table is skipped with
-  an `enrichment_timeout` warning. The stage advances rather than wedging. A
-  `ctx.signal` abort terminates the same way.
- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly
-  on abort via the existing native path, with no subprocess involved.
- **Default + policy:** the default timeout is 120s and a timeout is not retried (one
-  wedge = one timeout); a transient error is still retried up to the attempt limit.
- **Graceful skip persists the rest:** a stage that completes with one table failing
-  (timeout, exhausted retries, or an unexpected throw) still writes the other N−1
-  descriptions to the durable record + `ai:` `_schema` and marks the stage completed
-  (a `local_scan_enrichment_stages` row exists); the failed table is a single `null`
-  description left for the resume remainder, not a discarded stage.
- **Incremental durability:** interrupting the descriptions stage after K of N tables
-  leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`),
-  with no completed `local_scan_enrichment_stages` row.
- **Resume does not re-spend:** re-running the interrupted ingest (unchanged
-  `inputHash`, fresh `runId`) issues **no** LLM description calls for the K already-
-  enriched tables and enriches only the remaining N−K; the returned result is the
-  full merged set. A changed `inputHash` recomputes the stage.
- **No regression:** a small uninterrupted ingest yields identical artifacts and the
-  same descriptions/embeddings output as today; the incremental flush is idempotent
-  with the checkpoint and terminal writes.
-
-## Non-goals
-
- **Incremental persistence of embeddings.** Embeddings are fast and already covered
-  by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This
-  spec scopes incremental persistence to the `descriptions` stage.
- **Changing the timeout default, retry counts, or adding a timeout config knob.**
-  D2 keeps the moderate default and the single env tuning seam.
- **Routing HTTP backends through the subprocess boundary.** Their native abort
-  already meets the contract; a subprocess would add cost and a credential-passing
-  surface for no benefit.
- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed
-  call; no pool, no respawn/job-tracking (D1).
- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage
-  budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes
-  above them: spec 16 bounds individual queries, spec 19 makes whole stages durable
-  and resumable, and this spec hardens the per-table enrichment call's termination
-  and adds within-stage description durability.
- **A general per-stage incremental-flush framework.** The incremental flush is
-  specifically the descriptions stage; it is not a generic abstraction over every
-  enrichment stage.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns the
-design.
-
- **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`,
-  `KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block
-  ~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on
-  timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry
-  loop stays here; each attempt runs through the kill boundary for subprocess
-  backends.
- **LLM runtime + backend selection** — `context/llm/runtime-port.ts`
-  (`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input),
-  `context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` /
-  `ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts`
-  (`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the
-  backend/runtime (e.g. on the runtime interface), not inferred from a name list.
- **Subprocess backends** — `context/llm/codex-runtime.ts` +
-  `context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's
-  `spawn(executable, args, { signal })` is in `@openai/codex-sdk`),
-  `context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()`
-  abort path). These are what the kill boundary must wrap and tree-kill.
- **Reuse spec 16's mechanism (extended to group/tree kill)** —
-  `connectors/sqlite/read-query-child.ts` (the forked child shape) and
-  `connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`,
-  deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts`
-  child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a
-  process-group leader and killing the **group/tree** (the SDK grandchild), portably.
-  Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`,
-  `linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns
-  passing the backend config/credentials to it (env/IPC) and serializing the
-  structured result back.
- **Incremental persistence (gap #2)** —
-  `context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table
-  `pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage`
-  ~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the
-  `onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the
-  existing record, skip already-enriched tables, flush per batch, return the merged
-  full set.
- **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts`
-  (`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts`
-  with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308,
-  `loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription`
-  ~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115,
-  `SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive
-  description/manifest write; tag the durable record with `inputHash`.
- **Stage store + input hash** —
-  `context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE =
-  'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`,
-  `findCompletedStage`, `saveCompletedStage`),
-  `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The
-  whole-stage row stays; the `inputHash` is the gate for the resume-skip set.
- **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the
-  terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal`
-  (`context/scan/types.ts`) which the kill boundary must honor.
- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores
-  SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within
-  deadline+grace, the child and a spawned grandchild both exit, and the table is
-  skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the
-  native path. gap #2: interrupt the descriptions stage after K/N tables (a flush
-  seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed
-  stage row; a resume with matching `inputHash` issues no LLM calls for the K and
-  enriches only N−K; a changed `inputHash` recomputes; regression: a small
-  uninterrupted ingest yields identical artifacts.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
-
-## Benchmark context (motivation, not a requirement)
-
-Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment
-backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for
-41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout
-never killed the hung codex children, and because descriptions checkpoint only at
-stage completion, the 283 already-enriched tables were unrecoverable — the operator
-had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout
-as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at
-scale; the gaps and the fixes are generic production hygiene for any agent that
-enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark
-specifics in the implementation.
-
-## Implementation notes
-
-Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance
-criteria are covered by tests. The full ktx test surface for the touched code is
-green (the only failures in the whole suite are 3 pre-existing assertions in
-`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown
-— an unrelated subsystem this change does not touch).
-
-### Gap #1 — enforced timeout for subprocess backends
-
- **Structural property on the runtime, not a name list.** Added
-  `subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort`
-  (`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime`
-  return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime`
-  (and the deterministic stub) return `null`. The per-table call branches on this,
-  never on a vendor list (D1).
- **Shared structured core.** Both subprocess runtimes gained
-  `generateStructuredJson(jsonSchema)` (returns the raw object; the caller
-  Zod-validates). Their existing `generateObject` was refactored to delegate to the
-  same streaming core, so structured generation has one implementation.
- **Kill boundary.** New `context/llm/subprocess-generate-object.ts`
-  (`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned
-  child (`subprocess-generate-object-child.ts`) **detached** (process-group leader);
-  the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx
-  tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX,
-  `taskkill /pid <pid> /T /F` on Windows) and rejects promptly; on success the raw
-  output is Zod-validated. Credentials reach the child via inherited `process.env`
-  (the runtimes re-derive their allowlisted env), never over IPC.
- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions`
-  (`context/scan/description-generation.ts`) routes each retry attempt through the
-  boundary for subprocess backends and keeps the native `AbortSignal` → `fetch`
-  path for HTTP backends. A fired deadline maps to the existing
-  `KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout);
-  default stays 120s (D2).
- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real
-  fixture child that spawns a grandchild and ignores SIGTERM, and asserts the
-  deadline/abort tree-kills both (the grandchild PID is reaped) and the await
-  settles within deadline+grace; plus success / schema-failure / child-error paths.
-  `test/context/scan/description-generation.test.ts` adds the generator-level
-  timeout-skip and the "HTTP backend spawns no child" cases.
-
-### Gap #2 — incremental descriptions persistence + resume
-
- **Durable record + resume store.** `createKtxScanDescriptionResumeStore`
-  (`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to
-  a durable record (inputHash-tagged) and **only the manifest shards that gained a
-  table this batch** (new `onlyChangedTableNames` filter on
-  `writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)`
-  returns the prior enriched set only on a matching inputHash (D3).
- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`)
-  loads the prior record, skips already-enriched tables, enriches only the
-  remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single
-  in-flight flush; the final force-flush drains the tail), and returns the full
-  merged set (recovered + fresh + `null` for still-failed, so failures are retried,
-  D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`).
- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in
-  a try/catch: any non-cancellation failure degrades to one `null` description + an
-  `enrichment_failed` warning and the fan-out continues, so no single table can
-  reject `Promise.all` / abort the stage. This makes the "one skipped table costs one
-  description, not the stage's output" guarantee live at the stage boundary
-  (`generateBatchedTableDescriptions` already degrades its own failures; this is the
-  explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails
-  and resumes), and the already-flushed descriptions stay durable. This closes the
-  field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows.
- **Deviation from the spec's literal path (necessary correction).** The durable
-  record lives at a **stable, non-`syncId`** path
-  (`raw-sources/<connectionId>/live-database/enrichment-progress/descriptions.json`),
-  not the `syncId`-scoped `…/<syncId>/enrichment/descriptions.json` the spec named.
-  Reason: a from-scratch interruption (the incident's exact case — no prior
-  *completed* run) gets a **fresh `syncId`** on the next run
-  (`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped
-  record would be unreachable on resume. The manifest is already at the stable
-  per-connection scope (`semantic-layer/<connectionId>/_schema/`), so this keeps the
-  resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json`
-  debug artifact written by the terminal/checkpoint writers is unchanged.
- **Tests.** `test/context/scan/description-resume.test.ts` drives
-  `runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a
-  durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues
-  zero LLM calls and returns the full merged set; a partial record re-enriches only
-  the missing tables; a changed `inputHash` recomputes; the changed-shard filter
-  rewrites only the affected shard; and (requirement 6) a run where one table fails
-  still persists the other tables (durable record + `ai:`) and **completes the stage**
-  (a completed `local_scan_enrichment_stages` row), with the failed table left `null`
-  for resume.
-
-### Incidental
-
- Fixed a stale assertion in `description-generation.test.ts` ("does not run
-  per-column fallback…" expected 1 call) to `3`, matching the retry policy added in
-  commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt
-  limit). The HTTP path is unchanged; the assertion simply predated the retry.
- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit
-  governor is not wired into the scan-enrichment path, so the kill-boundary child
-  loses no pacing.
- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles
-  to `dist/context/llm/subprocess-generate-object-child.js`.
--- a/spider2-specs/specs/21-selective-enrichment-stages.md
+++ b/spider2-specs/specs/21-selective-enrichment-stages.md
@ -1,567 +0,0 @@
-# Selective enrichment stages (`--stages`) + per-stage cache keys
-
-> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`.
->
-> **Scope: make the three enrichment stages independently invalidatable and
-> independently re-runnable.** Today one coarse cache key gates all three stages,
-> so changing any one stage's inputs re-pays for every stage — most painfully the
-> expensive per-table `descriptions`. And there is no CLI surface to re-run a
-> chosen subset. This spec splits the key per stage (so a change invalidates only
-> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen
-> subset while preserving the others. It is the operability follow-on to spec 19
-> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable
-> descriptions); it composes with both rather than replacing them.
-
-## Problem
-
-Enrichment has three stages — **`descriptions`** (one paid LLM call per table),
-**`embeddings`** (sentence-transformer vectors over the schema + descriptions),
-**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19
-and 20 these stages are durable and resumable, but they are still **coupled for
-cache invalidation and unreachable for selective re-run**. Three facts make a
-targeted re-run impossible without a full, expensive re-enrich.
-
-### 1. One coarse cache key gates all three stages
-
-`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single
-`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity,
-relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`),
-`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself
-(`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the
-description LLM identity, the embedding model/dimensions/batch size, **and** the
-whole relationship config — and it redundantly re-encodes `mode` and
-`relationships`, which the coarse hash already mixes in.
-
-The consequence: flipping `scan.relationships.llmProposals`, switching the LLM
-backend, or upgrading the embeddings model changes the **one** hash and so
-invalidates **all three** stages. ktx then re-runs the expensive per-table
-`descriptions` even though they did not conceptually change. The headline cost of
-the system — paid LLM description calls — is thrown away on any unrelated
-enrichment-config edit.
-
-### 2. No CLI surface to select stages
-
-The enrichment internals already support a relationships-only path
-(`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are
-gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while
-`shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But
-`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no
-flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history`
-and friends). The relationships-only capability is built but unreachable, and there
-is no way at all to ask for "descriptions only" or "embeddings only."
-
-### 3. The foundation for "touch one stage, keep the rest" already exists
-
-The per-stage store `local_scan_enrichment_stages` is keyed
-`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is
-additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and
-`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`,
-`db:`, and external description keys on rewrite; spec 20's per-table resume record
-(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already
-re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave
-the others byte-for-byte" needs only two missing pieces: **per-stage key
-granularity** and a **CLI surface** to select stages.
-
-**Requirement:** let an operator re-run a chosen subset of enrichment stages on an
-already-ingested connection, recomputing only those stages, preserving the others'
-artifacts untouched, and **re-paying only for what genuinely changed** — never
-re-running the costly `descriptions` because an unrelated stage's inputs moved.
-
-## Generic use case (independent of any benchmark)
-
-Any team running ktx in production maintains its semantic layer over time: they
-improve the description prompt or switch the description LLM, upgrade the embeddings
-model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich
-of every connection** — re-running the expensive per-table descriptions even when
-only embeddings or relationships changed. Two routine operations should be cheap and
-targeted:
-
- **"Re-embed everything on the new model."** Swapping the embeddings model should
-  recompute only embeddings, leaving descriptions and joins on disk.
- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed
-  relationships should recompute only relationships.
-
-And one operation needs an explicit trigger because no input changed:
-
- **"These descriptions came out thin — re-run them with a longer timeout."** A
-  connection whose description coverage is poor because tables timed out (same
-  snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand,
-  cheaply retrying only the tables that failed.
-
-This is core operability for a long-lived ingestion product and is wholly
-independent of any benchmark.
-
-## Design decisions (resolved during refinement)
-
-These resolve ambiguities the intake draft left open. They constrain the
-implementer; the exact code is theirs (requirement-level, per the specs README).
-
-### D1 — Split the coarse hash into three per-stage input hashes
-
-Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash
-computation, each keyed on only that stage's own inputs. Decompose the
-`localScanProviderIdentity` blob into the slices each stage actually depends on:
-
- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the
-  description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the
-  embedding model/dimensions/batch size, **not** relationship settings.
- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where
-  `embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest`
-  is a stable digest of the resolved description text the embeddings consume (the
-  same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model,
-  `local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses
-  embeddings on their real upstream (D4).
- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and
-  `detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X,
-  D5), **not** the embedding identity.
-
-`mode` and `detectRelationships` drop out of the per-stage inputs: each stage
-produces output under exactly one mode, so the stage name already scopes that, and
-re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals`
-invalidates only `relationships`; swapping the embeddings model invalidates only
-`embeddings`; switching the description LLM invalidates only `descriptions`.
-
-The per-stage hash becomes the key everywhere a single hash is used today: the
-`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20
-descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now
-keyed on the **descriptions** stage's hash — so changing the embedding model no
-longer busts the descriptions resume record, a strict improvement.
-
-> **No migration bridge.** The stage store and the descriptions resume record are
-> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage
-> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next
-> run after upgrade. Recreate/ignore stale-shaped records with no compatibility
-> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy.
-
-### D2 — `--stages <comma-list>` selects a subset; one gate, no new mode
-
-Add `ktx ingest [connectionId] --stages <comma-list>`, a non-empty subset of
-`descriptions,embeddings,relationships`. Plural because it takes a **set**:
-`--stages relationships` and `--stages descriptions,embeddings` both read naturally,
-and the plural signals "list expected." Flag absent = all three (today's behavior).
-
-A Commander custom parser validates each name against the canonical stage registry
-and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a
-hard `InvalidArgumentError`** — never silently ignored. The set threads CLI →
-`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`.
-
-Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected
-stages)`** — a single gate. Each existing stage block additionally checks
-membership in the selected set (`descriptions`/`embeddings` already gate on
-`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`).
-This adds **no** new `KtxScanMode` variant and **no** second parallel selection
-path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means
-"which of those stages to (re)compute this run." A named stage that cannot run
-because a prerequisite is absent (e.g. `--stages embeddings` with no embedding
-provider configured) MUST fail or warn clearly, never silently no-op.
-
-> Rejected alternative — repurpose `mode` (`--stages relationships` →
-> `mode: 'relationships'`). It only expresses single-stage cases, leaves
-> `descriptions,embeddings` with no mode, and creates two ways to say "relationships
-> only." The explicit stage set is the one canonical selector.
-
-### D3 — A named stage force-re-runs; per-table resume still avoids re-paying
-
-Naming a stage in `--stages` carries the intent "recompute this," so a named stage
-**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in
-`runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still
-applies **inside** `compute()`:
-
- `--stages descriptions` re-enters `generateDescriptions`, which loads the
-  per-table resume record and re-issues LLM calls **only for the still-null/failed
-  tables** (when the descriptions hash is unchanged) — the "fill thin coverage with
-  a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps.
- A genuine input change (e.g. switching the LLM → a new descriptions hash)
-  invalidates the resume record and rebuilds the stage fully, as today.
-
-Stages **not** named are skipped entirely — not run, not resumed — and their
-on-disk artifacts are left exactly as they are (additive write; preserve-others is
-already the behavior). The **no-flag default is unchanged**: all eligible stages
-run, the completed-row short-circuit is respected (spec-19 cross-run resume).
-
-Behavior follows from the input (did you explicitly name the stage?), not the call
-path. A consequence to state plainly: `--stages descriptions,embeddings,relationships`
-is **not** identical to passing no flag — naming all three is the explicit "force a
-full enrichment recompute," whereas no flag is "ingest, resuming whatever is done."
-
-### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent
-
-The only hard dependency between stages is **`descriptions → embeddings`**
-(embeddings embed the description text; `relationships` is decoupled, D5). Two
-mechanisms keep it correct without a hardcoded dependency table:
-
- **Self-healing via content-addressing.** Because the embeddings hash includes
-  `descriptionDigest` (D1), re-running `descriptions` changes that digest, so a
-  later embeddings run (or a full ingest) sees a hash miss and recomputes — stale
-  embeddings can never silently persist across a future embeddings run. (Without
-  this, the embeddings hash would be unchanged after a description edit and a later
-  run would wrongly short-circuit on stale vectors.)
- **Surfaced immediately.** After a selective run, for each **unselected** stage that
-  has artifacts on disk, recompute its *current* per-stage hash from on-disk state
-  and compare it to the stored completed-row hash; if they differ, emit a
-  **recoverable `enrichment_stage_stale` warning** naming the stale stage and the
-  cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the
-  system's own state — it also catches "you changed the embedding model in `ktx.yaml`
-  but only ran `--stages descriptions`."
-
-The run **never silently leaves a stale-but-unflagged downstream**, and **never
-silently auto-cascades** extra work — the operator is told and decides. Re-running
-`descriptions` does **not** flag `relationships` stale (D5).
-
-### D5 — Relationships are decoupled from description content, but still get it as context
-
-`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is
-**not** invalidated or stale-flagged by a description change (decision X). Rationale:
-relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's
-own framing); coupling them to description content would make every routine
-description re-run also invalidate joins — re-opening the exact over-invalidation
-this spec exists to close.
-
-Independently, a `relationships`-only run (descriptions stage not running this
-invocation) MUST **hydrate its working schema from the persisted on-disk enriched
-`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full
-description context, not raw column names. Today the relationship stage builds its
-schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740`
-never merge the AI descriptions), so this also closes a latent gap: both the
-full-run and the relationships-only paths MUST feed `llmProposals` the
-best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) —
-behavior from inputs, not path.
-
-### D6 — Scope: enrichment stages only, composable with existing flags
-
-`--stages` controls only the three enrichment stages. It is **orthogonal to and
-composable with** the existing `--no-query-history` flag — a pure joins backfill
-across everything is `ktx ingest --all --stages relationships --no-query-history`.
-Schema introspection still runs (it is the hash substrate and the enrichment base,
-and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it
-can later extend to the broader scan phases (schema / query-history / source /
-memory) and subsume the inconsistent negative `--no-query-history` flag — but that
-unification is **out of scope** here.
-
-## Requirements
-
-### 1. Per-stage input hashes
-
-Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its
-resume record on a hash of only that stage's own inputs, per D1
-(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding
-identity + a digest of the embedded description text; `relationships` ← snapshot +
-relationship settings + LLM identity). Changing one stage's inputs MUST invalidate
-**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over
-`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }`
-MUST be removed in favor of per-stage computation. The stage store and the
-descriptions resume record MAY be recreated without a migration bridge (disposable
-local state).
-
-### 2. `--stages` flag with strict validation
-
-`ktx ingest` MUST accept `--stages <comma-list>`, a non-empty subset of
-`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An
-unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`),
-never silently ignored. The selected set MUST thread through to enrichment and gate
-which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new
-`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite
-is missing MUST fail or warn clearly, not silently no-op.
-
-### 3. Selecting a stage force-re-runs it; unselected stages are preserved
-
-A stage named in `--stages` MUST re-enter its `compute()`, bypassing the
-completed-stage short-circuit, while still using the spec-20 per-table resume record
-so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash)
-and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST
-leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19
-cross-run resume (all eligible stages, completed-row short-circuit respected).
-
-### 4. Downstream staleness is surfaced, never silent
-
-After a selective run, the run MUST emit a recoverable `enrichment_stage_stale`
-warning for every **unselected** stage whose current per-stage hash no longer
-matches its stored completed-row hash (derived from on-disk state, naming the stage
-and the cascade command). The embeddings hash MUST include a digest of the embedded
-description text so a later embeddings run self-heals after a description change. The
-run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently
-auto-cascade. A description change MUST NOT stale-flag `relationships`.
-
-### 5. Relationships run with description context
-
-When the `relationships` stage runs without `descriptions` having run in the same
-invocation, it MUST hydrate its working schema from the persisted on-disk enriched
-`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description
-context as a full enriched run, not bare column names. The full-run and
-relationships-only paths MUST feed `llmProposals` descriptions consistently.
-
-### 6. No regression for normal ingests
-
-A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as
-today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19
-cross-run resume and spec-20 per-table description resume. The per-stage hash split
-MUST NOT change a normal run's output, only which stages a *changed* input
-invalidates.
-
-## Acceptance criteria
-
- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals`
-  re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM
-  description calls, no re-embedding); swapping the embeddings model re-runs only
-  `embeddings`; switching the description LLM re-runs only `descriptions`. Verified by
-  asserting no LLM description calls / no embed calls for the unaffected stages.
- **Flag parse + validation:** `--stages relationships` and
-  `--stages descriptions,embeddings` parse to the right set; `--stages foo`,
-  `--stages` (empty), and `--stages descriptions,foo` each fail with a clear
-  `InvalidArgumentError`.
- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed
-  with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM
-  calls for exactly those K tables and leaves the already-good descriptions
-  untouched; the run completes and the K are now enriched. A changed descriptions
-  hash instead rebuilds all tables.
- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and
-  `relationships` artifacts are byte-stable (unselected stages did not run).
- **Derived staleness warning:** after `--stages descriptions` changes the
-  descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its
-  recomputed hash diverged) and does **not** emit it for `relationships` (decision
-  X); a subsequent `--stages embeddings` clears it.
- **Relationships context:** a `--stages relationships` run on an already-described
-  connection feeds the on-disk AI descriptions into `llmProposals` (verified: the
-  proposal prompt carries descriptions, not just column names).
- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical
-  artifacts and the same descriptions/embeddings/relationship output as today, with
-  spec-19/20 resume intact.
-
-## Non-goals
-
- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The
-  namespace is built to extend later; this spec ships only the three enrichment
-  stages, composable with the existing query-history flag (D6).
- **A new `KtxScanMode` variant or a second stage-selection path.** One gate,
-  `(eligible) ∩ (selected)` (D2).
- **Coupling `relationships` to description content** (decision X, D5). Improving
-  descriptions does not invalidate or stale-flag joins.
- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the
-  operator chooses to cascade (D4).
- **Capturing prompt/code-level description-prompt changes in the hash.** The
-  descriptions hash keys on snapshot + LLM identity (config/model), not the prompt
-  text; a pure prompt improvement that does not change a hash input will not
-  force-rebuild already-good descriptions. Forcing that is out of scope — the
-  operator changes a real input or selects the stage with a changed config.
- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20
-  (per-table description resume, enforced timeout).** This spec composes above them:
-  it splits the key those stages resume on and adds the CLI surface to select and
-  force-re-run stages.
- **A general per-phase incremental-flush framework.** The selection mechanism is the
-  three enrichment stages; it is not a generic abstraction over every ingest phase.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns the
-design.
-
- **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts`
-  (`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput`
-  `:57`): replace with per-stage hash functions (or one function taking a per-stage
-  input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three
-  `runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`,
-  `relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit
-  `:538–547`). The `descriptions` hash also feeds `generateDescriptions`'
-  `resumeStore.load(inputHash)` (`:345`).
- **Provider-identity decomposition** — `context/scan/local-scan.ts`
-  (`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`):
-  split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` /
-  `relationships` re-encoding, and pass each stage only its slice.
- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings`
-  `:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText`
-  `:17–44`): digest the resolved per-column/table description text that the embeddings
-  consume, and fold that digest into the embeddings hash.
- **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations,
-  `:51–104` action handler): add `--stages` with a custom parser that validates
-  against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in
-  `enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`.
-  Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'`
-  `:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) →
-  `runLocalScanEnrichment`.
- **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage
-  block on membership in the selected set (`descriptions` `:632`, `embeddings`
-  `:663–665`, `relationships` `:720`); make a named stage bypass the completed-row
-  short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20
-  per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage
-  set.
- **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the
-  stage blocks): recompute each unselected stage's current hash from on-disk state,
-  compare to the stored completed-row hash, push a recoverable warning on mismatch.
-  Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in
-  `context/scan/types.ts` (alongside `relationship_detection_partial`).
- **Relationships description context** — `context/scan/local-enrichment.ts`
-  (`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`):
-  hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from
-  the on-disk `_schema` via `loadExistingManifestState`,
-  `local-enrichment-artifacts.ts`) before relationship detection.
- **Stage store + resume record** —
-  `context/scan/sqlite-local-enrichment-state-store.ts`
-  (`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`,
-  `findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore`
-  (`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate
-  `:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge.
- **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema`
-  `:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`;
-  `scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`,
-  `llm.provider.gateway.base_url`): the sources of each per-stage identity slice.
- **Tests** — per-stage invalidation isolation (flip one input, assert only the
-  matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty
-  rejected); resume-aware force-rerun (`--stages descriptions` retries only the null
-  tables, leaves good ones, completes); preserve-others (unselected artifacts
-  byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after
-  a descriptions change, not for relationships; cleared by a later `--stages
-  embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`;
-  regression — a normal no-flag ingest yields identical artifacts with spec-19/20
-  resume intact.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
- **Docs:** add `--stages` to the `ktx ingest` CLI reference
-  (`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior
-  where enrichment/ingest is described.
-
-## Benchmark context (motivation, not a requirement)
-
-Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A
-level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description
-coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only**
-re-run with a longer timeout, and (b) a desire to **backfill joins** across all
-already-ingested datasets after enabling `llmProposals` — without re-paying for
-descriptions. Both were blocked by the coarse single `inputHash` (flipping
-`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a
-stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend
-ingestion at scale; the gap and the fix are generic production operability. Do not
-encode any benchmark specifics in the implementation.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented;
-all acceptance criteria covered by tests.
-
-**What was built / where:**
-
- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the
-  coarse `computeKtxScanEnrichmentInputHash` and added
-  `computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`),
-  `computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`),
-  `computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`),
-  plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` /
-  `KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the
-  canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into
-  `localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant
-  `mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the
-  keys. No migration bridge — the stage store + descriptions resume record just miss the
-  old coarse-keyed rows.
- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted
-  `buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage
-  and the digest, so the embeddings hash content-addresses the exact text the model sees.
- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`:
-  `parseEnrichmentStagesOption` (Commander parser) validates against the registry,
-  rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated
-  set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` →
-  `RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate
-  (`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage
-  emits a new `enrichment_stage_skipped` warning (never a silent no-op).
- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named
-  stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions`
-  still consults the spec-20 per-table resume record (retries only failed tables on an
-  unchanged hash).
- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment`
-  resolves best-available descriptions (fresh-this-run, else on-disk via a lazy
-  `loadPriorDescriptions` thunk wired from `local-scan.ts` →
-  `loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema`
-  now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket`
-  now carries the resolved description text — closing the latent gap on **both** the
-  full-run and relationships-only paths.
- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code +
-  `findLatestCompletedStage` on the state store (interface + sqlite + test store). After a
-  selective run, each unselected stage with a completed row is compared against its
-  freshly recomputed hash; a mismatch warns and names the cascade command. Relationships
-  are never flagged by a description change (decoupled per D5).
- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a
-  "Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and
-  examples.
-
-**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a
-relationships-only run should hydrate "AI descriptions **and** embeddings" from the
-on-disk `_schema`. Investigation found the `_schema` manifest shards store only
-descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json`
-that no code reads back, and each run mints a fresh syncId — so there is no durable
-per-connection embeddings artifact to hydrate from. A relationships-only run therefore
-hydrates **descriptions** (required for, and verified against, the `llmProposals`
-acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships`
-backfill gets deterministic + name-based + LLM-proposed candidates (the point of
-`llmProposals`), but not the embedding-similarity candidates a full run would add.
-Durable embeddings hydration (persist vectors at a stable per-connection path, or read
-them from the vector index) is a clean follow-on and was left out of scope.
-
-**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation),
-`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture
-guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves
-others, naming all three forces a full recompute, per-stage invalidation isolation,
-prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced
-descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not
-relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`,
-`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in
-`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a
-`**Window functions**` heading the test expects — was present before this work and left
-untouched.)
-
---
-
-## ⚠️ Defect found in post-implementation validation (2026-06-24)
-
-**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req
-"preserve-others / a selective run never deletes another stage's artifacts."
-
-**Reproduction (deterministic):**
- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges.
- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅
-  but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌
- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins.
-
-**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the
-freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions
-and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the
-**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach
-the `llmProposals` evidence packet only). So the on-disk `_schema` loses them.
-
-**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the
-`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every
-db. Caught on a 1-db validation before any rollout.
-
-**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:`
-descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages'
-artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`,
-and asserts description count is unchanged while joins increase.
-
-### ✅ Fixed (2026-06-24)
-
-**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first
-fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural**
-manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard,
-but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as
-**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a
-subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the
-already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test
-passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test
-was rewritten to go through the full `runLocalScan` path.)
-
-**What changed:**
- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions
-  (`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as
-  `descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them.
- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before
-  the structural manifest write** and feeds them to both the structural write and enrichment — so the
-  structural pre-write preserves them too (robust even if relationship detection later fails).
- Joins were already preserved for `--stages descriptions` via the existing manual/inferred
-  `preservedJoins` path; verified by a symmetric test.
-
-**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai`
-descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the
-enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions`
-preserves joins).
-
-**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER
-`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the
-descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`.
-Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass.
--- a/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md
+++ b/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md
@ -1,463 +0,0 @@
-# Resumable and fault-tolerant source ingest
-
-> Refined spec. No intake draft — surfaced by a real user report, not the
-> playground agent (see Motivation). Lives beside the analogous scan-durability
-> specs 19/20.
->
-> **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt /
-> Metabase / Notion) survive interruption and partial failure on large
-> projects.** Two compounding gaps live on the source-ingest path: (1) an
-> interrupted run restarts every work unit from scratch — there is no cross-run
-> reuse of already-generated work-unit output, so a multi-day dbt ingest loses
-> *all* progress to a single VPN/network blip; (2) the final integration gate is
-> all-or-nothing — one artifact that cannot pass it (after LLM repair) discards
-> the **entire** run with nothing committed. This is the source-ingest analog of
-> spec 19 (move the durability boundary to the cost boundary so expensive LLM
-> work is not lost) and spec 20 (a stage survives an interruption with per-item
-> durability). It **reuses** the same content-keyed durability primitive those
-> specs established rather than copying it.
-
-## Problem
-
-Two independent failure modes on the source-ingest work-unit (WU) pipeline,
-both confirmed in the current code, both observed by a user on a ~2-day dbt
-ingest. Their union makes large-project ingest brittle: any interruption is
-total loss, and any single unfixable artifact at the end is total loss.
-
-### 1. An interrupted run resumes nothing — every work unit re-runs
-
-`IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as
-a sequence of stages: fetch → parse/extract into **work units** → run each WU as
-an isolated agent loop in a child worktree (`runIsolatedWorkUnit` →
-`executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU
-patches → reconcile → finalize → final gates → one atomic squash commit
-(`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU
-is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki
-artifacts, producing a git patch (`WorkUnitOutcome.patchPath` /
-`patchTouchedPaths`, `stage-3-work-units.ts:31-46`).
-
-The only persisted cross-run state is `SqliteBundleIngestStore`
-(`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report,
-and provenance — all written at or near **run completion**. There is **no
-checkpoint of completed WU output**. A run that dies mid-flight (the user's
-VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches,
-re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM
-cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no
-relationship to the interrupted one.
-
-> Observed (user report, large dbt project): a run reached deep into its
-> work-unit progress and was lost to a network blip; the follow-up run started
-> over from zero. On a ~2-day ingest this is the difference between a 5-minute
-> resume and a 2-day redo.
-
-### 2. The final integration gate is all-or-nothing
-
-After all surviving WUs are integrated, `validateFinalIngestArtifacts`
-(`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across
-the *integrated* tree:
-
- **intrinsic source validity** — `validateTouchedSources` →
-  `validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) →
-  `validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`),
-  which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`);
- **cross-artifact references** — dangling join targets
-  (`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki`
-  refs (`validateWikiRefs` → `findMissingWikiRefs`), broken `wiki→sl_ref`s
-  (`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs
-  (`findInvalidWikiBodyRefs`).
-
-On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`).
-The runner catches it, runs the LLM repair `repairFinalGateFailure`
-(`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws**
-(`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null`
-(`runner.ts:2729`) → **the whole run is discarded, nothing committed.**
-
-The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate
-— `executeWorkUnit` already validates each WU in isolation (`validateWikiRefs`
-~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`,
-~155: the WU resets, is excluded from integration, and the run continues). So by
-the time the final gate runs, intrinsic single-source failures are rare. The
-gate fails predominantly on **cross-artifact dangling references**: WU-A's source
-joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so
-A's join now points at nothing. Each WU passed *alone*; the break only appears
-once the survivors are integrated — and that break currently nukes the run.
-
-> Observed (user report): a run completed all task generation and then failed at
-> the final integration gate on a **single model**; because the gate is
-> all-or-nothing, that one failure discarded an ~18h run with nothing committed.
-
-## Generic use case (independent of any benchmark)
-
-Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit
-both failures. Large ingests run long enough that an interruption is a *when*,
-not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator
-ctrl-C on an apparently-stuck run), and a large artifact set makes it
-near-certain that *some* model lands a cross-reference its sibling didn't
-produce. Without cross-run reuse, every interruption is a from-scratch redo of
-the dominant (LLM) cost; without partial commit, one unfixable artifact throws
-away every good one. Both fixes make large-project ingest **resilient and
-resumable**: an interruption costs only the unfinished work, and a single bad
-model costs only that model — not the run. This is core robustness for a
-general-purpose ingestion product.
-
-## Design decisions (resolved during refinement)
-
-These resolve the design space explored during refinement. They constrain the
-implementer; the exact code is theirs (requirement-level, per the specs README).
-
-### D1 — Resume is automatic and content-keyed at the work-unit level
-
-A successful WU's output is cached across runs, keyed by a **content hash of its
-inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest`
-transparently replays any WU whose inputs are byte-identical to a cached success
-and re-runs only the changed, failed, or missing WUs. The key is computed over:
-the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads,
-`types.ts:19-28`), the adapter/source identity, and a **version/prompt
-fingerprint** (ktx version + the WU system/user prompt + model role). A changed
-dbt model busts only that model's entry; everything unchanged replays for free.
-
-> No flag, no config knob. Content-keying makes resume automatic; a flag would
-> double the state space for no benefit. This is the same shape scan uses
-> (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU
-> pipeline.
-
-### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes
-
-The cache stores a successful WU's **output artifacts**: its git patch
-(`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs
-(`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner
-**replays the patch** into the session worktree — no agent loop, no LLM — exactly
-where it would have integrated a freshly-run WU. If a cached patch **fails to
-apply** (the surrounding tree drifted), the entry is discarded and the WU
-**recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree:
-the cache can only make a run faster, never wrong.
-
-### D3 — One durability primitive, shared by scan and ingest
-
-Per the "one capability, one implementation" rule, the content-keyed store is
-**extracted** into a shared primitive and **both** scan and ingest route through
-it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK
-`(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`)
-and its `inputHash` computation (`enrichment-state.ts`) are generalized to a
-content-keyed result cache; scan is migrated onto the shared primitive **in the
-same change** so no second copy exists even transiently. The ingest cache is a
-new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`)
-on that one store.
-
-> Extract-and-share in one PR, not "build a copy for ingest now, unify later."
-> A temporary fork is exactly the divergence the rule forbids; the one-time
-> extraction cost is paid once and both paths benefit from every later fix.
-
-### D4 — Only successes are cached; failures retry on the next run
-
-A failed WU is **not** recorded as terminal — the next run retries it. WU
-failures on this path are dominantly transient (network, provider stall, an LLM
-slip), and the user's explicit ask is "resume and finish the rest," so a failure
-must not be sticky. This deliberately differs from scan's stage store (which
-caches failed stages and re-throws): there the failure is the stage's
-deterministic verdict; here a WU failure is usually a blip to retry. Caching only
-successes also keeps the invariant simple — a cache entry always means "this
-exact input already produced this exact good output."
-
-### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune
-
-Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation
-that always yields a committable, internally-consistent tree:
-
-1. `validateFinalIngestArtifacts` is refactored to **return structured findings**
-   (the danglers it already computes internally — join targets, `wiki→wiki`,
-   `wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of
-   flattening them into a thrown string.
-2. **Drop the rare self-invalid source first.** A source that fails its *own*
-   validation at the final gate (intrinsic — rare, since stage 3 already filters
-   these) is removed, establishing the surviving artifact set.
-3. **Prune the dead edges in a single pass** over that surviving set. For each
-   dangling reference — whether it pointed at an absent sibling or at a
-   just-dropped source — **remove that reference from its owner** (drop the join
-   entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping
-   the owning artifact. Because nodes are dropped first (step 2) and pruning only
-   removes edges, pruning **cannot create a new dangling edge, so one pass
-   suffices; no fixpoint.**
-4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are
-   cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit
-   the remainder. If the confirm pass *still* fails, that is a real bug — fail the
-   run loudly rather than commit a dirty tree.
-
-`repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`)
-is **removed**. The deterministic prune supersedes it for the referential class,
-and the rare intrinsic case is handled by drop.
-
-> **Prune the edge, do not cascade the node.** The rejected alternative drops the
-> *referencing artifact* and, transitively, everything that referenced *it* — a
-> node-quarantine fixpoint that cascades healthy artifacts and needs a closure
-> search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the
-> dependent intact (minus one pointer that never resolved anyway), needs no
-> fixpoint, and acts on findings the gate already produces.
->
-> **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair
-> can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune
-> merely deletes it, preserving marginally more content. We drop it anyway:
-> determinism beats an LLM round-trip with variance on the commit path, prune
-> guarantees a commit where repair could only `throw`, and deleting it is a net
-> maintenance reduction. The decision is reversible — repair could later run as a
-> best-effort pass *before* prune — but the default is prune-only.
-
-### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose)
-
-Pruning is applied to the **integrated session worktree** at gate time and is
-**re-derived from the current survivor set on every run**. It MUST NOT mutate the
-cached WU patches (D2). This makes resume and prune compose correctly and
-**self-heal**:
-
- Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B
-  fails; the gate prunes A's join-to-B from the integrated tree and commits A
-  without it.
- Run 2 (after the root cause is fixed): A's input is unchanged → A **replays
-  from cache with its join restored**; B now succeeds and exists; the gate finds
-  no dangler and commits both, fully linked.
-
-So a ref pruned because of a sibling's failure costs nothing permanent: fixing
-the sibling and re-running restores the link for free. The cache stores
-intent (the WU's real output); prune is a per-run consistency projection over
-whatever survived.
-
-### D7 — Pruning is faithful and never silent
-
-A pruned reference was, by definition, non-functional (its target was absent), so
-removing it loses nothing executable — and removing dangling SL joins is already
-the established fix for the SL engine's eager orphan-join rejection. Every prune
-and every drop MUST be **recorded in the run report and a trace event** naming
-the artifact, the removed reference, and the absent target. The report status
-MUST reflect partial completion (extend the existing `failedWorkUnits`
-mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs /
-dropped-sources detail) so a partial run is visibly partial, never a silent
-"success."
-
-### D8 — Cache state is regenerable; no migration bridge
-
-The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape
-may change with **no migration bridge** — a stale-shaped or absent cache simply
-forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's
-no-backward-compatibility policy; the cache is an optimization, never a source of
-truth.
-
-## Requirements
-
-1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output
-   MUST be cached keyed by a content hash over its input bytes
-   (`rawFiles` + `dependencyPaths`), the adapter/source identity, and a
-   version/prompt fingerprint (ktx version + WU prompt + model role). Re-running
-   `ktx ingest` MUST replay cached successes without an agent loop / LLM call and
-   re-run only changed, failed, or missing WUs. No `--resume` flag and no config
-   knob is added.
-2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the
-   stored patch into the session worktree; if the patch does not apply cleanly the
-   entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to
-   produce a tree different from what a fresh run of that WU would have integrated.
-3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it
-   MUST be retried on the next run.
-4. **Conservative invalidation.** The input hash MUST change when the ktx version,
-   the WU prompt, or the model role changes (bias toward recompute). Under-keying
-   (stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is
-   acceptable.
-5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run.
-   `validateFinalIngestArtifacts` MUST return structured findings; the runner MUST
-   deterministically **prune** every dangling reference from its owning artifact
-   and **drop** any source that fails its own validation, then commit the
-   remaining internally-consistent tree.
-6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not
-   cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint)
-   because edge removal cannot create new dangling edges. A dependent that loses
-   one dangling ref MUST otherwise be committed intact.
-7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and
-   MUST NOT mutate cached WU patches. A reference pruned in one run because its
-   target was absent MUST be restored automatically on a later run once the target
-   exists (resume replays the owner's intact patch).
-8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on
-   the remainder and MUST pass before the squash; if it still fails the run MUST
-   fail loudly rather than commit a dirty tree.
-9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its
-   obsolete tests/branches MUST be deleted (no dormant compatibility path).
-10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST
-    be recorded in the run report and a trace event (artifact, removed ref, absent
-    target). A run that pruned or dropped anything MUST report as partial, never as
-    an unqualified success.
-11. **One shared durability primitive.** The content-keyed store MUST be a single
-    implementation used by both scan and ingest; scan MUST be migrated onto it in
-    the same change. No second copy may exist, even transiently.
-12. **No regression for clean runs.** A small, uninterrupted run whose every WU
-    passes and whose final gate is clean MUST produce byte-identical artifacts and
-    the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as
-    today.
-
-## Acceptance criteria
-
- **Resume skips completed work:** interrupt an ingest after K of N WUs have
-  succeeded; re-run the same command (unchanged inputs); the run issues **zero**
-  agent loops / LLM calls for the K cached WUs, runs only the remaining N−K, and
-  produces the same final artifacts as an uninterrupted run.
- **Changed model busts only its entry:** edit one dbt model between runs; the
-  re-run re-executes **only** the WU(s) whose input bytes changed and replays the
-  rest from cache.
- **Stale patch self-corrects:** a cached patch that no longer applies (forced
-  drift in a test) causes that WU to recompute, not a corrupt tree or a crash.
- **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached;
-  run 2 retries it and, on success, integrates it.
- **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join
-  to B dangles **commits** — A is committed with the dangling join **pruned**, the
-  report lists the pruned ref, and `commitSha` is non-null (contrast: today this
-  throws and commits nothing).
- **No cascade:** in that scenario A (and any other artifact that only referenced
-  B) is committed intact except for the single pruned reference; nothing healthy
-  is dropped.
- **Self-heal:** fix B's root cause and re-run; A replays from cache with its join
-  intact, B succeeds, and the final tree commits both fully linked with no prune.
- **Intrinsic drop:** a source that fails its own warehouse dry-run at the final
-  gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop
-  is reported.
- **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the
-  gate path has no LLM call.
- **One store:** scan and ingest both resume through the same content-keyed
-  primitive (one implementation; scan's behavior is unchanged by the migration —
-  spec 19/20 acceptance still passes).
- **Clean-run regression:** a small uninterrupted all-passing ingest yields
-  identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to
-  today.
-
-## Non-goals
-
- **Resuming the cross-WU stages.** Reconciliation, finalization, and the final
-  gate re-run every time; their inputs depend on the full survivor set and their
-  cost is small relative to WU generation. Only WU generation is cached.
- **A `--resume` flag or any timeout/cache config knob.** Content-keying makes
-  resume automatic (D1); one opinionated default is the canonical ktx shape.
- **Caching failed WUs as terminal.** Failures retry (D4).
- **Node-cascade quarantine of the final gate.** Prune edges, do not drop
-  dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step.
- **Tolerating dangling references (warn instead of remove).** Unsafe — the SL
-  engine eagerly rejects orphan joins — so dead edges must be removed, not kept.
- **Keeping the LLM final-gate repair.** Removed (D5/req 9).
- **A general per-stage resume framework beyond the shared content-keyed store.**
-  The store is the one shared primitive (D3); this spec does not abstract every
-  ingest stage into a resumable framework.
- **Re-implementing spec 19/20 (scan durability).** This spec composes the same
-  primitive onto the source-ingest WU pipeline.
-
-## Implementation orientation
-
-Line numbers drift; treat these as anchors, not addresses. The implementer owns
-the design.
-
- **Run flow + the all-or-nothing seam** — `context/ingest/ingest-bundle.runner.ts`:
-  WU run + integration of successful patches (~1600–1900), the final-gate block
-  (~2549–2587, `runFinalArtifactGates`), the repair-then-rethrow that must be
-  replaced by prune (~2588–2644; the fatal `throw` ~2623), and the atomic squash
-  (~2701–2729; `commitSha: null` when nothing is touched ~2729). The prune step
-  slots between the gate findings and the squash, operating on `sessionWorktree`.
- **Work units & cacheable output** — `context/ingest/types.ts` (`WorkUnit`
-  ~19–28: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult`
-  ~204–213: extend with pruned/dropped detail);
-  `context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU
-  validation + `failWithReset` ~134–157 that already soft-fails a WU;
-  `WorkUnitOutcome` ~31–46 with `patchPath`/`patchTouchedPaths`/`actions`/
-  `touchedSlSources` — the cache payload). The cache lookup/replay wraps the
-  per-WU execution; only the agent-loop branch is skipped on a hit.
- **The gate (make it return findings)** — `context/ingest/artifact-gates.ts`
-  (`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from
-  `validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`;
-  the concatenated `throw` ~129 to replace with a structured return);
-  `context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124;
-  `findJoinTargetErrors` ~89 already returns missing join targets per source —
-  the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts`
-  (`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are
-  the drop set, not the prune set).
- **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the
-  offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove
-  the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts`
-  `findMissingWikiRefs`); wiki body refs: remove the broken link token
-  (`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is
-  deterministic and edits the integrated worktree only.
- **Remove the LLM repair** — `context/ingest/final-gate-repair.ts`
-  (`repairFinalGateFailure`) and the `constrained-repair.ts` usage for
-  `final_artifact_gate`; delete the call site (~2595) and its tests.
- **Durability primitive to extract & share** —
-  `context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`,
-  PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`),
-  `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and
-  the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`).
-  Generalize to a content-keyed result cache; migrate scan onto it; add the ingest
-  namespace. The existing ingest store
-  `context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is
-  where ingest-side persistence lives — the WU cache sits alongside it under
-  `.ktx/`.
- **Tests** — resume: run an ingest against a real git-backed project with a fake
-  agent runner, interrupt after K WUs, assert the re-run issues no agent loops for
-  the K and the same artifacts result; changed-input bust; stale-patch recompute;
-  failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki
-  ref dangles → assert the run commits the sibling with the ref pruned, reports the
-  prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a
-  follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still
-  green on the shared primitive. Regression: a small uninterrupted all-passing
-  ingest is byte-identical to today.
- After implementing, rebuild and re-link so the playground picks it up:
-  `pnpm run build && pnpm run link:dev`.
-
-## Motivation (the real report, not a benchmark)
-
-A user ingesting a fairly large dbt project (~2-day run) hit both gaps together.
-First, an interruption — a VPN drop / network blip — lost all progress because
-ingest cannot resume; they had to restart from scratch. Second, on a later run
-that completed all task generation, a **single model** failed the final
-integration gate, and because the gate is all-or-nothing the one failure
-discarded an ~18h run with nothing committed. Their ask: "some form of resume or
-checkpoint (or at least reusing the patches that were already generated), and a
-way to skip or quarantine a single failing model instead of failing the entire
-run." This spec delivers both — resume via the content-keyed WU cache, and
-partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this
-gap was surfaced by a real user on a real warehouse, not by the benchmark; the
-fix is generic production hygiene for any large ingest.
-
-## Implementation notes
-
-Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12
-requirements and every acceptance criterion are covered by committed code and
-tests; the full `@kaelio/ktx` package suite is green.
-
-What was built and where:
-
- **Shared content-keyed durability primitive** — `context/cache/content-result-cache.ts`
-  + `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`).
-  Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts`
-  is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped),
-  so no second copy exists (D3 / req 11).
- **Content-keyed WU cache + replay** — `context/ingest/work-unit-cache.ts`
-  (`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity +
-  CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`).
-  Replay/recompute and stale-recompute state refresh wrap the WU loop in
-  `ingest-bundle.runner.ts` (D1/D2/D4 / reqs 1–4).
- **Non-fatal final gate** — `artifact-gates.ts` `validateFinalIngestArtifacts`
-  returns structured findings; `context/ingest/final-gate-prune.ts` deterministically
-  drops self-invalid sources and prunes dangling edges in a single pass, then a
-  confirm gate runs before squash (D5/D6 / reqs 5–8). `finalGatePrunedReferences`
-  / `finalGateDroppedSources` are recorded in the report + trace and surface as a
-  `partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are
-  deleted (req 9).
-
-Deviations / decisions worth noting (all preserve spec intent):
-
- **Cache stores artifact content snapshots (payload schema v2), not just a raw
-  git patch.** Replay materializes the owner's artifacts against the *current*
-  base, so a ref pruned in one run because a sibling failed is restored for free
-  on a later run once the sibling exists — without re-running the owner's agent
-  loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute.
- **Final-gate prune/drop resolves sources through the canonical
-  `resolveSlSourceFile` resolver**, not a derived `semantic-layer/<conn>/<name>.yaml`
-  path, so it works for uppercase / hash-derived source filenames (not only
-  lowercase demo names).
- **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join
-  target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU;
-  only intrinsic `source_validation` failures remain fatal at the WU level. This
-  is what lets a sibling-failed WU's owner survive to be pruned rather than be
-  excluded upstream (reqs 5–7, "no cascade").
- The raw report record keeps `status: 'completed'`; partial completion is derived
-  by `ingestReportOutcome` from the populated prune/drop fields.
--- a/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md
+++ b/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md
@ -1,66 +0,0 @@
-# Multi-connection routing guidance in the ktx-analytics skill
-
-## Problem
-
-The agent-facing `ktx-analytics` skill (installed into agent environments via
-the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in
-projects) describes the query workflow — wiki_search → sl_read_source →
-sl_query / sql_execution — but assumes the connection is obvious. In a
-multi-connection project nothing tells the agent to *first decide which
-connection the question is about*, and several tools silently require it:
-
- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId`
-  **required**;
- `sl_query`, `discover_data`, `dictionary_search`: optional, but
-  auto-inference only works with exactly one connection
-  (`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or
-  multiple connections).
-
-An agent that skips routing either errors out or, worse, queries the wrong
-database when names overlap.
-
-## Generic use case
-
-Any ktx project with more than one connection — the common shape for a data
-org (warehouse + product DB + events DB). Routing is the first step of every
-question, and the skill should encode it so individual agents don't have to
-rediscover it.
-
-## Requirements
-
-1. **Add an explicit routing step (step 0) to the skill's workflow:**
-   - Call `connection_list` to see what exists.
-   - Match the question's domain to a connection using connection ids/names,
-     `discover_data` hits, and wiki context — not guesswork.
-   - If genuinely ambiguous after discovery, ask the user rather than pick.
-2. **Thread the resolved `connectionId` everywhere:** all subsequent
-   `sl_query`, `sql_execution`, `sl_read_source`, `entity_details`,
-   `dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01
-   lands (search scoped to the resolved connection plus unscoped pages).
-3. **Single-connection projects stay frictionless:** the skill should say
-   routing is trivial when `connection_list` returns one entry — don't add a
-   mandatory ceremony step for the common simple case.
-4. **Capture routing knowledge:** when the agent learns a non-obvious
-   question-domain → connection mapping, the skill should encourage
-   `memory_ingest` so the mapping becomes wiki knowledge for next time.
-
-This is a docs/prompt change in the skill content (plus any skill-install
-plumbing if the skill is versioned); no engine changes required.
-
-## Acceptance criteria
-
- In a fixture project with ≥2 connections, an agent following the skill
-  resolves the correct connection before its first data query, and no tool
-  call fails with "connectionId is required".
- In a single-connection project the skill-driven flow is unchanged (no
-  extra mandatory steps).
- Skill text nowhere assumes a default/implicit connection.
-
-## Benchmark context (motivation only)
-
-Spider 2.0-Lite local subset = 30 SQLite connections in one project; every
-one of the 135 questions targets exactly one of them. Connection ids are set
-to the benchmark's database names, so with this skill guidance routing is
-mechanical (`connection_list` + name match) and needs no benchmark-specific
-instructions — which is the point: the harness gives the agent only the
-question text.
--- a/spider2-specs/todo/04-offline-schema-docs-adapter.md
+++ b/spider2-specs/todo/04-offline-schema-docs-adapter.md
@ -1,51 +0,0 @@
-# Offline schema-documentation ingest adapter
-
-> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider
-> 2.0-Lite benchmark — we verified the benchmark's offline schema files
-> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite
-> scan already captures (DDL, types, PKs, sample values, cardinality
-> profiling). Implement specs 01-03 first; pick this up only if a real
-> use case shows up.
-
-## Problem
-
-The ingest pipeline's schema knowledge comes from live database scans
-(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…).
-There is no adapter for **offline schema documentation**: files describing
-tables/columns that exist outside the database — column-description
-spreadsheets, data dictionaries, DDL exports with comments, hand-maintained
-schema docs.
-
-## Generic use case
-
-Teams whose richest schema documentation lives outside `information_schema`:
-a wiki export of column meanings, a governance tool's CSV data dictionary,
-DDL files with COMMENT clauses the production scan can't see, or
-environments where ktx has no live access at all and must build the semantic
-layer from documentation alone.
-
-## Requirements (sketch — refine when picked up)
-
-1. A new ingest adapter (peer of `metabase`/`dbt` in
-   `context/ingest/adapters/`) consuming a configured local path of schema
-   docs per connection.
-2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements)
-   and tabular column dictionaries (CSV/JSON: table, column, description,
-   …). Extensible to other formats.
-3. Output: **enrichment, not duplication** — merge descriptions/metadata
-   into the manifest-backed semantic-layer sources and dictionary for the
-   matching connection. Where a live scan exists, offline docs fill gaps
-   (descriptions, enum meanings, deprecation notes) and flag drift
-   (documented column missing from live schema and vice versa) rather than
-   creating parallel wiki pages that duplicate schema info.
-4. Works without live database access (documentation-only bootstrap of a
-   connection's semantic layer), clearly marked as unverified-against-live.
-
-## Acceptance criteria (sketch)
-
- Given a connection with a live scan plus an offline column dictionary,
-  semantic-layer sources carry the documented descriptions, and drift
-  between doc and live schema is reported.
- Given a connection with docs only (no live access), `sl list`/`sl read`
-  expose manifest sources built from the docs.
- No wiki pages are created that merely restate table/column lists.
--- a/spider2-specs/todo/05-composite-key-join-detection.md
+++ b/spider2-specs/todo/05-composite-key-join-detection.md
@ -1,59 +0,0 @@
-# Composite-key (multi-column) join detection
-
-> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite
-> smoke test (2026-06-13): relationship detection emitted **zero joins** for a
-> database whose fact tables are linked only by composite keys. Agents still
-> answered correctly by inferring the join from shared `grain`, so this didn't
-> cost benchmark points — but it forces inference that explicit joins would
-> remove, and the gap is generic.
-
-## Problem
-
-Relationship detection appears to emit only single-column joins. For the IPL
-sqlite database, every table came back with `joins=0`, even though its fact
-tables are connected by a 4-column composite key
-(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`,
-`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did
-correctly record that shared key as each table's `grain`, which is why agents
-could recover the relationship — but no `joins:` entries were produced for the
-fact-to-fact links.
-
-## Generic use case
-
-Event/fact tables keyed by composite business keys are common: ledger lines
-(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports
-ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a
-multi-column key recurs across tables, ktx should detect and document the join
-so agents (and `sl_query`) don't have to infer it.
-
-## Requirements
-
-1. Relationship detection considers **multi-column** join candidates, not just
-   single-column ones. A strong signal already exists in ktx: when two tables
-   share an identical (or subset/superset) declared `grain`, that grain is a
-   prime composite-join candidate.
-2. Emitted joins carry the full composite condition, e.g.
-   `on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`,
-   with a sensible `relationship` cardinality.
-3. The existing validation/threshold machinery
-   (`scan.relationships.acceptThreshold` etc.) applies to composite candidates
-   too; profile-based validation should check join selectivity on the full key.
-4. No regression for single-column joins; don't explode combinatorially —
-   bound candidate generation (e.g. only consider shared-grain keys and
-   declared/!inferred PK overlaps, cap column count).
-5. `sl_query` can compile a join across a composite-key relationship.
-
-## Acceptance criteria
-
- For a fixture with two tables sharing a 3- or 4-column grain and no
-  single-column FK, ingest emits a composite join between them with the full
-  multi-column `on` condition.
- `sl read <source>` shows the composite join; `sl_query` can traverse it.
- Single-column join detection is unchanged on existing fixtures.
-
-## Benchmark context (motivation only)
-
-IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set)
-have no single-column FKs; their joins are entirely composite. Explicit
-composite joins would let the agent rely on documented relationships instead
-of inferring them from grain.
--- a/spider2-specs/todo/13-canonical-authoritative-source-measures.md
+++ b/spider2-specs/todo/13-canonical-authoritative-source-measures.md
@ -1,89 +0,0 @@
-# Canonical / authoritative-source measures in the semantic layer
-
-## Problem
-
-Many schemas contain an **authoritative table** that already encodes a metric's
-business rules — an official standings/leaderboard table, a general-ledger or
-period-end balance table, a materialized summary/snapshot — alongside the **raw
-transactional** rows the metric *could* be re-derived from. Re-deriving the metric
-from the raw rows frequently diverges from the canonical definition, because the
-authoritative table bakes in rules the raw data doesn't expose (drop-scores,
-penalties, adjustments, reconciliations, as-of snapshots).
-
-Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from
-raw fact tables, so the analytics skill has no signal that one source is canonical
-for a metric — and the agent often re-derives from raw rows and gets a defensible-
-but-different number.
-
-## Generic use case (independent of any benchmark)
-
- "Championship points per competitor this season" — a sports schema may hold both
-  raw per-event results AND an official standings table that applies drop-scores
-  and penalties. The standings table is the canonical source; summing raw results
-  is wrong.
- "Account balance as of month end" — prefer a ledger/balance-snapshot table over
-  re-summing every transaction (which may miss adjustments).
- "Monthly recognized revenue" — prefer a finance summary table over re-deriving
-  from line items.
-
-In each case a real analyst should be steered to the authoritative source.
-
-## Requirements
-
-1. **Detect candidate authoritative tables during ingest.** Heuristics only —
-   e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`,
-   `*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained
-   aggregation of another table, or tables documented as authoritative in provided
-   docs/wiki. Surface them as such in the semantic layer.
-
-2. **Represent the metric as an SL measure backed by the authoritative table.**
-   Where a canonical source exists, define the measure over it so a query for that
-   metric resolves to the authoritative source by default. (The analytics skill
-   already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs
-   into existing behavior.)
-
-3. **Keep raw re-derivation available** as a non-default alternative; the measure
-   documents which source it uses and why, so the choice is transparent and
-   overridable.
-
-## Fairness boundary (HARD — this spec is fairness-sensitive)
-
-The choice of authoritative source MUST be driven by **schema/structure or provided
-documentation** — the table exists, is structured as a summary, or is documented as
-authoritative. It must **NEVER** be driven by observing which interpretation matches
-a benchmark gold answer. Concretely:
-
- ✅ Fair: "a table named/structured as official standings exists and aggregates the
-  raw results → treat it as the canonical points source."
- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold
-  result." That is per-instance gold-tuning (cheating) and must not appear in ktx,
-  the ingest heuristics, or any mapping.
-
-If a metric is genuinely underspecified and only the gold answer disambiguates the
-intended source, it is **not fairly fixable** — leave it. Whether this feature helps
-any specific benchmark instance is therefore *conditional* on a real schema/doc basis
-existing; do not manufacture one.
-
-## Leak-safety (hard constraint)
-
-No benchmark table names, queries, gold values, or instance-specific mappings
-anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic.
-
-## Acceptance criteria
-
- Ingest can flag candidate authoritative/summary tables via generic heuristics
-  (name/role/aggregation/doc signals), with no benchmark-specific rules.
- The semantic layer can express a measure as backed by a designated authoritative
-  source; the skill resolves the metric to it by default; raw re-derivation remains
-  available and the choice is documented.
- Tests use synthetic schemas only; no gold-derived mappings exist anywhere.
-
-## Benchmark context (motivation only)
-
-Some SQLite-subset metric questions are underspecified between a raw-derivation and
-an authoritative-table interpretation (e.g. season points from raw results vs an
-official standings table). This is the roadmap's "canonical semantic-layer measures
-from schema + provided docs" item. It is fair ONLY where schema/docs support one
-source; the gold-only cases are explicitly out of scope (fixing them would require
-tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches
-ingest + the semantic-layer model.
--- a/spider2-specs/todo/17-lifecycle-event-metrics.md
+++ b/spider2-specs/todo/17-lifecycle-event-metrics.md
@ -1,57 +0,0 @@
-# 17 — Lifecycle-event metrics in the semantic layer
-
-**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`.
-
-## Problem / requirement
-
-Many entities carry **several lifecycle timestamps** for the same record — an order has
-`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery`
-times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`,
-`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named
-completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled
-payments by day"), the correct time anchor is the timestamp of *that named event*, not the
-record-creation timestamp.
-
-Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it
-does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a
-human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is
-left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule
-now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the
-**model**, so any consumer of the semantic layer gets it for free.)
-
-**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more
-lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event
-metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with
-its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`),
-distinct from the creation-anchored `orders` metric. Keep the inference conservative and
-source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing
-that the schema/descriptions don't independently support.
-
-## Sketch (implementer to refine)
-
- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions
-  (e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`).
- Emit a metric per detected completed state: filter = the state predicate, grain = record,
-  `defaultTimeDimension` = the matching event timestamp.
- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the
-  delivery-anchored metric rather than a bare row count over the creation date.
- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar
-  (precision over recall — a wrong pairing is worse than none).
-
-## Generic use case (independent of the benchmark)
-
-Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments)
-has this multi-timestamp lifecycle shape. An analyst asking "how many X were <completed-state> last
-month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the
-model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without
-re-deriving it, and prevents the silent "grouped by when they started" error.
-
-## Benchmark context (motivation only — not a benchmark-specific rule)
-
-Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028
-("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed
-because the solver bucketed delivered orders by `order_purchase_timestamp` instead of
-`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both
-date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this
-spec is the **model-layer** form of the same fix, which would make the right anchor the default for any
-solver and any lifecycle schema.