chore: remove private benchmark specs

2026-07-01 08:59:39 +02:00 · 2026-06-30 11:13:44 +02:00 · 2026-06-30 11:13:44 +02:00 · 1c5d16abc3
commit 1c5d16abc3
parent 67a69dba8b
40 changed files with 0 additions and 8716 deletions
--- a/spider2-specs/README.md
+++ b/spider2-specs/README.md
@ -1,62 +0,0 @@
 # spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark
 This directory is the handoff point between two agents working on different
 sides of the same goal: making Claude Code + ktx score well on the Spider
 2.0-Lite benchmark **without benchmark-specific instructions** — the agent
 should succeed using only what ktx provides (skills, semantic layer, wiki).
 ## Mechanics
 Three directories form a pipeline. A feature flows `todo/` → `specs/` →
 (implemented), and only its intake draft moves to `done/`:
 - **`todo/`** — intake drafts. A **playground agent** (works in
  `/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the
  benchmark, identifies ktx capability gaps) writes a draft spec here when it
  finds a gap.
 - **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a
  `todo/` draft and produces a proper, implementation-ready spec at
  `specs/<same-filename>.md`: sharpened requirements, resolved ambiguities,
  acceptance criteria, and orientation hints. The refined spec is the **durable
  artifact** the implementer builds from — it stays in `specs/` permanently and
  never moves.
 - **`done/`** — intake drafts whose feature has shipped (see below).
 The **ktx worktree agent** (started from a ktx repo worktree, e.g.
 `/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the
 refined spec in `specs/` (falling back to the `todo/` draft only if no refined
 spec exists yet). When the feature is implemented it:
 1. appends a short **"Implementation notes"** section to the refined spec in
   `specs/` (what was built, where, any deviations); and
 2. **moves the original intake draft from `todo/` to `done/`.**
 Location is status: `todo/` = draft awaiting implementation, `done/` = draft
 whose feature shipped, `specs/` = refined specs (permanent home, do not move).
 A draft and its refined spec share the same filename so they correspond
 (`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking.
 ## Rules for specs
 1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the
   benchmark only surfaces the need. Every spec must state a real-world use
   case independent of Spider 2.0-Lite. If a requirement only makes sense for
   the benchmark, it doesn't belong in ktx.
 2. Specs are **requirement-level**, not implementation plans. Code pointers in
   specs are orientation hints from exploration (line numbers may have
   drifted); the implementer owns the design.
 3. One spec per file, kebab-case, numeric prefix = suggested priority order.
   A refined spec in `specs/` keeps the same filename as its `todo/` draft.
 ## For the implementer
 - After implementing, rebuild and re-link the dev binary so the playground
  picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
 - Add/extend tests in the ktx test suites; specs list acceptance criteria to
  cover.
 - Build from the refined spec in `specs/`. On completion, append
  "Implementation notes" to that spec (it stays in `specs/`) and move the
  intake draft from `todo/` to `done/`.
 - If a spec turns out to be wrong or already satisfied, don't silently drop
  it — record why in the refined spec's notes and move the draft to `done/`
  explaining why no change was needed.
--- a/spider2-specs/done/.gitkeep
+++ b/spider2-specs/done/.gitkeep
--- a/spider2-specs/done/01-connection-scoped-wiki.md
+++ b/spider2-specs/done/01-connection-scoped-wiki.md
@ -1,74 +0,0 @@
 # Connection-scoped wiki pages
 ## Problem
 Wiki pages have only two scopes today: `GLOBAL` and `USER`
 (`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29).
 There is no way to associate a page with a connection. In a project with many
 connections, all pages share one search index, so `wiki_search` for a generic
 term ("orders", "revenue", "average order value") surfaces pages about the
 wrong database. Concept names collide across databases constantly in
 real-world multi-connection projects (several databases each with `orders`,
 `customers`, etc.).
 Today, when `memory_ingest` is called with a `connectionId`, that id is only
 used to scope which semantic-layer sources the triage agent can see
 (`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the
 resulting wiki page in any form.
 ## Generic use case
 Any org with multiple databases/warehouses in one ktx project: org-wide
 definitions ("fiscal year starts in February") should be visible everywhere,
 while database-specific conventions ("in the events DB, `user_id` is the
 anonymous device id, not the account id") should not pollute searches about
 other databases.
 ## Requirements
 1. **Frontmatter field.** Add an optional `connections:` field to wiki page
   frontmatter — a list of connection ids (accept a single string too,
   normalize to list).
   - **Absent or empty ⇒ unscoped: the page applies to all connections.**
     This is exactly today's behavior, so every existing page is unaffected
     (backward compatible by construction).
 2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64)
   and `ktx wiki search` / `ktx wiki list` (CLI,
   `knowledge-commands.ts`) accept an optional `connectionId`:
   - With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages.
   - Without ⇒ current behavior, all pages.
   - The filter must apply to **all three search lanes** (lexical FTS5,
     semantic/embedding, token fallback) in
     `local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter
     that eats into the result limit unevenly.
 3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index
   (the index is already re-synced from files on every search,
   `local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient).
 4. **Write path.** The memory agent's wiki-write tool accepts the connections
   field; when `memory_ingest` is invoked with a `connectionId`, the agent
   should default new database-specific pages to that connection, while still
   being allowed to write unscoped pages for clearly org-wide content (prompt
   guidance, not a hard rule).
 5. **`wiki_read` and refs are unchanged** — pages remain addressable by key
   regardless of scoping; `connections` is a search/relevance concern only.
 6. **Validation.** Warn (don't fail) when a page references a connection id
   not present in `ktx.yaml` — config and content can evolve independently.
 ## Acceptance criteria
 - A page with `connections: [db_a]` is returned by
  `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
  **not** by `wiki_search(query, connectionId: "db_b")`.
 - A page with no `connections` field is returned in all three cases above.
 - Existing projects with no scoped pages behave identically before/after.
 - Filtering works in each lane independently (test with embeddings disabled
  to exercise lexical/token lanes alone).
 - `memory_ingest(content, connectionId)` produces a page scoped to that
  connection for database-specific content.
 ## Benchmark context (motivation only)
 Spider 2.0-Lite local subset = one project with 30 SQLite connections whose
 schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
 External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
 to exactly one database and must not surface for the other 29.
--- a/spider2-specs/done/02-verbatim-ingest-mode.md
+++ b/spider2-specs/done/02-verbatim-ingest-mode.md
@ -1,71 +0,0 @@
 # Verbatim ingest mode for authoritative documents
 ## Problem
 `ktx ingest --text/--file` routes content through the memory agent
 (`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop
 (30-step budget for `external_ingest`, content clipped at ~48k chars,
 `memory-agent.service.ts` ~165) that may rewrite, condense, or split the
 content before writing wiki pages.
 For *authoritative* documents — formula definitions, specs, runbooks,
 compliance text — paraphrasing is a bug, not a feature:
 - exact thresholds, constants, and rule wording must survive byte-for-byte;
 - lexical (BM25) search works best when the stored text matches the phrasing
  users/agents will query with;
 - ingestion should be deterministic and reproducible — same input file, same
  resulting page.
 ## Generic use case
 Any team ingesting documents that are already the source of truth: metric
 definition sheets, SLA documents, calculation methodology docs, regulatory
 text. The user wants ktx to *index and surface* the document, not to
 re-author it.
 ## Requirements
 1. **Flag.** `ktx ingest --file <path> --verbatim` (apply to `--text` too).
   Composes with the existing optional `--connection <id>` so the resulting
   page can be connection-scoped (see spec 01).
 2. **Body preservation is enforced by code, not by prompt.** The stored page
   body must be the input content byte-for-byte. The LLM is used **only** to
   generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug
   (and `connections` default from the flag). Implementation freedom: a
   single constrained LLM call is fine — the full memory-agent loop is not
   required for this mode.
 3. **No clipping of the stored body.** The ~48k clip may apply to what is
   *sent to the LLM* for metadata generation, never to what is *written* to
   the wiki page.
 4. **Existing frontmatter.** If the input file already has YAML frontmatter,
   preserve user-provided fields and only fill gaps (don't overwrite an
   explicit `summary` with a generated one).
 5. **Key collisions.** Deterministic, non-destructive behavior: error or
   suffix — never silently overwrite an existing page.
 6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should
   still work, deriving `summary` from the first heading/sentence and leaving
   optional metadata empty. (Regular agent ingest can't do this; verbatim
   mode can and should.)
 ## Acceptance criteria
 - Ingesting a file with `--verbatim` produces a wiki page whose body is
  byte-identical to the input (assert with a hash in tests).
 - Running the same ingest twice is idempotent or fails loudly on the second
  run (per requirement 5) — no duplicated/divergent pages.
 - A >48k-char file is stored in full.
 - `--verbatim --connection X` yields a page scoped to X (depends on spec 01;
  if 01 isn't implemented yet, the flag composition can land later).
 - Generated metadata makes the page findable: `wiki_search` for a phrase
  from the document body returns it (lexical lane), and for a paraphrase of
  its topic returns it when embeddings are enabled (semantic lane).
 ## Benchmark context (motivation only)
 Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
 definitions, haversine formula, F1 overtake rules…). Gold SQL was authored
 against their exact text; an LLM paraphrase that drops a bucket boundary
 loses a question. We currently work around this by hand-writing frontmatter
 and copying files into `wiki/global/` — verbatim mode makes that a supported
 ktx workflow instead of a manual step.
--- a/spider2-specs/done/06-scan-tolerate-broken-objects.md
+++ b/spider2-specs/done/06-scan-tolerate-broken-objects.md
@ -1,63 +0,0 @@
 # Schema scan must tolerate individual objects that fail introspection
 > Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest
 > (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely**
 > because a single broken VIEW errored during introspection, leaving that
 > connection with no semantic layer at all.
 ## Problem
 `ktx ingest <connection>` aborts the whole database's schema scan when one
 table/view errors during introspection/profiling. In `oracle_sql` the view
 `emp_hire_periods_with_name` is defined as
 `SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the
 base table has no `start_date`/`end_date` columns — so any attempt to read it
 raises `no such column: ehp.start_date`. That single broken object failed the
 ingest of all ~48 healthy tables/views in the database.
 A second, related symptom: setting `enabled_tables: [main.customers]` to work
 around it produced a different hard failure (`Adapter "database schema" did not
 recognize fetched source output`), so the documented allowlist escape hatch did
 not provide a clean fallback either.
 ## Generic use case
 Real databases routinely contain broken or inaccessible objects: views over
 dropped/renamed columns, views referencing tables the connection role can't
 read, permission-denied tables, or vendor system views that error. ktx should
 ingest everything it *can* and skip what it can't — never let one bad object
 zero out an entire connection's context. This is basic robustness for
 production warehouses, not benchmark-specific.
 ## Requirements
 1. **Per-object isolation.** If introspecting/profiling one table or view
   throws, skip that object, record a warning (object name + error), and
   continue scanning the rest. The connection's semantic layer is built from
   the objects that succeeded.
 2. **Surface, don't hide.** Report skipped objects in the ingest summary and in
   `ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name:
   no such column ehp.start_date"). Honor `failureMode` for whole-connection
   aborts, but a single bad object should not count as a connection failure.
 3. **Views vs tables.** A broken view should never block base-table ingest.
   Consider profiling views defensively (they are read-only projections).
 4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict
   the scan to the listed objects (and the qualification format for sqlite must
   be documented and accepted). Fix the `did not recognize fetched source
   output` failure when the allowlist yields a small/edge-case set.
 ## Acceptance criteria
 - Ingesting a sqlite DB containing one broken view plus N healthy tables yields
  a semantic layer for the N healthy tables and a warning naming the broken view
  — exit is success (not "failed"), subject to `failureMode`.
 - The skipped object is listed in the ingest summary and `ktx status`.
 - `enabled_tables` restricted to a subset ingests exactly that subset without the
  adapter-output error.
 ## Benchmark context (motivation only)
 `oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer
 because of its one broken view; those questions must be solved from raw
 `sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
 would restore enriched context for that database.
--- a/spider2-specs/done/07-analytics-skill-sql-craft.md
+++ b/spider2-specs/done/07-analytics-skill-sql-craft.md
@ -1,112 +0,0 @@
 # Add universal SQL-authoring craft to the ktx-analytics skill
 > Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which
 > ktx tools to call and in what order*, but gives almost no guidance on
 > *writing correct SQL*. In benchmark runs the agent reliably produced
 > runnable SQL (0 execution errors) yet failed on correctness — precision,
 > determinism, type mismatches, and answer completeness. These are universal
 > analytics-engineering truths that every ktx user benefits from, so they
 > belong in the shipped skill, not in any caller's prompt.
 ## Scope guard (read first)
 Only **universally-true** SQL/analytics craft goes here — guidance that helps a
 real ktx user querying a **live** database. The test for inclusion: *"Would this
 advice be correct and useful for an analyst on a current, production database?"*
 **Dialect-specific syntax is out of scope here.** The v9 harnesses' only
 per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted
 lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX`
 for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but
 belongs in a **dialect-aware** location (per-driver notes), not this flat
 skill. Track separately as a follow-up; the rules below must stay
 dialect-agnostic.
 Explicitly **do NOT** add (these are application/consumer concerns, not skill
 concerns, and some are actively wrong for live data):
 - Output-format contracts ("return a bare result set with exactly these
  columns, no prose"). The skill is for interactive analysis and already
  favors readable tables + summaries; a caller that needs a strict result
  shape specifies that itself.
 - Anchoring relative time ("recent", "past N months") to `MAX(date)` of the
  data. On a live database "recent" means relative to *now*; this is only true
  for static snapshots and must not be baked into the product.
 - Anything justified by a grader/scoring comparator.
 ## File
 `packages/cli/src/skills/analytics/SKILL.md` (the shipped skill;
 `setup-agents.ts` installs it into agent environments — the copy under a
 project's `.claude/skills/` is regenerated from this source). Extend the
 existing `<rules>` block and step 5 ("Query") / step 6 ("Validate and
 explain"); keep the existing interactive guidance intact.
 ## Requirements — add these as general rules (behavior only, no rationale that
 references answers/graders)
 **Schema discovery before writing SQL**
 1. Inspect representative sample rows of each table before composing SQL —
   confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null
   prevalence in join/filter keys, and the actual set of categorical/enum
   values. (`entity_details` + a small `sql_execution` sample.)
 2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A
   string column compared against a numeric literal (or vice versa) can
   silently match nothing.
 **Composition discipline**
 3. Build complex queries incrementally — one CTE at a time, verifying each
   layer's output on a small sample before stacking the next.
 4. Avoid joins that fan out row counts. Add columns only from tables already
   required by the grain, or pre-aggregate to the target grain before joining.
 **Window-function correctness**
 5. Give every ranking/ordering window function a complete, deterministic
   tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG`
   results are stable rather than flickering across runs.
 6. Apply row filters **after** window functions for sequence / "first" /
   "most recent" / "since" questions — compute over the full partition, then
   filter.
 **Numeric precision**
 7. Compute at full precision; round only in the final projection, never inside
   intermediate CTEs.
 8. Be explicit about truncation (`CAST AS INT` truncates; use explicit
   rounding when rounding is intended).
 9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`)
   from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the
   question's wording.
 **Answer completeness / interpretation**
 10. "top / highest / most / lowest" → return only the winning row(s) (e.g.
    `RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked
    for.
 11. "for each X / per X / by X" → exactly one row per X; don't collapse to a
    single value unless the question says "overall" or "total across X".
 12. When a question asks for inputs and a derived value ("X, Y, and their
    ratio"), include the inputs as columns alongside the derived value.
 13. When grouping by a human-readable label (a name), also expose the entity's
    identifier — identity, not just the label, is part of the result.
 14. When a result is unexpectedly empty, relax filters one at a time to find
    which predicate removed the rows.
 ## Acceptance criteria
 - The shipped `analytics/SKILL.md` contains the rules above, phrased as general
  truths with **no reference to any benchmark, gold answer, or scoring
  comparator**.
 - Existing interactive guidance (compact result tables, summaries,
  clarification prompts, the tool-order workflow) is preserved — the skill must
  still read well for an interactive human-facing analysis session.
 - None of the excluded items (output-shape contract, `MAX(date)` anchoring,
  grader-driven advice) appear.
 - Skill stays within a reasonable size; group the new rules under clear
  sub-headings so they're scannable.
 ## Benchmark context (motivation only)
 On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors
 but ~50 result mismatches; a large share traced to exactly these gaps
 (premature rounding, string-vs-number compares, non-deterministic window
 ordering, returning full lists for "top" questions, dropping inputs to derived
 values). These are generic SQL-authoring defects — fixing them in the skill
 improves ktx for everyone and, as a side effect, the benchmark.
--- a/spider2-specs/done/08-per-dialect-sql-syntax-notes.md
+++ b/spider2-specs/done/08-per-dialect-sql-syntax-notes.md
@ -1,83 +0,0 @@
 # Per-dialect SQL syntax notes (dialect-aware, scoped to the connection)
 > Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept
 > the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect
 > syntax here.
 ## Problem
 Spec 07 deliberately keeps the analytics SQL-authoring craft
 **dialect-agnostic** — every rule must read correctly on any engine. But a lot of
 *real* correctness depends on dialect-specific syntax that spec 07 excludes and
 defers to this follow-up:
 - **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers,
  VARIANT colon-paths.
 - **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`.
 - **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`.
 This guidance is genuinely useful to an agent writing SQL against a live
 database, but it must **not** pollute the flat dialect-agnostic skill — an agent
 querying sqlite should never see Snowflake VARIANT syntax. It belongs in a
 **dialect-aware** location, surfaced only for the dialect the active connection
 actually uses.
 ## Generic use case
 Any ktx project whose connections span more than one warehouse engine (e.g. a
 Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent
 writes SQL for a given connection, it should get that engine's syntax
 conventions — and nothing for the engines it isn't querying.
 ## Requirements
 1. **Per-driver dialect notes.** Author concise, correct syntax notes per
   supported driver: FQTN form, identifier quoting/case, date/time functions,
   top-N / window-filtering idiom, semi-structured access. These are genuine
   per-engine invariants, so enumerating them per driver is acceptable (unlike a
   denylist of bad specifics).
 2. **Scope to the active dialect, derived from state.** Which notes the agent
   sees must be selected from the connection's configured driver/dialect
   (`ktx.yaml` connections / the connector registry), not guessed and not shown
   all at once. The flat analytics skill stays dialect-agnostic (spec 07
   invariant preserved).
 3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is
   installed as a **single `SKILL.md`** per target (`setup-agents.ts` /
   `readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one
   of two approaches; the refinement pass should compare them before committing:
   - **Multi-file skill delivery** — bundle `reference/<dialect>.md` files and
     have the skill point to the one matching the connection. Requires extending
     `setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal
     `.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate
     transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and
     **per-file manifest entries** for clean uninstall. This is the
     install-mechanism improvement spec 07's Model section flags as future work.
   - **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a
     given `connectionId` (the MCP layer already resolves the connection's
     dialect), so no install change is needed and Cursor/OpenCode get identical
     behavior. May be the lower-cost, more uniform path; weigh it first.
 4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's
   acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in
   `analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware
   channel; it does not amend the flat skill.
 ## Acceptance criteria
 - An agent querying a sqlite connection gets sqlite date idioms and never sees
  Snowflake/BigQuery-only syntax; an agent querying Snowflake gets
  FQTN/identifier/VARIANT guidance.
 - The dialect shown is **derived from the connection's configured driver**, not
  hardcoded per project and not guessed.
 - `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
  unaffected.
 - Whichever delivery mechanism is chosen installs/serves correctly across **all**
  supported agent targets, including the single-file Cursor/OpenCode shape.
 ## Benchmark context (motivation only)
 The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
 (`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths),
 BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
 (`strftime`/`julianday`). That content is real and useful but engine-specific;
 spec 07 kept it out of the flat skill and deferred it here so the
 dialect-agnostic rules stay clean.
--- a/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md
+++ b/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md
@ -1,150 +0,0 @@
 # Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
 ## Problem
 The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4:
 *"Avoid fan-out joins — add columns only from tables already at the target
 grain, or pre-aggregate to that grain before joining; a join that multiplies
 rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent
 honors it on a single join but still **silently fan-outs on multi-hop join
 chains**, where the inflation is one or two joins removed from the aggregate and
 therefore much harder to notice.
 The failure shape: a metric that lives at a *coarse* grain (e.g. one row per
 parent record) is counted/summed *after* the parent has been joined down to a
 *finer* grain (e.g. one row per child line). Every parent-level value is then
 duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an
 amount that depends on the data — runnable SQL, plausible-looking number,
 quietly wrong.
 The rule today is stated as a *prohibition* ("avoid"). It needs to be a
 *detect-and-fix habit*: a concrete multi-hop example of the trap, and an active
 verification step the agent runs while composing, not just an instruction to be
 careful.
 ## Generic use case (independent of any benchmark)
 An analyst on any production warehouse asks: *"How many orders are there per
 region?"* where the path from region to the order's detail runs through several
 hops (region → store → order → order line). The honest answer counts each order
 once. If the query descends to the line-level table along the way (e.g. for a
 filter), each order is counted once **per line on the order**, inflating the
 per-region total. Attribution here is unambiguous — each order belongs to exactly
 one store and thus one region — so the *only* thing that can go wrong is the row
 multiplication, which is exactly what makes it a clean teaching case. This is one
 of the most common silently-wrong analytics mistakes on normalized schemas — it
 is not
 specific to any dataset, dialect, or benchmark.
 ## Requirements
 This extends the existing `<sql_craft>` "Composition" guidance in the
 `ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic,
 and stated as a heuristic-plus-why (consistent with spec 07's style).
 1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the
   danger is *cumulative*: any one-to-many hop on the path between the table that
   owns a measure and the aggregate inflates that measure, even when the
   offending join is several hops away from the `SUM`/`COUNT`. The fix is the
   same as the single-hop case — **pre-aggregate the measure to its own grain in
   a CTE, then join the already-aggregated result** — but the agent must apply it
   per measure-owning table along the whole chain, not just at the final join.
 2. **Add a verification habit, not just a prohibition.** While composing, the
   agent should confirm a join did not change the grain it intends to aggregate
   at — e.g. check that the row count (or the count of the aggregate's key) is
   unchanged across a join that is supposed to be one-to-one / many-to-one, and
   pre-aggregate the finer table to grain when it is one-to-many. This is the same
   "build incrementally and check each layer" discipline spec 07 already endorses,
   pointed specifically at grain preservation.
   **Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only
   shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and
   then joining one-to-one is the remedy that works for every aggregate
   (`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT <key>)` is a valid one-liner *for counts
   only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two
   rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse
   them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is
   silently wrong for sums.
 3. **One concrete, generic multi-hop example.** Include a short worked example
   that shows the inflation and the fix. It must use an **invented, generic
   schema** — **no benchmark table names, no benchmark SQL, and no benchmark
   result values** (see "Leak-safety" below — hard constraint). The example must:
   (a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson
   and does not entangle the skill's separate *macro-vs-micro average* rule; and
   (b) use a chain with **unambiguous single-owner attribution** so the only thing
   that can go wrong is row multiplication. The intended example is the chain
   `regions → stores → orders → order_lines` answering *"how many orders per region
   include at least one backordered line"* — each order belongs to exactly one
   store and thus exactly one region, so attribution is clean; the line-level
   filter gives `order_lines` a genuine reason to be joined (so the fix is the
   pre-aggregate remedy, not "drop the join"), and that join sits **several hops
   below** the region-level COUNT (the multi-hop point):
   ```sql
   -- "How many orders per region include at least one backordered line?"
   -- (order_lines is genuinely needed here — for the backordered filter — so the
   --  fix is NOT "just drop the join".)
   -- WRONG: the order_lines join is one row per matching line, joined several hops
   -- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the
   -- per-region total is inflated by backordered-lines-per-order — silently wrong.
   SELECT r.region_id, COUNT(*) AS n_orders
   FROM regions r
   JOIN stores s      ON s.region_id = r.region_id
   JOIN orders o      ON o.store_id  = s.store_id
   JOIN order_lines l ON l.order_id  = o.order_id AND l.is_backordered  -- one-to-many: fan-out
   GROUP BY r.region_id;
   -- RIGHT (general remedy): collapse the finer table to the measure's grain in a
   -- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works
   -- for SUM/AVG, not just COUNT.
   WITH qualifying_orders AS (                 -- back to ONE row per order
     SELECT DISTINCT order_id FROM order_lines WHERE is_backordered
   )
   SELECT r.region_id, COUNT(*) AS n_orders
   FROM regions r
   JOIN stores s            ON s.region_id = r.region_id
   JOIN orders o            ON o.store_id  = s.store_id
   JOIN qualifying_orders q ON q.order_id  = o.order_id
   GROUP BY r.region_id;
   -- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works
   -- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g.
   -- summing each order's shipping_fee after joining lines) must pre-aggregate;
   -- DISTINCT would wrongly merge two orders that happen to share the same fee.
   ```
 ## Leak-safety (hard constraint on this spec and its example)
 The benchmark's gold answers must never appear in ktx. The worked example must
 be a **synthetic, generic schema invented for teaching** — not the tables,
 column names, query, or numeric results of any Spider 2.0-Lite question. The
 example demonstrates the *pattern* (coarse-grain measure counted after a
 one-to-many join), which is universal; it must be reconstructable from first
 principles by anyone, with zero reference to benchmark data. A reviewer should
 be able to read the example and find nothing that ties it to a specific
 benchmark instance.
 ## Acceptance criteria
 - The skill's `<sql_craft>` Composition section states the multi-hop
  generalization of the fan-out rule and a grain-verification habit, inline and
  dialect-agnostic.
 - It includes exactly one short, **generic** worked example (wrong vs.
  pre-aggregated-right) using an invented schema, with no benchmark-derived
  identifiers or values.
 - No new tool, flag, or config; this is skill-content only (additive to spec 07).
 - Existing analytics-skill content tests are updated to cover the added rule's
  presence (mirroring spec 07's `analytics-skill-content.test.ts`).
 ## Benchmark context (motivation only)
 Multi-hop aggregation questions (counting/averaging a coarse-grained measure
 reached through several one-to-many joins) are a recurring source of
 result-mismatch failures in the SQLite subset: the agent produces runnable SQL
 with the right tables but a fan-out-inflated number. These are correctness
 failures, not knowledge or schema-discovery failures (zero execution errors in
 the latest run), so the fix belongs in the product's authoring craft — where it
 also helps any real analyst — not in a benchmark-specific prompt.
 ```
--- a/spider2-specs/done/10-panel-completeness-spine.md
+++ b/spider2-specs/done/10-panel-completeness-spine.md
@ -1,65 +0,0 @@
 # Panel/period completeness — emit the full set of groups, not only the populated ones
 ## Problem
 When a question asks for a result *per period* or *per category* ("orders for each
 month of 2023", "revenue by region", "count per status"), the natural `GROUP BY`
 only returns groups that actually have rows. Periods/categories with **zero**
 activity silently vanish, so a "12 months" answer comes back with 9 rows and the
 ones that should read `0` are simply absent. The agent writes runnable SQL with
 the right aggregate but an **incomplete panel**.
 This is a universal reporting correctness issue: a monthly report with missing
 months, or a category breakdown missing the empty categories, is wrong for any
 analyst — and it is also a frequent result-mismatch shape on the benchmark.
 ## Generic use case (independent of any benchmark)
 "How many orders were placed in each month of 2023?" must return **12 rows** even
 if March had no orders (March = 0), not 11 rows. "Sales per region" should include
 regions with no sales (as 0/NULL) when the question asks for *each* region.
 ## Requirements
 Additive to the `ktx-analytics` skill's `<sql_craft>` "Answer completeness /
 interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic
 + why style).
 1. **Recognize "full-panel" phrasing.** Cues like *each / every / per <period> /
   for all <category> / by month* signal that the answer's row set should be the
   **complete** set of periods or categories in scope, not just those present in
   the filtered fact rows.
 2. **Build a spine, then LEFT JOIN.** Generate the full set of expected
   groups — a date/number series via a recursive CTE for periods, or the distinct
   dimension values from the authoritative dimension table for categories — and
   LEFT JOIN the aggregated facts onto it, defaulting missing measures with
   `COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner
   `GROUP BY` can only emit groups that have at least one fact row.
 3. **Don't over-apply.** When the question asks only about groups that exist
   ("which months had orders"), the spine is unnecessary; the cue is *each/all*
   vs *which*.
 ## Leak-safety (hard constraint)
 Any worked example must use a **synthetic generic schema** (e.g. an `orders`
 table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN
 + COALESCE). No benchmark table names, SQL, or result values. The behavior is
 reconstructable from first principles and tied to no specific instance.
 ## Acceptance criteria
 - `<sql_craft>` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe,
  and the over-application guard — inline and dialect-agnostic.
 - At most one short generic example (recursive-CTE date spine or distinct-dimension
  spine), no benchmark-derived content.
 - Skill-content only; analytics-skill content tests updated to cover the rule.
 ## Benchmark context (motivation only)
 Per-period / per-category questions where some periods are empty produce
 short-row result mismatches in the SQLite subset. The fix is a universal
 reporting habit (complete panels), so it belongs in the product's craft, where it
 also helps real analysts — not in a benchmark-specific prompt. Related to spec 11
 (rolling/cumulative windows need a complete date spine to be correct).
--- a/spider2-specs/done/11-time-series-window-recipes.md
+++ b/spider2-specs/done/11-time-series-window-recipes.md
@ -1,73 +0,0 @@
 # Time-series window craft — running totals, rolling-N (min-periods), period-over-period
 ## Problem
 A large share of analytics questions are time-series shaped: a **running/cumulative
 balance**, a **rolling N-day average**, or **period-over-period growth**. The agent
 knows window functions exist (spec 07 covers determinism and window-then-filter) but
 gets the *time-series specifics* wrong:
 - cumulative balance computed without an unbounded preceding frame (or with the
  frame defaulting incorrectly when there are ties on the order key);
 - "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily
  data, so the window spans the wrong calendar span when days are missing;
 - no **minimum-periods** handling — a rolling average is reported before the window
  is actually full;
 - "growth vs previous period" without `LAG`, or comparing to the wrong neighbor.
 These are runnable-but-wrong; the structure is close, the edge case diverges.
 ## Generic use case (independent of any benchmark)
 - "Each account's month-end running balance over 2023" — cumulative sum of monthly
  net over an ordered window.
 - "30-day rolling average of daily revenue, only once 30 days of history exist."
 - "Month-over-month revenue growth rate."
 All three are bread-and-butter for any analyst on any time-series table.
 ## Requirements
 Additive to the `ktx-analytics` skill's `<sql_craft>` "Window functions" group
 (inline, dialect-agnostic, heuristic + why).
 1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS
   BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in
   `ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY`
   can include/exclude peers unexpectedly.
 2. **Rolling window over time, not over rows.** When "rolling N days/months" is
   asked, the window must span a calendar range. Over gappy data, either build a
   complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals
   the intended span, or use a range/self-join keyed on the date. *Why:* row-count
   frames over missing dates silently measure the wrong span.
 3. **Minimum periods.** When the question says "only after N periods of data" (or
   it is implied by a rolling metric), emit NULL/skip until the window is full
   (e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not
   the requested metric.
 4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
   for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at
   full precision (round only at the end). Guard divide-by-zero/NULL prev.
 ## Leak-safety (hard constraint)
 Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day,
 amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*.
 No benchmark table names, SQL, or result values.
 ## Acceptance criteria
 - `<sql_craft>` "Window functions" gains the cumulative, rolling-over-time +
  min-periods, and period-over-period recipes — inline, dialect-agnostic.
 - At most one or two compact generic examples; no benchmark-derived content.
 - Skill-content only; analytics-skill content tests updated.
 ## Benchmark context (motivation only)
 Running-balance / rolling / period-over-period questions are the single largest
 result-mismatch cluster in the SQLite subset (financial-transactions style DBs).
 The methodology is universal analyst craft, so it belongs in the product's skill
 (transfers to real users), not in a benchmark-specific prompt. Depends on spec 10
 (date spine) for the gappy-rolling case.
--- a/spider2-specs/done/12-parse-text-encoded-numbers.md
+++ b/spider2-specs/done/12-parse-text-encoded-numbers.md
@ -1,61 +0,0 @@
 # Parse text-encoded numeric columns before doing math on them
 ## Problem
 Numeric measures are often stored as **text** with human formatting: unit suffixes
 (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`),
 percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`,
 `""`). Aggregating or comparing such a column directly is silently wrong: string
 comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on
 the formatted values rather than the intended number.
 The agent already samples schemas (spec 07 schema-discovery), but when it sees a
 "numeric" column it tends to assume it is a real number type and skips the parse —
 so the arithmetic runs on garbage. Runnable, plausible, wrong.
 ## Generic use case (independent of any benchmark)
 A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000
 / 0` before you can sum it or compute a daily change. A `price` stored as
 `"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene
 on real, messy production tables.
 ## Requirements
 Extend the `ktx-analytics` skill's `<sql_craft>` "Schema discovery before writing
 SQL" group (inline, dialect-agnostic, heuristic + why).
 1. **Detect text-encoded numerics during sampling.** When a column that the
   question treats as a number is stored as text, sample distinct values to learn
   the encodings actually present (suffixes, symbols, separators, sentinels) before
   composing — never assume the format from the column name.
 2. **Parse and scale before arithmetic.** Strip currency/separator/percent
   characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels
   (`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a
   numeric type. Do this in an early CTE so all downstream math sees clean numbers.
   *Why:* string columns compared/aggregated as-is sort lexically and cast to 0,
   producing silently wrong results instead of errors.
 3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value
   failed to parse (would surface as NULL), to catch an encoding the sample missed.
 ## Leak-safety (hard constraint)
 Worked examples must use a **synthetic generic schema** and made-up values (e.g. a
 `metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names,
 SQL, or result values; the parsing pattern is universal and tied to no instance.
 ## Acceptance criteria
 - `<sql_craft>` schema-discovery gains the detect → parse/scale → verify guidance —
  inline, dialect-agnostic, with at most one short generic example.
 - No benchmark-derived content. Skill-content only; content tests updated.
 ## Benchmark context (motivation only)
 At least one SQLite-subset question stores trading volume as suffix-encoded text
 ("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The
 fix — parse messy encodings before math — is universal data hygiene that helps any
 analyst, so it belongs in the product's craft rather than a benchmark-specific
 prompt.
--- a/spider2-specs/done/14-output-completeness-final-check.md
+++ b/spider2-specs/done/14-output-completeness-final-check.md
@ -1,105 +0,0 @@
 # Enforce answer-output completeness with a final pre-emit check in the analytics skill
 ## Problem
 The single largest correctness failure mode is **incomplete output**: the query runs and the
 methodology is roughly right, but the result is missing columns the question asked for. Three
 recurring sub-patterns:
 1. **Multi-part questions answered partially.** A question that asks for several things ("report
   the highest *and* the lowest month, each with its count and average, *and* the difference")
   comes back with only the first part — one column instead of the several requested.
 2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's
   identifier (e.g. a product name without its product id, a customer name without its
   customer id).
 3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not
   the underlying counts the question also asked for.
 Sub-patterns 2 and 3 are **already covered by `<sql_craft>` rules** in the analytics skill
 (spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*),
 yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these
 rules are passive heuristics buried in a list, and the agent doesn't reliably check them before
 finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn
 output-completeness into an **explicit final verification step** the agent performs before
 emitting SQL.
 This is reinforced by evidence that the failure is **model-independent**: a markedly stronger
 model produced the same incomplete-output mistakes on these questions, which means it is a
 craft/enforcement gap, not a capability gap.
 ## Generic use case (independent of any benchmark)
 An analyst is asked: *"For each region, report the highest and the lowest monthly order count,
 and the difference between them."* A complete, useful answer has a column for the region's id
 and name, the highest count, the lowest count, and the difference — five columns. Returning just
 the region and a single number answers only part of the request. This is a universal expectation
 on any database: answer **every** part of a multi-part request, identify the entities, and show
 the inputs behind any derived figure.
 ## Requirements
 Additive to the analytics skill's `<sql_craft>` "Answer completeness / interpretation" group and
 its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07).
 1. **Multi-part / multi-output completeness (new rule).** When a question requests several
   outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a
   value plus its components ("X, Y, and their ratio") — the final projection must contain a
   column for **each** requested output. *Why:* answering only the first clause is the most common
   way a runnable query is still wrong; the grain and methodology can be perfect yet the answer
   is short by columns.
 2. **Fold the existing identity / inputs rules into the same completeness notion.** The
   already-shipped rules — project the entity **identifier** alongside any human-readable label,
   and **keep the inputs** to any derived value — are part of output completeness; reference them
   from the check below so they are actually applied, not just listed.
 3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the
   final SQL, the skill should have the agent **re-read the question and confirm the projection
   covers**: every named metric/attribute; the identifier of every grouped/named entity; every
   input to a derived value; all at the grain the question specifies. This is a short, concrete
   checkpoint at the validate step — the point is to convert the passive heuristics into an active
   pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is
   grader-gaming; the check is about matching the request exactly, not padding it.)
   Generic teaching example (synthetic schema — see Leak-safety):
   ```sql
   -- "For each region, report the highest and lowest monthly order count and their difference."
   -- WRONG: answers only the first clause; no region id, no lowest, no difference.
   SELECT region_name, MAX(monthly_orders) AS highest
   FROM region_monthly GROUP BY region_name;
   -- RIGHT: one column per requested output + the entity's identity, at the region grain.
   SELECT r.region_id, r.region_name,
          MAX(m.monthly_orders) AS highest_monthly_orders,
          MIN(m.monthly_orders) AS lowest_monthly_orders,
          MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference
   FROM regions r
   JOIN region_monthly m ON m.region_id = r.region_id
   GROUP BY r.region_id, r.region_name;
   ```
 ## Leak-safety (hard constraint)
 The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up
 columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover
 every requested output + identity + inputs), which is universal and tied to no specific instance.
 ## Acceptance criteria
 - The skill states the multi-part-completeness rule and a concrete **final completeness check**
  (re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic,
  cross-referencing the existing identity/inputs rules so they're enforced.
 - Includes the over-projection guard (don't pad with extra columns — that's grader-gaming).
 - One short generic example (wrong vs complete); no benchmark-derived content.
 - Skill-content only; analytics-skill content tests updated to cover the new rule + check.
 ## Benchmark context (motivation only)
 In the latest SQLite-subset run, **incomplete output was the single largest failure bucket
 (~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value
 inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A
 probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this
 is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested
 part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product
 skill (and transfers to real users), enforced as a final check rather than left as a passive hint.
 ```
--- a/spider2-specs/done/15-mcp-server-structured-logging.md
+++ b/spider2-specs/done/15-mcp-server-structured-logging.md
@ -1,116 +0,0 @@
 # Structured, leveled logging for the ktx MCP server
 > **Scope: observability only.** This spec is about *seeing* what the MCP server
 > does (which tool, what params, when, how long, outcome). *Preventing* a runaway
 > query from blocking the server (off-event-loop / interruptible query execution)
 > is a separate concern — see "Non-goals" and the sibling spec note below.
 ## Problem
 The ktx MCP server (`packages/cli/src/mcp-http-server.ts` +
 `mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk`
 `StreamableHTTPServerTransport`) emits almost no operational logs. There is no
 server-side record of **which MCP tool was called, with what parameters, when,
 how long it took, or whether it succeeded** — nor of session open/close or
 transport errors. When a tool call is slow, hangs, or a client connection drops
 ("Transport channel closed"), an operator has no trail to diagnose it and must
 resort to process sampling / `lsof` / guesswork — and the offending input
 (e.g. the exact SQL) is typically unrecoverable.
 ## Generic use case
 Anyone running a long-lived ktx MCP server — a developer's local instance, a
 shared team server, or a hosted deployment — needs observability into tool-call
 activity to:
 - diagnose slow or hung tool calls (which `sql_execution` ran, against which
  connection, with what SQL, for how long);
 - explain client-visible connection failures from the server side (session
  lifecycle, transport-closed events);
 - audit what agents asked the server to do;
 - spot patterns (hot tools, slow connections, error rates).
 This is standard production-server hygiene; the server currently provides none.
 ## Requirements (sketch — refine when picked up)
 1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation
   only; implementer owns the choice). A single shared instance; write **JSON to
   stdout** (12-factor — the launcher/aggregator routes it). No in-app file
   rotation. Optional human-readable pretty output only when attached to a TTY
   (dev).
 2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug`
   for diagnosis) — verbose logging on demand without code changes.
 3. **Per-session / per-call context** via child loggers: every line carries a
   `sessionId` (from the transport session) and, for tool calls, a `callId` +
   `tool` name, so one session's or call's activity can be traced/grepped.
 4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For
   every MCP tool invocation:
   - on entry: log `{ tool, params, sessionId, callId }` **before** running the
     handler (so the record exists even if the handler never returns);
   - on exit: log `durationMs` + outcome (ok with result size, or error with
     stack).
   This makes a **hung / never-returning call identifiable**: a start with no
   matching completion is the culprit, with its exact parameters and timestamp.
   This matters specifically because handlers like `sql_execution` run a
   *synchronous* better-sqlite3 query — a runaway query blocks the process and no
   completion is ever logged, so the start line (flushed before the blocking
   call) is the only record. For `sql_execution`, `params` should include the SQL
   text (the most useful field). Emit a **WARN** when a *completed* call exceeds a
   configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`).
 5. **Connection / session lifecycle:** log session open/close (with `sessionId`)
   and transport errors (the SDK's closed-channel / "Transport channel closed"
   events) so client-side connection failures have a server-side counterpart.
 6. **Error logging** with structured stack traces (a standard error serializer),
   not bare strings.
 7. **Light redaction — credentials only** (bearer token, connection
   passwords/secrets). SQL text and tool params are *not* secrets and must be
   logged. Do not over-redact.
 8. **Synchronous logging is fine.** The server uses a synchronous DB client, so
   logging need not be async; prefer the simpler synchronous stdout path over
   async/worker transports (which can lose buffered lines on a hard crash). Do
   not introduce async-logging machinery.
 ## Acceptance criteria (sketch)
 - With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start`
  (tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line
  on the server's stdout, as JSON.
 - A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
  `tool.start` line carrying its **exact SQL and timestamp** and **no**
  `tool.end` — so the offending query is recoverable from the log alone, with no
  process sampling.
 - A completed tool call slower than the configured threshold emits a WARN with
  its duration.
 - Session open/close and transport-closed events are logged with the `sessionId`.
 - At default level (`info`), routine per-tool lines are suppressed but lifecycle,
  slow-call warnings, and errors are present.
 - Credentials (bearer token, connection secrets) never appear in logs; SQL and
  tool params do.
 - No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no
  async-transport machinery.
 ## Non-goals
 - **Preventing/interrupting runaway queries** (off-event-loop execution, query
  timeouts, worker-thread isolation). That is a *separate* spec; a single
  synchronous query that fans out into a massive nested-loop join can peg the
  single-threaded server for hours and break new connections — observability
  surfaces *which* query, but the fix is execution-model work. (This logging is
  also a prerequisite for a future watchdog that detects a `tool.start` with no
  `tool.end` past a threshold and recycles the server.)
 - Metrics/tracing/OpenTelemetry exporters.
 - Forwarding logs to the MCP *client* via the protocol's logging capability
  (`notifications/message`, `logging/setLevel`) — a possible later enhancement,
  distinct from operational stdout logging.
 ## Benchmark context (motivation, not a requirement)
 Running Spider 2.0-Lite against the MCP server at concurrency, an
 adversarial-reviewer-generated query degenerated into a massive nested-loop join;
 synchronous better-sqlite3 executed it on the event loop, pegging a server at
 ~100% CPU for hours and breaking new MCP connections to it ("Transport channel
 closed"). We could not determine *which* query, because the server logs nothing
 about tool calls — diagnosis required `sample`/`lsof` on the live process and the
 exact SQL was never recovered. Structured tool-call logging (especially
 start-before-execute) would have turned this into a one-line `grep` of the server
 log.
--- a/spider2-specs/done/16-bounded-query-execution-timeout.md
+++ b/spider2-specs/done/16-bounded-query-execution-timeout.md
@ -1,131 +0,0 @@
 # Bounded query execution (deadline + non-blocking) for read SQL
 > Priority: HIGH. Found empirically during a Spider2-lite sqlite run
 > (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU
 > for 13+ minutes and never returned. The query
 > `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the
 > `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112
 > rows, joined on a 4-column key with no composite index) whose plan degraded to
 > an O(N×M) nested-loop scan. Because the sqlite connector runs
 > `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP
 > worker's entire event loop: no `tool.end` was ever logged, the port went
 > unresponsive, and the query could not be cancelled. One of four eval shards
 > stalled until the worker was killed by hand.
 ## Problem
 Two compounding gaps on the read-query path:
 1. **No execution deadline.** A single expensive query runs unbounded. This is
   handled divergently per connector, with no shared contract: BigQuery has a
   real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP
   `request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only
   connection/pool *acquisition*, not statement *execution*; SQLite has nothing.
   So whether a runaway query is bounded depends entirely on which driver the
   caller happened to hit.
 2. **In-process engines block the event loop and can't be cancelled.** The
   sqlite connector executes on the main thread via synchronous
   `better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't
   serve other requests, send progress, or write `tool.end`), and there is no
   way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its
   documented mechanism for slow queries is to run them in a **worker thread**,
   and the only way to stop a runaway synchronous query is to terminate the
   thread executing it.
 The net effect is a query that produces a `tool.start` with no matching
 `tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`)
 does not help — it bounds returned rows, not scan work, and the failing query
 returned a single aggregate row.
 ## Generic use case
 Any data agent that lets an LLM author SQL will eventually issue an
 accidentally-expensive query — an unindexed or cartesian join, an expensive
 VIEW, a wide aggregate over a large fact table. A general-purpose context layer
 must bound that and return a clean, fast "query exceeded Ns" error so the agent
 can revise (add filters, query base tables, narrow the range) instead of hanging
 the tool and the server. This matters for embedded/local warehouses (sqlite,
 duckdb) and remote ones alike, and is wholly independent of any benchmark.
 ## Requirements
 1. Every read-query execution path (`executeReadOnly`) enforces a single
   canonical execution deadline. One opinionated default; **not** a per-call
   user flag. Where a driver already supports a per-connection timeout
   (BigQuery `job_timeout_ms`), reuse that as the per-connection override rather
   than inventing a parallel knob.
 2. On exceeding the deadline the path resolves with a `KtxQueryError`
   ("query exceeded {N}s") — a finite, decision-reaching outcome, never an
   unbounded hang.
 3. The deadline is a **shared contract at the connector boundary**, defined once
   (on the `executeReadOnly` contract or a shared wrapper at the call site) so
   all drivers participate. Bring the existing divergent timeouts (BigQuery job
   timeout, ClickHouse request timeout) under this one contract instead of
   leaving parallel mechanisms.
 4. For in-process engines (sqlite today, any future embedded driver), execution
   MUST NOT block the MCP server event loop. Run the query off the main thread
   and enforce the deadline by terminating that thread on timeout (the
   better-sqlite3-documented approach, since synchronous queries are
   uncancellable in-thread). The event loop must stay responsive so `tool.end`
   is always written and concurrent requests on the same port are served.
 5. Prefer real cancellation over client-side give-up. Where the engine supports
   a server-side statement timeout (Postgres `statement_timeout`, MySQL
   `max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse
   `max_execution_time`, BigQuery job timeout, SQL Server request timeout), set
   it so the deadline actually stops work, not merely abandons the promise while
   the query keeps running. For in-process engines, thread termination is the
   cancellation.
 6. The MCP `sql_execution` tool surfaces the timeout as an expected error
   (classified as `KtxQueryError`, not a `$exception` fault, consistent with
   existing expected-error classification) and logs a `tool.end` with the error
   outcome.
 7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain
   unchanged. The deadline is additive; `maxRows` is not a substitute for it.
 ## Acceptance criteria
 - A read query that exceeds the deadline returns a `KtxQueryError` within
  roughly the deadline; the MCP worker stays responsive (a concurrent tool call
  on the same server completes while the slow query is still pending) and writes
  a matching `tool.end` with a non-ok outcome.
 - sqlite specifically: executing a deliberately pathological query (e.g. an
  expensive VIEW or an unindexed cross join) on a fixture does not block the
  event loop, is terminated at the deadline, and CPU returns to idle afterward
  (the off-main-thread executor is killed, not left spinning).
 - No regression: normal fast queries return identical results; read-only
  rejection still works; `maxRows` still bounds returned rows.
 - Tests cover the deadline path for at least the in-process driver (sqlite,
  terminate-on-deadline) and one server-side-timeout driver.
 ## Benchmark context (motivation only)
 The Spider2-lite local set loads several warehouses into sqlite, some with
 expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
 `costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
 rows, no composite index, with `promo_id` (the index the optimizer picks) being
 95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
 view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
 shard for 10+ minutes; with one, the agent gets a fast error and can scope the
 query instead.
 ## Orientation hints (code pointers; may have drifted)
 - Shared contract: `packages/cli/src/context/scan/types.ts` —
  `KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285).
 - MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70`
  (`connector.executeReadOnly`); tool registration in
  `packages/cli/src/context/mcp/context-tools.ts`.
 - In-process sync execution (the acute hang):
  `packages/cli/src/connectors/sqlite/connector.ts:311-313`
  (`better_sqlite3 .prepare().all()`).
 - Existing divergent timeouts to unify: `connectors/bigquery/connector.ts`
  (`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602`
  (`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only),
  `connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`,
  `connectors/sqlserver/connector.ts` (pool/connection only).
 - Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`).
 - better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no
  interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern
  for slow queries (master owns worker lifecycle and respawns on exit) — extend
  it with terminate-on-deadline to enforce the timeout.
--- a/spider2-specs/done/18-bigquery-cross-project-datasets.md
+++ b/spider2-specs/done/18-bigquery-cross-project-datasets.md
@ -1,68 +0,0 @@
 # 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project)
 **Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`.
 ## Problem (generic, real-world)
 Analysts routinely query datasets that live in a **different** BigQuery project than the one
 they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an
 organization's central data project, etc. To make those connectable in ktx (so `discover_data`,
 the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to
 **introspect a dataset hosted in a foreign project while running/billing jobs in the
 credentials' own project**.
 Today it can't. ktx's BigQuery connector derives a single `projectId` from
 `credentials.project_id` and uses it for **both** job billing **and** schema introspection:
 - `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`;
  there is no separate billing-vs-dataset project knob.
 - `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the
  dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`.
 - `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the
  **billing** project's INFORMATION_SCHEMA.
 - `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix.
 So a `dataset_id` naming a dataset in another project can't be introspected, even though querying
 it works fine (cross-project reads bill to the caller's project — that path already works).
 ### Empirical confirmation
 With a service account in project `ktx-spider2-lite`:
 - ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in
  `projects/ktx-spider2-lite/datasets/austin_311`).
 - The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds**
  (lists the public tables; public metadata is readable by any authenticated principal).
 - There is **no config knob** to separate the introspection project from the billing project.
 ## Requirement
 The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids`
 (a single connection may span more than one source project), and for each:
 - **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` /
  `DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and
  label the table `catalog` with the dataset's project;
 - **run jobs / bill** in `credentials.project_id` (unchanged).
 A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so
 existing single-project connections are unaffected.
 ## Acceptance
 - `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) →
  `ktx ingest <conn>` introspects the tables, enriches, and samples values; `discover_data` /
  `dictionary_search` return them.
 - A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both.
 - `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in
  `credentials.project_id`.
 - Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression).
 ## Benchmark context (motivation only — do not encode benchmark specifics)
 Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every
 one of its ~74 logical databases groups datasets hosted in foreign public projects
 (`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query
 execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the
 faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74
 BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset`
 and introspect each in its own project" covers the benchmark and the general case alike. This is
 the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector
 change and is already baselined).
--- a/spider2-specs/done/19-durable-bounded-relationship-detection.md
+++ b/spider2-specs/done/19-durable-bounded-relationship-detection.md
@ -1,89 +0,0 @@
 # 19 — Durable, resumable, bounded relationship detection during ingest enrichment
 **Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`.
 ## Problem (generic, real-world)
 Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment`
 (`packages/cli/src/context/scan/local-enrichment.ts`):
 1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per
   table; on a large schema this is minutes of paid LLM work).
 2. `embeddings` (`:559`) — column embeddings.
 3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then
   validates candidate joins.
 The queryable semantic-layer artifacts are persisted **once, at the very end**, by
 `writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after**
 `runLocalScanEnrichment` returns, i.e. after all three stages.
 This creates three failure modes that compound on large schemas (hundreds of tables):
 1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings
   are computed and held in memory, but they only reach the durable, queryable artifacts when the
   final write runs after the `relationships` stage. If the process is killed/crashes/times out
   **during** relationship detection (the last, slowest, silent stage), the artifacts are never
   written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the
   paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced
   full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
   relationships stage ran silently past a supervising deadline and was killed — the persisted
   `_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits
   this, so the most expensive work is the most likely to be thrown away.
 2. **Re-running does not resume — it re-spends.** There is a stage state store
   (`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves
   each completed stage's output. But the completed-stage lookup keys on **`runId`**
   (`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest
   invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a
   new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch**
   (re-paying for the LLM work that already succeeded).
 3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between
   "Detecting relationships" and the final "Relationship detection found N accepted" — minutes of
   silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but-
   working profile from a true hang, and there is no internal time/work budget, so on a very large
   schema it can run far longer than any reasonable deadline.
 ## Requirements
 1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions +
   embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before
   the `relationships` stage runs. Relationship detection then appends/merges its own artifact on
   completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**,
   even if relationship detection fails, is interrupted, or is skipped. (A failed/partial
   relationship stage should degrade to "no/partial joins", never to "no descriptions".)
 2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity
   — `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted
   ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what
   actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM
   credits on stages that already succeeded.
 3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query
   execution). Emit progress through the existing progress port — e.g. "Profiling table K/N",
   "Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget
   (configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops
   gracefully and returns the relationships found so far (partial) rather than running unboundedly.
   Partial completion is persisted (per requirement 1) and marked as such.
 ## Acceptance
 - Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer
  with the table/column descriptions + embeddings that were generated (verified: re-open the
  connection, descriptions are present).
 - Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage
  already completed (verified: no LLM description calls for the cached tables; only the failed
  stage re-runs).
 - A connection with hundreds of tables emits relationship-stage progress and completes within the
  configured budget, persisting partial relationships if the budget is hit — without discarding
  enrichment.
 - Small/single-run ingests behave exactly as before (no regression in artifacts or relationship
  output when nothing is interrupted).
 ## Benchmark context (motivation only — do not encode benchmark specifics)
 The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl`
 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM
 budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every
 retry — makes large-schema ingest impractical. This is a general durability/cost property of the
 ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale.
--- a/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md
+++ b/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md
@ -1,101 +0,0 @@
 # 20 — Resilient enrichment under a slow/hung LLM backend
 **Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`.
 This is the **enrichment-stage** analog of two already-shipped specs:
 - spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline);
 - spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it.
 Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two
 weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung
 table into an indefinite wedge plus total loss of an entire stage's LLM work.
 ## Problem / requirement
 Two compounding gaps on the per-table description-enrichment path, observed end-to-end:
 ### 1. The per-table LLM timeout does not actually terminate the work
 The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh
 `AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM
 backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise
 spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0%
 CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock
 the await** — so the call sits *past* its own timeout indefinitely.
 Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min),
 two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** —
 well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED
 connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung
 child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot
 interrupt the underlying work — applied to the enrichment LLM call instead of the query.)
 **Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires,
 the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort
 for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per
 the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded
 wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike.
 ### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones
 Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before
 relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's
 description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1),
 or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost,
 even though their (expensive) LLM descriptions were finished.
 Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in
 `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded
 all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of
 redone LLM calls.
 **Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to
 the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated
 and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the
 descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just
 "lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**,
 throwing away every successfully-generated description. The skip must be graceful — a skipped table costs
 one missing description, not the entire stage's output. (This is the strongest argument for per-table
 incremental persistence: the 283 good descriptions should have been durable the moment each was produced.)
 **Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the
 descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and
 (b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write
 design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the
 checkpoint granularity from once-per-stage to incremental.
 ## Sketch (implementer to refine)
 - **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/
  claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for
  network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period.
 - **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit
  minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value
  itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator
  override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental
  checkpointing, a moderate default + skip is the better operating point.)
 - **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to
  the same store/format used at stage completion; on resume, treat already-persisted tables as done and only
  enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions).
 - **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer
  than ~one timeout, so an external stall watchdog stops being the only backstop.
 ## Generic use case (independent of the benchmark)
 Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend,
 which is the common local/desktop setup) will eventually hit a table whose description call hangs — a
 provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one
 such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws
 away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema
 enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a
 hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product,
 wholly independent of any benchmark.
 ## Benchmark context (motivation only — not a benchmark-specific rule)
 Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching
 the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the
 30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions
 checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had
 to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark
 just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic.
--- a/spider2-specs/done/21-selective-enrichment-stages.md
+++ b/spider2-specs/done/21-selective-enrichment-stages.md
@ -1,91 +0,0 @@
 # 21 — Selective enrichment stages (`--stages`) + per-stage cache keys
 **Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`.
 Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment).
 Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one
 enrichment stage without re-paying for the others.
 ## Problem / requirement
 Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`**
 (sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally
 LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a
 targeted re-run impossible without a full, expensive re-enrich:
 1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a
   single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`,
   and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So
   changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping
   `scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces
   ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change.
 2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only
   path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on
   `mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`).
   The capability is built; it's just not reachable.
 3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage,
   input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the
   foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI
   surface are missing.
 **Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested
 connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply,
 without re-running unchanged (especially the costly `descriptions`) stages.
 ## Design decisions (resolved during intake; implementer may refine)
 - **CLI flag: `--stages <comma-list>`** (plural). Accepts a comma-separated subset of
  `descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes
  a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the
  plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an
  unknown stage is an error, never silently ignored.
 - **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs:
  - `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model)
  - `embeddings`   → `{snapshot, embeddings model/provider, + the description text it embeds}`
  - `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}`
  Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates
  only `embeddings`; improving description prompts/LLM invalidates only `descriptions`.
 - **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write,
  already the behaviour). A selective run never deletes another stage's artifacts.
 - **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`;
  `relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing
  `embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected
  re-run leaves an unselected downstream stage stale, and the operator can opt to cascade
  (`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream.
 - **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the
  stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has
  full context — not just raw column names.
 - **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to
  the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent
  `--no-query-history` negative flag, but that unification is out of scope here.
 ## Sketch (implementer to refine)
 - Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it
  selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the
  precedent).
 - Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each
  stage's own inputs; gate each stage's resume/skip on its own hash.
 - Ensure selective runs read + preserve the on-disk enriched schema and write additively.
 - Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one.
 ## Generic use case (independent of the benchmark)
 Any team running ktx in production maintains its semantic layer over time: they improve description prompts
 or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of
 those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions
 even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine
 maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now
 that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost —
 alone. This is core operability for a long-lived ingestion product and is wholly independent of any
 benchmark.
 ## Benchmark context (motivation only — not a benchmark-specific rule)
 Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a
 tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …)
 that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins**
 across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both
 were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate
 the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised
 large-scale multi-backend ingestion; the gap and the fix are generic.
--- a/spider2-specs/specs/01-connection-scoped-wiki.md
+++ b/spider2-specs/specs/01-connection-scoped-wiki.md
@ -1,300 +0,0 @@
 # Connection-scoped wiki pages
 > Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`.
 ## Problem
 Wiki pages have only two scopes today: `GLOBAL` and `USER`
 (`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by
 directory (`wiki/global/<key>.md`, `wiki/user/<userId>/<key>.md`) and the
 search path filters by loading only the in-scope pages before any lane runs.
 There is no way to associate a page with a **connection** (a warehouse/database
 defined under `connections:` in `ktx.yaml`).
 In a project with many connections this causes two distinct failures:
 1. **Cross-database relevance pollution.** All pages share one search index, so
   `wiki_search` for a generic term (`orders`, `revenue`, `average order
   value`) surfaces pages written about the wrong database. Concept names
   collide across databases constantly in real multi-connection projects
   (several databases each with `orders`, `customers`, …).
 2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace.
   The write path resolves a repeated key to the existing file and updates it
   in place. So if the agent writes an `orders` page while ingesting database B
   and an `orders` page already exists for database A, B's content **overwrites
   A's** — same-concept pages for different databases cannot coexist today.
 Today, when `memory_ingest` is called with a `connectionId`, that id only
 scopes which semantic-layer sources the triage agent can see
 (`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page
 and **not** validated against `ktx.yaml`.
 ## Generic use case
 Any org with multiple databases/warehouses in one **ktx** project: org-wide
 definitions ("fiscal year starts in February") should be visible everywhere,
 while database-specific conventions ("in the events DB, `user_id` is the
 anonymous device id, not the account id") should not pollute searches about
 other databases — and two databases that both have an `orders` concept must be
 able to keep separate, non-colliding pages.
 ## Model
 `connections` is **additive frontmatter metadata**, orthogonal to the existing
 `GLOBAL`/`USER` directory scope — not a third scope dimension:
 - A page is still `GLOBAL` or `USER` and lives where it lives today. It may
  **additionally** carry a `connections` list.
 - **Page keys remain a flat, globally-unique namespace.** `connections` does
  **not** namespace keys; a page is addressable by key alone, unchanged.
 - A page may list **multiple** connections.
 - **Absent or empty `connections` ⇒ unscoped: the page applies to all
  connections.** This is exactly today's behavior, so every existing page is
  unaffected.
 This keeps `wiki_read` and refs untouched and adds no parallel scope axis;
 filtering by connection is purely a search/relevance concern.
 ## Requirements
 ### 1. Frontmatter field
 Add an optional `connections` field to wiki page frontmatter — a list of
 connection ids.
 - Accept a single string too; normalize to a list at parse time (reuse the
  existing array-coercion helper used for `tags`/`refs`/`sl_refs`).
 - Round-trips through parse/serialize without loss.
 - Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by
  construction.
 ### 2. Page identity and key distinctness
 `connections` does not change how pages are identified or addressed:
 - Keys stay flat and globally unique; `wiki_read(key)` is unchanged.
 - Because the write path updates a page in place when its key already exists,
  same-concept pages for different connections **MUST** use distinct keys
  (e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys
  for database-specific pages are the primary mechanism (driven by write-path
  prompt guidance, requirement 5).
 - **Data-loss guard (code, not prompt):** a connection-scoped write whose key
  matches an existing page whose `connections` scope is **disjoint** from the
  incoming scope MUST surface a collision instead of silently overwriting the
  existing page. (Updating a page within the same connection scope, or
  broadening/narrowing its own `connections`, is a normal update — not a
  collision.) The implementer owns whether the collision is a hard error or a
  suffixed new key; it must not be a silent clobber.
 ### 3. Search filtering
 Add an optional connection filter to the search surfaces:
 - **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`).
 - **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection <id>`
  (with `-c` alias), matching the `ktx sql` connection flag.
 Semantics:
 - With `connectionId: X` ⇒ return pages whose `connections` is empty
  (unscoped) **∪** pages whose `connections` contains X.
 - Without ⇒ current behavior, all pages.
 - The filter **MUST** apply uniformly to **all three search lanes** (lexical
  FTS5, semantic/embedding, token fallback) at the **candidate-source level**,
  so each lane draws its full candidate pool from the already-scoped set. It
  **MUST NOT** be a post-filter on the merged/ranked results — that would let
  off-scope candidates consume both the per-lane pool and the final result
  limit unevenly.
 *Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the
 disk-load step that feeds both the in-memory token lane and the synced SQLite
 index (`local-knowledge.ts`); the connection filter fits the same seam.
 ### 4. Index persistence
 The `.ktx/db.sqlite` knowledge index is re-synced from files on every search.
 The implementer owns whether to persist `connections` as index columns / a side
 table, or to filter the loaded page-set before the per-search sync. The binding
 requirement is the uniform-across-lanes behavior in requirement 3 — not a
 specific schema.
 *Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the
 scoped subset and gives up a little embedding-cache reuse when searches
 alternate between connections (recompute is one embedding per scoped page per
 connection switch — negligible at the scale this targets). Persisting
 `connections` in the index avoids that at the cost of a schema addition and a
 per-lane predicate. Either is acceptable.
 ### 5. Write path
 - The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a
  `connections` input field with the same REPLACE semantics as
  `tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to
  unscoped; `[ids]` ⇒ set.
 - When `memory_ingest` / the memory agent runs with a `connectionId`, prompt
  guidance directs the agent to:
  - set `connections: [connectionId]` on new **database-specific** pages, using
    connection-distinctive keys; and
  - leave `connections` empty for clearly **org-wide** content.
 - This is **prompt guidance, not a code auto-default.** A connection-scoped
  ingest must remain able to produce unscoped org-wide pages, so the tool must
  not force the session's `connectionId` onto every page.
 ### 6. `wiki_read` and refs unchanged
 Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and
 `sl_refs` semantics are unchanged; `connections` is a search/relevance concern
 only.
 ### 7. Validation
 Validation behavior splits by surface, because an explicit argument is a
 typo-prone input while persisted content drifts independently of config:
 - **Explicit argument** — a connection id supplied as a command/tool argument
  (`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`)
  MUST be validated against `ktx.yaml` connections and **rejected with a clear
  error listing the configured ids** when unknown. Reuse the canonical
  `project.config.connections[id]` check. This also closes the current gap
  where `memory_ingest`'s `connectionId` is accepted unvalidated.
 - **Persisted frontmatter** — a connection id that appears only in a stored
  page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during
  validation/doctor, and MUST NOT break loading, searching, or reading that
  page. Config and content can evolve independently.
 ### 8. Scope boundary
 This spec delivers the **mechanism** (frontmatter storage + uniform filter +
 write surface + validation). Driving the agent to actually pass `connectionId`
 during analytics work is the concern of
 `03-multi-connection-routing-in-analytics-skill`. It composes with the
 `--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`.
 ## Acceptance criteria
 - A page with `connections: [db_a]` is returned by
  `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
  **not** by `wiki_search(query, connectionId: "db_b")`.
 - A page with no `connections` field is returned in all three cases above.
 - Two pages — `orders_sales_db` (`connections: [sales_db]`) and
  `orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to
  `sales_db` returns the first and not the second, and neither overwrote the
  other on write.
 - A connection-scoped write whose key matches an existing page scoped to a
  **different** connection surfaces a collision instead of silently
  overwriting (data-loss guard, requirement 2).
 - Filtering works in each lane independently (test with embeddings disabled to
  exercise the lexical and token lanes alone).
 - `memory_ingest(content, connectionId)` produces a page scoped to that
  connection for database-specific content.
 - `wiki_search`/`ktx wiki search --connection <unknown>` fails with an error
  that lists the configured connection ids.
 - A page whose `connections` references an id absent from `ktx.yaml` produces a
  warning but stays searchable and readable; search and read do not throw.
 - `connections` accepts a single string and a list, both normalized to a list.
 - Existing projects with no scoped pages and no `connectionId`/`--connection`
  behave identically before/after.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`),
  `wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array
  coercion `wiki/local-knowledge.ts` (`stringArray`).
 - **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts`
  (`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already
  scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts`
  (FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`).
 - **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`,
  `memory_ingest`; `connectionId` already present on `memory_ingest` but
  unvalidated).
 - **CLI surface:** `commands/knowledge-commands.ts`
  (`ktx wiki search`/`list`/`read`); canonical `--connection` flag in
  `commands/sql-commands.ts`; validation pattern
  `project.config.connections[id]` in `mcp/local-project-ports.ts`.
 - **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE
  semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId`
  threaded through the capture session and tool session;
  `external_ingest` forces `GLOBAL` scope).
 - **Connection config:** `context/project/config.ts` (`connections` record in
  `ktx.yaml`).
 ## Benchmark context (motivation only)
 Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose
 schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
 External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
 to exactly one database and must not surface for the other 29.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
 acceptance criteria covered; full package suite green (2924 passing),
 type-check, knip/biome dead-code, and pre-commit clean.
 **What was built / where**
 1. **Frontmatter field (req 1).** `connections?: string[]` added to
   `WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model
   `LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new
   `stringList()` coercion (single string → list); round-trips through both
   serializers. Absent/empty ⇒ unscoped.
 2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through
   `searchLocalKnowledgePages` → both the sqlite-FTS and scan impls →
   `loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is
   applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages
   listing the id), so the token lane and the per-search SQLite sync (lexical +
   semantic) both draw their candidate pool from the already-scoped set —
   candidate-source level, not a post-filter.
   - Chose req 4 **option B (filter the loaded page-set)** over persisting a
     column. Verified-safe here: standalone ktx's memory agent reads pages from
     files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s
     `knowledge_pages` is a per-search cache that `searchLocalKnowledgePages`
     rebuilds every call — scoping the sync corrupts no shared state. Only cost
     is one embedding recompute per scoped page on a connection switch (the
     spec's acknowledged, negligible trade-off). No index-schema change.
 3. **Page identity + data-loss guard (req 2).** Keys stay flat/global;
   `wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`)
   rejects (hard error, no silent clobber) a connection-scoped write whose
   incoming `connections` is **disjoint** from a same-key existing page's
   non-empty `connections`, suggesting a connection-distinctive key. Same-scope,
   overlapping, broaden/narrow, and unscoped-existing updates are allowed.
   Chose a hard error over auto-suffixing so the conflict reaches the agent
   (the decision-maker) instead of silently forking the key namespace.
 4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list)
   with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no
   code auto-default of the session connection. Prompt guidance added to the
   shared `wiki_capture` skill (new "Connection scoping" section) and the
   `memory_agent_external_ingest` prompt. The session `connectionId` is now
   surfaced to the agent so the guidance is actionable: in the memory-agent
   prompt header and in the ingest work-unit `<context>` block
   (`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`).
 5. **Validation (req 7).** New shared helper
   `context/connections/configured-connections.ts → assertConfiguredConnectionId`
   validates explicit connection-id arguments against `ktx.yaml` and throws an
   error listing the configured ids. Routed from all three explicit-arg
   surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest`
   (validated at the boundary in `mcp-server-factory.ts` — this also closes the
   prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated),
   and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` +
   `knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**:
   `listReferencedConnectionIds` + a non-fatal `ktx status` warning
   (`status-project.ts`); loading/searching/reading never throw on them.
 **Deviations / notes**
 - Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`".
  That helper (`stringArray`) is array-only and does **not** coerce a single
  string; added a dedicated `stringList` for `connections` to meet the
  single-string acceptance criterion rather than change `stringArray`'s
  behavior for the other fields.
 - **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already
  takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so
  its wiki lane is intentionally left unscoped. Worth a follow-up if
  `discover_data`'s wiki results should also be connection-scoped for
  consistency.
 - MCP tools-list snapshot and the `mcp-server-factory` test were updated for the
  new `wiki_search.connectionId` param and the `memory_ingest` validation
  wrapper (the port is no longer the raw service object; it delegates).
--- a/spider2-specs/specs/02-verbatim-ingest-mode.md
+++ b/spider2-specs/specs/02-verbatim-ingest-mode.md
@ -1,327 +0,0 @@
 # Verbatim ingest mode for authoritative documents
 > Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`.
 ## Problem
 `ktx ingest --text/--file` routes captured content through the memory agent.
 `runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a
 `MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to
 `MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which
 runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k
 chars) inside a session worktree. The agent decides — via the `wiki_write`
 tool — what to persist, so it may **rewrite, condense, split, or re-title** the
 content before it lands as a wiki page. The body is produced by an LLM, not
 copied by code.
 For *authoritative* documents — formula definitions, metric specs, runbooks,
 compliance text — paraphrasing is a defect, not a feature:
 - exact thresholds, constants, and rule wording must survive unchanged;
 - lexical (BM25/FTS5) search works best when the stored text matches the
  phrasing users and agents query with;
 - ingestion should be deterministic and reproducible — the same input file
  yields the same page, and re-running is safe.
 Two further gaps block authoritative ingest today:
 - The memory agent hard-requires an LLM backend
  (`context/memory/local-memory.ts` throws when `llm.provider.backend: none`
  and no runner is injected), so there is **no** offline ingest path at all.
 - The agent's write tool *merges* a repeated same-scope key in place (REPLACE
  frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the
  silent in-place rewrite an authoritative-document workflow must avoid.
 ## Generic use case
 Any team ingesting documents that are already the source of truth: metric
 definition sheets, SLA documents, calculation-methodology docs, regulatory
 text. The user wants **ktx** to *index and surface* the document, not to
 re-author it. Today they work around the memory agent by hand-writing
 frontmatter and copying files into `wiki/global/`; verbatim mode makes that a
 first-class, supported `ktx ingest` workflow.
 ## Model
 `ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a
 constrained prompt over the existing agent loop. Its defining invariants:
 - **The stored page body is the input document body, written by code.** The LLM
  never produces, edits, or relays the body. It is confined to generating
  *metadata* about the body.
 - **Behavior follows from inputs, not from a mode prompt.** Whether metadata is
  LLM-generated or derived offline follows from the configured backend
  (`llm.provider.backend`), not from a second user-facing switch.
 - **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project
  authoritative docs (the content teams copy into `wiki/global/` today).
  Connection association is expressed by the **additive `connections`
  frontmatter** from spec 01, never by directory.
 - **Deterministic and idempotent.** The page key, the merged frontmatter, and
  the stored body are all functions of the input alone (given a fixed backend),
  so the same input produces the same page and a re-run is a safe no-op.
 ### "Byte-for-byte" scope
 The guarantee is on the document's **interior**: no paraphrase, no condense, no
 split, no re-title, no reflow, **no clipping**. The shared wiki store
 canonicalizes *surrounding* whitespace — `parsePage` trims the body and
 `serializePage` emits a single trailing newline
 (`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are
 normalized by the storage layer. Verbatim mode **MUST** write through that
 shared `writePage`/`serializePage` path rather than fork a parallel serializer;
 the interior bytes (thresholds, constants, wording) are what must be preserved
 exactly, and they are. Acceptance hashes compare the stored body against the
 **trimmed** input body.
 ## Requirements
 ### 1. Flag
 `ktx ingest --file <path> --verbatim` and `ktx ingest --text <content>
 --verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text`
 item in the invocation; each item becomes its own page.
 - It composes with the existing `--connection-id <id>` flag
  (`commands/ingest-commands.ts`) so the resulting page can be
  connection-scoped (see spec 01). **Note:** the intake draft wrote
  `--connection`; the shipped flag is `--connection-id`. Use `--connection-id`.
 - No new `--key` flag (see requirement 4). No second behavioral switch beyond
  `--verbatim` itself.
 ### 2. Body preservation is enforced by code, not by prompt
 The stored page body is the input content (interior preserved exactly, per
 **Model → "Byte-for-byte" scope**).
 - Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop
  or any `wiki_write` tool call where a model could alter it.
 - The LLM, when used, generates **only** metadata: `summary`, `tags`, and
  `sl_refs`. A single constrained structured-output call (AI SDK v6
  `generateObject` with a `zod` schema) is the intended mechanism — the full
  memory-agent loop, worktree, and squash-merge are **not** required and should
  not be used.
 - The page key is **not** LLM-generated (requirement 4).
 ### 3. No clipping of the stored body
 The ~48k clip may apply only to the text **sent to the LLM** for metadata
 generation. It **MUST NOT** apply to the text **written** to the page. A
 document larger than the clip limit is stored in full; only its metadata is
 derived from the clipped prefix.
 ### 4. Deterministic page key
 The key is derived from the input, never chosen by the LLM (an LLM-chosen slug
 would break determinism and the requirement-6 idempotency guarantee):
 - **`--file <path>`** → `suggestFlatWikiKey(basename without extension)`
  (`wiki/keys.ts`). This is the primary document case and is always
  deterministic.
 - **`--text <content>`** → if the content opens with a Markdown heading, the
  key is `suggestFlatWikiKey(heading text)`. If there is no leading heading,
  **hard error**: inline verbatim text needs a leading heading to derive a
  stable key, or should be passed as `--file`.
 - No hash-based keys (unfindable) and no `--key` override flag. A real need for
  explicit key control can add `--key` later.
 ### 5. Frontmatter: passthrough + gap-fill
 If the input has its own YAML frontmatter, split it from the body: the body is
 everything after the closing `---`; the frontmatter is authoritative metadata.
 - **Passthrough.** Every input frontmatter field is preserved in the stored
  page, **including fields not in `WikiFrontmatter`** (`effective_date`,
  `version`, `owner`, …). The serializer `YAML.stringify`s the object, so
  unknown keys round-trip. Dropping them would be silent data loss on
  authoritative docs.
 - **Gap-fill only.** Generated/derived metadata fills **absent** fields only;
  it **MUST NOT** overwrite an explicit value. An input `summary:` is never
  replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept.
 - **Defaults.** `usage_mode` defaults to `auto` (findable via search, not
  force-injected) when the input does not set it.
 - **Connection scoping.** `--connection-id X` (validated via
  `assertConfiguredConnectionId`, `context/connections/configured-connections.ts`)
  sets `connections: [X]` when the input frontmatter does not already declare
  `connections`. If the input frontmatter declares a **different**
  `connections` than the flag, **hard error** (ambiguous intent) rather than
  silently choosing one. If they match, or only one source is present, proceed.
 ### 6. Degraded mode (`llm.provider.backend: none`)
 `--verbatim` **MUST** work with no LLM backend — this is its capability the
 regular agent ingest lacks.
 - `summary` is derived from the leading Markdown heading text, or, if none, the
  first non-empty sentence of the body (trimmed to a reasonable length).
 - `tags` and `sl_refs` are left empty.
 - The body is still stored in full (requirement 3 applies unchanged).
 ### 7. Key collisions: idempotent-if-identical, else hard error
 Verbatim mode does **not** reuse the agent write tool's in-place merge. Before
 writing, read any existing `GLOBAL` page at the derived key:
 - **No existing page** → write.
 - **Existing page, stored body identical** to the new body (compared after the
  storage-layer normalization in **Model**) → **idempotent no-op success**
  (re-running the same file is safe).
 - **Existing page, body differs** → **hard error** naming the conflicting key
  and directing the user to a distinct key. Never a silent overwrite, never an
  auto-suffixed second page (which would produce the duplicated/divergent pages
  this mode must avoid).
 ### 8. LLM-failure handling
 When a backend **is** configured but the metadata call fails (rate limit,
 transport error, malformed output after retries), **fail the item** (honoring
 `--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`).
 **MUST NOT** silently fall back to degraded derivation: a degraded page written
 on a transient error would, under requirement 7, refuse to be replaced by a
 healthy re-run — breaking reproducibility. Degraded derivation is reserved for
 `backend: none`.
 ### 9. Findability
 After write, the page is reindexed so search returns it:
 - `wiki_search` for a phrase taken from the document body returns the page via
  the lexical lane (the body is indexed in `buildKnowledgeSearchText`).
 - `wiki_search` for a paraphrase of the document's topic returns it via the
  semantic lane **when embeddings are enabled** (this is what the generated
  `summary`/`tags` buy over a bare degraded page).
 ## Acceptance criteria
 - Ingesting a file with `--verbatim` produces a page whose body is
  byte-identical to the trimmed input body (assert with a hash in tests).
 - A >48k-char file is stored in full (assert stored body length ≥ input length
  minus trim).
 - Running the same `--verbatim` ingest twice is idempotent: one page, identical
  bytes both times, no error on the second run.
 - A second ingest to the same derived key with **different** body content fails
  loudly (requirement 7) and does not modify the existing page or create a
  suffixed one.
 - Input frontmatter with an unknown field (e.g. `effective_date`) is preserved
  in the stored page; an explicit input `summary` is **not** overwritten by a
  generated one.
 - With `llm.provider.backend: none`, `--verbatim` still produces a page: full
  body stored, `summary` derived from the heading/first sentence, `tags` and
  `sl_refs` empty.
 - `--verbatim --connection-id X` yields a page with `connections: [X]`; an
  unknown id is rejected with an error listing the configured ids. (Depends on
  spec 01, now shipped.)
 - `--verbatim --connection-id X` where the input frontmatter already declares a
  different `connections` fails with an ambiguity error.
 - `ktx ingest --text "no heading here" --verbatim` errors asking for a leading
  heading or `--file`.
 - `wiki_search` for a body phrase returns the page (lexical lane); for a topic
  paraphrase it returns the page when embeddings are enabled (semantic lane).
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 module layout and design, subject to the invariants above.
 - **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table;
  `--text`/`--file`/`--connection-id`/`--fail-fast` already present — add
  `--verbatim` and thread it into `KtxTextIngestArgs`).
 - **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`,
  `validateItems`, per-item loop and exit-code aggregation). The verbatim flow
  reuses item loading and replaces the `memoryIngest.ingest(...)` call with a
  code-driven write for `--verbatim` items. Keep the new logic in a focused
  module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`.
 - **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts`
  (`parsePage` for the `---…---` split shape, `serializePage`, `writePage`,
  `readPage` for the collision check). Write through this shared path — do not
  re-implement YAML framing.
 - **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`).
 - **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and
  `usage_mode` are the required fields; unknown passthrough fields live
  alongside).
 - **Connection validation:** `context/connections/configured-connections.ts`
  (`assertConfiguredConnectionId`, shipped with spec 01).
 - **Metadata LLM call:** the local LLM runtime/config resolution in
  `context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a
  single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill
  covers v6 patterns.
 - **Reindex / search lanes:** `wiki/local-knowledge.ts`
  (`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/
  semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`).
 - **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a
  verbatim-focused test file covering the acceptance criteria above.
 ## Benchmark context (motivation only)
 Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
 definitions, the haversine formula, F1 overtake rules, …). Gold SQL was
 authored against their **exact** text; an LLM paraphrase that drops a bucket
 boundary or rounds a constant loses the corresponding question. The current
 workaround is hand-writing frontmatter and copying files into `wiki/global/`.
 Verbatim mode turns that manual step into a supported **ktx** workflow, and
 composes with the connection scoping from spec 01 so a doc relevant to exactly
 one of the benchmark's ~30 SQLite databases does not surface for the other 29.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered
 by tests and verified end-to-end through the linked `ktx-dev` binary.
 **What was built**
 - New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor`
  + `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`,
  `deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter`
  (the last four are `@internal` exports for unit testing).
 - `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a
  guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded
  into `KtxTextIngestArgs.verbatim`.
 - `text-ingest.ts` now tags each loaded item with an `origin`
  (`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim
  ingestor once and branches the per-item loop to a code-driven write instead of
  `memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and
  `--fail-fast` handling are reused.
 **Deviations from the literal spec (design refinements, per "implementer owns the design")**
 - *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The
  implementation routes through the existing `KtxLlmRuntimePort.generateObject`
  instead — it is implemented by all three backends (ai-sdk, claude-code, codex),
  and the ai-sdk one already wraps `generateText` + `Output.object({schema})`.
  This realizes the spec's "single constrained structured-output call" intent via
  the canonical cross-backend path rather than forking a second LLM entry point.
 - *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages`
  rebuilds the SQLite index from disk on every call (recomputing embeddings for
  changed pages), so a written page is findable without a dedicated reindex step.
  The write still goes through the shared `KnowledgeWikiService.writePage` +
  `syncSinglePage` path, so the page is also eagerly indexed.
 - *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter
  already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills
  absent fields, so there is nothing to generate). A fully specified document thus
  ingests with a configured backend without any LLM call.
 **Tests**
 - `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration
  against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip,
  idempotency, conflict hard-error, frontmatter passthrough, explicit-summary
  preservation, degraded mode, connection scoping + unknown-id rejection +
  ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item,
  lexical + semantic findability).
 - `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging,
  connection-id forwarding, fail-fast.
 - `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the
  requires-`--text`/`--file` guard.
 **Docs**
 - `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest"
  section, examples, common errors) and
  `docs-site/content/docs/guides/writing-context.mdx` (authoritative-document
  workflow).
 **Verification**
 - Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code`
  (Biome + Knip default + production) clean; pre-commit clean on changed files.
  A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is
  untouched — it predates this work.
--- a/spider2-specs/specs/06-scan-tolerate-broken-objects.md
+++ b/spider2-specs/specs/06-scan-tolerate-broken-objects.md
@ -1,361 +0,0 @@
 # Schema scan tolerates individual objects that fail introspection
 > Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`.
 ## Problem
 A single broken or inaccessible object zeroes out an entire connection's
 context. Schema introspection iterates objects with no per-object error
 handling, so one throw aborts the whole scan, the live-database adapter's
 `fetch()` rejects, and the connection ends with **no semantic layer at all** —
 even when every other object was healthy.
 The failure surfaces in two phases, and the contract must hold in both:
 - **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does
  `rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch.
  `readTable` runs `PRAGMA table_info(<object>)`, which *executes* a view's
  body to resolve its columns — so a view over a dropped/renamed column (the
  `oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date`
  from a base table that has no such column) raises `no such column:
  ehp.start_date` and aborts introspection of all ~48 healthy objects.
 - **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/
  bigquery/snowflake read metadata in bulk from catalog / `information_schema`
  (a broken view rarely breaks that), then fail when a per-object profiling or
  sampling `SELECT` runs against a broken object. Enrichment sampling is
  *already* isolated (`description-generation.ts` wraps `sampleTable` in
  try/catch → `sampling_failed`), but mandatory introspection-phase reads are
  not uniformly isolated across drivers.
 A second, related defect blocks the documented escape hatch. Setting
 `enabled_tables: ["main.customers"]` on a sqlite connection produces a
 different hard failure — `Adapter "database schema" did not recognize fetched
 source output`. Root cause: the sqlite connector emits every object as
 `{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })`
 (`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but
 `"main.customers"` parses to `{ db: "main", name: "customers" }`
 (`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`,
 so the entry matches **nothing**, zero table files are written, and
 `detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping
 the generic "did not recognize fetched source output" error at
 `context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form
 `enabled_tables: ["customers"]` would have worked; the `main.`-qualified form
 silently matches nothing.
 ## Generic use case
 Real warehouses routinely contain broken or inaccessible objects: views over
 dropped/renamed columns, views referencing tables the connection role can't
 read, permission-denied tables, and vendor system views that error on read.
 **ktx** should ingest everything it *can* and skip what it can't, so one bad
 object never zeroes out an entire connection's context. This is baseline
 production robustness, independent of any benchmark — the same tolerance a
 33-warehouse fleet needs the first time one of its databases has a stale view.
 ## Design
 The unit of failure is **one object** (table or view). Introspecting or
 profiling an object is an operation that can fail independently; a failure skips
 that object, records a recoverable warning, and the scan continues from the
 objects that succeeded.
 Because seven Node connectors and the Python daemon each introspect differently
 (sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata
 in bulk and fail per-object during profiling), the **semantics** of "skip /
 warn / total-failure" are defined **once** and every connector routes through
 them — rather than seven copies of the same try/catch that drift apart:
 - A shared per-object helper in the `scan/` layer — the sibling of the existing
  `tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single
  object read and returns `{ ok: true, table } | { ok: false, warning }`, with a
  standard warning code (e.g. `object_introspection_failed`).
 - A shared post-check enforces the total-failure rule (R3) uniformly.
 - Each connector keeps its **natural** shape: sqlite routes each `readTable`
  through the helper; bulk-read drivers route their per-object profiling reads
  through it. The contract is uniform; the loop is not forced to be.
 - The Python daemon implements the **same contract** in its own helper, adds a
  `warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps
  those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`).
 The warning channel already exists end to end on the Node side
 (`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/
 `recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json`
 artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates
 it. This spec makes that channel carry object-skip warnings and surfaces them in
 the ingest summary, the persisted report body, and `ktx status`.
 ## Requirements
 ### R1 — Per-object isolation (the contract)
 If introspecting or profiling one object throws, the scan **MUST** skip that
 object, record a `KtxScanWarning` (object name, the error message, and any
 schema/catalog qualifier; `recoverable: true`), and continue with the remaining
 objects. No single object may abort the scan.
 - The contract holds in **both** phases: the mandatory metadata read *and* any
  profiling/row-count/sample read performed during introspection.
 - It holds for **all seven Node connectors**
  (`packages/cli/src/connectors/<driver>/`) and the **Python daemon** postgres
  path (R6).
 - The semantics are defined once (the shared helper + warning code from the
  Design section) and every connector routes through them. Do not inline a
  divergent per-driver copy.
 - Warnings **MUST NOT** carry secrets or full SQL bodies; record the object
  identifier and the database's error text, redacted through the existing
  `redactKtxSensitiveMetadata` path that `warnings.json` already uses.
 ### R2 — Surface, don't hide
 Skipped objects **MUST** be reported both at ingest time and in the durable
 status view:
 - **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports
  a count plus the object name and a short reason for each skip — e.g.
  `Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`.
 - **Run report.** Object skips land in the run report's `warnings.json` artifact
  (already written) and in the persisted report body (`IngestReportBody`), whose
  natural home is the existing `fetch?: SourceFetchReport` field — the fetch
  phase *is* introspection.
 - **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for
  the connection's latest ingest — e.g. `oracle_sql: 1 object skipped —
  emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived
  from the latest persisted report, not new persisted state**: the report body
  is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so
  surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status`
  reads and renders the skip info already present in the latest report body. A
  connection whose latest ingest skipped nothing shows no such line.
 ### R3 — Failure semantics (partial vs total)
 Per-object skipping is **unconditional** — there is **no new config knob**, and
 the existing `ingest.workUnits.failureMode` (which governs the later LLM
 work-unit stage, not introspection) is untouched and orthogonal. Outcomes are
 derived from object counts, not from a mode:
 | Scope | Objects discovered / matched | Introspection outcome | Result |
 | --- | --- | --- | --- |
 | none | 0 | n/a (legitimately empty DB) | **success**, empty layer |
 | none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest |
 | none | N > 0 | all N fail | **connection failure** (clear error) |
 | `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) |
 | `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings |
 | `enabled_tables` | matches M > 0 | all M fail | **connection failure** |
 - "Connection failure" means the connector / `fetch()` raises a **clear,
  actionable error** for that connection. It **MUST NOT** surface as the generic
  `did not recognize fetched source output` (that message is reserved for a
  genuinely unrecognized staged dir, not an empty/total-failure result).
 - A total failure of one connection follows existing per-connection ingest
  orchestration for whether sibling connections continue; this spec does not
  change cross-connection behavior.
 ### R4 — A broken view never blocks base tables
 A broken view **MUST NEVER** prevent base-table ingest.
 - View introspection failures are isolated exactly like any other object (R1).
 - Mandatory introspection **MUST** prefer reading an object's structure from the
  catalog where possible over executing the object's body, and **MUST NOT** run
  a data-reading query (row count, sample) against a view as a required step.
  (sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the
  metadata read that executes the view definition.)
 ### R5 — `enabled_tables` allowlist works
 The documented allowlist escape hatch **MUST** reliably restrict the scan to the
 listed objects, with no spurious adapter error:
 - **sqlite qualification.** The schema-qualified form `"main.<name>"` **MUST**
  resolve to the same object as the bare form `"<name>"` (sqlite's sole schema
  is `main`; the connector emits `db: null`). Both forms select the object;
  neither silently matches nothing.
 - **Documented format.** The accepted qualification forms for each driver
  (`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main`
  equivalence **MUST** be documented where `enabled_tables` is described
  (`context/project/driver-schemas.ts` and the user-facing config docs).
 - **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to
  **zero** matched objects **MUST** fail with an actionable error naming the
  connection, the unmatched entries, and the available object names — **not** the
  generic `did not recognize fetched source output`. This is distinct from a
  legitimately empty database (R3 row 1) and from a matched-but-all-broken scope
  (R3 last row).
 - **Any subset works.** An `enabled_tables` matching M > 0 objects ingests
  **exactly** those M objects (minus any that fail per R1), with no adapter
  recognition error regardless of how small or edge-case the set is.
 ### R6 — Python daemon parity
 The daemon's postgres introspection path **MUST** honor the same contract:
 - Add a `warnings` field to `DatabaseIntrospectionResponse`
  (`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the
  same shape Node expects (code, message, object identifier, recoverable).
 - Isolate per-object failures in the daemon's introspection so one broken object
  does not abort the response; apply the R3 total-failure rule there too.
 - Map daemon warnings into `KtxSchemaSnapshot.warnings` in
  `mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`),
  which currently drops them.
 - The Node and Python warning shapes **MUST** stay in parity (the codebase
  already mirrors Node↔Python schemas for telemetry; follow the same discipline
  so the daemon cannot emit a code Node can't render).
 ## Acceptance criteria
 - Ingesting a sqlite DB with one broken view + N healthy tables yields a
  semantic layer for the N healthy tables and **exactly one** warning naming the
  broken view and its error; exit is **success**.
 - The skipped object appears in the `ktx ingest` summary output, in the run's
  `warnings.json`, and in `ktx status` as a per-connection skipped-objects line
  on the connection's latest ingest.
 - A sqlite DB in which **every** discovered object fails introspection (and the
  file opens) exits as a **connection failure** with a clear error — not an
  empty "success" and not `did not recognize fetched source output`.
 - A genuinely empty sqlite DB (zero objects) exits **success** with an empty
  layer (not a failure).
 - `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both
  ingest exactly the `customers` object on a sqlite connection.
 - `enabled_tables` restricted to a valid subset of M objects ingests exactly
  that subset, with **no** adapter-output error.
 - `enabled_tables` that matches zero objects fails with an error naming the
  connection, the unmatched entries, and available objects — distinguishable
  from the empty-DB and all-broken cases.
 - A broken view does not prevent ingest of base tables in the same connection
  (regression test with a view that errors on read alongside a healthy table).
 - The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a
  per-object failure in the daemon path produces a warning mapped into
  `KtxSchemaSnapshot.warnings` (Node↔Python parity test).
 - A warehouse-driver object whose profiling/sample read fails is skipped with a
  warning and does not abort introspection of its siblings.
 - Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave
  identically before/after — no warnings, same semantic layer.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **Shared semantics:** `context/scan/constraint-discovery.ts`
  (`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror
  for the per-object helper), `context/scan/types.ts`
  (`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the
  new object-skip code here).
 - **Node connectors:** `packages/cli/src/connectors/<driver>/connector.ts` and
  each `live-database-introspection.ts`. sqlite's loop is
  `connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable`
  (≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171.
  Existing per-table sample isolation precedent: `description-generation.ts`
  (≈ line 867, `sampling_failed`).
 - **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156)
  routes every driver to its Node connector; the daemon is the `else` fallback.
 - **`enabled_tables` matching:** `context/scan/enabled-tables.ts`
  (`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts`
  (`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47),
  `context/project/driver-schemas.ts` (`enabled_tables` schema + description).
 - **Staging / detect / error surface:**
  `context/ingest/adapters/live-database/stage.ts`
  (`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94,
  `detectLiveDatabaseStagedDir` ≈ line 138),
  `context/ingest/local-stage-ingest.ts` (the
  `did not recognize fetched source output` throw ≈ line 291 — must stop being
  the surface for empty-scope and total-failure).
 - **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus`
  ≈ line 202), `context/ingest/memory-flow/summary.ts`
  (`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing
  summary.
 - **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`;
  `SourceFetchReport` as the home for scan warnings),
  `context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted
  whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts`
  (`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body
  per connection and render the skipped line via `renderLocalStatsAsLines`).
 - **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py`
  (`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response`
  ≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables`
  ≈ line 267), and the Node mapping in
  `context/ingest/adapters/live-database/daemon-introspection.ts`
  (`mapDaemonSnapshot` ≈ line 209).
 ## Benchmark context (motivation only)
 `oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic
 layer because of its one broken view, so those questions fall back to raw
 `sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
 restores enriched context for that database. The same robustness is required for
 the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or
 permission-restricted objects are common and a single one must not zero out a
 warehouse's context.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki`. All requirements implemented;
 verified with `pnpm --filter @kaelio/ktx run test` (2981 passing),
 `pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing),
 `uv run pre-commit`, and `pnpm run build && pnpm run link:dev`.
 **Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes
 `tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning
 `{ ok, table } | { ok: false, warning }` and building an
 `object_introspection_failed` warning (object name + redactable DB error). It
 rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is
 never masked as an object skip. The new warning code was added to
 `KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist
 (`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode`
 validator), and `describeWarningGroup` (`scan.ts`).
 **Per-object isolation, where it actually exists (R1/R4).** Only sqlite
 (`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do
 per-object reads during *mandatory* introspection; both now route each object
 through `tryIntrospectObject`. The other five Node connectors (postgres, mysql,
 clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/
 `information_schema` (already object-safe at this phase) and isolate per-object
 profiling/sampling in the enrichment phase (`description-generation.ts`,
 `sampling_failed`), so no divergent per-driver try/catch was added there. sqlite
 also tolerates a `COUNT(*)` (profiling) failure without dropping a
 structurally-readable table, and a broken view's metadata read is isolated so it
 never blocks base tables (R4).
 **Single-source outcome decision (R3/R5).** New
 `adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once
 in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the
 daemon) routes through — and derives the outcome from the snapshot + scope:
 ≥1 object → success (skips ride along as warnings); all matched objects failed →
 clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear
 zero-match error naming the connection, the requested entries, and the available
 objects (sqlite/bigquery attach the discovered inventory via
 `metadata.discovered_object_names`); empty database (no scope) → success with an
 empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a
 valid empty staging is recognized; total-failure/zero-match now throw a clear
 connection error before staging instead of surfacing the generic
 `did not recognize fetched source output`.
 **`enabled_tables` matching (R5).** Normalized at the scope boundary in
 `resolveEnabledTables` using `connection.driver`: for sqlite, `main.<name>` →
 `{ db: null }`, so `"main.customers"` and `"customers"` select the same object.
 `table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and
 `docs-site/.../configuration/ktx-yaml.mdx`.
 **Surfacing (R2).** Deviation from the spec's orientation: live-database schema
 ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` →
 `local_ingest_reports`), not the bundle runner, so the home for scan warnings is
 `LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is
 persisted whole, so **no migration**), not the bundle-only
 `IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport`
 (`live-database/fetch-report.ts` derives skips from the existing `warnings.json`).
 The ingest summary is already rendered by `runKtxScan` from `report.warnings`
 (the new `describeWarningGroup` case), and `ktx status`
 (`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the
 latest report body per connection and prints a per-connection
 `N object(s) skipped — name: reason` line.
 **Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to
 `DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model,
 isolates per-object failures in `_map_rows_to_tables`, and shares the
 `OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with
 Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`,
 dropping any code Node cannot render (validated via `isKtxScanWarningCode`).
 Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the
 shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver
 including the daemon, avoiding a divergent second implementation. Parity is
 covered by a Node test (daemon-shaped warning round-trips) and a pytest
 (per-object failure → warning with the shared code).
--- a/spider2-specs/specs/07-analytics-skill-sql-craft.md
+++ b/spider2-specs/specs/07-analytics-skill-sql-craft.md
@ -1,363 +0,0 @@
 # Add universal SQL-authoring craft to the ktx-analytics skill
 > Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`.
 ## Problem
 The shipped `ktx-analytics` skill
 (`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its
 `<workflow>` and `<rules>` tell the agent **which ktx tools to call and in what
 order** (`discover_data` → `entity_details`/`sl_read_source` →
 `sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing
 about **writing correct SQL**.
 That gap shows up as a specific failure shape: the agent reliably produces
 *runnable* SQL but *wrong* results. The recurring defects are universal
 analytics-engineering mistakes, not ktx-specific ones:
 - comparing a string column to a numeric literal (or vice versa), which can
  silently match zero rows;
 - rounding inside intermediate CTEs, so the final number is off;
 - ranking/“first”/“most recent” windows with no deterministic tie-breaker, so
  results flicker run to run;
 - filtering *before* a window function for sequence/“since”/“first” questions,
  truncating the partition the window should see;
 - returning a full ranked list for a “top/highest” question, or collapsing a
  “per X” question to a single value;
 - dropping the inputs (or the entity identifier) a derived value was built from.
 These are correctness defects every ktx user hits on a live database. They
 belong in the shipped skill — fixing them once improves ktx for everyone, rather
 than living in any individual caller’s prompt.
 ## Generic use case
 An analyst (human or agent) points ktx at a **live, production** database and
 asks a real analytical question — “what’s the most recent order per customer”,
 “top region by margin”, “average order value by month”. The schema is unfamiliar
 (unknown date encodings, nullable join keys, string-typed numeric columns), the
 question carries grain and ranking intent in its wording, and the answer must be
 *correct and deterministic*, not merely executable. The skill should encode the
 analytics-engineering craft that makes the difference between a query that runs
 and a query that’s right — independent of any benchmark.
 ## Model
 The change is **additive content in one Markdown file**, governed by these
 invariants. They constrain the implementer; the exact prose is theirs.
 ### Inline-only delivery (this is a hard constraint, not a style preference)
 All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled
 `reference/*.md` file (the progressive-disclosure pattern Anthropic’s
 skill-authoring guide recommends for large skills) **MUST NOT** be used here,
 because the delivery mechanism ships only `SKILL.md`:
 - `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`,
  which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file
  per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex /
  universal `.agents` equivalent, a **flattened** single rules file for Cursor
  (`.cursor/rules/ktx-analytics.mdc`) and OpenCode
  (`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that
  contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`).
 - Nothing copies sibling files or subdirectories. A reference file would dangle
  on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot
  represent a multi-file skill at all.
 The skill is small enough that inline costs nothing meaningful: ~67 lines today
 plus ~60 of craft is well under the 500-line budget. And this craft is **core
 content** — consulted on every SQL-authoring turn — so even if multi-file delivery
 existed it would still belong inline: progressive disclosure only pays off for
 large, *conditionally-relevant* reference material loaded on demand, not for
 always-needed craft.
 Multi-file skill *delivery* is a legitimate future enhancement, but it must be
 **pulled by a concrete need, not built ahead of one** — no shipped skill today
 exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first
 real trigger is the **per-dialect SQL syntax follow-up**
 (`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand
 `reference/<dialect>.md` content is a genuine progressive-disclosure fit. When
 that work is scoped, note that multi-file delivery is **not** a simple directory
 copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor
 (`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform,
 and uninstall needs per-file manifest entries. Recording the constraint here so a
 future implementer does not “improve” this inline content into a bundled
 reference that dangles on every target.
 ### Heuristics with a generic *why*, not a wall of MUSTs
 The new rules are phrased as **heuristics with a one-line, universal rationale**,
 because SQL authoring is a high-freedom task (many valid approaches, choice
 depends on the question and the data). A bare imperative overfits; a rule plus
 its *why* lets the model apply judgment and generalize. This follows Anthropic’s
 own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all
 caps or rigid structures, reframe and explain the reasoning”).
 This **reconciles the draft’s “behavior only, no rationale” instruction**: the
 prohibition is specifically on rationale that references a **grader, gold answer,
 or the benchmark**. *Generic analytics-engineering rationale is required* — e.g.
 “…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number
 compare can silently match nothing”. That is a universal truth, not a
 grader reference.
 ### Dialect-agnostic
 Every rule must read correctly on any SQL dialect a ktx connection might use.
 **No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only),
 not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs.
 Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware
 (per-driver) location, explicitly out of scope here.
 ### Discovery craft attaches to discovery; authoring craft to query/validate
 Two of the draft’s rules (inspect sample rows; cast before comparing) are
 *schema-discovery* concerns that happen **before** SQL is composed. They belong
 with the discovery steps of the existing workflow, not only at the query step.
 The rest (composition, window correctness, precision, completeness) belong with
 the query/validate steps. The draft’s “extend step 5/6” is the right home for
 most rules but is slightly off for the discovery pair; this spec corrects that.
 ### Additive only
 The existing `<workflow>`, `<rules>`, and `<examples>` — compact result tables,
 summaries, clarification prompts, the tool-order workflow, the `connectionId`
 scoping rules — are preserved unchanged. The skill must still read well for an
 interactive, human-facing analysis session.
 ## Requirements
 ### 1. Placement and structure
 Add a dedicated, scannable craft section to `SKILL.md`:
 - A new top-level block — `<sql_craft>` (sibling to `<workflow>`/`<rules>`) — with
  **five sub-headings**: *Schema discovery*, *Composition*, *Window functions*,
  *Numeric precision*, *Answer completeness*. Sub-headings keep the block
  scannable (the draft’s “group under clear sub-headings” goal).
 - **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and
  explain”) each gain a **one-line pointer** into `<sql_craft>` rather than
  inlining the rules (state each rule once; Anthropic’s “consistent terminology /
  don’t repeat” guidance). The schema-discovery pair is additionally reflected as
  a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing
  to the same block.
 - No new tool, flag, or config. This is content only.
 ### 2. The craft rules (all fourteen behaviors, grouped)
 Every behavior from the intake draft must be represented. Tightly-related ones
 **may** be merged into a single bullet where that reads better; none may be
 dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout.
 **Schema discovery** (cue in steps 2/4; lives in `<sql_craft>`)
 1. Inspect representative **sample rows** of each table before composing SQL —
   confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in
   join/filter keys, and the real set of categorical/enum values
   (`entity_details` + a small `sql_execution` sample). *Why:* assumptions about
   encoding and nullability are the most common source of silently-wrong filters.
 2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A
   string column compared to a numeric literal (or vice versa) can silently match
   nothing.
 **Composition**
 3. Build complex queries **incrementally** — one CTE at a time, verifying each
   layer’s output on a small sample before stacking the next. *Why:* a wrong
   intermediate layer is far cheaper to catch early than to debug in the final
   result.
 4. **Avoid fan-out joins.** Add columns only from tables already at the target
   grain, or **pre-aggregate** to that grain before joining. *Why:* a join that
   multiplies rows quietly inflates every downstream `SUM`/`COUNT`.
 **Window functions**
 5. Give every ranking/ordering window function a **complete, deterministic
   tie-breaker** (append unique key columns to `ORDER BY`), so
   `RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs.
 6. For sequence / “first” / “most recent” / “since” questions, **filter after the
   window**, not before: compute over the full partition, then keep the rows you
   want. *Why:* a pre-filter shrinks the partition the window ranks over, so
   “first”/“most recent” is computed against the wrong set. (See the worked
   example, requirement 3.)
 **Numeric precision**
 7. Compute at **full precision; round only in the final projection**, never inside
   intermediate CTEs.
 8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit
   rounding when rounding is intended. (May merge with rule 7.)
 9. Distinguish **macro vs micro averages** based on the question’s wording:
   “average of per-group averages” = `AVG(group_metric)`; “overall/weighted
   average” = `SUM(numerator)/SUM(denominator)`.
 **Answer completeness / interpretation**
 10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the
    top-ranked row via the window result), not the full ranked list, unless a list
    is asked for. *(Phrase the mechanism dialect-agnostically — do not name
    `QUALIFY`.)*
 11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a
    single value unless the question says “overall” or “total across X”.
 12. When a question asks for inputs and a derived value (“X, Y, and their ratio”),
    **include the inputs as columns** alongside the derived value.
 13. When grouping by a human-readable label (a name), also **expose the entity’s
    identifier** — identity, not just the label, is part of the result (and
    disambiguates duplicate names).
 14. When a result is **unexpectedly empty, relax filters one at a time** to find
    which predicate removed the rows. *Why:* this is the validation feedback loop
    that turns a silent empty result into a diagnosable one.
 ### 3. One worked example (dialect-agnostic)
 Add **exactly one** compact before/after example to the skill, demonstrating the
 **window-then-filter** rule (rule 6) — the subtlest and highest-value of the set.
 It shows the wrong shape (filter inside, then rank) and the right shape (rank over
 the full partition in a CTE, then filter to the top rank in the outer query),
 using generic table/column names and standard SQL only (no `QUALIFY`, no
 dialect functions). Keep it ~6–10 lines. Do not add a second example; the
 existing three tool-orchestration examples stay as the primary example set.
 *(Superseded by spec 09: the skill now carries a second `sql` worked example —
 the multi-hop fan-out case — so the one-example constraint applies to spec 07's
 window-then-filter example only.)*
 ### 4. Explicit exclusions
 None of the following may appear in the skill (they are application/consumer
 concerns, or actively wrong for live data):
 - **Output-shape contracts** (“return a bare result set with exactly these
  columns, no prose”). The skill is for interactive analysis and already favors
  readable tables + summaries; a caller needing a strict shape specifies that
  itself.
 - **Anchoring relative time to `MAX(date)` of the data.** On a live database
  “recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is
  only valid for static snapshots and must not be baked into the product.
 - **Any advice justified by a grader, gold answer, or scoring comparator.**
 - **Dialect-specific syntax** (deferred to the per-driver follow-up).
 ### 5. Coordination with spec 03
 `03-multi-connection-routing-in-analytics-skill` also edits this same file (it
 adds a connection-routing “step 0” to `<workflow>` and threads `connectionId`
 through the tool calls). Spec 07’s additions are **orthogonal**: they live in a
 new `<sql_craft>` block and in step 5/6 pointers, and must not rewrite the
 `<workflow>` routing or the `<rules>` `connectionId` scoping that spec 03 owns.
 If both land, the result is one coherent skill: routing in `<workflow>`/`<rules>`,
 SQL craft in `<sql_craft>`.
 ## Acceptance criteria
 - The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped
  under the five sub-headings, each phrased as a heuristic with a generic
  rationale.
 - **Zero references** to any benchmark, gold answer, grader, or scoring
  comparator anywhere in the skill.
 - **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`,
  no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect
  construct — including in the worked example.
 - The existing interactive guidance is intact: the `<workflow>` steps, the
  `<rules>` (compact tables, summaries, clarification prompt, `connectionId`
  scoping), and the three existing examples all still read correctly and were not
  removed or contradicted.
 - **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of
  “recent”, grader-driven advice, dialect syntax) appear.
 - Exactly **one** new worked example is present, demonstrating window-then-filter,
  in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second
  `sql` worked example for the multi-hop fan-out case; the shipped skill then
  contains two worked examples and the content test asserts two `sql` fences.)*
 - The craft is **inline in `SKILL.md`** — no bundled reference file is introduced,
  and the skill still installs as a single file through `setup-agents.ts` for all
  targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip).
 - The skill stays **scannable and within a reasonable size** (comfortably under
  the 500-line budget).
 - The frontmatter (`name`, `description`) is unchanged and still parses through
  `SkillsRegistryService.parseFrontmatter`.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the
  `<sql_craft>` block; add one-line pointers in steps 5/6 and a discovery cue in
  steps 2/4; add the single worked example. Keep `<workflow>`/`<rules>`/`<examples>`
  otherwise intact.
 - **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts`
  (`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`,
  `plannedKtxAgentFiles`). Each target gets a single file derived from
  `SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only
  `ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this
  spec — confirm the skill still installs unchanged.
 - **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the
  same file; keep the changes non-overlapping (see requirement 5).
 - **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the
  right level (this is prompt content, not executable logic). Assert the skill
  text contains the craft sub-headings / representative rule phrases, contains the
  worked example, and contains none of the banned constructs: the literal tokens
  `QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`,
  `gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since
  `MAX()` is a legitimate aggregate — any instruction anchoring relative time
  (“recent”, “past N months”) to the data’s maximum date. The existing
  `SkillsRegistryService` frontmatter-parse test must still pass. The standalone
  `ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run
  link:dev`) so the playground picks up the updated skill.
 ## Benchmark context (motivation only)
 On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but
 ~50 result mismatches**, and a large share traced to exactly these gaps:
 premature rounding, string-vs-number compares, non-deterministic window ordering,
 returning full lists for “top” questions, and dropping the inputs to derived
 values. These are generic SQL-authoring defects — fixing them in the skill
 improves ktx for every user querying a live database, and improving the benchmark
 score is a side effect, not the goal. The skill itself must contain no trace of
 the benchmark.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki`.
 **What was built**
 - Added a new `<sql_craft>` block to `packages/cli/src/skills/analytics/SKILL.md`
  (sibling to `<workflow>`/`<rules>`, placed just before `<examples>`), with the
  five sub-headings — *Schema discovery before writing SQL*, *Composition*,
  *Window functions*, *Numeric precision*, *Answer completeness / interpretation* —
  and a one-line opener framing the bullets as heuristics-with-a-why.
 - All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end /
  truncation) are merged into one "Round only at the end" bullet, as the spec
  permitted. Each bullet carries a generic analytics-engineering rationale; none
  references a benchmark, grader, or gold answer.
 - Exactly one worked example (a fenced `sql` block inside `<sql_craft>`)
  demonstrates the window-then-filter rule, and incidentally the deterministic
  tie-breaker: the *wrong* shape filters before the window; the *right* shape
  ranks the full partition in a CTE, then filters in the outer query. Standard
  SQL only — no `QUALIFY`, no dialect functions.
 - Step pointers added without duplicating the rules: a schema-discovery cue in
  steps 2 and 4, an authoring pointer in step 5, and a validation pointer in
  step 6, each pointing into `<sql_craft>`.
 - The existing `<workflow>` / `<rules>` / `<examples>` (compact tables,
  summaries, clarification prompt, `connectionId` scoping, the three
  orchestration examples) are unchanged. Delivery is unchanged: still a single
  `SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/`
  file was introduced.
 **Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a
 content assertion over the source `SKILL.md`: the five sub-headings, a
 representative phrase for each behavior, exactly one `sql` worked example, the
 preserved interactive guidance, and the absence of banned constructs
 (`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` /
 `grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring
 relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content
 assertions and the `SkillsRegistryService` frontmatter test still pass (77/77
 across the three relevant files). Rebuilt and re-linked `ktx-dev`
 (`pnpm run build && pnpm run link:dev`); the craft block is present in the
 shipped `dist` asset.
 **Deviations / notes**
 - The worked example runs ~18 lines including comments rather than the spec's
  "~6–10"; a faithful before/after with a CTE needs the extra lines, and the
  skill stays well within budget (~117 lines total).
 - `pnpm run type-check` currently reports one **pre-existing, unrelated** error
  in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on
  this branch ahead of `origin/main`. The src type-check and `pnpm run build`
  are green; this change does not touch any MCP file.
 - Per-dialect SQL syntax stays out of scope here (deferred to
  `todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains
  dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that
  belongs with spec 08's channel so the skill never references a tool that does
  not exist.
--- a/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md
+++ b/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md
@ -1,395 +0,0 @@
 # Per-dialect SQL syntax notes, served on demand and scoped to the connection
 > Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion
 > to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft
 > dialect-agnostic and explicitly deferred per-dialect syntax to this spec.
 ## Problem
 Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the
 `ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft
 deliberately excludes anything that reads correctly on only one engine — no
 `QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs —
 because the flat skill is installed verbatim and an agent querying sqlite must
 never see Snowflake syntax.
 But a large share of *real* correctness depends on exactly that excluded,
 engine-specific syntax:
 - **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive
  identifiers (unquoted folds to upper-case), VARIANT colon-paths
  (`col:field.sub::type`), `QUALIFY`.
 - **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX`
  for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`.
 - **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`,
  `json_extract`.
 - and the remaining supported engines (`postgres`, `mysql`, `clickhouse`,
  `sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and
  JSON conventions.
 This guidance is genuinely useful to an agent writing SQL against a live
 database, but it must **not** pollute the flat dialect-agnostic skill. It belongs
 in a **dialect-aware** channel, surfaced only for the dialect the active
 connection actually uses, and selected from the project's own configured state —
 not guessed, not shown all at once.
 ## Generic use case
 Any **ktx** project whose connections span more than one warehouse engine — a
 Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When
 the agent (or a human analyst the agent assists) writes SQL for a given
 connection, it should receive *that engine's* syntax conventions — FQTN form,
 identifier quoting, date functions, top-N idiom, semi-structured access — and
 nothing for the engines it is not querying. The need is independent of any
 benchmark: it is what "write correct SQL against this specific warehouse" requires
 on every multi-engine stack.
 ## Model
 The change adds a **dialect-aware channel** alongside spec 07's flat skill. The
 following decisions are committed by this refinement; the implementer owns the
 exact prose and code.
 ### Delivery: a dynamic MCP tool (decision committed)
 The draft posed two delivery mechanisms and asked the refinement to "weigh them
 before committing." This spec commits to **dynamic MCP delivery**: a new
 read-only MCP tool returns the syntax notes for a given `connectionId`, with the
 dialect resolved server-side from the connection's configured `driver`. The flat
 skill gains a one-line pointer to that tool. **No install-mechanism change is
 required.**
 The alternative — **multi-file skill delivery** (bundle `reference/<dialect>.md`
 files and point the skill at the matching one) — is **rejected** for **ktx**, for
 reasons that hold regardless of how the skill is otherwise authored:
 1. **It cannot scope on two of the six install targets.** Cursor
   (`.cursor/rules/ktx-analytics.mdc`) and OpenCode
   (`.opencode/commands/ktx-analytics.md`) are physically **single-file**;
   `setup-agents.ts` flattens the skill to one file there. A bundled `reference/`
   directory degenerates to "concatenate every dialect into one file," so a
   sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core
   no-leak criterion on those targets**, and defeating progressive disclosure
   (everything is in context at once). The MCP tool behaves **identically on all
   six targets** because it is a tool call, not an installed file.
 2. **Selecting the dialect is a deterministic operation, so it belongs in code,
   not model judgment.** Anthropic's skill-authoring guidance explicitly says to
   *"prefer scripts [tools] for deterministic operations."* With bundled files the
   **model** must infer that connection X is Snowflake and open the right file —
   and on a multi-connection project it can open the wrong one. With the tool, the
   **server** resolves `driver → dialect` from `ktx.yaml` state and returns
   exactly the right notes.
 3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery
   requires reworking `readAnalyticsSkillContent`, `installTarget`,
   `plannedKtxAgentFiles`, the install manifest (a directory variant),
   `removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a
   concatenation transform for the single-file targets. The MCP tool requires one
   read-only handler and one skill pointer.
 4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on
   the **ktx** MCP server — its entire workflow is calling `discover_data`,
   `entity_details`, `sql_execution`, and so on. Wherever the server is down, the
   skill is already non-functional; the tool adds **no new dependency**.
 5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would
   make multi-file delivery *possible*, but it would not make it better: reasons
   2–4 stand, and the drop is a disproportionate cost (Cursor is a major target)
   to neutralize a constraint the tool handles for free. Whether **ktx** supports
   those targets is a separate product decision and is out of scope here.
 This is consistent with Anthropic's progressive-disclosure goal — load the
 relevant material on demand, at zero context cost until needed — which the tool
 satisfies (its output costs context only when called) while resolving *which*
 dialect from state rather than from a model guess. Reference:
 [Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
 ### Scope derived from state, through the one existing resolver
 Which dialect's notes the agent sees is **derived** from the connection's
 configured `driver`, via the resolver the rest of the system already uses —
 `sqlAnalysisDialectForDriver(driver)` in
 `packages/cli/src/context/sql-analysis/dialect.ts`. The same function already
 selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis
 daemon. This spec **must not** introduce a second driver→dialect map. The notes
 are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is
 keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's
 codomain so the two cannot drift.
 ### Authored per-engine notes are sanctioned static content
 Enumerating syntax notes per engine is **not** a rotting denylist of bad
 specifics; FQTN form and identifier quoting are genuine, stable invariants of each
 engine — the kind of universal fact **ktx**'s design rules explicitly permit as
 static content. What must stay derived-from-state is note *selection* (the active
 dialect) and note *coverage* (every configured driver must resolve to notes that
 exist), both of which this spec ties to the connector registry.
 ### The flat skill stays dialect-agnostic (spec 07 invariant preserved)
 This work adds a *separate* channel. It does **not** amend spec 07's `<sql_craft>`
 block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion
 — no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays
 green. The only `SKILL.md` change is the pointer in requirement 3, which names the
 tool and contains no dialect syntax.
 ## Requirements
 ### 1. A read-only `sql_dialect_notes` MCP tool
 Register a new tool beside the existing context tools
 (`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the
 implementer's to finalize but should follow the existing snake_case convention
 (`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name.
 - **Input:** `{ connectionId }`, **required** — matching its siblings
  `entity_details`/`sql_execution`, which always take an explicit connection.
 - **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved
  `SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect.
 - **Resolution:** `connectionId → connection.driver →
  sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing
  resolver. Do not duplicate the driver→dialect map.
 - **Guards:**
  - A **non-SQL context-source** connection (driver `metabase`, `looker`,
    `lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL
    warehouse connection" error**, not postgres notes. Gate on the existing
    `isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`).
  - For any **SQL warehouse** connection the resolver always yields a dialect with
    notes (all seven warehouse drivers are covered — requirement 2); its built-in
    `postgres` default is a safety floor, so the tool never errors for a SQL
    connection and never emits a single-engine dialect (e.g. Snowflake) by
    accident.
 - **Annotations:** read-only and idempotent, consistent with the other read
  tools.
 - **Description (docs-grade, third person, states what and when):** e.g.
  *"Returns the SQL syntax conventions for a connection's dialect — FQTN form,
  identifier quoting and case-folding, date/time functions, top-N idiom, and
  semi-structured access. Use before authoring raw SQL against a connection so the
  SQL matches that engine."* The description drives the agent's decision to call
  the tool, so it must be specific.
 ### 2. Per-dialect note content
 Author concise notes for each supported dialect against a **fixed rubric**, so
 every dialect answers the same questions. Each facet is a line or two of timeless,
 engine-true convention (no version-dated "as of vX" content), phrased as
 guidance with the engine reason where it helps — inheriting spec 07's
 heuristics-with-a-why tone. The rubric facets:
 1. **FQTN form** — how to fully-qualify a table on this engine.
 2. **Identifier quoting & case-folding** — quote character and how unquoted
   identifiers fold.
 3. **Date/time** — the engine's date functions and common date-encoding idioms.
 4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE +
   outer-filter form where it is not; `TOP` for `tsql`.
 5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/
   `JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable.
 6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery
   `_TABLE_SUFFIX`).
 Constraints on the content:
 - **Coverage = the reachable dialect set.** Every driver in the connector registry
  must resolve to a dialect that has non-empty notes. The reachable set is
  `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and
  `tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`:
  they appear in the resolver map but no connector can produce them, so they are
  unreachable — matching the draft's "don't author for nonexistent drivers."
 - **Keyed by `SqlAnalysisDialect`** (see Model).
 - **Storage is the implementer's choice.** The notes MAY live as per-dialect
  markdown files inside the package (e.g. under the skill's directory) served by
  the tool, or as a typed map. If files are used they are **package-internal** —
  served by the tool, never installed onto an agent target — and already ship via
  the recursive `src/skills → dist/skills` copy
  (`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change.
 - **No benchmark, gold-answer, grader, or scoring references** anywhere in the
  notes.
 The implementer must verify each engine's specifics against current official
 documentation (the well-known anchors above are starting points, not a
 substitute for checking the engine's docs).
 ### 3. The `SKILL.md` pointer (completes spec 07's deferral)
 Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step
 5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to
 call the tool before writing raw SQL against a connection — e.g. *"Before writing
 raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get
 that engine's syntax conventions."* This is the pointer spec 07 deliberately did
 not add because the tool did not yet exist.
 - The pointer **names the tool only**; it contains **no dialect syntax**, so the
  flat skill stays dialect-agnostic.
 - Follow the skill's existing tool-reference convention. The skill currently names
  MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's
  guidance recommends **fully-qualified** `ServerName:tool` names to avoid
  "tool not found" when multiple MCP servers are present. Whether to fully-qualify
  the new pointer (and optionally retrofit the existing bare references) is a
  small, separable decision flagged for the maintainer — **not** a rename sweep
  this spec mandates.
 ### 4. Coverage is enforced from state, not by hand
 A test must **derive** the required coverage from the connector registry rather
 than hardcoding a dialect list: enumerate the configured warehouse drivers
 (`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in
 `connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and
 assert each result has non-empty notes. Adding a connector later then **fails this
 test** until its dialect gets notes — the allowlist-from-state discipline, not a
 hand-maintained list.
 ### 5. No dialect syntax leaks into the flat skill
 Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill
 (and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`,
 backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds
 a tool and a tool-pointer; it does not move dialect syntax into the skill.
 ### 6. Delivery is unchanged
 `setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`,
 `writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The
 skill still installs as a single `SKILL.md` per target. Confirm the channel works
 on all six targets — Claude Code, Claude Desktop (zip), Codex, universal
 `.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call,
 including the single-file targets where multi-file delivery could not scope.
 ### 7. Coordination with specs 07 and 03
 - **Spec 07** owns the dialect-agnostic `<sql_craft>` block. This spec must not
  amend it; it adds the tool, the pointer, and the notes.
 - **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads
  `connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer
  is `connectionId`-scoped and fits that routing; keep the pointer consistent with
  spec 03's `connectionId` rules and do not rewrite the routing it owns.
 ## Acceptance criteria
 - An agent querying a **sqlite** connection gets sqlite date idioms and **never**
  sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets
  FQTN / identifier / VARIANT guidance.
 - The dialect shown is **derived from the connection's configured `driver`** via
  the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not
  guessed. No second driver→dialect map is introduced.
 - **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`,
  `bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with
  non-empty notes, and the coverage test derives this from the registry.
 - A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a
  clear "not a SQL warehouse" response, **not** postgres notes.
 - `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
  unaffected. The new pointer references the tool only and adds no dialect syntax.
 - The channel installs/serves correctly across **all six** agent targets,
  including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts`
  change**.
 - The notes contain **no** benchmark/gold/grader/scoring references and **no**
  time-sensitive ("as of version X") content.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **Dialect resolver (reuse, do not duplicate):**
  `packages/cli/src/context/sql-analysis/dialect.ts` —
  `sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect`
  (`./ports.ts`), default `postgres`.
 - **Connector registry (drives coverage):**
  `packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`,
  `isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts`
  (`warehouseDrivers`, the per-driver `connectionConfigSchema`).
 - **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts`
  (register beside `connection_list`, `entity_details`, `sql_execution`); the
  `connectionId → driver → dialect` resolution already exists for `sql_execution`
  in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool
  through the same path.
 - **The skill (one-line pointer only):**
  `packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5;
  leave `<workflow>`/`<rules>`/`<sql_craft>`/`<examples>` otherwise intact.
 - **Note storage (if files):** under the skill directory, shipped by
  `packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the
  tool, never installed.
 - **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`.
 - **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown →
  `postgres`, and non-warehouse rejection); a registry-derived coverage test
  (requirement 4); a content test that each dialect's notes cover the rubric
  facets and contain no banned tokens; and an extension of spec 07's
  `analytics/SKILL.md` content test asserting the new pointer is present and the
  flat skill is still dialect-clean. Rebuild and re-link the dev binary so the
  playground picks up the change: `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation only)
 The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
 (`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths),
 BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
 (`strftime`/`julianday`). That content is real and useful but engine-specific;
 spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic
 rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes
 the same correctness benefit to every multi-engine **ktx** project — improving the
 benchmark score is a side effect, not the goal, and the shipped skill contains no
 trace of the benchmark.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed
 decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as
 specified — no `setup-agents.ts` change.
 **What was built**
 - Per-dialect notes are markdown files under
  `packages/cli/src/context/sql-analysis/dialects/<dialect>.md` (one each for
  `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`),
  served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy
  read + cache, `postgres` fallback floor; the authored set is the
  `DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored
  (unreachable from any connector). Each note answers the fixed rubric — FQTN,
  identifier quoting/case-folding, date/time, top-N/window idiom,
  JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics
  were verified against current docs via Context7 (Snowflake VARIANT colon-paths
  and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`,
  `JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The
  files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they
  are never installed onto an agent target.
 - New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input
  `{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only
  + idempotent annotations. It resolves through the **existing**
  `connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second
  driver→dialect map), implemented as the unconditional `dialectNotes` port in
  `local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A
  non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError`
  ("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays
  out of Error Tracking.
 - `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`)
  readonly tuple so the coverage test derives required coverage from the registry;
  `isDatabaseDriver` behavior is unchanged.
 - `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call
  `sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N,
  and JSON conventions"). It names the tool only; spec 07's `<sql_craft>` block and
  its dialect-clean content test are untouched.
 **Tests**
 - `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future
  connector fails the test until its dialect has notes), the full rubric per dialect,
  leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`;
  `QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no
  benchmark/grader or version-dated content, the postgres fallback, and
  `resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql`
  and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a
  guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync.
 - `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool
  set + annotations assertion + a handler-routing test, and the regenerated
  `__snapshots__/mcp-tools-list.json`.
 - `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present
  and the flat skill stays dialect-clean.
 **Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files /
 3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three
 `dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and
 re-linked `ktx-dev`.
 **Deviations / notes**
 - Notes are stored as per-dialect markdown files (not a typed map, and not bundled
  `reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the
  most maintainable to edit. They are served by the tool and ship via a
  `copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no
  `setup-agents.ts` change.
 - `pnpm run type-check` still reports one pre-existing, unrelated error in
  `test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch);
  this change adds zero new type errors and does not touch that file.
--- a/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md
+++ b/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md
@ -1,362 +0,0 @@
 # Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
 > Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`.
 > Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the
 > `<sql_craft>` block. Additive, content-only.
 ## Problem
 The shipped `ktx-analytics` skill
 (`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop
 fan-out rule in `<sql_craft>` → **Composition**:
 > **Avoid fan-out joins.** Add columns only from tables already at the target
 > grain, or pre-aggregate to that grain before joining. A join that multiplies
 > rows quietly inflates every downstream `SUM`/`COUNT`.
 In practice the agent honors that on a single join but still **silently
 fans out on multi-hop join chains**, where the inflation is one or two joins
 removed from the aggregate and therefore much harder to notice.
 The failure shape: a measure that lives at a *coarse* grain (one row per parent
 record) is counted/summed *after* the parent has been joined down to a *finer*
 grain (one row per child line). Every parent-level value is then duplicated by
 its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent
 amount — runnable SQL, plausible-looking number, quietly wrong.
 The rule today is stated only as a **prohibition** ("Avoid…"). It needs two
 upgrades: (a) generalize it so the danger is understood as *cumulative across a
 whole join chain*, not a single join; and (b) pair it with an **affirmative
 verification habit** the agent runs while composing, so a grain change is
 detected and fixed rather than merely warned against.
 ## Generic use case (independent of any benchmark)
 An analyst on any production warehouse asks a counting/summing question whose
 path runs through several one-to-many hops — e.g. *"how many orders per region
 contain a returned item?"* where the path is `region → store → order →
 order_line`. The honest answer counts each order once. The naïve join chain joins
 `order_line` (to apply the line-level condition) and then counts orders, so an
 order with three returned lines is counted three times. The inflation happens
 **three joins below the `COUNT`**, where it is easy to miss. This is one of the
 most common silently-wrong analytics mistakes on normalized schemas — not
 specific to any dataset, dialect, or benchmark.
 ## Model (invariants — the implementer owns the prose)
 These constrain the change; the exact wording is the implementer's. Each is
 grounded in Anthropic's skill-authoring and prompt-engineering guidance so the
 addition stays consistent with how spec 07 was written.
 ### Additive, inline-only, dialect-agnostic (inherited from spec 07)
 The change is **additive content inside `skills/analytics/SKILL.md`** only — no
 bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per
 target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config.
 Every addition must read correctly on any dialect: **no** `QUALIFY`,
 `strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect
 construct — including in the worked example. The existing `<workflow>`, `<rules>`,
 `<examples>`, and the other four `<sql_craft>` sub-headings are preserved
 unchanged.
 ### Heuristic-plus-*why*, because SQL authoring is a high-freedom task
 Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with
 many valid approaches where decisions depend on context as **high freedom →
 text-based heuristics**, the "open field, many paths" case (versus low-freedom,
 fragile operations that need an exact script). SQL authoring is squarely
 high-freedom. So the new content is phrased as **heuristics with a one-line,
 universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the
 existing `<sql_craft>` style and Anthropic's "add context / explain why so Claude
 generalizes" principle.
 ### Affirmative framing for the verification step (do, not don't)
 Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do
 instead of what not to do."** The draft's requirement for "a detect-and-fix
 *habit*, not just a prohibition" is the same principle. Therefore:
 - The **generalized rule keeps the established `Avoid fan-out joins` lead and the
  term `fan-out`** — it is spec 07's consistent terminology and the existing
  content test references that phrase; reframing it would churn shared vocabulary
  for no gain.
 - The **new verification step is phrased affirmatively** (e.g. *"Verify the grain
  holds across each join"*) — an action the agent performs while composing, not a
  warning. The two together satisfy both principles: a recognized anti-pattern
  name *and* a positive habit.
 ### One default with an escape hatch, not two equal options
 Anthropic: **"Avoid offering too many options… provide a default with an escape
 hatch."** The fix for an inflated aggregate is presented as exactly that:
 - **Default: pre-aggregate the measure to its own grain in a CTE, then join the
  already-aggregated result.** This is the single-hop fix generalized, and it is
  the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed
  measure with `DISTINCT` (two legitimately-equal amounts would collapse).
 - **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an
  inflated count in one line, but must be stated as count-only, not as a general
  remedy.
 This is the deepest correctness point in the spec and the easiest to get wrong; a
 naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums.
 ### Consistent terminology
 Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing
 vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not
 introduce synonyms (e.g. do not rename the concept "row blow-up" or
 "multiplication factor"). Prose may vary, but the named concepts stay fixed.
 ### Concise — the addition must justify its token cost
 Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and
 "Claude is already very smart." The agent knows what a join and a `GROUP BY` are;
 the addition explains only the non-obvious trap (cumulative grain inflation) and
 shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and
 one worked example — the skill stays comfortably under the 500-line budget
 (~117 lines today).
 ### Examples over descriptions — exactly one
 Anthropic's "examples pattern": **"Examples help Claude understand the desired
 style and level of detail more clearly than descriptions alone"** and
 "examples are concrete, not abstract." The multishot guidance favors 3–5 examples
 in general, but here **conciseness and spec 07's one-example-per-rule economy
 win**: the skill already carries the window-then-filter example, so this adds
 **exactly one** compact wrong-vs-right example. The wrong/right contrast inside
 that single example supplies the diversity multishot calls for, at one example's
 token cost.
 ### Leak-safety (hard constraint)
 The worked example must be a **synthetic, generic schema invented for teaching** —
 not the tables, column names, query, or numeric results of any Spider 2.0-Lite
 question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a
 one-to-many join), which is universal and reconstructable from first principles. A
 reviewer must find nothing in it that ties it to a specific benchmark instance.
 See "Leak-safety" below.
 ## Requirements
 All four land in the **Composition** sub-heading of `<sql_craft>` in
 `packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite
 the existing fan-out bullet, add one affirmative verification bullet, add one
 worked example. Do not touch the other four sub-headings or `<workflow>`/`<rules>`/
 `<examples>`.
 ### 1. Generalize the fan-out rule to multi-hop chains
 Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that
 the danger is **cumulative**: *any* one-to-many hop on the path between a measure's
 owning table and the aggregate inflates that measure, **even when the offending
 join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the
 single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join
 the already-aggregated result** — but the agent must apply it **per
 measure-owning table along the whole chain**, not just at the final join. Keep the
 `fan-out` term and the one-line *why*.
 ### 2. Add an affirmative grain-verification habit
 Add a companion bullet, phrased as an action the agent performs **while
 composing** (not a prohibition):
 - Confirm that a join intended to be one-to-one / many-to-one **did not change the
  grain** it aggregates at — e.g. check that the row count (or the count of the
  aggregate's key) is unchanged across that join.
 - When a join is genuinely one-to-many, **reach for the default fix
  (pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an
  acceptable escape hatch.
 - State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate**
  — `DISTINCT` cannot de-duplicate a sum.
 This is spec 07's "build incrementally and check each layer" discipline pointed
 specifically at grain preservation, in affirmative form.
 ### 3. One concrete, generic multi-hop worked example
 Add **exactly one** compact wrong-vs-right `sql` example inside `<sql_craft>`
 demonstrating the multi-hop inflation and the pre-aggregate fix. It is the
 **second** `sql` fence in the skill (the first is spec 07's window-then-filter
 example).
 **Required properties** (these are the constraints; the SQL below is orientation):
 - **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed**
  from the aggregate (not the single-hop case spec 07 already covers).
 - **Unambiguous attribution**: each counted entity maps to **exactly one** group,
  so the honest answer is well-defined. (This rules out "coarse measure attributed
  to a fine dimension reached by descending," where one entity spans several
  groups and the correct number is itself ambiguous — that would teach a murky
  pattern.)
 - **Motivated descent**: the finer-grain table is joined for a real reason (a
  line-level filter or a needed line-level value), so the reader sees *why* the
  fan-out join is there.
 - **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing
  *Macro vs micro average* bullet and would muddy the fan-out lesson.
 - The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a
  CTE) and is **actually correct**, not merely runnable — its number must equal the
  honest answer, not just avoid an error.
 - Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect
  functions), no benchmark identifiers or values.
 **Recommended sketch** (implementer may adjust within the properties above):
 ```sql
 -- "How many orders per region contain a returned item?"
 -- WRONG: joining order_lines to apply the line-level filter multiplies orders —
 -- an order with two returned lines is counted twice, three joins below the COUNT.
 SELECT r.region_id, COUNT(*) AS n_orders
 FROM regions r
 JOIN stores s      ON s.region_id = r.region_id
 JOIN orders o      ON o.store_id  = s.store_id
 JOIN order_lines l ON l.order_id  = o.order_id
 WHERE l.status = 'returned'
 GROUP BY r.region_id;
 -- RIGHT: collapse order_lines to one row per qualifying order first, then join up.
 WITH returned_orders AS (
  SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
 )
 SELECT r.region_id, COUNT(*) AS n_orders
 FROM regions r
 JOIN stores s           ON s.region_id  = r.region_id
 JOIN orders o           ON o.store_id   = s.store_id
 JOIN returned_orders ro ON ro.order_id  = o.order_id
 GROUP BY r.region_id;
 -- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
 -- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
 -- de-duplicate a sum.
 ```
 ### 4. Placement and structure
 - Both bullets live under the existing **Composition** sub-heading; the example
  follows them. The five-sub-heading structure spec 07 established is unchanged.
 - **State each rule once** (Anthropic "consistent terminology / don't repeat"):
  do not also restate the multi-hop rule in `<workflow>` steps 5/6 — those already
  carry a one-line pointer into `<sql_craft>`, which is sufficient.
 ### 5. Coordination with spec 07 (supersession)
 Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly
 one** worked example and "Do not add a second example." **This spec supersedes
 that constraint**: the skill now carries **two** `sql` worked examples
 (window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate
 spec 07 at those two spots with a one-line "superseded by spec 09" note so the two
 permanent specs do not contradict. No other spec 07 content changes.
 ## Leak-safety (hard constraint on this spec and its example)
 The benchmark's gold answers must never appear in ktx. The worked example must be
 a **synthetic, generic schema invented for teaching** — not the tables, column
 names, query, or numeric results of any Spider 2.0-Lite question. The example
 demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many
 join), which is universal; it must be reconstructable from first principles by
 anyone, with zero reference to benchmark data. A reviewer should be able to read
 the example and find nothing that ties it to a specific benchmark instance.
 ## Acceptance criteria
 - The `<sql_craft>` **Composition** section states the **multi-hop generalization**
  of the fan-out rule (cumulative danger across the chain; pre-aggregate per
  measure-owning table) and an **affirmative grain-verification habit**, inline and
  dialect-agnostic.
 - The fix is presented as **default (pre-aggregate to grain) + escape hatch
  (`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG`
  of a fanned-out measure must pre-aggregate.
 - Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right)
  using an invented schema, with no benchmark-derived identifiers or values, whose
  RIGHT side is actually correct (unambiguous attribution; honest number).
 - The skill now contains **two** `sql` worked examples total; the existing content
  test's fence-count assertion is updated `1 → 2` and new assertions cover the
  multi-hop rule phrase and the grain-verification-habit phrase.
 - Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no
  synonyms introduced.
 - **No new tool, flag, or config.** Skill-content only; additive to spec 07.
 - All spec 07 invariants still hold: the skill remains dialect-agnostic (no
  `QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time
  anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference,
  including in the new example; `<workflow>`/`<rules>`/`<examples>` and the other
  four sub-headings are intact; frontmatter still parses through
  `SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines.
 - Spec 07's "exactly one example" constraint is annotated as superseded (no
  contradiction between the two permanent specs).
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` →
  `<sql_craft>` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add
  the affirmative grain-verification bullet, add the one worked example after them.
  Leave the other four sub-headings, `<workflow>`, `<rules>`, and `<examples>`
  unchanged.
 - **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the
  "ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`,
  add an assertion for the new fan-out example's distinctive tokens (e.g.
  `WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit
  phrases to the behavior-presence list, and keep all banned-construct and
  size-budget guards. This is a content assertion over the source `SKILL.md` — the
  right level for prompt content.
 - **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's
  requirement 3 and at its "Exactly one new worked example" acceptance bullet.
 - **Rebuild/re-link** the dev binary so the playground picks up the change:
  `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
 ## Benchmark context (motivation only)
 Multi-hop aggregation questions (counting/averaging a coarse-grained measure
 reached through several one-to-many joins) are a recurring source of
 result-mismatch failures in the SQLite subset: the agent produces runnable SQL
 with the right tables but a fan-out-inflated number. These are correctness
 failures, not knowledge or schema-discovery failures (zero execution errors in the
 latest run), so the fix belongs in the product's authoring craft — where it also
 helps any real analyst — not in a benchmark-specific prompt. The skill itself must
 contain no trace of the benchmark.
 ## Implementation notes
 Shipped as specified — additive, content-only, no new tool/flag/config.
 - **`packages/cli/src/skills/analytics/SKILL.md`** → `<sql_craft>` → **Composition**:
  - Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the
    danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many
    hop between a measure's owning table and the aggregate inflates that measure
    even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per
    measure-owning table along the whole chain. Kept the `fan-out` term and the
    one-line *why*.
  - Added the affirmative `**Verify the grain holds across each join.**` bullet:
    confirm a one-to-one / many-to-one join did not change the grain (row/key
    count unchanged); default fix is pre-aggregate to grain, escape hatch is
    `COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a
    fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a
    sum.
  - Added one generic wrong-vs-right worked example (orders→regions via
    stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in
    the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side
    pre-aggregates `order_lines` to one row per qualifying order so each order is
    counted once (honest answer), and the trailing comment names the count-only
    `COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented
    schema, dialect-agnostic SQL, no benchmark identifiers/values.
  - The other four sub-headings and `<workflow>`/`<rules>`/`<examples>` are
    untouched. Skill is 147 lines (well under the 500-line budget).
 - **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count
  `1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the
  grain-verification phrase (`Verify the grain holds across each join`) to the
  behavior-presence list; added new-example token assertions
  (`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct,
  relative-time, and size-budget guards retained. Test file passes (9/9).
 - **Spec 07** annotated as superseded at requirement 3 and at its "exactly one
  worked example" acceptance bullet — no contradiction between the two permanent
  specs.
 **Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9
 passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built
 `dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev`
 re-linked `ktx-dev`. A pre-existing, unrelated type error in
 `test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last
 touched in commit `2677b3ef`) surfaces under the full `type-check`'s
 `tsconfig.test.json` pass; it is outside this change's surface and not introduced
 here.
--- a/spider2-specs/specs/10-panel-completeness-spine.md
+++ b/spider2-specs/specs/10-panel-completeness-spine.md
@ -1,289 +0,0 @@
 # Panel/period completeness — emit the full set of groups, not only the populated ones
 > Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`.
 ## Problem
 When a question asks for a result *per period* or *per category* ("orders for
 each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY`
 only returns groups that actually have rows. Periods or categories with **zero**
 activity silently vanish, so a "12 months" answer comes back with 9 rows and the
 three that should read `0` are simply absent. The SQL is runnable and the
 aggregate is right, but the **panel is incomplete** — and a monthly report with
 missing months or a category breakdown missing its empty categories is wrong for
 any analyst, on any database.
 The existing `<sql_craft>` "Answer completeness / interpretation" group already
 carries a *"For each X / per X / by X returns exactly one row per X"* rule, but
 that rule only governs **grain** (don't collapse to a single value). It says
 nothing about the **domain**: "one row per X" today means one row per *observed*
 X, so empty groups still drop. This spec sharpens that rule from grain-only to
 grain-and-completeness.
 ## Generic use case (independent of any benchmark)
 "How many orders were placed in each month of 2023?" must return **12 rows** even
 if March had no orders (March = 0), not 11. "Sales per region" should include
 regions with no sales when the question asks for *each* region. Both are
 bread-and-butter reporting for any analyst on any warehouse, with no benchmark in
 sight.
 ## Model
 The feature splits across **two surfaces**, each holding the half it is suited
 for. This split is the central design decision and exists to satisfy spec 07's
 hard dialect-agnostic invariant without weakening it.
 ### Why two surfaces (the dialect-agnostic reconciliation)
 The draft asked for a *"recursive-CTE date spine"* worked example. But a real
 date/number series is **inherently dialect-specific** — Postgres `generate_series`,
 SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake
 `GENERATOR`+`DATEADD` — and spec 07 made `<sql_craft>` strictly dialect-agnostic
 (the analytics-skill content test bans single-dialect constructs). Inlining a date
 spine would violate that invariant; carving out a test exception would erode it.
 ktx already has the canonical home for engine-specific syntax: the per-dialect
 notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
 the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric
 (FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is
 not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the
 other per-dialect idioms, and the dialect-agnostic skill points to it. This
 routes the dialect-specific half through the existing channel rather than
 standing up a parallel dialect-specific recipe inside the skill.
 Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the
 **concrete series syntax**.
 ### Additive, inline, heuristic-with-a-why
 Consistent with spec 07: the skill change is **additive content in one Markdown
 file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the
 delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
 and phrased as a **heuristic with a one-line generic rationale**, not a wall of
 MUSTs. The dialect-notes change is additive content in the seven existing
 `dialects/*.md` files. No new tool, flag, or config on either surface.
 ## Requirements
 ### 1. Skill surface — `<sql_craft>` "Answer completeness / interpretation"
 Add the panel-completeness rule to the existing group (it extends, and should sit
 adjacent to, the *"For each X / per X / by X"* bullet). It must cover:
 1. **Recognize the full-panel cue.** *each / every / all / per <period> / for all
   <category> / by month* signals that the answer's row set should be the
   **complete expected domain** of periods or categories in scope, not just those
   present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit
   groups that have at least one fact row.
 2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the
   **spine**), then LEFT JOIN the aggregated facts onto it:
   - **Category/dimension spine:** the distinct values from the **domain-defining
     dimension/entity table** (e.g. all regions from a `regions` table), *not*
     `SELECT DISTINCT region FROM facts` — the latter yields only categories that
     already occur, so a zero-activity category still drops. When no dimension
     table exists, the distinct values from the **unfiltered** fact table are the
     best available domain (with the residual caveat that a category which never
     occurs at all cannot surface).
   - **Period/number spine:** generate the series for the question's stated range
     (e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the
     question's explicit range; when the range is "all periods present," derive
     bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete
     series-generation syntax is per-dialect — the rule points the author to
     `sql_dialect_notes` (see requirement 2) and shows no inline series SQL.
 3. **COALESCE by measure additivity.** Default missing measures with
   `COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events
   or amounts — "no activity" genuinely reads as 0). Leave **non-additive**
   measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** —
   absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value
   only for additive measures.
 4. **Don't over-apply (the each-vs-which guard).** When the question asks only
   about groups that exist ("*which* months had orders", "regions that made a
   sale"), the spine is unnecessary and wrong — emit only observed groups. The cue
   is *each / all / every* (complete domain) vs *which / that have* (observed
   subset).
 5. **One worked example — the category spine, fully portable.** Add **exactly
   one** compact before/after example demonstrating the pattern with a
   **distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty
   groups missing) and the right shape (`SELECT DISTINCT` domain from the
   dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic
   table/column names, standard SQL only — no series generation, no dialect
   functions, so the example stays dialect-clean. The period-spine variant is
   described in prose (requirement 2) and delegated to `sql_dialect_notes`; it
   gets **no** inline example. This is the **third** worked `sql` example in the
   skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out).
 6. **Step pointer, no duplication.** The validate/explain step (and/or the query
   step) already points into `<sql_craft>` for answer-completeness; extend that
   existing pointer's wording if needed, but state the rule **once** inside
   `<sql_craft>`. The step-5 pointer that lists what `sql_dialect_notes` provides
   ("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also
   name the **series/calendar** convention now that it exists.
 ### 2. Dialect-notes surface — `dialects/*.md`
 Add a **"Series"** (date/number range) line to **each** of the seven authored
 dialect files, giving that engine's idiomatic way to generate a contiguous
 date or integer series for use as a spine. Each note is engine-exclusive — a
 SQLite analyst gets the SQLite idiom and never another engine's construct, per the
 existing dialect-notes leak guards. Orientation (exact syntax is the
 implementer's):
 - **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`.
 - **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`.
 - **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers).
 - **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE.
 - **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`.
 - **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`).
 - **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table.
 This line is what makes the period spine usable from the dialect-agnostic skill,
 and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the
 same date spine) — so it is foundational, not scope creep.
 ### 3. Coordination with spec 11
 Spec 11 (time-series window recipes) explicitly depends on this date spine for the
 gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10
 establishes the spine concept in the Answer-completeness group and the
 series syntax in the dialect notes; spec 11 reuses both from the Window-functions
 group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it.
 ## Leak-safety (hard constraint)
 Any worked example or note must use a **synthetic generic schema** (e.g. an
 `orders` table with an `order_date`, a `regions` dimension) and demonstrate only
 the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL,
 or result values on either surface. The dialect-notes additions, like the existing
 notes, carry no benchmark/grader/version-dated content. The behavior is
 reconstructable from first principles and tied to no specific instance.
 ## Acceptance criteria
 - `<sql_craft>` "Answer completeness / interpretation" states: the full-panel cue,
  the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE
  discriminator (0 vs NULL), and the each-vs-which over-application guard —
  inline, dialect-agnostic, each with a generic *why*.
 - Exactly **one** new worked `sql` example is present, a portable
  distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`),
  with no series generation and no dialect-specific syntax. The skill then carries
  **three** `sql` worked examples total.
 - Each of the seven `dialects/*.md` files gains a **Series** (date/number range)
  line in its engine's own idiom; no engine leaks another engine's construct, and
  the additions contain no benchmark/grader/version-dated content.
 - The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`,
  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other
  single-dialect construct anywhere in `SKILL.md`, including the new example.
 - The existing interactive guidance (`<workflow>`, `<rules>`, the other examples)
  and the existing dialect-note rubric lines are intact and uncontradicted.
 - No grader/benchmark reference, no output-shape contract, and no anchoring of
  *relative* time ("recent" / "past N months") to a `MAX(date)` over the data
  appears (period-spine bounds derive from the question's explicit range or, for
  "all periods present," from `MIN`/`MAX` over the facts — which is range
  derivation, not relative-time anchoring).
 - The skill stays scannable and comfortably under the 500-line budget; frontmatter
  still parses as `ktx-analytics`.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the
  panel-completeness bullets to the Answer-completeness group, the single category
  spine example, and extend the existing step pointer / dialect-notes provision
  list to name the series convention. Leave `<workflow>`/`<rules>`/other examples
  intact. Delivery is unchanged (single `SKILL.md` per target via
  `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required.
 - **Dialect notes:** the seven files under
  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
  `copy-runtime-assets.mjs` — no plumbing change, content only.
 - **Tests:**
  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
    representative phrase for the completeness rule; bump the `sql`-fence count
    assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the
    existing dialect-clean guards already cover the no-inline-series requirement
    (the example is `SELECT DISTINCT`, so they pass unchanged).
  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop
    (the "answers the full rubric for every dialect" test) so every dialect must
    also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`.
    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
    all seven without a hand-maintained list.
 - Rebuild and re-link the dev binary so the playground picks up both surfaces:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation only)
 Per-period / per-category questions where some periods are empty produce
 short-row result mismatches in the SQLite subset, and the related rolling/cumulative
 cluster (spec 11) needs a complete date spine to be correct at all. The fix is a
 universal reporting habit (complete panels) plus the per-dialect series syntax
 that makes it executable — both belong in the product, where they help real
 analysts. Improving the benchmark score is a side effect; the skill and the
 dialect notes contain no trace of the benchmark.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no
 new tool/flag/config, no plumbing change.
 **Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):**
 - Added a **"Complete the panel for 'each / every / all / per <period or
  category>'"** bullet to the `<sql_craft>` "Answer completeness / interpretation"
  group, directly after the *"For each X / per X / by X"* bullet, with three
  sub-bullets carrying the rest of the rule each with its generic *why*: **Spine
  source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT`
  over the facts; period/number series across the question's stated range, bounds
  from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series
  syntax delegated to `sql_dialect_notes`), **Default by additivity**
  (`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and
  **Don't over-apply** (the each-vs-which guard).
 - Added **one** worked `sql` example at the end of the Answer-completeness group: a
  portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` →
  `LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right,
  standard SQL only, no series generation, no dialect functions. The skill now
  carries **three** `sql` worked examples.
 - Extended the step-5 dialect-notes pointer to name the **series/calendar**
  convention alongside FQTN / identifier-quoting / date / top-N / JSON.
 - Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the
  single `SKILL.md` per target — confirmed, no change.
 **Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):**
 - Added a `- **Series:**` line to all seven authored files (postgres, sqlite,
  bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom
  (`generate_series`; recursive CTE with `date(d,'+1 month')`;
  `UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE
  with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` +
  `MAXRECURSION`), placed right after each file's Date/time line. No cross-engine
  leak, no version-dated/benchmark content. Shipped to `dist` unchanged by
  `copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`.
 **Tests:**
 - `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel`
  and `Default by additivity` phrases; renamed the worked-examples test and bumped
  the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE`
  shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the
  dialect-clean banned list — a deliberate **strengthening** beyond the spec's
  test orientation so the "no inline series" acceptance criterion is *enforced*,
  not merely incidentally true of a `SELECT DISTINCT` example.
 - `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric
  for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven
  dialects are required to answer a Series line (coverage derived from
  `DIALECTS_WITH_NOTES`, no hand-maintained list).
 **Verification:** both affected test files pass (19 tests). `src` type-check and
 `pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in
 all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an
 unrelated, pre-existing `tsconfig.test.json` type error in
 `test/mcp-server-factory.test.ts` exists on this branch — untouched by this work
 and outside its scope.
 **Coordination with spec 11:** the per-dialect Series line is the foundational
 date spine that spec 11 (rolling/cumulative windows over gappy dates) references.
 Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11
 will reference it from the Window-functions group. No overlap introduced.
--- a/spider2-specs/specs/11-time-series-window-recipes.md
+++ b/spider2-specs/specs/11-time-series-window-recipes.md
@ -1,391 +0,0 @@
 # Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period
 > Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`.
 ## Problem
 A large share of analytics questions are time-series shaped: a **running /
 cumulative balance**, a **rolling N-day average**, or **period-over-period
 growth**. The agent already knows window functions exist — spec 07 gave the
 `<sql_craft>` "Window functions" group its determinism and window-then-filter
 rules, and spec 10 added panel/period completeness — but it still gets the
 *time-series specifics* wrong:
 - a cumulative balance computed **without an explicit unbounded-preceding
  frame**, or with the implicit frame misbehaving when there are **ties on the
  order key**;
 - "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy**
  daily data, so the window spans the wrong calendar span when days are missing;
 - no **minimum-periods** handling — a rolling average reported before the window
  is actually full;
 - "growth vs the previous period" written **without `LAG`** (or against the wrong
  neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or
  absent prior.
 These are runnable-but-wrong: the structure is close, the edge case diverges.
 It is the same failure shape spec 07 addressed at the general level; this spec
 adds the time-series specifics to the **same Window-functions group**, building
 on the rules already there rather than restating them.
 ## Generic use case (independent of any benchmark)
 - "Each account's month-end running balance over 2023" — a cumulative sum of
  monthly net over an ordered window.
 - "30-day rolling average of daily revenue, only once 30 days of history exist."
 - "Month-over-month revenue growth rate."
 All three are bread-and-butter for any analyst on any time-series table, with no
 benchmark in sight. The methodology is universal analyst craft, so it belongs in
 the shipped skill — it transfers to every ktx user querying a live database.
 ## Model
 The change is **additive content across two surfaces** — the same split spec 10
 made, and for the same reason. The split is the central design decision; it
 satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
 weakening it.
 ### Why two surfaces (the dialect-agnostic reconciliation)
 Two of the three recipes are **pure standard SQL** and stay entirely in the
 dialect-agnostic skill:
 - **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED
  PRECEDING AND CURRENT ROW)` is standard on every engine.
 - **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a
  `NULLIF`-style divide-by-zero guard are standard on every engine.
 The third recipe — a **rolling window over calendar time** — has one piece that
 is genuinely dialect-divergent: the **calendar-range window frame**. A native
 range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`
 exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has
 no date-interval range frame, and SQL Server has **no offset `RANGE` frames at
 all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot
 inline a range frame any more than it could inline a date-series generator.
 ktx already routes that kind of engine-specific syntax through the per-dialect
 notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
 the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent
 exactly: series/spine generation was not in the dialect rubric, so it was added
 there (the **Series** line) and the dialect-agnostic skill points to it.
 Rolling-window framing is the next construct in that same position — not in the
 rubric yet, dialect-specific — so the **rolling-window idiom belongs in the
 dialect notes**, and the skill points to it.
 Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the
 min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries
 the **concrete rolling-window frame syntax** per engine.
 ### Additive, inline, heuristic-with-a-why
 Consistent with specs 07 and 10: the skill change is **additive content in one
 Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/`
 file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as
 **heuristics with a one-line generic rationale**, not a wall of MUSTs. The
 dialect-notes change is additive content in the seven existing `dialects/*.md`
 files. No new tool, flag, or config on either surface.
 ### Build on the rules already present; do not restate them
 The Window-functions group already carries **"Make the ordering deterministic"**
 (complete tie-breaker) from spec 07, and the Numeric-precision group carries
 **"Round only at the end."** The cumulative and period-over-period recipes
 **reference** these rather than repeat them (state each rule once — Anthropic's
 "consistent terminology / don't repeat" guidance, already followed in spec 07).
 Spec 10's **Series** dialect line is likewise **referenced** by the rolling
 recipe's spine fallback, not duplicated.
 ## Requirements
 ### 1. Skill surface — `<sql_craft>` "Window functions" group (three recipes)
 Add three recipes to the **existing** "Window functions" group, after its two
 current bullets (deterministic ordering; filter-after-the-window). Each is a
 heuristic with a generic *why*, dialect-agnostic.
 1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER
   (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` —
   with a **complete tie-breaker** on the `ORDER BY` (per the group's existing
   deterministic-ordering rule; reference it, do not restate). *Why:* a bare
   `ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the
   order key** folds every tied peer into the same cumulative value — it runs and
   looks plausible, but the running total jumps at each tie boundary.
 2. **Rolling window over calendar time, plus minimum periods.** "Rolling N
   days/months" must span a **calendar range**, not a fixed row count: a `ROWS
   BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are
   missing. Two sanctioned techniques:
   - **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's
     **Series**, via `sql_dialect_notes`) so the data has one row per calendar
     unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the
     intended calendar span. This path is fully dialect-agnostic.
   - **Native range frame or date-keyed self-join (engine-specific).** Where the
     engine supports it, a calendar **range frame** expresses the window directly;
     otherwise a self-join keyed on the date does. Both use engine-specific
     syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see
     requirement 3); show no inline range frame in the skill.
   **Minimum periods.** When the question says "only after N periods of data" (or
   a rolling metric implies it), emit `NULL` / skip until the window is actually
   full — guard on a window count, e.g. `COUNT(*) OVER (<same frame>) = N`. On a
   gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null
   observations** instead when "N periods" means N data points rather than N
   calendar units. *Why:* a row-count frame over missing dates measures the wrong
   span, and a partial early window is not the requested metric.
 3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
   for the prior-period comparison; compute growth as `(cur - prev) / prev` at
   **full precision**, rounding only in the final projection (per the existing
   "Round only at the end" rule), and **guard divide-by-zero / NULL prev**
   (e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against
   the wrong neighbor — the comparison lands on the wrong period, and an unguarded
   ratio errors or returns garbage when the prior period is zero or absent.
 **Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list
 (currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON
 conventions") should also name the **rolling-window** convention now that it
 exists. State each rule once inside `<sql_craft>`; the workflow steps only point
 to it.
 ### 2. One worked example — cumulative running total (dialect-agnostic)
 Add **exactly one** new compact before/after `sql` example, demonstrating the
 **cumulative running total** — the subtlest of the three (the implicit-frame trap
 runs fine and is wrong only at tie boundaries) and the highest-value to show.
 Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`):
 - **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the
  implicit `RANGE` frame makes two txns on the same date share one inflated
  running balance.
 - **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND
  CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`).
 Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no
 `RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets
 **no** inline example (its correct form needs the engine-specific frame/spine,
 delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was
 prose-only); the **period-over-period** recipe is self-evident from its bullet
 and also gets no example. This is the **fourth** worked `sql` example in the
 skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and
 spec 10 (panel-completeness spine).
 ### 3. Dialect-notes surface — `dialects/*.md` (rolling window)
 Add a **rolling-window-over-time** idiom line to **each** of the seven authored
 dialect files, parallel to spec 10's **Series** line. Each note is
 engine-exclusive — a SQLite analyst gets the SQLite idiom and never another
 engine's construct, per the existing dialect-notes leak guards. Each note either
 gives the engine's native calendar-range frame **or** references its own
 **Series** line for the spine + `ROWS` fallback (a cross-reference within the
 file, not a duplicate of the Series line).
 Orientation only — **`RANGE`-frame support genuinely varies by engine and
 version, so the implementer must verify each engine's current support against
 authoritative docs (context7 / the engine's manual) rather than assert it from
 memory.** Starting points:
 - **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days'
  PRECEDING AND CURRENT ROW)`.
 - **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT
  ROW` over a temporal order key.
 - **bigquery:** `RANGE` frames are **numeric** — range over an integer day key
  (e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or
  build a spine (see **Series**) and use a `ROWS` frame.
 - **sqlite:** **no** date-interval range frame — build a date spine (see
  **Series**) and use a `ROWS` frame.
 - **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see
  **Series**) and use a `ROWS` frame, or a date-keyed self-join.
 - **snowflake / clickhouse:** range-frame support over dates is limited — verify;
  default to a spine (see **Series**) + `ROWS` frame where a native calendar range
  frame is unavailable.
 This line is what makes the rolling-over-time recipe executable from the
 dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series =
 how to *generate* a spine; Rolling window = how to compute a *moving
 calendar-range aggregate*, natively or via that spine), and it cross-references
 the Series line rather than overlapping it.
 ### 4. Explicit constraints / exclusions
 None of the following may appear (consistent with specs 07 and 10):
 - **No inline dialect-specific range-frame syntax in the skill** — no
  `RANGE … INTERVAL` frame, no series generator, no dialect function. The skill
  stays dialect-clean; the range frame lives only in the dialect notes.
 - **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months"
  means relative to *now* on a live database. A range *bound* may be derived from
  the question's explicit range or, for "all periods present," from `MIN`/`MAX`
  over the **unfiltered** facts (range derivation, per spec 10) — but the metric
  must never silently redefine "recent" as the data's maximum date.
 - **No grader / gold-answer / benchmark reference**, and no output-shape contract
  (the skill is for interactive analysis).
 ### 5. Coordination with specs 07 and 10
 All three recipes live in the **existing** `<sql_craft>` "Window functions"
 group; the two current bullets and the spec-07 window-then-filter example must
 stay intact and uncontradicted.
 - **Spec 07** owns the deterministic-ordering rule (Window functions) and the
  round-at-the-end rule (Numeric precision). Spec 11 **builds on** both —
  references them, never restates them.
 - **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11
  **references** the spine for the gappy-rolling fallback and adds the **distinct**
  rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a
  spine; spec 11 = how to compute a moving calendar-range aggregate (native frame
  or spine + `ROWS`).
 ## Leak-safety (hard constraint)
 Every worked example or note uses a **synthetic generic schema** (e.g.
 `daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and
 shows only the *pattern*. **No** benchmark table names, SQL, or result values on
 either surface. The dialect-notes additions, like the existing notes, carry no
 benchmark / grader / version-dated content. The behavior is reconstructable from
 first principles and tied to no specific instance.
 ## Acceptance criteria
 - The `<sql_craft>` "Window functions" group states the three recipes — inline,
  dialect-agnostic, each with a generic *why*, and each **building on** (not
  restating) the deterministic-ordering and round-at-the-end rules:
  - **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED
    PRECEDING AND CURRENT ROW` frame and a complete tie-breaker;
  - **rolling window over calendar time + minimum periods** — calendar range not
    row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)`
    guard — delegating the engine's range-frame syntax to `sql_dialect_notes`;
  - **period-over-period** via `LAG`, with full-precision growth and a
    divide-by-zero / NULL-prev guard.
 - Exactly **one** new worked `sql` example: the cumulative running total,
  wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
  ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The
  skill then carries **four** `sql` worked examples total.
 - Each of the seven `dialects/*.md` files gains a **rolling-window-over-time**
  idiom line in its engine's own idiom (native calendar-range frame where
  supported, otherwise a spine + `ROWS` fallback that references its **Series**
  line); no engine leaks another engine's construct, and the additions contain no
  benchmark / grader / version-dated content.
 - The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no
  inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new
  example.
 - The step-5 `sql_dialect_notes` provision list names the **rolling-window**
  convention alongside FQTN / identifier-quoting / date / top-N / series/calendar /
  JSON.
 - The existing interactive guidance (`<workflow>`, `<rules>`, the other
  examples), the two existing Window-functions bullets, the window-then-filter
  example, and the existing dialect-note rubric lines (including **Series**) are
  intact and uncontradicted.
 - No grader / benchmark reference, no output-shape contract, and no anchoring of
  *relative* time ("recent" / "past N months") to a `MAX(date)` over the data.
 - The skill stays scannable and comfortably under the 500-line budget; frontmatter
  still parses as `ktx-analytics`.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes
  to the "Window functions" group (after its two existing bullets), the single
  cumulative worked example, and extend the step-5 dialect-notes provision list to
  name the rolling-window convention. Leave `<workflow>` / `<rules>` / the other
  examples and the two existing window bullets intact. Delivery is unchanged
  (single `SKILL.md` per target via `readAnalyticsSkillContent` in
  `setup-agents.ts`) — confirm, no change required.
 - **Dialect notes:** the seven files under
  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
  `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
  engine's actual `RANGE`-frame support against authoritative docs before writing
  the idiom; do not assert from memory.**
 - **Tests:**
  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
    representative phrase for each of the three recipes; bump the `sql`-fence count
    assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN
    UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean
    guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding
    `generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the
    "range frame lives only in the dialect notes" criterion is *enforced*, not
    incidentally true).
  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the
    full rubric for every dialect" loop with the rolling-window assertion, e.g.
    `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it.
    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
    all seven without a hand-maintained list.
 - Rebuild and re-link the dev binary so the playground picks up both surfaces:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation only)
 Running-balance / rolling / period-over-period questions are the single largest
 result-mismatch cluster in the SQLite subset (financial-transactions-style DBs):
 cumulative balances with the wrong frame on ties, rolling windows that mis-span
 gappy dates, partial early windows, and unguarded period-over-period ratios. The
 methodology is universal analyst craft, so it belongs in the product's skill
 (where it helps every real user) plus the per-dialect rolling-window syntax that
 makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the
 date spine) for the gappy-rolling fallback. Improving the benchmark score is a
 side effect; the skill and the dialect notes contain no trace of the benchmark.
 ## Implementation notes
 Shipped as additive content across the two surfaces the spec specified — no new
 tool, flag, or config.
 **Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes
 to the existing `<sql_craft>` "Window functions" group, after its two bullets and
 the spec-07 window-then-filter example: **Cumulative / running total** (explicit
 `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing
 the deterministic-ordering rule), **Rolling window over calendar time, plus
 minimum periods** (calendar range not row count; spine-or-native-range choice
 delegated to `sql_dialect_notes`; the `COUNT(*) OVER (<same frame>) = N`
 min-periods guard), and **Period-over-period** (`LAG` + full-precision growth +
 `NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked
 `sql` example — the cumulative running total, wrong-vs-right, using
 `account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four
 worked examples. Extended the step-5 `sql_dialect_notes` provision list to name
 the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the
 skill; it stays dialect-clean.
 **Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a
 **Rolling window over time** line to all seven files, parallel to the spec-10
 **Series** line and cross-referencing it for the spine fallback.
 **Deviation — `RANGE`-frame support verified against authoritative docs (the
 spec's hard requirement), which corrected two of its starting points:**
 - **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days'
  PRECEDING AND CURRENT ROW` (as the spec guessed).
 - **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL
  29 DAY PRECEDING AND CURRENT ROW` (as guessed).
 - **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with
  `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed).
 - **snowflake** — **corrected:** the spec said "limited; default to a spine," but
  Snowflake *does* support a native interval `RANGE` frame over a date/timestamp
  key and it is gap-tolerant, so the note gives the native frame
  (`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed.
 - **clickhouse** — **corrected:** the spec said "limited; default to a spine," but
  ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in
  days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for
  `DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as
  the fallback.
 - **sqlite** — no date-interval range frame (no native date type): spine + `ROWS`
  (as guessed).
 - **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame):
  spine + `ROWS`, or a date-keyed self-join (as guessed).
 **Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative
 phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4,
 asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND
 CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened
 the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex.
 `test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop
 with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from
 `DIALECTS_WITH_NOTES`) must answer the rolling-window rubric.
 **Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped);
 `pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed
 `ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one
 error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is
 present in committed branch code and untouched by this work.
--- a/spider2-specs/specs/12-parse-text-encoded-numbers.md
+++ b/spider2-specs/specs/12-parse-text-encoded-numbers.md
@ -1,405 +0,0 @@
 # Parse text-encoded numeric columns before doing math on them
 > Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`.
 ## Problem
 Numeric measures are often stored as **text** with human formatting: unit
 suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators
 (`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero
 (`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is
 **silently wrong**: a string comparison orders `"100" < "9"`, and a naive
 `CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the
 intended number. The query runs, the shape looks right, the number is garbage.
 The agent already samples schemas before composing — spec 07 gave the
 `<sql_craft>` "Schema discovery before writing SQL" group its *"Sample before you
 compose"* and *"Cast to the real type before comparing"* rules. But those rules
 guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**;
 they say nothing about a column whose declared/affinity type is text yet whose
 *meaning* is numeric. When the agent sees a "numeric-looking" column it tends to
 assume a real number type and skips the parse, so the arithmetic runs on the raw
 strings. This spec adds the detect → parse/scale → verify habit to that same
 group, building on the two rules already there rather than restating them.
 ## Generic use case (independent of any benchmark)
 - A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become
  `1200 / 3000000 / 0` before you can sum it or compute a daily change.
 - A `price` stored as `"$1,299.00"` must become `1299.00` before averaging.
 - A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it.
 This is routine data hygiene on real, messy production tables — every analyst
 hits text-encoded measures on some warehouse, with no benchmark in sight. The
 methodology is universal craft, so it belongs in the shipped skill; it transfers
 to every ktx user querying a live database.
 ## Model
 The change is **additive content across two surfaces** — the same split specs 10
 and 11 made, and for the same reason. The split is the central design decision;
 it satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
 weakening it.
 ### Why two surfaces (the dialect-agnostic reconciliation)
 The **detect → parse → scale** half is **pure portable SQL** and stays entirely
 in the dialect-agnostic skill:
 - Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known
  set of literal characters — no regex needed.
 - Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression.
 - Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`.
 - The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable.
 The **verify** half has one piece that is genuinely dialect-divergent: a
 **failure-detecting numeric cast** — a cast that signals (rather than silently
 swallows) a value that did not parse. This is exactly what requirement 3
 ("confirm coverage") needs, and it cannot be written portably:
 - **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure.
 - **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure.
 - **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`.
 - **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
 - **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before
  casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`).
 - **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and
  `CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an
  `IS NULL` coverage check is **silently broken**. Detecting a failed parse needs
  a `GLOB`/`typeof` pattern guard.
 So a portable skill cannot inline a safe cast any more than spec 10 could inline a
 date-series generator or spec 11 a calendar range frame. ktx already routes that
 kind of engine-specific syntax through the per-dialect notes in
 `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by the
 `sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent:
 a construct not yet in the dialect rubric, genuinely engine-specific, was added
 there (the **Series** line; the **Rolling window** line) and the dialect-agnostic
 skill points to it. The failure-detecting cast is the next construct in that same
 position, so the **safe-cast idiom belongs in the dialect notes**, and the skill
 points to it.
 Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale
 in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes)
 carries the **concrete safe-cast syntax** per engine, including the sqlite
 `CAST`-returns-0 gotcha.
 The regex character-*strip* is deliberately **not** promoted to the dialect
 notes: a portable chained `REPLACE` over a known character set is the opinionated
 default, so there is no need for a per-dialect strip line (derive from need; one
 default). The dialect surface gains exactly one thing — the safe cast — because
 that is the only piece the portable path genuinely cannot express.
 ### Additive, inline, heuristic-with-a-why
 Consistent with specs 07, 10, and 11: the skill change is **additive content in
 one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled
 `reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
 and phrased as **heuristics with a one-line generic rationale**, not a wall of
 MUSTs. The dialect-notes change is additive content in the seven existing
 `dialects/*.md` files. No new tool, flag, or config on either surface.
 ### Build on the rules already present; do not restate them
 - The Schema-discovery group already carries **"Sample before you compose"** and
  **"Cast to the real type before comparing"** (spec 07). The detect rule
  **extends** the first (distinct-value sampling to learn the encoding) and the
  parse rule **complements** the second (text-meaning-numeric, not just
  text-vs-numeric literal mismatch) — reference them, do not repeat them.
 - The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive
  judgment** spec 10 established in its *"Default by additivity"* rule (0 only
  when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule
  rather than restating the discriminator (state each rule once).
 ## Requirements
 ### 1. Skill surface — `<sql_craft>` "Schema discovery before writing SQL"
 Add the text-encoded-numeric guidance to the **existing** group, after its two
 current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic.
 It must cover:
 1. **Detect text-encoded numerics during sampling.** When a column the question
   treats as a number is stored as text, sample its **distinct** values to learn
   the encodings actually present — unit suffixes (`K`/`M`/`B`), currency
   symbols, thousands separators, percent signs, and non-numeric sentinels
   (`-`, `N/A`, empty) — **before** composing. Never infer the format from the
   column name. *Why:* compared/aggregated as-is, the text sorts lexically
   (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL —
   producing a silently wrong result instead of an error.
 2. **Parse and scale in an early CTE.** Strip currency/separator/percent
   characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels
   to `0` **or** `NULL` per the question's intent, then cast to a numeric type —
   all in **one early CTE**, so every downstream layer sees clean numbers. The
   `0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive**
   rule (reference it; do not restate). *Why:* a string column aggregated as-is
   sorts lexically and casts to 0, so the math is silently wrong.
 3. **Confirm coverage (verify).** After parsing, sanity-check that **no
   intended-numeric value silently failed to parse** — a failed parse should
   surface as `NULL`, which is only visible with a **failure-detecting cast**.
   Note the divergence: a plain `CAST` errors on some engines and, on sqlite,
   returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from
   `sql_dialect_notes` (requirement 3), then count residual NULLs among
   non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish
   as `0`/NULL instead of being caught.
 ### 2. One worked example — parse/scale, fully portable
 Add **exactly one** new compact before/after `sql` example demonstrating the
 parse-and-scale pattern on a synthetic generic schema
 (e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`):
 - **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the
  formatted values collapse to `0`/partial, so the total is silently wrong.
 - **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a
  `CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to
  `DECIMAL`, then `SUM`s the parsed column.
 **Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`,
 `TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the
 example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no**
 inline example (its correct form needs the engine-specific safe cast, delegated to
 `sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's
 rolling-window variants were prose-only).
 This adds **one** worked `sql` example to the skill. Spec 11 independently adds
 one as well; **do not hardcode the resulting total** — increment from the current
 state. As of this writing the skill carries **three** examples (spec 07
 window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is
 the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test
 assertion is incremented by one from its current value (see Acceptance criteria).
 ### 3. Dialect-notes surface — `dialects/*.md` (safe cast)
 Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files,
 parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each
 line gives that engine's **failure-detecting numeric cast** — a cast that returns
 `NULL` (or is detectably invalid) on a non-numeric input — which is what makes the
 verify step correct on that engine. Each note is engine-exclusive (a SQLite
 analyst gets the SQLite idiom and never another engine's construct, per the
 existing dialect-notes leak guards). Orientation only — exact syntax is the
 implementer's; verify against authoritative docs (context7 / the engine manual)
 rather than asserting from memory:
 - **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting,
  e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is
  available for the strip, but chained `REPLACE` is the portable default.)
 - **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before
  `CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip.
 - **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) →
  `NULL` on failure.
 - **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` →
  `NULL` on failure.
 - **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
 - **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`.
 - **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an
  error, so a coverage check must use a pattern guard such as
  `CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof`
  check) to detect a value that did not parse.
 This line is what makes the verify step executable from the dialect-agnostic
 skill. It is **distinct** from the Series and Rolling-window lines (those generate
 or window over a calendar; this detects a failed numeric parse). Phrase any
 version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test
 bans version-dated wording).
 ### 4. Explicit constraints / exclusions
 None of the following may appear (consistent with specs 07, 10, and 11):
 - **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`,
  `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
  `replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is
  chained `REPLACE`; the failure-detecting cast lives only in the dialect notes.
 - **No regex-strip dialect line.** The character strip stays the portable
  chained-`REPLACE` default; the dialect notes gain only the **safe cast**.
 - **No grader / gold-answer / benchmark reference**, and no output-shape contract
  (the skill is for interactive analysis).
 ### 5. Coordination with specs 07, 08, 10, and 11
 - **Spec 07** owns the Schema-discovery group and its two existing bullets
  (*"Sample before you compose"*, *"Cast to the real type before comparing"*).
  Spec 12 **extends** that group and **builds on** both bullets — references them,
  never restates them; they must stay intact and uncontradicted.
 - **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one
  rubric line through that channel; the engine-exclusivity guards apply unchanged.
 - **Spec 10** owns the additive-vs-non-additive discriminator (Answer
  completeness) and the dialect **Series** line. Spec 12 **references** the
  additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it.
 - **Spec 11** independently adds the dialect **Rolling window** line, one `sql`
  example, and the **rolling-window** entry to the step-5 provision list. Spec 12
  touches the **same** three places (the dialect-notes rubric loop, the example
  count, and the step-5 list). Both are independent and additive — **add to the
  current state, do not assume an order**: name **safe-cast** in the step-5 list
  without removing rolling-window/series; increment the example count by one from
  whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any
  `/\*\*Rolling/` assertion.
 ### 6. Step pointer (no duplication)
 The step-5 `sql_dialect_notes` provision list (currently "FQTN,
 identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11
 also names rolling-window) should additionally name the **safe-cast** convention
 now that it exists. State each rule once inside `<sql_craft>`; the workflow steps
 only point to it.
 ## Leak-safety (hard constraint)
 Every worked example or note uses a **synthetic generic schema** (e.g.
 `metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`),
 showing only the *pattern*. **No** benchmark table names, SQL, or result values on
 either surface. The dialect-notes additions, like the existing notes, carry no
 benchmark / grader / version-dated content. The behavior is reconstructable from
 first principles and tied to no specific instance.
 ## Acceptance criteria
 - The `<sql_craft>` "Schema discovery before writing SQL" group states the three
  heuristics — inline, dialect-agnostic, each with a generic *why*, and each
  **building on** (not restating) the existing *"Sample before you compose"* and
  *"Cast to the real type before comparing"* bullets and spec 10's additivity rule:
  - **detect** text-encoded numerics by sampling distinct values (suffixes,
    symbols, separators, sentinels) — never from the column name;
  - **parse and scale** in an early CTE (strip → suffix-scale → sentinel map →
    cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule;
  - **confirm coverage** with a failure-detecting cast, delegating the engine's
    safe-cast syntax to `sql_dialect_notes`.
 - Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using
  chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS
  DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is
  incremented by **one** from its current value (3 today → 4; or 5 if spec 11
  shipped first).
 - Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its
  engine's own failure-detecting numeric-cast idiom (including the sqlite
  `CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the
  additions contain no benchmark / grader / version-dated content.
 - The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
  `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline
  `RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` /
  `REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md`
  including the new example.
 - The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention
  alongside FQTN / identifier-quoting / date / top-N / series-calendar /
  rolling-window / JSON.
 - The existing interactive guidance (`<workflow>`, `<rules>`, the other examples),
  the two existing Schema-discovery bullets, and the existing dialect-note rubric
  lines (including **Series** and, if present, **Rolling window**) are intact and
  uncontradicted.
 - No grader / benchmark reference, and no output-shape contract.
 - The skill stays scannable and comfortably under the 500-line budget; frontmatter
  still parses as `ktx-analytics`.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three
  heuristics to the "Schema discovery before writing SQL" group (after its two
  existing bullets), the single parse-and-scale worked example, and extend the
  step-5 dialect-notes provision list to name the safe-cast convention. Leave
  `<workflow>` / `<rules>` / the other examples and the two existing
  schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per
  target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no
  change required.
 - **Dialect notes:** the seven files under
  `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
  `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
  `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
  engine's actual safe-cast / try-cast support against authoritative docs before
  writing the idiom; do not assert from memory** (in particular the sqlite
  `CAST`-returns-0 behavior, which is the motivating gotcha).
 - **Tests:**
  - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
    representative phrase for each of the three heuristics (e.g. a *detect*, a
    *parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft
    behavior` list; bump the `sql`-fence count assertion **by one** from its
    current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a
    suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding
    `SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
    and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` /
    `GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL`
    guard, so the "safe cast lives only in the dialect notes" criterion is
    *enforced*, not incidentally true).
  - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers
    the full rubric for every dialect" loop with the safe-cast assertion,
    `expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it.
    Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
    all seven without a hand-maintained list. Do **not** add a false-exclusivity
    assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the
    line per dialect is sufficient.
 - Rebuild and re-link the dev binary so the playground picks up both surfaces:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation only)
 At least one SQLite-subset question stores trading volume as suffix-encoded text
 (`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw
 strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes
 the failure especially insidious: there is no error to alert the agent, and a
 naive `IS NULL` coverage check would not catch it either, which is precisely why
 the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings
 before math, then verify coverage with a failure-detecting cast — is universal
 data hygiene that helps any analyst on any warehouse, so it belongs in the
 product's craft (skill) plus the per-dialect safe-cast syntax that makes the
 verify step executable, not in a benchmark-specific prompt. Improving the
 benchmark score is a side effect; the skill and the dialect notes contain no trace
 of the benchmark.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already
 applied in the working tree). Built from the current state per the "do not assume an
 order" guidance — there were **four** worked examples (specs 07 window-then-filter,
 09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the
 **fifth**, and step 5 already named `series/calendar, rolling-window`.
 **Skill — `packages/cli/src/skills/analytics/SKILL.md`:**
 - Added the three heuristics to the **"Schema discovery before writing SQL"** group,
  after the two existing bullets: *Parse text-encoded numerics before doing math on
  them* (detect by sampling distinct values, extending *Sample before you compose*,
  never inferring from the column name), *Strip, scale, and cast in one early CTE*
  (the *meaning-is-numeric* complement to *Cast to the real type before comparing*,
  with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by
  additivity* rule), and *Confirm the parse covered every value* (failure-detecting
  cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing
  bullets and the additivity rule are referenced, not restated.
 - Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`,
  `'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an
  early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with
  a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard
  portable SQL only — no dialect functions, no inline safe cast.
 - Step 5 dialect-notes provision list now names **safe-cast** alongside the others.
 **Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a
 **Safe cast** line to all seven files (after the *Rolling window* line), each giving
 that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern
 guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning);
 bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql
 `TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the
 `...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial
 gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against
 the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real
 name — the `to<Type>OrNull` family requires a bit width, hence `toDecimal64OrNull`).
 No version-dated wording.
 **Tests:** `analytics-skill-content.test.ts` — added the three representative
 phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the
 example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`),
 and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`,
 `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's
 `generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts`
 — added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so
 all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity
 assertion for the shared `TRY_CAST`.
 **Verification:** both affected test files pass (19 tests); broader `test/skills` +
 `test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`)
 is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry
 *Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev`
 relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the
 test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD,
 untouched here) — out of scope for this spec.
--- a/spider2-specs/specs/14-output-completeness-final-check.md
+++ b/spider2-specs/specs/14-output-completeness-final-check.md
@ -1,336 +0,0 @@
 # Output completeness — answer every requested part, enforced by a final pre-emit check
 > Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`.
 ## Problem
 The single largest correctness failure mode for the analytics skill is
 **incomplete output**: the query runs and the methodology is roughly right, but
 the projection is missing columns the question asked for. The SQL is runnable and
 the aggregate is correct — the answer is simply *short by columns*. Three
 recurring shapes:
 1. **Multi-part questions answered partially.** A question that asks for several
   things ("report the highest *and* the lowest month, each with its count and
   average, *and* the difference") comes back with only the first clause — one
   column where several were requested.
 2. **Identity dropped.** Grouping by a human-readable name but not projecting the
   entity's identifier (a product name without its product id, a customer name
   without its customer id).
 3. **Inputs to a derived value dropped.** Returning a ratio / percentage /
   difference but not the underlying counts the question also asked for.
 Shapes 2 and 3 are **already covered** by shipped `<sql_craft>` rules — spec 07's
 *"Expose identity, not just the label"* and *"Keep the inputs to a derived
 value"* — yet they are frequently **not applied**. So the gap is not missing
 knowledge: these rules sit as passive heuristics in a list, and nothing makes the
 agent reliably check them before finalizing. The fix is twofold: (a) add the
 missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b)
 turn output-completeness into an **explicit final verification step** the agent
 performs before emitting SQL, so the existing identity/inputs rules are actually
 enforced rather than merely listed.
 The failure is **model-independent**: a markedly stronger model produced the same
 incomplete-output mistakes on these questions, which means it is a
 craft/enforcement gap, not a capability gap — exactly the kind of universal
 analyst craft that belongs in the shipped skill.
 ## Generic use case (independent of any benchmark)
 An analyst is asked: *"For each region, report the highest and the lowest monthly
 order count, and the difference between them."* A complete answer has a column for
 the region's id and name, the highest count, the lowest count, and the difference
 — five columns. Returning just the region and a single number answers only part
 of the request. This is a universal expectation on any database: answer **every**
 part of a multi-part request, identify the entities, and show the inputs behind
 any derived figure — and answer *exactly* that, without padding the result with
 columns the question never asked for.
 ## Model
 The change is **additive content in one Markdown file**
 (`skills/analytics/SKILL.md`), governed by the same invariants spec 07
 established. They constrain the implementer; the exact prose is theirs.
 ### Additive, inline, heuristic-with-a-why
 Consistent with specs 07 and 10: the change is additive content in
 `skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the
 `setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic,
 and phrased as **heuristics with a one-line generic rationale**, not a wall of
 MUSTs. The new rule extends the existing `<sql_craft>` "Answer completeness /
 interpretation" group; the shipped bullets in that group (including the *identity*
 and *inputs* rules this spec builds on) are preserved unchanged. No new tool,
 flag, or config.
 ### The over-projection guard carries a *universal* why, not a grader reference
 The intake draft frames "don't pad the result with extra columns" as
 *grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or
 benchmark (spec 07's hard invariant; the content test bans the words). So the
 guard must ship with a **universal analytics rationale** instead: columns the
 question did not ask for add noise, mislead the reader into thinking they matter,
 and make the result harder to consume — match the request exactly, neither short
 nor padded. This is the same reconciliation spec 07 applied to the draft's
 "behavior only, no rationale" instruction: generic *why* is required; only
 grader/gold/benchmark rationale is banned.
 ### Completeness is a closed set — identity and inputs are *inside* it
 "Expose identity" and "keep the inputs" tell the agent to add columns; the
 over-projection guard tells it not to. These only contradict if the target is
 left fuzzy, so this spec pins it down. A **complete projection** is exactly:
 > {every requested metric/attribute} ∪ {the identifier of each grouped/named
 > entity} ∪ {the inputs to each derived value}, at the grain the question
 > specifies.
 Identity and inputs are **members of that set** — part of completeness, never
 "padding." **Under-projection** is any member missing (the failure this spec
 attacks); **over-projection** is any column *outside* the set (what the guard
 forbids). The implementer must phrase the rule and guard against this single
 definition so they read as one coherent notion, not two competing instructions.
 ### Dialect-agnostic, additive-only, exclusions intact
 Every addition reads correctly on any dialect — no dialect-specific syntax in the
 rule text or the worked example. The existing `<workflow>`, `<rules>`, and the
 other `<sql_craft>` bullets and examples (specs 07/09/10/11/12) are preserved and
 uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no
 `MAX(date)` anchoring of relative time, no grader-driven advice, no dialect
 syntax.
 ## Requirements
 ### 1. Multi-part / multi-output completeness — a new umbrella rule
 Add a bullet to the `<sql_craft>` "Answer completeness / interpretation" group:
 when a question requests several outputs — a **list** ("A, B, and C"), **paired
 extremes** ("the highest *and* the lowest"), or a **value plus its components**
 ("X, Y, and their ratio") — the final projection must contain a column for
 **each** requested output. *Why:* answering only the first clause is the most
 common way a runnable query is still wrong; the grain and methodology can be
 perfect yet the answer is short by columns.
 This rule is the **umbrella** over the two shipped completeness rules: the
 *inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components"
 instance, and the *identity* rule (*"Expose identity, not just the label"*) is its
 "entity identity" instance. The new bullet should **name that relationship**
 (so the three read as one notion) rather than restating either rule.
 Keep this distinct from the row-selection rules in the same group: *"Top /
 highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows**
 appear; multi-part completeness governs **which columns** appear. They compose
 (e.g. "highest and lowest per region" needs one row per region *and* a column per
 clause).
 ### 2. Final completeness check — the enforcement mechanism
 The rule content lives **once** in `<sql_craft>`; the trigger is promoted to a
 first-class line in `<workflow>` step 6.
 - **Capstone bullet in `<sql_craft>`** (closing the "Answer completeness /
  interpretation" group): *before emitting the final SQL, re-read the question and
  confirm the projection covers* —
  1. every named **metric / attribute** the question asks for (→ the multi-part
     rule);
  2. the **identifier** of every grouped or named entity (→ the *identity* rule);
  3. every **input** to each derived value (→ the *inputs* rule);
  4. all at the **grain** the question specifies (→ the *for each X* / panel
     rules).
  Each facet cross-references the rule it enforces, so the check is what makes
  those passive rules active. Phrase it as a short, concrete "confirm the
  projection covers…" checklist, not a wall of MUSTs.
 - **Over-projection guard** (attached to the check): do **not** add columns the
  question did not ask for "to be safe" — extra columns add noise, mislead, and
  make the result harder to consume; match the request exactly. Carries the
  **universal** why from the Model, **never** a grader/gold/benchmark reference.
 - **`<workflow>` step 6 line** (the explicit ritual): step 6 ("Validate and
  explain") gains a mandatory line directing the agent to **always** run the final
  completeness check before emitting — re-read the question and verify every
  requested output, each entity's identity, each derived value's inputs, and the
  grain are all projected — pointing into the `<sql_craft>` capstone for the
  detail. This **replaces the current conditional pointer's role** ("If a result
  is unexpectedly empty or its grain looks wrong, work through the … rules"): the
  empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty
  results"* and grain rules), but the completeness check fires **unconditionally**,
  on every SQL-authoring turn, not only when a result looks off. The workflow line
  names the ritual and the four facets; the rationale, guard, and example are
  stated once in `<sql_craft>`, not duplicated into the workflow.
 ### 3. One worked example (dialect-agnostic)
 Add **exactly one** compact before/after example to the "Answer completeness /
 interpretation" group, demonstrating multi-part completeness on a **synthetic**
 schema (`regions`, `region_monthly`):
 - **WRONG:** answers only the first clause — `SELECT region_name,
  MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no
  lowest, no difference.
 - **RIGHT:** one column per requested output plus the entity's identity, at the
  region grain — `region_id, region_name`, the highest, the lowest, and the
  difference, with `regions` joined to `region_monthly` and grouped by the region
  id and name.
 Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN`
 are portable aggregates). Keep it tight. It teaches multi-clause coverage +
 identity + derived-value inputs in one capstone, and is **distinct** from the
 spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN
 spine + `COALESCE`); this one is about missing **columns**. This is the **sixth**
 worked `sql` example in the skill (after specs 07/09/10/11/12).
 ### 4. Coordination with specs 03 and 07/09/10/11/12
 - **Spec 03** (multi-connection routing) owns `<workflow>` step 0 and the
  `connectionId` threading/scoping. Spec 14 touches `<workflow>` only to add the
  completeness-check line to **step 6** — it must not rewrite the routing or the
  `<rules>` `connectionId` scoping. If both land, step 6 reads coherently: validate
  + the completeness ritual.
 - **Specs 07/09/10/11/12** own their own bullets and worked examples in
  `<sql_craft>`. Spec 14 is **additive** to the same "Answer completeness /
  interpretation" group and adds one example; it must not remove or contradict
  theirs.
 ## Leak-safety (hard constraint)
 The example uses an **invented, generic schema** (`regions`, `region_monthly`) and
 made-up columns — **no benchmark table names, SQL, or result values.** It teaches
 the *pattern* (cover every requested output + identity + inputs, at grain, without
 padding), which is universal and tied to no specific instance. The over-projection
 guard's rationale is **universal** (noise/clarity/consumability), never
 "grader-gaming" or any other scoring reference. No part of the addition mentions a
 benchmark, gold answer, grader, or scoring comparator.
 ## Acceptance criteria
 - `<sql_craft>` "Answer completeness / interpretation" states the **multi-part /
  multi-output completeness** rule (a column per requested output; list / paired
  extremes / value-plus-components), named as the umbrella over the shipped
  *identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*.
 - `<sql_craft>` states a concrete **final completeness check** (re-read the
  question → confirm metrics + entity identity + derived-value inputs + grain are
  projected), cross-referencing the existing identity/inputs/grain rules so they
  are enforced, not merely listed.
 - The check carries the **over-projection guard** with a **universal** rationale
  (don't pad with unrequested columns — noise / misleading / harder to consume),
  and the skill contains **zero** grader/gold/benchmark references anywhere.
 - `<workflow>` **step 6** carries a mandatory line that runs the completeness
  check **unconditionally** before emitting and points into the `<sql_craft>`
  capstone; the rule content is **stated once** in `<sql_craft>` (no duplicated
  rationale/guard in the workflow). The empty/grain diagnostic remains available.
 - Exactly **one** new worked `sql` example is present (synthetic
  `regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL;
  the skill then carries **six** `sql` worked examples total.
 - The existing interactive guidance (`<workflow>` steps, `<rules>`, the other
  `<sql_craft>` bullets and the five prior examples) is intact and uncontradicted;
  the additive-only and dialect-clean invariants from specs 07/10 still hold.
 - None of spec 07's excluded items appear (output-shape contract, `MAX(date)`
  anchoring of "recent"/"past N", grader-driven advice, dialect syntax).
 - The skill stays scannable and comfortably under the 500-line budget; the
  frontmatter still parses as `ktx-analytics`.
 - The analytics-skill **content test is updated** to cover the new rule and check
  (see Implementation orientation).
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the prose.
 - **Skill:** `packages/cli/src/skills/analytics/SKILL.md`.
  - Add the multi-part-completeness bullet and the final-completeness-check
    capstone (with the over-projection guard) to the `<sql_craft>` "Answer
    completeness / interpretation" group; add the single
    `regions`/`region_monthly` worked example.
  - In `<workflow>` step 6, replace the current conditional answer-completeness
    pointer with the mandatory completeness-check line (unconditional, names the
    four facets, points into `<sql_craft>`); keep the empty/grain diagnostic.
  - Leave `<workflow>` steps 0–5, `<rules>`, and the other `<sql_craft>`
    bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target
    via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change
    required.
 - **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`.
  - Add representative phrases to the "represents every craft behavior" list for
    the multi-part rule, the final completeness check, and the over-projection
    guard.
  - Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the
    test name/comment), and assert the new example's shape (e.g. `region_monthly`,
    `MAX(`, `MIN(`, the difference expression, `region_id`).
  - The existing dialect-clean, grader/benchmark-clean, and relative-time
    (`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN`
    lines carry no "recent"/"past N" wording, so the phrase-level guard is
    unaffected. The `SkillsRegistryService` frontmatter test must still pass.
 - Rebuild and re-link the dev binary so the playground picks up the updated skill:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation only)
 On the latest SQLite-subset run, **incomplete output was the single largest
 failure bucket (~13 of 51 voted failures)**: multi-part questions answered
 partially, plus dropped identity / derived-value inputs — the latter two being
 spec-07 rules that already exist but weren't applied. A probe with a much stronger
 model reproduced the *same* incomplete-output failures, confirming this is a
 craft-enforcement gap rather than a model-capability one. The fix — answer every
 requested part, identify the entities, keep the inputs, and don't pad — is
 universal analyst craft, so it belongs in the product skill (and transfers to real
 users), enforced as a final pre-emit check rather than left as a passive hint.
 Improving the benchmark score is a side effect; the skill contains no trace of the
 benchmark.
 ## Implementation notes
 Implemented as additive content in one Markdown file plus a test update.
 - **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`<sql_craft>` "Answer
  completeness / interpretation" group):
  - Added the **"Answer every requested output"** umbrella bullet (list / paired
    extremes / value-plus-components → a column per requested output, with a generic
    *why*). It names *keep the inputs* and *expose identity* as its "value +
    components" and "entity identity" instances, pins the closed-set definition of a
    complete projection, and marks itself as governing *which columns* appear —
    distinct from the *Top …* / *For each X* row-selection rules, with which it
    composes. The two shipped instance rules are preserved verbatim.
  - Added the **"Final completeness check"** capstone bullet: a four-facet
    "before emitting, re-read the question and confirm the projection covers…"
    checklist (metric/attribute → multi-part rule; identifier → *expose identity*;
    inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on
    every query. It carries the **over-projection guard** with a universal rationale
    (unrequested columns add noise, mislead, and are harder to consume — match the
    request exactly), with **no** grader/gold/benchmark reference.
  - Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG
    answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`),
    dropping the region id, the lowest, and the difference; RIGHT projects
    `r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the
    `MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id
    + name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`).
  - `<workflow>` **step 6**: replaced the conditional answer-completeness pointer
    with an unconditional *"Always run the final completeness check before emitting"*
    line that names the four facets and points into the `<sql_craft>` capstone; the
    empty/grain diagnostic is retained for diagnosis. Steps 0–5, `<rules>`, and the
    other `<sql_craft>` bullets/examples are untouched.
  - Delivery is unchanged: `readAnalyticsSkillContent` in
    `packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target
    (confirmed, no change required).
 - **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the
  three representative phrases (`Answer every requested output`, `Final completeness
  check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and
  renamed that test; asserted the new example's shape (`region_monthly`,
  `MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and
  `r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean,
  relative-time, and frontmatter guards still pass.
 - **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass;
  production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the
  updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content
  present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it
  up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as
  `ktx-analytics`.
 - **Deviation (cosmetic):** the worked example uses alias `rm` and a difference
  column named `order_count_range`; the intake draft sketched alias `m` and
  `AS difference`. The spec leaves prose to the implementer, so the change is purely
  naming.
 - **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in
  `packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools`
  mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is
  untouched by this work and out of scope here.
--- a/spider2-specs/specs/15-mcp-server-structured-logging.md
+++ b/spider2-specs/specs/15-mcp-server-structured-logging.md
@ -1,405 +0,0 @@
 # Structured, leveled logging for the ktx MCP server
 > Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`.
 >
 > **Scope: observability only.** This spec is about *seeing* what the MCP server
 > does (which tool, what params, when, how long, outcome). *Preventing* a runaway
 > query from blocking the server (off-event-loop / interruptible execution) is a
 > separate concern — see "Non-goals".
 ## Problem
 The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built
 through `mcp-server-factory.ts` on raw `node:http` + the
 `@modelcontextprotocol/sdk` transports) emits almost no operational logs. There
 is no server-side record of **which MCP tool was called, with what parameters,
 when, how long it took, or whether it succeeded** — nor of session open/close or
 transport errors. When a tool call is slow, hangs, or a client connection drops
 ("Transport channel closed"), an operator has no trail to diagnose it and must
 resort to process sampling / `lsof` / guesswork — and the offending input
 (e.g. the exact SQL) is typically unrecoverable.
 The hook to fix this already exists but is half-built: `instrumentMcpServer`
 (`context/mcp/context-tools.ts`) wraps every tool handler and already times it,
 but it emits **only on completion** (a sampled `mcp_request_completed` telemetry
 event) and **never writes a start line and never writes to the server log**. A
 call that never returns therefore leaves no trace at all.
 ## Generic use case (independent of any benchmark)
 Anyone running a long-lived ktx MCP server — a developer's local instance
 (stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a
 shared/hosted HTTP daemon — needs observability into tool-call activity to:
 - diagnose slow or hung tool calls (which `sql_execution` ran, against which
  connection, with what SQL, for how long);
 - explain client-visible connection failures from the server side (session
  lifecycle, transport-closed events);
 - audit what agents asked the server to do;
 - spot patterns (hot tools, slow connections, error rates).
 This is standard production-server hygiene; the server currently provides none.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs.
 ### One `pino` logger, synchronous, written to **stderr**
 Use `pino` — the de-facto standard structured-JSON logger for Node servers — as
 a single shared instance. Two corrections to the draft's sketch:
 - **stderr, not stdout.** The stdio transport reserves **stdout** for the
  JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`);
  writing logs there would corrupt the protocol stream. The HTTP daemon already
  redirects **both** child fds to `.ktx/logs/mcp.log`
  (`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands
  in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the
  one universally-correct sink** for both transports.
 - **Synchronous, no worker-thread transport.** `pino` writes through a
  `DestinationStream` (`{ write(msg) }`) — the server's existing
  `KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a
  **synchronous** destination (`pino.destination({ sync: true })`, or the
  pino-pretty stream below with `sync: true`). This is load-bearing: the
  `tool.start` line **must** be flushed to the fd *before* the (possibly
  blocking) handler runs, so a runaway synchronous `better-sqlite3` query that
  pegs the event loop still leaves the start line on disk. A worker-thread
  transport (`transport: { target: ... }`) buffers and can lose that exact line
  on a hard crash — **do not use transport mode.**
 ### Format is derived from `stderr.isTTY`, not a config flag
 One logger, two serializations chosen by the environment (the "behavior follows
 from inputs" rule — not a user-visible knob):
 - **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) →
  **`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true,
  destination: <stderr sink> })`, colorized). A readable live dev view.
 - **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log`
  file fd) → **plain JSON line** via the synchronous pino destination. The log
  *file* stays structured JSON so the incident workflow ("recover the hung query
  with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat
  it.
 `KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal
 from the underlying stream (`process.stderr.isTTY`) at logger construction, while
 still writing *through* the `io.stderr` sink so tests can capture emitted lines.
 ### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper
 Tool-call logging is added to the existing `instrumentMcpServer`
 (`context-tools.ts`), which already wraps `registerTool` and measures duration.
 It receives the **raw** tool input (it wraps the schema-parsing handler from
 `registerParsedTool`), so the params it logs include `sql` for `sql_execution`.
 The existing telemetry emission stays unchanged; logging is **additive** beside
 it. Because both transports build their server through `mcp-server-factory.ts` →
 `registerKtxContextTools`, this single change gives **both HTTP and stdio**
 tool-call logging for free.
 ### `sessionId` / `callId` provenance
 - **`sessionId`** comes from the SDK's per-call handler context
  (`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk`
  `1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for
  stdio (single session) — log it when present, omit otherwise. Add
  `sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`).
 - **`callId`** is generated per invocation with `randomUUID()` (already imported
  in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`.
 ### No redaction in v1 (explicit)
 v1 ships **no log redaction**. Rationale recorded here so it is a deliberate
 choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`),
 **never transmitted off-box**, and sit at the **same trust boundary** as the
 `ktx.yaml` / environment that already hold the connection credentials. Concretely:
 - Request **headers are never logged** at all, so the bearer token
  (`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted."
 - Errors are logged with their **full message and stack** via pino's standard
  `err` serializer.
 - SQL text and tool params are logged **verbatim** (they are not secrets).
 Credential redaction (e.g. a DB URL embedded in a driver error string) is an
 explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box.
 This drops the draft's "light redaction" requirement and the
 `collectTelemetryRedactionSecrets` / scrubber reuse it implied.
 ## Requirements
 ### 1. One shared pino logger
 - A single `pino` instance per server process, constructed once and threaded to
  both the transport layer (for lifecycle events) and the tool layer (for
  tool-call events). Level set from env (Requirement 7), default `info`.
 - Synchronous destination bound to the server's stderr sink (see Design
  decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`,
  otherwise plain JSON. Each line carries pino's standard `time` and `level`.
 - No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics
  stack, no async/worker transport, no in-app file rotation.
 ### 2. Per-session / per-call context via child loggers
 Use pino child loggers so every line carries the relevant correlation fields:
 a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one
 session's or one call's activity can be grepped from the log.
 ### 3. Tool-call logging — START before execute, END after
 In `instrumentMcpServer`, for **every** MCP tool invocation:
 - **On entry, before invoking the handler**, write `tool.start` with
  `{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool
  input; for `sql_execution` this includes the full **SQL text** (the single most
  useful field). The write is synchronous so the line exists even if the handler
  never returns.
 - **On normal completion**, write `tool.end` with
  `{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at
  **`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a
  tool-agnostic size measure (byte length of the serialized result text content).
 - **On error**, write `tool.end` with
  `{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**,
  where `err` is the serialized error (message + stack) per Requirement 6.
 `tool.start` and `tool.end` share the **same correlation fields and the same
 `info` level** (for the non-slow, non-error case) so that an **unmatched
 `tool.start`** — a start with no `tool.end` for the same `callId` — is an
 unambiguous "this call hung" signal. This is the property that makes a runaway
 `sql_execution` identifiable from the log alone, with its exact SQL and
 timestamp, no process sampling.
 > **Deliberate change from the intake draft.** The draft put `tool.start` /
 > `tool.end` at `debug` (suppressed at the default `info`). That defeats the
 > motivating incident: a hang is unpredictable, so debug would have to be enabled
 > *before* it occurs, which never happens. v1 logs start/end at **`info`** — an
 > always-on access log — so the offending query is recoverable at the default
 > level. `debug` is reserved for heavier detail (Requirement 7).
 ### 4. Slow-call warning
 When a call **completes** with `durationMs` greater than the configured slow
 threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same
 fields plus the duration) instead of `info`. This makes a completed-but-slow call
 stand out and keeps it visible even when the level is raised to `warn`.
 ### 5. Connection / session lifecycle and transport errors
 - **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from
  `onsessioninitialized` and `session.close` from `onsessionclosed` /
  `transport.onclose`, each with `sessionId`, at `info`. **Wire the currently
  unused `transport.onerror`** to log `transport.error` (the SDK's
  closed-channel / "Transport channel closed" events) at `error`, so a
  client-visible connection failure has a server-side counterpart.
 - **stdio** (`mcp-stdio-server.ts`): route the existing raw
  `transport.onerror` stderr string (it currently writes a plain string) through
  the logger as a `transport.error` line at `error`. A single `session.open` /
  `session.close` pair for the one stdio connection MAY be logged at `info`.
 ### 6. Structured error logging
 Errors are logged as structured objects via pino's standard `err` serializer
 (`pino.stdSerializers.err` or equivalent), carrying error class, message, and
 stack — never a bare interpolated string. The existing telemetry exception
 reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged.
 ### 7. Configuration surface
 - **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` |
  …), default **`info`**. MCP-scoped name because the MCP server is the only
  emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system
  that does not exist.
 - **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement
  4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply
  between a local SQLite file and a remote warehouse.
 - Level ladder that results from Requirements 3–5:
  - `debug`: everything below **plus** heavier detail (e.g. result bodies,
    progress notifications) — implementer's discretion on what extra to attach.
  - `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s,
    errors.
  - `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but
    not routine tool traffic.
  - `error`: errored `tool.end`s and `transport.error` only.
 ## Acceptance criteria
 - At default level (`info`), invoking any MCP tool produces a `tool.start`
  (`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end`
  (`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr
  is not a TTY.
 - A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
  `tool.start` line carrying its **exact SQL and timestamp** and **no** matching
  `tool.end` for that `callId` — so the offending query is recoverable from the
  log alone, with no process sampling.
 - A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at
  `warn` with its `durationMs`.
 - Session open/close and transport-closed (`transport.error`) events are logged
  with the `sessionId` (HTTP); the stdio transport error path goes through the
  logger, not a raw `stderr.write`.
 - At level `warn`, routine `tool.start` / `tool.end` are suppressed but
  slow-call warnings, transport errors, and errored calls are present.
 - When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a
  terminal), output is human-readable colorized `pino-pretty`; the daemon log
  file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous.
 - The bearer token never appears in any log line (headers are not logged); SQL
  and tool params do appear.
 - No worker-thread / async log transport is introduced; no OpenTelemetry /
  metrics stack; the only new dependencies are `pino` and `pino-pretty`.
 - The existing `mcp_request_completed` telemetry and exception reporting still
  work unchanged.
 ## Non-goals
 - **Preventing / interrupting runaway queries** (off-event-loop execution, query
  timeouts, worker-thread isolation). A single synchronous query that fans out
  into a massive nested-loop join can peg the single-threaded server for hours
  and break new connections — observability surfaces *which* query, but the fix
  is execution-model work in a separate spec. (This logging is also the
  prerequisite for a future watchdog that detects a `tool.start` with no
  `tool.end` past a threshold and recycles the server.)
 - **Log redaction** (see Design decisions) — explicit v1 non-goal.
 - **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty
  as a synchronous in-process stream only.
 - Metrics / tracing / OpenTelemetry exporters.
 - Forwarding logs to the MCP *client* via the protocol logging capability
  (`notifications/message`, `logging/setLevel`) — a possible later enhancement,
  distinct from operational stderr logging.
 - A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other
  surfaces emit structured logs.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **New module** — a small logger factory, e.g.
  `packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from
  the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream
  when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and
  exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`.
 - **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`:
  extend `instrumentMcpServer` (~line 585) to write `tool.start` before
  `handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate
  `callId` via the already-imported `randomUUID`; read `sessionId` from the
  handler `context`. Thread the logger via `RegisterKtxContextToolsDeps`
  (~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool`
  and the existing telemetry emission intact.
 - **Context type** — `packages/cli/src/context/mcp/types.ts`: add
  `sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to
  `KtxMcpServerDeps` / the register deps.
 - **Server wiring** — `packages/cli/src/context/mcp/server.ts`
  (`createDefaultKtxMcpServer` / `createKtxMcpServer`) and
  `packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept
  and pass the logger down to `registerKtxContextTools`.
 - **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or
  receive) the logger; in `newTransport` (~line 186) log `session.open` /
  `session.close` and add `transport.onerror` → `transport.error`.
 - **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or
  receive) the logger; route the existing `transport.onerror` (~line 54) through
  it.
 - **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts`
  redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs`
  (`commands/mcp-commands.ts`) tails it. No change needed there.
 - **Dependencies** — add `pino` and `pino-pretty` to
  `packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks
  still pass.
 - **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`,
  `mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and
  `commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written
  before a (mock) handler runs and carries `params`/`sql`; (b) a matching
  `tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a
  `tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits
  `warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token
  never appears. Inject a capturing `io.stderr` and parse the JSON lines.
  *Note:* `mcp-server-factory.test.ts` carries a pre-existing
  `KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`,
  unrelated to this work) — do not let it mask new failures.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation, not a requirement)
 Running Spider 2.0-Lite against the MCP server at concurrency, an
 adversarial-reviewer-generated query degenerated into a massive nested-loop join;
 synchronous `better-sqlite3` executed it on the event loop, pegging a server at
 ~100% CPU for hours and breaking new MCP connections ("Transport channel
 closed"). We could not determine *which* query, because the server logs nothing
 about tool calls — diagnosis required `sample` / `lsof` on the live process and
 the exact SQL was never recovered. Structured tool-call logging — especially
 `tool.start` written synchronously *before* execution, at the default level —
 would have turned this into a one-line `grep` of the server log. Improving the
 benchmark is a side effect; the logging is generic production-server hygiene.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance
 criteria are satisfied.
 **What was built / where**
 - **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io,
  { isTTY? })` builds one synchronous `pino` (v10) instance written through the
  `io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13)
  synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the
  sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel`
  (`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`),
  `mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and
  `serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`.
 - **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** —
  per invocation: `callId = randomUUID()`, a child logger bound to
  `{ tool, callId, sessionId? }`, `tool.start { params }` written at `info`
  **before** awaiting the handler (synchronous, so a runaway query still leaves it
  on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`,
  `warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error",
  err }`. `resultSize` is the UTF-8 byte length of the serialized text content.
  The existing `mcp_request_completed` telemetry + `reportException` are unchanged
  (`durationMs` is now computed once and shared); `registerParsedTool` is intact.
 - **`sessionId` / logger plumbing** — `sessionId?: string` added to
  `KtxMcpToolHandlerContext`; a single per-process logger threads from each
  transport entrypoint through `createKtxMcpServerFactory` →
  `createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools`
  (`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`).
 - **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from
  `onsessioninitialized`, `session.close` from `transport.onclose`, and the
  previously-unused `transport.onerror` wired to `transport.error` at `error`.
 - **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror`
  string write is replaced by a `transport.error` log line; `session.open` /
  `session.close` are logged for the single stdio session.
 - **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to
  `packages/cli/package.json`.
 - **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env
  parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in
  `test/context/mcp/server.test.ts` (start-before-handler, matching end with
  `resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level
  suppression with errored end still present, no-logger no-op), session lifecycle
  + bearer-token-never-logged in `test/mcp-http-server.test.ts`, and
  `test/mcp-stdio-server.test.ts` for `transport.error`.
 **Deviations / decisions**
 - **In-band errors carry no stack (inherent).** `registerParsedTool` converts a
  thrown handler error into an `{ isError: true }` result (and reports the full
  error via telemetry) before it reaches `instrumentMcpServer`, so the original
  stack is already gone. `tool.end` for such a result logs `outcome:"error"` with
  `err.message` only; a genuine throw that escapes gets the full pino `err`
  serialization (type + message + stack). The field is always `err` for
  consistency. This honours "leave `registerParsedTool` intact."
 - **`session.close` is logged from `transport.onclose`** (the universal close
  signal for both clean DELETE and dropped connections) rather than
  `onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its
  session-map cleanup role.
 - **The logger is optional throughout.** Production always wires one per process;
  when absent (programmatic/test callers that inject `createMcpServer`), tool-call
  logging is simply off — which keeps existing tests unchanged.
 - `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production
  derives format from `process.stderr.isTTY`.
 **Verification**
 `pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test
 files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only
 2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and
 unrelated to this change (in-progress analytics-skill work on this branch).
 `pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run
 build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the
 one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit
 `2677b3ef` (documented above); all source and the new tests type-check clean.
--- a/spider2-specs/specs/16-bounded-query-execution-timeout.md
+++ b/spider2-specs/specs/16-bounded-query-execution-timeout.md
@ -1,493 +0,0 @@
 # Bounded query execution (deadline + non-blocking) for read SQL
 > Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`.
 >
 > **Scope: bound and cancel a read query that runs too long.** This is the
 > execution-model companion to spec 15 (MCP structured logging). Spec 15
 > *surfaces* a runaway query in the log; it explicitly defers *preventing* one —
 > "off-event-loop execution, query timeouts, worker-thread isolation … is
 > execution-model work in a separate spec." This is that spec.
 ## Problem
 Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the
 current code:
 1. **No execution deadline, handled divergently per connector.** A single
   expensive query runs unbounded, and whether it is bounded at all depends
   entirely on which driver the caller hit:
   - **BigQuery** is the only connector with a real statement timeout — it sets
     `jobTimeoutMs` on the query job from a per-connection config field
     `job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512).
   - **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client
     creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up,
     not a server-side `max_execution_time`; the server keeps working.
   - **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection
     *acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres
     `connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`;
     MySQL pool size only) — nothing bounds statement *execution*.
   - **SQLite** has nothing.
 2. **In-process SQLite blocks the event loop and cannot be cancelled.** The
   SQLite connector executes on the main thread via synchronous
   `better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`,
   `query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes
   the whole MCP server — it cannot serve other requests, send progress, or write
   `tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12)
   exposes no interrupt/cancel API. Its documented mechanism for slow queries is a
   **worker thread**, and the only way to stop a runaway synchronous query is to
   **terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`,
   `docs/threads.md`).
 The observed failure (Spider2-lite sqlite run, 2026-06-18): a single
 `sql_execution` MCP call —
 `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`,
 where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a
 4-column key with no composite index) — degraded to an O(N×M) nested-loop scan,
 pegged a worker at 100% CPU for 13+ minutes, never returned, produced a
 `tool.start` with no matching `tool.end`, and stalled an eval shard until the
 worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned
 rows, not scan work, and the failing query returned a single aggregate row.
 ## Generic use case (independent of any benchmark)
 Any data agent that lets an LLM author SQL will eventually issue an
 accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW,
 a wide aggregate over a large fact table. A general-purpose context layer must
 bound that and return a clean, fast "query exceeded Ns" error so the agent can
 revise (add filters, query base tables, narrow the range) instead of hanging the
 tool and the server. This matters for embedded/local warehouses (SQLite, and any
 future DuckDB-style in-process driver) and remote ones alike, and is wholly
 independent of any benchmark.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs.
 ### One canonical deadline, applied uniformly at the contract
 The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP
 `sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query
 executor, relationship profiling and composite-candidate probes, relationship
 validation, historic-SQL probes, `ktx sql`); the contract is the single place to
 bound all of them. A heavy ingest profiling probe over a giant unindexed join is
 exactly as worth abandoning as an interactive one — those call sites are
 best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip
 this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the
 call sites that must treat the timeout as recoverable.)
 > Rejected alternative: a caller-resolved deadline (short on the interactive path,
 > longer/none for ingest). That introduces a second value source and the open
 > question "what is the ingest budget," for no real gain — the 30s default already
 > clears any normal profiling probe, and a probe that exceeds it is one to drop.
 ### Default 30s, configurable per-connection via one shared field
 - **Default `30_000` ms.** Fast enough that an LLM agent gets a clean
  "exceeded 30s" and revises within the same turn; generous headroom over any
  indexed aggregate or normal profiling probe; a genuine pathological nested-loop
  scan blows past it immediately.
 - **One shared per-connection override**, honored by every connector:
  `query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer
  in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it
  replaces; the user-facing error still reads in seconds.
 - **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the
  new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved
  value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so
  there is exactly one way to set a query timeout — no parallel knob (intake
  requirement 1).
 - **Granularity is per-connection only.** No global all-connections override —
  different warehouses have different performance envelopes, and a second
  (global) knob would double the configuration surface for no stated need.
 ### The shared contract is a value + an error, not a base class
 There is **no shared connector base class or factory** — each connector is
 constructed independently; the only shared registry is the *dialect* factory
 (`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement
 3) means a single shared module that owns:
 - `DEFAULT_QUERY_TIMEOUT_MS = 30_000`;
 - `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms`
  override, else the default — so the default and the override precedence live in
  exactly one place;
 - `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical
  message `query exceeded ${Math.round(deadlineMs / 1000)}s`.
 Each connector calls the resolver once (at construction; connectors already
 receive their connection config) and stores `this.deadlineMs`. **Enforcement is
 necessarily per-connector** — different engines cancel differently — but the
 *value* and the *error message* are shared, so the agent sees one consistent,
 actionable error regardless of driver.
 ### Real cancellation, not client-side give-up
 Per intake requirement 5, the deadline must *stop the work*, not merely abandon
 the promise while the query keeps running (which on a pooled driver also risks
 returning a still-busy connection to the pool). So:
 - **In-process (SQLite, and any future embedded driver):** run the query off the
  main thread and enforce the deadline by **terminating the worker thread**. There
  is no generic `Promise.race` outer wrapper — a `Promise.race` against a
  synchronous in-thread `.all()` can never fire (the loop is blocked), and against
  a pooled remote query it would poison the pool. Thread termination *is* the
  cancellation.
 - **Remote engines:** set the engine's **server-side statement timeout** so the
  server itself aborts the query and frees the connection cleanly.
 ### Logging routes through spec 15's pino path — no second logger
 The deadline cases are logged through the **existing** MCP tool-call logger
 (spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a
 new logging path threaded into the connector. Verified flow for a timeout:
 `executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) →
 `local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it
 (`reportException` skips `$exception` for `KtxExpectedError`) and returns an
 in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`**
 with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same
 `callId`** as the `tool.start`.
 This is the central observability win and it requires **no new MCP logging code**:
 spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this
 spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose
 `tool.end` names the deadline. The worker-termination (SQLite) and server-side
 abort (remote) are internal enforcement mechanisms; their single observable signal
 is that `tool.end`, so the connector does **not** get its own logger threaded
 through `KtxScanContext` — that would fork a second path for one capability. The
 "worker was actually reaped, not left spinning" guarantee is asserted by the
 worker's `exit` event in tests (Requirement 3), not by a log line.
 ## Requirements
 ### 1. Shared deadline contract, defined once
 A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`)
 exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`,
 and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its
 deadline through this resolver; no connector hardcodes its own default or
 duplicates the override-precedence logic.
 ### 2. Shared per-connection config field; BigQuery's removed
 `query_timeout_ms` is added to the **shared** connection config schema (validated
 as an optional positive integer, milliseconds) so every driver accepts it. The
 BigQuery-specific `job_timeout_ms` config field and its dedicated reader
 (`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout
 from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms`
 (zero, negative, non-integer) is a clear config validation error, consistent with
 how ktx validates `ktx.yaml`.
 ### 3. SQLite executes off the main thread, terminated on deadline
 `executeReadOnly` on the SQLite connector MUST NOT block the MCP server event
 loop:
 - Read-only validation and the row-limit wrapper (`assertReadOnlySql` +
  `limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL
  fails instantly without spawning a worker, and read-only enforcement stays at
  the boundary (Requirement 7).
 - The validated, row-limited SQL (and any params) is dispatched to a **worker
  thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs
  the query, and posts back `{ headers, rows, totalRows }` (all values are
  structured-cloneable — primitives, `Buffer`, `BigInt`).
 - The main thread arms a timer for `this.deadlineMs`; on expiry it calls
  `worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal
  message it clears the timer and resolves. On a worker error (SQLite rejected the
  SQL) it rejects with that error, message preserved. A provided
  `ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates
  the worker, for external cancellation.
 - **One short-lived worker per call**, terminated on completion or deadline — not
  a persistent worker or pool. Terminate-on-deadline destroys the worker, so a
  pool would need respawn/job-tracking for no benefit: `executeReadOnly` is
  low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is
  negligible against query latency. The other SQLite paths (introspect, sample,
  stats, distinct-values, row-count) stay on the main thread — they are
  ktx-authored, bounded, and not on the `executeReadOnly` contract.
 - The event loop stays responsive throughout, so `tool.end` is always written and
  concurrent requests on the same port are served.
 ### 4. Remote engines set a real server-side statement timeout
 Each remote connector applies `this.deadlineMs` as its engine's server-side
 statement timeout, so the deadline stops server work rather than abandoning the
 promise:
 | Connector  | Mechanism                                              | Unit          |
 |------------|--------------------------------------------------------|---------------|
 | BigQuery   | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms       |
 | Postgres   | `statement_timeout`                                    | ms            |
 | MySQL      | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms |
 | Snowflake  | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION)         | s (ceil)      |
 | ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) |
 | SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms       |
 ClickHouse's existing hardcoded 30s `request_timeout` is brought under this
 contract (derived from the resolved deadline), not left as a parallel mechanism.
 ### 5. Timeout resolves as a `KtxQueryError` with the canonical message
 On exceeding the deadline, the path resolves with a `KtxQueryError`
 (`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded
 hang. For SQLite the worker-termination path throws `queryDeadlineExceededError`
 directly. For remote engines, each connector recognizes **its own** engine's
 timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`;
 SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as
 `queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector
 owns its driver's signal — there is no central denylist of error codes to
 maintain.
 ### 6. MCP surfacing and logging via the existing pino path
 The MCP `sql_execution` path already (a) maps any non-native driver error to
 `KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by
 `isNativeProgrammingFault`), (b) reports it through `reportException`, which skips
 `$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start`
 synchronously before the handler and `tool.end` in `instrumentMcpServer`
 (`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through
 this path — the implementer verifies and tests them, but adds **no parallel
 classification or logging path**:
 - **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with
  `outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same
  `callId` as the `tool.start`. Classified as an expected error, so it is absent
  from `$exception` Error Tracking. The reason `tool.end` was previously missing
  is solely the blocked event loop (Requirement 3); once the loop stays free and
  the deadline throws, the existing instrumentation logs the matched pair — closing
  spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
 - **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):**
  unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline
  (default 30s) and the slow threshold (default 10s) are independent knobs; a query
  between 10s and 30s completes with a slow `warn`, one past 30s is killed with the
  `error` above.
 ### 7. Read-only enforcement and `maxRows` unchanged
 `assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave
 exactly as today. The deadline is additive. `maxRows` is not a substitute for it
 (it bounds returned rows, not scan work).
 ### 8. Best-effort callers treat a deadline timeout as recoverable
 The non-interactive `executeReadOnly` call sites that are best-effort —
 relationship profiling, composite-candidate probes, relationship validation,
 historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this
 probe / mark unprofiled" and continue, never as a source-fatal error. The
 implementer confirms each such site already swallows query errors into a
 graceful-skip and adds that handling where it does not, so the uniform deadline
 (Requirement 1, applied to all callers) cannot abort an ingest run. A skipped
 probe is logged at the skip site through that path's existing scan/ingest logger
 (`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers
 are off the MCP tool-call path, so their visibility comes from the logger they
 already use.
 ## Acceptance criteria
 - A read query that exceeds the deadline returns a `KtxQueryError`
  (`query exceeded {N}s`) within roughly the deadline; the MCP worker stays
  responsive (a concurrent tool call on the same server completes while the slow
  query is still pending) and writes a matching `tool.end` with a non-ok outcome.
 - **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching
  `tool.end` (same `callId`) at `error` with `outcome:"error"` and
  `err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The
  timeout does not raise a `$exception` Error Tracking event (it is a
  `KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but
  under the deadline still emits its `tool.end` at `warn`. No new logger is
  introduced — the lines come from the existing `instrumentMcpServer`.
 - **SQLite specifically:** executing a deliberately pathological query (an
  expensive VIEW or an unindexed cross join) on a fixture does not block the event
  loop, is terminated at the deadline, and the worker exits (the off-main-thread
  executor is killed, not left spinning) so CPU returns to idle.
 - **One server-side-timeout driver (Postgres):** the connector applies
  `statement_timeout` equal to the resolved deadline, and a `57014` cancellation
  is mapped to the canonical `KtxQueryError`.
 - `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms`
  override, and rejects an invalid value (zero / negative / non-integer).
 - **No regression:** normal fast queries return identical results; read-only
  rejection still works; `maxRows` still bounds returned rows.
 - The shared `query_timeout_ms` field is accepted by every connector; BigQuery's
  former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the
  shared field.
 ## Non-goals
 - **A row/byte/cost budget on returned data.** This spec bounds *time*, not result
  size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a
  separate, retained concern.
 - **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated
  default plus a per-connection override; no per-call knob, no global knob.
 - **A server watchdog that recycles the process on an unmatched `tool.start`.**
  Spec 15 names this as a possible future mitigation; this spec prevents the hang
  at the source, so the watchdog is out of scope here.
 - **Moving SQLite introspection / sampling / stats off the main thread.** Only the
  `executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded
  ktx-authored queries.
 - **Per-connection retry / backoff on timeout.** A timeout returns a clean error
  for the agent to revise; ktx does not auto-retry.
 - **A second logger threaded into the connector.** The deadline cases are logged
  through spec 15's existing MCP tool-call logger; the connector gets no separate
  pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes
  through spec 15's pino path").
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
  `DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`.
  Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`).
 - **Contract anchor** — `KtxScanConnector.executeReadOnly`
  (`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`),
  `KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the
  MCP path).
 - **Config schema** — add `query_timeout_ms` to the shared connection config
  (`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema);
  remove BigQuery's `job_timeout_ms` reader.
 - **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts`
  (constructed by path via `new URL('./read-query-worker.js', import.meta.url)`);
  rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate
  on the main thread then dispatch to the worker with a terminate-on-deadline
  timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in
  the worker. Register the worker as a dynamic entry in `knip.json` (it is
  referenced by path, not import) and confirm the build copies it into `dist`.
 - **Remote connectors** — apply the resolved deadline and recognize the engine's
  timeout signal in each `executeReadOnly` / `query(...)`:
  `connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`),
  `connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` +
  `request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534,
  `STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838,
  `statement_timeout`), `connectors/mysql/connector.ts` (~774–793,
  `max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832,
  `requestTimeout`).
 - **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88`
  (error mapping), the `sql_execution` registration (~915–943), and the logging in
  `instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes
  `tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No
  new classification or logging code; confirm the timeout flows through as an
  expected error producing a matching `tool.end(error)` with the canonical message.
 - **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275),
  `context/scan/relationship-composite-candidates.ts` (~365, 440),
  `context/scan/relationship-validation.ts` (~259),
  `context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the
  historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a
  graceful skip.
 - **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms`
  as the test seam) asserting terminate-on-deadline, event-loop responsiveness
  (a concurrent promise resolves while the query is pending), and worker exit; a
  Postgres test asserting `statement_timeout` is set to the resolved deadline and
  a `57014` error maps to `KtxQueryError`; resolver unit tests (default /
  override / invalid); regression tests for normal results, read-only rejection,
  and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g.
  `test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a
  matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation, not a requirement)
 The Spider2-lite local set loads several warehouses into SQLite, some with
 expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
 `costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
 rows, no composite index, with `promo_id` (the index the optimizer picks) being
 95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
 view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
 shard for 10+ minutes; with one, the agent gets a fast error and can scope the
 query instead. Improving the benchmark is a side effect; the deadline is generic
 production hygiene for any agent that lets an LLM author SQL.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
 acceptance criteria are met; tests, type-check, dead-code, and build are green
 for the changed surface.
 ### What was built, and where
 - **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
  `DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns
  the validated `query_timeout_ms` override else the default; throws on
  zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)`
  (a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the
  driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`.
 - **Config field** — `query_timeout_ms` (optional positive integer, ms) added to
  the **shared warehouse** schema. NOTE (spec drift): that schema lives in
  `context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not
  `config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be
  declared explicitly to be *validated* (otherwise it would pass through
  unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection`
  reader were removed; BigQuery now resolves the shared field. Every connector
  resolves its deadline once at construction via `resolveQueryDeadlineMs`.
 ### Deviation from the spec's SQLite mechanism (worker thread → child process)
 The spec mandated running SQLite read queries on a **worker thread** and enforcing
 the deadline by `worker.terminate()`. This was **empirically disproven**:
 `Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3`
 scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise
 never even resolves (an 8s probe of the exact failing query shape confirmed the
 thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler
 API, and `.iterate()` does not help because the failing query is a single
 aggregate row produced only *after* the full scan.
 The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`**
 (`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from
 `connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed
 the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both*
 SQLite requirements better than a thread (event loop stays free **and** the query
 is genuinely cancellable). The child is self-contained (imports only
 `better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`)
 and `normalizeQueryRows` stay on the main thread. One short-lived child per call,
 killed on completion, deadline, or `ctx.signal` abort. Node v24's native
 TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts`
 URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in
 `knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke).
 ### Remote connectors (server-side timeouts + own-signal mapping)
 Each applies the resolved deadline server-side and re-wraps its own timeout signal
 as `queryDeadlineExceededError(deadlineMs, { cause })`:
 - **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error.
 - **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=<ms>`); maps `57014`.
 - **MySQL** — `SET SESSION max_execution_time = <ms>` before the read; maps errno `3024`.
 - **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = <ceil(s)>` in the pooled connection; maps code `604` / "reached its … timeout".
 - **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`.
 - **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`.
 Each connector has a focused test asserting the timeout is applied and its signal
 maps to `KtxQueryError` (Postgres is the spec's required acceptance test).
 ### Best-effort callers (Requirement 8)
 Confirmed already graceful: relationship **profiling** (outer try/catch →
 `profile_failed` warning) and **composite-candidate** detection
 (`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL
 **probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error
 into `{ ok: false }`. **Added** handling to relationship **validation**: a
 `KtxQueryError` on the per-candidate coverage probe now sends that one candidate to
 `review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of
 aborting the whole validation pass. `ingest-query-executor.ts` is a generic
 executor port whose callers own recoverability — left unchanged.
 ### MCP surfacing/logging
 No new MCP classification or logging code. The deadline `KtxQueryError` flows
 through the existing `local-project-ports` mapping → `reportException` (skips
 `$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts`
 covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched
 `tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query
 exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched
 pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
 ### Pre-existing branch issues encountered (not part of this feature)
 - `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with
  a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke
  `tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the
  type-check gate green; behavior unchanged.
 - `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing
  `**Window functions**` heading and `Expose identity, not just the label` prose
  in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec
  13/14) content drift committed earlier on the branch; **left untouched** — no
  skill files were modified by this feature.
--- a/spider2-specs/specs/18-bigquery-cross-project-datasets.md
+++ b/spider2-specs/specs/18-bigquery-cross-project-datasets.md
@ -1,418 +0,0 @@
 # BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project)
 > Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`.
 >
 > **Scope: let the BigQuery connector introspect a dataset hosted in a *different*
 > project than the one it bills jobs to.** A `dataset_ids` entry may be written
 > fully-qualified as `project.dataset`; the connector introspects each entry in
 > *its own* project while every job still runs in `credentials.project_id`. A
 > bare `dataset` keeps today's single-project behavior unchanged.
 >
 > Out of scope (confirmed during refinement): the interactive `ktx setup` wizard
 > is **not** expected to *discover* foreign datasets — you cannot enumerate
 > datasets in a project you don't own, and the wizard doesn't know which foreign
 > projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or
 > at the dataset prompt); the connector must accept and introspect them. See
 > *Non-goals*.
 ## Problem
 **ktx**'s BigQuery connector derives a single `projectId` from
 `credentials.project_id` and uses it for **both** job billing **and** schema
 introspection. There is no way to introspect a dataset that lives in another
 project, even though *querying* such a dataset already works (a cross-project
 read in a `FROM` clause bills to the caller's project — that path is proven).
 Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`):
 - **`:294`** — `projectId` is read only from `credentials.project_id`. There is
  no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig`
  (`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed.
 - **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim;
  it never parses a `project.` prefix.
 - **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`,
  which resolves the dataset in the **client's (billing) project**, and labels
  every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the
  introspection-failure warning metadata (`:566`).
 - **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as
  `` `<projectId>.<datasetId>.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the
  **billing** project.
 - **`listTables` (`:453`)** — queries
  `` `<projectId>`.`region-<region>`.INFORMATION_SCHEMA.TABLES `` against the
  **billing** project and labels each row `catalog: this.resolved.projectId`.
 - **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the
  billing project.
 ### Empirical confirmation (from the intake draft)
 With a service account in project `ktx-spider2-lite`:
 - ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks
  in `projects/ktx-spider2-lite/datasets/austin_311`).
 - The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })`
  → **succeeds** (public metadata is readable by any authenticated principal).
 - There is **no config knob** to separate the introspection project from billing.
 ### Why the table `catalog` label is load-bearing, not cosmetic
 The BigQuery dialect generates **three-part `catalog.db.name`** SQL
 (`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`;
 `context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The
 `catalog` stored on each scanned table is therefore the project that *every*
 later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`,
 and ref-based `executeReadOnly` all format the ref through the dialect. If a
 foreign dataset's tables are labeled with the billing project, every one of those
 queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling
 the table `catalog` with the dataset's own project is a **correctness
 requirement**, and it is the single lever that makes sampling, dictionary value
 extraction, and `discover_data` all resolve once the snapshot is right.
 ### One introspection path, no divergence
 `connectors/bigquery/live-database-introspection.ts` wraps
 `KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database
 paths share **one** introspection implementation. The SDK already supports the
 fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s
 `DatasetOptions` exposes `projectId?: string`.
 ## Generic use case (independent of any benchmark)
 Analysts routinely introspect datasets they can **read but do not own and do not
 bill to**: Google's `bigquery-public-data`, a partner's shared project, an
 organization's central data project that a smaller team queries from its own
 billing project. To make those connectable in **ktx** — so `discover_data`, the
 semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the
 connector must introspect a foreign-hosted dataset while billing jobs in the
 credentials' own project. This is a standard BigQuery deployment shape and is
 wholly independent of any benchmark.
 The class to design for is "the dataset's project ≠ the billing project," and it
 must generalize beyond one example: a single connection may reference datasets in
 **several** foreign projects at once (e.g. one slice mixing `bigquery-public-data`
 and `isb-cgc-bq`), and two different projects may host datasets with the **same
 name**. The design must keep those distinct.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs.
 ### Carry the project inline on each dataset entry — no separate knob
 The introspection project is expressed **per dataset**, inline, as the optional
 `project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config
 field.
 > Rejected alternative: a separate connection-level `dataset_project` (or
 > `introspection_project`) field. It is a speculative runtime knob (against the
 > repo's opinionated-defaults rule) and, more decisively, it **cannot express the
 > requirement**: one connection must span *multiple* foreign projects, which a
 > single global field cannot represent. The inline form also derives scope from
 > the user's own declared input rather than adding a parallel setting.
 ### Parse to canonical `{ project, dataset }` pairs at the config boundary
 Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` /
 `datasetIds()`, into a canonical pair: the project (when no prefix is present,
 default it to `credentials.project_id`) and the bare dataset id. Every
 introspection-side call site reads the resolved pair; nothing downstream re-parses
 a `project.dataset` string.
 > Rejected alternative: keep `datasetIds: string[]` raw and split the prefix
 > lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`,
 > `testConnection`). That re-implements one rule in four places and is exactly the
 > drift trap the repo's single-source-of-truth rule warns about — a later fix
 > lands on one path and not another. Normalize at the boundary; carry the
 > canonical form downstream.
 The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`)
 changes shape from `string[]` to a structured pair list. That is an internal type;
 the connector internals and the connector test fixtures are the only consumers.
 ### Parsing rule (at the boundary)
 - An entry contains **at most one `.`**.
 - With a dot: the segment **before** the dot is the project, validated by the
  existing `normalizeBigQueryProjectId` charset
  (`context/connections/bigquery-identifiers.ts`); the segment **after** is the
  dataset id (validated as a normal identifier).
 - Without a dot: a bare dataset; the project defaults to `credentials.project_id`
  (today's behavior).
 - **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error
  raised at resolution time, naming the connection — not a silent
  mis-introspection.
 - Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay
  **out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset
  (which already rejects `.` and `:` in a project id).
 ### Billing is never the dataset's project
 The BigQuery client is still constructed with `projectId = credentials.project_id`
 (`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the
 *introspection* surfaces switch to the per-dataset project. Cross-project reads in
 a `FROM` clause already bill to the caller — unchanged and already proven.
 ### Dataset identity downstream is `(catalog, db)`
 Scanned tables are keyed by `(catalog, db, name)` throughout
 (`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because
 the table `catalog` now holds the dataset's own project, two foreign projects that
 each host a `austin_311` dataset remain distinct with no extra work — provided the
 snapshot's `scope` / `metadata` also preserve the project (Requirement 6).
 ### Setup-wizard scope: accept, don't discover
 The connector's region-scoped `listTables` (`:453`) is consumed **only** by the
 `ktx setup` wizard's table-selection step (`setup-databases.ts`); the
 ingest / `discover_data` path reads persisted snapshot JSON via
 `WarehouseCatalogService.listTables`, not the connector method. The wizard is not
 expected to enumerate foreign datasets (you can't list a project you don't own).
 A `project.dataset` value hand-entered at the dataset prompt, or written into
 `ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the
 region caveat that follows from this.
 ## Requirements
 ### R1 — Accept and parse `project.dataset` at the config boundary
 `datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each
 `dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair
 per the parsing rule above, defaulting `project` to `credentials.project_id` when
 unprefixed. A malformed entry (more than one `.`, an empty project or dataset
 segment, or a project/dataset that fails identifier validation) raises a clear
 error at resolution time that names the connection id.
 ### R2 — Introspect each dataset in its own project
 `introspectDataset` resolves the dataset via the **dataset's** project —
 `client.dataset(datasetId, { projectId })` — for `getTables()` and each
 `tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to
 accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`)
 and forwarding it from `DefaultBigQueryClientFactory`.
 ### R3 — Label table `catalog` with the dataset's project
 Every table produced by `introspectDataset` is labeled `catalog: <dataset's
 project>` (not the billing project), and the introspection-failure warning
 metadata (`object` / `catalog`) likewise reflects the dataset's project. This is
 what makes downstream sample/distinct-value/read queries resolve.
 ### R4 — Primary-key discovery targets the dataset's project
 The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` /
 `KEY_COLUMN_USAGE` SQL is built against
 `` `<dataset's project>.<datasetId>.INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA
 view is dataset-qualified and therefore region-independent.) Its existing
 soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved.
 ### R5 — `listTables` lists each dataset in its own project
 `listTables` returns rows labeled `catalog: <that dataset's project>` and queries
 each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection
 can now span projects, it queries per distinct project rather than assuming one.
 (This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.)
 ### R6 — Snapshot scope and metadata reflect multiple projects
 `introspect`'s returned snapshot keeps `metadata.project_id` = the **billing**
 project, but `scope.catalogs` becomes the **distinct set of dataset projects**
 actually introspected. `scope.datasets` / `metadata.datasets` must stay
 unambiguous when two projects share a dataset name (e.g. carry the qualified
 `project.dataset`, or otherwise preserve the project). The scoped table-name
 lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass
 each dataset's own project so `tableScope` / `enabled_tables` filtering still
 matches.
 ### R7 — `testConnection` resolves foreign datasets
 `testConnection` validates each configured dataset via its own project
 (`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only
 at foreign datasets reports success rather than a spurious `404`.
 ### R8 — Billing unchanged; bare dataset is a strict no-op
 `createQueryJob` continues to bill in `credentials.project_id`. A connection whose
 `dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before:
 same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no
 behavioral change.
 ### R9 — `getTableRowCount` honors the parsed entry
 `getTableRowCount`'s default-dataset handling (`:431`, today
 `this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign
 default dataset is introspected in its own project.
 ### R10 — Docs reflect the qualified form
 Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written
 `project.dataset` to introspect a dataset hosted in another project (billing stays
 in `credentials.project_id`). Update the BigQuery rows/examples in
 `docs-site/content/docs/configuration/ktx-yaml.mdx` and
 `docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope
 note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples
 copy-pasteable and follow the `fumadocs-mdx-structure` skill.
 ## Acceptance criteria
 1. **Foreign single-project introspection.** With credentials in project
   `ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`,
   `ktx ingest <conn>` introspects the tables, enriches, and samples values;
   `discover_data` / `dictionary_search` return them. Tables are labeled
   `catalog: 'bigquery-public-data'`.
 2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x',
   'other-project.y']` introspects **both**, each under its own project; the
   snapshot's `scope.catalogs` contains both projects.
 3. **Cross-project query still bills locally.** `sql_execution` of a
   fully-qualified `project.dataset.table` query runs and bills in
   `credentials.project_id`.
 4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']`
   yields two distinct dataset groups; tables do not collide.
 5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`)
   behaves exactly as before — resolved under `credentials.project_id`, same
   `catalog` labels and INFORMATION_SCHEMA targets.
 6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an
   empty segment) raises a config error naming the connection, not a `404` at
   scan time.
 7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`,
   using the existing fake `clientFactory` harness):
   - the fake `dataset()` is called with the dataset's project for a prefixed
     entry, and with the billing project for a bare entry;
   - a prefixed entry yields tables with `catalog: '<dataset project>'`;
   - a mixed two-project `dataset_ids` introspects both;
   - `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment
     entry;
   - the existing single-project tests still pass unchanged.
 ## Non-goals
 - **Foreign-dataset discovery in the setup wizard.** The wizard does not
  enumerate datasets in projects the credentials don't own; users supply
  `project.dataset` explicitly (scope decision A).
 - **Cross-region `listTables`.** `listTables`' region-scoped
  `region-<location>.INFORMATION_SCHEMA.TABLES` query uses the connection-level
  `location`; a foreign dataset in a *different* region than the connection's
  `location` will not be listed by that wizard-facing query. This does **not**
  affect ingest/`discover_data`, whose introspection path
  (`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is
  region-independent. A per-dataset region knob is a separate spec if ever needed.
 - **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`),
  already unsupported by `normalizeBigQueryProjectId`.
 - **A separate billing/introspection config field** — explicitly rejected above.
 ## Implementation orientation
 Pointers from exploration; line numbers may have drifted, and the implementer owns
 the design.
 - `packages/cli/src/connectors/bigquery/connector.ts`
  - `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) —
    parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds`
    shape.
  - `KtxBigQueryClient.dataset` port (`:100`–`:110`) and
    `DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId`
    (R2). `getClient()` (`:487`) keeps the billing project (R8).
  - `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog`
    + warning metadata (R2, R3).
  - `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4).
  - `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog
    (R5).
  - `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup
    (`:359`) (R6).
  - `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9).
 - `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps
  `introspect`; no separate change needed (it inherits the fix).
 - `packages/cli/src/context/connections/bigquery-identifiers.ts` —
  `normalizeBigQueryProjectId` is the project-segment validator.
 - `packages/cli/src/context/connections/dialect-helpers.ts` /
  `connectors/bigquery/dialect.ts` — three-part naming; no change, but this is
  *why* R3 matters.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`. Run
  `pnpm --filter @kaelio/ktx run type-check` and the connector test suite.
 ## Benchmark context (motivation, not a requirement — do not encode benchmark specifics)
 Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable
 faithfully: every one of its ~74 logical databases groups datasets hosted in
 foreign public projects (`bigquery-public-data`, `isb-cgc-bq`,
 `data-to-insights`, …), never in a project we own. Query execution already works
 cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly
 because the connector can't introspect a foreign-hosted dataset. Of 74 BQ
 databases only **one** spans more than one source project, so "let `dataset_ids`
 carry `project.dataset` and introspect each in its own project" covers the
 benchmark and the general case alike. None of these project names belong in the
 code — they are derived from the user's own `dataset_ids` input.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki`. The whole change is contained in
 the BigQuery connector, its identifier helpers, the connector test suite, and three
 docs pages.
 **Config boundary (R1).** Added `normalizeBigQueryDatasetId`
 (`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset
 `[A-Za-z0-9_]`) next to the existing project/region validators. In
 `connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry,
 defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots →
 bare dataset in `defaultProject`; one dot → `project.dataset` (each segment
 validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs`
 resolves `env:`/`file:` references first, trims/filters empties, then parses each.
 `bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the
 default, so the canonical pair list is produced once at the boundary.
 `KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new
 `BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name
 `connections.<id>.dataset_ids entry "<entry>"`.
 **Client port (R2).** `KtxBigQueryClient.dataset` now takes
 `(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards
 `client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`).
 `getClient()` still constructs the client with the **billing** `project_id`, so
 `createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3).
 **Per-dataset introspection (R3–R7, R9).** Every introspection site reads the
 resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)`
 and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s
 `catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified
 `` `<project>.<dataset>.INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each
 dataset under its own project; `getTableRowCount`'s default resolves through the first
 pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and
 keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a
 `qualifiedDatasetLabel` helper — bare in the billing project (so the single-project
 snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with
 the same dataset name stay distinct, R6/acceptance 4).
 **`listTables` (R5).** Split into `listTables` (parse override entries, group by
 project) and `listTablesInProject(project, region, datasets?)`. With no override it
 lists the billing project's region (unchanged); with an override it runs one
 region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that
 project's bare datasets, and labels rows with that project. The existing single-region
 test is unchanged (bare entries collapse to one billing-project query).
 **Docs (R10).** Added a "Cross-project datasets" subsection to
 `integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats),
 plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`.
 **Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and
 malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection
 calls `dataset('austin_311', 'bigquery-public-data')`, labels tables
 `catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps
 `metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']`
 connection introspects both under their own projects; and `['proj_a.shared',
 'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated
 to the pair list; all pre-existing behavioral tests pass unchanged.
 **Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite
 (18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`,
 `pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production),
 `pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all
 pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory;
 criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is
 structurally guaranteed (single billing client) and asserted via the `createClient`
 project. End-to-end ingest against live `bigquery-public-data` was not run here (no live
 credentials in this worktree); the `link:dev` binary is ready for the playground agent to
 validate.
 **No deviations from the spec design.** The only judgment call: `scope.datasets`
 renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to
 satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation
 requirement (R6/acceptance 4) with one unambiguous, dot-delimited form.
--- a/spider2-specs/specs/19-durable-bounded-relationship-detection.md
+++ b/spider2-specs/specs/19-durable-bounded-relationship-detection.md
@ -1,471 +0,0 @@
 # Durable, resumable, bounded relationship detection during ingest enrichment
 > Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`.
 >
 > **Scope: make the expensive part of ingest enrichment survive an interrupted
 > relationship stage.** Today the paid LLM descriptions + embeddings only become
 > durable and queryable after the slowest, most-killable, least-valuable stage
 > (relationship detection) also finishes. This spec moves the persistence boundary
 > to the cost boundary, makes stage resume work across runs, and bounds + observes
 > the one open-ended stage — the durability companion to spec 16 (bounded query
 > execution), which this spec composes with rather than replaces.
 ## Problem
 Three compounding failure modes, all confirmed in the current code, share one root
 cause: **the three enrichment stages are treated as a single atomic unit for
 persistence, identity, and bounding, even though they differ radically in cost,
 durability value, runtime, and likelihood of being killed.**
 `runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages
 in a fixed order through `runEnrichmentStage` (`:413`):
 | stage | order | cost | durability value | runtime on a large schema | likely to be killed |
 |-------|-------|------|------------------|---------------------------|---------------------|
 | `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low |
 | `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low |
 | `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** |
 The slowest, most-killable, least-valuable stage runs **last**, and it gates the
 durability of the two expensive stages held in memory before it.
 ### 1. Enrichment is lost if relationship detection is interrupted
 The queryable artifact agents search and execute against is the `_schema` manifest
 YAML (`semantic-layer/<connectionId>/_schema/*.yaml`). It is written **twice**:
 - bare (native column comments only) early, at `local-scan.ts:473`
  (`writeLocalScanManifestShards`), before enrichment runs; and
 - rewritten **with AI descriptions + accepted joins** by
  `writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called
  from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after
  all three stages.
 So the descriptions and embeddings reach the queryable layer only via that single
 terminal write. If the process is killed/crashes/times out **during** the
 `relationships` stage, `runLocalScanEnrichment` never returns, the terminal write
 never runs, and the in-memory descriptions + embeddings are discarded — the
 `_schema` retains only the bare native comments from the `:473` write.
 Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full
 descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
 relationship stage ran silently past a supervising deadline and was killed; the
 persisted `_schema` had **0** AI descriptions. The most expensive work is the most
 likely to be thrown away.
 > A stage-state store (below) does save each completed stage's output to an
 > internal SQLite cache as the stage finishes — so the descriptions are not lost to
 > the *resume cache*. They are simply never **promoted** to the queryable `_schema`
 > until the terminal write. The data survives somewhere the agent cannot query, and
 > (per failure mode 2) cannot be reused on the next run either.
 ### 2. Re-running does not resume — it re-spends
 `runEnrichmentStage` resolves a completed stage with
 `findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and
 the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares
 `PRIMARY KEY (run_id, stage)` and filters lookups by `run_id`
 (`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per
 ingest invocation (`record.runId`). The cache therefore only resolves *within* one
 run; re-running an interrupted ingest gets a new `runId`, misses every cached
 stage, and **recomputes descriptions + embeddings from scratch** — re-paying for
 LLM work that already succeeded.
 The store already computes and persists `inputHash` next to `runId` —
 a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity,
 relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is
 already on the row; the lookup just uses the volatile column. This is a keying
 defect, not a missing capability.
 ### 3. Relationship detection is unobservable and unbounded
 `discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a
 row sample of **every enabled table** (`profileKtxRelationshipSchema`,
 `relationship-profiling.ts:320` — one sampled query per table at
 `profileConcurrency`, default 4), validates candidate joins
 (`relationship-validation.ts:237` — one coverage query per candidate), and detects
 composite keys (`relationship-composite-candidates.ts:515` — per-table plus
 cross-table queries). None of the controls the rest of the scan pipeline relies on
 were ever wired into this stack:
 - **No progress.** `discoverKtxRelationships` does not accept a progress port; the
  caller can only emit start/end around it (`local-enrichment.ts:600,611` —
  `update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of
  silence between.
 - **No honored cancellation.** `KtxScanContext.signal` exists on the contract
  (`types.ts`) but **no sub-stage reads it**.
 - **No time budget.** Validation has a *count* budget (`validationBudget`, default
  `min(2 × tableCount, 1000)`); profiling and composite detection have none. On a
  schema with hundreds–thousands of tables, profiling is O(tables) silent queries
  with no internal stop condition.
 A supervisor watching for liveness cannot tell a slow-but-working profile from a
 true hang, and nothing inside the stage will voluntarily stop — so on a very large
 schema it runs far past any reasonable deadline and is killed (which, via failure
 mode 1, takes the descriptions with it).
 ## Generic use case (independent of any benchmark)
 Any context layer that enriches a real warehouse with paid LLM work must make that
 work durable the instant it is produced, resume it across process restarts without
 re-paying, and bound the open-ended profiling stage so a large catalog cannot hang
 ingest indefinitely. A data team ingesting a 500-table production warehouse over a
 flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit
 hits all three failure modes regardless of any benchmark. This is general
 durability and cost hygiene for the ingest pipeline; the benchmark only made it
 acute at scale.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs (requirement-level, per the specs README).
 ### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships
 As soon as the last non-relationship stage completes — `embeddings` when an
 embedding provider is configured, otherwise `descriptions` — persist the
 descriptions + embeddings into the **queryable** `_schema` manifest (and the raw
 `descriptions.json` / `embeddings.json` enrichment artifacts), **before** the
 `relationships` stage runs. The relationship stage then writes its joins on top: the
 manifest builder already re-reads and preserves existing descriptions and
 manual/inferred joins on rewrite (`loadExistingManifestState`,
 `local-enrichment-artifacts.ts:196`), so the second write is additive, not
 destructive.
 Net invariant: **the descriptions + embeddings are always durable and queryable the
 moment they are computed**, even if relationship detection then fails, is
 interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
 relationship stage degrades to "no joins" or "partial joins" — **never** to "no
 descriptions." This is the inverse guarantee the current terminal-write ordering
 violates.
 The bare `:473` manifest write stays — it is the queryable schema for the
 no-providers / enrichment-disabled path. The checkpoint is an additional write that
 runs only when enrichment produced descriptions.
 > Orientation (the implementer owns the seam): the lowest-coupling shape is a
 > checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once
 > the last non-relationship stage completes, and `local-scan.ts` supplies a callback
 > that calls the existing `writeLocalScanEnrichmentArtifacts` for the
 > descriptions + embeddings + manifest only (no generated joins yet). The final
 > write after the relationship stage proceeds as today. Relationship-specific
 > artifacts (`relationships.json`, `relationship-profile.json`,
 > `relationship-diagnostics.json`) are written by the final/relationship write, not
 > the checkpoint, so the checkpoint never emits misleading empty relationship
 > diagnostics.
 >
 > Rejected alternative: move all artifact writing inside `runLocalScanEnrichment`
 > (inject the file store / project). That couples the enrichment module to
 > persistence for no gain — the writer already lives in `local-scan.ts` and the
 > checkpoint needs only a one-line hook, not a relocation.
 ### D2 — Resume by content identity, not by `runId`
 Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**,
 independent of `runId`, so a re-run with an unchanged schema and config resumes the
 finished `descriptions` / `embeddings` stages from cache and re-runs only what
 actually failed. `inputHash` is already the content fingerprint; `connectionId`
 scopes it to the right source. When several rows share a content identity (one per
 prior run), the most recent `updatedAt` wins.
 `runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves
 the uniqueness/lookup key.
 The state store is a **disposable local resume cache** (`.ktx` local state,
 regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate
 the table if its on-disk shape differs from the new `(connection_id, stage,
 input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the
 old cache only means one ingest cannot resume; it never corrupts a queryable
 artifact.
 > Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest
 > are already folded into `inputHash`; adding them again would only narrow the key
 > and re-break cross-run resume when an incidental field differs.
 ### D3 — Make the relationship stage observable and bounded
 Thread three things the rest of the pipeline already supports through
 `discoverKtxRelationships` into profiling, validation, and composite detection:
 - **Progress** through the existing progress port (the relationship phase is
  already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit
  liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent
  for composite probing — so a supervisor can distinguish slow-but-working from
  hung.
 - **A flat wall-clock budget** for the whole relationship stage: a new
  `scan.relationships.detectionBudgetMs`, a positive integer of milliseconds,
  project-level, validated like the other `scan.relationships` fields, **default
  600_000 (10 min), enforced by default.** Checked at unit boundaries (before each
  table profile, each candidate validation, each composite probe). It sits **above**
  spec 16's per-query deadline (default 30s): each individual query is already
  bounded; this bounds the *sum* of them.
 - **Honored cancellation:** where `KtxScanContext.signal` is available, the same
  unit-boundary check honors it, so external cancellation stops the stage too.
 On budget exhaustion or abort: stop scheduling new work, let in-flight queries
 finish (each already bounded by spec 16), finalize with the relationships found so
 far, and return a **partial** result — never an unbounded hang and never an
 exception that would lose the checkpointed descriptions.
 > Rejected alternative — per-table-scaled budget (N seconds × table count). It is a
 > second formula to reason about and "more tables → more budget" partly re-opens the
 > unbounded door this requirement closes. One flat, generous, project-level number
 > matches how the other `scan.relationships` knobs are shaped and is enough for a
 > best-effort stage whose partial output is durable and improvable (D4).
 >
 > Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a
 > per-call override. One opinionated project-level default with a config override is
 > the canonical ktx shape; no second runtime path.
 ### D4 — A budget-truncated partial is a successful, cached, completed stage
 A graceful budget stop is **not** a failure. The relationship stage saves its
 partial result like any completed stage (so a plain re-run resumes it for free, no
 re-querying) and marks it `partial` with a reason in the relationship diagnostics
 plus a recoverable scan warning. Because `detectionBudgetMs` lives in
 `relationshipSettings ⊂ inputHash`, **raising the budget changes the content
 identity and triggers a fresh, fuller run** — that is the only "try harder"
 mechanism, with no extra flag or runtime path.
 Distinguish the two stop kinds:
 - **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as
  completed, so the next run recomputes the relationship stage (after resuming
  descriptions/embeddings from cache via D2). This is the primary durability path.
 - **Graceful budget/abort stop**: a partial *is* saved as completed-partial and
  resumed cheaply on re-run, unless the budget is raised.
 ## Requirements
 ### 1. Checkpoint descriptions + embeddings before relationship detection
 The descriptions and embeddings MUST be persisted into the durable, queryable
 `_schema` manifest (and the raw enrichment artifacts) as soon as the last
 non-relationship stage completes, before the `relationships` stage runs.
 Relationship detection appends/merges its joins on completion. The expensive LLM +
 embedding enrichment MUST be queryable even if the relationship stage subsequently
 fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
 relationship stage MUST degrade to "no/partial joins," never to "no descriptions."
 ### 2. Stage resume resolves by content identity across runs
 Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`,
 independent of `runId`, so re-running an interrupted ingest resumes the finished
 `descriptions` / `embeddings` stages from cache and re-runs only what failed.
 Re-running after an interruption MUST NOT re-issue LLM description or embedding
 calls for stages that already completed. The resume cache MAY be recreated without a
 migration bridge if its schema changes (it is disposable local state).
 ### 3. Relationship detection emits progress and honors a wall-clock budget
 The relationship stage MUST emit per-unit progress through the existing progress
 port (at minimum per-table during profiling and per-candidate during validation) so
 liveness is observable. It MUST enforce a flat wall-clock budget
 (`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level,
 overridable, validated as a positive integer) checked at unit boundaries and layered
 above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where
 available. On budget exhaustion or abort it MUST stop scheduling new work, finalize
 with the relationships found so far, and return a partial result rather than running
 unboundedly or throwing.
 ### 4. A budget-truncated relationship result is durable and marked partial
 A graceful budget/abort stop MUST persist the partial relationship result as a
 completed stage (so a plain re-run resumes it without re-querying) and MUST mark it
 `partial` — in the relationship diagnostics artifact and as a recoverable scan
 warning — so downstream consumers can see the joins are incomplete. Raising
 `detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller
 relationship run; no separate flag is introduced for "redo." A process killed
 mid-stage MUST NOT leave a completed record (so it recomputes on re-run).
 ### 5. No regression for small or uninterrupted ingests
 A small or single-run ingest that is never interrupted MUST produce the same
 artifacts and the same relationship output as today. The checkpoint write MUST be
 idempotent with the final write (descriptions survive the join rewrite); the budget
 default MUST be generous enough that normal and large-but-tractable schemas complete
 relationship detection fully, hitting the budget only on pathological scale.
 ## Acceptance criteria
 - **Durability across interruption:** interrupting an ingest **during** relationship
  detection still leaves a queryable semantic layer carrying the table/column
  descriptions + embeddings that were generated (verified: re-open the connection;
  AI descriptions are present in `_schema`, not just native comments).
 - **Resume does not re-spend:** re-running an interrupted ingest does **not**
  regenerate descriptions/embeddings whose stage already completed (verified: no LLM
  description calls and no embedding calls for the cached tables; only the failed
  stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume
  survives a fresh `runId`.
 - **Observable + bounded relationships:** a connection with hundreds of tables emits
  relationship-stage progress (per-table profiling, per-candidate validation) and
  completes within `detectionBudgetMs`; when the budget is hit, the stage stops
  gracefully and persists the partial relationships found so far — without
  discarding enrichment — marked `partial` in diagnostics and via a recoverable
  warning.
 - **Partial is cached and improvable:** re-running with an unchanged budget resumes
  the partial relationship result from cache (no re-querying); raising
  `detectionBudgetMs` triggers a fresh, fuller relationship run.
 - **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project
  override, and rejects an invalid value (zero / negative / non-integer) as a clear
  `ktx.yaml` config error.
 - **No regression:** small/single-run ingests behave exactly as before — identical
  artifacts and relationship output when nothing is interrupted; the checkpoint +
  final writes leave descriptions intact alongside the generated joins.
 ## Non-goals
 - **Bounding the descriptions stage's per-table LLM call.** Whether an individual
  enrichment LLM call can wedge is a separate concern (already being addressed in the
  working tree via a per-table enrichment timeout). This spec ensures whatever
  descriptions *did* complete are durable; it does not own the per-call timeout.
 - **Changing relationship-detection quality, thresholds, or the candidate/validation
  algorithm.** The accept/review thresholds, scoring, and the existing
  `validationBudget` count cap are unchanged; this spec adds durability,
  cross-run resume, progress, and a time budget around them.
 - **A per-connection or per-call relationship budget, or a global env override.**
  One flat project-level `detectionBudgetMs`; no second runtime path (D3).
 - **A new per-query timeout.** Spec 16 already bounds individual queries; this spec
  composes above it and does not re-implement query-level deadlines.
 - **Replacing the per-query deadline with the stage budget, or vice versa.** They
  are independent and layered: a single query is bounded by spec 16; the stage's sum
  is bounded by `detectionBudgetMs`.
 - **A general checkpoint framework for every ingest stage.** The checkpoint is
  specifically the descriptions+embeddings → queryable-manifest promotion before
  relationships; it is not a generic per-stage artifact-flush abstraction.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns the
 design.
 - **Enrichment orchestration** — `context/scan/local-enrichment.ts`:
  `runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls
  (`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`),
  `runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the
  checkpoint hook after the last non-relationship stage; thread the progress port,
  signal, and budget into the relationship stage.
 - **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest
  write (`:473`), enrichment call (`:492`, currently passing only
  `{ runId, progress }` as `context` — wire `signal` through here too), terminal
  `writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch
  (`:530`, which after D1 no longer loses descriptions). Supply the checkpoint
  callback here.
 - **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`:
  `writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards`
  (`:270`), and the description-preserving merge in `loadExistingManifestState`
  (`:196`) — the basis for the additive checkpoint/final write.
 - **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`:
  `PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`),
  `saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`,
  pick latest `updated_at`, recreate the table if shape differs (disposable cache).
  Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage`
  in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash`
  (`:78`).
 - **Relationship stack (progress + budget + signal)** —
  `context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept
  a progress port and budget/deadline + signal),
  `context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` —
  per-table progress + budget check),
  `context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates`
  `:237` — per-candidate progress + budget check, alongside the existing
  `validationBudget`),
  `context/scan/relationship-composite-candidates.ts`
  (`discoverKtxCompositeRelationships` `:515` — budget check).
 - **Config** — `context/project/config.ts` `scan.relationships`
  (`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive
  integer ms, default 600_000) to the zod schema and the default config builder.
 - **Partial marker** — `context/scan/relationship-diagnostics.ts`
  (`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries
  a `partial` flag + reason; add a recoverable warning code to the
  `KtxScanWarningCode` union in `context/scan/types.ts` (e.g.
  `relationship_detection_partial`).
 - **Tests** — durability: a fixture ingest interrupted during the relationship stage
  leaves AI descriptions in the queryable `_schema`. Resume: a second run with a
  fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings
  (assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema
  large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget,
  emits per-unit progress, returns partial, persists it marked `partial`, and a
  re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests
  for `detectionBudgetMs` (default / override / invalid). Regression: small
  uninterrupted ingest yields identical artifacts and relationship output.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation, not a requirement)
 The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables
 (`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code
 costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage
 interruption — and re-spending it on every retry — makes large-schema ingest
 impractical, and an unbounded profiling stage runs past any supervising deadline and
 is killed. This is a general durability/cost property of the ingest pipeline,
 independent of the benchmark; the benchmark only made it acute at scale. Do not
 encode any benchmark specifics in the implementation.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
 four design decisions shipped; no deviations from the resolved design.
 **D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`,
 `enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to
 `PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by
 `(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent
 content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`;
 `runId` stays on the row for diagnostics/`listRunStages`. The store drops and
 recreates the table when the on-disk primary key differs (disposable cache, no
 migration bridge), detected via `PRAGMA table_info`.
 **D3 — observable + bounded relationship stage** (new
 `relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget`
 (`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an
 injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that
 generalizes and replaces the old `mapWithConcurrency`). Threaded through
 `discoverKtxRelationships` → profiling (per-table progress + budget stop),
 validation (per-candidate progress + budget stop; budget-skipped candidates
 degrade to the existing `validation_unattempted` review), and composite
 detection (budget stops at PK-detection and coverage-probe boundaries).
 `discoverKtxRelationships` now accepts `progress` and `now` and returns
 `partial: { reason } | null`. The clock check fires only when work remains, so a
 deadline elapsing after the last unit never marks a fully-processed stage partial.
 **D1 — checkpoint before relationships** (`local-enrichment.ts`,
 `local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a
 caller-supplied `onCheckpoint` once descriptions/embeddings complete and before
 the relationship stage runs, gated on `shouldDetectRelationships` so the
 no-relationship path keeps a single write. `local-scan.ts` supplies a callback
 calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json +
 embeddings.json + manifest with descriptions and no generated joins — no
 relationship artifacts, so no misleading empty diagnostics). The shared
 description/embedding JSON writer was factored out so checkpoint and final writes
 stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions`
 into the enrichment context (completing the existing `KtxScanContext.signal`
 contract already read by the budget and the in-flight description timeout).
 **D4 — partial is durable + marked** (`relationship-diagnostics.ts`,
 `local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact
 carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable
 `relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated.
 A graceful budget/abort stop returns normally, so the relationship stage saves as a
 completed-partial record and resumes cheaply; a process killed mid-stage saves
 nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash`
 (it lives in `relationshipSettings`), forcing a fresh, fuller run — the only
 "try harder" mechanism, no extra flag.
 **Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer
 ms, default `600_000`, validated like the other relationship fields. Documented in
 `docs-site/content/docs/configuration/ktx-yaml.mdx`.
 **Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`);
 cross-run resume + table-recreate (`enrichment-state.test.ts`,
 `local-enrichment.test.ts`); progress/budget/abort partial
 (`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise +
 checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`);
 end-to-end durability — a relationship-stage failure still leaves AI descriptions
 in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag
 (`relationship-diagnostics.test.ts`); config default/override/invalid
 (`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`,
 and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated:
 three `analytics-skill-content.test.ts` markdown-structure assertions fail on this
 branch from earlier analytics-skill commits — untouched here.)
--- a/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md
+++ b/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md
@ -1,533 +0,0 @@
 # Resilient enrichment under a slow/hung LLM backend
 > Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`.
 >
 > **Scope: make the descriptions enrichment stage survive a hung LLM backend and
 > an interrupted run.** Two compounding gaps live *inside* the per-table
 > description-enrichment path: (1) the per-table LLM timeout fires in JS but does
 > not terminate a wedged subprocess backend, so a hung table wedges the whole
 > stage indefinitely; (2) descriptions are persisted only at full-stage
 > completion, so any interruption discards every already-enriched table. This is
 > the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline
 > that *stops the work*, not just abandons the promise) and spec 19 (move the
 > durability boundary to the cost boundary so expensive LLM work is not lost). It
 > composes with both rather than replacing them.
 ## Problem
 Two compounding failure modes on the per-table description-enrichment path, both
 confirmed in the current code and observed end-to-end together. Their union turned
 a single hung table into an indefinite wedge *plus* total loss of an entire
 stage's LLM work.
 ### 1. The per-table LLM timeout does not terminate the work
 `KtxDescriptionGenerator.generateBatchedTableDescriptions`
 (`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the
 per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh
 `AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`).
 A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one
 wedge stays one timeout, not 3×). That is the correct policy — but the abort never
 actually stops a subprocess backend, so the timeout is cosmetic.
 The runtime is selected by the `backend` config field
 (`context/llm/local-config.ts`, `KTX_LLM_BACKENDS =
 ['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn
 a **child process the SDK owns** and to which ktx hands only an `AbortSignal`:
 - **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` →
  `codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's
  `spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the
  SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing
  the abort error **only after that loop ends**. A child wedged on a hung provider
  socket survives SIGTERM → its stdout never closes → the readline loop never ends
  → the SDK never throws → ktx's `await generateObject` **never settles**, past the
  per-attempt timeout, indefinitely. The child leaks (open provider connections,
  ~0% CPU).
 - **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via
  `context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls
  best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks
  `throwIfAborted` **between** streamed messages. A wedged child emits no message, so
  the `for await (const message of queryResult)` loop blocks and the graceful
  `interrupt()` may never land — the same hang class.
 By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via
 `context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's
 `generateObject`, which cancels the underlying `fetch` natively — the await settles
 promptly and there is no child to leak.
 So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too
 gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks
 `read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork —
 which it does not here.
 Observed (BigQuery ingest, codex backend, 2026-06-23): with
 `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of
 `covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+
 minutes** — well past the 30-min per-attempt timeout — with exactly two codex
 children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand.
 ### 2. Descriptions are persisted only at full-stage completion
 `generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out
 per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and
 **accumulates every table's result in an in-memory `updates` array**, returned only
 when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls
 `saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`)
 **after** `compute()` returns, and the spec-19 checkpoint write
 (`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379,
 fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the
 descriptions stage completes**. There is no within-stage persistence: while the
 stage runs, every enriched table's description lives only in memory.
 So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is
 killed, or a supervising watchdog fires — **all** already-enriched tables are lost,
 even though their (expensive, paid) LLM descriptions were finished. On the next run,
 `findCompletedStage` finds no row, so the descriptions stage **recomputes from
 scratch**.
 Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but
 **0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk;
 killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The
 cost of 2 pathological tables was 283 tables' worth of redone LLM calls.
 Sharper still (re-ingest with a short, *enforced* timeout): even when the stage
 **runs to the end** — the 2 hung tables hit their timeout and were skipped, so
 **283/285** descriptions were generated and the ingest reported success (`Scan
 completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were
 **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not**
 only "discarded on kill": a stage that completes with *any* skipped/aborted table
 threw away **every** successfully-generated description. The skip must be
 **graceful** — a skipped table costs one missing description, not the entire stage's
 output — which is the strongest argument for per-table incremental persistence: the
 283 good descriptions should have been durable the moment each was produced.
 The on-disk artifacts already carry everything needed to fix this *additively*: the
 `_schema` manifest encodes per-table completion (a table with `descriptions.ai` is
 AI-enriched), and rewrites preserve existing descriptions
 (`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115;
 `loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis
 spec 19 relies on). The durable record and the resume-skip set can be **derived from
 the system's own on-disk state**, with no new cache schema.
 ## Generic use case (independent of any benchmark)
 Anyone ingesting a large or wide schema with an LLM enrichment backend —
 especially a **subprocess** backend, the common local/desktop setup — will
 eventually hit a table whose description call hangs: a provider stall, a rate-limit
 black-hole, a pathologically large prompt. Without an *enforced* timeout, one such
 table wedges the entire ingest indefinitely and leaks the spawned child; without
 *incremental* persistence, any interruption throws away all the per-table LLM work
 already done — the dominant ingest cost. Both fixes make large-schema enrichment
 **resilient and resumable**: a few bad tables degrade to a few skipped
 descriptions, not a hung process and a from-scratch redo. This is core robustness
 for a general-purpose ingestion product, wholly independent of any benchmark.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs (requirement-level, per the specs README).
 ### D1 — One bounded-call guarantee; enforcement follows the backend's nature
 The canonical contract is a single guarantee for the per-table enrichment call:
 **the in-flight work terminates and ktx's await settles within the per-table
 deadline plus a small grace, on every backend.** How that guarantee is met follows
 from a structural property of the configured backend — *does it own a subprocess?*
 — not from a hand-maintained list of provider names:
 - **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is
  insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call
  behind a **boundary it can hard-kill** — a short-lived ktx-owned child process,
  made a **process-group leader** (`detached`). The SDK's grandchild (the
  `codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx
  **tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the
  grandchild — and rejects promptly. This mirrors spec 16's child-process +
  SIGKILL mechanism, extended by the critical step that **killing the immediate
  child is not enough**: the grandchild would otherwise orphan to init and keep its
  provider connections. Killing the group is the real fix.
 - **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing
  in-process `abortSignal` → `fetch` cancellation already satisfies the contract —
  the await settles promptly and there is no subprocess to leak. Routing these
  through a subprocess would pay fork + IPC + credential-passing cost for no benefit.
 > The branch on "subprocess-backed?" is behavior following from an input the backend
 > declares about itself, not vendor enumeration — the same guarantee is reached two
 > ways because the backends differ structurally. This matches the intake's own split
 > ("subprocess SIGKILL for process-backed; request abort for HTTP-backed").
 >
 > Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline
 > regardless of the SDK, but leave the SDK's child running). It unwedges the stage
 > but leaves the orphaned child holding provider connections — the exact leak the
 > incident showed — so it fails the intake's "actually cancelled" requirement and
 > compounds over a long ingest that hits several hung tables.
 >
 > Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime,
 > killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a
 > pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is
 > low-frequency relative to its own latency and already concurrency-bounded (4), so
 > one short-lived child per call (spec 16's resolved choice) is simpler and as fast.
 **Portability.** ktx supports Windows, where POSIX process groups and
 `process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached
 process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating
 equivalent on Windows (e.g. `taskkill /pid <pid> /T /F` or a job object) so the
 grandchild is reaped on every platform the subprocess backends run on.
 ### D2 — Default stays moderate and the retry/skip policy is unchanged
 The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the
 existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the
 no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the
 table is skipped with the existing `enrichment_timeout` warning and the stage
 proceeds. The 30-min value in the incident was an operator stopgap chosen *because*
 the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a
 long timeout is strictly worse for a hang (a hang costs the full timeout), so the
 moderate default is the correct operating point. The retry loop stays in
 `description-generation.ts`: each attempt runs through the bounded boundary (D1), so
 a transient backend error retries while a timeout surfaces as `KtxAbortedError` and
 does not.
 > Not introducing a new `ktx.yaml` config field for the timeout. The existing env
 > override is the tuning seam; adding a per-connection/per-call/global knob would
 > multiply the runtime surface for no stated need (one opinionated default + the
 > existing env override is the canonical ktx shape).
 ### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state
 During the descriptions fan-out, flush completed tables **per batch** (every N
 tables / on a timer, at a cadence that bounds the at-risk window) to the durable
 on-disk artifacts, reusing spec 19's additive write:
 - the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**;
 - the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal`
  preserves prior `ai:`/`db:`/external keys) so finished descriptions are also
  **queryable** the moment they are computed — the spec-19 invariant, one level
  deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by
  rewriting only changed shards.
 On resume, `generateDescriptions` reads the existing record, **skips any table
 already enriched**, computes only the remainder, and returns the merged full set so
 the embeddings stage, the checkpoint write, and the stage-store row all see a
 complete result exactly as today.
 **The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The
 durable record is tagged with the descriptions stage's `inputHash`
 (`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when
 the current `inputHash` matches** — a genuine resume-after-interruption of the same
 content identity. A changed `inputHash` (schema or enrichment settings changed)
 ignores the prior record for skipping and recomputes the stage as today; the
 manifest write stays additive regardless. The artifact's on-disk shape may gain the
 `inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped
 record simply forces one non-incremental run), consistent with ktx's
 no-backward-compatibility policy.
 > The skip set is **derived from the artifacts ktx already writes**, not from a new
 > per-table cache table. The manifest's `ai:` field already encodes "this table is
 > enriched"; a parallel per-table SQLite record would be a second source of truth for
 > the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is
 > still written at stage completion (it remains the stage-level resume gate — a clean
 > re-run skips the descriptions stage as today); the incremental record only matters
 > when the stage did **not** complete — exactly the case where no row exists and
 > `compute()` re-runs.
 ### D4 — A killed-mid-stage run is durable; resume is cheap
 A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the
 per-batch-flushed tables durable on disk. The next run resumes the descriptions
 stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again),
 but `generateDescriptions` now **re-issues LLM calls only for the unfinished
 tables**. A failed/skipped table (timeout or exhausted retries) is left for the
 remainder set and is retried on the next resume — never silently treated as done.
 ## Requirements
 ### 1. The per-table enrichment timeout is enforced for subprocess backends
 When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed
 backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and
 its descendants** — MUST be terminated (SIGKILL of the process group / tree), and
 ktx's `generateObject` await MUST settle within the deadline plus a small bounded
 grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded.
 The termination MUST be portable across the platforms the subprocess backends run on
 (POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends
 keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one
 contract met two ways, branching on the backend's structural "owns a subprocess"
 property, not on a list of provider names.
 ### 2. The timeout default and retry/skip policy are unchanged
 The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`),
 with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the
 no-retry-on-timeout policy. On timeout, the table is skipped with the existing
 `enrichment_timeout` recoverable warning and the stage proceeds. No new
 per-connection / per-call / global timeout knob is added.
 ### 3. Descriptions are persisted incrementally during the stage
 Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch**
 (per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence
 that bounds the at-risk window to a small number of tables. The flush MUST be
 idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and
 external keys via the existing merge). Finished tables MUST remain durable even if the
 stage never completes — is wedged, killed, or interrupted. A failed/skipped
 relationship/embedding stage or a killed descriptions stage MUST NOT lose the
 descriptions already flushed.
 ### 4. Resume re-enriches only the unfinished tables
 On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST
 re-issue LLM description calls **only for tables not already enriched**, deriving the
 already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable
 record / the manifest's `ai:` descriptions), and MUST return the merged full result
 so downstream stages behave as on a fresh run. A changed `inputHash` (schema or
 enrichment settings changed) MUST recompute the stage as today (spec 19's
 inputHash-gated semantics preserved). The durable record MAY be recreated without a
 migration bridge if its on-disk shape changes (it is regenerable local/artifact
 state).
 ### 5. No regression for small or uninterrupted ingests
 A small or single-run ingest that is never interrupted MUST produce the same
 artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST
 be idempotent with the spec-19 checkpoint and the terminal write (descriptions
 survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT
 change a normal successful enrichment's output, only how a wedged call is terminated.
 ### 6. A skipped table costs one description, never the stage's output
 A descriptions stage that **completes** with one or more skipped/aborted tables MUST
 persist every successfully-generated description (the durable record and the `ai:`
 manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages`
 row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's
 output because some tables were skipped. No single table's failure may reject the
 per-table fan-out: a per-table failure degrades to one missing description (left for
 the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the
 only thing that fails the stage (so it resumes), and even then the already-flushed
 descriptions remain durable.
 ## Acceptance criteria
 - **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call
  that hangs past the deadline is terminated within the deadline plus a small grace;
  ktx's await settles, the spawned child **and a grandchild it spawned** both exit
  (verified via the child's `exit`, not left spinning), and the table is skipped with
  an `enrichment_timeout` warning. The stage advances rather than wedging. A
  `ctx.signal` abort terminates the same way.
 - **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly
  on abort via the existing native path, with no subprocess involved.
 - **Default + policy:** the default timeout is 120s and a timeout is not retried (one
  wedge = one timeout); a transient error is still retried up to the attempt limit.
 - **Graceful skip persists the rest:** a stage that completes with one table failing
  (timeout, exhausted retries, or an unexpected throw) still writes the other N−1
  descriptions to the durable record + `ai:` `_schema` and marks the stage completed
  (a `local_scan_enrichment_stages` row exists); the failed table is a single `null`
  description left for the resume remainder, not a discarded stage.
 - **Incremental durability:** interrupting the descriptions stage after K of N tables
  leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`),
  with no completed `local_scan_enrichment_stages` row.
 - **Resume does not re-spend:** re-running the interrupted ingest (unchanged
  `inputHash`, fresh `runId`) issues **no** LLM description calls for the K already-
  enriched tables and enriches only the remaining N−K; the returned result is the
  full merged set. A changed `inputHash` recomputes the stage.
 - **No regression:** a small uninterrupted ingest yields identical artifacts and the
  same descriptions/embeddings output as today; the incremental flush is idempotent
  with the checkpoint and terminal writes.
 ## Non-goals
 - **Incremental persistence of embeddings.** Embeddings are fast and already covered
  by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This
  spec scopes incremental persistence to the `descriptions` stage.
 - **Changing the timeout default, retry counts, or adding a timeout config knob.**
  D2 keeps the moderate default and the single env tuning seam.
 - **Routing HTTP backends through the subprocess boundary.** Their native abort
  already meets the contract; a subprocess would add cost and a credential-passing
  surface for no benefit.
 - **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed
  call; no pool, no respawn/job-tracking (D1).
 - **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage
  budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes
  above them: spec 16 bounds individual queries, spec 19 makes whole stages durable
  and resumable, and this spec hardens the per-table enrichment call's termination
  and adds within-stage description durability.
 - **A general per-stage incremental-flush framework.** The incremental flush is
  specifically the descriptions stage; it is not a generic abstraction over every
  enrichment stage.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns the
 design.
 - **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`,
  `KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block
  ~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on
  timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry
  loop stays here; each attempt runs through the kill boundary for subprocess
  backends.
 - **LLM runtime + backend selection** — `context/llm/runtime-port.ts`
  (`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input),
  `context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` /
  `ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts`
  (`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the
  backend/runtime (e.g. on the runtime interface), not inferred from a name list.
 - **Subprocess backends** — `context/llm/codex-runtime.ts` +
  `context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's
  `spawn(executable, args, { signal })` is in `@openai/codex-sdk`),
  `context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()`
  abort path). These are what the kill boundary must wrap and tree-kill.
 - **Reuse spec 16's mechanism (extended to group/tree kill)** —
  `connectors/sqlite/read-query-child.ts` (the forked child shape) and
  `connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`,
  deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts`
  child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a
  process-group leader and killing the **group/tree** (the SDK grandchild), portably.
  Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`,
  `linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns
  passing the backend config/credentials to it (env/IPC) and serializing the
  structured result back.
 - **Incremental persistence (gap #2)** —
  `context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table
  `pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage`
  ~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the
  `onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the
  existing record, skip already-enriched tables, flush per batch, return the merged
  full set.
 - **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts`
  (`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts`
  with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308,
  `loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription`
  ~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115,
  `SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive
  description/manifest write; tag the durable record with `inputHash`.
 - **Stage store + input hash** —
  `context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE =
  'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`,
  `findCompletedStage`, `saveCompletedStage`),
  `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The
  whole-stage row stays; the `inputHash` is the gate for the resume-skip set.
 - **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the
  terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal`
  (`context/scan/types.ts`) which the kill boundary must honor.
 - **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores
  SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within
  deadline+grace, the child and a spawned grandchild both exit, and the table is
  skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the
  native path. gap #2: interrupt the descriptions stage after K/N tables (a flush
  seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed
  stage row; a resume with matching `inputHash` issues no LLM calls for the K and
  enriches only N−K; a changed `inputHash` recomputes; regression: a small
  uninterrupted ingest yields identical artifacts.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 ## Benchmark context (motivation, not a requirement)
 Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment
 backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for
 41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout
 never killed the hung codex children, and because descriptions checkpoint only at
 stage completion, the 283 already-enriched tables were unrecoverable — the operator
 had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout
 as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at
 scale; the gaps and the fixes are generic production hygiene for any agent that
 enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark
 specifics in the implementation.
 ## Implementation notes
 Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance
 criteria are covered by tests. The full ktx test surface for the touched code is
 green (the only failures in the whole suite are 3 pre-existing assertions in
 `test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown
 — an unrelated subsystem this change does not touch).
 ### Gap #1 — enforced timeout for subprocess backends
 - **Structural property on the runtime, not a name list.** Added
  `subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort`
  (`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime`
  return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime`
  (and the deterministic stub) return `null`. The per-table call branches on this,
  never on a vendor list (D1).
 - **Shared structured core.** Both subprocess runtimes gained
  `generateStructuredJson(jsonSchema)` (returns the raw object; the caller
  Zod-validates). Their existing `generateObject` was refactored to delegate to the
  same streaming core, so structured generation has one implementation.
 - **Kill boundary.** New `context/llm/subprocess-generate-object.ts`
  (`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned
  child (`subprocess-generate-object-child.ts`) **detached** (process-group leader);
  the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx
  tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX,
  `taskkill /pid <pid> /T /F` on Windows) and rejects promptly; on success the raw
  output is Zod-validated. Credentials reach the child via inherited `process.env`
  (the runtimes re-derive their allowlisted env), never over IPC.
 - **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions`
  (`context/scan/description-generation.ts`) routes each retry attempt through the
  boundary for subprocess backends and keeps the native `AbortSignal` → `fetch`
  path for HTTP backends. A fired deadline maps to the existing
  `KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout);
  default stays 120s (D2).
 - **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real
  fixture child that spawns a grandchild and ignores SIGTERM, and asserts the
  deadline/abort tree-kills both (the grandchild PID is reaped) and the await
  settles within deadline+grace; plus success / schema-failure / child-error paths.
  `test/context/scan/description-generation.test.ts` adds the generator-level
  timeout-skip and the "HTTP backend spawns no child" cases.
 ### Gap #2 — incremental descriptions persistence + resume
 - **Durable record + resume store.** `createKtxScanDescriptionResumeStore`
  (`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to
  a durable record (inputHash-tagged) and **only the manifest shards that gained a
  table this batch** (new `onlyChangedTableNames` filter on
  `writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)`
  returns the prior enriched set only on a matching inputHash (D3).
 - **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`)
  loads the prior record, skips already-enriched tables, enriches only the
  remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single
  in-flight flush; the final force-flush drains the tail), and returns the full
  merged set (recovered + fresh + `null` for still-failed, so failures are retried,
  D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`).
 - **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in
  a try/catch: any non-cancellation failure degrades to one `null` description + an
  `enrichment_failed` warning and the fan-out continues, so no single table can
  reject `Promise.all` / abort the stage. This makes the "one skipped table costs one
  description, not the stage's output" guarantee live at the stage boundary
  (`generateBatchedTableDescriptions` already degrades its own failures; this is the
  explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails
  and resumes), and the already-flushed descriptions stay durable. This closes the
  field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows.
 - **Deviation from the spec's literal path (necessary correction).** The durable
  record lives at a **stable, non-`syncId`** path
  (`raw-sources/<connectionId>/live-database/enrichment-progress/descriptions.json`),
  not the `syncId`-scoped `…/<syncId>/enrichment/descriptions.json` the spec named.
  Reason: a from-scratch interruption (the incident's exact case — no prior
  *completed* run) gets a **fresh `syncId`** on the next run
  (`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped
  record would be unreachable on resume. The manifest is already at the stable
  per-connection scope (`semantic-layer/<connectionId>/_schema/`), so this keeps the
  resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json`
  debug artifact written by the terminal/checkpoint writers is unchanged.
 - **Tests.** `test/context/scan/description-resume.test.ts` drives
  `runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a
  durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues
  zero LLM calls and returns the full merged set; a partial record re-enriches only
  the missing tables; a changed `inputHash` recomputes; the changed-shard filter
  rewrites only the affected shard; and (requirement 6) a run where one table fails
  still persists the other tables (durable record + `ai:`) and **completes the stage**
  (a completed `local_scan_enrichment_stages` row), with the failed table left `null`
  for resume.
 ### Incidental
 - Fixed a stale assertion in `description-generation.test.ts` ("does not run
  per-column fallback…" expected 1 call) to `3`, matching the retry policy added in
  commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt
  limit). The HTTP path is unchanged; the assertion simply predated the retry.
 - No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit
  governor is not wired into the scan-enrichment path, so the kill-boundary child
  loses no pacing.
 - Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles
  to `dist/context/llm/subprocess-generate-object-child.js`.
--- a/spider2-specs/specs/21-selective-enrichment-stages.md
+++ b/spider2-specs/specs/21-selective-enrichment-stages.md
@ -1,567 +0,0 @@
 # Selective enrichment stages (`--stages`) + per-stage cache keys
 > Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`.
 >
 > **Scope: make the three enrichment stages independently invalidatable and
 > independently re-runnable.** Today one coarse cache key gates all three stages,
 > so changing any one stage's inputs re-pays for every stage — most painfully the
 > expensive per-table `descriptions`. And there is no CLI surface to re-run a
 > chosen subset. This spec splits the key per stage (so a change invalidates only
 > the stage it touched) and adds a `--stages` flag that force-re-runs a chosen
 > subset while preserving the others. It is the operability follow-on to spec 19
 > (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable
 > descriptions); it composes with both rather than replacing them.
 ## Problem
 Enrichment has three stages — **`descriptions`** (one paid LLM call per table),
 **`embeddings`** (sentence-transformer vectors over the schema + descriptions),
 **`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19
 and 20 these stages are durable and resumable, but they are still **coupled for
 cache invalidation and unreachable for selective re-run**. Three facts make a
 targeted re-run impossible without a full, expensive re-enrich.
 ### 1. One coarse cache key gates all three stages
 `runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single
 `inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity,
 relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`),
 `embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself
 (`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the
 description LLM identity, the embedding model/dimensions/batch size, **and** the
 whole relationship config — and it redundantly re-encodes `mode` and
 `relationships`, which the coarse hash already mixes in.
 The consequence: flipping `scan.relationships.llmProposals`, switching the LLM
 backend, or upgrading the embeddings model changes the **one** hash and so
 invalidates **all three** stages. ktx then re-runs the expensive per-table
 `descriptions` even though they did not conceptually change. The headline cost of
 the system — paid LLM description calls — is thrown away on any unrelated
 enrichment-config edit.
 ### 2. No CLI surface to select stages
 The enrichment internals already support a relationships-only path
 (`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are
 gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while
 `shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But
 `ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no
 flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history`
 and friends). The relationships-only capability is built but unreachable, and there
 is no way at all to ask for "descriptions only" or "embeddings only."
 ### 3. The foundation for "touch one stage, keep the rest" already exists
 The per-stage store `local_scan_enrichment_stages` is keyed
 `(connection_id, stage, input_hash)` (spec 19) and the descriptions write is
 additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and
 `loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`,
 `db:`, and external description keys on rewrite; spec 20's per-table resume record
 (`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already
 re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave
 the others byte-for-byte" needs only two missing pieces: **per-stage key
 granularity** and a **CLI surface** to select stages.
 **Requirement:** let an operator re-run a chosen subset of enrichment stages on an
 already-ingested connection, recomputing only those stages, preserving the others'
 artifacts untouched, and **re-paying only for what genuinely changed** — never
 re-running the costly `descriptions` because an unrelated stage's inputs moved.
 ## Generic use case (independent of any benchmark)
 Any team running ktx in production maintains its semantic layer over time: they
 improve the description prompt or switch the description LLM, upgrade the embeddings
 model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich
 of every connection** — re-running the expensive per-table descriptions even when
 only embeddings or relationships changed. Two routine operations should be cheap and
 targeted:
 - **"Re-embed everything on the new model."** Swapping the embeddings model should
  recompute only embeddings, leaving descriptions and joins on disk.
 - **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed
  relationships should recompute only relationships.
 And one operation needs an explicit trigger because no input changed:
 - **"These descriptions came out thin — re-run them with a longer timeout."** A
  connection whose description coverage is poor because tables timed out (same
  snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand,
  cheaply retrying only the tables that failed.
 This is core operability for a long-lived ingestion product and is wholly
 independent of any benchmark.
 ## Design decisions (resolved during refinement)
 These resolve ambiguities the intake draft left open. They constrain the
 implementer; the exact code is theirs (requirement-level, per the specs README).
 ### D1 — Split the coarse hash into three per-stage input hashes
 Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash
 computation, each keyed on only that stage's own inputs. Decompose the
 `localScanProviderIdentity` blob into the slices each stage actually depends on:
 - **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the
  description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the
  embedding model/dimensions/batch size, **not** relationship settings.
 - **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where
  `embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest`
  is a stable digest of the resolved description text the embeddings consume (the
  same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model,
  `local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses
  embeddings on their real upstream (D4).
 - **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and
  `detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X,
  D5), **not** the embedding identity.
 `mode` and `detectRelationships` drop out of the per-stage inputs: each stage
 produces output under exactly one mode, so the stage name already scopes that, and
 re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals`
 invalidates only `relationships`; swapping the embeddings model invalidates only
 `embeddings`; switching the description LLM invalidates only `descriptions`.
 The per-stage hash becomes the key everywhere a single hash is used today: the
 `local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20
 descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now
 keyed on the **descriptions** stage's hash — so changing the embedding model no
 longer busts the descriptions resume record, a strict improvement.
 > **No migration bridge.** The stage store and the descriptions resume record are
 > disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage
 > keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next
 > run after upgrade. Recreate/ignore stale-shaped records with no compatibility
 > shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy.
 ### D2 — `--stages <comma-list>` selects a subset; one gate, no new mode
 Add `ktx ingest [connectionId] --stages <comma-list>`, a non-empty subset of
 `descriptions,embeddings,relationships`. Plural because it takes a **set**:
 `--stages relationships` and `--stages descriptions,embeddings` both read naturally,
 and the plural signals "list expected." Flag absent = all three (today's behavior).
 A Commander custom parser validates each name against the canonical stage registry
 and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a
 hard `InvalidArgumentError`** — never silently ignored. The set threads CLI →
 `runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`.
 Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected
 stages)`** — a single gate. Each existing stage block additionally checks
 membership in the selected set (`descriptions`/`embeddings` already gate on
 `mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`).
 This adds **no** new `KtxScanMode` variant and **no** second parallel selection
 path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means
 "which of those stages to (re)compute this run." A named stage that cannot run
 because a prerequisite is absent (e.g. `--stages embeddings` with no embedding
 provider configured) MUST fail or warn clearly, never silently no-op.
 > Rejected alternative — repurpose `mode` (`--stages relationships` →
 > `mode: 'relationships'`). It only expresses single-stage cases, leaves
 > `descriptions,embeddings` with no mode, and creates two ways to say "relationships
 > only." The explicit stage set is the one canonical selector.
 ### D3 — A named stage force-re-runs; per-table resume still avoids re-paying
 Naming a stage in `--stages` carries the intent "recompute this," so a named stage
 **re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in
 `runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still
 applies **inside** `compute()`:
 - `--stages descriptions` re-enters `generateDescriptions`, which loads the
  per-table resume record and re-issues LLM calls **only for the still-null/failed
  tables** (when the descriptions hash is unchanged) — the "fill thin coverage with
  a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps.
 - A genuine input change (e.g. switching the LLM → a new descriptions hash)
  invalidates the resume record and rebuilds the stage fully, as today.
 Stages **not** named are skipped entirely — not run, not resumed — and their
 on-disk artifacts are left exactly as they are (additive write; preserve-others is
 already the behavior). The **no-flag default is unchanged**: all eligible stages
 run, the completed-row short-circuit is respected (spec-19 cross-run resume).
 Behavior follows from the input (did you explicitly name the stage?), not the call
 path. A consequence to state plainly: `--stages descriptions,embeddings,relationships`
 is **not** identical to passing no flag — naming all three is the explicit "force a
 full enrichment recompute," whereas no flag is "ingest, resuming whatever is done."
 ### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent
 The only hard dependency between stages is **`descriptions → embeddings`**
 (embeddings embed the description text; `relationships` is decoupled, D5). Two
 mechanisms keep it correct without a hardcoded dependency table:
 - **Self-healing via content-addressing.** Because the embeddings hash includes
  `descriptionDigest` (D1), re-running `descriptions` changes that digest, so a
  later embeddings run (or a full ingest) sees a hash miss and recomputes — stale
  embeddings can never silently persist across a future embeddings run. (Without
  this, the embeddings hash would be unchanged after a description edit and a later
  run would wrongly short-circuit on stale vectors.)
 - **Surfaced immediately.** After a selective run, for each **unselected** stage that
  has artifacts on disk, recompute its *current* per-stage hash from on-disk state
  and compare it to the stored completed-row hash; if they differ, emit a
  **recoverable `enrichment_stage_stale` warning** naming the stale stage and the
  cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the
  system's own state — it also catches "you changed the embedding model in `ktx.yaml`
  but only ran `--stages descriptions`."
 The run **never silently leaves a stale-but-unflagged downstream**, and **never
 silently auto-cascades** extra work — the operator is told and decides. Re-running
 `descriptions` does **not** flag `relationships` stale (D5).
 ### D5 — Relationships are decoupled from description content, but still get it as context
 `relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is
 **not** invalidated or stale-flagged by a description change (decision X). Rationale:
 relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's
 own framing); coupling them to description content would make every routine
 description re-run also invalidate joins — re-opening the exact over-invalidation
 this spec exists to close.
 Independently, a `relationships`-only run (descriptions stage not running this
 invocation) MUST **hydrate its working schema from the persisted on-disk enriched
 `_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full
 description context, not raw column names. Today the relationship stage builds its
 schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740`
 never merge the AI descriptions), so this also closes a latent gap: both the
 full-run and the relationships-only paths MUST feed `llmProposals` the
 best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) —
 behavior from inputs, not path.
 ### D6 — Scope: enrichment stages only, composable with existing flags
 `--stages` controls only the three enrichment stages. It is **orthogonal to and
 composable with** the existing `--no-query-history` flag — a pure joins backfill
 across everything is `ktx ingest --all --stages relationships --no-query-history`.
 Schema introspection still runs (it is the hash substrate and the enrichment base,
 and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it
 can later extend to the broader scan phases (schema / query-history / source /
 memory) and subsume the inconsistent negative `--no-query-history` flag — but that
 unification is **out of scope** here.
 ## Requirements
 ### 1. Per-stage input hashes
 Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its
 resume record on a hash of only that stage's own inputs, per D1
 (`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding
 identity + a digest of the embedded description text; `relationships` ← snapshot +
 relationship settings + LLM identity). Changing one stage's inputs MUST invalidate
 **only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over
 `{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }`
 MUST be removed in favor of per-stage computation. The stage store and the
 descriptions resume record MAY be recreated without a migration bridge (disposable
 local state).
 ### 2. `--stages` flag with strict validation
 `ktx ingest` MUST accept `--stages <comma-list>`, a non-empty subset of
 `descriptions,embeddings,relationships`, defaulting (when absent) to all three. An
 unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`),
 never silently ignored. The selected set MUST thread through to enrichment and gate
 which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new
 `KtxScanMode` variant, no second selection path. A selected stage whose prerequisite
 is missing MUST fail or warn clearly, not silently no-op.
 ### 3. Selecting a stage force-re-runs it; unselected stages are preserved
 A stage named in `--stages` MUST re-enter its `compute()`, bypassing the
 completed-stage short-circuit, while still using the spec-20 per-table resume record
 so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash)
 and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST
 leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19
 cross-run resume (all eligible stages, completed-row short-circuit respected).
 ### 4. Downstream staleness is surfaced, never silent
 After a selective run, the run MUST emit a recoverable `enrichment_stage_stale`
 warning for every **unselected** stage whose current per-stage hash no longer
 matches its stored completed-row hash (derived from on-disk state, naming the stage
 and the cascade command). The embeddings hash MUST include a digest of the embedded
 description text so a later embeddings run self-heals after a description change. The
 run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently
 auto-cascade. A description change MUST NOT stale-flag `relationships`.
 ### 5. Relationships run with description context
 When the `relationships` stage runs without `descriptions` having run in the same
 invocation, it MUST hydrate its working schema from the persisted on-disk enriched
 `_schema` (AI descriptions + embeddings) so `llmProposals` has the same description
 context as a full enriched run, not bare column names. The full-run and
 relationships-only paths MUST feed `llmProposals` descriptions consistently.
 ### 6. No regression for normal ingests
 A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as
 today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19
 cross-run resume and spec-20 per-table description resume. The per-stage hash split
 MUST NOT change a normal run's output, only which stages a *changed* input
 invalidates.
 ## Acceptance criteria
 - **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals`
  re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM
  description calls, no re-embedding); swapping the embeddings model re-runs only
  `embeddings`; switching the description LLM re-runs only `descriptions`. Verified by
  asserting no LLM description calls / no embed calls for the unaffected stages.
 - **Flag parse + validation:** `--stages relationships` and
  `--stages descriptions,embeddings` parse to the right set; `--stages foo`,
  `--stages` (empty), and `--stages descriptions,foo` each fail with a clear
  `InvalidArgumentError`.
 - **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed
  with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM
  calls for exactly those K tables and leaves the already-good descriptions
  untouched; the run completes and the K are now enriched. A changed descriptions
  hash instead rebuilds all tables.
 - **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and
  `relationships` artifacts are byte-stable (unselected stages did not run).
 - **Derived staleness warning:** after `--stages descriptions` changes the
  descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its
  recomputed hash diverged) and does **not** emit it for `relationships` (decision
  X); a subsequent `--stages embeddings` clears it.
 - **Relationships context:** a `--stages relationships` run on an already-described
  connection feeds the on-disk AI descriptions into `llmProposals` (verified: the
  proposal prompt carries descriptions, not just column names).
 - **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical
  artifacts and the same descriptions/embeddings/relationship output as today, with
  spec-19/20 resume intact.
 ## Non-goals
 - **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The
  namespace is built to extend later; this spec ships only the three enrichment
  stages, composable with the existing query-history flag (D6).
 - **A new `KtxScanMode` variant or a second stage-selection path.** One gate,
  `(eligible) ∩ (selected)` (D2).
 - **Coupling `relationships` to description content** (decision X, D5). Improving
  descriptions does not invalidate or stale-flag joins.
 - **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the
  operator chooses to cascade (D4).
 - **Capturing prompt/code-level description-prompt changes in the hash.** The
  descriptions hash keys on snapshot + LLM identity (config/model), not the prompt
  text; a pure prompt improvement that does not change a hash input will not
  force-rebuild already-good descriptions. Forcing that is out of scope — the
  operator changes a real input or selects the stage with a changed config.
 - **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20
  (per-table description resume, enforced timeout).** This spec composes above them:
  it splits the key those stages resume on and adds the CLI surface to select and
  force-re-run stages.
 - **A general per-phase incremental-flush framework.** The selection mechanism is the
  three enrichment stages; it is not a generic abstraction over every ingest phase.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns the
 design.
 - **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts`
  (`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput`
  `:57`): replace with per-stage hash functions (or one function taking a per-stage
  input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three
  `runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`,
  `relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit
  `:538–547`). The `descriptions` hash also feeds `generateDescriptions`'
  `resumeStore.load(inputHash)` (`:345`).
 - **Provider-identity decomposition** — `context/scan/local-scan.ts`
  (`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`):
  split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` /
  `relationships` re-encoding, and pass each stage only its slice.
 - **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings`
  `:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText`
  `:17–44`): digest the resolved per-column/table description text that the embeddings
  consume, and fold that digest into the embeddings hash.
 - **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations,
  `:51–104` action handler): add `--stages` with a custom parser that validates
  against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in
  `enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`.
  Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'`
  `:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) →
  `runLocalScanEnrichment`.
 - **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage
  block on membership in the selected set (`descriptions` `:632`, `embeddings`
  `:663–665`, `relationships` `:720`); make a named stage bypass the completed-row
  short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20
  per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage
  set.
 - **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the
  stage blocks): recompute each unselected stage's current hash from on-disk state,
  compare to the stored completed-row hash, push a recoverable warning on mismatch.
  Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in
  `context/scan/types.ts` (alongside `relationship_detection_partial`).
 - **Relationships description context** — `context/scan/local-enrichment.ts`
  (`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`):
  hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from
  the on-disk `_schema` via `loadExistingManifestState`,
  `local-enrichment-artifacts.ts`) before relationship detection.
 - **Stage store + resume record** —
  `context/scan/sqlite-local-enrichment-state-store.ts`
  (`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`,
  `findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore`
  (`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate
  `:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge.
 - **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema`
  `:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`;
  `scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`,
  `llm.provider.gateway.base_url`): the sources of each per-stage identity slice.
 - **Tests** — per-stage invalidation isolation (flip one input, assert only the
  matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty
  rejected); resume-aware force-rerun (`--stages descriptions` retries only the null
  tables, leaves good ones, completes); preserve-others (unselected artifacts
  byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after
  a descriptions change, not for relationships; cleared by a later `--stages
  embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`;
  regression — a normal no-flag ingest yields identical artifacts with spec-19/20
  resume intact.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 - **Docs:** add `--stages` to the `ktx ingest` CLI reference
  (`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior
  where enrichment/ingest is described.
 ## Benchmark context (motivation, not a requirement)
 Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A
 level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description
 coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only**
 re-run with a longer timeout, and (b) a desire to **backfill joins** across all
 already-ingested datasets after enabling `llmProposals` — without re-paying for
 descriptions. Both were blocked by the coarse single `inputHash` (flipping
 `llmProposals` or re-describing invalidated the whole enrichment) and the absence of a
 stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend
 ingestion at scale; the gap and the fix are generic production operability. Do not
 encode any benchmark specifics in the implementation.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented;
 all acceptance criteria covered by tests.
 **What was built / where:**
 - **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the
  coarse `computeKtxScanEnrichmentInputHash` and added
  `computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`),
  `computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`),
  `computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`),
  plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` /
  `KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the
  canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into
  `localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant
  `mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the
  keys. No migration bridge — the stage store + descriptions resume record just miss the
  old coarse-keyed rows.
 - **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted
  `buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage
  and the digest, so the embeddings hash content-addresses the exact text the model sees.
 - **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`:
  `parseEnrichmentStagesOption` (Commander parser) validates against the registry,
  rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated
  set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` →
  `RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate
  (`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage
  emits a new `enrichment_stage_skipped` warning (never a silent no-op).
 - **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named
  stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions`
  still consults the spec-20 per-table resume record (retries only failed tables on an
  unchanged hash).
 - **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment`
  resolves best-available descriptions (fresh-this-run, else on-disk via a lazy
  `loadPriorDescriptions` thunk wired from `local-scan.ts` →
  `loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema`
  now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket`
  now carries the resolved description text — closing the latent gap on **both** the
  full-run and relationships-only paths.
 - **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code +
  `findLatestCompletedStage` on the state store (interface + sqlite + test store). After a
  selective run, each unselected stage with a completed row is compared against its
  freshly recomputed hash; a mismatch warns and names the cascade command. Relationships
  are never flagged by a description change (decoupled per D5).
 - **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a
  "Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and
  examples.
 **Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a
 relationships-only run should hydrate "AI descriptions **and** embeddings" from the
 on-disk `_schema`. Investigation found the `_schema` manifest shards store only
 descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json`
 that no code reads back, and each run mints a fresh syncId — so there is no durable
 per-connection embeddings artifact to hydrate from. A relationships-only run therefore
 hydrates **descriptions** (required for, and verified against, the `llmProposals`
 acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships`
 backfill gets deterministic + name-based + LLM-proposed candidates (the point of
 `llmProposals`), but not the embedding-similarity candidates a full run would add.
 Durable embeddings hydration (persist vectors at a stable per-connection path, or read
 them from the vector index) is a clean follow-on and was left out of scope.
 **Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation),
 `commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture
 guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves
 others, naming all three forces a full recompute, per-stage invalidation isolation,
 prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced
 descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not
 relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`,
 `type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in
 `test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a
 `**Window functions**` heading the test expects — was present before this work and left
 untouched.)
 ---
 ## ⚠️ Defect found in post-implementation validation (2026-06-24)
 **`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req
 "preserve-others / a selective run never deletes another stage's artifacts."
 **Reproduction (deterministic):**
 - `northwind` before: 110 `ai:` column/table descriptions, 0 join edges.
 - `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅
  but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌
 - A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins.
 **Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the
 freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions
 and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the
 **write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach
 the `llmProposals` evidence packet only). So the on-disk `_schema` loses them.
 **Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the
 `--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every
 db. Caught on a 1-db validation before any rollout.
 **Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:`
 descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages'
 artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`,
 and asserts description count is unchanged while joins increase.
 ### ✅ Fixed (2026-06-24)
 **Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first
 fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural**
 manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard,
 but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as
 **scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a
 subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the
 already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test
 passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test
 was rewritten to go through the full `runLocalScan` path.)
 **What changed:**
 - `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions
  (`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as
  `descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them.
 - `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before
  the structural manifest write** and feeds them to both the structural write and enrichment — so the
  structural pre-write preserves them too (robust even if relationship detection later fails).
 - Joins were already preserved for `--stages descriptions` via the existing manual/inferred
  `preservedJoins` path; verified by a symmetric test.
 **Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai`
 descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the
 enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions`
 preserves joins).
 **Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER
 `ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the
 descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`.
 Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass.
--- a/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md
+++ b/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md
@ -1,463 +0,0 @@
 # Resumable and fault-tolerant source ingest
 > Refined spec. No intake draft — surfaced by a real user report, not the
 > playground agent (see Motivation). Lives beside the analogous scan-durability
 > specs 19/20.
 >
 > **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt /
 > Metabase / Notion) survive interruption and partial failure on large
 > projects.** Two compounding gaps live on the source-ingest path: (1) an
 > interrupted run restarts every work unit from scratch — there is no cross-run
 > reuse of already-generated work-unit output, so a multi-day dbt ingest loses
 > *all* progress to a single VPN/network blip; (2) the final integration gate is
 > all-or-nothing — one artifact that cannot pass it (after LLM repair) discards
 > the **entire** run with nothing committed. This is the source-ingest analog of
 > spec 19 (move the durability boundary to the cost boundary so expensive LLM
 > work is not lost) and spec 20 (a stage survives an interruption with per-item
 > durability). It **reuses** the same content-keyed durability primitive those
 > specs established rather than copying it.
 ## Problem
 Two independent failure modes on the source-ingest work-unit (WU) pipeline,
 both confirmed in the current code, both observed by a user on a ~2-day dbt
 ingest. Their union makes large-project ingest brittle: any interruption is
 total loss, and any single unfixable artifact at the end is total loss.
 ### 1. An interrupted run resumes nothing — every work unit re-runs
 `IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as
 a sequence of stages: fetch → parse/extract into **work units** → run each WU as
 an isolated agent loop in a child worktree (`runIsolatedWorkUnit` →
 `executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU
 patches → reconcile → finalize → final gates → one atomic squash commit
 (`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU
 is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki
 artifacts, producing a git patch (`WorkUnitOutcome.patchPath` /
 `patchTouchedPaths`, `stage-3-work-units.ts:31-46`).
 The only persisted cross-run state is `SqliteBundleIngestStore`
 (`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report,
 and provenance — all written at or near **run completion**. There is **no
 checkpoint of completed WU output**. A run that dies mid-flight (the user's
 VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches,
 re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM
 cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no
 relationship to the interrupted one.
 > Observed (user report, large dbt project): a run reached deep into its
 > work-unit progress and was lost to a network blip; the follow-up run started
 > over from zero. On a ~2-day ingest this is the difference between a 5-minute
 > resume and a 2-day redo.
 ### 2. The final integration gate is all-or-nothing
 After all surviving WUs are integrated, `validateFinalIngestArtifacts`
 (`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across
 the *integrated* tree:
 - **intrinsic source validity** — `validateTouchedSources` →
  `validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) →
  `validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`),
  which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`);
 - **cross-artifact references** — dangling join targets
  (`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki`
  refs (`validateWikiRefs` → `findMissingWikiRefs`), broken `wiki→sl_ref`s
  (`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs
  (`findInvalidWikiBodyRefs`).
 On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`).
 The runner catches it, runs the LLM repair `repairFinalGateFailure`
 (`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws**
 (`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null`
 (`runner.ts:2729`) → **the whole run is discarded, nothing committed.**
 The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate
 — `executeWorkUnit` already validates each WU in isolation (`validateWikiRefs`
 ~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`,
 ~155: the WU resets, is excluded from integration, and the run continues). So by
 the time the final gate runs, intrinsic single-source failures are rare. The
 gate fails predominantly on **cross-artifact dangling references**: WU-A's source
 joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so
 A's join now points at nothing. Each WU passed *alone*; the break only appears
 once the survivors are integrated — and that break currently nukes the run.
 > Observed (user report): a run completed all task generation and then failed at
 > the final integration gate on a **single model**; because the gate is
 > all-or-nothing, that one failure discarded an ~18h run with nothing committed.
 ## Generic use case (independent of any benchmark)
 Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit
 both failures. Large ingests run long enough that an interruption is a *when*,
 not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator
 ctrl-C on an apparently-stuck run), and a large artifact set makes it
 near-certain that *some* model lands a cross-reference its sibling didn't
 produce. Without cross-run reuse, every interruption is a from-scratch redo of
 the dominant (LLM) cost; without partial commit, one unfixable artifact throws
 away every good one. Both fixes make large-project ingest **resilient and
 resumable**: an interruption costs only the unfinished work, and a single bad
 model costs only that model — not the run. This is core robustness for a
 general-purpose ingestion product.
 ## Design decisions (resolved during refinement)
 These resolve the design space explored during refinement. They constrain the
 implementer; the exact code is theirs (requirement-level, per the specs README).
 ### D1 — Resume is automatic and content-keyed at the work-unit level
 A successful WU's output is cached across runs, keyed by a **content hash of its
 inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest`
 transparently replays any WU whose inputs are byte-identical to a cached success
 and re-runs only the changed, failed, or missing WUs. The key is computed over:
 the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads,
 `types.ts:19-28`), the adapter/source identity, and a **version/prompt
 fingerprint** (ktx version + the WU system/user prompt + model role). A changed
 dbt model busts only that model's entry; everything unchanged replays for free.
 > No flag, no config knob. Content-keying makes resume automatic; a flag would
 > double the state space for no benefit. This is the same shape scan uses
 > (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU
 > pipeline.
 ### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes
 The cache stores a successful WU's **output artifacts**: its git patch
 (`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs
 (`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner
 **replays the patch** into the session worktree — no agent loop, no LLM — exactly
 where it would have integrated a freshly-run WU. If a cached patch **fails to
 apply** (the surrounding tree drifted), the entry is discarded and the WU
 **recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree:
 the cache can only make a run faster, never wrong.
 ### D3 — One durability primitive, shared by scan and ingest
 Per the "one capability, one implementation" rule, the content-keyed store is
 **extracted** into a shared primitive and **both** scan and ingest route through
 it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK
 `(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`)
 and its `inputHash` computation (`enrichment-state.ts`) are generalized to a
 content-keyed result cache; scan is migrated onto the shared primitive **in the
 same change** so no second copy exists even transiently. The ingest cache is a
 new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`)
 on that one store.
 > Extract-and-share in one PR, not "build a copy for ingest now, unify later."
 > A temporary fork is exactly the divergence the rule forbids; the one-time
 > extraction cost is paid once and both paths benefit from every later fix.
 ### D4 — Only successes are cached; failures retry on the next run
 A failed WU is **not** recorded as terminal — the next run retries it. WU
 failures on this path are dominantly transient (network, provider stall, an LLM
 slip), and the user's explicit ask is "resume and finish the rest," so a failure
 must not be sticky. This deliberately differs from scan's stage store (which
 caches failed stages and re-throws): there the failure is the stage's
 deterministic verdict; here a WU failure is usually a blip to retry. Caching only
 successes also keeps the invariant simple — a cache entry always means "this
 exact input already produced this exact good output."
 ### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune
 Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation
 that always yields a committable, internally-consistent tree:
 1. `validateFinalIngestArtifacts` is refactored to **return structured findings**
   (the danglers it already computes internally — join targets, `wiki→wiki`,
   `wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of
   flattening them into a thrown string.
 2. **Drop the rare self-invalid source first.** A source that fails its *own*
   validation at the final gate (intrinsic — rare, since stage 3 already filters
   these) is removed, establishing the surviving artifact set.
 3. **Prune the dead edges in a single pass** over that surviving set. For each
   dangling reference — whether it pointed at an absent sibling or at a
   just-dropped source — **remove that reference from its owner** (drop the join
   entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping
   the owning artifact. Because nodes are dropped first (step 2) and pruning only
   removes edges, pruning **cannot create a new dangling edge, so one pass
   suffices; no fixpoint.**
 4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are
   cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit
   the remainder. If the confirm pass *still* fails, that is a real bug — fail the
   run loudly rather than commit a dirty tree.
 `repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`)
 is **removed**. The deterministic prune supersedes it for the referential class,
 and the rare intrinsic case is handled by drop.
 > **Prune the edge, do not cascade the node.** The rejected alternative drops the
 > *referencing artifact* and, transitively, everything that referenced *it* — a
 > node-quarantine fixpoint that cascades healthy artifacts and needs a closure
 > search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the
 > dependent intact (minus one pointer that never resolved anyway), needs no
 > fixpoint, and acts on findings the gate already produces.
 >
 > **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair
 > can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune
 > merely deletes it, preserving marginally more content. We drop it anyway:
 > determinism beats an LLM round-trip with variance on the commit path, prune
 > guarantees a commit where repair could only `throw`, and deleting it is a net
 > maintenance reduction. The decision is reversible — repair could later run as a
 > best-effort pass *before* prune — but the default is prune-only.
 ### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose)
 Pruning is applied to the **integrated session worktree** at gate time and is
 **re-derived from the current survivor set on every run**. It MUST NOT mutate the
 cached WU patches (D2). This makes resume and prune compose correctly and
 **self-heal**:
 - Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B
  fails; the gate prunes A's join-to-B from the integrated tree and commits A
  without it.
 - Run 2 (after the root cause is fixed): A's input is unchanged → A **replays
  from cache with its join restored**; B now succeeds and exists; the gate finds
  no dangler and commits both, fully linked.
 So a ref pruned because of a sibling's failure costs nothing permanent: fixing
 the sibling and re-running restores the link for free. The cache stores
 intent (the WU's real output); prune is a per-run consistency projection over
 whatever survived.
 ### D7 — Pruning is faithful and never silent
 A pruned reference was, by definition, non-functional (its target was absent), so
 removing it loses nothing executable — and removing dangling SL joins is already
 the established fix for the SL engine's eager orphan-join rejection. Every prune
 and every drop MUST be **recorded in the run report and a trace event** naming
 the artifact, the removed reference, and the absent target. The report status
 MUST reflect partial completion (extend the existing `failedWorkUnits`
 mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs /
 dropped-sources detail) so a partial run is visibly partial, never a silent
 "success."
 ### D8 — Cache state is regenerable; no migration bridge
 The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape
 may change with **no migration bridge** — a stale-shaped or absent cache simply
 forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's
 no-backward-compatibility policy; the cache is an optimization, never a source of
 truth.
 ## Requirements
 1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output
   MUST be cached keyed by a content hash over its input bytes
   (`rawFiles` + `dependencyPaths`), the adapter/source identity, and a
   version/prompt fingerprint (ktx version + WU prompt + model role). Re-running
   `ktx ingest` MUST replay cached successes without an agent loop / LLM call and
   re-run only changed, failed, or missing WUs. No `--resume` flag and no config
   knob is added.
 2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the
   stored patch into the session worktree; if the patch does not apply cleanly the
   entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to
   produce a tree different from what a fresh run of that WU would have integrated.
 3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it
   MUST be retried on the next run.
 4. **Conservative invalidation.** The input hash MUST change when the ktx version,
   the WU prompt, or the model role changes (bias toward recompute). Under-keying
   (stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is
   acceptable.
 5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run.
   `validateFinalIngestArtifacts` MUST return structured findings; the runner MUST
   deterministically **prune** every dangling reference from its owning artifact
   and **drop** any source that fails its own validation, then commit the
   remaining internally-consistent tree.
 6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not
   cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint)
   because edge removal cannot create new dangling edges. A dependent that loses
   one dangling ref MUST otherwise be committed intact.
 7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and
   MUST NOT mutate cached WU patches. A reference pruned in one run because its
   target was absent MUST be restored automatically on a later run once the target
   exists (resume replays the owner's intact patch).
 8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on
   the remainder and MUST pass before the squash; if it still fails the run MUST
   fail loudly rather than commit a dirty tree.
 9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its
   obsolete tests/branches MUST be deleted (no dormant compatibility path).
 10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST
    be recorded in the run report and a trace event (artifact, removed ref, absent
    target). A run that pruned or dropped anything MUST report as partial, never as
    an unqualified success.
 11. **One shared durability primitive.** The content-keyed store MUST be a single
    implementation used by both scan and ingest; scan MUST be migrated onto it in
    the same change. No second copy may exist, even transiently.
 12. **No regression for clean runs.** A small, uninterrupted run whose every WU
    passes and whose final gate is clean MUST produce byte-identical artifacts and
    the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as
    today.
 ## Acceptance criteria
 - **Resume skips completed work:** interrupt an ingest after K of N WUs have
  succeeded; re-run the same command (unchanged inputs); the run issues **zero**
  agent loops / LLM calls for the K cached WUs, runs only the remaining N−K, and
  produces the same final artifacts as an uninterrupted run.
 - **Changed model busts only its entry:** edit one dbt model between runs; the
  re-run re-executes **only** the WU(s) whose input bytes changed and replays the
  rest from cache.
 - **Stale patch self-corrects:** a cached patch that no longer applies (forced
  drift in a test) causes that WU to recompute, not a corrupt tree or a crash.
 - **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached;
  run 2 retries it and, on success, integrates it.
 - **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join
  to B dangles **commits** — A is committed with the dangling join **pruned**, the
  report lists the pruned ref, and `commitSha` is non-null (contrast: today this
  throws and commits nothing).
 - **No cascade:** in that scenario A (and any other artifact that only referenced
  B) is committed intact except for the single pruned reference; nothing healthy
  is dropped.
 - **Self-heal:** fix B's root cause and re-run; A replays from cache with its join
  intact, B succeeds, and the final tree commits both fully linked with no prune.
 - **Intrinsic drop:** a source that fails its own warehouse dry-run at the final
  gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop
  is reported.
 - **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the
  gate path has no LLM call.
 - **One store:** scan and ingest both resume through the same content-keyed
  primitive (one implementation; scan's behavior is unchanged by the migration —
  spec 19/20 acceptance still passes).
 - **Clean-run regression:** a small uninterrupted all-passing ingest yields
  identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to
  today.
 ## Non-goals
 - **Resuming the cross-WU stages.** Reconciliation, finalization, and the final
  gate re-run every time; their inputs depend on the full survivor set and their
  cost is small relative to WU generation. Only WU generation is cached.
 - **A `--resume` flag or any timeout/cache config knob.** Content-keying makes
  resume automatic (D1); one opinionated default is the canonical ktx shape.
 - **Caching failed WUs as terminal.** Failures retry (D4).
 - **Node-cascade quarantine of the final gate.** Prune edges, do not drop
  dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step.
 - **Tolerating dangling references (warn instead of remove).** Unsafe — the SL
  engine eagerly rejects orphan joins — so dead edges must be removed, not kept.
 - **Keeping the LLM final-gate repair.** Removed (D5/req 9).
 - **A general per-stage resume framework beyond the shared content-keyed store.**
  The store is the one shared primitive (D3); this spec does not abstract every
  ingest stage into a resumable framework.
 - **Re-implementing spec 19/20 (scan durability).** This spec composes the same
  primitive onto the source-ingest WU pipeline.
 ## Implementation orientation
 Line numbers drift; treat these as anchors, not addresses. The implementer owns
 the design.
 - **Run flow + the all-or-nothing seam** — `context/ingest/ingest-bundle.runner.ts`:
  WU run + integration of successful patches (~1600–1900), the final-gate block
  (~2549–2587, `runFinalArtifactGates`), the repair-then-rethrow that must be
  replaced by prune (~2588–2644; the fatal `throw` ~2623), and the atomic squash
  (~2701–2729; `commitSha: null` when nothing is touched ~2729). The prune step
  slots between the gate findings and the squash, operating on `sessionWorktree`.
 - **Work units & cacheable output** — `context/ingest/types.ts` (`WorkUnit`
  ~19–28: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult`
  ~204–213: extend with pruned/dropped detail);
  `context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU
  validation + `failWithReset` ~134–157 that already soft-fails a WU;
  `WorkUnitOutcome` ~31–46 with `patchPath`/`patchTouchedPaths`/`actions`/
  `touchedSlSources` — the cache payload). The cache lookup/replay wraps the
  per-WU execution; only the agent-loop branch is skipped on a hit.
 - **The gate (make it return findings)** — `context/ingest/artifact-gates.ts`
  (`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from
  `validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`;
  the concatenated `throw` ~129 to replace with a structured return);
  `context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124;
  `findJoinTargetErrors` ~89 already returns missing join targets per source —
  the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts`
  (`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are
  the drop set, not the prune set).
 - **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the
  offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove
  the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts`
  `findMissingWikiRefs`); wiki body refs: remove the broken link token
  (`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is
  deterministic and edits the integrated worktree only.
 - **Remove the LLM repair** — `context/ingest/final-gate-repair.ts`
  (`repairFinalGateFailure`) and the `constrained-repair.ts` usage for
  `final_artifact_gate`; delete the call site (~2595) and its tests.
 - **Durability primitive to extract & share** —
  `context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`,
  PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`),
  `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and
  the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`).
  Generalize to a content-keyed result cache; migrate scan onto it; add the ingest
  namespace. The existing ingest store
  `context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is
  where ingest-side persistence lives — the WU cache sits alongside it under
  `.ktx/`.
 - **Tests** — resume: run an ingest against a real git-backed project with a fake
  agent runner, interrupt after K WUs, assert the re-run issues no agent loops for
  the K and the same artifacts result; changed-input bust; stale-patch recompute;
  failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki
  ref dangles → assert the run commits the sibling with the ref pruned, reports the
  prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a
  follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still
  green on the shared primitive. Regression: a small uninterrupted all-passing
  ingest is byte-identical to today.
 - After implementing, rebuild and re-link so the playground picks it up:
  `pnpm run build && pnpm run link:dev`.
 ## Motivation (the real report, not a benchmark)
 A user ingesting a fairly large dbt project (~2-day run) hit both gaps together.
 First, an interruption — a VPN drop / network blip — lost all progress because
 ingest cannot resume; they had to restart from scratch. Second, on a later run
 that completed all task generation, a **single model** failed the final
 integration gate, and because the gate is all-or-nothing the one failure
 discarded an ~18h run with nothing committed. Their ask: "some form of resume or
 checkpoint (or at least reusing the patches that were already generated), and a
 way to skip or quarantine a single failing model instead of failing the entire
 run." This spec delivers both — resume via the content-keyed WU cache, and
 partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this
 gap was surfaced by a real user on a real warehouse, not by the benchmark; the
 fix is generic production hygiene for any large ingest.
 ## Implementation notes
 Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12
 requirements and every acceptance criterion are covered by committed code and
 tests; the full `@kaelio/ktx` package suite is green.
 What was built and where:
 - **Shared content-keyed durability primitive** — `context/cache/content-result-cache.ts`
  + `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`).
  Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts`
  is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped),
  so no second copy exists (D3 / req 11).
 - **Content-keyed WU cache + replay** — `context/ingest/work-unit-cache.ts`
  (`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity +
  CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`).
  Replay/recompute and stale-recompute state refresh wrap the WU loop in
  `ingest-bundle.runner.ts` (D1/D2/D4 / reqs 1–4).
 - **Non-fatal final gate** — `artifact-gates.ts` `validateFinalIngestArtifacts`
  returns structured findings; `context/ingest/final-gate-prune.ts` deterministically
  drops self-invalid sources and prunes dangling edges in a single pass, then a
  confirm gate runs before squash (D5/D6 / reqs 5–8). `finalGatePrunedReferences`
  / `finalGateDroppedSources` are recorded in the report + trace and surface as a
  `partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are
  deleted (req 9).
 Deviations / decisions worth noting (all preserve spec intent):
 - **Cache stores artifact content snapshots (payload schema v2), not just a raw
  git patch.** Replay materializes the owner's artifacts against the *current*
  base, so a ref pruned in one run because a sibling failed is restored for free
  on a later run once the sibling exists — without re-running the owner's agent
  loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute.
 - **Final-gate prune/drop resolves sources through the canonical
  `resolveSlSourceFile` resolver**, not a derived `semantic-layer/<conn>/<name>.yaml`
  path, so it works for uppercase / hash-derived source filenames (not only
  lowercase demo names).
 - **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join
  target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU;
  only intrinsic `source_validation` failures remain fatal at the WU level. This
  is what lets a sibling-failed WU's owner survive to be pruned rather than be
  excluded upstream (reqs 5–7, "no cascade").
 - The raw report record keeps `status: 'completed'`; partial completion is derived
  by `ingestReportOutcome` from the populated prune/drop fields.
--- a/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md
+++ b/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md
@ -1,66 +0,0 @@
 # Multi-connection routing guidance in the ktx-analytics skill
 ## Problem
 The agent-facing `ktx-analytics` skill (installed into agent environments via
 the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in
 projects) describes the query workflow — wiki_search → sl_read_source →
 sl_query / sql_execution — but assumes the connection is obvious. In a
 multi-connection project nothing tells the agent to *first decide which
 connection the question is about*, and several tools silently require it:
 - `sql_execution`, `sl_read_source`, `entity_details`: `connectionId`
  **required**;
 - `sl_query`, `discover_data`, `dictionary_search`: optional, but
  auto-inference only works with exactly one connection
  (`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or
  multiple connections).
 An agent that skips routing either errors out or, worse, queries the wrong
 database when names overlap.
 ## Generic use case
 Any ktx project with more than one connection — the common shape for a data
 org (warehouse + product DB + events DB). Routing is the first step of every
 question, and the skill should encode it so individual agents don't have to
 rediscover it.
 ## Requirements
 1. **Add an explicit routing step (step 0) to the skill's workflow:**
   - Call `connection_list` to see what exists.
   - Match the question's domain to a connection using connection ids/names,
     `discover_data` hits, and wiki context — not guesswork.
   - If genuinely ambiguous after discovery, ask the user rather than pick.
 2. **Thread the resolved `connectionId` everywhere:** all subsequent
   `sl_query`, `sql_execution`, `sl_read_source`, `entity_details`,
   `dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01
   lands (search scoped to the resolved connection plus unscoped pages).
 3. **Single-connection projects stay frictionless:** the skill should say
   routing is trivial when `connection_list` returns one entry — don't add a
   mandatory ceremony step for the common simple case.
 4. **Capture routing knowledge:** when the agent learns a non-obvious
   question-domain → connection mapping, the skill should encourage
   `memory_ingest` so the mapping becomes wiki knowledge for next time.
 This is a docs/prompt change in the skill content (plus any skill-install
 plumbing if the skill is versioned); no engine changes required.
 ## Acceptance criteria
 - In a fixture project with ≥2 connections, an agent following the skill
  resolves the correct connection before its first data query, and no tool
  call fails with "connectionId is required".
 - In a single-connection project the skill-driven flow is unchanged (no
  extra mandatory steps).
 - Skill text nowhere assumes a default/implicit connection.
 ## Benchmark context (motivation only)
 Spider 2.0-Lite local subset = 30 SQLite connections in one project; every
 one of the 135 questions targets exactly one of them. Connection ids are set
 to the benchmark's database names, so with this skill guidance routing is
 mechanical (`connection_list` + name match) and needs no benchmark-specific
 instructions — which is the point: the harness gives the agent only the
 question text.
--- a/spider2-specs/todo/04-offline-schema-docs-adapter.md
+++ b/spider2-specs/todo/04-offline-schema-docs-adapter.md
@ -1,51 +0,0 @@
 # Offline schema-documentation ingest adapter
 > **Priority: LOW / backlog.** Explicitly **not** needed for the Spider
 > 2.0-Lite benchmark — we verified the benchmark's offline schema files
 > (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite
 > scan already captures (DDL, types, PKs, sample values, cardinality
 > profiling). Implement specs 01-03 first; pick this up only if a real
 > use case shows up.
 ## Problem
 The ingest pipeline's schema knowledge comes from live database scans
 (`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…).
 There is no adapter for **offline schema documentation**: files describing
 tables/columns that exist outside the database — column-description
 spreadsheets, data dictionaries, DDL exports with comments, hand-maintained
 schema docs.
 ## Generic use case
 Teams whose richest schema documentation lives outside `information_schema`:
 a wiki export of column meanings, a governance tool's CSV data dictionary,
 DDL files with COMMENT clauses the production scan can't see, or
 environments where ktx has no live access at all and must build the semantic
 layer from documentation alone.
 ## Requirements (sketch — refine when picked up)
 1. A new ingest adapter (peer of `metabase`/`dbt` in
   `context/ingest/adapters/`) consuming a configured local path of schema
   docs per connection.
 2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements)
   and tabular column dictionaries (CSV/JSON: table, column, description,
   …). Extensible to other formats.
 3. Output: **enrichment, not duplication** — merge descriptions/metadata
   into the manifest-backed semantic-layer sources and dictionary for the
   matching connection. Where a live scan exists, offline docs fill gaps
   (descriptions, enum meanings, deprecation notes) and flag drift
   (documented column missing from live schema and vice versa) rather than
   creating parallel wiki pages that duplicate schema info.
 4. Works without live database access (documentation-only bootstrap of a
   connection's semantic layer), clearly marked as unverified-against-live.
 ## Acceptance criteria (sketch)
 - Given a connection with a live scan plus an offline column dictionary,
  semantic-layer sources carry the documented descriptions, and drift
  between doc and live schema is reported.
 - Given a connection with docs only (no live access), `sl list`/`sl read`
  expose manifest sources built from the docs.
 - No wiki pages are created that merely restate table/column lists.
--- a/spider2-specs/todo/05-composite-key-join-detection.md
+++ b/spider2-specs/todo/05-composite-key-join-detection.md
@ -1,59 +0,0 @@
 # Composite-key (multi-column) join detection
 > Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite
 > smoke test (2026-06-13): relationship detection emitted **zero joins** for a
 > database whose fact tables are linked only by composite keys. Agents still
 > answered correctly by inferring the join from shared `grain`, so this didn't
 > cost benchmark points — but it forces inference that explicit joins would
 > remove, and the gap is generic.
 ## Problem
 Relationship detection appears to emit only single-column joins. For the IPL
 sqlite database, every table came back with `joins=0`, even though its fact
 tables are connected by a 4-column composite key
 (`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`,
 `batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did
 correctly record that shared key as each table's `grain`, which is why agents
 could recover the relationship — but no `joins:` entries were produced for the
 fact-to-fact links.
 ## Generic use case
 Event/fact tables keyed by composite business keys are common: ledger lines
 (`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports
 ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a
 multi-column key recurs across tables, ktx should detect and document the join
 so agents (and `sl_query`) don't have to infer it.
 ## Requirements
 1. Relationship detection considers **multi-column** join candidates, not just
   single-column ones. A strong signal already exists in ktx: when two tables
   share an identical (or subset/superset) declared `grain`, that grain is a
   prime composite-join candidate.
 2. Emitted joins carry the full composite condition, e.g.
   `on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`,
   with a sensible `relationship` cardinality.
 3. The existing validation/threshold machinery
   (`scan.relationships.acceptThreshold` etc.) applies to composite candidates
   too; profile-based validation should check join selectivity on the full key.
 4. No regression for single-column joins; don't explode combinatorially —
   bound candidate generation (e.g. only consider shared-grain keys and
   declared/!inferred PK overlaps, cap column count).
 5. `sl_query` can compile a join across a composite-key relationship.
 ## Acceptance criteria
 - For a fixture with two tables sharing a 3- or 4-column grain and no
  single-column FK, ingest emits a composite join between them with the full
  multi-column `on` condition.
 - `sl read <source>` shows the composite join; `sl_query` can traverse it.
 - Single-column join detection is unchanged on existing fixtures.
 ## Benchmark context (motivation only)
 IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set)
 have no single-column FKs; their joins are entirely composite. Explicit
 composite joins would let the agent rely on documented relationships instead
 of inferring them from grain.
--- a/spider2-specs/todo/13-canonical-authoritative-source-measures.md
+++ b/spider2-specs/todo/13-canonical-authoritative-source-measures.md
@ -1,89 +0,0 @@
 # Canonical / authoritative-source measures in the semantic layer
 ## Problem
 Many schemas contain an **authoritative table** that already encodes a metric's
 business rules — an official standings/leaderboard table, a general-ledger or
 period-end balance table, a materialized summary/snapshot — alongside the **raw
 transactional** rows the metric *could* be re-derived from. Re-deriving the metric
 from the raw rows frequently diverges from the canonical definition, because the
 authoritative table bakes in rules the raw data doesn't expose (drop-scores,
 penalties, adjustments, reconciliations, as-of snapshots).
 Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from
 raw fact tables, so the analytics skill has no signal that one source is canonical
 for a metric — and the agent often re-derives from raw rows and gets a defensible-
 but-different number.
 ## Generic use case (independent of any benchmark)
 - "Championship points per competitor this season" — a sports schema may hold both
  raw per-event results AND an official standings table that applies drop-scores
  and penalties. The standings table is the canonical source; summing raw results
  is wrong.
 - "Account balance as of month end" — prefer a ledger/balance-snapshot table over
  re-summing every transaction (which may miss adjustments).
 - "Monthly recognized revenue" — prefer a finance summary table over re-deriving
  from line items.
 In each case a real analyst should be steered to the authoritative source.
 ## Requirements
 1. **Detect candidate authoritative tables during ingest.** Heuristics only —
   e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`,
   `*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained
   aggregation of another table, or tables documented as authoritative in provided
   docs/wiki. Surface them as such in the semantic layer.
 2. **Represent the metric as an SL measure backed by the authoritative table.**
   Where a canonical source exists, define the measure over it so a query for that
   metric resolves to the authoritative source by default. (The analytics skill
   already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs
   into existing behavior.)
 3. **Keep raw re-derivation available** as a non-default alternative; the measure
   documents which source it uses and why, so the choice is transparent and
   overridable.
 ## Fairness boundary (HARD — this spec is fairness-sensitive)
 The choice of authoritative source MUST be driven by **schema/structure or provided
 documentation** — the table exists, is structured as a summary, or is documented as
 authoritative. It must **NEVER** be driven by observing which interpretation matches
 a benchmark gold answer. Concretely:
 - ✅ Fair: "a table named/structured as official standings exists and aggregates the
  raw results → treat it as the canonical points source."
 - ❌ Forbidden: "for question X, use table T because that's what reproduces the gold
  result." That is per-instance gold-tuning (cheating) and must not appear in ktx,
  the ingest heuristics, or any mapping.
 If a metric is genuinely underspecified and only the gold answer disambiguates the
 intended source, it is **not fairly fixable** — leave it. Whether this feature helps
 any specific benchmark instance is therefore *conditional* on a real schema/doc basis
 existing; do not manufacture one.
 ## Leak-safety (hard constraint)
 No benchmark table names, queries, gold values, or instance-specific mappings
 anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic.
 ## Acceptance criteria
 - Ingest can flag candidate authoritative/summary tables via generic heuristics
  (name/role/aggregation/doc signals), with no benchmark-specific rules.
 - The semantic layer can express a measure as backed by a designated authoritative
  source; the skill resolves the metric to it by default; raw re-derivation remains
  available and the choice is documented.
 - Tests use synthetic schemas only; no gold-derived mappings exist anywhere.
 ## Benchmark context (motivation only)
 Some SQLite-subset metric questions are underspecified between a raw-derivation and
 an authoritative-table interpretation (e.g. season points from raw results vs an
 official standings table). This is the roadmap's "canonical semantic-layer measures
 from schema + provided docs" item. It is fair ONLY where schema/docs support one
 source; the gold-only cases are explicitly out of scope (fixing them would require
 tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches
 ingest + the semantic-layer model.
--- a/spider2-specs/todo/17-lifecycle-event-metrics.md
+++ b/spider2-specs/todo/17-lifecycle-event-metrics.md
@ -1,57 +0,0 @@
 # 17 — Lifecycle-event metrics in the semantic layer
 **Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`.
 ## Problem / requirement
 Many entities carry **several lifecycle timestamps** for the same record — an order has
 `placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery`
 times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`,
 `authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named
 completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled
 payments by day"), the correct time anchor is the timestamp of *that named event*, not the
 record-creation timestamp.
 Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it
 does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a
 human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is
 left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule
 now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the
 **model**, so any consumer of the semantic layer gets it for free.)
 **Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more
 lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event
 metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with
 its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`),
 distinct from the creation-anchored `orders` metric. Keep the inference conservative and
 source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing
 that the schema/descriptions don't independently support.
 ## Sketch (implementer to refine)
 - Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions
  (e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`).
 - Emit a metric per detected completed state: filter = the state predicate, grain = record,
  `defaultTimeDimension` = the matching event timestamp.
 - Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the
  delivery-anchored metric rather than a bare row count over the creation date.
 - Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar
  (precision over recall — a wrong pairing is worse than none).
 ## Generic use case (independent of the benchmark)
 Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments)
 has this multi-timestamp lifecycle shape. An analyst asking "how many X were <completed-state> last
 month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the
 model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without
 re-deriving it, and prevents the silent "grouped by when they started" error.
 ## Benchmark context (motivation only — not a benchmark-specific rule)
 Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028
 ("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed
 because the solver bucketed delivered orders by `order_purchase_timestamp` instead of
 `order_delivered_customer_date`. The trace showed the solver had both columns and even compared both
 date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this
 spec is the **model-layer** form of the same fix, which would make the right anchor the default for any
 solver and any lifecycle schema.