From 1c5d16abc31a8256f424502878bcf45735bb83da Mon Sep 17 00:00:00 2001 From: Andrey Avtomonov Date: Tue, 30 Jun 2026 11:13:44 +0200 Subject: [PATCH] chore: remove private benchmark specs --- spider2-specs/README.md | 62 -- spider2-specs/done/.gitkeep | 0 .../done/01-connection-scoped-wiki.md | 74 --- spider2-specs/done/02-verbatim-ingest-mode.md | 71 --- .../done/06-scan-tolerate-broken-objects.md | 63 -- .../done/07-analytics-skill-sql-craft.md | 112 ---- .../done/08-per-dialect-sql-syntax-notes.md | 83 --- .../09-fan-out-safe-multi-hop-aggregation.md | 150 ----- .../done/10-panel-completeness-spine.md | 65 -- .../done/11-time-series-window-recipes.md | 73 --- .../done/12-parse-text-encoded-numbers.md | 61 -- .../14-output-completeness-final-check.md | 105 ---- .../done/15-mcp-server-structured-logging.md | 116 ---- .../16-bounded-query-execution-timeout.md | 131 ---- .../18-bigquery-cross-project-datasets.md | 68 --- ...-durable-bounded-relationship-detection.md | 89 --- .../20-resilient-enrichment-under-slow-llm.md | 101 ---- .../done/21-selective-enrichment-stages.md | 91 --- .../specs/01-connection-scoped-wiki.md | 300 --------- .../specs/02-verbatim-ingest-mode.md | 327 ---------- .../specs/06-scan-tolerate-broken-objects.md | 361 ----------- .../specs/07-analytics-skill-sql-craft.md | 363 ----------- .../specs/08-per-dialect-sql-syntax-notes.md | 395 ------------ .../09-fan-out-safe-multi-hop-aggregation.md | 362 ----------- .../specs/10-panel-completeness-spine.md | 289 --------- .../specs/11-time-series-window-recipes.md | 391 ------------ .../specs/12-parse-text-encoded-numbers.md | 405 ------------- .../14-output-completeness-final-check.md | 336 ----------- .../specs/15-mcp-server-structured-logging.md | 405 ------------- .../16-bounded-query-execution-timeout.md | 493 --------------- .../18-bigquery-cross-project-datasets.md | 418 ------------- ...-durable-bounded-relationship-detection.md | 471 --------------- .../20-resilient-enrichment-under-slow-llm.md | 533 ---------------- .../specs/21-selective-enrichment-stages.md | 567 ------------------ ...umable-and-fault-tolerant-source-ingest.md | 463 -------------- ...i-connection-routing-in-analytics-skill.md | 66 -- .../todo/04-offline-schema-docs-adapter.md | 51 -- .../todo/05-composite-key-join-detection.md | 59 -- ...canonical-authoritative-source-measures.md | 89 --- .../todo/17-lifecycle-event-metrics.md | 57 -- 40 files changed, 8716 deletions(-) delete mode 100644 spider2-specs/README.md delete mode 100644 spider2-specs/done/.gitkeep delete mode 100644 spider2-specs/done/01-connection-scoped-wiki.md delete mode 100644 spider2-specs/done/02-verbatim-ingest-mode.md delete mode 100644 spider2-specs/done/06-scan-tolerate-broken-objects.md delete mode 100644 spider2-specs/done/07-analytics-skill-sql-craft.md delete mode 100644 spider2-specs/done/08-per-dialect-sql-syntax-notes.md delete mode 100644 spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md delete mode 100644 spider2-specs/done/10-panel-completeness-spine.md delete mode 100644 spider2-specs/done/11-time-series-window-recipes.md delete mode 100644 spider2-specs/done/12-parse-text-encoded-numbers.md delete mode 100644 spider2-specs/done/14-output-completeness-final-check.md delete mode 100644 spider2-specs/done/15-mcp-server-structured-logging.md delete mode 100644 spider2-specs/done/16-bounded-query-execution-timeout.md delete mode 100644 spider2-specs/done/18-bigquery-cross-project-datasets.md delete mode 100644 spider2-specs/done/19-durable-bounded-relationship-detection.md delete mode 100644 spider2-specs/done/20-resilient-enrichment-under-slow-llm.md delete mode 100644 spider2-specs/done/21-selective-enrichment-stages.md delete mode 100644 spider2-specs/specs/01-connection-scoped-wiki.md delete mode 100644 spider2-specs/specs/02-verbatim-ingest-mode.md delete mode 100644 spider2-specs/specs/06-scan-tolerate-broken-objects.md delete mode 100644 spider2-specs/specs/07-analytics-skill-sql-craft.md delete mode 100644 spider2-specs/specs/08-per-dialect-sql-syntax-notes.md delete mode 100644 spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md delete mode 100644 spider2-specs/specs/10-panel-completeness-spine.md delete mode 100644 spider2-specs/specs/11-time-series-window-recipes.md delete mode 100644 spider2-specs/specs/12-parse-text-encoded-numbers.md delete mode 100644 spider2-specs/specs/14-output-completeness-final-check.md delete mode 100644 spider2-specs/specs/15-mcp-server-structured-logging.md delete mode 100644 spider2-specs/specs/16-bounded-query-execution-timeout.md delete mode 100644 spider2-specs/specs/18-bigquery-cross-project-datasets.md delete mode 100644 spider2-specs/specs/19-durable-bounded-relationship-detection.md delete mode 100644 spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md delete mode 100644 spider2-specs/specs/21-selective-enrichment-stages.md delete mode 100644 spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md delete mode 100644 spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md delete mode 100644 spider2-specs/todo/04-offline-schema-docs-adapter.md delete mode 100644 spider2-specs/todo/05-composite-key-join-detection.md delete mode 100644 spider2-specs/todo/13-canonical-authoritative-source-measures.md delete mode 100644 spider2-specs/todo/17-lifecycle-event-metrics.md diff --git a/spider2-specs/README.md b/spider2-specs/README.md deleted file mode 100644 index 1cf2acdb..00000000 --- a/spider2-specs/README.md +++ /dev/null @@ -1,62 +0,0 @@ -# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark - -This directory is the handoff point between two agents working on different -sides of the same goal: making Claude Code + ktx score well on the Spider -2.0-Lite benchmark **without benchmark-specific instructions** — the agent -should succeed using only what ktx provides (skills, semantic layer, wiki). - -## Mechanics - -Three directories form a pipeline. A feature flows `todo/` → `specs/` → -(implemented), and only its intake draft moves to `done/`: - -- **`todo/`** — intake drafts. A **playground agent** (works in - `/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the - benchmark, identifies ktx capability gaps) writes a draft spec here when it - finds a gap. -- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a - `todo/` draft and produces a proper, implementation-ready spec at - `specs/.md`: sharpened requirements, resolved ambiguities, - acceptance criteria, and orientation hints. The refined spec is the **durable - artifact** the implementer builds from — it stays in `specs/` permanently and - never moves. -- **`done/`** — intake drafts whose feature has shipped (see below). - -The **ktx worktree agent** (started from a ktx repo worktree, e.g. -`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the -refined spec in `specs/` (falling back to the `todo/` draft only if no refined -spec exists yet). When the feature is implemented it: - -1. appends a short **"Implementation notes"** section to the refined spec in - `specs/` (what was built, where, any deviations); and -2. **moves the original intake draft from `todo/` to `done/`.** - -Location is status: `todo/` = draft awaiting implementation, `done/` = draft -whose feature shipped, `specs/` = refined specs (permanent home, do not move). -A draft and its refined spec share the same filename so they correspond -(`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking. - -## Rules for specs - -1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the - benchmark only surfaces the need. Every spec must state a real-world use - case independent of Spider 2.0-Lite. If a requirement only makes sense for - the benchmark, it doesn't belong in ktx. -2. Specs are **requirement-level**, not implementation plans. Code pointers in - specs are orientation hints from exploration (line numbers may have - drifted); the implementer owns the design. -3. One spec per file, kebab-case, numeric prefix = suggested priority order. - A refined spec in `specs/` keeps the same filename as its `todo/` draft. - -## For the implementer - -- After implementing, rebuild and re-link the dev binary so the playground - picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`). -- Add/extend tests in the ktx test suites; specs list acceptance criteria to - cover. -- Build from the refined spec in `specs/`. On completion, append - "Implementation notes" to that spec (it stays in `specs/`) and move the - intake draft from `todo/` to `done/`. -- If a spec turns out to be wrong or already satisfied, don't silently drop - it — record why in the refined spec's notes and move the draft to `done/` - explaining why no change was needed. diff --git a/spider2-specs/done/.gitkeep b/spider2-specs/done/.gitkeep deleted file mode 100644 index e69de29b..00000000 diff --git a/spider2-specs/done/01-connection-scoped-wiki.md b/spider2-specs/done/01-connection-scoped-wiki.md deleted file mode 100644 index cbd220dc..00000000 --- a/spider2-specs/done/01-connection-scoped-wiki.md +++ /dev/null @@ -1,74 +0,0 @@ -# Connection-scoped wiki pages - -## Problem - -Wiki pages have only two scopes today: `GLOBAL` and `USER` -(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29). -There is no way to associate a page with a connection. In a project with many -connections, all pages share one search index, so `wiki_search` for a generic -term ("orders", "revenue", "average order value") surfaces pages about the -wrong database. Concept names collide across databases constantly in -real-world multi-connection projects (several databases each with `orders`, -`customers`, etc.). - -Today, when `memory_ingest` is called with a `connectionId`, that id is only -used to scope which semantic-layer sources the triage agent can see -(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the -resulting wiki page in any form. - -## Generic use case - -Any org with multiple databases/warehouses in one ktx project: org-wide -definitions ("fiscal year starts in February") should be visible everywhere, -while database-specific conventions ("in the events DB, `user_id` is the -anonymous device id, not the account id") should not pollute searches about -other databases. - -## Requirements - -1. **Frontmatter field.** Add an optional `connections:` field to wiki page - frontmatter — a list of connection ids (accept a single string too, - normalize to list). - - **Absent or empty ⇒ unscoped: the page applies to all connections.** - This is exactly today's behavior, so every existing page is unaffected - (backward compatible by construction). -2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64) - and `ktx wiki search` / `ktx wiki list` (CLI, - `knowledge-commands.ts`) accept an optional `connectionId`: - - With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages. - - Without ⇒ current behavior, all pages. - - The filter must apply to **all three search lanes** (lexical FTS5, - semantic/embedding, token fallback) in - `local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter - that eats into the result limit unevenly. -3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index - (the index is already re-synced from files on every search, - `local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient). -4. **Write path.** The memory agent's wiki-write tool accepts the connections - field; when `memory_ingest` is invoked with a `connectionId`, the agent - should default new database-specific pages to that connection, while still - being allowed to write unscoped pages for clearly org-wide content (prompt - guidance, not a hard rule). -5. **`wiki_read` and refs are unchanged** — pages remain addressable by key - regardless of scoping; `connections` is a search/relevance concern only. -6. **Validation.** Warn (don't fail) when a page references a connection id - not present in `ktx.yaml` — config and content can evolve independently. - -## Acceptance criteria - -- A page with `connections: [db_a]` is returned by - `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but - **not** by `wiki_search(query, connectionId: "db_b")`. -- A page with no `connections` field is returned in all three cases above. -- Existing projects with no scoped pages behave identically before/after. -- Filtering works in each lane independently (test with embeddings disabled - to exercise lexical/token lanes alone). -- `memory_ingest(content, connectionId)` produces a page scoped to that - connection for database-specific content. - -## Benchmark context (motivation only) - -Spider 2.0-Lite local subset = one project with 30 SQLite connections whose -schemas share table/concept names (Northwind, sakila, two e-commerce DBs…). -External-knowledge docs (RFM definition, F1 overtake rules) are each relevant -to exactly one database and must not surface for the other 29. diff --git a/spider2-specs/done/02-verbatim-ingest-mode.md b/spider2-specs/done/02-verbatim-ingest-mode.md deleted file mode 100644 index 03e86a02..00000000 --- a/spider2-specs/done/02-verbatim-ingest-mode.md +++ /dev/null @@ -1,71 +0,0 @@ -# Verbatim ingest mode for authoritative documents - -## Problem - -`ktx ingest --text/--file` routes content through the memory agent -(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop -(30-step budget for `external_ingest`, content clipped at ~48k chars, -`memory-agent.service.ts` ~165) that may rewrite, condense, or split the -content before writing wiki pages. - -For *authoritative* documents — formula definitions, specs, runbooks, -compliance text — paraphrasing is a bug, not a feature: - -- exact thresholds, constants, and rule wording must survive byte-for-byte; -- lexical (BM25) search works best when the stored text matches the phrasing - users/agents will query with; -- ingestion should be deterministic and reproducible — same input file, same - resulting page. - -## Generic use case - -Any team ingesting documents that are already the source of truth: metric -definition sheets, SLA documents, calculation methodology docs, regulatory -text. The user wants ktx to *index and surface* the document, not to -re-author it. - -## Requirements - -1. **Flag.** `ktx ingest --file --verbatim` (apply to `--text` too). - Composes with the existing optional `--connection ` so the resulting - page can be connection-scoped (see spec 01). -2. **Body preservation is enforced by code, not by prompt.** The stored page - body must be the input content byte-for-byte. The LLM is used **only** to - generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug - (and `connections` default from the flag). Implementation freedom: a - single constrained LLM call is fine — the full memory-agent loop is not - required for this mode. -3. **No clipping of the stored body.** The ~48k clip may apply to what is - *sent to the LLM* for metadata generation, never to what is *written* to - the wiki page. -4. **Existing frontmatter.** If the input file already has YAML frontmatter, - preserve user-provided fields and only fill gaps (don't overwrite an - explicit `summary` with a generated one). -5. **Key collisions.** Deterministic, non-destructive behavior: error or - suffix — never silently overwrite an existing page. -6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should - still work, deriving `summary` from the first heading/sentence and leaving - optional metadata empty. (Regular agent ingest can't do this; verbatim - mode can and should.) - -## Acceptance criteria - -- Ingesting a file with `--verbatim` produces a wiki page whose body is - byte-identical to the input (assert with a hash in tests). -- Running the same ingest twice is idempotent or fails loudly on the second - run (per requirement 5) — no duplicated/divergent pages. -- A >48k-char file is stored in full. -- `--verbatim --connection X` yields a page scoped to X (depends on spec 01; - if 01 isn't implemented yet, the flag composition can land later). -- Generated metadata makes the page findable: `wiki_search` for a phrase - from the document body returns it (lexical lane), and for a paraphrase of - its topic returns it when embeddings are enabled (semantic lane). - -## Benchmark context (motivation only) - -Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket -definitions, haversine formula, F1 overtake rules…). Gold SQL was authored -against their exact text; an LLM paraphrase that drops a bucket boundary -loses a question. We currently work around this by hand-writing frontmatter -and copying files into `wiki/global/` — verbatim mode makes that a supported -ktx workflow instead of a manual step. diff --git a/spider2-specs/done/06-scan-tolerate-broken-objects.md b/spider2-specs/done/06-scan-tolerate-broken-objects.md deleted file mode 100644 index c56e3dd5..00000000 --- a/spider2-specs/done/06-scan-tolerate-broken-objects.md +++ /dev/null @@ -1,63 +0,0 @@ -# Schema scan must tolerate individual objects that fail introspection - -> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest -> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely** -> because a single broken VIEW errored during introspection, leaving that -> connection with no semantic layer at all. - -## Problem - -`ktx ingest ` aborts the whole database's schema scan when one -table/view errors during introspection/profiling. In `oracle_sql` the view -`emp_hire_periods_with_name` is defined as -`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the -base table has no `start_date`/`end_date` columns — so any attempt to read it -raises `no such column: ehp.start_date`. That single broken object failed the -ingest of all ~48 healthy tables/views in the database. - -A second, related symptom: setting `enabled_tables: [main.customers]` to work -around it produced a different hard failure (`Adapter "database schema" did not -recognize fetched source output`), so the documented allowlist escape hatch did -not provide a clean fallback either. - -## Generic use case - -Real databases routinely contain broken or inaccessible objects: views over -dropped/renamed columns, views referencing tables the connection role can't -read, permission-denied tables, or vendor system views that error. ktx should -ingest everything it *can* and skip what it can't — never let one bad object -zero out an entire connection's context. This is basic robustness for -production warehouses, not benchmark-specific. - -## Requirements - -1. **Per-object isolation.** If introspecting/profiling one table or view - throws, skip that object, record a warning (object name + error), and - continue scanning the rest. The connection's semantic layer is built from - the objects that succeeded. -2. **Surface, don't hide.** Report skipped objects in the ingest summary and in - `ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name: - no such column ehp.start_date"). Honor `failureMode` for whole-connection - aborts, but a single bad object should not count as a connection failure. -3. **Views vs tables.** A broken view should never block base-table ingest. - Consider profiling views defensively (they are read-only projections). -4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict - the scan to the listed objects (and the qualification format for sqlite must - be documented and accepted). Fix the `did not recognize fetched source - output` failure when the allowlist yields a small/edge-case set. - -## Acceptance criteria - -- Ingesting a sqlite DB containing one broken view plus N healthy tables yields - a semantic layer for the N healthy tables and a warning naming the broken view - — exit is success (not "failed"), subject to `failureMode`. -- The skipped object is listed in the ingest summary and `ktx status`. -- `enabled_tables` restricted to a subset ingests exactly that subset without the - adapter-output error. - -## Benchmark context (motivation only) - -`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer -because of its one broken view; those questions must be solved from raw -`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning -would restore enriched context for that database. diff --git a/spider2-specs/done/07-analytics-skill-sql-craft.md b/spider2-specs/done/07-analytics-skill-sql-craft.md deleted file mode 100644 index 97d64904..00000000 --- a/spider2-specs/done/07-analytics-skill-sql-craft.md +++ /dev/null @@ -1,112 +0,0 @@ -# Add universal SQL-authoring craft to the ktx-analytics skill - -> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which -> ktx tools to call and in what order*, but gives almost no guidance on -> *writing correct SQL*. In benchmark runs the agent reliably produced -> runnable SQL (0 execution errors) yet failed on correctness — precision, -> determinism, type mismatches, and answer completeness. These are universal -> analytics-engineering truths that every ktx user benefits from, so they -> belong in the shipped skill, not in any caller's prompt. - -## Scope guard (read first) - -Only **universally-true** SQL/analytics craft goes here — guidance that helps a -real ktx user querying a **live** database. The test for inclusion: *"Would this -advice be correct and useful for an analyst on a current, production database?"* - -**Dialect-specific syntax is out of scope here.** The v9 harnesses' only -per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted -lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX` -for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but -belongs in a **dialect-aware** location (per-driver notes), not this flat -skill. Track separately as a follow-up; the rules below must stay -dialect-agnostic. - -Explicitly **do NOT** add (these are application/consumer concerns, not skill -concerns, and some are actively wrong for live data): -- Output-format contracts ("return a bare result set with exactly these - columns, no prose"). The skill is for interactive analysis and already - favors readable tables + summaries; a caller that needs a strict result - shape specifies that itself. -- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the - data. On a live database "recent" means relative to *now*; this is only true - for static snapshots and must not be baked into the product. -- Anything justified by a grader/scoring comparator. - -## File - -`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill; -`setup-agents.ts` installs it into agent environments — the copy under a -project's `.claude/skills/` is regenerated from this source). Extend the -existing `` block and step 5 ("Query") / step 6 ("Validate and -explain"); keep the existing interactive guidance intact. - -## Requirements — add these as general rules (behavior only, no rationale that -references answers/graders) - -**Schema discovery before writing SQL** -1. Inspect representative sample rows of each table before composing SQL — - confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null - prevalence in join/filter keys, and the actual set of categorical/enum - values. (`entity_details` + a small `sql_execution` sample.) -2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A - string column compared against a numeric literal (or vice versa) can - silently match nothing. - -**Composition discipline** -3. Build complex queries incrementally — one CTE at a time, verifying each - layer's output on a small sample before stacking the next. -4. Avoid joins that fan out row counts. Add columns only from tables already - required by the grain, or pre-aggregate to the target grain before joining. - -**Window-function correctness** -5. Give every ranking/ordering window function a complete, deterministic - tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG` - results are stable rather than flickering across runs. -6. Apply row filters **after** window functions for sequence / "first" / - "most recent" / "since" questions — compute over the full partition, then - filter. - -**Numeric precision** -7. Compute at full precision; round only in the final projection, never inside - intermediate CTEs. -8. Be explicit about truncation (`CAST AS INT` truncates; use explicit - rounding when rounding is intended). -9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`) - from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the - question's wording. - -**Answer completeness / interpretation** -10. "top / highest / most / lowest" → return only the winning row(s) (e.g. - `RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked - for. -11. "for each X / per X / by X" → exactly one row per X; don't collapse to a - single value unless the question says "overall" or "total across X". -12. When a question asks for inputs and a derived value ("X, Y, and their - ratio"), include the inputs as columns alongside the derived value. -13. When grouping by a human-readable label (a name), also expose the entity's - identifier — identity, not just the label, is part of the result. -14. When a result is unexpectedly empty, relax filters one at a time to find - which predicate removed the rows. - -## Acceptance criteria - -- The shipped `analytics/SKILL.md` contains the rules above, phrased as general - truths with **no reference to any benchmark, gold answer, or scoring - comparator**. -- Existing interactive guidance (compact result tables, summaries, - clarification prompts, the tool-order workflow) is preserved — the skill must - still read well for an interactive human-facing analysis session. -- None of the excluded items (output-shape contract, `MAX(date)` anchoring, - grader-driven advice) appear. -- Skill stays within a reasonable size; group the new rules under clear - sub-headings so they're scannable. - -## Benchmark context (motivation only) - -On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors -but ~50 result mismatches; a large share traced to exactly these gaps -(premature rounding, string-vs-number compares, non-deterministic window -ordering, returning full lists for "top" questions, dropping inputs to derived -values). These are generic SQL-authoring defects — fixing them in the skill -improves ktx for everyone and, as a side effect, the benchmark. diff --git a/spider2-specs/done/08-per-dialect-sql-syntax-notes.md b/spider2-specs/done/08-per-dialect-sql-syntax-notes.md deleted file mode 100644 index 3cb0a815..00000000 --- a/spider2-specs/done/08-per-dialect-sql-syntax-notes.md +++ /dev/null @@ -1,83 +0,0 @@ -# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection) - -> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept -> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect -> syntax here. - -## Problem - -Spec 07 deliberately keeps the analytics SQL-authoring craft -**dialect-agnostic** — every rule must read correctly on any engine. But a lot of -*real* correctness depends on dialect-specific syntax that spec 07 excludes and -defers to this follow-up: - -- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers, - VARIANT colon-paths. -- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`. -- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`. - -This guidance is genuinely useful to an agent writing SQL against a live -database, but it must **not** pollute the flat dialect-agnostic skill — an agent -querying sqlite should never see Snowflake VARIANT syntax. It belongs in a -**dialect-aware** location, surfaced only for the dialect the active connection -actually uses. - -## Generic use case - -Any ktx project whose connections span more than one warehouse engine (e.g. a -Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent -writes SQL for a given connection, it should get that engine's syntax -conventions — and nothing for the engines it isn't querying. - -## Requirements - -1. **Per-driver dialect notes.** Author concise, correct syntax notes per - supported driver: FQTN form, identifier quoting/case, date/time functions, - top-N / window-filtering idiom, semi-structured access. These are genuine - per-engine invariants, so enumerating them per driver is acceptable (unlike a - denylist of bad specifics). -2. **Scope to the active dialect, derived from state.** Which notes the agent - sees must be selected from the connection's configured driver/dialect - (`ktx.yaml` connections / the connector registry), not guessed and not shown - all at once. The flat analytics skill stays dialect-agnostic (spec 07 - invariant preserved). -3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is - installed as a **single `SKILL.md`** per target (`setup-agents.ts` / - `readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one - of two approaches; the refinement pass should compare them before committing: - - **Multi-file skill delivery** — bundle `reference/.md` files and - have the skill point to the one matching the connection. Requires extending - `setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal - `.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate - transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and - **per-file manifest entries** for clean uninstall. This is the - install-mechanism improvement spec 07's Model section flags as future work. - - **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a - given `connectionId` (the MCP layer already resolves the connection's - dialect), so no install change is needed and Cursor/OpenCode get identical - behavior. May be the lower-cost, more uniform path; weigh it first. -4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's - acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in - `analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware - channel; it does not amend the flat skill. - -## Acceptance criteria - -- An agent querying a sqlite connection gets sqlite date idioms and never sees - Snowflake/BigQuery-only syntax; an agent querying Snowflake gets - FQTN/identifier/VARIANT guidance. -- The dialect shown is **derived from the connection's configured driver**, not - hardcoded per project and not guessed. -- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are - unaffected. -- Whichever delivery mechanism is chosen installs/serves correctly across **all** - supported agent targets, including the single-file Cursor/OpenCode shape. - -## Benchmark context (motivation only) - -The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake -(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths), -BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite -(`strftime`/`julianday`). That content is real and useful but engine-specific; -spec 07 kept it out of the flat skill and deferred it here so the -dialect-agnostic rules stay clean. diff --git a/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md b/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md deleted file mode 100644 index 12334325..00000000 --- a/spider2-specs/done/09-fan-out-safe-multi-hop-aggregation.md +++ /dev/null @@ -1,150 +0,0 @@ -# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill - -## Problem - -The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4: -*"Avoid fan-out joins — add columns only from tables already at the target -grain, or pre-aggregate to that grain before joining; a join that multiplies -rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent -honors it on a single join but still **silently fan-outs on multi-hop join -chains**, where the inflation is one or two joins removed from the aggregate and -therefore much harder to notice. - -The failure shape: a metric that lives at a *coarse* grain (e.g. one row per -parent record) is counted/summed *after* the parent has been joined down to a -*finer* grain (e.g. one row per child line). Every parent-level value is then -duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an -amount that depends on the data — runnable SQL, plausible-looking number, -quietly wrong. - -The rule today is stated as a *prohibition* ("avoid"). It needs to be a -*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active -verification step the agent runs while composing, not just an instruction to be -careful. - -## Generic use case (independent of any benchmark) - -An analyst on any production warehouse asks: *"How many orders are there per -region?"* where the path from region to the order's detail runs through several -hops (region → store → order → order line). The honest answer counts each order -once. If the query descends to the line-level table along the way (e.g. for a -filter), each order is counted once **per line on the order**, inflating the -per-region total. Attribution here is unambiguous — each order belongs to exactly -one store and thus one region — so the *only* thing that can go wrong is the row -multiplication, which is exactly what makes it a clean teaching case. This is one -of the most common silently-wrong analytics mistakes on normalized schemas — it -is not -specific to any dataset, dialect, or benchmark. - -## Requirements - -This extends the existing `` "Composition" guidance in the -`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic, -and stated as a heuristic-plus-why (consistent with spec 07's style). - -1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the - danger is *cumulative*: any one-to-many hop on the path between the table that - owns a measure and the aggregate inflates that measure, even when the - offending join is several hops away from the `SUM`/`COUNT`. The fix is the - same as the single-hop case — **pre-aggregate the measure to its own grain in - a CTE, then join the already-aggregated result** — but the agent must apply it - per measure-owning table along the whole chain, not just at the final join. - -2. **Add a verification habit, not just a prohibition.** While composing, the - agent should confirm a join did not change the grain it intends to aggregate - at — e.g. check that the row count (or the count of the aggregate's key) is - unchanged across a join that is supposed to be one-to-one / many-to-one, and - pre-aggregate the finer table to grain when it is one-to-many. This is the same - "build incrementally and check each layer" discipline spec 07 already endorses, - pointed specifically at grain preservation. - - **Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only - shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and - then joining one-to-one is the remedy that works for every aggregate - (`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT )` is a valid one-liner *for counts - only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two - rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse - them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is - silently wrong for sums. - -3. **One concrete, generic multi-hop example.** Include a short worked example - that shows the inflation and the fix. It must use an **invented, generic - schema** — **no benchmark table names, no benchmark SQL, and no benchmark - result values** (see "Leak-safety" below — hard constraint). The example must: - (a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson - and does not entangle the skill's separate *macro-vs-micro average* rule; and - (b) use a chain with **unambiguous single-owner attribution** so the only thing - that can go wrong is row multiplication. The intended example is the chain - `regions → stores → orders → order_lines` answering *"how many orders per region - include at least one backordered line"* — each order belongs to exactly one - store and thus exactly one region, so attribution is clean; the line-level - filter gives `order_lines` a genuine reason to be joined (so the fix is the - pre-aggregate remedy, not "drop the join"), and that join sits **several hops - below** the region-level COUNT (the multi-hop point): - - ```sql - -- "How many orders per region include at least one backordered line?" - -- (order_lines is genuinely needed here — for the backordered filter — so the - -- fix is NOT "just drop the join".) - -- WRONG: the order_lines join is one row per matching line, joined several hops - -- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the - -- per-region total is inflated by backordered-lines-per-order — silently wrong. - SELECT r.region_id, COUNT(*) AS n_orders - FROM regions r - JOIN stores s ON s.region_id = r.region_id - JOIN orders o ON o.store_id = s.store_id - JOIN order_lines l ON l.order_id = o.order_id AND l.is_backordered -- one-to-many: fan-out - GROUP BY r.region_id; - - -- RIGHT (general remedy): collapse the finer table to the measure's grain in a - -- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works - -- for SUM/AVG, not just COUNT. - WITH qualifying_orders AS ( -- back to ONE row per order - SELECT DISTINCT order_id FROM order_lines WHERE is_backordered - ) - SELECT r.region_id, COUNT(*) AS n_orders - FROM regions r - JOIN stores s ON s.region_id = r.region_id - JOIN orders o ON o.store_id = s.store_id - JOIN qualifying_orders q ON q.order_id = o.order_id - GROUP BY r.region_id; - - -- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works - -- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g. - -- summing each order's shipping_fee after joining lines) must pre-aggregate; - -- DISTINCT would wrongly merge two orders that happen to share the same fee. - ``` - -## Leak-safety (hard constraint on this spec and its example) - -The benchmark's gold answers must never appear in ktx. The worked example must -be a **synthetic, generic schema invented for teaching** — not the tables, -column names, query, or numeric results of any Spider 2.0-Lite question. The -example demonstrates the *pattern* (coarse-grain measure counted after a -one-to-many join), which is universal; it must be reconstructable from first -principles by anyone, with zero reference to benchmark data. A reviewer should -be able to read the example and find nothing that ties it to a specific -benchmark instance. - -## Acceptance criteria - -- The skill's `` Composition section states the multi-hop - generalization of the fan-out rule and a grain-verification habit, inline and - dialect-agnostic. -- It includes exactly one short, **generic** worked example (wrong vs. - pre-aggregated-right) using an invented schema, with no benchmark-derived - identifiers or values. -- No new tool, flag, or config; this is skill-content only (additive to spec 07). -- Existing analytics-skill content tests are updated to cover the added rule's - presence (mirroring spec 07's `analytics-skill-content.test.ts`). - -## Benchmark context (motivation only) - -Multi-hop aggregation questions (counting/averaging a coarse-grained measure -reached through several one-to-many joins) are a recurring source of -result-mismatch failures in the SQLite subset: the agent produces runnable SQL -with the right tables but a fan-out-inflated number. These are correctness -failures, not knowledge or schema-discovery failures (zero execution errors in -the latest run), so the fix belongs in the product's authoring craft — where it -also helps any real analyst — not in a benchmark-specific prompt. -``` diff --git a/spider2-specs/done/10-panel-completeness-spine.md b/spider2-specs/done/10-panel-completeness-spine.md deleted file mode 100644 index 91b9294b..00000000 --- a/spider2-specs/done/10-panel-completeness-spine.md +++ /dev/null @@ -1,65 +0,0 @@ -# Panel/period completeness — emit the full set of groups, not only the populated ones - -## Problem - -When a question asks for a result *per period* or *per category* ("orders for each -month of 2023", "revenue by region", "count per status"), the natural `GROUP BY` -only returns groups that actually have rows. Periods/categories with **zero** -activity silently vanish, so a "12 months" answer comes back with 9 rows and the -ones that should read `0` are simply absent. The agent writes runnable SQL with -the right aggregate but an **incomplete panel**. - -This is a universal reporting correctness issue: a monthly report with missing -months, or a category breakdown missing the empty categories, is wrong for any -analyst — and it is also a frequent result-mismatch shape on the benchmark. - -## Generic use case (independent of any benchmark) - -"How many orders were placed in each month of 2023?" must return **12 rows** even -if March had no orders (March = 0), not 11 rows. "Sales per region" should include -regions with no sales (as 0/NULL) when the question asks for *each* region. - -## Requirements - -Additive to the `ktx-analytics` skill's `` "Answer completeness / -interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic -+ why style). - -1. **Recognize "full-panel" phrasing.** Cues like *each / every / per / - for all / by month* signal that the answer's row set should be the - **complete** set of periods or categories in scope, not just those present in - the filtered fact rows. - -2. **Build a spine, then LEFT JOIN.** Generate the full set of expected - groups — a date/number series via a recursive CTE for periods, or the distinct - dimension values from the authoritative dimension table for categories — and - LEFT JOIN the aggregated facts onto it, defaulting missing measures with - `COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner - `GROUP BY` can only emit groups that have at least one fact row. - -3. **Don't over-apply.** When the question asks only about groups that exist - ("which months had orders"), the spine is unnecessary; the cue is *each/all* - vs *which*. - -## Leak-safety (hard constraint) - -Any worked example must use a **synthetic generic schema** (e.g. an `orders` -table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN -+ COALESCE). No benchmark table names, SQL, or result values. The behavior is -reconstructable from first principles and tied to no specific instance. - -## Acceptance criteria - -- `` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe, - and the over-application guard — inline and dialect-agnostic. -- At most one short generic example (recursive-CTE date spine or distinct-dimension - spine), no benchmark-derived content. -- Skill-content only; analytics-skill content tests updated to cover the rule. - -## Benchmark context (motivation only) - -Per-period / per-category questions where some periods are empty produce -short-row result mismatches in the SQLite subset. The fix is a universal -reporting habit (complete panels), so it belongs in the product's craft, where it -also helps real analysts — not in a benchmark-specific prompt. Related to spec 11 -(rolling/cumulative windows need a complete date spine to be correct). diff --git a/spider2-specs/done/11-time-series-window-recipes.md b/spider2-specs/done/11-time-series-window-recipes.md deleted file mode 100644 index 7c9bb355..00000000 --- a/spider2-specs/done/11-time-series-window-recipes.md +++ /dev/null @@ -1,73 +0,0 @@ -# Time-series window craft — running totals, rolling-N (min-periods), period-over-period - -## Problem - -A large share of analytics questions are time-series shaped: a **running/cumulative -balance**, a **rolling N-day average**, or **period-over-period growth**. The agent -knows window functions exist (spec 07 covers determinism and window-then-filter) but -gets the *time-series specifics* wrong: - -- cumulative balance computed without an unbounded preceding frame (or with the - frame defaulting incorrectly when there are ties on the order key); -- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily - data, so the window spans the wrong calendar span when days are missing; -- no **minimum-periods** handling — a rolling average is reported before the window - is actually full; -- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor. - -These are runnable-but-wrong; the structure is close, the edge case diverges. - -## Generic use case (independent of any benchmark) - -- "Each account's month-end running balance over 2023" — cumulative sum of monthly - net over an ordered window. -- "30-day rolling average of daily revenue, only once 30 days of history exist." -- "Month-over-month revenue growth rate." - -All three are bread-and-butter for any analyst on any time-series table. - -## Requirements - -Additive to the `ktx-analytics` skill's `` "Window functions" group -(inline, dialect-agnostic, heuristic + why). - -1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS - BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in - `ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY` - can include/exclude peers unexpectedly. - -2. **Rolling window over time, not over rows.** When "rolling N days/months" is - asked, the window must span a calendar range. Over gappy data, either build a - complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals - the intended span, or use a range/self-join keyed on the date. *Why:* row-count - frames over missing dates silently measure the wrong span. - -3. **Minimum periods.** When the question says "only after N periods of data" (or - it is implied by a rolling metric), emit NULL/skip until the window is full - (e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not - the requested metric. - -4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)` - for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at - full precision (round only at the end). Guard divide-by-zero/NULL prev. - -## Leak-safety (hard constraint) - -Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day, -amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*. -No benchmark table names, SQL, or result values. - -## Acceptance criteria - -- `` "Window functions" gains the cumulative, rolling-over-time + - min-periods, and period-over-period recipes — inline, dialect-agnostic. -- At most one or two compact generic examples; no benchmark-derived content. -- Skill-content only; analytics-skill content tests updated. - -## Benchmark context (motivation only) - -Running-balance / rolling / period-over-period questions are the single largest -result-mismatch cluster in the SQLite subset (financial-transactions style DBs). -The methodology is universal analyst craft, so it belongs in the product's skill -(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10 -(date spine) for the gappy-rolling case. diff --git a/spider2-specs/done/12-parse-text-encoded-numbers.md b/spider2-specs/done/12-parse-text-encoded-numbers.md deleted file mode 100644 index 43100e6c..00000000 --- a/spider2-specs/done/12-parse-text-encoded-numbers.md +++ /dev/null @@ -1,61 +0,0 @@ -# Parse text-encoded numeric columns before doing math on them - -## Problem - -Numeric measures are often stored as **text** with human formatting: unit suffixes -(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`), -percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`, -`""`). Aggregating or comparing such a column directly is silently wrong: string -comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on -the formatted values rather than the intended number. - -The agent already samples schemas (spec 07 schema-discovery), but when it sees a -"numeric" column it tends to assume it is a real number type and skips the parse — -so the arithmetic runs on garbage. Runnable, plausible, wrong. - -## Generic use case (independent of any benchmark) - -A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000 -/ 0` before you can sum it or compute a daily change. A `price` stored as -`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene -on real, messy production tables. - -## Requirements - -Extend the `ktx-analytics` skill's `` "Schema discovery before writing -SQL" group (inline, dialect-agnostic, heuristic + why). - -1. **Detect text-encoded numerics during sampling.** When a column that the - question treats as a number is stored as text, sample distinct values to learn - the encodings actually present (suffixes, symbols, separators, sentinels) before - composing — never assume the format from the column name. - -2. **Parse and scale before arithmetic.** Strip currency/separator/percent - characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels - (`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a - numeric type. Do this in an early CTE so all downstream math sees clean numbers. - *Why:* string columns compared/aggregated as-is sort lexically and cast to 0, - producing silently wrong results instead of errors. - -3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value - failed to parse (would surface as NULL), to catch an encoding the sample missed. - -## Leak-safety (hard constraint) - -Worked examples must use a **synthetic generic schema** and made-up values (e.g. a -`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names, -SQL, or result values; the parsing pattern is universal and tied to no instance. - -## Acceptance criteria - -- `` schema-discovery gains the detect → parse/scale → verify guidance — - inline, dialect-agnostic, with at most one short generic example. -- No benchmark-derived content. Skill-content only; content tests updated. - -## Benchmark context (motivation only) - -At least one SQLite-subset question stores trading volume as suffix-encoded text -("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The -fix — parse messy encodings before math — is universal data hygiene that helps any -analyst, so it belongs in the product's craft rather than a benchmark-specific -prompt. diff --git a/spider2-specs/done/14-output-completeness-final-check.md b/spider2-specs/done/14-output-completeness-final-check.md deleted file mode 100644 index 49445e18..00000000 --- a/spider2-specs/done/14-output-completeness-final-check.md +++ /dev/null @@ -1,105 +0,0 @@ -# Enforce answer-output completeness with a final pre-emit check in the analytics skill - -## Problem - -The single largest correctness failure mode is **incomplete output**: the query runs and the -methodology is roughly right, but the result is missing columns the question asked for. Three -recurring sub-patterns: - -1. **Multi-part questions answered partially.** A question that asks for several things ("report - the highest *and* the lowest month, each with its count and average, *and* the difference") - comes back with only the first part — one column instead of the several requested. -2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's - identifier (e.g. a product name without its product id, a customer name without its - customer id). -3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not - the underlying counts the question also asked for. - -Sub-patterns 2 and 3 are **already covered by `` rules** in the analytics skill -(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*), -yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these -rules are passive heuristics buried in a list, and the agent doesn't reliably check them before -finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn -output-completeness into an **explicit final verification step** the agent performs before -emitting SQL. - -This is reinforced by evidence that the failure is **model-independent**: a markedly stronger -model produced the same incomplete-output mistakes on these questions, which means it is a -craft/enforcement gap, not a capability gap. - -## Generic use case (independent of any benchmark) - -An analyst is asked: *"For each region, report the highest and the lowest monthly order count, -and the difference between them."* A complete, useful answer has a column for the region's id -and name, the highest count, the lowest count, and the difference — five columns. Returning just -the region and a single number answers only part of the request. This is a universal expectation -on any database: answer **every** part of a multi-part request, identify the entities, and show -the inputs behind any derived figure. - -## Requirements - -Additive to the analytics skill's `` "Answer completeness / interpretation" group and -its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07). - -1. **Multi-part / multi-output completeness (new rule).** When a question requests several - outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a - value plus its components ("X, Y, and their ratio") — the final projection must contain a - column for **each** requested output. *Why:* answering only the first clause is the most common - way a runnable query is still wrong; the grain and methodology can be perfect yet the answer - is short by columns. - -2. **Fold the existing identity / inputs rules into the same completeness notion.** The - already-shipped rules — project the entity **identifier** alongside any human-readable label, - and **keep the inputs** to any derived value — are part of output completeness; reference them - from the check below so they are actually applied, not just listed. - -3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the - final SQL, the skill should have the agent **re-read the question and confirm the projection - covers**: every named metric/attribute; the identifier of every grouped/named entity; every - input to a derived value; all at the grain the question specifies. This is a short, concrete - checkpoint at the validate step — the point is to convert the passive heuristics into an active - pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is - grader-gaming; the check is about matching the request exactly, not padding it.) - - Generic teaching example (synthetic schema — see Leak-safety): - ```sql - -- "For each region, report the highest and lowest monthly order count and their difference." - -- WRONG: answers only the first clause; no region id, no lowest, no difference. - SELECT region_name, MAX(monthly_orders) AS highest - FROM region_monthly GROUP BY region_name; - - -- RIGHT: one column per requested output + the entity's identity, at the region grain. - SELECT r.region_id, r.region_name, - MAX(m.monthly_orders) AS highest_monthly_orders, - MIN(m.monthly_orders) AS lowest_monthly_orders, - MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference - FROM regions r - JOIN region_monthly m ON m.region_id = r.region_id - GROUP BY r.region_id, r.region_name; - ``` - -## Leak-safety (hard constraint) - -The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up -columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover -every requested output + identity + inputs), which is universal and tied to no specific instance. - -## Acceptance criteria - -- The skill states the multi-part-completeness rule and a concrete **final completeness check** - (re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic, - cross-referencing the existing identity/inputs rules so they're enforced. -- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming). -- One short generic example (wrong vs complete); no benchmark-derived content. -- Skill-content only; analytics-skill content tests updated to cover the new rule + check. - -## Benchmark context (motivation only) - -In the latest SQLite-subset run, **incomplete output was the single largest failure bucket -(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value -inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A -probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this -is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested -part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product -skill (and transfers to real users), enforced as a final check rather than left as a passive hint. -``` diff --git a/spider2-specs/done/15-mcp-server-structured-logging.md b/spider2-specs/done/15-mcp-server-structured-logging.md deleted file mode 100644 index 294c986c..00000000 --- a/spider2-specs/done/15-mcp-server-structured-logging.md +++ /dev/null @@ -1,116 +0,0 @@ -# Structured, leveled logging for the ktx MCP server - -> **Scope: observability only.** This spec is about *seeing* what the MCP server -> does (which tool, what params, when, how long, outcome). *Preventing* a runaway -> query from blocking the server (off-event-loop / interruptible query execution) -> is a separate concern — see "Non-goals" and the sibling spec note below. - -## Problem - -The ktx MCP server (`packages/cli/src/mcp-http-server.ts` + -`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk` -`StreamableHTTPServerTransport`) emits almost no operational logs. There is no -server-side record of **which MCP tool was called, with what parameters, when, -how long it took, or whether it succeeded** — nor of session open/close or -transport errors. When a tool call is slow, hangs, or a client connection drops -("Transport channel closed"), an operator has no trail to diagnose it and must -resort to process sampling / `lsof` / guesswork — and the offending input -(e.g. the exact SQL) is typically unrecoverable. - -## Generic use case - -Anyone running a long-lived ktx MCP server — a developer's local instance, a -shared team server, or a hosted deployment — needs observability into tool-call -activity to: -- diagnose slow or hung tool calls (which `sql_execution` ran, against which - connection, with what SQL, for how long); -- explain client-visible connection failures from the server side (session - lifecycle, transport-closed events); -- audit what agents asked the server to do; -- spot patterns (hot tools, slow connections, error rates). - -This is standard production-server hygiene; the server currently provides none. - -## Requirements (sketch — refine when picked up) - -1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation - only; implementer owns the choice). A single shared instance; write **JSON to - stdout** (12-factor — the launcher/aggregator routes it). No in-app file - rotation. Optional human-readable pretty output only when attached to a TTY - (dev). -2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug` - for diagnosis) — verbose logging on demand without code changes. -3. **Per-session / per-call context** via child loggers: every line carries a - `sessionId` (from the transport session) and, for tool calls, a `callId` + - `tool` name, so one session's or call's activity can be traced/grepped. -4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For - every MCP tool invocation: - - on entry: log `{ tool, params, sessionId, callId }` **before** running the - handler (so the record exists even if the handler never returns); - - on exit: log `durationMs` + outcome (ok with result size, or error with - stack). - This makes a **hung / never-returning call identifiable**: a start with no - matching completion is the culprit, with its exact parameters and timestamp. - This matters specifically because handlers like `sql_execution` run a - *synchronous* better-sqlite3 query — a runaway query blocks the process and no - completion is ever logged, so the start line (flushed before the blocking - call) is the only record. For `sql_execution`, `params` should include the SQL - text (the most useful field). Emit a **WARN** when a *completed* call exceeds a - configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`). -5. **Connection / session lifecycle:** log session open/close (with `sessionId`) - and transport errors (the SDK's closed-channel / "Transport channel closed" - events) so client-side connection failures have a server-side counterpart. -6. **Error logging** with structured stack traces (a standard error serializer), - not bare strings. -7. **Light redaction — credentials only** (bearer token, connection - passwords/secrets). SQL text and tool params are *not* secrets and must be - logged. Do not over-redact. -8. **Synchronous logging is fine.** The server uses a synchronous DB client, so - logging need not be async; prefer the simpler synchronous stdout path over - async/worker transports (which can lose buffered lines on a hard crash). Do - not introduce async-logging machinery. - -## Acceptance criteria (sketch) - -- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start` - (tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line - on the server's stdout, as JSON. -- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a - `tool.start` line carrying its **exact SQL and timestamp** and **no** - `tool.end` — so the offending query is recoverable from the log alone, with no - process sampling. -- A completed tool call slower than the configured threshold emits a WARN with - its duration. -- Session open/close and transport-closed events are logged with the `sessionId`. -- At default level (`info`), routine per-tool lines are suppressed but lifecycle, - slow-call warnings, and errors are present. -- Credentials (bearer token, connection secrets) never appear in logs; SQL and - tool params do. -- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no - async-transport machinery. - -## Non-goals - -- **Preventing/interrupting runaway queries** (off-event-loop execution, query - timeouts, worker-thread isolation). That is a *separate* spec; a single - synchronous query that fans out into a massive nested-loop join can peg the - single-threaded server for hours and break new connections — observability - surfaces *which* query, but the fix is execution-model work. (This logging is - also a prerequisite for a future watchdog that detects a `tool.start` with no - `tool.end` past a threshold and recycles the server.) -- Metrics/tracing/OpenTelemetry exporters. -- Forwarding logs to the MCP *client* via the protocol's logging capability - (`notifications/message`, `logging/setLevel`) — a possible later enhancement, - distinct from operational stdout logging. - -## Benchmark context (motivation, not a requirement) - -Running Spider 2.0-Lite against the MCP server at concurrency, an -adversarial-reviewer-generated query degenerated into a massive nested-loop join; -synchronous better-sqlite3 executed it on the event loop, pegging a server at -~100% CPU for hours and breaking new MCP connections to it ("Transport channel -closed"). We could not determine *which* query, because the server logs nothing -about tool calls — diagnosis required `sample`/`lsof` on the live process and the -exact SQL was never recovered. Structured tool-call logging (especially -start-before-execute) would have turned this into a one-line `grep` of the server -log. diff --git a/spider2-specs/done/16-bounded-query-execution-timeout.md b/spider2-specs/done/16-bounded-query-execution-timeout.md deleted file mode 100644 index 5ecd43d3..00000000 --- a/spider2-specs/done/16-bounded-query-execution-timeout.md +++ /dev/null @@ -1,131 +0,0 @@ -# Bounded query execution (deadline + non-blocking) for read SQL - -> Priority: HIGH. Found empirically during a Spider2-lite sqlite run -> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU -> for 13+ minutes and never returned. The query -> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the -> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112 -> rows, joined on a 4-column key with no composite index) whose plan degraded to -> an O(N×M) nested-loop scan. Because the sqlite connector runs -> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP -> worker's entire event loop: no `tool.end` was ever logged, the port went -> unresponsive, and the query could not be cancelled. One of four eval shards -> stalled until the worker was killed by hand. - -## Problem - -Two compounding gaps on the read-query path: - -1. **No execution deadline.** A single expensive query runs unbounded. This is - handled divergently per connector, with no shared contract: BigQuery has a - real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP - `request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only - connection/pool *acquisition*, not statement *execution*; SQLite has nothing. - So whether a runaway query is bounded depends entirely on which driver the - caller happened to hit. - -2. **In-process engines block the event loop and can't be cancelled.** The - sqlite connector executes on the main thread via synchronous - `better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't - serve other requests, send progress, or write `tool.end`), and there is no - way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its - documented mechanism for slow queries is to run them in a **worker thread**, - and the only way to stop a runaway synchronous query is to terminate the - thread executing it. - -The net effect is a query that produces a `tool.start` with no matching -`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`) -does not help — it bounds returned rows, not scan work, and the failing query -returned a single aggregate row. - -## Generic use case - -Any data agent that lets an LLM author SQL will eventually issue an -accidentally-expensive query — an unindexed or cartesian join, an expensive -VIEW, a wide aggregate over a large fact table. A general-purpose context layer -must bound that and return a clean, fast "query exceeded Ns" error so the agent -can revise (add filters, query base tables, narrow the range) instead of hanging -the tool and the server. This matters for embedded/local warehouses (sqlite, -duckdb) and remote ones alike, and is wholly independent of any benchmark. - -## Requirements - -1. Every read-query execution path (`executeReadOnly`) enforces a single - canonical execution deadline. One opinionated default; **not** a per-call - user flag. Where a driver already supports a per-connection timeout - (BigQuery `job_timeout_ms`), reuse that as the per-connection override rather - than inventing a parallel knob. -2. On exceeding the deadline the path resolves with a `KtxQueryError` - ("query exceeded {N}s") — a finite, decision-reaching outcome, never an - unbounded hang. -3. The deadline is a **shared contract at the connector boundary**, defined once - (on the `executeReadOnly` contract or a shared wrapper at the call site) so - all drivers participate. Bring the existing divergent timeouts (BigQuery job - timeout, ClickHouse request timeout) under this one contract instead of - leaving parallel mechanisms. -4. For in-process engines (sqlite today, any future embedded driver), execution - MUST NOT block the MCP server event loop. Run the query off the main thread - and enforce the deadline by terminating that thread on timeout (the - better-sqlite3-documented approach, since synchronous queries are - uncancellable in-thread). The event loop must stay responsive so `tool.end` - is always written and concurrent requests on the same port are served. -5. Prefer real cancellation over client-side give-up. Where the engine supports - a server-side statement timeout (Postgres `statement_timeout`, MySQL - `max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse - `max_execution_time`, BigQuery job timeout, SQL Server request timeout), set - it so the deadline actually stops work, not merely abandons the promise while - the query keeps running. For in-process engines, thread termination is the - cancellation. -6. The MCP `sql_execution` tool surfaces the timeout as an expected error - (classified as `KtxQueryError`, not a `$exception` fault, consistent with - existing expected-error classification) and logs a `tool.end` with the error - outcome. -7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain - unchanged. The deadline is additive; `maxRows` is not a substitute for it. - -## Acceptance criteria - -- A read query that exceeds the deadline returns a `KtxQueryError` within - roughly the deadline; the MCP worker stays responsive (a concurrent tool call - on the same server completes while the slow query is still pending) and writes - a matching `tool.end` with a non-ok outcome. -- sqlite specifically: executing a deliberately pathological query (e.g. an - expensive VIEW or an unindexed cross join) on a fixture does not block the - event loop, is terminated at the deadline, and CPU returns to idle afterward - (the off-main-thread executor is killed, not left spinning). -- No regression: normal fast queries return identical results; read-only - rejection still works; `maxRows` still bounds returned rows. -- Tests cover the deadline path for at least the in-process driver (sqlite, - terminate-on-deadline) and one server-side-timeout driver. - -## Benchmark context (motivation only) - -The Spider2-lite local set loads several warehouses into sqlite, some with -expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` = -`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112 -rows, no composite index, with `promo_id` (the index the optimizer picks) being -95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a -view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval -shard for 10+ minutes; with one, the agent gets a fast error and can scope the -query instead. - -## Orientation hints (code pointers; may have drifted) - -- Shared contract: `packages/cli/src/context/scan/types.ts` — - `KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285). -- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70` - (`connector.executeReadOnly`); tool registration in - `packages/cli/src/context/mcp/context-tools.ts`. -- In-process sync execution (the acute hang): - `packages/cli/src/connectors/sqlite/connector.ts:311-313` - (`better_sqlite3 .prepare().all()`). -- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts` - (`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602` - (`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only), - `connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`, - `connectors/sqlserver/connector.ts` (pool/connection only). -- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`). -- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no - interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern - for slow queries (master owns worker lifecycle and respawns on exit) — extend - it with terminate-on-deadline to enforce the timeout. diff --git a/spider2-specs/done/18-bigquery-cross-project-datasets.md b/spider2-specs/done/18-bigquery-cross-project-datasets.md deleted file mode 100644 index e83c74d8..00000000 --- a/spider2-specs/done/18-bigquery-cross-project-datasets.md +++ /dev/null @@ -1,68 +0,0 @@ -# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project) - -**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`. - -## Problem (generic, real-world) - -Analysts routinely query datasets that live in a **different** BigQuery project than the one -they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an -organization's central data project, etc. To make those connectable in ktx (so `discover_data`, -the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to -**introspect a dataset hosted in a foreign project while running/billing jobs in the -credentials' own project**. - -Today it can't. ktx's BigQuery connector derives a single `projectId` from -`credentials.project_id` and uses it for **both** job billing **and** schema introspection: - -- `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`; - there is no separate billing-vs-dataset project knob. -- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the - dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`. -- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the - **billing** project's INFORMATION_SCHEMA. -- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix. - -So a `dataset_id` naming a dataset in another project can't be introspected, even though querying -it works fine (cross-project reads bill to the caller's project — that path already works). - -### Empirical confirmation -With a service account in project `ktx-spider2-lite`: -- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in - `projects/ktx-spider2-lite/datasets/austin_311`). -- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds** - (lists the public tables; public metadata is readable by any authenticated principal). -- There is **no config knob** to separate the introspection project from the billing project. - -## Requirement - -The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids` -(a single connection may span more than one source project), and for each: -- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` / - `DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and - label the table `catalog` with the dataset's project; -- **run jobs / bill** in `credentials.project_id` (unchanged). - -A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so -existing single-project connections are unaffected. - -## Acceptance - -- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) → - `ktx ingest ` introspects the tables, enriches, and samples values; `discover_data` / - `dictionary_search` return them. -- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both. -- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in - `credentials.project_id`. -- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression). - -## Benchmark context (motivation only — do not encode benchmark specifics) - -Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every -one of its ~74 logical databases groups datasets hosted in foreign public projects -(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query -execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the -faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74 -BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset` -and introspect each in its own project" covers the benchmark and the general case alike. This is -the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector -change and is already baselined). diff --git a/spider2-specs/done/19-durable-bounded-relationship-detection.md b/spider2-specs/done/19-durable-bounded-relationship-detection.md deleted file mode 100644 index 3435d2a7..00000000 --- a/spider2-specs/done/19-durable-bounded-relationship-detection.md +++ /dev/null @@ -1,89 +0,0 @@ -# 19 — Durable, resumable, bounded relationship detection during ingest enrichment - -**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`. - -## Problem (generic, real-world) - -Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment` -(`packages/cli/src/context/scan/local-enrichment.ts`): - -1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per - table; on a large schema this is minutes of paid LLM work). -2. `embeddings` (`:559`) — column embeddings. -3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then - validates candidate joins. - -The queryable semantic-layer artifacts are persisted **once, at the very end**, by -`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after** -`runLocalScanEnrichment` returns, i.e. after all three stages. - -This creates three failure modes that compound on large schemas (hundreds of tables): - -1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings - are computed and held in memory, but they only reach the durable, queryable artifacts when the - final write runs after the `relationships` stage. If the process is killed/crashes/times out - **during** relationship detection (the last, slowest, silent stage), the artifacts are never - written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the - paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced - full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the - relationships stage ran silently past a supervising deadline and was killed — the persisted - `_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits - this, so the most expensive work is the most likely to be thrown away. - -2. **Re-running does not resume — it re-spends.** There is a stage state store - (`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves - each completed stage's output. But the completed-stage lookup keys on **`runId`** - (`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest - invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a - new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch** - (re-paying for the LLM work that already succeeded). - -3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between - "Detecting relationships" and the final "Relationship detection found N accepted" — minutes of - silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but- - working profile from a true hang, and there is no internal time/work budget, so on a very large - schema it can run far longer than any reasonable deadline. - -## Requirements - -1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions + - embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before - the `relationships` stage runs. Relationship detection then appends/merges its own artifact on - completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**, - even if relationship detection fails, is interrupted, or is skipped. (A failed/partial - relationship stage should degrade to "no/partial joins", never to "no descriptions".) - -2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity - — `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted - ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what - actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM - credits on stages that already succeeded. - -3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query - execution). Emit progress through the existing progress port — e.g. "Profiling table K/N", - "Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget - (configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops - gracefully and returns the relationships found so far (partial) rather than running unboundedly. - Partial completion is persisted (per requirement 1) and marked as such. - -## Acceptance - -- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer - with the table/column descriptions + embeddings that were generated (verified: re-open the - connection, descriptions are present). -- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage - already completed (verified: no LLM description calls for the cached tables; only the failed - stage re-runs). -- A connection with hundreds of tables emits relationship-stage progress and completes within the - configured budget, persisting partial relationships if the budget is hit — without discarding - enrichment. -- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship - output when nothing is interrupted). - -## Benchmark context (motivation only — do not encode benchmark specifics) - -The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl` -785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM -budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every -retry — makes large-schema ingest impractical. This is a general durability/cost property of the -ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale. diff --git a/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md b/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md deleted file mode 100644 index ab1e176e..00000000 --- a/spider2-specs/done/20-resilient-enrichment-under-slow-llm.md +++ /dev/null @@ -1,101 +0,0 @@ -# 20 — Resilient enrichment under a slow/hung LLM backend - -**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`. - -This is the **enrichment-stage** analog of two already-shipped specs: -- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline); -- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it. - -Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two -weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung -table into an indefinite wedge plus total loss of an entire stage's LLM work. - -## Problem / requirement - -Two compounding gaps on the per-table description-enrichment path, observed end-to-end: - -### 1. The per-table LLM timeout does not actually terminate the work - -The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh -`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM -backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise -spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0% -CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock -the await** — so the call sits *past* its own timeout indefinitely. - -Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min), -two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** — -well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED -connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung -child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot -interrupt the underlying work — applied to the enrichment LLM call instead of the query.) - -**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires, -the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort -for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per -the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded -wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike. - -### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones - -Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before -relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's -description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1), -or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost, -even though their (expensive) LLM descriptions were finished. - -Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in -`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded -all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of -redone LLM calls. - -**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to -the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated -and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the -descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just -"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**, -throwing away every successfully-generated description. The skip must be graceful — a skipped table costs -one missing description, not the entire stage's output. (This is the strongest argument for per-table -incremental persistence: the 283 good descriptions should have been durable the moment each was produced.) - -**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the -descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and -(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write -design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the -checkpoint granularity from once-per-stage to incremental. - -## Sketch (implementer to refine) - -- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/ - claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for - network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period. -- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit - minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value - itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator - override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental - checkpointing, a moderate default + skip is the better operating point.) -- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to - the same store/format used at stage completion; on resume, treat already-persisted tables as done and only - enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions). -- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer - than ~one timeout, so an external stall watchdog stops being the only backstop. - -## Generic use case (independent of the benchmark) - -Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend, -which is the common local/desktop setup) will eventually hit a table whose description call hangs — a -provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one -such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws -away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema -enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a -hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product, -wholly independent of any benchmark. - -## Benchmark context (motivation only — not a benchmark-specific rule) - -Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching -the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the -30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions -checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had -to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark -just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic. diff --git a/spider2-specs/done/21-selective-enrichment-stages.md b/spider2-specs/done/21-selective-enrichment-stages.md deleted file mode 100644 index 6226fbf2..00000000 --- a/spider2-specs/done/21-selective-enrichment-stages.md +++ /dev/null @@ -1,91 +0,0 @@ -# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys - -**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`. - -Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment). -Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one -enrichment stage without re-paying for the others. - -## Problem / requirement - -Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`** -(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally -LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a -targeted re-run impossible without a full, expensive re-enrich: - -1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a - single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`, - and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So - changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping - `scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces - ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change. -2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only - path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on - `mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`). - The capability is built; it's just not reachable. -3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage, - input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the - foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI - surface are missing. - -**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested -connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply, -without re-running unchanged (especially the costly `descriptions`) stages. - -## Design decisions (resolved during intake; implementer may refine) - -- **CLI flag: `--stages `** (plural). Accepts a comma-separated subset of - `descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes - a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the - plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an - unknown stage is an error, never silently ignored. -- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs: - - `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model) - - `embeddings` → `{snapshot, embeddings model/provider, + the description text it embeds}` - - `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}` - Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates - only `embeddings`; improving description prompts/LLM invalidates only `descriptions`. -- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write, - already the behaviour). A selective run never deletes another stage's artifacts. -- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`; - `relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing - `embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected - re-run leaves an unselected downstream stage stale, and the operator can opt to cascade - (`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream. -- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the - stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has - full context — not just raw column names. -- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to - the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent - `--no-query-history` negative flag, but that unification is out of scope here. - -## Sketch (implementer to refine) - -- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it - selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the - precedent). -- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each - stage's own inputs; gate each stage's resume/skip on its own hash. -- Ensure selective runs read + preserve the on-disk enriched schema and write additively. -- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one. - -## Generic use case (independent of the benchmark) - -Any team running ktx in production maintains its semantic layer over time: they improve description prompts -or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of -those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions -even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine -maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now -that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost — -alone. This is core operability for a long-lived ingestion product and is wholly independent of any -benchmark. - -## Benchmark context (motivation only — not a benchmark-specific rule) - -Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a -tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) -that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins** -across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both -were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate -the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised -large-scale multi-backend ingestion; the gap and the fix are generic. diff --git a/spider2-specs/specs/01-connection-scoped-wiki.md b/spider2-specs/specs/01-connection-scoped-wiki.md deleted file mode 100644 index 1ffed215..00000000 --- a/spider2-specs/specs/01-connection-scoped-wiki.md +++ /dev/null @@ -1,300 +0,0 @@ -# Connection-scoped wiki pages - -> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`. - -## Problem - -Wiki pages have only two scopes today: `GLOBAL` and `USER` -(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by -directory (`wiki/global/.md`, `wiki/user//.md`) and the -search path filters by loading only the in-scope pages before any lane runs. -There is no way to associate a page with a **connection** (a warehouse/database -defined under `connections:` in `ktx.yaml`). - -In a project with many connections this causes two distinct failures: - -1. **Cross-database relevance pollution.** All pages share one search index, so - `wiki_search` for a generic term (`orders`, `revenue`, `average order - value`) surfaces pages written about the wrong database. Concept names - collide across databases constantly in real multi-connection projects - (several databases each with `orders`, `customers`, …). -2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace. - The write path resolves a repeated key to the existing file and updates it - in place. So if the agent writes an `orders` page while ingesting database B - and an `orders` page already exists for database A, B's content **overwrites - A's** — same-concept pages for different databases cannot coexist today. - -Today, when `memory_ingest` is called with a `connectionId`, that id only -scopes which semantic-layer sources the triage agent can see -(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page -and **not** validated against `ktx.yaml`. - -## Generic use case - -Any org with multiple databases/warehouses in one **ktx** project: org-wide -definitions ("fiscal year starts in February") should be visible everywhere, -while database-specific conventions ("in the events DB, `user_id` is the -anonymous device id, not the account id") should not pollute searches about -other databases — and two databases that both have an `orders` concept must be -able to keep separate, non-colliding pages. - -## Model - -`connections` is **additive frontmatter metadata**, orthogonal to the existing -`GLOBAL`/`USER` directory scope — not a third scope dimension: - -- A page is still `GLOBAL` or `USER` and lives where it lives today. It may - **additionally** carry a `connections` list. -- **Page keys remain a flat, globally-unique namespace.** `connections` does - **not** namespace keys; a page is addressable by key alone, unchanged. -- A page may list **multiple** connections. -- **Absent or empty `connections` ⇒ unscoped: the page applies to all - connections.** This is exactly today's behavior, so every existing page is - unaffected. - -This keeps `wiki_read` and refs untouched and adds no parallel scope axis; -filtering by connection is purely a search/relevance concern. - -## Requirements - -### 1. Frontmatter field - -Add an optional `connections` field to wiki page frontmatter — a list of -connection ids. - -- Accept a single string too; normalize to a list at parse time (reuse the - existing array-coercion helper used for `tags`/`refs`/`sl_refs`). -- Round-trips through parse/serialize without loss. -- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by - construction. - -### 2. Page identity and key distinctness - -`connections` does not change how pages are identified or addressed: - -- Keys stay flat and globally unique; `wiki_read(key)` is unchanged. -- Because the write path updates a page in place when its key already exists, - same-concept pages for different connections **MUST** use distinct keys - (e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys - for database-specific pages are the primary mechanism (driven by write-path - prompt guidance, requirement 5). -- **Data-loss guard (code, not prompt):** a connection-scoped write whose key - matches an existing page whose `connections` scope is **disjoint** from the - incoming scope MUST surface a collision instead of silently overwriting the - existing page. (Updating a page within the same connection scope, or - broadening/narrowing its own `connections`, is a normal update — not a - collision.) The implementer owns whether the collision is a hard error or a - suffixed new key; it must not be a silent clobber. - -### 3. Search filtering - -Add an optional connection filter to the search surfaces: - -- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`). -- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection ` - (with `-c` alias), matching the `ktx sql` connection flag. - -Semantics: - -- With `connectionId: X` ⇒ return pages whose `connections` is empty - (unscoped) **∪** pages whose `connections` contains X. -- Without ⇒ current behavior, all pages. -- The filter **MUST** apply uniformly to **all three search lanes** (lexical - FTS5, semantic/embedding, token fallback) at the **candidate-source level**, - so each lane draws its full candidate pool from the already-scoped set. It - **MUST NOT** be a post-filter on the merged/ranked results — that would let - off-scope candidates consume both the per-lane pool and the final result - limit unevenly. - -*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the -disk-load step that feeds both the in-memory token lane and the synced SQLite -index (`local-knowledge.ts`); the connection filter fits the same seam. - -### 4. Index persistence - -The `.ktx/db.sqlite` knowledge index is re-synced from files on every search. -The implementer owns whether to persist `connections` as index columns / a side -table, or to filter the loaded page-set before the per-search sync. The binding -requirement is the uniform-across-lanes behavior in requirement 3 — not a -specific schema. - -*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the -scoped subset and gives up a little embedding-cache reuse when searches -alternate between connections (recompute is one embedding per scoped page per -connection switch — negligible at the scale this targets). Persisting -`connections` in the index avoids that at the cost of a schema addition and a -per-lane predicate. Either is acceptable. - -### 5. Write path - -- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a - `connections` input field with the same REPLACE semantics as - `tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to - unscoped; `[ids]` ⇒ set. -- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt - guidance directs the agent to: - - set `connections: [connectionId]` on new **database-specific** pages, using - connection-distinctive keys; and - - leave `connections` empty for clearly **org-wide** content. -- This is **prompt guidance, not a code auto-default.** A connection-scoped - ingest must remain able to produce unscoped org-wide pages, so the tool must - not force the session's `connectionId` onto every page. - -### 6. `wiki_read` and refs unchanged - -Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and -`sl_refs` semantics are unchanged; `connections` is a search/relevance concern -only. - -### 7. Validation - -Validation behavior splits by surface, because an explicit argument is a -typo-prone input while persisted content drifts independently of config: - -- **Explicit argument** — a connection id supplied as a command/tool argument - (`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`) - MUST be validated against `ktx.yaml` connections and **rejected with a clear - error listing the configured ids** when unknown. Reuse the canonical - `project.config.connections[id]` check. This also closes the current gap - where `memory_ingest`'s `connectionId` is accepted unvalidated. -- **Persisted frontmatter** — a connection id that appears only in a stored - page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during - validation/doctor, and MUST NOT break loading, searching, or reading that - page. Config and content can evolve independently. - -### 8. Scope boundary - -This spec delivers the **mechanism** (frontmatter storage + uniform filter + -write surface + validation). Driving the agent to actually pass `connectionId` -during analytics work is the concern of -`03-multi-connection-routing-in-analytics-skill`. It composes with the -`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`. - -## Acceptance criteria - -- A page with `connections: [db_a]` is returned by - `wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but - **not** by `wiki_search(query, connectionId: "db_b")`. -- A page with no `connections` field is returned in all three cases above. -- Two pages — `orders_sales_db` (`connections: [sales_db]`) and - `orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to - `sales_db` returns the first and not the second, and neither overwrote the - other on write. -- A connection-scoped write whose key matches an existing page scoped to a - **different** connection surfaces a collision instead of silently - overwriting (data-loss guard, requirement 2). -- Filtering works in each lane independently (test with embeddings disabled to - exercise the lexical and token lanes alone). -- `memory_ingest(content, connectionId)` produces a page scoped to that - connection for database-specific content. -- `wiki_search`/`ktx wiki search --connection ` fails with an error - that lists the configured connection ids. -- A page whose `connections` references an id absent from `ktx.yaml` produces a - warning but stays searchable and readable; search and read do not throw. -- `connections` accepts a single string and a list, both normalized to a list. -- Existing projects with no scoped pages and no `connectionId`/`--connection` - behave identically before/after. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`), - `wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array - coercion `wiki/local-knowledge.ts` (`stringArray`). -- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts` - (`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already - scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts` - (FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`). -- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`, - `memory_ingest`; `connectionId` already present on `memory_ingest` but - unvalidated). -- **CLI surface:** `commands/knowledge-commands.ts` - (`ktx wiki search`/`list`/`read`); canonical `--connection` flag in - `commands/sql-commands.ts`; validation pattern - `project.config.connections[id]` in `mcp/local-project-ports.ts`. -- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE - semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId` - threaded through the capture session and tool session; - `external_ingest` forces `GLOBAL` scope). -- **Connection config:** `context/project/config.ts` (`connections` record in - `ktx.yaml`). - -## Benchmark context (motivation only) - -Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose -schemas share table/concept names (Northwind, sakila, two e-commerce DBs…). -External-knowledge docs (RFM definition, F1 overtake rules) are each relevant -to exactly one database and must not surface for the other 29. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All -acceptance criteria covered; full package suite green (2924 passing), -type-check, knip/biome dead-code, and pre-commit clean. - -**What was built / where** - -1. **Frontmatter field (req 1).** `connections?: string[]` added to - `WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model - `LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new - `stringList()` coercion (single string → list); round-trips through both - serializers. Absent/empty ⇒ unscoped. -2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through - `searchLocalKnowledgePages` → both the sqlite-FTS and scan impls → - `loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is - applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages - listing the id), so the token lane and the per-search SQLite sync (lexical + - semantic) both draw their candidate pool from the already-scoped set — - candidate-source level, not a post-filter. - - Chose req 4 **option B (filter the loaded page-set)** over persisting a - column. Verified-safe here: standalone ktx's memory agent reads pages from - files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s - `knowledge_pages` is a per-search cache that `searchLocalKnowledgePages` - rebuilds every call — scoping the sync corrupts no shared state. Only cost - is one embedding recompute per scoped page on a connection switch (the - spec's acknowledged, negligible trade-off). No index-schema change. -3. **Page identity + data-loss guard (req 2).** Keys stay flat/global; - `wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`) - rejects (hard error, no silent clobber) a connection-scoped write whose - incoming `connections` is **disjoint** from a same-key existing page's - non-empty `connections`, suggesting a connection-distinctive key. Same-scope, - overlapping, broaden/narrow, and unscoped-existing updates are allowed. - Chose a hard error over auto-suffixing so the conflict reaches the agent - (the decision-maker) instead of silently forking the key namespace. -4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list) - with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no - code auto-default of the session connection. Prompt guidance added to the - shared `wiki_capture` skill (new "Connection scoping" section) and the - `memory_agent_external_ingest` prompt. The session `connectionId` is now - surfaced to the agent so the guidance is actionable: in the memory-agent - prompt header and in the ingest work-unit `` block - (`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`). -5. **Validation (req 7).** New shared helper - `context/connections/configured-connections.ts → assertConfiguredConnectionId` - validates explicit connection-id arguments against `ktx.yaml` and throws an - error listing the configured ids. Routed from all three explicit-arg - surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest` - (validated at the boundary in `mcp-server-factory.ts` — this also closes the - prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated), - and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` + - `knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**: - `listReferencedConnectionIds` + a non-fatal `ktx status` warning - (`status-project.ts`); loading/searching/reading never throw on them. - -**Deviations / notes** - -- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`". - That helper (`stringArray`) is array-only and does **not** coerce a single - string; added a dedicated `stringList` for `connections` to meet the - single-string acceptance criterion rather than change `stringArray`'s - behavior for the other fields. -- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already - takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so - its wiki lane is intentionally left unscoped. Worth a follow-up if - `discover_data`'s wiki results should also be connection-scoped for - consistency. -- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the - new `wiki_search.connectionId` param and the `memory_ingest` validation - wrapper (the port is no longer the raw service object; it delegates). diff --git a/spider2-specs/specs/02-verbatim-ingest-mode.md b/spider2-specs/specs/02-verbatim-ingest-mode.md deleted file mode 100644 index a16645d8..00000000 --- a/spider2-specs/specs/02-verbatim-ingest-mode.md +++ /dev/null @@ -1,327 +0,0 @@ -# Verbatim ingest mode for authoritative documents - -> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`. - -## Problem - -`ktx ingest --text/--file` routes captured content through the memory agent. -`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a -`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to -`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which -runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k -chars) inside a session worktree. The agent decides — via the `wiki_write` -tool — what to persist, so it may **rewrite, condense, split, or re-title** the -content before it lands as a wiki page. The body is produced by an LLM, not -copied by code. - -For *authoritative* documents — formula definitions, metric specs, runbooks, -compliance text — paraphrasing is a defect, not a feature: - -- exact thresholds, constants, and rule wording must survive unchanged; -- lexical (BM25/FTS5) search works best when the stored text matches the - phrasing users and agents query with; -- ingestion should be deterministic and reproducible — the same input file - yields the same page, and re-running is safe. - -Two further gaps block authoritative ingest today: - -- The memory agent hard-requires an LLM backend - (`context/memory/local-memory.ts` throws when `llm.provider.backend: none` - and no runner is injected), so there is **no** offline ingest path at all. -- The agent's write tool *merges* a repeated same-scope key in place (REPLACE - frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the - silent in-place rewrite an authoritative-document workflow must avoid. - -## Generic use case - -Any team ingesting documents that are already the source of truth: metric -definition sheets, SLA documents, calculation-methodology docs, regulatory -text. The user wants **ktx** to *index and surface* the document, not to -re-author it. Today they work around the memory agent by hand-writing -frontmatter and copying files into `wiki/global/`; verbatim mode makes that a -first-class, supported `ktx ingest` workflow. - -## Model - -`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a -constrained prompt over the existing agent loop. Its defining invariants: - -- **The stored page body is the input document body, written by code.** The LLM - never produces, edits, or relays the body. It is confined to generating - *metadata* about the body. -- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is - LLM-generated or derived offline follows from the configured backend - (`llm.provider.backend`), not from a second user-facing switch. -- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project - authoritative docs (the content teams copy into `wiki/global/` today). - Connection association is expressed by the **additive `connections` - frontmatter** from spec 01, never by directory. -- **Deterministic and idempotent.** The page key, the merged frontmatter, and - the stored body are all functions of the input alone (given a fixed backend), - so the same input produces the same page and a re-run is a safe no-op. - -### "Byte-for-byte" scope - -The guarantee is on the document's **interior**: no paraphrase, no condense, no -split, no re-title, no reflow, **no clipping**. The shared wiki store -canonicalizes *surrounding* whitespace — `parsePage` trims the body and -`serializePage` emits a single trailing newline -(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are -normalized by the storage layer. Verbatim mode **MUST** write through that -shared `writePage`/`serializePage` path rather than fork a parallel serializer; -the interior bytes (thresholds, constants, wording) are what must be preserved -exactly, and they are. Acceptance hashes compare the stored body against the -**trimmed** input body. - -## Requirements - -### 1. Flag - -`ktx ingest --file --verbatim` and `ktx ingest --text ---verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text` -item in the invocation; each item becomes its own page. - -- It composes with the existing `--connection-id ` flag - (`commands/ingest-commands.ts`) so the resulting page can be - connection-scoped (see spec 01). **Note:** the intake draft wrote - `--connection`; the shipped flag is `--connection-id`. Use `--connection-id`. -- No new `--key` flag (see requirement 4). No second behavioral switch beyond - `--verbatim` itself. - -### 2. Body preservation is enforced by code, not by prompt - -The stored page body is the input content (interior preserved exactly, per -**Model → "Byte-for-byte" scope**). - -- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop - or any `wiki_write` tool call where a model could alter it. -- The LLM, when used, generates **only** metadata: `summary`, `tags`, and - `sl_refs`. A single constrained structured-output call (AI SDK v6 - `generateObject` with a `zod` schema) is the intended mechanism — the full - memory-agent loop, worktree, and squash-merge are **not** required and should - not be used. -- The page key is **not** LLM-generated (requirement 4). - -### 3. No clipping of the stored body - -The ~48k clip may apply only to the text **sent to the LLM** for metadata -generation. It **MUST NOT** apply to the text **written** to the page. A -document larger than the clip limit is stored in full; only its metadata is -derived from the clipped prefix. - -### 4. Deterministic page key - -The key is derived from the input, never chosen by the LLM (an LLM-chosen slug -would break determinism and the requirement-6 idempotency guarantee): - -- **`--file `** → `suggestFlatWikiKey(basename without extension)` - (`wiki/keys.ts`). This is the primary document case and is always - deterministic. -- **`--text `** → if the content opens with a Markdown heading, the - key is `suggestFlatWikiKey(heading text)`. If there is no leading heading, - **hard error**: inline verbatim text needs a leading heading to derive a - stable key, or should be passed as `--file`. -- No hash-based keys (unfindable) and no `--key` override flag. A real need for - explicit key control can add `--key` later. - -### 5. Frontmatter: passthrough + gap-fill - -If the input has its own YAML frontmatter, split it from the body: the body is -everything after the closing `---`; the frontmatter is authoritative metadata. - -- **Passthrough.** Every input frontmatter field is preserved in the stored - page, **including fields not in `WikiFrontmatter`** (`effective_date`, - `version`, `owner`, …). The serializer `YAML.stringify`s the object, so - unknown keys round-trip. Dropping them would be silent data loss on - authoritative docs. -- **Gap-fill only.** Generated/derived metadata fills **absent** fields only; - it **MUST NOT** overwrite an explicit value. An input `summary:` is never - replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept. -- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not - force-injected) when the input does not set it. -- **Connection scoping.** `--connection-id X` (validated via - `assertConfiguredConnectionId`, `context/connections/configured-connections.ts`) - sets `connections: [X]` when the input frontmatter does not already declare - `connections`. If the input frontmatter declares a **different** - `connections` than the flag, **hard error** (ambiguous intent) rather than - silently choosing one. If they match, or only one source is present, proceed. - -### 6. Degraded mode (`llm.provider.backend: none`) - -`--verbatim` **MUST** work with no LLM backend — this is its capability the -regular agent ingest lacks. - -- `summary` is derived from the leading Markdown heading text, or, if none, the - first non-empty sentence of the body (trimmed to a reasonable length). -- `tags` and `sl_refs` are left empty. -- The body is still stored in full (requirement 3 applies unchanged). - -### 7. Key collisions: idempotent-if-identical, else hard error - -Verbatim mode does **not** reuse the agent write tool's in-place merge. Before -writing, read any existing `GLOBAL` page at the derived key: - -- **No existing page** → write. -- **Existing page, stored body identical** to the new body (compared after the - storage-layer normalization in **Model**) → **idempotent no-op success** - (re-running the same file is safe). -- **Existing page, body differs** → **hard error** naming the conflicting key - and directing the user to a distinct key. Never a silent overwrite, never an - auto-suffixed second page (which would produce the duplicated/divergent pages - this mode must avoid). - -### 8. LLM-failure handling - -When a backend **is** configured but the metadata call fails (rate limit, -transport error, malformed output after retries), **fail the item** (honoring -`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`). -**MUST NOT** silently fall back to degraded derivation: a degraded page written -on a transient error would, under requirement 7, refuse to be replaced by a -healthy re-run — breaking reproducibility. Degraded derivation is reserved for -`backend: none`. - -### 9. Findability - -After write, the page is reindexed so search returns it: - -- `wiki_search` for a phrase taken from the document body returns the page via - the lexical lane (the body is indexed in `buildKnowledgeSearchText`). -- `wiki_search` for a paraphrase of the document's topic returns it via the - semantic lane **when embeddings are enabled** (this is what the generated - `summary`/`tags` buy over a bare degraded page). - -## Acceptance criteria - -- Ingesting a file with `--verbatim` produces a page whose body is - byte-identical to the trimmed input body (assert with a hash in tests). -- A >48k-char file is stored in full (assert stored body length ≥ input length - minus trim). -- Running the same `--verbatim` ingest twice is idempotent: one page, identical - bytes both times, no error on the second run. -- A second ingest to the same derived key with **different** body content fails - loudly (requirement 7) and does not modify the existing page or create a - suffixed one. -- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved - in the stored page; an explicit input `summary` is **not** overwritten by a - generated one. -- With `llm.provider.backend: none`, `--verbatim` still produces a page: full - body stored, `summary` derived from the heading/first sentence, `tags` and - `sl_refs` empty. -- `--verbatim --connection-id X` yields a page with `connections: [X]`; an - unknown id is rejected with an error listing the configured ids. (Depends on - spec 01, now shipped.) -- `--verbatim --connection-id X` where the input frontmatter already declares a - different `connections` fails with an ambiguity error. -- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading - heading or `--file`. -- `wiki_search` for a body phrase returns the page (lexical lane); for a topic - paraphrase it returns the page when embeddings are enabled (semantic lane). - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -module layout and design, subject to the invariants above. - -- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table; - `--text`/`--file`/`--connection-id`/`--fail-fast` already present — add - `--verbatim` and thread it into `KtxTextIngestArgs`). -- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`, - `validateItems`, per-item loop and exit-code aggregation). The verbatim flow - reuses item loading and replaces the `memoryIngest.ingest(...)` call with a - code-driven write for `--verbatim` items. Keep the new logic in a focused - module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`. -- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts` - (`parsePage` for the `---…---` split shape, `serializePage`, `writePage`, - `readPage` for the collision check). Write through this shared path — do not - re-implement YAML framing. -- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`). -- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and - `usage_mode` are the required fields; unknown passthrough fields live - alongside). -- **Connection validation:** `context/connections/configured-connections.ts` - (`assertConfiguredConnectionId`, shipped with spec 01). -- **Metadata LLM call:** the local LLM runtime/config resolution in - `context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a - single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill - covers v6 patterns. -- **Reindex / search lanes:** `wiki/local-knowledge.ts` - (`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/ - semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`). -- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a - verbatim-focused test file covering the acceptance criteria above. - -## Benchmark context (motivation only) - -Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket -definitions, the haversine formula, F1 overtake rules, …). Gold SQL was -authored against their **exact** text; an LLM paraphrase that drops a bucket -boundary or rounds a constant loses the corresponding question. The current -workaround is hand-writing frontmatter and copying files into `wiki/global/`. -Verbatim mode turns that manual step into a supported **ktx** workflow, and -composes with the connection scoping from spec 01 so a doc relevant to exactly -one of the benchmark's ~30 SQLite databases does not surface for the other 29. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered -by tests and verified end-to-end through the linked `ktx-dev` binary. - -**What was built** - -- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor` - + `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`, - `deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter` - (the last four are `@internal` exports for unit testing). -- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a - guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded - into `KtxTextIngestArgs.verbatim`. -- `text-ingest.ts` now tags each loaded item with an `origin` - (`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim - ingestor once and branches the per-item loop to a code-driven write instead of - `memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and - `--fail-fast` handling are reused. - -**Deviations from the literal spec (design refinements, per "implementer owns the design")** - -- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The - implementation routes through the existing `KtxLlmRuntimePort.generateObject` - instead — it is implemented by all three backends (ai-sdk, claude-code, codex), - and the ai-sdk one already wraps `generateText` + `Output.object({schema})`. - This realizes the spec's "single constrained structured-output call" intent via - the canonical cross-backend path rather than forking a second LLM entry point. -- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages` - rebuilds the SQLite index from disk on every call (recomputing embeddings for - changed pages), so a written page is findable without a dedicated reindex step. - The write still goes through the shared `KnowledgeWikiService.writePage` + - `syncSinglePage` path, so the page is also eagerly indexed. -- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter - already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills - absent fields, so there is nothing to generate). A fully specified document thus - ingests with a configured backend without any LLM call. - -**Tests** - -- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration - against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip, - idempotency, conflict hard-error, frontmatter passthrough, explicit-summary - preservation, degraded mode, connection scoping + unknown-id rejection + - ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item, - lexical + semantic findability). -- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging, - connection-id forwarding, fail-fast. -- `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the - requires-`--text`/`--file` guard. - -**Docs** - -- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest" - section, examples, common errors) and - `docs-site/content/docs/guides/writing-context.mdx` (authoritative-document - workflow). - -**Verification** - -- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code` - (Biome + Knip default + production) clean; pre-commit clean on changed files. - A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is - untouched — it predates this work. diff --git a/spider2-specs/specs/06-scan-tolerate-broken-objects.md b/spider2-specs/specs/06-scan-tolerate-broken-objects.md deleted file mode 100644 index e64d87ef..00000000 --- a/spider2-specs/specs/06-scan-tolerate-broken-objects.md +++ /dev/null @@ -1,361 +0,0 @@ -# Schema scan tolerates individual objects that fail introspection - -> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`. - -## Problem - -A single broken or inaccessible object zeroes out an entire connection's -context. Schema introspection iterates objects with no per-object error -handling, so one throw aborts the whole scan, the live-database adapter's -`fetch()` rejects, and the connection ends with **no semantic layer at all** — -even when every other object was healthy. - -The failure surfaces in two phases, and the contract must hold in both: - -- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does - `rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch. - `readTable` runs `PRAGMA table_info()`, which *executes* a view's - body to resolve its columns — so a view over a dropped/renamed column (the - `oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date` - from a base table that has no such column) raises `no such column: - ehp.start_date` and aborts introspection of all ~48 healthy objects. -- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/ - bigquery/snowflake read metadata in bulk from catalog / `information_schema` - (a broken view rarely breaks that), then fail when a per-object profiling or - sampling `SELECT` runs against a broken object. Enrichment sampling is - *already* isolated (`description-generation.ts` wraps `sampleTable` in - try/catch → `sampling_failed`), but mandatory introspection-phase reads are - not uniformly isolated across drivers. - -A second, related defect blocks the documented escape hatch. Setting -`enabled_tables: ["main.customers"]` on a sqlite connection produces a -different hard failure — `Adapter "database schema" did not recognize fetched -source output`. Root cause: the sqlite connector emits every object as -`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })` -(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but -`"main.customers"` parses to `{ db: "main", name: "customers" }` -(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`, -so the entry matches **nothing**, zero table files are written, and -`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping -the generic "did not recognize fetched source output" error at -`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form -`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form -silently matches nothing. - -## Generic use case - -Real warehouses routinely contain broken or inaccessible objects: views over -dropped/renamed columns, views referencing tables the connection role can't -read, permission-denied tables, and vendor system views that error on read. -**ktx** should ingest everything it *can* and skip what it can't, so one bad -object never zeroes out an entire connection's context. This is baseline -production robustness, independent of any benchmark — the same tolerance a -33-warehouse fleet needs the first time one of its databases has a stale view. - -## Design - -The unit of failure is **one object** (table or view). Introspecting or -profiling an object is an operation that can fail independently; a failure skips -that object, records a recoverable warning, and the scan continues from the -objects that succeeded. - -Because seven Node connectors and the Python daemon each introspect differently -(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata -in bulk and fail per-object during profiling), the **semantics** of "skip / -warn / total-failure" are defined **once** and every connector routes through -them — rather than seven copies of the same try/catch that drift apart: - -- A shared per-object helper in the `scan/` layer — the sibling of the existing - `tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single - object read and returns `{ ok: true, table } | { ok: false, warning }`, with a - standard warning code (e.g. `object_introspection_failed`). -- A shared post-check enforces the total-failure rule (R3) uniformly. -- Each connector keeps its **natural** shape: sqlite routes each `readTable` - through the helper; bulk-read drivers route their per-object profiling reads - through it. The contract is uniform; the loop is not forced to be. -- The Python daemon implements the **same contract** in its own helper, adds a - `warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps - those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`). - -The warning channel already exists end to end on the Node side -(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/ -`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json` -artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates -it. This spec makes that channel carry object-skip warnings and surfaces them in -the ingest summary, the persisted report body, and `ktx status`. - -## Requirements - -### R1 — Per-object isolation (the contract) - -If introspecting or profiling one object throws, the scan **MUST** skip that -object, record a `KtxScanWarning` (object name, the error message, and any -schema/catalog qualifier; `recoverable: true`), and continue with the remaining -objects. No single object may abort the scan. - -- The contract holds in **both** phases: the mandatory metadata read *and* any - profiling/row-count/sample read performed during introspection. -- It holds for **all seven Node connectors** - (`packages/cli/src/connectors//`) and the **Python daemon** postgres - path (R6). -- The semantics are defined once (the shared helper + warning code from the - Design section) and every connector routes through them. Do not inline a - divergent per-driver copy. -- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object - identifier and the database's error text, redacted through the existing - `redactKtxSensitiveMetadata` path that `warnings.json` already uses. - -### R2 — Surface, don't hide - -Skipped objects **MUST** be reported both at ingest time and in the durable -status view: - -- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports - a count plus the object name and a short reason for each skip — e.g. - `Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`. -- **Run report.** Object skips land in the run report's `warnings.json` artifact - (already written) and in the persisted report body (`IngestReportBody`), whose - natural home is the existing `fetch?: SourceFetchReport` field — the fetch - phase *is* introspection. -- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for - the connection's latest ingest — e.g. `oracle_sql: 1 object skipped — - emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived - from the latest persisted report, not new persisted state**: the report body - is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so - surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status` - reads and renders the skip info already present in the latest report body. A - connection whose latest ingest skipped nothing shows no such line. - -### R3 — Failure semantics (partial vs total) - -Per-object skipping is **unconditional** — there is **no new config knob**, and -the existing `ingest.workUnits.failureMode` (which governs the later LLM -work-unit stage, not introspection) is untouched and orthogonal. Outcomes are -derived from object counts, not from a mode: - -| Scope | Objects discovered / matched | Introspection outcome | Result | -| --- | --- | --- | --- | -| none | 0 | n/a (legitimately empty DB) | **success**, empty layer | -| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest | -| none | N > 0 | all N fail | **connection failure** (clear error) | -| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) | -| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings | -| `enabled_tables` | matches M > 0 | all M fail | **connection failure** | - -- "Connection failure" means the connector / `fetch()` raises a **clear, - actionable error** for that connection. It **MUST NOT** surface as the generic - `did not recognize fetched source output` (that message is reserved for a - genuinely unrecognized staged dir, not an empty/total-failure result). -- A total failure of one connection follows existing per-connection ingest - orchestration for whether sibling connections continue; this spec does not - change cross-connection behavior. - -### R4 — A broken view never blocks base tables - -A broken view **MUST NEVER** prevent base-table ingest. - -- View introspection failures are isolated exactly like any other object (R1). -- Mandatory introspection **MUST** prefer reading an object's structure from the - catalog where possible over executing the object's body, and **MUST NOT** run - a data-reading query (row count, sample) against a view as a required step. - (sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the - metadata read that executes the view definition.) - -### R5 — `enabled_tables` allowlist works - -The documented allowlist escape hatch **MUST** reliably restrict the scan to the -listed objects, with no spurious adapter error: - -- **sqlite qualification.** The schema-qualified form `"main."` **MUST** - resolve to the same object as the bare form `""` (sqlite's sole schema - is `main`; the connector emits `db: null`). Both forms select the object; - neither silently matches nothing. -- **Documented format.** The accepted qualification forms for each driver - (`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main` - equivalence **MUST** be documented where `enabled_tables` is described - (`context/project/driver-schemas.ts` and the user-facing config docs). -- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to - **zero** matched objects **MUST** fail with an actionable error naming the - connection, the unmatched entries, and the available object names — **not** the - generic `did not recognize fetched source output`. This is distinct from a - legitimately empty database (R3 row 1) and from a matched-but-all-broken scope - (R3 last row). -- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests - **exactly** those M objects (minus any that fail per R1), with no adapter - recognition error regardless of how small or edge-case the set is. - -### R6 — Python daemon parity - -The daemon's postgres introspection path **MUST** honor the same contract: - -- Add a `warnings` field to `DatabaseIntrospectionResponse` - (`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the - same shape Node expects (code, message, object identifier, recoverable). -- Isolate per-object failures in the daemon's introspection so one broken object - does not abort the response; apply the R3 total-failure rule there too. -- Map daemon warnings into `KtxSchemaSnapshot.warnings` in - `mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`), - which currently drops them. -- The Node and Python warning shapes **MUST** stay in parity (the codebase - already mirrors Node↔Python schemas for telemetry; follow the same discipline - so the daemon cannot emit a code Node can't render). - -## Acceptance criteria - -- Ingesting a sqlite DB with one broken view + N healthy tables yields a - semantic layer for the N healthy tables and **exactly one** warning naming the - broken view and its error; exit is **success**. -- The skipped object appears in the `ktx ingest` summary output, in the run's - `warnings.json`, and in `ktx status` as a per-connection skipped-objects line - on the connection's latest ingest. -- A sqlite DB in which **every** discovered object fails introspection (and the - file opens) exits as a **connection failure** with a clear error — not an - empty "success" and not `did not recognize fetched source output`. -- A genuinely empty sqlite DB (zero objects) exits **success** with an empty - layer (not a failure). -- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both - ingest exactly the `customers` object on a sqlite connection. -- `enabled_tables` restricted to a valid subset of M objects ingests exactly - that subset, with **no** adapter-output error. -- `enabled_tables` that matches zero objects fails with an error naming the - connection, the unmatched entries, and available objects — distinguishable - from the empty-DB and all-broken cases. -- A broken view does not prevent ingest of base tables in the same connection - (regression test with a view that errors on read alongside a healthy table). -- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a - per-object failure in the daemon path produces a warning mapped into - `KtxSchemaSnapshot.warnings` (Node↔Python parity test). -- A warehouse-driver object whose profiling/sample read fails is skipped with a - warning and does not abort introspection of its siblings. -- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave - identically before/after — no warnings, same semantic layer. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **Shared semantics:** `context/scan/constraint-discovery.ts` - (`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror - for the per-object helper), `context/scan/types.ts` - (`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the - new object-skip code here). -- **Node connectors:** `packages/cli/src/connectors//connector.ts` and - each `live-database-introspection.ts`. sqlite's loop is - `connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable` - (≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171. - Existing per-table sample isolation precedent: `description-generation.ts` - (≈ line 867, `sampling_failed`). -- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156) - routes every driver to its Node connector; the daemon is the `else` fallback. -- **`enabled_tables` matching:** `context/scan/enabled-tables.ts` - (`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts` - (`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47), - `context/project/driver-schemas.ts` (`enabled_tables` schema + description). -- **Staging / detect / error surface:** - `context/ingest/adapters/live-database/stage.ts` - (`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94, - `detectLiveDatabaseStagedDir` ≈ line 138), - `context/ingest/local-stage-ingest.ts` (the - `did not recognize fetched source output` throw ≈ line 291 — must stop being - the surface for empty-scope and total-failure). -- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus` - ≈ line 202), `context/ingest/memory-flow/summary.ts` - (`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing - summary. -- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`; - `SourceFetchReport` as the home for scan warnings), - `context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted - whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts` - (`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body - per connection and render the skipped line via `renderLocalStatsAsLines`). -- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py` - (`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response` - ≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables` - ≈ line 267), and the Node mapping in - `context/ingest/adapters/live-database/daemon-introspection.ts` - (`mapDaemonSnapshot` ≈ line 209). - -## Benchmark context (motivation only) - -`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic -layer because of its one broken view, so those questions fall back to raw -`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning -restores enriched context for that database. The same robustness is required for -the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or -permission-restricted objects are common and a single one must not zero out a -warehouse's context. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki`. All requirements implemented; -verified with `pnpm --filter @kaelio/ktx run test` (2981 passing), -`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing), -`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`. - -**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes -`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning -`{ ok, table } | { ok: false, warning }` and building an -`object_introspection_failed` warning (object name + redactable DB error). It -rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is -never masked as an object skip. The new warning code was added to -`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist -(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode` -validator), and `describeWarningGroup` (`scan.ts`). - -**Per-object isolation, where it actually exists (R1/R4).** Only sqlite -(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do -per-object reads during *mandatory* introspection; both now route each object -through `tryIntrospectObject`. The other five Node connectors (postgres, mysql, -clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/ -`information_schema` (already object-safe at this phase) and isolate per-object -profiling/sampling in the enrichment phase (`description-generation.ts`, -`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite -also tolerates a `COUNT(*)` (profiling) failure without dropping a -structurally-readable table, and a broken view's metadata read is isolated so it -never blocks base tables (R4). - -**Single-source outcome decision (R3/R5).** New -`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once -in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the -daemon) routes through — and derives the outcome from the snapshot + scope: -≥1 object → success (skips ride along as warnings); all matched objects failed → -clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear -zero-match error naming the connection, the requested entries, and the available -objects (sqlite/bigquery attach the discovered inventory via -`metadata.discovered_object_names`); empty database (no scope) → success with an -empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a -valid empty staging is recognized; total-failure/zero-match now throw a clear -connection error before staging instead of surfacing the generic -`did not recognize fetched source output`. - -**`enabled_tables` matching (R5).** Normalized at the scope boundary in -`resolveEnabledTables` using `connection.driver`: for sqlite, `main.` → -`{ db: null }`, so `"main.customers"` and `"customers"` select the same object. -`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and -`docs-site/.../configuration/ktx-yaml.mdx`. - -**Surfacing (R2).** Deviation from the spec's orientation: live-database schema -ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` → -`local_ingest_reports`), not the bundle runner, so the home for scan warnings is -`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is -persisted whole, so **no migration**), not the bundle-only -`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport` -(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`). -The ingest summary is already rendered by `runKtxScan` from `report.warnings` -(the new `describeWarningGroup` case), and `ktx status` -(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the -latest report body per connection and prints a per-connection -`N object(s) skipped — name: reason` line. - -**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to -`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model, -isolates per-object failures in `_map_rows_to_tables`, and shares the -`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with -Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`, -dropping any code Node cannot render (validated via `isKtxScanWarningCode`). -Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the -shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver -including the daemon, avoiding a divergent second implementation. Parity is -covered by a Node test (daemon-shaped warning round-trips) and a pytest -(per-object failure → warning with the shared code). diff --git a/spider2-specs/specs/07-analytics-skill-sql-craft.md b/spider2-specs/specs/07-analytics-skill-sql-craft.md deleted file mode 100644 index 023780d5..00000000 --- a/spider2-specs/specs/07-analytics-skill-sql-craft.md +++ /dev/null @@ -1,363 +0,0 @@ -# Add universal SQL-authoring craft to the ktx-analytics skill - -> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`. - -## Problem - -The shipped `ktx-analytics` skill -(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its -`` and `` tell the agent **which ktx tools to call and in what -order** (`discover_data` → `entity_details`/`sl_read_source` → -`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing -about **writing correct SQL**. - -That gap shows up as a specific failure shape: the agent reliably produces -*runnable* SQL but *wrong* results. The recurring defects are universal -analytics-engineering mistakes, not ktx-specific ones: - -- comparing a string column to a numeric literal (or vice versa), which can - silently match zero rows; -- rounding inside intermediate CTEs, so the final number is off; -- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so - results flicker run to run; -- filtering *before* a window function for sequence/“since”/“first” questions, - truncating the partition the window should see; -- returning a full ranked list for a “top/highest” question, or collapsing a - “per X” question to a single value; -- dropping the inputs (or the entity identifier) a derived value was built from. - -These are correctness defects every ktx user hits on a live database. They -belong in the shipped skill — fixing them once improves ktx for everyone, rather -than living in any individual caller’s prompt. - -## Generic use case - -An analyst (human or agent) points ktx at a **live, production** database and -asks a real analytical question — “what’s the most recent order per customer”, -“top region by margin”, “average order value by month”. The schema is unfamiliar -(unknown date encodings, nullable join keys, string-typed numeric columns), the -question carries grain and ranking intent in its wording, and the answer must be -*correct and deterministic*, not merely executable. The skill should encode the -analytics-engineering craft that makes the difference between a query that runs -and a query that’s right — independent of any benchmark. - -## Model - -The change is **additive content in one Markdown file**, governed by these -invariants. They constrain the implementer; the exact prose is theirs. - -### Inline-only delivery (this is a hard constraint, not a style preference) - -All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled -`reference/*.md` file (the progressive-disclosure pattern Anthropic’s -skill-authoring guide recommends for large skills) **MUST NOT** be used here, -because the delivery mechanism ships only `SKILL.md`: - -- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`, - which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file - per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex / - universal `.agents` equivalent, a **flattened** single rules file for Cursor - (`.cursor/rules/ktx-analytics.mdc`) and OpenCode - (`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that - contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`). -- Nothing copies sibling files or subdirectories. A reference file would dangle - on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot - represent a multi-file skill at all. - -The skill is small enough that inline costs nothing meaningful: ~67 lines today -plus ~60 of craft is well under the 500-line budget. And this craft is **core -content** — consulted on every SQL-authoring turn — so even if multi-file delivery -existed it would still belong inline: progressive disclosure only pays off for -large, *conditionally-relevant* reference material loaded on demand, not for -always-needed craft. - -Multi-file skill *delivery* is a legitimate future enhancement, but it must be -**pulled by a concrete need, not built ahead of one** — no shipped skill today -exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first -real trigger is the **per-dialect SQL syntax follow-up** -(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand -`reference/.md` content is a genuine progressive-disclosure fit. When -that work is scoped, note that multi-file delivery is **not** a simple directory -copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor -(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform, -and uninstall needs per-file manifest entries. Recording the constraint here so a -future implementer does not “improve” this inline content into a bundled -reference that dangles on every target. - -### Heuristics with a generic *why*, not a wall of MUSTs - -The new rules are phrased as **heuristics with a one-line, universal rationale**, -because SQL authoring is a high-freedom task (many valid approaches, choice -depends on the question and the data). A bare imperative overfits; a rule plus -its *why* lets the model apply judgment and generalize. This follows Anthropic’s -own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all -caps or rigid structures, reframe and explain the reasoning”). - -This **reconciles the draft’s “behavior only, no rationale” instruction**: the -prohibition is specifically on rationale that references a **grader, gold answer, -or the benchmark**. *Generic analytics-engineering rationale is required* — e.g. -“…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number -compare can silently match nothing”. That is a universal truth, not a -grader reference. - -### Dialect-agnostic - -Every rule must read correctly on any SQL dialect a ktx connection might use. -**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only), -not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs. -Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware -(per-driver) location, explicitly out of scope here. - -### Discovery craft attaches to discovery; authoring craft to query/validate - -Two of the draft’s rules (inspect sample rows; cast before comparing) are -*schema-discovery* concerns that happen **before** SQL is composed. They belong -with the discovery steps of the existing workflow, not only at the query step. -The rest (composition, window correctness, precision, completeness) belong with -the query/validate steps. The draft’s “extend step 5/6” is the right home for -most rules but is slightly off for the discovery pair; this spec corrects that. - -### Additive only - -The existing ``, ``, and `` — compact result tables, -summaries, clarification prompts, the tool-order workflow, the `connectionId` -scoping rules — are preserved unchanged. The skill must still read well for an -interactive, human-facing analysis session. - -## Requirements - -### 1. Placement and structure - -Add a dedicated, scannable craft section to `SKILL.md`: - -- A new top-level block — `` (sibling to ``/``) — with - **five sub-headings**: *Schema discovery*, *Composition*, *Window functions*, - *Numeric precision*, *Answer completeness*. Sub-headings keep the block - scannable (the draft’s “group under clear sub-headings” goal). -- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and - explain”) each gain a **one-line pointer** into `` rather than - inlining the rules (state each rule once; Anthropic’s “consistent terminology / - don’t repeat” guidance). The schema-discovery pair is additionally reflected as - a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing - to the same block. -- No new tool, flag, or config. This is content only. - -### 2. The craft rules (all fourteen behaviors, grouped) - -Every behavior from the intake draft must be represented. Tightly-related ones -**may** be merged into a single bullet where that reads better; none may be -dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout. - -**Schema discovery** (cue in steps 2/4; lives in ``) -1. Inspect representative **sample rows** of each table before composing SQL — - confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in - join/filter keys, and the real set of categorical/enum values - (`entity_details` + a small `sql_execution` sample). *Why:* assumptions about - encoding and nullability are the most common source of silently-wrong filters. -2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A - string column compared to a numeric literal (or vice versa) can silently match - nothing. - -**Composition** -3. Build complex queries **incrementally** — one CTE at a time, verifying each - layer’s output on a small sample before stacking the next. *Why:* a wrong - intermediate layer is far cheaper to catch early than to debug in the final - result. -4. **Avoid fan-out joins.** Add columns only from tables already at the target - grain, or **pre-aggregate** to that grain before joining. *Why:* a join that - multiplies rows quietly inflates every downstream `SUM`/`COUNT`. - -**Window functions** -5. Give every ranking/ordering window function a **complete, deterministic - tie-breaker** (append unique key columns to `ORDER BY`), so - `RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs. -6. For sequence / “first” / “most recent” / “since” questions, **filter after the - window**, not before: compute over the full partition, then keep the rows you - want. *Why:* a pre-filter shrinks the partition the window ranks over, so - “first”/“most recent” is computed against the wrong set. (See the worked - example, requirement 3.) - -**Numeric precision** -7. Compute at **full precision; round only in the final projection**, never inside - intermediate CTEs. -8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit - rounding when rounding is intended. (May merge with rule 7.) -9. Distinguish **macro vs micro averages** based on the question’s wording: - “average of per-group averages” = `AVG(group_metric)`; “overall/weighted - average” = `SUM(numerator)/SUM(denominator)`. - -**Answer completeness / interpretation** -10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the - top-ranked row via the window result), not the full ranked list, unless a list - is asked for. *(Phrase the mechanism dialect-agnostically — do not name - `QUALIFY`.)* -11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a - single value unless the question says “overall” or “total across X”. -12. When a question asks for inputs and a derived value (“X, Y, and their ratio”), - **include the inputs as columns** alongside the derived value. -13. When grouping by a human-readable label (a name), also **expose the entity’s - identifier** — identity, not just the label, is part of the result (and - disambiguates duplicate names). -14. When a result is **unexpectedly empty, relax filters one at a time** to find - which predicate removed the rows. *Why:* this is the validation feedback loop - that turns a silent empty result into a diagnosable one. - -### 3. One worked example (dialect-agnostic) - -Add **exactly one** compact before/after example to the skill, demonstrating the -**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set. -It shows the wrong shape (filter inside, then rank) and the right shape (rank over -the full partition in a CTE, then filter to the top rank in the outer query), -using generic table/column names and standard SQL only (no `QUALIFY`, no -dialect functions). Keep it ~6–10 lines. Do not add a second example; the -existing three tool-orchestration examples stay as the primary example set. -*(Superseded by spec 09: the skill now carries a second `sql` worked example — -the multi-hop fan-out case — so the one-example constraint applies to spec 07's -window-then-filter example only.)* - -### 4. Explicit exclusions - -None of the following may appear in the skill (they are application/consumer -concerns, or actively wrong for live data): - -- **Output-shape contracts** (“return a bare result set with exactly these - columns, no prose”). The skill is for interactive analysis and already favors - readable tables + summaries; a caller needing a strict shape specifies that - itself. -- **Anchoring relative time to `MAX(date)` of the data.** On a live database - “recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is - only valid for static snapshots and must not be baked into the product. -- **Any advice justified by a grader, gold answer, or scoring comparator.** -- **Dialect-specific syntax** (deferred to the per-driver follow-up). - -### 5. Coordination with spec 03 - -`03-multi-connection-routing-in-analytics-skill` also edits this same file (it -adds a connection-routing “step 0” to `` and threads `connectionId` -through the tool calls). Spec 07’s additions are **orthogonal**: they live in a -new `` block and in step 5/6 pointers, and must not rewrite the -`` routing or the `` `connectionId` scoping that spec 03 owns. -If both land, the result is one coherent skill: routing in ``/``, -SQL craft in ``. - -## Acceptance criteria - -- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped - under the five sub-headings, each phrased as a heuristic with a generic - rationale. -- **Zero references** to any benchmark, gold answer, grader, or scoring - comparator anywhere in the skill. -- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`, - no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect - construct — including in the worked example. -- The existing interactive guidance is intact: the `` steps, the - `` (compact tables, summaries, clarification prompt, `connectionId` - scoping), and the three existing examples all still read correctly and were not - removed or contradicted. -- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of - “recent”, grader-driven advice, dialect syntax) appear. -- Exactly **one** new worked example is present, demonstrating window-then-filter, - in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second - `sql` worked example for the multi-hop fan-out case; the shipped skill then - contains two worked examples and the content test asserts two `sql` fences.)* -- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced, - and the skill still installs as a single file through `setup-agents.ts` for all - targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip). -- The skill stays **scannable and within a reasonable size** (comfortably under - the 500-line budget). -- The frontmatter (`name`, `description`) is unchanged and still parses through - `SkillsRegistryService.parseFrontmatter`. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the - `` block; add one-line pointers in steps 5/6 and a discovery cue in - steps 2/4; add the single worked example. Keep ``/``/`` - otherwise intact. -- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts` - (`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`, - `plannedKtxAgentFiles`). Each target gets a single file derived from - `SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only - `ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this - spec — confirm the skill still installs unchanged. -- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the - same file; keep the changes non-overlapping (see requirement 5). -- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the - right level (this is prompt content, not executable logic). Assert the skill - text contains the craft sub-headings / representative rule phrases, contains the - worked example, and contains none of the banned constructs: the literal tokens - `QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`, - `gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since - `MAX()` is a legitimate aggregate — any instruction anchoring relative time - (“recent”, “past N months”) to the data’s maximum date. The existing - `SkillsRegistryService` frontmatter-parse test must still pass. The standalone - `ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run - link:dev`) so the playground picks up the updated skill. - -## Benchmark context (motivation only) - -On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but -~50 result mismatches**, and a large share traced to exactly these gaps: -premature rounding, string-vs-number compares, non-deterministic window ordering, -returning full lists for “top” questions, and dropping the inputs to derived -values. These are generic SQL-authoring defects — fixing them in the skill -improves ktx for every user querying a live database, and improving the benchmark -score is a side effect, not the goal. The skill itself must contain no trace of -the benchmark. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki`. - -**What was built** -- Added a new `` block to `packages/cli/src/skills/analytics/SKILL.md` - (sibling to ``/``, placed just before ``), with the - five sub-headings — *Schema discovery before writing SQL*, *Composition*, - *Window functions*, *Numeric precision*, *Answer completeness / interpretation* — - and a one-line opener framing the bullets as heuristics-with-a-why. -- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end / - truncation) are merged into one "Round only at the end" bullet, as the spec - permitted. Each bullet carries a generic analytics-engineering rationale; none - references a benchmark, grader, or gold answer. -- Exactly one worked example (a fenced `sql` block inside ``) - demonstrates the window-then-filter rule, and incidentally the deterministic - tie-breaker: the *wrong* shape filters before the window; the *right* shape - ranks the full partition in a CTE, then filters in the outer query. Standard - SQL only — no `QUALIFY`, no dialect functions. -- Step pointers added without duplicating the rules: a schema-discovery cue in - steps 2 and 4, an authoring pointer in step 5, and a validation pointer in - step 6, each pointing into ``. -- The existing `` / `` / `` (compact tables, - summaries, clarification prompt, `connectionId` scoping, the three - orchestration examples) are unchanged. Delivery is unchanged: still a single - `SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/` - file was introduced. - -**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a -content assertion over the source `SKILL.md`: the five sub-headings, a -representative phrase for each behavior, exactly one `sql` worked example, the -preserved interactive guidance, and the absence of banned constructs -(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` / -`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring -relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content -assertions and the `SkillsRegistryService` frontmatter test still pass (77/77 -across the three relevant files). Rebuilt and re-linked `ktx-dev` -(`pnpm run build && pnpm run link:dev`); the craft block is present in the -shipped `dist` asset. - -**Deviations / notes** -- The worked example runs ~18 lines including comments rather than the spec's - "~6–10"; a faithful before/after with a CTE needs the extra lines, and the - skill stays well within budget (~117 lines total). -- `pnpm run type-check` currently reports one **pre-existing, unrelated** error - in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on - this branch ahead of `origin/main`. The src type-check and `pnpm run build` - are green; this change does not touch any MCP file. -- Per-dialect SQL syntax stays out of scope here (deferred to - `todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains - dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that - belongs with spec 08's channel so the skill never references a tool that does - not exist. diff --git a/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md b/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md deleted file mode 100644 index d2674c9c..00000000 --- a/spider2-specs/specs/08-per-dialect-sql-syntax-notes.md +++ /dev/null @@ -1,395 +0,0 @@ -# Per-dialect SQL syntax notes, served on demand and scoped to the connection - -> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion -> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft -> dialect-agnostic and explicitly deferred per-dialect syntax to this spec. - -## Problem - -Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the -`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft -deliberately excludes anything that reads correctly on only one engine — no -`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs — -because the flat skill is installed verbatim and an agent querying sqlite must -never see Snowflake syntax. - -But a large share of *real* correctness depends on exactly that excluded, -engine-specific syntax: - -- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive - identifiers (unquoted folds to upper-case), VARIANT colon-paths - (`col:field.sub::type`), `QUALIFY`. -- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX` - for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`. -- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`, - `json_extract`. -- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`, - `sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and - JSON conventions. - -This guidance is genuinely useful to an agent writing SQL against a live -database, but it must **not** pollute the flat dialect-agnostic skill. It belongs -in a **dialect-aware** channel, surfaced only for the dialect the active -connection actually uses, and selected from the project's own configured state — -not guessed, not shown all at once. - -## Generic use case - -Any **ktx** project whose connections span more than one warehouse engine — a -Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When -the agent (or a human analyst the agent assists) writes SQL for a given -connection, it should receive *that engine's* syntax conventions — FQTN form, -identifier quoting, date functions, top-N idiom, semi-structured access — and -nothing for the engines it is not querying. The need is independent of any -benchmark: it is what "write correct SQL against this specific warehouse" requires -on every multi-engine stack. - -## Model - -The change adds a **dialect-aware channel** alongside spec 07's flat skill. The -following decisions are committed by this refinement; the implementer owns the -exact prose and code. - -### Delivery: a dynamic MCP tool (decision committed) - -The draft posed two delivery mechanisms and asked the refinement to "weigh them -before committing." This spec commits to **dynamic MCP delivery**: a new -read-only MCP tool returns the syntax notes for a given `connectionId`, with the -dialect resolved server-side from the connection's configured `driver`. The flat -skill gains a one-line pointer to that tool. **No install-mechanism change is -required.** - -The alternative — **multi-file skill delivery** (bundle `reference/.md` -files and point the skill at the matching one) — is **rejected** for **ktx**, for -reasons that hold regardless of how the skill is otherwise authored: - -1. **It cannot scope on two of the six install targets.** Cursor - (`.cursor/rules/ktx-analytics.mdc`) and OpenCode - (`.opencode/commands/ktx-analytics.md`) are physically **single-file**; - `setup-agents.ts` flattens the skill to one file there. A bundled `reference/` - directory degenerates to "concatenate every dialect into one file," so a - sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core - no-leak criterion on those targets**, and defeating progressive disclosure - (everything is in context at once). The MCP tool behaves **identically on all - six targets** because it is a tool call, not an installed file. -2. **Selecting the dialect is a deterministic operation, so it belongs in code, - not model judgment.** Anthropic's skill-authoring guidance explicitly says to - *"prefer scripts [tools] for deterministic operations."* With bundled files the - **model** must infer that connection X is Snowflake and open the right file — - and on a multi-connection project it can open the wrong one. With the tool, the - **server** resolves `driver → dialect` from `ktx.yaml` state and returns - exactly the right notes. -3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery - requires reworking `readAnalyticsSkillContent`, `installTarget`, - `plannedKtxAgentFiles`, the install manifest (a directory variant), - `removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a - concatenation transform for the single-file targets. The MCP tool requires one - read-only handler and one skill pointer. -4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on - the **ktx** MCP server — its entire workflow is calling `discover_data`, - `entity_details`, `sql_execution`, and so on. Wherever the server is down, the - skill is already non-functional; the tool adds **no new dependency**. -5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would - make multi-file delivery *possible*, but it would not make it better: reasons - 2–4 stand, and the drop is a disproportionate cost (Cursor is a major target) - to neutralize a constraint the tool handles for free. Whether **ktx** supports - those targets is a separate product decision and is out of scope here. - -This is consistent with Anthropic's progressive-disclosure goal — load the -relevant material on demand, at zero context cost until needed — which the tool -satisfies (its output costs context only when called) while resolving *which* -dialect from state rather than from a model guess. Reference: -[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices). - -### Scope derived from state, through the one existing resolver - -Which dialect's notes the agent sees is **derived** from the connection's -configured `driver`, via the resolver the rest of the system already uses — -`sqlAnalysisDialectForDriver(driver)` in -`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already -selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis -daemon. This spec **must not** introduce a second driver→dialect map. The notes -are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is -keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's -codomain so the two cannot drift. - -### Authored per-engine notes are sanctioned static content - -Enumerating syntax notes per engine is **not** a rotting denylist of bad -specifics; FQTN form and identifier quoting are genuine, stable invariants of each -engine — the kind of universal fact **ktx**'s design rules explicitly permit as -static content. What must stay derived-from-state is note *selection* (the active -dialect) and note *coverage* (every configured driver must resolve to notes that -exist), both of which this spec ties to the connector registry. - -### The flat skill stays dialect-agnostic (spec 07 invariant preserved) - -This work adds a *separate* channel. It does **not** amend spec 07's `` -block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion -— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays -green. The only `SKILL.md` change is the pointer in requirement 3, which names the -tool and contains no dialect syntax. - -## Requirements - -### 1. A read-only `sql_dialect_notes` MCP tool - -Register a new tool beside the existing context tools -(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the -implementer's to finalize but should follow the existing snake_case convention -(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name. - -- **Input:** `{ connectionId }`, **required** — matching its siblings - `entity_details`/`sql_execution`, which always take an explicit connection. -- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved - `SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect. -- **Resolution:** `connectionId → connection.driver → - sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing - resolver. Do not duplicate the driver→dialect map. -- **Guards:** - - A **non-SQL context-source** connection (driver `metabase`, `looker`, - `lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL - warehouse connection" error**, not postgres notes. Gate on the existing - `isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`). - - For any **SQL warehouse** connection the resolver always yields a dialect with - notes (all seven warehouse drivers are covered — requirement 2); its built-in - `postgres` default is a safety floor, so the tool never errors for a SQL - connection and never emits a single-engine dialect (e.g. Snowflake) by - accident. -- **Annotations:** read-only and idempotent, consistent with the other read - tools. -- **Description (docs-grade, third person, states what and when):** e.g. - *"Returns the SQL syntax conventions for a connection's dialect — FQTN form, - identifier quoting and case-folding, date/time functions, top-N idiom, and - semi-structured access. Use before authoring raw SQL against a connection so the - SQL matches that engine."* The description drives the agent's decision to call - the tool, so it must be specific. - -### 2. Per-dialect note content - -Author concise notes for each supported dialect against a **fixed rubric**, so -every dialect answers the same questions. Each facet is a line or two of timeless, -engine-true convention (no version-dated "as of vX" content), phrased as -guidance with the engine reason where it helps — inheriting spec 07's -heuristics-with-a-why tone. The rubric facets: - -1. **FQTN form** — how to fully-qualify a table on this engine. -2. **Identifier quoting & case-folding** — quote character and how unquoted - identifiers fold. -3. **Date/time** — the engine's date functions and common date-encoding idioms. -4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE + - outer-filter form where it is not; `TOP` for `tsql`. -5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/ - `JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable. -6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery - `_TABLE_SUFFIX`). - -Constraints on the content: - -- **Coverage = the reachable dialect set.** Every driver in the connector registry - must resolve to a dialect that has non-empty notes. The reachable set is - `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and - `tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`: - they appear in the resolver map but no connector can produce them, so they are - unreachable — matching the draft's "don't author for nonexistent drivers." -- **Keyed by `SqlAnalysisDialect`** (see Model). -- **Storage is the implementer's choice.** The notes MAY live as per-dialect - markdown files inside the package (e.g. under the skill's directory) served by - the tool, or as a typed map. If files are used they are **package-internal** — - served by the tool, never installed onto an agent target — and already ship via - the recursive `src/skills → dist/skills` copy - (`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change. -- **No benchmark, gold-answer, grader, or scoring references** anywhere in the - notes. - -The implementer must verify each engine's specifics against current official -documentation (the well-known anchors above are starting points, not a -substitute for checking the engine's docs). - -### 3. The `SKILL.md` pointer (completes spec 07's deferral) - -Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step -5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to -call the tool before writing raw SQL against a connection — e.g. *"Before writing -raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get -that engine's syntax conventions."* This is the pointer spec 07 deliberately did -not add because the tool did not yet exist. - -- The pointer **names the tool only**; it contains **no dialect syntax**, so the - flat skill stays dialect-agnostic. -- Follow the skill's existing tool-reference convention. The skill currently names - MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's - guidance recommends **fully-qualified** `ServerName:tool` names to avoid - "tool not found" when multiple MCP servers are present. Whether to fully-qualify - the new pointer (and optionally retrofit the existing bare references) is a - small, separable decision flagged for the maintainer — **not** a rename sweep - this spec mandates. - -### 4. Coverage is enforced from state, not by hand - -A test must **derive** the required coverage from the connector registry rather -than hardcoding a dialect list: enumerate the configured warehouse drivers -(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in -`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and -assert each result has non-empty notes. Adding a connector later then **fails this -test** until its dialect gets notes — the allowlist-from-state discipline, not a -hand-maintained list. - -### 5. No dialect syntax leaks into the flat skill - -Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill -(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`, -backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds -a tool and a tool-pointer; it does not move dialect syntax into the skill. - -### 6. Delivery is unchanged - -`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`, -`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The -skill still installs as a single `SKILL.md` per target. Confirm the channel works -on all six targets — Claude Code, Claude Desktop (zip), Codex, universal -`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call, -including the single-file targets where multi-file delivery could not scope. - -### 7. Coordination with specs 07 and 03 - -- **Spec 07** owns the dialect-agnostic `` block. This spec must not - amend it; it adds the tool, the pointer, and the notes. -- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads - `connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer - is `connectionId`-scoped and fits that routing; keep the pointer consistent with - spec 03's `connectionId` rules and do not rewrite the routing it owns. - -## Acceptance criteria - -- An agent querying a **sqlite** connection gets sqlite date idioms and **never** - sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets - FQTN / identifier / VARIANT guidance. -- The dialect shown is **derived from the connection's configured `driver`** via - the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not - guessed. No second driver→dialect map is introduced. -- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`, - `bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with - non-empty notes, and the coverage test derives this from the registry. -- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a - clear "not a SQL warehouse" response, **not** postgres notes. -- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are - unaffected. The new pointer references the tool only and adds no dialect syntax. -- The channel installs/serves correctly across **all six** agent targets, - including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts` - change**. -- The notes contain **no** benchmark/gold/grader/scoring references and **no** - time-sensitive ("as of version X") content. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **Dialect resolver (reuse, do not duplicate):** - `packages/cli/src/context/sql-analysis/dialect.ts` — - `sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect` - (`./ports.ts`), default `postgres`. -- **Connector registry (drives coverage):** - `packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`, - `isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts` - (`warehouseDrivers`, the per-driver `connectionConfigSchema`). -- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts` - (register beside `connection_list`, `entity_details`, `sql_execution`); the - `connectionId → driver → dialect` resolution already exists for `sql_execution` - in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool - through the same path. -- **The skill (one-line pointer only):** - `packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5; - leave ``/``/``/`` otherwise intact. -- **Note storage (if files):** under the skill directory, shipped by - `packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the - tool, never installed. -- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`. -- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown → - `postgres`, and non-warehouse rejection); a registry-derived coverage test - (requirement 4); a content test that each dialect's notes cover the rubric - facets and contain no banned tokens; and an extension of spec 07's - `analytics/SKILL.md` content test asserting the new pointer is present and the - flat skill is still dialect-clean. Rebuild and re-link the dev binary so the - playground picks up the change: `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation only) - -The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake -(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths), -BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite -(`strftime`/`julianday`). That content is real and useful but engine-specific; -spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic -rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes -the same correctness benefit to every multi-engine **ktx** project — improving the -benchmark score is a side effect, not the goal, and the shipped skill contains no -trace of the benchmark. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed -decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as -specified — no `setup-agents.ts` change. - -**What was built** -- Per-dialect notes are markdown files under - `packages/cli/src/context/sql-analysis/dialects/.md` (one each for - `postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`), - served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy - read + cache, `postgres` fallback floor; the authored set is the - `DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored - (unreachable from any connector). Each note answers the fixed rubric — FQTN, - identifier quoting/case-folding, date/time, top-N/window idiom, - JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics - were verified against current docs via Context7 (Snowflake VARIANT colon-paths - and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`, - `JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The - files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they - are never installed onto an agent target. -- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input - `{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only - + idempotent annotations. It resolves through the **existing** - `connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second - driver→dialect map), implemented as the unconditional `dialectNotes` port in - `local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A - non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError` - ("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays - out of Error Tracking. -- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`) - readonly tuple so the coverage test derives required coverage from the registry; - `isDatabaseDriver` behavior is unchanged. -- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call - `sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N, - and JSON conventions"). It names the tool only; spec 07's `` block and - its dialect-clean content test are untouched. - -**Tests** -- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future - connector fails the test until its dialect has notes), the full rubric per dialect, - leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`; - `QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no - benchmark/grader or version-dated content, the postgres fallback, and - `resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql` - and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a - guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync. -- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool - set + annotations assertion + a handler-routing test, and the regenerated - `__snapshots__/mcp-tools-list.json`. -- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present - and the flat skill stays dialect-clean. - -**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files / -3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three -`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and -re-linked `ktx-dev`. - -**Deviations / notes** -- Notes are stored as per-dialect markdown files (not a typed map, and not bundled - `reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the - most maintainable to edit. They are served by the tool and ship via a - `copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no - `setup-agents.ts` change. -- `pnpm run type-check` still reports one pre-existing, unrelated error in - `test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch); - this change adds zero new type errors and does not touch that file. diff --git a/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md b/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md deleted file mode 100644 index 5c75150b..00000000 --- a/spider2-specs/specs/09-fan-out-safe-multi-hop-aggregation.md +++ /dev/null @@ -1,362 +0,0 @@ -# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill - -> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`. -> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the -> `` block. Additive, content-only. - -## Problem - -The shipped `ktx-analytics` skill -(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop -fan-out rule in `` → **Composition**: - -> **Avoid fan-out joins.** Add columns only from tables already at the target -> grain, or pre-aggregate to that grain before joining. A join that multiplies -> rows quietly inflates every downstream `SUM`/`COUNT`. - -In practice the agent honors that on a single join but still **silently -fans out on multi-hop join chains**, where the inflation is one or two joins -removed from the aggregate and therefore much harder to notice. - -The failure shape: a measure that lives at a *coarse* grain (one row per parent -record) is counted/summed *after* the parent has been joined down to a *finer* -grain (one row per child line). Every parent-level value is then duplicated by -its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent -amount — runnable SQL, plausible-looking number, quietly wrong. - -The rule today is stated only as a **prohibition** ("Avoid…"). It needs two -upgrades: (a) generalize it so the danger is understood as *cumulative across a -whole join chain*, not a single join; and (b) pair it with an **affirmative -verification habit** the agent runs while composing, so a grain change is -detected and fixed rather than merely warned against. - -## Generic use case (independent of any benchmark) - -An analyst on any production warehouse asks a counting/summing question whose -path runs through several one-to-many hops — e.g. *"how many orders per region -contain a returned item?"* where the path is `region → store → order → -order_line`. The honest answer counts each order once. The naïve join chain joins -`order_line` (to apply the line-level condition) and then counts orders, so an -order with three returned lines is counted three times. The inflation happens -**three joins below the `COUNT`**, where it is easy to miss. This is one of the -most common silently-wrong analytics mistakes on normalized schemas — not -specific to any dataset, dialect, or benchmark. - -## Model (invariants — the implementer owns the prose) - -These constrain the change; the exact wording is the implementer's. Each is -grounded in Anthropic's skill-authoring and prompt-engineering guidance so the -addition stays consistent with how spec 07 was written. - -### Additive, inline-only, dialect-agnostic (inherited from spec 07) - -The change is **additive content inside `skills/analytics/SKILL.md`** only — no -bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per -target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config. -Every addition must read correctly on any dialect: **no** `QUALIFY`, -`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect -construct — including in the worked example. The existing ``, ``, -``, and the other four `` sub-headings are preserved -unchanged. - -### Heuristic-plus-*why*, because SQL authoring is a high-freedom task - -Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with -many valid approaches where decisions depend on context as **high freedom → -text-based heuristics**, the "open field, many paths" case (versus low-freedom, -fragile operations that need an exact script). SQL authoring is squarely -high-freedom. So the new content is phrased as **heuristics with a one-line, -universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the -existing `` style and Anthropic's "add context / explain why so Claude -generalizes" principle. - -### Affirmative framing for the verification step (do, not don't) - -Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do -instead of what not to do."** The draft's requirement for "a detect-and-fix -*habit*, not just a prohibition" is the same principle. Therefore: - -- The **generalized rule keeps the established `Avoid fan-out joins` lead and the - term `fan-out`** — it is spec 07's consistent terminology and the existing - content test references that phrase; reframing it would churn shared vocabulary - for no gain. -- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain - holds across each join"*) — an action the agent performs while composing, not a - warning. The two together satisfy both principles: a recognized anti-pattern - name *and* a positive habit. - -### One default with an escape hatch, not two equal options - -Anthropic: **"Avoid offering too many options… provide a default with an escape -hatch."** The fix for an inflated aggregate is presented as exactly that: - -- **Default: pre-aggregate the measure to its own grain in a CTE, then join the - already-aggregated result.** This is the single-hop fix generalized, and it is - the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed - measure with `DISTINCT` (two legitimately-equal amounts would collapse). -- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an - inflated count in one line, but must be stated as count-only, not as a general - remedy. - -This is the deepest correctness point in the spec and the easiest to get wrong; a -naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums. - -### Consistent terminology - -Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing -vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not -introduce synonyms (e.g. do not rename the concept "row blow-up" or -"multiplication factor"). Prose may vary, but the named concepts stay fixed. - -### Concise — the addition must justify its token cost - -Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and -"Claude is already very smart." The agent knows what a join and a `GROUP BY` are; -the addition explains only the non-obvious trap (cumulative grain inflation) and -shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and -one worked example — the skill stays comfortably under the 500-line budget -(~117 lines today). - -### Examples over descriptions — exactly one - -Anthropic's "examples pattern": **"Examples help Claude understand the desired -style and level of detail more clearly than descriptions alone"** and -"examples are concrete, not abstract." The multishot guidance favors 3–5 examples -in general, but here **conciseness and spec 07's one-example-per-rule economy -win**: the skill already carries the window-then-filter example, so this adds -**exactly one** compact wrong-vs-right example. The wrong/right contrast inside -that single example supplies the diversity multishot calls for, at one example's -token cost. - -### Leak-safety (hard constraint) - -The worked example must be a **synthetic, generic schema invented for teaching** — -not the tables, column names, query, or numeric results of any Spider 2.0-Lite -question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a -one-to-many join), which is universal and reconstructable from first principles. A -reviewer must find nothing in it that ties it to a specific benchmark instance. -See "Leak-safety" below. - -## Requirements - -All four land in the **Composition** sub-heading of `` in -`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite -the existing fan-out bullet, add one affirmative verification bullet, add one -worked example. Do not touch the other four sub-headings or ``/``/ -``. - -### 1. Generalize the fan-out rule to multi-hop chains - -Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that -the danger is **cumulative**: *any* one-to-many hop on the path between a measure's -owning table and the aggregate inflates that measure, **even when the offending -join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the -single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join -the already-aggregated result** — but the agent must apply it **per -measure-owning table along the whole chain**, not just at the final join. Keep the -`fan-out` term and the one-line *why*. - -### 2. Add an affirmative grain-verification habit - -Add a companion bullet, phrased as an action the agent performs **while -composing** (not a prohibition): - -- Confirm that a join intended to be one-to-one / many-to-one **did not change the - grain** it aggregates at — e.g. check that the row count (or the count of the - aggregate's key) is unchanged across that join. -- When a join is genuinely one-to-many, **reach for the default fix - (pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an - acceptable escape hatch. -- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate** - — `DISTINCT` cannot de-duplicate a sum. - -This is spec 07's "build incrementally and check each layer" discipline pointed -specifically at grain preservation, in affirmative form. - -### 3. One concrete, generic multi-hop worked example - -Add **exactly one** compact wrong-vs-right `sql` example inside `` -demonstrating the multi-hop inflation and the pre-aggregate fix. It is the -**second** `sql` fence in the skill (the first is spec 07's window-then-filter -example). - -**Required properties** (these are the constraints; the SQL below is orientation): - -- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed** - from the aggregate (not the single-hop case spec 07 already covers). -- **Unambiguous attribution**: each counted entity maps to **exactly one** group, - so the honest answer is well-defined. (This rules out "coarse measure attributed - to a fine dimension reached by descending," where one entity spans several - groups and the correct number is itself ambiguous — that would teach a murky - pattern.) -- **Motivated descent**: the finer-grain table is joined for a real reason (a - line-level filter or a needed line-level value), so the reader sees *why* the - fan-out join is there. -- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing - *Macro vs micro average* bullet and would muddy the fan-out lesson. -- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a - CTE) and is **actually correct**, not merely runnable — its number must equal the - honest answer, not just avoid an error. -- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect - functions), no benchmark identifiers or values. - -**Recommended sketch** (implementer may adjust within the properties above): - -```sql --- "How many orders per region contain a returned item?" --- WRONG: joining order_lines to apply the line-level filter multiplies orders — --- an order with two returned lines is counted twice, three joins below the COUNT. -SELECT r.region_id, COUNT(*) AS n_orders -FROM regions r -JOIN stores s ON s.region_id = r.region_id -JOIN orders o ON o.store_id = s.store_id -JOIN order_lines l ON l.order_id = o.order_id -WHERE l.status = 'returned' -GROUP BY r.region_id; - --- RIGHT: collapse order_lines to one row per qualifying order first, then join up. -WITH returned_orders AS ( - SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id -) -SELECT r.region_id, COUNT(*) AS n_orders -FROM regions r -JOIN stores s ON s.region_id = r.region_id -JOIN orders o ON o.store_id = s.store_id -JOIN returned_orders ro ON ro.order_id = o.order_id -GROUP BY r.region_id; --- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an --- order-level measure fanned out this way must pre-aggregate — DISTINCT can't --- de-duplicate a sum. -``` - -### 4. Placement and structure - -- Both bullets live under the existing **Composition** sub-heading; the example - follows them. The five-sub-heading structure spec 07 established is unchanged. -- **State each rule once** (Anthropic "consistent terminology / don't repeat"): - do not also restate the multi-hop rule in `` steps 5/6 — those already - carry a one-line pointer into ``, which is sufficient. - -### 5. Coordination with spec 07 (supersession) - -Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly -one** worked example and "Do not add a second example." **This spec supersedes -that constraint**: the skill now carries **two** `sql` worked examples -(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate -spec 07 at those two spots with a one-line "superseded by spec 09" note so the two -permanent specs do not contradict. No other spec 07 content changes. - -## Leak-safety (hard constraint on this spec and its example) - -The benchmark's gold answers must never appear in ktx. The worked example must be -a **synthetic, generic schema invented for teaching** — not the tables, column -names, query, or numeric results of any Spider 2.0-Lite question. The example -demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many -join), which is universal; it must be reconstructable from first principles by -anyone, with zero reference to benchmark data. A reviewer should be able to read -the example and find nothing that ties it to a specific benchmark instance. - -## Acceptance criteria - -- The `` **Composition** section states the **multi-hop generalization** - of the fan-out rule (cumulative danger across the chain; pre-aggregate per - measure-owning table) and an **affirmative grain-verification habit**, inline and - dialect-agnostic. -- The fix is presented as **default (pre-aggregate to grain) + escape hatch - (`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG` - of a fanned-out measure must pre-aggregate. -- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right) - using an invented schema, with no benchmark-derived identifiers or values, whose - RIGHT side is actually correct (unambiguous attribution; honest number). -- The skill now contains **two** `sql` worked examples total; the existing content - test's fence-count assertion is updated `1 → 2` and new assertions cover the - multi-hop rule phrase and the grain-verification-habit phrase. -- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no - synonyms introduced. -- **No new tool, flag, or config.** Skill-content only; additive to spec 07. -- All spec 07 invariants still hold: the skill remains dialect-agnostic (no - `QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time - anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference, - including in the new example; ``/``/`` and the other - four sub-headings are intact; frontmatter still parses through - `SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines. -- Spec 07's "exactly one example" constraint is annotated as superseded (no - contradiction between the two permanent specs). - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` → - `` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add - the affirmative grain-verification bullet, add the one worked example after them. - Leave the other four sub-headings, ``, ``, and `` - unchanged. -- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the - "ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`, - add an assertion for the new fan-out example's distinctive tokens (e.g. - `WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit - phrases to the behavior-presence list, and keep all banned-construct and - size-budget guards. This is a content assertion over the source `SKILL.md` — the - right level for prompt content. -- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's - requirement 3 and at its "Exactly one new worked example" acceptance bullet. -- **Rebuild/re-link** the dev binary so the playground picks up the change: - `pnpm run build && pnpm run link:dev` (provides `ktx-dev`). - -## Benchmark context (motivation only) - -Multi-hop aggregation questions (counting/averaging a coarse-grained measure -reached through several one-to-many joins) are a recurring source of -result-mismatch failures in the SQLite subset: the agent produces runnable SQL -with the right tables but a fan-out-inflated number. These are correctness -failures, not knowledge or schema-discovery failures (zero execution errors in the -latest run), so the fix belongs in the product's authoring craft — where it also -helps any real analyst — not in a benchmark-specific prompt. The skill itself must -contain no trace of the benchmark. - -## Implementation notes - -Shipped as specified — additive, content-only, no new tool/flag/config. - -- **`packages/cli/src/skills/analytics/SKILL.md`** → `` → **Composition**: - - Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the - danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many - hop between a measure's owning table and the aggregate inflates that measure - even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per - measure-owning table along the whole chain. Kept the `fan-out` term and the - one-line *why*. - - Added the affirmative `**Verify the grain holds across each join.**` bullet: - confirm a one-to-one / many-to-one join did not change the grain (row/key - count unchanged); default fix is pre-aggregate to grain, escape hatch is - `COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a - fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a - sum. - - Added one generic wrong-vs-right worked example (orders→regions via - stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in - the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side - pre-aggregates `order_lines` to one row per qualifying order so each order is - counted once (honest answer), and the trailing comment names the count-only - `COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented - schema, dialect-agnostic SQL, no benchmark identifiers/values. - - The other four sub-headings and ``/``/`` are - untouched. Skill is 147 lines (well under the 500-line budget). -- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count - `1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the - grain-verification phrase (`Verify the grain holds across each join`) to the - behavior-presence list; added new-example token assertions - (`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct, - relative-time, and size-budget guards retained. Test file passes (9/9). -- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one - worked example" acceptance bullet — no contradiction between the two permanent - specs. - -**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9 -passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built -`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev` -re-linked `ktx-dev`. A pre-existing, unrelated type error in -`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last -touched in commit `2677b3ef`) surfaces under the full `type-check`'s -`tsconfig.test.json` pass; it is outside this change's surface and not introduced -here. diff --git a/spider2-specs/specs/10-panel-completeness-spine.md b/spider2-specs/specs/10-panel-completeness-spine.md deleted file mode 100644 index 983f01b1..00000000 --- a/spider2-specs/specs/10-panel-completeness-spine.md +++ /dev/null @@ -1,289 +0,0 @@ -# Panel/period completeness — emit the full set of groups, not only the populated ones - -> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`. - -## Problem - -When a question asks for a result *per period* or *per category* ("orders for -each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY` -only returns groups that actually have rows. Periods or categories with **zero** -activity silently vanish, so a "12 months" answer comes back with 9 rows and the -three that should read `0` are simply absent. The SQL is runnable and the -aggregate is right, but the **panel is incomplete** — and a monthly report with -missing months or a category breakdown missing its empty categories is wrong for -any analyst, on any database. - -The existing `` "Answer completeness / interpretation" group already -carries a *"For each X / per X / by X returns exactly one row per X"* rule, but -that rule only governs **grain** (don't collapse to a single value). It says -nothing about the **domain**: "one row per X" today means one row per *observed* -X, so empty groups still drop. This spec sharpens that rule from grain-only to -grain-and-completeness. - -## Generic use case (independent of any benchmark) - -"How many orders were placed in each month of 2023?" must return **12 rows** even -if March had no orders (March = 0), not 11. "Sales per region" should include -regions with no sales when the question asks for *each* region. Both are -bread-and-butter reporting for any analyst on any warehouse, with no benchmark in -sight. - -## Model - -The feature splits across **two surfaces**, each holding the half it is suited -for. This split is the central design decision and exists to satisfy spec 07's -hard dialect-agnostic invariant without weakening it. - -### Why two surfaces (the dialect-agnostic reconciliation) - -The draft asked for a *"recursive-CTE date spine"* worked example. But a real -date/number series is **inherently dialect-specific** — Postgres `generate_series`, -SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake -`GENERATOR`+`DATEADD` — and spec 07 made `` strictly dialect-agnostic -(the analytics-skill content test bans single-dialect constructs). Inlining a date -spine would violate that invariant; carving out a test exception would erode it. - -ktx already has the canonical home for engine-specific syntax: the per-dialect -notes in `packages/cli/src/context/sql-analysis/dialects/.md`, served by -the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric -(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is -not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the -other per-dialect idioms, and the dialect-agnostic skill points to it. This -routes the dialect-specific half through the existing channel rather than -standing up a parallel dialect-specific recipe inside the skill. - -Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the -**concrete series syntax**. - -### Additive, inline, heuristic-with-a-why - -Consistent with spec 07: the skill change is **additive content in one Markdown -file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the -delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, -and phrased as a **heuristic with a one-line generic rationale**, not a wall of -MUSTs. The dialect-notes change is additive content in the seven existing -`dialects/*.md` files. No new tool, flag, or config on either surface. - -## Requirements - -### 1. Skill surface — `` "Answer completeness / interpretation" - -Add the panel-completeness rule to the existing group (it extends, and should sit -adjacent to, the *"For each X / per X / by X"* bullet). It must cover: - -1. **Recognize the full-panel cue.** *each / every / all / per / for all - / by month* signals that the answer's row set should be the - **complete expected domain** of periods or categories in scope, not just those - present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit - groups that have at least one fact row. - -2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the - **spine**), then LEFT JOIN the aggregated facts onto it: - - **Category/dimension spine:** the distinct values from the **domain-defining - dimension/entity table** (e.g. all regions from a `regions` table), *not* - `SELECT DISTINCT region FROM facts` — the latter yields only categories that - already occur, so a zero-activity category still drops. When no dimension - table exists, the distinct values from the **unfiltered** fact table are the - best available domain (with the residual caveat that a category which never - occurs at all cannot surface). - - **Period/number spine:** generate the series for the question's stated range - (e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the - question's explicit range; when the range is "all periods present," derive - bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete - series-generation syntax is per-dialect — the rule points the author to - `sql_dialect_notes` (see requirement 2) and shows no inline series SQL. - -3. **COALESCE by measure additivity.** Default missing measures with - `COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events - or amounts — "no activity" genuinely reads as 0). Leave **non-additive** - measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** — - absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value - only for additive measures. - -4. **Don't over-apply (the each-vs-which guard).** When the question asks only - about groups that exist ("*which* months had orders", "regions that made a - sale"), the spine is unnecessary and wrong — emit only observed groups. The cue - is *each / all / every* (complete domain) vs *which / that have* (observed - subset). - -5. **One worked example — the category spine, fully portable.** Add **exactly - one** compact before/after example demonstrating the pattern with a - **distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty - groups missing) and the right shape (`SELECT DISTINCT` domain from the - dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic - table/column names, standard SQL only — no series generation, no dialect - functions, so the example stays dialect-clean. The period-spine variant is - described in prose (requirement 2) and delegated to `sql_dialect_notes`; it - gets **no** inline example. This is the **third** worked `sql` example in the - skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out). - -6. **Step pointer, no duplication.** The validate/explain step (and/or the query - step) already points into `` for answer-completeness; extend that - existing pointer's wording if needed, but state the rule **once** inside - ``. The step-5 pointer that lists what `sql_dialect_notes` provides - ("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also - name the **series/calendar** convention now that it exists. - -### 2. Dialect-notes surface — `dialects/*.md` - -Add a **"Series"** (date/number range) line to **each** of the seven authored -dialect files, giving that engine's idiomatic way to generate a contiguous -date or integer series for use as a spine. Each note is engine-exclusive — a -SQLite analyst gets the SQLite idiom and never another engine's construct, per the -existing dialect-notes leak guards. Orientation (exact syntax is the -implementer's): - -- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`. -- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`. -- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers). -- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE. -- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`. -- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`). -- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table. - -This line is what makes the period spine usable from the dialect-agnostic skill, -and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the -same date spine) — so it is foundational, not scope creep. - -### 3. Coordination with spec 11 - -Spec 11 (time-series window recipes) explicitly depends on this date spine for the -gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10 -establishes the spine concept in the Answer-completeness group and the -series syntax in the dialect notes; spec 11 reuses both from the Window-functions -group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it. - -## Leak-safety (hard constraint) - -Any worked example or note must use a **synthetic generic schema** (e.g. an -`orders` table with an `order_date`, a `regions` dimension) and demonstrate only -the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL, -or result values on either surface. The dialect-notes additions, like the existing -notes, carry no benchmark/grader/version-dated content. The behavior is -reconstructable from first principles and tied to no specific instance. - -## Acceptance criteria - -- `` "Answer completeness / interpretation" states: the full-panel cue, - the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE - discriminator (0 vs NULL), and the each-vs-which over-application guard — - inline, dialect-agnostic, each with a generic *why*. -- Exactly **one** new worked `sql` example is present, a portable - distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`), - with no series generation and no dialect-specific syntax. The skill then carries - **three** `sql` worked examples total. -- Each of the seven `dialects/*.md` files gains a **Series** (date/number range) - line in its engine's own idiom; no engine leaks another engine's construct, and - the additions contain no benchmark/grader/version-dated content. -- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`, - `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other - single-dialect construct anywhere in `SKILL.md`, including the new example. -- The existing interactive guidance (``, ``, the other examples) - and the existing dialect-note rubric lines are intact and uncontradicted. -- No grader/benchmark reference, no output-shape contract, and no anchoring of - *relative* time ("recent" / "past N months") to a `MAX(date)` over the data - appears (period-spine bounds derive from the question's explicit range or, for - "all periods present," from `MIN`/`MAX` over the facts — which is range - derivation, not relative-time anchoring). -- The skill stays scannable and comfortably under the 500-line budget; frontmatter - still parses as `ktx-analytics`. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the - panel-completeness bullets to the Answer-completeness group, the single category - spine example, and extend the existing step pointer / dialect-notes provision - list to name the series convention. Leave ``/``/other examples - intact. Delivery is unchanged (single `SKILL.md` per target via - `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required. -- **Dialect notes:** the seven files under - `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with - `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by - `copy-runtime-assets.mjs` — no plumbing change, content only. -- **Tests:** - - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a - representative phrase for the completeness rule; bump the `sql`-fence count - assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the - existing dialect-clean guards already cover the no-inline-series requirement - (the example is `SELECT DISTINCT`, so they pass unchanged). - - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop - (the "answers the full rubric for every dialect" test) so every dialect must - also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`. - Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces - all seven without a hand-maintained list. -- Rebuild and re-link the dev binary so the playground picks up both surfaces: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation only) - -Per-period / per-category questions where some periods are empty produce -short-row result mismatches in the SQLite subset, and the related rolling/cumulative -cluster (spec 11) needs a complete date spine to be correct at all. The fix is a -universal reporting habit (complete panels) plus the per-dialect series syntax -that makes it executable — both belong in the product, where they help real -analysts. Improving the benchmark score is a side effect; the skill and the -dialect notes contain no trace of the benchmark. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no -new tool/flag/config, no plumbing change. - -**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):** -- Added a **"Complete the panel for 'each / every / all / per '"** bullet to the `` "Answer completeness / interpretation" - group, directly after the *"For each X / per X / by X"* bullet, with three - sub-bullets carrying the rest of the rule each with its generic *why*: **Spine - source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT` - over the facts; period/number series across the question's stated range, bounds - from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series - syntax delegated to `sql_dialect_notes`), **Default by additivity** - (`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and - **Don't over-apply** (the each-vs-which guard). -- Added **one** worked `sql` example at the end of the Answer-completeness group: a - portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` → - `LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right, - standard SQL only, no series generation, no dialect functions. The skill now - carries **three** `sql` worked examples. -- Extended the step-5 dialect-notes pointer to name the **series/calendar** - convention alongside FQTN / identifier-quoting / date / top-N / JSON. -- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the - single `SKILL.md` per target — confirmed, no change. - -**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):** -- Added a `- **Series:**` line to all seven authored files (postgres, sqlite, - bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom - (`generate_series`; recursive CTE with `date(d,'+1 month')`; - `UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE - with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` + - `MAXRECURSION`), placed right after each file's Date/time line. No cross-engine - leak, no version-dated/benchmark content. Shipped to `dist` unchanged by - `copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`. - -**Tests:** -- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel` - and `Default by additivity` phrases; renamed the worked-examples test and bumped - the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE` - shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the - dialect-clean banned list — a deliberate **strengthening** beyond the spec's - test orientation so the "no inline series" acceptance criterion is *enforced*, - not merely incidentally true of a `SELECT DISTINCT` example. -- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric - for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven - dialects are required to answer a Series line (coverage derived from - `DIALECTS_WITH_NOTES`, no hand-maintained list). - -**Verification:** both affected test files pass (19 tests). `src` type-check and -`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in -all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an -unrelated, pre-existing `tsconfig.test.json` type error in -`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work -and outside its scope. - -**Coordination with spec 11:** the per-dialect Series line is the foundational -date spine that spec 11 (rolling/cumulative windows over gappy dates) references. -Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11 -will reference it from the Window-functions group. No overlap introduced. diff --git a/spider2-specs/specs/11-time-series-window-recipes.md b/spider2-specs/specs/11-time-series-window-recipes.md deleted file mode 100644 index 95bf3811..00000000 --- a/spider2-specs/specs/11-time-series-window-recipes.md +++ /dev/null @@ -1,391 +0,0 @@ -# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period - -> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`. - -## Problem - -A large share of analytics questions are time-series shaped: a **running / -cumulative balance**, a **rolling N-day average**, or **period-over-period -growth**. The agent already knows window functions exist — spec 07 gave the -`` "Window functions" group its determinism and window-then-filter -rules, and spec 10 added panel/period completeness — but it still gets the -*time-series specifics* wrong: - -- a cumulative balance computed **without an explicit unbounded-preceding - frame**, or with the implicit frame misbehaving when there are **ties on the - order key**; -- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** - daily data, so the window spans the wrong calendar span when days are missing; -- no **minimum-periods** handling — a rolling average reported before the window - is actually full; -- "growth vs the previous period" written **without `LAG`** (or against the wrong - neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or - absent prior. - -These are runnable-but-wrong: the structure is close, the edge case diverges. -It is the same failure shape spec 07 addressed at the general level; this spec -adds the time-series specifics to the **same Window-functions group**, building -on the rules already there rather than restating them. - -## Generic use case (independent of any benchmark) - -- "Each account's month-end running balance over 2023" — a cumulative sum of - monthly net over an ordered window. -- "30-day rolling average of daily revenue, only once 30 days of history exist." -- "Month-over-month revenue growth rate." - -All three are bread-and-butter for any analyst on any time-series table, with no -benchmark in sight. The methodology is universal analyst craft, so it belongs in -the shipped skill — it transfers to every ktx user querying a live database. - -## Model - -The change is **additive content across two surfaces** — the same split spec 10 -made, and for the same reason. The split is the central design decision; it -satisfies spec 07's hard dialect-agnostic invariant for `` without -weakening it. - -### Why two surfaces (the dialect-agnostic reconciliation) - -Two of the three recipes are **pure standard SQL** and stay entirely in the -dialect-agnostic skill: - -- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED - PRECEDING AND CURRENT ROW)` is standard on every engine. -- **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a - `NULLIF`-style divide-by-zero guard are standard on every engine. - -The third recipe — a **rolling window over calendar time** — has one piece that -is genuinely dialect-divergent: the **calendar-range window frame**. A native -range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW` -exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has -no date-interval range frame, and SQL Server has **no offset `RANGE` frames at -all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot -inline a range frame any more than it could inline a date-series generator. - -ktx already routes that kind of engine-specific syntax through the per-dialect -notes in `packages/cli/src/context/sql-analysis/dialects/.md`, served by -the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent -exactly: series/spine generation was not in the dialect rubric, so it was added -there (the **Series** line) and the dialect-agnostic skill points to it. -Rolling-window framing is the next construct in that same position — not in the -rubric yet, dialect-specific — so the **rolling-window idiom belongs in the -dialect notes**, and the skill points to it. - -Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the -min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries -the **concrete rolling-window frame syntax** per engine. - -### Additive, inline, heuristic-with-a-why - -Consistent with specs 07 and 10: the skill change is **additive content in one -Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` -file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as -**heuristics with a one-line generic rationale**, not a wall of MUSTs. The -dialect-notes change is additive content in the seven existing `dialects/*.md` -files. No new tool, flag, or config on either surface. - -### Build on the rules already present; do not restate them - -The Window-functions group already carries **"Make the ordering deterministic"** -(complete tie-breaker) from spec 07, and the Numeric-precision group carries -**"Round only at the end."** The cumulative and period-over-period recipes -**reference** these rather than repeat them (state each rule once — Anthropic's -"consistent terminology / don't repeat" guidance, already followed in spec 07). -Spec 10's **Series** dialect line is likewise **referenced** by the rolling -recipe's spine fallback, not duplicated. - -## Requirements - -### 1. Skill surface — `` "Window functions" group (three recipes) - -Add three recipes to the **existing** "Window functions" group, after its two -current bullets (deterministic ordering; filter-after-the-window). Each is a -heuristic with a generic *why*, dialect-agnostic. - -1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER - (PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` — - with a **complete tie-breaker** on the `ORDER BY` (per the group's existing - deterministic-ordering rule; reference it, do not restate). *Why:* a bare - `ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the - order key** folds every tied peer into the same cumulative value — it runs and - looks plausible, but the running total jumps at each tie boundary. - -2. **Rolling window over calendar time, plus minimum periods.** "Rolling N - days/months" must span a **calendar range**, not a fixed row count: a `ROWS - BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are - missing. Two sanctioned techniques: - - **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's - **Series**, via `sql_dialect_notes`) so the data has one row per calendar - unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the - intended calendar span. This path is fully dialect-agnostic. - - **Native range frame or date-keyed self-join (engine-specific).** Where the - engine supports it, a calendar **range frame** expresses the window directly; - otherwise a self-join keyed on the date does. Both use engine-specific - syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see - requirement 3); show no inline range frame in the skill. - - **Minimum periods.** When the question says "only after N periods of data" (or - a rolling metric implies it), emit `NULL` / skip until the window is actually - full — guard on a window count, e.g. `COUNT(*) OVER () = N`. On a - gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null - observations** instead when "N periods" means N data points rather than N - calendar units. *Why:* a row-count frame over missing dates measures the wrong - span, and a partial early window is not the requested metric. - -3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)` - for the prior-period comparison; compute growth as `(cur - prev) / prev` at - **full precision**, rounding only in the final projection (per the existing - "Round only at the end" rule), and **guard divide-by-zero / NULL prev** - (e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against - the wrong neighbor — the comparison lands on the wrong period, and an unguarded - ratio errors or returns garbage when the prior period is zero or absent. - -**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list -(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON -conventions") should also name the **rolling-window** convention now that it -exists. State each rule once inside ``; the workflow steps only point -to it. - -### 2. One worked example — cumulative running total (dialect-agnostic) - -Add **exactly one** new compact before/after `sql` example, demonstrating the -**cumulative running total** — the subtlest of the three (the implicit-frame trap -runs fine and is wrong only at tie boundaries) and the highest-value to show. -Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`): - -- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the - implicit `RANGE` frame makes two txns on the same date share one inflated - running balance. -- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND - CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`). - -Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no -`RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets -**no** inline example (its correct form needs the engine-specific frame/spine, -delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was -prose-only); the **period-over-period** recipe is self-evident from its bullet -and also gets no example. This is the **fourth** worked `sql` example in the -skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and -spec 10 (panel-completeness spine). - -### 3. Dialect-notes surface — `dialects/*.md` (rolling window) - -Add a **rolling-window-over-time** idiom line to **each** of the seven authored -dialect files, parallel to spec 10's **Series** line. Each note is -engine-exclusive — a SQLite analyst gets the SQLite idiom and never another -engine's construct, per the existing dialect-notes leak guards. Each note either -gives the engine's native calendar-range frame **or** references its own -**Series** line for the spine + `ROWS` fallback (a cross-reference within the -file, not a duplicate of the Series line). - -Orientation only — **`RANGE`-frame support genuinely varies by engine and -version, so the implementer must verify each engine's current support against -authoritative docs (context7 / the engine's manual) rather than assert it from -memory.** Starting points: - -- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days' - PRECEDING AND CURRENT ROW)`. -- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT - ROW` over a temporal order key. -- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key - (e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or - build a spine (see **Series**) and use a `ROWS` frame. -- **sqlite:** **no** date-interval range frame — build a date spine (see - **Series**) and use a `ROWS` frame. -- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see - **Series**) and use a `ROWS` frame, or a date-keyed self-join. -- **snowflake / clickhouse:** range-frame support over dates is limited — verify; - default to a spine (see **Series**) + `ROWS` frame where a native calendar range - frame is unavailable. - -This line is what makes the rolling-over-time recipe executable from the -dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series = -how to *generate* a spine; Rolling window = how to compute a *moving -calendar-range aggregate*, natively or via that spine), and it cross-references -the Series line rather than overlapping it. - -### 4. Explicit constraints / exclusions - -None of the following may appear (consistent with specs 07 and 10): - -- **No inline dialect-specific range-frame syntax in the skill** — no - `RANGE … INTERVAL` frame, no series generator, no dialect function. The skill - stays dialect-clean; the range frame lives only in the dialect notes. -- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months" - means relative to *now* on a live database. A range *bound* may be derived from - the question's explicit range or, for "all periods present," from `MIN`/`MAX` - over the **unfiltered** facts (range derivation, per spec 10) — but the metric - must never silently redefine "recent" as the data's maximum date. -- **No grader / gold-answer / benchmark reference**, and no output-shape contract - (the skill is for interactive analysis). - -### 5. Coordination with specs 07 and 10 - -All three recipes live in the **existing** `` "Window functions" -group; the two current bullets and the spec-07 window-then-filter example must -stay intact and uncontradicted. - -- **Spec 07** owns the deterministic-ordering rule (Window functions) and the - round-at-the-end rule (Numeric precision). Spec 11 **builds on** both — - references them, never restates them. -- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11 - **references** the spine for the gappy-rolling fallback and adds the **distinct** - rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a - spine; spec 11 = how to compute a moving calendar-range aggregate (native frame - or spine + `ROWS`). - -## Leak-safety (hard constraint) - -Every worked example or note uses a **synthetic generic schema** (e.g. -`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and -shows only the *pattern*. **No** benchmark table names, SQL, or result values on -either surface. The dialect-notes additions, like the existing notes, carry no -benchmark / grader / version-dated content. The behavior is reconstructable from -first principles and tied to no specific instance. - -## Acceptance criteria - -- The `` "Window functions" group states the three recipes — inline, - dialect-agnostic, each with a generic *why*, and each **building on** (not - restating) the deterministic-ordering and round-at-the-end rules: - - **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED - PRECEDING AND CURRENT ROW` frame and a complete tie-breaker; - - **rolling window over calendar time + minimum periods** — calendar range not - row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)` - guard — delegating the engine's range-frame syntax to `sql_dialect_notes`; - - **period-over-period** via `LAG`, with full-precision growth and a - divide-by-zero / NULL-prev guard. -- Exactly **one** new worked `sql` example: the cumulative running total, - wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT - ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The - skill then carries **four** `sql` worked examples total. -- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time** - idiom line in its engine's own idiom (native calendar-range frame where - supported, otherwise a spine + `ROWS` fallback that references its **Series** - line); no engine leaks another engine's construct, and the additions contain no - benchmark / grader / version-dated content. -- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`, - `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no - inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new - example. -- The step-5 `sql_dialect_notes` provision list names the **rolling-window** - convention alongside FQTN / identifier-quoting / date / top-N / series/calendar / - JSON. -- The existing interactive guidance (``, ``, the other - examples), the two existing Window-functions bullets, the window-then-filter - example, and the existing dialect-note rubric lines (including **Series**) are - intact and uncontradicted. -- No grader / benchmark reference, no output-shape contract, and no anchoring of - *relative* time ("recent" / "past N months") to a `MAX(date)` over the data. -- The skill stays scannable and comfortably under the 500-line budget; frontmatter - still parses as `ktx-analytics`. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes - to the "Window functions" group (after its two existing bullets), the single - cumulative worked example, and extend the step-5 dialect-notes provision list to - name the rolling-window convention. Leave `` / `` / the other - examples and the two existing window bullets intact. Delivery is unchanged - (single `SKILL.md` per target via `readAnalyticsSkillContent` in - `setup-agents.ts`) — confirm, no change required. -- **Dialect notes:** the seven files under - `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with - `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by - `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each - engine's actual `RANGE`-frame support against authoritative docs before writing - the idiom; do not assert from memory.** -- **Tests:** - - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a - representative phrase for each of the three recipes; bump the `sql`-fence count - assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN - UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean - guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding - `generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the - "range frame lives only in the dialect notes" criterion is *enforced*, not - incidentally true). - - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the - full rubric for every dialect" loop with the rolling-window assertion, e.g. - `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it. - Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces - all seven without a hand-maintained list. -- Rebuild and re-link the dev binary so the playground picks up both surfaces: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation only) - -Running-balance / rolling / period-over-period questions are the single largest -result-mismatch cluster in the SQLite subset (financial-transactions-style DBs): -cumulative balances with the wrong frame on ties, rolling windows that mis-span -gappy dates, partial early windows, and unguarded period-over-period ratios. The -methodology is universal analyst craft, so it belongs in the product's skill -(where it helps every real user) plus the per-dialect rolling-window syntax that -makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the -date spine) for the gappy-rolling fallback. Improving the benchmark score is a -side effect; the skill and the dialect notes contain no trace of the benchmark. - -## Implementation notes - -Shipped as additive content across the two surfaces the spec specified — no new -tool, flag, or config. - -**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes -to the existing `` "Window functions" group, after its two bullets and -the spec-07 window-then-filter example: **Cumulative / running total** (explicit -`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing -the deterministic-ordering rule), **Rolling window over calendar time, plus -minimum periods** (calendar range not row count; spine-or-native-range choice -delegated to `sql_dialect_notes`; the `COUNT(*) OVER () = N` -min-periods guard), and **Period-over-period** (`LAG` + full-precision growth + -`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked -`sql` example — the cumulative running total, wrong-vs-right, using -`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four -worked examples. Extended the step-5 `sql_dialect_notes` provision list to name -the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the -skill; it stays dialect-clean. - -**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a -**Rolling window over time** line to all seven files, parallel to the spec-10 -**Series** line and cross-referencing it for the spine fallback. - -**Deviation — `RANGE`-frame support verified against authoritative docs (the -spec's hard requirement), which corrected two of its starting points:** - -- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days' - PRECEDING AND CURRENT ROW` (as the spec guessed). -- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL - 29 DAY PRECEDING AND CURRENT ROW` (as guessed). -- **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with - `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed). -- **snowflake** — **corrected:** the spec said "limited; default to a spine," but - Snowflake *does* support a native interval `RANGE` frame over a date/timestamp - key and it is gap-tolerant, so the note gives the native frame - (`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed. -- **clickhouse** — **corrected:** the spec said "limited; default to a spine," but - ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in - days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for - `DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as - the fallback. -- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS` - (as guessed). -- **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame): - spine + `ROWS`, or a date-keyed self-join (as guessed). - -**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative -phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4, -asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND -CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened -the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex. -`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop -with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from -`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric. - -**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped); -`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed -`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one -error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is -present in committed branch code and untouched by this work. diff --git a/spider2-specs/specs/12-parse-text-encoded-numbers.md b/spider2-specs/specs/12-parse-text-encoded-numbers.md deleted file mode 100644 index 68139ca3..00000000 --- a/spider2-specs/specs/12-parse-text-encoded-numbers.md +++ /dev/null @@ -1,405 +0,0 @@ -# Parse text-encoded numeric columns before doing math on them - -> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`. - -## Problem - -Numeric measures are often stored as **text** with human formatting: unit -suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators -(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero -(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is -**silently wrong**: a string comparison orders `"100" < "9"`, and a naive -`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the -intended number. The query runs, the shape looks right, the number is garbage. - -The agent already samples schemas before composing — spec 07 gave the -`` "Schema discovery before writing SQL" group its *"Sample before you -compose"* and *"Cast to the real type before comparing"* rules. But those rules -guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**; -they say nothing about a column whose declared/affinity type is text yet whose -*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to -assume a real number type and skips the parse, so the arithmetic runs on the raw -strings. This spec adds the detect → parse/scale → verify habit to that same -group, building on the two rules already there rather than restating them. - -## Generic use case (independent of any benchmark) - -- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become - `1200 / 3000000 / 0` before you can sum it or compute a daily change. -- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging. -- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it. - -This is routine data hygiene on real, messy production tables — every analyst -hits text-encoded measures on some warehouse, with no benchmark in sight. The -methodology is universal craft, so it belongs in the shipped skill; it transfers -to every ktx user querying a live database. - -## Model - -The change is **additive content across two surfaces** — the same split specs 10 -and 11 made, and for the same reason. The split is the central design decision; -it satisfies spec 07's hard dialect-agnostic invariant for `` without -weakening it. - -### Why two surfaces (the dialect-agnostic reconciliation) - -The **detect → parse → scale** half is **pure portable SQL** and stays entirely -in the dialect-agnostic skill: - -- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known - set of literal characters — no regex needed. -- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression. -- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`. -- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable. - -The **verify** half has one piece that is genuinely dialect-divergent: a -**failure-detecting numeric cast** — a cast that signals (rather than silently -swallows) a value that did not parse. This is exactly what requirement 3 -("confirm coverage") needs, and it cannot be written portably: - -- **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure. -- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure. -- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`. -- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`. -- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before - casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`). -- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and - `CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an - `IS NULL` coverage check is **silently broken**. Detecting a failed parse needs - a `GLOB`/`typeof` pattern guard. - -So a portable skill cannot inline a safe cast any more than spec 10 could inline a -date-series generator or spec 11 a calendar range frame. ktx already routes that -kind of engine-specific syntax through the per-dialect notes in -`packages/cli/src/context/sql-analysis/dialects/.md`, served by the -`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent: -a construct not yet in the dialect rubric, genuinely engine-specific, was added -there (the **Series** line; the **Rolling window** line) and the dialect-agnostic -skill points to it. The failure-detecting cast is the next construct in that same -position, so the **safe-cast idiom belongs in the dialect notes**, and the skill -points to it. - -Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale -in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes) -carries the **concrete safe-cast syntax** per engine, including the sqlite -`CAST`-returns-0 gotcha. - -The regex character-*strip* is deliberately **not** promoted to the dialect -notes: a portable chained `REPLACE` over a known character set is the opinionated -default, so there is no need for a per-dialect strip line (derive from need; one -default). The dialect surface gains exactly one thing — the safe cast — because -that is the only piece the portable path genuinely cannot express. - -### Additive, inline, heuristic-with-a-why - -Consistent with specs 07, 10, and 11: the skill change is **additive content in -one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled -`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, -and phrased as **heuristics with a one-line generic rationale**, not a wall of -MUSTs. The dialect-notes change is additive content in the seven existing -`dialects/*.md` files. No new tool, flag, or config on either surface. - -### Build on the rules already present; do not restate them - -- The Schema-discovery group already carries **"Sample before you compose"** and - **"Cast to the real type before comparing"** (spec 07). The detect rule - **extends** the first (distinct-value sampling to learn the encoding) and the - parse rule **complements** the second (text-meaning-numeric, not just - text-vs-numeric literal mismatch) — reference them, do not repeat them. -- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive - judgment** spec 10 established in its *"Default by additivity"* rule (0 only - when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule - rather than restating the discriminator (state each rule once). - -## Requirements - -### 1. Skill surface — `` "Schema discovery before writing SQL" - -Add the text-encoded-numeric guidance to the **existing** group, after its two -current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic. -It must cover: - -1. **Detect text-encoded numerics during sampling.** When a column the question - treats as a number is stored as text, sample its **distinct** values to learn - the encodings actually present — unit suffixes (`K`/`M`/`B`), currency - symbols, thousands separators, percent signs, and non-numeric sentinels - (`-`, `N/A`, empty) — **before** composing. Never infer the format from the - column name. *Why:* compared/aggregated as-is, the text sorts lexically - (`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL — - producing a silently wrong result instead of an error. - -2. **Parse and scale in an early CTE.** Strip currency/separator/percent - characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels - to `0` **or** `NULL` per the question's intent, then cast to a numeric type — - all in **one early CTE**, so every downstream layer sees clean numbers. The - `0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive** - rule (reference it; do not restate). *Why:* a string column aggregated as-is - sorts lexically and casts to 0, so the math is silently wrong. - -3. **Confirm coverage (verify).** After parsing, sanity-check that **no - intended-numeric value silently failed to parse** — a failed parse should - surface as `NULL`, which is only visible with a **failure-detecting cast**. - Note the divergence: a plain `CAST` errors on some engines and, on sqlite, - returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from - `sql_dialect_notes` (requirement 3), then count residual NULLs among - non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish - as `0`/NULL instead of being caught. - -### 2. One worked example — parse/scale, fully portable - -Add **exactly one** new compact before/after `sql` example demonstrating the -parse-and-scale pattern on a synthetic generic schema -(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`): - -- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the - formatted values collapse to `0`/partial, so the total is silently wrong. -- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a - `CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to - `DECIMAL`, then `SUM`s the parsed column. - -**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`, -`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the -example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no** -inline example (its correct form needs the engine-specific safe cast, delegated to -`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's -rolling-window variants were prose-only). - -This adds **one** worked `sql` example to the skill. Spec 11 independently adds -one as well; **do not hardcode the resulting total** — increment from the current -state. As of this writing the skill carries **three** examples (spec 07 -window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is -the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test -assertion is incremented by one from its current value (see Acceptance criteria). - -### 3. Dialect-notes surface — `dialects/*.md` (safe cast) - -Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files, -parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each -line gives that engine's **failure-detecting numeric cast** — a cast that returns -`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the -verify step correct on that engine. Each note is engine-exclusive (a SQLite -analyst gets the SQLite idiom and never another engine's construct, per the -existing dialect-notes leak guards). Orientation only — exact syntax is the -implementer's; verify against authoritative docs (context7 / the engine manual) -rather than asserting from memory: - -- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting, - e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is - available for the strip, but chained `REPLACE` is the portable default.) -- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before - `CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip. -- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) → - `NULL` on failure. -- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` → - `NULL` on failure. -- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`. -- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`. -- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an - error, so a coverage check must use a pattern guard such as - `CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof` - check) to detect a value that did not parse. - -This line is what makes the verify step executable from the dialect-agnostic -skill. It is **distinct** from the Series and Rolling-window lines (those generate -or window over a calendar; this detects a failed numeric parse). Phrase any -version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test -bans version-dated wording). - -### 4. Explicit constraints / exclusions - -None of the following may appear (consistent with specs 07, 10, and 11): - -- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`, - `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, - `replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is - chained `REPLACE`; the failure-detecting cast lives only in the dialect notes. -- **No regex-strip dialect line.** The character strip stays the portable - chained-`REPLACE` default; the dialect notes gain only the **safe cast**. -- **No grader / gold-answer / benchmark reference**, and no output-shape contract - (the skill is for interactive analysis). - -### 5. Coordination with specs 07, 08, 10, and 11 - -- **Spec 07** owns the Schema-discovery group and its two existing bullets - (*"Sample before you compose"*, *"Cast to the real type before comparing"*). - Spec 12 **extends** that group and **builds on** both bullets — references them, - never restates them; they must stay intact and uncontradicted. -- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one - rubric line through that channel; the engine-exclusivity guards apply unchanged. -- **Spec 10** owns the additive-vs-non-additive discriminator (Answer - completeness) and the dialect **Series** line. Spec 12 **references** the - additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it. -- **Spec 11** independently adds the dialect **Rolling window** line, one `sql` - example, and the **rolling-window** entry to the step-5 provision list. Spec 12 - touches the **same** three places (the dialect-notes rubric loop, the example - count, and the step-5 list). Both are independent and additive — **add to the - current state, do not assume an order**: name **safe-cast** in the step-5 list - without removing rolling-window/series; increment the example count by one from - whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any - `/\*\*Rolling/` assertion. - -### 6. Step pointer (no duplication) - -The step-5 `sql_dialect_notes` provision list (currently "FQTN, -identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11 -also names rolling-window) should additionally name the **safe-cast** convention -now that it exists. State each rule once inside ``; the workflow steps -only point to it. - -## Leak-safety (hard constraint) - -Every worked example or note uses a **synthetic generic schema** (e.g. -`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`), -showing only the *pattern*. **No** benchmark table names, SQL, or result values on -either surface. The dialect-notes additions, like the existing notes, carry no -benchmark / grader / version-dated content. The behavior is reconstructable from -first principles and tied to no specific instance. - -## Acceptance criteria - -- The `` "Schema discovery before writing SQL" group states the three - heuristics — inline, dialect-agnostic, each with a generic *why*, and each - **building on** (not restating) the existing *"Sample before you compose"* and - *"Cast to the real type before comparing"* bullets and spec 10's additivity rule: - - **detect** text-encoded numerics by sampling distinct values (suffixes, - symbols, separators, sentinels) — never from the column name; - - **parse and scale** in an early CTE (strip → suffix-scale → sentinel map → - cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule; - - **confirm coverage** with a failure-detecting cast, delegating the engine's - safe-cast syntax to `sql_dialect_notes`. -- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using - chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS - DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is - incremented by **one** from its current value (3 today → 4; or 5 if spec 11 - shipped first). -- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its - engine's own failure-detecting numeric-cast idiom (including the sqlite - `CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the - additions contain no benchmark / grader / version-dated content. -- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`, - `generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline - `RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` / - `REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md` - including the new example. -- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention - alongside FQTN / identifier-quoting / date / top-N / series-calendar / - rolling-window / JSON. -- The existing interactive guidance (``, ``, the other examples), - the two existing Schema-discovery bullets, and the existing dialect-note rubric - lines (including **Series** and, if present, **Rolling window**) are intact and - uncontradicted. -- No grader / benchmark reference, and no output-shape contract. -- The skill stays scannable and comfortably under the 500-line budget; frontmatter - still parses as `ktx-analytics`. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three - heuristics to the "Schema discovery before writing SQL" group (after its two - existing bullets), the single parse-and-scale worked example, and extend the - step-5 dialect-notes provision list to name the safe-cast convention. Leave - `` / `` / the other examples and the two existing - schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per - target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no - change required. -- **Dialect notes:** the seven files under - `packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with - `DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by - `copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each - engine's actual safe-cast / try-cast support against authoritative docs before - writing the idiom; do not assert from memory** (in particular the sqlite - `CAST`-returns-0 behavior, which is the motivating gotcha). -- **Tests:** - - `packages/cli/test/skills/analytics-skill-content.test.ts` — add a - representative phrase for each of the three heuristics (e.g. a *detect*, a - *parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft - behavior` list; bump the `sql`-fence count assertion **by one** from its - current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a - suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding - `SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, - and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` / - `GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL` - guard, so the "safe cast lives only in the dialect notes" criterion is - *enforced*, not incidentally true). - - `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers - the full rubric for every dialect" loop with the safe-cast assertion, - `expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it. - Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces - all seven without a hand-maintained list. Do **not** add a false-exclusivity - assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the - line per dialect is sufficient. -- Rebuild and re-link the dev binary so the playground picks up both surfaces: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation only) - -At least one SQLite-subset question stores trading volume as suffix-encoded text -(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw -strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes -the failure especially insidious: there is no error to alert the agent, and a -naive `IS NULL` coverage check would not catch it either, which is precisely why -the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings -before math, then verify coverage with a failure-detecting cast — is universal -data hygiene that helps any analyst on any warehouse, so it belongs in the -product's craft (skill) plus the per-dialect safe-cast syntax that makes the -verify step executable, not in a benchmark-specific prompt. Improving the -benchmark score is a side effect; the skill and the dialect notes contain no trace -of the benchmark. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already -applied in the working tree). Built from the current state per the "do not assume an -order" guidance — there were **four** worked examples (specs 07 window-then-filter, -09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the -**fifth**, and step 5 already named `series/calendar, rolling-window`. - -**Skill — `packages/cli/src/skills/analytics/SKILL.md`:** -- Added the three heuristics to the **"Schema discovery before writing SQL"** group, - after the two existing bullets: *Parse text-encoded numerics before doing math on - them* (detect by sampling distinct values, extending *Sample before you compose*, - never inferring from the column name), *Strip, scale, and cast in one early CTE* - (the *meaning-is-numeric* complement to *Cast to the real type before comparing*, - with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by - additivity* rule), and *Confirm the parse covered every value* (failure-detecting - cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing - bullets and the additivity rule are referenced, not restated. -- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`, - `'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an - early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with - a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard - portable SQL only — no dialect functions, no inline safe cast. -- Step 5 dialect-notes provision list now names **safe-cast** alongside the others. - -**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a -**Safe cast** line to all seven files (after the *Rolling window* line), each giving -that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern -guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning); -bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql -`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the -`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial -gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against -the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real -name — the `toOrNull` family requires a bit width, hence `toDecimal64OrNull`). -No version-dated wording. - -**Tests:** `analytics-skill-content.test.ts` — added the three representative -phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the -example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`), -and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`, -`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's -`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts` -— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so -all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity -assertion for the shared `TRY_CAST`. - -**Verification:** both affected test files pass (19 tests); broader `test/skills` + -`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`) -is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry -*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev` -relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the -test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD, -untouched here) — out of scope for this spec. diff --git a/spider2-specs/specs/14-output-completeness-final-check.md b/spider2-specs/specs/14-output-completeness-final-check.md deleted file mode 100644 index c5b18e43..00000000 --- a/spider2-specs/specs/14-output-completeness-final-check.md +++ /dev/null @@ -1,336 +0,0 @@ -# Output completeness — answer every requested part, enforced by a final pre-emit check - -> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`. - -## Problem - -The single largest correctness failure mode for the analytics skill is -**incomplete output**: the query runs and the methodology is roughly right, but -the projection is missing columns the question asked for. The SQL is runnable and -the aggregate is correct — the answer is simply *short by columns*. Three -recurring shapes: - -1. **Multi-part questions answered partially.** A question that asks for several - things ("report the highest *and* the lowest month, each with its count and - average, *and* the difference") comes back with only the first clause — one - column where several were requested. -2. **Identity dropped.** Grouping by a human-readable name but not projecting the - entity's identifier (a product name without its product id, a customer name - without its customer id). -3. **Inputs to a derived value dropped.** Returning a ratio / percentage / - difference but not the underlying counts the question also asked for. - -Shapes 2 and 3 are **already covered** by shipped `` rules — spec 07's -*"Expose identity, not just the label"* and *"Keep the inputs to a derived -value"* — yet they are frequently **not applied**. So the gap is not missing -knowledge: these rules sit as passive heuristics in a list, and nothing makes the -agent reliably check them before finalizing. The fix is twofold: (a) add the -missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b) -turn output-completeness into an **explicit final verification step** the agent -performs before emitting SQL, so the existing identity/inputs rules are actually -enforced rather than merely listed. - -The failure is **model-independent**: a markedly stronger model produced the same -incomplete-output mistakes on these questions, which means it is a -craft/enforcement gap, not a capability gap — exactly the kind of universal -analyst craft that belongs in the shipped skill. - -## Generic use case (independent of any benchmark) - -An analyst is asked: *"For each region, report the highest and the lowest monthly -order count, and the difference between them."* A complete answer has a column for -the region's id and name, the highest count, the lowest count, and the difference -— five columns. Returning just the region and a single number answers only part -of the request. This is a universal expectation on any database: answer **every** -part of a multi-part request, identify the entities, and show the inputs behind -any derived figure — and answer *exactly* that, without padding the result with -columns the question never asked for. - -## Model - -The change is **additive content in one Markdown file** -(`skills/analytics/SKILL.md`), governed by the same invariants spec 07 -established. They constrain the implementer; the exact prose is theirs. - -### Additive, inline, heuristic-with-a-why - -Consistent with specs 07 and 10: the change is additive content in -`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the -`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic, -and phrased as **heuristics with a one-line generic rationale**, not a wall of -MUSTs. The new rule extends the existing `` "Answer completeness / -interpretation" group; the shipped bullets in that group (including the *identity* -and *inputs* rules this spec builds on) are preserved unchanged. No new tool, -flag, or config. - -### The over-projection guard carries a *universal* why, not a grader reference - -The intake draft frames "don't pad the result with extra columns" as -*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or -benchmark (spec 07's hard invariant; the content test bans the words). So the -guard must ship with a **universal analytics rationale** instead: columns the -question did not ask for add noise, mislead the reader into thinking they matter, -and make the result harder to consume — match the request exactly, neither short -nor padded. This is the same reconciliation spec 07 applied to the draft's -"behavior only, no rationale" instruction: generic *why* is required; only -grader/gold/benchmark rationale is banned. - -### Completeness is a closed set — identity and inputs are *inside* it - -"Expose identity" and "keep the inputs" tell the agent to add columns; the -over-projection guard tells it not to. These only contradict if the target is -left fuzzy, so this spec pins it down. A **complete projection** is exactly: - -> {every requested metric/attribute} ∪ {the identifier of each grouped/named -> entity} ∪ {the inputs to each derived value}, at the grain the question -> specifies. - -Identity and inputs are **members of that set** — part of completeness, never -"padding." **Under-projection** is any member missing (the failure this spec -attacks); **over-projection** is any column *outside* the set (what the guard -forbids). The implementer must phrase the rule and guard against this single -definition so they read as one coherent notion, not two competing instructions. - -### Dialect-agnostic, additive-only, exclusions intact - -Every addition reads correctly on any dialect — no dialect-specific syntax in the -rule text or the worked example. The existing ``, ``, and the -other `` bullets and examples (specs 07/09/10/11/12) are preserved and -uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no -`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect -syntax. - -## Requirements - -### 1. Multi-part / multi-output completeness — a new umbrella rule - -Add a bullet to the `` "Answer completeness / interpretation" group: -when a question requests several outputs — a **list** ("A, B, and C"), **paired -extremes** ("the highest *and* the lowest"), or a **value plus its components** -("X, Y, and their ratio") — the final projection must contain a column for -**each** requested output. *Why:* answering only the first clause is the most -common way a runnable query is still wrong; the grain and methodology can be -perfect yet the answer is short by columns. - -This rule is the **umbrella** over the two shipped completeness rules: the -*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components" -instance, and the *identity* rule (*"Expose identity, not just the label"*) is its -"entity identity" instance. The new bullet should **name that relationship** -(so the three read as one notion) rather than restating either rule. - -Keep this distinct from the row-selection rules in the same group: *"Top / -highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows** -appear; multi-part completeness governs **which columns** appear. They compose -(e.g. "highest and lowest per region" needs one row per region *and* a column per -clause). - -### 2. Final completeness check — the enforcement mechanism - -The rule content lives **once** in ``; the trigger is promoted to a -first-class line in `` step 6. - -- **Capstone bullet in ``** (closing the "Answer completeness / - interpretation" group): *before emitting the final SQL, re-read the question and - confirm the projection covers* — - 1. every named **metric / attribute** the question asks for (→ the multi-part - rule); - 2. the **identifier** of every grouped or named entity (→ the *identity* rule); - 3. every **input** to each derived value (→ the *inputs* rule); - 4. all at the **grain** the question specifies (→ the *for each X* / panel - rules). - - Each facet cross-references the rule it enforces, so the check is what makes - those passive rules active. Phrase it as a short, concrete "confirm the - projection covers…" checklist, not a wall of MUSTs. - -- **Over-projection guard** (attached to the check): do **not** add columns the - question did not ask for "to be safe" — extra columns add noise, mislead, and - make the result harder to consume; match the request exactly. Carries the - **universal** why from the Model, **never** a grader/gold/benchmark reference. - -- **`` step 6 line** (the explicit ritual): step 6 ("Validate and - explain") gains a mandatory line directing the agent to **always** run the final - completeness check before emitting — re-read the question and verify every - requested output, each entity's identity, each derived value's inputs, and the - grain are all projected — pointing into the `` capstone for the - detail. This **replaces the current conditional pointer's role** ("If a result - is unexpectedly empty or its grain looks wrong, work through the … rules"): the - empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty - results"* and grain rules), but the completeness check fires **unconditionally**, - on every SQL-authoring turn, not only when a result looks off. The workflow line - names the ritual and the four facets; the rationale, guard, and example are - stated once in ``, not duplicated into the workflow. - -### 3. One worked example (dialect-agnostic) - -Add **exactly one** compact before/after example to the "Answer completeness / -interpretation" group, demonstrating multi-part completeness on a **synthetic** -schema (`regions`, `region_monthly`): - -- **WRONG:** answers only the first clause — `SELECT region_name, - MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no - lowest, no difference. -- **RIGHT:** one column per requested output plus the entity's identity, at the - region grain — `region_id, region_name`, the highest, the lowest, and the - difference, with `regions` joined to `region_monthly` and grouped by the region - id and name. - -Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN` -are portable aggregates). Keep it tight. It teaches multi-clause coverage + -identity + derived-value inputs in one capstone, and is **distinct** from the -spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN -spine + `COALESCE`); this one is about missing **columns**. This is the **sixth** -worked `sql` example in the skill (after specs 07/09/10/11/12). - -### 4. Coordination with specs 03 and 07/09/10/11/12 - -- **Spec 03** (multi-connection routing) owns `` step 0 and the - `connectionId` threading/scoping. Spec 14 touches `` only to add the - completeness-check line to **step 6** — it must not rewrite the routing or the - `` `connectionId` scoping. If both land, step 6 reads coherently: validate - + the completeness ritual. -- **Specs 07/09/10/11/12** own their own bullets and worked examples in - ``. Spec 14 is **additive** to the same "Answer completeness / - interpretation" group and adds one example; it must not remove or contradict - theirs. - -## Leak-safety (hard constraint) - -The example uses an **invented, generic schema** (`regions`, `region_monthly`) and -made-up columns — **no benchmark table names, SQL, or result values.** It teaches -the *pattern* (cover every requested output + identity + inputs, at grain, without -padding), which is universal and tied to no specific instance. The over-projection -guard's rationale is **universal** (noise/clarity/consumability), never -"grader-gaming" or any other scoring reference. No part of the addition mentions a -benchmark, gold answer, grader, or scoring comparator. - -## Acceptance criteria - -- `` "Answer completeness / interpretation" states the **multi-part / - multi-output completeness** rule (a column per requested output; list / paired - extremes / value-plus-components), named as the umbrella over the shipped - *identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*. -- `` states a concrete **final completeness check** (re-read the - question → confirm metrics + entity identity + derived-value inputs + grain are - projected), cross-referencing the existing identity/inputs/grain rules so they - are enforced, not merely listed. -- The check carries the **over-projection guard** with a **universal** rationale - (don't pad with unrequested columns — noise / misleading / harder to consume), - and the skill contains **zero** grader/gold/benchmark references anywhere. -- `` **step 6** carries a mandatory line that runs the completeness - check **unconditionally** before emitting and points into the `` - capstone; the rule content is **stated once** in `` (no duplicated - rationale/guard in the workflow). The empty/grain diagnostic remains available. -- Exactly **one** new worked `sql` example is present (synthetic - `regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL; - the skill then carries **six** `sql` worked examples total. -- The existing interactive guidance (`` steps, ``, the other - `` bullets and the five prior examples) is intact and uncontradicted; - the additive-only and dialect-clean invariants from specs 07/10 still hold. -- None of spec 07's excluded items appear (output-shape contract, `MAX(date)` - anchoring of "recent"/"past N", grader-driven advice, dialect syntax). -- The skill stays scannable and comfortably under the 500-line budget; the - frontmatter still parses as `ktx-analytics`. -- The analytics-skill **content test is updated** to cover the new rule and check - (see Implementation orientation). - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the prose. - -- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`. - - Add the multi-part-completeness bullet and the final-completeness-check - capstone (with the over-projection guard) to the `` "Answer - completeness / interpretation" group; add the single - `regions`/`region_monthly` worked example. - - In `` step 6, replace the current conditional answer-completeness - pointer with the mandatory completeness-check line (unconditional, names the - four facets, points into ``); keep the empty/grain diagnostic. - - Leave `` steps 0–5, ``, and the other `` - bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target - via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change - required. -- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. - - Add representative phrases to the "represents every craft behavior" list for - the multi-part rule, the final completeness check, and the over-projection - guard. - - Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the - test name/comment), and assert the new example's shape (e.g. `region_monthly`, - `MAX(`, `MIN(`, the difference expression, `region_id`). - - The existing dialect-clean, grader/benchmark-clean, and relative-time - (`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN` - lines carry no "recent"/"past N" wording, so the phrase-level guard is - unaffected. The `SkillsRegistryService` frontmatter test must still pass. -- Rebuild and re-link the dev binary so the playground picks up the updated skill: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation only) - -On the latest SQLite-subset run, **incomplete output was the single largest -failure bucket (~13 of 51 voted failures)**: multi-part questions answered -partially, plus dropped identity / derived-value inputs — the latter two being -spec-07 rules that already exist but weren't applied. A probe with a much stronger -model reproduced the *same* incomplete-output failures, confirming this is a -craft-enforcement gap rather than a model-capability one. The fix — answer every -requested part, identify the entities, keep the inputs, and don't pad — is -universal analyst craft, so it belongs in the product skill (and transfers to real -users), enforced as a final pre-emit check rather than left as a passive hint. -Improving the benchmark score is a side effect; the skill contains no trace of the -benchmark. - -## Implementation notes - -Implemented as additive content in one Markdown file plus a test update. - -- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`` "Answer - completeness / interpretation" group): - - Added the **"Answer every requested output"** umbrella bullet (list / paired - extremes / value-plus-components → a column per requested output, with a generic - *why*). It names *keep the inputs* and *expose identity* as its "value + - components" and "entity identity" instances, pins the closed-set definition of a - complete projection, and marks itself as governing *which columns* appear — - distinct from the *Top …* / *For each X* row-selection rules, with which it - composes. The two shipped instance rules are preserved verbatim. - - Added the **"Final completeness check"** capstone bullet: a four-facet - "before emitting, re-read the question and confirm the projection covers…" - checklist (metric/attribute → multi-part rule; identifier → *expose identity*; - inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on - every query. It carries the **over-projection guard** with a universal rationale - (unrequested columns add noise, mislead, and are harder to consume — match the - request exactly), with **no** grader/gold/benchmark reference. - - Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG - answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`), - dropping the region id, the lowest, and the difference; RIGHT projects - `r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the - `MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id - + name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`). - - `` **step 6**: replaced the conditional answer-completeness pointer - with an unconditional *"Always run the final completeness check before emitting"* - line that names the four facets and points into the `` capstone; the - empty/grain diagnostic is retained for diagnosis. Steps 0–5, ``, and the - other `` bullets/examples are untouched. - - Delivery is unchanged: `readAnalyticsSkillContent` in - `packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target - (confirmed, no change required). -- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the - three representative phrases (`Answer every requested output`, `Final completeness - check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and - renamed that test; asserted the new example's shape (`region_monthly`, - `MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and - `r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean, - relative-time, and frontmatter guards still pass. -- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass; - production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the - updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content - present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it - up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as - `ktx-analytics`. -- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference - column named `order_count_range`; the intake draft sketched alias `m` and - `AS difference`. The spec leaves prose to the implementer, so the change is purely - naming. -- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in - `packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools` - mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is - untouched by this work and out of scope here. diff --git a/spider2-specs/specs/15-mcp-server-structured-logging.md b/spider2-specs/specs/15-mcp-server-structured-logging.md deleted file mode 100644 index 5ad31d18..00000000 --- a/spider2-specs/specs/15-mcp-server-structured-logging.md +++ /dev/null @@ -1,405 +0,0 @@ -# Structured, leveled logging for the ktx MCP server - -> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`. -> -> **Scope: observability only.** This spec is about *seeing* what the MCP server -> does (which tool, what params, when, how long, outcome). *Preventing* a runaway -> query from blocking the server (off-event-loop / interruptible execution) is a -> separate concern — see "Non-goals". - -## Problem - -The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built -through `mcp-server-factory.ts` on raw `node:http` + the -`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There -is no server-side record of **which MCP tool was called, with what parameters, -when, how long it took, or whether it succeeded** — nor of session open/close or -transport errors. When a tool call is slow, hangs, or a client connection drops -("Transport channel closed"), an operator has no trail to diagnose it and must -resort to process sampling / `lsof` / guesswork — and the offending input -(e.g. the exact SQL) is typically unrecoverable. - -The hook to fix this already exists but is half-built: `instrumentMcpServer` -(`context/mcp/context-tools.ts`) wraps every tool handler and already times it, -but it emits **only on completion** (a sampled `mcp_request_completed` telemetry -event) and **never writes a start line and never writes to the server log**. A -call that never returns therefore leaves no trace at all. - -## Generic use case (independent of any benchmark) - -Anyone running a long-lived ktx MCP server — a developer's local instance -(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a -shared/hosted HTTP daemon — needs observability into tool-call activity to: - -- diagnose slow or hung tool calls (which `sql_execution` ran, against which - connection, with what SQL, for how long); -- explain client-visible connection failures from the server side (session - lifecycle, transport-closed events); -- audit what agents asked the server to do; -- spot patterns (hot tools, slow connections, error rates). - -This is standard production-server hygiene; the server currently provides none. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs. - -### One `pino` logger, synchronous, written to **stderr** - -Use `pino` — the de-facto standard structured-JSON logger for Node servers — as -a single shared instance. Two corrections to the draft's sketch: - -- **stderr, not stdout.** The stdio transport reserves **stdout** for the - JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`); - writing logs there would corrupt the protocol stream. The HTTP daemon already - redirects **both** child fds to `.ktx/logs/mcp.log` - (`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands - in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the - one universally-correct sink** for both transports. -- **Synchronous, no worker-thread transport.** `pino` writes through a - `DestinationStream` (`{ write(msg) }`) — the server's existing - `KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a - **synchronous** destination (`pino.destination({ sync: true })`, or the - pino-pretty stream below with `sync: true`). This is load-bearing: the - `tool.start` line **must** be flushed to the fd *before* the (possibly - blocking) handler runs, so a runaway synchronous `better-sqlite3` query that - pegs the event loop still leaves the start line on disk. A worker-thread - transport (`transport: { target: ... }`) buffers and can lose that exact line - on a hard crash — **do not use transport mode.** - -### Format is derived from `stderr.isTTY`, not a config flag - -One logger, two serializations chosen by the environment (the "behavior follows -from inputs" rule — not a user-visible knob): - -- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) → - **`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true, - destination: })`, colorized). A readable live dev view. -- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log` - file fd) → **plain JSON line** via the synchronous pino destination. The log - *file* stays structured JSON so the incident workflow ("recover the hung query - with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat - it. - -`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal -from the underlying stream (`process.stderr.isTTY`) at logger construction, while -still writing *through* the `io.stderr` sink so tests can capture emitted lines. - -### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper - -Tool-call logging is added to the existing `instrumentMcpServer` -(`context-tools.ts`), which already wraps `registerTool` and measures duration. -It receives the **raw** tool input (it wraps the schema-parsing handler from -`registerParsedTool`), so the params it logs include `sql` for `sql_execution`. -The existing telemetry emission stays unchanged; logging is **additive** beside -it. Because both transports build their server through `mcp-server-factory.ts` → -`registerKtxContextTools`, this single change gives **both HTTP and stdio** -tool-call logging for free. - -### `sessionId` / `callId` provenance - -- **`sessionId`** comes from the SDK's per-call handler context - (`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk` - `1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for - stdio (single session) — log it when present, omit otherwise. Add - `sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`). -- **`callId`** is generated per invocation with `randomUUID()` (already imported - in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`. - -### No redaction in v1 (explicit) - -v1 ships **no log redaction**. Rationale recorded here so it is a deliberate -choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`), -**never transmitted off-box**, and sit at the **same trust boundary** as the -`ktx.yaml` / environment that already hold the connection credentials. Concretely: - -- Request **headers are never logged** at all, so the bearer token - (`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted." -- Errors are logged with their **full message and stack** via pino's standard - `err` serializer. -- SQL text and tool params are logged **verbatim** (they are not secrets). - -Credential redaction (e.g. a DB URL embedded in a driver error string) is an -explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box. -This drops the draft's "light redaction" requirement and the -`collectTelemetryRedactionSecrets` / scrubber reuse it implied. - -## Requirements - -### 1. One shared pino logger - -- A single `pino` instance per server process, constructed once and threaded to - both the transport layer (for lifecycle events) and the tool layer (for - tool-call events). Level set from env (Requirement 7), default `info`. -- Synchronous destination bound to the server's stderr sink (see Design - decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`, - otherwise plain JSON. Each line carries pino's standard `time` and `level`. -- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics - stack, no async/worker transport, no in-app file rotation. - -### 2. Per-session / per-call context via child loggers - -Use pino child loggers so every line carries the relevant correlation fields: -a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one -session's or one call's activity can be grepped from the log. - -### 3. Tool-call logging — START before execute, END after - -In `instrumentMcpServer`, for **every** MCP tool invocation: - -- **On entry, before invoking the handler**, write `tool.start` with - `{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool - input; for `sql_execution` this includes the full **SQL text** (the single most - useful field). The write is synchronous so the line exists even if the handler - never returns. -- **On normal completion**, write `tool.end` with - `{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at - **`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a - tool-agnostic size measure (byte length of the serialized result text content). -- **On error**, write `tool.end` with - `{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**, - where `err` is the serialized error (message + stack) per Requirement 6. - -`tool.start` and `tool.end` share the **same correlation fields and the same -`info` level** (for the non-slow, non-error case) so that an **unmatched -`tool.start`** — a start with no `tool.end` for the same `callId` — is an -unambiguous "this call hung" signal. This is the property that makes a runaway -`sql_execution` identifiable from the log alone, with its exact SQL and -timestamp, no process sampling. - -> **Deliberate change from the intake draft.** The draft put `tool.start` / -> `tool.end` at `debug` (suppressed at the default `info`). That defeats the -> motivating incident: a hang is unpredictable, so debug would have to be enabled -> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an -> always-on access log — so the offending query is recoverable at the default -> level. `debug` is reserved for heavier detail (Requirement 7). - -### 4. Slow-call warning - -When a call **completes** with `durationMs` greater than the configured slow -threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same -fields plus the duration) instead of `info`. This makes a completed-but-slow call -stand out and keeps it visible even when the level is raised to `warn`. - -### 5. Connection / session lifecycle and transport errors - -- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from - `onsessioninitialized` and `session.close` from `onsessionclosed` / - `transport.onclose`, each with `sessionId`, at `info`. **Wire the currently - unused `transport.onerror`** to log `transport.error` (the SDK's - closed-channel / "Transport channel closed" events) at `error`, so a - client-visible connection failure has a server-side counterpart. -- **stdio** (`mcp-stdio-server.ts`): route the existing raw - `transport.onerror` stderr string (it currently writes a plain string) through - the logger as a `transport.error` line at `error`. A single `session.open` / - `session.close` pair for the one stdio connection MAY be logged at `info`. - -### 6. Structured error logging - -Errors are logged as structured objects via pino's standard `err` serializer -(`pino.stdSerializers.err` or equivalent), carrying error class, message, and -stack — never a bare interpolated string. The existing telemetry exception -reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged. - -### 7. Configuration surface - -- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` | - …), default **`info`**. MCP-scoped name because the MCP server is the only - emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system - that does not exist. -- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement - 4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply - between a local SQLite file and a remote warehouse. -- Level ladder that results from Requirements 3–5: - - `debug`: everything below **plus** heavier detail (e.g. result bodies, - progress notifications) — implementer's discretion on what extra to attach. - - `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s, - errors. - - `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but - not routine tool traffic. - - `error`: errored `tool.end`s and `transport.error` only. - -## Acceptance criteria - -- At default level (`info`), invoking any MCP tool produces a `tool.start` - (`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end` - (`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr - is not a TTY. -- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a - `tool.start` line carrying its **exact SQL and timestamp** and **no** matching - `tool.end` for that `callId` — so the offending query is recoverable from the - log alone, with no process sampling. -- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at - `warn` with its `durationMs`. -- Session open/close and transport-closed (`transport.error`) events are logged - with the `sessionId` (HTTP); the stdio transport error path goes through the - logger, not a raw `stderr.write`. -- At level `warn`, routine `tool.start` / `tool.end` are suppressed but - slow-call warnings, transport errors, and errored calls are present. -- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a - terminal), output is human-readable colorized `pino-pretty`; the daemon log - file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous. -- The bearer token never appears in any log line (headers are not logged); SQL - and tool params do appear. -- No worker-thread / async log transport is introduced; no OpenTelemetry / - metrics stack; the only new dependencies are `pino` and `pino-pretty`. -- The existing `mcp_request_completed` telemetry and exception reporting still - work unchanged. - -## Non-goals - -- **Preventing / interrupting runaway queries** (off-event-loop execution, query - timeouts, worker-thread isolation). A single synchronous query that fans out - into a massive nested-loop join can peg the single-threaded server for hours - and break new connections — observability surfaces *which* query, but the fix - is execution-model work in a separate spec. (This logging is also the - prerequisite for a future watchdog that detects a `tool.start` with no - `tool.end` past a threshold and recycles the server.) -- **Log redaction** (see Design decisions) — explicit v1 non-goal. -- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty - as a synchronous in-process stream only. -- Metrics / tracing / OpenTelemetry exporters. -- Forwarding logs to the MCP *client* via the protocol logging capability - (`notifications/message`, `logging/setLevel`) — a possible later enhancement, - distinct from operational stderr logging. -- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other - surfaces emit structured logs. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **New module** — a small logger factory, e.g. - `packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from - the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream - when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and - exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`. -- **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`: - extend `instrumentMcpServer` (~line 585) to write `tool.start` before - `handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate - `callId` via the already-imported `randomUUID`; read `sessionId` from the - handler `context`. Thread the logger via `RegisterKtxContextToolsDeps` - (~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool` - and the existing telemetry emission intact. -- **Context type** — `packages/cli/src/context/mcp/types.ts`: add - `sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to - `KtxMcpServerDeps` / the register deps. -- **Server wiring** — `packages/cli/src/context/mcp/server.ts` - (`createDefaultKtxMcpServer` / `createKtxMcpServer`) and - `packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept - and pass the logger down to `registerKtxContextTools`. -- **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or - receive) the logger; in `newTransport` (~line 186) log `session.open` / - `session.close` and add `transport.onerror` → `transport.error`. -- **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or - receive) the logger; route the existing `transport.onerror` (~line 54) through - it. -- **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts` - redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs` - (`commands/mcp-commands.ts`) tails it. No change needed there. -- **Dependencies** — add `pino` and `pino-pretty` to - `packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks - still pass. -- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`, - `mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and - `commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written - before a (mock) handler runs and carries `params`/`sql`; (b) a matching - `tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a - `tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits - `warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token - never appears. Inject a capturing `io.stderr` and parse the JSON lines. - *Note:* `mcp-server-factory.test.ts` carries a pre-existing - `KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`, - unrelated to this work) — do not let it mask new failures. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation, not a requirement) - -Running Spider 2.0-Lite against the MCP server at concurrency, an -adversarial-reviewer-generated query degenerated into a massive nested-loop join; -synchronous `better-sqlite3` executed it on the event loop, pegging a server at -~100% CPU for hours and breaking new MCP connections ("Transport channel -closed"). We could not determine *which* query, because the server logs nothing -about tool calls — diagnosis required `sample` / `lsof` on the live process and -the exact SQL was never recovered. Structured tool-call logging — especially -`tool.start` written synchronously *before* execution, at the default level — -would have turned this into a one-line `grep` of the server log. Improving the -benchmark is a side effect; the logging is generic production-server hygiene. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance -criteria are satisfied. - -**What was built / where** - -- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io, - { isTTY? })` builds one synchronous `pino` (v10) instance written through the - `io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13) - synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the - sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel` - (`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`), - `mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and - `serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`. -- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** — - per invocation: `callId = randomUUID()`, a child logger bound to - `{ tool, callId, sessionId? }`, `tool.start { params }` written at `info` - **before** awaiting the handler (synchronous, so a runaway query still leaves it - on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`, - `warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error", - err }`. `resultSize` is the UTF-8 byte length of the serialized text content. - The existing `mcp_request_completed` telemetry + `reportException` are unchanged - (`durationMs` is now computed once and shared); `registerParsedTool` is intact. -- **`sessionId` / logger plumbing** — `sessionId?: string` added to - `KtxMcpToolHandlerContext`; a single per-process logger threads from each - transport entrypoint through `createKtxMcpServerFactory` → - `createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools` - (`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`). -- **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from - `onsessioninitialized`, `session.close` from `transport.onclose`, and the - previously-unused `transport.onerror` wired to `transport.error` at `error`. -- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror` - string write is replaced by a `transport.error` log line; `session.open` / - `session.close` are logged for the single stdio session. -- **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to - `packages/cli/package.json`. -- **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env - parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in - `test/context/mcp/server.test.ts` (start-before-handler, matching end with - `resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level - suppression with errored end still present, no-logger no-op), session lifecycle - + bearer-token-never-logged in `test/mcp-http-server.test.ts`, and - `test/mcp-stdio-server.test.ts` for `transport.error`. - -**Deviations / decisions** - -- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a - thrown handler error into an `{ isError: true }` result (and reports the full - error via telemetry) before it reaches `instrumentMcpServer`, so the original - stack is already gone. `tool.end` for such a result logs `outcome:"error"` with - `err.message` only; a genuine throw that escapes gets the full pino `err` - serialization (type + message + stack). The field is always `err` for - consistency. This honours "leave `registerParsedTool` intact." -- **`session.close` is logged from `transport.onclose`** (the universal close - signal for both clean DELETE and dropped connections) rather than - `onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its - session-map cleanup role. -- **The logger is optional throughout.** Production always wires one per process; - when absent (programmatic/test callers that inject `createMcpServer`), tool-call - logging is simply off — which keeps existing tests unchanged. -- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production - derives format from `process.stderr.isTTY`. - -**Verification** - -`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test -files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only -2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and -unrelated to this change (in-progress analytics-skill work on this branch). -`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run -build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the -one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit -`2677b3ef` (documented above); all source and the new tests type-check clean. diff --git a/spider2-specs/specs/16-bounded-query-execution-timeout.md b/spider2-specs/specs/16-bounded-query-execution-timeout.md deleted file mode 100644 index 597968ef..00000000 --- a/spider2-specs/specs/16-bounded-query-execution-timeout.md +++ /dev/null @@ -1,493 +0,0 @@ -# Bounded query execution (deadline + non-blocking) for read SQL - -> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`. -> -> **Scope: bound and cancel a read query that runs too long.** This is the -> execution-model companion to spec 15 (MCP structured logging). Spec 15 -> *surfaces* a runaway query in the log; it explicitly defers *preventing* one — -> "off-event-loop execution, query timeouts, worker-thread isolation … is -> execution-model work in a separate spec." This is that spec. - -## Problem - -Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the -current code: - -1. **No execution deadline, handled divergently per connector.** A single - expensive query runs unbounded, and whether it is bounded at all depends - entirely on which driver the caller hit: - - **BigQuery** is the only connector with a real statement timeout — it sets - `jobTimeoutMs` on the query job from a per-connection config field - `job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512). - - **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client - creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up, - not a server-side `max_execution_time`; the server keeps working. - - **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection - *acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres - `connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`; - MySQL pool size only) — nothing bounds statement *execution*. - - **SQLite** has nothing. - -2. **In-process SQLite blocks the event loop and cannot be cancelled.** The - SQLite connector executes on the main thread via synchronous - `better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`, - `query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes - the whole MCP server — it cannot serve other requests, send progress, or write - `tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12) - exposes no interrupt/cancel API. Its documented mechanism for slow queries is a - **worker thread**, and the only way to stop a runaway synchronous query is to - **terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`, - `docs/threads.md`). - -The observed failure (Spider2-lite sqlite run, 2026-06-18): a single -`sql_execution` MCP call — -`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`, -where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a -4-column key with no composite index) — degraded to an O(N×M) nested-loop scan, -pegged a worker at 100% CPU for 13+ minutes, never returned, produced a -`tool.start` with no matching `tool.end`, and stalled an eval shard until the -worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned -rows, not scan work, and the failing query returned a single aggregate row. - -## Generic use case (independent of any benchmark) - -Any data agent that lets an LLM author SQL will eventually issue an -accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW, -a wide aggregate over a large fact table. A general-purpose context layer must -bound that and return a clean, fast "query exceeded Ns" error so the agent can -revise (add filters, query base tables, narrow the range) instead of hanging the -tool and the server. This matters for embedded/local warehouses (SQLite, and any -future DuckDB-style in-process driver) and remote ones alike, and is wholly -independent of any benchmark. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs. - -### One canonical deadline, applied uniformly at the contract - -The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP -`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query -executor, relationship profiling and composite-candidate probes, relationship -validation, historic-SQL probes, `ktx sql`); the contract is the single place to -bound all of them. A heavy ingest profiling probe over a giant unindexed join is -exactly as worth abandoning as an interactive one — those call sites are -best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip -this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the -call sites that must treat the timeout as recoverable.) - -> Rejected alternative: a caller-resolved deadline (short on the interactive path, -> longer/none for ingest). That introduces a second value source and the open -> question "what is the ingest budget," for no real gain — the 30s default already -> clears any normal profiling probe, and a probe that exceeds it is one to drop. - -### Default 30s, configurable per-connection via one shared field - -- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean - "exceeded 30s" and revises within the same turn; generous headroom over any - indexed aggregate or normal profiling probe; a genuine pathological nested-loop - scan blows past it immediately. -- **One shared per-connection override**, honored by every connector: - `query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer - in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it - replaces; the user-facing error still reads in seconds. -- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the - new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved - value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so - there is exactly one way to set a query timeout — no parallel knob (intake - requirement 1). -- **Granularity is per-connection only.** No global all-connections override — - different warehouses have different performance envelopes, and a second - (global) knob would double the configuration surface for no stated need. - -### The shared contract is a value + an error, not a base class - -There is **no shared connector base class or factory** — each connector is -constructed independently; the only shared registry is the *dialect* factory -(`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement -3) means a single shared module that owns: - -- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`; -- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms` - override, else the default — so the default and the override precedence live in - exactly one place; -- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical - message `query exceeded ${Math.round(deadlineMs / 1000)}s`. - -Each connector calls the resolver once (at construction; connectors already -receive their connection config) and stores `this.deadlineMs`. **Enforcement is -necessarily per-connector** — different engines cancel differently — but the -*value* and the *error message* are shared, so the agent sees one consistent, -actionable error regardless of driver. - -### Real cancellation, not client-side give-up - -Per intake requirement 5, the deadline must *stop the work*, not merely abandon -the promise while the query keeps running (which on a pooled driver also risks -returning a still-busy connection to the pool). So: - -- **In-process (SQLite, and any future embedded driver):** run the query off the - main thread and enforce the deadline by **terminating the worker thread**. There - is no generic `Promise.race` outer wrapper — a `Promise.race` against a - synchronous in-thread `.all()` can never fire (the loop is blocked), and against - a pooled remote query it would poison the pool. Thread termination *is* the - cancellation. -- **Remote engines:** set the engine's **server-side statement timeout** so the - server itself aborts the query and frees the connection cleanly. - -### Logging routes through spec 15's pino path — no second logger - -The deadline cases are logged through the **existing** MCP tool-call logger -(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a -new logging path threaded into the connector. Verified flow for a timeout: -`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) → -`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it -(`reportException` skips `$exception` for `KtxExpectedError`) and returns an -in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`** -with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same -`callId`** as the `tool.start`. - -This is the central observability win and it requires **no new MCP logging code**: -spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this -spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose -`tool.end` names the deadline. The worker-termination (SQLite) and server-side -abort (remote) are internal enforcement mechanisms; their single observable signal -is that `tool.end`, so the connector does **not** get its own logger threaded -through `KtxScanContext` — that would fork a second path for one capability. The -"worker was actually reaped, not left spinning" guarantee is asserted by the -worker's `exit` event in tests (Requirement 3), not by a log line. - -## Requirements - -### 1. Shared deadline contract, defined once - -A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`) -exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`, -and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its -deadline through this resolver; no connector hardcodes its own default or -duplicates the override-precedence logic. - -### 2. Shared per-connection config field; BigQuery's removed - -`query_timeout_ms` is added to the **shared** connection config schema (validated -as an optional positive integer, milliseconds) so every driver accepts it. The -BigQuery-specific `job_timeout_ms` config field and its dedicated reader -(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout -from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms` -(zero, negative, non-integer) is a clear config validation error, consistent with -how ktx validates `ktx.yaml`. - -### 3. SQLite executes off the main thread, terminated on deadline - -`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event -loop: - -- Read-only validation and the row-limit wrapper (`assertReadOnlySql` + - `limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL - fails instantly without spawning a worker, and read-only enforcement stays at - the boundary (Requirement 7). -- The validated, row-limited SQL (and any params) is dispatched to a **worker - thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs - the query, and posts back `{ headers, rows, totalRows }` (all values are - structured-cloneable — primitives, `Buffer`, `BigInt`). -- The main thread arms a timer for `this.deadlineMs`; on expiry it calls - `worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal - message it clears the timer and resolves. On a worker error (SQLite rejected the - SQL) it rejects with that error, message preserved. A provided - `ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates - the worker, for external cancellation. -- **One short-lived worker per call**, terminated on completion or deadline — not - a persistent worker or pool. Terminate-on-deadline destroys the worker, so a - pool would need respawn/job-tracking for no benefit: `executeReadOnly` is - low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is - negligible against query latency. The other SQLite paths (introspect, sample, - stats, distinct-values, row-count) stay on the main thread — they are - ktx-authored, bounded, and not on the `executeReadOnly` contract. -- The event loop stays responsive throughout, so `tool.end` is always written and - concurrent requests on the same port are served. - -### 4. Remote engines set a real server-side statement timeout - -Each remote connector applies `this.deadlineMs` as its engine's server-side -statement timeout, so the deadline stops server work rather than abandoning the -promise: - -| Connector | Mechanism | Unit | -|------------|--------------------------------------------------------|---------------| -| BigQuery | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms | -| Postgres | `statement_timeout` | ms | -| MySQL | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms | -| Snowflake | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION) | s (ceil) | -| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) | -| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms | - -ClickHouse's existing hardcoded 30s `request_timeout` is brought under this -contract (derived from the resolved deadline), not left as a parallel mechanism. - -### 5. Timeout resolves as a `KtxQueryError` with the canonical message - -On exceeding the deadline, the path resolves with a `KtxQueryError` -(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded -hang. For SQLite the worker-termination path throws `queryDeadlineExceededError` -directly. For remote engines, each connector recognizes **its own** engine's -timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`; -SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as -`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector -owns its driver's signal — there is no central denylist of error codes to -maintain. - -### 6. MCP surfacing and logging via the existing pino path - -The MCP `sql_execution` path already (a) maps any non-native driver error to -`KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by -`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips -`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start` -synchronously before the handler and `tool.end` in `instrumentMcpServer` -(`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through -this path — the implementer verifies and tests them, but adds **no parallel -classification or logging path**: - -- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with - `outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same - `callId` as the `tool.start`. Classified as an expected error, so it is absent - from `$exception` Error Tracking. The reason `tool.end` was previously missing - is solely the blocked event loop (Requirement 3); once the loop stays free and - the deadline throws, the existing instrumentation logs the matched pair — closing - spec 15's "`tool.start` with no `tool.end` = hang" gap for this case. -- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):** - unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline - (default 30s) and the slow threshold (default 10s) are independent knobs; a query - between 10s and 30s completes with a slow `warn`, one past 30s is killed with the - `error` above. - -### 7. Read-only enforcement and `maxRows` unchanged - -`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave -exactly as today. The deadline is additive. `maxRows` is not a substitute for it -(it bounds returned rows, not scan work). - -### 8. Best-effort callers treat a deadline timeout as recoverable - -The non-interactive `executeReadOnly` call sites that are best-effort — -relationship profiling, composite-candidate probes, relationship validation, -historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this -probe / mark unprofiled" and continue, never as a source-fatal error. The -implementer confirms each such site already swallows query errors into a -graceful-skip and adds that handling where it does not, so the uniform deadline -(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped -probe is logged at the skip site through that path's existing scan/ingest logger -(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers -are off the MCP tool-call path, so their visibility comes from the logger they -already use. - -## Acceptance criteria - -- A read query that exceeds the deadline returns a `KtxQueryError` - (`query exceeded {N}s`) within roughly the deadline; the MCP worker stays - responsive (a concurrent tool call on the same server completes while the slow - query is still pending) and writes a matching `tool.end` with a non-ok outcome. -- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching - `tool.end` (same `callId`) at `error` with `outcome:"error"` and - `err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The - timeout does not raise a `$exception` Error Tracking event (it is a - `KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but - under the deadline still emits its `tool.end` at `warn`. No new logger is - introduced — the lines come from the existing `instrumentMcpServer`. -- **SQLite specifically:** executing a deliberately pathological query (an - expensive VIEW or an unindexed cross join) on a fixture does not block the event - loop, is terminated at the deadline, and the worker exits (the off-main-thread - executor is killed, not left spinning) so CPU returns to idle. -- **One server-side-timeout driver (Postgres):** the connector applies - `statement_timeout` equal to the resolved deadline, and a `57014` cancellation - is mapped to the canonical `KtxQueryError`. -- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms` - override, and rejects an invalid value (zero / negative / non-integer). -- **No regression:** normal fast queries return identical results; read-only - rejection still works; `maxRows` still bounds returned rows. -- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's - former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the - shared field. - -## Non-goals - -- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result - size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a - separate, retained concern. -- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated - default plus a per-connection override; no per-call knob, no global knob. -- **A server watchdog that recycles the process on an unmatched `tool.start`.** - Spec 15 names this as a possible future mitigation; this spec prevents the hang - at the source, so the watchdog is out of scope here. -- **Moving SQLite introspection / sampling / stats off the main thread.** Only the - `executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded - ktx-authored queries. -- **Per-connection retry / backoff on timeout.** A timeout returns a clean error - for the agent to revise; ktx does not auto-retry. -- **A second logger threaded into the connector.** The deadline cases are logged - through spec 15's existing MCP tool-call logger; the connector gets no separate - pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes - through spec 15's pino path"). - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`: - `DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`. - Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`). -- **Contract anchor** — `KtxScanConnector.executeReadOnly` - (`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`), - `KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the - MCP path). -- **Config schema** — add `query_timeout_ms` to the shared connection config - (`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema); - remove BigQuery's `job_timeout_ms` reader. -- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts` - (constructed by path via `new URL('./read-query-worker.js', import.meta.url)`); - rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate - on the main thread then dispatch to the worker with a terminate-on-deadline - timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in - the worker. Register the worker as a dynamic entry in `knip.json` (it is - referenced by path, not import) and confirm the build copies it into `dist`. -- **Remote connectors** — apply the resolved deadline and recognize the engine's - timeout signal in each `executeReadOnly` / `query(...)`: - `connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`), - `connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` + - `request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534, - `STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838, - `statement_timeout`), `connectors/mysql/connector.ts` (~774–793, - `max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832, - `requestTimeout`). -- **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88` - (error mapping), the `sql_execution` registration (~915–943), and the logging in - `instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes - `tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No - new classification or logging code; confirm the timeout flows through as an - expected error producing a matching `tool.end(error)` with the canonical message. -- **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275), - `context/scan/relationship-composite-candidates.ts` (~365, 440), - `context/scan/relationship-validation.ts` (~259), - `context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the - historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a - graceful skip. -- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms` - as the test seam) asserting terminate-on-deadline, event-loop responsiveness - (a concurrent promise resolves while the query is pending), and worker exit; a - Postgres test asserting `statement_timeout` is set to the resolved deadline and - a `57014` error maps to `KtxQueryError`; resolver unit tests (default / - override / invalid); regression tests for normal results, read-only rejection, - and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g. - `test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a - matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation, not a requirement) - -The Spider2-lite local set loads several warehouses into SQLite, some with -expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` = -`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112 -rows, no composite index, with `promo_id` (the index the optimizer picks) being -95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a -view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval -shard for 10+ minutes; with one, the agent gets a fast error and can scope the -query instead. Improving the benchmark is a side effect; the deadline is generic -production hygiene for any agent that lets an LLM author SQL. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All -acceptance criteria are met; tests, type-check, dead-code, and build are green -for the changed surface. - -### What was built, and where - -- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`: - `DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns - the validated `query_timeout_ms` override else the default; throws on - zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)` - (a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the - driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`. -- **Config field** — `query_timeout_ms` (optional positive integer, ms) added to - the **shared warehouse** schema. NOTE (spec drift): that schema lives in - `context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not - `config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be - declared explicitly to be *validated* (otherwise it would pass through - unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection` - reader were removed; BigQuery now resolves the shared field. Every connector - resolves its deadline once at construction via `resolveQueryDeadlineMs`. - -### Deviation from the spec's SQLite mechanism (worker thread → child process) - -The spec mandated running SQLite read queries on a **worker thread** and enforcing -the deadline by `worker.terminate()`. This was **empirically disproven**: -`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3` -scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise -never even resolves (an 8s probe of the exact failing query shape confirmed the -thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler -API, and `.iterate()` does not help because the failing query is a single -aggregate row produced only *after* the full scan. - -The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`** -(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from -`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed -the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both* -SQLite requirements better than a thread (event loop stays free **and** the query -is genuinely cancellable). The child is self-contained (imports only -`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`) -and `normalizeQueryRows` stay on the main thread. One short-lived child per call, -killed on completion, deadline, or `ctx.signal` abort. Node v24's native -TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts` -URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in -`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke). - -### Remote connectors (server-side timeouts + own-signal mapping) - -Each applies the resolved deadline server-side and re-wraps its own timeout signal -as `queryDeadlineExceededError(deadlineMs, { cause })`: - -- **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error. -- **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=`); maps `57014`. -- **MySQL** — `SET SESSION max_execution_time = ` before the read; maps errno `3024`. -- **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = ` in the pooled connection; maps code `604` / "reached its … timeout". -- **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`. -- **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`. - -Each connector has a focused test asserting the timeout is applied and its signal -maps to `KtxQueryError` (Postgres is the spec's required acceptance test). - -### Best-effort callers (Requirement 8) - -Confirmed already graceful: relationship **profiling** (outer try/catch → -`profile_failed` warning) and **composite-candidate** detection -(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL -**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error -into `{ ok: false }`. **Added** handling to relationship **validation**: a -`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to -`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of -aborting the whole validation pass. `ingest-query-executor.ts` is a generic -executor port whose callers own recoverability — left unchanged. - -### MCP surfacing/logging - -No new MCP classification or logging code. The deadline `KtxQueryError` flows -through the existing `local-project-ports` mapping → `reportException` (skips -`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts` -covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched -`tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query -exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched -pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case. - -### Pre-existing branch issues encountered (not part of this feature) - -- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with - a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke - `tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the - type-check gate green; behavior unchanged. -- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing - `**Window functions**` heading and `Expose identity, not just the label` prose - in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec - 13/14) content drift committed earlier on the branch; **left untouched** — no - skill files were modified by this feature. diff --git a/spider2-specs/specs/18-bigquery-cross-project-datasets.md b/spider2-specs/specs/18-bigquery-cross-project-datasets.md deleted file mode 100644 index 4dd65e2d..00000000 --- a/spider2-specs/specs/18-bigquery-cross-project-datasets.md +++ /dev/null @@ -1,418 +0,0 @@ -# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project) - -> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`. -> -> **Scope: let the BigQuery connector introspect a dataset hosted in a *different* -> project than the one it bills jobs to.** A `dataset_ids` entry may be written -> fully-qualified as `project.dataset`; the connector introspects each entry in -> *its own* project while every job still runs in `credentials.project_id`. A -> bare `dataset` keeps today's single-project behavior unchanged. -> -> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard -> is **not** expected to *discover* foreign datasets — you cannot enumerate -> datasets in a project you don't own, and the wizard doesn't know which foreign -> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or -> at the dataset prompt); the connector must accept and introspect them. See -> *Non-goals*. - -## Problem - -**ktx**'s BigQuery connector derives a single `projectId` from -`credentials.project_id` and uses it for **both** job billing **and** schema -introspection. There is no way to introspect a dataset that lives in another -project, even though *querying* such a dataset already works (a cross-project -read in a `FROM` clause bills to the caller's project — that path is proven). - -Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`): - -- **`:294`** — `projectId` is read only from `credentials.project_id`. There is - no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig` - (`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed. -- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim; - it never parses a `project.` prefix. -- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`, - which resolves the dataset in the **client's (billing) project**, and labels - every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the - introspection-failure warning metadata (`:566`). -- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as - `` `..INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the - **billing** project. -- **`listTables` (`:453`)** — queries - `` ``.`region-`.INFORMATION_SCHEMA.TABLES `` against the - **billing** project and labels each row `catalog: this.resolved.projectId`. -- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the - billing project. - -### Empirical confirmation (from the intake draft) - -With a service account in project `ktx-spider2-lite`: - -- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks - in `projects/ktx-spider2-lite/datasets/austin_311`). -- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })` - → **succeeds** (public metadata is readable by any authenticated principal). -- There is **no config knob** to separate the introspection project from billing. - -### Why the table `catalog` label is load-bearing, not cosmetic - -The BigQuery dialect generates **three-part `catalog.db.name`** SQL -(`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`; -`context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The -`catalog` stored on each scanned table is therefore the project that *every* -later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`, -and ref-based `executeReadOnly` all format the ref through the dialect. If a -foreign dataset's tables are labeled with the billing project, every one of those -queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling -the table `catalog` with the dataset's own project is a **correctness -requirement**, and it is the single lever that makes sampling, dictionary value -extraction, and `discover_data` all resolve once the snapshot is right. - -### One introspection path, no divergence - -`connectors/bigquery/live-database-introspection.ts` wraps -`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database -paths share **one** introspection implementation. The SDK already supports the -fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s -`DatasetOptions` exposes `projectId?: string`. - -## Generic use case (independent of any benchmark) - -Analysts routinely introspect datasets they can **read but do not own and do not -bill to**: Google's `bigquery-public-data`, a partner's shared project, an -organization's central data project that a smaller team queries from its own -billing project. To make those connectable in **ktx** — so `discover_data`, the -semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the -connector must introspect a foreign-hosted dataset while billing jobs in the -credentials' own project. This is a standard BigQuery deployment shape and is -wholly independent of any benchmark. - -The class to design for is "the dataset's project ≠ the billing project," and it -must generalize beyond one example: a single connection may reference datasets in -**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data` -and `isb-cgc-bq`), and two different projects may host datasets with the **same -name**. The design must keep those distinct. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs. - -### Carry the project inline on each dataset entry — no separate knob - -The introspection project is expressed **per dataset**, inline, as the optional -`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config -field. - -> Rejected alternative: a separate connection-level `dataset_project` (or -> `introspection_project`) field. It is a speculative runtime knob (against the -> repo's opinionated-defaults rule) and, more decisively, it **cannot express the -> requirement**: one connection must span *multiple* foreign projects, which a -> single global field cannot represent. The inline form also derives scope from -> the user's own declared input rather than adding a parallel setting. - -### Parse to canonical `{ project, dataset }` pairs at the config boundary - -Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` / -`datasetIds()`, into a canonical pair: the project (when no prefix is present, -default it to `credentials.project_id`) and the bare dataset id. Every -introspection-side call site reads the resolved pair; nothing downstream re-parses -a `project.dataset` string. - -> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix -> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`, -> `testConnection`). That re-implements one rule in four places and is exactly the -> drift trap the repo's single-source-of-truth rule warns about — a later fix -> lands on one path and not another. Normalize at the boundary; carry the -> canonical form downstream. - -The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`) -changes shape from `string[]` to a structured pair list. That is an internal type; -the connector internals and the connector test fixtures are the only consumers. - -### Parsing rule (at the boundary) - -- An entry contains **at most one `.`**. -- With a dot: the segment **before** the dot is the project, validated by the - existing `normalizeBigQueryProjectId` charset - (`context/connections/bigquery-identifiers.ts`); the segment **after** is the - dataset id (validated as a normal identifier). -- Without a dot: a bare dataset; the project defaults to `credentials.project_id` - (today's behavior). -- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error - raised at resolution time, naming the connection — not a silent - mis-introspection. -- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay - **out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset - (which already rejects `.` and `:` in a project id). - -### Billing is never the dataset's project - -The BigQuery client is still constructed with `projectId = credentials.project_id` -(`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the -*introspection* surfaces switch to the per-dataset project. Cross-project reads in -a `FROM` clause already bill to the caller — unchanged and already proven. - -### Dataset identity downstream is `(catalog, db)` - -Scanned tables are keyed by `(catalog, db, name)` throughout -(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because -the table `catalog` now holds the dataset's own project, two foreign projects that -each host a `austin_311` dataset remain distinct with no extra work — provided the -snapshot's `scope` / `metadata` also preserve the project (Requirement 6). - -### Setup-wizard scope: accept, don't discover - -The connector's region-scoped `listTables` (`:453`) is consumed **only** by the -`ktx setup` wizard's table-selection step (`setup-databases.ts`); the -ingest / `discover_data` path reads persisted snapshot JSON via -`WarehouseCatalogService.listTables`, not the connector method. The wizard is not -expected to enumerate foreign datasets (you can't list a project you don't own). -A `project.dataset` value hand-entered at the dataset prompt, or written into -`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the -region caveat that follows from this. - -## Requirements - -### R1 — Accept and parse `project.dataset` at the config boundary - -`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each -`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair -per the parsing rule above, defaulting `project` to `credentials.project_id` when -unprefixed. A malformed entry (more than one `.`, an empty project or dataset -segment, or a project/dataset that fails identifier validation) raises a clear -error at resolution time that names the connection id. - -### R2 — Introspect each dataset in its own project - -`introspectDataset` resolves the dataset via the **dataset's** project — -`client.dataset(datasetId, { projectId })` — for `getTables()` and each -`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to -accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`) -and forwarding it from `DefaultBigQueryClientFactory`. - -### R3 — Label table `catalog` with the dataset's project - -Every table produced by `introspectDataset` is labeled `catalog: ` (not the billing project), and the introspection-failure warning -metadata (`object` / `catalog`) likewise reflects the dataset's project. This is -what makes downstream sample/distinct-value/read queries resolve. - -### R4 — Primary-key discovery targets the dataset's project - -The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` / -`KEY_COLUMN_USAGE` SQL is built against -`` `..INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA -view is dataset-qualified and therefore region-independent.) Its existing -soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved. - -### R5 — `listTables` lists each dataset in its own project - -`listTables` returns rows labeled `catalog: ` and queries -each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection -can now span projects, it queries per distinct project rather than assuming one. -(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.) - -### R6 — Snapshot scope and metadata reflect multiple projects - -`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing** -project, but `scope.catalogs` becomes the **distinct set of dataset projects** -actually introspected. `scope.datasets` / `metadata.datasets` must stay -unambiguous when two projects share a dataset name (e.g. carry the qualified -`project.dataset`, or otherwise preserve the project). The scoped table-name -lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass -each dataset's own project so `tableScope` / `enabled_tables` filtering still -matches. - -### R7 — `testConnection` resolves foreign datasets - -`testConnection` validates each configured dataset via its own project -(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only -at foreign datasets reports success rather than a spurious `404`. - -### R8 — Billing unchanged; bare dataset is a strict no-op - -`createQueryJob` continues to bill in `credentials.project_id`. A connection whose -`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before: -same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no -behavioral change. - -### R9 — `getTableRowCount` honors the parsed entry - -`getTableRowCount`'s default-dataset handling (`:431`, today -`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign -default dataset is introspected in its own project. - -### R10 — Docs reflect the qualified form - -Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written -`project.dataset` to introspect a dataset hosted in another project (billing stays -in `credentials.project_id`). Update the BigQuery rows/examples in -`docs-site/content/docs/configuration/ktx-yaml.mdx` and -`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope -note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples -copy-pasteable and follow the `fumadocs-mdx-structure` skill. - -## Acceptance criteria - -1. **Foreign single-project introspection.** With credentials in project - `ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`, - `ktx ingest ` introspects the tables, enriches, and samples values; - `discover_data` / `dictionary_search` return them. Tables are labeled - `catalog: 'bigquery-public-data'`. -2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x', - 'other-project.y']` introspects **both**, each under its own project; the - snapshot's `scope.catalogs` contains both projects. -3. **Cross-project query still bills locally.** `sql_execution` of a - fully-qualified `project.dataset.table` query runs and bills in - `credentials.project_id`. -4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']` - yields two distinct dataset groups; tables do not collide. -5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`) - behaves exactly as before — resolved under `credentials.project_id`, same - `catalog` labels and INFORMATION_SCHEMA targets. -6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an - empty segment) raises a config error naming the connection, not a `404` at - scan time. -7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`, - using the existing fake `clientFactory` harness): - - the fake `dataset()` is called with the dataset's project for a prefixed - entry, and with the billing project for a bare entry; - - a prefixed entry yields tables with `catalog: ''`; - - a mixed two-project `dataset_ids` introspects both; - - `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment - entry; - - the existing single-project tests still pass unchanged. - -## Non-goals - -- **Foreign-dataset discovery in the setup wizard.** The wizard does not - enumerate datasets in projects the credentials don't own; users supply - `project.dataset` explicitly (scope decision A). -- **Cross-region `listTables`.** `listTables`' region-scoped - `region-.INFORMATION_SCHEMA.TABLES` query uses the connection-level - `location`; a foreign dataset in a *different* region than the connection's - `location` will not be listed by that wizard-facing query. This does **not** - affect ingest/`discover_data`, whose introspection path - (`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is - region-independent. A per-dataset region knob is a separate spec if ever needed. -- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`), - already unsupported by `normalizeBigQueryProjectId`. -- **A separate billing/introspection config field** — explicitly rejected above. - -## Implementation orientation - -Pointers from exploration; line numbers may have drifted, and the implementer owns -the design. - -- `packages/cli/src/connectors/bigquery/connector.ts` - - `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) — - parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds` - shape. - - `KtxBigQueryClient.dataset` port (`:100`–`:110`) and - `DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId` - (R2). `getClient()` (`:487`) keeps the billing project (R8). - - `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog` - + warning metadata (R2, R3). - - `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4). - - `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog - (R5). - - `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup - (`:359`) (R6). - - `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9). -- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps - `introspect`; no separate change needed (it inherits the fix). -- `packages/cli/src/context/connections/bigquery-identifiers.ts` — - `normalizeBigQueryProjectId` is the project-segment validator. -- `packages/cli/src/context/connections/dialect-helpers.ts` / - `connectors/bigquery/dialect.ts` — three-part naming; no change, but this is - *why* R3 matters. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. Run - `pnpm --filter @kaelio/ktx run type-check` and the connector test suite. - -## Benchmark context (motivation, not a requirement — do not encode benchmark specifics) - -Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable -faithfully: every one of its ~74 logical databases groups datasets hosted in -foreign public projects (`bigquery-public-data`, `isb-cgc-bq`, -`data-to-insights`, …), never in a project we own. Query execution already works -cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly -because the connector can't introspect a foreign-hosted dataset. Of 74 BQ -databases only **one** spans more than one source project, so "let `dataset_ids` -carry `project.dataset` and introspect each in its own project" covers the -benchmark and the general case alike. None of these project names belong in the -code — they are derived from the user's own `dataset_ids` input. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki`. The whole change is contained in -the BigQuery connector, its identifier helpers, the connector test suite, and three -docs pages. - -**Config boundary (R1).** Added `normalizeBigQueryDatasetId` -(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset -`[A-Za-z0-9_]`) next to the existing project/region validators. In -`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry, -defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots → -bare dataset in `defaultProject`; one dot → `project.dataset` (each segment -validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs` -resolves `env:`/`file:` references first, trims/filters empties, then parses each. -`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the -default, so the canonical pair list is produced once at the boundary. -`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new -`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name -`connections..dataset_ids entry ""`. - -**Client port (R2).** `KtxBigQueryClient.dataset` now takes -`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards -`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`). -`getClient()` still constructs the client with the **billing** `project_id`, so -`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3). - -**Per-dataset introspection (R3–R7, R9).** Every introspection site reads the -resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)` -and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s -`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified -`` `..INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each -dataset under its own project; `getTableRowCount`'s default resolves through the first -pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and -keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a -`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project -snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with -the same dataset name stay distinct, R6/acceptance 4). - -**`listTables` (R5).** Split into `listTables` (parse override entries, group by -project) and `listTablesInProject(project, region, datasets?)`. With no override it -lists the billing project's region (unchanged); with an override it runs one -region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that -project's bare datasets, and labels rows with that project. The existing single-region -test is unchanged (bare entries collapse to one billing-project query). - -**Docs (R10).** Added a "Cross-project datasets" subsection to -`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats), -plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`. - -**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and -malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection -calls `dataset('austin_311', 'bigquery-public-data')`, labels tables -`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps -`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']` -connection introspects both under their own projects; and `['proj_a.shared', -'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated -to the pair list; all pre-existing behavioral tests pass unchanged. - -**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite -(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`, -`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production), -`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all -pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory; -criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is -structurally guaranteed (single billing client) and asserted via the `createClient` -project. End-to-end ingest against live `bigquery-public-data` was not run here (no live -credentials in this worktree); the `link:dev` binary is ready for the playground agent to -validate. - -**No deviations from the spec design.** The only judgment call: `scope.datasets` -renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to -satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation -requirement (R6/acceptance 4) with one unambiguous, dot-delimited form. diff --git a/spider2-specs/specs/19-durable-bounded-relationship-detection.md b/spider2-specs/specs/19-durable-bounded-relationship-detection.md deleted file mode 100644 index 3aecf45b..00000000 --- a/spider2-specs/specs/19-durable-bounded-relationship-detection.md +++ /dev/null @@ -1,471 +0,0 @@ -# Durable, resumable, bounded relationship detection during ingest enrichment - -> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`. -> -> **Scope: make the expensive part of ingest enrichment survive an interrupted -> relationship stage.** Today the paid LLM descriptions + embeddings only become -> durable and queryable after the slowest, most-killable, least-valuable stage -> (relationship detection) also finishes. This spec moves the persistence boundary -> to the cost boundary, makes stage resume work across runs, and bounds + observes -> the one open-ended stage — the durability companion to spec 16 (bounded query -> execution), which this spec composes with rather than replaces. - -## Problem - -Three compounding failure modes, all confirmed in the current code, share one root -cause: **the three enrichment stages are treated as a single atomic unit for -persistence, identity, and bounding, even though they differ radically in cost, -durability value, runtime, and likelihood of being killed.** - -`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages -in a fixed order through `runEnrichmentStage` (`:413`): - -| stage | order | cost | durability value | runtime on a large schema | likely to be killed | -|-------|-------|------|------------------|---------------------------|---------------------| -| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low | -| `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low | -| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** | - -The slowest, most-killable, least-valuable stage runs **last**, and it gates the -durability of the two expensive stages held in memory before it. - -### 1. Enrichment is lost if relationship detection is interrupted - -The queryable artifact agents search and execute against is the `_schema` manifest -YAML (`semantic-layer//_schema/*.yaml`). It is written **twice**: - -- bare (native column comments only) early, at `local-scan.ts:473` - (`writeLocalScanManifestShards`), before enrichment runs; and -- rewritten **with AI descriptions + accepted joins** by - `writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called - from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after - all three stages. - -So the descriptions and embeddings reach the queryable layer only via that single -terminal write. If the process is killed/crashes/times out **during** the -`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write -never runs, and the in-memory descriptions + embeddings are discarded — the -`_schema` retains only the bare native comments from the `:473` write. - -Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full -descriptions + embeddings (progress reached "Building embeddings 17/17"), then the -relationship stage ran silently past a supervising deadline and was killed; the -persisted `_schema` had **0** AI descriptions. The most expensive work is the most -likely to be thrown away. - -> A stage-state store (below) does save each completed stage's output to an -> internal SQLite cache as the stage finishes — so the descriptions are not lost to -> the *resume cache*. They are simply never **promoted** to the queryable `_schema` -> until the terminal write. The data survives somewhere the agent cannot query, and -> (per failure mode 2) cannot be reused on the next run either. - -### 2. Re-running does not resume — it re-spends - -`runEnrichmentStage` resolves a completed stage with -`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and -the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares -`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id` -(`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per -ingest invocation (`record.runId`). The cache therefore only resolves *within* one -run; re-running an interrupted ingest gets a new `runId`, misses every cached -stage, and **recomputes descriptions + embeddings from scratch** — re-paying for -LLM work that already succeeded. - -The store already computes and persists `inputHash` next to `runId` — -a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity, -relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is -already on the row; the lookup just uses the volatile column. This is a keying -defect, not a missing capability. - -### 3. Relationship detection is unobservable and unbounded - -`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a -row sample of **every enabled table** (`profileKtxRelationshipSchema`, -`relationship-profiling.ts:320` — one sampled query per table at -`profileConcurrency`, default 4), validates candidate joins -(`relationship-validation.ts:237` — one coverage query per candidate), and detects -composite keys (`relationship-composite-candidates.ts:515` — per-table plus -cross-table queries). None of the controls the rest of the scan pipeline relies on -were ever wired into this stack: - -- **No progress.** `discoverKtxRelationships` does not accept a progress port; the - caller can only emit start/end around it (`local-enrichment.ts:600,611` — - `update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of - silence between. -- **No honored cancellation.** `KtxScanContext.signal` exists on the contract - (`types.ts`) but **no sub-stage reads it**. -- **No time budget.** Validation has a *count* budget (`validationBudget`, default - `min(2 × tableCount, 1000)`); profiling and composite detection have none. On a - schema with hundreds–thousands of tables, profiling is O(tables) silent queries - with no internal stop condition. - -A supervisor watching for liveness cannot tell a slow-but-working profile from a -true hang, and nothing inside the stage will voluntarily stop — so on a very large -schema it runs far past any reasonable deadline and is killed (which, via failure -mode 1, takes the descriptions with it). - -## Generic use case (independent of any benchmark) - -Any context layer that enriches a real warehouse with paid LLM work must make that -work durable the instant it is produced, resume it across process restarts without -re-paying, and bound the open-ended profiling stage so a large catalog cannot hang -ingest indefinitely. A data team ingesting a 500-table production warehouse over a -flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit -hits all three failure modes regardless of any benchmark. This is general -durability and cost hygiene for the ingest pipeline; the benchmark only made it -acute at scale. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs (requirement-level, per the specs README). - -### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships - -As soon as the last non-relationship stage completes — `embeddings` when an -embedding provider is configured, otherwise `descriptions` — persist the -descriptions + embeddings into the **queryable** `_schema` manifest (and the raw -`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the -`relationships` stage runs. The relationship stage then writes its joins on top: the -manifest builder already re-reads and preserves existing descriptions and -manual/inferred joins on rewrite (`loadExistingManifestState`, -`local-enrichment-artifacts.ts:196`), so the second write is additive, not -destructive. - -Net invariant: **the descriptions + embeddings are always durable and queryable the -moment they are computed**, even if relationship detection then fails, is -interrupted, is budget-truncated, or is skipped. A failed/partial/skipped -relationship stage degrades to "no joins" or "partial joins" — **never** to "no -descriptions." This is the inverse guarantee the current terminal-write ordering -violates. - -The bare `:473` manifest write stays — it is the queryable schema for the -no-providers / enrichment-disabled path. The checkpoint is an additional write that -runs only when enrichment produced descriptions. - -> Orientation (the implementer owns the seam): the lowest-coupling shape is a -> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once -> the last non-relationship stage completes, and `local-scan.ts` supplies a callback -> that calls the existing `writeLocalScanEnrichmentArtifacts` for the -> descriptions + embeddings + manifest only (no generated joins yet). The final -> write after the relationship stage proceeds as today. Relationship-specific -> artifacts (`relationships.json`, `relationship-profile.json`, -> `relationship-diagnostics.json`) are written by the final/relationship write, not -> the checkpoint, so the checkpoint never emits misleading empty relationship -> diagnostics. -> -> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment` -> (inject the file store / project). That couples the enrichment module to -> persistence for no gain — the writer already lives in `local-scan.ts` and the -> checkpoint needs only a one-line hook, not a relocation. - -### D2 — Resume by content identity, not by `runId` - -Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**, -independent of `runId`, so a re-run with an unchanged schema and config resumes the -finished `descriptions` / `embeddings` stages from cache and re-runs only what -actually failed. `inputHash` is already the content fingerprint; `connectionId` -scopes it to the right source. When several rows share a content identity (one per -prior run), the most recent `updatedAt` wins. - -`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves -the uniqueness/lookup key. - -The state store is a **disposable local resume cache** (`.ktx` local state, -regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate -the table if its on-disk shape differs from the new `(connection_id, stage, -input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the -old cache only means one ingest cannot resume; it never corrupts a queryable -artifact. - -> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest -> are already folded into `inputHash`; adding them again would only narrow the key -> and re-break cross-run resume when an incidental field differs. - -### D3 — Make the relationship stage observable and bounded - -Thread three things the rest of the pipeline already supports through -`discoverKtxRelationships` into profiling, validation, and composite detection: - -- **Progress** through the existing progress port (the relationship phase is - already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit - liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent - for composite probing — so a supervisor can distinguish slow-but-working from - hung. -- **A flat wall-clock budget** for the whole relationship stage: a new - `scan.relationships.detectionBudgetMs`, a positive integer of milliseconds, - project-level, validated like the other `scan.relationships` fields, **default - 600_000 (10 min), enforced by default.** Checked at unit boundaries (before each - table profile, each candidate validation, each composite probe). It sits **above** - spec 16's per-query deadline (default 30s): each individual query is already - bounded; this bounds the *sum* of them. -- **Honored cancellation:** where `KtxScanContext.signal` is available, the same - unit-boundary check honors it, so external cancellation stops the stage too. - -On budget exhaustion or abort: stop scheduling new work, let in-flight queries -finish (each already bounded by spec 16), finalize with the relationships found so -far, and return a **partial** result — never an unbounded hang and never an -exception that would lose the checkpointed descriptions. - -> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a -> second formula to reason about and "more tables → more budget" partly re-opens the -> unbounded door this requirement closes. One flat, generous, project-level number -> matches how the other `scan.relationships` knobs are shaped and is enough for a -> best-effort stage whose partial output is durable and improvable (D4). -> -> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a -> per-call override. One opinionated project-level default with a config override is -> the canonical ktx shape; no second runtime path. - -### D4 — A budget-truncated partial is a successful, cached, completed stage - -A graceful budget stop is **not** a failure. The relationship stage saves its -partial result like any completed stage (so a plain re-run resumes it for free, no -re-querying) and marks it `partial` with a reason in the relationship diagnostics -plus a recoverable scan warning. Because `detectionBudgetMs` lives in -`relationshipSettings ⊂ inputHash`, **raising the budget changes the content -identity and triggers a fresh, fuller run** — that is the only "try harder" -mechanism, with no extra flag or runtime path. - -Distinguish the two stop kinds: - -- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as - completed, so the next run recomputes the relationship stage (after resuming - descriptions/embeddings from cache via D2). This is the primary durability path. -- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and - resumed cheaply on re-run, unless the budget is raised. - -## Requirements - -### 1. Checkpoint descriptions + embeddings before relationship detection - -The descriptions and embeddings MUST be persisted into the durable, queryable -`_schema` manifest (and the raw enrichment artifacts) as soon as the last -non-relationship stage completes, before the `relationships` stage runs. -Relationship detection appends/merges its joins on completion. The expensive LLM + -embedding enrichment MUST be queryable even if the relationship stage subsequently -fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped -relationship stage MUST degrade to "no/partial joins," never to "no descriptions." - -### 2. Stage resume resolves by content identity across runs - -Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`, -independent of `runId`, so re-running an interrupted ingest resumes the finished -`descriptions` / `embeddings` stages from cache and re-runs only what failed. -Re-running after an interruption MUST NOT re-issue LLM description or embedding -calls for stages that already completed. The resume cache MAY be recreated without a -migration bridge if its schema changes (it is disposable local state). - -### 3. Relationship detection emits progress and honors a wall-clock budget - -The relationship stage MUST emit per-unit progress through the existing progress -port (at minimum per-table during profiling and per-candidate during validation) so -liveness is observable. It MUST enforce a flat wall-clock budget -(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level, -overridable, validated as a positive integer) checked at unit boundaries and layered -above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where -available. On budget exhaustion or abort it MUST stop scheduling new work, finalize -with the relationships found so far, and return a partial result rather than running -unboundedly or throwing. - -### 4. A budget-truncated relationship result is durable and marked partial - -A graceful budget/abort stop MUST persist the partial relationship result as a -completed stage (so a plain re-run resumes it without re-querying) and MUST mark it -`partial` — in the relationship diagnostics artifact and as a recoverable scan -warning — so downstream consumers can see the joins are incomplete. Raising -`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller -relationship run; no separate flag is introduced for "redo." A process killed -mid-stage MUST NOT leave a completed record (so it recomputes on re-run). - -### 5. No regression for small or uninterrupted ingests - -A small or single-run ingest that is never interrupted MUST produce the same -artifacts and the same relationship output as today. The checkpoint write MUST be -idempotent with the final write (descriptions survive the join rewrite); the budget -default MUST be generous enough that normal and large-but-tractable schemas complete -relationship detection fully, hitting the budget only on pathological scale. - -## Acceptance criteria - -- **Durability across interruption:** interrupting an ingest **during** relationship - detection still leaves a queryable semantic layer carrying the table/column - descriptions + embeddings that were generated (verified: re-open the connection; - AI descriptions are present in `_schema`, not just native comments). -- **Resume does not re-spend:** re-running an interrupted ingest does **not** - regenerate descriptions/embeddings whose stage already completed (verified: no LLM - description calls and no embedding calls for the cached tables; only the failed - stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume - survives a fresh `runId`. -- **Observable + bounded relationships:** a connection with hundreds of tables emits - relationship-stage progress (per-table profiling, per-candidate validation) and - completes within `detectionBudgetMs`; when the budget is hit, the stage stops - gracefully and persists the partial relationships found so far — without - discarding enrichment — marked `partial` in diagnostics and via a recoverable - warning. -- **Partial is cached and improvable:** re-running with an unchanged budget resumes - the partial relationship result from cache (no re-querying); raising - `detectionBudgetMs` triggers a fresh, fuller relationship run. -- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project - override, and rejects an invalid value (zero / negative / non-integer) as a clear - `ktx.yaml` config error. -- **No regression:** small/single-run ingests behave exactly as before — identical - artifacts and relationship output when nothing is interrupted; the checkpoint + - final writes leave descriptions intact alongside the generated joins. - -## Non-goals - -- **Bounding the descriptions stage's per-table LLM call.** Whether an individual - enrichment LLM call can wedge is a separate concern (already being addressed in the - working tree via a per-table enrichment timeout). This spec ensures whatever - descriptions *did* complete are durable; it does not own the per-call timeout. -- **Changing relationship-detection quality, thresholds, or the candidate/validation - algorithm.** The accept/review thresholds, scoring, and the existing - `validationBudget` count cap are unchanged; this spec adds durability, - cross-run resume, progress, and a time budget around them. -- **A per-connection or per-call relationship budget, or a global env override.** - One flat project-level `detectionBudgetMs`; no second runtime path (D3). -- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec - composes above it and does not re-implement query-level deadlines. -- **Replacing the per-query deadline with the stage budget, or vice versa.** They - are independent and layered: a single query is bounded by spec 16; the stage's sum - is bounded by `detectionBudgetMs`. -- **A general checkpoint framework for every ingest stage.** The checkpoint is - specifically the descriptions+embeddings → queryable-manifest promotion before - relationships; it is not a generic per-stage artifact-flush abstraction. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns the -design. - -- **Enrichment orchestration** — `context/scan/local-enrichment.ts`: - `runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls - (`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`), - `runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the - checkpoint hook after the last non-relationship stage; thread the progress port, - signal, and budget into the relationship stage. -- **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest - write (`:473`), enrichment call (`:492`, currently passing only - `{ runId, progress }` as `context` — wire `signal` through here too), terminal - `writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch - (`:530`, which after D1 no longer loses descriptions). Supply the checkpoint - callback here. -- **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`: - `writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards` - (`:270`), and the description-preserving merge in `loadExistingManifestState` - (`:196`) — the basis for the additive checkpoint/final write. -- **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`: - `PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`), - `saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`, - pick latest `updated_at`, recreate the table if shape differs (disposable cache). - Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage` - in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash` - (`:78`). -- **Relationship stack (progress + budget + signal)** — - `context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept - a progress port and budget/deadline + signal), - `context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` — - per-table progress + budget check), - `context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates` - `:237` — per-candidate progress + budget check, alongside the existing - `validationBudget`), - `context/scan/relationship-composite-candidates.ts` - (`discoverKtxCompositeRelationships` `:515` — budget check). -- **Config** — `context/project/config.ts` `scan.relationships` - (`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive - integer ms, default 600_000) to the zod schema and the default config builder. -- **Partial marker** — `context/scan/relationship-diagnostics.ts` - (`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries - a `partial` flag + reason; add a recoverable warning code to the - `KtxScanWarningCode` union in `context/scan/types.ts` (e.g. - `relationship_detection_partial`). -- **Tests** — durability: a fixture ingest interrupted during the relationship stage - leaves AI descriptions in the queryable `_schema`. Resume: a second run with a - fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings - (assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema - large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget, - emits per-unit progress, returns partial, persists it marked `partial`, and a - re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests - for `detectionBudgetMs` (default / override / invalid). Regression: small - uninterrupted ingest yields identical artifacts and relationship output. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation, not a requirement) - -The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables -(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code -costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage -interruption — and re-spending it on every retry — makes large-schema ingest -impractical, and an unbounded profiling stage runs past any supervising deadline and -is killed. This is a general durability/cost property of the ingest pipeline, -independent of the benchmark; the benchmark only made it acute at scale. Do not -encode any benchmark specifics in the implementation. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All -four design decisions shipped; no deviations from the resolved design. - -**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`, -`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to -`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by -`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent -content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`; -`runId` stays on the row for diagnostics/`listRunStages`. The store drops and -recreates the table when the on-disk primary key differs (disposable cache, no -migration bridge), detected via `PRAGMA table_info`. - -**D3 — observable + bounded relationship stage** (new -`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget` -(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an -injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that -generalizes and replaces the old `mapWithConcurrency`). Threaded through -`discoverKtxRelationships` → profiling (per-table progress + budget stop), -validation (per-candidate progress + budget stop; budget-skipped candidates -degrade to the existing `validation_unattempted` review), and composite -detection (budget stops at PK-detection and coverage-probe boundaries). -`discoverKtxRelationships` now accepts `progress` and `now` and returns -`partial: { reason } | null`. The clock check fires only when work remains, so a -deadline elapsing after the last unit never marks a fully-processed stage partial. - -**D1 — checkpoint before relationships** (`local-enrichment.ts`, -`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a -caller-supplied `onCheckpoint` once descriptions/embeddings complete and before -the relationship stage runs, gated on `shouldDetectRelationships` so the -no-relationship path keeps a single write. `local-scan.ts` supplies a callback -calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json + -embeddings.json + manifest with descriptions and no generated joins — no -relationship artifacts, so no misleading empty diagnostics). The shared -description/embedding JSON writer was factored out so checkpoint and final writes -stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions` -into the enrichment context (completing the existing `KtxScanContext.signal` -contract already read by the budget and the in-flight description timeout). - -**D4 — partial is durable + marked** (`relationship-diagnostics.ts`, -`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact -carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable -`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated. -A graceful budget/abort stop returns normally, so the relationship stage saves as a -completed-partial record and resumes cheaply; a process killed mid-stage saves -nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash` -(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only -"try harder" mechanism, no extra flag. - -**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer -ms, default `600_000`, validated like the other relationship fields. Documented in -`docs-site/content/docs/configuration/ktx-yaml.mdx`. - -**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`); -cross-run resume + table-recreate (`enrichment-state.test.ts`, -`local-enrichment.test.ts`); progress/budget/abort partial -(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise + -checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`); -end-to-end durability — a relationship-stage failure still leaves AI descriptions -in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag -(`relationship-diagnostics.test.ts`); config default/override/invalid -(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`, -and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated: -three `analytics-skill-content.test.ts` markdown-structure assertions fail on this -branch from earlier analytics-skill commits — untouched here.) diff --git a/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md b/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md deleted file mode 100644 index 1f4ad022..00000000 --- a/spider2-specs/specs/20-resilient-enrichment-under-slow-llm.md +++ /dev/null @@ -1,533 +0,0 @@ -# Resilient enrichment under a slow/hung LLM backend - -> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`. -> -> **Scope: make the descriptions enrichment stage survive a hung LLM backend and -> an interrupted run.** Two compounding gaps live *inside* the per-table -> description-enrichment path: (1) the per-table LLM timeout fires in JS but does -> not terminate a wedged subprocess backend, so a hung table wedges the whole -> stage indefinitely; (2) descriptions are persisted only at full-stage -> completion, so any interruption discards every already-enriched table. This is -> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline -> that *stops the work*, not just abandons the promise) and spec 19 (move the -> durability boundary to the cost boundary so expensive LLM work is not lost). It -> composes with both rather than replacing them. - -## Problem - -Two compounding failure modes on the per-table description-enrichment path, both -confirmed in the current code and observed end-to-end together. Their union turned -a single hung table into an indefinite wedge *plus* total loss of an entire -stage's LLM work. - -### 1. The per-table LLM timeout does not terminate the work - -`KtxDescriptionGenerator.generateBatchedTableDescriptions` -(`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the -per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh -`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`). -A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one -wedge stays one timeout, not 3×). That is the correct policy — but the abort never -actually stops a subprocess backend, so the timeout is cosmetic. - -The runtime is selected by the `backend` config field -(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS = -['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn -a **child process the SDK owns** and to which ktx hands only an `AbortSignal`: - -- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` → - `codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's - `spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the - SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing - the abort error **only after that loop ends**. A child wedged on a hung provider - socket survives SIGTERM → its stdout never closes → the readline loop never ends - → the SDK never throws → ktx's `await generateObject` **never settles**, past the - per-attempt timeout, indefinitely. The child leaks (open provider connections, - ~0% CPU). -- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via - `context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls - best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks - `throwIfAborted` **between** streamed messages. A wedged child emits no message, so - the `for await (const message of queryResult)` loop blocks and the graceful - `interrupt()` may never land — the same hang class. - -By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via -`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's -`generateObject`, which cancels the underlying `fetch` natively — the await settles -promptly and there is no child to leak. - -So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too -gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks -`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork — -which it does not here. - -Observed (BigQuery ingest, codex backend, 2026-06-23): with -`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of -`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+ -minutes** — well past the 30-min per-attempt timeout — with exactly two codex -children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand. - -### 2. Descriptions are persisted only at full-stage completion - -`generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out -per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and -**accumulates every table's result in an in-memory `updates` array**, returned only -when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls -`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`) -**after** `compute()` returns, and the spec-19 checkpoint write -(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379, -fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the -descriptions stage completes**. There is no within-stage persistence: while the -stage runs, every enriched table's description lives only in memory. - -So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is -killed, or a supervising watchdog fires — **all** already-enriched tables are lost, -even though their (expensive, paid) LLM descriptions were finished. On the next run, -`findCompletedStage` finds no row, so the descriptions stage **recomputes from -scratch**. - -Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but -**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; -killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The -cost of 2 pathological tables was 283 tables' worth of redone LLM calls. - -Sharper still (re-ingest with a short, *enforced* timeout): even when the stage -**runs to the end** — the 2 hung tables hit their timeout and were skipped, so -**283/285** descriptions were generated and the ingest reported success (`Scan -completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were -**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not** -only "discarded on kill": a stage that completes with *any* skipped/aborted table -threw away **every** successfully-generated description. The skip must be -**graceful** — a skipped table costs one missing description, not the entire stage's -output — which is the strongest argument for per-table incremental persistence: the -283 good descriptions should have been durable the moment each was produced. - -The on-disk artifacts already carry everything needed to fix this *additively*: the -`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is -AI-enriched), and rewrites preserve existing descriptions -(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115; -`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis -spec 19 relies on). The durable record and the resume-skip set can be **derived from -the system's own on-disk state**, with no new cache schema. - -## Generic use case (independent of any benchmark) - -Anyone ingesting a large or wide schema with an LLM enrichment backend — -especially a **subprocess** backend, the common local/desktop setup — will -eventually hit a table whose description call hangs: a provider stall, a rate-limit -black-hole, a pathologically large prompt. Without an *enforced* timeout, one such -table wedges the entire ingest indefinitely and leaks the spawned child; without -*incremental* persistence, any interruption throws away all the per-table LLM work -already done — the dominant ingest cost. Both fixes make large-schema enrichment -**resilient and resumable**: a few bad tables degrade to a few skipped -descriptions, not a hung process and a from-scratch redo. This is core robustness -for a general-purpose ingestion product, wholly independent of any benchmark. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs (requirement-level, per the specs README). - -### D1 — One bounded-call guarantee; enforcement follows the backend's nature - -The canonical contract is a single guarantee for the per-table enrichment call: -**the in-flight work terminates and ktx's await settles within the per-table -deadline plus a small grace, on every backend.** How that guarantee is met follows -from a structural property of the configured backend — *does it own a subprocess?* -— not from a hand-maintained list of provider names: - -- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is - insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call - behind a **boundary it can hard-kill** — a short-lived ktx-owned child process, - made a **process-group leader** (`detached`). The SDK's grandchild (the - `codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx - **tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the - grandchild — and rejects promptly. This mirrors spec 16's child-process + - SIGKILL mechanism, extended by the critical step that **killing the immediate - child is not enough**: the grandchild would otherwise orphan to init and keep its - provider connections. Killing the group is the real fix. -- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing - in-process `abortSignal` → `fetch` cancellation already satisfies the contract — - the await settles promptly and there is no subprocess to leak. Routing these - through a subprocess would pay fork + IPC + credential-passing cost for no benefit. - -> The branch on "subprocess-backed?" is behavior following from an input the backend -> declares about itself, not vendor enumeration — the same guarantee is reached two -> ways because the backends differ structurally. This matches the intake's own split -> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed"). -> -> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline -> regardless of the SDK, but leave the SDK's child running). It unwedges the stage -> but leaves the orphaned child holding provider connections — the exact leak the -> incident showed — so it fails the intake's "actually cancelled" requirement and -> compounds over a long ingest that hits several hung tables. -> -> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime, -> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a -> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is -> low-frequency relative to its own latency and already concurrency-bounded (4), so -> one short-lived child per call (spec 16's resolved choice) is simpler and as fast. - -**Portability.** ktx supports Windows, where POSIX process groups and -`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached -process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating -equivalent on Windows (e.g. `taskkill /pid /T /F` or a job object) so the -grandchild is reaped on every platform the subprocess backends run on. - -### D2 — Default stays moderate and the retry/skip policy is unchanged - -The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the -existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the -no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the -table is skipped with the existing `enrichment_timeout` warning and the stage -proceeds. The 30-min value in the incident was an operator stopgap chosen *because* -the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a -long timeout is strictly worse for a hang (a hang costs the full timeout), so the -moderate default is the correct operating point. The retry loop stays in -`description-generation.ts`: each attempt runs through the bounded boundary (D1), so -a transient backend error retries while a timeout surfaces as `KtxAbortedError` and -does not. - -> Not introducing a new `ktx.yaml` config field for the timeout. The existing env -> override is the tuning seam; adding a per-connection/per-call/global knob would -> multiply the runtime surface for no stated need (one opinionated default + the -> existing env override is the canonical ktx shape). - -### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state - -During the descriptions fan-out, flush completed tables **per batch** (every N -tables / on a timer, at a cadence that bounds the at-risk window) to the durable -on-disk artifacts, reusing spec 19's additive write: - -- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**; -- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal` - preserves prior `ai:`/`db:`/external keys) so finished descriptions are also - **queryable** the moment they are computed — the spec-19 invariant, one level - deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by - rewriting only changed shards. - -On resume, `generateDescriptions` reads the existing record, **skips any table -already enriched**, computes only the remainder, and returns the merged full set so -the embeddings stage, the checkpoint write, and the stage-store row all see a -complete result exactly as today. - -**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The -durable record is tagged with the descriptions stage's `inputHash` -(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when -the current `inputHash` matches** — a genuine resume-after-interruption of the same -content identity. A changed `inputHash` (schema or enrichment settings changed) -ignores the prior record for skipping and recomputes the stage as today; the -manifest write stays additive regardless. The artifact's on-disk shape may gain the -`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped -record simply forces one non-incremental run), consistent with ktx's -no-backward-compatibility policy. - -> The skip set is **derived from the artifacts ktx already writes**, not from a new -> per-table cache table. The manifest's `ai:` field already encodes "this table is -> enriched"; a parallel per-table SQLite record would be a second source of truth for -> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is -> still written at stage completion (it remains the stage-level resume gate — a clean -> re-run skips the descriptions stage as today); the incremental record only matters -> when the stage did **not** complete — exactly the case where no row exists and -> `compute()` re-runs. - -### D4 — A killed-mid-stage run is durable; resume is cheap - -A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the -per-batch-flushed tables durable on disk. The next run resumes the descriptions -stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again), -but `generateDescriptions` now **re-issues LLM calls only for the unfinished -tables**. A failed/skipped table (timeout or exhausted retries) is left for the -remainder set and is retried on the next resume — never silently treated as done. - -## Requirements - -### 1. The per-table enrichment timeout is enforced for subprocess backends - -When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed -backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and -its descendants** — MUST be terminated (SIGKILL of the process group / tree), and -ktx's `generateObject` await MUST settle within the deadline plus a small bounded -grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded. -The termination MUST be portable across the platforms the subprocess backends run on -(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends -keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one -contract met two ways, branching on the backend's structural "owns a subprocess" -property, not on a list of provider names. - -### 2. The timeout default and retry/skip policy are unchanged - -The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`), -with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the -no-retry-on-timeout policy. On timeout, the table is skipped with the existing -`enrichment_timeout` recoverable warning and the stage proceeds. No new -per-connection / per-call / global timeout knob is added. - -### 3. Descriptions are persisted incrementally during the stage - -Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch** -(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence -that bounds the at-risk window to a small number of tables. The flush MUST be -idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and -external keys via the existing merge). Finished tables MUST remain durable even if the -stage never completes — is wedged, killed, or interrupted. A failed/skipped -relationship/embedding stage or a killed descriptions stage MUST NOT lose the -descriptions already flushed. - -### 4. Resume re-enriches only the unfinished tables - -On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST -re-issue LLM description calls **only for tables not already enriched**, deriving the -already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable -record / the manifest's `ai:` descriptions), and MUST return the merged full result -so downstream stages behave as on a fresh run. A changed `inputHash` (schema or -enrichment settings changed) MUST recompute the stage as today (spec 19's -inputHash-gated semantics preserved). The durable record MAY be recreated without a -migration bridge if its on-disk shape changes (it is regenerable local/artifact -state). - -### 5. No regression for small or uninterrupted ingests - -A small or single-run ingest that is never interrupted MUST produce the same -artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST -be idempotent with the spec-19 checkpoint and the terminal write (descriptions -survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT -change a normal successful enrichment's output, only how a wedged call is terminated. - -### 6. A skipped table costs one description, never the stage's output - -A descriptions stage that **completes** with one or more skipped/aborted tables MUST -persist every successfully-generated description (the durable record and the `ai:` -manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages` -row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's -output because some tables were skipped. No single table's failure may reject the -per-table fan-out: a per-table failure degrades to one missing description (left for -the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the -only thing that fails the stage (so it resumes), and even then the already-flushed -descriptions remain durable. - -## Acceptance criteria - -- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call - that hangs past the deadline is terminated within the deadline plus a small grace; - ktx's await settles, the spawned child **and a grandchild it spawned** both exit - (verified via the child's `exit`, not left spinning), and the table is skipped with - an `enrichment_timeout` warning. The stage advances rather than wedging. A - `ctx.signal` abort terminates the same way. -- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly - on abort via the existing native path, with no subprocess involved. -- **Default + policy:** the default timeout is 120s and a timeout is not retried (one - wedge = one timeout); a transient error is still retried up to the attempt limit. -- **Graceful skip persists the rest:** a stage that completes with one table failing - (timeout, exhausted retries, or an unexpected throw) still writes the other N−1 - descriptions to the durable record + `ai:` `_schema` and marks the stage completed - (a `local_scan_enrichment_stages` row exists); the failed table is a single `null` - description left for the resume remainder, not a discarded stage. -- **Incremental durability:** interrupting the descriptions stage after K of N tables - leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`), - with no completed `local_scan_enrichment_stages` row. -- **Resume does not re-spend:** re-running the interrupted ingest (unchanged - `inputHash`, fresh `runId`) issues **no** LLM description calls for the K already- - enriched tables and enriches only the remaining N−K; the returned result is the - full merged set. A changed `inputHash` recomputes the stage. -- **No regression:** a small uninterrupted ingest yields identical artifacts and the - same descriptions/embeddings output as today; the incremental flush is idempotent - with the checkpoint and terminal writes. - -## Non-goals - -- **Incremental persistence of embeddings.** Embeddings are fast and already covered - by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This - spec scopes incremental persistence to the `descriptions` stage. -- **Changing the timeout default, retry counts, or adding a timeout config knob.** - D2 keeps the moderate default and the single env tuning seam. -- **Routing HTTP backends through the subprocess boundary.** Their native abort - already meets the contract; a subprocess would add cost and a credential-passing - surface for no benefit. -- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed - call; no pool, no respawn/job-tracking (D1). -- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage - budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes - above them: spec 16 bounds individual queries, spec 19 makes whole stages durable - and resumable, and this spec hardens the per-table enrichment call's termination - and adds within-stage description durability. -- **A general per-stage incremental-flush framework.** The incremental flush is - specifically the descriptions stage; it is not a generic abstraction over every - enrichment stage. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns the -design. - -- **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`, - `KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block - ~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on - timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry - loop stays here; each attempt runs through the kill boundary for subprocess - backends. -- **LLM runtime + backend selection** — `context/llm/runtime-port.ts` - (`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input), - `context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` / - `ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts` - (`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the - backend/runtime (e.g. on the runtime interface), not inferred from a name list. -- **Subprocess backends** — `context/llm/codex-runtime.ts` + - `context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's - `spawn(executable, args, { signal })` is in `@openai/codex-sdk`), - `context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()` - abort path). These are what the kill boundary must wrap and tree-kill. -- **Reuse spec 16's mechanism (extended to group/tree kill)** — - `connectors/sqlite/read-query-child.ts` (the forked child shape) and - `connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`, - deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts` - child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a - process-group leader and killing the **group/tree** (the SDK grandchild), portably. - Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`, - `linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns - passing the backend config/credentials to it (env/IPC) and serializing the - structured result back. -- **Incremental persistence (gap #2)** — - `context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table - `pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage` - ~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the - `onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the - existing record, skip already-enriched tables, flush per batch, return the merged - full set. -- **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts` - (`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts` - with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308, - `loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription` - ~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115, - `SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive - description/manifest write; tag the durable record with `inputHash`. -- **Stage store + input hash** — - `context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE = - 'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`, - `findCompletedStage`, `saveCompletedStage`), - `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The - whole-stage row stays; the `inputHash` is the gate for the resume-skip set. -- **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the - terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal` - (`context/scan/types.ts`) which the kill boundary must honor. -- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores - SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within - deadline+grace, the child and a spawned grandchild both exit, and the table is - skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the - native path. gap #2: interrupt the descriptions stage after K/N tables (a flush - seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed - stage row; a resume with matching `inputHash` issues no LLM calls for the K and - enriches only N−K; a changed `inputHash` recomputes; regression: a small - uninterrupted ingest yields identical artifacts. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. - -## Benchmark context (motivation, not a requirement) - -Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment -backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for -41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout -never killed the hung codex children, and because descriptions checkpoint only at -stage completion, the 283 already-enriched tables were unrecoverable — the operator -had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout -as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at -scale; the gaps and the fixes are generic production hygiene for any agent that -enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark -specifics in the implementation. - -## Implementation notes - -Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance -criteria are covered by tests. The full ktx test surface for the touched code is -green (the only failures in the whole suite are 3 pre-existing assertions in -`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown -— an unrelated subsystem this change does not touch). - -### Gap #1 — enforced timeout for subprocess backends - -- **Structural property on the runtime, not a name list.** Added - `subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort` - (`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime` - return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime` - (and the deterministic stub) return `null`. The per-table call branches on this, - never on a vendor list (D1). -- **Shared structured core.** Both subprocess runtimes gained - `generateStructuredJson(jsonSchema)` (returns the raw object; the caller - Zod-validates). Their existing `generateObject` was refactored to delegate to the - same streaming core, so structured generation has one implementation. -- **Kill boundary.** New `context/llm/subprocess-generate-object.ts` - (`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned - child (`subprocess-generate-object-child.ts`) **detached** (process-group leader); - the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx - tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX, - `taskkill /pid /T /F` on Windows) and rejects promptly; on success the raw - output is Zod-validated. Credentials reach the child via inherited `process.env` - (the runtimes re-derive their allowlisted env), never over IPC. -- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions` - (`context/scan/description-generation.ts`) routes each retry attempt through the - boundary for subprocess backends and keeps the native `AbortSignal` → `fetch` - path for HTTP backends. A fired deadline maps to the existing - `KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout); - default stays 120s (D2). -- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real - fixture child that spawns a grandchild and ignores SIGTERM, and asserts the - deadline/abort tree-kills both (the grandchild PID is reaped) and the await - settles within deadline+grace; plus success / schema-failure / child-error paths. - `test/context/scan/description-generation.test.ts` adds the generator-level - timeout-skip and the "HTTP backend spawns no child" cases. - -### Gap #2 — incremental descriptions persistence + resume - -- **Durable record + resume store.** `createKtxScanDescriptionResumeStore` - (`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to - a durable record (inputHash-tagged) and **only the manifest shards that gained a - table this batch** (new `onlyChangedTableNames` filter on - `writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)` - returns the prior enriched set only on a matching inputHash (D3). -- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`) - loads the prior record, skips already-enriched tables, enriches only the - remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single - in-flight flush; the final force-flush drains the tail), and returns the full - merged set (recovered + fresh + `null` for still-failed, so failures are retried, - D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`). -- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in - a try/catch: any non-cancellation failure degrades to one `null` description + an - `enrichment_failed` warning and the fan-out continues, so no single table can - reject `Promise.all` / abort the stage. This makes the "one skipped table costs one - description, not the stage's output" guarantee live at the stage boundary - (`generateBatchedTableDescriptions` already degrades its own failures; this is the - explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails - and resumes), and the already-flushed descriptions stay durable. This closes the - field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows. -- **Deviation from the spec's literal path (necessary correction).** The durable - record lives at a **stable, non-`syncId`** path - (`raw-sources//live-database/enrichment-progress/descriptions.json`), - not the `syncId`-scoped `…//enrichment/descriptions.json` the spec named. - Reason: a from-scratch interruption (the incident's exact case — no prior - *completed* run) gets a **fresh `syncId`** on the next run - (`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped - record would be unreachable on resume. The manifest is already at the stable - per-connection scope (`semantic-layer//_schema/`), so this keeps the - resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json` - debug artifact written by the terminal/checkpoint writers is unchanged. -- **Tests.** `test/context/scan/description-resume.test.ts` drives - `runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a - durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues - zero LLM calls and returns the full merged set; a partial record re-enriches only - the missing tables; a changed `inputHash` recomputes; the changed-shard filter - rewrites only the affected shard; and (requirement 6) a run where one table fails - still persists the other tables (durable record + `ai:`) and **completes the stage** - (a completed `local_scan_enrichment_stages` row), with the failed table left `null` - for resume. - -### Incidental - -- Fixed a stale assertion in `description-generation.test.ts` ("does not run - per-column fallback…" expected 1 call) to `3`, matching the retry policy added in - commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt - limit). The HTTP path is unchanged; the assertion simply predated the retry. -- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit - governor is not wired into the scan-enrichment path, so the kill-boundary child - loses no pacing. -- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles - to `dist/context/llm/subprocess-generate-object-child.js`. diff --git a/spider2-specs/specs/21-selective-enrichment-stages.md b/spider2-specs/specs/21-selective-enrichment-stages.md deleted file mode 100644 index 130647b1..00000000 --- a/spider2-specs/specs/21-selective-enrichment-stages.md +++ /dev/null @@ -1,567 +0,0 @@ -# Selective enrichment stages (`--stages`) + per-stage cache keys - -> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`. -> -> **Scope: make the three enrichment stages independently invalidatable and -> independently re-runnable.** Today one coarse cache key gates all three stages, -> so changing any one stage's inputs re-pays for every stage — most painfully the -> expensive per-table `descriptions`. And there is no CLI surface to re-run a -> chosen subset. This spec splits the key per stage (so a change invalidates only -> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen -> subset while preserving the others. It is the operability follow-on to spec 19 -> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable -> descriptions); it composes with both rather than replacing them. - -## Problem - -Enrichment has three stages — **`descriptions`** (one paid LLM call per table), -**`embeddings`** (sentence-transformer vectors over the schema + descriptions), -**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19 -and 20 these stages are durable and resumable, but they are still **coupled for -cache invalidation and unreachable for selective re-run**. Three facts make a -targeted re-run impossible without a full, expensive re-enrich. - -### 1. One coarse cache key gates all three stages - -`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single -`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity, -relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`), -`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself -(`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the -description LLM identity, the embedding model/dimensions/batch size, **and** the -whole relationship config — and it redundantly re-encodes `mode` and -`relationships`, which the coarse hash already mixes in. - -The consequence: flipping `scan.relationships.llmProposals`, switching the LLM -backend, or upgrading the embeddings model changes the **one** hash and so -invalidates **all three** stages. ktx then re-runs the expensive per-table -`descriptions` even though they did not conceptually change. The headline cost of -the system — paid LLM description calls — is thrown away on any unrelated -enrichment-config edit. - -### 2. No CLI surface to select stages - -The enrichment internals already support a relationships-only path -(`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are -gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while -`shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But -`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no -flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history` -and friends). The relationships-only capability is built but unreachable, and there -is no way at all to ask for "descriptions only" or "embeddings only." - -### 3. The foundation for "touch one stage, keep the rest" already exists - -The per-stage store `local_scan_enrichment_stages` is keyed -`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is -additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and -`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`, -`db:`, and external description keys on rewrite; spec 20's per-table resume record -(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already -re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave -the others byte-for-byte" needs only two missing pieces: **per-stage key -granularity** and a **CLI surface** to select stages. - -**Requirement:** let an operator re-run a chosen subset of enrichment stages on an -already-ingested connection, recomputing only those stages, preserving the others' -artifacts untouched, and **re-paying only for what genuinely changed** — never -re-running the costly `descriptions` because an unrelated stage's inputs moved. - -## Generic use case (independent of any benchmark) - -Any team running ktx in production maintains its semantic layer over time: they -improve the description prompt or switch the description LLM, upgrade the embeddings -model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich -of every connection** — re-running the expensive per-table descriptions even when -only embeddings or relationships changed. Two routine operations should be cheap and -targeted: - -- **"Re-embed everything on the new model."** Swapping the embeddings model should - recompute only embeddings, leaving descriptions and joins on disk. -- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed - relationships should recompute only relationships. - -And one operation needs an explicit trigger because no input changed: - -- **"These descriptions came out thin — re-run them with a longer timeout."** A - connection whose description coverage is poor because tables timed out (same - snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand, - cheaply retrying only the tables that failed. - -This is core operability for a long-lived ingestion product and is wholly -independent of any benchmark. - -## Design decisions (resolved during refinement) - -These resolve ambiguities the intake draft left open. They constrain the -implementer; the exact code is theirs (requirement-level, per the specs README). - -### D1 — Split the coarse hash into three per-stage input hashes - -Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash -computation, each keyed on only that stage's own inputs. Decompose the -`localScanProviderIdentity` blob into the slices each stage actually depends on: - -- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the - description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the - embedding model/dimensions/batch size, **not** relationship settings. -- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where - `embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest` - is a stable digest of the resolved description text the embeddings consume (the - same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model, - `local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses - embeddings on their real upstream (D4). -- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and - `detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X, - D5), **not** the embedding identity. - -`mode` and `detectRelationships` drop out of the per-stage inputs: each stage -produces output under exactly one mode, so the stage name already scopes that, and -re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals` -invalidates only `relationships`; swapping the embeddings model invalidates only -`embeddings`; switching the description LLM invalidates only `descriptions`. - -The per-stage hash becomes the key everywhere a single hash is used today: the -`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20 -descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now -keyed on the **descriptions** stage's hash — so changing the embedding model no -longer busts the descriptions resume record, a strict improvement. - -> **No migration bridge.** The stage store and the descriptions resume record are -> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage -> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next -> run after upgrade. Recreate/ignore stale-shaped records with no compatibility -> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy. - -### D2 — `--stages ` selects a subset; one gate, no new mode - -Add `ktx ingest [connectionId] --stages `, a non-empty subset of -`descriptions,embeddings,relationships`. Plural because it takes a **set**: -`--stages relationships` and `--stages descriptions,embeddings` both read naturally, -and the plural signals "list expected." Flag absent = all three (today's behavior). - -A Commander custom parser validates each name against the canonical stage registry -and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a -hard `InvalidArgumentError`** — never silently ignored. The set threads CLI → -`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`. - -Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected -stages)`** — a single gate. Each existing stage block additionally checks -membership in the selected set (`descriptions`/`embeddings` already gate on -`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`). -This adds **no** new `KtxScanMode` variant and **no** second parallel selection -path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means -"which of those stages to (re)compute this run." A named stage that cannot run -because a prerequisite is absent (e.g. `--stages embeddings` with no embedding -provider configured) MUST fail or warn clearly, never silently no-op. - -> Rejected alternative — repurpose `mode` (`--stages relationships` → -> `mode: 'relationships'`). It only expresses single-stage cases, leaves -> `descriptions,embeddings` with no mode, and creates two ways to say "relationships -> only." The explicit stage set is the one canonical selector. - -### D3 — A named stage force-re-runs; per-table resume still avoids re-paying - -Naming a stage in `--stages` carries the intent "recompute this," so a named stage -**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in -`runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still -applies **inside** `compute()`: - -- `--stages descriptions` re-enters `generateDescriptions`, which loads the - per-table resume record and re-issues LLM calls **only for the still-null/failed - tables** (when the descriptions hash is unchanged) — the "fill thin coverage with - a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps. -- A genuine input change (e.g. switching the LLM → a new descriptions hash) - invalidates the resume record and rebuilds the stage fully, as today. - -Stages **not** named are skipped entirely — not run, not resumed — and their -on-disk artifacts are left exactly as they are (additive write; preserve-others is -already the behavior). The **no-flag default is unchanged**: all eligible stages -run, the completed-row short-circuit is respected (spec-19 cross-run resume). - -Behavior follows from the input (did you explicitly name the stage?), not the call -path. A consequence to state plainly: `--stages descriptions,embeddings,relationships` -is **not** identical to passing no flag — naming all three is the explicit "force a -full enrichment recompute," whereas no flag is "ingest, resuming whatever is done." - -### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent - -The only hard dependency between stages is **`descriptions → embeddings`** -(embeddings embed the description text; `relationships` is decoupled, D5). Two -mechanisms keep it correct without a hardcoded dependency table: - -- **Self-healing via content-addressing.** Because the embeddings hash includes - `descriptionDigest` (D1), re-running `descriptions` changes that digest, so a - later embeddings run (or a full ingest) sees a hash miss and recomputes — stale - embeddings can never silently persist across a future embeddings run. (Without - this, the embeddings hash would be unchanged after a description edit and a later - run would wrongly short-circuit on stale vectors.) -- **Surfaced immediately.** After a selective run, for each **unselected** stage that - has artifacts on disk, recompute its *current* per-stage hash from on-disk state - and compare it to the stored completed-row hash; if they differ, emit a - **recoverable `enrichment_stage_stale` warning** naming the stale stage and the - cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the - system's own state — it also catches "you changed the embedding model in `ktx.yaml` - but only ran `--stages descriptions`." - -The run **never silently leaves a stale-but-unflagged downstream**, and **never -silently auto-cascades** extra work — the operator is told and decides. Re-running -`descriptions` does **not** flag `relationships` stale (D5). - -### D5 — Relationships are decoupled from description content, but still get it as context - -`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is -**not** invalidated or stale-flagged by a description change (decision X). Rationale: -relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's -own framing); coupling them to description content would make every routine -description re-run also invalidate joins — re-opening the exact over-invalidation -this spec exists to close. - -Independently, a `relationships`-only run (descriptions stage not running this -invocation) MUST **hydrate its working schema from the persisted on-disk enriched -`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full -description context, not raw column names. Today the relationship stage builds its -schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740` -never merge the AI descriptions), so this also closes a latent gap: both the -full-run and the relationships-only paths MUST feed `llmProposals` the -best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) — -behavior from inputs, not path. - -### D6 — Scope: enrichment stages only, composable with existing flags - -`--stages` controls only the three enrichment stages. It is **orthogonal to and -composable with** the existing `--no-query-history` flag — a pure joins backfill -across everything is `ktx ingest --all --stages relationships --no-query-history`. -Schema introspection still runs (it is the hash substrate and the enrichment base, -and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it -can later extend to the broader scan phases (schema / query-history / source / -memory) and subsume the inconsistent negative `--no-query-history` flag — but that -unification is **out of scope** here. - -## Requirements - -### 1. Per-stage input hashes - -Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its -resume record on a hash of only that stage's own inputs, per D1 -(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding -identity + a digest of the embedded description text; `relationships` ← snapshot + -relationship settings + LLM identity). Changing one stage's inputs MUST invalidate -**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over -`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }` -MUST be removed in favor of per-stage computation. The stage store and the -descriptions resume record MAY be recreated without a migration bridge (disposable -local state). - -### 2. `--stages` flag with strict validation - -`ktx ingest` MUST accept `--stages `, a non-empty subset of -`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An -unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`), -never silently ignored. The selected set MUST thread through to enrichment and gate -which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new -`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite -is missing MUST fail or warn clearly, not silently no-op. - -### 3. Selecting a stage force-re-runs it; unselected stages are preserved - -A stage named in `--stages` MUST re-enter its `compute()`, bypassing the -completed-stage short-circuit, while still using the spec-20 per-table resume record -so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash) -and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST -leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19 -cross-run resume (all eligible stages, completed-row short-circuit respected). - -### 4. Downstream staleness is surfaced, never silent - -After a selective run, the run MUST emit a recoverable `enrichment_stage_stale` -warning for every **unselected** stage whose current per-stage hash no longer -matches its stored completed-row hash (derived from on-disk state, naming the stage -and the cascade command). The embeddings hash MUST include a digest of the embedded -description text so a later embeddings run self-heals after a description change. The -run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently -auto-cascade. A description change MUST NOT stale-flag `relationships`. - -### 5. Relationships run with description context - -When the `relationships` stage runs without `descriptions` having run in the same -invocation, it MUST hydrate its working schema from the persisted on-disk enriched -`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description -context as a full enriched run, not bare column names. The full-run and -relationships-only paths MUST feed `llmProposals` descriptions consistently. - -### 6. No regression for normal ingests - -A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as -today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19 -cross-run resume and spec-20 per-table description resume. The per-stage hash split -MUST NOT change a normal run's output, only which stages a *changed* input -invalidates. - -## Acceptance criteria - -- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals` - re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM - description calls, no re-embedding); swapping the embeddings model re-runs only - `embeddings`; switching the description LLM re-runs only `descriptions`. Verified by - asserting no LLM description calls / no embed calls for the unaffected stages. -- **Flag parse + validation:** `--stages relationships` and - `--stages descriptions,embeddings` parse to the right set; `--stages foo`, - `--stages` (empty), and `--stages descriptions,foo` each fail with a clear - `InvalidArgumentError`. -- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed - with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM - calls for exactly those K tables and leaves the already-good descriptions - untouched; the run completes and the K are now enriched. A changed descriptions - hash instead rebuilds all tables. -- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and - `relationships` artifacts are byte-stable (unselected stages did not run). -- **Derived staleness warning:** after `--stages descriptions` changes the - descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its - recomputed hash diverged) and does **not** emit it for `relationships` (decision - X); a subsequent `--stages embeddings` clears it. -- **Relationships context:** a `--stages relationships` run on an already-described - connection feeds the on-disk AI descriptions into `llmProposals` (verified: the - proposal prompt carries descriptions, not just column names). -- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical - artifacts and the same descriptions/embeddings/relationship output as today, with - spec-19/20 resume intact. - -## Non-goals - -- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The - namespace is built to extend later; this spec ships only the three enrichment - stages, composable with the existing query-history flag (D6). -- **A new `KtxScanMode` variant or a second stage-selection path.** One gate, - `(eligible) ∩ (selected)` (D2). -- **Coupling `relationships` to description content** (decision X, D5). Improving - descriptions does not invalidate or stale-flag joins. -- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the - operator chooses to cascade (D4). -- **Capturing prompt/code-level description-prompt changes in the hash.** The - descriptions hash keys on snapshot + LLM identity (config/model), not the prompt - text; a pure prompt improvement that does not change a hash input will not - force-rebuild already-good descriptions. Forcing that is out of scope — the - operator changes a real input or selects the stage with a changed config. -- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20 - (per-table description resume, enforced timeout).** This spec composes above them: - it splits the key those stages resume on and adds the CLI surface to select and - force-re-run stages. -- **A general per-phase incremental-flush framework.** The selection mechanism is the - three enrichment stages; it is not a generic abstraction over every ingest phase. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns the -design. - -- **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts` - (`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput` - `:57`): replace with per-stage hash functions (or one function taking a per-stage - input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three - `runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`, - `relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit - `:538–547`). The `descriptions` hash also feeds `generateDescriptions`' - `resumeStore.load(inputHash)` (`:345`). -- **Provider-identity decomposition** — `context/scan/local-scan.ts` - (`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`): - split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` / - `relationships` re-encoding, and pass each stage only its slice. -- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings` - `:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText` - `:17–44`): digest the resolved per-column/table description text that the embeddings - consume, and fold that digest into the embeddings hash. -- **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations, - `:51–104` action handler): add `--stages` with a custom parser that validates - against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in - `enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`. - Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'` - `:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) → - `runLocalScanEnrichment`. -- **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage - block on membership in the selected set (`descriptions` `:632`, `embeddings` - `:663–665`, `relationships` `:720`); make a named stage bypass the completed-row - short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20 - per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage - set. -- **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the - stage blocks): recompute each unselected stage's current hash from on-disk state, - compare to the stored completed-row hash, push a recoverable warning on mismatch. - Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in - `context/scan/types.ts` (alongside `relationship_detection_partial`). -- **Relationships description context** — `context/scan/local-enrichment.ts` - (`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`): - hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from - the on-disk `_schema` via `loadExistingManifestState`, - `local-enrichment-artifacts.ts`) before relationship detection. -- **Stage store + resume record** — - `context/scan/sqlite-local-enrichment-state-store.ts` - (`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`, - `findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore` - (`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate - `:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge. -- **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema` - `:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`; - `scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`, - `llm.provider.gateway.base_url`): the sources of each per-stage identity slice. -- **Tests** — per-stage invalidation isolation (flip one input, assert only the - matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty - rejected); resume-aware force-rerun (`--stages descriptions` retries only the null - tables, leaves good ones, completes); preserve-others (unselected artifacts - byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after - a descriptions change, not for relationships; cleared by a later `--stages - embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`; - regression — a normal no-flag ingest yields identical artifacts with spec-19/20 - resume intact. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. -- **Docs:** add `--stages` to the `ktx ingest` CLI reference - (`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior - where enrichment/ingest is described. - -## Benchmark context (motivation, not a requirement) - -Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A -level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description -coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only** -re-run with a longer timeout, and (b) a desire to **backfill joins** across all -already-ingested datasets after enabling `llmProposals` — without re-paying for -descriptions. Both were blocked by the coarse single `inputHash` (flipping -`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a -stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend -ingestion at scale; the gap and the fix are generic production operability. Do not -encode any benchmark specifics in the implementation. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented; -all acceptance criteria covered by tests. - -**What was built / where:** - -- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the - coarse `computeKtxScanEnrichmentInputHash` and added - `computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`), - `computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`), - `computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`), - plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` / - `KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the - canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into - `localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant - `mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the - keys. No migration bridge — the stage store + descriptions resume record just miss the - old coarse-keyed rows. -- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted - `buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage - and the digest, so the embeddings hash content-addresses the exact text the model sees. -- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`: - `parseEnrichmentStagesOption` (Commander parser) validates against the registry, - rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated - set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` → - `RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate - (`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage - emits a new `enrichment_stage_skipped` warning (never a silent no-op). -- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named - stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions` - still consults the spec-20 per-table resume record (retries only failed tables on an - unchanged hash). -- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment` - resolves best-available descriptions (fresh-this-run, else on-disk via a lazy - `loadPriorDescriptions` thunk wired from `local-scan.ts` → - `loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema` - now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket` - now carries the resolved description text — closing the latent gap on **both** the - full-run and relationships-only paths. -- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code + - `findLatestCompletedStage` on the state store (interface + sqlite + test store). After a - selective run, each unselected stage with a completed row is compared against its - freshly recomputed hash; a mismatch warns and names the cascade command. Relationships - are never flagged by a description change (decoupled per D5). -- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a - "Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and - examples. - -**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a -relationships-only run should hydrate "AI descriptions **and** embeddings" from the -on-disk `_schema`. Investigation found the `_schema` manifest shards store only -descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json` -that no code reads back, and each run mints a fresh syncId — so there is no durable -per-connection embeddings artifact to hydrate from. A relationships-only run therefore -hydrates **descriptions** (required for, and verified against, the `llmProposals` -acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships` -backfill gets deterministic + name-based + LLM-proposed candidates (the point of -`llmProposals`), but not the embedding-similarity candidates a full run would add. -Durable embeddings hydration (persist vectors at a stable per-connection path, or read -them from the vector index) is a clean follow-on and was left out of scope. - -**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation), -`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture -guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves -others, naming all three forces a full recompute, per-stage invalidation isolation, -prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced -descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not -relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`, -`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in -`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a -`**Window functions**` heading the test expects — was present before this work and left -untouched.) - ---- - -## ⚠️ Defect found in post-implementation validation (2026-06-24) - -**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req -"preserve-others / a selective run never deletes another stage's artifacts." - -**Reproduction (deterministic):** -- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges. -- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅ - but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌ -- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins. - -**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the -freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions -and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the -**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach -the `llmProposals` evidence packet only). So the on-disk `_schema` loses them. - -**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the -`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every -db. Caught on a 1-db validation before any rollout. - -**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:` -descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages' -artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`, -and asserts description count is unchanged while joins increase. - -### ✅ Fixed (2026-06-24) - -**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first -fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural** -manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard, -but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as -**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a -subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the -already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test -passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test -was rewritten to go through the full `runLocalScan` path.) - -**What changed:** -- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions - (`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as - `descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them. -- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before - the structural manifest write** and feeds them to both the structural write and enrichment — so the - structural pre-write preserves them too (robust even if relationship detection later fails). -- Joins were already preserved for `--stages descriptions` via the existing manual/inferred - `preservedJoins` path; verified by a symmetric test. - -**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai` -descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the -enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions` -preserves joins). - -**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER -`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the -descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`. -Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass. diff --git a/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md b/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md deleted file mode 100644 index 15d1a861..00000000 --- a/spider2-specs/specs/22-resumable-and-fault-tolerant-source-ingest.md +++ /dev/null @@ -1,463 +0,0 @@ -# Resumable and fault-tolerant source ingest - -> Refined spec. No intake draft — surfaced by a real user report, not the -> playground agent (see Motivation). Lives beside the analogous scan-durability -> specs 19/20. -> -> **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt / -> Metabase / Notion) survive interruption and partial failure on large -> projects.** Two compounding gaps live on the source-ingest path: (1) an -> interrupted run restarts every work unit from scratch — there is no cross-run -> reuse of already-generated work-unit output, so a multi-day dbt ingest loses -> *all* progress to a single VPN/network blip; (2) the final integration gate is -> all-or-nothing — one artifact that cannot pass it (after LLM repair) discards -> the **entire** run with nothing committed. This is the source-ingest analog of -> spec 19 (move the durability boundary to the cost boundary so expensive LLM -> work is not lost) and spec 20 (a stage survives an interruption with per-item -> durability). It **reuses** the same content-keyed durability primitive those -> specs established rather than copying it. - -## Problem - -Two independent failure modes on the source-ingest work-unit (WU) pipeline, -both confirmed in the current code, both observed by a user on a ~2-day dbt -ingest. Their union makes large-project ingest brittle: any interruption is -total loss, and any single unfixable artifact at the end is total loss. - -### 1. An interrupted run resumes nothing — every work unit re-runs - -`IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as -a sequence of stages: fetch → parse/extract into **work units** → run each WU as -an isolated agent loop in a child worktree (`runIsolatedWorkUnit` → -`executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU -patches → reconcile → finalize → final gates → one atomic squash commit -(`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU -is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki -artifacts, producing a git patch (`WorkUnitOutcome.patchPath` / -`patchTouchedPaths`, `stage-3-work-units.ts:31-46`). - -The only persisted cross-run state is `SqliteBundleIngestStore` -(`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report, -and provenance — all written at or near **run completion**. There is **no -checkpoint of completed WU output**. A run that dies mid-flight (the user's -VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches, -re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM -cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no -relationship to the interrupted one. - -> Observed (user report, large dbt project): a run reached deep into its -> work-unit progress and was lost to a network blip; the follow-up run started -> over from zero. On a ~2-day ingest this is the difference between a 5-minute -> resume and a 2-day redo. - -### 2. The final integration gate is all-or-nothing - -After all surviving WUs are integrated, `validateFinalIngestArtifacts` -(`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across -the *integrated* tree: - -- **intrinsic source validity** — `validateTouchedSources` → - `validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) → - `validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`), - which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`); -- **cross-artifact references** — dangling join targets - (`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki` - refs (`validateWikiRefs` → `findMissingWikiRefs`), broken `wiki→sl_ref`s - (`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs - (`findInvalidWikiBodyRefs`). - -On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`). -The runner catches it, runs the LLM repair `repairFinalGateFailure` -(`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws** -(`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null` -(`runner.ts:2729`) → **the whole run is discarded, nothing committed.** - -The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate -— `executeWorkUnit` already validates each WU in isolation (`validateWikiRefs` -~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`, -~155: the WU resets, is excluded from integration, and the run continues). So by -the time the final gate runs, intrinsic single-source failures are rare. The -gate fails predominantly on **cross-artifact dangling references**: WU-A's source -joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so -A's join now points at nothing. Each WU passed *alone*; the break only appears -once the survivors are integrated — and that break currently nukes the run. - -> Observed (user report): a run completed all task generation and then failed at -> the final integration gate on a **single model**; because the gate is -> all-or-nothing, that one failure discarded an ~18h run with nothing committed. - -## Generic use case (independent of any benchmark) - -Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit -both failures. Large ingests run long enough that an interruption is a *when*, -not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator -ctrl-C on an apparently-stuck run), and a large artifact set makes it -near-certain that *some* model lands a cross-reference its sibling didn't -produce. Without cross-run reuse, every interruption is a from-scratch redo of -the dominant (LLM) cost; without partial commit, one unfixable artifact throws -away every good one. Both fixes make large-project ingest **resilient and -resumable**: an interruption costs only the unfinished work, and a single bad -model costs only that model — not the run. This is core robustness for a -general-purpose ingestion product. - -## Design decisions (resolved during refinement) - -These resolve the design space explored during refinement. They constrain the -implementer; the exact code is theirs (requirement-level, per the specs README). - -### D1 — Resume is automatic and content-keyed at the work-unit level - -A successful WU's output is cached across runs, keyed by a **content hash of its -inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest` -transparently replays any WU whose inputs are byte-identical to a cached success -and re-runs only the changed, failed, or missing WUs. The key is computed over: -the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads, -`types.ts:19-28`), the adapter/source identity, and a **version/prompt -fingerprint** (ktx version + the WU system/user prompt + model role). A changed -dbt model busts only that model's entry; everything unchanged replays for free. - -> No flag, no config knob. Content-keying makes resume automatic; a flag would -> double the state space for no benefit. This is the same shape scan uses -> (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU -> pipeline. - -### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes - -The cache stores a successful WU's **output artifacts**: its git patch -(`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs -(`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner -**replays the patch** into the session worktree — no agent loop, no LLM — exactly -where it would have integrated a freshly-run WU. If a cached patch **fails to -apply** (the surrounding tree drifted), the entry is discarded and the WU -**recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree: -the cache can only make a run faster, never wrong. - -### D3 — One durability primitive, shared by scan and ingest - -Per the "one capability, one implementation" rule, the content-keyed store is -**extracted** into a shared primitive and **both** scan and ingest route through -it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK -`(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`) -and its `inputHash` computation (`enrichment-state.ts`) are generalized to a -content-keyed result cache; scan is migrated onto the shared primitive **in the -same change** so no second copy exists even transiently. The ingest cache is a -new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`) -on that one store. - -> Extract-and-share in one PR, not "build a copy for ingest now, unify later." -> A temporary fork is exactly the divergence the rule forbids; the one-time -> extraction cost is paid once and both paths benefit from every later fix. - -### D4 — Only successes are cached; failures retry on the next run - -A failed WU is **not** recorded as terminal — the next run retries it. WU -failures on this path are dominantly transient (network, provider stall, an LLM -slip), and the user's explicit ask is "resume and finish the rest," so a failure -must not be sticky. This deliberately differs from scan's stage store (which -caches failed stages and re-throws): there the failure is the stage's -deterministic verdict; here a WU failure is usually a blip to retry. Caching only -successes also keeps the invariant simple — a cache entry always means "this -exact input already produced this exact good output." - -### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune - -Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation -that always yields a committable, internally-consistent tree: - -1. `validateFinalIngestArtifacts` is refactored to **return structured findings** - (the danglers it already computes internally — join targets, `wiki→wiki`, - `wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of - flattening them into a thrown string. -2. **Drop the rare self-invalid source first.** A source that fails its *own* - validation at the final gate (intrinsic — rare, since stage 3 already filters - these) is removed, establishing the surviving artifact set. -3. **Prune the dead edges in a single pass** over that surviving set. For each - dangling reference — whether it pointed at an absent sibling or at a - just-dropped source — **remove that reference from its owner** (drop the join - entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping - the owning artifact. Because nodes are dropped first (step 2) and pruning only - removes edges, pruning **cannot create a new dangling edge, so one pass - suffices; no fixpoint.** -4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are - cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit - the remainder. If the confirm pass *still* fails, that is a real bug — fail the - run loudly rather than commit a dirty tree. - -`repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`) -is **removed**. The deterministic prune supersedes it for the referential class, -and the rare intrinsic case is handled by drop. - -> **Prune the edge, do not cascade the node.** The rejected alternative drops the -> *referencing artifact* and, transitively, everything that referenced *it* — a -> node-quarantine fixpoint that cascades healthy artifacts and needs a closure -> search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the -> dependent intact (minus one pointer that never resolved anyway), needs no -> fixpoint, and acts on findings the gate already produces. -> -> **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair -> can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune -> merely deletes it, preserving marginally more content. We drop it anyway: -> determinism beats an LLM round-trip with variance on the commit path, prune -> guarantees a commit where repair could only `throw`, and deleting it is a net -> maintenance reduction. The decision is reversible — repair could later run as a -> best-effort pass *before* prune — but the default is prune-only. - -### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose) - -Pruning is applied to the **integrated session worktree** at gate time and is -**re-derived from the current survivor set on every run**. It MUST NOT mutate the -cached WU patches (D2). This makes resume and prune compose correctly and -**self-heal**: - -- Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B - fails; the gate prunes A's join-to-B from the integrated tree and commits A - without it. -- Run 2 (after the root cause is fixed): A's input is unchanged → A **replays - from cache with its join restored**; B now succeeds and exists; the gate finds - no dangler and commits both, fully linked. - -So a ref pruned because of a sibling's failure costs nothing permanent: fixing -the sibling and re-running restores the link for free. The cache stores -intent (the WU's real output); prune is a per-run consistency projection over -whatever survived. - -### D7 — Pruning is faithful and never silent - -A pruned reference was, by definition, non-functional (its target was absent), so -removing it loses nothing executable — and removing dangling SL joins is already -the established fix for the SL engine's eager orphan-join rejection. Every prune -and every drop MUST be **recorded in the run report and a trace event** naming -the artifact, the removed reference, and the absent target. The report status -MUST reflect partial completion (extend the existing `failedWorkUnits` -mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs / -dropped-sources detail) so a partial run is visibly partial, never a silent -"success." - -### D8 — Cache state is regenerable; no migration bridge - -The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape -may change with **no migration bridge** — a stale-shaped or absent cache simply -forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's -no-backward-compatibility policy; the cache is an optimization, never a source of -truth. - -## Requirements - -1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output - MUST be cached keyed by a content hash over its input bytes - (`rawFiles` + `dependencyPaths`), the adapter/source identity, and a - version/prompt fingerprint (ktx version + WU prompt + model role). Re-running - `ktx ingest` MUST replay cached successes without an agent loop / LLM call and - re-run only changed, failed, or missing WUs. No `--resume` flag and no config - knob is added. -2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the - stored patch into the session worktree; if the patch does not apply cleanly the - entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to - produce a tree different from what a fresh run of that WU would have integrated. -3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it - MUST be retried on the next run. -4. **Conservative invalidation.** The input hash MUST change when the ktx version, - the WU prompt, or the model role changes (bias toward recompute). Under-keying - (stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is - acceptable. -5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run. - `validateFinalIngestArtifacts` MUST return structured findings; the runner MUST - deterministically **prune** every dangling reference from its owning artifact - and **drop** any source that fails its own validation, then commit the - remaining internally-consistent tree. -6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not - cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint) - because edge removal cannot create new dangling edges. A dependent that loses - one dangling ref MUST otherwise be committed intact. -7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and - MUST NOT mutate cached WU patches. A reference pruned in one run because its - target was absent MUST be restored automatically on a later run once the target - exists (resume replays the owner's intact patch). -8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on - the remainder and MUST pass before the squash; if it still fails the run MUST - fail loudly rather than commit a dirty tree. -9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its - obsolete tests/branches MUST be deleted (no dormant compatibility path). -10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST - be recorded in the run report and a trace event (artifact, removed ref, absent - target). A run that pruned or dropped anything MUST report as partial, never as - an unqualified success. -11. **One shared durability primitive.** The content-keyed store MUST be a single - implementation used by both scan and ingest; scan MUST be migrated onto it in - the same change. No second copy may exist, even transiently. -12. **No regression for clean runs.** A small, uninterrupted run whose every WU - passes and whose final gate is clean MUST produce byte-identical artifacts and - the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as - today. - -## Acceptance criteria - -- **Resume skips completed work:** interrupt an ingest after K of N WUs have - succeeded; re-run the same command (unchanged inputs); the run issues **zero** - agent loops / LLM calls for the K cached WUs, runs only the remaining N−K, and - produces the same final artifacts as an uninterrupted run. -- **Changed model busts only its entry:** edit one dbt model between runs; the - re-run re-executes **only** the WU(s) whose input bytes changed and replays the - rest from cache. -- **Stale patch self-corrects:** a cached patch that no longer applies (forced - drift in a test) causes that WU to recompute, not a corrupt tree or a crash. -- **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached; - run 2 retries it and, on success, integrates it. -- **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join - to B dangles **commits** — A is committed with the dangling join **pruned**, the - report lists the pruned ref, and `commitSha` is non-null (contrast: today this - throws and commits nothing). -- **No cascade:** in that scenario A (and any other artifact that only referenced - B) is committed intact except for the single pruned reference; nothing healthy - is dropped. -- **Self-heal:** fix B's root cause and re-run; A replays from cache with its join - intact, B succeeds, and the final tree commits both fully linked with no prune. -- **Intrinsic drop:** a source that fails its own warehouse dry-run at the final - gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop - is reported. -- **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the - gate path has no LLM call. -- **One store:** scan and ingest both resume through the same content-keyed - primitive (one implementation; scan's behavior is unchanged by the migration — - spec 19/20 acceptance still passes). -- **Clean-run regression:** a small uninterrupted all-passing ingest yields - identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to - today. - -## Non-goals - -- **Resuming the cross-WU stages.** Reconciliation, finalization, and the final - gate re-run every time; their inputs depend on the full survivor set and their - cost is small relative to WU generation. Only WU generation is cached. -- **A `--resume` flag or any timeout/cache config knob.** Content-keying makes - resume automatic (D1); one opinionated default is the canonical ktx shape. -- **Caching failed WUs as terminal.** Failures retry (D4). -- **Node-cascade quarantine of the final gate.** Prune edges, do not drop - dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step. -- **Tolerating dangling references (warn instead of remove).** Unsafe — the SL - engine eagerly rejects orphan joins — so dead edges must be removed, not kept. -- **Keeping the LLM final-gate repair.** Removed (D5/req 9). -- **A general per-stage resume framework beyond the shared content-keyed store.** - The store is the one shared primitive (D3); this spec does not abstract every - ingest stage into a resumable framework. -- **Re-implementing spec 19/20 (scan durability).** This spec composes the same - primitive onto the source-ingest WU pipeline. - -## Implementation orientation - -Line numbers drift; treat these as anchors, not addresses. The implementer owns -the design. - -- **Run flow + the all-or-nothing seam** — `context/ingest/ingest-bundle.runner.ts`: - WU run + integration of successful patches (~1600–1900), the final-gate block - (~2549–2587, `runFinalArtifactGates`), the repair-then-rethrow that must be - replaced by prune (~2588–2644; the fatal `throw` ~2623), and the atomic squash - (~2701–2729; `commitSha: null` when nothing is touched ~2729). The prune step - slots between the gate findings and the squash, operating on `sessionWorktree`. -- **Work units & cacheable output** — `context/ingest/types.ts` (`WorkUnit` - ~19–28: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult` - ~204–213: extend with pruned/dropped detail); - `context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU - validation + `failWithReset` ~134–157 that already soft-fails a WU; - `WorkUnitOutcome` ~31–46 with `patchPath`/`patchTouchedPaths`/`actions`/ - `touchedSlSources` — the cache payload). The cache lookup/replay wraps the - per-WU execution; only the agent-loop branch is skipped on a hit. -- **The gate (make it return findings)** — `context/ingest/artifact-gates.ts` - (`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from - `validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`; - the concatenated `throw` ~129 to replace with a structured return); - `context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124; - `findJoinTargetErrors` ~89 already returns missing join targets per source — - the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts` - (`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are - the drop set, not the prune set). -- **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the - offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove - the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts` - `findMissingWikiRefs`); wiki body refs: remove the broken link token - (`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is - deterministic and edits the integrated worktree only. -- **Remove the LLM repair** — `context/ingest/final-gate-repair.ts` - (`repairFinalGateFailure`) and the `constrained-repair.ts` usage for - `final_artifact_gate`; delete the call site (~2595) and its tests. -- **Durability primitive to extract & share** — - `context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`, - PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`), - `context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and - the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`). - Generalize to a content-keyed result cache; migrate scan onto it; add the ingest - namespace. The existing ingest store - `context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is - where ingest-side persistence lives — the WU cache sits alongside it under - `.ktx/`. -- **Tests** — resume: run an ingest against a real git-backed project with a fake - agent runner, interrupt after K WUs, assert the re-run issues no agent loops for - the K and the same artifacts result; changed-input bust; stale-patch recompute; - failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki - ref dangles → assert the run commits the sibling with the ref pruned, reports the - prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a - follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still - green on the shared primitive. Regression: a small uninterrupted all-passing - ingest is byte-identical to today. -- After implementing, rebuild and re-link so the playground picks it up: - `pnpm run build && pnpm run link:dev`. - -## Motivation (the real report, not a benchmark) - -A user ingesting a fairly large dbt project (~2-day run) hit both gaps together. -First, an interruption — a VPN drop / network blip — lost all progress because -ingest cannot resume; they had to restart from scratch. Second, on a later run -that completed all task generation, a **single model** failed the final -integration gate, and because the gate is all-or-nothing the one failure -discarded an ~18h run with nothing committed. Their ask: "some form of resume or -checkpoint (or at least reusing the patches that were already generated), and a -way to skip or quarantine a single failing model instead of failing the entire -run." This spec delivers both — resume via the content-keyed WU cache, and -partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this -gap was surfaced by a real user on a real warehouse, not by the benchmark; the -fix is generic production hygiene for any large ingest. - -## Implementation notes - -Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12 -requirements and every acceptance criterion are covered by committed code and -tests; the full `@kaelio/ktx` package suite is green. - -What was built and where: - -- **Shared content-keyed durability primitive** — `context/cache/content-result-cache.ts` - + `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`). - Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts` - is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped), - so no second copy exists (D3 / req 11). -- **Content-keyed WU cache + replay** — `context/ingest/work-unit-cache.ts` - (`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity + - CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`). - Replay/recompute and stale-recompute state refresh wrap the WU loop in - `ingest-bundle.runner.ts` (D1/D2/D4 / reqs 1–4). -- **Non-fatal final gate** — `artifact-gates.ts` `validateFinalIngestArtifacts` - returns structured findings; `context/ingest/final-gate-prune.ts` deterministically - drops self-invalid sources and prunes dangling edges in a single pass, then a - confirm gate runs before squash (D5/D6 / reqs 5–8). `finalGatePrunedReferences` - / `finalGateDroppedSources` are recorded in the report + trace and surface as a - `partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are - deleted (req 9). - -Deviations / decisions worth noting (all preserve spec intent): - -- **Cache stores artifact content snapshots (payload schema v2), not just a raw - git patch.** Replay materializes the owner's artifacts against the *current* - base, so a ref pruned in one run because a sibling failed is restored for free - on a later run once the sibling exists — without re-running the owner's agent - loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute. -- **Final-gate prune/drop resolves sources through the canonical - `resolveSlSourceFile` resolver**, not a derived `semantic-layer//.yaml` - path, so it works for uppercase / hash-derived source filenames (not only - lowercase demo names). -- **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join - target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU; - only intrinsic `source_validation` failures remain fatal at the WU level. This - is what lets a sibling-failed WU's owner survive to be pruned rather than be - excluded upstream (reqs 5–7, "no cascade"). -- The raw report record keeps `status: 'completed'`; partial completion is derived - by `ingestReportOutcome` from the populated prune/drop fields. diff --git a/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md b/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md deleted file mode 100644 index ad70e83d..00000000 --- a/spider2-specs/todo/03-multi-connection-routing-in-analytics-skill.md +++ /dev/null @@ -1,66 +0,0 @@ -# Multi-connection routing guidance in the ktx-analytics skill - -## Problem - -The agent-facing `ktx-analytics` skill (installed into agent environments via -the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in -projects) describes the query workflow — wiki_search → sl_read_source → -sl_query / sql_execution — but assumes the connection is obvious. In a -multi-connection project nothing tells the agent to *first decide which -connection the question is about*, and several tools silently require it: - -- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId` - **required**; -- `sl_query`, `discover_data`, `dictionary_search`: optional, but - auto-inference only works with exactly one connection - (`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or - multiple connections). - -An agent that skips routing either errors out or, worse, queries the wrong -database when names overlap. - -## Generic use case - -Any ktx project with more than one connection — the common shape for a data -org (warehouse + product DB + events DB). Routing is the first step of every -question, and the skill should encode it so individual agents don't have to -rediscover it. - -## Requirements - -1. **Add an explicit routing step (step 0) to the skill's workflow:** - - Call `connection_list` to see what exists. - - Match the question's domain to a connection using connection ids/names, - `discover_data` hits, and wiki context — not guesswork. - - If genuinely ambiguous after discovery, ask the user rather than pick. -2. **Thread the resolved `connectionId` everywhere:** all subsequent - `sl_query`, `sql_execution`, `sl_read_source`, `entity_details`, - `dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01 - lands (search scoped to the resolved connection plus unscoped pages). -3. **Single-connection projects stay frictionless:** the skill should say - routing is trivial when `connection_list` returns one entry — don't add a - mandatory ceremony step for the common simple case. -4. **Capture routing knowledge:** when the agent learns a non-obvious - question-domain → connection mapping, the skill should encourage - `memory_ingest` so the mapping becomes wiki knowledge for next time. - -This is a docs/prompt change in the skill content (plus any skill-install -plumbing if the skill is versioned); no engine changes required. - -## Acceptance criteria - -- In a fixture project with ≥2 connections, an agent following the skill - resolves the correct connection before its first data query, and no tool - call fails with "connectionId is required". -- In a single-connection project the skill-driven flow is unchanged (no - extra mandatory steps). -- Skill text nowhere assumes a default/implicit connection. - -## Benchmark context (motivation only) - -Spider 2.0-Lite local subset = 30 SQLite connections in one project; every -one of the 135 questions targets exactly one of them. Connection ids are set -to the benchmark's database names, so with this skill guidance routing is -mechanical (`connection_list` + name match) and needs no benchmark-specific -instructions — which is the point: the harness gives the agent only the -question text. diff --git a/spider2-specs/todo/04-offline-schema-docs-adapter.md b/spider2-specs/todo/04-offline-schema-docs-adapter.md deleted file mode 100644 index d37fd97f..00000000 --- a/spider2-specs/todo/04-offline-schema-docs-adapter.md +++ /dev/null @@ -1,51 +0,0 @@ -# Offline schema-documentation ingest adapter - -> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider -> 2.0-Lite benchmark — we verified the benchmark's offline schema files -> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite -> scan already captures (DDL, types, PKs, sample values, cardinality -> profiling). Implement specs 01-03 first; pick this up only if a real -> use case shows up. - -## Problem - -The ingest pipeline's schema knowledge comes from live database scans -(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…). -There is no adapter for **offline schema documentation**: files describing -tables/columns that exist outside the database — column-description -spreadsheets, data dictionaries, DDL exports with comments, hand-maintained -schema docs. - -## Generic use case - -Teams whose richest schema documentation lives outside `information_schema`: -a wiki export of column meanings, a governance tool's CSV data dictionary, -DDL files with COMMENT clauses the production scan can't see, or -environments where ktx has no live access at all and must build the semantic -layer from documentation alone. - -## Requirements (sketch — refine when picked up) - -1. A new ingest adapter (peer of `metabase`/`dbt` in - `context/ingest/adapters/`) consuming a configured local path of schema - docs per connection. -2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements) - and tabular column dictionaries (CSV/JSON: table, column, description, - …). Extensible to other formats. -3. Output: **enrichment, not duplication** — merge descriptions/metadata - into the manifest-backed semantic-layer sources and dictionary for the - matching connection. Where a live scan exists, offline docs fill gaps - (descriptions, enum meanings, deprecation notes) and flag drift - (documented column missing from live schema and vice versa) rather than - creating parallel wiki pages that duplicate schema info. -4. Works without live database access (documentation-only bootstrap of a - connection's semantic layer), clearly marked as unverified-against-live. - -## Acceptance criteria (sketch) - -- Given a connection with a live scan plus an offline column dictionary, - semantic-layer sources carry the documented descriptions, and drift - between doc and live schema is reported. -- Given a connection with docs only (no live access), `sl list`/`sl read` - expose manifest sources built from the docs. -- No wiki pages are created that merely restate table/column lists. diff --git a/spider2-specs/todo/05-composite-key-join-detection.md b/spider2-specs/todo/05-composite-key-join-detection.md deleted file mode 100644 index 0f3a6c7e..00000000 --- a/spider2-specs/todo/05-composite-key-join-detection.md +++ /dev/null @@ -1,59 +0,0 @@ -# Composite-key (multi-column) join detection - -> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite -> smoke test (2026-06-13): relationship detection emitted **zero joins** for a -> database whose fact tables are linked only by composite keys. Agents still -> answered correctly by inferring the join from shared `grain`, so this didn't -> cost benchmark points — but it forces inference that explicit joins would -> remove, and the gap is generic. - -## Problem - -Relationship detection appears to emit only single-column joins. For the IPL -sqlite database, every table came back with `joins=0`, even though its fact -tables are connected by a 4-column composite key -(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`, -`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did -correctly record that shared key as each table's `grain`, which is why agents -could recover the relationship — but no `joins:` entries were produced for the -fact-to-fact links. - -## Generic use case - -Event/fact tables keyed by composite business keys are common: ledger lines -(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports -ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a -multi-column key recurs across tables, ktx should detect and document the join -so agents (and `sl_query`) don't have to infer it. - -## Requirements - -1. Relationship detection considers **multi-column** join candidates, not just - single-column ones. A strong signal already exists in ktx: when two tables - share an identical (or subset/superset) declared `grain`, that grain is a - prime composite-join candidate. -2. Emitted joins carry the full composite condition, e.g. - `on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`, - with a sensible `relationship` cardinality. -3. The existing validation/threshold machinery - (`scan.relationships.acceptThreshold` etc.) applies to composite candidates - too; profile-based validation should check join selectivity on the full key. -4. No regression for single-column joins; don't explode combinatorially — - bound candidate generation (e.g. only consider shared-grain keys and - declared/!inferred PK overlaps, cap column count). -5. `sl_query` can compile a join across a composite-key relationship. - -## Acceptance criteria - -- For a fixture with two tables sharing a 3- or 4-column grain and no - single-column FK, ingest emits a composite join between them with the full - multi-column `on` condition. -- `sl read ` shows the composite join; `sl_query` can traverse it. -- Single-column join detection is unchanged on existing fixtures. - -## Benchmark context (motivation only) - -IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set) -have no single-column FKs; their joins are entirely composite. Explicit -composite joins would let the agent rely on documented relationships instead -of inferring them from grain. diff --git a/spider2-specs/todo/13-canonical-authoritative-source-measures.md b/spider2-specs/todo/13-canonical-authoritative-source-measures.md deleted file mode 100644 index f80c4c2d..00000000 --- a/spider2-specs/todo/13-canonical-authoritative-source-measures.md +++ /dev/null @@ -1,89 +0,0 @@ -# Canonical / authoritative-source measures in the semantic layer - -## Problem - -Many schemas contain an **authoritative table** that already encodes a metric's -business rules — an official standings/leaderboard table, a general-ledger or -period-end balance table, a materialized summary/snapshot — alongside the **raw -transactional** rows the metric *could* be re-derived from. Re-deriving the metric -from the raw rows frequently diverges from the canonical definition, because the -authoritative table bakes in rules the raw data doesn't expose (drop-scores, -penalties, adjustments, reconciliations, as-of snapshots). - -Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from -raw fact tables, so the analytics skill has no signal that one source is canonical -for a metric — and the agent often re-derives from raw rows and gets a defensible- -but-different number. - -## Generic use case (independent of any benchmark) - -- "Championship points per competitor this season" — a sports schema may hold both - raw per-event results AND an official standings table that applies drop-scores - and penalties. The standings table is the canonical source; summing raw results - is wrong. -- "Account balance as of month end" — prefer a ledger/balance-snapshot table over - re-summing every transaction (which may miss adjustments). -- "Monthly recognized revenue" — prefer a finance summary table over re-deriving - from line items. - -In each case a real analyst should be steered to the authoritative source. - -## Requirements - -1. **Detect candidate authoritative tables during ingest.** Heuristics only — - e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`, - `*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained - aggregation of another table, or tables documented as authoritative in provided - docs/wiki. Surface them as such in the semantic layer. - -2. **Represent the metric as an SL measure backed by the authoritative table.** - Where a canonical source exists, define the measure over it so a query for that - metric resolves to the authoritative source by default. (The analytics skill - already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs - into existing behavior.) - -3. **Keep raw re-derivation available** as a non-default alternative; the measure - documents which source it uses and why, so the choice is transparent and - overridable. - -## Fairness boundary (HARD — this spec is fairness-sensitive) - -The choice of authoritative source MUST be driven by **schema/structure or provided -documentation** — the table exists, is structured as a summary, or is documented as -authoritative. It must **NEVER** be driven by observing which interpretation matches -a benchmark gold answer. Concretely: - -- ✅ Fair: "a table named/structured as official standings exists and aggregates the - raw results → treat it as the canonical points source." -- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold - result." That is per-instance gold-tuning (cheating) and must not appear in ktx, - the ingest heuristics, or any mapping. - -If a metric is genuinely underspecified and only the gold answer disambiguates the -intended source, it is **not fairly fixable** — leave it. Whether this feature helps -any specific benchmark instance is therefore *conditional* on a real schema/doc basis -existing; do not manufacture one. - -## Leak-safety (hard constraint) - -No benchmark table names, queries, gold values, or instance-specific mappings -anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic. - -## Acceptance criteria - -- Ingest can flag candidate authoritative/summary tables via generic heuristics - (name/role/aggregation/doc signals), with no benchmark-specific rules. -- The semantic layer can express a measure as backed by a designated authoritative - source; the skill resolves the metric to it by default; raw re-derivation remains - available and the choice is documented. -- Tests use synthetic schemas only; no gold-derived mappings exist anywhere. - -## Benchmark context (motivation only) - -Some SQLite-subset metric questions are underspecified between a raw-derivation and -an authoritative-table interpretation (e.g. season points from raw results vs an -official standings table). This is the roadmap's "canonical semantic-layer measures -from schema + provided docs" item. It is fair ONLY where schema/docs support one -source; the gold-only cases are explicitly out of scope (fixing them would require -tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches -ingest + the semantic-layer model. diff --git a/spider2-specs/todo/17-lifecycle-event-metrics.md b/spider2-specs/todo/17-lifecycle-event-metrics.md deleted file mode 100644 index 7b8a6e2b..00000000 --- a/spider2-specs/todo/17-lifecycle-event-metrics.md +++ /dev/null @@ -1,57 +0,0 @@ -# 17 — Lifecycle-event metrics in the semantic layer - -**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`. - -## Problem / requirement - -Many entities carry **several lifecycle timestamps** for the same record — an order has -`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery` -times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`, -`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named -completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled -payments by day"), the correct time anchor is the timestamp of *that named event*, not the -record-creation timestamp. - -Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it -does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a -human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is -left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule -now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the -**model**, so any consumer of the semantic layer gets it for free.) - -**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more -lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event -metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with -its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`), -distinct from the creation-anchored `orders` metric. Keep the inference conservative and -source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing -that the schema/descriptions don't independently support. - -## Sketch (implementer to refine) - -- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions - (e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`). -- Emit a metric per detected completed state: filter = the state predicate, grain = record, - `defaultTimeDimension` = the matching event timestamp. -- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the - delivery-anchored metric rather than a bare row count over the creation date. -- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar - (precision over recall — a wrong pairing is worse than none). - -## Generic use case (independent of the benchmark) - -Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments) -has this multi-timestamp lifecycle shape. An analyst asking "how many X were last -month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the -model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without -re-deriving it, and prevents the silent "grouped by when they started" error. - -## Benchmark context (motivation only — not a benchmark-specific rule) - -Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028 -("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed -because the solver bucketed delivered orders by `order_purchase_timestamp` instead of -`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both -date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this -spec is the **model-layer** form of the same fix, which would make the right anchor the default for any -solver and any lifecycle schema.