mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
chore: remove private benchmark specs
This commit is contained in:
parent
67a69dba8b
commit
1c5d16abc3
40 changed files with 0 additions and 8716 deletions
|
|
@ -1,62 +0,0 @@
|
||||||
# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark
|
|
||||||
|
|
||||||
This directory is the handoff point between two agents working on different
|
|
||||||
sides of the same goal: making Claude Code + ktx score well on the Spider
|
|
||||||
2.0-Lite benchmark **without benchmark-specific instructions** — the agent
|
|
||||||
should succeed using only what ktx provides (skills, semantic layer, wiki).
|
|
||||||
|
|
||||||
## Mechanics
|
|
||||||
|
|
||||||
Three directories form a pipeline. A feature flows `todo/` → `specs/` →
|
|
||||||
(implemented), and only its intake draft moves to `done/`:
|
|
||||||
|
|
||||||
- **`todo/`** — intake drafts. A **playground agent** (works in
|
|
||||||
`/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the
|
|
||||||
benchmark, identifies ktx capability gaps) writes a draft spec here when it
|
|
||||||
finds a gap.
|
|
||||||
- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a
|
|
||||||
`todo/` draft and produces a proper, implementation-ready spec at
|
|
||||||
`specs/<same-filename>.md`: sharpened requirements, resolved ambiguities,
|
|
||||||
acceptance criteria, and orientation hints. The refined spec is the **durable
|
|
||||||
artifact** the implementer builds from — it stays in `specs/` permanently and
|
|
||||||
never moves.
|
|
||||||
- **`done/`** — intake drafts whose feature has shipped (see below).
|
|
||||||
|
|
||||||
The **ktx worktree agent** (started from a ktx repo worktree, e.g.
|
|
||||||
`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the
|
|
||||||
refined spec in `specs/` (falling back to the `todo/` draft only if no refined
|
|
||||||
spec exists yet). When the feature is implemented it:
|
|
||||||
|
|
||||||
1. appends a short **"Implementation notes"** section to the refined spec in
|
|
||||||
`specs/` (what was built, where, any deviations); and
|
|
||||||
2. **moves the original intake draft from `todo/` to `done/`.**
|
|
||||||
|
|
||||||
Location is status: `todo/` = draft awaiting implementation, `done/` = draft
|
|
||||||
whose feature shipped, `specs/` = refined specs (permanent home, do not move).
|
|
||||||
A draft and its refined spec share the same filename so they correspond
|
|
||||||
(`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking.
|
|
||||||
|
|
||||||
## Rules for specs
|
|
||||||
|
|
||||||
1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the
|
|
||||||
benchmark only surfaces the need. Every spec must state a real-world use
|
|
||||||
case independent of Spider 2.0-Lite. If a requirement only makes sense for
|
|
||||||
the benchmark, it doesn't belong in ktx.
|
|
||||||
2. Specs are **requirement-level**, not implementation plans. Code pointers in
|
|
||||||
specs are orientation hints from exploration (line numbers may have
|
|
||||||
drifted); the implementer owns the design.
|
|
||||||
3. One spec per file, kebab-case, numeric prefix = suggested priority order.
|
|
||||||
A refined spec in `specs/` keeps the same filename as its `todo/` draft.
|
|
||||||
|
|
||||||
## For the implementer
|
|
||||||
|
|
||||||
- After implementing, rebuild and re-link the dev binary so the playground
|
|
||||||
picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
|
|
||||||
- Add/extend tests in the ktx test suites; specs list acceptance criteria to
|
|
||||||
cover.
|
|
||||||
- Build from the refined spec in `specs/`. On completion, append
|
|
||||||
"Implementation notes" to that spec (it stays in `specs/`) and move the
|
|
||||||
intake draft from `todo/` to `done/`.
|
|
||||||
- If a spec turns out to be wrong or already satisfied, don't silently drop
|
|
||||||
it — record why in the refined spec's notes and move the draft to `done/`
|
|
||||||
explaining why no change was needed.
|
|
||||||
|
|
@ -1,74 +0,0 @@
|
||||||
# Connection-scoped wiki pages
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Wiki pages have only two scopes today: `GLOBAL` and `USER`
|
|
||||||
(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29).
|
|
||||||
There is no way to associate a page with a connection. In a project with many
|
|
||||||
connections, all pages share one search index, so `wiki_search` for a generic
|
|
||||||
term ("orders", "revenue", "average order value") surfaces pages about the
|
|
||||||
wrong database. Concept names collide across databases constantly in
|
|
||||||
real-world multi-connection projects (several databases each with `orders`,
|
|
||||||
`customers`, etc.).
|
|
||||||
|
|
||||||
Today, when `memory_ingest` is called with a `connectionId`, that id is only
|
|
||||||
used to scope which semantic-layer sources the triage agent can see
|
|
||||||
(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the
|
|
||||||
resulting wiki page in any form.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any org with multiple databases/warehouses in one ktx project: org-wide
|
|
||||||
definitions ("fiscal year starts in February") should be visible everywhere,
|
|
||||||
while database-specific conventions ("in the events DB, `user_id` is the
|
|
||||||
anonymous device id, not the account id") should not pollute searches about
|
|
||||||
other databases.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Frontmatter field.** Add an optional `connections:` field to wiki page
|
|
||||||
frontmatter — a list of connection ids (accept a single string too,
|
|
||||||
normalize to list).
|
|
||||||
- **Absent or empty ⇒ unscoped: the page applies to all connections.**
|
|
||||||
This is exactly today's behavior, so every existing page is unaffected
|
|
||||||
(backward compatible by construction).
|
|
||||||
2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64)
|
|
||||||
and `ktx wiki search` / `ktx wiki list` (CLI,
|
|
||||||
`knowledge-commands.ts`) accept an optional `connectionId`:
|
|
||||||
- With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages.
|
|
||||||
- Without ⇒ current behavior, all pages.
|
|
||||||
- The filter must apply to **all three search lanes** (lexical FTS5,
|
|
||||||
semantic/embedding, token fallback) in
|
|
||||||
`local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter
|
|
||||||
that eats into the result limit unevenly.
|
|
||||||
3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index
|
|
||||||
(the index is already re-synced from files on every search,
|
|
||||||
`local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient).
|
|
||||||
4. **Write path.** The memory agent's wiki-write tool accepts the connections
|
|
||||||
field; when `memory_ingest` is invoked with a `connectionId`, the agent
|
|
||||||
should default new database-specific pages to that connection, while still
|
|
||||||
being allowed to write unscoped pages for clearly org-wide content (prompt
|
|
||||||
guidance, not a hard rule).
|
|
||||||
5. **`wiki_read` and refs are unchanged** — pages remain addressable by key
|
|
||||||
regardless of scoping; `connections` is a search/relevance concern only.
|
|
||||||
6. **Validation.** Warn (don't fail) when a page references a connection id
|
|
||||||
not present in `ktx.yaml` — config and content can evolve independently.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- A page with `connections: [db_a]` is returned by
|
|
||||||
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
|
|
||||||
**not** by `wiki_search(query, connectionId: "db_b")`.
|
|
||||||
- A page with no `connections` field is returned in all three cases above.
|
|
||||||
- Existing projects with no scoped pages behave identically before/after.
|
|
||||||
- Filtering works in each lane independently (test with embeddings disabled
|
|
||||||
to exercise lexical/token lanes alone).
|
|
||||||
- `memory_ingest(content, connectionId)` produces a page scoped to that
|
|
||||||
connection for database-specific content.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Spider 2.0-Lite local subset = one project with 30 SQLite connections whose
|
|
||||||
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
|
|
||||||
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
|
|
||||||
to exactly one database and must not surface for the other 29.
|
|
||||||
|
|
@ -1,71 +0,0 @@
|
||||||
# Verbatim ingest mode for authoritative documents
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
`ktx ingest --text/--file` routes content through the memory agent
|
|
||||||
(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop
|
|
||||||
(30-step budget for `external_ingest`, content clipped at ~48k chars,
|
|
||||||
`memory-agent.service.ts` ~165) that may rewrite, condense, or split the
|
|
||||||
content before writing wiki pages.
|
|
||||||
|
|
||||||
For *authoritative* documents — formula definitions, specs, runbooks,
|
|
||||||
compliance text — paraphrasing is a bug, not a feature:
|
|
||||||
|
|
||||||
- exact thresholds, constants, and rule wording must survive byte-for-byte;
|
|
||||||
- lexical (BM25) search works best when the stored text matches the phrasing
|
|
||||||
users/agents will query with;
|
|
||||||
- ingestion should be deterministic and reproducible — same input file, same
|
|
||||||
resulting page.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any team ingesting documents that are already the source of truth: metric
|
|
||||||
definition sheets, SLA documents, calculation methodology docs, regulatory
|
|
||||||
text. The user wants ktx to *index and surface* the document, not to
|
|
||||||
re-author it.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Flag.** `ktx ingest --file <path> --verbatim` (apply to `--text` too).
|
|
||||||
Composes with the existing optional `--connection <id>` so the resulting
|
|
||||||
page can be connection-scoped (see spec 01).
|
|
||||||
2. **Body preservation is enforced by code, not by prompt.** The stored page
|
|
||||||
body must be the input content byte-for-byte. The LLM is used **only** to
|
|
||||||
generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug
|
|
||||||
(and `connections` default from the flag). Implementation freedom: a
|
|
||||||
single constrained LLM call is fine — the full memory-agent loop is not
|
|
||||||
required for this mode.
|
|
||||||
3. **No clipping of the stored body.** The ~48k clip may apply to what is
|
|
||||||
*sent to the LLM* for metadata generation, never to what is *written* to
|
|
||||||
the wiki page.
|
|
||||||
4. **Existing frontmatter.** If the input file already has YAML frontmatter,
|
|
||||||
preserve user-provided fields and only fill gaps (don't overwrite an
|
|
||||||
explicit `summary` with a generated one).
|
|
||||||
5. **Key collisions.** Deterministic, non-destructive behavior: error or
|
|
||||||
suffix — never silently overwrite an existing page.
|
|
||||||
6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should
|
|
||||||
still work, deriving `summary` from the first heading/sentence and leaving
|
|
||||||
optional metadata empty. (Regular agent ingest can't do this; verbatim
|
|
||||||
mode can and should.)
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- Ingesting a file with `--verbatim` produces a wiki page whose body is
|
|
||||||
byte-identical to the input (assert with a hash in tests).
|
|
||||||
- Running the same ingest twice is idempotent or fails loudly on the second
|
|
||||||
run (per requirement 5) — no duplicated/divergent pages.
|
|
||||||
- A >48k-char file is stored in full.
|
|
||||||
- `--verbatim --connection X` yields a page scoped to X (depends on spec 01;
|
|
||||||
if 01 isn't implemented yet, the flag composition can land later).
|
|
||||||
- Generated metadata makes the page findable: `wiki_search` for a phrase
|
|
||||||
from the document body returns it (lexical lane), and for a paraphrase of
|
|
||||||
its topic returns it when embeddings are enabled (semantic lane).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
|
|
||||||
definitions, haversine formula, F1 overtake rules…). Gold SQL was authored
|
|
||||||
against their exact text; an LLM paraphrase that drops a bucket boundary
|
|
||||||
loses a question. We currently work around this by hand-writing frontmatter
|
|
||||||
and copying files into `wiki/global/` — verbatim mode makes that a supported
|
|
||||||
ktx workflow instead of a manual step.
|
|
||||||
|
|
@ -1,63 +0,0 @@
|
||||||
# Schema scan must tolerate individual objects that fail introspection
|
|
||||||
|
|
||||||
> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest
|
|
||||||
> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely**
|
|
||||||
> because a single broken VIEW errored during introspection, leaving that
|
|
||||||
> connection with no semantic layer at all.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
`ktx ingest <connection>` aborts the whole database's schema scan when one
|
|
||||||
table/view errors during introspection/profiling. In `oracle_sql` the view
|
|
||||||
`emp_hire_periods_with_name` is defined as
|
|
||||||
`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the
|
|
||||||
base table has no `start_date`/`end_date` columns — so any attempt to read it
|
|
||||||
raises `no such column: ehp.start_date`. That single broken object failed the
|
|
||||||
ingest of all ~48 healthy tables/views in the database.
|
|
||||||
|
|
||||||
A second, related symptom: setting `enabled_tables: [main.customers]` to work
|
|
||||||
around it produced a different hard failure (`Adapter "database schema" did not
|
|
||||||
recognize fetched source output`), so the documented allowlist escape hatch did
|
|
||||||
not provide a clean fallback either.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Real databases routinely contain broken or inaccessible objects: views over
|
|
||||||
dropped/renamed columns, views referencing tables the connection role can't
|
|
||||||
read, permission-denied tables, or vendor system views that error. ktx should
|
|
||||||
ingest everything it *can* and skip what it can't — never let one bad object
|
|
||||||
zero out an entire connection's context. This is basic robustness for
|
|
||||||
production warehouses, not benchmark-specific.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Per-object isolation.** If introspecting/profiling one table or view
|
|
||||||
throws, skip that object, record a warning (object name + error), and
|
|
||||||
continue scanning the rest. The connection's semantic layer is built from
|
|
||||||
the objects that succeeded.
|
|
||||||
2. **Surface, don't hide.** Report skipped objects in the ingest summary and in
|
|
||||||
`ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name:
|
|
||||||
no such column ehp.start_date"). Honor `failureMode` for whole-connection
|
|
||||||
aborts, but a single bad object should not count as a connection failure.
|
|
||||||
3. **Views vs tables.** A broken view should never block base-table ingest.
|
|
||||||
Consider profiling views defensively (they are read-only projections).
|
|
||||||
4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict
|
|
||||||
the scan to the listed objects (and the qualification format for sqlite must
|
|
||||||
be documented and accepted). Fix the `did not recognize fetched source
|
|
||||||
output` failure when the allowlist yields a small/edge-case set.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- Ingesting a sqlite DB containing one broken view plus N healthy tables yields
|
|
||||||
a semantic layer for the N healthy tables and a warning naming the broken view
|
|
||||||
— exit is success (not "failed"), subject to `failureMode`.
|
|
||||||
- The skipped object is listed in the ingest summary and `ktx status`.
|
|
||||||
- `enabled_tables` restricted to a subset ingests exactly that subset without the
|
|
||||||
adapter-output error.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer
|
|
||||||
because of its one broken view; those questions must be solved from raw
|
|
||||||
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
|
|
||||||
would restore enriched context for that database.
|
|
||||||
|
|
@ -1,112 +0,0 @@
|
||||||
# Add universal SQL-authoring craft to the ktx-analytics skill
|
|
||||||
|
|
||||||
> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which
|
|
||||||
> ktx tools to call and in what order*, but gives almost no guidance on
|
|
||||||
> *writing correct SQL*. In benchmark runs the agent reliably produced
|
|
||||||
> runnable SQL (0 execution errors) yet failed on correctness — precision,
|
|
||||||
> determinism, type mismatches, and answer completeness. These are universal
|
|
||||||
> analytics-engineering truths that every ktx user benefits from, so they
|
|
||||||
> belong in the shipped skill, not in any caller's prompt.
|
|
||||||
|
|
||||||
## Scope guard (read first)
|
|
||||||
|
|
||||||
Only **universally-true** SQL/analytics craft goes here — guidance that helps a
|
|
||||||
real ktx user querying a **live** database. The test for inclusion: *"Would this
|
|
||||||
advice be correct and useful for an analyst on a current, production database?"*
|
|
||||||
|
|
||||||
**Dialect-specific syntax is out of scope here.** The v9 harnesses' only
|
|
||||||
per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted
|
|
||||||
lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX`
|
|
||||||
for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but
|
|
||||||
belongs in a **dialect-aware** location (per-driver notes), not this flat
|
|
||||||
skill. Track separately as a follow-up; the rules below must stay
|
|
||||||
dialect-agnostic.
|
|
||||||
|
|
||||||
Explicitly **do NOT** add (these are application/consumer concerns, not skill
|
|
||||||
concerns, and some are actively wrong for live data):
|
|
||||||
- Output-format contracts ("return a bare result set with exactly these
|
|
||||||
columns, no prose"). The skill is for interactive analysis and already
|
|
||||||
favors readable tables + summaries; a caller that needs a strict result
|
|
||||||
shape specifies that itself.
|
|
||||||
- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the
|
|
||||||
data. On a live database "recent" means relative to *now*; this is only true
|
|
||||||
for static snapshots and must not be baked into the product.
|
|
||||||
- Anything justified by a grader/scoring comparator.
|
|
||||||
|
|
||||||
## File
|
|
||||||
|
|
||||||
`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill;
|
|
||||||
`setup-agents.ts` installs it into agent environments — the copy under a
|
|
||||||
project's `.claude/skills/` is regenerated from this source). Extend the
|
|
||||||
existing `<rules>` block and step 5 ("Query") / step 6 ("Validate and
|
|
||||||
explain"); keep the existing interactive guidance intact.
|
|
||||||
|
|
||||||
## Requirements — add these as general rules (behavior only, no rationale that
|
|
||||||
references answers/graders)
|
|
||||||
|
|
||||||
**Schema discovery before writing SQL**
|
|
||||||
1. Inspect representative sample rows of each table before composing SQL —
|
|
||||||
confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null
|
|
||||||
prevalence in join/filter keys, and the actual set of categorical/enum
|
|
||||||
values. (`entity_details` + a small `sql_execution` sample.)
|
|
||||||
2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A
|
|
||||||
string column compared against a numeric literal (or vice versa) can
|
|
||||||
silently match nothing.
|
|
||||||
|
|
||||||
**Composition discipline**
|
|
||||||
3. Build complex queries incrementally — one CTE at a time, verifying each
|
|
||||||
layer's output on a small sample before stacking the next.
|
|
||||||
4. Avoid joins that fan out row counts. Add columns only from tables already
|
|
||||||
required by the grain, or pre-aggregate to the target grain before joining.
|
|
||||||
|
|
||||||
**Window-function correctness**
|
|
||||||
5. Give every ranking/ordering window function a complete, deterministic
|
|
||||||
tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG`
|
|
||||||
results are stable rather than flickering across runs.
|
|
||||||
6. Apply row filters **after** window functions for sequence / "first" /
|
|
||||||
"most recent" / "since" questions — compute over the full partition, then
|
|
||||||
filter.
|
|
||||||
|
|
||||||
**Numeric precision**
|
|
||||||
7. Compute at full precision; round only in the final projection, never inside
|
|
||||||
intermediate CTEs.
|
|
||||||
8. Be explicit about truncation (`CAST AS INT` truncates; use explicit
|
|
||||||
rounding when rounding is intended).
|
|
||||||
9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`)
|
|
||||||
from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the
|
|
||||||
question's wording.
|
|
||||||
|
|
||||||
**Answer completeness / interpretation**
|
|
||||||
10. "top / highest / most / lowest" → return only the winning row(s) (e.g.
|
|
||||||
`RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked
|
|
||||||
for.
|
|
||||||
11. "for each X / per X / by X" → exactly one row per X; don't collapse to a
|
|
||||||
single value unless the question says "overall" or "total across X".
|
|
||||||
12. When a question asks for inputs and a derived value ("X, Y, and their
|
|
||||||
ratio"), include the inputs as columns alongside the derived value.
|
|
||||||
13. When grouping by a human-readable label (a name), also expose the entity's
|
|
||||||
identifier — identity, not just the label, is part of the result.
|
|
||||||
14. When a result is unexpectedly empty, relax filters one at a time to find
|
|
||||||
which predicate removed the rows.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The shipped `analytics/SKILL.md` contains the rules above, phrased as general
|
|
||||||
truths with **no reference to any benchmark, gold answer, or scoring
|
|
||||||
comparator**.
|
|
||||||
- Existing interactive guidance (compact result tables, summaries,
|
|
||||||
clarification prompts, the tool-order workflow) is preserved — the skill must
|
|
||||||
still read well for an interactive human-facing analysis session.
|
|
||||||
- None of the excluded items (output-shape contract, `MAX(date)` anchoring,
|
|
||||||
grader-driven advice) appear.
|
|
||||||
- Skill stays within a reasonable size; group the new rules under clear
|
|
||||||
sub-headings so they're scannable.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors
|
|
||||||
but ~50 result mismatches; a large share traced to exactly these gaps
|
|
||||||
(premature rounding, string-vs-number compares, non-deterministic window
|
|
||||||
ordering, returning full lists for "top" questions, dropping inputs to derived
|
|
||||||
values). These are generic SQL-authoring defects — fixing them in the skill
|
|
||||||
improves ktx for everyone and, as a side effect, the benchmark.
|
|
||||||
|
|
@ -1,83 +0,0 @@
|
||||||
# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection)
|
|
||||||
|
|
||||||
> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept
|
|
||||||
> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect
|
|
||||||
> syntax here.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Spec 07 deliberately keeps the analytics SQL-authoring craft
|
|
||||||
**dialect-agnostic** — every rule must read correctly on any engine. But a lot of
|
|
||||||
*real* correctness depends on dialect-specific syntax that spec 07 excludes and
|
|
||||||
defers to this follow-up:
|
|
||||||
|
|
||||||
- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers,
|
|
||||||
VARIANT colon-paths.
|
|
||||||
- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`.
|
|
||||||
- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`.
|
|
||||||
|
|
||||||
This guidance is genuinely useful to an agent writing SQL against a live
|
|
||||||
database, but it must **not** pollute the flat dialect-agnostic skill — an agent
|
|
||||||
querying sqlite should never see Snowflake VARIANT syntax. It belongs in a
|
|
||||||
**dialect-aware** location, surfaced only for the dialect the active connection
|
|
||||||
actually uses.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any ktx project whose connections span more than one warehouse engine (e.g. a
|
|
||||||
Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent
|
|
||||||
writes SQL for a given connection, it should get that engine's syntax
|
|
||||||
conventions — and nothing for the engines it isn't querying.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Per-driver dialect notes.** Author concise, correct syntax notes per
|
|
||||||
supported driver: FQTN form, identifier quoting/case, date/time functions,
|
|
||||||
top-N / window-filtering idiom, semi-structured access. These are genuine
|
|
||||||
per-engine invariants, so enumerating them per driver is acceptable (unlike a
|
|
||||||
denylist of bad specifics).
|
|
||||||
2. **Scope to the active dialect, derived from state.** Which notes the agent
|
|
||||||
sees must be selected from the connection's configured driver/dialect
|
|
||||||
(`ktx.yaml` connections / the connector registry), not guessed and not shown
|
|
||||||
all at once. The flat analytics skill stays dialect-agnostic (spec 07
|
|
||||||
invariant preserved).
|
|
||||||
3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is
|
|
||||||
installed as a **single `SKILL.md`** per target (`setup-agents.ts` /
|
|
||||||
`readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one
|
|
||||||
of two approaches; the refinement pass should compare them before committing:
|
|
||||||
- **Multi-file skill delivery** — bundle `reference/<dialect>.md` files and
|
|
||||||
have the skill point to the one matching the connection. Requires extending
|
|
||||||
`setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal
|
|
||||||
`.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate
|
|
||||||
transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and
|
|
||||||
**per-file manifest entries** for clean uninstall. This is the
|
|
||||||
install-mechanism improvement spec 07's Model section flags as future work.
|
|
||||||
- **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a
|
|
||||||
given `connectionId` (the MCP layer already resolves the connection's
|
|
||||||
dialect), so no install change is needed and Cursor/OpenCode get identical
|
|
||||||
behavior. May be the lower-cost, more uniform path; weigh it first.
|
|
||||||
4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's
|
|
||||||
acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in
|
|
||||||
`analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware
|
|
||||||
channel; it does not amend the flat skill.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- An agent querying a sqlite connection gets sqlite date idioms and never sees
|
|
||||||
Snowflake/BigQuery-only syntax; an agent querying Snowflake gets
|
|
||||||
FQTN/identifier/VARIANT guidance.
|
|
||||||
- The dialect shown is **derived from the connection's configured driver**, not
|
|
||||||
hardcoded per project and not guessed.
|
|
||||||
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
|
|
||||||
unaffected.
|
|
||||||
- Whichever delivery mechanism is chosen installs/serves correctly across **all**
|
|
||||||
supported agent targets, including the single-file Cursor/OpenCode shape.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
|
|
||||||
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths),
|
|
||||||
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
|
|
||||||
(`strftime`/`julianday`). That content is real and useful but engine-specific;
|
|
||||||
spec 07 kept it out of the flat skill and deferred it here so the
|
|
||||||
dialect-agnostic rules stay clean.
|
|
||||||
|
|
@ -1,150 +0,0 @@
|
||||||
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4:
|
|
||||||
*"Avoid fan-out joins — add columns only from tables already at the target
|
|
||||||
grain, or pre-aggregate to that grain before joining; a join that multiplies
|
|
||||||
rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent
|
|
||||||
honors it on a single join but still **silently fan-outs on multi-hop join
|
|
||||||
chains**, where the inflation is one or two joins removed from the aggregate and
|
|
||||||
therefore much harder to notice.
|
|
||||||
|
|
||||||
The failure shape: a metric that lives at a *coarse* grain (e.g. one row per
|
|
||||||
parent record) is counted/summed *after* the parent has been joined down to a
|
|
||||||
*finer* grain (e.g. one row per child line). Every parent-level value is then
|
|
||||||
duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an
|
|
||||||
amount that depends on the data — runnable SQL, plausible-looking number,
|
|
||||||
quietly wrong.
|
|
||||||
|
|
||||||
The rule today is stated as a *prohibition* ("avoid"). It needs to be a
|
|
||||||
*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active
|
|
||||||
verification step the agent runs while composing, not just an instruction to be
|
|
||||||
careful.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
An analyst on any production warehouse asks: *"How many orders are there per
|
|
||||||
region?"* where the path from region to the order's detail runs through several
|
|
||||||
hops (region → store → order → order line). The honest answer counts each order
|
|
||||||
once. If the query descends to the line-level table along the way (e.g. for a
|
|
||||||
filter), each order is counted once **per line on the order**, inflating the
|
|
||||||
per-region total. Attribution here is unambiguous — each order belongs to exactly
|
|
||||||
one store and thus one region — so the *only* thing that can go wrong is the row
|
|
||||||
multiplication, which is exactly what makes it a clean teaching case. This is one
|
|
||||||
of the most common silently-wrong analytics mistakes on normalized schemas — it
|
|
||||||
is not
|
|
||||||
specific to any dataset, dialect, or benchmark.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
This extends the existing `<sql_craft>` "Composition" guidance in the
|
|
||||||
`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic,
|
|
||||||
and stated as a heuristic-plus-why (consistent with spec 07's style).
|
|
||||||
|
|
||||||
1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the
|
|
||||||
danger is *cumulative*: any one-to-many hop on the path between the table that
|
|
||||||
owns a measure and the aggregate inflates that measure, even when the
|
|
||||||
offending join is several hops away from the `SUM`/`COUNT`. The fix is the
|
|
||||||
same as the single-hop case — **pre-aggregate the measure to its own grain in
|
|
||||||
a CTE, then join the already-aggregated result** — but the agent must apply it
|
|
||||||
per measure-owning table along the whole chain, not just at the final join.
|
|
||||||
|
|
||||||
2. **Add a verification habit, not just a prohibition.** While composing, the
|
|
||||||
agent should confirm a join did not change the grain it intends to aggregate
|
|
||||||
at — e.g. check that the row count (or the count of the aggregate's key) is
|
|
||||||
unchanged across a join that is supposed to be one-to-one / many-to-one, and
|
|
||||||
pre-aggregate the finer table to grain when it is one-to-many. This is the same
|
|
||||||
"build incrementally and check each layer" discipline spec 07 already endorses,
|
|
||||||
pointed specifically at grain preservation.
|
|
||||||
|
|
||||||
**Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only
|
|
||||||
shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and
|
|
||||||
then joining one-to-one is the remedy that works for every aggregate
|
|
||||||
(`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT <key>)` is a valid one-liner *for counts
|
|
||||||
only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two
|
|
||||||
rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse
|
|
||||||
them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is
|
|
||||||
silently wrong for sums.
|
|
||||||
|
|
||||||
3. **One concrete, generic multi-hop example.** Include a short worked example
|
|
||||||
that shows the inflation and the fix. It must use an **invented, generic
|
|
||||||
schema** — **no benchmark table names, no benchmark SQL, and no benchmark
|
|
||||||
result values** (see "Leak-safety" below — hard constraint). The example must:
|
|
||||||
(a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson
|
|
||||||
and does not entangle the skill's separate *macro-vs-micro average* rule; and
|
|
||||||
(b) use a chain with **unambiguous single-owner attribution** so the only thing
|
|
||||||
that can go wrong is row multiplication. The intended example is the chain
|
|
||||||
`regions → stores → orders → order_lines` answering *"how many orders per region
|
|
||||||
include at least one backordered line"* — each order belongs to exactly one
|
|
||||||
store and thus exactly one region, so attribution is clean; the line-level
|
|
||||||
filter gives `order_lines` a genuine reason to be joined (so the fix is the
|
|
||||||
pre-aggregate remedy, not "drop the join"), and that join sits **several hops
|
|
||||||
below** the region-level COUNT (the multi-hop point):
|
|
||||||
|
|
||||||
```sql
|
|
||||||
-- "How many orders per region include at least one backordered line?"
|
|
||||||
-- (order_lines is genuinely needed here — for the backordered filter — so the
|
|
||||||
-- fix is NOT "just drop the join".)
|
|
||||||
-- WRONG: the order_lines join is one row per matching line, joined several hops
|
|
||||||
-- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the
|
|
||||||
-- per-region total is inflated by backordered-lines-per-order — silently wrong.
|
|
||||||
SELECT r.region_id, COUNT(*) AS n_orders
|
|
||||||
FROM regions r
|
|
||||||
JOIN stores s ON s.region_id = r.region_id
|
|
||||||
JOIN orders o ON o.store_id = s.store_id
|
|
||||||
JOIN order_lines l ON l.order_id = o.order_id AND l.is_backordered -- one-to-many: fan-out
|
|
||||||
GROUP BY r.region_id;
|
|
||||||
|
|
||||||
-- RIGHT (general remedy): collapse the finer table to the measure's grain in a
|
|
||||||
-- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works
|
|
||||||
-- for SUM/AVG, not just COUNT.
|
|
||||||
WITH qualifying_orders AS ( -- back to ONE row per order
|
|
||||||
SELECT DISTINCT order_id FROM order_lines WHERE is_backordered
|
|
||||||
)
|
|
||||||
SELECT r.region_id, COUNT(*) AS n_orders
|
|
||||||
FROM regions r
|
|
||||||
JOIN stores s ON s.region_id = r.region_id
|
|
||||||
JOIN orders o ON o.store_id = s.store_id
|
|
||||||
JOIN qualifying_orders q ON q.order_id = o.order_id
|
|
||||||
GROUP BY r.region_id;
|
|
||||||
|
|
||||||
-- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works
|
|
||||||
-- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g.
|
|
||||||
-- summing each order's shipping_fee after joining lines) must pre-aggregate;
|
|
||||||
-- DISTINCT would wrongly merge two orders that happen to share the same fee.
|
|
||||||
```
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint on this spec and its example)
|
|
||||||
|
|
||||||
The benchmark's gold answers must never appear in ktx. The worked example must
|
|
||||||
be a **synthetic, generic schema invented for teaching** — not the tables,
|
|
||||||
column names, query, or numeric results of any Spider 2.0-Lite question. The
|
|
||||||
example demonstrates the *pattern* (coarse-grain measure counted after a
|
|
||||||
one-to-many join), which is universal; it must be reconstructable from first
|
|
||||||
principles by anyone, with zero reference to benchmark data. A reviewer should
|
|
||||||
be able to read the example and find nothing that ties it to a specific
|
|
||||||
benchmark instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The skill's `<sql_craft>` Composition section states the multi-hop
|
|
||||||
generalization of the fan-out rule and a grain-verification habit, inline and
|
|
||||||
dialect-agnostic.
|
|
||||||
- It includes exactly one short, **generic** worked example (wrong vs.
|
|
||||||
pre-aggregated-right) using an invented schema, with no benchmark-derived
|
|
||||||
identifiers or values.
|
|
||||||
- No new tool, flag, or config; this is skill-content only (additive to spec 07).
|
|
||||||
- Existing analytics-skill content tests are updated to cover the added rule's
|
|
||||||
presence (mirroring spec 07's `analytics-skill-content.test.ts`).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
|
|
||||||
reached through several one-to-many joins) are a recurring source of
|
|
||||||
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
|
|
||||||
with the right tables but a fan-out-inflated number. These are correctness
|
|
||||||
failures, not knowledge or schema-discovery failures (zero execution errors in
|
|
||||||
the latest run), so the fix belongs in the product's authoring craft — where it
|
|
||||||
also helps any real analyst — not in a benchmark-specific prompt.
|
|
||||||
```
|
|
||||||
|
|
@ -1,65 +0,0 @@
|
||||||
# Panel/period completeness — emit the full set of groups, not only the populated ones
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
When a question asks for a result *per period* or *per category* ("orders for each
|
|
||||||
month of 2023", "revenue by region", "count per status"), the natural `GROUP BY`
|
|
||||||
only returns groups that actually have rows. Periods/categories with **zero**
|
|
||||||
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
|
|
||||||
ones that should read `0` are simply absent. The agent writes runnable SQL with
|
|
||||||
the right aggregate but an **incomplete panel**.
|
|
||||||
|
|
||||||
This is a universal reporting correctness issue: a monthly report with missing
|
|
||||||
months, or a category breakdown missing the empty categories, is wrong for any
|
|
||||||
analyst — and it is also a frequent result-mismatch shape on the benchmark.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
"How many orders were placed in each month of 2023?" must return **12 rows** even
|
|
||||||
if March had no orders (March = 0), not 11 rows. "Sales per region" should include
|
|
||||||
regions with no sales (as 0/NULL) when the question asks for *each* region.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
Additive to the `ktx-analytics` skill's `<sql_craft>` "Answer completeness /
|
|
||||||
interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic
|
|
||||||
+ why style).
|
|
||||||
|
|
||||||
1. **Recognize "full-panel" phrasing.** Cues like *each / every / per <period> /
|
|
||||||
for all <category> / by month* signal that the answer's row set should be the
|
|
||||||
**complete** set of periods or categories in scope, not just those present in
|
|
||||||
the filtered fact rows.
|
|
||||||
|
|
||||||
2. **Build a spine, then LEFT JOIN.** Generate the full set of expected
|
|
||||||
groups — a date/number series via a recursive CTE for periods, or the distinct
|
|
||||||
dimension values from the authoritative dimension table for categories — and
|
|
||||||
LEFT JOIN the aggregated facts onto it, defaulting missing measures with
|
|
||||||
`COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner
|
|
||||||
`GROUP BY` can only emit groups that have at least one fact row.
|
|
||||||
|
|
||||||
3. **Don't over-apply.** When the question asks only about groups that exist
|
|
||||||
("which months had orders"), the spine is unnecessary; the cue is *each/all*
|
|
||||||
vs *which*.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Any worked example must use a **synthetic generic schema** (e.g. an `orders`
|
|
||||||
table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN
|
|
||||||
+ COALESCE). No benchmark table names, SQL, or result values. The behavior is
|
|
||||||
reconstructable from first principles and tied to no specific instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- `<sql_craft>` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe,
|
|
||||||
and the over-application guard — inline and dialect-agnostic.
|
|
||||||
- At most one short generic example (recursive-CTE date spine or distinct-dimension
|
|
||||||
spine), no benchmark-derived content.
|
|
||||||
- Skill-content only; analytics-skill content tests updated to cover the rule.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Per-period / per-category questions where some periods are empty produce
|
|
||||||
short-row result mismatches in the SQLite subset. The fix is a universal
|
|
||||||
reporting habit (complete panels), so it belongs in the product's craft, where it
|
|
||||||
also helps real analysts — not in a benchmark-specific prompt. Related to spec 11
|
|
||||||
(rolling/cumulative windows need a complete date spine to be correct).
|
|
||||||
|
|
@ -1,73 +0,0 @@
|
||||||
# Time-series window craft — running totals, rolling-N (min-periods), period-over-period
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
A large share of analytics questions are time-series shaped: a **running/cumulative
|
|
||||||
balance**, a **rolling N-day average**, or **period-over-period growth**. The agent
|
|
||||||
knows window functions exist (spec 07 covers determinism and window-then-filter) but
|
|
||||||
gets the *time-series specifics* wrong:
|
|
||||||
|
|
||||||
- cumulative balance computed without an unbounded preceding frame (or with the
|
|
||||||
frame defaulting incorrectly when there are ties on the order key);
|
|
||||||
- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily
|
|
||||||
data, so the window spans the wrong calendar span when days are missing;
|
|
||||||
- no **minimum-periods** handling — a rolling average is reported before the window
|
|
||||||
is actually full;
|
|
||||||
- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor.
|
|
||||||
|
|
||||||
These are runnable-but-wrong; the structure is close, the edge case diverges.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
- "Each account's month-end running balance over 2023" — cumulative sum of monthly
|
|
||||||
net over an ordered window.
|
|
||||||
- "30-day rolling average of daily revenue, only once 30 days of history exist."
|
|
||||||
- "Month-over-month revenue growth rate."
|
|
||||||
|
|
||||||
All three are bread-and-butter for any analyst on any time-series table.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
Additive to the `ktx-analytics` skill's `<sql_craft>` "Window functions" group
|
|
||||||
(inline, dialect-agnostic, heuristic + why).
|
|
||||||
|
|
||||||
1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS
|
|
||||||
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in
|
|
||||||
`ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY`
|
|
||||||
can include/exclude peers unexpectedly.
|
|
||||||
|
|
||||||
2. **Rolling window over time, not over rows.** When "rolling N days/months" is
|
|
||||||
asked, the window must span a calendar range. Over gappy data, either build a
|
|
||||||
complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals
|
|
||||||
the intended span, or use a range/self-join keyed on the date. *Why:* row-count
|
|
||||||
frames over missing dates silently measure the wrong span.
|
|
||||||
|
|
||||||
3. **Minimum periods.** When the question says "only after N periods of data" (or
|
|
||||||
it is implied by a rolling metric), emit NULL/skip until the window is full
|
|
||||||
(e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not
|
|
||||||
the requested metric.
|
|
||||||
|
|
||||||
4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
|
|
||||||
for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at
|
|
||||||
full precision (round only at the end). Guard divide-by-zero/NULL prev.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day,
|
|
||||||
amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*.
|
|
||||||
No benchmark table names, SQL, or result values.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- `<sql_craft>` "Window functions" gains the cumulative, rolling-over-time +
|
|
||||||
min-periods, and period-over-period recipes — inline, dialect-agnostic.
|
|
||||||
- At most one or two compact generic examples; no benchmark-derived content.
|
|
||||||
- Skill-content only; analytics-skill content tests updated.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Running-balance / rolling / period-over-period questions are the single largest
|
|
||||||
result-mismatch cluster in the SQLite subset (financial-transactions style DBs).
|
|
||||||
The methodology is universal analyst craft, so it belongs in the product's skill
|
|
||||||
(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10
|
|
||||||
(date spine) for the gappy-rolling case.
|
|
||||||
|
|
@ -1,61 +0,0 @@
|
||||||
# Parse text-encoded numeric columns before doing math on them
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Numeric measures are often stored as **text** with human formatting: unit suffixes
|
|
||||||
(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`),
|
|
||||||
percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`,
|
|
||||||
`""`). Aggregating or comparing such a column directly is silently wrong: string
|
|
||||||
comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on
|
|
||||||
the formatted values rather than the intended number.
|
|
||||||
|
|
||||||
The agent already samples schemas (spec 07 schema-discovery), but when it sees a
|
|
||||||
"numeric" column it tends to assume it is a real number type and skips the parse —
|
|
||||||
so the arithmetic runs on garbage. Runnable, plausible, wrong.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000
|
|
||||||
/ 0` before you can sum it or compute a daily change. A `price` stored as
|
|
||||||
`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene
|
|
||||||
on real, messy production tables.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
Extend the `ktx-analytics` skill's `<sql_craft>` "Schema discovery before writing
|
|
||||||
SQL" group (inline, dialect-agnostic, heuristic + why).
|
|
||||||
|
|
||||||
1. **Detect text-encoded numerics during sampling.** When a column that the
|
|
||||||
question treats as a number is stored as text, sample distinct values to learn
|
|
||||||
the encodings actually present (suffixes, symbols, separators, sentinels) before
|
|
||||||
composing — never assume the format from the column name.
|
|
||||||
|
|
||||||
2. **Parse and scale before arithmetic.** Strip currency/separator/percent
|
|
||||||
characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels
|
|
||||||
(`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a
|
|
||||||
numeric type. Do this in an early CTE so all downstream math sees clean numbers.
|
|
||||||
*Why:* string columns compared/aggregated as-is sort lexically and cast to 0,
|
|
||||||
producing silently wrong results instead of errors.
|
|
||||||
|
|
||||||
3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value
|
|
||||||
failed to parse (would surface as NULL), to catch an encoding the sample missed.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Worked examples must use a **synthetic generic schema** and made-up values (e.g. a
|
|
||||||
`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names,
|
|
||||||
SQL, or result values; the parsing pattern is universal and tied to no instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- `<sql_craft>` schema-discovery gains the detect → parse/scale → verify guidance —
|
|
||||||
inline, dialect-agnostic, with at most one short generic example.
|
|
||||||
- No benchmark-derived content. Skill-content only; content tests updated.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
At least one SQLite-subset question stores trading volume as suffix-encoded text
|
|
||||||
("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The
|
|
||||||
fix — parse messy encodings before math — is universal data hygiene that helps any
|
|
||||||
analyst, so it belongs in the product's craft rather than a benchmark-specific
|
|
||||||
prompt.
|
|
||||||
|
|
@ -1,105 +0,0 @@
|
||||||
# Enforce answer-output completeness with a final pre-emit check in the analytics skill
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The single largest correctness failure mode is **incomplete output**: the query runs and the
|
|
||||||
methodology is roughly right, but the result is missing columns the question asked for. Three
|
|
||||||
recurring sub-patterns:
|
|
||||||
|
|
||||||
1. **Multi-part questions answered partially.** A question that asks for several things ("report
|
|
||||||
the highest *and* the lowest month, each with its count and average, *and* the difference")
|
|
||||||
comes back with only the first part — one column instead of the several requested.
|
|
||||||
2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's
|
|
||||||
identifier (e.g. a product name without its product id, a customer name without its
|
|
||||||
customer id).
|
|
||||||
3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not
|
|
||||||
the underlying counts the question also asked for.
|
|
||||||
|
|
||||||
Sub-patterns 2 and 3 are **already covered by `<sql_craft>` rules** in the analytics skill
|
|
||||||
(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*),
|
|
||||||
yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these
|
|
||||||
rules are passive heuristics buried in a list, and the agent doesn't reliably check them before
|
|
||||||
finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn
|
|
||||||
output-completeness into an **explicit final verification step** the agent performs before
|
|
||||||
emitting SQL.
|
|
||||||
|
|
||||||
This is reinforced by evidence that the failure is **model-independent**: a markedly stronger
|
|
||||||
model produced the same incomplete-output mistakes on these questions, which means it is a
|
|
||||||
craft/enforcement gap, not a capability gap.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
An analyst is asked: *"For each region, report the highest and the lowest monthly order count,
|
|
||||||
and the difference between them."* A complete, useful answer has a column for the region's id
|
|
||||||
and name, the highest count, the lowest count, and the difference — five columns. Returning just
|
|
||||||
the region and a single number answers only part of the request. This is a universal expectation
|
|
||||||
on any database: answer **every** part of a multi-part request, identify the entities, and show
|
|
||||||
the inputs behind any derived figure.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
Additive to the analytics skill's `<sql_craft>` "Answer completeness / interpretation" group and
|
|
||||||
its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07).
|
|
||||||
|
|
||||||
1. **Multi-part / multi-output completeness (new rule).** When a question requests several
|
|
||||||
outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a
|
|
||||||
value plus its components ("X, Y, and their ratio") — the final projection must contain a
|
|
||||||
column for **each** requested output. *Why:* answering only the first clause is the most common
|
|
||||||
way a runnable query is still wrong; the grain and methodology can be perfect yet the answer
|
|
||||||
is short by columns.
|
|
||||||
|
|
||||||
2. **Fold the existing identity / inputs rules into the same completeness notion.** The
|
|
||||||
already-shipped rules — project the entity **identifier** alongside any human-readable label,
|
|
||||||
and **keep the inputs** to any derived value — are part of output completeness; reference them
|
|
||||||
from the check below so they are actually applied, not just listed.
|
|
||||||
|
|
||||||
3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the
|
|
||||||
final SQL, the skill should have the agent **re-read the question and confirm the projection
|
|
||||||
covers**: every named metric/attribute; the identifier of every grouped/named entity; every
|
|
||||||
input to a derived value; all at the grain the question specifies. This is a short, concrete
|
|
||||||
checkpoint at the validate step — the point is to convert the passive heuristics into an active
|
|
||||||
pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is
|
|
||||||
grader-gaming; the check is about matching the request exactly, not padding it.)
|
|
||||||
|
|
||||||
Generic teaching example (synthetic schema — see Leak-safety):
|
|
||||||
```sql
|
|
||||||
-- "For each region, report the highest and lowest monthly order count and their difference."
|
|
||||||
-- WRONG: answers only the first clause; no region id, no lowest, no difference.
|
|
||||||
SELECT region_name, MAX(monthly_orders) AS highest
|
|
||||||
FROM region_monthly GROUP BY region_name;
|
|
||||||
|
|
||||||
-- RIGHT: one column per requested output + the entity's identity, at the region grain.
|
|
||||||
SELECT r.region_id, r.region_name,
|
|
||||||
MAX(m.monthly_orders) AS highest_monthly_orders,
|
|
||||||
MIN(m.monthly_orders) AS lowest_monthly_orders,
|
|
||||||
MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference
|
|
||||||
FROM regions r
|
|
||||||
JOIN region_monthly m ON m.region_id = r.region_id
|
|
||||||
GROUP BY r.region_id, r.region_name;
|
|
||||||
```
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up
|
|
||||||
columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover
|
|
||||||
every requested output + identity + inputs), which is universal and tied to no specific instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The skill states the multi-part-completeness rule and a concrete **final completeness check**
|
|
||||||
(re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic,
|
|
||||||
cross-referencing the existing identity/inputs rules so they're enforced.
|
|
||||||
- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming).
|
|
||||||
- One short generic example (wrong vs complete); no benchmark-derived content.
|
|
||||||
- Skill-content only; analytics-skill content tests updated to cover the new rule + check.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
In the latest SQLite-subset run, **incomplete output was the single largest failure bucket
|
|
||||||
(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value
|
|
||||||
inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A
|
|
||||||
probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this
|
|
||||||
is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested
|
|
||||||
part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product
|
|
||||||
skill (and transfers to real users), enforced as a final check rather than left as a passive hint.
|
|
||||||
```
|
|
||||||
|
|
@ -1,116 +0,0 @@
|
||||||
# Structured, leveled logging for the ktx MCP server
|
|
||||||
|
|
||||||
> **Scope: observability only.** This spec is about *seeing* what the MCP server
|
|
||||||
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
|
|
||||||
> query from blocking the server (off-event-loop / interruptible query execution)
|
|
||||||
> is a separate concern — see "Non-goals" and the sibling spec note below.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The ktx MCP server (`packages/cli/src/mcp-http-server.ts` +
|
|
||||||
`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk`
|
|
||||||
`StreamableHTTPServerTransport`) emits almost no operational logs. There is no
|
|
||||||
server-side record of **which MCP tool was called, with what parameters, when,
|
|
||||||
how long it took, or whether it succeeded** — nor of session open/close or
|
|
||||||
transport errors. When a tool call is slow, hangs, or a client connection drops
|
|
||||||
("Transport channel closed"), an operator has no trail to diagnose it and must
|
|
||||||
resort to process sampling / `lsof` / guesswork — and the offending input
|
|
||||||
(e.g. the exact SQL) is typically unrecoverable.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Anyone running a long-lived ktx MCP server — a developer's local instance, a
|
|
||||||
shared team server, or a hosted deployment — needs observability into tool-call
|
|
||||||
activity to:
|
|
||||||
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
|
|
||||||
connection, with what SQL, for how long);
|
|
||||||
- explain client-visible connection failures from the server side (session
|
|
||||||
lifecycle, transport-closed events);
|
|
||||||
- audit what agents asked the server to do;
|
|
||||||
- spot patterns (hot tools, slow connections, error rates).
|
|
||||||
|
|
||||||
This is standard production-server hygiene; the server currently provides none.
|
|
||||||
|
|
||||||
## Requirements (sketch — refine when picked up)
|
|
||||||
|
|
||||||
1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation
|
|
||||||
only; implementer owns the choice). A single shared instance; write **JSON to
|
|
||||||
stdout** (12-factor — the launcher/aggregator routes it). No in-app file
|
|
||||||
rotation. Optional human-readable pretty output only when attached to a TTY
|
|
||||||
(dev).
|
|
||||||
2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug`
|
|
||||||
for diagnosis) — verbose logging on demand without code changes.
|
|
||||||
3. **Per-session / per-call context** via child loggers: every line carries a
|
|
||||||
`sessionId` (from the transport session) and, for tool calls, a `callId` +
|
|
||||||
`tool` name, so one session's or call's activity can be traced/grepped.
|
|
||||||
4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For
|
|
||||||
every MCP tool invocation:
|
|
||||||
- on entry: log `{ tool, params, sessionId, callId }` **before** running the
|
|
||||||
handler (so the record exists even if the handler never returns);
|
|
||||||
- on exit: log `durationMs` + outcome (ok with result size, or error with
|
|
||||||
stack).
|
|
||||||
This makes a **hung / never-returning call identifiable**: a start with no
|
|
||||||
matching completion is the culprit, with its exact parameters and timestamp.
|
|
||||||
This matters specifically because handlers like `sql_execution` run a
|
|
||||||
*synchronous* better-sqlite3 query — a runaway query blocks the process and no
|
|
||||||
completion is ever logged, so the start line (flushed before the blocking
|
|
||||||
call) is the only record. For `sql_execution`, `params` should include the SQL
|
|
||||||
text (the most useful field). Emit a **WARN** when a *completed* call exceeds a
|
|
||||||
configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`).
|
|
||||||
5. **Connection / session lifecycle:** log session open/close (with `sessionId`)
|
|
||||||
and transport errors (the SDK's closed-channel / "Transport channel closed"
|
|
||||||
events) so client-side connection failures have a server-side counterpart.
|
|
||||||
6. **Error logging** with structured stack traces (a standard error serializer),
|
|
||||||
not bare strings.
|
|
||||||
7. **Light redaction — credentials only** (bearer token, connection
|
|
||||||
passwords/secrets). SQL text and tool params are *not* secrets and must be
|
|
||||||
logged. Do not over-redact.
|
|
||||||
8. **Synchronous logging is fine.** The server uses a synchronous DB client, so
|
|
||||||
logging need not be async; prefer the simpler synchronous stdout path over
|
|
||||||
async/worker transports (which can lose buffered lines on a hard crash). Do
|
|
||||||
not introduce async-logging machinery.
|
|
||||||
|
|
||||||
## Acceptance criteria (sketch)
|
|
||||||
|
|
||||||
- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start`
|
|
||||||
(tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line
|
|
||||||
on the server's stdout, as JSON.
|
|
||||||
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
|
|
||||||
`tool.start` line carrying its **exact SQL and timestamp** and **no**
|
|
||||||
`tool.end` — so the offending query is recoverable from the log alone, with no
|
|
||||||
process sampling.
|
|
||||||
- A completed tool call slower than the configured threshold emits a WARN with
|
|
||||||
its duration.
|
|
||||||
- Session open/close and transport-closed events are logged with the `sessionId`.
|
|
||||||
- At default level (`info`), routine per-tool lines are suppressed but lifecycle,
|
|
||||||
slow-call warnings, and errors are present.
|
|
||||||
- Credentials (bearer token, connection secrets) never appear in logs; SQL and
|
|
||||||
tool params do.
|
|
||||||
- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no
|
|
||||||
async-transport machinery.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Preventing/interrupting runaway queries** (off-event-loop execution, query
|
|
||||||
timeouts, worker-thread isolation). That is a *separate* spec; a single
|
|
||||||
synchronous query that fans out into a massive nested-loop join can peg the
|
|
||||||
single-threaded server for hours and break new connections — observability
|
|
||||||
surfaces *which* query, but the fix is execution-model work. (This logging is
|
|
||||||
also a prerequisite for a future watchdog that detects a `tool.start` with no
|
|
||||||
`tool.end` past a threshold and recycles the server.)
|
|
||||||
- Metrics/tracing/OpenTelemetry exporters.
|
|
||||||
- Forwarding logs to the MCP *client* via the protocol's logging capability
|
|
||||||
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
|
|
||||||
distinct from operational stdout logging.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
Running Spider 2.0-Lite against the MCP server at concurrency, an
|
|
||||||
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
|
|
||||||
synchronous better-sqlite3 executed it on the event loop, pegging a server at
|
|
||||||
~100% CPU for hours and breaking new MCP connections to it ("Transport channel
|
|
||||||
closed"). We could not determine *which* query, because the server logs nothing
|
|
||||||
about tool calls — diagnosis required `sample`/`lsof` on the live process and the
|
|
||||||
exact SQL was never recovered. Structured tool-call logging (especially
|
|
||||||
start-before-execute) would have turned this into a one-line `grep` of the server
|
|
||||||
log.
|
|
||||||
|
|
@ -1,131 +0,0 @@
|
||||||
# Bounded query execution (deadline + non-blocking) for read SQL
|
|
||||||
|
|
||||||
> Priority: HIGH. Found empirically during a Spider2-lite sqlite run
|
|
||||||
> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU
|
|
||||||
> for 13+ minutes and never returned. The query
|
|
||||||
> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the
|
|
||||||
> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112
|
|
||||||
> rows, joined on a 4-column key with no composite index) whose plan degraded to
|
|
||||||
> an O(N×M) nested-loop scan. Because the sqlite connector runs
|
|
||||||
> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP
|
|
||||||
> worker's entire event loop: no `tool.end` was ever logged, the port went
|
|
||||||
> unresponsive, and the query could not be cancelled. One of four eval shards
|
|
||||||
> stalled until the worker was killed by hand.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Two compounding gaps on the read-query path:
|
|
||||||
|
|
||||||
1. **No execution deadline.** A single expensive query runs unbounded. This is
|
|
||||||
handled divergently per connector, with no shared contract: BigQuery has a
|
|
||||||
real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP
|
|
||||||
`request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only
|
|
||||||
connection/pool *acquisition*, not statement *execution*; SQLite has nothing.
|
|
||||||
So whether a runaway query is bounded depends entirely on which driver the
|
|
||||||
caller happened to hit.
|
|
||||||
|
|
||||||
2. **In-process engines block the event loop and can't be cancelled.** The
|
|
||||||
sqlite connector executes on the main thread via synchronous
|
|
||||||
`better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't
|
|
||||||
serve other requests, send progress, or write `tool.end`), and there is no
|
|
||||||
way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its
|
|
||||||
documented mechanism for slow queries is to run them in a **worker thread**,
|
|
||||||
and the only way to stop a runaway synchronous query is to terminate the
|
|
||||||
thread executing it.
|
|
||||||
|
|
||||||
The net effect is a query that produces a `tool.start` with no matching
|
|
||||||
`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`)
|
|
||||||
does not help — it bounds returned rows, not scan work, and the failing query
|
|
||||||
returned a single aggregate row.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any data agent that lets an LLM author SQL will eventually issue an
|
|
||||||
accidentally-expensive query — an unindexed or cartesian join, an expensive
|
|
||||||
VIEW, a wide aggregate over a large fact table. A general-purpose context layer
|
|
||||||
must bound that and return a clean, fast "query exceeded Ns" error so the agent
|
|
||||||
can revise (add filters, query base tables, narrow the range) instead of hanging
|
|
||||||
the tool and the server. This matters for embedded/local warehouses (sqlite,
|
|
||||||
duckdb) and remote ones alike, and is wholly independent of any benchmark.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. Every read-query execution path (`executeReadOnly`) enforces a single
|
|
||||||
canonical execution deadline. One opinionated default; **not** a per-call
|
|
||||||
user flag. Where a driver already supports a per-connection timeout
|
|
||||||
(BigQuery `job_timeout_ms`), reuse that as the per-connection override rather
|
|
||||||
than inventing a parallel knob.
|
|
||||||
2. On exceeding the deadline the path resolves with a `KtxQueryError`
|
|
||||||
("query exceeded {N}s") — a finite, decision-reaching outcome, never an
|
|
||||||
unbounded hang.
|
|
||||||
3. The deadline is a **shared contract at the connector boundary**, defined once
|
|
||||||
(on the `executeReadOnly` contract or a shared wrapper at the call site) so
|
|
||||||
all drivers participate. Bring the existing divergent timeouts (BigQuery job
|
|
||||||
timeout, ClickHouse request timeout) under this one contract instead of
|
|
||||||
leaving parallel mechanisms.
|
|
||||||
4. For in-process engines (sqlite today, any future embedded driver), execution
|
|
||||||
MUST NOT block the MCP server event loop. Run the query off the main thread
|
|
||||||
and enforce the deadline by terminating that thread on timeout (the
|
|
||||||
better-sqlite3-documented approach, since synchronous queries are
|
|
||||||
uncancellable in-thread). The event loop must stay responsive so `tool.end`
|
|
||||||
is always written and concurrent requests on the same port are served.
|
|
||||||
5. Prefer real cancellation over client-side give-up. Where the engine supports
|
|
||||||
a server-side statement timeout (Postgres `statement_timeout`, MySQL
|
|
||||||
`max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse
|
|
||||||
`max_execution_time`, BigQuery job timeout, SQL Server request timeout), set
|
|
||||||
it so the deadline actually stops work, not merely abandons the promise while
|
|
||||||
the query keeps running. For in-process engines, thread termination is the
|
|
||||||
cancellation.
|
|
||||||
6. The MCP `sql_execution` tool surfaces the timeout as an expected error
|
|
||||||
(classified as `KtxQueryError`, not a `$exception` fault, consistent with
|
|
||||||
existing expected-error classification) and logs a `tool.end` with the error
|
|
||||||
outcome.
|
|
||||||
7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain
|
|
||||||
unchanged. The deadline is additive; `maxRows` is not a substitute for it.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- A read query that exceeds the deadline returns a `KtxQueryError` within
|
|
||||||
roughly the deadline; the MCP worker stays responsive (a concurrent tool call
|
|
||||||
on the same server completes while the slow query is still pending) and writes
|
|
||||||
a matching `tool.end` with a non-ok outcome.
|
|
||||||
- sqlite specifically: executing a deliberately pathological query (e.g. an
|
|
||||||
expensive VIEW or an unindexed cross join) on a fixture does not block the
|
|
||||||
event loop, is terminated at the deadline, and CPU returns to idle afterward
|
|
||||||
(the off-main-thread executor is killed, not left spinning).
|
|
||||||
- No regression: normal fast queries return identical results; read-only
|
|
||||||
rejection still works; `maxRows` still bounds returned rows.
|
|
||||||
- Tests cover the deadline path for at least the in-process driver (sqlite,
|
|
||||||
terminate-on-deadline) and one server-side-timeout driver.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
The Spider2-lite local set loads several warehouses into sqlite, some with
|
|
||||||
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
|
|
||||||
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
|
|
||||||
rows, no composite index, with `promo_id` (the index the optimizer picks) being
|
|
||||||
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
|
|
||||||
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
|
|
||||||
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
|
|
||||||
query instead.
|
|
||||||
|
|
||||||
## Orientation hints (code pointers; may have drifted)
|
|
||||||
|
|
||||||
- Shared contract: `packages/cli/src/context/scan/types.ts` —
|
|
||||||
`KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285).
|
|
||||||
- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70`
|
|
||||||
(`connector.executeReadOnly`); tool registration in
|
|
||||||
`packages/cli/src/context/mcp/context-tools.ts`.
|
|
||||||
- In-process sync execution (the acute hang):
|
|
||||||
`packages/cli/src/connectors/sqlite/connector.ts:311-313`
|
|
||||||
(`better_sqlite3 .prepare().all()`).
|
|
||||||
- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts`
|
|
||||||
(`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602`
|
|
||||||
(`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only),
|
|
||||||
`connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`,
|
|
||||||
`connectors/sqlserver/connector.ts` (pool/connection only).
|
|
||||||
- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`).
|
|
||||||
- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no
|
|
||||||
interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern
|
|
||||||
for slow queries (master owns worker lifecycle and respawns on exit) — extend
|
|
||||||
it with terminate-on-deadline to enforce the timeout.
|
|
||||||
|
|
@ -1,68 +0,0 @@
|
||||||
# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project)
|
|
||||||
|
|
||||||
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`.
|
|
||||||
|
|
||||||
## Problem (generic, real-world)
|
|
||||||
|
|
||||||
Analysts routinely query datasets that live in a **different** BigQuery project than the one
|
|
||||||
they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an
|
|
||||||
organization's central data project, etc. To make those connectable in ktx (so `discover_data`,
|
|
||||||
the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to
|
|
||||||
**introspect a dataset hosted in a foreign project while running/billing jobs in the
|
|
||||||
credentials' own project**.
|
|
||||||
|
|
||||||
Today it can't. ktx's BigQuery connector derives a single `projectId` from
|
|
||||||
`credentials.project_id` and uses it for **both** job billing **and** schema introspection:
|
|
||||||
|
|
||||||
- `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`;
|
|
||||||
there is no separate billing-vs-dataset project knob.
|
|
||||||
- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the
|
|
||||||
dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`.
|
|
||||||
- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the
|
|
||||||
**billing** project's INFORMATION_SCHEMA.
|
|
||||||
- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix.
|
|
||||||
|
|
||||||
So a `dataset_id` naming a dataset in another project can't be introspected, even though querying
|
|
||||||
it works fine (cross-project reads bill to the caller's project — that path already works).
|
|
||||||
|
|
||||||
### Empirical confirmation
|
|
||||||
With a service account in project `ktx-spider2-lite`:
|
|
||||||
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in
|
|
||||||
`projects/ktx-spider2-lite/datasets/austin_311`).
|
|
||||||
- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds**
|
|
||||||
(lists the public tables; public metadata is readable by any authenticated principal).
|
|
||||||
- There is **no config knob** to separate the introspection project from the billing project.
|
|
||||||
|
|
||||||
## Requirement
|
|
||||||
|
|
||||||
The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids`
|
|
||||||
(a single connection may span more than one source project), and for each:
|
|
||||||
- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` /
|
|
||||||
`DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and
|
|
||||||
label the table `catalog` with the dataset's project;
|
|
||||||
- **run jobs / bill** in `credentials.project_id` (unchanged).
|
|
||||||
|
|
||||||
A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so
|
|
||||||
existing single-project connections are unaffected.
|
|
||||||
|
|
||||||
## Acceptance
|
|
||||||
|
|
||||||
- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) →
|
|
||||||
`ktx ingest <conn>` introspects the tables, enriches, and samples values; `discover_data` /
|
|
||||||
`dictionary_search` return them.
|
|
||||||
- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both.
|
|
||||||
- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in
|
|
||||||
`credentials.project_id`.
|
|
||||||
- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only — do not encode benchmark specifics)
|
|
||||||
|
|
||||||
Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every
|
|
||||||
one of its ~74 logical databases groups datasets hosted in foreign public projects
|
|
||||||
(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query
|
|
||||||
execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the
|
|
||||||
faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74
|
|
||||||
BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset`
|
|
||||||
and introspect each in its own project" covers the benchmark and the general case alike. This is
|
|
||||||
the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector
|
|
||||||
change and is already baselined).
|
|
||||||
|
|
@ -1,89 +0,0 @@
|
||||||
# 19 — Durable, resumable, bounded relationship detection during ingest enrichment
|
|
||||||
|
|
||||||
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`.
|
|
||||||
|
|
||||||
## Problem (generic, real-world)
|
|
||||||
|
|
||||||
Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment`
|
|
||||||
(`packages/cli/src/context/scan/local-enrichment.ts`):
|
|
||||||
|
|
||||||
1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per
|
|
||||||
table; on a large schema this is minutes of paid LLM work).
|
|
||||||
2. `embeddings` (`:559`) — column embeddings.
|
|
||||||
3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then
|
|
||||||
validates candidate joins.
|
|
||||||
|
|
||||||
The queryable semantic-layer artifacts are persisted **once, at the very end**, by
|
|
||||||
`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after**
|
|
||||||
`runLocalScanEnrichment` returns, i.e. after all three stages.
|
|
||||||
|
|
||||||
This creates three failure modes that compound on large schemas (hundreds of tables):
|
|
||||||
|
|
||||||
1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings
|
|
||||||
are computed and held in memory, but they only reach the durable, queryable artifacts when the
|
|
||||||
final write runs after the `relationships` stage. If the process is killed/crashes/times out
|
|
||||||
**during** relationship detection (the last, slowest, silent stage), the artifacts are never
|
|
||||||
written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the
|
|
||||||
paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced
|
|
||||||
full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
|
|
||||||
relationships stage ran silently past a supervising deadline and was killed — the persisted
|
|
||||||
`_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits
|
|
||||||
this, so the most expensive work is the most likely to be thrown away.
|
|
||||||
|
|
||||||
2. **Re-running does not resume — it re-spends.** There is a stage state store
|
|
||||||
(`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves
|
|
||||||
each completed stage's output. But the completed-stage lookup keys on **`runId`**
|
|
||||||
(`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest
|
|
||||||
invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a
|
|
||||||
new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch**
|
|
||||||
(re-paying for the LLM work that already succeeded).
|
|
||||||
|
|
||||||
3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between
|
|
||||||
"Detecting relationships" and the final "Relationship detection found N accepted" — minutes of
|
|
||||||
silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but-
|
|
||||||
working profile from a true hang, and there is no internal time/work budget, so on a very large
|
|
||||||
schema it can run far longer than any reasonable deadline.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions +
|
|
||||||
embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before
|
|
||||||
the `relationships` stage runs. Relationship detection then appends/merges its own artifact on
|
|
||||||
completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**,
|
|
||||||
even if relationship detection fails, is interrupted, or is skipped. (A failed/partial
|
|
||||||
relationship stage should degrade to "no/partial joins", never to "no descriptions".)
|
|
||||||
|
|
||||||
2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity
|
|
||||||
— `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted
|
|
||||||
ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what
|
|
||||||
actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM
|
|
||||||
credits on stages that already succeeded.
|
|
||||||
|
|
||||||
3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query
|
|
||||||
execution). Emit progress through the existing progress port — e.g. "Profiling table K/N",
|
|
||||||
"Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget
|
|
||||||
(configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops
|
|
||||||
gracefully and returns the relationships found so far (partial) rather than running unboundedly.
|
|
||||||
Partial completion is persisted (per requirement 1) and marked as such.
|
|
||||||
|
|
||||||
## Acceptance
|
|
||||||
|
|
||||||
- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer
|
|
||||||
with the table/column descriptions + embeddings that were generated (verified: re-open the
|
|
||||||
connection, descriptions are present).
|
|
||||||
- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage
|
|
||||||
already completed (verified: no LLM description calls for the cached tables; only the failed
|
|
||||||
stage re-runs).
|
|
||||||
- A connection with hundreds of tables emits relationship-stage progress and completes within the
|
|
||||||
configured budget, persisting partial relationships if the budget is hit — without discarding
|
|
||||||
enrichment.
|
|
||||||
- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship
|
|
||||||
output when nothing is interrupted).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only — do not encode benchmark specifics)
|
|
||||||
|
|
||||||
The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl`
|
|
||||||
785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM
|
|
||||||
budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every
|
|
||||||
retry — makes large-schema ingest impractical. This is a general durability/cost property of the
|
|
||||||
ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale.
|
|
||||||
|
|
@ -1,101 +0,0 @@
|
||||||
# 20 — Resilient enrichment under a slow/hung LLM backend
|
|
||||||
|
|
||||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`.
|
|
||||||
|
|
||||||
This is the **enrichment-stage** analog of two already-shipped specs:
|
|
||||||
- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline);
|
|
||||||
- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it.
|
|
||||||
|
|
||||||
Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two
|
|
||||||
weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung
|
|
||||||
table into an indefinite wedge plus total loss of an entire stage's LLM work.
|
|
||||||
|
|
||||||
## Problem / requirement
|
|
||||||
|
|
||||||
Two compounding gaps on the per-table description-enrichment path, observed end-to-end:
|
|
||||||
|
|
||||||
### 1. The per-table LLM timeout does not actually terminate the work
|
|
||||||
|
|
||||||
The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh
|
|
||||||
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM
|
|
||||||
backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise
|
|
||||||
spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0%
|
|
||||||
CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock
|
|
||||||
the await** — so the call sits *past* its own timeout indefinitely.
|
|
||||||
|
|
||||||
Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min),
|
|
||||||
two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** —
|
|
||||||
well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED
|
|
||||||
connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung
|
|
||||||
child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot
|
|
||||||
interrupt the underlying work — applied to the enrichment LLM call instead of the query.)
|
|
||||||
|
|
||||||
**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires,
|
|
||||||
the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort
|
|
||||||
for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per
|
|
||||||
the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded
|
|
||||||
wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike.
|
|
||||||
|
|
||||||
### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones
|
|
||||||
|
|
||||||
Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before
|
|
||||||
relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's
|
|
||||||
description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1),
|
|
||||||
or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost,
|
|
||||||
even though their (expensive) LLM descriptions were finished.
|
|
||||||
|
|
||||||
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in
|
|
||||||
`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded
|
|
||||||
all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of
|
|
||||||
redone LLM calls.
|
|
||||||
|
|
||||||
**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to
|
|
||||||
the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated
|
|
||||||
and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the
|
|
||||||
descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just
|
|
||||||
"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**,
|
|
||||||
throwing away every successfully-generated description. The skip must be graceful — a skipped table costs
|
|
||||||
one missing description, not the entire stage's output. (This is the strongest argument for per-table
|
|
||||||
incremental persistence: the 283 good descriptions should have been durable the moment each was produced.)
|
|
||||||
|
|
||||||
**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the
|
|
||||||
descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and
|
|
||||||
(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write
|
|
||||||
design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the
|
|
||||||
checkpoint granularity from once-per-stage to incremental.
|
|
||||||
|
|
||||||
## Sketch (implementer to refine)
|
|
||||||
|
|
||||||
- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/
|
|
||||||
claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for
|
|
||||||
network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period.
|
|
||||||
- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit
|
|
||||||
minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value
|
|
||||||
itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator
|
|
||||||
override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental
|
|
||||||
checkpointing, a moderate default + skip is the better operating point.)
|
|
||||||
- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to
|
|
||||||
the same store/format used at stage completion; on resume, treat already-persisted tables as done and only
|
|
||||||
enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions).
|
|
||||||
- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer
|
|
||||||
than ~one timeout, so an external stall watchdog stops being the only backstop.
|
|
||||||
|
|
||||||
## Generic use case (independent of the benchmark)
|
|
||||||
|
|
||||||
Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend,
|
|
||||||
which is the common local/desktop setup) will eventually hit a table whose description call hangs — a
|
|
||||||
provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one
|
|
||||||
such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws
|
|
||||||
away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema
|
|
||||||
enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a
|
|
||||||
hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product,
|
|
||||||
wholly independent of any benchmark.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
|
||||||
|
|
||||||
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching
|
|
||||||
the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the
|
|
||||||
30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions
|
|
||||||
checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had
|
|
||||||
to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark
|
|
||||||
just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic.
|
|
||||||
|
|
@ -1,91 +0,0 @@
|
||||||
# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys
|
|
||||||
|
|
||||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`.
|
|
||||||
|
|
||||||
Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment).
|
|
||||||
Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one
|
|
||||||
enrichment stage without re-paying for the others.
|
|
||||||
|
|
||||||
## Problem / requirement
|
|
||||||
|
|
||||||
Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`**
|
|
||||||
(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally
|
|
||||||
LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a
|
|
||||||
targeted re-run impossible without a full, expensive re-enrich:
|
|
||||||
|
|
||||||
1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a
|
|
||||||
single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`,
|
|
||||||
and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So
|
|
||||||
changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping
|
|
||||||
`scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces
|
|
||||||
ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change.
|
|
||||||
2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only
|
|
||||||
path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on
|
|
||||||
`mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`).
|
|
||||||
The capability is built; it's just not reachable.
|
|
||||||
3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage,
|
|
||||||
input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the
|
|
||||||
foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI
|
|
||||||
surface are missing.
|
|
||||||
|
|
||||||
**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested
|
|
||||||
connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply,
|
|
||||||
without re-running unchanged (especially the costly `descriptions`) stages.
|
|
||||||
|
|
||||||
## Design decisions (resolved during intake; implementer may refine)
|
|
||||||
|
|
||||||
- **CLI flag: `--stages <comma-list>`** (plural). Accepts a comma-separated subset of
|
|
||||||
`descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes
|
|
||||||
a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the
|
|
||||||
plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an
|
|
||||||
unknown stage is an error, never silently ignored.
|
|
||||||
- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs:
|
|
||||||
- `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model)
|
|
||||||
- `embeddings` → `{snapshot, embeddings model/provider, + the description text it embeds}`
|
|
||||||
- `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}`
|
|
||||||
Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates
|
|
||||||
only `embeddings`; improving description prompts/LLM invalidates only `descriptions`.
|
|
||||||
- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write,
|
|
||||||
already the behaviour). A selective run never deletes another stage's artifacts.
|
|
||||||
- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`;
|
|
||||||
`relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing
|
|
||||||
`embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected
|
|
||||||
re-run leaves an unselected downstream stage stale, and the operator can opt to cascade
|
|
||||||
(`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream.
|
|
||||||
- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the
|
|
||||||
stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has
|
|
||||||
full context — not just raw column names.
|
|
||||||
- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to
|
|
||||||
the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent
|
|
||||||
`--no-query-history` negative flag, but that unification is out of scope here.
|
|
||||||
|
|
||||||
## Sketch (implementer to refine)
|
|
||||||
|
|
||||||
- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it
|
|
||||||
selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the
|
|
||||||
precedent).
|
|
||||||
- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each
|
|
||||||
stage's own inputs; gate each stage's resume/skip on its own hash.
|
|
||||||
- Ensure selective runs read + preserve the on-disk enriched schema and write additively.
|
|
||||||
- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one.
|
|
||||||
|
|
||||||
## Generic use case (independent of the benchmark)
|
|
||||||
|
|
||||||
Any team running ktx in production maintains its semantic layer over time: they improve description prompts
|
|
||||||
or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of
|
|
||||||
those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions
|
|
||||||
even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine
|
|
||||||
maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now
|
|
||||||
that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost —
|
|
||||||
alone. This is core operability for a long-lived ingestion product and is wholly independent of any
|
|
||||||
benchmark.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
|
||||||
|
|
||||||
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a
|
|
||||||
tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …)
|
|
||||||
that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins**
|
|
||||||
across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both
|
|
||||||
were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate
|
|
||||||
the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised
|
|
||||||
large-scale multi-backend ingestion; the gap and the fix are generic.
|
|
||||||
|
|
@ -1,300 +0,0 @@
|
||||||
# Connection-scoped wiki pages
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Wiki pages have only two scopes today: `GLOBAL` and `USER`
|
|
||||||
(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by
|
|
||||||
directory (`wiki/global/<key>.md`, `wiki/user/<userId>/<key>.md`) and the
|
|
||||||
search path filters by loading only the in-scope pages before any lane runs.
|
|
||||||
There is no way to associate a page with a **connection** (a warehouse/database
|
|
||||||
defined under `connections:` in `ktx.yaml`).
|
|
||||||
|
|
||||||
In a project with many connections this causes two distinct failures:
|
|
||||||
|
|
||||||
1. **Cross-database relevance pollution.** All pages share one search index, so
|
|
||||||
`wiki_search` for a generic term (`orders`, `revenue`, `average order
|
|
||||||
value`) surfaces pages written about the wrong database. Concept names
|
|
||||||
collide across databases constantly in real multi-connection projects
|
|
||||||
(several databases each with `orders`, `customers`, …).
|
|
||||||
2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace.
|
|
||||||
The write path resolves a repeated key to the existing file and updates it
|
|
||||||
in place. So if the agent writes an `orders` page while ingesting database B
|
|
||||||
and an `orders` page already exists for database A, B's content **overwrites
|
|
||||||
A's** — same-concept pages for different databases cannot coexist today.
|
|
||||||
|
|
||||||
Today, when `memory_ingest` is called with a `connectionId`, that id only
|
|
||||||
scopes which semantic-layer sources the triage agent can see
|
|
||||||
(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page
|
|
||||||
and **not** validated against `ktx.yaml`.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any org with multiple databases/warehouses in one **ktx** project: org-wide
|
|
||||||
definitions ("fiscal year starts in February") should be visible everywhere,
|
|
||||||
while database-specific conventions ("in the events DB, `user_id` is the
|
|
||||||
anonymous device id, not the account id") should not pollute searches about
|
|
||||||
other databases — and two databases that both have an `orders` concept must be
|
|
||||||
able to keep separate, non-colliding pages.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
`connections` is **additive frontmatter metadata**, orthogonal to the existing
|
|
||||||
`GLOBAL`/`USER` directory scope — not a third scope dimension:
|
|
||||||
|
|
||||||
- A page is still `GLOBAL` or `USER` and lives where it lives today. It may
|
|
||||||
**additionally** carry a `connections` list.
|
|
||||||
- **Page keys remain a flat, globally-unique namespace.** `connections` does
|
|
||||||
**not** namespace keys; a page is addressable by key alone, unchanged.
|
|
||||||
- A page may list **multiple** connections.
|
|
||||||
- **Absent or empty `connections` ⇒ unscoped: the page applies to all
|
|
||||||
connections.** This is exactly today's behavior, so every existing page is
|
|
||||||
unaffected.
|
|
||||||
|
|
||||||
This keeps `wiki_read` and refs untouched and adds no parallel scope axis;
|
|
||||||
filtering by connection is purely a search/relevance concern.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Frontmatter field
|
|
||||||
|
|
||||||
Add an optional `connections` field to wiki page frontmatter — a list of
|
|
||||||
connection ids.
|
|
||||||
|
|
||||||
- Accept a single string too; normalize to a list at parse time (reuse the
|
|
||||||
existing array-coercion helper used for `tags`/`refs`/`sl_refs`).
|
|
||||||
- Round-trips through parse/serialize without loss.
|
|
||||||
- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by
|
|
||||||
construction.
|
|
||||||
|
|
||||||
### 2. Page identity and key distinctness
|
|
||||||
|
|
||||||
`connections` does not change how pages are identified or addressed:
|
|
||||||
|
|
||||||
- Keys stay flat and globally unique; `wiki_read(key)` is unchanged.
|
|
||||||
- Because the write path updates a page in place when its key already exists,
|
|
||||||
same-concept pages for different connections **MUST** use distinct keys
|
|
||||||
(e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys
|
|
||||||
for database-specific pages are the primary mechanism (driven by write-path
|
|
||||||
prompt guidance, requirement 5).
|
|
||||||
- **Data-loss guard (code, not prompt):** a connection-scoped write whose key
|
|
||||||
matches an existing page whose `connections` scope is **disjoint** from the
|
|
||||||
incoming scope MUST surface a collision instead of silently overwriting the
|
|
||||||
existing page. (Updating a page within the same connection scope, or
|
|
||||||
broadening/narrowing its own `connections`, is a normal update — not a
|
|
||||||
collision.) The implementer owns whether the collision is a hard error or a
|
|
||||||
suffixed new key; it must not be a silent clobber.
|
|
||||||
|
|
||||||
### 3. Search filtering
|
|
||||||
|
|
||||||
Add an optional connection filter to the search surfaces:
|
|
||||||
|
|
||||||
- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`).
|
|
||||||
- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection <id>`
|
|
||||||
(with `-c` alias), matching the `ktx sql` connection flag.
|
|
||||||
|
|
||||||
Semantics:
|
|
||||||
|
|
||||||
- With `connectionId: X` ⇒ return pages whose `connections` is empty
|
|
||||||
(unscoped) **∪** pages whose `connections` contains X.
|
|
||||||
- Without ⇒ current behavior, all pages.
|
|
||||||
- The filter **MUST** apply uniformly to **all three search lanes** (lexical
|
|
||||||
FTS5, semantic/embedding, token fallback) at the **candidate-source level**,
|
|
||||||
so each lane draws its full candidate pool from the already-scoped set. It
|
|
||||||
**MUST NOT** be a post-filter on the merged/ranked results — that would let
|
|
||||||
off-scope candidates consume both the per-lane pool and the final result
|
|
||||||
limit unevenly.
|
|
||||||
|
|
||||||
*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the
|
|
||||||
disk-load step that feeds both the in-memory token lane and the synced SQLite
|
|
||||||
index (`local-knowledge.ts`); the connection filter fits the same seam.
|
|
||||||
|
|
||||||
### 4. Index persistence
|
|
||||||
|
|
||||||
The `.ktx/db.sqlite` knowledge index is re-synced from files on every search.
|
|
||||||
The implementer owns whether to persist `connections` as index columns / a side
|
|
||||||
table, or to filter the loaded page-set before the per-search sync. The binding
|
|
||||||
requirement is the uniform-across-lanes behavior in requirement 3 — not a
|
|
||||||
specific schema.
|
|
||||||
|
|
||||||
*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the
|
|
||||||
scoped subset and gives up a little embedding-cache reuse when searches
|
|
||||||
alternate between connections (recompute is one embedding per scoped page per
|
|
||||||
connection switch — negligible at the scale this targets). Persisting
|
|
||||||
`connections` in the index avoids that at the cost of a schema addition and a
|
|
||||||
per-lane predicate. Either is acceptable.
|
|
||||||
|
|
||||||
### 5. Write path
|
|
||||||
|
|
||||||
- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a
|
|
||||||
`connections` input field with the same REPLACE semantics as
|
|
||||||
`tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to
|
|
||||||
unscoped; `[ids]` ⇒ set.
|
|
||||||
- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt
|
|
||||||
guidance directs the agent to:
|
|
||||||
- set `connections: [connectionId]` on new **database-specific** pages, using
|
|
||||||
connection-distinctive keys; and
|
|
||||||
- leave `connections` empty for clearly **org-wide** content.
|
|
||||||
- This is **prompt guidance, not a code auto-default.** A connection-scoped
|
|
||||||
ingest must remain able to produce unscoped org-wide pages, so the tool must
|
|
||||||
not force the session's `connectionId` onto every page.
|
|
||||||
|
|
||||||
### 6. `wiki_read` and refs unchanged
|
|
||||||
|
|
||||||
Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and
|
|
||||||
`sl_refs` semantics are unchanged; `connections` is a search/relevance concern
|
|
||||||
only.
|
|
||||||
|
|
||||||
### 7. Validation
|
|
||||||
|
|
||||||
Validation behavior splits by surface, because an explicit argument is a
|
|
||||||
typo-prone input while persisted content drifts independently of config:
|
|
||||||
|
|
||||||
- **Explicit argument** — a connection id supplied as a command/tool argument
|
|
||||||
(`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`)
|
|
||||||
MUST be validated against `ktx.yaml` connections and **rejected with a clear
|
|
||||||
error listing the configured ids** when unknown. Reuse the canonical
|
|
||||||
`project.config.connections[id]` check. This also closes the current gap
|
|
||||||
where `memory_ingest`'s `connectionId` is accepted unvalidated.
|
|
||||||
- **Persisted frontmatter** — a connection id that appears only in a stored
|
|
||||||
page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during
|
|
||||||
validation/doctor, and MUST NOT break loading, searching, or reading that
|
|
||||||
page. Config and content can evolve independently.
|
|
||||||
|
|
||||||
### 8. Scope boundary
|
|
||||||
|
|
||||||
This spec delivers the **mechanism** (frontmatter storage + uniform filter +
|
|
||||||
write surface + validation). Driving the agent to actually pass `connectionId`
|
|
||||||
during analytics work is the concern of
|
|
||||||
`03-multi-connection-routing-in-analytics-skill`. It composes with the
|
|
||||||
`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- A page with `connections: [db_a]` is returned by
|
|
||||||
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
|
|
||||||
**not** by `wiki_search(query, connectionId: "db_b")`.
|
|
||||||
- A page with no `connections` field is returned in all three cases above.
|
|
||||||
- Two pages — `orders_sales_db` (`connections: [sales_db]`) and
|
|
||||||
`orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to
|
|
||||||
`sales_db` returns the first and not the second, and neither overwrote the
|
|
||||||
other on write.
|
|
||||||
- A connection-scoped write whose key matches an existing page scoped to a
|
|
||||||
**different** connection surfaces a collision instead of silently
|
|
||||||
overwriting (data-loss guard, requirement 2).
|
|
||||||
- Filtering works in each lane independently (test with embeddings disabled to
|
|
||||||
exercise the lexical and token lanes alone).
|
|
||||||
- `memory_ingest(content, connectionId)` produces a page scoped to that
|
|
||||||
connection for database-specific content.
|
|
||||||
- `wiki_search`/`ktx wiki search --connection <unknown>` fails with an error
|
|
||||||
that lists the configured connection ids.
|
|
||||||
- A page whose `connections` references an id absent from `ktx.yaml` produces a
|
|
||||||
warning but stays searchable and readable; search and read do not throw.
|
|
||||||
- `connections` accepts a single string and a list, both normalized to a list.
|
|
||||||
- Existing projects with no scoped pages and no `connectionId`/`--connection`
|
|
||||||
behave identically before/after.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`),
|
|
||||||
`wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array
|
|
||||||
coercion `wiki/local-knowledge.ts` (`stringArray`).
|
|
||||||
- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts`
|
|
||||||
(`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already
|
|
||||||
scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts`
|
|
||||||
(FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`).
|
|
||||||
- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`,
|
|
||||||
`memory_ingest`; `connectionId` already present on `memory_ingest` but
|
|
||||||
unvalidated).
|
|
||||||
- **CLI surface:** `commands/knowledge-commands.ts`
|
|
||||||
(`ktx wiki search`/`list`/`read`); canonical `--connection` flag in
|
|
||||||
`commands/sql-commands.ts`; validation pattern
|
|
||||||
`project.config.connections[id]` in `mcp/local-project-ports.ts`.
|
|
||||||
- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE
|
|
||||||
semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId`
|
|
||||||
threaded through the capture session and tool session;
|
|
||||||
`external_ingest` forces `GLOBAL` scope).
|
|
||||||
- **Connection config:** `context/project/config.ts` (`connections` record in
|
|
||||||
`ktx.yaml`).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose
|
|
||||||
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
|
|
||||||
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
|
|
||||||
to exactly one database and must not surface for the other 29.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
|
||||||
acceptance criteria covered; full package suite green (2924 passing),
|
|
||||||
type-check, knip/biome dead-code, and pre-commit clean.
|
|
||||||
|
|
||||||
**What was built / where**
|
|
||||||
|
|
||||||
1. **Frontmatter field (req 1).** `connections?: string[]` added to
|
|
||||||
`WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model
|
|
||||||
`LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new
|
|
||||||
`stringList()` coercion (single string → list); round-trips through both
|
|
||||||
serializers. Absent/empty ⇒ unscoped.
|
|
||||||
2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through
|
|
||||||
`searchLocalKnowledgePages` → both the sqlite-FTS and scan impls →
|
|
||||||
`loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is
|
|
||||||
applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages
|
|
||||||
listing the id), so the token lane and the per-search SQLite sync (lexical +
|
|
||||||
semantic) both draw their candidate pool from the already-scoped set —
|
|
||||||
candidate-source level, not a post-filter.
|
|
||||||
- Chose req 4 **option B (filter the loaded page-set)** over persisting a
|
|
||||||
column. Verified-safe here: standalone ktx's memory agent reads pages from
|
|
||||||
files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s
|
|
||||||
`knowledge_pages` is a per-search cache that `searchLocalKnowledgePages`
|
|
||||||
rebuilds every call — scoping the sync corrupts no shared state. Only cost
|
|
||||||
is one embedding recompute per scoped page on a connection switch (the
|
|
||||||
spec's acknowledged, negligible trade-off). No index-schema change.
|
|
||||||
3. **Page identity + data-loss guard (req 2).** Keys stay flat/global;
|
|
||||||
`wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`)
|
|
||||||
rejects (hard error, no silent clobber) a connection-scoped write whose
|
|
||||||
incoming `connections` is **disjoint** from a same-key existing page's
|
|
||||||
non-empty `connections`, suggesting a connection-distinctive key. Same-scope,
|
|
||||||
overlapping, broaden/narrow, and unscoped-existing updates are allowed.
|
|
||||||
Chose a hard error over auto-suffixing so the conflict reaches the agent
|
|
||||||
(the decision-maker) instead of silently forking the key namespace.
|
|
||||||
4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list)
|
|
||||||
with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no
|
|
||||||
code auto-default of the session connection. Prompt guidance added to the
|
|
||||||
shared `wiki_capture` skill (new "Connection scoping" section) and the
|
|
||||||
`memory_agent_external_ingest` prompt. The session `connectionId` is now
|
|
||||||
surfaced to the agent so the guidance is actionable: in the memory-agent
|
|
||||||
prompt header and in the ingest work-unit `<context>` block
|
|
||||||
(`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`).
|
|
||||||
5. **Validation (req 7).** New shared helper
|
|
||||||
`context/connections/configured-connections.ts → assertConfiguredConnectionId`
|
|
||||||
validates explicit connection-id arguments against `ktx.yaml` and throws an
|
|
||||||
error listing the configured ids. Routed from all three explicit-arg
|
|
||||||
surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest`
|
|
||||||
(validated at the boundary in `mcp-server-factory.ts` — this also closes the
|
|
||||||
prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated),
|
|
||||||
and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` +
|
|
||||||
`knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**:
|
|
||||||
`listReferencedConnectionIds` + a non-fatal `ktx status` warning
|
|
||||||
(`status-project.ts`); loading/searching/reading never throw on them.
|
|
||||||
|
|
||||||
**Deviations / notes**
|
|
||||||
|
|
||||||
- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`".
|
|
||||||
That helper (`stringArray`) is array-only and does **not** coerce a single
|
|
||||||
string; added a dedicated `stringList` for `connections` to meet the
|
|
||||||
single-string acceptance criterion rather than change `stringArray`'s
|
|
||||||
behavior for the other fields.
|
|
||||||
- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already
|
|
||||||
takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so
|
|
||||||
its wiki lane is intentionally left unscoped. Worth a follow-up if
|
|
||||||
`discover_data`'s wiki results should also be connection-scoped for
|
|
||||||
consistency.
|
|
||||||
- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the
|
|
||||||
new `wiki_search.connectionId` param and the `memory_ingest` validation
|
|
||||||
wrapper (the port is no longer the raw service object; it delegates).
|
|
||||||
|
|
@ -1,327 +0,0 @@
|
||||||
# Verbatim ingest mode for authoritative documents
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
`ktx ingest --text/--file` routes captured content through the memory agent.
|
|
||||||
`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a
|
|
||||||
`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to
|
|
||||||
`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which
|
|
||||||
runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k
|
|
||||||
chars) inside a session worktree. The agent decides — via the `wiki_write`
|
|
||||||
tool — what to persist, so it may **rewrite, condense, split, or re-title** the
|
|
||||||
content before it lands as a wiki page. The body is produced by an LLM, not
|
|
||||||
copied by code.
|
|
||||||
|
|
||||||
For *authoritative* documents — formula definitions, metric specs, runbooks,
|
|
||||||
compliance text — paraphrasing is a defect, not a feature:
|
|
||||||
|
|
||||||
- exact thresholds, constants, and rule wording must survive unchanged;
|
|
||||||
- lexical (BM25/FTS5) search works best when the stored text matches the
|
|
||||||
phrasing users and agents query with;
|
|
||||||
- ingestion should be deterministic and reproducible — the same input file
|
|
||||||
yields the same page, and re-running is safe.
|
|
||||||
|
|
||||||
Two further gaps block authoritative ingest today:
|
|
||||||
|
|
||||||
- The memory agent hard-requires an LLM backend
|
|
||||||
(`context/memory/local-memory.ts` throws when `llm.provider.backend: none`
|
|
||||||
and no runner is injected), so there is **no** offline ingest path at all.
|
|
||||||
- The agent's write tool *merges* a repeated same-scope key in place (REPLACE
|
|
||||||
frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the
|
|
||||||
silent in-place rewrite an authoritative-document workflow must avoid.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any team ingesting documents that are already the source of truth: metric
|
|
||||||
definition sheets, SLA documents, calculation-methodology docs, regulatory
|
|
||||||
text. The user wants **ktx** to *index and surface* the document, not to
|
|
||||||
re-author it. Today they work around the memory agent by hand-writing
|
|
||||||
frontmatter and copying files into `wiki/global/`; verbatim mode makes that a
|
|
||||||
first-class, supported `ktx ingest` workflow.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a
|
|
||||||
constrained prompt over the existing agent loop. Its defining invariants:
|
|
||||||
|
|
||||||
- **The stored page body is the input document body, written by code.** The LLM
|
|
||||||
never produces, edits, or relays the body. It is confined to generating
|
|
||||||
*metadata* about the body.
|
|
||||||
- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is
|
|
||||||
LLM-generated or derived offline follows from the configured backend
|
|
||||||
(`llm.provider.backend`), not from a second user-facing switch.
|
|
||||||
- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project
|
|
||||||
authoritative docs (the content teams copy into `wiki/global/` today).
|
|
||||||
Connection association is expressed by the **additive `connections`
|
|
||||||
frontmatter** from spec 01, never by directory.
|
|
||||||
- **Deterministic and idempotent.** The page key, the merged frontmatter, and
|
|
||||||
the stored body are all functions of the input alone (given a fixed backend),
|
|
||||||
so the same input produces the same page and a re-run is a safe no-op.
|
|
||||||
|
|
||||||
### "Byte-for-byte" scope
|
|
||||||
|
|
||||||
The guarantee is on the document's **interior**: no paraphrase, no condense, no
|
|
||||||
split, no re-title, no reflow, **no clipping**. The shared wiki store
|
|
||||||
canonicalizes *surrounding* whitespace — `parsePage` trims the body and
|
|
||||||
`serializePage` emits a single trailing newline
|
|
||||||
(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are
|
|
||||||
normalized by the storage layer. Verbatim mode **MUST** write through that
|
|
||||||
shared `writePage`/`serializePage` path rather than fork a parallel serializer;
|
|
||||||
the interior bytes (thresholds, constants, wording) are what must be preserved
|
|
||||||
exactly, and they are. Acceptance hashes compare the stored body against the
|
|
||||||
**trimmed** input body.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Flag
|
|
||||||
|
|
||||||
`ktx ingest --file <path> --verbatim` and `ktx ingest --text <content>
|
|
||||||
--verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text`
|
|
||||||
item in the invocation; each item becomes its own page.
|
|
||||||
|
|
||||||
- It composes with the existing `--connection-id <id>` flag
|
|
||||||
(`commands/ingest-commands.ts`) so the resulting page can be
|
|
||||||
connection-scoped (see spec 01). **Note:** the intake draft wrote
|
|
||||||
`--connection`; the shipped flag is `--connection-id`. Use `--connection-id`.
|
|
||||||
- No new `--key` flag (see requirement 4). No second behavioral switch beyond
|
|
||||||
`--verbatim` itself.
|
|
||||||
|
|
||||||
### 2. Body preservation is enforced by code, not by prompt
|
|
||||||
|
|
||||||
The stored page body is the input content (interior preserved exactly, per
|
|
||||||
**Model → "Byte-for-byte" scope**).
|
|
||||||
|
|
||||||
- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop
|
|
||||||
or any `wiki_write` tool call where a model could alter it.
|
|
||||||
- The LLM, when used, generates **only** metadata: `summary`, `tags`, and
|
|
||||||
`sl_refs`. A single constrained structured-output call (AI SDK v6
|
|
||||||
`generateObject` with a `zod` schema) is the intended mechanism — the full
|
|
||||||
memory-agent loop, worktree, and squash-merge are **not** required and should
|
|
||||||
not be used.
|
|
||||||
- The page key is **not** LLM-generated (requirement 4).
|
|
||||||
|
|
||||||
### 3. No clipping of the stored body
|
|
||||||
|
|
||||||
The ~48k clip may apply only to the text **sent to the LLM** for metadata
|
|
||||||
generation. It **MUST NOT** apply to the text **written** to the page. A
|
|
||||||
document larger than the clip limit is stored in full; only its metadata is
|
|
||||||
derived from the clipped prefix.
|
|
||||||
|
|
||||||
### 4. Deterministic page key
|
|
||||||
|
|
||||||
The key is derived from the input, never chosen by the LLM (an LLM-chosen slug
|
|
||||||
would break determinism and the requirement-6 idempotency guarantee):
|
|
||||||
|
|
||||||
- **`--file <path>`** → `suggestFlatWikiKey(basename without extension)`
|
|
||||||
(`wiki/keys.ts`). This is the primary document case and is always
|
|
||||||
deterministic.
|
|
||||||
- **`--text <content>`** → if the content opens with a Markdown heading, the
|
|
||||||
key is `suggestFlatWikiKey(heading text)`. If there is no leading heading,
|
|
||||||
**hard error**: inline verbatim text needs a leading heading to derive a
|
|
||||||
stable key, or should be passed as `--file`.
|
|
||||||
- No hash-based keys (unfindable) and no `--key` override flag. A real need for
|
|
||||||
explicit key control can add `--key` later.
|
|
||||||
|
|
||||||
### 5. Frontmatter: passthrough + gap-fill
|
|
||||||
|
|
||||||
If the input has its own YAML frontmatter, split it from the body: the body is
|
|
||||||
everything after the closing `---`; the frontmatter is authoritative metadata.
|
|
||||||
|
|
||||||
- **Passthrough.** Every input frontmatter field is preserved in the stored
|
|
||||||
page, **including fields not in `WikiFrontmatter`** (`effective_date`,
|
|
||||||
`version`, `owner`, …). The serializer `YAML.stringify`s the object, so
|
|
||||||
unknown keys round-trip. Dropping them would be silent data loss on
|
|
||||||
authoritative docs.
|
|
||||||
- **Gap-fill only.** Generated/derived metadata fills **absent** fields only;
|
|
||||||
it **MUST NOT** overwrite an explicit value. An input `summary:` is never
|
|
||||||
replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept.
|
|
||||||
- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not
|
|
||||||
force-injected) when the input does not set it.
|
|
||||||
- **Connection scoping.** `--connection-id X` (validated via
|
|
||||||
`assertConfiguredConnectionId`, `context/connections/configured-connections.ts`)
|
|
||||||
sets `connections: [X]` when the input frontmatter does not already declare
|
|
||||||
`connections`. If the input frontmatter declares a **different**
|
|
||||||
`connections` than the flag, **hard error** (ambiguous intent) rather than
|
|
||||||
silently choosing one. If they match, or only one source is present, proceed.
|
|
||||||
|
|
||||||
### 6. Degraded mode (`llm.provider.backend: none`)
|
|
||||||
|
|
||||||
`--verbatim` **MUST** work with no LLM backend — this is its capability the
|
|
||||||
regular agent ingest lacks.
|
|
||||||
|
|
||||||
- `summary` is derived from the leading Markdown heading text, or, if none, the
|
|
||||||
first non-empty sentence of the body (trimmed to a reasonable length).
|
|
||||||
- `tags` and `sl_refs` are left empty.
|
|
||||||
- The body is still stored in full (requirement 3 applies unchanged).
|
|
||||||
|
|
||||||
### 7. Key collisions: idempotent-if-identical, else hard error
|
|
||||||
|
|
||||||
Verbatim mode does **not** reuse the agent write tool's in-place merge. Before
|
|
||||||
writing, read any existing `GLOBAL` page at the derived key:
|
|
||||||
|
|
||||||
- **No existing page** → write.
|
|
||||||
- **Existing page, stored body identical** to the new body (compared after the
|
|
||||||
storage-layer normalization in **Model**) → **idempotent no-op success**
|
|
||||||
(re-running the same file is safe).
|
|
||||||
- **Existing page, body differs** → **hard error** naming the conflicting key
|
|
||||||
and directing the user to a distinct key. Never a silent overwrite, never an
|
|
||||||
auto-suffixed second page (which would produce the duplicated/divergent pages
|
|
||||||
this mode must avoid).
|
|
||||||
|
|
||||||
### 8. LLM-failure handling
|
|
||||||
|
|
||||||
When a backend **is** configured but the metadata call fails (rate limit,
|
|
||||||
transport error, malformed output after retries), **fail the item** (honoring
|
|
||||||
`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`).
|
|
||||||
**MUST NOT** silently fall back to degraded derivation: a degraded page written
|
|
||||||
on a transient error would, under requirement 7, refuse to be replaced by a
|
|
||||||
healthy re-run — breaking reproducibility. Degraded derivation is reserved for
|
|
||||||
`backend: none`.
|
|
||||||
|
|
||||||
### 9. Findability
|
|
||||||
|
|
||||||
After write, the page is reindexed so search returns it:
|
|
||||||
|
|
||||||
- `wiki_search` for a phrase taken from the document body returns the page via
|
|
||||||
the lexical lane (the body is indexed in `buildKnowledgeSearchText`).
|
|
||||||
- `wiki_search` for a paraphrase of the document's topic returns it via the
|
|
||||||
semantic lane **when embeddings are enabled** (this is what the generated
|
|
||||||
`summary`/`tags` buy over a bare degraded page).
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- Ingesting a file with `--verbatim` produces a page whose body is
|
|
||||||
byte-identical to the trimmed input body (assert with a hash in tests).
|
|
||||||
- A >48k-char file is stored in full (assert stored body length ≥ input length
|
|
||||||
minus trim).
|
|
||||||
- Running the same `--verbatim` ingest twice is idempotent: one page, identical
|
|
||||||
bytes both times, no error on the second run.
|
|
||||||
- A second ingest to the same derived key with **different** body content fails
|
|
||||||
loudly (requirement 7) and does not modify the existing page or create a
|
|
||||||
suffixed one.
|
|
||||||
- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved
|
|
||||||
in the stored page; an explicit input `summary` is **not** overwritten by a
|
|
||||||
generated one.
|
|
||||||
- With `llm.provider.backend: none`, `--verbatim` still produces a page: full
|
|
||||||
body stored, `summary` derived from the heading/first sentence, `tags` and
|
|
||||||
`sl_refs` empty.
|
|
||||||
- `--verbatim --connection-id X` yields a page with `connections: [X]`; an
|
|
||||||
unknown id is rejected with an error listing the configured ids. (Depends on
|
|
||||||
spec 01, now shipped.)
|
|
||||||
- `--verbatim --connection-id X` where the input frontmatter already declares a
|
|
||||||
different `connections` fails with an ambiguity error.
|
|
||||||
- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading
|
|
||||||
heading or `--file`.
|
|
||||||
- `wiki_search` for a body phrase returns the page (lexical lane); for a topic
|
|
||||||
paraphrase it returns the page when embeddings are enabled (semantic lane).
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
module layout and design, subject to the invariants above.
|
|
||||||
|
|
||||||
- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table;
|
|
||||||
`--text`/`--file`/`--connection-id`/`--fail-fast` already present — add
|
|
||||||
`--verbatim` and thread it into `KtxTextIngestArgs`).
|
|
||||||
- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`,
|
|
||||||
`validateItems`, per-item loop and exit-code aggregation). The verbatim flow
|
|
||||||
reuses item loading and replaces the `memoryIngest.ingest(...)` call with a
|
|
||||||
code-driven write for `--verbatim` items. Keep the new logic in a focused
|
|
||||||
module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`.
|
|
||||||
- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts`
|
|
||||||
(`parsePage` for the `---…---` split shape, `serializePage`, `writePage`,
|
|
||||||
`readPage` for the collision check). Write through this shared path — do not
|
|
||||||
re-implement YAML framing.
|
|
||||||
- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`).
|
|
||||||
- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and
|
|
||||||
`usage_mode` are the required fields; unknown passthrough fields live
|
|
||||||
alongside).
|
|
||||||
- **Connection validation:** `context/connections/configured-connections.ts`
|
|
||||||
(`assertConfiguredConnectionId`, shipped with spec 01).
|
|
||||||
- **Metadata LLM call:** the local LLM runtime/config resolution in
|
|
||||||
`context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a
|
|
||||||
single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill
|
|
||||||
covers v6 patterns.
|
|
||||||
- **Reindex / search lanes:** `wiki/local-knowledge.ts`
|
|
||||||
(`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/
|
|
||||||
semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`).
|
|
||||||
- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a
|
|
||||||
verbatim-focused test file covering the acceptance criteria above.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
|
|
||||||
definitions, the haversine formula, F1 overtake rules, …). Gold SQL was
|
|
||||||
authored against their **exact** text; an LLM paraphrase that drops a bucket
|
|
||||||
boundary or rounds a constant loses the corresponding question. The current
|
|
||||||
workaround is hand-writing frontmatter and copying files into `wiki/global/`.
|
|
||||||
Verbatim mode turns that manual step into a supported **ktx** workflow, and
|
|
||||||
composes with the connection scoping from spec 01 so a doc relevant to exactly
|
|
||||||
one of the benchmark's ~30 SQLite databases does not surface for the other 29.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered
|
|
||||||
by tests and verified end-to-end through the linked `ktx-dev` binary.
|
|
||||||
|
|
||||||
**What was built**
|
|
||||||
|
|
||||||
- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor`
|
|
||||||
+ `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`,
|
|
||||||
`deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter`
|
|
||||||
(the last four are `@internal` exports for unit testing).
|
|
||||||
- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a
|
|
||||||
guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded
|
|
||||||
into `KtxTextIngestArgs.verbatim`.
|
|
||||||
- `text-ingest.ts` now tags each loaded item with an `origin`
|
|
||||||
(`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim
|
|
||||||
ingestor once and branches the per-item loop to a code-driven write instead of
|
|
||||||
`memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and
|
|
||||||
`--fail-fast` handling are reused.
|
|
||||||
|
|
||||||
**Deviations from the literal spec (design refinements, per "implementer owns the design")**
|
|
||||||
|
|
||||||
- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The
|
|
||||||
implementation routes through the existing `KtxLlmRuntimePort.generateObject`
|
|
||||||
instead — it is implemented by all three backends (ai-sdk, claude-code, codex),
|
|
||||||
and the ai-sdk one already wraps `generateText` + `Output.object({schema})`.
|
|
||||||
This realizes the spec's "single constrained structured-output call" intent via
|
|
||||||
the canonical cross-backend path rather than forking a second LLM entry point.
|
|
||||||
- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages`
|
|
||||||
rebuilds the SQLite index from disk on every call (recomputing embeddings for
|
|
||||||
changed pages), so a written page is findable without a dedicated reindex step.
|
|
||||||
The write still goes through the shared `KnowledgeWikiService.writePage` +
|
|
||||||
`syncSinglePage` path, so the page is also eagerly indexed.
|
|
||||||
- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter
|
|
||||||
already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills
|
|
||||||
absent fields, so there is nothing to generate). A fully specified document thus
|
|
||||||
ingests with a configured backend without any LLM call.
|
|
||||||
|
|
||||||
**Tests**
|
|
||||||
|
|
||||||
- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration
|
|
||||||
against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip,
|
|
||||||
idempotency, conflict hard-error, frontmatter passthrough, explicit-summary
|
|
||||||
preservation, degraded mode, connection scoping + unknown-id rejection +
|
|
||||||
ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item,
|
|
||||||
lexical + semantic findability).
|
|
||||||
- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging,
|
|
||||||
connection-id forwarding, fail-fast.
|
|
||||||
- `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the
|
|
||||||
requires-`--text`/`--file` guard.
|
|
||||||
|
|
||||||
**Docs**
|
|
||||||
|
|
||||||
- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest"
|
|
||||||
section, examples, common errors) and
|
|
||||||
`docs-site/content/docs/guides/writing-context.mdx` (authoritative-document
|
|
||||||
workflow).
|
|
||||||
|
|
||||||
**Verification**
|
|
||||||
|
|
||||||
- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code`
|
|
||||||
(Biome + Knip default + production) clean; pre-commit clean on changed files.
|
|
||||||
A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is
|
|
||||||
untouched — it predates this work.
|
|
||||||
|
|
@ -1,361 +0,0 @@
|
||||||
# Schema scan tolerates individual objects that fail introspection
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
A single broken or inaccessible object zeroes out an entire connection's
|
|
||||||
context. Schema introspection iterates objects with no per-object error
|
|
||||||
handling, so one throw aborts the whole scan, the live-database adapter's
|
|
||||||
`fetch()` rejects, and the connection ends with **no semantic layer at all** —
|
|
||||||
even when every other object was healthy.
|
|
||||||
|
|
||||||
The failure surfaces in two phases, and the contract must hold in both:
|
|
||||||
|
|
||||||
- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does
|
|
||||||
`rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch.
|
|
||||||
`readTable` runs `PRAGMA table_info(<object>)`, which *executes* a view's
|
|
||||||
body to resolve its columns — so a view over a dropped/renamed column (the
|
|
||||||
`oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date`
|
|
||||||
from a base table that has no such column) raises `no such column:
|
|
||||||
ehp.start_date` and aborts introspection of all ~48 healthy objects.
|
|
||||||
- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/
|
|
||||||
bigquery/snowflake read metadata in bulk from catalog / `information_schema`
|
|
||||||
(a broken view rarely breaks that), then fail when a per-object profiling or
|
|
||||||
sampling `SELECT` runs against a broken object. Enrichment sampling is
|
|
||||||
*already* isolated (`description-generation.ts` wraps `sampleTable` in
|
|
||||||
try/catch → `sampling_failed`), but mandatory introspection-phase reads are
|
|
||||||
not uniformly isolated across drivers.
|
|
||||||
|
|
||||||
A second, related defect blocks the documented escape hatch. Setting
|
|
||||||
`enabled_tables: ["main.customers"]` on a sqlite connection produces a
|
|
||||||
different hard failure — `Adapter "database schema" did not recognize fetched
|
|
||||||
source output`. Root cause: the sqlite connector emits every object as
|
|
||||||
`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })`
|
|
||||||
(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but
|
|
||||||
`"main.customers"` parses to `{ db: "main", name: "customers" }`
|
|
||||||
(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`,
|
|
||||||
so the entry matches **nothing**, zero table files are written, and
|
|
||||||
`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping
|
|
||||||
the generic "did not recognize fetched source output" error at
|
|
||||||
`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form
|
|
||||||
`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form
|
|
||||||
silently matches nothing.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Real warehouses routinely contain broken or inaccessible objects: views over
|
|
||||||
dropped/renamed columns, views referencing tables the connection role can't
|
|
||||||
read, permission-denied tables, and vendor system views that error on read.
|
|
||||||
**ktx** should ingest everything it *can* and skip what it can't, so one bad
|
|
||||||
object never zeroes out an entire connection's context. This is baseline
|
|
||||||
production robustness, independent of any benchmark — the same tolerance a
|
|
||||||
33-warehouse fleet needs the first time one of its databases has a stale view.
|
|
||||||
|
|
||||||
## Design
|
|
||||||
|
|
||||||
The unit of failure is **one object** (table or view). Introspecting or
|
|
||||||
profiling an object is an operation that can fail independently; a failure skips
|
|
||||||
that object, records a recoverable warning, and the scan continues from the
|
|
||||||
objects that succeeded.
|
|
||||||
|
|
||||||
Because seven Node connectors and the Python daemon each introspect differently
|
|
||||||
(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata
|
|
||||||
in bulk and fail per-object during profiling), the **semantics** of "skip /
|
|
||||||
warn / total-failure" are defined **once** and every connector routes through
|
|
||||||
them — rather than seven copies of the same try/catch that drift apart:
|
|
||||||
|
|
||||||
- A shared per-object helper in the `scan/` layer — the sibling of the existing
|
|
||||||
`tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single
|
|
||||||
object read and returns `{ ok: true, table } | { ok: false, warning }`, with a
|
|
||||||
standard warning code (e.g. `object_introspection_failed`).
|
|
||||||
- A shared post-check enforces the total-failure rule (R3) uniformly.
|
|
||||||
- Each connector keeps its **natural** shape: sqlite routes each `readTable`
|
|
||||||
through the helper; bulk-read drivers route their per-object profiling reads
|
|
||||||
through it. The contract is uniform; the loop is not forced to be.
|
|
||||||
- The Python daemon implements the **same contract** in its own helper, adds a
|
|
||||||
`warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps
|
|
||||||
those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`).
|
|
||||||
|
|
||||||
The warning channel already exists end to end on the Node side
|
|
||||||
(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/
|
|
||||||
`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json`
|
|
||||||
artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates
|
|
||||||
it. This spec makes that channel carry object-skip warnings and surfaces them in
|
|
||||||
the ingest summary, the persisted report body, and `ktx status`.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### R1 — Per-object isolation (the contract)
|
|
||||||
|
|
||||||
If introspecting or profiling one object throws, the scan **MUST** skip that
|
|
||||||
object, record a `KtxScanWarning` (object name, the error message, and any
|
|
||||||
schema/catalog qualifier; `recoverable: true`), and continue with the remaining
|
|
||||||
objects. No single object may abort the scan.
|
|
||||||
|
|
||||||
- The contract holds in **both** phases: the mandatory metadata read *and* any
|
|
||||||
profiling/row-count/sample read performed during introspection.
|
|
||||||
- It holds for **all seven Node connectors**
|
|
||||||
(`packages/cli/src/connectors/<driver>/`) and the **Python daemon** postgres
|
|
||||||
path (R6).
|
|
||||||
- The semantics are defined once (the shared helper + warning code from the
|
|
||||||
Design section) and every connector routes through them. Do not inline a
|
|
||||||
divergent per-driver copy.
|
|
||||||
- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object
|
|
||||||
identifier and the database's error text, redacted through the existing
|
|
||||||
`redactKtxSensitiveMetadata` path that `warnings.json` already uses.
|
|
||||||
|
|
||||||
### R2 — Surface, don't hide
|
|
||||||
|
|
||||||
Skipped objects **MUST** be reported both at ingest time and in the durable
|
|
||||||
status view:
|
|
||||||
|
|
||||||
- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports
|
|
||||||
a count plus the object name and a short reason for each skip — e.g.
|
|
||||||
`Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`.
|
|
||||||
- **Run report.** Object skips land in the run report's `warnings.json` artifact
|
|
||||||
(already written) and in the persisted report body (`IngestReportBody`), whose
|
|
||||||
natural home is the existing `fetch?: SourceFetchReport` field — the fetch
|
|
||||||
phase *is* introspection.
|
|
||||||
- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for
|
|
||||||
the connection's latest ingest — e.g. `oracle_sql: 1 object skipped —
|
|
||||||
emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived
|
|
||||||
from the latest persisted report, not new persisted state**: the report body
|
|
||||||
is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so
|
|
||||||
surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status`
|
|
||||||
reads and renders the skip info already present in the latest report body. A
|
|
||||||
connection whose latest ingest skipped nothing shows no such line.
|
|
||||||
|
|
||||||
### R3 — Failure semantics (partial vs total)
|
|
||||||
|
|
||||||
Per-object skipping is **unconditional** — there is **no new config knob**, and
|
|
||||||
the existing `ingest.workUnits.failureMode` (which governs the later LLM
|
|
||||||
work-unit stage, not introspection) is untouched and orthogonal. Outcomes are
|
|
||||||
derived from object counts, not from a mode:
|
|
||||||
|
|
||||||
| Scope | Objects discovered / matched | Introspection outcome | Result |
|
|
||||||
| --- | --- | --- | --- |
|
|
||||||
| none | 0 | n/a (legitimately empty DB) | **success**, empty layer |
|
|
||||||
| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest |
|
|
||||||
| none | N > 0 | all N fail | **connection failure** (clear error) |
|
|
||||||
| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) |
|
|
||||||
| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings |
|
|
||||||
| `enabled_tables` | matches M > 0 | all M fail | **connection failure** |
|
|
||||||
|
|
||||||
- "Connection failure" means the connector / `fetch()` raises a **clear,
|
|
||||||
actionable error** for that connection. It **MUST NOT** surface as the generic
|
|
||||||
`did not recognize fetched source output` (that message is reserved for a
|
|
||||||
genuinely unrecognized staged dir, not an empty/total-failure result).
|
|
||||||
- A total failure of one connection follows existing per-connection ingest
|
|
||||||
orchestration for whether sibling connections continue; this spec does not
|
|
||||||
change cross-connection behavior.
|
|
||||||
|
|
||||||
### R4 — A broken view never blocks base tables
|
|
||||||
|
|
||||||
A broken view **MUST NEVER** prevent base-table ingest.
|
|
||||||
|
|
||||||
- View introspection failures are isolated exactly like any other object (R1).
|
|
||||||
- Mandatory introspection **MUST** prefer reading an object's structure from the
|
|
||||||
catalog where possible over executing the object's body, and **MUST NOT** run
|
|
||||||
a data-reading query (row count, sample) against a view as a required step.
|
|
||||||
(sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the
|
|
||||||
metadata read that executes the view definition.)
|
|
||||||
|
|
||||||
### R5 — `enabled_tables` allowlist works
|
|
||||||
|
|
||||||
The documented allowlist escape hatch **MUST** reliably restrict the scan to the
|
|
||||||
listed objects, with no spurious adapter error:
|
|
||||||
|
|
||||||
- **sqlite qualification.** The schema-qualified form `"main.<name>"` **MUST**
|
|
||||||
resolve to the same object as the bare form `"<name>"` (sqlite's sole schema
|
|
||||||
is `main`; the connector emits `db: null`). Both forms select the object;
|
|
||||||
neither silently matches nothing.
|
|
||||||
- **Documented format.** The accepted qualification forms for each driver
|
|
||||||
(`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main`
|
|
||||||
equivalence **MUST** be documented where `enabled_tables` is described
|
|
||||||
(`context/project/driver-schemas.ts` and the user-facing config docs).
|
|
||||||
- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to
|
|
||||||
**zero** matched objects **MUST** fail with an actionable error naming the
|
|
||||||
connection, the unmatched entries, and the available object names — **not** the
|
|
||||||
generic `did not recognize fetched source output`. This is distinct from a
|
|
||||||
legitimately empty database (R3 row 1) and from a matched-but-all-broken scope
|
|
||||||
(R3 last row).
|
|
||||||
- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests
|
|
||||||
**exactly** those M objects (minus any that fail per R1), with no adapter
|
|
||||||
recognition error regardless of how small or edge-case the set is.
|
|
||||||
|
|
||||||
### R6 — Python daemon parity
|
|
||||||
|
|
||||||
The daemon's postgres introspection path **MUST** honor the same contract:
|
|
||||||
|
|
||||||
- Add a `warnings` field to `DatabaseIntrospectionResponse`
|
|
||||||
(`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the
|
|
||||||
same shape Node expects (code, message, object identifier, recoverable).
|
|
||||||
- Isolate per-object failures in the daemon's introspection so one broken object
|
|
||||||
does not abort the response; apply the R3 total-failure rule there too.
|
|
||||||
- Map daemon warnings into `KtxSchemaSnapshot.warnings` in
|
|
||||||
`mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`),
|
|
||||||
which currently drops them.
|
|
||||||
- The Node and Python warning shapes **MUST** stay in parity (the codebase
|
|
||||||
already mirrors Node↔Python schemas for telemetry; follow the same discipline
|
|
||||||
so the daemon cannot emit a code Node can't render).
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- Ingesting a sqlite DB with one broken view + N healthy tables yields a
|
|
||||||
semantic layer for the N healthy tables and **exactly one** warning naming the
|
|
||||||
broken view and its error; exit is **success**.
|
|
||||||
- The skipped object appears in the `ktx ingest` summary output, in the run's
|
|
||||||
`warnings.json`, and in `ktx status` as a per-connection skipped-objects line
|
|
||||||
on the connection's latest ingest.
|
|
||||||
- A sqlite DB in which **every** discovered object fails introspection (and the
|
|
||||||
file opens) exits as a **connection failure** with a clear error — not an
|
|
||||||
empty "success" and not `did not recognize fetched source output`.
|
|
||||||
- A genuinely empty sqlite DB (zero objects) exits **success** with an empty
|
|
||||||
layer (not a failure).
|
|
||||||
- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both
|
|
||||||
ingest exactly the `customers` object on a sqlite connection.
|
|
||||||
- `enabled_tables` restricted to a valid subset of M objects ingests exactly
|
|
||||||
that subset, with **no** adapter-output error.
|
|
||||||
- `enabled_tables` that matches zero objects fails with an error naming the
|
|
||||||
connection, the unmatched entries, and available objects — distinguishable
|
|
||||||
from the empty-DB and all-broken cases.
|
|
||||||
- A broken view does not prevent ingest of base tables in the same connection
|
|
||||||
(regression test with a view that errors on read alongside a healthy table).
|
|
||||||
- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a
|
|
||||||
per-object failure in the daemon path produces a warning mapped into
|
|
||||||
`KtxSchemaSnapshot.warnings` (Node↔Python parity test).
|
|
||||||
- A warehouse-driver object whose profiling/sample read fails is skipped with a
|
|
||||||
warning and does not abort introspection of its siblings.
|
|
||||||
- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave
|
|
||||||
identically before/after — no warnings, same semantic layer.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **Shared semantics:** `context/scan/constraint-discovery.ts`
|
|
||||||
(`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror
|
|
||||||
for the per-object helper), `context/scan/types.ts`
|
|
||||||
(`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the
|
|
||||||
new object-skip code here).
|
|
||||||
- **Node connectors:** `packages/cli/src/connectors/<driver>/connector.ts` and
|
|
||||||
each `live-database-introspection.ts`. sqlite's loop is
|
|
||||||
`connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable`
|
|
||||||
(≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171.
|
|
||||||
Existing per-table sample isolation precedent: `description-generation.ts`
|
|
||||||
(≈ line 867, `sampling_failed`).
|
|
||||||
- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156)
|
|
||||||
routes every driver to its Node connector; the daemon is the `else` fallback.
|
|
||||||
- **`enabled_tables` matching:** `context/scan/enabled-tables.ts`
|
|
||||||
(`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts`
|
|
||||||
(`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47),
|
|
||||||
`context/project/driver-schemas.ts` (`enabled_tables` schema + description).
|
|
||||||
- **Staging / detect / error surface:**
|
|
||||||
`context/ingest/adapters/live-database/stage.ts`
|
|
||||||
(`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94,
|
|
||||||
`detectLiveDatabaseStagedDir` ≈ line 138),
|
|
||||||
`context/ingest/local-stage-ingest.ts` (the
|
|
||||||
`did not recognize fetched source output` throw ≈ line 291 — must stop being
|
|
||||||
the surface for empty-scope and total-failure).
|
|
||||||
- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus`
|
|
||||||
≈ line 202), `context/ingest/memory-flow/summary.ts`
|
|
||||||
(`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing
|
|
||||||
summary.
|
|
||||||
- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`;
|
|
||||||
`SourceFetchReport` as the home for scan warnings),
|
|
||||||
`context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted
|
|
||||||
whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts`
|
|
||||||
(`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body
|
|
||||||
per connection and render the skipped line via `renderLocalStatsAsLines`).
|
|
||||||
- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py`
|
|
||||||
(`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response`
|
|
||||||
≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables`
|
|
||||||
≈ line 267), and the Node mapping in
|
|
||||||
`context/ingest/adapters/live-database/daemon-introspection.ts`
|
|
||||||
(`mapDaemonSnapshot` ≈ line 209).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic
|
|
||||||
layer because of its one broken view, so those questions fall back to raw
|
|
||||||
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
|
|
||||||
restores enriched context for that database. The same robustness is required for
|
|
||||||
the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or
|
|
||||||
permission-restricted objects are common and a single one must not zero out a
|
|
||||||
warehouse's context.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki`. All requirements implemented;
|
|
||||||
verified with `pnpm --filter @kaelio/ktx run test` (2981 passing),
|
|
||||||
`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing),
|
|
||||||
`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes
|
|
||||||
`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning
|
|
||||||
`{ ok, table } | { ok: false, warning }` and building an
|
|
||||||
`object_introspection_failed` warning (object name + redactable DB error). It
|
|
||||||
rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is
|
|
||||||
never masked as an object skip. The new warning code was added to
|
|
||||||
`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist
|
|
||||||
(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode`
|
|
||||||
validator), and `describeWarningGroup` (`scan.ts`).
|
|
||||||
|
|
||||||
**Per-object isolation, where it actually exists (R1/R4).** Only sqlite
|
|
||||||
(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do
|
|
||||||
per-object reads during *mandatory* introspection; both now route each object
|
|
||||||
through `tryIntrospectObject`. The other five Node connectors (postgres, mysql,
|
|
||||||
clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/
|
|
||||||
`information_schema` (already object-safe at this phase) and isolate per-object
|
|
||||||
profiling/sampling in the enrichment phase (`description-generation.ts`,
|
|
||||||
`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite
|
|
||||||
also tolerates a `COUNT(*)` (profiling) failure without dropping a
|
|
||||||
structurally-readable table, and a broken view's metadata read is isolated so it
|
|
||||||
never blocks base tables (R4).
|
|
||||||
|
|
||||||
**Single-source outcome decision (R3/R5).** New
|
|
||||||
`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once
|
|
||||||
in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the
|
|
||||||
daemon) routes through — and derives the outcome from the snapshot + scope:
|
|
||||||
≥1 object → success (skips ride along as warnings); all matched objects failed →
|
|
||||||
clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear
|
|
||||||
zero-match error naming the connection, the requested entries, and the available
|
|
||||||
objects (sqlite/bigquery attach the discovered inventory via
|
|
||||||
`metadata.discovered_object_names`); empty database (no scope) → success with an
|
|
||||||
empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a
|
|
||||||
valid empty staging is recognized; total-failure/zero-match now throw a clear
|
|
||||||
connection error before staging instead of surfacing the generic
|
|
||||||
`did not recognize fetched source output`.
|
|
||||||
|
|
||||||
**`enabled_tables` matching (R5).** Normalized at the scope boundary in
|
|
||||||
`resolveEnabledTables` using `connection.driver`: for sqlite, `main.<name>` →
|
|
||||||
`{ db: null }`, so `"main.customers"` and `"customers"` select the same object.
|
|
||||||
`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and
|
|
||||||
`docs-site/.../configuration/ktx-yaml.mdx`.
|
|
||||||
|
|
||||||
**Surfacing (R2).** Deviation from the spec's orientation: live-database schema
|
|
||||||
ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` →
|
|
||||||
`local_ingest_reports`), not the bundle runner, so the home for scan warnings is
|
|
||||||
`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is
|
|
||||||
persisted whole, so **no migration**), not the bundle-only
|
|
||||||
`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport`
|
|
||||||
(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`).
|
|
||||||
The ingest summary is already rendered by `runKtxScan` from `report.warnings`
|
|
||||||
(the new `describeWarningGroup` case), and `ktx status`
|
|
||||||
(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the
|
|
||||||
latest report body per connection and prints a per-connection
|
|
||||||
`N object(s) skipped — name: reason` line.
|
|
||||||
|
|
||||||
**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to
|
|
||||||
`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model,
|
|
||||||
isolates per-object failures in `_map_rows_to_tables`, and shares the
|
|
||||||
`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with
|
|
||||||
Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`,
|
|
||||||
dropping any code Node cannot render (validated via `isKtxScanWarningCode`).
|
|
||||||
Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the
|
|
||||||
shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver
|
|
||||||
including the daemon, avoiding a divergent second implementation. Parity is
|
|
||||||
covered by a Node test (daemon-shaped warning round-trips) and a pytest
|
|
||||||
(per-object failure → warning with the shared code).
|
|
||||||
|
|
@ -1,363 +0,0 @@
|
||||||
# Add universal SQL-authoring craft to the ktx-analytics skill
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The shipped `ktx-analytics` skill
|
|
||||||
(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its
|
|
||||||
`<workflow>` and `<rules>` tell the agent **which ktx tools to call and in what
|
|
||||||
order** (`discover_data` → `entity_details`/`sl_read_source` →
|
|
||||||
`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing
|
|
||||||
about **writing correct SQL**.
|
|
||||||
|
|
||||||
That gap shows up as a specific failure shape: the agent reliably produces
|
|
||||||
*runnable* SQL but *wrong* results. The recurring defects are universal
|
|
||||||
analytics-engineering mistakes, not ktx-specific ones:
|
|
||||||
|
|
||||||
- comparing a string column to a numeric literal (or vice versa), which can
|
|
||||||
silently match zero rows;
|
|
||||||
- rounding inside intermediate CTEs, so the final number is off;
|
|
||||||
- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so
|
|
||||||
results flicker run to run;
|
|
||||||
- filtering *before* a window function for sequence/“since”/“first” questions,
|
|
||||||
truncating the partition the window should see;
|
|
||||||
- returning a full ranked list for a “top/highest” question, or collapsing a
|
|
||||||
“per X” question to a single value;
|
|
||||||
- dropping the inputs (or the entity identifier) a derived value was built from.
|
|
||||||
|
|
||||||
These are correctness defects every ktx user hits on a live database. They
|
|
||||||
belong in the shipped skill — fixing them once improves ktx for everyone, rather
|
|
||||||
than living in any individual caller’s prompt.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
An analyst (human or agent) points ktx at a **live, production** database and
|
|
||||||
asks a real analytical question — “what’s the most recent order per customer”,
|
|
||||||
“top region by margin”, “average order value by month”. The schema is unfamiliar
|
|
||||||
(unknown date encodings, nullable join keys, string-typed numeric columns), the
|
|
||||||
question carries grain and ranking intent in its wording, and the answer must be
|
|
||||||
*correct and deterministic*, not merely executable. The skill should encode the
|
|
||||||
analytics-engineering craft that makes the difference between a query that runs
|
|
||||||
and a query that’s right — independent of any benchmark.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The change is **additive content in one Markdown file**, governed by these
|
|
||||||
invariants. They constrain the implementer; the exact prose is theirs.
|
|
||||||
|
|
||||||
### Inline-only delivery (this is a hard constraint, not a style preference)
|
|
||||||
|
|
||||||
All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled
|
|
||||||
`reference/*.md` file (the progressive-disclosure pattern Anthropic’s
|
|
||||||
skill-authoring guide recommends for large skills) **MUST NOT** be used here,
|
|
||||||
because the delivery mechanism ships only `SKILL.md`:
|
|
||||||
|
|
||||||
- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`,
|
|
||||||
which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file
|
|
||||||
per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex /
|
|
||||||
universal `.agents` equivalent, a **flattened** single rules file for Cursor
|
|
||||||
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
|
|
||||||
(`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that
|
|
||||||
contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`).
|
|
||||||
- Nothing copies sibling files or subdirectories. A reference file would dangle
|
|
||||||
on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot
|
|
||||||
represent a multi-file skill at all.
|
|
||||||
|
|
||||||
The skill is small enough that inline costs nothing meaningful: ~67 lines today
|
|
||||||
plus ~60 of craft is well under the 500-line budget. And this craft is **core
|
|
||||||
content** — consulted on every SQL-authoring turn — so even if multi-file delivery
|
|
||||||
existed it would still belong inline: progressive disclosure only pays off for
|
|
||||||
large, *conditionally-relevant* reference material loaded on demand, not for
|
|
||||||
always-needed craft.
|
|
||||||
|
|
||||||
Multi-file skill *delivery* is a legitimate future enhancement, but it must be
|
|
||||||
**pulled by a concrete need, not built ahead of one** — no shipped skill today
|
|
||||||
exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first
|
|
||||||
real trigger is the **per-dialect SQL syntax follow-up**
|
|
||||||
(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand
|
|
||||||
`reference/<dialect>.md` content is a genuine progressive-disclosure fit. When
|
|
||||||
that work is scoped, note that multi-file delivery is **not** a simple directory
|
|
||||||
copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor
|
|
||||||
(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform,
|
|
||||||
and uninstall needs per-file manifest entries. Recording the constraint here so a
|
|
||||||
future implementer does not “improve” this inline content into a bundled
|
|
||||||
reference that dangles on every target.
|
|
||||||
|
|
||||||
### Heuristics with a generic *why*, not a wall of MUSTs
|
|
||||||
|
|
||||||
The new rules are phrased as **heuristics with a one-line, universal rationale**,
|
|
||||||
because SQL authoring is a high-freedom task (many valid approaches, choice
|
|
||||||
depends on the question and the data). A bare imperative overfits; a rule plus
|
|
||||||
its *why* lets the model apply judgment and generalize. This follows Anthropic’s
|
|
||||||
own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all
|
|
||||||
caps or rigid structures, reframe and explain the reasoning”).
|
|
||||||
|
|
||||||
This **reconciles the draft’s “behavior only, no rationale” instruction**: the
|
|
||||||
prohibition is specifically on rationale that references a **grader, gold answer,
|
|
||||||
or the benchmark**. *Generic analytics-engineering rationale is required* — e.g.
|
|
||||||
“…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number
|
|
||||||
compare can silently match nothing”. That is a universal truth, not a
|
|
||||||
grader reference.
|
|
||||||
|
|
||||||
### Dialect-agnostic
|
|
||||||
|
|
||||||
Every rule must read correctly on any SQL dialect a ktx connection might use.
|
|
||||||
**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only),
|
|
||||||
not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs.
|
|
||||||
Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware
|
|
||||||
(per-driver) location, explicitly out of scope here.
|
|
||||||
|
|
||||||
### Discovery craft attaches to discovery; authoring craft to query/validate
|
|
||||||
|
|
||||||
Two of the draft’s rules (inspect sample rows; cast before comparing) are
|
|
||||||
*schema-discovery* concerns that happen **before** SQL is composed. They belong
|
|
||||||
with the discovery steps of the existing workflow, not only at the query step.
|
|
||||||
The rest (composition, window correctness, precision, completeness) belong with
|
|
||||||
the query/validate steps. The draft’s “extend step 5/6” is the right home for
|
|
||||||
most rules but is slightly off for the discovery pair; this spec corrects that.
|
|
||||||
|
|
||||||
### Additive only
|
|
||||||
|
|
||||||
The existing `<workflow>`, `<rules>`, and `<examples>` — compact result tables,
|
|
||||||
summaries, clarification prompts, the tool-order workflow, the `connectionId`
|
|
||||||
scoping rules — are preserved unchanged. The skill must still read well for an
|
|
||||||
interactive, human-facing analysis session.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Placement and structure
|
|
||||||
|
|
||||||
Add a dedicated, scannable craft section to `SKILL.md`:
|
|
||||||
|
|
||||||
- A new top-level block — `<sql_craft>` (sibling to `<workflow>`/`<rules>`) — with
|
|
||||||
**five sub-headings**: *Schema discovery*, *Composition*, *Window functions*,
|
|
||||||
*Numeric precision*, *Answer completeness*. Sub-headings keep the block
|
|
||||||
scannable (the draft’s “group under clear sub-headings” goal).
|
|
||||||
- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and
|
|
||||||
explain”) each gain a **one-line pointer** into `<sql_craft>` rather than
|
|
||||||
inlining the rules (state each rule once; Anthropic’s “consistent terminology /
|
|
||||||
don’t repeat” guidance). The schema-discovery pair is additionally reflected as
|
|
||||||
a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing
|
|
||||||
to the same block.
|
|
||||||
- No new tool, flag, or config. This is content only.
|
|
||||||
|
|
||||||
### 2. The craft rules (all fourteen behaviors, grouped)
|
|
||||||
|
|
||||||
Every behavior from the intake draft must be represented. Tightly-related ones
|
|
||||||
**may** be merged into a single bullet where that reads better; none may be
|
|
||||||
dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout.
|
|
||||||
|
|
||||||
**Schema discovery** (cue in steps 2/4; lives in `<sql_craft>`)
|
|
||||||
1. Inspect representative **sample rows** of each table before composing SQL —
|
|
||||||
confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in
|
|
||||||
join/filter keys, and the real set of categorical/enum values
|
|
||||||
(`entity_details` + a small `sql_execution` sample). *Why:* assumptions about
|
|
||||||
encoding and nullability are the most common source of silently-wrong filters.
|
|
||||||
2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A
|
|
||||||
string column compared to a numeric literal (or vice versa) can silently match
|
|
||||||
nothing.
|
|
||||||
|
|
||||||
**Composition**
|
|
||||||
3. Build complex queries **incrementally** — one CTE at a time, verifying each
|
|
||||||
layer’s output on a small sample before stacking the next. *Why:* a wrong
|
|
||||||
intermediate layer is far cheaper to catch early than to debug in the final
|
|
||||||
result.
|
|
||||||
4. **Avoid fan-out joins.** Add columns only from tables already at the target
|
|
||||||
grain, or **pre-aggregate** to that grain before joining. *Why:* a join that
|
|
||||||
multiplies rows quietly inflates every downstream `SUM`/`COUNT`.
|
|
||||||
|
|
||||||
**Window functions**
|
|
||||||
5. Give every ranking/ordering window function a **complete, deterministic
|
|
||||||
tie-breaker** (append unique key columns to `ORDER BY`), so
|
|
||||||
`RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs.
|
|
||||||
6. For sequence / “first” / “most recent” / “since” questions, **filter after the
|
|
||||||
window**, not before: compute over the full partition, then keep the rows you
|
|
||||||
want. *Why:* a pre-filter shrinks the partition the window ranks over, so
|
|
||||||
“first”/“most recent” is computed against the wrong set. (See the worked
|
|
||||||
example, requirement 3.)
|
|
||||||
|
|
||||||
**Numeric precision**
|
|
||||||
7. Compute at **full precision; round only in the final projection**, never inside
|
|
||||||
intermediate CTEs.
|
|
||||||
8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit
|
|
||||||
rounding when rounding is intended. (May merge with rule 7.)
|
|
||||||
9. Distinguish **macro vs micro averages** based on the question’s wording:
|
|
||||||
“average of per-group averages” = `AVG(group_metric)`; “overall/weighted
|
|
||||||
average” = `SUM(numerator)/SUM(denominator)`.
|
|
||||||
|
|
||||||
**Answer completeness / interpretation**
|
|
||||||
10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the
|
|
||||||
top-ranked row via the window result), not the full ranked list, unless a list
|
|
||||||
is asked for. *(Phrase the mechanism dialect-agnostically — do not name
|
|
||||||
`QUALIFY`.)*
|
|
||||||
11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a
|
|
||||||
single value unless the question says “overall” or “total across X”.
|
|
||||||
12. When a question asks for inputs and a derived value (“X, Y, and their ratio”),
|
|
||||||
**include the inputs as columns** alongside the derived value.
|
|
||||||
13. When grouping by a human-readable label (a name), also **expose the entity’s
|
|
||||||
identifier** — identity, not just the label, is part of the result (and
|
|
||||||
disambiguates duplicate names).
|
|
||||||
14. When a result is **unexpectedly empty, relax filters one at a time** to find
|
|
||||||
which predicate removed the rows. *Why:* this is the validation feedback loop
|
|
||||||
that turns a silent empty result into a diagnosable one.
|
|
||||||
|
|
||||||
### 3. One worked example (dialect-agnostic)
|
|
||||||
|
|
||||||
Add **exactly one** compact before/after example to the skill, demonstrating the
|
|
||||||
**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set.
|
|
||||||
It shows the wrong shape (filter inside, then rank) and the right shape (rank over
|
|
||||||
the full partition in a CTE, then filter to the top rank in the outer query),
|
|
||||||
using generic table/column names and standard SQL only (no `QUALIFY`, no
|
|
||||||
dialect functions). Keep it ~6–10 lines. Do not add a second example; the
|
|
||||||
existing three tool-orchestration examples stay as the primary example set.
|
|
||||||
*(Superseded by spec 09: the skill now carries a second `sql` worked example —
|
|
||||||
the multi-hop fan-out case — so the one-example constraint applies to spec 07's
|
|
||||||
window-then-filter example only.)*
|
|
||||||
|
|
||||||
### 4. Explicit exclusions
|
|
||||||
|
|
||||||
None of the following may appear in the skill (they are application/consumer
|
|
||||||
concerns, or actively wrong for live data):
|
|
||||||
|
|
||||||
- **Output-shape contracts** (“return a bare result set with exactly these
|
|
||||||
columns, no prose”). The skill is for interactive analysis and already favors
|
|
||||||
readable tables + summaries; a caller needing a strict shape specifies that
|
|
||||||
itself.
|
|
||||||
- **Anchoring relative time to `MAX(date)` of the data.** On a live database
|
|
||||||
“recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is
|
|
||||||
only valid for static snapshots and must not be baked into the product.
|
|
||||||
- **Any advice justified by a grader, gold answer, or scoring comparator.**
|
|
||||||
- **Dialect-specific syntax** (deferred to the per-driver follow-up).
|
|
||||||
|
|
||||||
### 5. Coordination with spec 03
|
|
||||||
|
|
||||||
`03-multi-connection-routing-in-analytics-skill` also edits this same file (it
|
|
||||||
adds a connection-routing “step 0” to `<workflow>` and threads `connectionId`
|
|
||||||
through the tool calls). Spec 07’s additions are **orthogonal**: they live in a
|
|
||||||
new `<sql_craft>` block and in step 5/6 pointers, and must not rewrite the
|
|
||||||
`<workflow>` routing or the `<rules>` `connectionId` scoping that spec 03 owns.
|
|
||||||
If both land, the result is one coherent skill: routing in `<workflow>`/`<rules>`,
|
|
||||||
SQL craft in `<sql_craft>`.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped
|
|
||||||
under the five sub-headings, each phrased as a heuristic with a generic
|
|
||||||
rationale.
|
|
||||||
- **Zero references** to any benchmark, gold answer, grader, or scoring
|
|
||||||
comparator anywhere in the skill.
|
|
||||||
- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`,
|
|
||||||
no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect
|
|
||||||
construct — including in the worked example.
|
|
||||||
- The existing interactive guidance is intact: the `<workflow>` steps, the
|
|
||||||
`<rules>` (compact tables, summaries, clarification prompt, `connectionId`
|
|
||||||
scoping), and the three existing examples all still read correctly and were not
|
|
||||||
removed or contradicted.
|
|
||||||
- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of
|
|
||||||
“recent”, grader-driven advice, dialect syntax) appear.
|
|
||||||
- Exactly **one** new worked example is present, demonstrating window-then-filter,
|
|
||||||
in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second
|
|
||||||
`sql` worked example for the multi-hop fan-out case; the shipped skill then
|
|
||||||
contains two worked examples and the content test asserts two `sql` fences.)*
|
|
||||||
- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced,
|
|
||||||
and the skill still installs as a single file through `setup-agents.ts` for all
|
|
||||||
targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip).
|
|
||||||
- The skill stays **scannable and within a reasonable size** (comfortably under
|
|
||||||
the 500-line budget).
|
|
||||||
- The frontmatter (`name`, `description`) is unchanged and still parses through
|
|
||||||
`SkillsRegistryService.parseFrontmatter`.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the
|
|
||||||
`<sql_craft>` block; add one-line pointers in steps 5/6 and a discovery cue in
|
|
||||||
steps 2/4; add the single worked example. Keep `<workflow>`/`<rules>`/`<examples>`
|
|
||||||
otherwise intact.
|
|
||||||
- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts`
|
|
||||||
(`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`,
|
|
||||||
`plannedKtxAgentFiles`). Each target gets a single file derived from
|
|
||||||
`SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only
|
|
||||||
`ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this
|
|
||||||
spec — confirm the skill still installs unchanged.
|
|
||||||
- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the
|
|
||||||
same file; keep the changes non-overlapping (see requirement 5).
|
|
||||||
- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the
|
|
||||||
right level (this is prompt content, not executable logic). Assert the skill
|
|
||||||
text contains the craft sub-headings / representative rule phrases, contains the
|
|
||||||
worked example, and contains none of the banned constructs: the literal tokens
|
|
||||||
`QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`,
|
|
||||||
`gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since
|
|
||||||
`MAX()` is a legitimate aggregate — any instruction anchoring relative time
|
|
||||||
(“recent”, “past N months”) to the data’s maximum date. The existing
|
|
||||||
`SkillsRegistryService` frontmatter-parse test must still pass. The standalone
|
|
||||||
`ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run
|
|
||||||
link:dev`) so the playground picks up the updated skill.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but
|
|
||||||
~50 result mismatches**, and a large share traced to exactly these gaps:
|
|
||||||
premature rounding, string-vs-number compares, non-deterministic window ordering,
|
|
||||||
returning full lists for “top” questions, and dropping the inputs to derived
|
|
||||||
values. These are generic SQL-authoring defects — fixing them in the skill
|
|
||||||
improves ktx for every user querying a live database, and improving the benchmark
|
|
||||||
score is a side effect, not the goal. The skill itself must contain no trace of
|
|
||||||
the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki`.
|
|
||||||
|
|
||||||
**What was built**
|
|
||||||
- Added a new `<sql_craft>` block to `packages/cli/src/skills/analytics/SKILL.md`
|
|
||||||
(sibling to `<workflow>`/`<rules>`, placed just before `<examples>`), with the
|
|
||||||
five sub-headings — *Schema discovery before writing SQL*, *Composition*,
|
|
||||||
*Window functions*, *Numeric precision*, *Answer completeness / interpretation* —
|
|
||||||
and a one-line opener framing the bullets as heuristics-with-a-why.
|
|
||||||
- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end /
|
|
||||||
truncation) are merged into one "Round only at the end" bullet, as the spec
|
|
||||||
permitted. Each bullet carries a generic analytics-engineering rationale; none
|
|
||||||
references a benchmark, grader, or gold answer.
|
|
||||||
- Exactly one worked example (a fenced `sql` block inside `<sql_craft>`)
|
|
||||||
demonstrates the window-then-filter rule, and incidentally the deterministic
|
|
||||||
tie-breaker: the *wrong* shape filters before the window; the *right* shape
|
|
||||||
ranks the full partition in a CTE, then filters in the outer query. Standard
|
|
||||||
SQL only — no `QUALIFY`, no dialect functions.
|
|
||||||
- Step pointers added without duplicating the rules: a schema-discovery cue in
|
|
||||||
steps 2 and 4, an authoring pointer in step 5, and a validation pointer in
|
|
||||||
step 6, each pointing into `<sql_craft>`.
|
|
||||||
- The existing `<workflow>` / `<rules>` / `<examples>` (compact tables,
|
|
||||||
summaries, clarification prompt, `connectionId` scoping, the three
|
|
||||||
orchestration examples) are unchanged. Delivery is unchanged: still a single
|
|
||||||
`SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/`
|
|
||||||
file was introduced.
|
|
||||||
|
|
||||||
**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a
|
|
||||||
content assertion over the source `SKILL.md`: the five sub-headings, a
|
|
||||||
representative phrase for each behavior, exactly one `sql` worked example, the
|
|
||||||
preserved interactive guidance, and the absence of banned constructs
|
|
||||||
(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` /
|
|
||||||
`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring
|
|
||||||
relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content
|
|
||||||
assertions and the `SkillsRegistryService` frontmatter test still pass (77/77
|
|
||||||
across the three relevant files). Rebuilt and re-linked `ktx-dev`
|
|
||||||
(`pnpm run build && pnpm run link:dev`); the craft block is present in the
|
|
||||||
shipped `dist` asset.
|
|
||||||
|
|
||||||
**Deviations / notes**
|
|
||||||
- The worked example runs ~18 lines including comments rather than the spec's
|
|
||||||
"~6–10"; a faithful before/after with a CTE needs the extra lines, and the
|
|
||||||
skill stays well within budget (~117 lines total).
|
|
||||||
- `pnpm run type-check` currently reports one **pre-existing, unrelated** error
|
|
||||||
in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on
|
|
||||||
this branch ahead of `origin/main`. The src type-check and `pnpm run build`
|
|
||||||
are green; this change does not touch any MCP file.
|
|
||||||
- Per-dialect SQL syntax stays out of scope here (deferred to
|
|
||||||
`todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains
|
|
||||||
dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that
|
|
||||||
belongs with spec 08's channel so the skill never references a tool that does
|
|
||||||
not exist.
|
|
||||||
|
|
@ -1,395 +0,0 @@
|
||||||
# Per-dialect SQL syntax notes, served on demand and scoped to the connection
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion
|
|
||||||
> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft
|
|
||||||
> dialect-agnostic and explicitly deferred per-dialect syntax to this spec.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the
|
|
||||||
`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft
|
|
||||||
deliberately excludes anything that reads correctly on only one engine — no
|
|
||||||
`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs —
|
|
||||||
because the flat skill is installed verbatim and an agent querying sqlite must
|
|
||||||
never see Snowflake syntax.
|
|
||||||
|
|
||||||
But a large share of *real* correctness depends on exactly that excluded,
|
|
||||||
engine-specific syntax:
|
|
||||||
|
|
||||||
- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive
|
|
||||||
identifiers (unquoted folds to upper-case), VARIANT colon-paths
|
|
||||||
(`col:field.sub::type`), `QUALIFY`.
|
|
||||||
- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX`
|
|
||||||
for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`.
|
|
||||||
- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`,
|
|
||||||
`json_extract`.
|
|
||||||
- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`,
|
|
||||||
`sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and
|
|
||||||
JSON conventions.
|
|
||||||
|
|
||||||
This guidance is genuinely useful to an agent writing SQL against a live
|
|
||||||
database, but it must **not** pollute the flat dialect-agnostic skill. It belongs
|
|
||||||
in a **dialect-aware** channel, surfaced only for the dialect the active
|
|
||||||
connection actually uses, and selected from the project's own configured state —
|
|
||||||
not guessed, not shown all at once.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any **ktx** project whose connections span more than one warehouse engine — a
|
|
||||||
Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When
|
|
||||||
the agent (or a human analyst the agent assists) writes SQL for a given
|
|
||||||
connection, it should receive *that engine's* syntax conventions — FQTN form,
|
|
||||||
identifier quoting, date functions, top-N idiom, semi-structured access — and
|
|
||||||
nothing for the engines it is not querying. The need is independent of any
|
|
||||||
benchmark: it is what "write correct SQL against this specific warehouse" requires
|
|
||||||
on every multi-engine stack.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The change adds a **dialect-aware channel** alongside spec 07's flat skill. The
|
|
||||||
following decisions are committed by this refinement; the implementer owns the
|
|
||||||
exact prose and code.
|
|
||||||
|
|
||||||
### Delivery: a dynamic MCP tool (decision committed)
|
|
||||||
|
|
||||||
The draft posed two delivery mechanisms and asked the refinement to "weigh them
|
|
||||||
before committing." This spec commits to **dynamic MCP delivery**: a new
|
|
||||||
read-only MCP tool returns the syntax notes for a given `connectionId`, with the
|
|
||||||
dialect resolved server-side from the connection's configured `driver`. The flat
|
|
||||||
skill gains a one-line pointer to that tool. **No install-mechanism change is
|
|
||||||
required.**
|
|
||||||
|
|
||||||
The alternative — **multi-file skill delivery** (bundle `reference/<dialect>.md`
|
|
||||||
files and point the skill at the matching one) — is **rejected** for **ktx**, for
|
|
||||||
reasons that hold regardless of how the skill is otherwise authored:
|
|
||||||
|
|
||||||
1. **It cannot scope on two of the six install targets.** Cursor
|
|
||||||
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
|
|
||||||
(`.opencode/commands/ktx-analytics.md`) are physically **single-file**;
|
|
||||||
`setup-agents.ts` flattens the skill to one file there. A bundled `reference/`
|
|
||||||
directory degenerates to "concatenate every dialect into one file," so a
|
|
||||||
sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core
|
|
||||||
no-leak criterion on those targets**, and defeating progressive disclosure
|
|
||||||
(everything is in context at once). The MCP tool behaves **identically on all
|
|
||||||
six targets** because it is a tool call, not an installed file.
|
|
||||||
2. **Selecting the dialect is a deterministic operation, so it belongs in code,
|
|
||||||
not model judgment.** Anthropic's skill-authoring guidance explicitly says to
|
|
||||||
*"prefer scripts [tools] for deterministic operations."* With bundled files the
|
|
||||||
**model** must infer that connection X is Snowflake and open the right file —
|
|
||||||
and on a multi-connection project it can open the wrong one. With the tool, the
|
|
||||||
**server** resolves `driver → dialect` from `ktx.yaml` state and returns
|
|
||||||
exactly the right notes.
|
|
||||||
3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery
|
|
||||||
requires reworking `readAnalyticsSkillContent`, `installTarget`,
|
|
||||||
`plannedKtxAgentFiles`, the install manifest (a directory variant),
|
|
||||||
`removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a
|
|
||||||
concatenation transform for the single-file targets. The MCP tool requires one
|
|
||||||
read-only handler and one skill pointer.
|
|
||||||
4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on
|
|
||||||
the **ktx** MCP server — its entire workflow is calling `discover_data`,
|
|
||||||
`entity_details`, `sql_execution`, and so on. Wherever the server is down, the
|
|
||||||
skill is already non-functional; the tool adds **no new dependency**.
|
|
||||||
5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would
|
|
||||||
make multi-file delivery *possible*, but it would not make it better: reasons
|
|
||||||
2–4 stand, and the drop is a disproportionate cost (Cursor is a major target)
|
|
||||||
to neutralize a constraint the tool handles for free. Whether **ktx** supports
|
|
||||||
those targets is a separate product decision and is out of scope here.
|
|
||||||
|
|
||||||
This is consistent with Anthropic's progressive-disclosure goal — load the
|
|
||||||
relevant material on demand, at zero context cost until needed — which the tool
|
|
||||||
satisfies (its output costs context only when called) while resolving *which*
|
|
||||||
dialect from state rather than from a model guess. Reference:
|
|
||||||
[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
|
|
||||||
|
|
||||||
### Scope derived from state, through the one existing resolver
|
|
||||||
|
|
||||||
Which dialect's notes the agent sees is **derived** from the connection's
|
|
||||||
configured `driver`, via the resolver the rest of the system already uses —
|
|
||||||
`sqlAnalysisDialectForDriver(driver)` in
|
|
||||||
`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already
|
|
||||||
selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis
|
|
||||||
daemon. This spec **must not** introduce a second driver→dialect map. The notes
|
|
||||||
are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is
|
|
||||||
keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's
|
|
||||||
codomain so the two cannot drift.
|
|
||||||
|
|
||||||
### Authored per-engine notes are sanctioned static content
|
|
||||||
|
|
||||||
Enumerating syntax notes per engine is **not** a rotting denylist of bad
|
|
||||||
specifics; FQTN form and identifier quoting are genuine, stable invariants of each
|
|
||||||
engine — the kind of universal fact **ktx**'s design rules explicitly permit as
|
|
||||||
static content. What must stay derived-from-state is note *selection* (the active
|
|
||||||
dialect) and note *coverage* (every configured driver must resolve to notes that
|
|
||||||
exist), both of which this spec ties to the connector registry.
|
|
||||||
|
|
||||||
### The flat skill stays dialect-agnostic (spec 07 invariant preserved)
|
|
||||||
|
|
||||||
This work adds a *separate* channel. It does **not** amend spec 07's `<sql_craft>`
|
|
||||||
block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion
|
|
||||||
— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays
|
|
||||||
green. The only `SKILL.md` change is the pointer in requirement 3, which names the
|
|
||||||
tool and contains no dialect syntax.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. A read-only `sql_dialect_notes` MCP tool
|
|
||||||
|
|
||||||
Register a new tool beside the existing context tools
|
|
||||||
(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the
|
|
||||||
implementer's to finalize but should follow the existing snake_case convention
|
|
||||||
(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name.
|
|
||||||
|
|
||||||
- **Input:** `{ connectionId }`, **required** — matching its siblings
|
|
||||||
`entity_details`/`sql_execution`, which always take an explicit connection.
|
|
||||||
- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved
|
|
||||||
`SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect.
|
|
||||||
- **Resolution:** `connectionId → connection.driver →
|
|
||||||
sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing
|
|
||||||
resolver. Do not duplicate the driver→dialect map.
|
|
||||||
- **Guards:**
|
|
||||||
- A **non-SQL context-source** connection (driver `metabase`, `looker`,
|
|
||||||
`lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL
|
|
||||||
warehouse connection" error**, not postgres notes. Gate on the existing
|
|
||||||
`isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`).
|
|
||||||
- For any **SQL warehouse** connection the resolver always yields a dialect with
|
|
||||||
notes (all seven warehouse drivers are covered — requirement 2); its built-in
|
|
||||||
`postgres` default is a safety floor, so the tool never errors for a SQL
|
|
||||||
connection and never emits a single-engine dialect (e.g. Snowflake) by
|
|
||||||
accident.
|
|
||||||
- **Annotations:** read-only and idempotent, consistent with the other read
|
|
||||||
tools.
|
|
||||||
- **Description (docs-grade, third person, states what and when):** e.g.
|
|
||||||
*"Returns the SQL syntax conventions for a connection's dialect — FQTN form,
|
|
||||||
identifier quoting and case-folding, date/time functions, top-N idiom, and
|
|
||||||
semi-structured access. Use before authoring raw SQL against a connection so the
|
|
||||||
SQL matches that engine."* The description drives the agent's decision to call
|
|
||||||
the tool, so it must be specific.
|
|
||||||
|
|
||||||
### 2. Per-dialect note content
|
|
||||||
|
|
||||||
Author concise notes for each supported dialect against a **fixed rubric**, so
|
|
||||||
every dialect answers the same questions. Each facet is a line or two of timeless,
|
|
||||||
engine-true convention (no version-dated "as of vX" content), phrased as
|
|
||||||
guidance with the engine reason where it helps — inheriting spec 07's
|
|
||||||
heuristics-with-a-why tone. The rubric facets:
|
|
||||||
|
|
||||||
1. **FQTN form** — how to fully-qualify a table on this engine.
|
|
||||||
2. **Identifier quoting & case-folding** — quote character and how unquoted
|
|
||||||
identifiers fold.
|
|
||||||
3. **Date/time** — the engine's date functions and common date-encoding idioms.
|
|
||||||
4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE +
|
|
||||||
outer-filter form where it is not; `TOP` for `tsql`.
|
|
||||||
5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/
|
|
||||||
`JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable.
|
|
||||||
6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery
|
|
||||||
`_TABLE_SUFFIX`).
|
|
||||||
|
|
||||||
Constraints on the content:
|
|
||||||
|
|
||||||
- **Coverage = the reachable dialect set.** Every driver in the connector registry
|
|
||||||
must resolve to a dialect that has non-empty notes. The reachable set is
|
|
||||||
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and
|
|
||||||
`tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`:
|
|
||||||
they appear in the resolver map but no connector can produce them, so they are
|
|
||||||
unreachable — matching the draft's "don't author for nonexistent drivers."
|
|
||||||
- **Keyed by `SqlAnalysisDialect`** (see Model).
|
|
||||||
- **Storage is the implementer's choice.** The notes MAY live as per-dialect
|
|
||||||
markdown files inside the package (e.g. under the skill's directory) served by
|
|
||||||
the tool, or as a typed map. If files are used they are **package-internal** —
|
|
||||||
served by the tool, never installed onto an agent target — and already ship via
|
|
||||||
the recursive `src/skills → dist/skills` copy
|
|
||||||
(`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change.
|
|
||||||
- **No benchmark, gold-answer, grader, or scoring references** anywhere in the
|
|
||||||
notes.
|
|
||||||
|
|
||||||
The implementer must verify each engine's specifics against current official
|
|
||||||
documentation (the well-known anchors above are starting points, not a
|
|
||||||
substitute for checking the engine's docs).
|
|
||||||
|
|
||||||
### 3. The `SKILL.md` pointer (completes spec 07's deferral)
|
|
||||||
|
|
||||||
Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step
|
|
||||||
5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to
|
|
||||||
call the tool before writing raw SQL against a connection — e.g. *"Before writing
|
|
||||||
raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get
|
|
||||||
that engine's syntax conventions."* This is the pointer spec 07 deliberately did
|
|
||||||
not add because the tool did not yet exist.
|
|
||||||
|
|
||||||
- The pointer **names the tool only**; it contains **no dialect syntax**, so the
|
|
||||||
flat skill stays dialect-agnostic.
|
|
||||||
- Follow the skill's existing tool-reference convention. The skill currently names
|
|
||||||
MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's
|
|
||||||
guidance recommends **fully-qualified** `ServerName:tool` names to avoid
|
|
||||||
"tool not found" when multiple MCP servers are present. Whether to fully-qualify
|
|
||||||
the new pointer (and optionally retrofit the existing bare references) is a
|
|
||||||
small, separable decision flagged for the maintainer — **not** a rename sweep
|
|
||||||
this spec mandates.
|
|
||||||
|
|
||||||
### 4. Coverage is enforced from state, not by hand
|
|
||||||
|
|
||||||
A test must **derive** the required coverage from the connector registry rather
|
|
||||||
than hardcoding a dialect list: enumerate the configured warehouse drivers
|
|
||||||
(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in
|
|
||||||
`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and
|
|
||||||
assert each result has non-empty notes. Adding a connector later then **fails this
|
|
||||||
test** until its dialect gets notes — the allowlist-from-state discipline, not a
|
|
||||||
hand-maintained list.
|
|
||||||
|
|
||||||
### 5. No dialect syntax leaks into the flat skill
|
|
||||||
|
|
||||||
Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill
|
|
||||||
(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`,
|
|
||||||
backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds
|
|
||||||
a tool and a tool-pointer; it does not move dialect syntax into the skill.
|
|
||||||
|
|
||||||
### 6. Delivery is unchanged
|
|
||||||
|
|
||||||
`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`,
|
|
||||||
`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The
|
|
||||||
skill still installs as a single `SKILL.md` per target. Confirm the channel works
|
|
||||||
on all six targets — Claude Code, Claude Desktop (zip), Codex, universal
|
|
||||||
`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call,
|
|
||||||
including the single-file targets where multi-file delivery could not scope.
|
|
||||||
|
|
||||||
### 7. Coordination with specs 07 and 03
|
|
||||||
|
|
||||||
- **Spec 07** owns the dialect-agnostic `<sql_craft>` block. This spec must not
|
|
||||||
amend it; it adds the tool, the pointer, and the notes.
|
|
||||||
- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads
|
|
||||||
`connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer
|
|
||||||
is `connectionId`-scoped and fits that routing; keep the pointer consistent with
|
|
||||||
spec 03's `connectionId` rules and do not rewrite the routing it owns.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- An agent querying a **sqlite** connection gets sqlite date idioms and **never**
|
|
||||||
sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets
|
|
||||||
FQTN / identifier / VARIANT guidance.
|
|
||||||
- The dialect shown is **derived from the connection's configured `driver`** via
|
|
||||||
the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not
|
|
||||||
guessed. No second driver→dialect map is introduced.
|
|
||||||
- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`,
|
|
||||||
`bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with
|
|
||||||
non-empty notes, and the coverage test derives this from the registry.
|
|
||||||
- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a
|
|
||||||
clear "not a SQL warehouse" response, **not** postgres notes.
|
|
||||||
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
|
|
||||||
unaffected. The new pointer references the tool only and adds no dialect syntax.
|
|
||||||
- The channel installs/serves correctly across **all six** agent targets,
|
|
||||||
including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts`
|
|
||||||
change**.
|
|
||||||
- The notes contain **no** benchmark/gold/grader/scoring references and **no**
|
|
||||||
time-sensitive ("as of version X") content.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **Dialect resolver (reuse, do not duplicate):**
|
|
||||||
`packages/cli/src/context/sql-analysis/dialect.ts` —
|
|
||||||
`sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect`
|
|
||||||
(`./ports.ts`), default `postgres`.
|
|
||||||
- **Connector registry (drives coverage):**
|
|
||||||
`packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`,
|
|
||||||
`isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts`
|
|
||||||
(`warehouseDrivers`, the per-driver `connectionConfigSchema`).
|
|
||||||
- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts`
|
|
||||||
(register beside `connection_list`, `entity_details`, `sql_execution`); the
|
|
||||||
`connectionId → driver → dialect` resolution already exists for `sql_execution`
|
|
||||||
in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool
|
|
||||||
through the same path.
|
|
||||||
- **The skill (one-line pointer only):**
|
|
||||||
`packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5;
|
|
||||||
leave `<workflow>`/`<rules>`/`<sql_craft>`/`<examples>` otherwise intact.
|
|
||||||
- **Note storage (if files):** under the skill directory, shipped by
|
|
||||||
`packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the
|
|
||||||
tool, never installed.
|
|
||||||
- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`.
|
|
||||||
- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown →
|
|
||||||
`postgres`, and non-warehouse rejection); a registry-derived coverage test
|
|
||||||
(requirement 4); a content test that each dialect's notes cover the rubric
|
|
||||||
facets and contain no banned tokens; and an extension of spec 07's
|
|
||||||
`analytics/SKILL.md` content test asserting the new pointer is present and the
|
|
||||||
flat skill is still dialect-clean. Rebuild and re-link the dev binary so the
|
|
||||||
playground picks up the change: `pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
|
|
||||||
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths),
|
|
||||||
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
|
|
||||||
(`strftime`/`julianday`). That content is real and useful but engine-specific;
|
|
||||||
spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic
|
|
||||||
rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes
|
|
||||||
the same correctness benefit to every multi-engine **ktx** project — improving the
|
|
||||||
benchmark score is a side effect, not the goal, and the shipped skill contains no
|
|
||||||
trace of the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed
|
|
||||||
decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as
|
|
||||||
specified — no `setup-agents.ts` change.
|
|
||||||
|
|
||||||
**What was built**
|
|
||||||
- Per-dialect notes are markdown files under
|
|
||||||
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md` (one each for
|
|
||||||
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`),
|
|
||||||
served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy
|
|
||||||
read + cache, `postgres` fallback floor; the authored set is the
|
|
||||||
`DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored
|
|
||||||
(unreachable from any connector). Each note answers the fixed rubric — FQTN,
|
|
||||||
identifier quoting/case-folding, date/time, top-N/window idiom,
|
|
||||||
JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics
|
|
||||||
were verified against current docs via Context7 (Snowflake VARIANT colon-paths
|
|
||||||
and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`,
|
|
||||||
`JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The
|
|
||||||
files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they
|
|
||||||
are never installed onto an agent target.
|
|
||||||
- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input
|
|
||||||
`{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only
|
|
||||||
+ idempotent annotations. It resolves through the **existing**
|
|
||||||
`connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second
|
|
||||||
driver→dialect map), implemented as the unconditional `dialectNotes` port in
|
|
||||||
`local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A
|
|
||||||
non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError`
|
|
||||||
("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays
|
|
||||||
out of Error Tracking.
|
|
||||||
- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`)
|
|
||||||
readonly tuple so the coverage test derives required coverage from the registry;
|
|
||||||
`isDatabaseDriver` behavior is unchanged.
|
|
||||||
- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call
|
|
||||||
`sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N,
|
|
||||||
and JSON conventions"). It names the tool only; spec 07's `<sql_craft>` block and
|
|
||||||
its dialect-clean content test are untouched.
|
|
||||||
|
|
||||||
**Tests**
|
|
||||||
- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future
|
|
||||||
connector fails the test until its dialect has notes), the full rubric per dialect,
|
|
||||||
leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`;
|
|
||||||
`QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no
|
|
||||||
benchmark/grader or version-dated content, the postgres fallback, and
|
|
||||||
`resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql`
|
|
||||||
and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a
|
|
||||||
guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync.
|
|
||||||
- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool
|
|
||||||
set + annotations assertion + a handler-routing test, and the regenerated
|
|
||||||
`__snapshots__/mcp-tools-list.json`.
|
|
||||||
- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present
|
|
||||||
and the flat skill stays dialect-clean.
|
|
||||||
|
|
||||||
**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files /
|
|
||||||
3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three
|
|
||||||
`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and
|
|
||||||
re-linked `ktx-dev`.
|
|
||||||
|
|
||||||
**Deviations / notes**
|
|
||||||
- Notes are stored as per-dialect markdown files (not a typed map, and not bundled
|
|
||||||
`reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the
|
|
||||||
most maintainable to edit. They are served by the tool and ship via a
|
|
||||||
`copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no
|
|
||||||
`setup-agents.ts` change.
|
|
||||||
- `pnpm run type-check` still reports one pre-existing, unrelated error in
|
|
||||||
`test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch);
|
|
||||||
this change adds zero new type errors and does not touch that file.
|
|
||||||
|
|
@ -1,362 +0,0 @@
|
||||||
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`.
|
|
||||||
> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the
|
|
||||||
> `<sql_craft>` block. Additive, content-only.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The shipped `ktx-analytics` skill
|
|
||||||
(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop
|
|
||||||
fan-out rule in `<sql_craft>` → **Composition**:
|
|
||||||
|
|
||||||
> **Avoid fan-out joins.** Add columns only from tables already at the target
|
|
||||||
> grain, or pre-aggregate to that grain before joining. A join that multiplies
|
|
||||||
> rows quietly inflates every downstream `SUM`/`COUNT`.
|
|
||||||
|
|
||||||
In practice the agent honors that on a single join but still **silently
|
|
||||||
fans out on multi-hop join chains**, where the inflation is one or two joins
|
|
||||||
removed from the aggregate and therefore much harder to notice.
|
|
||||||
|
|
||||||
The failure shape: a measure that lives at a *coarse* grain (one row per parent
|
|
||||||
record) is counted/summed *after* the parent has been joined down to a *finer*
|
|
||||||
grain (one row per child line). Every parent-level value is then duplicated by
|
|
||||||
its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent
|
|
||||||
amount — runnable SQL, plausible-looking number, quietly wrong.
|
|
||||||
|
|
||||||
The rule today is stated only as a **prohibition** ("Avoid…"). It needs two
|
|
||||||
upgrades: (a) generalize it so the danger is understood as *cumulative across a
|
|
||||||
whole join chain*, not a single join; and (b) pair it with an **affirmative
|
|
||||||
verification habit** the agent runs while composing, so a grain change is
|
|
||||||
detected and fixed rather than merely warned against.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
An analyst on any production warehouse asks a counting/summing question whose
|
|
||||||
path runs through several one-to-many hops — e.g. *"how many orders per region
|
|
||||||
contain a returned item?"* where the path is `region → store → order →
|
|
||||||
order_line`. The honest answer counts each order once. The naïve join chain joins
|
|
||||||
`order_line` (to apply the line-level condition) and then counts orders, so an
|
|
||||||
order with three returned lines is counted three times. The inflation happens
|
|
||||||
**three joins below the `COUNT`**, where it is easy to miss. This is one of the
|
|
||||||
most common silently-wrong analytics mistakes on normalized schemas — not
|
|
||||||
specific to any dataset, dialect, or benchmark.
|
|
||||||
|
|
||||||
## Model (invariants — the implementer owns the prose)
|
|
||||||
|
|
||||||
These constrain the change; the exact wording is the implementer's. Each is
|
|
||||||
grounded in Anthropic's skill-authoring and prompt-engineering guidance so the
|
|
||||||
addition stays consistent with how spec 07 was written.
|
|
||||||
|
|
||||||
### Additive, inline-only, dialect-agnostic (inherited from spec 07)
|
|
||||||
|
|
||||||
The change is **additive content inside `skills/analytics/SKILL.md`** only — no
|
|
||||||
bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per
|
|
||||||
target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config.
|
|
||||||
Every addition must read correctly on any dialect: **no** `QUALIFY`,
|
|
||||||
`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect
|
|
||||||
construct — including in the worked example. The existing `<workflow>`, `<rules>`,
|
|
||||||
`<examples>`, and the other four `<sql_craft>` sub-headings are preserved
|
|
||||||
unchanged.
|
|
||||||
|
|
||||||
### Heuristic-plus-*why*, because SQL authoring is a high-freedom task
|
|
||||||
|
|
||||||
Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with
|
|
||||||
many valid approaches where decisions depend on context as **high freedom →
|
|
||||||
text-based heuristics**, the "open field, many paths" case (versus low-freedom,
|
|
||||||
fragile operations that need an exact script). SQL authoring is squarely
|
|
||||||
high-freedom. So the new content is phrased as **heuristics with a one-line,
|
|
||||||
universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the
|
|
||||||
existing `<sql_craft>` style and Anthropic's "add context / explain why so Claude
|
|
||||||
generalizes" principle.
|
|
||||||
|
|
||||||
### Affirmative framing for the verification step (do, not don't)
|
|
||||||
|
|
||||||
Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do
|
|
||||||
instead of what not to do."** The draft's requirement for "a detect-and-fix
|
|
||||||
*habit*, not just a prohibition" is the same principle. Therefore:
|
|
||||||
|
|
||||||
- The **generalized rule keeps the established `Avoid fan-out joins` lead and the
|
|
||||||
term `fan-out`** — it is spec 07's consistent terminology and the existing
|
|
||||||
content test references that phrase; reframing it would churn shared vocabulary
|
|
||||||
for no gain.
|
|
||||||
- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain
|
|
||||||
holds across each join"*) — an action the agent performs while composing, not a
|
|
||||||
warning. The two together satisfy both principles: a recognized anti-pattern
|
|
||||||
name *and* a positive habit.
|
|
||||||
|
|
||||||
### One default with an escape hatch, not two equal options
|
|
||||||
|
|
||||||
Anthropic: **"Avoid offering too many options… provide a default with an escape
|
|
||||||
hatch."** The fix for an inflated aggregate is presented as exactly that:
|
|
||||||
|
|
||||||
- **Default: pre-aggregate the measure to its own grain in a CTE, then join the
|
|
||||||
already-aggregated result.** This is the single-hop fix generalized, and it is
|
|
||||||
the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed
|
|
||||||
measure with `DISTINCT` (two legitimately-equal amounts would collapse).
|
|
||||||
- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an
|
|
||||||
inflated count in one line, but must be stated as count-only, not as a general
|
|
||||||
remedy.
|
|
||||||
|
|
||||||
This is the deepest correctness point in the spec and the easiest to get wrong; a
|
|
||||||
naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums.
|
|
||||||
|
|
||||||
### Consistent terminology
|
|
||||||
|
|
||||||
Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing
|
|
||||||
vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not
|
|
||||||
introduce synonyms (e.g. do not rename the concept "row blow-up" or
|
|
||||||
"multiplication factor"). Prose may vary, but the named concepts stay fixed.
|
|
||||||
|
|
||||||
### Concise — the addition must justify its token cost
|
|
||||||
|
|
||||||
Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and
|
|
||||||
"Claude is already very smart." The agent knows what a join and a `GROUP BY` are;
|
|
||||||
the addition explains only the non-obvious trap (cumulative grain inflation) and
|
|
||||||
shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and
|
|
||||||
one worked example — the skill stays comfortably under the 500-line budget
|
|
||||||
(~117 lines today).
|
|
||||||
|
|
||||||
### Examples over descriptions — exactly one
|
|
||||||
|
|
||||||
Anthropic's "examples pattern": **"Examples help Claude understand the desired
|
|
||||||
style and level of detail more clearly than descriptions alone"** and
|
|
||||||
"examples are concrete, not abstract." The multishot guidance favors 3–5 examples
|
|
||||||
in general, but here **conciseness and spec 07's one-example-per-rule economy
|
|
||||||
win**: the skill already carries the window-then-filter example, so this adds
|
|
||||||
**exactly one** compact wrong-vs-right example. The wrong/right contrast inside
|
|
||||||
that single example supplies the diversity multishot calls for, at one example's
|
|
||||||
token cost.
|
|
||||||
|
|
||||||
### Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
The worked example must be a **synthetic, generic schema invented for teaching** —
|
|
||||||
not the tables, column names, query, or numeric results of any Spider 2.0-Lite
|
|
||||||
question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a
|
|
||||||
one-to-many join), which is universal and reconstructable from first principles. A
|
|
||||||
reviewer must find nothing in it that ties it to a specific benchmark instance.
|
|
||||||
See "Leak-safety" below.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
All four land in the **Composition** sub-heading of `<sql_craft>` in
|
|
||||||
`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite
|
|
||||||
the existing fan-out bullet, add one affirmative verification bullet, add one
|
|
||||||
worked example. Do not touch the other four sub-headings or `<workflow>`/`<rules>`/
|
|
||||||
`<examples>`.
|
|
||||||
|
|
||||||
### 1. Generalize the fan-out rule to multi-hop chains
|
|
||||||
|
|
||||||
Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that
|
|
||||||
the danger is **cumulative**: *any* one-to-many hop on the path between a measure's
|
|
||||||
owning table and the aggregate inflates that measure, **even when the offending
|
|
||||||
join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the
|
|
||||||
single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join
|
|
||||||
the already-aggregated result** — but the agent must apply it **per
|
|
||||||
measure-owning table along the whole chain**, not just at the final join. Keep the
|
|
||||||
`fan-out` term and the one-line *why*.
|
|
||||||
|
|
||||||
### 2. Add an affirmative grain-verification habit
|
|
||||||
|
|
||||||
Add a companion bullet, phrased as an action the agent performs **while
|
|
||||||
composing** (not a prohibition):
|
|
||||||
|
|
||||||
- Confirm that a join intended to be one-to-one / many-to-one **did not change the
|
|
||||||
grain** it aggregates at — e.g. check that the row count (or the count of the
|
|
||||||
aggregate's key) is unchanged across that join.
|
|
||||||
- When a join is genuinely one-to-many, **reach for the default fix
|
|
||||||
(pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an
|
|
||||||
acceptable escape hatch.
|
|
||||||
- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate**
|
|
||||||
— `DISTINCT` cannot de-duplicate a sum.
|
|
||||||
|
|
||||||
This is spec 07's "build incrementally and check each layer" discipline pointed
|
|
||||||
specifically at grain preservation, in affirmative form.
|
|
||||||
|
|
||||||
### 3. One concrete, generic multi-hop worked example
|
|
||||||
|
|
||||||
Add **exactly one** compact wrong-vs-right `sql` example inside `<sql_craft>`
|
|
||||||
demonstrating the multi-hop inflation and the pre-aggregate fix. It is the
|
|
||||||
**second** `sql` fence in the skill (the first is spec 07's window-then-filter
|
|
||||||
example).
|
|
||||||
|
|
||||||
**Required properties** (these are the constraints; the SQL below is orientation):
|
|
||||||
|
|
||||||
- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed**
|
|
||||||
from the aggregate (not the single-hop case spec 07 already covers).
|
|
||||||
- **Unambiguous attribution**: each counted entity maps to **exactly one** group,
|
|
||||||
so the honest answer is well-defined. (This rules out "coarse measure attributed
|
|
||||||
to a fine dimension reached by descending," where one entity spans several
|
|
||||||
groups and the correct number is itself ambiguous — that would teach a murky
|
|
||||||
pattern.)
|
|
||||||
- **Motivated descent**: the finer-grain table is joined for a real reason (a
|
|
||||||
line-level filter or a needed line-level value), so the reader sees *why* the
|
|
||||||
fan-out join is there.
|
|
||||||
- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing
|
|
||||||
*Macro vs micro average* bullet and would muddy the fan-out lesson.
|
|
||||||
- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a
|
|
||||||
CTE) and is **actually correct**, not merely runnable — its number must equal the
|
|
||||||
honest answer, not just avoid an error.
|
|
||||||
- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect
|
|
||||||
functions), no benchmark identifiers or values.
|
|
||||||
|
|
||||||
**Recommended sketch** (implementer may adjust within the properties above):
|
|
||||||
|
|
||||||
```sql
|
|
||||||
-- "How many orders per region contain a returned item?"
|
|
||||||
-- WRONG: joining order_lines to apply the line-level filter multiplies orders —
|
|
||||||
-- an order with two returned lines is counted twice, three joins below the COUNT.
|
|
||||||
SELECT r.region_id, COUNT(*) AS n_orders
|
|
||||||
FROM regions r
|
|
||||||
JOIN stores s ON s.region_id = r.region_id
|
|
||||||
JOIN orders o ON o.store_id = s.store_id
|
|
||||||
JOIN order_lines l ON l.order_id = o.order_id
|
|
||||||
WHERE l.status = 'returned'
|
|
||||||
GROUP BY r.region_id;
|
|
||||||
|
|
||||||
-- RIGHT: collapse order_lines to one row per qualifying order first, then join up.
|
|
||||||
WITH returned_orders AS (
|
|
||||||
SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
|
|
||||||
)
|
|
||||||
SELECT r.region_id, COUNT(*) AS n_orders
|
|
||||||
FROM regions r
|
|
||||||
JOIN stores s ON s.region_id = r.region_id
|
|
||||||
JOIN orders o ON o.store_id = s.store_id
|
|
||||||
JOIN returned_orders ro ON ro.order_id = o.order_id
|
|
||||||
GROUP BY r.region_id;
|
|
||||||
-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
|
|
||||||
-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
|
|
||||||
-- de-duplicate a sum.
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Placement and structure
|
|
||||||
|
|
||||||
- Both bullets live under the existing **Composition** sub-heading; the example
|
|
||||||
follows them. The five-sub-heading structure spec 07 established is unchanged.
|
|
||||||
- **State each rule once** (Anthropic "consistent terminology / don't repeat"):
|
|
||||||
do not also restate the multi-hop rule in `<workflow>` steps 5/6 — those already
|
|
||||||
carry a one-line pointer into `<sql_craft>`, which is sufficient.
|
|
||||||
|
|
||||||
### 5. Coordination with spec 07 (supersession)
|
|
||||||
|
|
||||||
Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly
|
|
||||||
one** worked example and "Do not add a second example." **This spec supersedes
|
|
||||||
that constraint**: the skill now carries **two** `sql` worked examples
|
|
||||||
(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate
|
|
||||||
spec 07 at those two spots with a one-line "superseded by spec 09" note so the two
|
|
||||||
permanent specs do not contradict. No other spec 07 content changes.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint on this spec and its example)
|
|
||||||
|
|
||||||
The benchmark's gold answers must never appear in ktx. The worked example must be
|
|
||||||
a **synthetic, generic schema invented for teaching** — not the tables, column
|
|
||||||
names, query, or numeric results of any Spider 2.0-Lite question. The example
|
|
||||||
demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many
|
|
||||||
join), which is universal; it must be reconstructable from first principles by
|
|
||||||
anyone, with zero reference to benchmark data. A reviewer should be able to read
|
|
||||||
the example and find nothing that ties it to a specific benchmark instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The `<sql_craft>` **Composition** section states the **multi-hop generalization**
|
|
||||||
of the fan-out rule (cumulative danger across the chain; pre-aggregate per
|
|
||||||
measure-owning table) and an **affirmative grain-verification habit**, inline and
|
|
||||||
dialect-agnostic.
|
|
||||||
- The fix is presented as **default (pre-aggregate to grain) + escape hatch
|
|
||||||
(`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG`
|
|
||||||
of a fanned-out measure must pre-aggregate.
|
|
||||||
- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right)
|
|
||||||
using an invented schema, with no benchmark-derived identifiers or values, whose
|
|
||||||
RIGHT side is actually correct (unambiguous attribution; honest number).
|
|
||||||
- The skill now contains **two** `sql` worked examples total; the existing content
|
|
||||||
test's fence-count assertion is updated `1 → 2` and new assertions cover the
|
|
||||||
multi-hop rule phrase and the grain-verification-habit phrase.
|
|
||||||
- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no
|
|
||||||
synonyms introduced.
|
|
||||||
- **No new tool, flag, or config.** Skill-content only; additive to spec 07.
|
|
||||||
- All spec 07 invariants still hold: the skill remains dialect-agnostic (no
|
|
||||||
`QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time
|
|
||||||
anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference,
|
|
||||||
including in the new example; `<workflow>`/`<rules>`/`<examples>` and the other
|
|
||||||
four sub-headings are intact; frontmatter still parses through
|
|
||||||
`SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines.
|
|
||||||
- Spec 07's "exactly one example" constraint is annotated as superseded (no
|
|
||||||
contradiction between the two permanent specs).
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` →
|
|
||||||
`<sql_craft>` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add
|
|
||||||
the affirmative grain-verification bullet, add the one worked example after them.
|
|
||||||
Leave the other four sub-headings, `<workflow>`, `<rules>`, and `<examples>`
|
|
||||||
unchanged.
|
|
||||||
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the
|
|
||||||
"ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`,
|
|
||||||
add an assertion for the new fan-out example's distinctive tokens (e.g.
|
|
||||||
`WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit
|
|
||||||
phrases to the behavior-presence list, and keep all banned-construct and
|
|
||||||
size-budget guards. This is a content assertion over the source `SKILL.md` — the
|
|
||||||
right level for prompt content.
|
|
||||||
- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's
|
|
||||||
requirement 3 and at its "Exactly one new worked example" acceptance bullet.
|
|
||||||
- **Rebuild/re-link** the dev binary so the playground picks up the change:
|
|
||||||
`pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
|
|
||||||
reached through several one-to-many joins) are a recurring source of
|
|
||||||
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
|
|
||||||
with the right tables but a fan-out-inflated number. These are correctness
|
|
||||||
failures, not knowledge or schema-discovery failures (zero execution errors in the
|
|
||||||
latest run), so the fix belongs in the product's authoring craft — where it also
|
|
||||||
helps any real analyst — not in a benchmark-specific prompt. The skill itself must
|
|
||||||
contain no trace of the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped as specified — additive, content-only, no new tool/flag/config.
|
|
||||||
|
|
||||||
- **`packages/cli/src/skills/analytics/SKILL.md`** → `<sql_craft>` → **Composition**:
|
|
||||||
- Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the
|
|
||||||
danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many
|
|
||||||
hop between a measure's owning table and the aggregate inflates that measure
|
|
||||||
even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per
|
|
||||||
measure-owning table along the whole chain. Kept the `fan-out` term and the
|
|
||||||
one-line *why*.
|
|
||||||
- Added the affirmative `**Verify the grain holds across each join.**` bullet:
|
|
||||||
confirm a one-to-one / many-to-one join did not change the grain (row/key
|
|
||||||
count unchanged); default fix is pre-aggregate to grain, escape hatch is
|
|
||||||
`COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a
|
|
||||||
fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a
|
|
||||||
sum.
|
|
||||||
- Added one generic wrong-vs-right worked example (orders→regions via
|
|
||||||
stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in
|
|
||||||
the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side
|
|
||||||
pre-aggregates `order_lines` to one row per qualifying order so each order is
|
|
||||||
counted once (honest answer), and the trailing comment names the count-only
|
|
||||||
`COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented
|
|
||||||
schema, dialect-agnostic SQL, no benchmark identifiers/values.
|
|
||||||
- The other four sub-headings and `<workflow>`/`<rules>`/`<examples>` are
|
|
||||||
untouched. Skill is 147 lines (well under the 500-line budget).
|
|
||||||
- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count
|
|
||||||
`1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the
|
|
||||||
grain-verification phrase (`Verify the grain holds across each join`) to the
|
|
||||||
behavior-presence list; added new-example token assertions
|
|
||||||
(`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct,
|
|
||||||
relative-time, and size-budget guards retained. Test file passes (9/9).
|
|
||||||
- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one
|
|
||||||
worked example" acceptance bullet — no contradiction between the two permanent
|
|
||||||
specs.
|
|
||||||
|
|
||||||
**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9
|
|
||||||
passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built
|
|
||||||
`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev`
|
|
||||||
re-linked `ktx-dev`. A pre-existing, unrelated type error in
|
|
||||||
`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last
|
|
||||||
touched in commit `2677b3ef`) surfaces under the full `type-check`'s
|
|
||||||
`tsconfig.test.json` pass; it is outside this change's surface and not introduced
|
|
||||||
here.
|
|
||||||
|
|
@ -1,289 +0,0 @@
|
||||||
# Panel/period completeness — emit the full set of groups, not only the populated ones
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
When a question asks for a result *per period* or *per category* ("orders for
|
|
||||||
each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY`
|
|
||||||
only returns groups that actually have rows. Periods or categories with **zero**
|
|
||||||
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
|
|
||||||
three that should read `0` are simply absent. The SQL is runnable and the
|
|
||||||
aggregate is right, but the **panel is incomplete** — and a monthly report with
|
|
||||||
missing months or a category breakdown missing its empty categories is wrong for
|
|
||||||
any analyst, on any database.
|
|
||||||
|
|
||||||
The existing `<sql_craft>` "Answer completeness / interpretation" group already
|
|
||||||
carries a *"For each X / per X / by X returns exactly one row per X"* rule, but
|
|
||||||
that rule only governs **grain** (don't collapse to a single value). It says
|
|
||||||
nothing about the **domain**: "one row per X" today means one row per *observed*
|
|
||||||
X, so empty groups still drop. This spec sharpens that rule from grain-only to
|
|
||||||
grain-and-completeness.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
"How many orders were placed in each month of 2023?" must return **12 rows** even
|
|
||||||
if March had no orders (March = 0), not 11. "Sales per region" should include
|
|
||||||
regions with no sales when the question asks for *each* region. Both are
|
|
||||||
bread-and-butter reporting for any analyst on any warehouse, with no benchmark in
|
|
||||||
sight.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The feature splits across **two surfaces**, each holding the half it is suited
|
|
||||||
for. This split is the central design decision and exists to satisfy spec 07's
|
|
||||||
hard dialect-agnostic invariant without weakening it.
|
|
||||||
|
|
||||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
|
||||||
|
|
||||||
The draft asked for a *"recursive-CTE date spine"* worked example. But a real
|
|
||||||
date/number series is **inherently dialect-specific** — Postgres `generate_series`,
|
|
||||||
SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake
|
|
||||||
`GENERATOR`+`DATEADD` — and spec 07 made `<sql_craft>` strictly dialect-agnostic
|
|
||||||
(the analytics-skill content test bans single-dialect constructs). Inlining a date
|
|
||||||
spine would violate that invariant; carving out a test exception would erode it.
|
|
||||||
|
|
||||||
ktx already has the canonical home for engine-specific syntax: the per-dialect
|
|
||||||
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
|
|
||||||
the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric
|
|
||||||
(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is
|
|
||||||
not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the
|
|
||||||
other per-dialect idioms, and the dialect-agnostic skill points to it. This
|
|
||||||
routes the dialect-specific half through the existing channel rather than
|
|
||||||
standing up a parallel dialect-specific recipe inside the skill.
|
|
||||||
|
|
||||||
Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the
|
|
||||||
**concrete series syntax**.
|
|
||||||
|
|
||||||
### Additive, inline, heuristic-with-a-why
|
|
||||||
|
|
||||||
Consistent with spec 07: the skill change is **additive content in one Markdown
|
|
||||||
file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the
|
|
||||||
delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
|
|
||||||
and phrased as a **heuristic with a one-line generic rationale**, not a wall of
|
|
||||||
MUSTs. The dialect-notes change is additive content in the seven existing
|
|
||||||
`dialects/*.md` files. No new tool, flag, or config on either surface.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Skill surface — `<sql_craft>` "Answer completeness / interpretation"
|
|
||||||
|
|
||||||
Add the panel-completeness rule to the existing group (it extends, and should sit
|
|
||||||
adjacent to, the *"For each X / per X / by X"* bullet). It must cover:
|
|
||||||
|
|
||||||
1. **Recognize the full-panel cue.** *each / every / all / per <period> / for all
|
|
||||||
<category> / by month* signals that the answer's row set should be the
|
|
||||||
**complete expected domain** of periods or categories in scope, not just those
|
|
||||||
present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit
|
|
||||||
groups that have at least one fact row.
|
|
||||||
|
|
||||||
2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the
|
|
||||||
**spine**), then LEFT JOIN the aggregated facts onto it:
|
|
||||||
- **Category/dimension spine:** the distinct values from the **domain-defining
|
|
||||||
dimension/entity table** (e.g. all regions from a `regions` table), *not*
|
|
||||||
`SELECT DISTINCT region FROM facts` — the latter yields only categories that
|
|
||||||
already occur, so a zero-activity category still drops. When no dimension
|
|
||||||
table exists, the distinct values from the **unfiltered** fact table are the
|
|
||||||
best available domain (with the residual caveat that a category which never
|
|
||||||
occurs at all cannot surface).
|
|
||||||
- **Period/number spine:** generate the series for the question's stated range
|
|
||||||
(e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the
|
|
||||||
question's explicit range; when the range is "all periods present," derive
|
|
||||||
bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete
|
|
||||||
series-generation syntax is per-dialect — the rule points the author to
|
|
||||||
`sql_dialect_notes` (see requirement 2) and shows no inline series SQL.
|
|
||||||
|
|
||||||
3. **COALESCE by measure additivity.** Default missing measures with
|
|
||||||
`COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events
|
|
||||||
or amounts — "no activity" genuinely reads as 0). Leave **non-additive**
|
|
||||||
measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** —
|
|
||||||
absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value
|
|
||||||
only for additive measures.
|
|
||||||
|
|
||||||
4. **Don't over-apply (the each-vs-which guard).** When the question asks only
|
|
||||||
about groups that exist ("*which* months had orders", "regions that made a
|
|
||||||
sale"), the spine is unnecessary and wrong — emit only observed groups. The cue
|
|
||||||
is *each / all / every* (complete domain) vs *which / that have* (observed
|
|
||||||
subset).
|
|
||||||
|
|
||||||
5. **One worked example — the category spine, fully portable.** Add **exactly
|
|
||||||
one** compact before/after example demonstrating the pattern with a
|
|
||||||
**distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty
|
|
||||||
groups missing) and the right shape (`SELECT DISTINCT` domain from the
|
|
||||||
dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic
|
|
||||||
table/column names, standard SQL only — no series generation, no dialect
|
|
||||||
functions, so the example stays dialect-clean. The period-spine variant is
|
|
||||||
described in prose (requirement 2) and delegated to `sql_dialect_notes`; it
|
|
||||||
gets **no** inline example. This is the **third** worked `sql` example in the
|
|
||||||
skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out).
|
|
||||||
|
|
||||||
6. **Step pointer, no duplication.** The validate/explain step (and/or the query
|
|
||||||
step) already points into `<sql_craft>` for answer-completeness; extend that
|
|
||||||
existing pointer's wording if needed, but state the rule **once** inside
|
|
||||||
`<sql_craft>`. The step-5 pointer that lists what `sql_dialect_notes` provides
|
|
||||||
("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also
|
|
||||||
name the **series/calendar** convention now that it exists.
|
|
||||||
|
|
||||||
### 2. Dialect-notes surface — `dialects/*.md`
|
|
||||||
|
|
||||||
Add a **"Series"** (date/number range) line to **each** of the seven authored
|
|
||||||
dialect files, giving that engine's idiomatic way to generate a contiguous
|
|
||||||
date or integer series for use as a spine. Each note is engine-exclusive — a
|
|
||||||
SQLite analyst gets the SQLite idiom and never another engine's construct, per the
|
|
||||||
existing dialect-notes leak guards. Orientation (exact syntax is the
|
|
||||||
implementer's):
|
|
||||||
|
|
||||||
- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`.
|
|
||||||
- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`.
|
|
||||||
- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers).
|
|
||||||
- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE.
|
|
||||||
- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`.
|
|
||||||
- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`).
|
|
||||||
- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table.
|
|
||||||
|
|
||||||
This line is what makes the period spine usable from the dialect-agnostic skill,
|
|
||||||
and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the
|
|
||||||
same date spine) — so it is foundational, not scope creep.
|
|
||||||
|
|
||||||
### 3. Coordination with spec 11
|
|
||||||
|
|
||||||
Spec 11 (time-series window recipes) explicitly depends on this date spine for the
|
|
||||||
gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10
|
|
||||||
establishes the spine concept in the Answer-completeness group and the
|
|
||||||
series syntax in the dialect notes; spec 11 reuses both from the Window-functions
|
|
||||||
group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Any worked example or note must use a **synthetic generic schema** (e.g. an
|
|
||||||
`orders` table with an `order_date`, a `regions` dimension) and demonstrate only
|
|
||||||
the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL,
|
|
||||||
or result values on either surface. The dialect-notes additions, like the existing
|
|
||||||
notes, carry no benchmark/grader/version-dated content. The behavior is
|
|
||||||
reconstructable from first principles and tied to no specific instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- `<sql_craft>` "Answer completeness / interpretation" states: the full-panel cue,
|
|
||||||
the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE
|
|
||||||
discriminator (0 vs NULL), and the each-vs-which over-application guard —
|
|
||||||
inline, dialect-agnostic, each with a generic *why*.
|
|
||||||
- Exactly **one** new worked `sql` example is present, a portable
|
|
||||||
distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`),
|
|
||||||
with no series generation and no dialect-specific syntax. The skill then carries
|
|
||||||
**three** `sql` worked examples total.
|
|
||||||
- Each of the seven `dialects/*.md` files gains a **Series** (date/number range)
|
|
||||||
line in its engine's own idiom; no engine leaks another engine's construct, and
|
|
||||||
the additions contain no benchmark/grader/version-dated content.
|
|
||||||
- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`,
|
|
||||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other
|
|
||||||
single-dialect construct anywhere in `SKILL.md`, including the new example.
|
|
||||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples)
|
|
||||||
and the existing dialect-note rubric lines are intact and uncontradicted.
|
|
||||||
- No grader/benchmark reference, no output-shape contract, and no anchoring of
|
|
||||||
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data
|
|
||||||
appears (period-spine bounds derive from the question's explicit range or, for
|
|
||||||
"all periods present," from `MIN`/`MAX` over the facts — which is range
|
|
||||||
derivation, not relative-time anchoring).
|
|
||||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
|
||||||
still parses as `ktx-analytics`.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the
|
|
||||||
panel-completeness bullets to the Answer-completeness group, the single category
|
|
||||||
spine example, and extend the existing step pointer / dialect-notes provision
|
|
||||||
list to name the series convention. Leave `<workflow>`/`<rules>`/other examples
|
|
||||||
intact. Delivery is unchanged (single `SKILL.md` per target via
|
|
||||||
`readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required.
|
|
||||||
- **Dialect notes:** the seven files under
|
|
||||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
|
||||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
|
||||||
`copy-runtime-assets.mjs` — no plumbing change, content only.
|
|
||||||
- **Tests:**
|
|
||||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
|
||||||
representative phrase for the completeness rule; bump the `sql`-fence count
|
|
||||||
assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the
|
|
||||||
existing dialect-clean guards already cover the no-inline-series requirement
|
|
||||||
(the example is `SELECT DISTINCT`, so they pass unchanged).
|
|
||||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop
|
|
||||||
(the "answers the full rubric for every dialect" test) so every dialect must
|
|
||||||
also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`.
|
|
||||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
|
||||||
all seven without a hand-maintained list.
|
|
||||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Per-period / per-category questions where some periods are empty produce
|
|
||||||
short-row result mismatches in the SQLite subset, and the related rolling/cumulative
|
|
||||||
cluster (spec 11) needs a complete date spine to be correct at all. The fix is a
|
|
||||||
universal reporting habit (complete panels) plus the per-dialect series syntax
|
|
||||||
that makes it executable — both belong in the product, where they help real
|
|
||||||
analysts. Improving the benchmark score is a side effect; the skill and the
|
|
||||||
dialect notes contain no trace of the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no
|
|
||||||
new tool/flag/config, no plumbing change.
|
|
||||||
|
|
||||||
**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):**
|
|
||||||
- Added a **"Complete the panel for 'each / every / all / per <period or
|
|
||||||
category>'"** bullet to the `<sql_craft>` "Answer completeness / interpretation"
|
|
||||||
group, directly after the *"For each X / per X / by X"* bullet, with three
|
|
||||||
sub-bullets carrying the rest of the rule each with its generic *why*: **Spine
|
|
||||||
source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT`
|
|
||||||
over the facts; period/number series across the question's stated range, bounds
|
|
||||||
from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series
|
|
||||||
syntax delegated to `sql_dialect_notes`), **Default by additivity**
|
|
||||||
(`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and
|
|
||||||
**Don't over-apply** (the each-vs-which guard).
|
|
||||||
- Added **one** worked `sql` example at the end of the Answer-completeness group: a
|
|
||||||
portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` →
|
|
||||||
`LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right,
|
|
||||||
standard SQL only, no series generation, no dialect functions. The skill now
|
|
||||||
carries **three** `sql` worked examples.
|
|
||||||
- Extended the step-5 dialect-notes pointer to name the **series/calendar**
|
|
||||||
convention alongside FQTN / identifier-quoting / date / top-N / JSON.
|
|
||||||
- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the
|
|
||||||
single `SKILL.md` per target — confirmed, no change.
|
|
||||||
|
|
||||||
**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):**
|
|
||||||
- Added a `- **Series:**` line to all seven authored files (postgres, sqlite,
|
|
||||||
bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom
|
|
||||||
(`generate_series`; recursive CTE with `date(d,'+1 month')`;
|
|
||||||
`UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE
|
|
||||||
with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` +
|
|
||||||
`MAXRECURSION`), placed right after each file's Date/time line. No cross-engine
|
|
||||||
leak, no version-dated/benchmark content. Shipped to `dist` unchanged by
|
|
||||||
`copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`.
|
|
||||||
|
|
||||||
**Tests:**
|
|
||||||
- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel`
|
|
||||||
and `Default by additivity` phrases; renamed the worked-examples test and bumped
|
|
||||||
the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE`
|
|
||||||
shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the
|
|
||||||
dialect-clean banned list — a deliberate **strengthening** beyond the spec's
|
|
||||||
test orientation so the "no inline series" acceptance criterion is *enforced*,
|
|
||||||
not merely incidentally true of a `SELECT DISTINCT` example.
|
|
||||||
- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric
|
|
||||||
for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven
|
|
||||||
dialects are required to answer a Series line (coverage derived from
|
|
||||||
`DIALECTS_WITH_NOTES`, no hand-maintained list).
|
|
||||||
|
|
||||||
**Verification:** both affected test files pass (19 tests). `src` type-check and
|
|
||||||
`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in
|
|
||||||
all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an
|
|
||||||
unrelated, pre-existing `tsconfig.test.json` type error in
|
|
||||||
`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work
|
|
||||||
and outside its scope.
|
|
||||||
|
|
||||||
**Coordination with spec 11:** the per-dialect Series line is the foundational
|
|
||||||
date spine that spec 11 (rolling/cumulative windows over gappy dates) references.
|
|
||||||
Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11
|
|
||||||
will reference it from the Window-functions group. No overlap introduced.
|
|
||||||
|
|
@ -1,391 +0,0 @@
|
||||||
# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
A large share of analytics questions are time-series shaped: a **running /
|
|
||||||
cumulative balance**, a **rolling N-day average**, or **period-over-period
|
|
||||||
growth**. The agent already knows window functions exist — spec 07 gave the
|
|
||||||
`<sql_craft>` "Window functions" group its determinism and window-then-filter
|
|
||||||
rules, and spec 10 added panel/period completeness — but it still gets the
|
|
||||||
*time-series specifics* wrong:
|
|
||||||
|
|
||||||
- a cumulative balance computed **without an explicit unbounded-preceding
|
|
||||||
frame**, or with the implicit frame misbehaving when there are **ties on the
|
|
||||||
order key**;
|
|
||||||
- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy**
|
|
||||||
daily data, so the window spans the wrong calendar span when days are missing;
|
|
||||||
- no **minimum-periods** handling — a rolling average reported before the window
|
|
||||||
is actually full;
|
|
||||||
- "growth vs the previous period" written **without `LAG`** (or against the wrong
|
|
||||||
neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or
|
|
||||||
absent prior.
|
|
||||||
|
|
||||||
These are runnable-but-wrong: the structure is close, the edge case diverges.
|
|
||||||
It is the same failure shape spec 07 addressed at the general level; this spec
|
|
||||||
adds the time-series specifics to the **same Window-functions group**, building
|
|
||||||
on the rules already there rather than restating them.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
- "Each account's month-end running balance over 2023" — a cumulative sum of
|
|
||||||
monthly net over an ordered window.
|
|
||||||
- "30-day rolling average of daily revenue, only once 30 days of history exist."
|
|
||||||
- "Month-over-month revenue growth rate."
|
|
||||||
|
|
||||||
All three are bread-and-butter for any analyst on any time-series table, with no
|
|
||||||
benchmark in sight. The methodology is universal analyst craft, so it belongs in
|
|
||||||
the shipped skill — it transfers to every ktx user querying a live database.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The change is **additive content across two surfaces** — the same split spec 10
|
|
||||||
made, and for the same reason. The split is the central design decision; it
|
|
||||||
satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
|
|
||||||
weakening it.
|
|
||||||
|
|
||||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
|
||||||
|
|
||||||
Two of the three recipes are **pure standard SQL** and stay entirely in the
|
|
||||||
dialect-agnostic skill:
|
|
||||||
|
|
||||||
- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED
|
|
||||||
PRECEDING AND CURRENT ROW)` is standard on every engine.
|
|
||||||
- **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a
|
|
||||||
`NULLIF`-style divide-by-zero guard are standard on every engine.
|
|
||||||
|
|
||||||
The third recipe — a **rolling window over calendar time** — has one piece that
|
|
||||||
is genuinely dialect-divergent: the **calendar-range window frame**. A native
|
|
||||||
range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`
|
|
||||||
exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has
|
|
||||||
no date-interval range frame, and SQL Server has **no offset `RANGE` frames at
|
|
||||||
all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot
|
|
||||||
inline a range frame any more than it could inline a date-series generator.
|
|
||||||
|
|
||||||
ktx already routes that kind of engine-specific syntax through the per-dialect
|
|
||||||
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
|
|
||||||
the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent
|
|
||||||
exactly: series/spine generation was not in the dialect rubric, so it was added
|
|
||||||
there (the **Series** line) and the dialect-agnostic skill points to it.
|
|
||||||
Rolling-window framing is the next construct in that same position — not in the
|
|
||||||
rubric yet, dialect-specific — so the **rolling-window idiom belongs in the
|
|
||||||
dialect notes**, and the skill points to it.
|
|
||||||
|
|
||||||
Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the
|
|
||||||
min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries
|
|
||||||
the **concrete rolling-window frame syntax** per engine.
|
|
||||||
|
|
||||||
### Additive, inline, heuristic-with-a-why
|
|
||||||
|
|
||||||
Consistent with specs 07 and 10: the skill change is **additive content in one
|
|
||||||
Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/`
|
|
||||||
file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as
|
|
||||||
**heuristics with a one-line generic rationale**, not a wall of MUSTs. The
|
|
||||||
dialect-notes change is additive content in the seven existing `dialects/*.md`
|
|
||||||
files. No new tool, flag, or config on either surface.
|
|
||||||
|
|
||||||
### Build on the rules already present; do not restate them
|
|
||||||
|
|
||||||
The Window-functions group already carries **"Make the ordering deterministic"**
|
|
||||||
(complete tie-breaker) from spec 07, and the Numeric-precision group carries
|
|
||||||
**"Round only at the end."** The cumulative and period-over-period recipes
|
|
||||||
**reference** these rather than repeat them (state each rule once — Anthropic's
|
|
||||||
"consistent terminology / don't repeat" guidance, already followed in spec 07).
|
|
||||||
Spec 10's **Series** dialect line is likewise **referenced** by the rolling
|
|
||||||
recipe's spine fallback, not duplicated.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Skill surface — `<sql_craft>` "Window functions" group (three recipes)
|
|
||||||
|
|
||||||
Add three recipes to the **existing** "Window functions" group, after its two
|
|
||||||
current bullets (deterministic ordering; filter-after-the-window). Each is a
|
|
||||||
heuristic with a generic *why*, dialect-agnostic.
|
|
||||||
|
|
||||||
1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER
|
|
||||||
(PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` —
|
|
||||||
with a **complete tie-breaker** on the `ORDER BY` (per the group's existing
|
|
||||||
deterministic-ordering rule; reference it, do not restate). *Why:* a bare
|
|
||||||
`ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the
|
|
||||||
order key** folds every tied peer into the same cumulative value — it runs and
|
|
||||||
looks plausible, but the running total jumps at each tie boundary.
|
|
||||||
|
|
||||||
2. **Rolling window over calendar time, plus minimum periods.** "Rolling N
|
|
||||||
days/months" must span a **calendar range**, not a fixed row count: a `ROWS
|
|
||||||
BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are
|
|
||||||
missing. Two sanctioned techniques:
|
|
||||||
- **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's
|
|
||||||
**Series**, via `sql_dialect_notes`) so the data has one row per calendar
|
|
||||||
unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the
|
|
||||||
intended calendar span. This path is fully dialect-agnostic.
|
|
||||||
- **Native range frame or date-keyed self-join (engine-specific).** Where the
|
|
||||||
engine supports it, a calendar **range frame** expresses the window directly;
|
|
||||||
otherwise a self-join keyed on the date does. Both use engine-specific
|
|
||||||
syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see
|
|
||||||
requirement 3); show no inline range frame in the skill.
|
|
||||||
|
|
||||||
**Minimum periods.** When the question says "only after N periods of data" (or
|
|
||||||
a rolling metric implies it), emit `NULL` / skip until the window is actually
|
|
||||||
full — guard on a window count, e.g. `COUNT(*) OVER (<same frame>) = N`. On a
|
|
||||||
gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null
|
|
||||||
observations** instead when "N periods" means N data points rather than N
|
|
||||||
calendar units. *Why:* a row-count frame over missing dates measures the wrong
|
|
||||||
span, and a partial early window is not the requested metric.
|
|
||||||
|
|
||||||
3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
|
|
||||||
for the prior-period comparison; compute growth as `(cur - prev) / prev` at
|
|
||||||
**full precision**, rounding only in the final projection (per the existing
|
|
||||||
"Round only at the end" rule), and **guard divide-by-zero / NULL prev**
|
|
||||||
(e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against
|
|
||||||
the wrong neighbor — the comparison lands on the wrong period, and an unguarded
|
|
||||||
ratio errors or returns garbage when the prior period is zero or absent.
|
|
||||||
|
|
||||||
**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list
|
|
||||||
(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON
|
|
||||||
conventions") should also name the **rolling-window** convention now that it
|
|
||||||
exists. State each rule once inside `<sql_craft>`; the workflow steps only point
|
|
||||||
to it.
|
|
||||||
|
|
||||||
### 2. One worked example — cumulative running total (dialect-agnostic)
|
|
||||||
|
|
||||||
Add **exactly one** new compact before/after `sql` example, demonstrating the
|
|
||||||
**cumulative running total** — the subtlest of the three (the implicit-frame trap
|
|
||||||
runs fine and is wrong only at tie boundaries) and the highest-value to show.
|
|
||||||
Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`):
|
|
||||||
|
|
||||||
- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the
|
|
||||||
implicit `RANGE` frame makes two txns on the same date share one inflated
|
|
||||||
running balance.
|
|
||||||
- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND
|
|
||||||
CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`).
|
|
||||||
|
|
||||||
Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no
|
|
||||||
`RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets
|
|
||||||
**no** inline example (its correct form needs the engine-specific frame/spine,
|
|
||||||
delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was
|
|
||||||
prose-only); the **period-over-period** recipe is self-evident from its bullet
|
|
||||||
and also gets no example. This is the **fourth** worked `sql` example in the
|
|
||||||
skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and
|
|
||||||
spec 10 (panel-completeness spine).
|
|
||||||
|
|
||||||
### 3. Dialect-notes surface — `dialects/*.md` (rolling window)
|
|
||||||
|
|
||||||
Add a **rolling-window-over-time** idiom line to **each** of the seven authored
|
|
||||||
dialect files, parallel to spec 10's **Series** line. Each note is
|
|
||||||
engine-exclusive — a SQLite analyst gets the SQLite idiom and never another
|
|
||||||
engine's construct, per the existing dialect-notes leak guards. Each note either
|
|
||||||
gives the engine's native calendar-range frame **or** references its own
|
|
||||||
**Series** line for the spine + `ROWS` fallback (a cross-reference within the
|
|
||||||
file, not a duplicate of the Series line).
|
|
||||||
|
|
||||||
Orientation only — **`RANGE`-frame support genuinely varies by engine and
|
|
||||||
version, so the implementer must verify each engine's current support against
|
|
||||||
authoritative docs (context7 / the engine's manual) rather than assert it from
|
|
||||||
memory.** Starting points:
|
|
||||||
|
|
||||||
- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days'
|
|
||||||
PRECEDING AND CURRENT ROW)`.
|
|
||||||
- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT
|
|
||||||
ROW` over a temporal order key.
|
|
||||||
- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key
|
|
||||||
(e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or
|
|
||||||
build a spine (see **Series**) and use a `ROWS` frame.
|
|
||||||
- **sqlite:** **no** date-interval range frame — build a date spine (see
|
|
||||||
**Series**) and use a `ROWS` frame.
|
|
||||||
- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see
|
|
||||||
**Series**) and use a `ROWS` frame, or a date-keyed self-join.
|
|
||||||
- **snowflake / clickhouse:** range-frame support over dates is limited — verify;
|
|
||||||
default to a spine (see **Series**) + `ROWS` frame where a native calendar range
|
|
||||||
frame is unavailable.
|
|
||||||
|
|
||||||
This line is what makes the rolling-over-time recipe executable from the
|
|
||||||
dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series =
|
|
||||||
how to *generate* a spine; Rolling window = how to compute a *moving
|
|
||||||
calendar-range aggregate*, natively or via that spine), and it cross-references
|
|
||||||
the Series line rather than overlapping it.
|
|
||||||
|
|
||||||
### 4. Explicit constraints / exclusions
|
|
||||||
|
|
||||||
None of the following may appear (consistent with specs 07 and 10):
|
|
||||||
|
|
||||||
- **No inline dialect-specific range-frame syntax in the skill** — no
|
|
||||||
`RANGE … INTERVAL` frame, no series generator, no dialect function. The skill
|
|
||||||
stays dialect-clean; the range frame lives only in the dialect notes.
|
|
||||||
- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months"
|
|
||||||
means relative to *now* on a live database. A range *bound* may be derived from
|
|
||||||
the question's explicit range or, for "all periods present," from `MIN`/`MAX`
|
|
||||||
over the **unfiltered** facts (range derivation, per spec 10) — but the metric
|
|
||||||
must never silently redefine "recent" as the data's maximum date.
|
|
||||||
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
|
|
||||||
(the skill is for interactive analysis).
|
|
||||||
|
|
||||||
### 5. Coordination with specs 07 and 10
|
|
||||||
|
|
||||||
All three recipes live in the **existing** `<sql_craft>` "Window functions"
|
|
||||||
group; the two current bullets and the spec-07 window-then-filter example must
|
|
||||||
stay intact and uncontradicted.
|
|
||||||
|
|
||||||
- **Spec 07** owns the deterministic-ordering rule (Window functions) and the
|
|
||||||
round-at-the-end rule (Numeric precision). Spec 11 **builds on** both —
|
|
||||||
references them, never restates them.
|
|
||||||
- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11
|
|
||||||
**references** the spine for the gappy-rolling fallback and adds the **distinct**
|
|
||||||
rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a
|
|
||||||
spine; spec 11 = how to compute a moving calendar-range aggregate (native frame
|
|
||||||
or spine + `ROWS`).
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Every worked example or note uses a **synthetic generic schema** (e.g.
|
|
||||||
`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and
|
|
||||||
shows only the *pattern*. **No** benchmark table names, SQL, or result values on
|
|
||||||
either surface. The dialect-notes additions, like the existing notes, carry no
|
|
||||||
benchmark / grader / version-dated content. The behavior is reconstructable from
|
|
||||||
first principles and tied to no specific instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The `<sql_craft>` "Window functions" group states the three recipes — inline,
|
|
||||||
dialect-agnostic, each with a generic *why*, and each **building on** (not
|
|
||||||
restating) the deterministic-ordering and round-at-the-end rules:
|
|
||||||
- **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED
|
|
||||||
PRECEDING AND CURRENT ROW` frame and a complete tie-breaker;
|
|
||||||
- **rolling window over calendar time + minimum periods** — calendar range not
|
|
||||||
row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)`
|
|
||||||
guard — delegating the engine's range-frame syntax to `sql_dialect_notes`;
|
|
||||||
- **period-over-period** via `LAG`, with full-precision growth and a
|
|
||||||
divide-by-zero / NULL-prev guard.
|
|
||||||
- Exactly **one** new worked `sql` example: the cumulative running total,
|
|
||||||
wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
|
|
||||||
ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The
|
|
||||||
skill then carries **four** `sql` worked examples total.
|
|
||||||
- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time**
|
|
||||||
idiom line in its engine's own idiom (native calendar-range frame where
|
|
||||||
supported, otherwise a spine + `ROWS` fallback that references its **Series**
|
|
||||||
line); no engine leaks another engine's construct, and the additions contain no
|
|
||||||
benchmark / grader / version-dated content.
|
|
||||||
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
|
|
||||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no
|
|
||||||
inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new
|
|
||||||
example.
|
|
||||||
- The step-5 `sql_dialect_notes` provision list names the **rolling-window**
|
|
||||||
convention alongside FQTN / identifier-quoting / date / top-N / series/calendar /
|
|
||||||
JSON.
|
|
||||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other
|
|
||||||
examples), the two existing Window-functions bullets, the window-then-filter
|
|
||||||
example, and the existing dialect-note rubric lines (including **Series**) are
|
|
||||||
intact and uncontradicted.
|
|
||||||
- No grader / benchmark reference, no output-shape contract, and no anchoring of
|
|
||||||
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data.
|
|
||||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
|
||||||
still parses as `ktx-analytics`.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes
|
|
||||||
to the "Window functions" group (after its two existing bullets), the single
|
|
||||||
cumulative worked example, and extend the step-5 dialect-notes provision list to
|
|
||||||
name the rolling-window convention. Leave `<workflow>` / `<rules>` / the other
|
|
||||||
examples and the two existing window bullets intact. Delivery is unchanged
|
|
||||||
(single `SKILL.md` per target via `readAnalyticsSkillContent` in
|
|
||||||
`setup-agents.ts`) — confirm, no change required.
|
|
||||||
- **Dialect notes:** the seven files under
|
|
||||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
|
||||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
|
||||||
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
|
|
||||||
engine's actual `RANGE`-frame support against authoritative docs before writing
|
|
||||||
the idiom; do not assert from memory.**
|
|
||||||
- **Tests:**
|
|
||||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
|
||||||
representative phrase for each of the three recipes; bump the `sql`-fence count
|
|
||||||
assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN
|
|
||||||
UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean
|
|
||||||
guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding
|
|
||||||
`generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the
|
|
||||||
"range frame lives only in the dialect notes" criterion is *enforced*, not
|
|
||||||
incidentally true).
|
|
||||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the
|
|
||||||
full rubric for every dialect" loop with the rolling-window assertion, e.g.
|
|
||||||
`expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it.
|
|
||||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
|
||||||
all seven without a hand-maintained list.
|
|
||||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Running-balance / rolling / period-over-period questions are the single largest
|
|
||||||
result-mismatch cluster in the SQLite subset (financial-transactions-style DBs):
|
|
||||||
cumulative balances with the wrong frame on ties, rolling windows that mis-span
|
|
||||||
gappy dates, partial early windows, and unguarded period-over-period ratios. The
|
|
||||||
methodology is universal analyst craft, so it belongs in the product's skill
|
|
||||||
(where it helps every real user) plus the per-dialect rolling-window syntax that
|
|
||||||
makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the
|
|
||||||
date spine) for the gappy-rolling fallback. Improving the benchmark score is a
|
|
||||||
side effect; the skill and the dialect notes contain no trace of the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped as additive content across the two surfaces the spec specified — no new
|
|
||||||
tool, flag, or config.
|
|
||||||
|
|
||||||
**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes
|
|
||||||
to the existing `<sql_craft>` "Window functions" group, after its two bullets and
|
|
||||||
the spec-07 window-then-filter example: **Cumulative / running total** (explicit
|
|
||||||
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing
|
|
||||||
the deterministic-ordering rule), **Rolling window over calendar time, plus
|
|
||||||
minimum periods** (calendar range not row count; spine-or-native-range choice
|
|
||||||
delegated to `sql_dialect_notes`; the `COUNT(*) OVER (<same frame>) = N`
|
|
||||||
min-periods guard), and **Period-over-period** (`LAG` + full-precision growth +
|
|
||||||
`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked
|
|
||||||
`sql` example — the cumulative running total, wrong-vs-right, using
|
|
||||||
`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four
|
|
||||||
worked examples. Extended the step-5 `sql_dialect_notes` provision list to name
|
|
||||||
the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the
|
|
||||||
skill; it stays dialect-clean.
|
|
||||||
|
|
||||||
**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a
|
|
||||||
**Rolling window over time** line to all seven files, parallel to the spec-10
|
|
||||||
**Series** line and cross-referencing it for the spine fallback.
|
|
||||||
|
|
||||||
**Deviation — `RANGE`-frame support verified against authoritative docs (the
|
|
||||||
spec's hard requirement), which corrected two of its starting points:**
|
|
||||||
|
|
||||||
- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days'
|
|
||||||
PRECEDING AND CURRENT ROW` (as the spec guessed).
|
|
||||||
- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL
|
|
||||||
29 DAY PRECEDING AND CURRENT ROW` (as guessed).
|
|
||||||
- **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with
|
|
||||||
`RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed).
|
|
||||||
- **snowflake** — **corrected:** the spec said "limited; default to a spine," but
|
|
||||||
Snowflake *does* support a native interval `RANGE` frame over a date/timestamp
|
|
||||||
key and it is gap-tolerant, so the note gives the native frame
|
|
||||||
(`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed.
|
|
||||||
- **clickhouse** — **corrected:** the spec said "limited; default to a spine," but
|
|
||||||
ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in
|
|
||||||
days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for
|
|
||||||
`DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as
|
|
||||||
the fallback.
|
|
||||||
- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS`
|
|
||||||
(as guessed).
|
|
||||||
- **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame):
|
|
||||||
spine + `ROWS`, or a date-keyed self-join (as guessed).
|
|
||||||
|
|
||||||
**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative
|
|
||||||
phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4,
|
|
||||||
asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND
|
|
||||||
CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened
|
|
||||||
the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex.
|
|
||||||
`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop
|
|
||||||
with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from
|
|
||||||
`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric.
|
|
||||||
|
|
||||||
**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped);
|
|
||||||
`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed
|
|
||||||
`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one
|
|
||||||
error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is
|
|
||||||
present in committed branch code and untouched by this work.
|
|
||||||
|
|
@ -1,405 +0,0 @@
|
||||||
# Parse text-encoded numeric columns before doing math on them
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Numeric measures are often stored as **text** with human formatting: unit
|
|
||||||
suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators
|
|
||||||
(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero
|
|
||||||
(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is
|
|
||||||
**silently wrong**: a string comparison orders `"100" < "9"`, and a naive
|
|
||||||
`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the
|
|
||||||
intended number. The query runs, the shape looks right, the number is garbage.
|
|
||||||
|
|
||||||
The agent already samples schemas before composing — spec 07 gave the
|
|
||||||
`<sql_craft>` "Schema discovery before writing SQL" group its *"Sample before you
|
|
||||||
compose"* and *"Cast to the real type before comparing"* rules. But those rules
|
|
||||||
guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**;
|
|
||||||
they say nothing about a column whose declared/affinity type is text yet whose
|
|
||||||
*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to
|
|
||||||
assume a real number type and skips the parse, so the arithmetic runs on the raw
|
|
||||||
strings. This spec adds the detect → parse/scale → verify habit to that same
|
|
||||||
group, building on the two rules already there rather than restating them.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become
|
|
||||||
`1200 / 3000000 / 0` before you can sum it or compute a daily change.
|
|
||||||
- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging.
|
|
||||||
- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it.
|
|
||||||
|
|
||||||
This is routine data hygiene on real, messy production tables — every analyst
|
|
||||||
hits text-encoded measures on some warehouse, with no benchmark in sight. The
|
|
||||||
methodology is universal craft, so it belongs in the shipped skill; it transfers
|
|
||||||
to every ktx user querying a live database.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The change is **additive content across two surfaces** — the same split specs 10
|
|
||||||
and 11 made, and for the same reason. The split is the central design decision;
|
|
||||||
it satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
|
|
||||||
weakening it.
|
|
||||||
|
|
||||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
|
||||||
|
|
||||||
The **detect → parse → scale** half is **pure portable SQL** and stays entirely
|
|
||||||
in the dialect-agnostic skill:
|
|
||||||
|
|
||||||
- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known
|
|
||||||
set of literal characters — no regex needed.
|
|
||||||
- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression.
|
|
||||||
- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`.
|
|
||||||
- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable.
|
|
||||||
|
|
||||||
The **verify** half has one piece that is genuinely dialect-divergent: a
|
|
||||||
**failure-detecting numeric cast** — a cast that signals (rather than silently
|
|
||||||
swallows) a value that did not parse. This is exactly what requirement 3
|
|
||||||
("confirm coverage") needs, and it cannot be written portably:
|
|
||||||
|
|
||||||
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure.
|
|
||||||
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure.
|
|
||||||
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`.
|
|
||||||
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
|
|
||||||
- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before
|
|
||||||
casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`).
|
|
||||||
- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and
|
|
||||||
`CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an
|
|
||||||
`IS NULL` coverage check is **silently broken**. Detecting a failed parse needs
|
|
||||||
a `GLOB`/`typeof` pattern guard.
|
|
||||||
|
|
||||||
So a portable skill cannot inline a safe cast any more than spec 10 could inline a
|
|
||||||
date-series generator or spec 11 a calendar range frame. ktx already routes that
|
|
||||||
kind of engine-specific syntax through the per-dialect notes in
|
|
||||||
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by the
|
|
||||||
`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent:
|
|
||||||
a construct not yet in the dialect rubric, genuinely engine-specific, was added
|
|
||||||
there (the **Series** line; the **Rolling window** line) and the dialect-agnostic
|
|
||||||
skill points to it. The failure-detecting cast is the next construct in that same
|
|
||||||
position, so the **safe-cast idiom belongs in the dialect notes**, and the skill
|
|
||||||
points to it.
|
|
||||||
|
|
||||||
Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale
|
|
||||||
in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes)
|
|
||||||
carries the **concrete safe-cast syntax** per engine, including the sqlite
|
|
||||||
`CAST`-returns-0 gotcha.
|
|
||||||
|
|
||||||
The regex character-*strip* is deliberately **not** promoted to the dialect
|
|
||||||
notes: a portable chained `REPLACE` over a known character set is the opinionated
|
|
||||||
default, so there is no need for a per-dialect strip line (derive from need; one
|
|
||||||
default). The dialect surface gains exactly one thing — the safe cast — because
|
|
||||||
that is the only piece the portable path genuinely cannot express.
|
|
||||||
|
|
||||||
### Additive, inline, heuristic-with-a-why
|
|
||||||
|
|
||||||
Consistent with specs 07, 10, and 11: the skill change is **additive content in
|
|
||||||
one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled
|
|
||||||
`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
|
|
||||||
and phrased as **heuristics with a one-line generic rationale**, not a wall of
|
|
||||||
MUSTs. The dialect-notes change is additive content in the seven existing
|
|
||||||
`dialects/*.md` files. No new tool, flag, or config on either surface.
|
|
||||||
|
|
||||||
### Build on the rules already present; do not restate them
|
|
||||||
|
|
||||||
- The Schema-discovery group already carries **"Sample before you compose"** and
|
|
||||||
**"Cast to the real type before comparing"** (spec 07). The detect rule
|
|
||||||
**extends** the first (distinct-value sampling to learn the encoding) and the
|
|
||||||
parse rule **complements** the second (text-meaning-numeric, not just
|
|
||||||
text-vs-numeric literal mismatch) — reference them, do not repeat them.
|
|
||||||
- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive
|
|
||||||
judgment** spec 10 established in its *"Default by additivity"* rule (0 only
|
|
||||||
when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule
|
|
||||||
rather than restating the discriminator (state each rule once).
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Skill surface — `<sql_craft>` "Schema discovery before writing SQL"
|
|
||||||
|
|
||||||
Add the text-encoded-numeric guidance to the **existing** group, after its two
|
|
||||||
current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic.
|
|
||||||
It must cover:
|
|
||||||
|
|
||||||
1. **Detect text-encoded numerics during sampling.** When a column the question
|
|
||||||
treats as a number is stored as text, sample its **distinct** values to learn
|
|
||||||
the encodings actually present — unit suffixes (`K`/`M`/`B`), currency
|
|
||||||
symbols, thousands separators, percent signs, and non-numeric sentinels
|
|
||||||
(`-`, `N/A`, empty) — **before** composing. Never infer the format from the
|
|
||||||
column name. *Why:* compared/aggregated as-is, the text sorts lexically
|
|
||||||
(`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL —
|
|
||||||
producing a silently wrong result instead of an error.
|
|
||||||
|
|
||||||
2. **Parse and scale in an early CTE.** Strip currency/separator/percent
|
|
||||||
characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels
|
|
||||||
to `0` **or** `NULL` per the question's intent, then cast to a numeric type —
|
|
||||||
all in **one early CTE**, so every downstream layer sees clean numbers. The
|
|
||||||
`0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive**
|
|
||||||
rule (reference it; do not restate). *Why:* a string column aggregated as-is
|
|
||||||
sorts lexically and casts to 0, so the math is silently wrong.
|
|
||||||
|
|
||||||
3. **Confirm coverage (verify).** After parsing, sanity-check that **no
|
|
||||||
intended-numeric value silently failed to parse** — a failed parse should
|
|
||||||
surface as `NULL`, which is only visible with a **failure-detecting cast**.
|
|
||||||
Note the divergence: a plain `CAST` errors on some engines and, on sqlite,
|
|
||||||
returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from
|
|
||||||
`sql_dialect_notes` (requirement 3), then count residual NULLs among
|
|
||||||
non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish
|
|
||||||
as `0`/NULL instead of being caught.
|
|
||||||
|
|
||||||
### 2. One worked example — parse/scale, fully portable
|
|
||||||
|
|
||||||
Add **exactly one** new compact before/after `sql` example demonstrating the
|
|
||||||
parse-and-scale pattern on a synthetic generic schema
|
|
||||||
(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`):
|
|
||||||
|
|
||||||
- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the
|
|
||||||
formatted values collapse to `0`/partial, so the total is silently wrong.
|
|
||||||
- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a
|
|
||||||
`CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to
|
|
||||||
`DECIMAL`, then `SUM`s the parsed column.
|
|
||||||
|
|
||||||
**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`,
|
|
||||||
`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the
|
|
||||||
example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no**
|
|
||||||
inline example (its correct form needs the engine-specific safe cast, delegated to
|
|
||||||
`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's
|
|
||||||
rolling-window variants were prose-only).
|
|
||||||
|
|
||||||
This adds **one** worked `sql` example to the skill. Spec 11 independently adds
|
|
||||||
one as well; **do not hardcode the resulting total** — increment from the current
|
|
||||||
state. As of this writing the skill carries **three** examples (spec 07
|
|
||||||
window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is
|
|
||||||
the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test
|
|
||||||
assertion is incremented by one from its current value (see Acceptance criteria).
|
|
||||||
|
|
||||||
### 3. Dialect-notes surface — `dialects/*.md` (safe cast)
|
|
||||||
|
|
||||||
Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files,
|
|
||||||
parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each
|
|
||||||
line gives that engine's **failure-detecting numeric cast** — a cast that returns
|
|
||||||
`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the
|
|
||||||
verify step correct on that engine. Each note is engine-exclusive (a SQLite
|
|
||||||
analyst gets the SQLite idiom and never another engine's construct, per the
|
|
||||||
existing dialect-notes leak guards). Orientation only — exact syntax is the
|
|
||||||
implementer's; verify against authoritative docs (context7 / the engine manual)
|
|
||||||
rather than asserting from memory:
|
|
||||||
|
|
||||||
- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting,
|
|
||||||
e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is
|
|
||||||
available for the strip, but chained `REPLACE` is the portable default.)
|
|
||||||
- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before
|
|
||||||
`CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip.
|
|
||||||
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) →
|
|
||||||
`NULL` on failure.
|
|
||||||
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` →
|
|
||||||
`NULL` on failure.
|
|
||||||
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
|
|
||||||
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`.
|
|
||||||
- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an
|
|
||||||
error, so a coverage check must use a pattern guard such as
|
|
||||||
`CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof`
|
|
||||||
check) to detect a value that did not parse.
|
|
||||||
|
|
||||||
This line is what makes the verify step executable from the dialect-agnostic
|
|
||||||
skill. It is **distinct** from the Series and Rolling-window lines (those generate
|
|
||||||
or window over a calendar; this detects a failed numeric parse). Phrase any
|
|
||||||
version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test
|
|
||||||
bans version-dated wording).
|
|
||||||
|
|
||||||
### 4. Explicit constraints / exclusions
|
|
||||||
|
|
||||||
None of the following may appear (consistent with specs 07, 10, and 11):
|
|
||||||
|
|
||||||
- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`,
|
|
||||||
`TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
|
|
||||||
`replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is
|
|
||||||
chained `REPLACE`; the failure-detecting cast lives only in the dialect notes.
|
|
||||||
- **No regex-strip dialect line.** The character strip stays the portable
|
|
||||||
chained-`REPLACE` default; the dialect notes gain only the **safe cast**.
|
|
||||||
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
|
|
||||||
(the skill is for interactive analysis).
|
|
||||||
|
|
||||||
### 5. Coordination with specs 07, 08, 10, and 11
|
|
||||||
|
|
||||||
- **Spec 07** owns the Schema-discovery group and its two existing bullets
|
|
||||||
(*"Sample before you compose"*, *"Cast to the real type before comparing"*).
|
|
||||||
Spec 12 **extends** that group and **builds on** both bullets — references them,
|
|
||||||
never restates them; they must stay intact and uncontradicted.
|
|
||||||
- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one
|
|
||||||
rubric line through that channel; the engine-exclusivity guards apply unchanged.
|
|
||||||
- **Spec 10** owns the additive-vs-non-additive discriminator (Answer
|
|
||||||
completeness) and the dialect **Series** line. Spec 12 **references** the
|
|
||||||
additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it.
|
|
||||||
- **Spec 11** independently adds the dialect **Rolling window** line, one `sql`
|
|
||||||
example, and the **rolling-window** entry to the step-5 provision list. Spec 12
|
|
||||||
touches the **same** three places (the dialect-notes rubric loop, the example
|
|
||||||
count, and the step-5 list). Both are independent and additive — **add to the
|
|
||||||
current state, do not assume an order**: name **safe-cast** in the step-5 list
|
|
||||||
without removing rolling-window/series; increment the example count by one from
|
|
||||||
whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any
|
|
||||||
`/\*\*Rolling/` assertion.
|
|
||||||
|
|
||||||
### 6. Step pointer (no duplication)
|
|
||||||
|
|
||||||
The step-5 `sql_dialect_notes` provision list (currently "FQTN,
|
|
||||||
identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11
|
|
||||||
also names rolling-window) should additionally name the **safe-cast** convention
|
|
||||||
now that it exists. State each rule once inside `<sql_craft>`; the workflow steps
|
|
||||||
only point to it.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
Every worked example or note uses a **synthetic generic schema** (e.g.
|
|
||||||
`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`),
|
|
||||||
showing only the *pattern*. **No** benchmark table names, SQL, or result values on
|
|
||||||
either surface. The dialect-notes additions, like the existing notes, carry no
|
|
||||||
benchmark / grader / version-dated content. The behavior is reconstructable from
|
|
||||||
first principles and tied to no specific instance.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- The `<sql_craft>` "Schema discovery before writing SQL" group states the three
|
|
||||||
heuristics — inline, dialect-agnostic, each with a generic *why*, and each
|
|
||||||
**building on** (not restating) the existing *"Sample before you compose"* and
|
|
||||||
*"Cast to the real type before comparing"* bullets and spec 10's additivity rule:
|
|
||||||
- **detect** text-encoded numerics by sampling distinct values (suffixes,
|
|
||||||
symbols, separators, sentinels) — never from the column name;
|
|
||||||
- **parse and scale** in an early CTE (strip → suffix-scale → sentinel map →
|
|
||||||
cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule;
|
|
||||||
- **confirm coverage** with a failure-detecting cast, delegating the engine's
|
|
||||||
safe-cast syntax to `sql_dialect_notes`.
|
|
||||||
- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using
|
|
||||||
chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS
|
|
||||||
DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is
|
|
||||||
incremented by **one** from its current value (3 today → 4; or 5 if spec 11
|
|
||||||
shipped first).
|
|
||||||
- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its
|
|
||||||
engine's own failure-detecting numeric-cast idiom (including the sqlite
|
|
||||||
`CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the
|
|
||||||
additions contain no benchmark / grader / version-dated content.
|
|
||||||
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
|
|
||||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline
|
|
||||||
`RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` /
|
|
||||||
`REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md`
|
|
||||||
including the new example.
|
|
||||||
- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention
|
|
||||||
alongside FQTN / identifier-quoting / date / top-N / series-calendar /
|
|
||||||
rolling-window / JSON.
|
|
||||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples),
|
|
||||||
the two existing Schema-discovery bullets, and the existing dialect-note rubric
|
|
||||||
lines (including **Series** and, if present, **Rolling window**) are intact and
|
|
||||||
uncontradicted.
|
|
||||||
- No grader / benchmark reference, and no output-shape contract.
|
|
||||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
|
||||||
still parses as `ktx-analytics`.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three
|
|
||||||
heuristics to the "Schema discovery before writing SQL" group (after its two
|
|
||||||
existing bullets), the single parse-and-scale worked example, and extend the
|
|
||||||
step-5 dialect-notes provision list to name the safe-cast convention. Leave
|
|
||||||
`<workflow>` / `<rules>` / the other examples and the two existing
|
|
||||||
schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per
|
|
||||||
target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no
|
|
||||||
change required.
|
|
||||||
- **Dialect notes:** the seven files under
|
|
||||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
|
||||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
|
||||||
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
|
|
||||||
engine's actual safe-cast / try-cast support against authoritative docs before
|
|
||||||
writing the idiom; do not assert from memory** (in particular the sqlite
|
|
||||||
`CAST`-returns-0 behavior, which is the motivating gotcha).
|
|
||||||
- **Tests:**
|
|
||||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
|
||||||
representative phrase for each of the three heuristics (e.g. a *detect*, a
|
|
||||||
*parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft
|
|
||||||
behavior` list; bump the `sql`-fence count assertion **by one** from its
|
|
||||||
current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a
|
|
||||||
suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding
|
|
||||||
`SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
|
|
||||||
and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` /
|
|
||||||
`GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL`
|
|
||||||
guard, so the "safe cast lives only in the dialect notes" criterion is
|
|
||||||
*enforced*, not incidentally true).
|
|
||||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers
|
|
||||||
the full rubric for every dialect" loop with the safe-cast assertion,
|
|
||||||
`expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it.
|
|
||||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
|
||||||
all seven without a hand-maintained list. Do **not** add a false-exclusivity
|
|
||||||
assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the
|
|
||||||
line per dialect is sufficient.
|
|
||||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
At least one SQLite-subset question stores trading volume as suffix-encoded text
|
|
||||||
(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw
|
|
||||||
strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes
|
|
||||||
the failure especially insidious: there is no error to alert the agent, and a
|
|
||||||
naive `IS NULL` coverage check would not catch it either, which is precisely why
|
|
||||||
the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings
|
|
||||||
before math, then verify coverage with a failure-detecting cast — is universal
|
|
||||||
data hygiene that helps any analyst on any warehouse, so it belongs in the
|
|
||||||
product's craft (skill) plus the per-dialect safe-cast syntax that makes the
|
|
||||||
verify step executable, not in a benchmark-specific prompt. Improving the
|
|
||||||
benchmark score is a side effect; the skill and the dialect notes contain no trace
|
|
||||||
of the benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already
|
|
||||||
applied in the working tree). Built from the current state per the "do not assume an
|
|
||||||
order" guidance — there were **four** worked examples (specs 07 window-then-filter,
|
|
||||||
09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the
|
|
||||||
**fifth**, and step 5 already named `series/calendar, rolling-window`.
|
|
||||||
|
|
||||||
**Skill — `packages/cli/src/skills/analytics/SKILL.md`:**
|
|
||||||
- Added the three heuristics to the **"Schema discovery before writing SQL"** group,
|
|
||||||
after the two existing bullets: *Parse text-encoded numerics before doing math on
|
|
||||||
them* (detect by sampling distinct values, extending *Sample before you compose*,
|
|
||||||
never inferring from the column name), *Strip, scale, and cast in one early CTE*
|
|
||||||
(the *meaning-is-numeric* complement to *Cast to the real type before comparing*,
|
|
||||||
with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by
|
|
||||||
additivity* rule), and *Confirm the parse covered every value* (failure-detecting
|
|
||||||
cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing
|
|
||||||
bullets and the additivity rule are referenced, not restated.
|
|
||||||
- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`,
|
|
||||||
`'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an
|
|
||||||
early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with
|
|
||||||
a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard
|
|
||||||
portable SQL only — no dialect functions, no inline safe cast.
|
|
||||||
- Step 5 dialect-notes provision list now names **safe-cast** alongside the others.
|
|
||||||
|
|
||||||
**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a
|
|
||||||
**Safe cast** line to all seven files (after the *Rolling window* line), each giving
|
|
||||||
that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern
|
|
||||||
guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning);
|
|
||||||
bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql
|
|
||||||
`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the
|
|
||||||
`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial
|
|
||||||
gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against
|
|
||||||
the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real
|
|
||||||
name — the `to<Type>OrNull` family requires a bit width, hence `toDecimal64OrNull`).
|
|
||||||
No version-dated wording.
|
|
||||||
|
|
||||||
**Tests:** `analytics-skill-content.test.ts` — added the three representative
|
|
||||||
phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the
|
|
||||||
example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`),
|
|
||||||
and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`,
|
|
||||||
`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's
|
|
||||||
`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts`
|
|
||||||
— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so
|
|
||||||
all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity
|
|
||||||
assertion for the shared `TRY_CAST`.
|
|
||||||
|
|
||||||
**Verification:** both affected test files pass (19 tests); broader `test/skills` +
|
|
||||||
`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`)
|
|
||||||
is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry
|
|
||||||
*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev`
|
|
||||||
relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the
|
|
||||||
test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD,
|
|
||||||
untouched here) — out of scope for this spec.
|
|
||||||
|
|
@ -1,336 +0,0 @@
|
||||||
# Output completeness — answer every requested part, enforced by a final pre-emit check
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The single largest correctness failure mode for the analytics skill is
|
|
||||||
**incomplete output**: the query runs and the methodology is roughly right, but
|
|
||||||
the projection is missing columns the question asked for. The SQL is runnable and
|
|
||||||
the aggregate is correct — the answer is simply *short by columns*. Three
|
|
||||||
recurring shapes:
|
|
||||||
|
|
||||||
1. **Multi-part questions answered partially.** A question that asks for several
|
|
||||||
things ("report the highest *and* the lowest month, each with its count and
|
|
||||||
average, *and* the difference") comes back with only the first clause — one
|
|
||||||
column where several were requested.
|
|
||||||
2. **Identity dropped.** Grouping by a human-readable name but not projecting the
|
|
||||||
entity's identifier (a product name without its product id, a customer name
|
|
||||||
without its customer id).
|
|
||||||
3. **Inputs to a derived value dropped.** Returning a ratio / percentage /
|
|
||||||
difference but not the underlying counts the question also asked for.
|
|
||||||
|
|
||||||
Shapes 2 and 3 are **already covered** by shipped `<sql_craft>` rules — spec 07's
|
|
||||||
*"Expose identity, not just the label"* and *"Keep the inputs to a derived
|
|
||||||
value"* — yet they are frequently **not applied**. So the gap is not missing
|
|
||||||
knowledge: these rules sit as passive heuristics in a list, and nothing makes the
|
|
||||||
agent reliably check them before finalizing. The fix is twofold: (a) add the
|
|
||||||
missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b)
|
|
||||||
turn output-completeness into an **explicit final verification step** the agent
|
|
||||||
performs before emitting SQL, so the existing identity/inputs rules are actually
|
|
||||||
enforced rather than merely listed.
|
|
||||||
|
|
||||||
The failure is **model-independent**: a markedly stronger model produced the same
|
|
||||||
incomplete-output mistakes on these questions, which means it is a
|
|
||||||
craft/enforcement gap, not a capability gap — exactly the kind of universal
|
|
||||||
analyst craft that belongs in the shipped skill.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
An analyst is asked: *"For each region, report the highest and the lowest monthly
|
|
||||||
order count, and the difference between them."* A complete answer has a column for
|
|
||||||
the region's id and name, the highest count, the lowest count, and the difference
|
|
||||||
— five columns. Returning just the region and a single number answers only part
|
|
||||||
of the request. This is a universal expectation on any database: answer **every**
|
|
||||||
part of a multi-part request, identify the entities, and show the inputs behind
|
|
||||||
any derived figure — and answer *exactly* that, without padding the result with
|
|
||||||
columns the question never asked for.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
|
|
||||||
The change is **additive content in one Markdown file**
|
|
||||||
(`skills/analytics/SKILL.md`), governed by the same invariants spec 07
|
|
||||||
established. They constrain the implementer; the exact prose is theirs.
|
|
||||||
|
|
||||||
### Additive, inline, heuristic-with-a-why
|
|
||||||
|
|
||||||
Consistent with specs 07 and 10: the change is additive content in
|
|
||||||
`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the
|
|
||||||
`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic,
|
|
||||||
and phrased as **heuristics with a one-line generic rationale**, not a wall of
|
|
||||||
MUSTs. The new rule extends the existing `<sql_craft>` "Answer completeness /
|
|
||||||
interpretation" group; the shipped bullets in that group (including the *identity*
|
|
||||||
and *inputs* rules this spec builds on) are preserved unchanged. No new tool,
|
|
||||||
flag, or config.
|
|
||||||
|
|
||||||
### The over-projection guard carries a *universal* why, not a grader reference
|
|
||||||
|
|
||||||
The intake draft frames "don't pad the result with extra columns" as
|
|
||||||
*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or
|
|
||||||
benchmark (spec 07's hard invariant; the content test bans the words). So the
|
|
||||||
guard must ship with a **universal analytics rationale** instead: columns the
|
|
||||||
question did not ask for add noise, mislead the reader into thinking they matter,
|
|
||||||
and make the result harder to consume — match the request exactly, neither short
|
|
||||||
nor padded. This is the same reconciliation spec 07 applied to the draft's
|
|
||||||
"behavior only, no rationale" instruction: generic *why* is required; only
|
|
||||||
grader/gold/benchmark rationale is banned.
|
|
||||||
|
|
||||||
### Completeness is a closed set — identity and inputs are *inside* it
|
|
||||||
|
|
||||||
"Expose identity" and "keep the inputs" tell the agent to add columns; the
|
|
||||||
over-projection guard tells it not to. These only contradict if the target is
|
|
||||||
left fuzzy, so this spec pins it down. A **complete projection** is exactly:
|
|
||||||
|
|
||||||
> {every requested metric/attribute} ∪ {the identifier of each grouped/named
|
|
||||||
> entity} ∪ {the inputs to each derived value}, at the grain the question
|
|
||||||
> specifies.
|
|
||||||
|
|
||||||
Identity and inputs are **members of that set** — part of completeness, never
|
|
||||||
"padding." **Under-projection** is any member missing (the failure this spec
|
|
||||||
attacks); **over-projection** is any column *outside* the set (what the guard
|
|
||||||
forbids). The implementer must phrase the rule and guard against this single
|
|
||||||
definition so they read as one coherent notion, not two competing instructions.
|
|
||||||
|
|
||||||
### Dialect-agnostic, additive-only, exclusions intact
|
|
||||||
|
|
||||||
Every addition reads correctly on any dialect — no dialect-specific syntax in the
|
|
||||||
rule text or the worked example. The existing `<workflow>`, `<rules>`, and the
|
|
||||||
other `<sql_craft>` bullets and examples (specs 07/09/10/11/12) are preserved and
|
|
||||||
uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no
|
|
||||||
`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect
|
|
||||||
syntax.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Multi-part / multi-output completeness — a new umbrella rule
|
|
||||||
|
|
||||||
Add a bullet to the `<sql_craft>` "Answer completeness / interpretation" group:
|
|
||||||
when a question requests several outputs — a **list** ("A, B, and C"), **paired
|
|
||||||
extremes** ("the highest *and* the lowest"), or a **value plus its components**
|
|
||||||
("X, Y, and their ratio") — the final projection must contain a column for
|
|
||||||
**each** requested output. *Why:* answering only the first clause is the most
|
|
||||||
common way a runnable query is still wrong; the grain and methodology can be
|
|
||||||
perfect yet the answer is short by columns.
|
|
||||||
|
|
||||||
This rule is the **umbrella** over the two shipped completeness rules: the
|
|
||||||
*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components"
|
|
||||||
instance, and the *identity* rule (*"Expose identity, not just the label"*) is its
|
|
||||||
"entity identity" instance. The new bullet should **name that relationship**
|
|
||||||
(so the three read as one notion) rather than restating either rule.
|
|
||||||
|
|
||||||
Keep this distinct from the row-selection rules in the same group: *"Top /
|
|
||||||
highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows**
|
|
||||||
appear; multi-part completeness governs **which columns** appear. They compose
|
|
||||||
(e.g. "highest and lowest per region" needs one row per region *and* a column per
|
|
||||||
clause).
|
|
||||||
|
|
||||||
### 2. Final completeness check — the enforcement mechanism
|
|
||||||
|
|
||||||
The rule content lives **once** in `<sql_craft>`; the trigger is promoted to a
|
|
||||||
first-class line in `<workflow>` step 6.
|
|
||||||
|
|
||||||
- **Capstone bullet in `<sql_craft>`** (closing the "Answer completeness /
|
|
||||||
interpretation" group): *before emitting the final SQL, re-read the question and
|
|
||||||
confirm the projection covers* —
|
|
||||||
1. every named **metric / attribute** the question asks for (→ the multi-part
|
|
||||||
rule);
|
|
||||||
2. the **identifier** of every grouped or named entity (→ the *identity* rule);
|
|
||||||
3. every **input** to each derived value (→ the *inputs* rule);
|
|
||||||
4. all at the **grain** the question specifies (→ the *for each X* / panel
|
|
||||||
rules).
|
|
||||||
|
|
||||||
Each facet cross-references the rule it enforces, so the check is what makes
|
|
||||||
those passive rules active. Phrase it as a short, concrete "confirm the
|
|
||||||
projection covers…" checklist, not a wall of MUSTs.
|
|
||||||
|
|
||||||
- **Over-projection guard** (attached to the check): do **not** add columns the
|
|
||||||
question did not ask for "to be safe" — extra columns add noise, mislead, and
|
|
||||||
make the result harder to consume; match the request exactly. Carries the
|
|
||||||
**universal** why from the Model, **never** a grader/gold/benchmark reference.
|
|
||||||
|
|
||||||
- **`<workflow>` step 6 line** (the explicit ritual): step 6 ("Validate and
|
|
||||||
explain") gains a mandatory line directing the agent to **always** run the final
|
|
||||||
completeness check before emitting — re-read the question and verify every
|
|
||||||
requested output, each entity's identity, each derived value's inputs, and the
|
|
||||||
grain are all projected — pointing into the `<sql_craft>` capstone for the
|
|
||||||
detail. This **replaces the current conditional pointer's role** ("If a result
|
|
||||||
is unexpectedly empty or its grain looks wrong, work through the … rules"): the
|
|
||||||
empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty
|
|
||||||
results"* and grain rules), but the completeness check fires **unconditionally**,
|
|
||||||
on every SQL-authoring turn, not only when a result looks off. The workflow line
|
|
||||||
names the ritual and the four facets; the rationale, guard, and example are
|
|
||||||
stated once in `<sql_craft>`, not duplicated into the workflow.
|
|
||||||
|
|
||||||
### 3. One worked example (dialect-agnostic)
|
|
||||||
|
|
||||||
Add **exactly one** compact before/after example to the "Answer completeness /
|
|
||||||
interpretation" group, demonstrating multi-part completeness on a **synthetic**
|
|
||||||
schema (`regions`, `region_monthly`):
|
|
||||||
|
|
||||||
- **WRONG:** answers only the first clause — `SELECT region_name,
|
|
||||||
MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no
|
|
||||||
lowest, no difference.
|
|
||||||
- **RIGHT:** one column per requested output plus the entity's identity, at the
|
|
||||||
region grain — `region_id, region_name`, the highest, the lowest, and the
|
|
||||||
difference, with `regions` joined to `region_monthly` and grouped by the region
|
|
||||||
id and name.
|
|
||||||
|
|
||||||
Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN`
|
|
||||||
are portable aggregates). Keep it tight. It teaches multi-clause coverage +
|
|
||||||
identity + derived-value inputs in one capstone, and is **distinct** from the
|
|
||||||
spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN
|
|
||||||
spine + `COALESCE`); this one is about missing **columns**. This is the **sixth**
|
|
||||||
worked `sql` example in the skill (after specs 07/09/10/11/12).
|
|
||||||
|
|
||||||
### 4. Coordination with specs 03 and 07/09/10/11/12
|
|
||||||
|
|
||||||
- **Spec 03** (multi-connection routing) owns `<workflow>` step 0 and the
|
|
||||||
`connectionId` threading/scoping. Spec 14 touches `<workflow>` only to add the
|
|
||||||
completeness-check line to **step 6** — it must not rewrite the routing or the
|
|
||||||
`<rules>` `connectionId` scoping. If both land, step 6 reads coherently: validate
|
|
||||||
+ the completeness ritual.
|
|
||||||
- **Specs 07/09/10/11/12** own their own bullets and worked examples in
|
|
||||||
`<sql_craft>`. Spec 14 is **additive** to the same "Answer completeness /
|
|
||||||
interpretation" group and adds one example; it must not remove or contradict
|
|
||||||
theirs.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
The example uses an **invented, generic schema** (`regions`, `region_monthly`) and
|
|
||||||
made-up columns — **no benchmark table names, SQL, or result values.** It teaches
|
|
||||||
the *pattern* (cover every requested output + identity + inputs, at grain, without
|
|
||||||
padding), which is universal and tied to no specific instance. The over-projection
|
|
||||||
guard's rationale is **universal** (noise/clarity/consumability), never
|
|
||||||
"grader-gaming" or any other scoring reference. No part of the addition mentions a
|
|
||||||
benchmark, gold answer, grader, or scoring comparator.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- `<sql_craft>` "Answer completeness / interpretation" states the **multi-part /
|
|
||||||
multi-output completeness** rule (a column per requested output; list / paired
|
|
||||||
extremes / value-plus-components), named as the umbrella over the shipped
|
|
||||||
*identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*.
|
|
||||||
- `<sql_craft>` states a concrete **final completeness check** (re-read the
|
|
||||||
question → confirm metrics + entity identity + derived-value inputs + grain are
|
|
||||||
projected), cross-referencing the existing identity/inputs/grain rules so they
|
|
||||||
are enforced, not merely listed.
|
|
||||||
- The check carries the **over-projection guard** with a **universal** rationale
|
|
||||||
(don't pad with unrequested columns — noise / misleading / harder to consume),
|
|
||||||
and the skill contains **zero** grader/gold/benchmark references anywhere.
|
|
||||||
- `<workflow>` **step 6** carries a mandatory line that runs the completeness
|
|
||||||
check **unconditionally** before emitting and points into the `<sql_craft>`
|
|
||||||
capstone; the rule content is **stated once** in `<sql_craft>` (no duplicated
|
|
||||||
rationale/guard in the workflow). The empty/grain diagnostic remains available.
|
|
||||||
- Exactly **one** new worked `sql` example is present (synthetic
|
|
||||||
`regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL;
|
|
||||||
the skill then carries **six** `sql` worked examples total.
|
|
||||||
- The existing interactive guidance (`<workflow>` steps, `<rules>`, the other
|
|
||||||
`<sql_craft>` bullets and the five prior examples) is intact and uncontradicted;
|
|
||||||
the additive-only and dialect-clean invariants from specs 07/10 still hold.
|
|
||||||
- None of spec 07's excluded items appear (output-shape contract, `MAX(date)`
|
|
||||||
anchoring of "recent"/"past N", grader-driven advice, dialect syntax).
|
|
||||||
- The skill stays scannable and comfortably under the 500-line budget; the
|
|
||||||
frontmatter still parses as `ktx-analytics`.
|
|
||||||
- The analytics-skill **content test is updated** to cover the new rule and check
|
|
||||||
(see Implementation orientation).
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the prose.
|
|
||||||
|
|
||||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`.
|
|
||||||
- Add the multi-part-completeness bullet and the final-completeness-check
|
|
||||||
capstone (with the over-projection guard) to the `<sql_craft>` "Answer
|
|
||||||
completeness / interpretation" group; add the single
|
|
||||||
`regions`/`region_monthly` worked example.
|
|
||||||
- In `<workflow>` step 6, replace the current conditional answer-completeness
|
|
||||||
pointer with the mandatory completeness-check line (unconditional, names the
|
|
||||||
four facets, points into `<sql_craft>`); keep the empty/grain diagnostic.
|
|
||||||
- Leave `<workflow>` steps 0–5, `<rules>`, and the other `<sql_craft>`
|
|
||||||
bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target
|
|
||||||
via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change
|
|
||||||
required.
|
|
||||||
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`.
|
|
||||||
- Add representative phrases to the "represents every craft behavior" list for
|
|
||||||
the multi-part rule, the final completeness check, and the over-projection
|
|
||||||
guard.
|
|
||||||
- Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the
|
|
||||||
test name/comment), and assert the new example's shape (e.g. `region_monthly`,
|
|
||||||
`MAX(`, `MIN(`, the difference expression, `region_id`).
|
|
||||||
- The existing dialect-clean, grader/benchmark-clean, and relative-time
|
|
||||||
(`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN`
|
|
||||||
lines carry no "recent"/"past N" wording, so the phrase-level guard is
|
|
||||||
unaffected. The `SkillsRegistryService` frontmatter test must still pass.
|
|
||||||
- Rebuild and re-link the dev binary so the playground picks up the updated skill:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
On the latest SQLite-subset run, **incomplete output was the single largest
|
|
||||||
failure bucket (~13 of 51 voted failures)**: multi-part questions answered
|
|
||||||
partially, plus dropped identity / derived-value inputs — the latter two being
|
|
||||||
spec-07 rules that already exist but weren't applied. A probe with a much stronger
|
|
||||||
model reproduced the *same* incomplete-output failures, confirming this is a
|
|
||||||
craft-enforcement gap rather than a model-capability one. The fix — answer every
|
|
||||||
requested part, identify the entities, keep the inputs, and don't pad — is
|
|
||||||
universal analyst craft, so it belongs in the product skill (and transfers to real
|
|
||||||
users), enforced as a final pre-emit check rather than left as a passive hint.
|
|
||||||
Improving the benchmark score is a side effect; the skill contains no trace of the
|
|
||||||
benchmark.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented as additive content in one Markdown file plus a test update.
|
|
||||||
|
|
||||||
- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`<sql_craft>` "Answer
|
|
||||||
completeness / interpretation" group):
|
|
||||||
- Added the **"Answer every requested output"** umbrella bullet (list / paired
|
|
||||||
extremes / value-plus-components → a column per requested output, with a generic
|
|
||||||
*why*). It names *keep the inputs* and *expose identity* as its "value +
|
|
||||||
components" and "entity identity" instances, pins the closed-set definition of a
|
|
||||||
complete projection, and marks itself as governing *which columns* appear —
|
|
||||||
distinct from the *Top …* / *For each X* row-selection rules, with which it
|
|
||||||
composes. The two shipped instance rules are preserved verbatim.
|
|
||||||
- Added the **"Final completeness check"** capstone bullet: a four-facet
|
|
||||||
"before emitting, re-read the question and confirm the projection covers…"
|
|
||||||
checklist (metric/attribute → multi-part rule; identifier → *expose identity*;
|
|
||||||
inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on
|
|
||||||
every query. It carries the **over-projection guard** with a universal rationale
|
|
||||||
(unrequested columns add noise, mislead, and are harder to consume — match the
|
|
||||||
request exactly), with **no** grader/gold/benchmark reference.
|
|
||||||
- Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG
|
|
||||||
answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`),
|
|
||||||
dropping the region id, the lowest, and the difference; RIGHT projects
|
|
||||||
`r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the
|
|
||||||
`MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id
|
|
||||||
+ name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`).
|
|
||||||
- `<workflow>` **step 6**: replaced the conditional answer-completeness pointer
|
|
||||||
with an unconditional *"Always run the final completeness check before emitting"*
|
|
||||||
line that names the four facets and points into the `<sql_craft>` capstone; the
|
|
||||||
empty/grain diagnostic is retained for diagnosis. Steps 0–5, `<rules>`, and the
|
|
||||||
other `<sql_craft>` bullets/examples are untouched.
|
|
||||||
- Delivery is unchanged: `readAnalyticsSkillContent` in
|
|
||||||
`packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target
|
|
||||||
(confirmed, no change required).
|
|
||||||
- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the
|
|
||||||
three representative phrases (`Answer every requested output`, `Final completeness
|
|
||||||
check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and
|
|
||||||
renamed that test; asserted the new example's shape (`region_monthly`,
|
|
||||||
`MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and
|
|
||||||
`r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean,
|
|
||||||
relative-time, and frontmatter guards still pass.
|
|
||||||
- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass;
|
|
||||||
production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the
|
|
||||||
updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content
|
|
||||||
present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it
|
|
||||||
up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as
|
|
||||||
`ktx-analytics`.
|
|
||||||
- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference
|
|
||||||
column named `order_count_range`; the intake draft sketched alias `m` and
|
|
||||||
`AS difference`. The spec leaves prose to the implementer, so the change is purely
|
|
||||||
naming.
|
|
||||||
- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in
|
|
||||||
`packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools`
|
|
||||||
mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is
|
|
||||||
untouched by this work and out of scope here.
|
|
||||||
|
|
@ -1,405 +0,0 @@
|
||||||
# Structured, leveled logging for the ktx MCP server
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`.
|
|
||||||
>
|
|
||||||
> **Scope: observability only.** This spec is about *seeing* what the MCP server
|
|
||||||
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
|
|
||||||
> query from blocking the server (off-event-loop / interruptible execution) is a
|
|
||||||
> separate concern — see "Non-goals".
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built
|
|
||||||
through `mcp-server-factory.ts` on raw `node:http` + the
|
|
||||||
`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There
|
|
||||||
is no server-side record of **which MCP tool was called, with what parameters,
|
|
||||||
when, how long it took, or whether it succeeded** — nor of session open/close or
|
|
||||||
transport errors. When a tool call is slow, hangs, or a client connection drops
|
|
||||||
("Transport channel closed"), an operator has no trail to diagnose it and must
|
|
||||||
resort to process sampling / `lsof` / guesswork — and the offending input
|
|
||||||
(e.g. the exact SQL) is typically unrecoverable.
|
|
||||||
|
|
||||||
The hook to fix this already exists but is half-built: `instrumentMcpServer`
|
|
||||||
(`context/mcp/context-tools.ts`) wraps every tool handler and already times it,
|
|
||||||
but it emits **only on completion** (a sampled `mcp_request_completed` telemetry
|
|
||||||
event) and **never writes a start line and never writes to the server log**. A
|
|
||||||
call that never returns therefore leaves no trace at all.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Anyone running a long-lived ktx MCP server — a developer's local instance
|
|
||||||
(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a
|
|
||||||
shared/hosted HTTP daemon — needs observability into tool-call activity to:
|
|
||||||
|
|
||||||
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
|
|
||||||
connection, with what SQL, for how long);
|
|
||||||
- explain client-visible connection failures from the server side (session
|
|
||||||
lifecycle, transport-closed events);
|
|
||||||
- audit what agents asked the server to do;
|
|
||||||
- spot patterns (hot tools, slow connections, error rates).
|
|
||||||
|
|
||||||
This is standard production-server hygiene; the server currently provides none.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs.
|
|
||||||
|
|
||||||
### One `pino` logger, synchronous, written to **stderr**
|
|
||||||
|
|
||||||
Use `pino` — the de-facto standard structured-JSON logger for Node servers — as
|
|
||||||
a single shared instance. Two corrections to the draft's sketch:
|
|
||||||
|
|
||||||
- **stderr, not stdout.** The stdio transport reserves **stdout** for the
|
|
||||||
JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`);
|
|
||||||
writing logs there would corrupt the protocol stream. The HTTP daemon already
|
|
||||||
redirects **both** child fds to `.ktx/logs/mcp.log`
|
|
||||||
(`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands
|
|
||||||
in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the
|
|
||||||
one universally-correct sink** for both transports.
|
|
||||||
- **Synchronous, no worker-thread transport.** `pino` writes through a
|
|
||||||
`DestinationStream` (`{ write(msg) }`) — the server's existing
|
|
||||||
`KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a
|
|
||||||
**synchronous** destination (`pino.destination({ sync: true })`, or the
|
|
||||||
pino-pretty stream below with `sync: true`). This is load-bearing: the
|
|
||||||
`tool.start` line **must** be flushed to the fd *before* the (possibly
|
|
||||||
blocking) handler runs, so a runaway synchronous `better-sqlite3` query that
|
|
||||||
pegs the event loop still leaves the start line on disk. A worker-thread
|
|
||||||
transport (`transport: { target: ... }`) buffers and can lose that exact line
|
|
||||||
on a hard crash — **do not use transport mode.**
|
|
||||||
|
|
||||||
### Format is derived from `stderr.isTTY`, not a config flag
|
|
||||||
|
|
||||||
One logger, two serializations chosen by the environment (the "behavior follows
|
|
||||||
from inputs" rule — not a user-visible knob):
|
|
||||||
|
|
||||||
- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) →
|
|
||||||
**`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true,
|
|
||||||
destination: <stderr sink> })`, colorized). A readable live dev view.
|
|
||||||
- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log`
|
|
||||||
file fd) → **plain JSON line** via the synchronous pino destination. The log
|
|
||||||
*file* stays structured JSON so the incident workflow ("recover the hung query
|
|
||||||
with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat
|
|
||||||
it.
|
|
||||||
|
|
||||||
`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal
|
|
||||||
from the underlying stream (`process.stderr.isTTY`) at logger construction, while
|
|
||||||
still writing *through* the `io.stderr` sink so tests can capture emitted lines.
|
|
||||||
|
|
||||||
### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper
|
|
||||||
|
|
||||||
Tool-call logging is added to the existing `instrumentMcpServer`
|
|
||||||
(`context-tools.ts`), which already wraps `registerTool` and measures duration.
|
|
||||||
It receives the **raw** tool input (it wraps the schema-parsing handler from
|
|
||||||
`registerParsedTool`), so the params it logs include `sql` for `sql_execution`.
|
|
||||||
The existing telemetry emission stays unchanged; logging is **additive** beside
|
|
||||||
it. Because both transports build their server through `mcp-server-factory.ts` →
|
|
||||||
`registerKtxContextTools`, this single change gives **both HTTP and stdio**
|
|
||||||
tool-call logging for free.
|
|
||||||
|
|
||||||
### `sessionId` / `callId` provenance
|
|
||||||
|
|
||||||
- **`sessionId`** comes from the SDK's per-call handler context
|
|
||||||
(`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk`
|
|
||||||
`1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for
|
|
||||||
stdio (single session) — log it when present, omit otherwise. Add
|
|
||||||
`sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`).
|
|
||||||
- **`callId`** is generated per invocation with `randomUUID()` (already imported
|
|
||||||
in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`.
|
|
||||||
|
|
||||||
### No redaction in v1 (explicit)
|
|
||||||
|
|
||||||
v1 ships **no log redaction**. Rationale recorded here so it is a deliberate
|
|
||||||
choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`),
|
|
||||||
**never transmitted off-box**, and sit at the **same trust boundary** as the
|
|
||||||
`ktx.yaml` / environment that already hold the connection credentials. Concretely:
|
|
||||||
|
|
||||||
- Request **headers are never logged** at all, so the bearer token
|
|
||||||
(`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted."
|
|
||||||
- Errors are logged with their **full message and stack** via pino's standard
|
|
||||||
`err` serializer.
|
|
||||||
- SQL text and tool params are logged **verbatim** (they are not secrets).
|
|
||||||
|
|
||||||
Credential redaction (e.g. a DB URL embedded in a driver error string) is an
|
|
||||||
explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box.
|
|
||||||
This drops the draft's "light redaction" requirement and the
|
|
||||||
`collectTelemetryRedactionSecrets` / scrubber reuse it implied.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. One shared pino logger
|
|
||||||
|
|
||||||
- A single `pino` instance per server process, constructed once and threaded to
|
|
||||||
both the transport layer (for lifecycle events) and the tool layer (for
|
|
||||||
tool-call events). Level set from env (Requirement 7), default `info`.
|
|
||||||
- Synchronous destination bound to the server's stderr sink (see Design
|
|
||||||
decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`,
|
|
||||||
otherwise plain JSON. Each line carries pino's standard `time` and `level`.
|
|
||||||
- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics
|
|
||||||
stack, no async/worker transport, no in-app file rotation.
|
|
||||||
|
|
||||||
### 2. Per-session / per-call context via child loggers
|
|
||||||
|
|
||||||
Use pino child loggers so every line carries the relevant correlation fields:
|
|
||||||
a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one
|
|
||||||
session's or one call's activity can be grepped from the log.
|
|
||||||
|
|
||||||
### 3. Tool-call logging — START before execute, END after
|
|
||||||
|
|
||||||
In `instrumentMcpServer`, for **every** MCP tool invocation:
|
|
||||||
|
|
||||||
- **On entry, before invoking the handler**, write `tool.start` with
|
|
||||||
`{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool
|
|
||||||
input; for `sql_execution` this includes the full **SQL text** (the single most
|
|
||||||
useful field). The write is synchronous so the line exists even if the handler
|
|
||||||
never returns.
|
|
||||||
- **On normal completion**, write `tool.end` with
|
|
||||||
`{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at
|
|
||||||
**`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a
|
|
||||||
tool-agnostic size measure (byte length of the serialized result text content).
|
|
||||||
- **On error**, write `tool.end` with
|
|
||||||
`{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**,
|
|
||||||
where `err` is the serialized error (message + stack) per Requirement 6.
|
|
||||||
|
|
||||||
`tool.start` and `tool.end` share the **same correlation fields and the same
|
|
||||||
`info` level** (for the non-slow, non-error case) so that an **unmatched
|
|
||||||
`tool.start`** — a start with no `tool.end` for the same `callId` — is an
|
|
||||||
unambiguous "this call hung" signal. This is the property that makes a runaway
|
|
||||||
`sql_execution` identifiable from the log alone, with its exact SQL and
|
|
||||||
timestamp, no process sampling.
|
|
||||||
|
|
||||||
> **Deliberate change from the intake draft.** The draft put `tool.start` /
|
|
||||||
> `tool.end` at `debug` (suppressed at the default `info`). That defeats the
|
|
||||||
> motivating incident: a hang is unpredictable, so debug would have to be enabled
|
|
||||||
> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an
|
|
||||||
> always-on access log — so the offending query is recoverable at the default
|
|
||||||
> level. `debug` is reserved for heavier detail (Requirement 7).
|
|
||||||
|
|
||||||
### 4. Slow-call warning
|
|
||||||
|
|
||||||
When a call **completes** with `durationMs` greater than the configured slow
|
|
||||||
threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same
|
|
||||||
fields plus the duration) instead of `info`. This makes a completed-but-slow call
|
|
||||||
stand out and keeps it visible even when the level is raised to `warn`.
|
|
||||||
|
|
||||||
### 5. Connection / session lifecycle and transport errors
|
|
||||||
|
|
||||||
- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from
|
|
||||||
`onsessioninitialized` and `session.close` from `onsessionclosed` /
|
|
||||||
`transport.onclose`, each with `sessionId`, at `info`. **Wire the currently
|
|
||||||
unused `transport.onerror`** to log `transport.error` (the SDK's
|
|
||||||
closed-channel / "Transport channel closed" events) at `error`, so a
|
|
||||||
client-visible connection failure has a server-side counterpart.
|
|
||||||
- **stdio** (`mcp-stdio-server.ts`): route the existing raw
|
|
||||||
`transport.onerror` stderr string (it currently writes a plain string) through
|
|
||||||
the logger as a `transport.error` line at `error`. A single `session.open` /
|
|
||||||
`session.close` pair for the one stdio connection MAY be logged at `info`.
|
|
||||||
|
|
||||||
### 6. Structured error logging
|
|
||||||
|
|
||||||
Errors are logged as structured objects via pino's standard `err` serializer
|
|
||||||
(`pino.stdSerializers.err` or equivalent), carrying error class, message, and
|
|
||||||
stack — never a bare interpolated string. The existing telemetry exception
|
|
||||||
reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged.
|
|
||||||
|
|
||||||
### 7. Configuration surface
|
|
||||||
|
|
||||||
- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` |
|
|
||||||
…), default **`info`**. MCP-scoped name because the MCP server is the only
|
|
||||||
emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system
|
|
||||||
that does not exist.
|
|
||||||
- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement
|
|
||||||
4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply
|
|
||||||
between a local SQLite file and a remote warehouse.
|
|
||||||
- Level ladder that results from Requirements 3–5:
|
|
||||||
- `debug`: everything below **plus** heavier detail (e.g. result bodies,
|
|
||||||
progress notifications) — implementer's discretion on what extra to attach.
|
|
||||||
- `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s,
|
|
||||||
errors.
|
|
||||||
- `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but
|
|
||||||
not routine tool traffic.
|
|
||||||
- `error`: errored `tool.end`s and `transport.error` only.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- At default level (`info`), invoking any MCP tool produces a `tool.start`
|
|
||||||
(`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end`
|
|
||||||
(`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr
|
|
||||||
is not a TTY.
|
|
||||||
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
|
|
||||||
`tool.start` line carrying its **exact SQL and timestamp** and **no** matching
|
|
||||||
`tool.end` for that `callId` — so the offending query is recoverable from the
|
|
||||||
log alone, with no process sampling.
|
|
||||||
- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at
|
|
||||||
`warn` with its `durationMs`.
|
|
||||||
- Session open/close and transport-closed (`transport.error`) events are logged
|
|
||||||
with the `sessionId` (HTTP); the stdio transport error path goes through the
|
|
||||||
logger, not a raw `stderr.write`.
|
|
||||||
- At level `warn`, routine `tool.start` / `tool.end` are suppressed but
|
|
||||||
slow-call warnings, transport errors, and errored calls are present.
|
|
||||||
- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a
|
|
||||||
terminal), output is human-readable colorized `pino-pretty`; the daemon log
|
|
||||||
file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous.
|
|
||||||
- The bearer token never appears in any log line (headers are not logged); SQL
|
|
||||||
and tool params do appear.
|
|
||||||
- No worker-thread / async log transport is introduced; no OpenTelemetry /
|
|
||||||
metrics stack; the only new dependencies are `pino` and `pino-pretty`.
|
|
||||||
- The existing `mcp_request_completed` telemetry and exception reporting still
|
|
||||||
work unchanged.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Preventing / interrupting runaway queries** (off-event-loop execution, query
|
|
||||||
timeouts, worker-thread isolation). A single synchronous query that fans out
|
|
||||||
into a massive nested-loop join can peg the single-threaded server for hours
|
|
||||||
and break new connections — observability surfaces *which* query, but the fix
|
|
||||||
is execution-model work in a separate spec. (This logging is also the
|
|
||||||
prerequisite for a future watchdog that detects a `tool.start` with no
|
|
||||||
`tool.end` past a threshold and recycles the server.)
|
|
||||||
- **Log redaction** (see Design decisions) — explicit v1 non-goal.
|
|
||||||
- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty
|
|
||||||
as a synchronous in-process stream only.
|
|
||||||
- Metrics / tracing / OpenTelemetry exporters.
|
|
||||||
- Forwarding logs to the MCP *client* via the protocol logging capability
|
|
||||||
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
|
|
||||||
distinct from operational stderr logging.
|
|
||||||
- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other
|
|
||||||
surfaces emit structured logs.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **New module** — a small logger factory, e.g.
|
|
||||||
`packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from
|
|
||||||
the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream
|
|
||||||
when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and
|
|
||||||
exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`.
|
|
||||||
- **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`:
|
|
||||||
extend `instrumentMcpServer` (~line 585) to write `tool.start` before
|
|
||||||
`handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate
|
|
||||||
`callId` via the already-imported `randomUUID`; read `sessionId` from the
|
|
||||||
handler `context`. Thread the logger via `RegisterKtxContextToolsDeps`
|
|
||||||
(~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool`
|
|
||||||
and the existing telemetry emission intact.
|
|
||||||
- **Context type** — `packages/cli/src/context/mcp/types.ts`: add
|
|
||||||
`sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to
|
|
||||||
`KtxMcpServerDeps` / the register deps.
|
|
||||||
- **Server wiring** — `packages/cli/src/context/mcp/server.ts`
|
|
||||||
(`createDefaultKtxMcpServer` / `createKtxMcpServer`) and
|
|
||||||
`packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept
|
|
||||||
and pass the logger down to `registerKtxContextTools`.
|
|
||||||
- **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or
|
|
||||||
receive) the logger; in `newTransport` (~line 186) log `session.open` /
|
|
||||||
`session.close` and add `transport.onerror` → `transport.error`.
|
|
||||||
- **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or
|
|
||||||
receive) the logger; route the existing `transport.onerror` (~line 54) through
|
|
||||||
it.
|
|
||||||
- **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts`
|
|
||||||
redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs`
|
|
||||||
(`commands/mcp-commands.ts`) tails it. No change needed there.
|
|
||||||
- **Dependencies** — add `pino` and `pino-pretty` to
|
|
||||||
`packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks
|
|
||||||
still pass.
|
|
||||||
- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`,
|
|
||||||
`mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and
|
|
||||||
`commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written
|
|
||||||
before a (mock) handler runs and carries `params`/`sql`; (b) a matching
|
|
||||||
`tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a
|
|
||||||
`tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits
|
|
||||||
`warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token
|
|
||||||
never appears. Inject a capturing `io.stderr` and parse the JSON lines.
|
|
||||||
*Note:* `mcp-server-factory.test.ts` carries a pre-existing
|
|
||||||
`KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`,
|
|
||||||
unrelated to this work) — do not let it mask new failures.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
Running Spider 2.0-Lite against the MCP server at concurrency, an
|
|
||||||
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
|
|
||||||
synchronous `better-sqlite3` executed it on the event loop, pegging a server at
|
|
||||||
~100% CPU for hours and breaking new MCP connections ("Transport channel
|
|
||||||
closed"). We could not determine *which* query, because the server logs nothing
|
|
||||||
about tool calls — diagnosis required `sample` / `lsof` on the live process and
|
|
||||||
the exact SQL was never recovered. Structured tool-call logging — especially
|
|
||||||
`tool.start` written synchronously *before* execution, at the default level —
|
|
||||||
would have turned this into a one-line `grep` of the server log. Improving the
|
|
||||||
benchmark is a side effect; the logging is generic production-server hygiene.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance
|
|
||||||
criteria are satisfied.
|
|
||||||
|
|
||||||
**What was built / where**
|
|
||||||
|
|
||||||
- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io,
|
|
||||||
{ isTTY? })` builds one synchronous `pino` (v10) instance written through the
|
|
||||||
`io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13)
|
|
||||||
synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the
|
|
||||||
sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel`
|
|
||||||
(`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`),
|
|
||||||
`mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and
|
|
||||||
`serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`.
|
|
||||||
- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** —
|
|
||||||
per invocation: `callId = randomUUID()`, a child logger bound to
|
|
||||||
`{ tool, callId, sessionId? }`, `tool.start { params }` written at `info`
|
|
||||||
**before** awaiting the handler (synchronous, so a runaway query still leaves it
|
|
||||||
on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`,
|
|
||||||
`warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error",
|
|
||||||
err }`. `resultSize` is the UTF-8 byte length of the serialized text content.
|
|
||||||
The existing `mcp_request_completed` telemetry + `reportException` are unchanged
|
|
||||||
(`durationMs` is now computed once and shared); `registerParsedTool` is intact.
|
|
||||||
- **`sessionId` / logger plumbing** — `sessionId?: string` added to
|
|
||||||
`KtxMcpToolHandlerContext`; a single per-process logger threads from each
|
|
||||||
transport entrypoint through `createKtxMcpServerFactory` →
|
|
||||||
`createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools`
|
|
||||||
(`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`).
|
|
||||||
- **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from
|
|
||||||
`onsessioninitialized`, `session.close` from `transport.onclose`, and the
|
|
||||||
previously-unused `transport.onerror` wired to `transport.error` at `error`.
|
|
||||||
- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror`
|
|
||||||
string write is replaced by a `transport.error` log line; `session.open` /
|
|
||||||
`session.close` are logged for the single stdio session.
|
|
||||||
- **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to
|
|
||||||
`packages/cli/package.json`.
|
|
||||||
- **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env
|
|
||||||
parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in
|
|
||||||
`test/context/mcp/server.test.ts` (start-before-handler, matching end with
|
|
||||||
`resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level
|
|
||||||
suppression with errored end still present, no-logger no-op), session lifecycle
|
|
||||||
+ bearer-token-never-logged in `test/mcp-http-server.test.ts`, and
|
|
||||||
`test/mcp-stdio-server.test.ts` for `transport.error`.
|
|
||||||
|
|
||||||
**Deviations / decisions**
|
|
||||||
|
|
||||||
- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a
|
|
||||||
thrown handler error into an `{ isError: true }` result (and reports the full
|
|
||||||
error via telemetry) before it reaches `instrumentMcpServer`, so the original
|
|
||||||
stack is already gone. `tool.end` for such a result logs `outcome:"error"` with
|
|
||||||
`err.message` only; a genuine throw that escapes gets the full pino `err`
|
|
||||||
serialization (type + message + stack). The field is always `err` for
|
|
||||||
consistency. This honours "leave `registerParsedTool` intact."
|
|
||||||
- **`session.close` is logged from `transport.onclose`** (the universal close
|
|
||||||
signal for both clean DELETE and dropped connections) rather than
|
|
||||||
`onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its
|
|
||||||
session-map cleanup role.
|
|
||||||
- **The logger is optional throughout.** Production always wires one per process;
|
|
||||||
when absent (programmatic/test callers that inject `createMcpServer`), tool-call
|
|
||||||
logging is simply off — which keeps existing tests unchanged.
|
|
||||||
- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production
|
|
||||||
derives format from `process.stderr.isTTY`.
|
|
||||||
|
|
||||||
**Verification**
|
|
||||||
|
|
||||||
`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test
|
|
||||||
files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only
|
|
||||||
2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and
|
|
||||||
unrelated to this change (in-progress analytics-skill work on this branch).
|
|
||||||
`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run
|
|
||||||
build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the
|
|
||||||
one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit
|
|
||||||
`2677b3ef` (documented above); all source and the new tests type-check clean.
|
|
||||||
|
|
@ -1,493 +0,0 @@
|
||||||
# Bounded query execution (deadline + non-blocking) for read SQL
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`.
|
|
||||||
>
|
|
||||||
> **Scope: bound and cancel a read query that runs too long.** This is the
|
|
||||||
> execution-model companion to spec 15 (MCP structured logging). Spec 15
|
|
||||||
> *surfaces* a runaway query in the log; it explicitly defers *preventing* one —
|
|
||||||
> "off-event-loop execution, query timeouts, worker-thread isolation … is
|
|
||||||
> execution-model work in a separate spec." This is that spec.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the
|
|
||||||
current code:
|
|
||||||
|
|
||||||
1. **No execution deadline, handled divergently per connector.** A single
|
|
||||||
expensive query runs unbounded, and whether it is bounded at all depends
|
|
||||||
entirely on which driver the caller hit:
|
|
||||||
- **BigQuery** is the only connector with a real statement timeout — it sets
|
|
||||||
`jobTimeoutMs` on the query job from a per-connection config field
|
|
||||||
`job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512).
|
|
||||||
- **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client
|
|
||||||
creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up,
|
|
||||||
not a server-side `max_execution_time`; the server keeps working.
|
|
||||||
- **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection
|
|
||||||
*acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres
|
|
||||||
`connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`;
|
|
||||||
MySQL pool size only) — nothing bounds statement *execution*.
|
|
||||||
- **SQLite** has nothing.
|
|
||||||
|
|
||||||
2. **In-process SQLite blocks the event loop and cannot be cancelled.** The
|
|
||||||
SQLite connector executes on the main thread via synchronous
|
|
||||||
`better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`,
|
|
||||||
`query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes
|
|
||||||
the whole MCP server — it cannot serve other requests, send progress, or write
|
|
||||||
`tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12)
|
|
||||||
exposes no interrupt/cancel API. Its documented mechanism for slow queries is a
|
|
||||||
**worker thread**, and the only way to stop a runaway synchronous query is to
|
|
||||||
**terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`,
|
|
||||||
`docs/threads.md`).
|
|
||||||
|
|
||||||
The observed failure (Spider2-lite sqlite run, 2026-06-18): a single
|
|
||||||
`sql_execution` MCP call —
|
|
||||||
`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`,
|
|
||||||
where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a
|
|
||||||
4-column key with no composite index) — degraded to an O(N×M) nested-loop scan,
|
|
||||||
pegged a worker at 100% CPU for 13+ minutes, never returned, produced a
|
|
||||||
`tool.start` with no matching `tool.end`, and stalled an eval shard until the
|
|
||||||
worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned
|
|
||||||
rows, not scan work, and the failing query returned a single aggregate row.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Any data agent that lets an LLM author SQL will eventually issue an
|
|
||||||
accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW,
|
|
||||||
a wide aggregate over a large fact table. A general-purpose context layer must
|
|
||||||
bound that and return a clean, fast "query exceeded Ns" error so the agent can
|
|
||||||
revise (add filters, query base tables, narrow the range) instead of hanging the
|
|
||||||
tool and the server. This matters for embedded/local warehouses (SQLite, and any
|
|
||||||
future DuckDB-style in-process driver) and remote ones alike, and is wholly
|
|
||||||
independent of any benchmark.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs.
|
|
||||||
|
|
||||||
### One canonical deadline, applied uniformly at the contract
|
|
||||||
|
|
||||||
The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP
|
|
||||||
`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query
|
|
||||||
executor, relationship profiling and composite-candidate probes, relationship
|
|
||||||
validation, historic-SQL probes, `ktx sql`); the contract is the single place to
|
|
||||||
bound all of them. A heavy ingest profiling probe over a giant unindexed join is
|
|
||||||
exactly as worth abandoning as an interactive one — those call sites are
|
|
||||||
best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip
|
|
||||||
this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the
|
|
||||||
call sites that must treat the timeout as recoverable.)
|
|
||||||
|
|
||||||
> Rejected alternative: a caller-resolved deadline (short on the interactive path,
|
|
||||||
> longer/none for ingest). That introduces a second value source and the open
|
|
||||||
> question "what is the ingest budget," for no real gain — the 30s default already
|
|
||||||
> clears any normal profiling probe, and a probe that exceeds it is one to drop.
|
|
||||||
|
|
||||||
### Default 30s, configurable per-connection via one shared field
|
|
||||||
|
|
||||||
- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean
|
|
||||||
"exceeded 30s" and revises within the same turn; generous headroom over any
|
|
||||||
indexed aggregate or normal profiling probe; a genuine pathological nested-loop
|
|
||||||
scan blows past it immediately.
|
|
||||||
- **One shared per-connection override**, honored by every connector:
|
|
||||||
`query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer
|
|
||||||
in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it
|
|
||||||
replaces; the user-facing error still reads in seconds.
|
|
||||||
- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the
|
|
||||||
new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved
|
|
||||||
value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so
|
|
||||||
there is exactly one way to set a query timeout — no parallel knob (intake
|
|
||||||
requirement 1).
|
|
||||||
- **Granularity is per-connection only.** No global all-connections override —
|
|
||||||
different warehouses have different performance envelopes, and a second
|
|
||||||
(global) knob would double the configuration surface for no stated need.
|
|
||||||
|
|
||||||
### The shared contract is a value + an error, not a base class
|
|
||||||
|
|
||||||
There is **no shared connector base class or factory** — each connector is
|
|
||||||
constructed independently; the only shared registry is the *dialect* factory
|
|
||||||
(`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement
|
|
||||||
3) means a single shared module that owns:
|
|
||||||
|
|
||||||
- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`;
|
|
||||||
- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms`
|
|
||||||
override, else the default — so the default and the override precedence live in
|
|
||||||
exactly one place;
|
|
||||||
- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical
|
|
||||||
message `query exceeded ${Math.round(deadlineMs / 1000)}s`.
|
|
||||||
|
|
||||||
Each connector calls the resolver once (at construction; connectors already
|
|
||||||
receive their connection config) and stores `this.deadlineMs`. **Enforcement is
|
|
||||||
necessarily per-connector** — different engines cancel differently — but the
|
|
||||||
*value* and the *error message* are shared, so the agent sees one consistent,
|
|
||||||
actionable error regardless of driver.
|
|
||||||
|
|
||||||
### Real cancellation, not client-side give-up
|
|
||||||
|
|
||||||
Per intake requirement 5, the deadline must *stop the work*, not merely abandon
|
|
||||||
the promise while the query keeps running (which on a pooled driver also risks
|
|
||||||
returning a still-busy connection to the pool). So:
|
|
||||||
|
|
||||||
- **In-process (SQLite, and any future embedded driver):** run the query off the
|
|
||||||
main thread and enforce the deadline by **terminating the worker thread**. There
|
|
||||||
is no generic `Promise.race` outer wrapper — a `Promise.race` against a
|
|
||||||
synchronous in-thread `.all()` can never fire (the loop is blocked), and against
|
|
||||||
a pooled remote query it would poison the pool. Thread termination *is* the
|
|
||||||
cancellation.
|
|
||||||
- **Remote engines:** set the engine's **server-side statement timeout** so the
|
|
||||||
server itself aborts the query and frees the connection cleanly.
|
|
||||||
|
|
||||||
### Logging routes through spec 15's pino path — no second logger
|
|
||||||
|
|
||||||
The deadline cases are logged through the **existing** MCP tool-call logger
|
|
||||||
(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a
|
|
||||||
new logging path threaded into the connector. Verified flow for a timeout:
|
|
||||||
`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) →
|
|
||||||
`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it
|
|
||||||
(`reportException` skips `$exception` for `KtxExpectedError`) and returns an
|
|
||||||
in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`**
|
|
||||||
with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same
|
|
||||||
`callId`** as the `tool.start`.
|
|
||||||
|
|
||||||
This is the central observability win and it requires **no new MCP logging code**:
|
|
||||||
spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this
|
|
||||||
spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose
|
|
||||||
`tool.end` names the deadline. The worker-termination (SQLite) and server-side
|
|
||||||
abort (remote) are internal enforcement mechanisms; their single observable signal
|
|
||||||
is that `tool.end`, so the connector does **not** get its own logger threaded
|
|
||||||
through `KtxScanContext` — that would fork a second path for one capability. The
|
|
||||||
"worker was actually reaped, not left spinning" guarantee is asserted by the
|
|
||||||
worker's `exit` event in tests (Requirement 3), not by a log line.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Shared deadline contract, defined once
|
|
||||||
|
|
||||||
A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`)
|
|
||||||
exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`,
|
|
||||||
and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its
|
|
||||||
deadline through this resolver; no connector hardcodes its own default or
|
|
||||||
duplicates the override-precedence logic.
|
|
||||||
|
|
||||||
### 2. Shared per-connection config field; BigQuery's removed
|
|
||||||
|
|
||||||
`query_timeout_ms` is added to the **shared** connection config schema (validated
|
|
||||||
as an optional positive integer, milliseconds) so every driver accepts it. The
|
|
||||||
BigQuery-specific `job_timeout_ms` config field and its dedicated reader
|
|
||||||
(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout
|
|
||||||
from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms`
|
|
||||||
(zero, negative, non-integer) is a clear config validation error, consistent with
|
|
||||||
how ktx validates `ktx.yaml`.
|
|
||||||
|
|
||||||
### 3. SQLite executes off the main thread, terminated on deadline
|
|
||||||
|
|
||||||
`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event
|
|
||||||
loop:
|
|
||||||
|
|
||||||
- Read-only validation and the row-limit wrapper (`assertReadOnlySql` +
|
|
||||||
`limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL
|
|
||||||
fails instantly without spawning a worker, and read-only enforcement stays at
|
|
||||||
the boundary (Requirement 7).
|
|
||||||
- The validated, row-limited SQL (and any params) is dispatched to a **worker
|
|
||||||
thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs
|
|
||||||
the query, and posts back `{ headers, rows, totalRows }` (all values are
|
|
||||||
structured-cloneable — primitives, `Buffer`, `BigInt`).
|
|
||||||
- The main thread arms a timer for `this.deadlineMs`; on expiry it calls
|
|
||||||
`worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal
|
|
||||||
message it clears the timer and resolves. On a worker error (SQLite rejected the
|
|
||||||
SQL) it rejects with that error, message preserved. A provided
|
|
||||||
`ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates
|
|
||||||
the worker, for external cancellation.
|
|
||||||
- **One short-lived worker per call**, terminated on completion or deadline — not
|
|
||||||
a persistent worker or pool. Terminate-on-deadline destroys the worker, so a
|
|
||||||
pool would need respawn/job-tracking for no benefit: `executeReadOnly` is
|
|
||||||
low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is
|
|
||||||
negligible against query latency. The other SQLite paths (introspect, sample,
|
|
||||||
stats, distinct-values, row-count) stay on the main thread — they are
|
|
||||||
ktx-authored, bounded, and not on the `executeReadOnly` contract.
|
|
||||||
- The event loop stays responsive throughout, so `tool.end` is always written and
|
|
||||||
concurrent requests on the same port are served.
|
|
||||||
|
|
||||||
### 4. Remote engines set a real server-side statement timeout
|
|
||||||
|
|
||||||
Each remote connector applies `this.deadlineMs` as its engine's server-side
|
|
||||||
statement timeout, so the deadline stops server work rather than abandoning the
|
|
||||||
promise:
|
|
||||||
|
|
||||||
| Connector | Mechanism | Unit |
|
|
||||||
|------------|--------------------------------------------------------|---------------|
|
|
||||||
| BigQuery | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms |
|
|
||||||
| Postgres | `statement_timeout` | ms |
|
|
||||||
| MySQL | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms |
|
|
||||||
| Snowflake | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION) | s (ceil) |
|
|
||||||
| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) |
|
|
||||||
| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms |
|
|
||||||
|
|
||||||
ClickHouse's existing hardcoded 30s `request_timeout` is brought under this
|
|
||||||
contract (derived from the resolved deadline), not left as a parallel mechanism.
|
|
||||||
|
|
||||||
### 5. Timeout resolves as a `KtxQueryError` with the canonical message
|
|
||||||
|
|
||||||
On exceeding the deadline, the path resolves with a `KtxQueryError`
|
|
||||||
(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded
|
|
||||||
hang. For SQLite the worker-termination path throws `queryDeadlineExceededError`
|
|
||||||
directly. For remote engines, each connector recognizes **its own** engine's
|
|
||||||
timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`;
|
|
||||||
SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as
|
|
||||||
`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector
|
|
||||||
owns its driver's signal — there is no central denylist of error codes to
|
|
||||||
maintain.
|
|
||||||
|
|
||||||
### 6. MCP surfacing and logging via the existing pino path
|
|
||||||
|
|
||||||
The MCP `sql_execution` path already (a) maps any non-native driver error to
|
|
||||||
`KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by
|
|
||||||
`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips
|
|
||||||
`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start`
|
|
||||||
synchronously before the handler and `tool.end` in `instrumentMcpServer`
|
|
||||||
(`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through
|
|
||||||
this path — the implementer verifies and tests them, but adds **no parallel
|
|
||||||
classification or logging path**:
|
|
||||||
|
|
||||||
- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with
|
|
||||||
`outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same
|
|
||||||
`callId` as the `tool.start`. Classified as an expected error, so it is absent
|
|
||||||
from `$exception` Error Tracking. The reason `tool.end` was previously missing
|
|
||||||
is solely the blocked event loop (Requirement 3); once the loop stays free and
|
|
||||||
the deadline throws, the existing instrumentation logs the matched pair — closing
|
|
||||||
spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
|
|
||||||
- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):**
|
|
||||||
unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline
|
|
||||||
(default 30s) and the slow threshold (default 10s) are independent knobs; a query
|
|
||||||
between 10s and 30s completes with a slow `warn`, one past 30s is killed with the
|
|
||||||
`error` above.
|
|
||||||
|
|
||||||
### 7. Read-only enforcement and `maxRows` unchanged
|
|
||||||
|
|
||||||
`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave
|
|
||||||
exactly as today. The deadline is additive. `maxRows` is not a substitute for it
|
|
||||||
(it bounds returned rows, not scan work).
|
|
||||||
|
|
||||||
### 8. Best-effort callers treat a deadline timeout as recoverable
|
|
||||||
|
|
||||||
The non-interactive `executeReadOnly` call sites that are best-effort —
|
|
||||||
relationship profiling, composite-candidate probes, relationship validation,
|
|
||||||
historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this
|
|
||||||
probe / mark unprofiled" and continue, never as a source-fatal error. The
|
|
||||||
implementer confirms each such site already swallows query errors into a
|
|
||||||
graceful-skip and adds that handling where it does not, so the uniform deadline
|
|
||||||
(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped
|
|
||||||
probe is logged at the skip site through that path's existing scan/ingest logger
|
|
||||||
(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers
|
|
||||||
are off the MCP tool-call path, so their visibility comes from the logger they
|
|
||||||
already use.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- A read query that exceeds the deadline returns a `KtxQueryError`
|
|
||||||
(`query exceeded {N}s`) within roughly the deadline; the MCP worker stays
|
|
||||||
responsive (a concurrent tool call on the same server completes while the slow
|
|
||||||
query is still pending) and writes a matching `tool.end` with a non-ok outcome.
|
|
||||||
- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching
|
|
||||||
`tool.end` (same `callId`) at `error` with `outcome:"error"` and
|
|
||||||
`err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The
|
|
||||||
timeout does not raise a `$exception` Error Tracking event (it is a
|
|
||||||
`KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but
|
|
||||||
under the deadline still emits its `tool.end` at `warn`. No new logger is
|
|
||||||
introduced — the lines come from the existing `instrumentMcpServer`.
|
|
||||||
- **SQLite specifically:** executing a deliberately pathological query (an
|
|
||||||
expensive VIEW or an unindexed cross join) on a fixture does not block the event
|
|
||||||
loop, is terminated at the deadline, and the worker exits (the off-main-thread
|
|
||||||
executor is killed, not left spinning) so CPU returns to idle.
|
|
||||||
- **One server-side-timeout driver (Postgres):** the connector applies
|
|
||||||
`statement_timeout` equal to the resolved deadline, and a `57014` cancellation
|
|
||||||
is mapped to the canonical `KtxQueryError`.
|
|
||||||
- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms`
|
|
||||||
override, and rejects an invalid value (zero / negative / non-integer).
|
|
||||||
- **No regression:** normal fast queries return identical results; read-only
|
|
||||||
rejection still works; `maxRows` still bounds returned rows.
|
|
||||||
- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's
|
|
||||||
former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the
|
|
||||||
shared field.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result
|
|
||||||
size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a
|
|
||||||
separate, retained concern.
|
|
||||||
- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated
|
|
||||||
default plus a per-connection override; no per-call knob, no global knob.
|
|
||||||
- **A server watchdog that recycles the process on an unmatched `tool.start`.**
|
|
||||||
Spec 15 names this as a possible future mitigation; this spec prevents the hang
|
|
||||||
at the source, so the watchdog is out of scope here.
|
|
||||||
- **Moving SQLite introspection / sampling / stats off the main thread.** Only the
|
|
||||||
`executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded
|
|
||||||
ktx-authored queries.
|
|
||||||
- **Per-connection retry / backoff on timeout.** A timeout returns a clean error
|
|
||||||
for the agent to revise; ktx does not auto-retry.
|
|
||||||
- **A second logger threaded into the connector.** The deadline cases are logged
|
|
||||||
through spec 15's existing MCP tool-call logger; the connector gets no separate
|
|
||||||
pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes
|
|
||||||
through spec 15's pino path").
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
|
|
||||||
`DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`.
|
|
||||||
Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`).
|
|
||||||
- **Contract anchor** — `KtxScanConnector.executeReadOnly`
|
|
||||||
(`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`),
|
|
||||||
`KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the
|
|
||||||
MCP path).
|
|
||||||
- **Config schema** — add `query_timeout_ms` to the shared connection config
|
|
||||||
(`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema);
|
|
||||||
remove BigQuery's `job_timeout_ms` reader.
|
|
||||||
- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts`
|
|
||||||
(constructed by path via `new URL('./read-query-worker.js', import.meta.url)`);
|
|
||||||
rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate
|
|
||||||
on the main thread then dispatch to the worker with a terminate-on-deadline
|
|
||||||
timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in
|
|
||||||
the worker. Register the worker as a dynamic entry in `knip.json` (it is
|
|
||||||
referenced by path, not import) and confirm the build copies it into `dist`.
|
|
||||||
- **Remote connectors** — apply the resolved deadline and recognize the engine's
|
|
||||||
timeout signal in each `executeReadOnly` / `query(...)`:
|
|
||||||
`connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`),
|
|
||||||
`connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` +
|
|
||||||
`request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534,
|
|
||||||
`STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838,
|
|
||||||
`statement_timeout`), `connectors/mysql/connector.ts` (~774–793,
|
|
||||||
`max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832,
|
|
||||||
`requestTimeout`).
|
|
||||||
- **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88`
|
|
||||||
(error mapping), the `sql_execution` registration (~915–943), and the logging in
|
|
||||||
`instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes
|
|
||||||
`tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No
|
|
||||||
new classification or logging code; confirm the timeout flows through as an
|
|
||||||
expected error producing a matching `tool.end(error)` with the canonical message.
|
|
||||||
- **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275),
|
|
||||||
`context/scan/relationship-composite-candidates.ts` (~365, 440),
|
|
||||||
`context/scan/relationship-validation.ts` (~259),
|
|
||||||
`context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the
|
|
||||||
historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a
|
|
||||||
graceful skip.
|
|
||||||
- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms`
|
|
||||||
as the test seam) asserting terminate-on-deadline, event-loop responsiveness
|
|
||||||
(a concurrent promise resolves while the query is pending), and worker exit; a
|
|
||||||
Postgres test asserting `statement_timeout` is set to the resolved deadline and
|
|
||||||
a `57014` error maps to `KtxQueryError`; resolver unit tests (default /
|
|
||||||
override / invalid); regression tests for normal results, read-only rejection,
|
|
||||||
and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g.
|
|
||||||
`test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a
|
|
||||||
matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
The Spider2-lite local set loads several warehouses into SQLite, some with
|
|
||||||
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
|
|
||||||
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
|
|
||||||
rows, no composite index, with `promo_id` (the index the optimizer picks) being
|
|
||||||
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
|
|
||||||
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
|
|
||||||
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
|
|
||||||
query instead. Improving the benchmark is a side effect; the deadline is generic
|
|
||||||
production hygiene for any agent that lets an LLM author SQL.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
|
||||||
acceptance criteria are met; tests, type-check, dead-code, and build are green
|
|
||||||
for the changed surface.
|
|
||||||
|
|
||||||
### What was built, and where
|
|
||||||
|
|
||||||
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
|
|
||||||
`DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns
|
|
||||||
the validated `query_timeout_ms` override else the default; throws on
|
|
||||||
zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)`
|
|
||||||
(a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the
|
|
||||||
driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`.
|
|
||||||
- **Config field** — `query_timeout_ms` (optional positive integer, ms) added to
|
|
||||||
the **shared warehouse** schema. NOTE (spec drift): that schema lives in
|
|
||||||
`context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not
|
|
||||||
`config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be
|
|
||||||
declared explicitly to be *validated* (otherwise it would pass through
|
|
||||||
unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection`
|
|
||||||
reader were removed; BigQuery now resolves the shared field. Every connector
|
|
||||||
resolves its deadline once at construction via `resolveQueryDeadlineMs`.
|
|
||||||
|
|
||||||
### Deviation from the spec's SQLite mechanism (worker thread → child process)
|
|
||||||
|
|
||||||
The spec mandated running SQLite read queries on a **worker thread** and enforcing
|
|
||||||
the deadline by `worker.terminate()`. This was **empirically disproven**:
|
|
||||||
`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3`
|
|
||||||
scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise
|
|
||||||
never even resolves (an 8s probe of the exact failing query shape confirmed the
|
|
||||||
thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler
|
|
||||||
API, and `.iterate()` does not help because the failing query is a single
|
|
||||||
aggregate row produced only *after* the full scan.
|
|
||||||
|
|
||||||
The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`**
|
|
||||||
(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from
|
|
||||||
`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed
|
|
||||||
the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both*
|
|
||||||
SQLite requirements better than a thread (event loop stays free **and** the query
|
|
||||||
is genuinely cancellable). The child is self-contained (imports only
|
|
||||||
`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`)
|
|
||||||
and `normalizeQueryRows` stay on the main thread. One short-lived child per call,
|
|
||||||
killed on completion, deadline, or `ctx.signal` abort. Node v24's native
|
|
||||||
TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts`
|
|
||||||
URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in
|
|
||||||
`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke).
|
|
||||||
|
|
||||||
### Remote connectors (server-side timeouts + own-signal mapping)
|
|
||||||
|
|
||||||
Each applies the resolved deadline server-side and re-wraps its own timeout signal
|
|
||||||
as `queryDeadlineExceededError(deadlineMs, { cause })`:
|
|
||||||
|
|
||||||
- **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error.
|
|
||||||
- **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=<ms>`); maps `57014`.
|
|
||||||
- **MySQL** — `SET SESSION max_execution_time = <ms>` before the read; maps errno `3024`.
|
|
||||||
- **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = <ceil(s)>` in the pooled connection; maps code `604` / "reached its … timeout".
|
|
||||||
- **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`.
|
|
||||||
- **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`.
|
|
||||||
|
|
||||||
Each connector has a focused test asserting the timeout is applied and its signal
|
|
||||||
maps to `KtxQueryError` (Postgres is the spec's required acceptance test).
|
|
||||||
|
|
||||||
### Best-effort callers (Requirement 8)
|
|
||||||
|
|
||||||
Confirmed already graceful: relationship **profiling** (outer try/catch →
|
|
||||||
`profile_failed` warning) and **composite-candidate** detection
|
|
||||||
(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL
|
|
||||||
**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error
|
|
||||||
into `{ ok: false }`. **Added** handling to relationship **validation**: a
|
|
||||||
`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to
|
|
||||||
`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of
|
|
||||||
aborting the whole validation pass. `ingest-query-executor.ts` is a generic
|
|
||||||
executor port whose callers own recoverability — left unchanged.
|
|
||||||
|
|
||||||
### MCP surfacing/logging
|
|
||||||
|
|
||||||
No new MCP classification or logging code. The deadline `KtxQueryError` flows
|
|
||||||
through the existing `local-project-ports` mapping → `reportException` (skips
|
|
||||||
`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts`
|
|
||||||
covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched
|
|
||||||
`tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query
|
|
||||||
exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched
|
|
||||||
pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
|
|
||||||
|
|
||||||
### Pre-existing branch issues encountered (not part of this feature)
|
|
||||||
|
|
||||||
- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with
|
|
||||||
a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke
|
|
||||||
`tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the
|
|
||||||
type-check gate green; behavior unchanged.
|
|
||||||
- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing
|
|
||||||
`**Window functions**` heading and `Expose identity, not just the label` prose
|
|
||||||
in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec
|
|
||||||
13/14) content drift committed earlier on the branch; **left untouched** — no
|
|
||||||
skill files were modified by this feature.
|
|
||||||
|
|
@ -1,418 +0,0 @@
|
||||||
# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project)
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`.
|
|
||||||
>
|
|
||||||
> **Scope: let the BigQuery connector introspect a dataset hosted in a *different*
|
|
||||||
> project than the one it bills jobs to.** A `dataset_ids` entry may be written
|
|
||||||
> fully-qualified as `project.dataset`; the connector introspects each entry in
|
|
||||||
> *its own* project while every job still runs in `credentials.project_id`. A
|
|
||||||
> bare `dataset` keeps today's single-project behavior unchanged.
|
|
||||||
>
|
|
||||||
> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard
|
|
||||||
> is **not** expected to *discover* foreign datasets — you cannot enumerate
|
|
||||||
> datasets in a project you don't own, and the wizard doesn't know which foreign
|
|
||||||
> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or
|
|
||||||
> at the dataset prompt); the connector must accept and introspect them. See
|
|
||||||
> *Non-goals*.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
**ktx**'s BigQuery connector derives a single `projectId` from
|
|
||||||
`credentials.project_id` and uses it for **both** job billing **and** schema
|
|
||||||
introspection. There is no way to introspect a dataset that lives in another
|
|
||||||
project, even though *querying* such a dataset already works (a cross-project
|
|
||||||
read in a `FROM` clause bills to the caller's project — that path is proven).
|
|
||||||
|
|
||||||
Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`):
|
|
||||||
|
|
||||||
- **`:294`** — `projectId` is read only from `credentials.project_id`. There is
|
|
||||||
no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig`
|
|
||||||
(`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed.
|
|
||||||
- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim;
|
|
||||||
it never parses a `project.` prefix.
|
|
||||||
- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`,
|
|
||||||
which resolves the dataset in the **client's (billing) project**, and labels
|
|
||||||
every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the
|
|
||||||
introspection-failure warning metadata (`:566`).
|
|
||||||
- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as
|
|
||||||
`` `<projectId>.<datasetId>.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the
|
|
||||||
**billing** project.
|
|
||||||
- **`listTables` (`:453`)** — queries
|
|
||||||
`` `<projectId>`.`region-<region>`.INFORMATION_SCHEMA.TABLES `` against the
|
|
||||||
**billing** project and labels each row `catalog: this.resolved.projectId`.
|
|
||||||
- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the
|
|
||||||
billing project.
|
|
||||||
|
|
||||||
### Empirical confirmation (from the intake draft)
|
|
||||||
|
|
||||||
With a service account in project `ktx-spider2-lite`:
|
|
||||||
|
|
||||||
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks
|
|
||||||
in `projects/ktx-spider2-lite/datasets/austin_311`).
|
|
||||||
- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })`
|
|
||||||
→ **succeeds** (public metadata is readable by any authenticated principal).
|
|
||||||
- There is **no config knob** to separate the introspection project from billing.
|
|
||||||
|
|
||||||
### Why the table `catalog` label is load-bearing, not cosmetic
|
|
||||||
|
|
||||||
The BigQuery dialect generates **three-part `catalog.db.name`** SQL
|
|
||||||
(`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`;
|
|
||||||
`context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The
|
|
||||||
`catalog` stored on each scanned table is therefore the project that *every*
|
|
||||||
later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`,
|
|
||||||
and ref-based `executeReadOnly` all format the ref through the dialect. If a
|
|
||||||
foreign dataset's tables are labeled with the billing project, every one of those
|
|
||||||
queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling
|
|
||||||
the table `catalog` with the dataset's own project is a **correctness
|
|
||||||
requirement**, and it is the single lever that makes sampling, dictionary value
|
|
||||||
extraction, and `discover_data` all resolve once the snapshot is right.
|
|
||||||
|
|
||||||
### One introspection path, no divergence
|
|
||||||
|
|
||||||
`connectors/bigquery/live-database-introspection.ts` wraps
|
|
||||||
`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database
|
|
||||||
paths share **one** introspection implementation. The SDK already supports the
|
|
||||||
fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s
|
|
||||||
`DatasetOptions` exposes `projectId?: string`.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Analysts routinely introspect datasets they can **read but do not own and do not
|
|
||||||
bill to**: Google's `bigquery-public-data`, a partner's shared project, an
|
|
||||||
organization's central data project that a smaller team queries from its own
|
|
||||||
billing project. To make those connectable in **ktx** — so `discover_data`, the
|
|
||||||
semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the
|
|
||||||
connector must introspect a foreign-hosted dataset while billing jobs in the
|
|
||||||
credentials' own project. This is a standard BigQuery deployment shape and is
|
|
||||||
wholly independent of any benchmark.
|
|
||||||
|
|
||||||
The class to design for is "the dataset's project ≠ the billing project," and it
|
|
||||||
must generalize beyond one example: a single connection may reference datasets in
|
|
||||||
**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data`
|
|
||||||
and `isb-cgc-bq`), and two different projects may host datasets with the **same
|
|
||||||
name**. The design must keep those distinct.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs.
|
|
||||||
|
|
||||||
### Carry the project inline on each dataset entry — no separate knob
|
|
||||||
|
|
||||||
The introspection project is expressed **per dataset**, inline, as the optional
|
|
||||||
`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config
|
|
||||||
field.
|
|
||||||
|
|
||||||
> Rejected alternative: a separate connection-level `dataset_project` (or
|
|
||||||
> `introspection_project`) field. It is a speculative runtime knob (against the
|
|
||||||
> repo's opinionated-defaults rule) and, more decisively, it **cannot express the
|
|
||||||
> requirement**: one connection must span *multiple* foreign projects, which a
|
|
||||||
> single global field cannot represent. The inline form also derives scope from
|
|
||||||
> the user's own declared input rather than adding a parallel setting.
|
|
||||||
|
|
||||||
### Parse to canonical `{ project, dataset }` pairs at the config boundary
|
|
||||||
|
|
||||||
Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` /
|
|
||||||
`datasetIds()`, into a canonical pair: the project (when no prefix is present,
|
|
||||||
default it to `credentials.project_id`) and the bare dataset id. Every
|
|
||||||
introspection-side call site reads the resolved pair; nothing downstream re-parses
|
|
||||||
a `project.dataset` string.
|
|
||||||
|
|
||||||
> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix
|
|
||||||
> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`,
|
|
||||||
> `testConnection`). That re-implements one rule in four places and is exactly the
|
|
||||||
> drift trap the repo's single-source-of-truth rule warns about — a later fix
|
|
||||||
> lands on one path and not another. Normalize at the boundary; carry the
|
|
||||||
> canonical form downstream.
|
|
||||||
|
|
||||||
The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`)
|
|
||||||
changes shape from `string[]` to a structured pair list. That is an internal type;
|
|
||||||
the connector internals and the connector test fixtures are the only consumers.
|
|
||||||
|
|
||||||
### Parsing rule (at the boundary)
|
|
||||||
|
|
||||||
- An entry contains **at most one `.`**.
|
|
||||||
- With a dot: the segment **before** the dot is the project, validated by the
|
|
||||||
existing `normalizeBigQueryProjectId` charset
|
|
||||||
(`context/connections/bigquery-identifiers.ts`); the segment **after** is the
|
|
||||||
dataset id (validated as a normal identifier).
|
|
||||||
- Without a dot: a bare dataset; the project defaults to `credentials.project_id`
|
|
||||||
(today's behavior).
|
|
||||||
- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error
|
|
||||||
raised at resolution time, naming the connection — not a silent
|
|
||||||
mis-introspection.
|
|
||||||
- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay
|
|
||||||
**out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset
|
|
||||||
(which already rejects `.` and `:` in a project id).
|
|
||||||
|
|
||||||
### Billing is never the dataset's project
|
|
||||||
|
|
||||||
The BigQuery client is still constructed with `projectId = credentials.project_id`
|
|
||||||
(`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the
|
|
||||||
*introspection* surfaces switch to the per-dataset project. Cross-project reads in
|
|
||||||
a `FROM` clause already bill to the caller — unchanged and already proven.
|
|
||||||
|
|
||||||
### Dataset identity downstream is `(catalog, db)`
|
|
||||||
|
|
||||||
Scanned tables are keyed by `(catalog, db, name)` throughout
|
|
||||||
(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because
|
|
||||||
the table `catalog` now holds the dataset's own project, two foreign projects that
|
|
||||||
each host a `austin_311` dataset remain distinct with no extra work — provided the
|
|
||||||
snapshot's `scope` / `metadata` also preserve the project (Requirement 6).
|
|
||||||
|
|
||||||
### Setup-wizard scope: accept, don't discover
|
|
||||||
|
|
||||||
The connector's region-scoped `listTables` (`:453`) is consumed **only** by the
|
|
||||||
`ktx setup` wizard's table-selection step (`setup-databases.ts`); the
|
|
||||||
ingest / `discover_data` path reads persisted snapshot JSON via
|
|
||||||
`WarehouseCatalogService.listTables`, not the connector method. The wizard is not
|
|
||||||
expected to enumerate foreign datasets (you can't list a project you don't own).
|
|
||||||
A `project.dataset` value hand-entered at the dataset prompt, or written into
|
|
||||||
`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the
|
|
||||||
region caveat that follows from this.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### R1 — Accept and parse `project.dataset` at the config boundary
|
|
||||||
|
|
||||||
`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each
|
|
||||||
`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair
|
|
||||||
per the parsing rule above, defaulting `project` to `credentials.project_id` when
|
|
||||||
unprefixed. A malformed entry (more than one `.`, an empty project or dataset
|
|
||||||
segment, or a project/dataset that fails identifier validation) raises a clear
|
|
||||||
error at resolution time that names the connection id.
|
|
||||||
|
|
||||||
### R2 — Introspect each dataset in its own project
|
|
||||||
|
|
||||||
`introspectDataset` resolves the dataset via the **dataset's** project —
|
|
||||||
`client.dataset(datasetId, { projectId })` — for `getTables()` and each
|
|
||||||
`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to
|
|
||||||
accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`)
|
|
||||||
and forwarding it from `DefaultBigQueryClientFactory`.
|
|
||||||
|
|
||||||
### R3 — Label table `catalog` with the dataset's project
|
|
||||||
|
|
||||||
Every table produced by `introspectDataset` is labeled `catalog: <dataset's
|
|
||||||
project>` (not the billing project), and the introspection-failure warning
|
|
||||||
metadata (`object` / `catalog`) likewise reflects the dataset's project. This is
|
|
||||||
what makes downstream sample/distinct-value/read queries resolve.
|
|
||||||
|
|
||||||
### R4 — Primary-key discovery targets the dataset's project
|
|
||||||
|
|
||||||
The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` /
|
|
||||||
`KEY_COLUMN_USAGE` SQL is built against
|
|
||||||
`` `<dataset's project>.<datasetId>.INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA
|
|
||||||
view is dataset-qualified and therefore region-independent.) Its existing
|
|
||||||
soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved.
|
|
||||||
|
|
||||||
### R5 — `listTables` lists each dataset in its own project
|
|
||||||
|
|
||||||
`listTables` returns rows labeled `catalog: <that dataset's project>` and queries
|
|
||||||
each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection
|
|
||||||
can now span projects, it queries per distinct project rather than assuming one.
|
|
||||||
(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.)
|
|
||||||
|
|
||||||
### R6 — Snapshot scope and metadata reflect multiple projects
|
|
||||||
|
|
||||||
`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing**
|
|
||||||
project, but `scope.catalogs` becomes the **distinct set of dataset projects**
|
|
||||||
actually introspected. `scope.datasets` / `metadata.datasets` must stay
|
|
||||||
unambiguous when two projects share a dataset name (e.g. carry the qualified
|
|
||||||
`project.dataset`, or otherwise preserve the project). The scoped table-name
|
|
||||||
lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass
|
|
||||||
each dataset's own project so `tableScope` / `enabled_tables` filtering still
|
|
||||||
matches.
|
|
||||||
|
|
||||||
### R7 — `testConnection` resolves foreign datasets
|
|
||||||
|
|
||||||
`testConnection` validates each configured dataset via its own project
|
|
||||||
(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only
|
|
||||||
at foreign datasets reports success rather than a spurious `404`.
|
|
||||||
|
|
||||||
### R8 — Billing unchanged; bare dataset is a strict no-op
|
|
||||||
|
|
||||||
`createQueryJob` continues to bill in `credentials.project_id`. A connection whose
|
|
||||||
`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before:
|
|
||||||
same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no
|
|
||||||
behavioral change.
|
|
||||||
|
|
||||||
### R9 — `getTableRowCount` honors the parsed entry
|
|
||||||
|
|
||||||
`getTableRowCount`'s default-dataset handling (`:431`, today
|
|
||||||
`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign
|
|
||||||
default dataset is introspected in its own project.
|
|
||||||
|
|
||||||
### R10 — Docs reflect the qualified form
|
|
||||||
|
|
||||||
Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written
|
|
||||||
`project.dataset` to introspect a dataset hosted in another project (billing stays
|
|
||||||
in `credentials.project_id`). Update the BigQuery rows/examples in
|
|
||||||
`docs-site/content/docs/configuration/ktx-yaml.mdx` and
|
|
||||||
`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope
|
|
||||||
note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples
|
|
||||||
copy-pasteable and follow the `fumadocs-mdx-structure` skill.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
1. **Foreign single-project introspection.** With credentials in project
|
|
||||||
`ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`,
|
|
||||||
`ktx ingest <conn>` introspects the tables, enriches, and samples values;
|
|
||||||
`discover_data` / `dictionary_search` return them. Tables are labeled
|
|
||||||
`catalog: 'bigquery-public-data'`.
|
|
||||||
2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x',
|
|
||||||
'other-project.y']` introspects **both**, each under its own project; the
|
|
||||||
snapshot's `scope.catalogs` contains both projects.
|
|
||||||
3. **Cross-project query still bills locally.** `sql_execution` of a
|
|
||||||
fully-qualified `project.dataset.table` query runs and bills in
|
|
||||||
`credentials.project_id`.
|
|
||||||
4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']`
|
|
||||||
yields two distinct dataset groups; tables do not collide.
|
|
||||||
5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`)
|
|
||||||
behaves exactly as before — resolved under `credentials.project_id`, same
|
|
||||||
`catalog` labels and INFORMATION_SCHEMA targets.
|
|
||||||
6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an
|
|
||||||
empty segment) raises a config error naming the connection, not a `404` at
|
|
||||||
scan time.
|
|
||||||
7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`,
|
|
||||||
using the existing fake `clientFactory` harness):
|
|
||||||
- the fake `dataset()` is called with the dataset's project for a prefixed
|
|
||||||
entry, and with the billing project for a bare entry;
|
|
||||||
- a prefixed entry yields tables with `catalog: '<dataset project>'`;
|
|
||||||
- a mixed two-project `dataset_ids` introspects both;
|
|
||||||
- `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment
|
|
||||||
entry;
|
|
||||||
- the existing single-project tests still pass unchanged.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Foreign-dataset discovery in the setup wizard.** The wizard does not
|
|
||||||
enumerate datasets in projects the credentials don't own; users supply
|
|
||||||
`project.dataset` explicitly (scope decision A).
|
|
||||||
- **Cross-region `listTables`.** `listTables`' region-scoped
|
|
||||||
`region-<location>.INFORMATION_SCHEMA.TABLES` query uses the connection-level
|
|
||||||
`location`; a foreign dataset in a *different* region than the connection's
|
|
||||||
`location` will not be listed by that wizard-facing query. This does **not**
|
|
||||||
affect ingest/`discover_data`, whose introspection path
|
|
||||||
(`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is
|
|
||||||
region-independent. A per-dataset region knob is a separate spec if ever needed.
|
|
||||||
- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`),
|
|
||||||
already unsupported by `normalizeBigQueryProjectId`.
|
|
||||||
- **A separate billing/introspection config field** — explicitly rejected above.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Pointers from exploration; line numbers may have drifted, and the implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- `packages/cli/src/connectors/bigquery/connector.ts`
|
|
||||||
- `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) —
|
|
||||||
parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds`
|
|
||||||
shape.
|
|
||||||
- `KtxBigQueryClient.dataset` port (`:100`–`:110`) and
|
|
||||||
`DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId`
|
|
||||||
(R2). `getClient()` (`:487`) keeps the billing project (R8).
|
|
||||||
- `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog`
|
|
||||||
+ warning metadata (R2, R3).
|
|
||||||
- `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4).
|
|
||||||
- `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog
|
|
||||||
(R5).
|
|
||||||
- `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup
|
|
||||||
(`:359`) (R6).
|
|
||||||
- `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9).
|
|
||||||
- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps
|
|
||||||
`introspect`; no separate change needed (it inherits the fix).
|
|
||||||
- `packages/cli/src/context/connections/bigquery-identifiers.ts` —
|
|
||||||
`normalizeBigQueryProjectId` is the project-segment validator.
|
|
||||||
- `packages/cli/src/context/connections/dialect-helpers.ts` /
|
|
||||||
`connectors/bigquery/dialect.ts` — three-part naming; no change, but this is
|
|
||||||
*why* R3 matters.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`. Run
|
|
||||||
`pnpm --filter @kaelio/ktx run type-check` and the connector test suite.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement — do not encode benchmark specifics)
|
|
||||||
|
|
||||||
Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable
|
|
||||||
faithfully: every one of its ~74 logical databases groups datasets hosted in
|
|
||||||
foreign public projects (`bigquery-public-data`, `isb-cgc-bq`,
|
|
||||||
`data-to-insights`, …), never in a project we own. Query execution already works
|
|
||||||
cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly
|
|
||||||
because the connector can't introspect a foreign-hosted dataset. Of 74 BQ
|
|
||||||
databases only **one** spans more than one source project, so "let `dataset_ids`
|
|
||||||
carry `project.dataset` and introspect each in its own project" covers the
|
|
||||||
benchmark and the general case alike. None of these project names belong in the
|
|
||||||
code — they are derived from the user's own `dataset_ids` input.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki`. The whole change is contained in
|
|
||||||
the BigQuery connector, its identifier helpers, the connector test suite, and three
|
|
||||||
docs pages.
|
|
||||||
|
|
||||||
**Config boundary (R1).** Added `normalizeBigQueryDatasetId`
|
|
||||||
(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset
|
|
||||||
`[A-Za-z0-9_]`) next to the existing project/region validators. In
|
|
||||||
`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry,
|
|
||||||
defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots →
|
|
||||||
bare dataset in `defaultProject`; one dot → `project.dataset` (each segment
|
|
||||||
validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs`
|
|
||||||
resolves `env:`/`file:` references first, trims/filters empties, then parses each.
|
|
||||||
`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the
|
|
||||||
default, so the canonical pair list is produced once at the boundary.
|
|
||||||
`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new
|
|
||||||
`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name
|
|
||||||
`connections.<id>.dataset_ids entry "<entry>"`.
|
|
||||||
|
|
||||||
**Client port (R2).** `KtxBigQueryClient.dataset` now takes
|
|
||||||
`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards
|
|
||||||
`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`).
|
|
||||||
`getClient()` still constructs the client with the **billing** `project_id`, so
|
|
||||||
`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3).
|
|
||||||
|
|
||||||
**Per-dataset introspection (R3–R7, R9).** Every introspection site reads the
|
|
||||||
resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)`
|
|
||||||
and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s
|
|
||||||
`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified
|
|
||||||
`` `<project>.<dataset>.INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each
|
|
||||||
dataset under its own project; `getTableRowCount`'s default resolves through the first
|
|
||||||
pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and
|
|
||||||
keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a
|
|
||||||
`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project
|
|
||||||
snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with
|
|
||||||
the same dataset name stay distinct, R6/acceptance 4).
|
|
||||||
|
|
||||||
**`listTables` (R5).** Split into `listTables` (parse override entries, group by
|
|
||||||
project) and `listTablesInProject(project, region, datasets?)`. With no override it
|
|
||||||
lists the billing project's region (unchanged); with an override it runs one
|
|
||||||
region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that
|
|
||||||
project's bare datasets, and labels rows with that project. The existing single-region
|
|
||||||
test is unchanged (bare entries collapse to one billing-project query).
|
|
||||||
|
|
||||||
**Docs (R10).** Added a "Cross-project datasets" subsection to
|
|
||||||
`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats),
|
|
||||||
plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`.
|
|
||||||
|
|
||||||
**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and
|
|
||||||
malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection
|
|
||||||
calls `dataset('austin_311', 'bigquery-public-data')`, labels tables
|
|
||||||
`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps
|
|
||||||
`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']`
|
|
||||||
connection introspects both under their own projects; and `['proj_a.shared',
|
|
||||||
'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated
|
|
||||||
to the pair list; all pre-existing behavioral tests pass unchanged.
|
|
||||||
|
|
||||||
**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite
|
|
||||||
(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`,
|
|
||||||
`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production),
|
|
||||||
`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all
|
|
||||||
pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory;
|
|
||||||
criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is
|
|
||||||
structurally guaranteed (single billing client) and asserted via the `createClient`
|
|
||||||
project. End-to-end ingest against live `bigquery-public-data` was not run here (no live
|
|
||||||
credentials in this worktree); the `link:dev` binary is ready for the playground agent to
|
|
||||||
validate.
|
|
||||||
|
|
||||||
**No deviations from the spec design.** The only judgment call: `scope.datasets`
|
|
||||||
renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to
|
|
||||||
satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation
|
|
||||||
requirement (R6/acceptance 4) with one unambiguous, dot-delimited form.
|
|
||||||
|
|
@ -1,471 +0,0 @@
|
||||||
# Durable, resumable, bounded relationship detection during ingest enrichment
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`.
|
|
||||||
>
|
|
||||||
> **Scope: make the expensive part of ingest enrichment survive an interrupted
|
|
||||||
> relationship stage.** Today the paid LLM descriptions + embeddings only become
|
|
||||||
> durable and queryable after the slowest, most-killable, least-valuable stage
|
|
||||||
> (relationship detection) also finishes. This spec moves the persistence boundary
|
|
||||||
> to the cost boundary, makes stage resume work across runs, and bounds + observes
|
|
||||||
> the one open-ended stage — the durability companion to spec 16 (bounded query
|
|
||||||
> execution), which this spec composes with rather than replaces.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Three compounding failure modes, all confirmed in the current code, share one root
|
|
||||||
cause: **the three enrichment stages are treated as a single atomic unit for
|
|
||||||
persistence, identity, and bounding, even though they differ radically in cost,
|
|
||||||
durability value, runtime, and likelihood of being killed.**
|
|
||||||
|
|
||||||
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages
|
|
||||||
in a fixed order through `runEnrichmentStage` (`:413`):
|
|
||||||
|
|
||||||
| stage | order | cost | durability value | runtime on a large schema | likely to be killed |
|
|
||||||
|-------|-------|------|------------------|---------------------------|---------------------|
|
|
||||||
| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low |
|
|
||||||
| `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low |
|
|
||||||
| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** |
|
|
||||||
|
|
||||||
The slowest, most-killable, least-valuable stage runs **last**, and it gates the
|
|
||||||
durability of the two expensive stages held in memory before it.
|
|
||||||
|
|
||||||
### 1. Enrichment is lost if relationship detection is interrupted
|
|
||||||
|
|
||||||
The queryable artifact agents search and execute against is the `_schema` manifest
|
|
||||||
YAML (`semantic-layer/<connectionId>/_schema/*.yaml`). It is written **twice**:
|
|
||||||
|
|
||||||
- bare (native column comments only) early, at `local-scan.ts:473`
|
|
||||||
(`writeLocalScanManifestShards`), before enrichment runs; and
|
|
||||||
- rewritten **with AI descriptions + accepted joins** by
|
|
||||||
`writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called
|
|
||||||
from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after
|
|
||||||
all three stages.
|
|
||||||
|
|
||||||
So the descriptions and embeddings reach the queryable layer only via that single
|
|
||||||
terminal write. If the process is killed/crashes/times out **during** the
|
|
||||||
`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write
|
|
||||||
never runs, and the in-memory descriptions + embeddings are discarded — the
|
|
||||||
`_schema` retains only the bare native comments from the `:473` write.
|
|
||||||
|
|
||||||
Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full
|
|
||||||
descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
|
|
||||||
relationship stage ran silently past a supervising deadline and was killed; the
|
|
||||||
persisted `_schema` had **0** AI descriptions. The most expensive work is the most
|
|
||||||
likely to be thrown away.
|
|
||||||
|
|
||||||
> A stage-state store (below) does save each completed stage's output to an
|
|
||||||
> internal SQLite cache as the stage finishes — so the descriptions are not lost to
|
|
||||||
> the *resume cache*. They are simply never **promoted** to the queryable `_schema`
|
|
||||||
> until the terminal write. The data survives somewhere the agent cannot query, and
|
|
||||||
> (per failure mode 2) cannot be reused on the next run either.
|
|
||||||
|
|
||||||
### 2. Re-running does not resume — it re-spends
|
|
||||||
|
|
||||||
`runEnrichmentStage` resolves a completed stage with
|
|
||||||
`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and
|
|
||||||
the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares
|
|
||||||
`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id`
|
|
||||||
(`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per
|
|
||||||
ingest invocation (`record.runId`). The cache therefore only resolves *within* one
|
|
||||||
run; re-running an interrupted ingest gets a new `runId`, misses every cached
|
|
||||||
stage, and **recomputes descriptions + embeddings from scratch** — re-paying for
|
|
||||||
LLM work that already succeeded.
|
|
||||||
|
|
||||||
The store already computes and persists `inputHash` next to `runId` —
|
|
||||||
a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity,
|
|
||||||
relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is
|
|
||||||
already on the row; the lookup just uses the volatile column. This is a keying
|
|
||||||
defect, not a missing capability.
|
|
||||||
|
|
||||||
### 3. Relationship detection is unobservable and unbounded
|
|
||||||
|
|
||||||
`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a
|
|
||||||
row sample of **every enabled table** (`profileKtxRelationshipSchema`,
|
|
||||||
`relationship-profiling.ts:320` — one sampled query per table at
|
|
||||||
`profileConcurrency`, default 4), validates candidate joins
|
|
||||||
(`relationship-validation.ts:237` — one coverage query per candidate), and detects
|
|
||||||
composite keys (`relationship-composite-candidates.ts:515` — per-table plus
|
|
||||||
cross-table queries). None of the controls the rest of the scan pipeline relies on
|
|
||||||
were ever wired into this stack:
|
|
||||||
|
|
||||||
- **No progress.** `discoverKtxRelationships` does not accept a progress port; the
|
|
||||||
caller can only emit start/end around it (`local-enrichment.ts:600,611` —
|
|
||||||
`update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of
|
|
||||||
silence between.
|
|
||||||
- **No honored cancellation.** `KtxScanContext.signal` exists on the contract
|
|
||||||
(`types.ts`) but **no sub-stage reads it**.
|
|
||||||
- **No time budget.** Validation has a *count* budget (`validationBudget`, default
|
|
||||||
`min(2 × tableCount, 1000)`); profiling and composite detection have none. On a
|
|
||||||
schema with hundreds–thousands of tables, profiling is O(tables) silent queries
|
|
||||||
with no internal stop condition.
|
|
||||||
|
|
||||||
A supervisor watching for liveness cannot tell a slow-but-working profile from a
|
|
||||||
true hang, and nothing inside the stage will voluntarily stop — so on a very large
|
|
||||||
schema it runs far past any reasonable deadline and is killed (which, via failure
|
|
||||||
mode 1, takes the descriptions with it).
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Any context layer that enriches a real warehouse with paid LLM work must make that
|
|
||||||
work durable the instant it is produced, resume it across process restarts without
|
|
||||||
re-paying, and bound the open-ended profiling stage so a large catalog cannot hang
|
|
||||||
ingest indefinitely. A data team ingesting a 500-table production warehouse over a
|
|
||||||
flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit
|
|
||||||
hits all three failure modes regardless of any benchmark. This is general
|
|
||||||
durability and cost hygiene for the ingest pipeline; the benchmark only made it
|
|
||||||
acute at scale.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
|
||||||
|
|
||||||
### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships
|
|
||||||
|
|
||||||
As soon as the last non-relationship stage completes — `embeddings` when an
|
|
||||||
embedding provider is configured, otherwise `descriptions` — persist the
|
|
||||||
descriptions + embeddings into the **queryable** `_schema` manifest (and the raw
|
|
||||||
`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the
|
|
||||||
`relationships` stage runs. The relationship stage then writes its joins on top: the
|
|
||||||
manifest builder already re-reads and preserves existing descriptions and
|
|
||||||
manual/inferred joins on rewrite (`loadExistingManifestState`,
|
|
||||||
`local-enrichment-artifacts.ts:196`), so the second write is additive, not
|
|
||||||
destructive.
|
|
||||||
|
|
||||||
Net invariant: **the descriptions + embeddings are always durable and queryable the
|
|
||||||
moment they are computed**, even if relationship detection then fails, is
|
|
||||||
interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
|
|
||||||
relationship stage degrades to "no joins" or "partial joins" — **never** to "no
|
|
||||||
descriptions." This is the inverse guarantee the current terminal-write ordering
|
|
||||||
violates.
|
|
||||||
|
|
||||||
The bare `:473` manifest write stays — it is the queryable schema for the
|
|
||||||
no-providers / enrichment-disabled path. The checkpoint is an additional write that
|
|
||||||
runs only when enrichment produced descriptions.
|
|
||||||
|
|
||||||
> Orientation (the implementer owns the seam): the lowest-coupling shape is a
|
|
||||||
> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once
|
|
||||||
> the last non-relationship stage completes, and `local-scan.ts` supplies a callback
|
|
||||||
> that calls the existing `writeLocalScanEnrichmentArtifacts` for the
|
|
||||||
> descriptions + embeddings + manifest only (no generated joins yet). The final
|
|
||||||
> write after the relationship stage proceeds as today. Relationship-specific
|
|
||||||
> artifacts (`relationships.json`, `relationship-profile.json`,
|
|
||||||
> `relationship-diagnostics.json`) are written by the final/relationship write, not
|
|
||||||
> the checkpoint, so the checkpoint never emits misleading empty relationship
|
|
||||||
> diagnostics.
|
|
||||||
>
|
|
||||||
> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment`
|
|
||||||
> (inject the file store / project). That couples the enrichment module to
|
|
||||||
> persistence for no gain — the writer already lives in `local-scan.ts` and the
|
|
||||||
> checkpoint needs only a one-line hook, not a relocation.
|
|
||||||
|
|
||||||
### D2 — Resume by content identity, not by `runId`
|
|
||||||
|
|
||||||
Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**,
|
|
||||||
independent of `runId`, so a re-run with an unchanged schema and config resumes the
|
|
||||||
finished `descriptions` / `embeddings` stages from cache and re-runs only what
|
|
||||||
actually failed. `inputHash` is already the content fingerprint; `connectionId`
|
|
||||||
scopes it to the right source. When several rows share a content identity (one per
|
|
||||||
prior run), the most recent `updatedAt` wins.
|
|
||||||
|
|
||||||
`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves
|
|
||||||
the uniqueness/lookup key.
|
|
||||||
|
|
||||||
The state store is a **disposable local resume cache** (`.ktx` local state,
|
|
||||||
regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate
|
|
||||||
the table if its on-disk shape differs from the new `(connection_id, stage,
|
|
||||||
input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the
|
|
||||||
old cache only means one ingest cannot resume; it never corrupts a queryable
|
|
||||||
artifact.
|
|
||||||
|
|
||||||
> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest
|
|
||||||
> are already folded into `inputHash`; adding them again would only narrow the key
|
|
||||||
> and re-break cross-run resume when an incidental field differs.
|
|
||||||
|
|
||||||
### D3 — Make the relationship stage observable and bounded
|
|
||||||
|
|
||||||
Thread three things the rest of the pipeline already supports through
|
|
||||||
`discoverKtxRelationships` into profiling, validation, and composite detection:
|
|
||||||
|
|
||||||
- **Progress** through the existing progress port (the relationship phase is
|
|
||||||
already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit
|
|
||||||
liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent
|
|
||||||
for composite probing — so a supervisor can distinguish slow-but-working from
|
|
||||||
hung.
|
|
||||||
- **A flat wall-clock budget** for the whole relationship stage: a new
|
|
||||||
`scan.relationships.detectionBudgetMs`, a positive integer of milliseconds,
|
|
||||||
project-level, validated like the other `scan.relationships` fields, **default
|
|
||||||
600_000 (10 min), enforced by default.** Checked at unit boundaries (before each
|
|
||||||
table profile, each candidate validation, each composite probe). It sits **above**
|
|
||||||
spec 16's per-query deadline (default 30s): each individual query is already
|
|
||||||
bounded; this bounds the *sum* of them.
|
|
||||||
- **Honored cancellation:** where `KtxScanContext.signal` is available, the same
|
|
||||||
unit-boundary check honors it, so external cancellation stops the stage too.
|
|
||||||
|
|
||||||
On budget exhaustion or abort: stop scheduling new work, let in-flight queries
|
|
||||||
finish (each already bounded by spec 16), finalize with the relationships found so
|
|
||||||
far, and return a **partial** result — never an unbounded hang and never an
|
|
||||||
exception that would lose the checkpointed descriptions.
|
|
||||||
|
|
||||||
> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a
|
|
||||||
> second formula to reason about and "more tables → more budget" partly re-opens the
|
|
||||||
> unbounded door this requirement closes. One flat, generous, project-level number
|
|
||||||
> matches how the other `scan.relationships` knobs are shaped and is enough for a
|
|
||||||
> best-effort stage whose partial output is durable and improvable (D4).
|
|
||||||
>
|
|
||||||
> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a
|
|
||||||
> per-call override. One opinionated project-level default with a config override is
|
|
||||||
> the canonical ktx shape; no second runtime path.
|
|
||||||
|
|
||||||
### D4 — A budget-truncated partial is a successful, cached, completed stage
|
|
||||||
|
|
||||||
A graceful budget stop is **not** a failure. The relationship stage saves its
|
|
||||||
partial result like any completed stage (so a plain re-run resumes it for free, no
|
|
||||||
re-querying) and marks it `partial` with a reason in the relationship diagnostics
|
|
||||||
plus a recoverable scan warning. Because `detectionBudgetMs` lives in
|
|
||||||
`relationshipSettings ⊂ inputHash`, **raising the budget changes the content
|
|
||||||
identity and triggers a fresh, fuller run** — that is the only "try harder"
|
|
||||||
mechanism, with no extra flag or runtime path.
|
|
||||||
|
|
||||||
Distinguish the two stop kinds:
|
|
||||||
|
|
||||||
- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as
|
|
||||||
completed, so the next run recomputes the relationship stage (after resuming
|
|
||||||
descriptions/embeddings from cache via D2). This is the primary durability path.
|
|
||||||
- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and
|
|
||||||
resumed cheaply on re-run, unless the budget is raised.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Checkpoint descriptions + embeddings before relationship detection
|
|
||||||
|
|
||||||
The descriptions and embeddings MUST be persisted into the durable, queryable
|
|
||||||
`_schema` manifest (and the raw enrichment artifacts) as soon as the last
|
|
||||||
non-relationship stage completes, before the `relationships` stage runs.
|
|
||||||
Relationship detection appends/merges its joins on completion. The expensive LLM +
|
|
||||||
embedding enrichment MUST be queryable even if the relationship stage subsequently
|
|
||||||
fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
|
|
||||||
relationship stage MUST degrade to "no/partial joins," never to "no descriptions."
|
|
||||||
|
|
||||||
### 2. Stage resume resolves by content identity across runs
|
|
||||||
|
|
||||||
Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`,
|
|
||||||
independent of `runId`, so re-running an interrupted ingest resumes the finished
|
|
||||||
`descriptions` / `embeddings` stages from cache and re-runs only what failed.
|
|
||||||
Re-running after an interruption MUST NOT re-issue LLM description or embedding
|
|
||||||
calls for stages that already completed. The resume cache MAY be recreated without a
|
|
||||||
migration bridge if its schema changes (it is disposable local state).
|
|
||||||
|
|
||||||
### 3. Relationship detection emits progress and honors a wall-clock budget
|
|
||||||
|
|
||||||
The relationship stage MUST emit per-unit progress through the existing progress
|
|
||||||
port (at minimum per-table during profiling and per-candidate during validation) so
|
|
||||||
liveness is observable. It MUST enforce a flat wall-clock budget
|
|
||||||
(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level,
|
|
||||||
overridable, validated as a positive integer) checked at unit boundaries and layered
|
|
||||||
above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where
|
|
||||||
available. On budget exhaustion or abort it MUST stop scheduling new work, finalize
|
|
||||||
with the relationships found so far, and return a partial result rather than running
|
|
||||||
unboundedly or throwing.
|
|
||||||
|
|
||||||
### 4. A budget-truncated relationship result is durable and marked partial
|
|
||||||
|
|
||||||
A graceful budget/abort stop MUST persist the partial relationship result as a
|
|
||||||
completed stage (so a plain re-run resumes it without re-querying) and MUST mark it
|
|
||||||
`partial` — in the relationship diagnostics artifact and as a recoverable scan
|
|
||||||
warning — so downstream consumers can see the joins are incomplete. Raising
|
|
||||||
`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller
|
|
||||||
relationship run; no separate flag is introduced for "redo." A process killed
|
|
||||||
mid-stage MUST NOT leave a completed record (so it recomputes on re-run).
|
|
||||||
|
|
||||||
### 5. No regression for small or uninterrupted ingests
|
|
||||||
|
|
||||||
A small or single-run ingest that is never interrupted MUST produce the same
|
|
||||||
artifacts and the same relationship output as today. The checkpoint write MUST be
|
|
||||||
idempotent with the final write (descriptions survive the join rewrite); the budget
|
|
||||||
default MUST be generous enough that normal and large-but-tractable schemas complete
|
|
||||||
relationship detection fully, hitting the budget only on pathological scale.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- **Durability across interruption:** interrupting an ingest **during** relationship
|
|
||||||
detection still leaves a queryable semantic layer carrying the table/column
|
|
||||||
descriptions + embeddings that were generated (verified: re-open the connection;
|
|
||||||
AI descriptions are present in `_schema`, not just native comments).
|
|
||||||
- **Resume does not re-spend:** re-running an interrupted ingest does **not**
|
|
||||||
regenerate descriptions/embeddings whose stage already completed (verified: no LLM
|
|
||||||
description calls and no embedding calls for the cached tables; only the failed
|
|
||||||
stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume
|
|
||||||
survives a fresh `runId`.
|
|
||||||
- **Observable + bounded relationships:** a connection with hundreds of tables emits
|
|
||||||
relationship-stage progress (per-table profiling, per-candidate validation) and
|
|
||||||
completes within `detectionBudgetMs`; when the budget is hit, the stage stops
|
|
||||||
gracefully and persists the partial relationships found so far — without
|
|
||||||
discarding enrichment — marked `partial` in diagnostics and via a recoverable
|
|
||||||
warning.
|
|
||||||
- **Partial is cached and improvable:** re-running with an unchanged budget resumes
|
|
||||||
the partial relationship result from cache (no re-querying); raising
|
|
||||||
`detectionBudgetMs` triggers a fresh, fuller relationship run.
|
|
||||||
- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project
|
|
||||||
override, and rejects an invalid value (zero / negative / non-integer) as a clear
|
|
||||||
`ktx.yaml` config error.
|
|
||||||
- **No regression:** small/single-run ingests behave exactly as before — identical
|
|
||||||
artifacts and relationship output when nothing is interrupted; the checkpoint +
|
|
||||||
final writes leave descriptions intact alongside the generated joins.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Bounding the descriptions stage's per-table LLM call.** Whether an individual
|
|
||||||
enrichment LLM call can wedge is a separate concern (already being addressed in the
|
|
||||||
working tree via a per-table enrichment timeout). This spec ensures whatever
|
|
||||||
descriptions *did* complete are durable; it does not own the per-call timeout.
|
|
||||||
- **Changing relationship-detection quality, thresholds, or the candidate/validation
|
|
||||||
algorithm.** The accept/review thresholds, scoring, and the existing
|
|
||||||
`validationBudget` count cap are unchanged; this spec adds durability,
|
|
||||||
cross-run resume, progress, and a time budget around them.
|
|
||||||
- **A per-connection or per-call relationship budget, or a global env override.**
|
|
||||||
One flat project-level `detectionBudgetMs`; no second runtime path (D3).
|
|
||||||
- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec
|
|
||||||
composes above it and does not re-implement query-level deadlines.
|
|
||||||
- **Replacing the per-query deadline with the stage budget, or vice versa.** They
|
|
||||||
are independent and layered: a single query is bounded by spec 16; the stage's sum
|
|
||||||
is bounded by `detectionBudgetMs`.
|
|
||||||
- **A general checkpoint framework for every ingest stage.** The checkpoint is
|
|
||||||
specifically the descriptions+embeddings → queryable-manifest promotion before
|
|
||||||
relationships; it is not a generic per-stage artifact-flush abstraction.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
|
||||||
design.
|
|
||||||
|
|
||||||
- **Enrichment orchestration** — `context/scan/local-enrichment.ts`:
|
|
||||||
`runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls
|
|
||||||
(`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`),
|
|
||||||
`runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the
|
|
||||||
checkpoint hook after the last non-relationship stage; thread the progress port,
|
|
||||||
signal, and budget into the relationship stage.
|
|
||||||
- **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest
|
|
||||||
write (`:473`), enrichment call (`:492`, currently passing only
|
|
||||||
`{ runId, progress }` as `context` — wire `signal` through here too), terminal
|
|
||||||
`writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch
|
|
||||||
(`:530`, which after D1 no longer loses descriptions). Supply the checkpoint
|
|
||||||
callback here.
|
|
||||||
- **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`:
|
|
||||||
`writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards`
|
|
||||||
(`:270`), and the description-preserving merge in `loadExistingManifestState`
|
|
||||||
(`:196`) — the basis for the additive checkpoint/final write.
|
|
||||||
- **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`:
|
|
||||||
`PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`),
|
|
||||||
`saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`,
|
|
||||||
pick latest `updated_at`, recreate the table if shape differs (disposable cache).
|
|
||||||
Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage`
|
|
||||||
in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash`
|
|
||||||
(`:78`).
|
|
||||||
- **Relationship stack (progress + budget + signal)** —
|
|
||||||
`context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept
|
|
||||||
a progress port and budget/deadline + signal),
|
|
||||||
`context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` —
|
|
||||||
per-table progress + budget check),
|
|
||||||
`context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates`
|
|
||||||
`:237` — per-candidate progress + budget check, alongside the existing
|
|
||||||
`validationBudget`),
|
|
||||||
`context/scan/relationship-composite-candidates.ts`
|
|
||||||
(`discoverKtxCompositeRelationships` `:515` — budget check).
|
|
||||||
- **Config** — `context/project/config.ts` `scan.relationships`
|
|
||||||
(`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive
|
|
||||||
integer ms, default 600_000) to the zod schema and the default config builder.
|
|
||||||
- **Partial marker** — `context/scan/relationship-diagnostics.ts`
|
|
||||||
(`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries
|
|
||||||
a `partial` flag + reason; add a recoverable warning code to the
|
|
||||||
`KtxScanWarningCode` union in `context/scan/types.ts` (e.g.
|
|
||||||
`relationship_detection_partial`).
|
|
||||||
- **Tests** — durability: a fixture ingest interrupted during the relationship stage
|
|
||||||
leaves AI descriptions in the queryable `_schema`. Resume: a second run with a
|
|
||||||
fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings
|
|
||||||
(assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema
|
|
||||||
large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget,
|
|
||||||
emits per-unit progress, returns partial, persists it marked `partial`, and a
|
|
||||||
re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests
|
|
||||||
for `detectionBudgetMs` (default / override / invalid). Regression: small
|
|
||||||
uninterrupted ingest yields identical artifacts and relationship output.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables
|
|
||||||
(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code
|
|
||||||
costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage
|
|
||||||
interruption — and re-spending it on every retry — makes large-schema ingest
|
|
||||||
impractical, and an unbounded profiling stage runs past any supervising deadline and
|
|
||||||
is killed. This is a general durability/cost property of the ingest pipeline,
|
|
||||||
independent of the benchmark; the benchmark only made it acute at scale. Do not
|
|
||||||
encode any benchmark specifics in the implementation.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
|
||||||
four design decisions shipped; no deviations from the resolved design.
|
|
||||||
|
|
||||||
**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`,
|
|
||||||
`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to
|
|
||||||
`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by
|
|
||||||
`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent
|
|
||||||
content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`;
|
|
||||||
`runId` stays on the row for diagnostics/`listRunStages`. The store drops and
|
|
||||||
recreates the table when the on-disk primary key differs (disposable cache, no
|
|
||||||
migration bridge), detected via `PRAGMA table_info`.
|
|
||||||
|
|
||||||
**D3 — observable + bounded relationship stage** (new
|
|
||||||
`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget`
|
|
||||||
(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an
|
|
||||||
injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that
|
|
||||||
generalizes and replaces the old `mapWithConcurrency`). Threaded through
|
|
||||||
`discoverKtxRelationships` → profiling (per-table progress + budget stop),
|
|
||||||
validation (per-candidate progress + budget stop; budget-skipped candidates
|
|
||||||
degrade to the existing `validation_unattempted` review), and composite
|
|
||||||
detection (budget stops at PK-detection and coverage-probe boundaries).
|
|
||||||
`discoverKtxRelationships` now accepts `progress` and `now` and returns
|
|
||||||
`partial: { reason } | null`. The clock check fires only when work remains, so a
|
|
||||||
deadline elapsing after the last unit never marks a fully-processed stage partial.
|
|
||||||
|
|
||||||
**D1 — checkpoint before relationships** (`local-enrichment.ts`,
|
|
||||||
`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a
|
|
||||||
caller-supplied `onCheckpoint` once descriptions/embeddings complete and before
|
|
||||||
the relationship stage runs, gated on `shouldDetectRelationships` so the
|
|
||||||
no-relationship path keeps a single write. `local-scan.ts` supplies a callback
|
|
||||||
calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json +
|
|
||||||
embeddings.json + manifest with descriptions and no generated joins — no
|
|
||||||
relationship artifacts, so no misleading empty diagnostics). The shared
|
|
||||||
description/embedding JSON writer was factored out so checkpoint and final writes
|
|
||||||
stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions`
|
|
||||||
into the enrichment context (completing the existing `KtxScanContext.signal`
|
|
||||||
contract already read by the budget and the in-flight description timeout).
|
|
||||||
|
|
||||||
**D4 — partial is durable + marked** (`relationship-diagnostics.ts`,
|
|
||||||
`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact
|
|
||||||
carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable
|
|
||||||
`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated.
|
|
||||||
A graceful budget/abort stop returns normally, so the relationship stage saves as a
|
|
||||||
completed-partial record and resumes cheaply; a process killed mid-stage saves
|
|
||||||
nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash`
|
|
||||||
(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only
|
|
||||||
"try harder" mechanism, no extra flag.
|
|
||||||
|
|
||||||
**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer
|
|
||||||
ms, default `600_000`, validated like the other relationship fields. Documented in
|
|
||||||
`docs-site/content/docs/configuration/ktx-yaml.mdx`.
|
|
||||||
|
|
||||||
**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`);
|
|
||||||
cross-run resume + table-recreate (`enrichment-state.test.ts`,
|
|
||||||
`local-enrichment.test.ts`); progress/budget/abort partial
|
|
||||||
(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise +
|
|
||||||
checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`);
|
|
||||||
end-to-end durability — a relationship-stage failure still leaves AI descriptions
|
|
||||||
in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag
|
|
||||||
(`relationship-diagnostics.test.ts`); config default/override/invalid
|
|
||||||
(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`,
|
|
||||||
and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated:
|
|
||||||
three `analytics-skill-content.test.ts` markdown-structure assertions fail on this
|
|
||||||
branch from earlier analytics-skill commits — untouched here.)
|
|
||||||
|
|
@ -1,533 +0,0 @@
|
||||||
# Resilient enrichment under a slow/hung LLM backend
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`.
|
|
||||||
>
|
|
||||||
> **Scope: make the descriptions enrichment stage survive a hung LLM backend and
|
|
||||||
> an interrupted run.** Two compounding gaps live *inside* the per-table
|
|
||||||
> description-enrichment path: (1) the per-table LLM timeout fires in JS but does
|
|
||||||
> not terminate a wedged subprocess backend, so a hung table wedges the whole
|
|
||||||
> stage indefinitely; (2) descriptions are persisted only at full-stage
|
|
||||||
> completion, so any interruption discards every already-enriched table. This is
|
|
||||||
> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline
|
|
||||||
> that *stops the work*, not just abandons the promise) and spec 19 (move the
|
|
||||||
> durability boundary to the cost boundary so expensive LLM work is not lost). It
|
|
||||||
> composes with both rather than replacing them.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Two compounding failure modes on the per-table description-enrichment path, both
|
|
||||||
confirmed in the current code and observed end-to-end together. Their union turned
|
|
||||||
a single hung table into an indefinite wedge *plus* total loss of an entire
|
|
||||||
stage's LLM work.
|
|
||||||
|
|
||||||
### 1. The per-table LLM timeout does not terminate the work
|
|
||||||
|
|
||||||
`KtxDescriptionGenerator.generateBatchedTableDescriptions`
|
|
||||||
(`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the
|
|
||||||
per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh
|
|
||||||
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`).
|
|
||||||
A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one
|
|
||||||
wedge stays one timeout, not 3×). That is the correct policy — but the abort never
|
|
||||||
actually stops a subprocess backend, so the timeout is cosmetic.
|
|
||||||
|
|
||||||
The runtime is selected by the `backend` config field
|
|
||||||
(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS =
|
|
||||||
['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn
|
|
||||||
a **child process the SDK owns** and to which ktx hands only an `AbortSignal`:
|
|
||||||
|
|
||||||
- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` →
|
|
||||||
`codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's
|
|
||||||
`spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the
|
|
||||||
SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing
|
|
||||||
the abort error **only after that loop ends**. A child wedged on a hung provider
|
|
||||||
socket survives SIGTERM → its stdout never closes → the readline loop never ends
|
|
||||||
→ the SDK never throws → ktx's `await generateObject` **never settles**, past the
|
|
||||||
per-attempt timeout, indefinitely. The child leaks (open provider connections,
|
|
||||||
~0% CPU).
|
|
||||||
- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via
|
|
||||||
`context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls
|
|
||||||
best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks
|
|
||||||
`throwIfAborted` **between** streamed messages. A wedged child emits no message, so
|
|
||||||
the `for await (const message of queryResult)` loop blocks and the graceful
|
|
||||||
`interrupt()` may never land — the same hang class.
|
|
||||||
|
|
||||||
By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via
|
|
||||||
`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's
|
|
||||||
`generateObject`, which cancels the underlying `fetch` natively — the await settles
|
|
||||||
promptly and there is no child to leak.
|
|
||||||
|
|
||||||
So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too
|
|
||||||
gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks
|
|
||||||
`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork —
|
|
||||||
which it does not here.
|
|
||||||
|
|
||||||
Observed (BigQuery ingest, codex backend, 2026-06-23): with
|
|
||||||
`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of
|
|
||||||
`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+
|
|
||||||
minutes** — well past the 30-min per-attempt timeout — with exactly two codex
|
|
||||||
children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand.
|
|
||||||
|
|
||||||
### 2. Descriptions are persisted only at full-stage completion
|
|
||||||
|
|
||||||
`generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out
|
|
||||||
per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and
|
|
||||||
**accumulates every table's result in an in-memory `updates` array**, returned only
|
|
||||||
when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls
|
|
||||||
`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`)
|
|
||||||
**after** `compute()` returns, and the spec-19 checkpoint write
|
|
||||||
(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379,
|
|
||||||
fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the
|
|
||||||
descriptions stage completes**. There is no within-stage persistence: while the
|
|
||||||
stage runs, every enriched table's description lives only in memory.
|
|
||||||
|
|
||||||
So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is
|
|
||||||
killed, or a supervising watchdog fires — **all** already-enriched tables are lost,
|
|
||||||
even though their (expensive, paid) LLM descriptions were finished. On the next run,
|
|
||||||
`findCompletedStage` finds no row, so the descriptions stage **recomputes from
|
|
||||||
scratch**.
|
|
||||||
|
|
||||||
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but
|
|
||||||
**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk;
|
|
||||||
killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The
|
|
||||||
cost of 2 pathological tables was 283 tables' worth of redone LLM calls.
|
|
||||||
|
|
||||||
Sharper still (re-ingest with a short, *enforced* timeout): even when the stage
|
|
||||||
**runs to the end** — the 2 hung tables hit their timeout and were skipped, so
|
|
||||||
**283/285** descriptions were generated and the ingest reported success (`Scan
|
|
||||||
completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were
|
|
||||||
**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not**
|
|
||||||
only "discarded on kill": a stage that completes with *any* skipped/aborted table
|
|
||||||
threw away **every** successfully-generated description. The skip must be
|
|
||||||
**graceful** — a skipped table costs one missing description, not the entire stage's
|
|
||||||
output — which is the strongest argument for per-table incremental persistence: the
|
|
||||||
283 good descriptions should have been durable the moment each was produced.
|
|
||||||
|
|
||||||
The on-disk artifacts already carry everything needed to fix this *additively*: the
|
|
||||||
`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is
|
|
||||||
AI-enriched), and rewrites preserve existing descriptions
|
|
||||||
(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115;
|
|
||||||
`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis
|
|
||||||
spec 19 relies on). The durable record and the resume-skip set can be **derived from
|
|
||||||
the system's own on-disk state**, with no new cache schema.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Anyone ingesting a large or wide schema with an LLM enrichment backend —
|
|
||||||
especially a **subprocess** backend, the common local/desktop setup — will
|
|
||||||
eventually hit a table whose description call hangs: a provider stall, a rate-limit
|
|
||||||
black-hole, a pathologically large prompt. Without an *enforced* timeout, one such
|
|
||||||
table wedges the entire ingest indefinitely and leaks the spawned child; without
|
|
||||||
*incremental* persistence, any interruption throws away all the per-table LLM work
|
|
||||||
already done — the dominant ingest cost. Both fixes make large-schema enrichment
|
|
||||||
**resilient and resumable**: a few bad tables degrade to a few skipped
|
|
||||||
descriptions, not a hung process and a from-scratch redo. This is core robustness
|
|
||||||
for a general-purpose ingestion product, wholly independent of any benchmark.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
|
||||||
|
|
||||||
### D1 — One bounded-call guarantee; enforcement follows the backend's nature
|
|
||||||
|
|
||||||
The canonical contract is a single guarantee for the per-table enrichment call:
|
|
||||||
**the in-flight work terminates and ktx's await settles within the per-table
|
|
||||||
deadline plus a small grace, on every backend.** How that guarantee is met follows
|
|
||||||
from a structural property of the configured backend — *does it own a subprocess?*
|
|
||||||
— not from a hand-maintained list of provider names:
|
|
||||||
|
|
||||||
- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is
|
|
||||||
insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call
|
|
||||||
behind a **boundary it can hard-kill** — a short-lived ktx-owned child process,
|
|
||||||
made a **process-group leader** (`detached`). The SDK's grandchild (the
|
|
||||||
`codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx
|
|
||||||
**tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the
|
|
||||||
grandchild — and rejects promptly. This mirrors spec 16's child-process +
|
|
||||||
SIGKILL mechanism, extended by the critical step that **killing the immediate
|
|
||||||
child is not enough**: the grandchild would otherwise orphan to init and keep its
|
|
||||||
provider connections. Killing the group is the real fix.
|
|
||||||
- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing
|
|
||||||
in-process `abortSignal` → `fetch` cancellation already satisfies the contract —
|
|
||||||
the await settles promptly and there is no subprocess to leak. Routing these
|
|
||||||
through a subprocess would pay fork + IPC + credential-passing cost for no benefit.
|
|
||||||
|
|
||||||
> The branch on "subprocess-backed?" is behavior following from an input the backend
|
|
||||||
> declares about itself, not vendor enumeration — the same guarantee is reached two
|
|
||||||
> ways because the backends differ structurally. This matches the intake's own split
|
|
||||||
> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed").
|
|
||||||
>
|
|
||||||
> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline
|
|
||||||
> regardless of the SDK, but leave the SDK's child running). It unwedges the stage
|
|
||||||
> but leaves the orphaned child holding provider connections — the exact leak the
|
|
||||||
> incident showed — so it fails the intake's "actually cancelled" requirement and
|
|
||||||
> compounds over a long ingest that hits several hung tables.
|
|
||||||
>
|
|
||||||
> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime,
|
|
||||||
> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a
|
|
||||||
> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is
|
|
||||||
> low-frequency relative to its own latency and already concurrency-bounded (4), so
|
|
||||||
> one short-lived child per call (spec 16's resolved choice) is simpler and as fast.
|
|
||||||
|
|
||||||
**Portability.** ktx supports Windows, where POSIX process groups and
|
|
||||||
`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached
|
|
||||||
process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating
|
|
||||||
equivalent on Windows (e.g. `taskkill /pid <pid> /T /F` or a job object) so the
|
|
||||||
grandchild is reaped on every platform the subprocess backends run on.
|
|
||||||
|
|
||||||
### D2 — Default stays moderate and the retry/skip policy is unchanged
|
|
||||||
|
|
||||||
The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the
|
|
||||||
existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the
|
|
||||||
no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the
|
|
||||||
table is skipped with the existing `enrichment_timeout` warning and the stage
|
|
||||||
proceeds. The 30-min value in the incident was an operator stopgap chosen *because*
|
|
||||||
the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a
|
|
||||||
long timeout is strictly worse for a hang (a hang costs the full timeout), so the
|
|
||||||
moderate default is the correct operating point. The retry loop stays in
|
|
||||||
`description-generation.ts`: each attempt runs through the bounded boundary (D1), so
|
|
||||||
a transient backend error retries while a timeout surfaces as `KtxAbortedError` and
|
|
||||||
does not.
|
|
||||||
|
|
||||||
> Not introducing a new `ktx.yaml` config field for the timeout. The existing env
|
|
||||||
> override is the tuning seam; adding a per-connection/per-call/global knob would
|
|
||||||
> multiply the runtime surface for no stated need (one opinionated default + the
|
|
||||||
> existing env override is the canonical ktx shape).
|
|
||||||
|
|
||||||
### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state
|
|
||||||
|
|
||||||
During the descriptions fan-out, flush completed tables **per batch** (every N
|
|
||||||
tables / on a timer, at a cadence that bounds the at-risk window) to the durable
|
|
||||||
on-disk artifacts, reusing spec 19's additive write:
|
|
||||||
|
|
||||||
- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**;
|
|
||||||
- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal`
|
|
||||||
preserves prior `ai:`/`db:`/external keys) so finished descriptions are also
|
|
||||||
**queryable** the moment they are computed — the spec-19 invariant, one level
|
|
||||||
deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by
|
|
||||||
rewriting only changed shards.
|
|
||||||
|
|
||||||
On resume, `generateDescriptions` reads the existing record, **skips any table
|
|
||||||
already enriched**, computes only the remainder, and returns the merged full set so
|
|
||||||
the embeddings stage, the checkpoint write, and the stage-store row all see a
|
|
||||||
complete result exactly as today.
|
|
||||||
|
|
||||||
**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The
|
|
||||||
durable record is tagged with the descriptions stage's `inputHash`
|
|
||||||
(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when
|
|
||||||
the current `inputHash` matches** — a genuine resume-after-interruption of the same
|
|
||||||
content identity. A changed `inputHash` (schema or enrichment settings changed)
|
|
||||||
ignores the prior record for skipping and recomputes the stage as today; the
|
|
||||||
manifest write stays additive regardless. The artifact's on-disk shape may gain the
|
|
||||||
`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped
|
|
||||||
record simply forces one non-incremental run), consistent with ktx's
|
|
||||||
no-backward-compatibility policy.
|
|
||||||
|
|
||||||
> The skip set is **derived from the artifacts ktx already writes**, not from a new
|
|
||||||
> per-table cache table. The manifest's `ai:` field already encodes "this table is
|
|
||||||
> enriched"; a parallel per-table SQLite record would be a second source of truth for
|
|
||||||
> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is
|
|
||||||
> still written at stage completion (it remains the stage-level resume gate — a clean
|
|
||||||
> re-run skips the descriptions stage as today); the incremental record only matters
|
|
||||||
> when the stage did **not** complete — exactly the case where no row exists and
|
|
||||||
> `compute()` re-runs.
|
|
||||||
|
|
||||||
### D4 — A killed-mid-stage run is durable; resume is cheap
|
|
||||||
|
|
||||||
A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the
|
|
||||||
per-batch-flushed tables durable on disk. The next run resumes the descriptions
|
|
||||||
stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again),
|
|
||||||
but `generateDescriptions` now **re-issues LLM calls only for the unfinished
|
|
||||||
tables**. A failed/skipped table (timeout or exhausted retries) is left for the
|
|
||||||
remainder set and is retried on the next resume — never silently treated as done.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. The per-table enrichment timeout is enforced for subprocess backends
|
|
||||||
|
|
||||||
When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed
|
|
||||||
backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and
|
|
||||||
its descendants** — MUST be terminated (SIGKILL of the process group / tree), and
|
|
||||||
ktx's `generateObject` await MUST settle within the deadline plus a small bounded
|
|
||||||
grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded.
|
|
||||||
The termination MUST be portable across the platforms the subprocess backends run on
|
|
||||||
(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends
|
|
||||||
keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one
|
|
||||||
contract met two ways, branching on the backend's structural "owns a subprocess"
|
|
||||||
property, not on a list of provider names.
|
|
||||||
|
|
||||||
### 2. The timeout default and retry/skip policy are unchanged
|
|
||||||
|
|
||||||
The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`),
|
|
||||||
with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the
|
|
||||||
no-retry-on-timeout policy. On timeout, the table is skipped with the existing
|
|
||||||
`enrichment_timeout` recoverable warning and the stage proceeds. No new
|
|
||||||
per-connection / per-call / global timeout knob is added.
|
|
||||||
|
|
||||||
### 3. Descriptions are persisted incrementally during the stage
|
|
||||||
|
|
||||||
Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch**
|
|
||||||
(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence
|
|
||||||
that bounds the at-risk window to a small number of tables. The flush MUST be
|
|
||||||
idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and
|
|
||||||
external keys via the existing merge). Finished tables MUST remain durable even if the
|
|
||||||
stage never completes — is wedged, killed, or interrupted. A failed/skipped
|
|
||||||
relationship/embedding stage or a killed descriptions stage MUST NOT lose the
|
|
||||||
descriptions already flushed.
|
|
||||||
|
|
||||||
### 4. Resume re-enriches only the unfinished tables
|
|
||||||
|
|
||||||
On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST
|
|
||||||
re-issue LLM description calls **only for tables not already enriched**, deriving the
|
|
||||||
already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable
|
|
||||||
record / the manifest's `ai:` descriptions), and MUST return the merged full result
|
|
||||||
so downstream stages behave as on a fresh run. A changed `inputHash` (schema or
|
|
||||||
enrichment settings changed) MUST recompute the stage as today (spec 19's
|
|
||||||
inputHash-gated semantics preserved). The durable record MAY be recreated without a
|
|
||||||
migration bridge if its on-disk shape changes (it is regenerable local/artifact
|
|
||||||
state).
|
|
||||||
|
|
||||||
### 5. No regression for small or uninterrupted ingests
|
|
||||||
|
|
||||||
A small or single-run ingest that is never interrupted MUST produce the same
|
|
||||||
artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST
|
|
||||||
be idempotent with the spec-19 checkpoint and the terminal write (descriptions
|
|
||||||
survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT
|
|
||||||
change a normal successful enrichment's output, only how a wedged call is terminated.
|
|
||||||
|
|
||||||
### 6. A skipped table costs one description, never the stage's output
|
|
||||||
|
|
||||||
A descriptions stage that **completes** with one or more skipped/aborted tables MUST
|
|
||||||
persist every successfully-generated description (the durable record and the `ai:`
|
|
||||||
manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages`
|
|
||||||
row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's
|
|
||||||
output because some tables were skipped. No single table's failure may reject the
|
|
||||||
per-table fan-out: a per-table failure degrades to one missing description (left for
|
|
||||||
the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the
|
|
||||||
only thing that fails the stage (so it resumes), and even then the already-flushed
|
|
||||||
descriptions remain durable.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call
|
|
||||||
that hangs past the deadline is terminated within the deadline plus a small grace;
|
|
||||||
ktx's await settles, the spawned child **and a grandchild it spawned** both exit
|
|
||||||
(verified via the child's `exit`, not left spinning), and the table is skipped with
|
|
||||||
an `enrichment_timeout` warning. The stage advances rather than wedging. A
|
|
||||||
`ctx.signal` abort terminates the same way.
|
|
||||||
- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly
|
|
||||||
on abort via the existing native path, with no subprocess involved.
|
|
||||||
- **Default + policy:** the default timeout is 120s and a timeout is not retried (one
|
|
||||||
wedge = one timeout); a transient error is still retried up to the attempt limit.
|
|
||||||
- **Graceful skip persists the rest:** a stage that completes with one table failing
|
|
||||||
(timeout, exhausted retries, or an unexpected throw) still writes the other N−1
|
|
||||||
descriptions to the durable record + `ai:` `_schema` and marks the stage completed
|
|
||||||
(a `local_scan_enrichment_stages` row exists); the failed table is a single `null`
|
|
||||||
description left for the resume remainder, not a discarded stage.
|
|
||||||
- **Incremental durability:** interrupting the descriptions stage after K of N tables
|
|
||||||
leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`),
|
|
||||||
with no completed `local_scan_enrichment_stages` row.
|
|
||||||
- **Resume does not re-spend:** re-running the interrupted ingest (unchanged
|
|
||||||
`inputHash`, fresh `runId`) issues **no** LLM description calls for the K already-
|
|
||||||
enriched tables and enriches only the remaining N−K; the returned result is the
|
|
||||||
full merged set. A changed `inputHash` recomputes the stage.
|
|
||||||
- **No regression:** a small uninterrupted ingest yields identical artifacts and the
|
|
||||||
same descriptions/embeddings output as today; the incremental flush is idempotent
|
|
||||||
with the checkpoint and terminal writes.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Incremental persistence of embeddings.** Embeddings are fast and already covered
|
|
||||||
by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This
|
|
||||||
spec scopes incremental persistence to the `descriptions` stage.
|
|
||||||
- **Changing the timeout default, retry counts, or adding a timeout config knob.**
|
|
||||||
D2 keeps the moderate default and the single env tuning seam.
|
|
||||||
- **Routing HTTP backends through the subprocess boundary.** Their native abort
|
|
||||||
already meets the contract; a subprocess would add cost and a credential-passing
|
|
||||||
surface for no benefit.
|
|
||||||
- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed
|
|
||||||
call; no pool, no respawn/job-tracking (D1).
|
|
||||||
- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage
|
|
||||||
budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes
|
|
||||||
above them: spec 16 bounds individual queries, spec 19 makes whole stages durable
|
|
||||||
and resumable, and this spec hardens the per-table enrichment call's termination
|
|
||||||
and adds within-stage description durability.
|
|
||||||
- **A general per-stage incremental-flush framework.** The incremental flush is
|
|
||||||
specifically the descriptions stage; it is not a generic abstraction over every
|
|
||||||
enrichment stage.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
|
||||||
design.
|
|
||||||
|
|
||||||
- **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`,
|
|
||||||
`KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block
|
|
||||||
~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on
|
|
||||||
timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry
|
|
||||||
loop stays here; each attempt runs through the kill boundary for subprocess
|
|
||||||
backends.
|
|
||||||
- **LLM runtime + backend selection** — `context/llm/runtime-port.ts`
|
|
||||||
(`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input),
|
|
||||||
`context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` /
|
|
||||||
`ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts`
|
|
||||||
(`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the
|
|
||||||
backend/runtime (e.g. on the runtime interface), not inferred from a name list.
|
|
||||||
- **Subprocess backends** — `context/llm/codex-runtime.ts` +
|
|
||||||
`context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's
|
|
||||||
`spawn(executable, args, { signal })` is in `@openai/codex-sdk`),
|
|
||||||
`context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()`
|
|
||||||
abort path). These are what the kill boundary must wrap and tree-kill.
|
|
||||||
- **Reuse spec 16's mechanism (extended to group/tree kill)** —
|
|
||||||
`connectors/sqlite/read-query-child.ts` (the forked child shape) and
|
|
||||||
`connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`,
|
|
||||||
deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts`
|
|
||||||
child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a
|
|
||||||
process-group leader and killing the **group/tree** (the SDK grandchild), portably.
|
|
||||||
Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`,
|
|
||||||
`linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns
|
|
||||||
passing the backend config/credentials to it (env/IPC) and serializing the
|
|
||||||
structured result back.
|
|
||||||
- **Incremental persistence (gap #2)** —
|
|
||||||
`context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table
|
|
||||||
`pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage`
|
|
||||||
~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the
|
|
||||||
`onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the
|
|
||||||
existing record, skip already-enriched tables, flush per batch, return the merged
|
|
||||||
full set.
|
|
||||||
- **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts`
|
|
||||||
(`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts`
|
|
||||||
with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308,
|
|
||||||
`loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription`
|
|
||||||
~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115,
|
|
||||||
`SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive
|
|
||||||
description/manifest write; tag the durable record with `inputHash`.
|
|
||||||
- **Stage store + input hash** —
|
|
||||||
`context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE =
|
|
||||||
'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`,
|
|
||||||
`findCompletedStage`, `saveCompletedStage`),
|
|
||||||
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The
|
|
||||||
whole-stage row stays; the `inputHash` is the gate for the resume-skip set.
|
|
||||||
- **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the
|
|
||||||
terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal`
|
|
||||||
(`context/scan/types.ts`) which the kill boundary must honor.
|
|
||||||
- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores
|
|
||||||
SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within
|
|
||||||
deadline+grace, the child and a spawned grandchild both exit, and the table is
|
|
||||||
skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the
|
|
||||||
native path. gap #2: interrupt the descriptions stage after K/N tables (a flush
|
|
||||||
seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed
|
|
||||||
stage row; a resume with matching `inputHash` issues no LLM calls for the K and
|
|
||||||
enriches only N−K; a changed `inputHash` recomputes; regression: a small
|
|
||||||
uninterrupted ingest yields identical artifacts.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment
|
|
||||||
backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for
|
|
||||||
41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout
|
|
||||||
never killed the hung codex children, and because descriptions checkpoint only at
|
|
||||||
stage completion, the 283 already-enriched tables were unrecoverable — the operator
|
|
||||||
had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout
|
|
||||||
as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at
|
|
||||||
scale; the gaps and the fixes are generic production hygiene for any agent that
|
|
||||||
enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark
|
|
||||||
specifics in the implementation.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance
|
|
||||||
criteria are covered by tests. The full ktx test surface for the touched code is
|
|
||||||
green (the only failures in the whole suite are 3 pre-existing assertions in
|
|
||||||
`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown
|
|
||||||
— an unrelated subsystem this change does not touch).
|
|
||||||
|
|
||||||
### Gap #1 — enforced timeout for subprocess backends
|
|
||||||
|
|
||||||
- **Structural property on the runtime, not a name list.** Added
|
|
||||||
`subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort`
|
|
||||||
(`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime`
|
|
||||||
return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime`
|
|
||||||
(and the deterministic stub) return `null`. The per-table call branches on this,
|
|
||||||
never on a vendor list (D1).
|
|
||||||
- **Shared structured core.** Both subprocess runtimes gained
|
|
||||||
`generateStructuredJson(jsonSchema)` (returns the raw object; the caller
|
|
||||||
Zod-validates). Their existing `generateObject` was refactored to delegate to the
|
|
||||||
same streaming core, so structured generation has one implementation.
|
|
||||||
- **Kill boundary.** New `context/llm/subprocess-generate-object.ts`
|
|
||||||
(`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned
|
|
||||||
child (`subprocess-generate-object-child.ts`) **detached** (process-group leader);
|
|
||||||
the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx
|
|
||||||
tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX,
|
|
||||||
`taskkill /pid <pid> /T /F` on Windows) and rejects promptly; on success the raw
|
|
||||||
output is Zod-validated. Credentials reach the child via inherited `process.env`
|
|
||||||
(the runtimes re-derive their allowlisted env), never over IPC.
|
|
||||||
- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions`
|
|
||||||
(`context/scan/description-generation.ts`) routes each retry attempt through the
|
|
||||||
boundary for subprocess backends and keeps the native `AbortSignal` → `fetch`
|
|
||||||
path for HTTP backends. A fired deadline maps to the existing
|
|
||||||
`KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout);
|
|
||||||
default stays 120s (D2).
|
|
||||||
- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real
|
|
||||||
fixture child that spawns a grandchild and ignores SIGTERM, and asserts the
|
|
||||||
deadline/abort tree-kills both (the grandchild PID is reaped) and the await
|
|
||||||
settles within deadline+grace; plus success / schema-failure / child-error paths.
|
|
||||||
`test/context/scan/description-generation.test.ts` adds the generator-level
|
|
||||||
timeout-skip and the "HTTP backend spawns no child" cases.
|
|
||||||
|
|
||||||
### Gap #2 — incremental descriptions persistence + resume
|
|
||||||
|
|
||||||
- **Durable record + resume store.** `createKtxScanDescriptionResumeStore`
|
|
||||||
(`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to
|
|
||||||
a durable record (inputHash-tagged) and **only the manifest shards that gained a
|
|
||||||
table this batch** (new `onlyChangedTableNames` filter on
|
|
||||||
`writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)`
|
|
||||||
returns the prior enriched set only on a matching inputHash (D3).
|
|
||||||
- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`)
|
|
||||||
loads the prior record, skips already-enriched tables, enriches only the
|
|
||||||
remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single
|
|
||||||
in-flight flush; the final force-flush drains the tail), and returns the full
|
|
||||||
merged set (recovered + fresh + `null` for still-failed, so failures are retried,
|
|
||||||
D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`).
|
|
||||||
- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in
|
|
||||||
a try/catch: any non-cancellation failure degrades to one `null` description + an
|
|
||||||
`enrichment_failed` warning and the fan-out continues, so no single table can
|
|
||||||
reject `Promise.all` / abort the stage. This makes the "one skipped table costs one
|
|
||||||
description, not the stage's output" guarantee live at the stage boundary
|
|
||||||
(`generateBatchedTableDescriptions` already degrades its own failures; this is the
|
|
||||||
explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails
|
|
||||||
and resumes), and the already-flushed descriptions stay durable. This closes the
|
|
||||||
field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows.
|
|
||||||
- **Deviation from the spec's literal path (necessary correction).** The durable
|
|
||||||
record lives at a **stable, non-`syncId`** path
|
|
||||||
(`raw-sources/<connectionId>/live-database/enrichment-progress/descriptions.json`),
|
|
||||||
not the `syncId`-scoped `…/<syncId>/enrichment/descriptions.json` the spec named.
|
|
||||||
Reason: a from-scratch interruption (the incident's exact case — no prior
|
|
||||||
*completed* run) gets a **fresh `syncId`** on the next run
|
|
||||||
(`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped
|
|
||||||
record would be unreachable on resume. The manifest is already at the stable
|
|
||||||
per-connection scope (`semantic-layer/<connectionId>/_schema/`), so this keeps the
|
|
||||||
resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json`
|
|
||||||
debug artifact written by the terminal/checkpoint writers is unchanged.
|
|
||||||
- **Tests.** `test/context/scan/description-resume.test.ts` drives
|
|
||||||
`runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a
|
|
||||||
durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues
|
|
||||||
zero LLM calls and returns the full merged set; a partial record re-enriches only
|
|
||||||
the missing tables; a changed `inputHash` recomputes; the changed-shard filter
|
|
||||||
rewrites only the affected shard; and (requirement 6) a run where one table fails
|
|
||||||
still persists the other tables (durable record + `ai:`) and **completes the stage**
|
|
||||||
(a completed `local_scan_enrichment_stages` row), with the failed table left `null`
|
|
||||||
for resume.
|
|
||||||
|
|
||||||
### Incidental
|
|
||||||
|
|
||||||
- Fixed a stale assertion in `description-generation.test.ts` ("does not run
|
|
||||||
per-column fallback…" expected 1 call) to `3`, matching the retry policy added in
|
|
||||||
commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt
|
|
||||||
limit). The HTTP path is unchanged; the assertion simply predated the retry.
|
|
||||||
- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit
|
|
||||||
governor is not wired into the scan-enrichment path, so the kill-boundary child
|
|
||||||
loses no pacing.
|
|
||||||
- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles
|
|
||||||
to `dist/context/llm/subprocess-generate-object-child.js`.
|
|
||||||
|
|
@ -1,567 +0,0 @@
|
||||||
# Selective enrichment stages (`--stages`) + per-stage cache keys
|
|
||||||
|
|
||||||
> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`.
|
|
||||||
>
|
|
||||||
> **Scope: make the three enrichment stages independently invalidatable and
|
|
||||||
> independently re-runnable.** Today one coarse cache key gates all three stages,
|
|
||||||
> so changing any one stage's inputs re-pays for every stage — most painfully the
|
|
||||||
> expensive per-table `descriptions`. And there is no CLI surface to re-run a
|
|
||||||
> chosen subset. This spec splits the key per stage (so a change invalidates only
|
|
||||||
> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen
|
|
||||||
> subset while preserving the others. It is the operability follow-on to spec 19
|
|
||||||
> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable
|
|
||||||
> descriptions); it composes with both rather than replacing them.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Enrichment has three stages — **`descriptions`** (one paid LLM call per table),
|
|
||||||
**`embeddings`** (sentence-transformer vectors over the schema + descriptions),
|
|
||||||
**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19
|
|
||||||
and 20 these stages are durable and resumable, but they are still **coupled for
|
|
||||||
cache invalidation and unreachable for selective re-run**. Three facts make a
|
|
||||||
targeted re-run impossible without a full, expensive re-enrich.
|
|
||||||
|
|
||||||
### 1. One coarse cache key gates all three stages
|
|
||||||
|
|
||||||
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single
|
|
||||||
`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity,
|
|
||||||
relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`),
|
|
||||||
`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself
|
|
||||||
(`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the
|
|
||||||
description LLM identity, the embedding model/dimensions/batch size, **and** the
|
|
||||||
whole relationship config — and it redundantly re-encodes `mode` and
|
|
||||||
`relationships`, which the coarse hash already mixes in.
|
|
||||||
|
|
||||||
The consequence: flipping `scan.relationships.llmProposals`, switching the LLM
|
|
||||||
backend, or upgrading the embeddings model changes the **one** hash and so
|
|
||||||
invalidates **all three** stages. ktx then re-runs the expensive per-table
|
|
||||||
`descriptions` even though they did not conceptually change. The headline cost of
|
|
||||||
the system — paid LLM description calls — is thrown away on any unrelated
|
|
||||||
enrichment-config edit.
|
|
||||||
|
|
||||||
### 2. No CLI surface to select stages
|
|
||||||
|
|
||||||
The enrichment internals already support a relationships-only path
|
|
||||||
(`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are
|
|
||||||
gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while
|
|
||||||
`shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But
|
|
||||||
`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no
|
|
||||||
flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history`
|
|
||||||
and friends). The relationships-only capability is built but unreachable, and there
|
|
||||||
is no way at all to ask for "descriptions only" or "embeddings only."
|
|
||||||
|
|
||||||
### 3. The foundation for "touch one stage, keep the rest" already exists
|
|
||||||
|
|
||||||
The per-stage store `local_scan_enrichment_stages` is keyed
|
|
||||||
`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is
|
|
||||||
additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and
|
|
||||||
`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`,
|
|
||||||
`db:`, and external description keys on rewrite; spec 20's per-table resume record
|
|
||||||
(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already
|
|
||||||
re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave
|
|
||||||
the others byte-for-byte" needs only two missing pieces: **per-stage key
|
|
||||||
granularity** and a **CLI surface** to select stages.
|
|
||||||
|
|
||||||
**Requirement:** let an operator re-run a chosen subset of enrichment stages on an
|
|
||||||
already-ingested connection, recomputing only those stages, preserving the others'
|
|
||||||
artifacts untouched, and **re-paying only for what genuinely changed** — never
|
|
||||||
re-running the costly `descriptions` because an unrelated stage's inputs moved.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Any team running ktx in production maintains its semantic layer over time: they
|
|
||||||
improve the description prompt or switch the description LLM, upgrade the embeddings
|
|
||||||
model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich
|
|
||||||
of every connection** — re-running the expensive per-table descriptions even when
|
|
||||||
only embeddings or relationships changed. Two routine operations should be cheap and
|
|
||||||
targeted:
|
|
||||||
|
|
||||||
- **"Re-embed everything on the new model."** Swapping the embeddings model should
|
|
||||||
recompute only embeddings, leaving descriptions and joins on disk.
|
|
||||||
- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed
|
|
||||||
relationships should recompute only relationships.
|
|
||||||
|
|
||||||
And one operation needs an explicit trigger because no input changed:
|
|
||||||
|
|
||||||
- **"These descriptions came out thin — re-run them with a longer timeout."** A
|
|
||||||
connection whose description coverage is poor because tables timed out (same
|
|
||||||
snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand,
|
|
||||||
cheaply retrying only the tables that failed.
|
|
||||||
|
|
||||||
This is core operability for a long-lived ingestion product and is wholly
|
|
||||||
independent of any benchmark.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve ambiguities the intake draft left open. They constrain the
|
|
||||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
|
||||||
|
|
||||||
### D1 — Split the coarse hash into three per-stage input hashes
|
|
||||||
|
|
||||||
Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash
|
|
||||||
computation, each keyed on only that stage's own inputs. Decompose the
|
|
||||||
`localScanProviderIdentity` blob into the slices each stage actually depends on:
|
|
||||||
|
|
||||||
- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the
|
|
||||||
description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the
|
|
||||||
embedding model/dimensions/batch size, **not** relationship settings.
|
|
||||||
- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where
|
|
||||||
`embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest`
|
|
||||||
is a stable digest of the resolved description text the embeddings consume (the
|
|
||||||
same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model,
|
|
||||||
`local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses
|
|
||||||
embeddings on their real upstream (D4).
|
|
||||||
- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and
|
|
||||||
`detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X,
|
|
||||||
D5), **not** the embedding identity.
|
|
||||||
|
|
||||||
`mode` and `detectRelationships` drop out of the per-stage inputs: each stage
|
|
||||||
produces output under exactly one mode, so the stage name already scopes that, and
|
|
||||||
re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals`
|
|
||||||
invalidates only `relationships`; swapping the embeddings model invalidates only
|
|
||||||
`embeddings`; switching the description LLM invalidates only `descriptions`.
|
|
||||||
|
|
||||||
The per-stage hash becomes the key everywhere a single hash is used today: the
|
|
||||||
`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20
|
|
||||||
descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now
|
|
||||||
keyed on the **descriptions** stage's hash — so changing the embedding model no
|
|
||||||
longer busts the descriptions resume record, a strict improvement.
|
|
||||||
|
|
||||||
> **No migration bridge.** The stage store and the descriptions resume record are
|
|
||||||
> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage
|
|
||||||
> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next
|
|
||||||
> run after upgrade. Recreate/ignore stale-shaped records with no compatibility
|
|
||||||
> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy.
|
|
||||||
|
|
||||||
### D2 — `--stages <comma-list>` selects a subset; one gate, no new mode
|
|
||||||
|
|
||||||
Add `ktx ingest [connectionId] --stages <comma-list>`, a non-empty subset of
|
|
||||||
`descriptions,embeddings,relationships`. Plural because it takes a **set**:
|
|
||||||
`--stages relationships` and `--stages descriptions,embeddings` both read naturally,
|
|
||||||
and the plural signals "list expected." Flag absent = all three (today's behavior).
|
|
||||||
|
|
||||||
A Commander custom parser validates each name against the canonical stage registry
|
|
||||||
and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a
|
|
||||||
hard `InvalidArgumentError`** — never silently ignored. The set threads CLI →
|
|
||||||
`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`.
|
|
||||||
|
|
||||||
Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected
|
|
||||||
stages)`** — a single gate. Each existing stage block additionally checks
|
|
||||||
membership in the selected set (`descriptions`/`embeddings` already gate on
|
|
||||||
`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`).
|
|
||||||
This adds **no** new `KtxScanMode` variant and **no** second parallel selection
|
|
||||||
path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means
|
|
||||||
"which of those stages to (re)compute this run." A named stage that cannot run
|
|
||||||
because a prerequisite is absent (e.g. `--stages embeddings` with no embedding
|
|
||||||
provider configured) MUST fail or warn clearly, never silently no-op.
|
|
||||||
|
|
||||||
> Rejected alternative — repurpose `mode` (`--stages relationships` →
|
|
||||||
> `mode: 'relationships'`). It only expresses single-stage cases, leaves
|
|
||||||
> `descriptions,embeddings` with no mode, and creates two ways to say "relationships
|
|
||||||
> only." The explicit stage set is the one canonical selector.
|
|
||||||
|
|
||||||
### D3 — A named stage force-re-runs; per-table resume still avoids re-paying
|
|
||||||
|
|
||||||
Naming a stage in `--stages` carries the intent "recompute this," so a named stage
|
|
||||||
**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in
|
|
||||||
`runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still
|
|
||||||
applies **inside** `compute()`:
|
|
||||||
|
|
||||||
- `--stages descriptions` re-enters `generateDescriptions`, which loads the
|
|
||||||
per-table resume record and re-issues LLM calls **only for the still-null/failed
|
|
||||||
tables** (when the descriptions hash is unchanged) — the "fill thin coverage with
|
|
||||||
a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps.
|
|
||||||
- A genuine input change (e.g. switching the LLM → a new descriptions hash)
|
|
||||||
invalidates the resume record and rebuilds the stage fully, as today.
|
|
||||||
|
|
||||||
Stages **not** named are skipped entirely — not run, not resumed — and their
|
|
||||||
on-disk artifacts are left exactly as they are (additive write; preserve-others is
|
|
||||||
already the behavior). The **no-flag default is unchanged**: all eligible stages
|
|
||||||
run, the completed-row short-circuit is respected (spec-19 cross-run resume).
|
|
||||||
|
|
||||||
Behavior follows from the input (did you explicitly name the stage?), not the call
|
|
||||||
path. A consequence to state plainly: `--stages descriptions,embeddings,relationships`
|
|
||||||
is **not** identical to passing no flag — naming all three is the explicit "force a
|
|
||||||
full enrichment recompute," whereas no flag is "ingest, resuming whatever is done."
|
|
||||||
|
|
||||||
### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent
|
|
||||||
|
|
||||||
The only hard dependency between stages is **`descriptions → embeddings`**
|
|
||||||
(embeddings embed the description text; `relationships` is decoupled, D5). Two
|
|
||||||
mechanisms keep it correct without a hardcoded dependency table:
|
|
||||||
|
|
||||||
- **Self-healing via content-addressing.** Because the embeddings hash includes
|
|
||||||
`descriptionDigest` (D1), re-running `descriptions` changes that digest, so a
|
|
||||||
later embeddings run (or a full ingest) sees a hash miss and recomputes — stale
|
|
||||||
embeddings can never silently persist across a future embeddings run. (Without
|
|
||||||
this, the embeddings hash would be unchanged after a description edit and a later
|
|
||||||
run would wrongly short-circuit on stale vectors.)
|
|
||||||
- **Surfaced immediately.** After a selective run, for each **unselected** stage that
|
|
||||||
has artifacts on disk, recompute its *current* per-stage hash from on-disk state
|
|
||||||
and compare it to the stored completed-row hash; if they differ, emit a
|
|
||||||
**recoverable `enrichment_stage_stale` warning** naming the stale stage and the
|
|
||||||
cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the
|
|
||||||
system's own state — it also catches "you changed the embedding model in `ktx.yaml`
|
|
||||||
but only ran `--stages descriptions`."
|
|
||||||
|
|
||||||
The run **never silently leaves a stale-but-unflagged downstream**, and **never
|
|
||||||
silently auto-cascades** extra work — the operator is told and decides. Re-running
|
|
||||||
`descriptions` does **not** flag `relationships` stale (D5).
|
|
||||||
|
|
||||||
### D5 — Relationships are decoupled from description content, but still get it as context
|
|
||||||
|
|
||||||
`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is
|
|
||||||
**not** invalidated or stale-flagged by a description change (decision X). Rationale:
|
|
||||||
relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's
|
|
||||||
own framing); coupling them to description content would make every routine
|
|
||||||
description re-run also invalidate joins — re-opening the exact over-invalidation
|
|
||||||
this spec exists to close.
|
|
||||||
|
|
||||||
Independently, a `relationships`-only run (descriptions stage not running this
|
|
||||||
invocation) MUST **hydrate its working schema from the persisted on-disk enriched
|
|
||||||
`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full
|
|
||||||
description context, not raw column names. Today the relationship stage builds its
|
|
||||||
schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740`
|
|
||||||
never merge the AI descriptions), so this also closes a latent gap: both the
|
|
||||||
full-run and the relationships-only paths MUST feed `llmProposals` the
|
|
||||||
best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) —
|
|
||||||
behavior from inputs, not path.
|
|
||||||
|
|
||||||
### D6 — Scope: enrichment stages only, composable with existing flags
|
|
||||||
|
|
||||||
`--stages` controls only the three enrichment stages. It is **orthogonal to and
|
|
||||||
composable with** the existing `--no-query-history` flag — a pure joins backfill
|
|
||||||
across everything is `ktx ingest --all --stages relationships --no-query-history`.
|
|
||||||
Schema introspection still runs (it is the hash substrate and the enrichment base,
|
|
||||||
and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it
|
|
||||||
can later extend to the broader scan phases (schema / query-history / source /
|
|
||||||
memory) and subsume the inconsistent negative `--no-query-history` flag — but that
|
|
||||||
unification is **out of scope** here.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
### 1. Per-stage input hashes
|
|
||||||
|
|
||||||
Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its
|
|
||||||
resume record on a hash of only that stage's own inputs, per D1
|
|
||||||
(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding
|
|
||||||
identity + a digest of the embedded description text; `relationships` ← snapshot +
|
|
||||||
relationship settings + LLM identity). Changing one stage's inputs MUST invalidate
|
|
||||||
**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over
|
|
||||||
`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }`
|
|
||||||
MUST be removed in favor of per-stage computation. The stage store and the
|
|
||||||
descriptions resume record MAY be recreated without a migration bridge (disposable
|
|
||||||
local state).
|
|
||||||
|
|
||||||
### 2. `--stages` flag with strict validation
|
|
||||||
|
|
||||||
`ktx ingest` MUST accept `--stages <comma-list>`, a non-empty subset of
|
|
||||||
`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An
|
|
||||||
unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`),
|
|
||||||
never silently ignored. The selected set MUST thread through to enrichment and gate
|
|
||||||
which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new
|
|
||||||
`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite
|
|
||||||
is missing MUST fail or warn clearly, not silently no-op.
|
|
||||||
|
|
||||||
### 3. Selecting a stage force-re-runs it; unselected stages are preserved
|
|
||||||
|
|
||||||
A stage named in `--stages` MUST re-enter its `compute()`, bypassing the
|
|
||||||
completed-stage short-circuit, while still using the spec-20 per-table resume record
|
|
||||||
so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash)
|
|
||||||
and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST
|
|
||||||
leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19
|
|
||||||
cross-run resume (all eligible stages, completed-row short-circuit respected).
|
|
||||||
|
|
||||||
### 4. Downstream staleness is surfaced, never silent
|
|
||||||
|
|
||||||
After a selective run, the run MUST emit a recoverable `enrichment_stage_stale`
|
|
||||||
warning for every **unselected** stage whose current per-stage hash no longer
|
|
||||||
matches its stored completed-row hash (derived from on-disk state, naming the stage
|
|
||||||
and the cascade command). The embeddings hash MUST include a digest of the embedded
|
|
||||||
description text so a later embeddings run self-heals after a description change. The
|
|
||||||
run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently
|
|
||||||
auto-cascade. A description change MUST NOT stale-flag `relationships`.
|
|
||||||
|
|
||||||
### 5. Relationships run with description context
|
|
||||||
|
|
||||||
When the `relationships` stage runs without `descriptions` having run in the same
|
|
||||||
invocation, it MUST hydrate its working schema from the persisted on-disk enriched
|
|
||||||
`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description
|
|
||||||
context as a full enriched run, not bare column names. The full-run and
|
|
||||||
relationships-only paths MUST feed `llmProposals` descriptions consistently.
|
|
||||||
|
|
||||||
### 6. No regression for normal ingests
|
|
||||||
|
|
||||||
A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as
|
|
||||||
today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19
|
|
||||||
cross-run resume and spec-20 per-table description resume. The per-stage hash split
|
|
||||||
MUST NOT change a normal run's output, only which stages a *changed* input
|
|
||||||
invalidates.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals`
|
|
||||||
re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM
|
|
||||||
description calls, no re-embedding); swapping the embeddings model re-runs only
|
|
||||||
`embeddings`; switching the description LLM re-runs only `descriptions`. Verified by
|
|
||||||
asserting no LLM description calls / no embed calls for the unaffected stages.
|
|
||||||
- **Flag parse + validation:** `--stages relationships` and
|
|
||||||
`--stages descriptions,embeddings` parse to the right set; `--stages foo`,
|
|
||||||
`--stages` (empty), and `--stages descriptions,foo` each fail with a clear
|
|
||||||
`InvalidArgumentError`.
|
|
||||||
- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed
|
|
||||||
with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM
|
|
||||||
calls for exactly those K tables and leaves the already-good descriptions
|
|
||||||
untouched; the run completes and the K are now enriched. A changed descriptions
|
|
||||||
hash instead rebuilds all tables.
|
|
||||||
- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and
|
|
||||||
`relationships` artifacts are byte-stable (unselected stages did not run).
|
|
||||||
- **Derived staleness warning:** after `--stages descriptions` changes the
|
|
||||||
descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its
|
|
||||||
recomputed hash diverged) and does **not** emit it for `relationships` (decision
|
|
||||||
X); a subsequent `--stages embeddings` clears it.
|
|
||||||
- **Relationships context:** a `--stages relationships` run on an already-described
|
|
||||||
connection feeds the on-disk AI descriptions into `llmProposals` (verified: the
|
|
||||||
proposal prompt carries descriptions, not just column names).
|
|
||||||
- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical
|
|
||||||
artifacts and the same descriptions/embeddings/relationship output as today, with
|
|
||||||
spec-19/20 resume intact.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The
|
|
||||||
namespace is built to extend later; this spec ships only the three enrichment
|
|
||||||
stages, composable with the existing query-history flag (D6).
|
|
||||||
- **A new `KtxScanMode` variant or a second stage-selection path.** One gate,
|
|
||||||
`(eligible) ∩ (selected)` (D2).
|
|
||||||
- **Coupling `relationships` to description content** (decision X, D5). Improving
|
|
||||||
descriptions does not invalidate or stale-flag joins.
|
|
||||||
- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the
|
|
||||||
operator chooses to cascade (D4).
|
|
||||||
- **Capturing prompt/code-level description-prompt changes in the hash.** The
|
|
||||||
descriptions hash keys on snapshot + LLM identity (config/model), not the prompt
|
|
||||||
text; a pure prompt improvement that does not change a hash input will not
|
|
||||||
force-rebuild already-good descriptions. Forcing that is out of scope — the
|
|
||||||
operator changes a real input or selects the stage with a changed config.
|
|
||||||
- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20
|
|
||||||
(per-table description resume, enforced timeout).** This spec composes above them:
|
|
||||||
it splits the key those stages resume on and adds the CLI surface to select and
|
|
||||||
force-re-run stages.
|
|
||||||
- **A general per-phase incremental-flush framework.** The selection mechanism is the
|
|
||||||
three enrichment stages; it is not a generic abstraction over every ingest phase.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
|
||||||
design.
|
|
||||||
|
|
||||||
- **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts`
|
|
||||||
(`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput`
|
|
||||||
`:57`): replace with per-stage hash functions (or one function taking a per-stage
|
|
||||||
input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three
|
|
||||||
`runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`,
|
|
||||||
`relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit
|
|
||||||
`:538–547`). The `descriptions` hash also feeds `generateDescriptions`'
|
|
||||||
`resumeStore.load(inputHash)` (`:345`).
|
|
||||||
- **Provider-identity decomposition** — `context/scan/local-scan.ts`
|
|
||||||
(`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`):
|
|
||||||
split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` /
|
|
||||||
`relationships` re-encoding, and pass each stage only its slice.
|
|
||||||
- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings`
|
|
||||||
`:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText`
|
|
||||||
`:17–44`): digest the resolved per-column/table description text that the embeddings
|
|
||||||
consume, and fold that digest into the embeddings hash.
|
|
||||||
- **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations,
|
|
||||||
`:51–104` action handler): add `--stages` with a custom parser that validates
|
|
||||||
against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in
|
|
||||||
`enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`.
|
|
||||||
Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'`
|
|
||||||
`:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) →
|
|
||||||
`runLocalScanEnrichment`.
|
|
||||||
- **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage
|
|
||||||
block on membership in the selected set (`descriptions` `:632`, `embeddings`
|
|
||||||
`:663–665`, `relationships` `:720`); make a named stage bypass the completed-row
|
|
||||||
short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20
|
|
||||||
per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage
|
|
||||||
set.
|
|
||||||
- **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the
|
|
||||||
stage blocks): recompute each unselected stage's current hash from on-disk state,
|
|
||||||
compare to the stored completed-row hash, push a recoverable warning on mismatch.
|
|
||||||
Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in
|
|
||||||
`context/scan/types.ts` (alongside `relationship_detection_partial`).
|
|
||||||
- **Relationships description context** — `context/scan/local-enrichment.ts`
|
|
||||||
(`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`):
|
|
||||||
hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from
|
|
||||||
the on-disk `_schema` via `loadExistingManifestState`,
|
|
||||||
`local-enrichment-artifacts.ts`) before relationship detection.
|
|
||||||
- **Stage store + resume record** —
|
|
||||||
`context/scan/sqlite-local-enrichment-state-store.ts`
|
|
||||||
(`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`,
|
|
||||||
`findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore`
|
|
||||||
(`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate
|
|
||||||
`:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge.
|
|
||||||
- **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema`
|
|
||||||
`:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`;
|
|
||||||
`scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`,
|
|
||||||
`llm.provider.gateway.base_url`): the sources of each per-stage identity slice.
|
|
||||||
- **Tests** — per-stage invalidation isolation (flip one input, assert only the
|
|
||||||
matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty
|
|
||||||
rejected); resume-aware force-rerun (`--stages descriptions` retries only the null
|
|
||||||
tables, leaves good ones, completes); preserve-others (unselected artifacts
|
|
||||||
byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after
|
|
||||||
a descriptions change, not for relationships; cleared by a later `--stages
|
|
||||||
embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`;
|
|
||||||
regression — a normal no-flag ingest yields identical artifacts with spec-19/20
|
|
||||||
resume intact.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
- **Docs:** add `--stages` to the `ktx ingest` CLI reference
|
|
||||||
(`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior
|
|
||||||
where enrichment/ingest is described.
|
|
||||||
|
|
||||||
## Benchmark context (motivation, not a requirement)
|
|
||||||
|
|
||||||
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A
|
|
||||||
level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description
|
|
||||||
coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only**
|
|
||||||
re-run with a longer timeout, and (b) a desire to **backfill joins** across all
|
|
||||||
already-ingested datasets after enabling `llmProposals` — without re-paying for
|
|
||||||
descriptions. Both were blocked by the coarse single `inputHash` (flipping
|
|
||||||
`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a
|
|
||||||
stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend
|
|
||||||
ingestion at scale; the gap and the fix are generic production operability. Do not
|
|
||||||
encode any benchmark specifics in the implementation.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented;
|
|
||||||
all acceptance criteria covered by tests.
|
|
||||||
|
|
||||||
**What was built / where:**
|
|
||||||
|
|
||||||
- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the
|
|
||||||
coarse `computeKtxScanEnrichmentInputHash` and added
|
|
||||||
`computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`),
|
|
||||||
`computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`),
|
|
||||||
`computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`),
|
|
||||||
plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` /
|
|
||||||
`KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the
|
|
||||||
canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into
|
|
||||||
`localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant
|
|
||||||
`mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the
|
|
||||||
keys. No migration bridge — the stage store + descriptions resume record just miss the
|
|
||||||
old coarse-keyed rows.
|
|
||||||
- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted
|
|
||||||
`buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage
|
|
||||||
and the digest, so the embeddings hash content-addresses the exact text the model sees.
|
|
||||||
- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`:
|
|
||||||
`parseEnrichmentStagesOption` (Commander parser) validates against the registry,
|
|
||||||
rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated
|
|
||||||
set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` →
|
|
||||||
`RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate
|
|
||||||
(`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage
|
|
||||||
emits a new `enrichment_stage_skipped` warning (never a silent no-op).
|
|
||||||
- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named
|
|
||||||
stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions`
|
|
||||||
still consults the spec-20 per-table resume record (retries only failed tables on an
|
|
||||||
unchanged hash).
|
|
||||||
- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment`
|
|
||||||
resolves best-available descriptions (fresh-this-run, else on-disk via a lazy
|
|
||||||
`loadPriorDescriptions` thunk wired from `local-scan.ts` →
|
|
||||||
`loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema`
|
|
||||||
now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket`
|
|
||||||
now carries the resolved description text — closing the latent gap on **both** the
|
|
||||||
full-run and relationships-only paths.
|
|
||||||
- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code +
|
|
||||||
`findLatestCompletedStage` on the state store (interface + sqlite + test store). After a
|
|
||||||
selective run, each unselected stage with a completed row is compared against its
|
|
||||||
freshly recomputed hash; a mismatch warns and names the cascade command. Relationships
|
|
||||||
are never flagged by a description change (decoupled per D5).
|
|
||||||
- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a
|
|
||||||
"Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and
|
|
||||||
examples.
|
|
||||||
|
|
||||||
**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a
|
|
||||||
relationships-only run should hydrate "AI descriptions **and** embeddings" from the
|
|
||||||
on-disk `_schema`. Investigation found the `_schema` manifest shards store only
|
|
||||||
descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json`
|
|
||||||
that no code reads back, and each run mints a fresh syncId — so there is no durable
|
|
||||||
per-connection embeddings artifact to hydrate from. A relationships-only run therefore
|
|
||||||
hydrates **descriptions** (required for, and verified against, the `llmProposals`
|
|
||||||
acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships`
|
|
||||||
backfill gets deterministic + name-based + LLM-proposed candidates (the point of
|
|
||||||
`llmProposals`), but not the embedding-similarity candidates a full run would add.
|
|
||||||
Durable embeddings hydration (persist vectors at a stable per-connection path, or read
|
|
||||||
them from the vector index) is a clean follow-on and was left out of scope.
|
|
||||||
|
|
||||||
**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation),
|
|
||||||
`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture
|
|
||||||
guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves
|
|
||||||
others, naming all three forces a full recompute, per-stage invalidation isolation,
|
|
||||||
prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced
|
|
||||||
descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not
|
|
||||||
relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`,
|
|
||||||
`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in
|
|
||||||
`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a
|
|
||||||
`**Window functions**` heading the test expects — was present before this work and left
|
|
||||||
untouched.)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ⚠️ Defect found in post-implementation validation (2026-06-24)
|
|
||||||
|
|
||||||
**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req
|
|
||||||
"preserve-others / a selective run never deletes another stage's artifacts."
|
|
||||||
|
|
||||||
**Reproduction (deterministic):**
|
|
||||||
- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges.
|
|
||||||
- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅
|
|
||||||
but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌
|
|
||||||
- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins.
|
|
||||||
|
|
||||||
**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the
|
|
||||||
freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions
|
|
||||||
and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the
|
|
||||||
**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach
|
|
||||||
the `llmProposals` evidence packet only). So the on-disk `_schema` loses them.
|
|
||||||
|
|
||||||
**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the
|
|
||||||
`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every
|
|
||||||
db. Caught on a 1-db validation before any rollout.
|
|
||||||
|
|
||||||
**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:`
|
|
||||||
descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages'
|
|
||||||
artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`,
|
|
||||||
and asserts description count is unchanged while joins increase.
|
|
||||||
|
|
||||||
### ✅ Fixed (2026-06-24)
|
|
||||||
|
|
||||||
**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first
|
|
||||||
fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural**
|
|
||||||
manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard,
|
|
||||||
but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as
|
|
||||||
**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a
|
|
||||||
subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the
|
|
||||||
already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test
|
|
||||||
passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test
|
|
||||||
was rewritten to go through the full `runLocalScan` path.)
|
|
||||||
|
|
||||||
**What changed:**
|
|
||||||
- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions
|
|
||||||
(`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as
|
|
||||||
`descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them.
|
|
||||||
- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before
|
|
||||||
the structural manifest write** and feeds them to both the structural write and enrichment — so the
|
|
||||||
structural pre-write preserves them too (robust even if relationship detection later fails).
|
|
||||||
- Joins were already preserved for `--stages descriptions` via the existing manual/inferred
|
|
||||||
`preservedJoins` path; verified by a symmetric test.
|
|
||||||
|
|
||||||
**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai`
|
|
||||||
descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the
|
|
||||||
enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions`
|
|
||||||
preserves joins).
|
|
||||||
|
|
||||||
**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER
|
|
||||||
`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the
|
|
||||||
descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`.
|
|
||||||
Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass.
|
|
||||||
|
|
@ -1,463 +0,0 @@
|
||||||
# Resumable and fault-tolerant source ingest
|
|
||||||
|
|
||||||
> Refined spec. No intake draft — surfaced by a real user report, not the
|
|
||||||
> playground agent (see Motivation). Lives beside the analogous scan-durability
|
|
||||||
> specs 19/20.
|
|
||||||
>
|
|
||||||
> **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt /
|
|
||||||
> Metabase / Notion) survive interruption and partial failure on large
|
|
||||||
> projects.** Two compounding gaps live on the source-ingest path: (1) an
|
|
||||||
> interrupted run restarts every work unit from scratch — there is no cross-run
|
|
||||||
> reuse of already-generated work-unit output, so a multi-day dbt ingest loses
|
|
||||||
> *all* progress to a single VPN/network blip; (2) the final integration gate is
|
|
||||||
> all-or-nothing — one artifact that cannot pass it (after LLM repair) discards
|
|
||||||
> the **entire** run with nothing committed. This is the source-ingest analog of
|
|
||||||
> spec 19 (move the durability boundary to the cost boundary so expensive LLM
|
|
||||||
> work is not lost) and spec 20 (a stage survives an interruption with per-item
|
|
||||||
> durability). It **reuses** the same content-keyed durability primitive those
|
|
||||||
> specs established rather than copying it.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Two independent failure modes on the source-ingest work-unit (WU) pipeline,
|
|
||||||
both confirmed in the current code, both observed by a user on a ~2-day dbt
|
|
||||||
ingest. Their union makes large-project ingest brittle: any interruption is
|
|
||||||
total loss, and any single unfixable artifact at the end is total loss.
|
|
||||||
|
|
||||||
### 1. An interrupted run resumes nothing — every work unit re-runs
|
|
||||||
|
|
||||||
`IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as
|
|
||||||
a sequence of stages: fetch → parse/extract into **work units** → run each WU as
|
|
||||||
an isolated agent loop in a child worktree (`runIsolatedWorkUnit` →
|
|
||||||
`executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU
|
|
||||||
patches → reconcile → finalize → final gates → one atomic squash commit
|
|
||||||
(`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU
|
|
||||||
is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki
|
|
||||||
artifacts, producing a git patch (`WorkUnitOutcome.patchPath` /
|
|
||||||
`patchTouchedPaths`, `stage-3-work-units.ts:31-46`).
|
|
||||||
|
|
||||||
The only persisted cross-run state is `SqliteBundleIngestStore`
|
|
||||||
(`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report,
|
|
||||||
and provenance — all written at or near **run completion**. There is **no
|
|
||||||
checkpoint of completed WU output**. A run that dies mid-flight (the user's
|
|
||||||
VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches,
|
|
||||||
re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM
|
|
||||||
cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no
|
|
||||||
relationship to the interrupted one.
|
|
||||||
|
|
||||||
> Observed (user report, large dbt project): a run reached deep into its
|
|
||||||
> work-unit progress and was lost to a network blip; the follow-up run started
|
|
||||||
> over from zero. On a ~2-day ingest this is the difference between a 5-minute
|
|
||||||
> resume and a 2-day redo.
|
|
||||||
|
|
||||||
### 2. The final integration gate is all-or-nothing
|
|
||||||
|
|
||||||
After all surviving WUs are integrated, `validateFinalIngestArtifacts`
|
|
||||||
(`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across
|
|
||||||
the *integrated* tree:
|
|
||||||
|
|
||||||
- **intrinsic source validity** — `validateTouchedSources` →
|
|
||||||
`validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) →
|
|
||||||
`validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`),
|
|
||||||
which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`);
|
|
||||||
- **cross-artifact references** — dangling join targets
|
|
||||||
(`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki`
|
|
||||||
refs (`validateWikiRefs` → `findMissingWikiRefs`), broken `wiki→sl_ref`s
|
|
||||||
(`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs
|
|
||||||
(`findInvalidWikiBodyRefs`).
|
|
||||||
|
|
||||||
On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`).
|
|
||||||
The runner catches it, runs the LLM repair `repairFinalGateFailure`
|
|
||||||
(`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws**
|
|
||||||
(`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null`
|
|
||||||
(`runner.ts:2729`) → **the whole run is discarded, nothing committed.**
|
|
||||||
|
|
||||||
The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate
|
|
||||||
— `executeWorkUnit` already validates each WU in isolation (`validateWikiRefs`
|
|
||||||
~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`,
|
|
||||||
~155: the WU resets, is excluded from integration, and the run continues). So by
|
|
||||||
the time the final gate runs, intrinsic single-source failures are rare. The
|
|
||||||
gate fails predominantly on **cross-artifact dangling references**: WU-A's source
|
|
||||||
joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so
|
|
||||||
A's join now points at nothing. Each WU passed *alone*; the break only appears
|
|
||||||
once the survivors are integrated — and that break currently nukes the run.
|
|
||||||
|
|
||||||
> Observed (user report): a run completed all task generation and then failed at
|
|
||||||
> the final integration gate on a **single model**; because the gate is
|
|
||||||
> all-or-nothing, that one failure discarded an ~18h run with nothing committed.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit
|
|
||||||
both failures. Large ingests run long enough that an interruption is a *when*,
|
|
||||||
not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator
|
|
||||||
ctrl-C on an apparently-stuck run), and a large artifact set makes it
|
|
||||||
near-certain that *some* model lands a cross-reference its sibling didn't
|
|
||||||
produce. Without cross-run reuse, every interruption is a from-scratch redo of
|
|
||||||
the dominant (LLM) cost; without partial commit, one unfixable artifact throws
|
|
||||||
away every good one. Both fixes make large-project ingest **resilient and
|
|
||||||
resumable**: an interruption costs only the unfinished work, and a single bad
|
|
||||||
model costs only that model — not the run. This is core robustness for a
|
|
||||||
general-purpose ingestion product.
|
|
||||||
|
|
||||||
## Design decisions (resolved during refinement)
|
|
||||||
|
|
||||||
These resolve the design space explored during refinement. They constrain the
|
|
||||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
|
||||||
|
|
||||||
### D1 — Resume is automatic and content-keyed at the work-unit level
|
|
||||||
|
|
||||||
A successful WU's output is cached across runs, keyed by a **content hash of its
|
|
||||||
inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest`
|
|
||||||
transparently replays any WU whose inputs are byte-identical to a cached success
|
|
||||||
and re-runs only the changed, failed, or missing WUs. The key is computed over:
|
|
||||||
the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads,
|
|
||||||
`types.ts:19-28`), the adapter/source identity, and a **version/prompt
|
|
||||||
fingerprint** (ktx version + the WU system/user prompt + model role). A changed
|
|
||||||
dbt model busts only that model's entry; everything unchanged replays for free.
|
|
||||||
|
|
||||||
> No flag, no config knob. Content-keying makes resume automatic; a flag would
|
|
||||||
> double the state space for no benefit. This is the same shape scan uses
|
|
||||||
> (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU
|
|
||||||
> pipeline.
|
|
||||||
|
|
||||||
### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes
|
|
||||||
|
|
||||||
The cache stores a successful WU's **output artifacts**: its git patch
|
|
||||||
(`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs
|
|
||||||
(`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner
|
|
||||||
**replays the patch** into the session worktree — no agent loop, no LLM — exactly
|
|
||||||
where it would have integrated a freshly-run WU. If a cached patch **fails to
|
|
||||||
apply** (the surrounding tree drifted), the entry is discarded and the WU
|
|
||||||
**recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree:
|
|
||||||
the cache can only make a run faster, never wrong.
|
|
||||||
|
|
||||||
### D3 — One durability primitive, shared by scan and ingest
|
|
||||||
|
|
||||||
Per the "one capability, one implementation" rule, the content-keyed store is
|
|
||||||
**extracted** into a shared primitive and **both** scan and ingest route through
|
|
||||||
it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK
|
|
||||||
`(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`)
|
|
||||||
and its `inputHash` computation (`enrichment-state.ts`) are generalized to a
|
|
||||||
content-keyed result cache; scan is migrated onto the shared primitive **in the
|
|
||||||
same change** so no second copy exists even transiently. The ingest cache is a
|
|
||||||
new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`)
|
|
||||||
on that one store.
|
|
||||||
|
|
||||||
> Extract-and-share in one PR, not "build a copy for ingest now, unify later."
|
|
||||||
> A temporary fork is exactly the divergence the rule forbids; the one-time
|
|
||||||
> extraction cost is paid once and both paths benefit from every later fix.
|
|
||||||
|
|
||||||
### D4 — Only successes are cached; failures retry on the next run
|
|
||||||
|
|
||||||
A failed WU is **not** recorded as terminal — the next run retries it. WU
|
|
||||||
failures on this path are dominantly transient (network, provider stall, an LLM
|
|
||||||
slip), and the user's explicit ask is "resume and finish the rest," so a failure
|
|
||||||
must not be sticky. This deliberately differs from scan's stage store (which
|
|
||||||
caches failed stages and re-throws): there the failure is the stage's
|
|
||||||
deterministic verdict; here a WU failure is usually a blip to retry. Caching only
|
|
||||||
successes also keeps the invariant simple — a cache entry always means "this
|
|
||||||
exact input already produced this exact good output."
|
|
||||||
|
|
||||||
### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune
|
|
||||||
|
|
||||||
Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation
|
|
||||||
that always yields a committable, internally-consistent tree:
|
|
||||||
|
|
||||||
1. `validateFinalIngestArtifacts` is refactored to **return structured findings**
|
|
||||||
(the danglers it already computes internally — join targets, `wiki→wiki`,
|
|
||||||
`wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of
|
|
||||||
flattening them into a thrown string.
|
|
||||||
2. **Drop the rare self-invalid source first.** A source that fails its *own*
|
|
||||||
validation at the final gate (intrinsic — rare, since stage 3 already filters
|
|
||||||
these) is removed, establishing the surviving artifact set.
|
|
||||||
3. **Prune the dead edges in a single pass** over that surviving set. For each
|
|
||||||
dangling reference — whether it pointed at an absent sibling or at a
|
|
||||||
just-dropped source — **remove that reference from its owner** (drop the join
|
|
||||||
entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping
|
|
||||||
the owning artifact. Because nodes are dropped first (step 2) and pruning only
|
|
||||||
removes edges, pruning **cannot create a new dangling edge, so one pass
|
|
||||||
suffices; no fixpoint.**
|
|
||||||
4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are
|
|
||||||
cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit
|
|
||||||
the remainder. If the confirm pass *still* fails, that is a real bug — fail the
|
|
||||||
run loudly rather than commit a dirty tree.
|
|
||||||
|
|
||||||
`repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`)
|
|
||||||
is **removed**. The deterministic prune supersedes it for the referential class,
|
|
||||||
and the rare intrinsic case is handled by drop.
|
|
||||||
|
|
||||||
> **Prune the edge, do not cascade the node.** The rejected alternative drops the
|
|
||||||
> *referencing artifact* and, transitively, everything that referenced *it* — a
|
|
||||||
> node-quarantine fixpoint that cascades healthy artifacts and needs a closure
|
|
||||||
> search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the
|
|
||||||
> dependent intact (minus one pointer that never resolved anyway), needs no
|
|
||||||
> fixpoint, and acts on findings the gate already produces.
|
|
||||||
>
|
|
||||||
> **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair
|
|
||||||
> can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune
|
|
||||||
> merely deletes it, preserving marginally more content. We drop it anyway:
|
|
||||||
> determinism beats an LLM round-trip with variance on the commit path, prune
|
|
||||||
> guarantees a commit where repair could only `throw`, and deleting it is a net
|
|
||||||
> maintenance reduction. The decision is reversible — repair could later run as a
|
|
||||||
> best-effort pass *before* prune — but the default is prune-only.
|
|
||||||
|
|
||||||
### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose)
|
|
||||||
|
|
||||||
Pruning is applied to the **integrated session worktree** at gate time and is
|
|
||||||
**re-derived from the current survivor set on every run**. It MUST NOT mutate the
|
|
||||||
cached WU patches (D2). This makes resume and prune compose correctly and
|
|
||||||
**self-heal**:
|
|
||||||
|
|
||||||
- Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B
|
|
||||||
fails; the gate prunes A's join-to-B from the integrated tree and commits A
|
|
||||||
without it.
|
|
||||||
- Run 2 (after the root cause is fixed): A's input is unchanged → A **replays
|
|
||||||
from cache with its join restored**; B now succeeds and exists; the gate finds
|
|
||||||
no dangler and commits both, fully linked.
|
|
||||||
|
|
||||||
So a ref pruned because of a sibling's failure costs nothing permanent: fixing
|
|
||||||
the sibling and re-running restores the link for free. The cache stores
|
|
||||||
intent (the WU's real output); prune is a per-run consistency projection over
|
|
||||||
whatever survived.
|
|
||||||
|
|
||||||
### D7 — Pruning is faithful and never silent
|
|
||||||
|
|
||||||
A pruned reference was, by definition, non-functional (its target was absent), so
|
|
||||||
removing it loses nothing executable — and removing dangling SL joins is already
|
|
||||||
the established fix for the SL engine's eager orphan-join rejection. Every prune
|
|
||||||
and every drop MUST be **recorded in the run report and a trace event** naming
|
|
||||||
the artifact, the removed reference, and the absent target. The report status
|
|
||||||
MUST reflect partial completion (extend the existing `failedWorkUnits`
|
|
||||||
mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs /
|
|
||||||
dropped-sources detail) so a partial run is visibly partial, never a silent
|
|
||||||
"success."
|
|
||||||
|
|
||||||
### D8 — Cache state is regenerable; no migration bridge
|
|
||||||
|
|
||||||
The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape
|
|
||||||
may change with **no migration bridge** — a stale-shaped or absent cache simply
|
|
||||||
forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's
|
|
||||||
no-backward-compatibility policy; the cache is an optimization, never a source of
|
|
||||||
truth.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output
|
|
||||||
MUST be cached keyed by a content hash over its input bytes
|
|
||||||
(`rawFiles` + `dependencyPaths`), the adapter/source identity, and a
|
|
||||||
version/prompt fingerprint (ktx version + WU prompt + model role). Re-running
|
|
||||||
`ktx ingest` MUST replay cached successes without an agent loop / LLM call and
|
|
||||||
re-run only changed, failed, or missing WUs. No `--resume` flag and no config
|
|
||||||
knob is added.
|
|
||||||
2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the
|
|
||||||
stored patch into the session worktree; if the patch does not apply cleanly the
|
|
||||||
entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to
|
|
||||||
produce a tree different from what a fresh run of that WU would have integrated.
|
|
||||||
3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it
|
|
||||||
MUST be retried on the next run.
|
|
||||||
4. **Conservative invalidation.** The input hash MUST change when the ktx version,
|
|
||||||
the WU prompt, or the model role changes (bias toward recompute). Under-keying
|
|
||||||
(stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is
|
|
||||||
acceptable.
|
|
||||||
5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run.
|
|
||||||
`validateFinalIngestArtifacts` MUST return structured findings; the runner MUST
|
|
||||||
deterministically **prune** every dangling reference from its owning artifact
|
|
||||||
and **drop** any source that fails its own validation, then commit the
|
|
||||||
remaining internally-consistent tree.
|
|
||||||
6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not
|
|
||||||
cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint)
|
|
||||||
because edge removal cannot create new dangling edges. A dependent that loses
|
|
||||||
one dangling ref MUST otherwise be committed intact.
|
|
||||||
7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and
|
|
||||||
MUST NOT mutate cached WU patches. A reference pruned in one run because its
|
|
||||||
target was absent MUST be restored automatically on a later run once the target
|
|
||||||
exists (resume replays the owner's intact patch).
|
|
||||||
8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on
|
|
||||||
the remainder and MUST pass before the squash; if it still fails the run MUST
|
|
||||||
fail loudly rather than commit a dirty tree.
|
|
||||||
9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its
|
|
||||||
obsolete tests/branches MUST be deleted (no dormant compatibility path).
|
|
||||||
10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST
|
|
||||||
be recorded in the run report and a trace event (artifact, removed ref, absent
|
|
||||||
target). A run that pruned or dropped anything MUST report as partial, never as
|
|
||||||
an unqualified success.
|
|
||||||
11. **One shared durability primitive.** The content-keyed store MUST be a single
|
|
||||||
implementation used by both scan and ingest; scan MUST be migrated onto it in
|
|
||||||
the same change. No second copy may exist, even transiently.
|
|
||||||
12. **No regression for clean runs.** A small, uninterrupted run whose every WU
|
|
||||||
passes and whose final gate is clean MUST produce byte-identical artifacts and
|
|
||||||
the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as
|
|
||||||
today.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- **Resume skips completed work:** interrupt an ingest after K of N WUs have
|
|
||||||
succeeded; re-run the same command (unchanged inputs); the run issues **zero**
|
|
||||||
agent loops / LLM calls for the K cached WUs, runs only the remaining N−K, and
|
|
||||||
produces the same final artifacts as an uninterrupted run.
|
|
||||||
- **Changed model busts only its entry:** edit one dbt model between runs; the
|
|
||||||
re-run re-executes **only** the WU(s) whose input bytes changed and replays the
|
|
||||||
rest from cache.
|
|
||||||
- **Stale patch self-corrects:** a cached patch that no longer applies (forced
|
|
||||||
drift in a test) causes that WU to recompute, not a corrupt tree or a crash.
|
|
||||||
- **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached;
|
|
||||||
run 2 retries it and, on success, integrates it.
|
|
||||||
- **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join
|
|
||||||
to B dangles **commits** — A is committed with the dangling join **pruned**, the
|
|
||||||
report lists the pruned ref, and `commitSha` is non-null (contrast: today this
|
|
||||||
throws and commits nothing).
|
|
||||||
- **No cascade:** in that scenario A (and any other artifact that only referenced
|
|
||||||
B) is committed intact except for the single pruned reference; nothing healthy
|
|
||||||
is dropped.
|
|
||||||
- **Self-heal:** fix B's root cause and re-run; A replays from cache with its join
|
|
||||||
intact, B succeeds, and the final tree commits both fully linked with no prune.
|
|
||||||
- **Intrinsic drop:** a source that fails its own warehouse dry-run at the final
|
|
||||||
gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop
|
|
||||||
is reported.
|
|
||||||
- **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the
|
|
||||||
gate path has no LLM call.
|
|
||||||
- **One store:** scan and ingest both resume through the same content-keyed
|
|
||||||
primitive (one implementation; scan's behavior is unchanged by the migration —
|
|
||||||
spec 19/20 acceptance still passes).
|
|
||||||
- **Clean-run regression:** a small uninterrupted all-passing ingest yields
|
|
||||||
identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to
|
|
||||||
today.
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Resuming the cross-WU stages.** Reconciliation, finalization, and the final
|
|
||||||
gate re-run every time; their inputs depend on the full survivor set and their
|
|
||||||
cost is small relative to WU generation. Only WU generation is cached.
|
|
||||||
- **A `--resume` flag or any timeout/cache config knob.** Content-keying makes
|
|
||||||
resume automatic (D1); one opinionated default is the canonical ktx shape.
|
|
||||||
- **Caching failed WUs as terminal.** Failures retry (D4).
|
|
||||||
- **Node-cascade quarantine of the final gate.** Prune edges, do not drop
|
|
||||||
dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step.
|
|
||||||
- **Tolerating dangling references (warn instead of remove).** Unsafe — the SL
|
|
||||||
engine eagerly rejects orphan joins — so dead edges must be removed, not kept.
|
|
||||||
- **Keeping the LLM final-gate repair.** Removed (D5/req 9).
|
|
||||||
- **A general per-stage resume framework beyond the shared content-keyed store.**
|
|
||||||
The store is the one shared primitive (D3); this spec does not abstract every
|
|
||||||
ingest stage into a resumable framework.
|
|
||||||
- **Re-implementing spec 19/20 (scan durability).** This spec composes the same
|
|
||||||
primitive onto the source-ingest WU pipeline.
|
|
||||||
|
|
||||||
## Implementation orientation
|
|
||||||
|
|
||||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
|
||||||
the design.
|
|
||||||
|
|
||||||
- **Run flow + the all-or-nothing seam** — `context/ingest/ingest-bundle.runner.ts`:
|
|
||||||
WU run + integration of successful patches (~1600–1900), the final-gate block
|
|
||||||
(~2549–2587, `runFinalArtifactGates`), the repair-then-rethrow that must be
|
|
||||||
replaced by prune (~2588–2644; the fatal `throw` ~2623), and the atomic squash
|
|
||||||
(~2701–2729; `commitSha: null` when nothing is touched ~2729). The prune step
|
|
||||||
slots between the gate findings and the squash, operating on `sessionWorktree`.
|
|
||||||
- **Work units & cacheable output** — `context/ingest/types.ts` (`WorkUnit`
|
|
||||||
~19–28: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult`
|
|
||||||
~204–213: extend with pruned/dropped detail);
|
|
||||||
`context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU
|
|
||||||
validation + `failWithReset` ~134–157 that already soft-fails a WU;
|
|
||||||
`WorkUnitOutcome` ~31–46 with `patchPath`/`patchTouchedPaths`/`actions`/
|
|
||||||
`touchedSlSources` — the cache payload). The cache lookup/replay wraps the
|
|
||||||
per-WU execution; only the agent-loop branch is skipped on a hit.
|
|
||||||
- **The gate (make it return findings)** — `context/ingest/artifact-gates.ts`
|
|
||||||
(`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from
|
|
||||||
`validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`;
|
|
||||||
the concatenated `throw` ~129 to replace with a structured return);
|
|
||||||
`context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124;
|
|
||||||
`findJoinTargetErrors` ~89 already returns missing join targets per source —
|
|
||||||
the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts`
|
|
||||||
(`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are
|
|
||||||
the drop set, not the prune set).
|
|
||||||
- **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the
|
|
||||||
offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove
|
|
||||||
the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts`
|
|
||||||
`findMissingWikiRefs`); wiki body refs: remove the broken link token
|
|
||||||
(`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is
|
|
||||||
deterministic and edits the integrated worktree only.
|
|
||||||
- **Remove the LLM repair** — `context/ingest/final-gate-repair.ts`
|
|
||||||
(`repairFinalGateFailure`) and the `constrained-repair.ts` usage for
|
|
||||||
`final_artifact_gate`; delete the call site (~2595) and its tests.
|
|
||||||
- **Durability primitive to extract & share** —
|
|
||||||
`context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`,
|
|
||||||
PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`),
|
|
||||||
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and
|
|
||||||
the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`).
|
|
||||||
Generalize to a content-keyed result cache; migrate scan onto it; add the ingest
|
|
||||||
namespace. The existing ingest store
|
|
||||||
`context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is
|
|
||||||
where ingest-side persistence lives — the WU cache sits alongside it under
|
|
||||||
`.ktx/`.
|
|
||||||
- **Tests** — resume: run an ingest against a real git-backed project with a fake
|
|
||||||
agent runner, interrupt after K WUs, assert the re-run issues no agent loops for
|
|
||||||
the K and the same artifacts result; changed-input bust; stale-patch recompute;
|
|
||||||
failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki
|
|
||||||
ref dangles → assert the run commits the sibling with the ref pruned, reports the
|
|
||||||
prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a
|
|
||||||
follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still
|
|
||||||
green on the shared primitive. Regression: a small uninterrupted all-passing
|
|
||||||
ingest is byte-identical to today.
|
|
||||||
- After implementing, rebuild and re-link so the playground picks it up:
|
|
||||||
`pnpm run build && pnpm run link:dev`.
|
|
||||||
|
|
||||||
## Motivation (the real report, not a benchmark)
|
|
||||||
|
|
||||||
A user ingesting a fairly large dbt project (~2-day run) hit both gaps together.
|
|
||||||
First, an interruption — a VPN drop / network blip — lost all progress because
|
|
||||||
ingest cannot resume; they had to restart from scratch. Second, on a later run
|
|
||||||
that completed all task generation, a **single model** failed the final
|
|
||||||
integration gate, and because the gate is all-or-nothing the one failure
|
|
||||||
discarded an ~18h run with nothing committed. Their ask: "some form of resume or
|
|
||||||
checkpoint (or at least reusing the patches that were already generated), and a
|
|
||||||
way to skip or quarantine a single failing model instead of failing the entire
|
|
||||||
run." This spec delivers both — resume via the content-keyed WU cache, and
|
|
||||||
partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this
|
|
||||||
gap was surfaced by a real user on a real warehouse, not by the benchmark; the
|
|
||||||
fix is generic production hygiene for any large ingest.
|
|
||||||
|
|
||||||
## Implementation notes
|
|
||||||
|
|
||||||
Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12
|
|
||||||
requirements and every acceptance criterion are covered by committed code and
|
|
||||||
tests; the full `@kaelio/ktx` package suite is green.
|
|
||||||
|
|
||||||
What was built and where:
|
|
||||||
|
|
||||||
- **Shared content-keyed durability primitive** — `context/cache/content-result-cache.ts`
|
|
||||||
+ `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`).
|
|
||||||
Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts`
|
|
||||||
is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped),
|
|
||||||
so no second copy exists (D3 / req 11).
|
|
||||||
- **Content-keyed WU cache + replay** — `context/ingest/work-unit-cache.ts`
|
|
||||||
(`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity +
|
|
||||||
CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`).
|
|
||||||
Replay/recompute and stale-recompute state refresh wrap the WU loop in
|
|
||||||
`ingest-bundle.runner.ts` (D1/D2/D4 / reqs 1–4).
|
|
||||||
- **Non-fatal final gate** — `artifact-gates.ts` `validateFinalIngestArtifacts`
|
|
||||||
returns structured findings; `context/ingest/final-gate-prune.ts` deterministically
|
|
||||||
drops self-invalid sources and prunes dangling edges in a single pass, then a
|
|
||||||
confirm gate runs before squash (D5/D6 / reqs 5–8). `finalGatePrunedReferences`
|
|
||||||
/ `finalGateDroppedSources` are recorded in the report + trace and surface as a
|
|
||||||
`partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are
|
|
||||||
deleted (req 9).
|
|
||||||
|
|
||||||
Deviations / decisions worth noting (all preserve spec intent):
|
|
||||||
|
|
||||||
- **Cache stores artifact content snapshots (payload schema v2), not just a raw
|
|
||||||
git patch.** Replay materializes the owner's artifacts against the *current*
|
|
||||||
base, so a ref pruned in one run because a sibling failed is restored for free
|
|
||||||
on a later run once the sibling exists — without re-running the owner's agent
|
|
||||||
loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute.
|
|
||||||
- **Final-gate prune/drop resolves sources through the canonical
|
|
||||||
`resolveSlSourceFile` resolver**, not a derived `semantic-layer/<conn>/<name>.yaml`
|
|
||||||
path, so it works for uppercase / hash-derived source filenames (not only
|
|
||||||
lowercase demo names).
|
|
||||||
- **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join
|
|
||||||
target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU;
|
|
||||||
only intrinsic `source_validation` failures remain fatal at the WU level. This
|
|
||||||
is what lets a sibling-failed WU's owner survive to be pruned rather than be
|
|
||||||
excluded upstream (reqs 5–7, "no cascade").
|
|
||||||
- The raw report record keeps `status: 'completed'`; partial completion is derived
|
|
||||||
by `ingestReportOutcome` from the populated prune/drop fields.
|
|
||||||
|
|
@ -1,66 +0,0 @@
|
||||||
# Multi-connection routing guidance in the ktx-analytics skill
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The agent-facing `ktx-analytics` skill (installed into agent environments via
|
|
||||||
the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in
|
|
||||||
projects) describes the query workflow — wiki_search → sl_read_source →
|
|
||||||
sl_query / sql_execution — but assumes the connection is obvious. In a
|
|
||||||
multi-connection project nothing tells the agent to *first decide which
|
|
||||||
connection the question is about*, and several tools silently require it:
|
|
||||||
|
|
||||||
- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId`
|
|
||||||
**required**;
|
|
||||||
- `sl_query`, `discover_data`, `dictionary_search`: optional, but
|
|
||||||
auto-inference only works with exactly one connection
|
|
||||||
(`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or
|
|
||||||
multiple connections).
|
|
||||||
|
|
||||||
An agent that skips routing either errors out or, worse, queries the wrong
|
|
||||||
database when names overlap.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Any ktx project with more than one connection — the common shape for a data
|
|
||||||
org (warehouse + product DB + events DB). Routing is the first step of every
|
|
||||||
question, and the skill should encode it so individual agents don't have to
|
|
||||||
rediscover it.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Add an explicit routing step (step 0) to the skill's workflow:**
|
|
||||||
- Call `connection_list` to see what exists.
|
|
||||||
- Match the question's domain to a connection using connection ids/names,
|
|
||||||
`discover_data` hits, and wiki context — not guesswork.
|
|
||||||
- If genuinely ambiguous after discovery, ask the user rather than pick.
|
|
||||||
2. **Thread the resolved `connectionId` everywhere:** all subsequent
|
|
||||||
`sl_query`, `sql_execution`, `sl_read_source`, `entity_details`,
|
|
||||||
`dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01
|
|
||||||
lands (search scoped to the resolved connection plus unscoped pages).
|
|
||||||
3. **Single-connection projects stay frictionless:** the skill should say
|
|
||||||
routing is trivial when `connection_list` returns one entry — don't add a
|
|
||||||
mandatory ceremony step for the common simple case.
|
|
||||||
4. **Capture routing knowledge:** when the agent learns a non-obvious
|
|
||||||
question-domain → connection mapping, the skill should encourage
|
|
||||||
`memory_ingest` so the mapping becomes wiki knowledge for next time.
|
|
||||||
|
|
||||||
This is a docs/prompt change in the skill content (plus any skill-install
|
|
||||||
plumbing if the skill is versioned); no engine changes required.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- In a fixture project with ≥2 connections, an agent following the skill
|
|
||||||
resolves the correct connection before its first data query, and no tool
|
|
||||||
call fails with "connectionId is required".
|
|
||||||
- In a single-connection project the skill-driven flow is unchanged (no
|
|
||||||
extra mandatory steps).
|
|
||||||
- Skill text nowhere assumes a default/implicit connection.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Spider 2.0-Lite local subset = 30 SQLite connections in one project; every
|
|
||||||
one of the 135 questions targets exactly one of them. Connection ids are set
|
|
||||||
to the benchmark's database names, so with this skill guidance routing is
|
|
||||||
mechanical (`connection_list` + name match) and needs no benchmark-specific
|
|
||||||
instructions — which is the point: the harness gives the agent only the
|
|
||||||
question text.
|
|
||||||
|
|
@ -1,51 +0,0 @@
|
||||||
# Offline schema-documentation ingest adapter
|
|
||||||
|
|
||||||
> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider
|
|
||||||
> 2.0-Lite benchmark — we verified the benchmark's offline schema files
|
|
||||||
> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite
|
|
||||||
> scan already captures (DDL, types, PKs, sample values, cardinality
|
|
||||||
> profiling). Implement specs 01-03 first; pick this up only if a real
|
|
||||||
> use case shows up.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
The ingest pipeline's schema knowledge comes from live database scans
|
|
||||||
(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…).
|
|
||||||
There is no adapter for **offline schema documentation**: files describing
|
|
||||||
tables/columns that exist outside the database — column-description
|
|
||||||
spreadsheets, data dictionaries, DDL exports with comments, hand-maintained
|
|
||||||
schema docs.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Teams whose richest schema documentation lives outside `information_schema`:
|
|
||||||
a wiki export of column meanings, a governance tool's CSV data dictionary,
|
|
||||||
DDL files with COMMENT clauses the production scan can't see, or
|
|
||||||
environments where ktx has no live access at all and must build the semantic
|
|
||||||
layer from documentation alone.
|
|
||||||
|
|
||||||
## Requirements (sketch — refine when picked up)
|
|
||||||
|
|
||||||
1. A new ingest adapter (peer of `metabase`/`dbt` in
|
|
||||||
`context/ingest/adapters/`) consuming a configured local path of schema
|
|
||||||
docs per connection.
|
|
||||||
2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements)
|
|
||||||
and tabular column dictionaries (CSV/JSON: table, column, description,
|
|
||||||
…). Extensible to other formats.
|
|
||||||
3. Output: **enrichment, not duplication** — merge descriptions/metadata
|
|
||||||
into the manifest-backed semantic-layer sources and dictionary for the
|
|
||||||
matching connection. Where a live scan exists, offline docs fill gaps
|
|
||||||
(descriptions, enum meanings, deprecation notes) and flag drift
|
|
||||||
(documented column missing from live schema and vice versa) rather than
|
|
||||||
creating parallel wiki pages that duplicate schema info.
|
|
||||||
4. Works without live database access (documentation-only bootstrap of a
|
|
||||||
connection's semantic layer), clearly marked as unverified-against-live.
|
|
||||||
|
|
||||||
## Acceptance criteria (sketch)
|
|
||||||
|
|
||||||
- Given a connection with a live scan plus an offline column dictionary,
|
|
||||||
semantic-layer sources carry the documented descriptions, and drift
|
|
||||||
between doc and live schema is reported.
|
|
||||||
- Given a connection with docs only (no live access), `sl list`/`sl read`
|
|
||||||
expose manifest sources built from the docs.
|
|
||||||
- No wiki pages are created that merely restate table/column lists.
|
|
||||||
|
|
@ -1,59 +0,0 @@
|
||||||
# Composite-key (multi-column) join detection
|
|
||||||
|
|
||||||
> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite
|
|
||||||
> smoke test (2026-06-13): relationship detection emitted **zero joins** for a
|
|
||||||
> database whose fact tables are linked only by composite keys. Agents still
|
|
||||||
> answered correctly by inferring the join from shared `grain`, so this didn't
|
|
||||||
> cost benchmark points — but it forces inference that explicit joins would
|
|
||||||
> remove, and the gap is generic.
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Relationship detection appears to emit only single-column joins. For the IPL
|
|
||||||
sqlite database, every table came back with `joins=0`, even though its fact
|
|
||||||
tables are connected by a 4-column composite key
|
|
||||||
(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`,
|
|
||||||
`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did
|
|
||||||
correctly record that shared key as each table's `grain`, which is why agents
|
|
||||||
could recover the relationship — but no `joins:` entries were produced for the
|
|
||||||
fact-to-fact links.
|
|
||||||
|
|
||||||
## Generic use case
|
|
||||||
|
|
||||||
Event/fact tables keyed by composite business keys are common: ledger lines
|
|
||||||
(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports
|
|
||||||
ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a
|
|
||||||
multi-column key recurs across tables, ktx should detect and document the join
|
|
||||||
so agents (and `sl_query`) don't have to infer it.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. Relationship detection considers **multi-column** join candidates, not just
|
|
||||||
single-column ones. A strong signal already exists in ktx: when two tables
|
|
||||||
share an identical (or subset/superset) declared `grain`, that grain is a
|
|
||||||
prime composite-join candidate.
|
|
||||||
2. Emitted joins carry the full composite condition, e.g.
|
|
||||||
`on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`,
|
|
||||||
with a sensible `relationship` cardinality.
|
|
||||||
3. The existing validation/threshold machinery
|
|
||||||
(`scan.relationships.acceptThreshold` etc.) applies to composite candidates
|
|
||||||
too; profile-based validation should check join selectivity on the full key.
|
|
||||||
4. No regression for single-column joins; don't explode combinatorially —
|
|
||||||
bound candidate generation (e.g. only consider shared-grain keys and
|
|
||||||
declared/!inferred PK overlaps, cap column count).
|
|
||||||
5. `sl_query` can compile a join across a composite-key relationship.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- For a fixture with two tables sharing a 3- or 4-column grain and no
|
|
||||||
single-column FK, ingest emits a composite join between them with the full
|
|
||||||
multi-column `on` condition.
|
|
||||||
- `sl read <source>` shows the composite join; `sl_query` can traverse it.
|
|
||||||
- Single-column join detection is unchanged on existing fixtures.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set)
|
|
||||||
have no single-column FKs; their joins are entirely composite. Explicit
|
|
||||||
composite joins would let the agent rely on documented relationships instead
|
|
||||||
of inferring them from grain.
|
|
||||||
|
|
@ -1,89 +0,0 @@
|
||||||
# Canonical / authoritative-source measures in the semantic layer
|
|
||||||
|
|
||||||
## Problem
|
|
||||||
|
|
||||||
Many schemas contain an **authoritative table** that already encodes a metric's
|
|
||||||
business rules — an official standings/leaderboard table, a general-ledger or
|
|
||||||
period-end balance table, a materialized summary/snapshot — alongside the **raw
|
|
||||||
transactional** rows the metric *could* be re-derived from. Re-deriving the metric
|
|
||||||
from the raw rows frequently diverges from the canonical definition, because the
|
|
||||||
authoritative table bakes in rules the raw data doesn't expose (drop-scores,
|
|
||||||
penalties, adjustments, reconciliations, as-of snapshots).
|
|
||||||
|
|
||||||
Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from
|
|
||||||
raw fact tables, so the analytics skill has no signal that one source is canonical
|
|
||||||
for a metric — and the agent often re-derives from raw rows and gets a defensible-
|
|
||||||
but-different number.
|
|
||||||
|
|
||||||
## Generic use case (independent of any benchmark)
|
|
||||||
|
|
||||||
- "Championship points per competitor this season" — a sports schema may hold both
|
|
||||||
raw per-event results AND an official standings table that applies drop-scores
|
|
||||||
and penalties. The standings table is the canonical source; summing raw results
|
|
||||||
is wrong.
|
|
||||||
- "Account balance as of month end" — prefer a ledger/balance-snapshot table over
|
|
||||||
re-summing every transaction (which may miss adjustments).
|
|
||||||
- "Monthly recognized revenue" — prefer a finance summary table over re-deriving
|
|
||||||
from line items.
|
|
||||||
|
|
||||||
In each case a real analyst should be steered to the authoritative source.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
1. **Detect candidate authoritative tables during ingest.** Heuristics only —
|
|
||||||
e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`,
|
|
||||||
`*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained
|
|
||||||
aggregation of another table, or tables documented as authoritative in provided
|
|
||||||
docs/wiki. Surface them as such in the semantic layer.
|
|
||||||
|
|
||||||
2. **Represent the metric as an SL measure backed by the authoritative table.**
|
|
||||||
Where a canonical source exists, define the measure over it so a query for that
|
|
||||||
metric resolves to the authoritative source by default. (The analytics skill
|
|
||||||
already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs
|
|
||||||
into existing behavior.)
|
|
||||||
|
|
||||||
3. **Keep raw re-derivation available** as a non-default alternative; the measure
|
|
||||||
documents which source it uses and why, so the choice is transparent and
|
|
||||||
overridable.
|
|
||||||
|
|
||||||
## Fairness boundary (HARD — this spec is fairness-sensitive)
|
|
||||||
|
|
||||||
The choice of authoritative source MUST be driven by **schema/structure or provided
|
|
||||||
documentation** — the table exists, is structured as a summary, or is documented as
|
|
||||||
authoritative. It must **NEVER** be driven by observing which interpretation matches
|
|
||||||
a benchmark gold answer. Concretely:
|
|
||||||
|
|
||||||
- ✅ Fair: "a table named/structured as official standings exists and aggregates the
|
|
||||||
raw results → treat it as the canonical points source."
|
|
||||||
- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold
|
|
||||||
result." That is per-instance gold-tuning (cheating) and must not appear in ktx,
|
|
||||||
the ingest heuristics, or any mapping.
|
|
||||||
|
|
||||||
If a metric is genuinely underspecified and only the gold answer disambiguates the
|
|
||||||
intended source, it is **not fairly fixable** — leave it. Whether this feature helps
|
|
||||||
any specific benchmark instance is therefore *conditional* on a real schema/doc basis
|
|
||||||
existing; do not manufacture one.
|
|
||||||
|
|
||||||
## Leak-safety (hard constraint)
|
|
||||||
|
|
||||||
No benchmark table names, queries, gold values, or instance-specific mappings
|
|
||||||
anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic.
|
|
||||||
|
|
||||||
## Acceptance criteria
|
|
||||||
|
|
||||||
- Ingest can flag candidate authoritative/summary tables via generic heuristics
|
|
||||||
(name/role/aggregation/doc signals), with no benchmark-specific rules.
|
|
||||||
- The semantic layer can express a measure as backed by a designated authoritative
|
|
||||||
source; the skill resolves the metric to it by default; raw re-derivation remains
|
|
||||||
available and the choice is documented.
|
|
||||||
- Tests use synthetic schemas only; no gold-derived mappings exist anywhere.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only)
|
|
||||||
|
|
||||||
Some SQLite-subset metric questions are underspecified between a raw-derivation and
|
|
||||||
an authoritative-table interpretation (e.g. season points from raw results vs an
|
|
||||||
official standings table). This is the roadmap's "canonical semantic-layer measures
|
|
||||||
from schema + provided docs" item. It is fair ONLY where schema/docs support one
|
|
||||||
source; the gold-only cases are explicitly out of scope (fixing them would require
|
|
||||||
tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches
|
|
||||||
ingest + the semantic-layer model.
|
|
||||||
|
|
@ -1,57 +0,0 @@
|
||||||
# 17 — Lifecycle-event metrics in the semantic layer
|
|
||||||
|
|
||||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`.
|
|
||||||
|
|
||||||
## Problem / requirement
|
|
||||||
|
|
||||||
Many entities carry **several lifecycle timestamps** for the same record — an order has
|
|
||||||
`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery`
|
|
||||||
times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`,
|
|
||||||
`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named
|
|
||||||
completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled
|
|
||||||
payments by day"), the correct time anchor is the timestamp of *that named event*, not the
|
|
||||||
record-creation timestamp.
|
|
||||||
|
|
||||||
Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it
|
|
||||||
does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a
|
|
||||||
human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is
|
|
||||||
left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule
|
|
||||||
now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the
|
|
||||||
**model**, so any consumer of the semantic layer gets it for free.)
|
|
||||||
|
|
||||||
**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more
|
|
||||||
lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event
|
|
||||||
metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with
|
|
||||||
its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`),
|
|
||||||
distinct from the creation-anchored `orders` metric. Keep the inference conservative and
|
|
||||||
source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing
|
|
||||||
that the schema/descriptions don't independently support.
|
|
||||||
|
|
||||||
## Sketch (implementer to refine)
|
|
||||||
|
|
||||||
- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions
|
|
||||||
(e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`).
|
|
||||||
- Emit a metric per detected completed state: filter = the state predicate, grain = record,
|
|
||||||
`defaultTimeDimension` = the matching event timestamp.
|
|
||||||
- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the
|
|
||||||
delivery-anchored metric rather than a bare row count over the creation date.
|
|
||||||
- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar
|
|
||||||
(precision over recall — a wrong pairing is worse than none).
|
|
||||||
|
|
||||||
## Generic use case (independent of the benchmark)
|
|
||||||
|
|
||||||
Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments)
|
|
||||||
has this multi-timestamp lifecycle shape. An analyst asking "how many X were <completed-state> last
|
|
||||||
month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the
|
|
||||||
model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without
|
|
||||||
re-deriving it, and prevents the silent "grouped by when they started" error.
|
|
||||||
|
|
||||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
|
||||||
|
|
||||||
Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028
|
|
||||||
("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed
|
|
||||||
because the solver bucketed delivered orders by `order_purchase_timestamp` instead of
|
|
||||||
`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both
|
|
||||||
date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this
|
|
||||||
spec is the **model-layer** form of the same fix, which would make the right anchor the default for any
|
|
||||||
solver and any lifecycle schema.
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue