mirror of
https://github.com/Kaelio/ktx.git
synced 2026-07-01 08:59:39 +02:00
chore: remove private benchmark specs
This commit is contained in:
parent
67a69dba8b
commit
1c5d16abc3
40 changed files with 0 additions and 8716 deletions
|
|
@ -1,62 +0,0 @@
|
|||
# spider2-specs — feature specs driven by the Spider 2.0-Lite benchmark
|
||||
|
||||
This directory is the handoff point between two agents working on different
|
||||
sides of the same goal: making Claude Code + ktx score well on the Spider
|
||||
2.0-Lite benchmark **without benchmark-specific instructions** — the agent
|
||||
should succeed using only what ktx provides (skills, semantic layer, wiki).
|
||||
|
||||
## Mechanics
|
||||
|
||||
Three directories form a pipeline. A feature flows `todo/` → `specs/` →
|
||||
(implemented), and only its intake draft moves to `done/`:
|
||||
|
||||
- **`todo/`** — intake drafts. A **playground agent** (works in
|
||||
`/Users/andrey/projects/kaelio/spider-clean-submission/playground`, runs the
|
||||
benchmark, identifies ktx capability gaps) writes a draft spec here when it
|
||||
finds a gap.
|
||||
- **`specs/`** — refined specs. A **refinement pass** (brainstorming) takes a
|
||||
`todo/` draft and produces a proper, implementation-ready spec at
|
||||
`specs/<same-filename>.md`: sharpened requirements, resolved ambiguities,
|
||||
acceptance criteria, and orientation hints. The refined spec is the **durable
|
||||
artifact** the implementer builds from — it stays in `specs/` permanently and
|
||||
never moves.
|
||||
- **`done/`** — intake drafts whose feature has shipped (see below).
|
||||
|
||||
The **ktx worktree agent** (started from a ktx repo worktree, e.g.
|
||||
`/Users/andrey/conductor/workspaces/ktx/tallinn-v2`) implements from the
|
||||
refined spec in `specs/` (falling back to the `todo/` draft only if no refined
|
||||
spec exists yet). When the feature is implemented it:
|
||||
|
||||
1. appends a short **"Implementation notes"** section to the refined spec in
|
||||
`specs/` (what was built, where, any deviations); and
|
||||
2. **moves the original intake draft from `todo/` to `done/`.**
|
||||
|
||||
Location is status: `todo/` = draft awaiting implementation, `done/` = draft
|
||||
whose feature shipped, `specs/` = refined specs (permanent home, do not move).
|
||||
A draft and its refined spec share the same filename so they correspond
|
||||
(`todo/01-foo.md` ↔ `specs/01-foo.md` ↔ `done/01-foo.md`). No other tracking.
|
||||
|
||||
## Rules for specs
|
||||
|
||||
1. **Generic, not benchmark-overfit.** ktx is a general-purpose product; the
|
||||
benchmark only surfaces the need. Every spec must state a real-world use
|
||||
case independent of Spider 2.0-Lite. If a requirement only makes sense for
|
||||
the benchmark, it doesn't belong in ktx.
|
||||
2. Specs are **requirement-level**, not implementation plans. Code pointers in
|
||||
specs are orientation hints from exploration (line numbers may have
|
||||
drifted); the implementer owns the design.
|
||||
3. One spec per file, kebab-case, numeric prefix = suggested priority order.
|
||||
A refined spec in `specs/` keeps the same filename as its `todo/` draft.
|
||||
|
||||
## For the implementer
|
||||
|
||||
- After implementing, rebuild and re-link the dev binary so the playground
|
||||
picks it up: `pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
|
||||
- Add/extend tests in the ktx test suites; specs list acceptance criteria to
|
||||
cover.
|
||||
- Build from the refined spec in `specs/`. On completion, append
|
||||
"Implementation notes" to that spec (it stays in `specs/`) and move the
|
||||
intake draft from `todo/` to `done/`.
|
||||
- If a spec turns out to be wrong or already satisfied, don't silently drop
|
||||
it — record why in the refined spec's notes and move the draft to `done/`
|
||||
explaining why no change was needed.
|
||||
|
|
@ -1,74 +0,0 @@
|
|||
# Connection-scoped wiki pages
|
||||
|
||||
## Problem
|
||||
|
||||
Wiki pages have only two scopes today: `GLOBAL` and `USER`
|
||||
(`packages/cli/src/context/wiki/types.ts`, frontmatter schema ~lines 14-29).
|
||||
There is no way to associate a page with a connection. In a project with many
|
||||
connections, all pages share one search index, so `wiki_search` for a generic
|
||||
term ("orders", "revenue", "average order value") surfaces pages about the
|
||||
wrong database. Concept names collide across databases constantly in
|
||||
real-world multi-connection projects (several databases each with `orders`,
|
||||
`customers`, etc.).
|
||||
|
||||
Today, when `memory_ingest` is called with a `connectionId`, that id is only
|
||||
used to scope which semantic-layer sources the triage agent can see
|
||||
(`memory-agent.service.ts` ~46-72, ~107-109); it is **not** persisted on the
|
||||
resulting wiki page in any form.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any org with multiple databases/warehouses in one ktx project: org-wide
|
||||
definitions ("fiscal year starts in February") should be visible everywhere,
|
||||
while database-specific conventions ("in the events DB, `user_id` is the
|
||||
anonymous device id, not the account id") should not pollute searches about
|
||||
other databases.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Frontmatter field.** Add an optional `connections:` field to wiki page
|
||||
frontmatter — a list of connection ids (accept a single string too,
|
||||
normalize to list).
|
||||
- **Absent or empty ⇒ unscoped: the page applies to all connections.**
|
||||
This is exactly today's behavior, so every existing page is unaffected
|
||||
(backward compatible by construction).
|
||||
2. **Search filtering.** `wiki_search` (MCP tool, `context-tools.ts` ~46-64)
|
||||
and `ktx wiki search` / `ktx wiki list` (CLI,
|
||||
`knowledge-commands.ts`) accept an optional `connectionId`:
|
||||
- With `connectionId: X` ⇒ return pages scoped to X **∪** unscoped pages.
|
||||
- Without ⇒ current behavior, all pages.
|
||||
- The filter must apply to **all three search lanes** (lexical FTS5,
|
||||
semantic/embedding, token fallback) in
|
||||
`local-knowledge.ts` / `sqlite-knowledge-index.ts` — not as a post-filter
|
||||
that eats into the result limit unevenly.
|
||||
3. **Index.** Persist the scoping in the `.ktx/db.sqlite` knowledge index
|
||||
(the index is already re-synced from files on every search,
|
||||
`local-knowledge.ts` ~286-310, so a schema addition + sync is sufficient).
|
||||
4. **Write path.** The memory agent's wiki-write tool accepts the connections
|
||||
field; when `memory_ingest` is invoked with a `connectionId`, the agent
|
||||
should default new database-specific pages to that connection, while still
|
||||
being allowed to write unscoped pages for clearly org-wide content (prompt
|
||||
guidance, not a hard rule).
|
||||
5. **`wiki_read` and refs are unchanged** — pages remain addressable by key
|
||||
regardless of scoping; `connections` is a search/relevance concern only.
|
||||
6. **Validation.** Warn (don't fail) when a page references a connection id
|
||||
not present in `ktx.yaml` — config and content can evolve independently.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- A page with `connections: [db_a]` is returned by
|
||||
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
|
||||
**not** by `wiki_search(query, connectionId: "db_b")`.
|
||||
- A page with no `connections` field is returned in all three cases above.
|
||||
- Existing projects with no scoped pages behave identically before/after.
|
||||
- Filtering works in each lane independently (test with embeddings disabled
|
||||
to exercise lexical/token lanes alone).
|
||||
- `memory_ingest(content, connectionId)` produces a page scoped to that
|
||||
connection for database-specific content.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Spider 2.0-Lite local subset = one project with 30 SQLite connections whose
|
||||
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
|
||||
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
|
||||
to exactly one database and must not surface for the other 29.
|
||||
|
|
@ -1,71 +0,0 @@
|
|||
# Verbatim ingest mode for authoritative documents
|
||||
|
||||
## Problem
|
||||
|
||||
`ktx ingest --text/--file` routes content through the memory agent
|
||||
(`text-ingest.ts` ~246-357 → `memory-agent.service.ts`), an LLM triage loop
|
||||
(30-step budget for `external_ingest`, content clipped at ~48k chars,
|
||||
`memory-agent.service.ts` ~165) that may rewrite, condense, or split the
|
||||
content before writing wiki pages.
|
||||
|
||||
For *authoritative* documents — formula definitions, specs, runbooks,
|
||||
compliance text — paraphrasing is a bug, not a feature:
|
||||
|
||||
- exact thresholds, constants, and rule wording must survive byte-for-byte;
|
||||
- lexical (BM25) search works best when the stored text matches the phrasing
|
||||
users/agents will query with;
|
||||
- ingestion should be deterministic and reproducible — same input file, same
|
||||
resulting page.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any team ingesting documents that are already the source of truth: metric
|
||||
definition sheets, SLA documents, calculation methodology docs, regulatory
|
||||
text. The user wants ktx to *index and surface* the document, not to
|
||||
re-author it.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Flag.** `ktx ingest --file <path> --verbatim` (apply to `--text` too).
|
||||
Composes with the existing optional `--connection <id>` so the resulting
|
||||
page can be connection-scoped (see spec 01).
|
||||
2. **Body preservation is enforced by code, not by prompt.** The stored page
|
||||
body must be the input content byte-for-byte. The LLM is used **only** to
|
||||
generate metadata: `summary`, `tags`, `sl_refs`, suggested page key/slug
|
||||
(and `connections` default from the flag). Implementation freedom: a
|
||||
single constrained LLM call is fine — the full memory-agent loop is not
|
||||
required for this mode.
|
||||
3. **No clipping of the stored body.** The ~48k clip may apply to what is
|
||||
*sent to the LLM* for metadata generation, never to what is *written* to
|
||||
the wiki page.
|
||||
4. **Existing frontmatter.** If the input file already has YAML frontmatter,
|
||||
preserve user-provided fields and only fill gaps (don't overwrite an
|
||||
explicit `summary` with a generated one).
|
||||
5. **Key collisions.** Deterministic, non-destructive behavior: error or
|
||||
suffix — never silently overwrite an existing page.
|
||||
6. **Degraded mode.** With `llm.provider.backend: none`, `--verbatim` should
|
||||
still work, deriving `summary` from the first heading/sentence and leaving
|
||||
optional metadata empty. (Regular agent ingest can't do this; verbatim
|
||||
mode can and should.)
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Ingesting a file with `--verbatim` produces a wiki page whose body is
|
||||
byte-identical to the input (assert with a hash in tests).
|
||||
- Running the same ingest twice is idempotent or fails loudly on the second
|
||||
run (per requirement 5) — no duplicated/divergent pages.
|
||||
- A >48k-char file is stored in full.
|
||||
- `--verbatim --connection X` yields a page scoped to X (depends on spec 01;
|
||||
if 01 isn't implemented yet, the flag composition can land later).
|
||||
- Generated metadata makes the page findable: `wiki_search` for a phrase
|
||||
from the document body returns it (lexical lane), and for a paraphrase of
|
||||
its topic returns it when embeddings are enabled (semantic lane).
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
|
||||
definitions, haversine formula, F1 overtake rules…). Gold SQL was authored
|
||||
against their exact text; an LLM paraphrase that drops a bucket boundary
|
||||
loses a question. We currently work around this by hand-writing frontmatter
|
||||
and copying files into `wiki/global/` — verbatim mode makes that a supported
|
||||
ktx workflow instead of a manual step.
|
||||
|
|
@ -1,63 +0,0 @@
|
|||
# Schema scan must tolerate individual objects that fail introspection
|
||||
|
||||
> Priority: MEDIUM. Found during the first full Spider2-lite sqlite ingest
|
||||
> (2026-06-13): one database (`oracle_sql`) failed to ingest **entirely**
|
||||
> because a single broken VIEW errored during introspection, leaving that
|
||||
> connection with no semantic layer at all.
|
||||
|
||||
## Problem
|
||||
|
||||
`ktx ingest <connection>` aborts the whole database's schema scan when one
|
||||
table/view errors during introspection/profiling. In `oracle_sql` the view
|
||||
`emp_hire_periods_with_name` is defined as
|
||||
`SELECT ehp.start_date, ehp.end_date ... FROM emp_hire_periods ehp ...` but the
|
||||
base table has no `start_date`/`end_date` columns — so any attempt to read it
|
||||
raises `no such column: ehp.start_date`. That single broken object failed the
|
||||
ingest of all ~48 healthy tables/views in the database.
|
||||
|
||||
A second, related symptom: setting `enabled_tables: [main.customers]` to work
|
||||
around it produced a different hard failure (`Adapter "database schema" did not
|
||||
recognize fetched source output`), so the documented allowlist escape hatch did
|
||||
not provide a clean fallback either.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Real databases routinely contain broken or inaccessible objects: views over
|
||||
dropped/renamed columns, views referencing tables the connection role can't
|
||||
read, permission-denied tables, or vendor system views that error. ktx should
|
||||
ingest everything it *can* and skip what it can't — never let one bad object
|
||||
zero out an entire connection's context. This is basic robustness for
|
||||
production warehouses, not benchmark-specific.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Per-object isolation.** If introspecting/profiling one table or view
|
||||
throws, skip that object, record a warning (object name + error), and
|
||||
continue scanning the rest. The connection's semantic layer is built from
|
||||
the objects that succeeded.
|
||||
2. **Surface, don't hide.** Report skipped objects in the ingest summary and in
|
||||
`ktx status` (e.g. "oracle_sql: 1 object skipped — emp_hire_periods_with_name:
|
||||
no such column ehp.start_date"). Honor `failureMode` for whole-connection
|
||||
aborts, but a single bad object should not count as a connection failure.
|
||||
3. **Views vs tables.** A broken view should never block base-table ingest.
|
||||
Consider profiling views defensively (they are read-only projections).
|
||||
4. **Allowlist fallback should work.** `enabled_tables` should reliably restrict
|
||||
the scan to the listed objects (and the qualification format for sqlite must
|
||||
be documented and accepted). Fix the `did not recognize fetched source
|
||||
output` failure when the allowlist yields a small/edge-case set.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Ingesting a sqlite DB containing one broken view plus N healthy tables yields
|
||||
a semantic layer for the N healthy tables and a warning naming the broken view
|
||||
— exit is success (not "failed"), subject to `failureMode`.
|
||||
- The skipped object is listed in the ingest summary and `ktx status`.
|
||||
- `enabled_tables` restricted to a subset ingests exactly that subset without the
|
||||
adapter-output error.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
`oracle_sql` (8 of the 135 sqlite questions) currently has no semantic layer
|
||||
because of its one broken view; those questions must be solved from raw
|
||||
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
|
||||
would restore enriched context for that database.
|
||||
|
|
@ -1,112 +0,0 @@
|
|||
# Add universal SQL-authoring craft to the ktx-analytics skill
|
||||
|
||||
> Priority: HIGH. The `ktx-analytics` skill currently tells the agent *which
|
||||
> ktx tools to call and in what order*, but gives almost no guidance on
|
||||
> *writing correct SQL*. In benchmark runs the agent reliably produced
|
||||
> runnable SQL (0 execution errors) yet failed on correctness — precision,
|
||||
> determinism, type mismatches, and answer completeness. These are universal
|
||||
> analytics-engineering truths that every ktx user benefits from, so they
|
||||
> belong in the shipped skill, not in any caller's prompt.
|
||||
|
||||
## Scope guard (read first)
|
||||
|
||||
Only **universally-true** SQL/analytics craft goes here — guidance that helps a
|
||||
real ktx user querying a **live** database. The test for inclusion: *"Would this
|
||||
advice be correct and useful for an analyst on a current, production database?"*
|
||||
|
||||
**Dialect-specific syntax is out of scope here.** The v9 harnesses' only
|
||||
per-dialect content (Snowflake: `DB.SCHEMA.TABLE` FQTNs, double-quoted
|
||||
lowercase cols, VARIANT colon-paths; BigQuery: backtick FQTNs, `_TABLE_SUFFIX`
|
||||
for sharded tables; sqlite: `strftime`/`julianday`) is genuinely useful but
|
||||
belongs in a **dialect-aware** location (per-driver notes), not this flat
|
||||
skill. Track separately as a follow-up; the rules below must stay
|
||||
dialect-agnostic.
|
||||
|
||||
Explicitly **do NOT** add (these are application/consumer concerns, not skill
|
||||
concerns, and some are actively wrong for live data):
|
||||
- Output-format contracts ("return a bare result set with exactly these
|
||||
columns, no prose"). The skill is for interactive analysis and already
|
||||
favors readable tables + summaries; a caller that needs a strict result
|
||||
shape specifies that itself.
|
||||
- Anchoring relative time ("recent", "past N months") to `MAX(date)` of the
|
||||
data. On a live database "recent" means relative to *now*; this is only true
|
||||
for static snapshots and must not be baked into the product.
|
||||
- Anything justified by a grader/scoring comparator.
|
||||
|
||||
## File
|
||||
|
||||
`packages/cli/src/skills/analytics/SKILL.md` (the shipped skill;
|
||||
`setup-agents.ts` installs it into agent environments — the copy under a
|
||||
project's `.claude/skills/` is regenerated from this source). Extend the
|
||||
existing `<rules>` block and step 5 ("Query") / step 6 ("Validate and
|
||||
explain"); keep the existing interactive guidance intact.
|
||||
|
||||
## Requirements — add these as general rules (behavior only, no rationale that
|
||||
references answers/graders)
|
||||
|
||||
**Schema discovery before writing SQL**
|
||||
1. Inspect representative sample rows of each table before composing SQL —
|
||||
confirm date/time encoding (e.g. `YYYYMMDD` vs ISO vs epoch), null
|
||||
prevalence in join/filter keys, and the actual set of categorical/enum
|
||||
values. (`entity_details` + a small `sql_execution` sample.)
|
||||
2. Cast a column to its real type before comparing it in `WHERE`/`JOIN`. A
|
||||
string column compared against a numeric literal (or vice versa) can
|
||||
silently match nothing.
|
||||
|
||||
**Composition discipline**
|
||||
3. Build complex queries incrementally — one CTE at a time, verifying each
|
||||
layer's output on a small sample before stacking the next.
|
||||
4. Avoid joins that fan out row counts. Add columns only from tables already
|
||||
required by the grain, or pre-aggregate to the target grain before joining.
|
||||
|
||||
**Window-function correctness**
|
||||
5. Give every ranking/ordering window function a complete, deterministic
|
||||
tie-breaker (append unique key columns), so `RANK`/`ROW_NUMBER`/`LAG`
|
||||
results are stable rather than flickering across runs.
|
||||
6. Apply row filters **after** window functions for sequence / "first" /
|
||||
"most recent" / "since" questions — compute over the full partition, then
|
||||
filter.
|
||||
|
||||
**Numeric precision**
|
||||
7. Compute at full precision; round only in the final projection, never inside
|
||||
intermediate CTEs.
|
||||
8. Be explicit about truncation (`CAST AS INT` truncates; use explicit
|
||||
rounding when rounding is intended).
|
||||
9. Distinguish "average of per-group averages" (macro: `AVG(group_metric)`)
|
||||
from "overall/weighted average" (micro: `SUM(num)/SUM(den)`) based on the
|
||||
question's wording.
|
||||
|
||||
**Answer completeness / interpretation**
|
||||
10. "top / highest / most / lowest" → return only the winning row(s) (e.g.
|
||||
`RANK() = 1` / `QUALIFY`), not the full ranked list, unless a list is asked
|
||||
for.
|
||||
11. "for each X / per X / by X" → exactly one row per X; don't collapse to a
|
||||
single value unless the question says "overall" or "total across X".
|
||||
12. When a question asks for inputs and a derived value ("X, Y, and their
|
||||
ratio"), include the inputs as columns alongside the derived value.
|
||||
13. When grouping by a human-readable label (a name), also expose the entity's
|
||||
identifier — identity, not just the label, is part of the result.
|
||||
14. When a result is unexpectedly empty, relax filters one at a time to find
|
||||
which predicate removed the rows.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The shipped `analytics/SKILL.md` contains the rules above, phrased as general
|
||||
truths with **no reference to any benchmark, gold answer, or scoring
|
||||
comparator**.
|
||||
- Existing interactive guidance (compact result tables, summaries,
|
||||
clarification prompts, the tool-order workflow) is preserved — the skill must
|
||||
still read well for an interactive human-facing analysis session.
|
||||
- None of the excluded items (output-shape contract, `MAX(date)` anchoring,
|
||||
grader-driven advice) appear.
|
||||
- Skill stays within a reasonable size; group the new rules under clear
|
||||
sub-headings so they're scannable.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
On the Spider 2.0-Lite sqlite subset, the solver produced 0 execution errors
|
||||
but ~50 result mismatches; a large share traced to exactly these gaps
|
||||
(premature rounding, string-vs-number compares, non-deterministic window
|
||||
ordering, returning full lists for "top" questions, dropping inputs to derived
|
||||
values). These are generic SQL-authoring defects — fixing them in the skill
|
||||
improves ktx for everyone and, as a side effect, the benchmark.
|
||||
|
|
@ -1,83 +0,0 @@
|
|||
# Per-dialect SQL syntax notes (dialect-aware, scoped to the connection)
|
||||
|
||||
> Intake draft. Companion to `specs/07-analytics-skill-sql-craft.md`, which kept
|
||||
> the analytics SQL craft dialect-agnostic and explicitly deferred per-dialect
|
||||
> syntax here.
|
||||
|
||||
## Problem
|
||||
|
||||
Spec 07 deliberately keeps the analytics SQL-authoring craft
|
||||
**dialect-agnostic** — every rule must read correctly on any engine. But a lot of
|
||||
*real* correctness depends on dialect-specific syntax that spec 07 excludes and
|
||||
defers to this follow-up:
|
||||
|
||||
- **Snowflake:** `DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase identifiers,
|
||||
VARIANT colon-paths.
|
||||
- **BigQuery:** backtick FQTNs, `_TABLE_SUFFIX` for sharded tables, `QUALIFY`.
|
||||
- **sqlite:** `strftime`/`julianday` for dates, no `QUALIFY`.
|
||||
|
||||
This guidance is genuinely useful to an agent writing SQL against a live
|
||||
database, but it must **not** pollute the flat dialect-agnostic skill — an agent
|
||||
querying sqlite should never see Snowflake VARIANT syntax. It belongs in a
|
||||
**dialect-aware** location, surfaced only for the dialect the active connection
|
||||
actually uses.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any ktx project whose connections span more than one warehouse engine (e.g. a
|
||||
Snowflake warehouse + a BigQuery export + a local sqlite extract). When the agent
|
||||
writes SQL for a given connection, it should get that engine's syntax
|
||||
conventions — and nothing for the engines it isn't querying.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Per-driver dialect notes.** Author concise, correct syntax notes per
|
||||
supported driver: FQTN form, identifier quoting/case, date/time functions,
|
||||
top-N / window-filtering idiom, semi-structured access. These are genuine
|
||||
per-engine invariants, so enumerating them per driver is acceptable (unlike a
|
||||
denylist of bad specifics).
|
||||
2. **Scope to the active dialect, derived from state.** Which notes the agent
|
||||
sees must be selected from the connection's configured driver/dialect
|
||||
(`ktx.yaml` connections / the connector registry), not guessed and not shown
|
||||
all at once. The flat analytics skill stays dialect-agnostic (spec 07
|
||||
invariant preserved).
|
||||
3. **Delivery mechanism (enabling sub-requirement).** The shipped skill is
|
||||
installed as a **single `SKILL.md`** per target (`setup-agents.ts` /
|
||||
`readAnalyticsSkillContent`). Surfacing per-dialect notes on demand needs one
|
||||
of two approaches; the refinement pass should compare them before committing:
|
||||
- **Multi-file skill delivery** — bundle `reference/<dialect>.md` files and
|
||||
have the skill point to the one matching the connection. Requires extending
|
||||
`setup-agents.ts` to copy a skill *directory* (Claude Code, Codex, universal
|
||||
`.agents`) and a multi-file zip (Claude Desktop), a **flatten/concatenate
|
||||
transform** for the single-file targets (Cursor `.mdc`, OpenCode `.md`), and
|
||||
**per-file manifest entries** for clean uninstall. This is the
|
||||
install-mechanism improvement spec 07's Model section flags as future work.
|
||||
- **Dynamic MCP delivery** — an MCP surface returns the dialect hints for a
|
||||
given `connectionId` (the MCP layer already resolves the connection's
|
||||
dialect), so no install change is needed and Cursor/OpenCode get identical
|
||||
behavior. May be the lower-cost, more uniform path; weigh it first.
|
||||
4. **No dialect syntax leaks into the dialect-agnostic skill.** Spec 07's
|
||||
acceptance criterion (no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in
|
||||
`analytics/SKILL.md`) stays green. This work adds a *separate* dialect-aware
|
||||
channel; it does not amend the flat skill.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- An agent querying a sqlite connection gets sqlite date idioms and never sees
|
||||
Snowflake/BigQuery-only syntax; an agent querying Snowflake gets
|
||||
FQTN/identifier/VARIANT guidance.
|
||||
- The dialect shown is **derived from the connection's configured driver**, not
|
||||
hardcoded per project and not guessed.
|
||||
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
|
||||
unaffected.
|
||||
- Whichever delivery mechanism is chosen installs/serves correctly across **all**
|
||||
supported agent targets, including the single-file Cursor/OpenCode shape.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
|
||||
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lowercase cols, VARIANT colon-paths),
|
||||
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
|
||||
(`strftime`/`julianday`). That content is real and useful but engine-specific;
|
||||
spec 07 kept it out of the flat skill and deferred it here so the
|
||||
dialect-agnostic rules stay clean.
|
||||
|
|
@ -1,150 +0,0 @@
|
|||
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
|
||||
|
||||
## Problem
|
||||
|
||||
The `ktx-analytics` skill already carries a fan-out rule (spec 07, rule 4:
|
||||
*"Avoid fan-out joins — add columns only from tables already at the target
|
||||
grain, or pre-aggregate to that grain before joining; a join that multiplies
|
||||
rows quietly inflates every downstream `SUM`/`COUNT`"*). In practice the agent
|
||||
honors it on a single join but still **silently fan-outs on multi-hop join
|
||||
chains**, where the inflation is one or two joins removed from the aggregate and
|
||||
therefore much harder to notice.
|
||||
|
||||
The failure shape: a metric that lives at a *coarse* grain (e.g. one row per
|
||||
parent record) is counted/summed *after* the parent has been joined down to a
|
||||
*finer* grain (e.g. one row per child line). Every parent-level value is then
|
||||
duplicated by its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by an
|
||||
amount that depends on the data — runnable SQL, plausible-looking number,
|
||||
quietly wrong.
|
||||
|
||||
The rule today is stated as a *prohibition* ("avoid"). It needs to be a
|
||||
*detect-and-fix habit*: a concrete multi-hop example of the trap, and an active
|
||||
verification step the agent runs while composing, not just an instruction to be
|
||||
careful.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
An analyst on any production warehouse asks: *"How many orders are there per
|
||||
region?"* where the path from region to the order's detail runs through several
|
||||
hops (region → store → order → order line). The honest answer counts each order
|
||||
once. If the query descends to the line-level table along the way (e.g. for a
|
||||
filter), each order is counted once **per line on the order**, inflating the
|
||||
per-region total. Attribution here is unambiguous — each order belongs to exactly
|
||||
one store and thus one region — so the *only* thing that can go wrong is the row
|
||||
multiplication, which is exactly what makes it a clean teaching case. This is one
|
||||
of the most common silently-wrong analytics mistakes on normalized schemas — it
|
||||
is not
|
||||
specific to any dataset, dialect, or benchmark.
|
||||
|
||||
## Requirements
|
||||
|
||||
This extends the existing `<sql_craft>` "Composition" guidance in the
|
||||
`ktx-analytics` skill (spec 07). Additive only; keep it inline, dialect-agnostic,
|
||||
and stated as a heuristic-plus-why (consistent with spec 07's style).
|
||||
|
||||
1. **Generalize the fan-out rule to multi-hop chains.** Make explicit that the
|
||||
danger is *cumulative*: any one-to-many hop on the path between the table that
|
||||
owns a measure and the aggregate inflates that measure, even when the
|
||||
offending join is several hops away from the `SUM`/`COUNT`. The fix is the
|
||||
same as the single-hop case — **pre-aggregate the measure to its own grain in
|
||||
a CTE, then join the already-aggregated result** — but the agent must apply it
|
||||
per measure-owning table along the whole chain, not just at the final join.
|
||||
|
||||
2. **Add a verification habit, not just a prohibition.** While composing, the
|
||||
agent should confirm a join did not change the grain it intends to aggregate
|
||||
at — e.g. check that the row count (or the count of the aggregate's key) is
|
||||
unchanged across a join that is supposed to be one-to-one / many-to-one, and
|
||||
pre-aggregate the finer table to grain when it is one-to-many. This is the same
|
||||
"build incrementally and check each layer" discipline spec 07 already endorses,
|
||||
pointed specifically at grain preservation.
|
||||
|
||||
**Pre-aggregate is the general fix; `COUNT(DISTINCT)` is a count-only
|
||||
shortcut.** Pre-aggregating the finer table to the measure's grain in a CTE and
|
||||
then joining one-to-one is the remedy that works for every aggregate
|
||||
(`COUNT`/`SUM`/`AVG`). `COUNT(DISTINCT <key>)` is a valid one-liner *for counts
|
||||
only* — it must NOT be generalized to a fanned-out `SUM`/`AVG`, because two
|
||||
rows can legitimately hold equal amounts and `DISTINCT` would wrongly collapse
|
||||
them. State this trap explicitly; a naïve "just use `COUNT(DISTINCT)`" rule is
|
||||
silently wrong for sums.
|
||||
|
||||
3. **One concrete, generic multi-hop example.** Include a short worked example
|
||||
that shows the inflation and the fix. It must use an **invented, generic
|
||||
schema** — **no benchmark table names, no benchmark SQL, and no benchmark
|
||||
result values** (see "Leak-safety" below — hard constraint). The example must:
|
||||
(a) use a **plain `COUNT`** (not an average) so it isolates the fan-out lesson
|
||||
and does not entangle the skill's separate *macro-vs-micro average* rule; and
|
||||
(b) use a chain with **unambiguous single-owner attribution** so the only thing
|
||||
that can go wrong is row multiplication. The intended example is the chain
|
||||
`regions → stores → orders → order_lines` answering *"how many orders per region
|
||||
include at least one backordered line"* — each order belongs to exactly one
|
||||
store and thus exactly one region, so attribution is clean; the line-level
|
||||
filter gives `order_lines` a genuine reason to be joined (so the fix is the
|
||||
pre-aggregate remedy, not "drop the join"), and that join sits **several hops
|
||||
below** the region-level COUNT (the multi-hop point):
|
||||
|
||||
```sql
|
||||
-- "How many orders per region include at least one backordered line?"
|
||||
-- (order_lines is genuinely needed here — for the backordered filter — so the
|
||||
-- fix is NOT "just drop the join".)
|
||||
-- WRONG: the order_lines join is one row per matching line, joined several hops
|
||||
-- BELOW the COUNT. An order with 3 backordered lines is counted 3 times, so the
|
||||
-- per-region total is inflated by backordered-lines-per-order — silently wrong.
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN order_lines l ON l.order_id = o.order_id AND l.is_backordered -- one-to-many: fan-out
|
||||
GROUP BY r.region_id;
|
||||
|
||||
-- RIGHT (general remedy): collapse the finer table to the measure's grain in a
|
||||
-- CTE FIRST, then join one-to-one so nothing multiplies. This same shape works
|
||||
-- for SUM/AVG, not just COUNT.
|
||||
WITH qualifying_orders AS ( -- back to ONE row per order
|
||||
SELECT DISTINCT order_id FROM order_lines WHERE is_backordered
|
||||
)
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN qualifying_orders q ON q.order_id = o.order_id
|
||||
GROUP BY r.region_id;
|
||||
|
||||
-- Count-only shortcut: COUNT(DISTINCT o.order_id) over the WRONG query also works
|
||||
-- HERE. But it is counts-only — a fanned-out SUM/AVG of a per-order measure (e.g.
|
||||
-- summing each order's shipping_fee after joining lines) must pre-aggregate;
|
||||
-- DISTINCT would wrongly merge two orders that happen to share the same fee.
|
||||
```
|
||||
|
||||
## Leak-safety (hard constraint on this spec and its example)
|
||||
|
||||
The benchmark's gold answers must never appear in ktx. The worked example must
|
||||
be a **synthetic, generic schema invented for teaching** — not the tables,
|
||||
column names, query, or numeric results of any Spider 2.0-Lite question. The
|
||||
example demonstrates the *pattern* (coarse-grain measure counted after a
|
||||
one-to-many join), which is universal; it must be reconstructable from first
|
||||
principles by anyone, with zero reference to benchmark data. A reviewer should
|
||||
be able to read the example and find nothing that ties it to a specific
|
||||
benchmark instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The skill's `<sql_craft>` Composition section states the multi-hop
|
||||
generalization of the fan-out rule and a grain-verification habit, inline and
|
||||
dialect-agnostic.
|
||||
- It includes exactly one short, **generic** worked example (wrong vs.
|
||||
pre-aggregated-right) using an invented schema, with no benchmark-derived
|
||||
identifiers or values.
|
||||
- No new tool, flag, or config; this is skill-content only (additive to spec 07).
|
||||
- Existing analytics-skill content tests are updated to cover the added rule's
|
||||
presence (mirroring spec 07's `analytics-skill-content.test.ts`).
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
|
||||
reached through several one-to-many joins) are a recurring source of
|
||||
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
|
||||
with the right tables but a fan-out-inflated number. These are correctness
|
||||
failures, not knowledge or schema-discovery failures (zero execution errors in
|
||||
the latest run), so the fix belongs in the product's authoring craft — where it
|
||||
also helps any real analyst — not in a benchmark-specific prompt.
|
||||
```
|
||||
|
|
@ -1,65 +0,0 @@
|
|||
# Panel/period completeness — emit the full set of groups, not only the populated ones
|
||||
|
||||
## Problem
|
||||
|
||||
When a question asks for a result *per period* or *per category* ("orders for each
|
||||
month of 2023", "revenue by region", "count per status"), the natural `GROUP BY`
|
||||
only returns groups that actually have rows. Periods/categories with **zero**
|
||||
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
|
||||
ones that should read `0` are simply absent. The agent writes runnable SQL with
|
||||
the right aggregate but an **incomplete panel**.
|
||||
|
||||
This is a universal reporting correctness issue: a monthly report with missing
|
||||
months, or a category breakdown missing the empty categories, is wrong for any
|
||||
analyst — and it is also a frequent result-mismatch shape on the benchmark.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
"How many orders were placed in each month of 2023?" must return **12 rows** even
|
||||
if March had no orders (March = 0), not 11 rows. "Sales per region" should include
|
||||
regions with no sales (as 0/NULL) when the question asks for *each* region.
|
||||
|
||||
## Requirements
|
||||
|
||||
Additive to the `ktx-analytics` skill's `<sql_craft>` "Answer completeness /
|
||||
interpretation" group (consistent with spec 07's inline, dialect-agnostic, heuristic
|
||||
+ why style).
|
||||
|
||||
1. **Recognize "full-panel" phrasing.** Cues like *each / every / per <period> /
|
||||
for all <category> / by month* signal that the answer's row set should be the
|
||||
**complete** set of periods or categories in scope, not just those present in
|
||||
the filtered fact rows.
|
||||
|
||||
2. **Build a spine, then LEFT JOIN.** Generate the full set of expected
|
||||
groups — a date/number series via a recursive CTE for periods, or the distinct
|
||||
dimension values from the authoritative dimension table for categories — and
|
||||
LEFT JOIN the aggregated facts onto it, defaulting missing measures with
|
||||
`COALESCE(metric, 0)` (or NULL when 0 would be wrong). *Why:* a plain inner
|
||||
`GROUP BY` can only emit groups that have at least one fact row.
|
||||
|
||||
3. **Don't over-apply.** When the question asks only about groups that exist
|
||||
("which months had orders"), the spine is unnecessary; the cue is *each/all*
|
||||
vs *which*.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Any worked example must use a **synthetic generic schema** (e.g. an `orders`
|
||||
table with an `order_date`) and demonstrate only the *pattern* (spine + LEFT JOIN
|
||||
+ COALESCE). No benchmark table names, SQL, or result values. The behavior is
|
||||
reconstructable from first principles and tied to no specific instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `<sql_craft>` states the full-panel cue, the spine + LEFT JOIN + COALESCE recipe,
|
||||
and the over-application guard — inline and dialect-agnostic.
|
||||
- At most one short generic example (recursive-CTE date spine or distinct-dimension
|
||||
spine), no benchmark-derived content.
|
||||
- Skill-content only; analytics-skill content tests updated to cover the rule.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Per-period / per-category questions where some periods are empty produce
|
||||
short-row result mismatches in the SQLite subset. The fix is a universal
|
||||
reporting habit (complete panels), so it belongs in the product's craft, where it
|
||||
also helps real analysts — not in a benchmark-specific prompt. Related to spec 11
|
||||
(rolling/cumulative windows need a complete date spine to be correct).
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
# Time-series window craft — running totals, rolling-N (min-periods), period-over-period
|
||||
|
||||
## Problem
|
||||
|
||||
A large share of analytics questions are time-series shaped: a **running/cumulative
|
||||
balance**, a **rolling N-day average**, or **period-over-period growth**. The agent
|
||||
knows window functions exist (spec 07 covers determinism and window-then-filter) but
|
||||
gets the *time-series specifics* wrong:
|
||||
|
||||
- cumulative balance computed without an unbounded preceding frame (or with the
|
||||
frame defaulting incorrectly when there are ties on the order key);
|
||||
- "rolling 30-day" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy** daily
|
||||
data, so the window spans the wrong calendar span when days are missing;
|
||||
- no **minimum-periods** handling — a rolling average is reported before the window
|
||||
is actually full;
|
||||
- "growth vs previous period" without `LAG`, or comparing to the wrong neighbor.
|
||||
|
||||
These are runnable-but-wrong; the structure is close, the edge case diverges.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
- "Each account's month-end running balance over 2023" — cumulative sum of monthly
|
||||
net over an ordered window.
|
||||
- "30-day rolling average of daily revenue, only once 30 days of history exist."
|
||||
- "Month-over-month revenue growth rate."
|
||||
|
||||
All three are bread-and-butter for any analyst on any time-series table.
|
||||
|
||||
## Requirements
|
||||
|
||||
Additive to the `ktx-analytics` skill's `<sql_craft>` "Window functions" group
|
||||
(inline, dialect-agnostic, heuristic + why).
|
||||
|
||||
1. **Cumulative / running total.** `SUM(x) OVER (PARTITION BY k ORDER BY t ROWS
|
||||
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`, with a complete tie-breaker in
|
||||
`ORDER BY` (spec 07 rule). *Why:* the default frame with a non-unique `ORDER BY`
|
||||
can include/exclude peers unexpectedly.
|
||||
|
||||
2. **Rolling window over time, not over rows.** When "rolling N days/months" is
|
||||
asked, the window must span a calendar range. Over gappy data, either build a
|
||||
complete date spine first (see spec 10) so `ROWS BETWEEN n-1 PRECEDING` equals
|
||||
the intended span, or use a range/self-join keyed on the date. *Why:* row-count
|
||||
frames over missing dates silently measure the wrong span.
|
||||
|
||||
3. **Minimum periods.** When the question says "only after N periods of data" (or
|
||||
it is implied by a rolling metric), emit NULL/skip until the window is full
|
||||
(e.g. guard on `COUNT(*) OVER (...) = N`). *Why:* a partial early window is not
|
||||
the requested metric.
|
||||
|
||||
4. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
|
||||
for prior-period comparisons; growth rate = `(cur - prev) / prev` computed at
|
||||
full precision (round only at the end). Guard divide-by-zero/NULL prev.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Worked examples must use a **synthetic generic schema** (e.g. `daily_revenue(day,
|
||||
amount)` or `account_txns(account_id, txn_date, net)`) and show only the *pattern*.
|
||||
No benchmark table names, SQL, or result values.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `<sql_craft>` "Window functions" gains the cumulative, rolling-over-time +
|
||||
min-periods, and period-over-period recipes — inline, dialect-agnostic.
|
||||
- At most one or two compact generic examples; no benchmark-derived content.
|
||||
- Skill-content only; analytics-skill content tests updated.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Running-balance / rolling / period-over-period questions are the single largest
|
||||
result-mismatch cluster in the SQLite subset (financial-transactions style DBs).
|
||||
The methodology is universal analyst craft, so it belongs in the product's skill
|
||||
(transfers to real users), not in a benchmark-specific prompt. Depends on spec 10
|
||||
(date spine) for the gappy-rolling case.
|
||||
|
|
@ -1,61 +0,0 @@
|
|||
# Parse text-encoded numeric columns before doing math on them
|
||||
|
||||
## Problem
|
||||
|
||||
Numeric measures are often stored as **text** with human formatting: unit suffixes
|
||||
(`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators (`"$1,200"`),
|
||||
percent signs (`"12%"`), or non-numeric sentinels for missing/zero (`"-"`, `"N/A"`,
|
||||
`""`). Aggregating or comparing such a column directly is silently wrong: string
|
||||
comparison orders `"100" < "9"`, and a naive `CAST(x AS REAL)` yields `0`/NULL on
|
||||
the formatted values rather than the intended number.
|
||||
|
||||
The agent already samples schemas (spec 07 schema-discovery), but when it sees a
|
||||
"numeric" column it tends to assume it is a real number type and skips the parse —
|
||||
so the arithmetic runs on garbage. Runnable, plausible, wrong.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become `1200 / 3000000
|
||||
/ 0` before you can sum it or compute a daily change. A `price` stored as
|
||||
`"$1,299.00"` must become `1299.00` before averaging. This is routine data hygiene
|
||||
on real, messy production tables.
|
||||
|
||||
## Requirements
|
||||
|
||||
Extend the `ktx-analytics` skill's `<sql_craft>` "Schema discovery before writing
|
||||
SQL" group (inline, dialect-agnostic, heuristic + why).
|
||||
|
||||
1. **Detect text-encoded numerics during sampling.** When a column that the
|
||||
question treats as a number is stored as text, sample distinct values to learn
|
||||
the encodings actually present (suffixes, symbols, separators, sentinels) before
|
||||
composing — never assume the format from the column name.
|
||||
|
||||
2. **Parse and scale before arithmetic.** Strip currency/separator/percent
|
||||
characters; multiply by the suffix scale (K=10^3, M=10^6, B=10^9); map sentinels
|
||||
(`-`, `N/A`, empty) to `0` or `NULL` per the question's intent; then cast to a
|
||||
numeric type. Do this in an early CTE so all downstream math sees clean numbers.
|
||||
*Why:* string columns compared/aggregated as-is sort lexically and cast to 0,
|
||||
producing silently wrong results instead of errors.
|
||||
|
||||
3. **Confirm coverage.** After parsing, sanity-check that no intended-numeric value
|
||||
failed to parse (would surface as NULL), to catch an encoding the sample missed.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Worked examples must use a **synthetic generic schema** and made-up values (e.g. a
|
||||
`metrics(label, value_text)` table with `"1.2K"`, `"-"`). No benchmark table names,
|
||||
SQL, or result values; the parsing pattern is universal and tied to no instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `<sql_craft>` schema-discovery gains the detect → parse/scale → verify guidance —
|
||||
inline, dialect-agnostic, with at most one short generic example.
|
||||
- No benchmark-derived content. Skill-content only; content tests updated.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
At least one SQLite-subset question stores trading volume as suffix-encoded text
|
||||
("K"/"M", "-" for zero) and fails because the agent aggregates the raw strings. The
|
||||
fix — parse messy encodings before math — is universal data hygiene that helps any
|
||||
analyst, so it belongs in the product's craft rather than a benchmark-specific
|
||||
prompt.
|
||||
|
|
@ -1,105 +0,0 @@
|
|||
# Enforce answer-output completeness with a final pre-emit check in the analytics skill
|
||||
|
||||
## Problem
|
||||
|
||||
The single largest correctness failure mode is **incomplete output**: the query runs and the
|
||||
methodology is roughly right, but the result is missing columns the question asked for. Three
|
||||
recurring sub-patterns:
|
||||
|
||||
1. **Multi-part questions answered partially.** A question that asks for several things ("report
|
||||
the highest *and* the lowest month, each with its count and average, *and* the difference")
|
||||
comes back with only the first part — one column instead of the several requested.
|
||||
2. **Identity dropped.** Grouping by a human-readable name but not projecting the entity's
|
||||
identifier (e.g. a product name without its product id, a customer name without its
|
||||
customer id).
|
||||
3. **Inputs to a derived value dropped.** Returning a ratio / percentage / difference but not
|
||||
the underlying counts the question also asked for.
|
||||
|
||||
Sub-patterns 2 and 3 are **already covered by `<sql_craft>` rules** in the analytics skill
|
||||
(spec 07: *"expose identity, not just the label"* and *"keep the inputs to a derived value"*),
|
||||
yet they are frequently **not applied**. So the gap is not missing knowledge — it is that these
|
||||
rules are passive heuristics buried in a list, and the agent doesn't reliably check them before
|
||||
finalizing. The fix is to (a) add the missing multi-part-completeness rule and (b) turn
|
||||
output-completeness into an **explicit final verification step** the agent performs before
|
||||
emitting SQL.
|
||||
|
||||
This is reinforced by evidence that the failure is **model-independent**: a markedly stronger
|
||||
model produced the same incomplete-output mistakes on these questions, which means it is a
|
||||
craft/enforcement gap, not a capability gap.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
An analyst is asked: *"For each region, report the highest and the lowest monthly order count,
|
||||
and the difference between them."* A complete, useful answer has a column for the region's id
|
||||
and name, the highest count, the lowest count, and the difference — five columns. Returning just
|
||||
the region and a single number answers only part of the request. This is a universal expectation
|
||||
on any database: answer **every** part of a multi-part request, identify the entities, and show
|
||||
the inputs behind any derived figure.
|
||||
|
||||
## Requirements
|
||||
|
||||
Additive to the analytics skill's `<sql_craft>` "Answer completeness / interpretation" group and
|
||||
its workflow's validate step (inline, dialect-agnostic, heuristic + why, consistent with spec 07).
|
||||
|
||||
1. **Multi-part / multi-output completeness (new rule).** When a question requests several
|
||||
outputs — a list ("A, B, and C"), paired extremes ("the highest *and* the lowest"), or a
|
||||
value plus its components ("X, Y, and their ratio") — the final projection must contain a
|
||||
column for **each** requested output. *Why:* answering only the first clause is the most common
|
||||
way a runnable query is still wrong; the grain and methodology can be perfect yet the answer
|
||||
is short by columns.
|
||||
|
||||
2. **Fold the existing identity / inputs rules into the same completeness notion.** The
|
||||
already-shipped rules — project the entity **identifier** alongside any human-readable label,
|
||||
and **keep the inputs** to any derived value — are part of output completeness; reference them
|
||||
from the check below so they are actually applied, not just listed.
|
||||
|
||||
3. **Add an explicit final completeness check (the enforcement mechanism).** Before emitting the
|
||||
final SQL, the skill should have the agent **re-read the question and confirm the projection
|
||||
covers**: every named metric/attribute; the identifier of every grouped/named entity; every
|
||||
input to a derived value; all at the grain the question specifies. This is a short, concrete
|
||||
checkpoint at the validate step — the point is to convert the passive heuristics into an active
|
||||
pre-finalize verification. (Do **not** add unrequested/extra columns to be "safe" — that is
|
||||
grader-gaming; the check is about matching the request exactly, not padding it.)
|
||||
|
||||
Generic teaching example (synthetic schema — see Leak-safety):
|
||||
```sql
|
||||
-- "For each region, report the highest and lowest monthly order count and their difference."
|
||||
-- WRONG: answers only the first clause; no region id, no lowest, no difference.
|
||||
SELECT region_name, MAX(monthly_orders) AS highest
|
||||
FROM region_monthly GROUP BY region_name;
|
||||
|
||||
-- RIGHT: one column per requested output + the entity's identity, at the region grain.
|
||||
SELECT r.region_id, r.region_name,
|
||||
MAX(m.monthly_orders) AS highest_monthly_orders,
|
||||
MIN(m.monthly_orders) AS lowest_monthly_orders,
|
||||
MAX(m.monthly_orders) - MIN(m.monthly_orders) AS difference
|
||||
FROM regions r
|
||||
JOIN region_monthly m ON m.region_id = r.region_id
|
||||
GROUP BY r.region_id, r.region_name;
|
||||
```
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
The example must use an **invented, generic schema** (`regions`, `region_monthly`) and made-up
|
||||
columns — **no benchmark table names, SQL, or result values.** It teaches the *pattern* (cover
|
||||
every requested output + identity + inputs), which is universal and tied to no specific instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The skill states the multi-part-completeness rule and a concrete **final completeness check**
|
||||
(re-read question → verify metrics + identity + inputs + grain), inline and dialect-agnostic,
|
||||
cross-referencing the existing identity/inputs rules so they're enforced.
|
||||
- Includes the over-projection guard (don't pad with extra columns — that's grader-gaming).
|
||||
- One short generic example (wrong vs complete); no benchmark-derived content.
|
||||
- Skill-content only; analytics-skill content tests updated to cover the new rule + check.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
In the latest SQLite-subset run, **incomplete output was the single largest failure bucket
|
||||
(~13 of 51 voted failures)**: multi-part questions answered partially, and identity / derived-value
|
||||
inputs dropped — the latter two being spec-07 rules that already exist but weren't applied. A
|
||||
probe with a much stronger model reproduced the *same* incomplete-output failures, confirming this
|
||||
is a craft-enforcement gap rather than a model-capability one. The fix — answer every requested
|
||||
part, identify entities, keep inputs — is universal analyst craft, so it belongs in the product
|
||||
skill (and transfers to real users), enforced as a final check rather than left as a passive hint.
|
||||
```
|
||||
|
|
@ -1,116 +0,0 @@
|
|||
# Structured, leveled logging for the ktx MCP server
|
||||
|
||||
> **Scope: observability only.** This spec is about *seeing* what the MCP server
|
||||
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
|
||||
> query from blocking the server (off-event-loop / interruptible query execution)
|
||||
> is a separate concern — see "Non-goals" and the sibling spec note below.
|
||||
|
||||
## Problem
|
||||
|
||||
The ktx MCP server (`packages/cli/src/mcp-http-server.ts` +
|
||||
`mcp-server-factory.ts`; raw `node:http` + `@modelcontextprotocol/sdk`
|
||||
`StreamableHTTPServerTransport`) emits almost no operational logs. There is no
|
||||
server-side record of **which MCP tool was called, with what parameters, when,
|
||||
how long it took, or whether it succeeded** — nor of session open/close or
|
||||
transport errors. When a tool call is slow, hangs, or a client connection drops
|
||||
("Transport channel closed"), an operator has no trail to diagnose it and must
|
||||
resort to process sampling / `lsof` / guesswork — and the offending input
|
||||
(e.g. the exact SQL) is typically unrecoverable.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Anyone running a long-lived ktx MCP server — a developer's local instance, a
|
||||
shared team server, or a hosted deployment — needs observability into tool-call
|
||||
activity to:
|
||||
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
|
||||
connection, with what SQL, for how long);
|
||||
- explain client-visible connection failures from the server side (session
|
||||
lifecycle, transport-closed events);
|
||||
- audit what agents asked the server to do;
|
||||
- spot patterns (hot tools, slow connections, error rates).
|
||||
|
||||
This is standard production-server hygiene; the server currently provides none.
|
||||
|
||||
## Requirements (sketch — refine when picked up)
|
||||
|
||||
1. **One structured (JSON) logger, low overhead.** Suggested `pino` (orientation
|
||||
only; implementer owns the choice). A single shared instance; write **JSON to
|
||||
stdout** (12-factor — the launcher/aggregator routes it). No in-app file
|
||||
rotation. Optional human-readable pretty output only when attached to a TTY
|
||||
(dev).
|
||||
2. **Configurable level via env** (e.g. `KTX_LOG_LEVEL`, default `info`; `debug`
|
||||
for diagnosis) — verbose logging on demand without code changes.
|
||||
3. **Per-session / per-call context** via child loggers: every line carries a
|
||||
`sessionId` (from the transport session) and, for tool calls, a `callId` +
|
||||
`tool` name, so one session's or call's activity can be traced/grepped.
|
||||
4. **Tool-call logging — START logged BEFORE execution, COMPLETION after.** For
|
||||
every MCP tool invocation:
|
||||
- on entry: log `{ tool, params, sessionId, callId }` **before** running the
|
||||
handler (so the record exists even if the handler never returns);
|
||||
- on exit: log `durationMs` + outcome (ok with result size, or error with
|
||||
stack).
|
||||
This makes a **hung / never-returning call identifiable**: a start with no
|
||||
matching completion is the culprit, with its exact parameters and timestamp.
|
||||
This matters specifically because handlers like `sql_execution` run a
|
||||
*synchronous* better-sqlite3 query — a runaway query blocks the process and no
|
||||
completion is ever logged, so the start line (flushed before the blocking
|
||||
call) is the only record. For `sql_execution`, `params` should include the SQL
|
||||
text (the most useful field). Emit a **WARN** when a *completed* call exceeds a
|
||||
configurable slow threshold (e.g. `KTX_SLOW_TOOL_MS`).
|
||||
5. **Connection / session lifecycle:** log session open/close (with `sessionId`)
|
||||
and transport errors (the SDK's closed-channel / "Transport channel closed"
|
||||
events) so client-side connection failures have a server-side counterpart.
|
||||
6. **Error logging** with structured stack traces (a standard error serializer),
|
||||
not bare strings.
|
||||
7. **Light redaction — credentials only** (bearer token, connection
|
||||
passwords/secrets). SQL text and tool params are *not* secrets and must be
|
||||
logged. Do not over-redact.
|
||||
8. **Synchronous logging is fine.** The server uses a synchronous DB client, so
|
||||
logging need not be async; prefer the simpler synchronous stdout path over
|
||||
async/worker transports (which can lose buffered lines on a hard crash). Do
|
||||
not introduce async-logging machinery.
|
||||
|
||||
## Acceptance criteria (sketch)
|
||||
|
||||
- With `KTX_LOG_LEVEL=debug`, invoking any MCP tool produces a `tool.start`
|
||||
(tool, params, sessionId, callId) and a `tool.end` (durationMs, outcome) line
|
||||
on the server's stdout, as JSON.
|
||||
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
|
||||
`tool.start` line carrying its **exact SQL and timestamp** and **no**
|
||||
`tool.end` — so the offending query is recoverable from the log alone, with no
|
||||
process sampling.
|
||||
- A completed tool call slower than the configured threshold emits a WARN with
|
||||
its duration.
|
||||
- Session open/close and transport-closed events are logged with the `sessionId`.
|
||||
- At default level (`info`), routine per-tool lines are suppressed but lifecycle,
|
||||
slow-call warnings, and errors are present.
|
||||
- Credentials (bearer token, connection secrets) never appear in logs; SQL and
|
||||
tool params do.
|
||||
- No new heavy dependencies beyond the logger; no OpenTelemetry/metrics stack; no
|
||||
async-transport machinery.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Preventing/interrupting runaway queries** (off-event-loop execution, query
|
||||
timeouts, worker-thread isolation). That is a *separate* spec; a single
|
||||
synchronous query that fans out into a massive nested-loop join can peg the
|
||||
single-threaded server for hours and break new connections — observability
|
||||
surfaces *which* query, but the fix is execution-model work. (This logging is
|
||||
also a prerequisite for a future watchdog that detects a `tool.start` with no
|
||||
`tool.end` past a threshold and recycles the server.)
|
||||
- Metrics/tracing/OpenTelemetry exporters.
|
||||
- Forwarding logs to the MCP *client* via the protocol's logging capability
|
||||
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
|
||||
distinct from operational stdout logging.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
Running Spider 2.0-Lite against the MCP server at concurrency, an
|
||||
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
|
||||
synchronous better-sqlite3 executed it on the event loop, pegging a server at
|
||||
~100% CPU for hours and breaking new MCP connections to it ("Transport channel
|
||||
closed"). We could not determine *which* query, because the server logs nothing
|
||||
about tool calls — diagnosis required `sample`/`lsof` on the live process and the
|
||||
exact SQL was never recovered. Structured tool-call logging (especially
|
||||
start-before-execute) would have turned this into a one-line `grep` of the server
|
||||
log.
|
||||
|
|
@ -1,131 +0,0 @@
|
|||
# Bounded query execution (deadline + non-blocking) for read SQL
|
||||
|
||||
> Priority: HIGH. Found empirically during a Spider2-lite sqlite run
|
||||
> (2026-06-18): a single `sql_execution` MCP call wedged a worker at 100% CPU
|
||||
> for 13+ minutes and never returned. The query
|
||||
> `SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on the
|
||||
> `complex_oracle` sqlite database hit a VIEW (`costs ⋈ sales`, 918,843 × 82,112
|
||||
> rows, joined on a 4-column key with no composite index) whose plan degraded to
|
||||
> an O(N×M) nested-loop scan. Because the sqlite connector runs
|
||||
> `better_sqlite3 .all()` **synchronously with no timeout**, it blocked the MCP
|
||||
> worker's entire event loop: no `tool.end` was ever logged, the port went
|
||||
> unresponsive, and the query could not be cancelled. One of four eval shards
|
||||
> stalled until the worker was killed by hand.
|
||||
|
||||
## Problem
|
||||
|
||||
Two compounding gaps on the read-query path:
|
||||
|
||||
1. **No execution deadline.** A single expensive query runs unbounded. This is
|
||||
handled divergently per connector, with no shared contract: BigQuery has a
|
||||
real server-side job timeout (`job_timeout_ms`); ClickHouse has an HTTP
|
||||
`request_timeout`; Snowflake, Postgres, MySQL, and SQL Server bound only
|
||||
connection/pool *acquisition*, not statement *execution*; SQLite has nothing.
|
||||
So whether a runaway query is bounded depends entirely on which driver the
|
||||
caller happened to hit.
|
||||
|
||||
2. **In-process engines block the event loop and can't be cancelled.** The
|
||||
sqlite connector executes on the main thread via synchronous
|
||||
`better_sqlite3 .all()`. A slow query freezes the whole MCP server (it can't
|
||||
serve other requests, send progress, or write `tool.end`), and there is no
|
||||
way to interrupt it: better-sqlite3 exposes no interrupt/cancel API — its
|
||||
documented mechanism for slow queries is to run them in a **worker thread**,
|
||||
and the only way to stop a runaway synchronous query is to terminate the
|
||||
thread executing it.
|
||||
|
||||
The net effect is a query that produces a `tool.start` with no matching
|
||||
`tool.end`, an unresponsive server, and no self-recovery. A row cap (`maxRows`)
|
||||
does not help — it bounds returned rows, not scan work, and the failing query
|
||||
returned a single aggregate row.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any data agent that lets an LLM author SQL will eventually issue an
|
||||
accidentally-expensive query — an unindexed or cartesian join, an expensive
|
||||
VIEW, a wide aggregate over a large fact table. A general-purpose context layer
|
||||
must bound that and return a clean, fast "query exceeded Ns" error so the agent
|
||||
can revise (add filters, query base tables, narrow the range) instead of hanging
|
||||
the tool and the server. This matters for embedded/local warehouses (sqlite,
|
||||
duckdb) and remote ones alike, and is wholly independent of any benchmark.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. Every read-query execution path (`executeReadOnly`) enforces a single
|
||||
canonical execution deadline. One opinionated default; **not** a per-call
|
||||
user flag. Where a driver already supports a per-connection timeout
|
||||
(BigQuery `job_timeout_ms`), reuse that as the per-connection override rather
|
||||
than inventing a parallel knob.
|
||||
2. On exceeding the deadline the path resolves with a `KtxQueryError`
|
||||
("query exceeded {N}s") — a finite, decision-reaching outcome, never an
|
||||
unbounded hang.
|
||||
3. The deadline is a **shared contract at the connector boundary**, defined once
|
||||
(on the `executeReadOnly` contract or a shared wrapper at the call site) so
|
||||
all drivers participate. Bring the existing divergent timeouts (BigQuery job
|
||||
timeout, ClickHouse request timeout) under this one contract instead of
|
||||
leaving parallel mechanisms.
|
||||
4. For in-process engines (sqlite today, any future embedded driver), execution
|
||||
MUST NOT block the MCP server event loop. Run the query off the main thread
|
||||
and enforce the deadline by terminating that thread on timeout (the
|
||||
better-sqlite3-documented approach, since synchronous queries are
|
||||
uncancellable in-thread). The event loop must stay responsive so `tool.end`
|
||||
is always written and concurrent requests on the same port are served.
|
||||
5. Prefer real cancellation over client-side give-up. Where the engine supports
|
||||
a server-side statement timeout (Postgres `statement_timeout`, MySQL
|
||||
`max_execution_time`, Snowflake `STATEMENT_TIMEOUT_IN_SECONDS`, ClickHouse
|
||||
`max_execution_time`, BigQuery job timeout, SQL Server request timeout), set
|
||||
it so the deadline actually stops work, not merely abandons the promise while
|
||||
the query keeps running. For in-process engines, thread termination is the
|
||||
cancellation.
|
||||
6. The MCP `sql_execution` tool surfaces the timeout as an expected error
|
||||
(classified as `KtxQueryError`, not a `$exception` fault, consistent with
|
||||
existing expected-error classification) and logs a `tool.end` with the error
|
||||
outcome.
|
||||
7. Read-only enforcement (`assertReadOnlySql`) and the `maxRows` row cap remain
|
||||
unchanged. The deadline is additive; `maxRows` is not a substitute for it.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- A read query that exceeds the deadline returns a `KtxQueryError` within
|
||||
roughly the deadline; the MCP worker stays responsive (a concurrent tool call
|
||||
on the same server completes while the slow query is still pending) and writes
|
||||
a matching `tool.end` with a non-ok outcome.
|
||||
- sqlite specifically: executing a deliberately pathological query (e.g. an
|
||||
expensive VIEW or an unindexed cross join) on a fixture does not block the
|
||||
event loop, is terminated at the deadline, and CPU returns to idle afterward
|
||||
(the off-main-thread executor is killed, not left spinning).
|
||||
- No regression: normal fast queries return identical results; read-only
|
||||
rejection still works; `maxRows` still bounds returned rows.
|
||||
- Tests cover the deadline path for at least the in-process driver (sqlite,
|
||||
terminate-on-deadline) and one server-side-timeout driver.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
The Spider2-lite local set loads several warehouses into sqlite, some with
|
||||
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
|
||||
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
|
||||
rows, no composite index, with `promo_id` (the index the optimizer picks) being
|
||||
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
|
||||
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
|
||||
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
|
||||
query instead.
|
||||
|
||||
## Orientation hints (code pointers; may have drifted)
|
||||
|
||||
- Shared contract: `packages/cli/src/context/scan/types.ts` —
|
||||
`KtxScanConnector.executeReadOnly` (~343), `KtxReadOnlyQueryInput` (~285).
|
||||
- MCP call site: `packages/cli/src/context/mcp/local-project-ports.ts:70`
|
||||
(`connector.executeReadOnly`); tool registration in
|
||||
`packages/cli/src/context/mcp/context-tools.ts`.
|
||||
- In-process sync execution (the acute hang):
|
||||
`packages/cli/src/connectors/sqlite/connector.ts:311-313`
|
||||
(`better_sqlite3 .prepare().all()`).
|
||||
- Existing divergent timeouts to unify: `connectors/bigquery/connector.ts`
|
||||
(`job_timeout_ms` / `jobTimeoutMs`), `connectors/clickhouse/connector.ts:602`
|
||||
(`request_timeout`), `connectors/snowflake/connector.ts:342` (test/pool only),
|
||||
`connectors/postgres/connector.ts`, `connectors/mysql/connector.ts`,
|
||||
`connectors/sqlserver/connector.ts` (pool/connection only).
|
||||
- Error class: `packages/cli/src/errors.ts:25` (`KtxQueryError`).
|
||||
- better-sqlite3 (context7 `/wiselibs/better-sqlite3`, v12.x): no
|
||||
interrupt/cancel API; `docs/threads.md` documents the worker-thread pattern
|
||||
for slow queries (master owns worker lifecycle and respawns on exit) — extend
|
||||
it with terminate-on-deadline to enforce the timeout.
|
||||
|
|
@ -1,68 +0,0 @@
|
|||
# 18 — BigQuery cross-project dataset support (introspect foreign-hosted datasets, bill in own project)
|
||||
|
||||
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/18-…`.
|
||||
|
||||
## Problem (generic, real-world)
|
||||
|
||||
Analysts routinely query datasets that live in a **different** BigQuery project than the one
|
||||
they bill jobs to — Google's `bigquery-public-data`, a partner's shared project, an
|
||||
organization's central data project, etc. To make those connectable in ktx (so `discover_data`,
|
||||
the semantic layer, dictionary sampling, and `sql_dialect_notes` work), ktx must be able to
|
||||
**introspect a dataset hosted in a foreign project while running/billing jobs in the
|
||||
credentials' own project**.
|
||||
|
||||
Today it can't. ktx's BigQuery connector derives a single `projectId` from
|
||||
`credentials.project_id` and uses it for **both** job billing **and** schema introspection:
|
||||
|
||||
- `connectors/bigquery/connector.ts:294` — `projectId` is read only from `credentials.project_id`;
|
||||
there is no separate billing-vs-dataset project knob.
|
||||
- `:544` (`introspectDataset`) — calls `this.getClient().dataset(datasetId)`, which resolves the
|
||||
dataset **in the client's (billing) project**, and labels every table `catalog: this.resolved.projectId`.
|
||||
- `:453` (`listTables`) — queries `\`${projectId}\`.\`region-…\`.INFORMATION_SCHEMA.TABLES`, i.e. the
|
||||
**billing** project's INFORMATION_SCHEMA.
|
||||
- `:163` (`datasetIds()`) — returns `dataset_ids` verbatim; it never parses a `project.` prefix.
|
||||
|
||||
So a `dataset_id` naming a dataset in another project can't be introspected, even though querying
|
||||
it works fine (cross-project reads bill to the caller's project — that path already works).
|
||||
|
||||
### Empirical confirmation
|
||||
With a service account in project `ktx-spider2-lite`:
|
||||
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (looks in
|
||||
`projects/ktx-spider2-lite/datasets/austin_311`).
|
||||
- The cross-project form `DatasetReference("bigquery-public-data","austin_311")` → **succeeds**
|
||||
(lists the public tables; public metadata is readable by any authenticated principal).
|
||||
- There is **no config knob** to separate the introspection project from the billing project.
|
||||
|
||||
## Requirement
|
||||
|
||||
The BigQuery connector must accept **fully-qualified `project.dataset` entries** in `dataset_ids`
|
||||
(a single connection may span more than one source project), and for each:
|
||||
- **introspect** via the *dataset's* project — `client.dataset(id, { projectId })` /
|
||||
`DatasetReference(project, dataset)`, query the **dataset project's** `INFORMATION_SCHEMA`, and
|
||||
label the table `catalog` with the dataset's project;
|
||||
- **run jobs / bill** in `credentials.project_id` (unchanged).
|
||||
|
||||
A bare `dataset` (no `project.`) keeps today's behavior (resolve in `credentials.project_id`), so
|
||||
existing single-project connections are unaffected.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- `dataset_ids: ['bigquery-public-data.austin_311']` (credentials in a *different* project) →
|
||||
`ktx ingest <conn>` introspects the tables, enriches, and samples values; `discover_data` /
|
||||
`dictionary_search` return them.
|
||||
- A connection mixing `['bigquery-public-data.x', 'other-project.y']` introspects both.
|
||||
- `sql_execution` of a fully-qualified `project.dataset.table` query still runs and bills in
|
||||
`credentials.project_id`.
|
||||
- Single-project `dataset_ids: ['my_dataset']` behaves exactly as before (no regression).
|
||||
|
||||
## Benchmark context (motivation only — do not encode benchmark specifics)
|
||||
|
||||
Spider 2.0-Lite's **BigQuery slice (205 questions)** is otherwise **unservable faithfully**: every
|
||||
one of its ~74 logical databases groups datasets hosted in foreign public projects
|
||||
(`bigquery-public-data`, `isb-cgc-bq`, `data-to-insights`, …), never in a project we own. Query
|
||||
execution already works cross-project (proven), but ktx-only *discovery* (the whole point of the
|
||||
faithful surface) is blocked because the connector can't introspect them. Scope is small: of 74
|
||||
BQ dbs only **1** spans more than one source project, so "let `dataset_ids` carry `project.dataset`
|
||||
and introspect each in its own project" covers the benchmark and the general case alike. This is
|
||||
the sole blocker for the BigQuery leaderboard slice (the Snowflake slice needed no connector
|
||||
change and is already baselined).
|
||||
|
|
@ -1,89 +0,0 @@
|
|||
# 19 — Durable, resumable, bounded relationship detection during ingest enrichment
|
||||
|
||||
**Status:** intake draft (todo). Requirement-level; the implementer refines into `specs/19-…`.
|
||||
|
||||
## Problem (generic, real-world)
|
||||
|
||||
Ingest enrichment runs three stages in a fixed order inside `runLocalScanEnrichment`
|
||||
(`packages/cli/src/context/scan/local-enrichment.ts`):
|
||||
|
||||
1. `descriptions` (`:530`) — per-table LLM descriptions (the expensive step: one model call per
|
||||
table; on a large schema this is minutes of paid LLM work).
|
||||
2. `embeddings` (`:559`) — column embeddings.
|
||||
3. `relationships` (`:593`) — FK/join discovery: profiles a row sample of **every** table, then
|
||||
validates candidate joins.
|
||||
|
||||
The queryable semantic-layer artifacts are persisted **once, at the very end**, by
|
||||
`writeLocalScanEnrichmentArtifacts` in `local-scan.ts:510` — which runs **after**
|
||||
`runLocalScanEnrichment` returns, i.e. after all three stages.
|
||||
|
||||
This creates three failure modes that compound on large schemas (hundreds of tables):
|
||||
|
||||
1. **Enrichment is lost if relationship detection is interrupted.** The descriptions + embeddings
|
||||
are computed and held in memory, but they only reach the durable, queryable artifacts when the
|
||||
final write runs after the `relationships` stage. If the process is killed/crashes/times out
|
||||
**during** relationship detection (the last, slowest, silent stage), the artifacts are never
|
||||
written — the schema survives (it was written earlier at `local-scan.ts:473`) but **all the
|
||||
paid LLM enrichment is discarded**. Empirically: ingesting a 95-table BigQuery dataset produced
|
||||
full descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
|
||||
relationships stage ran silently past a supervising deadline and was killed — the persisted
|
||||
`_schema` had **0** AI descriptions, only the native column comments. Every larger dataset hits
|
||||
this, so the most expensive work is the most likely to be thrown away.
|
||||
|
||||
2. **Re-running does not resume — it re-spends.** There is a stage state store
|
||||
(`SqliteLocalScanEnrichmentStateStore`) and a `runEnrichmentStage` helper (`:413`) that saves
|
||||
each completed stage's output. But the completed-stage lookup keys on **`runId`**
|
||||
(`findCompletedStage({ runId, stage, inputHash })`, `:427`), and `runId` is fresh per ingest
|
||||
invocation. So resume only works *within* a single run; re-running an interrupted ingest gets a
|
||||
new `runId`, misses the cache, and **re-computes descriptions + embeddings from scratch**
|
||||
(re-paying for the LLM work that already succeeded).
|
||||
|
||||
3. **Relationship detection is unobservable and unbounded.** The stage emits no progress between
|
||||
"Detecting relationships" and the final "Relationship detection found N accepted" — minutes of
|
||||
silence on a large schema. A supervisor watching for liveness cannot distinguish a slow-but-
|
||||
working profile from a true hang, and there is no internal time/work budget, so on a very large
|
||||
schema it can run far longer than any reasonable deadline.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Checkpoint queryable artifacts before relationship detection.** Persist the descriptions +
|
||||
embeddings into the semantic-layer artifacts as soon as the `embeddings` stage completes, before
|
||||
the `relationships` stage runs. Relationship detection then appends/merges its own artifact on
|
||||
completion. Net: the expensive LLM + embedding enrichment is **always durable and queryable**,
|
||||
even if relationship detection fails, is interrupted, or is skipped. (A failed/partial
|
||||
relationship stage should degrade to "no/partial joins", never to "no descriptions".)
|
||||
|
||||
2. **Make stage resume work across runs.** Resolve a completed stage by stable content identity
|
||||
— `(connectionId, stage, inputHash)` — independent of `runId`, so re-running an interrupted
|
||||
ingest resumes the finished `descriptions`/`embeddings` stages from cache and only re-runs what
|
||||
actually failed (e.g. `relationships`). Re-running after an interruption must not re-spend LLM
|
||||
credits on stages that already succeeded.
|
||||
|
||||
3. **Make relationship detection observable and bounded** (mirrors spec 16's bounded query
|
||||
execution). Emit progress through the existing progress port — e.g. "Profiling table K/N",
|
||||
"Validating candidate K/M" — so liveness is visible. Enforce an overall time/work budget
|
||||
(configurable, e.g. under `scan.relationships`) so on a very large schema the stage stops
|
||||
gracefully and returns the relationships found so far (partial) rather than running unboundedly.
|
||||
Partial completion is persisted (per requirement 1) and marked as such.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- Interrupting an ingest **during** relationship detection still leaves a queryable semantic layer
|
||||
with the table/column descriptions + embeddings that were generated (verified: re-open the
|
||||
connection, descriptions are present).
|
||||
- Re-running an interrupted ingest **does not** regenerate descriptions/embeddings whose stage
|
||||
already completed (verified: no LLM description calls for the cached tables; only the failed
|
||||
stage re-runs).
|
||||
- A connection with hundreds of tables emits relationship-stage progress and completes within the
|
||||
configured budget, persisting partial relationships if the budget is hit — without discarding
|
||||
enrichment.
|
||||
- Small/single-run ingests behave exactly as before (no regression in artifacts or relationship
|
||||
output when nothing is interrupted).
|
||||
|
||||
## Benchmark context (motivation only — do not encode benchmark specifics)
|
||||
|
||||
The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables (`ebi_chembl`
|
||||
785, `fec` 486, `ga360` 366, …). Enriching them with claude-code costs real, rate-limited LLM
|
||||
budget; losing that enrichment to a relationship-stage interruption — and re-spending it on every
|
||||
retry — makes large-schema ingest impractical. This is a general durability/cost property of the
|
||||
ingest pipeline, independent of the benchmark; the benchmark only made it acute at scale.
|
||||
|
|
@ -1,101 +0,0 @@
|
|||
# 20 — Resilient enrichment under a slow/hung LLM backend
|
||||
|
||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/20-*.md`.
|
||||
|
||||
This is the **enrichment-stage** analog of two already-shipped specs:
|
||||
- spec 16 (bounded query execution) — bound *and actually cancel* a runaway read query (child-thread/process kill, not a cosmetic JS deadline);
|
||||
- spec 19 (durable/bounded relationship detection) — checkpoint expensive ingest work so an interruption doesn't lose it.
|
||||
|
||||
Spec 16 hardened the **read-query** path and spec 19 checkpointed at **stage boundaries**. The same two
|
||||
weaknesses still exist *inside the descriptions enrichment stage*, and together they turned a single hung
|
||||
table into an indefinite wedge plus total loss of an entire stage's LLM work.
|
||||
|
||||
## Problem / requirement
|
||||
|
||||
Two compounding gaps on the per-table description-enrichment path, observed end-to-end:
|
||||
|
||||
### 1. The per-table LLM timeout does not actually terminate the work
|
||||
|
||||
The per-table `generateObject` enrichment call is wrapped in `retryAsync` with a fresh
|
||||
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (ktx commit `01f63380`). When the LLM
|
||||
backend is a **subprocess** (the `codex` backend spawns a child `codex` process; `claude-code` likewise
|
||||
spawns a child) and that child **hangs with an open connection to the provider** (TCP ESTABLISHED, ~0%
|
||||
CPU, no bytes flowing), the JS-level `AbortSignal` fires but **does not kill the child process or unblock
|
||||
the await** — so the call sits *past* its own timeout indefinitely.
|
||||
|
||||
Observed (BigQuery ingest, codex backend, 2026-06-23): with `KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min),
|
||||
two of `covid19_usa`'s widest tables (252 columns) hung; the stage sat at **268/285 for 41+ minutes** —
|
||||
well past the 30-min per-attempt timeout — with exactly two codex children, each holding 3 ESTABLISHED
|
||||
connections at ~0% CPU, until killed by hand. The timeout was cosmetic: it never terminated the hung
|
||||
child. (This is precisely the failure mode spec 16 fixed for SQL — a deadline that fires in JS but cannot
|
||||
interrupt the underlying work — applied to the enrichment LLM call instead of the query.)
|
||||
|
||||
**Requirement:** the per-table enrichment-call timeout must be **enforced**, not advisory — when it fires,
|
||||
the in-flight work is actually cancelled (subprocess SIGKILL for process-backed providers; request abort
|
||||
for HTTP-backed ones) and the call returns/throws *promptly* so the stage can proceed (skip the table per
|
||||
the existing no-retry-on-timeout policy). A hung table must cost at most ~one timeout, never unbounded
|
||||
wall-clock. Provider-agnostic: it must hold for `codex`, `claude-code`, and HTTP backends alike.
|
||||
|
||||
### 2. Descriptions are checkpointed only at full-stage completion, so a few bad tables lose all the good ones
|
||||
|
||||
Spec 19 persists the descriptions checkpoint **after the descriptions stage completes** (before
|
||||
relationships). There is no *within-stage* persistence: while the stage runs, every enriched table's
|
||||
description lives only in memory. So if the stage cannot complete — e.g. 2 tables out of 285 hang (gap #1),
|
||||
or the process is killed, or it hits the stall watchdog — **all** the already-enriched tables are lost,
|
||||
even though their (expensive) LLM descriptions were finished.
|
||||
|
||||
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but **0** rows in
|
||||
`local_scan_enrichment_stages` and **0** `ai:` descriptions on disk; killing the wedged ingest discarded
|
||||
all 283, forcing a from-scratch re-ingest. The cost of 2 pathological tables was 283 tables' worth of
|
||||
redone LLM calls.
|
||||
|
||||
**Sharper observation (re-ingest with a short, enforced timeout):** even when the stage *does* run to
|
||||
the end — the 2 hung tables hit a 4-min timeout and were skipped, so 283/285 descriptions were generated
|
||||
and the ingest reported success (`Scan completed` / `Ingest finished`, embeddings built, exit 0) — the
|
||||
descriptions were **still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the discard is **not** just
|
||||
"lost on kill": a stage that completes with *any* skipped/aborted table currently persists **nothing**,
|
||||
throwing away every successfully-generated description. The skip must be graceful — a skipped table costs
|
||||
one missing description, not the entire stage's output. (This is the strongest argument for per-table
|
||||
incremental persistence: the 283 good descriptions should have been durable the moment each was produced.)
|
||||
|
||||
**Requirement:** persist enriched descriptions **incrementally** (per-table or per-batch) during the
|
||||
descriptions stage, so that (a) tables that finished are durable even if the stage never completes, and
|
||||
(b) a resumed ingest re-does only the *unfinished* tables, not the whole stage. The existing additive-write
|
||||
design (spec 19 already preserves existing descriptions on re-ingest) is the foundation; this extends the
|
||||
checkpoint granularity from once-per-stage to incremental.
|
||||
|
||||
## Sketch (implementer to refine)
|
||||
|
||||
- **Enforced timeout:** route enrichment-call cancellation through real termination — kill the codex/
|
||||
claude-code child process on timeout (reuse spec 16's child-kill mechanism), abort the HTTP request for
|
||||
network backends. A fired `AbortSignal` must guarantee the await settles within a bounded grace period.
|
||||
- **Sane default + the right tradeoff:** the default per-table timeout should be **moderate** (single-digit
|
||||
minutes) with a small retry count, not very large — because the cost of a *hang* is the timeout value
|
||||
itself, a long timeout is strictly worse for hangs. (The 30-min value used in the incident was an operator
|
||||
override chosen to avoid cutting off slow-but-completing wide tables; with #1 enforced and incremental
|
||||
checkpointing, a moderate default + skip is the better operating point.)
|
||||
- **Incremental persistence:** flush descriptions per-batch (e.g. every N completed tables or on a timer) to
|
||||
the same store/format used at stage completion; on resume, treat already-persisted tables as done and only
|
||||
enrich the remainder. Keep it idempotent and additive (don't clobber prior descriptions).
|
||||
- **Interaction with the stall watchdog:** with #1 enforced, no single table can starve progress for longer
|
||||
than ~one timeout, so an external stall watchdog stops being the only backstop.
|
||||
|
||||
## Generic use case (independent of the benchmark)
|
||||
|
||||
Anyone ingesting a large or wide schema with an LLM enrichment backend (especially a *subprocess* backend,
|
||||
which is the common local/desktop setup) will eventually hit a table whose description call hangs — a
|
||||
provider stall, a rate-limit black-hole, a pathologically large prompt. Without an *enforced* timeout, one
|
||||
such table wedges the whole ingest indefinitely; without *incremental* persistence, any interruption throws
|
||||
away all the per-table LLM work already done (the dominant ingest cost). Both fixes make large-schema
|
||||
enrichment **resilient and resumable** — a few bad tables degrade to a few skipped descriptions, not a
|
||||
hung process and a from-scratch redo. This is core robustness for a general-purpose ingestion product,
|
||||
wholly independent of any benchmark.
|
||||
|
||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
||||
|
||||
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment backend). Re-enriching
|
||||
the giant public datasets, `covid19_usa` wedged at 268/285 for 41+ minutes on 2 hung 252-column tables; the
|
||||
30-min per-table `AbortSignal` timeout never killed the hung codex children, and because descriptions
|
||||
checkpoint only at stage completion, the 283 already-enriched tables were unrecoverable — the operator had
|
||||
to kill, cache-bust, and re-ingest the db from scratch (with a short timeout as a stopgap). The benchmark
|
||||
just exercised a large/wide multi-dataset ingest at scale; the gap and the fix are generic.
|
||||
|
|
@ -1,91 +0,0 @@
|
|||
# 21 — Selective enrichment stages (`--stages`) + per-stage cache keys
|
||||
|
||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/21-*.md`.
|
||||
|
||||
Follow-on to spec 19 (durable/resumable relationship detection) and spec 20 (resilient enrichment).
|
||||
Those made enrichment *survivable and resumable*; this makes it *selectively re-runnable* — re-run one
|
||||
enrichment stage without re-paying for the others.
|
||||
|
||||
## Problem / requirement
|
||||
|
||||
Enrichment has three stages — **`descriptions`** (per-table LLM text), **`embeddings`**
|
||||
(sentence-transformers over the schema/descriptions), **`relationships`** (FK/join detection, optionally
|
||||
LLM-proposed). Today you cannot re-run a *subset* of them, and three facts in the current code make a
|
||||
targeted re-run impossible without a full, expensive re-enrich:
|
||||
|
||||
1. **One coarse cache key gates all three stages.** `context/scan/local-enrichment.ts:611` computes a
|
||||
single `inputHash` from `{snapshot, mode, detectRelationships, providerIdentity, relationshipSettings}`,
|
||||
and all three stages reuse it (descriptions ~`:641`, embeddings ~`:672`, relationships ~`:728`). So
|
||||
changing *any* one stage's inputs invalidates *every* stage's cache. Concretely: flipping
|
||||
`scan.relationships.llmProposals`, switching the LLM backend, or upgrading the embeddings model forces
|
||||
ktx to re-run the **expensive per-table descriptions** even though they didn't conceptually change.
|
||||
2. **No CLI surface to select stages.** The enrichment internally already supports a relationships-only
|
||||
path (`mode: 'relationships'`, which skips the description/embedding stages — they're gated on
|
||||
`mode === 'enriched'`), but `ktx ingest` exposes no flag to invoke it (only `--no-query-history`).
|
||||
The capability is built; it's just not reachable.
|
||||
3. **The per-stage storage already exists** (`local_scan_enrichment_stages` PK `(connection_id, stage,
|
||||
input_hash)`) and the **additive write already preserves existing descriptions** on re-ingest — so the
|
||||
foundation for "touch one stage, keep the rest" is in place; only the key granularity and the CLI
|
||||
surface are missing.
|
||||
|
||||
**Requirement:** let an operator re-run a chosen subset of enrichment stages on already-ingested
|
||||
connection(s), recomputing only those stages and **preserving the others' artifacts untouched** — cheaply,
|
||||
without re-running unchanged (especially the costly `descriptions`) stages.
|
||||
|
||||
## Design decisions (resolved during intake; implementer may refine)
|
||||
|
||||
- **CLI flag: `--stages <comma-list>`** (plural). Accepts a comma-separated subset of
|
||||
`descriptions,embeddings,relationships`; default = all three (current behaviour). Plural because it takes
|
||||
a *set*; `--stages relationships` and `--stages descriptions,embeddings` both read naturally, and the
|
||||
plural signals "list expected" (singular `--stage` implies exactly one). **Validate** the names — an
|
||||
unknown stage is an error, never silently ignored.
|
||||
- **Per-stage `inputHash`.** Split the single coarse hash so each stage keys on *only its own* inputs:
|
||||
- `descriptions` → `{snapshot, mode, providerIdentity}` (NOT relationship settings, NOT embedding model)
|
||||
- `embeddings` → `{snapshot, embeddings model/provider, + the description text it embeds}`
|
||||
- `relationships`→ `{snapshot, relationshipSettings (incl. llmProposals), providerIdentity}`
|
||||
Then flipping `llmProposals` invalidates only `relationships`; swapping the embeddings model invalidates
|
||||
only `embeddings`; improving description prompts/LLM invalidates only `descriptions`.
|
||||
- **Preserve-others semantics.** Stages not named in `--stages` are left exactly as on disk (additive write,
|
||||
already the behaviour). A selective run never deletes another stage's artifacts.
|
||||
- **Downstream-staleness handling.** Stages have a dependency order (`descriptions → embeddings`;
|
||||
`relationships` depends only on the schema snapshot). Re-running `descriptions` alone can leave existing
|
||||
`embeddings` semantically stale (they embedded the old text). The run must **warn** when a selected
|
||||
re-run leaves an unselected downstream stage stale, and the operator can opt to cascade
|
||||
(`--stages descriptions,embeddings`). Do not silently leave a stale-but-unflagged downstream.
|
||||
- **`relationships` uses existing descriptions as context.** When re-running `relationships` only, the
|
||||
stage should read the existing enriched schema (incl. on-disk `ai:` descriptions) so `llmProposals` has
|
||||
full context — not just raw column names.
|
||||
- **Scope:** the three enrichment stages for now. Design the stage-name namespace so it can later extend to
|
||||
the broader scan phases (schema / query-history / source / memory) and subsume the inconsistent
|
||||
`--no-query-history` negative flag, but that unification is out of scope here.
|
||||
|
||||
## Sketch (implementer to refine)
|
||||
|
||||
- Add `--stages` to `ktx ingest`; parse+validate into a stage set; thread it to the enrichment entry so it
|
||||
selects which stage blocks run (reuse the existing `mode`/stage gating — `mode: 'relationships'` is the
|
||||
precedent).
|
||||
- Replace the single `computeKtxScanEnrichmentInputHash` call with per-stage hash computation keyed on each
|
||||
stage's own inputs; gate each stage's resume/skip on its own hash.
|
||||
- Ensure selective runs read + preserve the on-disk enriched schema and write additively.
|
||||
- Emit a clear staleness warning when an unselected downstream stage is invalidated by a selected one.
|
||||
|
||||
## Generic use case (independent of the benchmark)
|
||||
|
||||
Any team running ktx in production maintains its semantic layer over time: they improve description prompts
|
||||
or switch the description LLM, upgrade the embeddings model, or turn on LLM-proposed joins. Today each of
|
||||
those forces a **full re-enrich of every connection** — re-running the expensive per-table descriptions
|
||||
even when only embeddings or relationships changed. Selective `--stages` re-runs makes these routine
|
||||
maintenance operations cheap and targeted: "re-embed everything on the new model" or "backfill joins now
|
||||
that llmProposals is on" become a single fast pass that leaves the untouched stages — and their cost —
|
||||
alone. This is core operability for a long-lived ingestion product and is wholly independent of any
|
||||
benchmark.
|
||||
|
||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
||||
|
||||
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A level-aware audit found (a) a
|
||||
tail of BigQuery dbs with poor *column*-description coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …)
|
||||
that want a **`descriptions`-only** re-run with a longer timeout, and (b) a desire to **backfill joins**
|
||||
across all already-ingested dbs after enabling `llmProposals` — without re-paying for descriptions. Both
|
||||
were blocked by the coarse single `inputHash` (flipping `llmProposals` or re-describing would invalidate
|
||||
the whole enrichment) and the absence of a stage-selective CLI flag. The benchmark just exercised
|
||||
large-scale multi-backend ingestion; the gap and the fix are generic.
|
||||
|
|
@ -1,300 +0,0 @@
|
|||
# Connection-scoped wiki pages
|
||||
|
||||
> Refined spec. Intake draft: `todo/01-connection-scoped-wiki.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
Wiki pages have only two scopes today: `GLOBAL` and `USER`
|
||||
(`packages/cli/src/context/wiki/types.ts`, `WikiScope`). Scope is expressed by
|
||||
directory (`wiki/global/<key>.md`, `wiki/user/<userId>/<key>.md`) and the
|
||||
search path filters by loading only the in-scope pages before any lane runs.
|
||||
There is no way to associate a page with a **connection** (a warehouse/database
|
||||
defined under `connections:` in `ktx.yaml`).
|
||||
|
||||
In a project with many connections this causes two distinct failures:
|
||||
|
||||
1. **Cross-database relevance pollution.** All pages share one search index, so
|
||||
`wiki_search` for a generic term (`orders`, `revenue`, `average order
|
||||
value`) surfaces pages written about the wrong database. Concept names
|
||||
collide across databases constantly in real multi-connection projects
|
||||
(several databases each with `orders`, `customers`, …).
|
||||
2. **Silent overwrite on shared keys.** Page keys are a flat, global namespace.
|
||||
The write path resolves a repeated key to the existing file and updates it
|
||||
in place. So if the agent writes an `orders` page while ingesting database B
|
||||
and an `orders` page already exists for database A, B's content **overwrites
|
||||
A's** — same-concept pages for different databases cannot coexist today.
|
||||
|
||||
Today, when `memory_ingest` is called with a `connectionId`, that id only
|
||||
scopes which semantic-layer sources the triage agent can see
|
||||
(`memory-agent.service.ts`); it is **not** persisted on the resulting wiki page
|
||||
and **not** validated against `ktx.yaml`.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any org with multiple databases/warehouses in one **ktx** project: org-wide
|
||||
definitions ("fiscal year starts in February") should be visible everywhere,
|
||||
while database-specific conventions ("in the events DB, `user_id` is the
|
||||
anonymous device id, not the account id") should not pollute searches about
|
||||
other databases — and two databases that both have an `orders` concept must be
|
||||
able to keep separate, non-colliding pages.
|
||||
|
||||
## Model
|
||||
|
||||
`connections` is **additive frontmatter metadata**, orthogonal to the existing
|
||||
`GLOBAL`/`USER` directory scope — not a third scope dimension:
|
||||
|
||||
- A page is still `GLOBAL` or `USER` and lives where it lives today. It may
|
||||
**additionally** carry a `connections` list.
|
||||
- **Page keys remain a flat, globally-unique namespace.** `connections` does
|
||||
**not** namespace keys; a page is addressable by key alone, unchanged.
|
||||
- A page may list **multiple** connections.
|
||||
- **Absent or empty `connections` ⇒ unscoped: the page applies to all
|
||||
connections.** This is exactly today's behavior, so every existing page is
|
||||
unaffected.
|
||||
|
||||
This keeps `wiki_read` and refs untouched and adds no parallel scope axis;
|
||||
filtering by connection is purely a search/relevance concern.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Frontmatter field
|
||||
|
||||
Add an optional `connections` field to wiki page frontmatter — a list of
|
||||
connection ids.
|
||||
|
||||
- Accept a single string too; normalize to a list at parse time (reuse the
|
||||
existing array-coercion helper used for `tags`/`refs`/`sl_refs`).
|
||||
- Round-trips through parse/serialize without loss.
|
||||
- Absent or empty ⇒ unscoped (see Model). Existing pages are unaffected by
|
||||
construction.
|
||||
|
||||
### 2. Page identity and key distinctness
|
||||
|
||||
`connections` does not change how pages are identified or addressed:
|
||||
|
||||
- Keys stay flat and globally unique; `wiki_read(key)` is unchanged.
|
||||
- Because the write path updates a page in place when its key already exists,
|
||||
same-concept pages for different connections **MUST** use distinct keys
|
||||
(e.g. `orders_sales_db` vs `orders_events_db`). Connection-distinctive keys
|
||||
for database-specific pages are the primary mechanism (driven by write-path
|
||||
prompt guidance, requirement 5).
|
||||
- **Data-loss guard (code, not prompt):** a connection-scoped write whose key
|
||||
matches an existing page whose `connections` scope is **disjoint** from the
|
||||
incoming scope MUST surface a collision instead of silently overwriting the
|
||||
existing page. (Updating a page within the same connection scope, or
|
||||
broadening/narrowing its own `connections`, is a normal update — not a
|
||||
collision.) The implementer owns whether the collision is a hard error or a
|
||||
suffixed new key; it must not be a silent clobber.
|
||||
|
||||
### 3. Search filtering
|
||||
|
||||
Add an optional connection filter to the search surfaces:
|
||||
|
||||
- **MCP:** `wiki_search(query, connectionId?)` (`context-tools.ts`).
|
||||
- **CLI:** `ktx wiki search` and `ktx wiki list` accept `--connection <id>`
|
||||
(with `-c` alias), matching the `ktx sql` connection flag.
|
||||
|
||||
Semantics:
|
||||
|
||||
- With `connectionId: X` ⇒ return pages whose `connections` is empty
|
||||
(unscoped) **∪** pages whose `connections` contains X.
|
||||
- Without ⇒ current behavior, all pages.
|
||||
- The filter **MUST** apply uniformly to **all three search lanes** (lexical
|
||||
FTS5, semantic/embedding, token fallback) at the **candidate-source level**,
|
||||
so each lane draws its full candidate pool from the already-scoped set. It
|
||||
**MUST NOT** be a post-filter on the merged/ranked results — that would let
|
||||
off-scope candidates consume both the per-lane pool and the final result
|
||||
limit unevenly.
|
||||
|
||||
*Orientation:* the existing `GLOBAL`/`USER` scoping already filters at the
|
||||
disk-load step that feeds both the in-memory token lane and the synced SQLite
|
||||
index (`local-knowledge.ts`); the connection filter fits the same seam.
|
||||
|
||||
### 4. Index persistence
|
||||
|
||||
The `.ktx/db.sqlite` knowledge index is re-synced from files on every search.
|
||||
The implementer owns whether to persist `connections` as index columns / a side
|
||||
table, or to filter the loaded page-set before the per-search sync. The binding
|
||||
requirement is the uniform-across-lanes behavior in requirement 3 — not a
|
||||
specific schema.
|
||||
|
||||
*Trade-off note (non-binding):* filtering the loaded page-set re-syncs only the
|
||||
scoped subset and gives up a little embedding-cache reuse when searches
|
||||
alternate between connections (recompute is one embedding per scoped page per
|
||||
connection switch — negligible at the scale this targets). Persisting
|
||||
`connections` in the index avoids that at the cost of a schema addition and a
|
||||
per-lane predicate. Either is acceptable.
|
||||
|
||||
### 5. Write path
|
||||
|
||||
- The memory agent's page-write tool (`wiki-write.tool.ts`) accepts a
|
||||
`connections` input field with the same REPLACE semantics as
|
||||
`tags`/`refs`/`sl_refs`: omit ⇒ keep existing on update; `[]` ⇒ clear to
|
||||
unscoped; `[ids]` ⇒ set.
|
||||
- When `memory_ingest` / the memory agent runs with a `connectionId`, prompt
|
||||
guidance directs the agent to:
|
||||
- set `connections: [connectionId]` on new **database-specific** pages, using
|
||||
connection-distinctive keys; and
|
||||
- leave `connections` empty for clearly **org-wide** content.
|
||||
- This is **prompt guidance, not a code auto-default.** A connection-scoped
|
||||
ingest must remain able to produce unscoped org-wide pages, so the tool must
|
||||
not force the session's `connectionId` onto every page.
|
||||
|
||||
### 6. `wiki_read` and refs unchanged
|
||||
|
||||
Pages remain addressable by key regardless of scoping. `wiki_read`, `refs`, and
|
||||
`sl_refs` semantics are unchanged; `connections` is a search/relevance concern
|
||||
only.
|
||||
|
||||
### 7. Validation
|
||||
|
||||
Validation behavior splits by surface, because an explicit argument is a
|
||||
typo-prone input while persisted content drifts independently of config:
|
||||
|
||||
- **Explicit argument** — a connection id supplied as a command/tool argument
|
||||
(`wiki_search`/`memory_ingest` `connectionId`, `ktx wiki … --connection`)
|
||||
MUST be validated against `ktx.yaml` connections and **rejected with a clear
|
||||
error listing the configured ids** when unknown. Reuse the canonical
|
||||
`project.config.connections[id]` check. This also closes the current gap
|
||||
where `memory_ingest`'s `connectionId` is accepted unvalidated.
|
||||
- **Persisted frontmatter** — a connection id that appears only in a stored
|
||||
page's `connections` and is not in `ktx.yaml` MUST **warn (not fail)** during
|
||||
validation/doctor, and MUST NOT break loading, searching, or reading that
|
||||
page. Config and content can evolve independently.
|
||||
|
||||
### 8. Scope boundary
|
||||
|
||||
This spec delivers the **mechanism** (frontmatter storage + uniform filter +
|
||||
write surface + validation). Driving the agent to actually pass `connectionId`
|
||||
during analytics work is the concern of
|
||||
`03-multi-connection-routing-in-analytics-skill`. It composes with the
|
||||
`--connection` flag on `ktx ingest` from `02-verbatim-ingest-mode`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- A page with `connections: [db_a]` is returned by
|
||||
`wiki_search(query, connectionId: "db_a")` and by an unfiltered search, but
|
||||
**not** by `wiki_search(query, connectionId: "db_b")`.
|
||||
- A page with no `connections` field is returned in all three cases above.
|
||||
- Two pages — `orders_sales_db` (`connections: [sales_db]`) and
|
||||
`orders_events_db` (`connections: [events_db]`) — coexist; a search scoped to
|
||||
`sales_db` returns the first and not the second, and neither overwrote the
|
||||
other on write.
|
||||
- A connection-scoped write whose key matches an existing page scoped to a
|
||||
**different** connection surfaces a collision instead of silently
|
||||
overwriting (data-loss guard, requirement 2).
|
||||
- Filtering works in each lane independently (test with embeddings disabled to
|
||||
exercise the lexical and token lanes alone).
|
||||
- `memory_ingest(content, connectionId)` produces a page scoped to that
|
||||
connection for database-specific content.
|
||||
- `wiki_search`/`ktx wiki search --connection <unknown>` fails with an error
|
||||
that lists the configured connection ids.
|
||||
- A page whose `connections` references an id absent from `ktx.yaml` produces a
|
||||
warning but stays searchable and readable; search and read do not throw.
|
||||
- `connections` accepts a single string and a list, both normalized to a list.
|
||||
- Existing projects with no scoped pages and no `connectionId`/`--connection`
|
||||
behave identically before/after.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **Frontmatter type + parse/serialize:** `wiki/types.ts` (`WikiFrontmatter`),
|
||||
`wiki/knowledge-wiki.service.ts` (`parsePage`/`serializePage`), array
|
||||
coercion `wiki/local-knowledge.ts` (`stringArray`).
|
||||
- **Search lanes + per-search re-sync:** `wiki/local-knowledge.ts`
|
||||
(`searchLocalKnowledgePagesWithSqlite`; the disk-load step that already
|
||||
scopes `GLOBAL`/`USER`; token lane), `wiki/sqlite-knowledge-index.ts`
|
||||
(FTS5 `knowledge_pages_fts` lexical lane, semantic scan, `sync`).
|
||||
- **MCP surface:** `mcp/context-tools.ts` (`wiki_search`, `wiki_read`,
|
||||
`memory_ingest`; `connectionId` already present on `memory_ingest` but
|
||||
unvalidated).
|
||||
- **CLI surface:** `commands/knowledge-commands.ts`
|
||||
(`ktx wiki search`/`list`/`read`); canonical `--connection` flag in
|
||||
`commands/sql-commands.ts`; validation pattern
|
||||
`project.config.connections[id]` in `mcp/local-project-ports.ts`.
|
||||
- **Write path:** `wiki/tools/wiki-write.tool.ts` (input schema, REPLACE
|
||||
semantics, scope decision), `memory/memory-agent.service.ts` (`connectionId`
|
||||
threaded through the capture session and tool session;
|
||||
`external_ingest` forces `GLOBAL` scope).
|
||||
- **Connection config:** `context/project/config.ts` (`connections` record in
|
||||
`ktx.yaml`).
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Spider 2.0-Lite local subset = one project with ~30 SQLite connections whose
|
||||
schemas share table/concept names (Northwind, sakila, two e-commerce DBs…).
|
||||
External-knowledge docs (RFM definition, F1 overtake rules) are each relevant
|
||||
to exactly one database and must not surface for the other 29.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
||||
acceptance criteria covered; full package suite green (2924 passing),
|
||||
type-check, knip/biome dead-code, and pre-commit clean.
|
||||
|
||||
**What was built / where**
|
||||
|
||||
1. **Frontmatter field (req 1).** `connections?: string[]` added to
|
||||
`WikiFrontmatter` (`context/wiki/types.ts`) and to the file-layer page model
|
||||
`LocalKnowledgePage` (`context/wiki/local-knowledge.ts`). Parsed via a new
|
||||
`stringList()` coercion (single string → list); round-trips through both
|
||||
serializers. Absent/empty ⇒ unscoped.
|
||||
2. **Search/list filter (req 3, req 4).** `connectionId?` threaded through
|
||||
`searchLocalKnowledgePages` → both the sqlite-FTS and scan impls →
|
||||
`loadAllKnowledgePages`, and through `listLocalKnowledgePages`. The filter is
|
||||
applied at the **disk-load seam** (`pageMatchesConnection`: unscoped ∪ pages
|
||||
listing the id), so the token lane and the per-search SQLite sync (lexical +
|
||||
semantic) both draw their candidate pool from the already-scoped set —
|
||||
candidate-source level, not a post-filter.
|
||||
- Chose req 4 **option B (filter the loaded page-set)** over persisting a
|
||||
column. Verified-safe here: standalone ktx's memory agent reads pages from
|
||||
files via a no-op `LocalKnowledgeIndex`, so `.ktx/db.sqlite`'s
|
||||
`knowledge_pages` is a per-search cache that `searchLocalKnowledgePages`
|
||||
rebuilds every call — scoping the sync corrupts no shared state. Only cost
|
||||
is one embedding recompute per scoped page on a connection switch (the
|
||||
spec's acknowledged, negligible trade-off). No index-schema change.
|
||||
3. **Page identity + data-loss guard (req 2).** Keys stay flat/global;
|
||||
`wiki_read`/refs unchanged. The write tool (`wiki/tools/wiki-write.tool.ts`)
|
||||
rejects (hard error, no silent clobber) a connection-scoped write whose
|
||||
incoming `connections` is **disjoint** from a same-key existing page's
|
||||
non-empty `connections`, suggesting a connection-distinctive key. Same-scope,
|
||||
overlapping, broaden/narrow, and unscoped-existing updates are allowed.
|
||||
Chose a hard error over auto-suffixing so the conflict reaches the agent
|
||||
(the decision-maker) instead of silently forking the key namespace.
|
||||
4. **Write path (req 5).** `wiki_write` accepts `connections` (string or list)
|
||||
with REPLACE semantics (omit ⇒ keep, `[]` ⇒ unscoped, `[ids]` ⇒ set); no
|
||||
code auto-default of the session connection. Prompt guidance added to the
|
||||
shared `wiki_capture` skill (new "Connection scoping" section) and the
|
||||
`memory_agent_external_ingest` prompt. The session `connectionId` is now
|
||||
surfaced to the agent so the guidance is actionable: in the memory-agent
|
||||
prompt header and in the ingest work-unit `<context>` block
|
||||
(`build-wu-context.ts`, fed from `ingest-bundle.runner.ts`).
|
||||
5. **Validation (req 7).** New shared helper
|
||||
`context/connections/configured-connections.ts → assertConfiguredConnectionId`
|
||||
validates explicit connection-id arguments against `ktx.yaml` and throws an
|
||||
error listing the configured ids. Routed from all three explicit-arg
|
||||
surfaces: MCP `wiki_search` (`local-project-ports.ts`), MCP `memory_ingest`
|
||||
(validated at the boundary in `mcp-server-factory.ts` — this also closes the
|
||||
prior gap where `memory_ingest`'s `connectionId` was accepted unvalidated),
|
||||
and CLI `ktx wiki --connection`/`-c` (`commands/knowledge-commands.ts` +
|
||||
`knowledge.ts`). Persisted-frontmatter ids absent from config are **warn-only**:
|
||||
`listReferencedConnectionIds` + a non-fatal `ktx status` warning
|
||||
(`status-project.ts`); loading/searching/reading never throw on them.
|
||||
|
||||
**Deviations / notes**
|
||||
|
||||
- Req 1 says "reuse the existing array-coercion helper used for `tags`/`refs`".
|
||||
That helper (`stringArray`) is array-only and does **not** coerce a single
|
||||
string; added a dedicated `stringList` for `connections` to meet the
|
||||
single-string acceptance criterion rather than change `stringArray`'s
|
||||
behavior for the other fields.
|
||||
- **Scope boundary kept:** `discover_data` (MCP) also searches wiki and already
|
||||
takes `connectionId`, but req 3/8 scope the filter to `wiki_search` + CLI, so
|
||||
its wiki lane is intentionally left unscoped. Worth a follow-up if
|
||||
`discover_data`'s wiki results should also be connection-scoped for
|
||||
consistency.
|
||||
- MCP tools-list snapshot and the `mcp-server-factory` test were updated for the
|
||||
new `wiki_search.connectionId` param and the `memory_ingest` validation
|
||||
wrapper (the port is no longer the raw service object; it delegates).
|
||||
|
|
@ -1,327 +0,0 @@
|
|||
# Verbatim ingest mode for authoritative documents
|
||||
|
||||
> Refined spec. Intake draft: `todo/02-verbatim-ingest-mode.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
`ktx ingest --text/--file` routes captured content through the memory agent.
|
||||
`runKtxTextIngest` (`packages/cli/src/text-ingest.ts`) builds a
|
||||
`MemoryAgentInput` with `sourceType: 'external_ingest'` and hands it to
|
||||
`MemoryAgentService.ingest` (`context/memory/memory-agent.service.ts`), which
|
||||
runs a multi-step LLM triage loop (≈30-step budget, content clipped to ~48k
|
||||
chars) inside a session worktree. The agent decides — via the `wiki_write`
|
||||
tool — what to persist, so it may **rewrite, condense, split, or re-title** the
|
||||
content before it lands as a wiki page. The body is produced by an LLM, not
|
||||
copied by code.
|
||||
|
||||
For *authoritative* documents — formula definitions, metric specs, runbooks,
|
||||
compliance text — paraphrasing is a defect, not a feature:
|
||||
|
||||
- exact thresholds, constants, and rule wording must survive unchanged;
|
||||
- lexical (BM25/FTS5) search works best when the stored text matches the
|
||||
phrasing users and agents query with;
|
||||
- ingestion should be deterministic and reproducible — the same input file
|
||||
yields the same page, and re-running is safe.
|
||||
|
||||
Two further gaps block authoritative ingest today:
|
||||
|
||||
- The memory agent hard-requires an LLM backend
|
||||
(`context/memory/local-memory.ts` throws when `llm.provider.backend: none`
|
||||
and no runner is injected), so there is **no** offline ingest path at all.
|
||||
- The agent's write tool *merges* a repeated same-scope key in place (REPLACE
|
||||
frontmatter semantics in `wiki/tools/wiki-write.tool.ts`), i.e. exactly the
|
||||
silent in-place rewrite an authoritative-document workflow must avoid.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any team ingesting documents that are already the source of truth: metric
|
||||
definition sheets, SLA documents, calculation-methodology docs, regulatory
|
||||
text. The user wants **ktx** to *index and surface* the document, not to
|
||||
re-author it. Today they work around the memory agent by hand-writing
|
||||
frontmatter and copying files into `wiki/global/`; verbatim mode makes that a
|
||||
first-class, supported `ktx ingest` workflow.
|
||||
|
||||
## Model
|
||||
|
||||
`ktx ingest --verbatim` is a **distinct, code-driven ingest path**, not a
|
||||
constrained prompt over the existing agent loop. Its defining invariants:
|
||||
|
||||
- **The stored page body is the input document body, written by code.** The LLM
|
||||
never produces, edits, or relays the body. It is confined to generating
|
||||
*metadata* about the body.
|
||||
- **Behavior follows from inputs, not from a mode prompt.** Whether metadata is
|
||||
LLM-generated or derived offline follows from the configured backend
|
||||
(`llm.provider.backend`), not from a second user-facing switch.
|
||||
- **Pages are `GLOBAL`-scoped.** Verbatim ingest targets org/project
|
||||
authoritative docs (the content teams copy into `wiki/global/` today).
|
||||
Connection association is expressed by the **additive `connections`
|
||||
frontmatter** from spec 01, never by directory.
|
||||
- **Deterministic and idempotent.** The page key, the merged frontmatter, and
|
||||
the stored body are all functions of the input alone (given a fixed backend),
|
||||
so the same input produces the same page and a re-run is a safe no-op.
|
||||
|
||||
### "Byte-for-byte" scope
|
||||
|
||||
The guarantee is on the document's **interior**: no paraphrase, no condense, no
|
||||
split, no re-title, no reflow, **no clipping**. The shared wiki store
|
||||
canonicalizes *surrounding* whitespace — `parsePage` trims the body and
|
||||
`serializePage` emits a single trailing newline
|
||||
(`wiki/knowledge-wiki.service.ts`) — so leading/trailing blank lines are
|
||||
normalized by the storage layer. Verbatim mode **MUST** write through that
|
||||
shared `writePage`/`serializePage` path rather than fork a parallel serializer;
|
||||
the interior bytes (thresholds, constants, wording) are what must be preserved
|
||||
exactly, and they are. Acceptance hashes compare the stored body against the
|
||||
**trimmed** input body.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Flag
|
||||
|
||||
`ktx ingest --file <path> --verbatim` and `ktx ingest --text <content>
|
||||
--verbatim`. `--verbatim` is a boolean that applies to every `--file`/`--text`
|
||||
item in the invocation; each item becomes its own page.
|
||||
|
||||
- It composes with the existing `--connection-id <id>` flag
|
||||
(`commands/ingest-commands.ts`) so the resulting page can be
|
||||
connection-scoped (see spec 01). **Note:** the intake draft wrote
|
||||
`--connection`; the shipped flag is `--connection-id`. Use `--connection-id`.
|
||||
- No new `--key` flag (see requirement 4). No second behavioral switch beyond
|
||||
`--verbatim` itself.
|
||||
|
||||
### 2. Body preservation is enforced by code, not by prompt
|
||||
|
||||
The stored page body is the input content (interior preserved exactly, per
|
||||
**Model → "Byte-for-byte" scope**).
|
||||
|
||||
- Verbatim mode **MUST NOT** route the body through the memory-agent LLM loop
|
||||
or any `wiki_write` tool call where a model could alter it.
|
||||
- The LLM, when used, generates **only** metadata: `summary`, `tags`, and
|
||||
`sl_refs`. A single constrained structured-output call (AI SDK v6
|
||||
`generateObject` with a `zod` schema) is the intended mechanism — the full
|
||||
memory-agent loop, worktree, and squash-merge are **not** required and should
|
||||
not be used.
|
||||
- The page key is **not** LLM-generated (requirement 4).
|
||||
|
||||
### 3. No clipping of the stored body
|
||||
|
||||
The ~48k clip may apply only to the text **sent to the LLM** for metadata
|
||||
generation. It **MUST NOT** apply to the text **written** to the page. A
|
||||
document larger than the clip limit is stored in full; only its metadata is
|
||||
derived from the clipped prefix.
|
||||
|
||||
### 4. Deterministic page key
|
||||
|
||||
The key is derived from the input, never chosen by the LLM (an LLM-chosen slug
|
||||
would break determinism and the requirement-6 idempotency guarantee):
|
||||
|
||||
- **`--file <path>`** → `suggestFlatWikiKey(basename without extension)`
|
||||
(`wiki/keys.ts`). This is the primary document case and is always
|
||||
deterministic.
|
||||
- **`--text <content>`** → if the content opens with a Markdown heading, the
|
||||
key is `suggestFlatWikiKey(heading text)`. If there is no leading heading,
|
||||
**hard error**: inline verbatim text needs a leading heading to derive a
|
||||
stable key, or should be passed as `--file`.
|
||||
- No hash-based keys (unfindable) and no `--key` override flag. A real need for
|
||||
explicit key control can add `--key` later.
|
||||
|
||||
### 5. Frontmatter: passthrough + gap-fill
|
||||
|
||||
If the input has its own YAML frontmatter, split it from the body: the body is
|
||||
everything after the closing `---`; the frontmatter is authoritative metadata.
|
||||
|
||||
- **Passthrough.** Every input frontmatter field is preserved in the stored
|
||||
page, **including fields not in `WikiFrontmatter`** (`effective_date`,
|
||||
`version`, `owner`, …). The serializer `YAML.stringify`s the object, so
|
||||
unknown keys round-trip. Dropping them would be silent data loss on
|
||||
authoritative docs.
|
||||
- **Gap-fill only.** Generated/derived metadata fills **absent** fields only;
|
||||
it **MUST NOT** overwrite an explicit value. An input `summary:` is never
|
||||
replaced by a generated one; explicit `tags`/`sl_refs` are likewise kept.
|
||||
- **Defaults.** `usage_mode` defaults to `auto` (findable via search, not
|
||||
force-injected) when the input does not set it.
|
||||
- **Connection scoping.** `--connection-id X` (validated via
|
||||
`assertConfiguredConnectionId`, `context/connections/configured-connections.ts`)
|
||||
sets `connections: [X]` when the input frontmatter does not already declare
|
||||
`connections`. If the input frontmatter declares a **different**
|
||||
`connections` than the flag, **hard error** (ambiguous intent) rather than
|
||||
silently choosing one. If they match, or only one source is present, proceed.
|
||||
|
||||
### 6. Degraded mode (`llm.provider.backend: none`)
|
||||
|
||||
`--verbatim` **MUST** work with no LLM backend — this is its capability the
|
||||
regular agent ingest lacks.
|
||||
|
||||
- `summary` is derived from the leading Markdown heading text, or, if none, the
|
||||
first non-empty sentence of the body (trimmed to a reasonable length).
|
||||
- `tags` and `sl_refs` are left empty.
|
||||
- The body is still stored in full (requirement 3 applies unchanged).
|
||||
|
||||
### 7. Key collisions: idempotent-if-identical, else hard error
|
||||
|
||||
Verbatim mode does **not** reuse the agent write tool's in-place merge. Before
|
||||
writing, read any existing `GLOBAL` page at the derived key:
|
||||
|
||||
- **No existing page** → write.
|
||||
- **Existing page, stored body identical** to the new body (compared after the
|
||||
storage-layer normalization in **Model**) → **idempotent no-op success**
|
||||
(re-running the same file is safe).
|
||||
- **Existing page, body differs** → **hard error** naming the conflicting key
|
||||
and directing the user to a distinct key. Never a silent overwrite, never an
|
||||
auto-suffixed second page (which would produce the duplicated/divergent pages
|
||||
this mode must avoid).
|
||||
|
||||
### 8. LLM-failure handling
|
||||
|
||||
When a backend **is** configured but the metadata call fails (rate limit,
|
||||
transport error, malformed output after retries), **fail the item** (honoring
|
||||
`--fail-fast` and the per-item exit-code aggregation in `text-ingest.ts`).
|
||||
**MUST NOT** silently fall back to degraded derivation: a degraded page written
|
||||
on a transient error would, under requirement 7, refuse to be replaced by a
|
||||
healthy re-run — breaking reproducibility. Degraded derivation is reserved for
|
||||
`backend: none`.
|
||||
|
||||
### 9. Findability
|
||||
|
||||
After write, the page is reindexed so search returns it:
|
||||
|
||||
- `wiki_search` for a phrase taken from the document body returns the page via
|
||||
the lexical lane (the body is indexed in `buildKnowledgeSearchText`).
|
||||
- `wiki_search` for a paraphrase of the document's topic returns it via the
|
||||
semantic lane **when embeddings are enabled** (this is what the generated
|
||||
`summary`/`tags` buy over a bare degraded page).
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Ingesting a file with `--verbatim` produces a page whose body is
|
||||
byte-identical to the trimmed input body (assert with a hash in tests).
|
||||
- A >48k-char file is stored in full (assert stored body length ≥ input length
|
||||
minus trim).
|
||||
- Running the same `--verbatim` ingest twice is idempotent: one page, identical
|
||||
bytes both times, no error on the second run.
|
||||
- A second ingest to the same derived key with **different** body content fails
|
||||
loudly (requirement 7) and does not modify the existing page or create a
|
||||
suffixed one.
|
||||
- Input frontmatter with an unknown field (e.g. `effective_date`) is preserved
|
||||
in the stored page; an explicit input `summary` is **not** overwritten by a
|
||||
generated one.
|
||||
- With `llm.provider.backend: none`, `--verbatim` still produces a page: full
|
||||
body stored, `summary` derived from the heading/first sentence, `tags` and
|
||||
`sl_refs` empty.
|
||||
- `--verbatim --connection-id X` yields a page with `connections: [X]`; an
|
||||
unknown id is rejected with an error listing the configured ids. (Depends on
|
||||
spec 01, now shipped.)
|
||||
- `--verbatim --connection-id X` where the input frontmatter already declares a
|
||||
different `connections` fails with an ambiguity error.
|
||||
- `ktx ingest --text "no heading here" --verbatim` errors asking for a leading
|
||||
heading or `--file`.
|
||||
- `wiki_search` for a body phrase returns the page (lexical lane); for a topic
|
||||
paraphrase it returns the page when embeddings are enabled (semantic lane).
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
module layout and design, subject to the invariants above.
|
||||
|
||||
- **Command flag:** `commands/ingest-commands.ts` (`ktx ingest` option table;
|
||||
`--text`/`--file`/`--connection-id`/`--fail-fast` already present — add
|
||||
`--verbatim` and thread it into `KtxTextIngestArgs`).
|
||||
- **Orchestration:** `text-ingest.ts` (`runKtxTextIngest`, `loadItems`,
|
||||
`validateItems`, per-item loop and exit-code aggregation). The verbatim flow
|
||||
reuses item loading and replaces the `memoryIngest.ingest(...)` call with a
|
||||
code-driven write for `--verbatim` items. Keep the new logic in a focused
|
||||
module (e.g. a `verbatim-ingest` sibling) rather than swelling `text-ingest`.
|
||||
- **Frontmatter split / write / serialize:** `wiki/knowledge-wiki.service.ts`
|
||||
(`parsePage` for the `---…---` split shape, `serializePage`, `writePage`,
|
||||
`readPage` for the collision check). Write through this shared path — do not
|
||||
re-implement YAML framing.
|
||||
- **Key derivation:** `wiki/keys.ts` (`suggestFlatWikiKey`, `assertFlatWikiKey`).
|
||||
- **Frontmatter type:** `wiki/types.ts` (`WikiFrontmatter`; `summary` and
|
||||
`usage_mode` are the required fields; unknown passthrough fields live
|
||||
alongside).
|
||||
- **Connection validation:** `context/connections/configured-connections.ts`
|
||||
(`assertConfiguredConnectionId`, shipped with spec 01).
|
||||
- **Metadata LLM call:** the local LLM runtime/config resolution in
|
||||
`context/llm/` (e.g. `local-config.ts`; `backend: none` ⇒ no runtime). Use a
|
||||
single `generateObject` call with a `zod` metadata schema; the `ai-sdk` skill
|
||||
covers v6 patterns.
|
||||
- **Reindex / search lanes:** `wiki/local-knowledge.ts`
|
||||
(`loadAllKnowledgePages`, `buildKnowledgeSearchText`, the lexical/token/
|
||||
semantic lanes) and `wiki/sqlite-knowledge-index.ts` (`sync`).
|
||||
- **Tests:** extend `packages/cli/test/text-ingest.test.ts` and add a
|
||||
verbatim-focused test file covering the acceptance criteria above.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Spider 2.0-Lite ships 8 external-knowledge markdown docs (RFM bucket
|
||||
definitions, the haversine formula, F1 overtake rules, …). Gold SQL was
|
||||
authored against their **exact** text; an LLM paraphrase that drops a bucket
|
||||
boundary or rounds a constant loses the corresponding question. The current
|
||||
workaround is hand-writing frontmatter and copying files into `wiki/global/`.
|
||||
Verbatim mode turns that manual step into a supported **ktx** workflow, and
|
||||
composes with the connection scoping from spec 01 so a doc relevant to exactly
|
||||
one of the benchmark's ~30 SQLite databases does not surface for the other 29.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki`. All acceptance criteria are covered
|
||||
by tests and verified end-to-end through the linked `ktx-dev` binary.
|
||||
|
||||
**What was built**
|
||||
|
||||
- New module `packages/cli/src/verbatim-ingest.ts`: `createLocalProjectVerbatimIngestor`
|
||||
+ `LocalVerbatimIngestor`, plus the pure helpers `splitInputDocument`,
|
||||
`deriveVerbatimPageKey`, `deriveDegradedSummary`, and `buildVerbatimFrontmatter`
|
||||
(the last four are `@internal` exports for unit testing).
|
||||
- `--verbatim` flag added to `ktx ingest` in `commands/ingest-commands.ts`, with a
|
||||
guard that rejects `--verbatim` without `--text`/`--file`. The flag is threaded
|
||||
into `KtxTextIngestArgs.verbatim`.
|
||||
- `text-ingest.ts` now tags each loaded item with an `origin`
|
||||
(`file` / `text` / `stdin`) and, when `verbatim` is set, constructs the verbatim
|
||||
ingestor once and branches the per-item loop to a code-driven write instead of
|
||||
`memoryIngest.ingest(...)`. The shared view, exit-code aggregation, and
|
||||
`--fail-fast` handling are reused.
|
||||
|
||||
**Deviations from the literal spec (design refinements, per "implementer owns the design")**
|
||||
|
||||
- *Metadata call.* The spec suggested raw AI SDK v6 `generateObject`. The
|
||||
implementation routes through the existing `KtxLlmRuntimePort.generateObject`
|
||||
instead — it is implemented by all three backends (ai-sdk, claude-code, codex),
|
||||
and the ai-sdk one already wraps `generateText` + `Output.object({schema})`.
|
||||
This realizes the spec's "single constrained structured-output call" intent via
|
||||
the canonical cross-backend path rather than forking a second LLM entry point.
|
||||
- *Reindex (requirement 9).* In the standalone CLI, `searchLocalKnowledgePages`
|
||||
rebuilds the SQLite index from disk on every call (recomputing embeddings for
|
||||
changed pages), so a written page is findable without a dedicated reindex step.
|
||||
The write still goes through the shared `KnowledgeWikiService.writePage` +
|
||||
`syncSinglePage` path, so the page is also eagerly indexed.
|
||||
- *Gap-fill optimization.* The LLM is skipped entirely when the input frontmatter
|
||||
already supplies `summary`, `tags`, and `sl_refs` (generated metadata only fills
|
||||
absent fields, so there is nothing to generate). A fully specified document thus
|
||||
ingests with a configured backend without any LLM call.
|
||||
|
||||
**Tests**
|
||||
|
||||
- `packages/cli/test/verbatim-ingest.test.ts` — helper units + ingestor integration
|
||||
against a real `initKtxProject` git repo (byte-identical body hash, >48k no-clip,
|
||||
idempotency, conflict hard-error, frontmatter passthrough, explicit-summary
|
||||
preservation, degraded mode, connection scoping + unknown-id rejection +
|
||||
ambiguity error, no-heading inline error, LLM gap-fill, LLM-failure-fails-item,
|
||||
lexical + semantic findability).
|
||||
- `packages/cli/test/text-ingest.test.ts` — verbatim routing, origin tagging,
|
||||
connection-id forwarding, fail-fast.
|
||||
- `packages/cli/test/index.test.ts` — `--verbatim` flag threading and the
|
||||
requires-`--text`/`--file` guard.
|
||||
|
||||
**Docs**
|
||||
|
||||
- `docs-site/content/docs/cli-reference/ktx-ingest.mdx` (flag, "Verbatim ingest"
|
||||
section, examples, common errors) and
|
||||
`docs-site/content/docs/guides/writing-context.mdx` (authoritative-document
|
||||
workflow).
|
||||
|
||||
**Verification**
|
||||
|
||||
- Full CLI suite: 2959 passed, 1 skipped. `pnpm run build` and `pnpm run dead-code`
|
||||
(Biome + Knip default + production) clean; pre-commit clean on changed files.
|
||||
A pre-existing, unrelated type error in `test/mcp-server-factory.test.ts` is
|
||||
untouched — it predates this work.
|
||||
|
|
@ -1,361 +0,0 @@
|
|||
# Schema scan tolerates individual objects that fail introspection
|
||||
|
||||
> Refined spec. Intake draft: `todo/06-scan-tolerate-broken-objects.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
A single broken or inaccessible object zeroes out an entire connection's
|
||||
context. Schema introspection iterates objects with no per-object error
|
||||
handling, so one throw aborts the whole scan, the live-database adapter's
|
||||
`fetch()` rejects, and the connection ends with **no semantic layer at all** —
|
||||
even when every other object was healthy.
|
||||
|
||||
The failure surfaces in two phases, and the contract must hold in both:
|
||||
|
||||
- **Metadata read (sqlite).** `connectors/sqlite/connector.ts` does
|
||||
`rawTables.map((t) => this.readTable(...))` (≈ line 171) with no try/catch.
|
||||
`readTable` runs `PRAGMA table_info(<object>)`, which *executes* a view's
|
||||
body to resolve its columns — so a view over a dropped/renamed column (the
|
||||
`oracle_sql` case: `emp_hire_periods_with_name` selecting `ehp.start_date`
|
||||
from a base table that has no such column) raises `no such column:
|
||||
ehp.start_date` and aborts introspection of all ~48 healthy objects.
|
||||
- **Profiling read (warehouse drivers).** postgres/mysql/clickhouse/sqlserver/
|
||||
bigquery/snowflake read metadata in bulk from catalog / `information_schema`
|
||||
(a broken view rarely breaks that), then fail when a per-object profiling or
|
||||
sampling `SELECT` runs against a broken object. Enrichment sampling is
|
||||
*already* isolated (`description-generation.ts` wraps `sampleTable` in
|
||||
try/catch → `sampling_failed`), but mandatory introspection-phase reads are
|
||||
not uniformly isolated across drivers.
|
||||
|
||||
A second, related defect blocks the documented escape hatch. Setting
|
||||
`enabled_tables: ["main.customers"]` on a sqlite connection produces a
|
||||
different hard failure — `Adapter "database schema" did not recognize fetched
|
||||
source output`. Root cause: the sqlite connector emits every object as
|
||||
`{ db: null }` and filters the scope with `scopedTableNames(scope, { db: null })`
|
||||
(`context/scan/table-ref.ts` ≈ line 47, `if (ref.db !== wantDb) continue`), but
|
||||
`"main.customers"` parses to `{ db: "main", name: "customers" }`
|
||||
(`context/scan/enabled-tables.ts`, `parseDottedTableEntry`). `"main" !== null`,
|
||||
so the entry matches **nothing**, zero table files are written, and
|
||||
`detectLiveDatabaseStagedDir` (`stage.ts` ≈ line 138) returns false, tripping
|
||||
the generic "did not recognize fetched source output" error at
|
||||
`context/ingest/local-stage-ingest.ts` (≈ line 291). The bare form
|
||||
`enabled_tables: ["customers"]` would have worked; the `main.`-qualified form
|
||||
silently matches nothing.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Real warehouses routinely contain broken or inaccessible objects: views over
|
||||
dropped/renamed columns, views referencing tables the connection role can't
|
||||
read, permission-denied tables, and vendor system views that error on read.
|
||||
**ktx** should ingest everything it *can* and skip what it can't, so one bad
|
||||
object never zeroes out an entire connection's context. This is baseline
|
||||
production robustness, independent of any benchmark — the same tolerance a
|
||||
33-warehouse fleet needs the first time one of its databases has a stale view.
|
||||
|
||||
## Design
|
||||
|
||||
The unit of failure is **one object** (table or view). Introspecting or
|
||||
profiling an object is an operation that can fail independently; a failure skips
|
||||
that object, records a recoverable warning, and the scan continues from the
|
||||
objects that succeeded.
|
||||
|
||||
Because seven Node connectors and the Python daemon each introspect differently
|
||||
(sqlite reads metadata per-object via `PRAGMA`; warehouse drivers read metadata
|
||||
in bulk and fail per-object during profiling), the **semantics** of "skip /
|
||||
warn / total-failure" are defined **once** and every connector routes through
|
||||
them — rather than seven copies of the same try/catch that drift apart:
|
||||
|
||||
- A shared per-object helper in the `scan/` layer — the sibling of the existing
|
||||
`tryConstraintQuery` (`context/scan/constraint-discovery.ts`) — wraps a single
|
||||
object read and returns `{ ok: true, table } | { ok: false, warning }`, with a
|
||||
standard warning code (e.g. `object_introspection_failed`).
|
||||
- A shared post-check enforces the total-failure rule (R3) uniformly.
|
||||
- Each connector keeps its **natural** shape: sqlite routes each `readTable`
|
||||
through the helper; bulk-read drivers route their per-object profiling reads
|
||||
through it. The contract is uniform; the loop is not forced to be.
|
||||
- The Python daemon implements the **same contract** in its own helper, adds a
|
||||
`warnings` field to `DatabaseIntrospectionResponse`, and the Node adapter maps
|
||||
those warnings into `KtxSchemaSnapshot` (`daemon-introspection.ts`).
|
||||
|
||||
The warning channel already exists end to end on the Node side
|
||||
(`KtxSchemaSnapshot.warnings`, the `KtxScanWarning` shape with `table`/`column`/
|
||||
`recoverable`, the `KtxScanWarningCode` enum, and the staged `warnings.json`
|
||||
artifact written by `writeLiveDatabaseSnapshot`); sqlite simply never populates
|
||||
it. This spec makes that channel carry object-skip warnings and surfaces them in
|
||||
the ingest summary, the persisted report body, and `ktx status`.
|
||||
|
||||
## Requirements
|
||||
|
||||
### R1 — Per-object isolation (the contract)
|
||||
|
||||
If introspecting or profiling one object throws, the scan **MUST** skip that
|
||||
object, record a `KtxScanWarning` (object name, the error message, and any
|
||||
schema/catalog qualifier; `recoverable: true`), and continue with the remaining
|
||||
objects. No single object may abort the scan.
|
||||
|
||||
- The contract holds in **both** phases: the mandatory metadata read *and* any
|
||||
profiling/row-count/sample read performed during introspection.
|
||||
- It holds for **all seven Node connectors**
|
||||
(`packages/cli/src/connectors/<driver>/`) and the **Python daemon** postgres
|
||||
path (R6).
|
||||
- The semantics are defined once (the shared helper + warning code from the
|
||||
Design section) and every connector routes through them. Do not inline a
|
||||
divergent per-driver copy.
|
||||
- Warnings **MUST NOT** carry secrets or full SQL bodies; record the object
|
||||
identifier and the database's error text, redacted through the existing
|
||||
`redactKtxSensitiveMetadata` path that `warnings.json` already uses.
|
||||
|
||||
### R2 — Surface, don't hide
|
||||
|
||||
Skipped objects **MUST** be reported both at ingest time and in the durable
|
||||
status view:
|
||||
|
||||
- **Ingest summary.** The `ktx ingest` run summary (human-facing output) reports
|
||||
a count plus the object name and a short reason for each skip — e.g.
|
||||
`Skipped 1 object — emp_hire_periods_with_name: no such column ehp.start_date`.
|
||||
- **Run report.** Object skips land in the run report's `warnings.json` artifact
|
||||
(already written) and in the persisted report body (`IngestReportBody`), whose
|
||||
natural home is the existing `fetch?: SourceFetchReport` field — the fetch
|
||||
phase *is* introspection.
|
||||
- **`ktx status`.** `ktx status` shows a per-connection skipped-objects line for
|
||||
the connection's latest ingest — e.g. `oracle_sql: 1 object skipped —
|
||||
emp_hire_periods_with_name: no such column ehp.start_date`. This is **derived
|
||||
from the latest persisted report, not new persisted state**: the report body
|
||||
is already stored whole as a JSON blob (`local_ingest_reports.body_json`), so
|
||||
surfacing it requires **no `.ktx/db.sqlite` schema migration** — `status`
|
||||
reads and renders the skip info already present in the latest report body. A
|
||||
connection whose latest ingest skipped nothing shows no such line.
|
||||
|
||||
### R3 — Failure semantics (partial vs total)
|
||||
|
||||
Per-object skipping is **unconditional** — there is **no new config knob**, and
|
||||
the existing `ingest.workUnits.failureMode` (which governs the later LLM
|
||||
work-unit stage, not introspection) is untouched and orthogonal. Outcomes are
|
||||
derived from object counts, not from a mode:
|
||||
|
||||
| Scope | Objects discovered / matched | Introspection outcome | Result |
|
||||
| --- | --- | --- | --- |
|
||||
| none | 0 | n/a (legitimately empty DB) | **success**, empty layer |
|
||||
| none | N > 0 | ≥ 1 succeeds | **success** + warnings for the rest |
|
||||
| none | N > 0 | all N fail | **connection failure** (clear error) |
|
||||
| `enabled_tables` | matches 0 objects | n/a | **clear scope error** (R5) |
|
||||
| `enabled_tables` | matches M > 0 | ≥ 1 succeeds | **success** + warnings |
|
||||
| `enabled_tables` | matches M > 0 | all M fail | **connection failure** |
|
||||
|
||||
- "Connection failure" means the connector / `fetch()` raises a **clear,
|
||||
actionable error** for that connection. It **MUST NOT** surface as the generic
|
||||
`did not recognize fetched source output` (that message is reserved for a
|
||||
genuinely unrecognized staged dir, not an empty/total-failure result).
|
||||
- A total failure of one connection follows existing per-connection ingest
|
||||
orchestration for whether sibling connections continue; this spec does not
|
||||
change cross-connection behavior.
|
||||
|
||||
### R4 — A broken view never blocks base tables
|
||||
|
||||
A broken view **MUST NEVER** prevent base-table ingest.
|
||||
|
||||
- View introspection failures are isolated exactly like any other object (R1).
|
||||
- Mandatory introspection **MUST** prefer reading an object's structure from the
|
||||
catalog where possible over executing the object's body, and **MUST NOT** run
|
||||
a data-reading query (row count, sample) against a view as a required step.
|
||||
(sqlite already skips `COUNT(*)` for views; the remaining gap is isolating the
|
||||
metadata read that executes the view definition.)
|
||||
|
||||
### R5 — `enabled_tables` allowlist works
|
||||
|
||||
The documented allowlist escape hatch **MUST** reliably restrict the scan to the
|
||||
listed objects, with no spurious adapter error:
|
||||
|
||||
- **sqlite qualification.** The schema-qualified form `"main.<name>"` **MUST**
|
||||
resolve to the same object as the bare form `"<name>"` (sqlite's sole schema
|
||||
is `main`; the connector emits `db: null`). Both forms select the object;
|
||||
neither silently matches nothing.
|
||||
- **Documented format.** The accepted qualification forms for each driver
|
||||
(`catalog.db.name` / `db.name` / `name`) and the sqlite-specific `main`
|
||||
equivalence **MUST** be documented where `enabled_tables` is described
|
||||
(`context/project/driver-schemas.ts` and the user-facing config docs).
|
||||
- **Zero-match is a clear error.** A non-empty `enabled_tables` that resolves to
|
||||
**zero** matched objects **MUST** fail with an actionable error naming the
|
||||
connection, the unmatched entries, and the available object names — **not** the
|
||||
generic `did not recognize fetched source output`. This is distinct from a
|
||||
legitimately empty database (R3 row 1) and from a matched-but-all-broken scope
|
||||
(R3 last row).
|
||||
- **Any subset works.** An `enabled_tables` matching M > 0 objects ingests
|
||||
**exactly** those M objects (minus any that fail per R1), with no adapter
|
||||
recognition error regardless of how small or edge-case the set is.
|
||||
|
||||
### R6 — Python daemon parity
|
||||
|
||||
The daemon's postgres introspection path **MUST** honor the same contract:
|
||||
|
||||
- Add a `warnings` field to `DatabaseIntrospectionResponse`
|
||||
(`python/ktx-daemon/src/ktx_daemon/database_introspection.py`) carrying the
|
||||
same shape Node expects (code, message, object identifier, recoverable).
|
||||
- Isolate per-object failures in the daemon's introspection so one broken object
|
||||
does not abort the response; apply the R3 total-failure rule there too.
|
||||
- Map daemon warnings into `KtxSchemaSnapshot.warnings` in
|
||||
`mapDaemonSnapshot` (`context/ingest/adapters/live-database/daemon-introspection.ts`),
|
||||
which currently drops them.
|
||||
- The Node and Python warning shapes **MUST** stay in parity (the codebase
|
||||
already mirrors Node↔Python schemas for telemetry; follow the same discipline
|
||||
so the daemon cannot emit a code Node can't render).
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Ingesting a sqlite DB with one broken view + N healthy tables yields a
|
||||
semantic layer for the N healthy tables and **exactly one** warning naming the
|
||||
broken view and its error; exit is **success**.
|
||||
- The skipped object appears in the `ktx ingest` summary output, in the run's
|
||||
`warnings.json`, and in `ktx status` as a per-connection skipped-objects line
|
||||
on the connection's latest ingest.
|
||||
- A sqlite DB in which **every** discovered object fails introspection (and the
|
||||
file opens) exits as a **connection failure** with a clear error — not an
|
||||
empty "success" and not `did not recognize fetched source output`.
|
||||
- A genuinely empty sqlite DB (zero objects) exits **success** with an empty
|
||||
layer (not a failure).
|
||||
- `enabled_tables: ["main.customers"]` and `enabled_tables: ["customers"]` both
|
||||
ingest exactly the `customers` object on a sqlite connection.
|
||||
- `enabled_tables` restricted to a valid subset of M objects ingests exactly
|
||||
that subset, with **no** adapter-output error.
|
||||
- `enabled_tables` that matches zero objects fails with an error naming the
|
||||
connection, the unmatched entries, and available objects — distinguishable
|
||||
from the empty-DB and all-broken cases.
|
||||
- A broken view does not prevent ingest of base tables in the same connection
|
||||
(regression test with a view that errors on read alongside a healthy table).
|
||||
- The daemon's `DatabaseIntrospectionResponse` carries a `warnings` array, and a
|
||||
per-object failure in the daemon path produces a warning mapped into
|
||||
`KtxSchemaSnapshot.warnings` (Node↔Python parity test).
|
||||
- A warehouse-driver object whose profiling/sample read fails is skipped with a
|
||||
warning and does not abort introspection of its siblings.
|
||||
- Existing healthy-only ingests (no broken objects, no `enabled_tables`) behave
|
||||
identically before/after — no warnings, same semantic layer.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **Shared semantics:** `context/scan/constraint-discovery.ts`
|
||||
(`tryConstraintQuery` / `constraintDiscoveryWarning` — the precedent to mirror
|
||||
for the per-object helper), `context/scan/types.ts`
|
||||
(`KtxSchemaSnapshot.warnings`, `KtxScanWarning`, `KtxScanWarningCode` — add the
|
||||
new object-skip code here).
|
||||
- **Node connectors:** `packages/cli/src/connectors/<driver>/connector.ts` and
|
||||
each `live-database-introspection.ts`. sqlite's loop is
|
||||
`connectors/sqlite/connector.ts` `introspect` (≈ line 158) → `readTable`
|
||||
(≈ line 306); the missing try/catch is the `rawTables.map(...)` at ≈ line 171.
|
||||
Existing per-table sample isolation precedent: `description-generation.ts`
|
||||
(≈ line 867, `sampling_failed`).
|
||||
- **Driver dispatch:** `packages/cli/src/local-adapters.ts` (≈ lines 122-156)
|
||||
routes every driver to its Node connector; the daemon is the `else` fallback.
|
||||
- **`enabled_tables` matching:** `context/scan/enabled-tables.ts`
|
||||
(`resolveEnabledTables`, `parseDottedTableEntry`), `context/scan/table-ref.ts`
|
||||
(`scopedTableNames`, the `ref.db !== wantDb` filter ≈ line 47),
|
||||
`context/project/driver-schemas.ts` (`enabled_tables` schema + description).
|
||||
- **Staging / detect / error surface:**
|
||||
`context/ingest/adapters/live-database/stage.ts`
|
||||
(`writeLiveDatabaseSnapshot`, `warningArtifact` ≈ line 94,
|
||||
`detectLiveDatabaseStagedDir` ≈ line 138),
|
||||
`context/ingest/local-stage-ingest.ts` (the
|
||||
`did not recognize fetched source output` throw ≈ line 291 — must stop being
|
||||
the surface for empty-scope and total-failure).
|
||||
- **Ingest summary:** `packages/cli/src/ingest.ts` (`writeReportStatus`
|
||||
≈ line 202), `context/ingest/memory-flow/summary.ts`
|
||||
(`formatMemoryFlowFinalSummary`) — thread object skips into the human-facing
|
||||
summary.
|
||||
- **Report body + `ktx status`:** `context/ingest/reports.ts` (`IngestReportBody`;
|
||||
`SourceFetchReport` as the home for scan warnings),
|
||||
`context/ingest/sqlite-local-ingest-store.ts` (the report body is persisted
|
||||
whole as `body_json` ≈ line 90 — no migration needed), `status-project.ts`
|
||||
(`buildLocalStatsStatus` reads `local_ingest_reports`; parse the latest body
|
||||
per connection and render the skipped line via `renderLocalStatsAsLines`).
|
||||
- **Daemon path:** `python/ktx-daemon/src/ktx_daemon/database_introspection.py`
|
||||
(`DatabaseIntrospectionResponse` ≈ line 165, `introspect_database_response`
|
||||
≈ line 323, `_load_postgres_rows` ≈ line 227, `_map_rows_to_tables`
|
||||
≈ line 267), and the Node mapping in
|
||||
`context/ingest/adapters/live-database/daemon-introspection.ts`
|
||||
(`mapDaemonSnapshot` ≈ line 209).
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
`oracle_sql` (8 of the 135 local sqlite questions) currently has **no** semantic
|
||||
layer because of its one broken view, so those questions fall back to raw
|
||||
`sql`-tool introspection instead of ktx's enriched context. Tolerant scanning
|
||||
restores enriched context for that database. The same robustness is required for
|
||||
the full Spider 2.0-Lite run across BigQuery and Snowflake, where broken or
|
||||
permission-restricted objects are common and a single one must not zero out a
|
||||
warehouse's context.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki`. All requirements implemented;
|
||||
verified with `pnpm --filter @kaelio/ktx run test` (2981 passing),
|
||||
`pnpm run dead-code`, `uv run pytest python/ktx-daemon/tests` (97 passing),
|
||||
`uv run pre-commit`, and `pnpm run build && pnpm run link:dev`.
|
||||
|
||||
**Shared semantics (R1).** New `context/scan/object-introspection.ts` exposes
|
||||
`tryIntrospectObject(ctx, fn)` (sibling of `tryConstraintQuery`), returning
|
||||
`{ ok, table } | { ok: false, warning }` and building an
|
||||
`object_introspection_failed` warning (object name + redactable DB error). It
|
||||
rethrows native programming faults (`isNativeProgrammingFault`) so a ktx bug is
|
||||
never masked as an object skip. The new warning code was added to
|
||||
`KtxScanWarningCode` (`scan/types.ts`), the `scanWarningCodes` allowlist
|
||||
(`local-structural-artifacts.ts`, plus a new exported `isKtxScanWarningCode`
|
||||
validator), and `describeWarningGroup` (`scan.ts`).
|
||||
|
||||
**Per-object isolation, where it actually exists (R1/R4).** Only sqlite
|
||||
(`readTable` via `PRAGMA`) and bigquery (`tableRef.get()` per dataset) do
|
||||
per-object reads during *mandatory* introspection; both now route each object
|
||||
through `tryIntrospectObject`. The other five Node connectors (postgres, mysql,
|
||||
clickhouse, sqlserver, snowflake) read metadata in bulk from the catalog/
|
||||
`information_schema` (already object-safe at this phase) and isolate per-object
|
||||
profiling/sampling in the enrichment phase (`description-generation.ts`,
|
||||
`sampling_failed`), so no divergent per-driver try/catch was added there. sqlite
|
||||
also tolerates a `COUNT(*)` (profiling) failure without dropping a
|
||||
structurally-readable table, and a broken view's metadata read is isolated so it
|
||||
never blocks base tables (R4).
|
||||
|
||||
**Single-source outcome decision (R3/R5).** New
|
||||
`adapters/live-database/scan-outcome.ts#assertLiveDatabaseScanOutcome` runs once
|
||||
in `LiveDatabaseSourceAdapter.fetch()` — the one path every driver (and the
|
||||
daemon) routes through — and derives the outcome from the snapshot + scope:
|
||||
≥1 object → success (skips ride along as warnings); all matched objects failed →
|
||||
clear `KtxExpectedError`; non-empty `enabled_tables` matched nothing → clear
|
||||
zero-match error naming the connection, the requested entries, and the available
|
||||
objects (sqlite/bigquery attach the discovered inventory via
|
||||
`metadata.discovered_object_names`); empty database (no scope) → success with an
|
||||
empty layer. `detectLiveDatabaseStagedDir` no longer requires table files, so a
|
||||
valid empty staging is recognized; total-failure/zero-match now throw a clear
|
||||
connection error before staging instead of surfacing the generic
|
||||
`did not recognize fetched source output`.
|
||||
|
||||
**`enabled_tables` matching (R5).** Normalized at the scope boundary in
|
||||
`resolveEnabledTables` using `connection.driver`: for sqlite, `main.<name>` →
|
||||
`{ db: null }`, so `"main.customers"` and `"customers"` select the same object.
|
||||
`table-ref.ts` stayed generic. Documented in `driver-schemas.ts` and
|
||||
`docs-site/.../configuration/ktx-yaml.mdx`.
|
||||
|
||||
**Surfacing (R2).** Deviation from the spec's orientation: live-database schema
|
||||
ingest runs through the **stage-only** path (`runLocalStageOnlyIngest` →
|
||||
`local_ingest_reports`), not the bundle runner, so the home for scan warnings is
|
||||
`LocalIngestRunRecord.fetch` (a new `SourceFetchReport` field; `body_json` is
|
||||
persisted whole, so **no migration**), not the bundle-only
|
||||
`IngestReportBody.fetch`. Both ingest paths read `adapter.readFetchReport`
|
||||
(`live-database/fetch-report.ts` derives skips from the existing `warnings.json`).
|
||||
The ingest summary is already rendered by `runKtxScan` from `report.warnings`
|
||||
(the new `describeWarningGroup` case), and `ktx status`
|
||||
(`status-project.ts#buildLocalStatsStatus`/`renderLocalStats`) now parses the
|
||||
latest report body per connection and prints a per-connection
|
||||
`N object(s) skipped — name: reason` line.
|
||||
|
||||
**Daemon parity (R6).** `database_introspection.py` adds a `warnings` field to
|
||||
`DatabaseIntrospectionResponse` and a `DatabaseIntrospectionWarning` model,
|
||||
isolates per-object failures in `_map_rows_to_tables`, and shares the
|
||||
`OBJECT_INTROSPECTION_FAILED_CODE = "object_introspection_failed"` string with
|
||||
Node. `mapDaemonSnapshot` maps `raw.warnings` into `KtxSchemaSnapshot.warnings`,
|
||||
dropping any code Node cannot render (validated via `isKtxScanWarningCode`).
|
||||
Deviation: the daemon does **not** re-enforce the R3 total-failure rule — the
|
||||
shared Node post-check (`assertLiveDatabaseScanOutcome`) owns it for every driver
|
||||
including the daemon, avoiding a divergent second implementation. Parity is
|
||||
covered by a Node test (daemon-shaped warning round-trips) and a pytest
|
||||
(per-object failure → warning with the shared code).
|
||||
|
|
@ -1,363 +0,0 @@
|
|||
# Add universal SQL-authoring craft to the ktx-analytics skill
|
||||
|
||||
> Refined spec. Intake draft: `todo/07-analytics-skill-sql-craft.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
The shipped `ktx-analytics` skill
|
||||
(`packages/cli/src/skills/analytics/SKILL.md`) is an *orchestration* guide: its
|
||||
`<workflow>` and `<rules>` tell the agent **which ktx tools to call and in what
|
||||
order** (`discover_data` → `entity_details`/`sl_read_source` →
|
||||
`sl_query`/`sql_execution` → validate → `memory_ingest`). It says almost nothing
|
||||
about **writing correct SQL**.
|
||||
|
||||
That gap shows up as a specific failure shape: the agent reliably produces
|
||||
*runnable* SQL but *wrong* results. The recurring defects are universal
|
||||
analytics-engineering mistakes, not ktx-specific ones:
|
||||
|
||||
- comparing a string column to a numeric literal (or vice versa), which can
|
||||
silently match zero rows;
|
||||
- rounding inside intermediate CTEs, so the final number is off;
|
||||
- ranking/“first”/“most recent” windows with no deterministic tie-breaker, so
|
||||
results flicker run to run;
|
||||
- filtering *before* a window function for sequence/“since”/“first” questions,
|
||||
truncating the partition the window should see;
|
||||
- returning a full ranked list for a “top/highest” question, or collapsing a
|
||||
“per X” question to a single value;
|
||||
- dropping the inputs (or the entity identifier) a derived value was built from.
|
||||
|
||||
These are correctness defects every ktx user hits on a live database. They
|
||||
belong in the shipped skill — fixing them once improves ktx for everyone, rather
|
||||
than living in any individual caller’s prompt.
|
||||
|
||||
## Generic use case
|
||||
|
||||
An analyst (human or agent) points ktx at a **live, production** database and
|
||||
asks a real analytical question — “what’s the most recent order per customer”,
|
||||
“top region by margin”, “average order value by month”. The schema is unfamiliar
|
||||
(unknown date encodings, nullable join keys, string-typed numeric columns), the
|
||||
question carries grain and ranking intent in its wording, and the answer must be
|
||||
*correct and deterministic*, not merely executable. The skill should encode the
|
||||
analytics-engineering craft that makes the difference between a query that runs
|
||||
and a query that’s right — independent of any benchmark.
|
||||
|
||||
## Model
|
||||
|
||||
The change is **additive content in one Markdown file**, governed by these
|
||||
invariants. They constrain the implementer; the exact prose is theirs.
|
||||
|
||||
### Inline-only delivery (this is a hard constraint, not a style preference)
|
||||
|
||||
All new guidance lives **inside `skills/analytics/SKILL.md`**. A bundled
|
||||
`reference/*.md` file (the progressive-disclosure pattern Anthropic’s
|
||||
skill-authoring guide recommends for large skills) **MUST NOT** be used here,
|
||||
because the delivery mechanism ships only `SKILL.md`:
|
||||
|
||||
- `setup-agents.ts` installs the analytics skill via `readAnalyticsSkillContent()`,
|
||||
which reads **only** `./skills/analytics/SKILL.md` and writes a **single** file
|
||||
per target: `.claude/skills/ktx-analytics/SKILL.md` (Claude Code), the Codex /
|
||||
universal `.agents` equivalent, a **flattened** single rules file for Cursor
|
||||
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
|
||||
(`.opencode/commands/ktx-analytics.md`), and a Claude Desktop **zip that
|
||||
contains only `ktx-analytics/SKILL.md`** (`writeClaudeDesktopSkillBundle`).
|
||||
- Nothing copies sibling files or subdirectories. A reference file would dangle
|
||||
on every target, and the Cursor/OpenCode flatten-to-one-file shape cannot
|
||||
represent a multi-file skill at all.
|
||||
|
||||
The skill is small enough that inline costs nothing meaningful: ~67 lines today
|
||||
plus ~60 of craft is well under the 500-line budget. And this craft is **core
|
||||
content** — consulted on every SQL-authoring turn — so even if multi-file delivery
|
||||
existed it would still belong inline: progressive disclosure only pays off for
|
||||
large, *conditionally-relevant* reference material loaded on demand, not for
|
||||
always-needed craft.
|
||||
|
||||
Multi-file skill *delivery* is a legitimate future enhancement, but it must be
|
||||
**pulled by a concrete need, not built ahead of one** — no shipped skill today
|
||||
exceeds the budget (largest is ~346 lines) or uses a bundled reference. The first
|
||||
real trigger is the **per-dialect SQL syntax follow-up**
|
||||
(`todo/08-per-dialect-sql-syntax-notes.md`), whose load-on-demand
|
||||
`reference/<dialect>.md` content is a genuine progressive-disclosure fit. When
|
||||
that work is scoped, note that multi-file delivery is **not** a simple directory
|
||||
copy: `setup-agents.ts` flattens the skill to a *single* file for Cursor
|
||||
(`.mdc`) and OpenCode (`.md`), so those targets need a concatenation transform,
|
||||
and uninstall needs per-file manifest entries. Recording the constraint here so a
|
||||
future implementer does not “improve” this inline content into a bundled
|
||||
reference that dangles on every target.
|
||||
|
||||
### Heuristics with a generic *why*, not a wall of MUSTs
|
||||
|
||||
The new rules are phrased as **heuristics with a one-line, universal rationale**,
|
||||
because SQL authoring is a high-freedom task (many valid approaches, choice
|
||||
depends on the question and the data). A bare imperative overfits; a rule plus
|
||||
its *why* lets the model apply judgment and generalize. This follows Anthropic’s
|
||||
own skill-authoring guidance (“if you find yourself writing ALWAYS/NEVER in all
|
||||
caps or rigid structures, reframe and explain the reasoning”).
|
||||
|
||||
This **reconciles the draft’s “behavior only, no rationale” instruction**: the
|
||||
prohibition is specifically on rationale that references a **grader, gold answer,
|
||||
or the benchmark**. *Generic analytics-engineering rationale is required* — e.g.
|
||||
“…so `RANK`/`ROW_NUMBER` results don’t flicker across runs”, “…a string-vs-number
|
||||
compare can silently match nothing”. That is a universal truth, not a
|
||||
grader reference.
|
||||
|
||||
### Dialect-agnostic
|
||||
|
||||
Every rule must read correctly on any SQL dialect a ktx connection might use.
|
||||
**No dialect-specific syntax** — not `QUALIFY` (Snowflake/BigQuery/DuckDB only),
|
||||
not `strftime`/`julianday` (sqlite), not backtick/`DB.SCHEMA.TABLE` FQTNs.
|
||||
Per-dialect syntax notes are a **separate follow-up** living in a dialect-aware
|
||||
(per-driver) location, explicitly out of scope here.
|
||||
|
||||
### Discovery craft attaches to discovery; authoring craft to query/validate
|
||||
|
||||
Two of the draft’s rules (inspect sample rows; cast before comparing) are
|
||||
*schema-discovery* concerns that happen **before** SQL is composed. They belong
|
||||
with the discovery steps of the existing workflow, not only at the query step.
|
||||
The rest (composition, window correctness, precision, completeness) belong with
|
||||
the query/validate steps. The draft’s “extend step 5/6” is the right home for
|
||||
most rules but is slightly off for the discovery pair; this spec corrects that.
|
||||
|
||||
### Additive only
|
||||
|
||||
The existing `<workflow>`, `<rules>`, and `<examples>` — compact result tables,
|
||||
summaries, clarification prompts, the tool-order workflow, the `connectionId`
|
||||
scoping rules — are preserved unchanged. The skill must still read well for an
|
||||
interactive, human-facing analysis session.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Placement and structure
|
||||
|
||||
Add a dedicated, scannable craft section to `SKILL.md`:
|
||||
|
||||
- A new top-level block — `<sql_craft>` (sibling to `<workflow>`/`<rules>`) — with
|
||||
**five sub-headings**: *Schema discovery*, *Composition*, *Window functions*,
|
||||
*Numeric precision*, *Answer completeness*. Sub-headings keep the block
|
||||
scannable (the draft’s “group under clear sub-headings” goal).
|
||||
- **Pointers, not duplication.** Step 5 (“Query”) and step 6 (“Validate and
|
||||
explain”) each gain a **one-line pointer** into `<sql_craft>` rather than
|
||||
inlining the rules (state each rule once; Anthropic’s “consistent terminology /
|
||||
don’t repeat” guidance). The schema-discovery pair is additionally reflected as
|
||||
a brief cue in the discovery steps (step 2 “Inspect” / step 4 “Plan”), pointing
|
||||
to the same block.
|
||||
- No new tool, flag, or config. This is content only.
|
||||
|
||||
### 2. The craft rules (all fourteen behaviors, grouped)
|
||||
|
||||
Every behavior from the intake draft must be represented. Tightly-related ones
|
||||
**may** be merged into a single bullet where that reads better; none may be
|
||||
dropped. Each carries a generic *why* (per Model). Dialect-agnostic throughout.
|
||||
|
||||
**Schema discovery** (cue in steps 2/4; lives in `<sql_craft>`)
|
||||
1. Inspect representative **sample rows** of each table before composing SQL —
|
||||
confirm date/time encoding (`YYYYMMDD` vs ISO vs epoch), null prevalence in
|
||||
join/filter keys, and the real set of categorical/enum values
|
||||
(`entity_details` + a small `sql_execution` sample). *Why:* assumptions about
|
||||
encoding and nullability are the most common source of silently-wrong filters.
|
||||
2. **Cast a column to its real type before comparing** it in `WHERE`/`JOIN`. A
|
||||
string column compared to a numeric literal (or vice versa) can silently match
|
||||
nothing.
|
||||
|
||||
**Composition**
|
||||
3. Build complex queries **incrementally** — one CTE at a time, verifying each
|
||||
layer’s output on a small sample before stacking the next. *Why:* a wrong
|
||||
intermediate layer is far cheaper to catch early than to debug in the final
|
||||
result.
|
||||
4. **Avoid fan-out joins.** Add columns only from tables already at the target
|
||||
grain, or **pre-aggregate** to that grain before joining. *Why:* a join that
|
||||
multiplies rows quietly inflates every downstream `SUM`/`COUNT`.
|
||||
|
||||
**Window functions**
|
||||
5. Give every ranking/ordering window function a **complete, deterministic
|
||||
tie-breaker** (append unique key columns to `ORDER BY`), so
|
||||
`RANK`/`ROW_NUMBER`/`LAG` are stable rather than flickering across runs.
|
||||
6. For sequence / “first” / “most recent” / “since” questions, **filter after the
|
||||
window**, not before: compute over the full partition, then keep the rows you
|
||||
want. *Why:* a pre-filter shrinks the partition the window ranks over, so
|
||||
“first”/“most recent” is computed against the wrong set. (See the worked
|
||||
example, requirement 3.)
|
||||
|
||||
**Numeric precision**
|
||||
7. Compute at **full precision; round only in the final projection**, never inside
|
||||
intermediate CTEs.
|
||||
8. Be **explicit about truncation** — `CAST AS INT` truncates; use explicit
|
||||
rounding when rounding is intended. (May merge with rule 7.)
|
||||
9. Distinguish **macro vs micro averages** based on the question’s wording:
|
||||
“average of per-group averages” = `AVG(group_metric)`; “overall/weighted
|
||||
average” = `SUM(numerator)/SUM(denominator)`.
|
||||
|
||||
**Answer completeness / interpretation**
|
||||
10. “top / highest / most / lowest” → return only the **winning row(s)** (keep the
|
||||
top-ranked row via the window result), not the full ranked list, unless a list
|
||||
is asked for. *(Phrase the mechanism dialect-agnostically — do not name
|
||||
`QUALIFY`.)*
|
||||
11. “for each X / per X / by X” → **exactly one row per X**; don’t collapse to a
|
||||
single value unless the question says “overall” or “total across X”.
|
||||
12. When a question asks for inputs and a derived value (“X, Y, and their ratio”),
|
||||
**include the inputs as columns** alongside the derived value.
|
||||
13. When grouping by a human-readable label (a name), also **expose the entity’s
|
||||
identifier** — identity, not just the label, is part of the result (and
|
||||
disambiguates duplicate names).
|
||||
14. When a result is **unexpectedly empty, relax filters one at a time** to find
|
||||
which predicate removed the rows. *Why:* this is the validation feedback loop
|
||||
that turns a silent empty result into a diagnosable one.
|
||||
|
||||
### 3. One worked example (dialect-agnostic)
|
||||
|
||||
Add **exactly one** compact before/after example to the skill, demonstrating the
|
||||
**window-then-filter** rule (rule 6) — the subtlest and highest-value of the set.
|
||||
It shows the wrong shape (filter inside, then rank) and the right shape (rank over
|
||||
the full partition in a CTE, then filter to the top rank in the outer query),
|
||||
using generic table/column names and standard SQL only (no `QUALIFY`, no
|
||||
dialect functions). Keep it ~6–10 lines. Do not add a second example; the
|
||||
existing three tool-orchestration examples stay as the primary example set.
|
||||
*(Superseded by spec 09: the skill now carries a second `sql` worked example —
|
||||
the multi-hop fan-out case — so the one-example constraint applies to spec 07's
|
||||
window-then-filter example only.)*
|
||||
|
||||
### 4. Explicit exclusions
|
||||
|
||||
None of the following may appear in the skill (they are application/consumer
|
||||
concerns, or actively wrong for live data):
|
||||
|
||||
- **Output-shape contracts** (“return a bare result set with exactly these
|
||||
columns, no prose”). The skill is for interactive analysis and already favors
|
||||
readable tables + summaries; a caller needing a strict shape specifies that
|
||||
itself.
|
||||
- **Anchoring relative time to `MAX(date)` of the data.** On a live database
|
||||
“recent” / “past N months” means relative to *now*; `MAX(date)` anchoring is
|
||||
only valid for static snapshots and must not be baked into the product.
|
||||
- **Any advice justified by a grader, gold answer, or scoring comparator.**
|
||||
- **Dialect-specific syntax** (deferred to the per-driver follow-up).
|
||||
|
||||
### 5. Coordination with spec 03
|
||||
|
||||
`03-multi-connection-routing-in-analytics-skill` also edits this same file (it
|
||||
adds a connection-routing “step 0” to `<workflow>` and threads `connectionId`
|
||||
through the tool calls). Spec 07’s additions are **orthogonal**: they live in a
|
||||
new `<sql_craft>` block and in step 5/6 pointers, and must not rewrite the
|
||||
`<workflow>` routing or the `<rules>` `connectionId` scoping that spec 03 owns.
|
||||
If both land, the result is one coherent skill: routing in `<workflow>`/`<rules>`,
|
||||
SQL craft in `<sql_craft>`.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The shipped `analytics/SKILL.md` contains all fourteen behaviors above, grouped
|
||||
under the five sub-headings, each phrased as a heuristic with a generic
|
||||
rationale.
|
||||
- **Zero references** to any benchmark, gold answer, grader, or scoring
|
||||
comparator anywhere in the skill.
|
||||
- **Dialect-agnostic:** the skill contains no `QUALIFY`, no `strftime`/`julianday`,
|
||||
no backtick/`DB.SCHEMA.TABLE` FQTN syntax, and no other single-dialect
|
||||
construct — including in the worked example.
|
||||
- The existing interactive guidance is intact: the `<workflow>` steps, the
|
||||
`<rules>` (compact tables, summaries, clarification prompt, `connectionId`
|
||||
scoping), and the three existing examples all still read correctly and were not
|
||||
removed or contradicted.
|
||||
- **None of the excluded items** (output-shape contract, `MAX(date)` anchoring of
|
||||
“recent”, grader-driven advice, dialect syntax) appear.
|
||||
- Exactly **one** new worked example is present, demonstrating window-then-filter,
|
||||
in standard dialect-agnostic SQL. *(Superseded by spec 09, which adds a second
|
||||
`sql` worked example for the multi-hop fan-out case; the shipped skill then
|
||||
contains two worked examples and the content test asserts two `sql` fences.)*
|
||||
- The craft is **inline in `SKILL.md`** — no bundled reference file is introduced,
|
||||
and the skill still installs as a single file through `setup-agents.ts` for all
|
||||
targets (Claude Code, Codex, Cursor, OpenCode, universal, Claude Desktop zip).
|
||||
- The skill stays **scannable and within a reasonable size** (comfortably under
|
||||
the 500-line budget).
|
||||
- The frontmatter (`name`, `description`) is unchanged and still parses through
|
||||
`SkillsRegistryService.parseFrontmatter`.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md`. Add the
|
||||
`<sql_craft>` block; add one-line pointers in steps 5/6 and a discovery cue in
|
||||
steps 2/4; add the single worked example. Keep `<workflow>`/`<rules>`/`<examples>`
|
||||
otherwise intact.
|
||||
- **Delivery (why inline is mandatory):** `packages/cli/src/setup-agents.ts`
|
||||
(`readAnalyticsSkillContent`, `installTarget`, `writeClaudeDesktopSkillBundle`,
|
||||
`plannedKtxAgentFiles`). Each target gets a single file derived from
|
||||
`SKILL.md`; Cursor/OpenCode flatten to one rules file; Claude Desktop zips only
|
||||
`ktx-analytics/SKILL.md`. No change to `setup-agents.ts` is required by this
|
||||
spec — confirm the skill still installs unchanged.
|
||||
- **Coordination:** `03-multi-connection-routing-in-analytics-skill` edits the
|
||||
same file; keep the changes non-overlapping (see requirement 5).
|
||||
- **Tests:** a content assertion over the shipped `analytics/SKILL.md` is the
|
||||
right level (this is prompt content, not executable logic). Assert the skill
|
||||
text contains the craft sub-headings / representative rule phrases, contains the
|
||||
worked example, and contains none of the banned constructs: the literal tokens
|
||||
`QUALIFY`/`strftime`/`julianday`, grader/benchmark words (`spider`, `benchmark`,
|
||||
`gold`, `grader`), and — checked as a phrase, not a raw `MAX(` grep, since
|
||||
`MAX()` is a legitimate aggregate — any instruction anchoring relative time
|
||||
(“recent”, “past N months”) to the data’s maximum date. The existing
|
||||
`SkillsRegistryService` frontmatter-parse test must still pass. The standalone
|
||||
`ktx-dev` binary should be rebuilt/re-linked (`pnpm run build && pnpm run
|
||||
link:dev`) so the playground picks up the updated skill.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
On the Spider 2.0-Lite sqlite subset the solver produced **0 execution errors but
|
||||
~50 result mismatches**, and a large share traced to exactly these gaps:
|
||||
premature rounding, string-vs-number compares, non-deterministic window ordering,
|
||||
returning full lists for “top” questions, and dropping the inputs to derived
|
||||
values. These are generic SQL-authoring defects — fixing them in the skill
|
||||
improves ktx for every user querying a live database, and improving the benchmark
|
||||
score is a side effect, not the goal. The skill itself must contain no trace of
|
||||
the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki`.
|
||||
|
||||
**What was built**
|
||||
- Added a new `<sql_craft>` block to `packages/cli/src/skills/analytics/SKILL.md`
|
||||
(sibling to `<workflow>`/`<rules>`, placed just before `<examples>`), with the
|
||||
five sub-headings — *Schema discovery before writing SQL*, *Composition*,
|
||||
*Window functions*, *Numeric precision*, *Answer completeness / interpretation* —
|
||||
and a one-line opener framing the bullets as heuristics-with-a-why.
|
||||
- All fourteen behaviors are represented. Rules 7 and 8 (round-at-the-end /
|
||||
truncation) are merged into one "Round only at the end" bullet, as the spec
|
||||
permitted. Each bullet carries a generic analytics-engineering rationale; none
|
||||
references a benchmark, grader, or gold answer.
|
||||
- Exactly one worked example (a fenced `sql` block inside `<sql_craft>`)
|
||||
demonstrates the window-then-filter rule, and incidentally the deterministic
|
||||
tie-breaker: the *wrong* shape filters before the window; the *right* shape
|
||||
ranks the full partition in a CTE, then filters in the outer query. Standard
|
||||
SQL only — no `QUALIFY`, no dialect functions.
|
||||
- Step pointers added without duplicating the rules: a schema-discovery cue in
|
||||
steps 2 and 4, an authoring pointer in step 5, and a validation pointer in
|
||||
step 6, each pointing into `<sql_craft>`.
|
||||
- The existing `<workflow>` / `<rules>` / `<examples>` (compact tables,
|
||||
summaries, clarification prompt, `connectionId` scoping, the three
|
||||
orchestration examples) are unchanged. Delivery is unchanged: still a single
|
||||
`SKILL.md` per target via `readAnalyticsSkillContent`; no bundled `reference/`
|
||||
file was introduced.
|
||||
|
||||
**Tests** — added `packages/cli/test/skills/analytics-skill-content.test.ts`, a
|
||||
content assertion over the source `SKILL.md`: the five sub-headings, a
|
||||
representative phrase for each behavior, exactly one `sql` worked example, the
|
||||
preserved interactive guidance, and the absence of banned constructs
|
||||
(`QUALIFY` / `strftime` / `julianday`, `spider` / `benchmark` / `gold` /
|
||||
`grader`, a backtick three-part FQTN, and a phrase-level guard against anchoring
|
||||
relative time to a `MAX(...)` date). The existing `setup-agents.test.ts` content
|
||||
assertions and the `SkillsRegistryService` frontmatter test still pass (77/77
|
||||
across the three relevant files). Rebuilt and re-linked `ktx-dev`
|
||||
(`pnpm run build && pnpm run link:dev`); the craft block is present in the
|
||||
shipped `dist` asset.
|
||||
|
||||
**Deviations / notes**
|
||||
- The worked example runs ~18 lines including comments rather than the spec's
|
||||
"~6–10"; a faithful before/after with a CTE needs the extra lines, and the
|
||||
skill stays well within budget (~117 lines total).
|
||||
- `pnpm run type-check` currently reports one **pre-existing, unrelated** error
|
||||
in `test/mcp-server-factory.test.ts` (MCP server deps typing), committed on
|
||||
this branch ahead of `origin/main`. The src type-check and `pnpm run build`
|
||||
are green; this change does not touch any MCP file.
|
||||
- Per-dialect SQL syntax stays out of scope here (deferred to
|
||||
`todo/08-per-dialect-sql-syntax-notes.md`), so the skill remains
|
||||
dialect-agnostic. No dialect-tool pointer was added to `SKILL.md` yet — that
|
||||
belongs with spec 08's channel so the skill never references a tool that does
|
||||
not exist.
|
||||
|
|
@ -1,395 +0,0 @@
|
|||
# Per-dialect SQL syntax notes, served on demand and scoped to the connection
|
||||
|
||||
> Refined spec. Intake draft: `todo/08-per-dialect-sql-syntax-notes.md`. Companion
|
||||
> to `specs/07-analytics-skill-sql-craft.md`, which kept the analytics SQL craft
|
||||
> dialect-agnostic and explicitly deferred per-dialect syntax to this spec.
|
||||
|
||||
## Problem
|
||||
|
||||
Spec 07 added universal, **dialect-agnostic** SQL-authoring craft to the
|
||||
`ktx-analytics` skill (`packages/cli/src/skills/analytics/SKILL.md`). That craft
|
||||
deliberately excludes anything that reads correctly on only one engine — no
|
||||
`QUALIFY`, no `strftime`/`julianday`, no backtick or `DB.SCHEMA.TABLE` FQTNs —
|
||||
because the flat skill is installed verbatim and an agent querying sqlite must
|
||||
never see Snowflake syntax.
|
||||
|
||||
But a large share of *real* correctness depends on exactly that excluded,
|
||||
engine-specific syntax:
|
||||
|
||||
- **Snowflake:** `DATABASE.SCHEMA.TABLE` FQTNs, double-quoted case-sensitive
|
||||
identifiers (unquoted folds to upper-case), VARIANT colon-paths
|
||||
(`col:field.sub::type`), `QUALIFY`.
|
||||
- **BigQuery:** backtick FQTNs (`` `project.dataset.table` ``), `_TABLE_SUFFIX`
|
||||
for sharded/wildcard tables, `QUALIFY`, `JSON_VALUE`/`JSON_EXTRACT`.
|
||||
- **sqlite:** `strftime`/`julianday`/`date()` for dates, no `QUALIFY`,
|
||||
`json_extract`.
|
||||
- and the remaining supported engines (`postgres`, `mysql`, `clickhouse`,
|
||||
`sqlserver`/`tsql`), each with its own FQTN, quoting, date, top-N, and
|
||||
JSON conventions.
|
||||
|
||||
This guidance is genuinely useful to an agent writing SQL against a live
|
||||
database, but it must **not** pollute the flat dialect-agnostic skill. It belongs
|
||||
in a **dialect-aware** channel, surfaced only for the dialect the active
|
||||
connection actually uses, and selected from the project's own configured state —
|
||||
not guessed, not shown all at once.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any **ktx** project whose connections span more than one warehouse engine — a
|
||||
Snowflake warehouse plus a BigQuery export plus a local sqlite extract, say. When
|
||||
the agent (or a human analyst the agent assists) writes SQL for a given
|
||||
connection, it should receive *that engine's* syntax conventions — FQTN form,
|
||||
identifier quoting, date functions, top-N idiom, semi-structured access — and
|
||||
nothing for the engines it is not querying. The need is independent of any
|
||||
benchmark: it is what "write correct SQL against this specific warehouse" requires
|
||||
on every multi-engine stack.
|
||||
|
||||
## Model
|
||||
|
||||
The change adds a **dialect-aware channel** alongside spec 07's flat skill. The
|
||||
following decisions are committed by this refinement; the implementer owns the
|
||||
exact prose and code.
|
||||
|
||||
### Delivery: a dynamic MCP tool (decision committed)
|
||||
|
||||
The draft posed two delivery mechanisms and asked the refinement to "weigh them
|
||||
before committing." This spec commits to **dynamic MCP delivery**: a new
|
||||
read-only MCP tool returns the syntax notes for a given `connectionId`, with the
|
||||
dialect resolved server-side from the connection's configured `driver`. The flat
|
||||
skill gains a one-line pointer to that tool. **No install-mechanism change is
|
||||
required.**
|
||||
|
||||
The alternative — **multi-file skill delivery** (bundle `reference/<dialect>.md`
|
||||
files and point the skill at the matching one) — is **rejected** for **ktx**, for
|
||||
reasons that hold regardless of how the skill is otherwise authored:
|
||||
|
||||
1. **It cannot scope on two of the six install targets.** Cursor
|
||||
(`.cursor/rules/ktx-analytics.mdc`) and OpenCode
|
||||
(`.opencode/commands/ktx-analytics.md`) are physically **single-file**;
|
||||
`setup-agents.ts` flattens the skill to one file there. A bundled `reference/`
|
||||
directory degenerates to "concatenate every dialect into one file," so a
|
||||
sqlite agent would see Snowflake VARIANT syntax — **failing this spec's core
|
||||
no-leak criterion on those targets**, and defeating progressive disclosure
|
||||
(everything is in context at once). The MCP tool behaves **identically on all
|
||||
six targets** because it is a tool call, not an installed file.
|
||||
2. **Selecting the dialect is a deterministic operation, so it belongs in code,
|
||||
not model judgment.** Anthropic's skill-authoring guidance explicitly says to
|
||||
*"prefer scripts [tools] for deterministic operations."* With bundled files the
|
||||
**model** must infer that connection X is Snowflake and open the right file —
|
||||
and on a multi-connection project it can open the wrong one. With the tool, the
|
||||
**server** resolves `driver → dialect` from `ktx.yaml` state and returns
|
||||
exactly the right notes.
|
||||
3. **It needs a delivery subsystem that the tool does not.** Multi-file delivery
|
||||
requires reworking `readAnalyticsSkillContent`, `installTarget`,
|
||||
`plannedKtxAgentFiles`, the install manifest (a directory variant),
|
||||
`removeKtxAgentInstall`, and `writeClaudeDesktopSkillBundle`, plus a
|
||||
concatenation transform for the single-file targets. The MCP tool requires one
|
||||
read-only handler and one skill pointer.
|
||||
4. **The dependency is free.** The `ktx-analytics` skill already hard-depends on
|
||||
the **ktx** MCP server — its entire workflow is calling `discover_data`,
|
||||
`entity_details`, `sql_execution`, and so on. Wherever the server is down, the
|
||||
skill is already non-functional; the tool adds **no new dependency**.
|
||||
5. **Dropping Cursor/OpenCode does not change this.** Removing those targets would
|
||||
make multi-file delivery *possible*, but it would not make it better: reasons
|
||||
2–4 stand, and the drop is a disproportionate cost (Cursor is a major target)
|
||||
to neutralize a constraint the tool handles for free. Whether **ktx** supports
|
||||
those targets is a separate product decision and is out of scope here.
|
||||
|
||||
This is consistent with Anthropic's progressive-disclosure goal — load the
|
||||
relevant material on demand, at zero context cost until needed — which the tool
|
||||
satisfies (its output costs context only when called) while resolving *which*
|
||||
dialect from state rather than from a model guess. Reference:
|
||||
[Skill authoring best practices](https://platform.claude.com/docs/en/agents-and-tools/agent-skills/best-practices).
|
||||
|
||||
### Scope derived from state, through the one existing resolver
|
||||
|
||||
Which dialect's notes the agent sees is **derived** from the connection's
|
||||
configured `driver`, via the resolver the rest of the system already uses —
|
||||
`sqlAnalysisDialectForDriver(driver)` in
|
||||
`packages/cli/src/context/sql-analysis/dialect.ts`. The same function already
|
||||
selects the dialect for `sql_execution`, `sl_query`, and the Python SQL-analysis
|
||||
daemon. This spec **must not** introduce a second driver→dialect map. The notes
|
||||
are **keyed by the resolved `SqlAnalysisDialect`** (so the SQL Server entry is
|
||||
keyed `tsql`, not `sqlserver`), tying the note key-space to the resolver's
|
||||
codomain so the two cannot drift.
|
||||
|
||||
### Authored per-engine notes are sanctioned static content
|
||||
|
||||
Enumerating syntax notes per engine is **not** a rotting denylist of bad
|
||||
specifics; FQTN form and identifier quoting are genuine, stable invariants of each
|
||||
engine — the kind of universal fact **ktx**'s design rules explicitly permit as
|
||||
static content. What must stay derived-from-state is note *selection* (the active
|
||||
dialect) and note *coverage* (every configured driver must resolve to notes that
|
||||
exist), both of which this spec ties to the connector registry.
|
||||
|
||||
### The flat skill stays dialect-agnostic (spec 07 invariant preserved)
|
||||
|
||||
This work adds a *separate* channel. It does **not** amend spec 07's `<sql_craft>`
|
||||
block or inline any dialect syntax into `SKILL.md`. Spec 07's acceptance criterion
|
||||
— no `QUALIFY`/`strftime`/`julianday`/backtick-FQTN/etc. in the flat skill — stays
|
||||
green. The only `SKILL.md` change is the pointer in requirement 3, which names the
|
||||
tool and contains no dialect syntax.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. A read-only `sql_dialect_notes` MCP tool
|
||||
|
||||
Register a new tool beside the existing context tools
|
||||
(`packages/cli/src/context/mcp/context-tools.ts`). The tool name is the
|
||||
implementer's to finalize but should follow the existing snake_case convention
|
||||
(`entity_details`, `sql_execution`); `sql_dialect_notes` is the suggested name.
|
||||
|
||||
- **Input:** `{ connectionId }`, **required** — matching its siblings
|
||||
`entity_details`/`sql_execution`, which always take an explicit connection.
|
||||
- **Output:** `{ connectionId, dialect, notes }` where `dialect` is the resolved
|
||||
`SqlAnalysisDialect` and `notes` is the markdown guidance for that dialect.
|
||||
- **Resolution:** `connectionId → connection.driver →
|
||||
sqlAnalysisDialectForDriver(driver) → notes[dialect]`, reusing the existing
|
||||
resolver. Do not duplicate the driver→dialect map.
|
||||
- **Guards:**
|
||||
- A **non-SQL context-source** connection (driver `metabase`, `looker`,
|
||||
`lookml`, `notion`, `dbt`, `metricflow`) returns a **clear "not a SQL
|
||||
warehouse connection" error**, not postgres notes. Gate on the existing
|
||||
`isDatabaseDriver()` (`packages/cli/src/connection-drivers.ts`).
|
||||
- For any **SQL warehouse** connection the resolver always yields a dialect with
|
||||
notes (all seven warehouse drivers are covered — requirement 2); its built-in
|
||||
`postgres` default is a safety floor, so the tool never errors for a SQL
|
||||
connection and never emits a single-engine dialect (e.g. Snowflake) by
|
||||
accident.
|
||||
- **Annotations:** read-only and idempotent, consistent with the other read
|
||||
tools.
|
||||
- **Description (docs-grade, third person, states what and when):** e.g.
|
||||
*"Returns the SQL syntax conventions for a connection's dialect — FQTN form,
|
||||
identifier quoting and case-folding, date/time functions, top-N idiom, and
|
||||
semi-structured access. Use before authoring raw SQL against a connection so the
|
||||
SQL matches that engine."* The description drives the agent's decision to call
|
||||
the tool, so it must be specific.
|
||||
|
||||
### 2. Per-dialect note content
|
||||
|
||||
Author concise notes for each supported dialect against a **fixed rubric**, so
|
||||
every dialect answers the same questions. Each facet is a line or two of timeless,
|
||||
engine-true convention (no version-dated "as of vX" content), phrased as
|
||||
guidance with the engine reason where it helps — inheriting spec 07's
|
||||
heuristics-with-a-why tone. The rubric facets:
|
||||
|
||||
1. **FQTN form** — how to fully-qualify a table on this engine.
|
||||
2. **Identifier quoting & case-folding** — quote character and how unquoted
|
||||
identifiers fold.
|
||||
3. **Date/time** — the engine's date functions and common date-encoding idioms.
|
||||
4. **Top-N / window-filtering idiom** — `QUALIFY` where supported; a CTE +
|
||||
outer-filter form where it is not; `TOP` for `tsql`.
|
||||
5. **Semi-structured / JSON access** — VARIANT colon-paths, `JSON_VALUE`/
|
||||
`JSON_EXTRACT`, `->`/`->>`, `json_extract`, as applicable.
|
||||
6. **Sharded / partition idiom** where the engine has one (e.g. BigQuery
|
||||
`_TABLE_SUFFIX`).
|
||||
|
||||
Constraints on the content:
|
||||
|
||||
- **Coverage = the reachable dialect set.** Every driver in the connector registry
|
||||
must resolve to a dialect that has non-empty notes. The reachable set is
|
||||
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, and
|
||||
`tsql` (from `sqlserver`). Do **not** author notes for `duckdb`/`databricks`:
|
||||
they appear in the resolver map but no connector can produce them, so they are
|
||||
unreachable — matching the draft's "don't author for nonexistent drivers."
|
||||
- **Keyed by `SqlAnalysisDialect`** (see Model).
|
||||
- **Storage is the implementer's choice.** The notes MAY live as per-dialect
|
||||
markdown files inside the package (e.g. under the skill's directory) served by
|
||||
the tool, or as a typed map. If files are used they are **package-internal** —
|
||||
served by the tool, never installed onto an agent target — and already ship via
|
||||
the recursive `src/skills → dist/skills` copy
|
||||
(`packages/cli/scripts/copy-runtime-assets.mjs`); no `setup-agents.ts` change.
|
||||
- **No benchmark, gold-answer, grader, or scoring references** anywhere in the
|
||||
notes.
|
||||
|
||||
The implementer must verify each engine's specifics against current official
|
||||
documentation (the well-known anchors above are starting points, not a
|
||||
substitute for checking the engine's docs).
|
||||
|
||||
### 3. The `SKILL.md` pointer (completes spec 07's deferral)
|
||||
|
||||
Add a **single one-line pointer** to the SQL-authoring step (step 4 "Plan" / step
|
||||
5 "Query") of `packages/cli/src/skills/analytics/SKILL.md`, directing the agent to
|
||||
call the tool before writing raw SQL against a connection — e.g. *"Before writing
|
||||
raw `sql_execution` SQL, call `sql_dialect_notes` with the connection's id to get
|
||||
that engine's syntax conventions."* This is the pointer spec 07 deliberately did
|
||||
not add because the tool did not yet exist.
|
||||
|
||||
- The pointer **names the tool only**; it contains **no dialect syntax**, so the
|
||||
flat skill stays dialect-agnostic.
|
||||
- Follow the skill's existing tool-reference convention. The skill currently names
|
||||
MCP tools by **bare** name (`discover_data`, `sql_execution`). Anthropic's
|
||||
guidance recommends **fully-qualified** `ServerName:tool` names to avoid
|
||||
"tool not found" when multiple MCP servers are present. Whether to fully-qualify
|
||||
the new pointer (and optionally retrofit the existing bare references) is a
|
||||
small, separable decision flagged for the maintainer — **not** a rename sweep
|
||||
this spec mandates.
|
||||
|
||||
### 4. Coverage is enforced from state, not by hand
|
||||
|
||||
A test must **derive** the required coverage from the connector registry rather
|
||||
than hardcoding a dialect list: enumerate the configured warehouse drivers
|
||||
(`warehouseDrivers` in `driver-schemas.ts` / `KTX_DATABASE_DRIVER_IDS` in
|
||||
`connection-drivers.ts`), resolve each through `sqlAnalysisDialectForDriver`, and
|
||||
assert each result has non-empty notes. Adding a connector later then **fails this
|
||||
test** until its dialect gets notes — the allowlist-from-state discipline, not a
|
||||
hand-maintained list.
|
||||
|
||||
### 5. No dialect syntax leaks into the flat skill
|
||||
|
||||
Spec 07's content assertion over `analytics/SKILL.md` stays green: the flat skill
|
||||
(and its worked example) still contain no `QUALIFY`, `strftime`, `julianday`,
|
||||
backtick/`DB.SCHEMA.TABLE` FQTN, or other single-engine construct. This spec adds
|
||||
a tool and a tool-pointer; it does not move dialect syntax into the skill.
|
||||
|
||||
### 6. Delivery is unchanged
|
||||
|
||||
`setup-agents.ts` (`readAnalyticsSkillContent`, `installTarget`,
|
||||
`writeClaudeDesktopSkillBundle`, `plannedKtxAgentFiles`) needs **no change**. The
|
||||
skill still installs as a single `SKILL.md` per target. Confirm the channel works
|
||||
on all six targets — Claude Code, Claude Desktop (zip), Codex, universal
|
||||
`.agents`, Cursor (`.mdc`), OpenCode (`.md`) — by virtue of being a tool call,
|
||||
including the single-file targets where multi-file delivery could not scope.
|
||||
|
||||
### 7. Coordination with specs 07 and 03
|
||||
|
||||
- **Spec 07** owns the dialect-agnostic `<sql_craft>` block. This spec must not
|
||||
amend it; it adds the tool, the pointer, and the notes.
|
||||
- **Spec 03** (`03-multi-connection-routing-in-analytics-skill`) threads
|
||||
`connectionId` through the skill's tool calls. The `sql_dialect_notes` pointer
|
||||
is `connectionId`-scoped and fits that routing; keep the pointer consistent with
|
||||
spec 03's `connectionId` rules and do not rewrite the routing it owns.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- An agent querying a **sqlite** connection gets sqlite date idioms and **never**
|
||||
sees Snowflake/BigQuery-only syntax; an agent querying **Snowflake** gets
|
||||
FQTN / identifier / VARIANT guidance.
|
||||
- The dialect shown is **derived from the connection's configured `driver`** via
|
||||
the existing `sqlAnalysisDialectForDriver`, not hardcoded per project and not
|
||||
guessed. No second driver→dialect map is introduced.
|
||||
- **Every configured warehouse driver** (`postgres`, `mysql`, `snowflake`,
|
||||
`bigquery`, `sqlite`, `clickhouse`, `sqlserver`) resolves to a dialect with
|
||||
non-empty notes, and the coverage test derives this from the registry.
|
||||
- A **non-SQL context-source** connection (e.g. `metabase`, `notion`) yields a
|
||||
clear "not a SQL warehouse" response, **not** postgres notes.
|
||||
- `analytics/SKILL.md` remains dialect-agnostic — spec 07's criteria are
|
||||
unaffected. The new pointer references the tool only and adds no dialect syntax.
|
||||
- The channel installs/serves correctly across **all six** agent targets,
|
||||
including the single-file Cursor/OpenCode shape, with **no `setup-agents.ts`
|
||||
change**.
|
||||
- The notes contain **no** benchmark/gold/grader/scoring references and **no**
|
||||
time-sensitive ("as of version X") content.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **Dialect resolver (reuse, do not duplicate):**
|
||||
`packages/cli/src/context/sql-analysis/dialect.ts` —
|
||||
`sqlAnalysisDialectForDriver(driver)`, returning `SqlAnalysisDialect`
|
||||
(`./ports.ts`), default `postgres`.
|
||||
- **Connector registry (drives coverage):**
|
||||
`packages/cli/src/connection-drivers.ts` (`KTX_DATABASE_DRIVER_IDS`,
|
||||
`isDatabaseDriver`) and `packages/cli/src/context/project/driver-schemas.ts`
|
||||
(`warehouseDrivers`, the per-driver `connectionConfigSchema`).
|
||||
- **MCP tool registration:** `packages/cli/src/context/mcp/context-tools.ts`
|
||||
(register beside `connection_list`, `entity_details`, `sql_execution`); the
|
||||
`connectionId → driver → dialect` resolution already exists for `sql_execution`
|
||||
in `packages/cli/src/context/mcp/local-project-ports.ts` — route the new tool
|
||||
through the same path.
|
||||
- **The skill (one-line pointer only):**
|
||||
`packages/cli/src/skills/analytics/SKILL.md` — add the tool pointer in step 4/5;
|
||||
leave `<workflow>`/`<rules>`/`<sql_craft>`/`<examples>` otherwise intact.
|
||||
- **Note storage (if files):** under the skill directory, shipped by
|
||||
`packages/cli/scripts/copy-runtime-assets.mjs`'s recursive copy; served by the
|
||||
tool, never installed.
|
||||
- **Delivery (confirm unchanged):** `packages/cli/src/setup-agents.ts`.
|
||||
- **Tests:** unit tests for resolution (including `sqlserver → tsql`, unknown →
|
||||
`postgres`, and non-warehouse rejection); a registry-derived coverage test
|
||||
(requirement 4); a content test that each dialect's notes cover the rubric
|
||||
facets and contain no banned tokens; and an extension of spec 07's
|
||||
`analytics/SKILL.md` content test asserting the new pointer is present and the
|
||||
flat skill is still dialect-clean. Rebuild and re-link the dev binary so the
|
||||
playground picks up the change: `pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
The Spider 2.0-Lite v9 harnesses' only per-dialect content was Snowflake
|
||||
(`DB.SCHEMA.TABLE` FQTNs, double-quoted lower-case columns, VARIANT colon-paths),
|
||||
BigQuery (backtick FQTNs, `_TABLE_SUFFIX` for sharded tables), and sqlite
|
||||
(`strftime`/`julianday`). That content is real and useful but engine-specific;
|
||||
spec 07 kept it out of the flat skill and deferred it here so the dialect-agnostic
|
||||
rules stay clean. Delivering it through a dialect-scoped **ktx** tool generalizes
|
||||
the same correctness benefit to every multi-engine **ktx** project — improving the
|
||||
benchmark score is a side effect, not the goal, and the shipped skill contains no
|
||||
trace of the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki`, alongside spec 07. The committed
|
||||
decision (dynamic MCP delivery, not multi-file skill bundling) was implemented as
|
||||
specified — no `setup-agents.ts` change.
|
||||
|
||||
**What was built**
|
||||
- Per-dialect notes are markdown files under
|
||||
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md` (one each for
|
||||
`postgres`, `mysql`, `snowflake`, `bigquery`, `sqlite`, `clickhouse`, `tsql`),
|
||||
served by `sqlDialectNotes(dialect)` in `sql-analysis/dialect-notes.ts` (lazy
|
||||
read + cache, `postgres` fallback floor; the authored set is the
|
||||
`DIALECTS_WITH_NOTES` const). `duckdb`/`databricks` are intentionally unauthored
|
||||
(unreachable from any connector). Each note answers the fixed rubric — FQTN,
|
||||
identifier quoting/case-folding, date/time, top-N/window idiom,
|
||||
JSON/semi-structured, plus a sharded-table line for BigQuery. Engine specifics
|
||||
were verified against current docs via Context7 (Snowflake VARIANT colon-paths
|
||||
and unquoted→UPPER case-folding; BigQuery `_TABLE_SUFFIX`, `QUALIFY`,
|
||||
`JSON_VALUE`; ClickHouse `LIMIT n BY` and `JSONExtract*`, with no `QUALIFY`). The
|
||||
files are package-internal — `copy-runtime-assets.mjs` ships them to `dist`; they
|
||||
are never installed onto an agent target.
|
||||
- New read-only MCP tool `sql_dialect_notes` (`context-tools.ts`): input
|
||||
`{ connectionId }` (required), output `{ connectionId, dialect, notes }`, read-only
|
||||
+ idempotent annotations. It resolves through the **existing**
|
||||
`connectionId → connection.driver → sqlAnalysisDialectForDriver` path (no second
|
||||
driver→dialect map), implemented as the unconditional `dialectNotes` port in
|
||||
`local-project-ports.ts` via an extracted `resolveDialectNotesForConnection`. A
|
||||
non-SQL context source (gated by `isDatabaseDriver`) throws `KtxExpectedError`
|
||||
("not a SQL warehouse"), not postgres notes — so the expected agent mistake stays
|
||||
out of Error Tracking.
|
||||
- `connection-drivers.ts`: `KTX_DATABASE_DRIVER_IDS` is now an exported (`@internal`)
|
||||
readonly tuple so the coverage test derives required coverage from the registry;
|
||||
`isDatabaseDriver` behavior is unchanged.
|
||||
- `skills/analytics/SKILL.md`: a single dialect-agnostic pointer in step 5 ("call
|
||||
`sql_dialect_notes` … to get that engine's FQTN, identifier-quoting, date, top-N,
|
||||
and JSON conventions"). It names the tool only; spec 07's `<sql_craft>` block and
|
||||
its dialect-clean content test are untouched.
|
||||
|
||||
**Tests**
|
||||
- `test/context/mcp/dialect-notes.test.ts`: registry-derived coverage (a future
|
||||
connector fails the test until its dialect has notes), the full rubric per dialect,
|
||||
leak isolation (sqlite shows `strftime` and never `VARIANT`/`_TABLE_SUFFIX`;
|
||||
`QUALIFY` only on snowflake/bigquery; engine-exclusive markers stay put), no
|
||||
benchmark/grader or version-dated content, the postgres fallback, and
|
||||
`resolveDialectNotesForConnection` resolving sqlite / snowflake / `sqlserver→tsql`
|
||||
and rejecting a non-SQL source / unknown connection with `KtxExpectedError`; plus a
|
||||
guard that the `DIALECTS_WITH_NOTES` const and the `dialects/*.md` files stay in sync.
|
||||
- `test/context/mcp/server.test.ts`: `sql_dialect_notes` added to the retained tool
|
||||
set + annotations assertion + a handler-routing test, and the regenerated
|
||||
`__snapshots__/mcp-tools-list.json`.
|
||||
- `test/skills/analytics-skill-content.test.ts`: asserts the new pointer is present
|
||||
and the flat skill stays dialect-clean.
|
||||
|
||||
**Verification** — `tsc -p tsconfig.json` (src) clean; full default suite 393 files /
|
||||
3001 passing; slow suite green (incl. `local-project-ports.test.ts`); all three
|
||||
`dead-code` checks clean; the `dialects/*.md` files copy into `dist`. Rebuilt and
|
||||
re-linked `ktx-dev`.
|
||||
|
||||
**Deviations / notes**
|
||||
- Notes are stored as per-dialect markdown files (not a typed map, and not bundled
|
||||
`reference/*.md` skill files) — all sanctioned by the spec; plain markdown is the
|
||||
most maintainable to edit. They are served by the tool and ship via a
|
||||
`copy-runtime-assets.mjs` entry (`src/context/sql-analysis/dialects → dist/…`); no
|
||||
`setup-agents.ts` change.
|
||||
- `pnpm run type-check` still reports one pre-existing, unrelated error in
|
||||
`test/mcp-server-factory.test.ts` (committed in-flight MCP work on this branch);
|
||||
this change adds zero new type errors and does not touch that file.
|
||||
|
|
@ -1,362 +0,0 @@
|
|||
# Strengthen fan-out-join safety for multi-hop aggregation in the analytics skill
|
||||
|
||||
> Refined spec. Intake draft: `todo/09-fan-out-safe-multi-hop-aggregation.md`.
|
||||
> Extends spec 07 (`specs/07-analytics-skill-sql-craft.md`), which shipped the
|
||||
> `<sql_craft>` block. Additive, content-only.
|
||||
|
||||
## Problem
|
||||
|
||||
The shipped `ktx-analytics` skill
|
||||
(`packages/cli/src/skills/analytics/SKILL.md`) already carries a single-hop
|
||||
fan-out rule in `<sql_craft>` → **Composition**:
|
||||
|
||||
> **Avoid fan-out joins.** Add columns only from tables already at the target
|
||||
> grain, or pre-aggregate to that grain before joining. A join that multiplies
|
||||
> rows quietly inflates every downstream `SUM`/`COUNT`.
|
||||
|
||||
In practice the agent honors that on a single join but still **silently
|
||||
fans out on multi-hop join chains**, where the inflation is one or two joins
|
||||
removed from the aggregate and therefore much harder to notice.
|
||||
|
||||
The failure shape: a measure that lives at a *coarse* grain (one row per parent
|
||||
record) is counted/summed *after* the parent has been joined down to a *finer*
|
||||
grain (one row per child line). Every parent-level value is then duplicated by
|
||||
its child fan-out, so `COUNT(*)` / `SUM(amount)` over-counts by a data-dependent
|
||||
amount — runnable SQL, plausible-looking number, quietly wrong.
|
||||
|
||||
The rule today is stated only as a **prohibition** ("Avoid…"). It needs two
|
||||
upgrades: (a) generalize it so the danger is understood as *cumulative across a
|
||||
whole join chain*, not a single join; and (b) pair it with an **affirmative
|
||||
verification habit** the agent runs while composing, so a grain change is
|
||||
detected and fixed rather than merely warned against.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
An analyst on any production warehouse asks a counting/summing question whose
|
||||
path runs through several one-to-many hops — e.g. *"how many orders per region
|
||||
contain a returned item?"* where the path is `region → store → order →
|
||||
order_line`. The honest answer counts each order once. The naïve join chain joins
|
||||
`order_line` (to apply the line-level condition) and then counts orders, so an
|
||||
order with three returned lines is counted three times. The inflation happens
|
||||
**three joins below the `COUNT`**, where it is easy to miss. This is one of the
|
||||
most common silently-wrong analytics mistakes on normalized schemas — not
|
||||
specific to any dataset, dialect, or benchmark.
|
||||
|
||||
## Model (invariants — the implementer owns the prose)
|
||||
|
||||
These constrain the change; the exact wording is the implementer's. Each is
|
||||
grounded in Anthropic's skill-authoring and prompt-engineering guidance so the
|
||||
addition stays consistent with how spec 07 was written.
|
||||
|
||||
### Additive, inline-only, dialect-agnostic (inherited from spec 07)
|
||||
|
||||
The change is **additive content inside `skills/analytics/SKILL.md`** only — no
|
||||
bundled `reference/*.md` file (the delivery path ships a single `SKILL.md` per
|
||||
target; see spec 07 §Model "Inline-only delivery"). No new tool, flag, or config.
|
||||
Every addition must read correctly on any dialect: **no** `QUALIFY`,
|
||||
`strftime`/`julianday`, backtick/`DB.SCHEMA.TABLE` FQTNs, or other single-dialect
|
||||
construct — including in the worked example. The existing `<workflow>`, `<rules>`,
|
||||
`<examples>`, and the other four `<sql_craft>` sub-headings are preserved
|
||||
unchanged.
|
||||
|
||||
### Heuristic-plus-*why*, because SQL authoring is a high-freedom task
|
||||
|
||||
Anthropic's "set appropriate degrees of freedom" guidance classifies tasks with
|
||||
many valid approaches where decisions depend on context as **high freedom →
|
||||
text-based heuristics**, the "open field, many paths" case (versus low-freedom,
|
||||
fragile operations that need an exact script). SQL authoring is squarely
|
||||
high-freedom. So the new content is phrased as **heuristics with a one-line,
|
||||
universal rationale**, never as bare `ALWAYS`/`NEVER` imperatives — matching the
|
||||
existing `<sql_craft>` style and Anthropic's "add context / explain why so Claude
|
||||
generalizes" principle.
|
||||
|
||||
### Affirmative framing for the verification step (do, not don't)
|
||||
|
||||
Anthropic's prompt-engineering guidance is explicit: **"Tell Claude what to do
|
||||
instead of what not to do."** The draft's requirement for "a detect-and-fix
|
||||
*habit*, not just a prohibition" is the same principle. Therefore:
|
||||
|
||||
- The **generalized rule keeps the established `Avoid fan-out joins` lead and the
|
||||
term `fan-out`** — it is spec 07's consistent terminology and the existing
|
||||
content test references that phrase; reframing it would churn shared vocabulary
|
||||
for no gain.
|
||||
- The **new verification step is phrased affirmatively** (e.g. *"Verify the grain
|
||||
holds across each join"*) — an action the agent performs while composing, not a
|
||||
warning. The two together satisfy both principles: a recognized anti-pattern
|
||||
name *and* a positive habit.
|
||||
|
||||
### One default with an escape hatch, not two equal options
|
||||
|
||||
Anthropic: **"Avoid offering too many options… provide a default with an escape
|
||||
hatch."** The fix for an inflated aggregate is presented as exactly that:
|
||||
|
||||
- **Default: pre-aggregate the measure to its own grain in a CTE, then join the
|
||||
already-aggregated result.** This is the single-hop fix generalized, and it is
|
||||
the *only* correct fix for `SUM`/`AVG` — you cannot de-duplicate a summed
|
||||
measure with `DISTINCT` (two legitimately-equal amounts would collapse).
|
||||
- **Escape hatch: `COUNT(DISTINCT key)` — for a pure count only.** It rescues an
|
||||
inflated count in one line, but must be stated as count-only, not as a general
|
||||
remedy.
|
||||
|
||||
This is the deepest correctness point in the spec and the easiest to get wrong; a
|
||||
naïve blanket "just use `COUNT(DISTINCT)`" is silently wrong for sums.
|
||||
|
||||
### Consistent terminology
|
||||
|
||||
Anthropic: **"Choose one term and use it throughout."** Reuse spec 07's existing
|
||||
vocabulary verbatim — **`grain`**, **`fan-out`**, **`pre-aggregate`** — do not
|
||||
introduce synonyms (e.g. do not rename the concept "row blow-up" or
|
||||
"multiplication factor"). Prose may vary, but the named concepts stay fixed.
|
||||
|
||||
### Concise — the addition must justify its token cost
|
||||
|
||||
Anthropic: **"Concise is key… does this paragraph justify its token cost?"** and
|
||||
"Claude is already very smart." The agent knows what a join and a `GROUP BY` are;
|
||||
the addition explains only the non-obvious trap (cumulative grain inflation) and
|
||||
shows the fix. Net addition is roughly one rewritten bullet, one new bullet, and
|
||||
one worked example — the skill stays comfortably under the 500-line budget
|
||||
(~117 lines today).
|
||||
|
||||
### Examples over descriptions — exactly one
|
||||
|
||||
Anthropic's "examples pattern": **"Examples help Claude understand the desired
|
||||
style and level of detail more clearly than descriptions alone"** and
|
||||
"examples are concrete, not abstract." The multishot guidance favors 3–5 examples
|
||||
in general, but here **conciseness and spec 07's one-example-per-rule economy
|
||||
win**: the skill already carries the window-then-filter example, so this adds
|
||||
**exactly one** compact wrong-vs-right example. The wrong/right contrast inside
|
||||
that single example supplies the diversity multishot calls for, at one example's
|
||||
token cost.
|
||||
|
||||
### Leak-safety (hard constraint)
|
||||
|
||||
The worked example must be a **synthetic, generic schema invented for teaching** —
|
||||
not the tables, column names, query, or numeric results of any Spider 2.0-Lite
|
||||
question. It demonstrates the *pattern* (a coarse-grain measure aggregated after a
|
||||
one-to-many join), which is universal and reconstructable from first principles. A
|
||||
reviewer must find nothing in it that ties it to a specific benchmark instance.
|
||||
See "Leak-safety" below.
|
||||
|
||||
## Requirements
|
||||
|
||||
All four land in the **Composition** sub-heading of `<sql_craft>` in
|
||||
`packages/cli/src/skills/analytics/SKILL.md`. Structure (chosen design): rewrite
|
||||
the existing fan-out bullet, add one affirmative verification bullet, add one
|
||||
worked example. Do not touch the other four sub-headings or `<workflow>`/`<rules>`/
|
||||
`<examples>`.
|
||||
|
||||
### 1. Generalize the fan-out rule to multi-hop chains
|
||||
|
||||
Rewrite the existing **`Avoid fan-out joins.`** bullet so it makes explicit that
|
||||
the danger is **cumulative**: *any* one-to-many hop on the path between a measure's
|
||||
owning table and the aggregate inflates that measure, **even when the offending
|
||||
join is several hops away from the `SUM`/`COUNT`**. The fix is the same as the
|
||||
single-hop case — **pre-aggregate the measure to its own grain in a CTE, then join
|
||||
the already-aggregated result** — but the agent must apply it **per
|
||||
measure-owning table along the whole chain**, not just at the final join. Keep the
|
||||
`fan-out` term and the one-line *why*.
|
||||
|
||||
### 2. Add an affirmative grain-verification habit
|
||||
|
||||
Add a companion bullet, phrased as an action the agent performs **while
|
||||
composing** (not a prohibition):
|
||||
|
||||
- Confirm that a join intended to be one-to-one / many-to-one **did not change the
|
||||
grain** it aggregates at — e.g. check that the row count (or the count of the
|
||||
aggregate's key) is unchanged across that join.
|
||||
- When a join is genuinely one-to-many, **reach for the default fix
|
||||
(pre-aggregate to grain)**; for a **pure count**, `COUNT(DISTINCT key)` is an
|
||||
acceptable escape hatch.
|
||||
- State the caveat once: **`SUM`/`AVG` of a fanned-out measure must pre-aggregate**
|
||||
— `DISTINCT` cannot de-duplicate a sum.
|
||||
|
||||
This is spec 07's "build incrementally and check each layer" discipline pointed
|
||||
specifically at grain preservation, in affirmative form.
|
||||
|
||||
### 3. One concrete, generic multi-hop worked example
|
||||
|
||||
Add **exactly one** compact wrong-vs-right `sql` example inside `<sql_craft>`
|
||||
demonstrating the multi-hop inflation and the pre-aggregate fix. It is the
|
||||
**second** `sql` fence in the skill (the first is spec 07's window-then-filter
|
||||
example).
|
||||
|
||||
**Required properties** (these are the constraints; the SQL below is orientation):
|
||||
|
||||
- **Multi-hop chain** where the inflating one-to-many hop is **≥1 join removed**
|
||||
from the aggregate (not the single-hop case spec 07 already covers).
|
||||
- **Unambiguous attribution**: each counted entity maps to **exactly one** group,
|
||||
so the honest answer is well-defined. (This rules out "coarse measure attributed
|
||||
to a fine dimension reached by descending," where one entity spans several
|
||||
groups and the correct number is itself ambiguous — that would teach a murky
|
||||
pattern.)
|
||||
- **Motivated descent**: the finer-grain table is joined for a real reason (a
|
||||
line-level filter or a needed line-level value), so the reader sees *why* the
|
||||
fan-out join is there.
|
||||
- **Plain `COUNT`/`SUM`**, not `AVG` — averaging collides with the existing
|
||||
*Macro vs micro average* bullet and would muddy the fan-out lesson.
|
||||
- The **RIGHT side demonstrates the default fix** (pre-aggregate to grain in a
|
||||
CTE) and is **actually correct**, not merely runnable — its number must equal the
|
||||
honest answer, not just avoid an error.
|
||||
- Generic invented schema, standard dialect-agnostic SQL (no `QUALIFY`, no dialect
|
||||
functions), no benchmark identifiers or values.
|
||||
|
||||
**Recommended sketch** (implementer may adjust within the properties above):
|
||||
|
||||
```sql
|
||||
-- "How many orders per region contain a returned item?"
|
||||
-- WRONG: joining order_lines to apply the line-level filter multiplies orders —
|
||||
-- an order with two returned lines is counted twice, three joins below the COUNT.
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN order_lines l ON l.order_id = o.order_id
|
||||
WHERE l.status = 'returned'
|
||||
GROUP BY r.region_id;
|
||||
|
||||
-- RIGHT: collapse order_lines to one row per qualifying order first, then join up.
|
||||
WITH returned_orders AS (
|
||||
SELECT order_id FROM order_lines WHERE status = 'returned' GROUP BY order_id
|
||||
)
|
||||
SELECT r.region_id, COUNT(*) AS n_orders
|
||||
FROM regions r
|
||||
JOIN stores s ON s.region_id = r.region_id
|
||||
JOIN orders o ON o.store_id = s.store_id
|
||||
JOIN returned_orders ro ON ro.order_id = o.order_id
|
||||
GROUP BY r.region_id;
|
||||
-- A pure count could also use COUNT(DISTINCT o.order_id); a SUM/AVG of an
|
||||
-- order-level measure fanned out this way must pre-aggregate — DISTINCT can't
|
||||
-- de-duplicate a sum.
|
||||
```
|
||||
|
||||
### 4. Placement and structure
|
||||
|
||||
- Both bullets live under the existing **Composition** sub-heading; the example
|
||||
follows them. The five-sub-heading structure spec 07 established is unchanged.
|
||||
- **State each rule once** (Anthropic "consistent terminology / don't repeat"):
|
||||
do not also restate the multi-hop rule in `<workflow>` steps 5/6 — those already
|
||||
carry a one-line pointer into `<sql_craft>`, which is sufficient.
|
||||
|
||||
### 5. Coordination with spec 07 (supersession)
|
||||
|
||||
Spec 07's requirement 3 and acceptance criteria say the skill contains **exactly
|
||||
one** worked example and "Do not add a second example." **This spec supersedes
|
||||
that constraint**: the skill now carries **two** `sql` worked examples
|
||||
(window-then-filter from spec 07, plus this multi-hop fan-out example). Annotate
|
||||
spec 07 at those two spots with a one-line "superseded by spec 09" note so the two
|
||||
permanent specs do not contradict. No other spec 07 content changes.
|
||||
|
||||
## Leak-safety (hard constraint on this spec and its example)
|
||||
|
||||
The benchmark's gold answers must never appear in ktx. The worked example must be
|
||||
a **synthetic, generic schema invented for teaching** — not the tables, column
|
||||
names, query, or numeric results of any Spider 2.0-Lite question. The example
|
||||
demonstrates the *pattern* (a coarse-grain measure counted after a one-to-many
|
||||
join), which is universal; it must be reconstructable from first principles by
|
||||
anyone, with zero reference to benchmark data. A reviewer should be able to read
|
||||
the example and find nothing that ties it to a specific benchmark instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The `<sql_craft>` **Composition** section states the **multi-hop generalization**
|
||||
of the fan-out rule (cumulative danger across the chain; pre-aggregate per
|
||||
measure-owning table) and an **affirmative grain-verification habit**, inline and
|
||||
dialect-agnostic.
|
||||
- The fix is presented as **default (pre-aggregate to grain) + escape hatch
|
||||
(`COUNT(DISTINCT key)`, count-only)**, with the explicit caveat that `SUM`/`AVG`
|
||||
of a fanned-out measure must pre-aggregate.
|
||||
- Exactly **one** new, **generic** worked example (wrong vs. pre-aggregated-right)
|
||||
using an invented schema, with no benchmark-derived identifiers or values, whose
|
||||
RIGHT side is actually correct (unambiguous attribution; honest number).
|
||||
- The skill now contains **two** `sql` worked examples total; the existing content
|
||||
test's fence-count assertion is updated `1 → 2` and new assertions cover the
|
||||
multi-hop rule phrase and the grain-verification-habit phrase.
|
||||
- Terminology is consistent with spec 07 (`grain`, `fan-out`, `pre-aggregate`); no
|
||||
synonyms introduced.
|
||||
- **No new tool, flag, or config.** Skill-content only; additive to spec 07.
|
||||
- All spec 07 invariants still hold: the skill remains dialect-agnostic (no
|
||||
`QUALIFY`/`strftime`/`julianday`, no backtick three-part FQTN, no relative-time
|
||||
anchoring to a `MAX(...)` date) and free of any benchmark/grader/gold reference,
|
||||
including in the new example; `<workflow>`/`<rules>`/`<examples>` and the other
|
||||
four sub-headings are intact; frontmatter still parses through
|
||||
`SkillsRegistryService.parseFrontmatter`; the skill stays under 500 lines.
|
||||
- Spec 07's "exactly one example" constraint is annotated as superseded (no
|
||||
contradiction between the two permanent specs).
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **The skill file:** `packages/cli/src/skills/analytics/SKILL.md` →
|
||||
`<sql_craft>` → **Composition**. Rewrite the `Avoid fan-out joins` bullet, add
|
||||
the affirmative grain-verification bullet, add the one worked example after them.
|
||||
Leave the other four sub-headings, `<workflow>`, `<rules>`, and `<examples>`
|
||||
unchanged.
|
||||
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`. Update the
|
||||
"ships exactly one … worked example" test: `match(/```sql/g)` length `1 → 2`,
|
||||
add an assertion for the new fan-out example's distinctive tokens (e.g.
|
||||
`WITH returned_orders AS`), add the multi-hop-rule and grain-verification-habit
|
||||
phrases to the behavior-presence list, and keep all banned-construct and
|
||||
size-budget guards. This is a content assertion over the source `SKILL.md` — the
|
||||
right level for prompt content.
|
||||
- **Spec 07 annotation:** add a one-line "superseded by spec 09" note at spec 07's
|
||||
requirement 3 and at its "Exactly one new worked example" acceptance bullet.
|
||||
- **Rebuild/re-link** the dev binary so the playground picks up the change:
|
||||
`pnpm run build && pnpm run link:dev` (provides `ktx-dev`).
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Multi-hop aggregation questions (counting/averaging a coarse-grained measure
|
||||
reached through several one-to-many joins) are a recurring source of
|
||||
result-mismatch failures in the SQLite subset: the agent produces runnable SQL
|
||||
with the right tables but a fan-out-inflated number. These are correctness
|
||||
failures, not knowledge or schema-discovery failures (zero execution errors in the
|
||||
latest run), so the fix belongs in the product's authoring craft — where it also
|
||||
helps any real analyst — not in a benchmark-specific prompt. The skill itself must
|
||||
contain no trace of the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped as specified — additive, content-only, no new tool/flag/config.
|
||||
|
||||
- **`packages/cli/src/skills/analytics/SKILL.md`** → `<sql_craft>` → **Composition**:
|
||||
- Rewrote the `Avoid fan-out joins` bullet to `**Avoid fan-out joins — the
|
||||
danger is cumulative.**`, generalizing to multi-hop chains: any one-to-many
|
||||
hop between a measure's owning table and the aggregate inflates that measure
|
||||
even when several hops below the `SUM`/`COUNT`; fix is pre-aggregate per
|
||||
measure-owning table along the whole chain. Kept the `fan-out` term and the
|
||||
one-line *why*.
|
||||
- Added the affirmative `**Verify the grain holds across each join.**` bullet:
|
||||
confirm a one-to-one / many-to-one join did not change the grain (row/key
|
||||
count unchanged); default fix is pre-aggregate to grain, escape hatch is
|
||||
`COUNT(DISTINCT key)` for a pure count only; stated once that `SUM`/`AVG` of a
|
||||
fanned-out measure must pre-aggregate because `DISTINCT` cannot de-duplicate a
|
||||
sum.
|
||||
- Added one generic wrong-vs-right worked example (orders→regions via
|
||||
stores/order_lines, `WITH returned_orders AS …`) — the second `sql` fence in
|
||||
the skill. The inflating hop is three joins below the `COUNT`; the RIGHT side
|
||||
pre-aggregates `order_lines` to one row per qualifying order so each order is
|
||||
counted once (honest answer), and the trailing comment names the count-only
|
||||
`COUNT(DISTINCT o.order_id)` escape hatch plus the `SUM`/`AVG` caveat. Invented
|
||||
schema, dialect-agnostic SQL, no benchmark identifiers/values.
|
||||
- The other four sub-headings and `<workflow>`/`<rules>`/`<examples>` are
|
||||
untouched. Skill is 147 lines (well under the 500-line budget).
|
||||
- **`packages/cli/test/skills/analytics-skill-content.test.ts`**: sql-fence count
|
||||
`1 → 2`; added the multi-hop phrase (`the danger is cumulative`) and the
|
||||
grain-verification phrase (`Verify the grain holds across each join`) to the
|
||||
behavior-presence list; added new-example token assertions
|
||||
(`WITH returned_orders AS`, `COUNT(DISTINCT o.order_id)`). All banned-construct,
|
||||
relative-time, and size-budget guards retained. Test file passes (9/9).
|
||||
- **Spec 07** annotated as superseded at requirement 3 and at its "exactly one
|
||||
worked example" acceptance bullet — no contradiction between the two permanent
|
||||
specs.
|
||||
|
||||
**Verification:** `vitest run test/skills/analytics-skill-content.test.ts` → 9
|
||||
passed. `pnpm run build` (src `tsc -p tsconfig.json`) succeeds and the built
|
||||
`dist/skills/analytics/SKILL.md` carries the new content; `pnpm run link:dev`
|
||||
re-linked `ktx-dev`. A pre-existing, unrelated type error in
|
||||
`test/mcp-server-factory.test.ts` (`KtxMcpContextPorts`/`context_tool`, last
|
||||
touched in commit `2677b3ef`) surfaces under the full `type-check`'s
|
||||
`tsconfig.test.json` pass; it is outside this change's surface and not introduced
|
||||
here.
|
||||
|
|
@ -1,289 +0,0 @@
|
|||
# Panel/period completeness — emit the full set of groups, not only the populated ones
|
||||
|
||||
> Refined spec. Intake draft: `todo/10-panel-completeness-spine.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
When a question asks for a result *per period* or *per category* ("orders for
|
||||
each month of 2023", "revenue by region", "count per status"), a plain `GROUP BY`
|
||||
only returns groups that actually have rows. Periods or categories with **zero**
|
||||
activity silently vanish, so a "12 months" answer comes back with 9 rows and the
|
||||
three that should read `0` are simply absent. The SQL is runnable and the
|
||||
aggregate is right, but the **panel is incomplete** — and a monthly report with
|
||||
missing months or a category breakdown missing its empty categories is wrong for
|
||||
any analyst, on any database.
|
||||
|
||||
The existing `<sql_craft>` "Answer completeness / interpretation" group already
|
||||
carries a *"For each X / per X / by X returns exactly one row per X"* rule, but
|
||||
that rule only governs **grain** (don't collapse to a single value). It says
|
||||
nothing about the **domain**: "one row per X" today means one row per *observed*
|
||||
X, so empty groups still drop. This spec sharpens that rule from grain-only to
|
||||
grain-and-completeness.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
"How many orders were placed in each month of 2023?" must return **12 rows** even
|
||||
if March had no orders (March = 0), not 11. "Sales per region" should include
|
||||
regions with no sales when the question asks for *each* region. Both are
|
||||
bread-and-butter reporting for any analyst on any warehouse, with no benchmark in
|
||||
sight.
|
||||
|
||||
## Model
|
||||
|
||||
The feature splits across **two surfaces**, each holding the half it is suited
|
||||
for. This split is the central design decision and exists to satisfy spec 07's
|
||||
hard dialect-agnostic invariant without weakening it.
|
||||
|
||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
||||
|
||||
The draft asked for a *"recursive-CTE date spine"* worked example. But a real
|
||||
date/number series is **inherently dialect-specific** — Postgres `generate_series`,
|
||||
SQLite recursive `date(d,'+1 month')`, BigQuery `GENERATE_DATE_ARRAY`, Snowflake
|
||||
`GENERATOR`+`DATEADD` — and spec 07 made `<sql_craft>` strictly dialect-agnostic
|
||||
(the analytics-skill content test bans single-dialect constructs). Inlining a date
|
||||
spine would violate that invariant; carving out a test exception would erode it.
|
||||
|
||||
ktx already has the canonical home for engine-specific syntax: the per-dialect
|
||||
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
|
||||
the `sql_dialect_notes` MCP tool (spec 08). Those files answer a fixed rubric
|
||||
(FQTN / Identifiers / Date-time / Top-N / JSON) — but **series/spine generation is
|
||||
not in that rubric yet**. So the date-spine syntax belongs *there*, alongside the
|
||||
other per-dialect idioms, and the dialect-agnostic skill points to it. This
|
||||
routes the dialect-specific half through the existing channel rather than
|
||||
standing up a parallel dialect-specific recipe inside the skill.
|
||||
|
||||
Surface 1 (skill) carries the **pattern**; surface 2 (dialect notes) carries the
|
||||
**concrete series syntax**.
|
||||
|
||||
### Additive, inline, heuristic-with-a-why
|
||||
|
||||
Consistent with spec 07: the skill change is **additive content in one Markdown
|
||||
file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/` file — the
|
||||
delivery mechanism in `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
|
||||
and phrased as a **heuristic with a one-line generic rationale**, not a wall of
|
||||
MUSTs. The dialect-notes change is additive content in the seven existing
|
||||
`dialects/*.md` files. No new tool, flag, or config on either surface.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Skill surface — `<sql_craft>` "Answer completeness / interpretation"
|
||||
|
||||
Add the panel-completeness rule to the existing group (it extends, and should sit
|
||||
adjacent to, the *"For each X / per X / by X"* bullet). It must cover:
|
||||
|
||||
1. **Recognize the full-panel cue.** *each / every / all / per <period> / for all
|
||||
<category> / by month* signals that the answer's row set should be the
|
||||
**complete expected domain** of periods or categories in scope, not just those
|
||||
present in the filtered fact rows. *Why:* a plain inner `GROUP BY` can only emit
|
||||
groups that have at least one fact row.
|
||||
|
||||
2. **Spine → LEFT JOIN → COALESCE.** Build the full set of expected groups (the
|
||||
**spine**), then LEFT JOIN the aggregated facts onto it:
|
||||
- **Category/dimension spine:** the distinct values from the **domain-defining
|
||||
dimension/entity table** (e.g. all regions from a `regions` table), *not*
|
||||
`SELECT DISTINCT region FROM facts` — the latter yields only categories that
|
||||
already occur, so a zero-activity category still drops. When no dimension
|
||||
table exists, the distinct values from the **unfiltered** fact table are the
|
||||
best available domain (with the residual caveat that a category which never
|
||||
occurs at all cannot surface).
|
||||
- **Period/number spine:** generate the series for the question's stated range
|
||||
(e.g. each month of 2023 → Jan..Dec 2023). The series bounds come from the
|
||||
question's explicit range; when the range is "all periods present," derive
|
||||
bounds from `MIN`/`MAX` over the **unfiltered** facts. The concrete
|
||||
series-generation syntax is per-dialect — the rule points the author to
|
||||
`sql_dialect_notes` (see requirement 2) and shows no inline series SQL.
|
||||
|
||||
3. **COALESCE by measure additivity.** Default missing measures with
|
||||
`COALESCE(metric, 0)` for **additive** measures (a `COUNT` or `SUM` of events
|
||||
or amounts — "no activity" genuinely reads as 0). Leave **non-additive**
|
||||
measures (`AVG`, a running balance, a price, a rate, a ratio) as **NULL** —
|
||||
absence is "no data," and 0 would be a wrong reading. *Why:* 0 is a real value
|
||||
only for additive measures.
|
||||
|
||||
4. **Don't over-apply (the each-vs-which guard).** When the question asks only
|
||||
about groups that exist ("*which* months had orders", "regions that made a
|
||||
sale"), the spine is unnecessary and wrong — emit only observed groups. The cue
|
||||
is *each / all / every* (complete domain) vs *which / that have* (observed
|
||||
subset).
|
||||
|
||||
5. **One worked example — the category spine, fully portable.** Add **exactly
|
||||
one** compact before/after example demonstrating the pattern with a
|
||||
**distinct-dimension spine**: the wrong shape (`GROUP BY` over facts, empty
|
||||
groups missing) and the right shape (`SELECT DISTINCT` domain from the
|
||||
dimension table → LEFT JOIN aggregated facts → `COALESCE(metric, 0)`). Generic
|
||||
table/column names, standard SQL only — no series generation, no dialect
|
||||
functions, so the example stays dialect-clean. The period-spine variant is
|
||||
described in prose (requirement 2) and delegated to `sql_dialect_notes`; it
|
||||
gets **no** inline example. This is the **third** worked `sql` example in the
|
||||
skill (after spec 07's window-then-filter and spec 09's multi-hop fan-out).
|
||||
|
||||
6. **Step pointer, no duplication.** The validate/explain step (and/or the query
|
||||
step) already points into `<sql_craft>` for answer-completeness; extend that
|
||||
existing pointer's wording if needed, but state the rule **once** inside
|
||||
`<sql_craft>`. The step-5 pointer that lists what `sql_dialect_notes` provides
|
||||
("FQTN, identifier-quoting, date, top-N, and JSON conventions") should also
|
||||
name the **series/calendar** convention now that it exists.
|
||||
|
||||
### 2. Dialect-notes surface — `dialects/*.md`
|
||||
|
||||
Add a **"Series"** (date/number range) line to **each** of the seven authored
|
||||
dialect files, giving that engine's idiomatic way to generate a contiguous
|
||||
date or integer series for use as a spine. Each note is engine-exclusive — a
|
||||
SQLite analyst gets the SQLite idiom and never another engine's construct, per the
|
||||
existing dialect-notes leak guards. Orientation (exact syntax is the
|
||||
implementer's):
|
||||
|
||||
- **postgres:** `generate_series('2023-01-01'::date, '2023-12-01'::date, interval '1 month')`.
|
||||
- **sqlite:** recursive CTE — `WITH RECURSIVE m(d) AS (SELECT '2023-01-01' UNION ALL SELECT date(d,'+1 month') FROM m WHERE d < '2023-12-01')`.
|
||||
- **bigquery:** `UNNEST(GENERATE_DATE_ARRAY('2023-01-01','2023-12-01', INTERVAL 1 MONTH))` (and `GENERATE_ARRAY` for integers).
|
||||
- **snowflake:** `TABLE(GENERATOR(ROWCOUNT => n))` with `DATEADD('month', SEQ4(), start)`, or a recursive CTE.
|
||||
- **mysql:** recursive CTE (8.0+) with `DATE_ADD(d, INTERVAL 1 MONTH)`.
|
||||
- **clickhouse:** `numbers(n)` / `range(n)` with `addMonths(start, number)` (or `arrayJoin`).
|
||||
- **tsql:** recursive CTE with `DATEADD(month, …)`, or a numbers/tally table.
|
||||
|
||||
This line is what makes the period spine usable from the dialect-agnostic skill,
|
||||
and it is also consumed by **spec 11** (rolling-window-over-gappy-dates needs the
|
||||
same date spine) — so it is foundational, not scope creep.
|
||||
|
||||
### 3. Coordination with spec 11
|
||||
|
||||
Spec 11 (time-series window recipes) explicitly depends on this date spine for the
|
||||
gappy-rolling case ("build a complete date spine first (see spec 10)"). Spec 10
|
||||
establishes the spine concept in the Answer-completeness group and the
|
||||
series syntax in the dialect notes; spec 11 reuses both from the Window-functions
|
||||
group. Keep the two non-overlapping: spec 10 owns the spine; spec 11 references it.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Any worked example or note must use a **synthetic generic schema** (e.g. an
|
||||
`orders` table with an `order_date`, a `regions` dimension) and demonstrate only
|
||||
the *pattern* (spine + LEFT JOIN + COALESCE). **No** benchmark table names, SQL,
|
||||
or result values on either surface. The dialect-notes additions, like the existing
|
||||
notes, carry no benchmark/grader/version-dated content. The behavior is
|
||||
reconstructable from first principles and tied to no specific instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `<sql_craft>` "Answer completeness / interpretation" states: the full-panel cue,
|
||||
the spine → LEFT JOIN → COALESCE recipe, the additive-vs-non-additive COALESCE
|
||||
discriminator (0 vs NULL), and the each-vs-which over-application guard —
|
||||
inline, dialect-agnostic, each with a generic *why*.
|
||||
- Exactly **one** new worked `sql` example is present, a portable
|
||||
distinct-dimension spine (`SELECT DISTINCT` domain → LEFT JOIN → `COALESCE`),
|
||||
with no series generation and no dialect-specific syntax. The skill then carries
|
||||
**three** `sql` worked examples total.
|
||||
- Each of the seven `dialects/*.md` files gains a **Series** (date/number range)
|
||||
line in its engine's own idiom; no engine leaks another engine's construct, and
|
||||
the additions contain no benchmark/grader/version-dated content.
|
||||
- The skill remains dialect-clean: no `QUALIFY`, `strftime`, `julianday`,
|
||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, or other
|
||||
single-dialect construct anywhere in `SKILL.md`, including the new example.
|
||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples)
|
||||
and the existing dialect-note rubric lines are intact and uncontradicted.
|
||||
- No grader/benchmark reference, no output-shape contract, and no anchoring of
|
||||
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data
|
||||
appears (period-spine bounds derive from the question's explicit range or, for
|
||||
"all periods present," from `MIN`/`MAX` over the facts — which is range
|
||||
derivation, not relative-time anchoring).
|
||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
||||
still parses as `ktx-analytics`.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the
|
||||
panel-completeness bullets to the Answer-completeness group, the single category
|
||||
spine example, and extend the existing step pointer / dialect-notes provision
|
||||
list to name the series convention. Leave `<workflow>`/`<rules>`/other examples
|
||||
intact. Delivery is unchanged (single `SKILL.md` per target via
|
||||
`readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change required.
|
||||
- **Dialect notes:** the seven files under
|
||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
||||
`copy-runtime-assets.mjs` — no plumbing change, content only.
|
||||
- **Tests:**
|
||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
||||
representative phrase for the completeness rule; bump the `sql`-fence count
|
||||
assertion **2 → 3**; assert the spine + LEFT JOIN + `COALESCE` shape; the
|
||||
existing dialect-clean guards already cover the no-inline-series requirement
|
||||
(the example is `SELECT DISTINCT`, so they pass unchanged).
|
||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the rubric loop
|
||||
(the "answers the full rubric for every dialect" test) so every dialect must
|
||||
also answer a **Series** line, e.g. `expect(notes).toMatch(/\*\*Series/)`.
|
||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
||||
all seven without a hand-maintained list.
|
||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Per-period / per-category questions where some periods are empty produce
|
||||
short-row result mismatches in the SQLite subset, and the related rolling/cumulative
|
||||
cluster (spec 11) needs a complete date spine to be correct at all. The fix is a
|
||||
universal reporting habit (complete panels) plus the per-dialect series syntax
|
||||
that makes it executable — both belong in the product, where they help real
|
||||
analysts. Improving the benchmark score is a side effect; the skill and the
|
||||
dialect notes contain no trace of the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki`. Content-only across two surfaces, no
|
||||
new tool/flag/config, no plumbing change.
|
||||
|
||||
**Surface 1 — skill (`packages/cli/src/skills/analytics/SKILL.md`):**
|
||||
- Added a **"Complete the panel for 'each / every / all / per <period or
|
||||
category>'"** bullet to the `<sql_craft>` "Answer completeness / interpretation"
|
||||
group, directly after the *"For each X / per X / by X"* bullet, with three
|
||||
sub-bullets carrying the rest of the rule each with its generic *why*: **Spine
|
||||
source** (distinct domain from the dimension/entity table — not `SELECT DISTINCT`
|
||||
over the facts; period/number series across the question's stated range, bounds
|
||||
from `MIN`/`MAX` over the *unfiltered* facts for "all periods present"; series
|
||||
syntax delegated to `sql_dialect_notes`), **Default by additivity**
|
||||
(`COALESCE(metric, 0)` for additive measures, `NULL` for non-additive), and
|
||||
**Don't over-apply** (the each-vs-which guard).
|
||||
- Added **one** worked `sql` example at the end of the Answer-completeness group: a
|
||||
portable distinct-dimension spine (`SELECT DISTINCT region_id FROM regions` →
|
||||
`LEFT JOIN` aggregated facts → `COALESCE(ro.n_orders, 0)`), wrong-vs-right,
|
||||
standard SQL only, no series generation, no dialect functions. The skill now
|
||||
carries **three** `sql` worked examples.
|
||||
- Extended the step-5 dialect-notes pointer to name the **series/calendar**
|
||||
convention alongside FQTN / identifier-quoting / date / top-N / JSON.
|
||||
- Delivery unchanged: `readAnalyticsSkillContent` in `setup-agents.ts` ships the
|
||||
single `SKILL.md` per target — confirmed, no change.
|
||||
|
||||
**Surface 2 — dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`):**
|
||||
- Added a `- **Series:**` line to all seven authored files (postgres, sqlite,
|
||||
bigquery, snowflake, mysql, clickhouse, tsql), each in that engine's own idiom
|
||||
(`generate_series`; recursive CTE with `date(d,'+1 month')`;
|
||||
`UNNEST(GENERATE_DATE_ARRAY(...))`; `GENERATOR`/`SEQ4`/`DATEADD`; recursive CTE
|
||||
with `DATE_ADD`; `numbers(n)`/`addMonths`; recursive CTE with `DATEADD` +
|
||||
`MAXRECURSION`), placed right after each file's Date/time line. No cross-engine
|
||||
leak, no version-dated/benchmark content. Shipped to `dist` unchanged by
|
||||
`copy-runtime-assets.mjs`; coverage stays derived from `DIALECTS_WITH_NOTES`.
|
||||
|
||||
**Tests:**
|
||||
- `test/skills/analytics-skill-content.test.ts`: added the `Complete the panel`
|
||||
and `Default by additivity` phrases; renamed the worked-examples test and bumped
|
||||
the `sql`-fence count **2 → 3**; asserted the spine + `LEFT JOIN` + `COALESCE`
|
||||
shape. Also added `generate_series` and `GENERATE_DATE_ARRAY` to the
|
||||
dialect-clean banned list — a deliberate **strengthening** beyond the spec's
|
||||
test orientation so the "no inline series" acceptance criterion is *enforced*,
|
||||
not merely incidentally true of a `SELECT DISTINCT` example.
|
||||
- `test/context/mcp/dialect-notes.test.ts`: extended the "answers the full rubric
|
||||
for every dialect" loop with `expect(notes).toMatch(/\*\*Series/)`, so all seven
|
||||
dialects are required to answer a Series line (coverage derived from
|
||||
`DIALECTS_WITH_NOTES`, no hand-maintained list).
|
||||
|
||||
**Verification:** both affected test files pass (19 tests). `src` type-check and
|
||||
`pnpm run build` are clean, and `copy-runtime-assets.mjs` placed the Series line in
|
||||
all seven `dist` dialect files; `pnpm run link:dev` re-linked `ktx-dev`. Note: an
|
||||
unrelated, pre-existing `tsconfig.test.json` type error in
|
||||
`test/mcp-server-factory.test.ts` exists on this branch — untouched by this work
|
||||
and outside its scope.
|
||||
|
||||
**Coordination with spec 11:** the per-dialect Series line is the foundational
|
||||
date spine that spec 11 (rolling/cumulative windows over gappy dates) references.
|
||||
Spec 10 owns the spine (Answer-completeness group + dialect Series notes); spec 11
|
||||
will reference it from the Window-functions group. No overlap introduced.
|
||||
|
|
@ -1,391 +0,0 @@
|
|||
# Time-series window craft — running totals, rolling-over-time (min-periods), period-over-period
|
||||
|
||||
> Refined spec. Intake draft: `todo/11-time-series-window-recipes.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
A large share of analytics questions are time-series shaped: a **running /
|
||||
cumulative balance**, a **rolling N-day average**, or **period-over-period
|
||||
growth**. The agent already knows window functions exist — spec 07 gave the
|
||||
`<sql_craft>` "Window functions" group its determinism and window-then-filter
|
||||
rules, and spec 10 added panel/period completeness — but it still gets the
|
||||
*time-series specifics* wrong:
|
||||
|
||||
- a cumulative balance computed **without an explicit unbounded-preceding
|
||||
frame**, or with the implicit frame misbehaving when there are **ties on the
|
||||
order key**;
|
||||
- "rolling 30 days" implemented as `ROWS BETWEEN 29 PRECEDING` over **gappy**
|
||||
daily data, so the window spans the wrong calendar span when days are missing;
|
||||
- no **minimum-periods** handling — a rolling average reported before the window
|
||||
is actually full;
|
||||
- "growth vs the previous period" written **without `LAG`** (or against the wrong
|
||||
neighbor), with an **unguarded** `(cur - prev) / prev` that breaks on a zero or
|
||||
absent prior.
|
||||
|
||||
These are runnable-but-wrong: the structure is close, the edge case diverges.
|
||||
It is the same failure shape spec 07 addressed at the general level; this spec
|
||||
adds the time-series specifics to the **same Window-functions group**, building
|
||||
on the rules already there rather than restating them.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
- "Each account's month-end running balance over 2023" — a cumulative sum of
|
||||
monthly net over an ordered window.
|
||||
- "30-day rolling average of daily revenue, only once 30 days of history exist."
|
||||
- "Month-over-month revenue growth rate."
|
||||
|
||||
All three are bread-and-butter for any analyst on any time-series table, with no
|
||||
benchmark in sight. The methodology is universal analyst craft, so it belongs in
|
||||
the shipped skill — it transfers to every ktx user querying a live database.
|
||||
|
||||
## Model
|
||||
|
||||
The change is **additive content across two surfaces** — the same split spec 10
|
||||
made, and for the same reason. The split is the central design decision; it
|
||||
satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
|
||||
weakening it.
|
||||
|
||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
||||
|
||||
Two of the three recipes are **pure standard SQL** and stay entirely in the
|
||||
dialect-agnostic skill:
|
||||
|
||||
- **Cumulative / running total** — `SUM(x) OVER (... ROWS BETWEEN UNBOUNDED
|
||||
PRECEDING AND CURRENT ROW)` is standard on every engine.
|
||||
- **Period-over-period** — `LAG(metric) OVER (...)`, the growth ratio, and a
|
||||
`NULLIF`-style divide-by-zero guard are standard on every engine.
|
||||
|
||||
The third recipe — a **rolling window over calendar time** — has one piece that
|
||||
is genuinely dialect-divergent: the **calendar-range window frame**. A native
|
||||
range frame such as `RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`
|
||||
exists on some engines (e.g. postgres, mysql 8) but **not others** — sqlite has
|
||||
no date-interval range frame, and SQL Server has **no offset `RANGE` frames at
|
||||
all**; bigquery's `RANGE` frames are numeric-only. So a portable skill cannot
|
||||
inline a range frame any more than it could inline a date-series generator.
|
||||
|
||||
ktx already routes that kind of engine-specific syntax through the per-dialect
|
||||
notes in `packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by
|
||||
the `sql_dialect_notes` MCP tool (spec 08). Spec 10 established the precedent
|
||||
exactly: series/spine generation was not in the dialect rubric, so it was added
|
||||
there (the **Series** line) and the dialect-agnostic skill points to it.
|
||||
Rolling-window framing is the next construct in that same position — not in the
|
||||
rubric yet, dialect-specific — so the **rolling-window idiom belongs in the
|
||||
dialect notes**, and the skill points to it.
|
||||
|
||||
Surface 1 (skill) carries the **pattern** (calendar range, not a row count; the
|
||||
min-periods guard; the spine-or-range choice). Surface 2 (dialect notes) carries
|
||||
the **concrete rolling-window frame syntax** per engine.
|
||||
|
||||
### Additive, inline, heuristic-with-a-why
|
||||
|
||||
Consistent with specs 07 and 10: the skill change is **additive content in one
|
||||
Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled `reference/`
|
||||
file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic, and phrased as
|
||||
**heuristics with a one-line generic rationale**, not a wall of MUSTs. The
|
||||
dialect-notes change is additive content in the seven existing `dialects/*.md`
|
||||
files. No new tool, flag, or config on either surface.
|
||||
|
||||
### Build on the rules already present; do not restate them
|
||||
|
||||
The Window-functions group already carries **"Make the ordering deterministic"**
|
||||
(complete tie-breaker) from spec 07, and the Numeric-precision group carries
|
||||
**"Round only at the end."** The cumulative and period-over-period recipes
|
||||
**reference** these rather than repeat them (state each rule once — Anthropic's
|
||||
"consistent terminology / don't repeat" guidance, already followed in spec 07).
|
||||
Spec 10's **Series** dialect line is likewise **referenced** by the rolling
|
||||
recipe's spine fallback, not duplicated.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Skill surface — `<sql_craft>` "Window functions" group (three recipes)
|
||||
|
||||
Add three recipes to the **existing** "Window functions" group, after its two
|
||||
current bullets (deterministic ordering; filter-after-the-window). Each is a
|
||||
heuristic with a generic *why*, dialect-agnostic.
|
||||
|
||||
1. **Cumulative / running total.** Use an **explicit frame** — `SUM(x) OVER
|
||||
(PARTITION BY k ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` —
|
||||
with a **complete tie-breaker** on the `ORDER BY` (per the group's existing
|
||||
deterministic-ordering rule; reference it, do not restate). *Why:* a bare
|
||||
`ORDER BY` defaults to a `RANGE … CURRENT ROW` frame, which on **ties in the
|
||||
order key** folds every tied peer into the same cumulative value — it runs and
|
||||
looks plausible, but the running total jumps at each tie boundary.
|
||||
|
||||
2. **Rolling window over calendar time, plus minimum periods.** "Rolling N
|
||||
days/months" must span a **calendar range**, not a fixed row count: a `ROWS
|
||||
BETWEEN n-1 PRECEDING` frame silently measures the wrong span when days are
|
||||
missing. Two sanctioned techniques:
|
||||
- **Spine + `ROWS` (portable).** Build a gap-free date spine first (spec 10's
|
||||
**Series**, via `sql_dialect_notes`) so the data has one row per calendar
|
||||
unit; then a `ROWS BETWEEN n-1 PRECEDING AND CURRENT ROW` frame equals the
|
||||
intended calendar span. This path is fully dialect-agnostic.
|
||||
- **Native range frame or date-keyed self-join (engine-specific).** Where the
|
||||
engine supports it, a calendar **range frame** expresses the window directly;
|
||||
otherwise a self-join keyed on the date does. Both use engine-specific
|
||||
syntax — get the **rolling-window** idiom from `sql_dialect_notes` (see
|
||||
requirement 3); show no inline range frame in the skill.
|
||||
|
||||
**Minimum periods.** When the question says "only after N periods of data" (or
|
||||
a rolling metric implies it), emit `NULL` / skip until the window is actually
|
||||
full — guard on a window count, e.g. `COUNT(*) OVER (<same frame>) = N`. On a
|
||||
gap-free spine, `COUNT(*)` counts calendar slots; count the **non-null
|
||||
observations** instead when "N periods" means N data points rather than N
|
||||
calendar units. *Why:* a row-count frame over missing dates measures the wrong
|
||||
span, and a partial early window is not the requested metric.
|
||||
|
||||
3. **Period-over-period.** Use `LAG(metric) OVER (PARTITION BY k ORDER BY period)`
|
||||
for the prior-period comparison; compute growth as `(cur - prev) / prev` at
|
||||
**full precision**, rounding only in the final projection (per the existing
|
||||
"Round only at the end" rule), and **guard divide-by-zero / NULL prev**
|
||||
(e.g. divide by `NULLIF(prev, 0)`). *Why:* without `LAG` — or ordered against
|
||||
the wrong neighbor — the comparison lands on the wrong period, and an unguarded
|
||||
ratio errors or returns garbage when the prior period is zero or absent.
|
||||
|
||||
**Step pointer (no duplication).** The step-5 `sql_dialect_notes` provision list
|
||||
(currently "FQTN, identifier-quoting, date, top-N, series/calendar, and JSON
|
||||
conventions") should also name the **rolling-window** convention now that it
|
||||
exists. State each rule once inside `<sql_craft>`; the workflow steps only point
|
||||
to it.
|
||||
|
||||
### 2. One worked example — cumulative running total (dialect-agnostic)
|
||||
|
||||
Add **exactly one** new compact before/after `sql` example, demonstrating the
|
||||
**cumulative running total** — the subtlest of the three (the implicit-frame trap
|
||||
runs fine and is wrong only at tie boundaries) and the highest-value to show.
|
||||
Use a synthetic generic schema (e.g. `account_txns(account_id, txn_date, net)`):
|
||||
|
||||
- **Wrong:** `SUM(net) OVER (PARTITION BY account_id ORDER BY txn_date)` — the
|
||||
implicit `RANGE` frame makes two txns on the same date share one inflated
|
||||
running balance.
|
||||
- **Right:** the same with an explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND
|
||||
CURRENT ROW` frame and a complete tie-breaker (`ORDER BY txn_date, txn_id`).
|
||||
|
||||
Standard SQL only — no `QUALIFY`, no dialect functions, no series generation, no
|
||||
`RANGE … INTERVAL`. Keep it ~10–14 lines. The **rolling-over-time** recipe gets
|
||||
**no** inline example (its correct form needs the engine-specific frame/spine,
|
||||
delegated to `sql_dialect_notes`, exactly as spec 10's period-spine variant was
|
||||
prose-only); the **period-over-period** recipe is self-evident from its bullet
|
||||
and also gets no example. This is the **fourth** worked `sql` example in the
|
||||
skill, after spec 07 (window-then-filter), spec 09 (multi-hop fan-out), and
|
||||
spec 10 (panel-completeness spine).
|
||||
|
||||
### 3. Dialect-notes surface — `dialects/*.md` (rolling window)
|
||||
|
||||
Add a **rolling-window-over-time** idiom line to **each** of the seven authored
|
||||
dialect files, parallel to spec 10's **Series** line. Each note is
|
||||
engine-exclusive — a SQLite analyst gets the SQLite idiom and never another
|
||||
engine's construct, per the existing dialect-notes leak guards. Each note either
|
||||
gives the engine's native calendar-range frame **or** references its own
|
||||
**Series** line for the spine + `ROWS` fallback (a cross-reference within the
|
||||
file, not a duplicate of the Series line).
|
||||
|
||||
Orientation only — **`RANGE`-frame support genuinely varies by engine and
|
||||
version, so the implementer must verify each engine's current support against
|
||||
authoritative docs (context7 / the engine's manual) rather than assert it from
|
||||
memory.** Starting points:
|
||||
|
||||
- **postgres:** native — `... OVER (ORDER BY day RANGE BETWEEN INTERVAL '29 days'
|
||||
PRECEDING AND CURRENT ROW)`.
|
||||
- **mysql (8.0+):** native — `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT
|
||||
ROW` over a temporal order key.
|
||||
- **bigquery:** `RANGE` frames are **numeric** — range over an integer day key
|
||||
(e.g. `UNIX_DATE(day)`) with `RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or
|
||||
build a spine (see **Series**) and use a `ROWS` frame.
|
||||
- **sqlite:** **no** date-interval range frame — build a date spine (see
|
||||
**Series**) and use a `ROWS` frame.
|
||||
- **tsql (SQL Server):** **no** offset `RANGE` frames at all — build a spine (see
|
||||
**Series**) and use a `ROWS` frame, or a date-keyed self-join.
|
||||
- **snowflake / clickhouse:** range-frame support over dates is limited — verify;
|
||||
default to a spine (see **Series**) + `ROWS` frame where a native calendar range
|
||||
frame is unavailable.
|
||||
|
||||
This line is what makes the rolling-over-time recipe executable from the
|
||||
dialect-agnostic skill. It is **distinct** from spec 10's Series line (Series =
|
||||
how to *generate* a spine; Rolling window = how to compute a *moving
|
||||
calendar-range aggregate*, natively or via that spine), and it cross-references
|
||||
the Series line rather than overlapping it.
|
||||
|
||||
### 4. Explicit constraints / exclusions
|
||||
|
||||
None of the following may appear (consistent with specs 07 and 10):
|
||||
|
||||
- **No inline dialect-specific range-frame syntax in the skill** — no
|
||||
`RANGE … INTERVAL` frame, no series generator, no dialect function. The skill
|
||||
stays dialect-clean; the range frame lives only in the dialect notes.
|
||||
- **No anchoring of relative time to `MAX(date)`.** "Recent" / "past N months"
|
||||
means relative to *now* on a live database. A range *bound* may be derived from
|
||||
the question's explicit range or, for "all periods present," from `MIN`/`MAX`
|
||||
over the **unfiltered** facts (range derivation, per spec 10) — but the metric
|
||||
must never silently redefine "recent" as the data's maximum date.
|
||||
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
|
||||
(the skill is for interactive analysis).
|
||||
|
||||
### 5. Coordination with specs 07 and 10
|
||||
|
||||
All three recipes live in the **existing** `<sql_craft>` "Window functions"
|
||||
group; the two current bullets and the spec-07 window-then-filter example must
|
||||
stay intact and uncontradicted.
|
||||
|
||||
- **Spec 07** owns the deterministic-ordering rule (Window functions) and the
|
||||
round-at-the-end rule (Numeric precision). Spec 11 **builds on** both —
|
||||
references them, never restates them.
|
||||
- **Spec 10** owns the spine concept and the dialect **Series** line. Spec 11
|
||||
**references** the spine for the gappy-rolling fallback and adds the **distinct**
|
||||
rolling-window dialect line. Keep them non-overlapping: spec 10 = how to make a
|
||||
spine; spec 11 = how to compute a moving calendar-range aggregate (native frame
|
||||
or spine + `ROWS`).
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Every worked example or note uses a **synthetic generic schema** (e.g.
|
||||
`daily_revenue(day, amount)` or `account_txns(account_id, txn_date, net)`) and
|
||||
shows only the *pattern*. **No** benchmark table names, SQL, or result values on
|
||||
either surface. The dialect-notes additions, like the existing notes, carry no
|
||||
benchmark / grader / version-dated content. The behavior is reconstructable from
|
||||
first principles and tied to no specific instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The `<sql_craft>` "Window functions" group states the three recipes — inline,
|
||||
dialect-agnostic, each with a generic *why*, and each **building on** (not
|
||||
restating) the deterministic-ordering and round-at-the-end rules:
|
||||
- **cumulative / running total** with an explicit `ROWS BETWEEN UNBOUNDED
|
||||
PRECEDING AND CURRENT ROW` frame and a complete tie-breaker;
|
||||
- **rolling window over calendar time + minimum periods** — calendar range not
|
||||
row count, the spine-or-range choice, the min-periods `COUNT(*) OVER (...)`
|
||||
guard — delegating the engine's range-frame syntax to `sql_dialect_notes`;
|
||||
- **period-over-period** via `LAG`, with full-precision growth and a
|
||||
divide-by-zero / NULL-prev guard.
|
||||
- Exactly **one** new worked `sql` example: the cumulative running total,
|
||||
wrong-vs-right, with the explicit `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
|
||||
ROW` frame and a complete tie-breaker, in standard dialect-agnostic SQL. The
|
||||
skill then carries **four** `sql` worked examples total.
|
||||
- Each of the seven `dialects/*.md` files gains a **rolling-window-over-time**
|
||||
idiom line in its engine's own idiom (native calendar-range frame where
|
||||
supported, otherwise a spine + `ROWS` fallback that references its **Series**
|
||||
line); no engine leaks another engine's construct, and the additions contain no
|
||||
benchmark / grader / version-dated content.
|
||||
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
|
||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, **and no
|
||||
inline `RANGE … INTERVAL` frame**, anywhere in `SKILL.md` including the new
|
||||
example.
|
||||
- The step-5 `sql_dialect_notes` provision list names the **rolling-window**
|
||||
convention alongside FQTN / identifier-quoting / date / top-N / series/calendar /
|
||||
JSON.
|
||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other
|
||||
examples), the two existing Window-functions bullets, the window-then-filter
|
||||
example, and the existing dialect-note rubric lines (including **Series**) are
|
||||
intact and uncontradicted.
|
||||
- No grader / benchmark reference, no output-shape contract, and no anchoring of
|
||||
*relative* time ("recent" / "past N months") to a `MAX(date)` over the data.
|
||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
||||
still parses as `ktx-analytics`.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three recipes
|
||||
to the "Window functions" group (after its two existing bullets), the single
|
||||
cumulative worked example, and extend the step-5 dialect-notes provision list to
|
||||
name the rolling-window convention. Leave `<workflow>` / `<rules>` / the other
|
||||
examples and the two existing window bullets intact. Delivery is unchanged
|
||||
(single `SKILL.md` per target via `readAnalyticsSkillContent` in
|
||||
`setup-agents.ts`) — confirm, no change required.
|
||||
- **Dialect notes:** the seven files under
|
||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
||||
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
|
||||
engine's actual `RANGE`-frame support against authoritative docs before writing
|
||||
the idiom; do not assert from memory.**
|
||||
- **Tests:**
|
||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
||||
representative phrase for each of the three recipes; bump the `sql`-fence count
|
||||
assertion **3 → 4**; assert the cumulative example shape (e.g. `ROWS BETWEEN
|
||||
UNBOUNDED PRECEDING AND CURRENT ROW`); and **strengthen** the dialect-clean
|
||||
guard with a no-inline-`RANGE … INTERVAL` assertion (mirroring spec 10 adding
|
||||
`generate_series` / `GENERATE_DATE_ARRAY` to the banned list, so the
|
||||
"range frame lives only in the dialect notes" criterion is *enforced*, not
|
||||
incidentally true).
|
||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers the
|
||||
full rubric for every dialect" loop with the rolling-window assertion, e.g.
|
||||
`expect(notes).toMatch(/\*\*Rolling/)`, so every dialect must answer it.
|
||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
||||
all seven without a hand-maintained list.
|
||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Running-balance / rolling / period-over-period questions are the single largest
|
||||
result-mismatch cluster in the SQLite subset (financial-transactions-style DBs):
|
||||
cumulative balances with the wrong frame on ties, rolling windows that mis-span
|
||||
gappy dates, partial early windows, and unguarded period-over-period ratios. The
|
||||
methodology is universal analyst craft, so it belongs in the product's skill
|
||||
(where it helps every real user) plus the per-dialect rolling-window syntax that
|
||||
makes it executable — not in a benchmark-specific prompt. Depends on spec 10 (the
|
||||
date spine) for the gappy-rolling fallback. Improving the benchmark score is a
|
||||
side effect; the skill and the dialect notes contain no trace of the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped as additive content across the two surfaces the spec specified — no new
|
||||
tool, flag, or config.
|
||||
|
||||
**Skill (`packages/cli/src/skills/analytics/SKILL.md`).** Added the three recipes
|
||||
to the existing `<sql_craft>` "Window functions" group, after its two bullets and
|
||||
the spec-07 window-then-filter example: **Cumulative / running total** (explicit
|
||||
`ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` + a tie-breaker, referencing
|
||||
the deterministic-ordering rule), **Rolling window over calendar time, plus
|
||||
minimum periods** (calendar range not row count; spine-or-native-range choice
|
||||
delegated to `sql_dialect_notes`; the `COUNT(*) OVER (<same frame>) = N`
|
||||
min-periods guard), and **Period-over-period** (`LAG` + full-precision growth +
|
||||
`NULLIF` divide guard, referencing the round-at-the-end rule). Added one worked
|
||||
`sql` example — the cumulative running total, wrong-vs-right, using
|
||||
`account_txns(account_id, txn_id, txn_date, net)` — bringing the skill to four
|
||||
worked examples. Extended the step-5 `sql_dialect_notes` provision list to name
|
||||
the rolling-window convention. No inline `RANGE … INTERVAL` frame anywhere in the
|
||||
skill; it stays dialect-clean.
|
||||
|
||||
**Dialect notes (`packages/cli/src/context/sql-analysis/dialects/*.md`).** Added a
|
||||
**Rolling window over time** line to all seven files, parallel to the spec-10
|
||||
**Series** line and cross-referencing it for the spine fallback.
|
||||
|
||||
**Deviation — `RANGE`-frame support verified against authoritative docs (the
|
||||
spec's hard requirement), which corrected two of its starting points:**
|
||||
|
||||
- **postgres** — native interval frame: `RANGE BETWEEN INTERVAL '29 days'
|
||||
PRECEDING AND CURRENT ROW` (as the spec guessed).
|
||||
- **mysql** — native interval frame over a temporal key: `RANGE BETWEEN INTERVAL
|
||||
29 DAY PRECEDING AND CURRENT ROW` (as guessed).
|
||||
- **bigquery** — `RANGE` is numeric-only: range over `UNIX_DATE(day)` with
|
||||
`RANGE BETWEEN 29 PRECEDING AND CURRENT ROW`, or spine + `ROWS` (as guessed).
|
||||
- **snowflake** — **corrected:** the spec said "limited; default to a spine," but
|
||||
Snowflake *does* support a native interval `RANGE` frame over a date/timestamp
|
||||
key and it is gap-tolerant, so the note gives the native frame
|
||||
(`RANGE BETWEEN INTERVAL '29 days' PRECEDING AND CURRENT ROW`), no spine needed.
|
||||
- **clickhouse** — **corrected:** the spec said "limited; default to a spine," but
|
||||
ClickHouse supports a numeric `RANGE` offset over a `Date` column (counts in
|
||||
days, gap-tolerant); the `INTERVAL` form is unsupported (use seconds for
|
||||
`DateTime`). The note gives the numeric `RANGE` frame, with spine + `ROWS` as
|
||||
the fallback.
|
||||
- **sqlite** — no date-interval range frame (no native date type): spine + `ROWS`
|
||||
(as guessed).
|
||||
- **tsql** — `RANGE` supports only `UNBOUNDED`/`CURRENT ROW` (no offset frame):
|
||||
spine + `ROWS`, or a date-keyed self-join (as guessed).
|
||||
|
||||
**Tests.** `test/skills/analytics-skill-content.test.ts` — added a representative
|
||||
phrase per recipe (plus `minimum periods`), bumped the `sql`-fence count 3 → 4,
|
||||
asserted the cumulative example shape (`ROWS BETWEEN UNBOUNDED PRECEDING AND
|
||||
CURRENT ROW` and the `ORDER BY txn_date, txn_id` tie-breaker), and strengthened
|
||||
the dialect-clean guard with a no-inline-`RANGE … INTERVAL` regex.
|
||||
`test/context/mcp/dialect-notes.test.ts` — extended the per-dialect rubric loop
|
||||
with `expect(notes).toMatch(/\*\*Rolling/)`, so every dialect (derived from
|
||||
`DIALECTS_WITH_NOTES`) must answer the rolling-window rubric.
|
||||
|
||||
**Verification.** Full `@kaelio/ktx` vitest suite green (3001 passed, 1 skipped);
|
||||
`pnpm run build` mirrors both surfaces into `dist`; `pnpm run link:dev` refreshed
|
||||
`ktx-dev`. Pre-existing, unrelated note: `tsc -p tsconfig.test.json` reports one
|
||||
error in `test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts` cast) that is
|
||||
present in committed branch code and untouched by this work.
|
||||
|
|
@ -1,405 +0,0 @@
|
|||
# Parse text-encoded numeric columns before doing math on them
|
||||
|
||||
> Refined spec. Intake draft: `todo/12-parse-text-encoded-numbers.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
Numeric measures are often stored as **text** with human formatting: unit
|
||||
suffixes (`"1.2K"`, `"3M"`, `"4B"`), currency symbols and thousands separators
|
||||
(`"$1,200"`), percent signs (`"12%"`), or non-numeric sentinels for missing/zero
|
||||
(`"-"`, `"N/A"`, `""`). Aggregating or comparing such a column directly is
|
||||
**silently wrong**: a string comparison orders `"100" < "9"`, and a naive
|
||||
`CAST(x AS REAL)` yields `0`/NULL/partial on the formatted values rather than the
|
||||
intended number. The query runs, the shape looks right, the number is garbage.
|
||||
|
||||
The agent already samples schemas before composing — spec 07 gave the
|
||||
`<sql_craft>` "Schema discovery before writing SQL" group its *"Sample before you
|
||||
compose"* and *"Cast to the real type before comparing"* rules. But those rules
|
||||
guard **encoding** (date format, nullability) and **type-mismatch in `WHERE`**;
|
||||
they say nothing about a column whose declared/affinity type is text yet whose
|
||||
*meaning* is numeric. When the agent sees a "numeric-looking" column it tends to
|
||||
assume a real number type and skips the parse, so the arithmetic runs on the raw
|
||||
strings. This spec adds the detect → parse/scale → verify habit to that same
|
||||
group, building on the two rules already there rather than restating them.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
- A `trade_volume` column stored as `"1.2K" / "3M" / "-"` must become
|
||||
`1200 / 3000000 / 0` before you can sum it or compute a daily change.
|
||||
- A `price` stored as `"$1,299.00"` must become `1299.00` before averaging.
|
||||
- A `conversion_rate` stored as `"12%"` must become `0.12` before weighting it.
|
||||
|
||||
This is routine data hygiene on real, messy production tables — every analyst
|
||||
hits text-encoded measures on some warehouse, with no benchmark in sight. The
|
||||
methodology is universal craft, so it belongs in the shipped skill; it transfers
|
||||
to every ktx user querying a live database.
|
||||
|
||||
## Model
|
||||
|
||||
The change is **additive content across two surfaces** — the same split specs 10
|
||||
and 11 made, and for the same reason. The split is the central design decision;
|
||||
it satisfies spec 07's hard dialect-agnostic invariant for `<sql_craft>` without
|
||||
weakening it.
|
||||
|
||||
### Why two surfaces (the dialect-agnostic reconciliation)
|
||||
|
||||
The **detect → parse → scale** half is **pure portable SQL** and stays entirely
|
||||
in the dialect-agnostic skill:
|
||||
|
||||
- Stripping `$` / `,` / `%` is a portable chained `REPLACE` over a small, known
|
||||
set of literal characters — no regex needed.
|
||||
- Suffix scaling (K=10³, M=10⁶, B=10⁹) is a portable `LIKE`/`CASE` expression.
|
||||
- Sentinel mapping (`-` / `N/A` / empty → `0` or `NULL`) is a portable `CASE`.
|
||||
- The final cast to a numeric type is `CAST(... AS DECIMAL)`, broadly portable.
|
||||
|
||||
The **verify** half has one piece that is genuinely dialect-divergent: a
|
||||
**failure-detecting numeric cast** — a cast that signals (rather than silently
|
||||
swallows) a value that did not parse. This is exactly what requirement 3
|
||||
("confirm coverage") needs, and it cannot be written portably:
|
||||
|
||||
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` → `NULL` on failure.
|
||||
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_CAST` → `NULL` on failure.
|
||||
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(...))` / `TRY_CONVERT` → `NULL`.
|
||||
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
|
||||
- **postgres / mysql:** no `TRY_CAST` — guard with a numeric pattern test before
|
||||
casting (e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`).
|
||||
- **sqlite (the gotcha):** a plain `CAST('abc' AS REAL)` returns **`0.0`** and
|
||||
`CAST('12abc' AS REAL)` returns **`12.0`** — it neither errors nor NULLs, so an
|
||||
`IS NULL` coverage check is **silently broken**. Detecting a failed parse needs
|
||||
a `GLOB`/`typeof` pattern guard.
|
||||
|
||||
So a portable skill cannot inline a safe cast any more than spec 10 could inline a
|
||||
date-series generator or spec 11 a calendar range frame. ktx already routes that
|
||||
kind of engine-specific syntax through the per-dialect notes in
|
||||
`packages/cli/src/context/sql-analysis/dialects/<dialect>.md`, served by the
|
||||
`sql_dialect_notes` MCP tool (spec 08). Specs 10 and 11 set the exact precedent:
|
||||
a construct not yet in the dialect rubric, genuinely engine-specific, was added
|
||||
there (the **Series** line; the **Rolling window** line) and the dialect-agnostic
|
||||
skill points to it. The failure-detecting cast is the next construct in that same
|
||||
position, so the **safe-cast idiom belongs in the dialect notes**, and the skill
|
||||
points to it.
|
||||
|
||||
Surface 1 (skill) carries the **pattern** (detect the text encoding; parse/scale
|
||||
in an early CTE; verify with a failure-detecting cast). Surface 2 (dialect notes)
|
||||
carries the **concrete safe-cast syntax** per engine, including the sqlite
|
||||
`CAST`-returns-0 gotcha.
|
||||
|
||||
The regex character-*strip* is deliberately **not** promoted to the dialect
|
||||
notes: a portable chained `REPLACE` over a known character set is the opinionated
|
||||
default, so there is no need for a per-dialect strip line (derive from need; one
|
||||
default). The dialect surface gains exactly one thing — the safe cast — because
|
||||
that is the only piece the portable path genuinely cannot express.
|
||||
|
||||
### Additive, inline, heuristic-with-a-why
|
||||
|
||||
Consistent with specs 07, 10, and 11: the skill change is **additive content in
|
||||
one Markdown file** (`skills/analytics/SKILL.md`), inline (no bundled
|
||||
`reference/` file — `setup-agents.ts` ships only `SKILL.md`), dialect-agnostic,
|
||||
and phrased as **heuristics with a one-line generic rationale**, not a wall of
|
||||
MUSTs. The dialect-notes change is additive content in the seven existing
|
||||
`dialects/*.md` files. No new tool, flag, or config on either surface.
|
||||
|
||||
### Build on the rules already present; do not restate them
|
||||
|
||||
- The Schema-discovery group already carries **"Sample before you compose"** and
|
||||
**"Cast to the real type before comparing"** (spec 07). The detect rule
|
||||
**extends** the first (distinct-value sampling to learn the encoding) and the
|
||||
parse rule **complements** the second (text-meaning-numeric, not just
|
||||
text-vs-numeric literal mismatch) — reference them, do not repeat them.
|
||||
- The sentinel **0-vs-NULL** choice is the **same additive-vs-non-additive
|
||||
judgment** spec 10 established in its *"Default by additivity"* rule (0 only
|
||||
when "no value" genuinely reads as 0; NULL otherwise). **Reference** that rule
|
||||
rather than restating the discriminator (state each rule once).
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Skill surface — `<sql_craft>` "Schema discovery before writing SQL"
|
||||
|
||||
Add the text-encoded-numeric guidance to the **existing** group, after its two
|
||||
current bullets. Phrase as heuristics, each with a generic *why*, dialect-agnostic.
|
||||
It must cover:
|
||||
|
||||
1. **Detect text-encoded numerics during sampling.** When a column the question
|
||||
treats as a number is stored as text, sample its **distinct** values to learn
|
||||
the encodings actually present — unit suffixes (`K`/`M`/`B`), currency
|
||||
symbols, thousands separators, percent signs, and non-numeric sentinels
|
||||
(`-`, `N/A`, empty) — **before** composing. Never infer the format from the
|
||||
column name. *Why:* compared/aggregated as-is, the text sorts lexically
|
||||
(`'100' < '9'`) and a naive cast collapses formatted values to `0`/NULL —
|
||||
producing a silently wrong result instead of an error.
|
||||
|
||||
2. **Parse and scale in an early CTE.** Strip currency/separator/percent
|
||||
characters, multiply by the suffix scale (K=10³, M=10⁶, B=10⁹), map sentinels
|
||||
to `0` **or** `NULL` per the question's intent, then cast to a numeric type —
|
||||
all in **one early CTE**, so every downstream layer sees clean numbers. The
|
||||
`0`-vs-`NULL` choice for sentinels follows spec 10's **additive-vs-non-additive**
|
||||
rule (reference it; do not restate). *Why:* a string column aggregated as-is
|
||||
sorts lexically and casts to 0, so the math is silently wrong.
|
||||
|
||||
3. **Confirm coverage (verify).** After parsing, sanity-check that **no
|
||||
intended-numeric value silently failed to parse** — a failed parse should
|
||||
surface as `NULL`, which is only visible with a **failure-detecting cast**.
|
||||
Note the divergence: a plain `CAST` errors on some engines and, on sqlite,
|
||||
returns `0`/partial rather than NULL — so use the engine's safe-cast idiom from
|
||||
`sql_dialect_notes` (requirement 3), then count residual NULLs among
|
||||
non-sentinel rows. *Why:* an encoding the sample missed would otherwise vanish
|
||||
as `0`/NULL instead of being caught.
|
||||
|
||||
### 2. One worked example — parse/scale, fully portable
|
||||
|
||||
Add **exactly one** new compact before/after `sql` example demonstrating the
|
||||
parse-and-scale pattern on a synthetic generic schema
|
||||
(e.g. `metrics(label, value_text)` with values like `'1.2K'`, `'$1,200'`, `'-'`):
|
||||
|
||||
- **Wrong:** `SUM(CAST(value_text AS REAL))` (or summing the raw strings) — the
|
||||
formatted values collapse to `0`/partial, so the total is silently wrong.
|
||||
- **Right:** an early CTE that strips symbols with chained `REPLACE`, applies a
|
||||
`CASE` for the K/M/B suffix scale, maps `'-'`/`'N/A'`/`''` to `0`, casts to
|
||||
`DECIMAL`, then `SUM`s the parsed column.
|
||||
|
||||
**Standard, portable SQL only** — no `REGEXP_REPLACE`, `SAFE_CAST`, `TRY_CAST`,
|
||||
`TRY_TO_NUMBER`, `toFloat64OrNull`, `GLOB`, or any dialect function — so the
|
||||
example stays dialect-clean. Keep it ~12–16 lines. The **verify** step gets **no**
|
||||
inline example (its correct form needs the engine-specific safe cast, delegated to
|
||||
`sql_dialect_notes`, exactly as spec 10's period-spine and spec 11's
|
||||
rolling-window variants were prose-only).
|
||||
|
||||
This adds **one** worked `sql` example to the skill. Spec 11 independently adds
|
||||
one as well; **do not hardcode the resulting total** — increment from the current
|
||||
state. As of this writing the skill carries **three** examples (spec 07
|
||||
window-then-filter, spec 09 multi-hop fan-out, spec 10 panel spine), so this is
|
||||
the **fourth**; if spec 11 ships first it is the **fifth**. The fence-count test
|
||||
assertion is incremented by one from its current value (see Acceptance criteria).
|
||||
|
||||
### 3. Dialect-notes surface — `dialects/*.md` (safe cast)
|
||||
|
||||
Add a **"Safe cast"** idiom line to **each** of the seven authored dialect files,
|
||||
parallel to spec 10's **Series** line and spec 11's **Rolling window** line. Each
|
||||
line gives that engine's **failure-detecting numeric cast** — a cast that returns
|
||||
`NULL` (or is detectably invalid) on a non-numeric input — which is what makes the
|
||||
verify step correct on that engine. Each note is engine-exclusive (a SQLite
|
||||
analyst gets the SQLite idiom and never another engine's construct, per the
|
||||
existing dialect-notes leak guards). Orientation only — exact syntax is the
|
||||
implementer's; verify against authoritative docs (context7 / the engine manual)
|
||||
rather than asserting from memory:
|
||||
|
||||
- **postgres:** no `TRY_CAST` — guard with a numeric pattern before casting,
|
||||
e.g. `CASE WHEN x ~ '^-?[0-9.]+$' THEN x::numeric END`. (`regexp_replace` is
|
||||
available for the strip, but chained `REPLACE` is the portable default.)
|
||||
- **mysql (8.0+):** no `TRY_CAST` — guard with `x REGEXP '^-?[0-9.]+$'` before
|
||||
`CAST(... AS DECIMAL)`; `REGEXP_REPLACE` is available for the strip.
|
||||
- **bigquery:** `SAFE_CAST(x AS FLOAT64)` (or `SAFE_CAST(... AS NUMERIC)`) →
|
||||
`NULL` on failure.
|
||||
- **snowflake:** `TRY_TO_NUMBER(x)` / `TRY_TO_DECIMAL(x, p, s)` / `TRY_CAST` →
|
||||
`NULL` on failure.
|
||||
- **clickhouse:** `toFloat64OrNull(x)` / `toDecimalOrNull(...)` → `NULL`.
|
||||
- **tsql (SQL Server):** `TRY_CAST(x AS DECIMAL(18,4))` / `TRY_CONVERT` → `NULL`.
|
||||
- **sqlite (the gotcha):** a plain `CAST` returns `0`/partial, **not** NULL or an
|
||||
error, so a coverage check must use a pattern guard such as
|
||||
`CASE WHEN cleaned GLOB '...' THEN CAST(cleaned AS REAL) END` (or a `typeof`
|
||||
check) to detect a value that did not parse.
|
||||
|
||||
This line is what makes the verify step executable from the dialect-agnostic
|
||||
skill. It is **distinct** from the Series and Rolling-window lines (those generate
|
||||
or window over a calendar; this detects a failed numeric parse). Phrase any
|
||||
version note as `8.0+`-style, **not** "as of version …" (the dialect-notes test
|
||||
bans version-dated wording).
|
||||
|
||||
### 4. Explicit constraints / exclusions
|
||||
|
||||
None of the following may appear (consistent with specs 07, 10, and 11):
|
||||
|
||||
- **No inline dialect-specific cast/regex syntax in the skill** — no `SAFE_CAST`,
|
||||
`TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
|
||||
`replaceRegexpAll`, or `GLOB` anywhere in `SKILL.md`. The portable strip is
|
||||
chained `REPLACE`; the failure-detecting cast lives only in the dialect notes.
|
||||
- **No regex-strip dialect line.** The character strip stays the portable
|
||||
chained-`REPLACE` default; the dialect notes gain only the **safe cast**.
|
||||
- **No grader / gold-answer / benchmark reference**, and no output-shape contract
|
||||
(the skill is for interactive analysis).
|
||||
|
||||
### 5. Coordination with specs 07, 08, 10, and 11
|
||||
|
||||
- **Spec 07** owns the Schema-discovery group and its two existing bullets
|
||||
(*"Sample before you compose"*, *"Cast to the real type before comparing"*).
|
||||
Spec 12 **extends** that group and **builds on** both bullets — references them,
|
||||
never restates them; they must stay intact and uncontradicted.
|
||||
- **Spec 08** owns the dialect-notes channel and its leak guards. Spec 12 adds one
|
||||
rubric line through that channel; the engine-exclusivity guards apply unchanged.
|
||||
- **Spec 10** owns the additive-vs-non-additive discriminator (Answer
|
||||
completeness) and the dialect **Series** line. Spec 12 **references** the
|
||||
additivity rule for the sentinel `0`-vs-`NULL` choice; do not duplicate it.
|
||||
- **Spec 11** independently adds the dialect **Rolling window** line, one `sql`
|
||||
example, and the **rolling-window** entry to the step-5 provision list. Spec 12
|
||||
touches the **same** three places (the dialect-notes rubric loop, the example
|
||||
count, and the step-5 list). Both are independent and additive — **add to the
|
||||
current state, do not assume an order**: name **safe-cast** in the step-5 list
|
||||
without removing rolling-window/series; increment the example count by one from
|
||||
whatever it is; add `/\*\*Safe cast/` to the rubric loop alongside any
|
||||
`/\*\*Rolling/` assertion.
|
||||
|
||||
### 6. Step pointer (no duplication)
|
||||
|
||||
The step-5 `sql_dialect_notes` provision list (currently "FQTN,
|
||||
identifier-quoting, date, top-N, series/calendar, and JSON conventions"; spec 11
|
||||
also names rolling-window) should additionally name the **safe-cast** convention
|
||||
now that it exists. State each rule once inside `<sql_craft>`; the workflow steps
|
||||
only point to it.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
Every worked example or note uses a **synthetic generic schema** (e.g.
|
||||
`metrics(label, value_text)`) and made-up values (`'1.2K'`, `'$1,200'`, `'-'`),
|
||||
showing only the *pattern*. **No** benchmark table names, SQL, or result values on
|
||||
either surface. The dialect-notes additions, like the existing notes, carry no
|
||||
benchmark / grader / version-dated content. The behavior is reconstructable from
|
||||
first principles and tied to no specific instance.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- The `<sql_craft>` "Schema discovery before writing SQL" group states the three
|
||||
heuristics — inline, dialect-agnostic, each with a generic *why*, and each
|
||||
**building on** (not restating) the existing *"Sample before you compose"* and
|
||||
*"Cast to the real type before comparing"* bullets and spec 10's additivity rule:
|
||||
- **detect** text-encoded numerics by sampling distinct values (suffixes,
|
||||
symbols, separators, sentinels) — never from the column name;
|
||||
- **parse and scale** in an early CTE (strip → suffix-scale → sentinel map →
|
||||
cast), sentinel `0`-vs-`NULL` per spec 10's additivity rule;
|
||||
- **confirm coverage** with a failure-detecting cast, delegating the engine's
|
||||
safe-cast syntax to `sql_dialect_notes`.
|
||||
- Exactly **one** new worked `sql` example: parse-and-scale, wrong-vs-right, using
|
||||
chained `REPLACE` + `CASE` suffix scale + sentinel `CASE` + `CAST(... AS
|
||||
DECIMAL)`, in standard portable SQL. The `sql`-fence count assertion is
|
||||
incremented by **one** from its current value (3 today → 4; or 5 if spec 11
|
||||
shipped first).
|
||||
- Each of the seven `dialects/*.md` files gains a **"Safe cast"** idiom line in its
|
||||
engine's own failure-detecting numeric-cast idiom (including the sqlite
|
||||
`CAST`-returns-0 gotcha); no engine leaks another engine's construct, and the
|
||||
additions contain no benchmark / grader / version-dated content.
|
||||
- The skill remains **dialect-clean:** no `QUALIFY`, `strftime`, `julianday`,
|
||||
`generate_series`, `GENERATE_DATE_ARRAY`, backtick three-part FQTN, inline
|
||||
`RANGE … INTERVAL` frame, **and no `SAFE_CAST` / `TRY_CAST` / `TRY_TO_NUMBER` /
|
||||
`REGEXP_REPLACE` / `toFloat64OrNull` / `GLOB`**, anywhere in `SKILL.md`
|
||||
including the new example.
|
||||
- The step-5 `sql_dialect_notes` provision list names the **safe-cast** convention
|
||||
alongside FQTN / identifier-quoting / date / top-N / series-calendar /
|
||||
rolling-window / JSON.
|
||||
- The existing interactive guidance (`<workflow>`, `<rules>`, the other examples),
|
||||
the two existing Schema-discovery bullets, and the existing dialect-note rubric
|
||||
lines (including **Series** and, if present, **Rolling window**) are intact and
|
||||
uncontradicted.
|
||||
- No grader / benchmark reference, and no output-shape contract.
|
||||
- The skill stays scannable and comfortably under the 500-line budget; frontmatter
|
||||
still parses as `ktx-analytics`.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md` — add the three
|
||||
heuristics to the "Schema discovery before writing SQL" group (after its two
|
||||
existing bullets), the single parse-and-scale worked example, and extend the
|
||||
step-5 dialect-notes provision list to name the safe-cast convention. Leave
|
||||
`<workflow>` / `<rules>` / the other examples and the two existing
|
||||
schema-discovery bullets intact. Delivery is unchanged (single `SKILL.md` per
|
||||
target via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no
|
||||
change required.
|
||||
- **Dialect notes:** the seven files under
|
||||
`packages/cli/src/context/sql-analysis/dialects/`. The list is kept in sync with
|
||||
`DIALECTS_WITH_NOTES` (`dialect-notes.ts`) and shipped to `dist` by
|
||||
`copy-runtime-assets.mjs` — no plumbing change, content only. **Verify each
|
||||
engine's actual safe-cast / try-cast support against authoritative docs before
|
||||
writing the idiom; do not assert from memory** (in particular the sqlite
|
||||
`CAST`-returns-0 behavior, which is the motivating gotcha).
|
||||
- **Tests:**
|
||||
- `packages/cli/test/skills/analytics-skill-content.test.ts` — add a
|
||||
representative phrase for each of the three heuristics (e.g. a *detect*, a
|
||||
*parse/scale*, and a *confirm-coverage* phrase) to the `represents every craft
|
||||
behavior` list; bump the `sql`-fence count assertion **by one** from its
|
||||
current value; assert the example shape (e.g. `REPLACE(` and `CAST(` and a
|
||||
suffix-scale multiplier); and **strengthen** the dialect-clean guard by adding
|
||||
`SAFE_CAST`, `TRY_CAST`, `TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`,
|
||||
and `GLOB` to the banned list (mirroring spec 10 adding `generate_series` /
|
||||
`GENERATE_DATE_ARRAY` and spec 11 adding the no-inline-`RANGE … INTERVAL`
|
||||
guard, so the "safe cast lives only in the dialect notes" criterion is
|
||||
*enforced*, not incidentally true).
|
||||
- `packages/cli/test/context/mcp/dialect-notes.test.ts` — extend the "answers
|
||||
the full rubric for every dialect" loop with the safe-cast assertion,
|
||||
`expect(notes).toMatch(/\*\*Safe cast/)`, so every dialect must answer it.
|
||||
Coverage is derived from `DIALECTS_WITH_NOTES`, so the new assertion enforces
|
||||
all seven without a hand-maintained list. Do **not** add a false-exclusivity
|
||||
assertion for `TRY_CAST` (it is shared by snowflake and tsql); requiring the
|
||||
line per dialect is sufficient.
|
||||
- Rebuild and re-link the dev binary so the playground picks up both surfaces:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
At least one SQLite-subset question stores trading volume as suffix-encoded text
|
||||
(`"K"`/`"M"`, `"-"` for zero) and fails because the agent aggregates the raw
|
||||
strings — runnable, plausible, wrong. The sqlite `CAST`-returns-0 behavior makes
|
||||
the failure especially insidious: there is no error to alert the agent, and a
|
||||
naive `IS NULL` coverage check would not catch it either, which is precisely why
|
||||
the safe-cast idiom belongs in the dialect notes. The fix — parse messy encodings
|
||||
before math, then verify coverage with a failure-detecting cast — is universal
|
||||
data hygiene that helps any analyst on any warehouse, so it belongs in the
|
||||
product's craft (skill) plus the per-dialect safe-cast syntax that makes the
|
||||
verify step executable, not in a benchmark-specific prompt. Improving the
|
||||
benchmark score is a side effect; the skill and the dialect notes contain no trace
|
||||
of the benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki`, on top of specs 10 and 11 (both already
|
||||
applied in the working tree). Built from the current state per the "do not assume an
|
||||
order" guidance — there were **four** worked examples (specs 07 window-then-filter,
|
||||
09 multi-hop fan-out, 10 panel spine, 11 cumulative running total), so this is the
|
||||
**fifth**, and step 5 already named `series/calendar, rolling-window`.
|
||||
|
||||
**Skill — `packages/cli/src/skills/analytics/SKILL.md`:**
|
||||
- Added the three heuristics to the **"Schema discovery before writing SQL"** group,
|
||||
after the two existing bullets: *Parse text-encoded numerics before doing math on
|
||||
them* (detect by sampling distinct values, extending *Sample before you compose*,
|
||||
never inferring from the column name), *Strip, scale, and cast in one early CTE*
|
||||
(the *meaning-is-numeric* complement to *Cast to the real type before comparing*,
|
||||
with the sentinel `0`-vs-`NULL` choice deferred to spec 10's *Default by
|
||||
additivity* rule), and *Confirm the parse covered every value* (failure-detecting
|
||||
cast from `sql_dialect_notes`). Each carries a one-line generic *why*; the existing
|
||||
bullets and the additivity rule are referenced, not restated.
|
||||
- Added **one** portable worked example (`metrics(label, value_text)` with `'1.2K'`,
|
||||
`'3M'`, `'$1,200'`, `'-'`): wrong = `SUM(CAST(value_text AS REAL))`; right = an
|
||||
early `parsed` CTE that strips with chained `REPLACE`, scales the K/M/B suffix with
|
||||
a `CASE`, maps sentinels to `0`, casts to `DECIMAL(18,4)`, then `SUM`s. Standard
|
||||
portable SQL only — no dialect functions, no inline safe cast.
|
||||
- Step 5 dialect-notes provision list now names **safe-cast** alongside the others.
|
||||
|
||||
**Dialect notes — `packages/cli/src/context/sql-analysis/dialects/*.md`:** added a
|
||||
**Safe cast** line to all seven files (after the *Rolling window* line), each giving
|
||||
that engine's failure-detecting numeric cast: postgres/mysql use a numeric pattern
|
||||
guard before casting (no `TRY_CAST`; mysql's bare `CAST` returns `0` with a warning);
|
||||
bigquery `SAFE_CAST`; snowflake `TRY_TO_NUMBER`/`TRY_TO_DECIMAL`/`TRY_CAST`; tsql
|
||||
`TRY_CAST`/`TRY_CONVERT`; clickhouse `toFloat64OrNull`/`toDecimal64OrNull` (the
|
||||
`...OrZero` variants return `0`); sqlite documents the `CAST`-returns-`0.0`/partial
|
||||
gotcha and a `GLOB` pattern guard. ClickHouse function names were verified against
|
||||
the official docs via context7 (the spec's loose `toDecimalOrNull` is not a real
|
||||
name — the `to<Type>OrNull` family requires a bit width, hence `toDecimal64OrNull`).
|
||||
No version-dated wording.
|
||||
|
||||
**Tests:** `analytics-skill-content.test.ts` — added the three representative
|
||||
phrases, bumped the `sql`-fence count 4 → 5 (and the test title), asserted the
|
||||
example shape (`WITH parsed AS`, `REPLACE(`, `AS DECIMAL(`, `LIKE '%K' THEN 1000`),
|
||||
and strengthened the dialect-clean banned list with `SAFE_CAST`, `TRY_CAST`,
|
||||
`TRY_TO_NUMBER`, `REGEXP_REPLACE`, `toFloat64OrNull`, and `GLOB` (mirroring spec 10's
|
||||
`generate_series` / spec 11's inline-`RANGE … INTERVAL` guards). `dialect-notes.test.ts`
|
||||
— added `expect(notes).toMatch(/\*\*Safe cast/)` to the per-dialect rubric loop, so
|
||||
all seven (derived from `DIALECTS_WITH_NOTES`) must answer it; no false-exclusivity
|
||||
assertion for the shared `TRY_CAST`.
|
||||
|
||||
**Verification:** both affected test files pass (19 tests); broader `test/skills` +
|
||||
`test/context/mcp` pass (65 tests); production type-check (`tsc -p tsconfig.json`)
|
||||
is clean; `pnpm run build` copies both surfaces into `dist` (7 dialect files carry
|
||||
*Safe cast*, the built `SKILL.md` carries the parse example) and `pnpm run link:dev`
|
||||
relinks `ktx-dev`. One **pre-existing, unrelated** type error remains in the
|
||||
test-only config (`test/mcp-server-factory.test.ts:152`, byte-identical to HEAD,
|
||||
untouched here) — out of scope for this spec.
|
||||
|
|
@ -1,336 +0,0 @@
|
|||
# Output completeness — answer every requested part, enforced by a final pre-emit check
|
||||
|
||||
> Refined spec. Intake draft: `todo/14-output-completeness-final-check.md`.
|
||||
|
||||
## Problem
|
||||
|
||||
The single largest correctness failure mode for the analytics skill is
|
||||
**incomplete output**: the query runs and the methodology is roughly right, but
|
||||
the projection is missing columns the question asked for. The SQL is runnable and
|
||||
the aggregate is correct — the answer is simply *short by columns*. Three
|
||||
recurring shapes:
|
||||
|
||||
1. **Multi-part questions answered partially.** A question that asks for several
|
||||
things ("report the highest *and* the lowest month, each with its count and
|
||||
average, *and* the difference") comes back with only the first clause — one
|
||||
column where several were requested.
|
||||
2. **Identity dropped.** Grouping by a human-readable name but not projecting the
|
||||
entity's identifier (a product name without its product id, a customer name
|
||||
without its customer id).
|
||||
3. **Inputs to a derived value dropped.** Returning a ratio / percentage /
|
||||
difference but not the underlying counts the question also asked for.
|
||||
|
||||
Shapes 2 and 3 are **already covered** by shipped `<sql_craft>` rules — spec 07's
|
||||
*"Expose identity, not just the label"* and *"Keep the inputs to a derived
|
||||
value"* — yet they are frequently **not applied**. So the gap is not missing
|
||||
knowledge: these rules sit as passive heuristics in a list, and nothing makes the
|
||||
agent reliably check them before finalizing. The fix is twofold: (a) add the
|
||||
missing **multi-part-completeness** rule that generalizes shapes 1–3, and (b)
|
||||
turn output-completeness into an **explicit final verification step** the agent
|
||||
performs before emitting SQL, so the existing identity/inputs rules are actually
|
||||
enforced rather than merely listed.
|
||||
|
||||
The failure is **model-independent**: a markedly stronger model produced the same
|
||||
incomplete-output mistakes on these questions, which means it is a
|
||||
craft/enforcement gap, not a capability gap — exactly the kind of universal
|
||||
analyst craft that belongs in the shipped skill.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
An analyst is asked: *"For each region, report the highest and the lowest monthly
|
||||
order count, and the difference between them."* A complete answer has a column for
|
||||
the region's id and name, the highest count, the lowest count, and the difference
|
||||
— five columns. Returning just the region and a single number answers only part
|
||||
of the request. This is a universal expectation on any database: answer **every**
|
||||
part of a multi-part request, identify the entities, and show the inputs behind
|
||||
any derived figure — and answer *exactly* that, without padding the result with
|
||||
columns the question never asked for.
|
||||
|
||||
## Model
|
||||
|
||||
The change is **additive content in one Markdown file**
|
||||
(`skills/analytics/SKILL.md`), governed by the same invariants spec 07
|
||||
established. They constrain the implementer; the exact prose is theirs.
|
||||
|
||||
### Additive, inline, heuristic-with-a-why
|
||||
|
||||
Consistent with specs 07 and 10: the change is additive content in
|
||||
`skills/analytics/SKILL.md`, **inline** (no bundled `reference/` file — the
|
||||
`setup-agents.ts` delivery ships only `SKILL.md` per target), dialect-agnostic,
|
||||
and phrased as **heuristics with a one-line generic rationale**, not a wall of
|
||||
MUSTs. The new rule extends the existing `<sql_craft>` "Answer completeness /
|
||||
interpretation" group; the shipped bullets in that group (including the *identity*
|
||||
and *inputs* rules this spec builds on) are preserved unchanged. No new tool,
|
||||
flag, or config.
|
||||
|
||||
### The over-projection guard carries a *universal* why, not a grader reference
|
||||
|
||||
The intake draft frames "don't pad the result with extra columns" as
|
||||
*grader-gaming*. The skill forbids **any** reference to a grader, gold answer, or
|
||||
benchmark (spec 07's hard invariant; the content test bans the words). So the
|
||||
guard must ship with a **universal analytics rationale** instead: columns the
|
||||
question did not ask for add noise, mislead the reader into thinking they matter,
|
||||
and make the result harder to consume — match the request exactly, neither short
|
||||
nor padded. This is the same reconciliation spec 07 applied to the draft's
|
||||
"behavior only, no rationale" instruction: generic *why* is required; only
|
||||
grader/gold/benchmark rationale is banned.
|
||||
|
||||
### Completeness is a closed set — identity and inputs are *inside* it
|
||||
|
||||
"Expose identity" and "keep the inputs" tell the agent to add columns; the
|
||||
over-projection guard tells it not to. These only contradict if the target is
|
||||
left fuzzy, so this spec pins it down. A **complete projection** is exactly:
|
||||
|
||||
> {every requested metric/attribute} ∪ {the identifier of each grouped/named
|
||||
> entity} ∪ {the inputs to each derived value}, at the grain the question
|
||||
> specifies.
|
||||
|
||||
Identity and inputs are **members of that set** — part of completeness, never
|
||||
"padding." **Under-projection** is any member missing (the failure this spec
|
||||
attacks); **over-projection** is any column *outside* the set (what the guard
|
||||
forbids). The implementer must phrase the rule and guard against this single
|
||||
definition so they read as one coherent notion, not two competing instructions.
|
||||
|
||||
### Dialect-agnostic, additive-only, exclusions intact
|
||||
|
||||
Every addition reads correctly on any dialect — no dialect-specific syntax in the
|
||||
rule text or the worked example. The existing `<workflow>`, `<rules>`, and the
|
||||
other `<sql_craft>` bullets and examples (specs 07/09/10/11/12) are preserved and
|
||||
uncontradicted. Spec 07's exclusions still hold: no output-shape contract, no
|
||||
`MAX(date)` anchoring of relative time, no grader-driven advice, no dialect
|
||||
syntax.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Multi-part / multi-output completeness — a new umbrella rule
|
||||
|
||||
Add a bullet to the `<sql_craft>` "Answer completeness / interpretation" group:
|
||||
when a question requests several outputs — a **list** ("A, B, and C"), **paired
|
||||
extremes** ("the highest *and* the lowest"), or a **value plus its components**
|
||||
("X, Y, and their ratio") — the final projection must contain a column for
|
||||
**each** requested output. *Why:* answering only the first clause is the most
|
||||
common way a runnable query is still wrong; the grain and methodology can be
|
||||
perfect yet the answer is short by columns.
|
||||
|
||||
This rule is the **umbrella** over the two shipped completeness rules: the
|
||||
*inputs* rule (*"Keep the inputs to a derived value"*) is its "value + components"
|
||||
instance, and the *identity* rule (*"Expose identity, not just the label"*) is its
|
||||
"entity identity" instance. The new bullet should **name that relationship**
|
||||
(so the three read as one notion) rather than restating either rule.
|
||||
|
||||
Keep this distinct from the row-selection rules in the same group: *"Top /
|
||||
highest / most / lowest"* and *"For each X / per X / by X"* govern **which rows**
|
||||
appear; multi-part completeness governs **which columns** appear. They compose
|
||||
(e.g. "highest and lowest per region" needs one row per region *and* a column per
|
||||
clause).
|
||||
|
||||
### 2. Final completeness check — the enforcement mechanism
|
||||
|
||||
The rule content lives **once** in `<sql_craft>`; the trigger is promoted to a
|
||||
first-class line in `<workflow>` step 6.
|
||||
|
||||
- **Capstone bullet in `<sql_craft>`** (closing the "Answer completeness /
|
||||
interpretation" group): *before emitting the final SQL, re-read the question and
|
||||
confirm the projection covers* —
|
||||
1. every named **metric / attribute** the question asks for (→ the multi-part
|
||||
rule);
|
||||
2. the **identifier** of every grouped or named entity (→ the *identity* rule);
|
||||
3. every **input** to each derived value (→ the *inputs* rule);
|
||||
4. all at the **grain** the question specifies (→ the *for each X* / panel
|
||||
rules).
|
||||
|
||||
Each facet cross-references the rule it enforces, so the check is what makes
|
||||
those passive rules active. Phrase it as a short, concrete "confirm the
|
||||
projection covers…" checklist, not a wall of MUSTs.
|
||||
|
||||
- **Over-projection guard** (attached to the check): do **not** add columns the
|
||||
question did not ask for "to be safe" — extra columns add noise, mislead, and
|
||||
make the result harder to consume; match the request exactly. Carries the
|
||||
**universal** why from the Model, **never** a grader/gold/benchmark reference.
|
||||
|
||||
- **`<workflow>` step 6 line** (the explicit ritual): step 6 ("Validate and
|
||||
explain") gains a mandatory line directing the agent to **always** run the final
|
||||
completeness check before emitting — re-read the question and verify every
|
||||
requested output, each entity's identity, each derived value's inputs, and the
|
||||
grain are all projected — pointing into the `<sql_craft>` capstone for the
|
||||
detail. This **replaces the current conditional pointer's role** ("If a result
|
||||
is unexpectedly empty or its grain looks wrong, work through the … rules"): the
|
||||
empty/grain diagnostic stays available (it maps to the existing *"Diagnose empty
|
||||
results"* and grain rules), but the completeness check fires **unconditionally**,
|
||||
on every SQL-authoring turn, not only when a result looks off. The workflow line
|
||||
names the ritual and the four facets; the rationale, guard, and example are
|
||||
stated once in `<sql_craft>`, not duplicated into the workflow.
|
||||
|
||||
### 3. One worked example (dialect-agnostic)
|
||||
|
||||
Add **exactly one** compact before/after example to the "Answer completeness /
|
||||
interpretation" group, demonstrating multi-part completeness on a **synthetic**
|
||||
schema (`regions`, `region_monthly`):
|
||||
|
||||
- **WRONG:** answers only the first clause — `SELECT region_name,
|
||||
MAX(monthly_orders) AS highest … GROUP BY region_name` — with no region id, no
|
||||
lowest, no difference.
|
||||
- **RIGHT:** one column per requested output plus the entity's identity, at the
|
||||
region grain — `region_id, region_name`, the highest, the lowest, and the
|
||||
difference, with `regions` joined to `region_monthly` and grouped by the region
|
||||
id and name.
|
||||
|
||||
Standard dialect-clean SQL only (no `QUALIFY`, no dialect functions; `MAX`/`MIN`
|
||||
are portable aggregates). Keep it tight. It teaches multi-clause coverage +
|
||||
identity + derived-value inputs in one capstone, and is **distinct** from the
|
||||
spec-10 `regions` panel example: that one is about missing **rows** (LEFT-JOIN
|
||||
spine + `COALESCE`); this one is about missing **columns**. This is the **sixth**
|
||||
worked `sql` example in the skill (after specs 07/09/10/11/12).
|
||||
|
||||
### 4. Coordination with specs 03 and 07/09/10/11/12
|
||||
|
||||
- **Spec 03** (multi-connection routing) owns `<workflow>` step 0 and the
|
||||
`connectionId` threading/scoping. Spec 14 touches `<workflow>` only to add the
|
||||
completeness-check line to **step 6** — it must not rewrite the routing or the
|
||||
`<rules>` `connectionId` scoping. If both land, step 6 reads coherently: validate
|
||||
+ the completeness ritual.
|
||||
- **Specs 07/09/10/11/12** own their own bullets and worked examples in
|
||||
`<sql_craft>`. Spec 14 is **additive** to the same "Answer completeness /
|
||||
interpretation" group and adds one example; it must not remove or contradict
|
||||
theirs.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
The example uses an **invented, generic schema** (`regions`, `region_monthly`) and
|
||||
made-up columns — **no benchmark table names, SQL, or result values.** It teaches
|
||||
the *pattern* (cover every requested output + identity + inputs, at grain, without
|
||||
padding), which is universal and tied to no specific instance. The over-projection
|
||||
guard's rationale is **universal** (noise/clarity/consumability), never
|
||||
"grader-gaming" or any other scoring reference. No part of the addition mentions a
|
||||
benchmark, gold answer, grader, or scoring comparator.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- `<sql_craft>` "Answer completeness / interpretation" states the **multi-part /
|
||||
multi-output completeness** rule (a column per requested output; list / paired
|
||||
extremes / value-plus-components), named as the umbrella over the shipped
|
||||
*identity* and *inputs* rules — inline, dialect-agnostic, with a generic *why*.
|
||||
- `<sql_craft>` states a concrete **final completeness check** (re-read the
|
||||
question → confirm metrics + entity identity + derived-value inputs + grain are
|
||||
projected), cross-referencing the existing identity/inputs/grain rules so they
|
||||
are enforced, not merely listed.
|
||||
- The check carries the **over-projection guard** with a **universal** rationale
|
||||
(don't pad with unrequested columns — noise / misleading / harder to consume),
|
||||
and the skill contains **zero** grader/gold/benchmark references anywhere.
|
||||
- `<workflow>` **step 6** carries a mandatory line that runs the completeness
|
||||
check **unconditionally** before emitting and points into the `<sql_craft>`
|
||||
capstone; the rule content is **stated once** in `<sql_craft>` (no duplicated
|
||||
rationale/guard in the workflow). The empty/grain diagnostic remains available.
|
||||
- Exactly **one** new worked `sql` example is present (synthetic
|
||||
`regions`/`region_monthly`, wrong vs complete), in standard dialect-agnostic SQL;
|
||||
the skill then carries **six** `sql` worked examples total.
|
||||
- The existing interactive guidance (`<workflow>` steps, `<rules>`, the other
|
||||
`<sql_craft>` bullets and the five prior examples) is intact and uncontradicted;
|
||||
the additive-only and dialect-clean invariants from specs 07/10 still hold.
|
||||
- None of spec 07's excluded items appear (output-shape contract, `MAX(date)`
|
||||
anchoring of "recent"/"past N", grader-driven advice, dialect syntax).
|
||||
- The skill stays scannable and comfortably under the 500-line budget; the
|
||||
frontmatter still parses as `ktx-analytics`.
|
||||
- The analytics-skill **content test is updated** to cover the new rule and check
|
||||
(see Implementation orientation).
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the prose.
|
||||
|
||||
- **Skill:** `packages/cli/src/skills/analytics/SKILL.md`.
|
||||
- Add the multi-part-completeness bullet and the final-completeness-check
|
||||
capstone (with the over-projection guard) to the `<sql_craft>` "Answer
|
||||
completeness / interpretation" group; add the single
|
||||
`regions`/`region_monthly` worked example.
|
||||
- In `<workflow>` step 6, replace the current conditional answer-completeness
|
||||
pointer with the mandatory completeness-check line (unconditional, names the
|
||||
four facets, points into `<sql_craft>`); keep the empty/grain diagnostic.
|
||||
- Leave `<workflow>` steps 0–5, `<rules>`, and the other `<sql_craft>`
|
||||
bullets/examples intact. Delivery is unchanged (single `SKILL.md` per target
|
||||
via `readAnalyticsSkillContent` in `setup-agents.ts`) — confirm, no change
|
||||
required.
|
||||
- **Tests:** `packages/cli/test/skills/analytics-skill-content.test.ts`.
|
||||
- Add representative phrases to the "represents every craft behavior" list for
|
||||
the multi-part rule, the final completeness check, and the over-projection
|
||||
guard.
|
||||
- Bump the worked-example `sql`-fence count assertion **5 → 6** (and update the
|
||||
test name/comment), and assert the new example's shape (e.g. `region_monthly`,
|
||||
`MAX(`, `MIN(`, the difference expression, `region_id`).
|
||||
- The existing dialect-clean, grader/benchmark-clean, and relative-time
|
||||
(`MAX(...)` anchoring) guards must still pass — the new example's `MAX`/`MIN`
|
||||
lines carry no "recent"/"past N" wording, so the phrase-level guard is
|
||||
unaffected. The `SkillsRegistryService` frontmatter test must still pass.
|
||||
- Rebuild and re-link the dev binary so the playground picks up the updated skill:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
On the latest SQLite-subset run, **incomplete output was the single largest
|
||||
failure bucket (~13 of 51 voted failures)**: multi-part questions answered
|
||||
partially, plus dropped identity / derived-value inputs — the latter two being
|
||||
spec-07 rules that already exist but weren't applied. A probe with a much stronger
|
||||
model reproduced the *same* incomplete-output failures, confirming this is a
|
||||
craft-enforcement gap rather than a model-capability one. The fix — answer every
|
||||
requested part, identify the entities, keep the inputs, and don't pad — is
|
||||
universal analyst craft, so it belongs in the product skill (and transfers to real
|
||||
users), enforced as a final pre-emit check rather than left as a passive hint.
|
||||
Improving the benchmark score is a side effect; the skill contains no trace of the
|
||||
benchmark.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented as additive content in one Markdown file plus a test update.
|
||||
|
||||
- **Skill — `packages/cli/src/skills/analytics/SKILL.md`** (`<sql_craft>` "Answer
|
||||
completeness / interpretation" group):
|
||||
- Added the **"Answer every requested output"** umbrella bullet (list / paired
|
||||
extremes / value-plus-components → a column per requested output, with a generic
|
||||
*why*). It names *keep the inputs* and *expose identity* as its "value +
|
||||
components" and "entity identity" instances, pins the closed-set definition of a
|
||||
complete projection, and marks itself as governing *which columns* appear —
|
||||
distinct from the *Top …* / *For each X* row-selection rules, with which it
|
||||
composes. The two shipped instance rules are preserved verbatim.
|
||||
- Added the **"Final completeness check"** capstone bullet: a four-facet
|
||||
"before emitting, re-read the question and confirm the projection covers…"
|
||||
checklist (metric/attribute → multi-part rule; identifier → *expose identity*;
|
||||
inputs → *keep the inputs*; grain → *for each X* / *complete the panel*), run on
|
||||
every query. It carries the **over-projection guard** with a universal rationale
|
||||
(unrequested columns add noise, mislead, and are harder to consume — match the
|
||||
request exactly), with **no** grader/gold/benchmark reference.
|
||||
- Added one worked `sql` example (synthetic `regions` / `region_monthly`): WRONG
|
||||
answers only the first clause (`SELECT region_name, MAX(monthly_orders) …`),
|
||||
dropping the region id, the lowest, and the difference; RIGHT projects
|
||||
`r.region_id, r.region_name`, `MAX` highest, `MIN` lowest, and the
|
||||
`MAX − MIN` difference, joining `regions` to `region_monthly` and grouping by id
|
||||
+ name. This is the **sixth** `sql` example, dialect-clean (portable `MAX`/`MIN`).
|
||||
- `<workflow>` **step 6**: replaced the conditional answer-completeness pointer
|
||||
with an unconditional *"Always run the final completeness check before emitting"*
|
||||
line that names the four facets and points into the `<sql_craft>` capstone; the
|
||||
empty/grain diagnostic is retained for diagnosis. Steps 0–5, `<rules>`, and the
|
||||
other `<sql_craft>` bullets/examples are untouched.
|
||||
- Delivery is unchanged: `readAnalyticsSkillContent` in
|
||||
`packages/cli/src/setup-agents.ts` still ships the single `SKILL.md` per target
|
||||
(confirmed, no change required).
|
||||
- **Tests — `packages/cli/test/skills/analytics-skill-content.test.ts`:** added the
|
||||
three representative phrases (`Answer every requested output`, `Final completeness
|
||||
check`, `Don't over-project`); bumped the `sql`-fence count assertion 5 → 6 and
|
||||
renamed that test; asserted the new example's shape (`region_monthly`,
|
||||
`MAX(rm.monthly_orders)`, `MIN(rm.monthly_orders)`, the `MAX − MIN` difference, and
|
||||
`r.region_id, r.region_name`). The dialect-clean, grader/benchmark-clean,
|
||||
relative-time, and frontmatter guards still pass.
|
||||
- **Verification:** `analytics-skill-content` 9/9 and `setup-agents` 46/46 pass;
|
||||
production type-check (`tsconfig.json`, src) is clean; `pnpm run build` copied the
|
||||
updated skill into `dist/skills/analytics/SKILL.md` (6 fences, all new content
|
||||
present) and `pnpm -w run link:dev` re-linked `ktx-dev` so the playground picks it
|
||||
up. The skill is 244 lines (< 500 budget) and the frontmatter still parses as
|
||||
`ktx-analytics`.
|
||||
- **Deviation (cosmetic):** the worked example uses alias `rm` and a difference
|
||||
column named `order_count_range`; the intake draft sketched alias `m` and
|
||||
`AS difference`. The spec leaves prose to the implementer, so the change is purely
|
||||
naming.
|
||||
- **Unrelated pre-existing issue:** `tsconfig.test.json` reports one type error in
|
||||
`packages/cli/test/mcp-server-factory.test.ts` (a `KtxMcpContextPorts`/`contextTools`
|
||||
mismatch introduced by the earlier connection-scoped-wiki commit `2677b3ef`). It is
|
||||
untouched by this work and out of scope here.
|
||||
|
|
@ -1,405 +0,0 @@
|
|||
# Structured, leveled logging for the ktx MCP server
|
||||
|
||||
> Refined spec. Intake draft: `todo/15-mcp-server-structured-logging.md`.
|
||||
>
|
||||
> **Scope: observability only.** This spec is about *seeing* what the MCP server
|
||||
> does (which tool, what params, when, how long, outcome). *Preventing* a runaway
|
||||
> query from blocking the server (off-event-loop / interruptible execution) is a
|
||||
> separate concern — see "Non-goals".
|
||||
|
||||
## Problem
|
||||
|
||||
The ktx MCP server (`mcp-http-server.ts` + `mcp-stdio-server.ts`, both built
|
||||
through `mcp-server-factory.ts` on raw `node:http` + the
|
||||
`@modelcontextprotocol/sdk` transports) emits almost no operational logs. There
|
||||
is no server-side record of **which MCP tool was called, with what parameters,
|
||||
when, how long it took, or whether it succeeded** — nor of session open/close or
|
||||
transport errors. When a tool call is slow, hangs, or a client connection drops
|
||||
("Transport channel closed"), an operator has no trail to diagnose it and must
|
||||
resort to process sampling / `lsof` / guesswork — and the offending input
|
||||
(e.g. the exact SQL) is typically unrecoverable.
|
||||
|
||||
The hook to fix this already exists but is half-built: `instrumentMcpServer`
|
||||
(`context/mcp/context-tools.ts`) wraps every tool handler and already times it,
|
||||
but it emits **only on completion** (a sampled `mcp_request_completed` telemetry
|
||||
event) and **never writes a start line and never writes to the server log**. A
|
||||
call that never returns therefore leaves no trace at all.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Anyone running a long-lived ktx MCP server — a developer's local instance
|
||||
(stdio, launched by Claude Desktop / Cursor), a foreground HTTP server, or a
|
||||
shared/hosted HTTP daemon — needs observability into tool-call activity to:
|
||||
|
||||
- diagnose slow or hung tool calls (which `sql_execution` ran, against which
|
||||
connection, with what SQL, for how long);
|
||||
- explain client-visible connection failures from the server side (session
|
||||
lifecycle, transport-closed events);
|
||||
- audit what agents asked the server to do;
|
||||
- spot patterns (hot tools, slow connections, error rates).
|
||||
|
||||
This is standard production-server hygiene; the server currently provides none.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs.
|
||||
|
||||
### One `pino` logger, synchronous, written to **stderr**
|
||||
|
||||
Use `pino` — the de-facto standard structured-JSON logger for Node servers — as
|
||||
a single shared instance. Two corrections to the draft's sketch:
|
||||
|
||||
- **stderr, not stdout.** The stdio transport reserves **stdout** for the
|
||||
JSON-RPC protocol (`mcp-stdio-server.ts` deliberately no-ops `stdout.write`);
|
||||
writing logs there would corrupt the protocol stream. The HTTP daemon already
|
||||
redirects **both** child fds to `.ktx/logs/mcp.log`
|
||||
(`managed-mcp-daemon.ts`: `stdio: ['ignore', log.fd, log.fd]`), so stderr lands
|
||||
in the same log file (surfaced by `ktx mcp logs`). **stderr is therefore the
|
||||
one universally-correct sink** for both transports.
|
||||
- **Synchronous, no worker-thread transport.** `pino` writes through a
|
||||
`DestinationStream` (`{ write(msg) }`) — the server's existing
|
||||
`KtxCliIo.stderr` sink satisfies that interface directly. Configure pino with a
|
||||
**synchronous** destination (`pino.destination({ sync: true })`, or the
|
||||
pino-pretty stream below with `sync: true`). This is load-bearing: the
|
||||
`tool.start` line **must** be flushed to the fd *before* the (possibly
|
||||
blocking) handler runs, so a runaway synchronous `better-sqlite3` query that
|
||||
pegs the event loop still leaves the start line on disk. A worker-thread
|
||||
transport (`transport: { target: ... }`) buffers and can lose that exact line
|
||||
on a hard crash — **do not use transport mode.**
|
||||
|
||||
### Format is derived from `stderr.isTTY`, not a config flag
|
||||
|
||||
One logger, two serializations chosen by the environment (the "behavior follows
|
||||
from inputs" rule — not a user-visible knob):
|
||||
|
||||
- **TTY** (`ktx mcp start --foreground` or `ktx mcp stdio` run in a terminal) →
|
||||
**`pino-pretty` as a synchronous in-process stream** (`pretty({ sync: true,
|
||||
destination: <stderr sink> })`, colorized). A readable live dev view.
|
||||
- **Not a TTY** (the detached daemon, whose stderr is the `.ktx/logs/mcp.log`
|
||||
file fd) → **plain JSON line** via the synchronous pino destination. The log
|
||||
*file* stays structured JSON so the incident workflow ("recover the hung query
|
||||
with a one-line `grep` / `jq`") works — colorized ANSI in a file would defeat
|
||||
it.
|
||||
|
||||
`KtxCliIo.stderr` has no `isTTY` field (`cli-runtime.ts`), so detect the terminal
|
||||
from the underlying stream (`process.stderr.isTTY`) at logger construction, while
|
||||
still writing *through* the `io.stderr` sink so tests can capture emitted lines.
|
||||
|
||||
### Single hook: extend `instrumentMcpServer`, do not fork a second wrapper
|
||||
|
||||
Tool-call logging is added to the existing `instrumentMcpServer`
|
||||
(`context-tools.ts`), which already wraps `registerTool` and measures duration.
|
||||
It receives the **raw** tool input (it wraps the schema-parsing handler from
|
||||
`registerParsedTool`), so the params it logs include `sql` for `sql_execution`.
|
||||
The existing telemetry emission stays unchanged; logging is **additive** beside
|
||||
it. Because both transports build their server through `mcp-server-factory.ts` →
|
||||
`registerKtxContextTools`, this single change gives **both HTTP and stdio**
|
||||
tool-call logging for free.
|
||||
|
||||
### `sessionId` / `callId` provenance
|
||||
|
||||
- **`sessionId`** comes from the SDK's per-call handler context
|
||||
(`RequestHandlerExtra.sessionId`; confirmed present in `@modelcontextprotocol/sdk`
|
||||
`1.29.0`). It is populated for the HTTP StreamableHTTP transport and absent for
|
||||
stdio (single session) — log it when present, omit otherwise. Add
|
||||
`sessionId?: string` to `KtxMcpToolHandlerContext` (`context/mcp/types.ts`).
|
||||
- **`callId`** is generated per invocation with `randomUUID()` (already imported
|
||||
in `context-tools.ts`). It correlates a `tool.start` with its `tool.end`.
|
||||
|
||||
### No redaction in v1 (explicit)
|
||||
|
||||
v1 ships **no log redaction**. Rationale recorded here so it is a deliberate
|
||||
choice, not an oversight: these logs are **local** (stderr → `.ktx/logs/mcp.log`),
|
||||
**never transmitted off-box**, and sit at the **same trust boundary** as the
|
||||
`ktx.yaml` / environment that already hold the connection credentials. Concretely:
|
||||
|
||||
- Request **headers are never logged** at all, so the bearer token
|
||||
(`KTX_MCP_TOKEN`) simply isn't collected — this is "not logged," not "redacted."
|
||||
- Errors are logged with their **full message and stack** via pino's standard
|
||||
`err` serializer.
|
||||
- SQL text and tool params are logged **verbatim** (they are not secrets).
|
||||
|
||||
Credential redaction (e.g. a DB URL embedded in a driver error string) is an
|
||||
explicit **v1 non-goal**; revisit only if these logs are ever shipped off-box.
|
||||
This drops the draft's "light redaction" requirement and the
|
||||
`collectTelemetryRedactionSecrets` / scrubber reuse it implied.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. One shared pino logger
|
||||
|
||||
- A single `pino` instance per server process, constructed once and threaded to
|
||||
both the transport layer (for lifecycle events) and the tool layer (for
|
||||
tool-call events). Level set from env (Requirement 7), default `info`.
|
||||
- Synchronous destination bound to the server's stderr sink (see Design
|
||||
decisions). Pretty (`pino-pretty`, sync stream) when `process.stderr.isTTY`,
|
||||
otherwise plain JSON. Each line carries pino's standard `time` and `level`.
|
||||
- No new dependency beyond `pino` and `pino-pretty`. No OpenTelemetry / metrics
|
||||
stack, no async/worker transport, no in-app file rotation.
|
||||
|
||||
### 2. Per-session / per-call context via child loggers
|
||||
|
||||
Use pino child loggers so every line carries the relevant correlation fields:
|
||||
a per-call child binds `{ tool, callId }` plus `sessionId` when present, so one
|
||||
session's or one call's activity can be grepped from the log.
|
||||
|
||||
### 3. Tool-call logging — START before execute, END after
|
||||
|
||||
In `instrumentMcpServer`, for **every** MCP tool invocation:
|
||||
|
||||
- **On entry, before invoking the handler**, write `tool.start` with
|
||||
`{ tool, callId, sessionId?, params }` at **`info`**. `params` is the raw tool
|
||||
input; for `sql_execution` this includes the full **SQL text** (the single most
|
||||
useful field). The write is synchronous so the line exists even if the handler
|
||||
never returns.
|
||||
- **On normal completion**, write `tool.end` with
|
||||
`{ tool, callId, sessionId?, durationMs, outcome: "ok", resultSize }` at
|
||||
**`info`** — *unless* it is a slow call (Requirement 4). `resultSize` is a
|
||||
tool-agnostic size measure (byte length of the serialized result text content).
|
||||
- **On error**, write `tool.end` with
|
||||
`{ tool, callId, sessionId?, durationMs, outcome: "error", err }` at **`error`**,
|
||||
where `err` is the serialized error (message + stack) per Requirement 6.
|
||||
|
||||
`tool.start` and `tool.end` share the **same correlation fields and the same
|
||||
`info` level** (for the non-slow, non-error case) so that an **unmatched
|
||||
`tool.start`** — a start with no `tool.end` for the same `callId` — is an
|
||||
unambiguous "this call hung" signal. This is the property that makes a runaway
|
||||
`sql_execution` identifiable from the log alone, with its exact SQL and
|
||||
timestamp, no process sampling.
|
||||
|
||||
> **Deliberate change from the intake draft.** The draft put `tool.start` /
|
||||
> `tool.end` at `debug` (suppressed at the default `info`). That defeats the
|
||||
> motivating incident: a hang is unpredictable, so debug would have to be enabled
|
||||
> *before* it occurs, which never happens. v1 logs start/end at **`info`** — an
|
||||
> always-on access log — so the offending query is recoverable at the default
|
||||
> level. `debug` is reserved for heavier detail (Requirement 7).
|
||||
|
||||
### 4. Slow-call warning
|
||||
|
||||
When a call **completes** with `durationMs` greater than the configured slow
|
||||
threshold (Requirement 7), emit its `tool.end` at **`warn`** (carrying the same
|
||||
fields plus the duration) instead of `info`. This makes a completed-but-slow call
|
||||
stand out and keeps it visible even when the level is raised to `warn`.
|
||||
|
||||
### 5. Connection / session lifecycle and transport errors
|
||||
|
||||
- **HTTP** (`mcp-http-server.ts`, in `newTransport`): log `session.open` from
|
||||
`onsessioninitialized` and `session.close` from `onsessionclosed` /
|
||||
`transport.onclose`, each with `sessionId`, at `info`. **Wire the currently
|
||||
unused `transport.onerror`** to log `transport.error` (the SDK's
|
||||
closed-channel / "Transport channel closed" events) at `error`, so a
|
||||
client-visible connection failure has a server-side counterpart.
|
||||
- **stdio** (`mcp-stdio-server.ts`): route the existing raw
|
||||
`transport.onerror` stderr string (it currently writes a plain string) through
|
||||
the logger as a `transport.error` line at `error`. A single `session.open` /
|
||||
`session.close` pair for the one stdio connection MAY be logged at `info`.
|
||||
|
||||
### 6. Structured error logging
|
||||
|
||||
Errors are logged as structured objects via pino's standard `err` serializer
|
||||
(`pino.stdSerializers.err` or equivalent), carrying error class, message, and
|
||||
stack — never a bare interpolated string. The existing telemetry exception
|
||||
reporting in `instrumentMcpServer` / `registerParsedTool` is unchanged.
|
||||
|
||||
### 7. Configuration surface
|
||||
|
||||
- **`KTX_MCP_LOG_LEVEL`** — pino level (`error` | `warn` | `info` | `debug` |
|
||||
…), default **`info`**. MCP-scoped name because the MCP server is the only
|
||||
emitter today; naming it global (`KTX_LOG_LEVEL`) would imply a logging system
|
||||
that does not exist.
|
||||
- **`KTX_MCP_SLOW_TOOL_MS`** — slow-call threshold in milliseconds (Requirement
|
||||
4), default **`10000`**. Justified as a real ops knob: "slow" differs sharply
|
||||
between a local SQLite file and a remote warehouse.
|
||||
- Level ladder that results from Requirements 3–5:
|
||||
- `debug`: everything below **plus** heavier detail (e.g. result bodies,
|
||||
progress notifications) — implementer's discretion on what extra to attach.
|
||||
- `info` (default): `tool.start` / `tool.end`, session lifecycle, slow `warn`s,
|
||||
errors.
|
||||
- `warn`: slow-call `tool.end`s, `transport.error`, errored `tool.end`s — but
|
||||
not routine tool traffic.
|
||||
- `error`: errored `tool.end`s and `transport.error` only.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- At default level (`info`), invoking any MCP tool produces a `tool.start`
|
||||
(`tool`, `callId`, `sessionId` when HTTP, `params`) and a matching `tool.end`
|
||||
(`durationMs`, `outcome`, `resultSize`) line, as **JSON to stderr** when stderr
|
||||
is not a TTY.
|
||||
- A tool call that never returns (e.g. a runaway `sql_execution`) leaves a
|
||||
`tool.start` line carrying its **exact SQL and timestamp** and **no** matching
|
||||
`tool.end` for that `callId` — so the offending query is recoverable from the
|
||||
log alone, with no process sampling.
|
||||
- A completed call slower than `KTX_MCP_SLOW_TOOL_MS` emits its `tool.end` at
|
||||
`warn` with its `durationMs`.
|
||||
- Session open/close and transport-closed (`transport.error`) events are logged
|
||||
with the `sessionId` (HTTP); the stdio transport error path goes through the
|
||||
logger, not a raw `stderr.write`.
|
||||
- At level `warn`, routine `tool.start` / `tool.end` are suppressed but
|
||||
slow-call warnings, transport errors, and errored calls are present.
|
||||
- When stderr is a TTY (`ktx mcp start --foreground` / `ktx mcp stdio` in a
|
||||
terminal), output is human-readable colorized `pino-pretty`; the daemon log
|
||||
file (`.ktx/logs/mcp.log`) is plain JSON. Both paths are synchronous.
|
||||
- The bearer token never appears in any log line (headers are not logged); SQL
|
||||
and tool params do appear.
|
||||
- No worker-thread / async log transport is introduced; no OpenTelemetry /
|
||||
metrics stack; the only new dependencies are `pino` and `pino-pretty`.
|
||||
- The existing `mcp_request_completed` telemetry and exception reporting still
|
||||
work unchanged.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Preventing / interrupting runaway queries** (off-event-loop execution, query
|
||||
timeouts, worker-thread isolation). A single synchronous query that fans out
|
||||
into a massive nested-loop join can peg the single-threaded server for hours
|
||||
and break new connections — observability surfaces *which* query, but the fix
|
||||
is execution-model work in a separate spec. (This logging is also the
|
||||
prerequisite for a future watchdog that detects a `tool.start` with no
|
||||
`tool.end` past a threshold and recycles the server.)
|
||||
- **Log redaction** (see Design decisions) — explicit v1 non-goal.
|
||||
- **Pretty output as a worker-thread transport** — the TTY path uses pino-pretty
|
||||
as a synchronous in-process stream only.
|
||||
- Metrics / tracing / OpenTelemetry exporters.
|
||||
- Forwarding logs to the MCP *client* via the protocol logging capability
|
||||
(`notifications/message`, `logging/setLevel`) — a possible later enhancement,
|
||||
distinct from operational stderr logging.
|
||||
- A global `KTX_LOG_LEVEL` spanning non-MCP commands — out of scope until other
|
||||
surfaces emit structured logs.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **New module** — a small logger factory, e.g.
|
||||
`packages/cli/src/context/mcp/logger.ts`: builds the shared pino instance from
|
||||
the stderr sink + `KTX_MCP_LOG_LEVEL`, choosing the pino-pretty (sync) stream
|
||||
when `process.stderr.isTTY` else `pino.destination({ sync: true })`, and
|
||||
exposes a `slow-threshold` read from `KTX_MCP_SLOW_TOOL_MS`.
|
||||
- **Tool-call logging** — `packages/cli/src/context/mcp/context-tools.ts`:
|
||||
extend `instrumentMcpServer` (~line 585) to write `tool.start` before
|
||||
`handler(...)` and `tool.end` after (ok / slow-`warn` / `error`); generate
|
||||
`callId` via the already-imported `randomUUID`; read `sessionId` from the
|
||||
handler `context`. Thread the logger via `RegisterKtxContextToolsDeps`
|
||||
(~line 26) and `registerKtxContextTools` (~line 650). Leave `registerParsedTool`
|
||||
and the existing telemetry emission intact.
|
||||
- **Context type** — `packages/cli/src/context/mcp/types.ts`: add
|
||||
`sessionId?: string` to `KtxMcpToolHandlerContext`; add the logger to
|
||||
`KtxMcpServerDeps` / the register deps.
|
||||
- **Server wiring** — `packages/cli/src/context/mcp/server.ts`
|
||||
(`createDefaultKtxMcpServer` / `createKtxMcpServer`) and
|
||||
`packages/cli/src/mcp-server-factory.ts` (`createKtxMcpServerFactory`): accept
|
||||
and pass the logger down to `registerKtxContextTools`.
|
||||
- **HTTP lifecycle** — `packages/cli/src/mcp-http-server.ts`: construct (or
|
||||
receive) the logger; in `newTransport` (~line 186) log `session.open` /
|
||||
`session.close` and add `transport.onerror` → `transport.error`.
|
||||
- **stdio lifecycle** — `packages/cli/src/mcp-stdio-server.ts`: construct (or
|
||||
receive) the logger; route the existing `transport.onerror` (~line 54) through
|
||||
it.
|
||||
- **Log destination is already captured** — `packages/cli/src/managed-mcp-daemon.ts`
|
||||
redirects child stdout+stderr to `.ktx/logs/mcp.log`; `ktx mcp logs`
|
||||
(`commands/mcp-commands.ts`) tails it. No change needed there.
|
||||
- **Dependencies** — add `pino` and `pino-pretty` to
|
||||
`packages/cli/package.json`. Verify Knip/Biome dead-code and bundle checks
|
||||
still pass.
|
||||
- **Tests** — extend `packages/cli/test/mcp-http-server.test.ts`,
|
||||
`mcp-server-factory.test.ts`, `context/mcp/server.test.ts`, and
|
||||
`commands/mcp-commands.test.ts`: assert (a) a `tool.start` JSON line is written
|
||||
before a (mock) handler runs and carries `params`/`sql`; (b) a matching
|
||||
`tool.end` with `durationMs`/`outcome`; (c) a hung-handler scenario yields a
|
||||
`tool.start` with no `tool.end` for that `callId`; (d) a slow completion emits
|
||||
`warn`; (e) session lifecycle + `transport.error` lines; (f) the bearer token
|
||||
never appears. Inject a capturing `io.stderr` and parse the JSON lines.
|
||||
*Note:* `mcp-server-factory.test.ts` carries a pre-existing
|
||||
`KtxMcpContextPorts`/`contextTools` type error (from commit `2677b3ef`,
|
||||
unrelated to this work) — do not let it mask new failures.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
Running Spider 2.0-Lite against the MCP server at concurrency, an
|
||||
adversarial-reviewer-generated query degenerated into a massive nested-loop join;
|
||||
synchronous `better-sqlite3` executed it on the event loop, pegging a server at
|
||||
~100% CPU for hours and breaking new MCP connections ("Transport channel
|
||||
closed"). We could not determine *which* query, because the server logs nothing
|
||||
about tool calls — diagnosis required `sample` / `lsof` on the live process and
|
||||
the exact SQL was never recovered. Structured tool-call logging — especially
|
||||
`tool.start` written synchronously *before* execution, at the default level —
|
||||
would have turned this into a one-line `grep` of the server log. Improving the
|
||||
benchmark is a side effect; the logging is generic production-server hygiene.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki`. All requirements and acceptance
|
||||
criteria are satisfied.
|
||||
|
||||
**What was built / where**
|
||||
|
||||
- **New module `packages/cli/src/context/mcp/logger.ts`** — `createMcpLogger(io,
|
||||
{ isTTY? })` builds one synchronous `pino` (v10) instance written through the
|
||||
`io.stderr` sink: plain JSON when stderr is not a TTY, a `pino-pretty` (v13)
|
||||
synchronous in-process stream (`{ colorize: true, sync: true }`, wrapping the
|
||||
sink in a `node:stream.Writable`) when it is. Also exports `mcpLogLevel`
|
||||
(`KTX_MCP_LOG_LEVEL`, validated against pino levels, default `info`),
|
||||
`mcpSlowToolMs` (`KTX_MCP_SLOW_TOOL_MS`, default `10000`), and
|
||||
`serializeMcpError`. No worker/async transport; no global `KTX_LOG_LEVEL`.
|
||||
- **Tool-call logging — `instrumentMcpServer` (`context/mcp/context-tools.ts`)** —
|
||||
per invocation: `callId = randomUUID()`, a child logger bound to
|
||||
`{ tool, callId, sessionId? }`, `tool.start { params }` written at `info`
|
||||
**before** awaiting the handler (synchronous, so a runaway query still leaves it
|
||||
on disk), and `tool.end` after: `info { durationMs, outcome:"ok", resultSize }`,
|
||||
`warn` when `durationMs > KTX_MCP_SLOW_TOOL_MS`, or `error { outcome:"error",
|
||||
err }`. `resultSize` is the UTF-8 byte length of the serialized text content.
|
||||
The existing `mcp_request_completed` telemetry + `reportException` are unchanged
|
||||
(`durationMs` is now computed once and shared); `registerParsedTool` is intact.
|
||||
- **`sessionId` / logger plumbing** — `sessionId?: string` added to
|
||||
`KtxMcpToolHandlerContext`; a single per-process logger threads from each
|
||||
transport entrypoint through `createKtxMcpServerFactory` →
|
||||
`createDefaultKtxMcpServer` → `createKtxMcpServer` → `registerKtxContextTools`
|
||||
(`KtxMcpServerDeps.logger`, `RegisterKtxContextToolsDeps.logger`).
|
||||
- **HTTP lifecycle (`mcp-http-server.ts`)** — `session.open` from
|
||||
`onsessioninitialized`, `session.close` from `transport.onclose`, and the
|
||||
previously-unused `transport.onerror` wired to `transport.error` at `error`.
|
||||
- **stdio lifecycle (`mcp-stdio-server.ts`)** — the raw `transport.onerror`
|
||||
string write is replaced by a `transport.error` log line; `session.open` /
|
||||
`session.close` are logged for the single stdio session.
|
||||
- **Deps** — `pino ^10.3.1`, `pino-pretty ^13.1.3` added to
|
||||
`packages/cli/package.json`.
|
||||
- **Tests** — `test/context/mcp/logger.test.ts` (factory, level/threshold env
|
||||
parsing, error serializer, TTY vs JSON), a "MCP tool-call logging" block in
|
||||
`test/context/mcp/server.test.ts` (start-before-handler, matching end with
|
||||
`resultSize`, hung-handler leaves an unmatched start, slow→`warn`, `warn`-level
|
||||
suppression with errored end still present, no-logger no-op), session lifecycle
|
||||
+ bearer-token-never-logged in `test/mcp-http-server.test.ts`, and
|
||||
`test/mcp-stdio-server.test.ts` for `transport.error`.
|
||||
|
||||
**Deviations / decisions**
|
||||
|
||||
- **In-band errors carry no stack (inherent).** `registerParsedTool` converts a
|
||||
thrown handler error into an `{ isError: true }` result (and reports the full
|
||||
error via telemetry) before it reaches `instrumentMcpServer`, so the original
|
||||
stack is already gone. `tool.end` for such a result logs `outcome:"error"` with
|
||||
`err.message` only; a genuine throw that escapes gets the full pino `err`
|
||||
serialization (type + message + stack). The field is always `err` for
|
||||
consistency. This honours "leave `registerParsedTool` intact."
|
||||
- **`session.close` is logged from `transport.onclose`** (the universal close
|
||||
signal for both clean DELETE and dropped connections) rather than
|
||||
`onsessionclosed`, to avoid duplicate lines; `onsessionclosed` keeps its
|
||||
session-map cleanup role.
|
||||
- **The logger is optional throughout.** Production always wires one per process;
|
||||
when absent (programmatic/test callers that inject `createMcpServer`), tool-call
|
||||
logging is simply off — which keeps existing tests unchanged.
|
||||
- `createMcpLogger` accepts an optional `isTTY` purely as a test seam; production
|
||||
derives format from `process.stderr.isTTY`.
|
||||
|
||||
**Verification**
|
||||
|
||||
`pnpm --filter @kaelio/ktx exec vitest run` for the four touched/added MCP test
|
||||
files: 57 passed. Full default `pnpm run test`: 3018 passed, 1 skipped — the only
|
||||
2 failures are in `test/skills/analytics-skill-content.test.ts`, pre-existing and
|
||||
unrelated to this change (in-progress analytics-skill work on this branch).
|
||||
`pnpm run dead-code` (Biome + Knip default + Knip production) clean. `pnpm run
|
||||
build` and `pnpm run link:dev` succeed. `pnpm run type-check` reports only the
|
||||
one pre-existing, test-only error in `test/mcp-server-factory.test.ts` from commit
|
||||
`2677b3ef` (documented above); all source and the new tests type-check clean.
|
||||
|
|
@ -1,493 +0,0 @@
|
|||
# Bounded query execution (deadline + non-blocking) for read SQL
|
||||
|
||||
> Refined spec. Intake draft: `todo/16-bounded-query-execution-timeout.md`.
|
||||
>
|
||||
> **Scope: bound and cancel a read query that runs too long.** This is the
|
||||
> execution-model companion to spec 15 (MCP structured logging). Spec 15
|
||||
> *surfaces* a runaway query in the log; it explicitly defers *preventing* one —
|
||||
> "off-event-loop execution, query timeouts, worker-thread isolation … is
|
||||
> execution-model work in a separate spec." This is that spec.
|
||||
|
||||
## Problem
|
||||
|
||||
Two compounding gaps on the read-query path (`executeReadOnly`), confirmed in the
|
||||
current code:
|
||||
|
||||
1. **No execution deadline, handled divergently per connector.** A single
|
||||
expensive query runs unbounded, and whether it is bounded at all depends
|
||||
entirely on which driver the caller hit:
|
||||
- **BigQuery** is the only connector with a real statement timeout — it sets
|
||||
`jobTimeoutMs` on the query job from a per-connection config field
|
||||
`job_timeout_ms` (`connectors/bigquery/connector.ts`, `query(...)` ~491–512).
|
||||
- **ClickHouse** sets a hardcoded 30s *HTTP* `request_timeout` at client
|
||||
creation (`connectors/clickhouse/connector.ts:602`) — a client-side give-up,
|
||||
not a server-side `max_execution_time`; the server keeps working.
|
||||
- **Snowflake, Postgres, MySQL, SQL Server** bound only pool/connection
|
||||
*acquisition* (Snowflake `acquireTimeoutMillis: 60_000`; Postgres
|
||||
`connectionTimeoutMillis: 10_000`; SQL Server `idleTimeoutMillis: 30000`;
|
||||
MySQL pool size only) — nothing bounds statement *execution*.
|
||||
- **SQLite** has nothing.
|
||||
|
||||
2. **In-process SQLite blocks the event loop and cannot be cancelled.** The
|
||||
SQLite connector executes on the main thread via synchronous
|
||||
`better-sqlite3 .prepare().all()` (`connectors/sqlite/connector.ts`,
|
||||
`query(...)` 311–318, used by `executeReadOnly` 247–251). A slow query freezes
|
||||
the whole MCP server — it cannot serve other requests, send progress, or write
|
||||
`tool.end` — and there is no in-thread way to interrupt it: better-sqlite3 (v12)
|
||||
exposes no interrupt/cancel API. Its documented mechanism for slow queries is a
|
||||
**worker thread**, and the only way to stop a runaway synchronous query is to
|
||||
**terminate the thread** executing it (context7 `/wiselibs/better-sqlite3`,
|
||||
`docs/threads.md`).
|
||||
|
||||
The observed failure (Spider2-lite sqlite run, 2026-06-18): a single
|
||||
`sql_execution` MCP call —
|
||||
`SELECT MIN(time_id), MAX(time_id), COUNT(*) FROM profits` on `complex_oracle`,
|
||||
where `profits` is a VIEW (`costs ⋈ sales`, 918,843 × 82,112 rows, joined on a
|
||||
4-column key with no composite index) — degraded to an O(N×M) nested-loop scan,
|
||||
pegged a worker at 100% CPU for 13+ minutes, never returned, produced a
|
||||
`tool.start` with no matching `tool.end`, and stalled an eval shard until the
|
||||
worker was killed by hand. A row cap (`maxRows`) does not help: it bounds returned
|
||||
rows, not scan work, and the failing query returned a single aggregate row.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Any data agent that lets an LLM author SQL will eventually issue an
|
||||
accidentally-expensive query — an unindexed or cartesian join, an expensive VIEW,
|
||||
a wide aggregate over a large fact table. A general-purpose context layer must
|
||||
bound that and return a clean, fast "query exceeded Ns" error so the agent can
|
||||
revise (add filters, query base tables, narrow the range) instead of hanging the
|
||||
tool and the server. This matters for embedded/local warehouses (SQLite, and any
|
||||
future DuckDB-style in-process driver) and remote ones alike, and is wholly
|
||||
independent of any benchmark.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs.
|
||||
|
||||
### One canonical deadline, applied uniformly at the contract
|
||||
|
||||
The deadline is enforced for **every** `executeReadOnly` caller, not only the MCP
|
||||
`sql_execution` path. `executeReadOnly` has 13 call sites beyond MCP (ingest query
|
||||
executor, relationship profiling and composite-candidate probes, relationship
|
||||
validation, historic-SQL probes, `ktx sql`); the contract is the single place to
|
||||
bound all of them. A heavy ingest profiling probe over a giant unindexed join is
|
||||
exactly as worth abandoning as an interactive one — those call sites are
|
||||
best-effort and degrade gracefully, so a deadline `KtxQueryError` becomes "skip
|
||||
this probe / mark unprofiled," not "fail the source." (Requirement 8 covers the
|
||||
call sites that must treat the timeout as recoverable.)
|
||||
|
||||
> Rejected alternative: a caller-resolved deadline (short on the interactive path,
|
||||
> longer/none for ingest). That introduces a second value source and the open
|
||||
> question "what is the ingest budget," for no real gain — the 30s default already
|
||||
> clears any normal profiling probe, and a probe that exceeds it is one to drop.
|
||||
|
||||
### Default 30s, configurable per-connection via one shared field
|
||||
|
||||
- **Default `30_000` ms.** Fast enough that an LLM agent gets a clean
|
||||
"exceeded 30s" and revises within the same turn; generous headroom over any
|
||||
indexed aggregate or normal profiling probe; a genuine pathological nested-loop
|
||||
scan blows past it immediately.
|
||||
- **One shared per-connection override**, honored by every connector:
|
||||
`query_timeout_ms` in `ktx.yaml` (`queryTimeoutMs` in TS), a positive integer
|
||||
in **milliseconds**. Milliseconds matches the BigQuery SDK and the field it
|
||||
replaces; the user-facing error still reads in seconds.
|
||||
- **BigQuery's `job_timeout_ms` config key is removed**, not kept alongside the
|
||||
new field. BigQuery reads the shared `query_timeout_ms` and maps the resolved
|
||||
value onto its SDK's `jobTimeoutMs`. ktx keeps no backward compatibility, so
|
||||
there is exactly one way to set a query timeout — no parallel knob (intake
|
||||
requirement 1).
|
||||
- **Granularity is per-connection only.** No global all-connections override —
|
||||
different warehouses have different performance envelopes, and a second
|
||||
(global) knob would double the configuration surface for no stated need.
|
||||
|
||||
### The shared contract is a value + an error, not a base class
|
||||
|
||||
There is **no shared connector base class or factory** — each connector is
|
||||
constructed independently; the only shared registry is the *dialect* factory
|
||||
(`context/connections/dialects.ts:47–55`). So "defined once" (intake requirement
|
||||
3) means a single shared module that owns:
|
||||
|
||||
- `DEFAULT_QUERY_TIMEOUT_MS = 30_000`;
|
||||
- `resolveQueryDeadlineMs(connectionConfig)` → the validated `query_timeout_ms`
|
||||
override, else the default — so the default and the override precedence live in
|
||||
exactly one place;
|
||||
- `queryDeadlineExceededError(deadlineMs)` → a `KtxQueryError` with the canonical
|
||||
message `query exceeded ${Math.round(deadlineMs / 1000)}s`.
|
||||
|
||||
Each connector calls the resolver once (at construction; connectors already
|
||||
receive their connection config) and stores `this.deadlineMs`. **Enforcement is
|
||||
necessarily per-connector** — different engines cancel differently — but the
|
||||
*value* and the *error message* are shared, so the agent sees one consistent,
|
||||
actionable error regardless of driver.
|
||||
|
||||
### Real cancellation, not client-side give-up
|
||||
|
||||
Per intake requirement 5, the deadline must *stop the work*, not merely abandon
|
||||
the promise while the query keeps running (which on a pooled driver also risks
|
||||
returning a still-busy connection to the pool). So:
|
||||
|
||||
- **In-process (SQLite, and any future embedded driver):** run the query off the
|
||||
main thread and enforce the deadline by **terminating the worker thread**. There
|
||||
is no generic `Promise.race` outer wrapper — a `Promise.race` against a
|
||||
synchronous in-thread `.all()` can never fire (the loop is blocked), and against
|
||||
a pooled remote query it would poison the pool. Thread termination *is* the
|
||||
cancellation.
|
||||
- **Remote engines:** set the engine's **server-side statement timeout** so the
|
||||
server itself aborts the query and frees the connection cleanly.
|
||||
|
||||
### Logging routes through spec 15's pino path — no second logger
|
||||
|
||||
The deadline cases are logged through the **existing** MCP tool-call logger
|
||||
(spec 15's `instrumentMcpServer`, `context/mcp/context-tools.ts:644–730`), not a
|
||||
new logging path threaded into the connector. Verified flow for a timeout:
|
||||
`executeReadOnly` throws `queryDeadlineExceededError` (a `KtxQueryError`) →
|
||||
`local-project-ports.ts` preserves it → `registerParsedTool` (:552) reports it
|
||||
(`reportException` skips `$exception` for `KtxExpectedError`) and returns an
|
||||
in-band `isError` result → `instrumentMcpServer` writes `tool.end` at **`error`**
|
||||
with `outcome:"error"`, `err.message = "query exceeded {N}s"`, and the **same
|
||||
`callId`** as the `tool.start`.
|
||||
|
||||
This is the central observability win and it requires **no new MCP logging code**:
|
||||
spec 15 made a hang show up as a `tool.start` with *no* matching `tool.end`; this
|
||||
spec turns it into a **matched `tool.start` → `tool.end(error)` pair** whose
|
||||
`tool.end` names the deadline. The worker-termination (SQLite) and server-side
|
||||
abort (remote) are internal enforcement mechanisms; their single observable signal
|
||||
is that `tool.end`, so the connector does **not** get its own logger threaded
|
||||
through `KtxScanContext` — that would fork a second path for one capability. The
|
||||
"worker was actually reaped, not left spinning" guarantee is asserted by the
|
||||
worker's `exit` event in tests (Requirement 3), not by a log line.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Shared deadline contract, defined once
|
||||
|
||||
A single new module (e.g. `packages/cli/src/context/connections/query-deadline.ts`)
|
||||
exports `DEFAULT_QUERY_TIMEOUT_MS` (30_000), `resolveQueryDeadlineMs(connectionConfig)`,
|
||||
and `queryDeadlineExceededError(deadlineMs)`. Every connector resolves its
|
||||
deadline through this resolver; no connector hardcodes its own default or
|
||||
duplicates the override-precedence logic.
|
||||
|
||||
### 2. Shared per-connection config field; BigQuery's removed
|
||||
|
||||
`query_timeout_ms` is added to the **shared** connection config schema (validated
|
||||
as an optional positive integer, milliseconds) so every driver accepts it. The
|
||||
BigQuery-specific `job_timeout_ms` config field and its dedicated reader
|
||||
(`bigQueryJobTimeoutMsFromConnection`) are removed; BigQuery sources its timeout
|
||||
from the shared field and applies it as `jobTimeoutMs`. A bad `query_timeout_ms`
|
||||
(zero, negative, non-integer) is a clear config validation error, consistent with
|
||||
how ktx validates `ktx.yaml`.
|
||||
|
||||
### 3. SQLite executes off the main thread, terminated on deadline
|
||||
|
||||
`executeReadOnly` on the SQLite connector MUST NOT block the MCP server event
|
||||
loop:
|
||||
|
||||
- Read-only validation and the row-limit wrapper (`assertReadOnlySql` +
|
||||
`limitSqlForExecution`) run **on the main thread** before dispatch — invalid SQL
|
||||
fails instantly without spawning a worker, and read-only enforcement stays at
|
||||
the boundary (Requirement 7).
|
||||
- The validated, row-limited SQL (and any params) is dispatched to a **worker
|
||||
thread** that opens the database `{ readonly: true, fileMustExist: true }`, runs
|
||||
the query, and posts back `{ headers, rows, totalRows }` (all values are
|
||||
structured-cloneable — primitives, `Buffer`, `BigInt`).
|
||||
- The main thread arms a timer for `this.deadlineMs`; on expiry it calls
|
||||
`worker.terminate()` and rejects with `queryDeadlineExceededError`. On a normal
|
||||
message it clears the timer and resolves. On a worker error (SQLite rejected the
|
||||
SQL) it rejects with that error, message preserved. A provided
|
||||
`ctx.signal` (`KtxScanContext.signal`, already on the contract) also terminates
|
||||
the worker, for external cancellation.
|
||||
- **One short-lived worker per call**, terminated on completion or deadline — not
|
||||
a persistent worker or pool. Terminate-on-deadline destroys the worker, so a
|
||||
pool would need respawn/job-tracking for no benefit: `executeReadOnly` is
|
||||
low-frequency (LLM-issued, serial per agent turn) and worker spawn cost is
|
||||
negligible against query latency. The other SQLite paths (introspect, sample,
|
||||
stats, distinct-values, row-count) stay on the main thread — they are
|
||||
ktx-authored, bounded, and not on the `executeReadOnly` contract.
|
||||
- The event loop stays responsive throughout, so `tool.end` is always written and
|
||||
concurrent requests on the same port are served.
|
||||
|
||||
### 4. Remote engines set a real server-side statement timeout
|
||||
|
||||
Each remote connector applies `this.deadlineMs` as its engine's server-side
|
||||
statement timeout, so the deadline stops server work rather than abandoning the
|
||||
promise:
|
||||
|
||||
| Connector | Mechanism | Unit |
|
||||
|------------|--------------------------------------------------------|---------------|
|
||||
| BigQuery | `jobTimeoutMs` on the query job (replaces `job_timeout_ms`) | ms |
|
||||
| Postgres | `statement_timeout` | ms |
|
||||
| MySQL | session `max_execution_time` (applies to read-only SELECT — the only kind on this path) | ms |
|
||||
| Snowflake | `STATEMENT_TIMEOUT_IN_SECONDS` (ALTER SESSION) | s (ceil) |
|
||||
| ClickHouse | `max_execution_time` setting, with `request_timeout` aligned to the deadline so the HTTP client does not give up before the server aborts | s (ceil) |
|
||||
| SQL Server | `mssql` `requestTimeout` (TDS attention cancels server-side) | ms |
|
||||
|
||||
ClickHouse's existing hardcoded 30s `request_timeout` is brought under this
|
||||
contract (derived from the resolved deadline), not left as a parallel mechanism.
|
||||
|
||||
### 5. Timeout resolves as a `KtxQueryError` with the canonical message
|
||||
|
||||
On exceeding the deadline, the path resolves with a `KtxQueryError`
|
||||
(`query exceeded {N}s`) — a finite, decision-reaching outcome, never an unbounded
|
||||
hang. For SQLite the worker-termination path throws `queryDeadlineExceededError`
|
||||
directly. For remote engines, each connector recognizes **its own** engine's
|
||||
timeout signal (Postgres `57014`; MySQL errno `3024`; ClickHouse code `159`;
|
||||
SQL Server `ETIMEOUT`; Snowflake and BigQuery timeout errors) and re-wraps it as
|
||||
`queryDeadlineExceededError`, keeping the driver error as `cause`. Each connector
|
||||
owns its driver's signal — there is no central denylist of error codes to
|
||||
maintain.
|
||||
|
||||
### 6. MCP surfacing and logging via the existing pino path
|
||||
|
||||
The MCP `sql_execution` path already (a) maps any non-native driver error to
|
||||
`KtxQueryError` (`context/mcp/local-project-ports.ts:78–88`, guarded by
|
||||
`isNativeProgrammingFault`), (b) reports it through `reportException`, which skips
|
||||
`$exception` Error Tracking for `KtxExpectedError`, and (c) writes `tool.start`
|
||||
synchronously before the handler and `tool.end` in `instrumentMcpServer`
|
||||
(`context/mcp/context-tools.ts:644–730`). The deadline cases MUST surface through
|
||||
this path — the implementer verifies and tests them, but adds **no parallel
|
||||
classification or logging path**:
|
||||
|
||||
- **Query exceeds the deadline (any driver):** a `tool.end` at **`error`** with
|
||||
`outcome:"error"` and `err.message = "query exceeded {N}s"`, carrying the same
|
||||
`callId` as the `tool.start`. Classified as an expected error, so it is absent
|
||||
from `$exception` Error Tracking. The reason `tool.end` was previously missing
|
||||
is solely the blocked event loop (Requirement 3); once the loop stays free and
|
||||
the deadline throws, the existing instrumentation logs the matched pair — closing
|
||||
spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
|
||||
- **Completed-but-slow query (under the deadline, over `KTX_MCP_SLOW_TOOL_MS`):**
|
||||
unchanged from spec 15 — its `tool.end` is emitted at **`warn`**. The deadline
|
||||
(default 30s) and the slow threshold (default 10s) are independent knobs; a query
|
||||
between 10s and 30s completes with a slow `warn`, one past 30s is killed with the
|
||||
`error` above.
|
||||
|
||||
### 7. Read-only enforcement and `maxRows` unchanged
|
||||
|
||||
`assertReadOnlySql` and the `maxRows` row cap (`limitSqlForExecution`) behave
|
||||
exactly as today. The deadline is additive. `maxRows` is not a substitute for it
|
||||
(it bounds returned rows, not scan work).
|
||||
|
||||
### 8. Best-effort callers treat a deadline timeout as recoverable
|
||||
|
||||
The non-interactive `executeReadOnly` call sites that are best-effort —
|
||||
relationship profiling, composite-candidate probes, relationship validation,
|
||||
historic-SQL probes — MUST treat a deadline `KtxQueryError` as "skip this
|
||||
probe / mark unprofiled" and continue, never as a source-fatal error. The
|
||||
implementer confirms each such site already swallows query errors into a
|
||||
graceful-skip and adds that handling where it does not, so the uniform deadline
|
||||
(Requirement 1, applied to all callers) cannot abort an ingest run. A skipped
|
||||
probe is logged at the skip site through that path's existing scan/ingest logger
|
||||
(`KtxScanContext.logger`, `warn`/`debug`), never silently dropped — these callers
|
||||
are off the MCP tool-call path, so their visibility comes from the logger they
|
||||
already use.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- A read query that exceeds the deadline returns a `KtxQueryError`
|
||||
(`query exceeded {N}s`) within roughly the deadline; the MCP worker stays
|
||||
responsive (a concurrent tool call on the same server completes while the slow
|
||||
query is still pending) and writes a matching `tool.end` with a non-ok outcome.
|
||||
- **Logging:** a timed-out `sql_execution` produces a `tool.start` and a matching
|
||||
`tool.end` (same `callId`) at `error` with `outcome:"error"` and
|
||||
`err.message = "query exceeded {N}s"` — no unmatched `tool.start` remains. The
|
||||
timeout does not raise a `$exception` Error Tracking event (it is a
|
||||
`KtxExpectedError`). A completed query slower than `KTX_MCP_SLOW_TOOL_MS` but
|
||||
under the deadline still emits its `tool.end` at `warn`. No new logger is
|
||||
introduced — the lines come from the existing `instrumentMcpServer`.
|
||||
- **SQLite specifically:** executing a deliberately pathological query (an
|
||||
expensive VIEW or an unindexed cross join) on a fixture does not block the event
|
||||
loop, is terminated at the deadline, and the worker exits (the off-main-thread
|
||||
executor is killed, not left spinning) so CPU returns to idle.
|
||||
- **One server-side-timeout driver (Postgres):** the connector applies
|
||||
`statement_timeout` equal to the resolved deadline, and a `57014` cancellation
|
||||
is mapped to the canonical `KtxQueryError`.
|
||||
- `resolveQueryDeadlineMs` returns 30_000 by default, honors a `query_timeout_ms`
|
||||
override, and rejects an invalid value (zero / negative / non-integer).
|
||||
- **No regression:** normal fast queries return identical results; read-only
|
||||
rejection still works; `maxRows` still bounds returned rows.
|
||||
- The shared `query_timeout_ms` field is accepted by every connector; BigQuery's
|
||||
former `job_timeout_ms` key is gone and BigQuery's timeout is driven by the
|
||||
shared field.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **A row/byte/cost budget on returned data.** This spec bounds *time*, not result
|
||||
size — `maxRows` already bounds rows, and BigQuery's `maximumBytesBilled` is a
|
||||
separate, retained concern.
|
||||
- **A global `KTX_QUERY_TIMEOUT_MS` or per-call user flag.** One opinionated
|
||||
default plus a per-connection override; no per-call knob, no global knob.
|
||||
- **A server watchdog that recycles the process on an unmatched `tool.start`.**
|
||||
Spec 15 names this as a possible future mitigation; this spec prevents the hang
|
||||
at the source, so the watchdog is out of scope here.
|
||||
- **Moving SQLite introspection / sampling / stats off the main thread.** Only the
|
||||
`executeReadOnly` (LLM-SQL) path needs worker isolation; the rest are bounded
|
||||
ktx-authored queries.
|
||||
- **Per-connection retry / backoff on timeout.** A timeout returns a clean error
|
||||
for the agent to revise; ktx does not auto-retry.
|
||||
- **A second logger threaded into the connector.** The deadline cases are logged
|
||||
through spec 15's existing MCP tool-call logger; the connector gets no separate
|
||||
pino instance and `KtxScanContext` gets no MCP-logger thread (see "Logging routes
|
||||
through spec 15's pino path").
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
|
||||
`DEFAULT_QUERY_TIMEOUT_MS`, `resolveQueryDeadlineMs`, `queryDeadlineExceededError`.
|
||||
Error class is `KtxQueryError` (`packages/cli/src/errors.ts:25`).
|
||||
- **Contract anchor** — `KtxScanConnector.executeReadOnly`
|
||||
(`context/scan/types.ts:343`), `KtxReadOnlyQueryInput` (`types.ts:285`),
|
||||
`KtxScanContext.signal` (`types.ts:176`, already present, currently unused on the
|
||||
MCP path).
|
||||
- **Config schema** — add `query_timeout_ms` to the shared connection config
|
||||
(`context/project/config.ts`, `KtxProjectConnectionConfig` and its zod schema);
|
||||
remove BigQuery's `job_timeout_ms` reader.
|
||||
- **SQLite worker** — new `packages/cli/src/connectors/sqlite/read-query-worker.ts`
|
||||
(constructed by path via `new URL('./read-query-worker.js', import.meta.url)`);
|
||||
rework `connectors/sqlite/connector.ts` `executeReadOnly` (247–251) to validate
|
||||
on the main thread then dispatch to the worker with a terminate-on-deadline
|
||||
timer. Reuse `normalizeQueryRows` (`context/connections/query-executor.ts`) in
|
||||
the worker. Register the worker as a dynamic entry in `knip.json` (it is
|
||||
referenced by path, not import) and confirm the build copies it into `dist`.
|
||||
- **Remote connectors** — apply the resolved deadline and recognize the engine's
|
||||
timeout signal in each `executeReadOnly` / `query(...)`:
|
||||
`connectors/bigquery/connector.ts` (~491–512, `jobTimeoutMs`),
|
||||
`connectors/clickhouse/connector.ts` (~602/629–644, `max_execution_time` +
|
||||
`request_timeout`), `connectors/snowflake/connector.ts` (~354–371/510–534,
|
||||
`STATEMENT_TIMEOUT_IN_SECONDS`), `connectors/postgres/connector.ts` (~822–838,
|
||||
`statement_timeout`), `connectors/mysql/connector.ts` (~774–793,
|
||||
`max_execution_time`), `connectors/sqlserver/connector.ts` (~812–832,
|
||||
`requestTimeout`).
|
||||
- **MCP path + logging (verify only)** — `context/mcp/local-project-ports.ts:69–88`
|
||||
(error mapping), the `sql_execution` registration (~915–943), and the logging in
|
||||
`instrumentMcpServer` (`context/mcp/context-tools.ts:644–730`, which writes
|
||||
`tool.start`/`tool.end` via the spec-15 pino logger `context/mcp/logger.ts`). No
|
||||
new classification or logging code; confirm the timeout flows through as an
|
||||
expected error producing a matching `tool.end(error)` with the canonical message.
|
||||
- **Best-effort callers** — `context/scan/relationship-profiling.ts` (~227, 275),
|
||||
`context/scan/relationship-composite-candidates.ts` (~365, 440),
|
||||
`context/scan/relationship-validation.ts` (~259),
|
||||
`context/ingest/historic-sql-probes/bigquery-runner.ts` (~97), and the
|
||||
historic-sql clients: confirm a deadline `KtxQueryError` is swallowed into a
|
||||
graceful skip.
|
||||
- **Tests** — a SQLite fixture with a pathological query (tiny `query_timeout_ms`
|
||||
as the test seam) asserting terminate-on-deadline, event-loop responsiveness
|
||||
(a concurrent promise resolves while the query is pending), and worker exit; a
|
||||
Postgres test asserting `statement_timeout` is set to the resolved deadline and
|
||||
a `57014` error maps to `KtxQueryError`; resolver unit tests (default /
|
||||
override / invalid); regression tests for normal results, read-only rejection,
|
||||
and `maxRows`. Extend the MCP logging tests (alongside spec 15's, e.g.
|
||||
`test/context/mcp/server.test.ts`) to assert a timed-out `sql_execution` yields a
|
||||
matched `tool.start`/`tool.end(error)` pair carrying `query exceeded {N}s`.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
The Spider2-lite local set loads several warehouses into SQLite, some with
|
||||
expensive VIEWs over large fact tables — e.g. `complex_oracle.profits` =
|
||||
`costs ⋈ sales` on `(prod_id, time_id, channel_id, promo_id)`, 918,843 × 82,112
|
||||
rows, no composite index, with `promo_id` (the index the optimizer picks) being
|
||||
95.5% a single value. LLM-authored profiling queries (MIN/MAX/COUNT over such a
|
||||
view) trigger O(N×M) nested-loop scans. Without a deadline these hang an eval
|
||||
shard for 10+ minutes; with one, the agent gets a fast error and can scope the
|
||||
query instead. Improving the benchmark is a side effect; the deadline is generic
|
||||
production hygiene for any agent that lets an LLM author SQL.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
||||
acceptance criteria are met; tests, type-check, dead-code, and build are green
|
||||
for the changed surface.
|
||||
|
||||
### What was built, and where
|
||||
|
||||
- **Shared contract** — new `packages/cli/src/context/connections/query-deadline.ts`:
|
||||
`DEFAULT_QUERY_TIMEOUT_MS = 30_000`, `resolveQueryDeadlineMs(connection)` (returns
|
||||
the validated `query_timeout_ms` override else the default; throws on
|
||||
zero/negative/non-integer), and `queryDeadlineExceededError(deadlineMs, options?)`
|
||||
(a `KtxQueryError` reading `query exceeded ${round(ms/1000)}s`, carrying the
|
||||
driver error as `cause`). Unit-tested in `test/context/connections/query-deadline.test.ts`.
|
||||
- **Config field** — `query_timeout_ms` (optional positive integer, ms) added to
|
||||
the **shared warehouse** schema. NOTE (spec drift): that schema lives in
|
||||
`context/project/driver-schemas.ts` (`warehouseConnectionSchema`), not
|
||||
`config.ts`. The warehouse schemas use `z.looseObject`, so the field had to be
|
||||
declared explicitly to be *validated* (otherwise it would pass through
|
||||
unvalidated). BigQuery's `job_timeout_ms` field and `bigQueryJobTimeoutMsFromConnection`
|
||||
reader were removed; BigQuery now resolves the shared field. Every connector
|
||||
resolves its deadline once at construction via `resolveQueryDeadlineMs`.
|
||||
|
||||
### Deviation from the spec's SQLite mechanism (worker thread → child process)
|
||||
|
||||
The spec mandated running SQLite read queries on a **worker thread** and enforcing
|
||||
the deadline by `worker.terminate()`. This was **empirically disproven**:
|
||||
`Worker.terminate()` cannot interrupt a CPU-bound synchronous `better-sqlite3`
|
||||
scan — the native `sqlite3_step` loop never yields to V8, so terminate's promise
|
||||
never even resolves (an 8s probe of the exact failing query shape confirmed the
|
||||
thread keeps spinning). better-sqlite3 v12 exposes no `interrupt`/progress-handler
|
||||
API, and `.iterate()` does not help because the failing query is a single
|
||||
aggregate row produced only *after* the full scan.
|
||||
|
||||
The implemented mechanism is therefore **`child_process.fork` + `SIGKILL`**
|
||||
(`packages/cli/src/connectors/sqlite/read-query-child.ts`, spawned from
|
||||
`connector.ts`). SIGKILL lets the OS reclaim the whole process — a probe confirmed
|
||||
the scan is interrupted in ~2 ms and CPU returns to idle. This satisfies *both*
|
||||
SQLite requirements better than a thread (event loop stays free **and** the query
|
||||
is genuinely cancellable). The child is self-contained (imports only
|
||||
`better-sqlite3` + node builtins); validation/row-limiting (`limitSqlForExecution`)
|
||||
and `normalizeQueryRows` stay on the main thread. One short-lived child per call,
|
||||
killed on completion, deadline, or `ctx.signal` abort. Node v24's native
|
||||
TS type-stripping lets the `.ts` child load under vitest; a `.js`-if-exists-else-`.ts`
|
||||
URL resolver picks the compiled child in `dist`. Registered as a dynamic entry in
|
||||
`knip.json`; `tsc` emits it to `dist` (verified, plus a dist-level end-to-end smoke).
|
||||
|
||||
### Remote connectors (server-side timeouts + own-signal mapping)
|
||||
|
||||
Each applies the resolved deadline server-side and re-wraps its own timeout signal
|
||||
as `queryDeadlineExceededError(deadlineMs, { cause })`:
|
||||
|
||||
- **BigQuery** — `jobTimeoutMs` on the query job; maps a "Job timed out" / timeout-reason error.
|
||||
- **Postgres** — `statement_timeout` via pool `options` (`-c statement_timeout=<ms>`); maps `57014`.
|
||||
- **MySQL** — `SET SESSION max_execution_time = <ms>` before the read; maps errno `3024`.
|
||||
- **Snowflake** — `ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = <ceil(s)>` in the pooled connection; maps code `604` / "reached its … timeout".
|
||||
- **ClickHouse** — `max_execution_time` (ceil seconds) setting, with `request_timeout` set to `deadline + 5s` so the HTTP client outlasts the server abort (replaces the old hardcoded 30s); maps code `159`.
|
||||
- **SQL Server** — `requestTimeout` on the `mssql` pool config (TDS attention cancels server-side); maps `ETIMEOUT`.
|
||||
|
||||
Each connector has a focused test asserting the timeout is applied and its signal
|
||||
maps to `KtxQueryError` (Postgres is the spec's required acceptance test).
|
||||
|
||||
### Best-effort callers (Requirement 8)
|
||||
|
||||
Confirmed already graceful: relationship **profiling** (outer try/catch →
|
||||
`profile_failed` warning) and **composite-candidate** detection
|
||||
(`detectCompositeRelationships` → recoverable warning, returns `[]`). Historic-SQL
|
||||
**probes** flow through `runHistoricSqlReadinessProbe`, which catches *any* error
|
||||
into `{ ok: false }`. **Added** handling to relationship **validation**: a
|
||||
`KtxQueryError` on the per-candidate coverage probe now sends that one candidate to
|
||||
`review` (`validation_query_failed`, logged via `ctx.logger.warn`) instead of
|
||||
aborting the whole validation pass. `ingest-query-executor.ts` is a generic
|
||||
executor port whose callers own recoverability — left unchanged.
|
||||
|
||||
### MCP surfacing/logging
|
||||
|
||||
No new MCP classification or logging code. The deadline `KtxQueryError` flows
|
||||
through the existing `local-project-ports` mapping → `reportException` (skips
|
||||
`$exception` for `KtxExpectedError`; existing test `telemetry/exception.test.ts`
|
||||
covers the skip for `KtxQueryError`) → `instrumentMcpServer`, which logs a matched
|
||||
`tool.start` → `tool.end(error, level 50)` pair carrying `err.message = "query
|
||||
exceeded {N}s"`. A test in `test/context/mcp/server.test.ts` asserts the matched
|
||||
pair, closing spec 15's "`tool.start` with no `tool.end` = hang" gap for this case.
|
||||
|
||||
### Pre-existing branch issues encountered (not part of this feature)
|
||||
|
||||
- `test/mcp-server-factory.test.ts` had a type error (an `as` cast to a shape with
|
||||
a fake `context_tool` key, introduced by branch commit `2677b3ef`) that broke
|
||||
`tsc -p tsconfig.test.json`. Fixed with a clean single cast to keep the
|
||||
type-check gate green; behavior unchanged.
|
||||
- `test/skills/analytics-skill-content.test.ts` fails (2 cases: missing
|
||||
`**Window functions**` heading and `Expose identity, not just the label` prose
|
||||
in `src/skills/analytics/SKILL.md`). This is unrelated analytics-skill (spec
|
||||
13/14) content drift committed earlier on the branch; **left untouched** — no
|
||||
skill files were modified by this feature.
|
||||
|
|
@ -1,418 +0,0 @@
|
|||
# BigQuery cross-project dataset introspection (foreign-hosted datasets, billed in own project)
|
||||
|
||||
> Refined spec. Intake draft: `todo/18-bigquery-cross-project-datasets.md`.
|
||||
>
|
||||
> **Scope: let the BigQuery connector introspect a dataset hosted in a *different*
|
||||
> project than the one it bills jobs to.** A `dataset_ids` entry may be written
|
||||
> fully-qualified as `project.dataset`; the connector introspects each entry in
|
||||
> *its own* project while every job still runs in `credentials.project_id`. A
|
||||
> bare `dataset` keeps today's single-project behavior unchanged.
|
||||
>
|
||||
> Out of scope (confirmed during refinement): the interactive `ktx setup` wizard
|
||||
> is **not** expected to *discover* foreign datasets — you cannot enumerate
|
||||
> datasets in a project you don't own, and the wizard doesn't know which foreign
|
||||
> projects to probe. Users hand-write `project.dataset` entries (in `ktx.yaml` or
|
||||
> at the dataset prompt); the connector must accept and introspect them. See
|
||||
> *Non-goals*.
|
||||
|
||||
## Problem
|
||||
|
||||
**ktx**'s BigQuery connector derives a single `projectId` from
|
||||
`credentials.project_id` and uses it for **both** job billing **and** schema
|
||||
introspection. There is no way to introspect a dataset that lives in another
|
||||
project, even though *querying* such a dataset already works (a cross-project
|
||||
read in a `FROM` clause bills to the caller's project — that path is proven).
|
||||
|
||||
Confirmed in the current connector (`packages/cli/src/connectors/bigquery/connector.ts`):
|
||||
|
||||
- **`:294`** — `projectId` is read only from `credentials.project_id`. There is
|
||||
no separate billing-vs-dataset project. `bigQueryConnectionConfigFromConfig`
|
||||
(`:278`–`:301`) returns `datasetIds: string[]` — raw, unparsed.
|
||||
- **`datasetIds()` (`:163`)** — returns `dataset_ids` / `dataset_id` verbatim;
|
||||
it never parses a `project.` prefix.
|
||||
- **`introspectDataset` (`:544`)** — calls `this.getClient().dataset(datasetId)`,
|
||||
which resolves the dataset in the **client's (billing) project**, and labels
|
||||
every table `catalog: this.resolved.projectId` (`:566`, `:574`) — including the
|
||||
introspection-failure warning metadata (`:566`).
|
||||
- **`primaryKeys` (`:591`)** — builds `INFORMATION_SCHEMA` SQL as
|
||||
`` `<projectId>.<datasetId>.INFORMATION_SCHEMA.TABLE_CONSTRAINTS` `` using the
|
||||
**billing** project.
|
||||
- **`listTables` (`:453`)** — queries
|
||||
`` `<projectId>`.`region-<region>`.INFORMATION_SCHEMA.TABLES `` against the
|
||||
**billing** project and labels each row `catalog: this.resolved.projectId`.
|
||||
- **`testConnection` (`:344`)** — calls `client.dataset(datasetId).get()` in the
|
||||
billing project.
|
||||
|
||||
### Empirical confirmation (from the intake draft)
|
||||
|
||||
With a service account in project `ktx-spider2-lite`:
|
||||
|
||||
- ktx's call pattern `client.dataset("austin_311")` → **`404 NotFound`** (it looks
|
||||
in `projects/ktx-spider2-lite/datasets/austin_311`).
|
||||
- The cross-project form `dataset("austin_311", { projectId: "bigquery-public-data" })`
|
||||
→ **succeeds** (public metadata is readable by any authenticated principal).
|
||||
- There is **no config knob** to separate the introspection project from billing.
|
||||
|
||||
### Why the table `catalog` label is load-bearing, not cosmetic
|
||||
|
||||
The BigQuery dialect generates **three-part `catalog.db.name`** SQL
|
||||
(`connectors/bigquery/dialect.ts:38` → `formatDialectTableName(..., 'three-part')`;
|
||||
`context/connections/dialect-helpers.ts:27`–`32` emits `catalog.db.name`). The
|
||||
`catalog` stored on each scanned table is therefore the project that *every*
|
||||
later query targets — `sampleTable`, `sampleColumn`, `getColumnDistinctValues`,
|
||||
and ref-based `executeReadOnly` all format the ref through the dialect. If a
|
||||
foreign dataset's tables are labeled with the billing project, every one of those
|
||||
queries becomes `` `billing-project`.`austin_311`.`table` `` → `404`. So labeling
|
||||
the table `catalog` with the dataset's own project is a **correctness
|
||||
requirement**, and it is the single lever that makes sampling, dictionary value
|
||||
extraction, and `discover_data` all resolve once the snapshot is right.
|
||||
|
||||
### One introspection path, no divergence
|
||||
|
||||
`connectors/bigquery/live-database-introspection.ts` wraps
|
||||
`KtxBigQueryScanConnector.introspect` directly, so the ingest and live-database
|
||||
paths share **one** introspection implementation. The SDK already supports the
|
||||
fix: `client.dataset(id, { projectId })` — `@google-cloud/bigquery@8.3.1`'s
|
||||
`DatasetOptions` exposes `projectId?: string`.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Analysts routinely introspect datasets they can **read but do not own and do not
|
||||
bill to**: Google's `bigquery-public-data`, a partner's shared project, an
|
||||
organization's central data project that a smaller team queries from its own
|
||||
billing project. To make those connectable in **ktx** — so `discover_data`, the
|
||||
semantic layer, dictionary sampling, and `sql_dialect_notes` all work — the
|
||||
connector must introspect a foreign-hosted dataset while billing jobs in the
|
||||
credentials' own project. This is a standard BigQuery deployment shape and is
|
||||
wholly independent of any benchmark.
|
||||
|
||||
The class to design for is "the dataset's project ≠ the billing project," and it
|
||||
must generalize beyond one example: a single connection may reference datasets in
|
||||
**several** foreign projects at once (e.g. one slice mixing `bigquery-public-data`
|
||||
and `isb-cgc-bq`), and two different projects may host datasets with the **same
|
||||
name**. The design must keep those distinct.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs.
|
||||
|
||||
### Carry the project inline on each dataset entry — no separate knob
|
||||
|
||||
The introspection project is expressed **per dataset**, inline, as the optional
|
||||
`project.` prefix on a `dataset_ids` / `dataset_id` entry. There is no new config
|
||||
field.
|
||||
|
||||
> Rejected alternative: a separate connection-level `dataset_project` (or
|
||||
> `introspection_project`) field. It is a speculative runtime knob (against the
|
||||
> repo's opinionated-defaults rule) and, more decisively, it **cannot express the
|
||||
> requirement**: one connection must span *multiple* foreign projects, which a
|
||||
> single global field cannot represent. The inline form also derives scope from
|
||||
> the user's own declared input rather than adding a parallel setting.
|
||||
|
||||
### Parse to canonical `{ project, dataset }` pairs at the config boundary
|
||||
|
||||
Each entry is parsed **once**, in `bigQueryConnectionConfigFromConfig` /
|
||||
`datasetIds()`, into a canonical pair: the project (when no prefix is present,
|
||||
default it to `credentials.project_id`) and the bare dataset id. Every
|
||||
introspection-side call site reads the resolved pair; nothing downstream re-parses
|
||||
a `project.dataset` string.
|
||||
|
||||
> Rejected alternative: keep `datasetIds: string[]` raw and split the prefix
|
||||
> lazily at each use site (`introspectDataset`, `primaryKeys`, `listTables`,
|
||||
> `testConnection`). That re-implements one rule in four places and is exactly the
|
||||
> drift trap the repo's single-source-of-truth rule warns about — a later fix
|
||||
> lands on one path and not another. Normalize at the boundary; carry the
|
||||
> canonical form downstream.
|
||||
|
||||
The internal resolved-config type (`KtxBigQueryResolvedConnectionConfig.datasetIds`)
|
||||
changes shape from `string[]` to a structured pair list. That is an internal type;
|
||||
the connector internals and the connector test fixtures are the only consumers.
|
||||
|
||||
### Parsing rule (at the boundary)
|
||||
|
||||
- An entry contains **at most one `.`**.
|
||||
- With a dot: the segment **before** the dot is the project, validated by the
|
||||
existing `normalizeBigQueryProjectId` charset
|
||||
(`context/connections/bigquery-identifiers.ts`); the segment **after** is the
|
||||
dataset id (validated as a normal identifier).
|
||||
- Without a dot: a bare dataset; the project defaults to `credentials.project_id`
|
||||
(today's behavior).
|
||||
- **More than one `.`** (e.g. a stray `proj.ds.table`) is a clear config error
|
||||
raised at resolution time, naming the connection — not a silent
|
||||
mis-introspection.
|
||||
- Legacy domain-scoped project ids that contain `:` (e.g. `example.com:proj`) stay
|
||||
**out of scope**, consistent with `normalizeBigQueryProjectId`'s current charset
|
||||
(which already rejects `.` and `:` in a project id).
|
||||
|
||||
### Billing is never the dataset's project
|
||||
|
||||
The BigQuery client is still constructed with `projectId = credentials.project_id`
|
||||
(`getClient()`, `:487`–`:495`), and `createQueryJob` always bills there. Only the
|
||||
*introspection* surfaces switch to the per-dataset project. Cross-project reads in
|
||||
a `FROM` clause already bill to the caller — unchanged and already proven.
|
||||
|
||||
### Dataset identity downstream is `(catalog, db)`
|
||||
|
||||
Scanned tables are keyed by `(catalog, db, name)` throughout
|
||||
(`context/scan/table-ref.ts`; `context/scan/warehouse-catalog.ts:107`). Because
|
||||
the table `catalog` now holds the dataset's own project, two foreign projects that
|
||||
each host a `austin_311` dataset remain distinct with no extra work — provided the
|
||||
snapshot's `scope` / `metadata` also preserve the project (Requirement 6).
|
||||
|
||||
### Setup-wizard scope: accept, don't discover
|
||||
|
||||
The connector's region-scoped `listTables` (`:453`) is consumed **only** by the
|
||||
`ktx setup` wizard's table-selection step (`setup-databases.ts`); the
|
||||
ingest / `discover_data` path reads persisted snapshot JSON via
|
||||
`WarehouseCatalogService.listTables`, not the connector method. The wizard is not
|
||||
expected to enumerate foreign datasets (you can't list a project you don't own).
|
||||
A `project.dataset` value hand-entered at the dataset prompt, or written into
|
||||
`ktx.yaml`, must be accepted, validated, and introspected. See *Non-goals* for the
|
||||
region caveat that follows from this.
|
||||
|
||||
## Requirements
|
||||
|
||||
### R1 — Accept and parse `project.dataset` at the config boundary
|
||||
|
||||
`datasetIds()` / `bigQueryConnectionConfigFromConfig` resolve each
|
||||
`dataset_ids` and `dataset_id` entry into a canonical `{ project, dataset }` pair
|
||||
per the parsing rule above, defaulting `project` to `credentials.project_id` when
|
||||
unprefixed. A malformed entry (more than one `.`, an empty project or dataset
|
||||
segment, or a project/dataset that fails identifier validation) raises a clear
|
||||
error at resolution time that names the connection id.
|
||||
|
||||
### R2 — Introspect each dataset in its own project
|
||||
|
||||
`introspectDataset` resolves the dataset via the **dataset's** project —
|
||||
`client.dataset(datasetId, { projectId })` — for `getTables()` and each
|
||||
`tableRef.get()`. This requires extending the `KtxBigQueryClient.dataset` port to
|
||||
accept the project (e.g. `dataset(id, projectId)` / `dataset(id, { projectId })`)
|
||||
and forwarding it from `DefaultBigQueryClientFactory`.
|
||||
|
||||
### R3 — Label table `catalog` with the dataset's project
|
||||
|
||||
Every table produced by `introspectDataset` is labeled `catalog: <dataset's
|
||||
project>` (not the billing project), and the introspection-failure warning
|
||||
metadata (`object` / `catalog`) likewise reflects the dataset's project. This is
|
||||
what makes downstream sample/distinct-value/read queries resolve.
|
||||
|
||||
### R4 — Primary-key discovery targets the dataset's project
|
||||
|
||||
The `primaryKeys` `INFORMATION_SCHEMA.TABLE_CONSTRAINTS` /
|
||||
`KEY_COLUMN_USAGE` SQL is built against
|
||||
`` `<dataset's project>.<datasetId>.INFORMATION_SCHEMA…` ``. (This INFORMATION_SCHEMA
|
||||
view is dataset-qualified and therefore region-independent.) Its existing
|
||||
soft-fail-on-denied behavior (`tryConstraintQuery`, scan warning) is preserved.
|
||||
|
||||
### R5 — `listTables` lists each dataset in its own project
|
||||
|
||||
`listTables` returns rows labeled `catalog: <that dataset's project>` and queries
|
||||
each referenced project's region `INFORMATION_SCHEMA.TABLES`. Because a connection
|
||||
can now span projects, it queries per distinct project rather than assuming one.
|
||||
(This is the setup-wizard surface — see the cross-region caveat in *Non-goals*.)
|
||||
|
||||
### R6 — Snapshot scope and metadata reflect multiple projects
|
||||
|
||||
`introspect`'s returned snapshot keeps `metadata.project_id` = the **billing**
|
||||
project, but `scope.catalogs` becomes the **distinct set of dataset projects**
|
||||
actually introspected. `scope.datasets` / `metadata.datasets` must stay
|
||||
unambiguous when two projects share a dataset name (e.g. carry the qualified
|
||||
`project.dataset`, or otherwise preserve the project). The scoped table-name
|
||||
lookup that today passes `catalog: this.resolved.projectId` (`:359`) must pass
|
||||
each dataset's own project so `tableScope` / `enabled_tables` filtering still
|
||||
matches.
|
||||
|
||||
### R7 — `testConnection` resolves foreign datasets
|
||||
|
||||
`testConnection` validates each configured dataset via its own project
|
||||
(`client.dataset(datasetId, { projectId }).get()`), so a connection pointing only
|
||||
at foreign datasets reports success rather than a spurious `404`.
|
||||
|
||||
### R8 — Billing unchanged; bare dataset is a strict no-op
|
||||
|
||||
`createQueryJob` continues to bill in `credentials.project_id`. A connection whose
|
||||
`dataset_ids` are all bare (no `project.` prefix) behaves **exactly** as before:
|
||||
same resolved project, same `catalog` labels, same INFORMATION_SCHEMA targets, no
|
||||
behavioral change.
|
||||
|
||||
### R9 — `getTableRowCount` honors the parsed entry
|
||||
|
||||
`getTableRowCount`'s default-dataset handling (`:431`, today
|
||||
`this.resolved.datasetIds[0]`) resolves through the canonical pair so a foreign
|
||||
default dataset is introspected in its own project.
|
||||
|
||||
### R10 — Docs reflect the qualified form
|
||||
|
||||
Document that a BigQuery `dataset_ids` / `dataset_id` entry may be written
|
||||
`project.dataset` to introspect a dataset hosted in another project (billing stays
|
||||
in `credentials.project_id`). Update the BigQuery rows/examples in
|
||||
`docs-site/content/docs/configuration/ktx-yaml.mdx` and
|
||||
`docs-site/content/docs/integrations/primary-sources.mdx` (and the dataset-scope
|
||||
note in `docs-site/content/docs/cli-reference/ktx-setup.mdx`). Keep examples
|
||||
copy-pasteable and follow the `fumadocs-mdx-structure` skill.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
1. **Foreign single-project introspection.** With credentials in project
|
||||
`ktx-spider2-lite` and `dataset_ids: ['bigquery-public-data.austin_311']`,
|
||||
`ktx ingest <conn>` introspects the tables, enriches, and samples values;
|
||||
`discover_data` / `dictionary_search` return them. Tables are labeled
|
||||
`catalog: 'bigquery-public-data'`.
|
||||
2. **Multi-project connection.** `dataset_ids: ['bigquery-public-data.x',
|
||||
'other-project.y']` introspects **both**, each under its own project; the
|
||||
snapshot's `scope.catalogs` contains both projects.
|
||||
3. **Cross-project query still bills locally.** `sql_execution` of a
|
||||
fully-qualified `project.dataset.table` query runs and bills in
|
||||
`credentials.project_id`.
|
||||
4. **Same dataset name, two projects.** `['proj-a.shared', 'proj-b.shared']`
|
||||
yields two distinct dataset groups; tables do not collide.
|
||||
5. **No regression.** `dataset_ids: ['my_dataset']` (or singular `dataset_id`)
|
||||
behaves exactly as before — resolved under `credentials.project_id`, same
|
||||
`catalog` labels and INFORMATION_SCHEMA targets.
|
||||
6. **Malformed entry fails clearly.** `dataset_ids: ['proj.ds.table']` (or an
|
||||
empty segment) raises a config error naming the connection, not a `404` at
|
||||
scan time.
|
||||
7. **Test coverage** (extend `packages/cli/test/connectors/bigquery/connector.test.ts`,
|
||||
using the existing fake `clientFactory` harness):
|
||||
- the fake `dataset()` is called with the dataset's project for a prefixed
|
||||
entry, and with the billing project for a bare entry;
|
||||
- a prefixed entry yields tables with `catalog: '<dataset project>'`;
|
||||
- a mixed two-project `dataset_ids` introspects both;
|
||||
- `bigQueryConnectionConfigFromConfig` rejects a multi-dot / empty-segment
|
||||
entry;
|
||||
- the existing single-project tests still pass unchanged.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Foreign-dataset discovery in the setup wizard.** The wizard does not
|
||||
enumerate datasets in projects the credentials don't own; users supply
|
||||
`project.dataset` explicitly (scope decision A).
|
||||
- **Cross-region `listTables`.** `listTables`' region-scoped
|
||||
`region-<location>.INFORMATION_SCHEMA.TABLES` query uses the connection-level
|
||||
`location`; a foreign dataset in a *different* region than the connection's
|
||||
`location` will not be listed by that wizard-facing query. This does **not**
|
||||
affect ingest/`discover_data`, whose introspection path
|
||||
(`introspectDataset` REST metadata + dataset-qualified PK INFORMATION_SCHEMA) is
|
||||
region-independent. A per-dataset region knob is a separate spec if ever needed.
|
||||
- **Domain-scoped legacy project ids** containing `:` (e.g. `example.com:proj`),
|
||||
already unsupported by `normalizeBigQueryProjectId`.
|
||||
- **A separate billing/introspection config field** — explicitly rejected above.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Pointers from exploration; line numbers may have drifted, and the implementer owns
|
||||
the design.
|
||||
|
||||
- `packages/cli/src/connectors/bigquery/connector.ts`
|
||||
- `datasetIds()` (`:163`) and `bigQueryConnectionConfigFromConfig` (`:278`) —
|
||||
parse + canonicalize (R1); change `KtxBigQueryResolvedConnectionConfig.datasetIds`
|
||||
shape.
|
||||
- `KtxBigQueryClient.dataset` port (`:100`–`:110`) and
|
||||
`DefaultBigQueryClientFactory.dataset` (`:130`–`:135`) — thread `projectId`
|
||||
(R2). `getClient()` (`:487`) keeps the billing project (R8).
|
||||
- `introspectDataset` (`:544`) — `dataset(id, { projectId })`, table `catalog`
|
||||
+ warning metadata (R2, R3).
|
||||
- `primaryKeys` (`:591`) — dataset-qualified INFORMATION_SCHEMA (R4).
|
||||
- `listTables` (`:453`) — per-project region INFORMATION_SCHEMA + row catalog
|
||||
(R5).
|
||||
- `introspect` (`:352`) — `scope.catalogs`, `scope.datasets`, scoped-name lookup
|
||||
(`:359`) (R6).
|
||||
- `testConnection` (`:339`) (R7); `getTableRowCount` (`:431`) (R9).
|
||||
- `packages/cli/src/connectors/bigquery/live-database-introspection.ts` — wraps
|
||||
`introspect`; no separate change needed (it inherits the fix).
|
||||
- `packages/cli/src/context/connections/bigquery-identifiers.ts` —
|
||||
`normalizeBigQueryProjectId` is the project-segment validator.
|
||||
- `packages/cli/src/context/connections/dialect-helpers.ts` /
|
||||
`connectors/bigquery/dialect.ts` — three-part naming; no change, but this is
|
||||
*why* R3 matters.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`. Run
|
||||
`pnpm --filter @kaelio/ktx run type-check` and the connector test suite.
|
||||
|
||||
## Benchmark context (motivation, not a requirement — do not encode benchmark specifics)
|
||||
|
||||
Spider 2.0-Lite's **BigQuery slice (~205 questions)** is otherwise unservable
|
||||
faithfully: every one of its ~74 logical databases groups datasets hosted in
|
||||
foreign public projects (`bigquery-public-data`, `isb-cgc-bq`,
|
||||
`data-to-insights`, …), never in a project we own. Query execution already works
|
||||
cross-project; ktx-only *discovery* is the sole blocker, and it is blocked exactly
|
||||
because the connector can't introspect a foreign-hosted dataset. Of 74 BQ
|
||||
databases only **one** spans more than one source project, so "let `dataset_ids`
|
||||
carry `project.dataset` and introspect each in its own project" covers the
|
||||
benchmark and the general case alike. None of these project names belong in the
|
||||
code — they are derived from the user's own `dataset_ids` input.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki`. The whole change is contained in
|
||||
the BigQuery connector, its identifier helpers, the connector test suite, and three
|
||||
docs pages.
|
||||
|
||||
**Config boundary (R1).** Added `normalizeBigQueryDatasetId`
|
||||
(`packages/cli/src/context/connections/bigquery-identifiers.ts`, charset
|
||||
`[A-Za-z0-9_]`) next to the existing project/region validators. In
|
||||
`connectors/bigquery/connector.ts`, a single `parseBigQueryDatasetEntry(entry,
|
||||
defaultProject, connectionId)` parses one entry by splitting on `.`: zero dots →
|
||||
bare dataset in `defaultProject`; one dot → `project.dataset` (each segment
|
||||
validated; empty segment throws); two or more dots → throws. `resolveDatasetRefs`
|
||||
resolves `env:`/`file:` references first, trims/filters empties, then parses each.
|
||||
`bigQueryConnectionConfigFromConfig` calls it with the billing `project_id` as the
|
||||
default, so the canonical pair list is produced once at the boundary.
|
||||
`KtxBigQueryResolvedConnectionConfig.datasetIds` changed from `string[]` to the new
|
||||
`BigQueryDatasetRef[]` (`{ project, dataset }`). All errors name
|
||||
`connections.<id>.dataset_ids entry "<entry>"`.
|
||||
|
||||
**Client port (R2).** `KtxBigQueryClient.dataset` now takes
|
||||
`(datasetId, projectId)`; `DefaultBigQueryClientFactory` forwards
|
||||
`client.dataset(datasetId, { projectId })` (`@google-cloud/bigquery` `DatasetOptions.projectId`).
|
||||
`getClient()` still constructs the client with the **billing** `project_id`, so
|
||||
`createQueryJob` bills locally regardless of the dataset's project (R8, acceptance 3).
|
||||
|
||||
**Per-dataset introspection (R3–R7, R9).** Every introspection site reads the
|
||||
resolved pair: `introspectDataset(ref, …)` resolves `dataset(ref.dataset, ref.project)`
|
||||
and labels tables (and the introspection-failure warning, via `tryIntrospectObject`'s
|
||||
`catalog.db.object`) with `ref.project`; `primaryKeys(ref)` builds dataset-qualified
|
||||
`` `<project>.<dataset>.INFORMATION_SCHEMA…` `` SQL; `testConnection` validates each
|
||||
dataset under its own project; `getTableRowCount`'s default resolves through the first
|
||||
pair. `introspect` sets `scope.catalogs` to the distinct set of dataset projects and
|
||||
keeps `metadata.project_id` = billing. `scope.datasets` / `metadata.datasets` use a
|
||||
`qualifiedDatasetLabel` helper — bare in the billing project (so the single-project
|
||||
snapshot is byte-for-byte unchanged), `project.dataset` otherwise (so two projects with
|
||||
the same dataset name stay distinct, R6/acceptance 4).
|
||||
|
||||
**`listTables` (R5).** Split into `listTables` (parse override entries, group by
|
||||
project) and `listTablesInProject(project, region, datasets?)`. With no override it
|
||||
lists the billing project's region (unchanged); with an override it runs one
|
||||
region-`INFORMATION_SCHEMA.TABLES` query per distinct project, filtered to that
|
||||
project's bare datasets, and labels rows with that project. The existing single-region
|
||||
test is unchanged (bare entries collapse to one billing-project query).
|
||||
|
||||
**Docs (R10).** Added a "Cross-project datasets" subsection to
|
||||
`integrations/primary-sources.mdx` (qualified-entry example + the setup/region caveats),
|
||||
plus pointers from `configuration/ktx-yaml.mdx` and `cli-reference/ktx-setup.mdx`.
|
||||
|
||||
**Tests.** Extended `test/connectors/bigquery/connector.test.ts`: parse-to-pairs and
|
||||
malformed-entry rejection (`proj.ds.table`, `proj.`, `.ds`); a foreign-only connection
|
||||
calls `dataset('austin_311', 'bigquery-public-data')`, labels tables
|
||||
`catalog: 'bigquery-public-data'`, builds the client with the billing project, and keeps
|
||||
`metadata.project_id` local; a mixed `['bigquery-public-data.austin_311', 'analytics']`
|
||||
connection introspects both under their own projects; and `['proj_a.shared',
|
||||
'proj_b.shared']` stays distinct. The internal `datasetIds`-shape assertion was updated
|
||||
to the pair list; all pre-existing behavioral tests pass unchanged.
|
||||
|
||||
**Verification.** `pnpm --filter @kaelio/ktx run type-check`, the connector suite
|
||||
(18 tests), `test/setup-databases.test.ts` + `bigquery-identifiers.test.ts`,
|
||||
`pnpm run build`, `pnpm run dead-code` (Biome + Knip default + production),
|
||||
`pnpm run link:dev` (`ktx-dev` → 0.12.0), and `pre-commit` on the changed files all
|
||||
pass. Acceptance criteria 1–4 are exercised by unit tests with the fake client factory;
|
||||
criteria 5–6 by unit tests; criterion 3 (cross-project query bills locally) is
|
||||
structurally guaranteed (single billing client) and asserted via the `createClient`
|
||||
project. End-to-end ingest against live `bigquery-public-data` was not run here (no live
|
||||
credentials in this worktree); the `link:dev` binary is ready for the playground agent to
|
||||
validate.
|
||||
|
||||
**No deviations from the spec design.** The only judgment call: `scope.datasets`
|
||||
renders bare-in-billing / qualified-otherwise rather than always-qualified, chosen to
|
||||
satisfy both the no-regression requirement (R8/acceptance 5) and the disambiguation
|
||||
requirement (R6/acceptance 4) with one unambiguous, dot-delimited form.
|
||||
|
|
@ -1,471 +0,0 @@
|
|||
# Durable, resumable, bounded relationship detection during ingest enrichment
|
||||
|
||||
> Refined spec. Intake draft: `todo/19-durable-bounded-relationship-detection.md`.
|
||||
>
|
||||
> **Scope: make the expensive part of ingest enrichment survive an interrupted
|
||||
> relationship stage.** Today the paid LLM descriptions + embeddings only become
|
||||
> durable and queryable after the slowest, most-killable, least-valuable stage
|
||||
> (relationship detection) also finishes. This spec moves the persistence boundary
|
||||
> to the cost boundary, makes stage resume work across runs, and bounds + observes
|
||||
> the one open-ended stage — the durability companion to spec 16 (bounded query
|
||||
> execution), which this spec composes with rather than replaces.
|
||||
|
||||
## Problem
|
||||
|
||||
Three compounding failure modes, all confirmed in the current code, share one root
|
||||
cause: **the three enrichment stages are treated as a single atomic unit for
|
||||
persistence, identity, and bounding, even though they differ radically in cost,
|
||||
durability value, runtime, and likelihood of being killed.**
|
||||
|
||||
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:472`) runs three stages
|
||||
in a fixed order through `runEnrichmentStage` (`:413`):
|
||||
|
||||
| stage | order | cost | durability value | runtime on a large schema | likely to be killed |
|
||||
|-------|-------|------|------------------|---------------------------|---------------------|
|
||||
| `descriptions` (`:524`) | 1st | high — one paid LLM call per table | high | minutes | low |
|
||||
| `embeddings` (`:553`) | 2nd | medium | high | seconds–minutes | low |
|
||||
| `relationships` (`:587`) | 3rd | low — best-effort joins | low | **minutes, silent** | **high** |
|
||||
|
||||
The slowest, most-killable, least-valuable stage runs **last**, and it gates the
|
||||
durability of the two expensive stages held in memory before it.
|
||||
|
||||
### 1. Enrichment is lost if relationship detection is interrupted
|
||||
|
||||
The queryable artifact agents search and execute against is the `_schema` manifest
|
||||
YAML (`semantic-layer/<connectionId>/_schema/*.yaml`). It is written **twice**:
|
||||
|
||||
- bare (native column comments only) early, at `local-scan.ts:473`
|
||||
(`writeLocalScanManifestShards`), before enrichment runs; and
|
||||
- rewritten **with AI descriptions + accepted joins** by
|
||||
`writeLocalScanEnrichmentArtifacts` (`local-enrichment-artifacts.ts:310`), called
|
||||
from `local-scan.ts:510` **after** `runLocalScanEnrichment` returns — i.e. after
|
||||
all three stages.
|
||||
|
||||
So the descriptions and embeddings reach the queryable layer only via that single
|
||||
terminal write. If the process is killed/crashes/times out **during** the
|
||||
`relationships` stage, `runLocalScanEnrichment` never returns, the terminal write
|
||||
never runs, and the in-memory descriptions + embeddings are discarded — the
|
||||
`_schema` retains only the bare native comments from the `:473` write.
|
||||
|
||||
Empirically (intake draft): ingesting a 95-table BigQuery dataset produced full
|
||||
descriptions + embeddings (progress reached "Building embeddings 17/17"), then the
|
||||
relationship stage ran silently past a supervising deadline and was killed; the
|
||||
persisted `_schema` had **0** AI descriptions. The most expensive work is the most
|
||||
likely to be thrown away.
|
||||
|
||||
> A stage-state store (below) does save each completed stage's output to an
|
||||
> internal SQLite cache as the stage finishes — so the descriptions are not lost to
|
||||
> the *resume cache*. They are simply never **promoted** to the queryable `_schema`
|
||||
> until the terminal write. The data survives somewhere the agent cannot query, and
|
||||
> (per failure mode 2) cannot be reused on the next run either.
|
||||
|
||||
### 2. Re-running does not resume — it re-spends
|
||||
|
||||
`runEnrichmentStage` resolves a completed stage with
|
||||
`findCompletedStage({ runId, stage, inputHash })` (`local-enrichment.ts:427`), and
|
||||
the store keys on **`runId`**: `SqliteLocalScanEnrichmentStateStore` declares
|
||||
`PRIMARY KEY (run_id, stage)` and filters lookups by `run_id`
|
||||
(`sqlite-local-enrichment-state-store.ts:83,91–115`). `runId` is minted fresh per
|
||||
ingest invocation (`record.runId`). The cache therefore only resolves *within* one
|
||||
run; re-running an interrupted ingest gets a new `runId`, misses every cached
|
||||
stage, and **recomputes descriptions + embeddings from scratch** — re-paying for
|
||||
LLM work that already succeeded.
|
||||
|
||||
The store already computes and persists `inputHash` next to `runId` —
|
||||
a stable `sha256` of `{ snapshot, mode, detectRelationships, providerIdentity,
|
||||
relationshipSettings }` (`enrichment-state.ts:78`). The correct content key is
|
||||
already on the row; the lookup just uses the volatile column. This is a keying
|
||||
defect, not a missing capability.
|
||||
|
||||
### 3. Relationship detection is unobservable and unbounded
|
||||
|
||||
`discoverKtxRelationships` (`context/scan/relationship-discovery.ts:218`) profiles a
|
||||
row sample of **every enabled table** (`profileKtxRelationshipSchema`,
|
||||
`relationship-profiling.ts:320` — one sampled query per table at
|
||||
`profileConcurrency`, default 4), validates candidate joins
|
||||
(`relationship-validation.ts:237` — one coverage query per candidate), and detects
|
||||
composite keys (`relationship-composite-candidates.ts:515` — per-table plus
|
||||
cross-table queries). None of the controls the rest of the scan pipeline relies on
|
||||
were ever wired into this stack:
|
||||
|
||||
- **No progress.** `discoverKtxRelationships` does not accept a progress port; the
|
||||
caller can only emit start/end around it (`local-enrichment.ts:600,611` —
|
||||
`update(0, 'Detecting relationships')` … `update(1, 'found N')`). Minutes of
|
||||
silence between.
|
||||
- **No honored cancellation.** `KtxScanContext.signal` exists on the contract
|
||||
(`types.ts`) but **no sub-stage reads it**.
|
||||
- **No time budget.** Validation has a *count* budget (`validationBudget`, default
|
||||
`min(2 × tableCount, 1000)`); profiling and composite detection have none. On a
|
||||
schema with hundreds–thousands of tables, profiling is O(tables) silent queries
|
||||
with no internal stop condition.
|
||||
|
||||
A supervisor watching for liveness cannot tell a slow-but-working profile from a
|
||||
true hang, and nothing inside the stage will voluntarily stop — so on a very large
|
||||
schema it runs far past any reasonable deadline and is killed (which, via failure
|
||||
mode 1, takes the descriptions with it).
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Any context layer that enriches a real warehouse with paid LLM work must make that
|
||||
work durable the instant it is produced, resume it across process restarts without
|
||||
re-paying, and bound the open-ended profiling stage so a large catalog cannot hang
|
||||
ingest indefinitely. A data team ingesting a 500-table production warehouse over a
|
||||
flaky connection, a rate-limited LLM budget, or a CI step with a wall-clock limit
|
||||
hits all three failure modes regardless of any benchmark. This is general
|
||||
durability and cost hygiene for the ingest pipeline; the benchmark only made it
|
||||
acute at scale.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
||||
|
||||
### D1 — Checkpoint queryable artifacts at the cost boundary, before relationships
|
||||
|
||||
As soon as the last non-relationship stage completes — `embeddings` when an
|
||||
embedding provider is configured, otherwise `descriptions` — persist the
|
||||
descriptions + embeddings into the **queryable** `_schema` manifest (and the raw
|
||||
`descriptions.json` / `embeddings.json` enrichment artifacts), **before** the
|
||||
`relationships` stage runs. The relationship stage then writes its joins on top: the
|
||||
manifest builder already re-reads and preserves existing descriptions and
|
||||
manual/inferred joins on rewrite (`loadExistingManifestState`,
|
||||
`local-enrichment-artifacts.ts:196`), so the second write is additive, not
|
||||
destructive.
|
||||
|
||||
Net invariant: **the descriptions + embeddings are always durable and queryable the
|
||||
moment they are computed**, even if relationship detection then fails, is
|
||||
interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
|
||||
relationship stage degrades to "no joins" or "partial joins" — **never** to "no
|
||||
descriptions." This is the inverse guarantee the current terminal-write ordering
|
||||
violates.
|
||||
|
||||
The bare `:473` manifest write stays — it is the queryable schema for the
|
||||
no-providers / enrichment-disabled path. The checkpoint is an additional write that
|
||||
runs only when enrichment produced descriptions.
|
||||
|
||||
> Orientation (the implementer owns the seam): the lowest-coupling shape is a
|
||||
> checkpoint hook — `runLocalScanEnrichment` invokes a caller-supplied callback once
|
||||
> the last non-relationship stage completes, and `local-scan.ts` supplies a callback
|
||||
> that calls the existing `writeLocalScanEnrichmentArtifacts` for the
|
||||
> descriptions + embeddings + manifest only (no generated joins yet). The final
|
||||
> write after the relationship stage proceeds as today. Relationship-specific
|
||||
> artifacts (`relationships.json`, `relationship-profile.json`,
|
||||
> `relationship-diagnostics.json`) are written by the final/relationship write, not
|
||||
> the checkpoint, so the checkpoint never emits misleading empty relationship
|
||||
> diagnostics.
|
||||
>
|
||||
> Rejected alternative: move all artifact writing inside `runLocalScanEnrichment`
|
||||
> (inject the file store / project). That couples the enrichment module to
|
||||
> persistence for no gain — the writer already lives in `local-scan.ts` and the
|
||||
> checkpoint needs only a one-line hook, not a relocation.
|
||||
|
||||
### D2 — Resume by content identity, not by `runId`
|
||||
|
||||
Re-key completed-stage resolution on **`(connectionId, stage, inputHash)`**,
|
||||
independent of `runId`, so a re-run with an unchanged schema and config resumes the
|
||||
finished `descriptions` / `embeddings` stages from cache and re-runs only what
|
||||
actually failed. `inputHash` is already the content fingerprint; `connectionId`
|
||||
scopes it to the right source. When several rows share a content identity (one per
|
||||
prior run), the most recent `updatedAt` wins.
|
||||
|
||||
`runId` stays on the stored row for diagnostics and for `listRunStages`, but leaves
|
||||
the uniqueness/lookup key.
|
||||
|
||||
The state store is a **disposable local resume cache** (`.ktx` local state,
|
||||
regenerable from a fresh ingest). Re-key it with **no migration bridge** — recreate
|
||||
the table if its on-disk shape differs from the new `(connection_id, stage,
|
||||
input_hash)` key, consistent with ktx's no-backward-compatibility policy. Losing the
|
||||
old cache only means one ingest cannot resume; it never corrupts a queryable
|
||||
artifact.
|
||||
|
||||
> Rejected alternative: include `syncId` or `mode` in the key. `mode` and the rest
|
||||
> are already folded into `inputHash`; adding them again would only narrow the key
|
||||
> and re-break cross-run resume when an incidental field differs.
|
||||
|
||||
### D3 — Make the relationship stage observable and bounded
|
||||
|
||||
Thread three things the rest of the pipeline already supports through
|
||||
`discoverKtxRelationships` into profiling, validation, and composite detection:
|
||||
|
||||
- **Progress** through the existing progress port (the relationship phase is
|
||||
already `progress?.startPhase(0.25)` at `local-enrichment.ts:586`): emit per-unit
|
||||
liveness — "Profiling table K/N", "Validating candidate K/M", and the equivalent
|
||||
for composite probing — so a supervisor can distinguish slow-but-working from
|
||||
hung.
|
||||
- **A flat wall-clock budget** for the whole relationship stage: a new
|
||||
`scan.relationships.detectionBudgetMs`, a positive integer of milliseconds,
|
||||
project-level, validated like the other `scan.relationships` fields, **default
|
||||
600_000 (10 min), enforced by default.** Checked at unit boundaries (before each
|
||||
table profile, each candidate validation, each composite probe). It sits **above**
|
||||
spec 16's per-query deadline (default 30s): each individual query is already
|
||||
bounded; this bounds the *sum* of them.
|
||||
- **Honored cancellation:** where `KtxScanContext.signal` is available, the same
|
||||
unit-boundary check honors it, so external cancellation stops the stage too.
|
||||
|
||||
On budget exhaustion or abort: stop scheduling new work, let in-flight queries
|
||||
finish (each already bounded by spec 16), finalize with the relationships found so
|
||||
far, and return a **partial** result — never an unbounded hang and never an
|
||||
exception that would lose the checkpointed descriptions.
|
||||
|
||||
> Rejected alternative — per-table-scaled budget (N seconds × table count). It is a
|
||||
> second formula to reason about and "more tables → more budget" partly re-opens the
|
||||
> unbounded door this requirement closes. One flat, generous, project-level number
|
||||
> matches how the other `scan.relationships` knobs are shaped and is enough for a
|
||||
> best-effort stage whose partial output is durable and improvable (D4).
|
||||
>
|
||||
> Rejected alternative — a global `KTX_RELATIONSHIP_BUDGET_MS` env knob or a
|
||||
> per-call override. One opinionated project-level default with a config override is
|
||||
> the canonical ktx shape; no second runtime path.
|
||||
|
||||
### D4 — A budget-truncated partial is a successful, cached, completed stage
|
||||
|
||||
A graceful budget stop is **not** a failure. The relationship stage saves its
|
||||
partial result like any completed stage (so a plain re-run resumes it for free, no
|
||||
re-querying) and marks it `partial` with a reason in the relationship diagnostics
|
||||
plus a recoverable scan warning. Because `detectionBudgetMs` lives in
|
||||
`relationshipSettings ⊂ inputHash`, **raising the budget changes the content
|
||||
identity and triggers a fresh, fuller run** — that is the only "try harder"
|
||||
mechanism, with no extra flag or runtime path.
|
||||
|
||||
Distinguish the two stop kinds:
|
||||
|
||||
- **Process killed mid-stage** (crash / SIGKILL / supervisor): nothing is saved as
|
||||
completed, so the next run recomputes the relationship stage (after resuming
|
||||
descriptions/embeddings from cache via D2). This is the primary durability path.
|
||||
- **Graceful budget/abort stop**: a partial *is* saved as completed-partial and
|
||||
resumed cheaply on re-run, unless the budget is raised.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Checkpoint descriptions + embeddings before relationship detection
|
||||
|
||||
The descriptions and embeddings MUST be persisted into the durable, queryable
|
||||
`_schema` manifest (and the raw enrichment artifacts) as soon as the last
|
||||
non-relationship stage completes, before the `relationships` stage runs.
|
||||
Relationship detection appends/merges its joins on completion. The expensive LLM +
|
||||
embedding enrichment MUST be queryable even if the relationship stage subsequently
|
||||
fails, is interrupted, is budget-truncated, or is skipped. A failed/partial/skipped
|
||||
relationship stage MUST degrade to "no/partial joins," never to "no descriptions."
|
||||
|
||||
### 2. Stage resume resolves by content identity across runs
|
||||
|
||||
Completed-stage resolution MUST key on `(connectionId, stage, inputHash)`,
|
||||
independent of `runId`, so re-running an interrupted ingest resumes the finished
|
||||
`descriptions` / `embeddings` stages from cache and re-runs only what failed.
|
||||
Re-running after an interruption MUST NOT re-issue LLM description or embedding
|
||||
calls for stages that already completed. The resume cache MAY be recreated without a
|
||||
migration bridge if its schema changes (it is disposable local state).
|
||||
|
||||
### 3. Relationship detection emits progress and honors a wall-clock budget
|
||||
|
||||
The relationship stage MUST emit per-unit progress through the existing progress
|
||||
port (at minimum per-table during profiling and per-candidate during validation) so
|
||||
liveness is observable. It MUST enforce a flat wall-clock budget
|
||||
(`scan.relationships.detectionBudgetMs`, default 600_000 ms, project-level,
|
||||
overridable, validated as a positive integer) checked at unit boundaries and layered
|
||||
above spec 16's per-query deadline, and MUST honor `KtxScanContext.signal` where
|
||||
available. On budget exhaustion or abort it MUST stop scheduling new work, finalize
|
||||
with the relationships found so far, and return a partial result rather than running
|
||||
unboundedly or throwing.
|
||||
|
||||
### 4. A budget-truncated relationship result is durable and marked partial
|
||||
|
||||
A graceful budget/abort stop MUST persist the partial relationship result as a
|
||||
completed stage (so a plain re-run resumes it without re-querying) and MUST mark it
|
||||
`partial` — in the relationship diagnostics artifact and as a recoverable scan
|
||||
warning — so downstream consumers can see the joins are incomplete. Raising
|
||||
`detectionBudgetMs` (which changes `inputHash`) MUST cause a fresh, fuller
|
||||
relationship run; no separate flag is introduced for "redo." A process killed
|
||||
mid-stage MUST NOT leave a completed record (so it recomputes on re-run).
|
||||
|
||||
### 5. No regression for small or uninterrupted ingests
|
||||
|
||||
A small or single-run ingest that is never interrupted MUST produce the same
|
||||
artifacts and the same relationship output as today. The checkpoint write MUST be
|
||||
idempotent with the final write (descriptions survive the join rewrite); the budget
|
||||
default MUST be generous enough that normal and large-but-tractable schemas complete
|
||||
relationship detection fully, hitting the budget only on pathological scale.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- **Durability across interruption:** interrupting an ingest **during** relationship
|
||||
detection still leaves a queryable semantic layer carrying the table/column
|
||||
descriptions + embeddings that were generated (verified: re-open the connection;
|
||||
AI descriptions are present in `_schema`, not just native comments).
|
||||
- **Resume does not re-spend:** re-running an interrupted ingest does **not**
|
||||
regenerate descriptions/embeddings whose stage already completed (verified: no LLM
|
||||
description calls and no embedding calls for the cached tables; only the failed
|
||||
stage re-runs). Resolution is by `(connectionId, stage, inputHash)`, so the resume
|
||||
survives a fresh `runId`.
|
||||
- **Observable + bounded relationships:** a connection with hundreds of tables emits
|
||||
relationship-stage progress (per-table profiling, per-candidate validation) and
|
||||
completes within `detectionBudgetMs`; when the budget is hit, the stage stops
|
||||
gracefully and persists the partial relationships found so far — without
|
||||
discarding enrichment — marked `partial` in diagnostics and via a recoverable
|
||||
warning.
|
||||
- **Partial is cached and improvable:** re-running with an unchanged budget resumes
|
||||
the partial relationship result from cache (no re-querying); raising
|
||||
`detectionBudgetMs` triggers a fresh, fuller relationship run.
|
||||
- **Budget validation:** `detectionBudgetMs` defaults to 600_000, honors a project
|
||||
override, and rejects an invalid value (zero / negative / non-integer) as a clear
|
||||
`ktx.yaml` config error.
|
||||
- **No regression:** small/single-run ingests behave exactly as before — identical
|
||||
artifacts and relationship output when nothing is interrupted; the checkpoint +
|
||||
final writes leave descriptions intact alongside the generated joins.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Bounding the descriptions stage's per-table LLM call.** Whether an individual
|
||||
enrichment LLM call can wedge is a separate concern (already being addressed in the
|
||||
working tree via a per-table enrichment timeout). This spec ensures whatever
|
||||
descriptions *did* complete are durable; it does not own the per-call timeout.
|
||||
- **Changing relationship-detection quality, thresholds, or the candidate/validation
|
||||
algorithm.** The accept/review thresholds, scoring, and the existing
|
||||
`validationBudget` count cap are unchanged; this spec adds durability,
|
||||
cross-run resume, progress, and a time budget around them.
|
||||
- **A per-connection or per-call relationship budget, or a global env override.**
|
||||
One flat project-level `detectionBudgetMs`; no second runtime path (D3).
|
||||
- **A new per-query timeout.** Spec 16 already bounds individual queries; this spec
|
||||
composes above it and does not re-implement query-level deadlines.
|
||||
- **Replacing the per-query deadline with the stage budget, or vice versa.** They
|
||||
are independent and layered: a single query is bounded by spec 16; the stage's sum
|
||||
is bounded by `detectionBudgetMs`.
|
||||
- **A general checkpoint framework for every ingest stage.** The checkpoint is
|
||||
specifically the descriptions+embeddings → queryable-manifest promotion before
|
||||
relationships; it is not a generic per-stage artifact-flush abstraction.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
||||
design.
|
||||
|
||||
- **Enrichment orchestration** — `context/scan/local-enrichment.ts`:
|
||||
`runLocalScanEnrichment` (`:472`), the three `runEnrichmentStage` calls
|
||||
(`descriptions` `:524`, `embeddings` `:553`, `relationships` `:587`),
|
||||
`runEnrichmentStage` (`:413`) and its `findCompletedStage` lookup (`:427`). Add the
|
||||
checkpoint hook after the last non-relationship stage; thread the progress port,
|
||||
signal, and budget into the relationship stage.
|
||||
- **Scan driver / write ordering** — `context/scan/local-scan.ts`: bare manifest
|
||||
write (`:473`), enrichment call (`:492`, currently passing only
|
||||
`{ runId, progress }` as `context` — wire `signal` through here too), terminal
|
||||
`writeLocalScanEnrichmentArtifacts` (`:510`), and the enrichment-failure catch
|
||||
(`:530`, which after D1 no longer loses descriptions). Supply the checkpoint
|
||||
callback here.
|
||||
- **Artifact writer** — `context/scan/local-enrichment-artifacts.ts`:
|
||||
`writeLocalScanEnrichmentArtifacts` (`:310`), `writeLocalScanManifestShards`
|
||||
(`:270`), and the description-preserving merge in `loadExistingManifestState`
|
||||
(`:196`) — the basis for the additive checkpoint/final write.
|
||||
- **Resume cache** — `context/scan/sqlite-local-enrichment-state-store.ts`:
|
||||
`PRIMARY KEY (run_id, stage)` (`:83`), `findCompletedStage` (`:91`),
|
||||
`saveCompletedStage` (`:117`). Re-key on `(connection_id, stage, input_hash)`,
|
||||
pick latest `updated_at`, recreate the table if shape differs (disposable cache).
|
||||
Lookup interface `KtxScanEnrichmentStageLookup` and `findCompletedStage`
|
||||
in `context/scan/enrichment-state.ts` (`:10,46`); `computeKtxScanEnrichmentInputHash`
|
||||
(`:78`).
|
||||
- **Relationship stack (progress + budget + signal)** —
|
||||
`context/scan/relationship-discovery.ts` (`discoverKtxRelationships` `:218`, accept
|
||||
a progress port and budget/deadline + signal),
|
||||
`context/scan/relationship-profiling.ts` (`profileKtxRelationshipSchema` `:320` —
|
||||
per-table progress + budget check),
|
||||
`context/scan/relationship-validation.ts` (`validateKtxRelationshipDiscoveryCandidates`
|
||||
`:237` — per-candidate progress + budget check, alongside the existing
|
||||
`validationBudget`),
|
||||
`context/scan/relationship-composite-candidates.ts`
|
||||
(`discoverKtxCompositeRelationships` `:515` — budget check).
|
||||
- **Config** — `context/project/config.ts` `scan.relationships`
|
||||
(`KtxScanRelationshipConfig`, `:171–213`): add `detectionBudgetMs` (positive
|
||||
integer ms, default 600_000) to the zod schema and the default config builder.
|
||||
- **Partial marker** — `context/scan/relationship-diagnostics.ts`
|
||||
(`buildKtxRelationshipDiagnostics`, the profile/diagnostics artifact shape) carries
|
||||
a `partial` flag + reason; add a recoverable warning code to the
|
||||
`KtxScanWarningCode` union in `context/scan/types.ts` (e.g.
|
||||
`relationship_detection_partial`).
|
||||
- **Tests** — durability: a fixture ingest interrupted during the relationship stage
|
||||
leaves AI descriptions in the queryable `_schema`. Resume: a second run with a
|
||||
fresh `runId` and unchanged `inputHash` resolves the cached descriptions/embeddings
|
||||
(assert no LLM/embedding calls) and re-runs only relationships. Budget: a schema
|
||||
large enough (or a tiny `detectionBudgetMs` as the test seam) hits the budget,
|
||||
emits per-unit progress, returns partial, persists it marked `partial`, and a
|
||||
re-run resumes the partial; raising the budget re-runs. Resolver/config unit tests
|
||||
for `detectionBudgetMs` (default / override / invalid). Regression: small
|
||||
uninterrupted ingest yields identical artifacts and relationship output.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
The Spider 2.0-Lite BigQuery slice has datasets with hundreds–thousands of tables
|
||||
(`ebi_chembl` 785, `fec` 486, `ga360` 366, …). Enriching them with claude-code
|
||||
costs real, rate-limited LLM budget; losing that enrichment to a relationship-stage
|
||||
interruption — and re-spending it on every retry — makes large-schema ingest
|
||||
impractical, and an unbounded profiling stage runs past any supervising deadline and
|
||||
is killed. This is a general durability/cost property of the ingest pipeline,
|
||||
independent of the benchmark; the benchmark only made it acute at scale. Do not
|
||||
encode any benchmark specifics in the implementation.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki` (ktx worktree `tallinn-v2`). All
|
||||
four design decisions shipped; no deviations from the resolved design.
|
||||
|
||||
**D2 — resume by content identity** (`sqlite-local-enrichment-state-store.ts`,
|
||||
`enrichment-state.ts`, `local-enrichment.ts`): the stage table is re-keyed to
|
||||
`PRIMARY KEY (connection_id, stage, input_hash)`; `findCompletedStage` looks up by
|
||||
`(connectionId, stage, inputHash)` ordered by `updated_at DESC` (most recent
|
||||
content identity wins). `KtxScanEnrichmentStageLookup.runId` became `connectionId`;
|
||||
`runId` stays on the row for diagnostics/`listRunStages`. The store drops and
|
||||
recreates the table when the on-disk primary key differs (disposable cache, no
|
||||
migration bridge), detected via `PRAGMA table_info`.
|
||||
|
||||
**D3 — observable + bounded relationship stage** (new
|
||||
`relationship-detection-budget.ts`): a sticky `KtxRelationshipDetectionBudget`
|
||||
(`check()`/`stopReason()`) built from `detectionBudgetMs` + `ctx.signal` + an
|
||||
injectable `now`, plus `mapWithBudget` (a budget-aware concurrent map that
|
||||
generalizes and replaces the old `mapWithConcurrency`). Threaded through
|
||||
`discoverKtxRelationships` → profiling (per-table progress + budget stop),
|
||||
validation (per-candidate progress + budget stop; budget-skipped candidates
|
||||
degrade to the existing `validation_unattempted` review), and composite
|
||||
detection (budget stops at PK-detection and coverage-probe boundaries).
|
||||
`discoverKtxRelationships` now accepts `progress` and `now` and returns
|
||||
`partial: { reason } | null`. The clock check fires only when work remains, so a
|
||||
deadline elapsing after the last unit never marks a fully-processed stage partial.
|
||||
|
||||
**D1 — checkpoint before relationships** (`local-enrichment.ts`,
|
||||
`local-enrichment-artifacts.ts`, `local-scan.ts`): `runLocalScanEnrichment` fires a
|
||||
caller-supplied `onCheckpoint` once descriptions/embeddings complete and before
|
||||
the relationship stage runs, gated on `shouldDetectRelationships` so the
|
||||
no-relationship path keeps a single write. `local-scan.ts` supplies a callback
|
||||
calling the new `writeLocalScanEnrichmentCheckpoint` (descriptions.json +
|
||||
embeddings.json + manifest with descriptions and no generated joins — no
|
||||
relationship artifacts, so no misleading empty diagnostics). The shared
|
||||
description/embedding JSON writer was factored out so checkpoint and final writes
|
||||
stay one implementation. `ctx.signal` is now threaded from `RunLocalScanOptions`
|
||||
into the enrichment context (completing the existing `KtxScanContext.signal`
|
||||
contract already read by the budget and the in-flight description timeout).
|
||||
|
||||
**D4 — partial is durable + marked** (`relationship-diagnostics.ts`,
|
||||
`local-enrichment.ts`, `local-enrichment-artifacts.ts`): the diagnostics artifact
|
||||
carries `partial` + `partialReason`; `runLocalScanEnrichment` pushes a recoverable
|
||||
`relationship_detection_partial` warning (new `KtxScanWarningCode`) when truncated.
|
||||
A graceful budget/abort stop returns normally, so the relationship stage saves as a
|
||||
completed-partial record and resumes cheaply; a process killed mid-stage saves
|
||||
nothing and recomputes. Raising `detectionBudgetMs` changes `inputHash`
|
||||
(it lives in `relationshipSettings`), forcing a fresh, fuller run — the only
|
||||
"try harder" mechanism, no extra flag.
|
||||
|
||||
**Config** (`config.ts`): `scan.relationships.detectionBudgetMs`, positive integer
|
||||
ms, default `600_000`, validated like the other relationship fields. Documented in
|
||||
`docs-site/content/docs/configuration/ktx-yaml.mdx`.
|
||||
|
||||
**Tests** (all green): budget unit tests (`relationship-detection-budget.test.ts`);
|
||||
cross-run resume + table-recreate (`enrichment-state.test.ts`,
|
||||
`local-enrichment.test.ts`); progress/budget/abort partial
|
||||
(`relationship-discovery.test.ts`); partial persisted/resumed/re-run-on-raise +
|
||||
checkpoint ordering + no-checkpoint-when-skipped (`local-enrichment.test.ts`);
|
||||
end-to-end durability — a relationship-stage failure still leaves AI descriptions
|
||||
in the queryable `_schema` (`local-scan.test.ts`); diagnostics partial flag
|
||||
(`relationship-diagnostics.test.ts`); config default/override/invalid
|
||||
(`config.test.ts`). `pnpm --filter @kaelio/ktx type-check`, `pnpm run dead-code`,
|
||||
and `pnpm run build && pnpm run link:dev` all pass. (Pre-existing and unrelated:
|
||||
three `analytics-skill-content.test.ts` markdown-structure assertions fail on this
|
||||
branch from earlier analytics-skill commits — untouched here.)
|
||||
|
|
@ -1,533 +0,0 @@
|
|||
# Resilient enrichment under a slow/hung LLM backend
|
||||
|
||||
> Refined spec. Intake draft: `todo/20-resilient-enrichment-under-slow-llm.md`.
|
||||
>
|
||||
> **Scope: make the descriptions enrichment stage survive a hung LLM backend and
|
||||
> an interrupted run.** Two compounding gaps live *inside* the per-table
|
||||
> description-enrichment path: (1) the per-table LLM timeout fires in JS but does
|
||||
> not terminate a wedged subprocess backend, so a hung table wedges the whole
|
||||
> stage indefinitely; (2) descriptions are persisted only at full-stage
|
||||
> completion, so any interruption discards every already-enriched table. This is
|
||||
> the enrichment-stage analog of spec 16 (enforced query cancellation — a deadline
|
||||
> that *stops the work*, not just abandons the promise) and spec 19 (move the
|
||||
> durability boundary to the cost boundary so expensive LLM work is not lost). It
|
||||
> composes with both rather than replacing them.
|
||||
|
||||
## Problem
|
||||
|
||||
Two compounding failure modes on the per-table description-enrichment path, both
|
||||
confirmed in the current code and observed end-to-end together. Their union turned
|
||||
a single hung table into an indefinite wedge *plus* total loss of an entire
|
||||
stage's LLM work.
|
||||
|
||||
### 1. The per-table LLM timeout does not terminate the work
|
||||
|
||||
`KtxDescriptionGenerator.generateBatchedTableDescriptions`
|
||||
(`context/scan/description-generation.ts`, the bounded call ~760–866) wraps the
|
||||
per-table `this.llmRuntime.generateObject(...)` call in `retryAsync` with a fresh
|
||||
`AbortSignal.timeout(KTX_ENRICH_LLM_TIMEOUT_MS)` per attempt (commit `01f63380`).
|
||||
A fired timeout is surfaced as `KtxAbortedError` so it is **not** retried (one
|
||||
wedge stays one timeout, not 3×). That is the correct policy — but the abort never
|
||||
actually stops a subprocess backend, so the timeout is cosmetic.
|
||||
|
||||
The runtime is selected by the `backend` config field
|
||||
(`context/llm/local-config.ts`, `KTX_LLM_BACKENDS =
|
||||
['none','anthropic','vertex','gateway','claude-code','codex']`). Two backends spawn
|
||||
a **child process the SDK owns** and to which ktx hands only an `AbortSignal`:
|
||||
|
||||
- **`codex`** (`@openai/codex-sdk`, via `context/llm/codex-runtime.ts` →
|
||||
`codex-sdk-runner.ts`): the SDK runs `spawn(executable, args, { signal })`. Node's
|
||||
`spawn` signal-option sends the child **SIGTERM** (not SIGKILL) on abort, and the
|
||||
SDK consumes the child's stdout with `for await (const line of rl)`, re-throwing
|
||||
the abort error **only after that loop ends**. A child wedged on a hung provider
|
||||
socket survives SIGTERM → its stdout never closes → the readline loop never ends
|
||||
→ the SDK never throws → ktx's `await generateObject` **never settles**, past the
|
||||
per-attempt timeout, indefinitely. The child leaks (open provider connections,
|
||||
~0% CPU).
|
||||
- **`claude-code`** (`@anthropic-ai/claude-agent-sdk`, via
|
||||
`context/llm/claude-code-runtime.ts`, `collectResult` ~275–322): on abort it calls
|
||||
best-effort `queryResult.interrupt?.()` (errors swallowed) and only checks
|
||||
`throwIfAborted` **between** streamed messages. A wedged child emits no message, so
|
||||
the `for await (const message of queryResult)` loop blocks and the graceful
|
||||
`interrupt()` may never land — the same hang class.
|
||||
|
||||
By contrast, **HTTP backends** (`anthropic`/`vertex`/`gateway`/`openai`, via
|
||||
`context/llm/ai-sdk-runtime.ts`) pass `abortSignal` straight to the AI SDK's
|
||||
`generateObject`, which cancels the underlying `fetch` natively — the await settles
|
||||
promptly and there is no child to leak.
|
||||
|
||||
So ktx holds **no kill handle** on the subprocess backends, and SIGTERM is too
|
||||
gentle for a wedged child. Spec 16's mechanism (ktx *itself* forks
|
||||
`read-query-child` and `SIGKILL`s it) works precisely because ktx owns the fork —
|
||||
which it does not here.
|
||||
|
||||
Observed (BigQuery ingest, codex backend, 2026-06-23): with
|
||||
`KTX_ENRICH_LLM_TIMEOUT_MS=1800000` (30 min, an operator override), two of
|
||||
`covid19_usa`'s 252-column tables hung; the stage sat at **268/285 for 41+
|
||||
minutes** — well past the 30-min per-attempt timeout — with exactly two codex
|
||||
children, each holding 3 ESTABLISHED connections at ~0% CPU, until killed by hand.
|
||||
|
||||
### 2. Descriptions are persisted only at full-stage completion
|
||||
|
||||
`generateDescriptions` (`context/scan/local-enrichment.ts` ~279–352) fans out
|
||||
per-table work through `pLimit(DESCRIPTION_TABLE_CONCURRENCY)` (default 4) and
|
||||
**accumulates every table's result in an in-memory `updates` array**, returned only
|
||||
when the whole stage finishes. `runEnrichmentStage` (~413, ~421–474) then calls
|
||||
`saveCompletedStage` (writing the whole-stage row to `local_scan_enrichment_stages`)
|
||||
**after** `compute()` returns, and the spec-19 checkpoint write
|
||||
(`writeLocalScanEnrichmentCheckpoint`, `local-enrichment-artifacts.ts` ~351–379,
|
||||
fired by the `onCheckpoint` hook in `local-scan.ts`) also runs **only once the
|
||||
descriptions stage completes**. There is no within-stage persistence: while the
|
||||
stage runs, every enriched table's description lives only in memory.
|
||||
|
||||
So if the stage cannot complete — 2 of 285 tables hang (gap #1), or the process is
|
||||
killed, or a supervising watchdog fires — **all** already-enriched tables are lost,
|
||||
even though their (expensive, paid) LLM descriptions were finished. On the next run,
|
||||
`findCompletedStage` finds no row, so the descriptions stage **recomputes from
|
||||
scratch**.
|
||||
|
||||
Observed (same run): `covid19_usa` had **283/285** tables enriched in memory but
|
||||
**0** rows in `local_scan_enrichment_stages` and **0** `ai:` descriptions on disk;
|
||||
killing the wedged ingest discarded all 283, forcing a from-scratch re-ingest. The
|
||||
cost of 2 pathological tables was 283 tables' worth of redone LLM calls.
|
||||
|
||||
Sharper still (re-ingest with a short, *enforced* timeout): even when the stage
|
||||
**runs to the end** — the 2 hung tables hit their timeout and were skipped, so
|
||||
**283/285** descriptions were generated and the ingest reported success (`Scan
|
||||
completed` / `Ingest finished`, embeddings built, exit 0) — the descriptions were
|
||||
**still persisted as 0** (0 `ai:` on disk, 0 stage rows). So the loss is **not**
|
||||
only "discarded on kill": a stage that completes with *any* skipped/aborted table
|
||||
threw away **every** successfully-generated description. The skip must be
|
||||
**graceful** — a skipped table costs one missing description, not the entire stage's
|
||||
output — which is the strongest argument for per-table incremental persistence: the
|
||||
283 good descriptions should have been durable the moment each was produced.
|
||||
|
||||
The on-disk artifacts already carry everything needed to fix this *additively*: the
|
||||
`_schema` manifest encodes per-table completion (a table with `descriptions.ai` is
|
||||
AI-enriched), and rewrites preserve existing descriptions
|
||||
(`mergeDescriptionsPreservingExternal`, `manifest.ts` ~96–115;
|
||||
`loadExistingManifestState`, `local-enrichment-artifacts.ts` ~196–253 — the basis
|
||||
spec 19 relies on). The durable record and the resume-skip set can be **derived from
|
||||
the system's own on-disk state**, with no new cache schema.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Anyone ingesting a large or wide schema with an LLM enrichment backend —
|
||||
especially a **subprocess** backend, the common local/desktop setup — will
|
||||
eventually hit a table whose description call hangs: a provider stall, a rate-limit
|
||||
black-hole, a pathologically large prompt. Without an *enforced* timeout, one such
|
||||
table wedges the entire ingest indefinitely and leaks the spawned child; without
|
||||
*incremental* persistence, any interruption throws away all the per-table LLM work
|
||||
already done — the dominant ingest cost. Both fixes make large-schema enrichment
|
||||
**resilient and resumable**: a few bad tables degrade to a few skipped
|
||||
descriptions, not a hung process and a from-scratch redo. This is core robustness
|
||||
for a general-purpose ingestion product, wholly independent of any benchmark.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
||||
|
||||
### D1 — One bounded-call guarantee; enforcement follows the backend's nature
|
||||
|
||||
The canonical contract is a single guarantee for the per-table enrichment call:
|
||||
**the in-flight work terminates and ktx's await settles within the per-table
|
||||
deadline plus a small grace, on every backend.** How that guarantee is met follows
|
||||
from a structural property of the configured backend — *does it own a subprocess?*
|
||||
— not from a hand-maintained list of provider names:
|
||||
|
||||
- **Subprocess-backed (`codex`, `claude-code`):** the SDK's own abort is
|
||||
insufficient (SIGTERM-only, and ktx has no kill handle), so ktx runs the call
|
||||
behind a **boundary it can hard-kill** — a short-lived ktx-owned child process,
|
||||
made a **process-group leader** (`detached`). The SDK's grandchild (the
|
||||
`codex`/`claude` binary) inherits that group. On deadline (or `ctx.signal`), ktx
|
||||
**tree-kills the whole group with SIGKILL** — reaping the wrapper *and* the
|
||||
grandchild — and rejects promptly. This mirrors spec 16's child-process +
|
||||
SIGKILL mechanism, extended by the critical step that **killing the immediate
|
||||
child is not enough**: the grandchild would otherwise orphan to init and keep its
|
||||
provider connections. Killing the group is the real fix.
|
||||
- **HTTP-backed (`anthropic`/`vertex`/`gateway`/`openai`):** unchanged. The existing
|
||||
in-process `abortSignal` → `fetch` cancellation already satisfies the contract —
|
||||
the await settles promptly and there is no subprocess to leak. Routing these
|
||||
through a subprocess would pay fork + IPC + credential-passing cost for no benefit.
|
||||
|
||||
> The branch on "subprocess-backed?" is behavior following from an input the backend
|
||||
> declares about itself, not vendor enumeration — the same guarantee is reached two
|
||||
> ways because the backends differ structurally. This matches the intake's own split
|
||||
> ("subprocess SIGKILL for process-backed; request abort for HTTP-backed").
|
||||
>
|
||||
> Rejected alternative — a *settle-only race* (reject ktx's promise on the deadline
|
||||
> regardless of the SDK, but leave the SDK's child running). It unwedges the stage
|
||||
> but leaves the orphaned child holding provider connections — the exact leak the
|
||||
> incident showed — so it fails the intake's "actually cancelled" requirement and
|
||||
> compounds over a long ingest that hits several hung tables.
|
||||
>
|
||||
> Rejected alternative — a *persistent ktx subprocess pool* hosting the runtime,
|
||||
> killed and respawned on timeout. Terminate-on-deadline destroys the worker, so a
|
||||
> pool needs respawn + in-flight job-tracking for no benefit: the enrichment call is
|
||||
> low-frequency relative to its own latency and already concurrency-bounded (4), so
|
||||
> one short-lived child per call (spec 16's resolved choice) is simpler and as fast.
|
||||
|
||||
**Portability.** ktx supports Windows, where POSIX process groups and
|
||||
`process.kill(-pgid, …)` do not exist. The tree-kill MUST be portable: a detached
|
||||
process group + `kill(-pgid, 'SIGKILL')` on POSIX, and a tree-terminating
|
||||
equivalent on Windows (e.g. `taskkill /pid <pid> /T /F` or a job object) so the
|
||||
grandchild is reaped on every platform the subprocess backends run on.
|
||||
|
||||
### D2 — Default stays moderate and the retry/skip policy is unchanged
|
||||
|
||||
The per-table timeout default stays **120s** (`KTX_ENRICH_LLM_TIMEOUT_MS`), with the
|
||||
existing per-attempt retry (`KTX_ENRICH_LLM_ATTEMPTS`, default 3) and the
|
||||
no-retry-on-timeout policy. A hung table costs **at most one timeout**, then the
|
||||
table is skipped with the existing `enrichment_timeout` warning and the stage
|
||||
proceeds. The 30-min value in the incident was an operator stopgap chosen *because*
|
||||
the timeout was cosmetic; once D1 makes the timeout actually terminate the work, a
|
||||
long timeout is strictly worse for a hang (a hang costs the full timeout), so the
|
||||
moderate default is the correct operating point. The retry loop stays in
|
||||
`description-generation.ts`: each attempt runs through the bounded boundary (D1), so
|
||||
a transient backend error retries while a timeout surfaces as `KtxAbortedError` and
|
||||
does not.
|
||||
|
||||
> Not introducing a new `ktx.yaml` config field for the timeout. The existing env
|
||||
> override is the tuning seam; adding a per-connection/per-call/global knob would
|
||||
> multiply the runtime surface for no stated need (one opinionated default + the
|
||||
> existing env override is the canonical ktx shape).
|
||||
|
||||
### D3 — Persist descriptions incrementally; derive the resume-skip set from on-disk state
|
||||
|
||||
During the descriptions fan-out, flush completed tables **per batch** (every N
|
||||
tables / on a timer, at a cadence that bounds the at-risk window) to the durable
|
||||
on-disk artifacts, reusing spec 19's additive write:
|
||||
|
||||
- the raw descriptions artifact (`descriptions.json`) is the **resume-skip source**;
|
||||
- the `_schema` manifest is updated additively (`mergeDescriptionsPreservingExternal`
|
||||
preserves prior `ai:`/`db:`/external keys) so finished descriptions are also
|
||||
**queryable** the moment they are computed — the spec-19 invariant, one level
|
||||
deeper. The implementer MAY bound manifest-rewrite cost on huge schemas by
|
||||
rewriting only changed shards.
|
||||
|
||||
On resume, `generateDescriptions` reads the existing record, **skips any table
|
||||
already enriched**, computes only the remainder, and returns the merged full set so
|
||||
the embeddings stage, the checkpoint write, and the stage-store row all see a
|
||||
complete result exactly as today.
|
||||
|
||||
**The skip is `inputHash`-gated**, preserving spec 19's recompute semantics. The
|
||||
durable record is tagged with the descriptions stage's `inputHash`
|
||||
(`computeKtxScanEnrichmentInputHash`). Resume reuses it to skip tables **only when
|
||||
the current `inputHash` matches** — a genuine resume-after-interruption of the same
|
||||
content identity. A changed `inputHash` (schema or enrichment settings changed)
|
||||
ignores the prior record for skipping and recomputes the stage as today; the
|
||||
manifest write stays additive regardless. The artifact's on-disk shape may gain the
|
||||
`inputHash` tag with **no migration bridge** (ktx owns the artifact; a stale-shaped
|
||||
record simply forces one non-incremental run), consistent with ktx's
|
||||
no-backward-compatibility policy.
|
||||
|
||||
> The skip set is **derived from the artifacts ktx already writes**, not from a new
|
||||
> per-table cache table. The manifest's `ai:` field already encodes "this table is
|
||||
> enriched"; a parallel per-table SQLite record would be a second source of truth for
|
||||
> the same fact and would drift. The whole-stage `local_scan_enrichment_stages` row is
|
||||
> still written at stage completion (it remains the stage-level resume gate — a clean
|
||||
> re-run skips the descriptions stage as today); the incremental record only matters
|
||||
> when the stage did **not** complete — exactly the case where no row exists and
|
||||
> `compute()` re-runs.
|
||||
|
||||
### D4 — A killed-mid-stage run is durable; resume is cheap
|
||||
|
||||
A process killed mid-stage (gap #1 wedge, SIGKILL, crash, supervisor) leaves the
|
||||
per-batch-flushed tables durable on disk. The next run resumes the descriptions
|
||||
stage (no completed `local_scan_enrichment_stages` row → `compute()` runs again),
|
||||
but `generateDescriptions` now **re-issues LLM calls only for the unfinished
|
||||
tables**. A failed/skipped table (timeout or exhausted retries) is left for the
|
||||
remainder set and is retried on the next resume — never silently treated as done.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. The per-table enrichment timeout is enforced for subprocess backends
|
||||
|
||||
When the per-table deadline fires (or `ctx.signal` aborts) on a subprocess-backed
|
||||
backend (`codex`, `claude-code`), the in-flight LLM work — the spawned child **and
|
||||
its descendants** — MUST be terminated (SIGKILL of the process group / tree), and
|
||||
ktx's `generateObject` await MUST settle within the deadline plus a small bounded
|
||||
grace. A hung table MUST cost at most ~one timeout of wall-clock, never unbounded.
|
||||
The termination MUST be portable across the platforms the subprocess backends run on
|
||||
(POSIX process-group kill and a Windows tree-kill equivalent). HTTP-backed backends
|
||||
keep their existing native `abortSignal` → `fetch` cancellation; the guarantee is one
|
||||
contract met two ways, branching on the backend's structural "owns a subprocess"
|
||||
property, not on a list of provider names.
|
||||
|
||||
### 2. The timeout default and retry/skip policy are unchanged
|
||||
|
||||
The default per-table timeout stays moderate (current 120s, `KTX_ENRICH_LLM_TIMEOUT_MS`),
|
||||
with the existing per-attempt retry (default 3, `KTX_ENRICH_LLM_ATTEMPTS`) and the
|
||||
no-retry-on-timeout policy. On timeout, the table is skipped with the existing
|
||||
`enrichment_timeout` recoverable warning and the stage proceeds. No new
|
||||
per-connection / per-call / global timeout knob is added.
|
||||
|
||||
### 3. Descriptions are persisted incrementally during the stage
|
||||
|
||||
Enriched descriptions MUST be flushed to the durable on-disk artifacts **per batch**
|
||||
(per-table or per-N-tables / on a timer) during the descriptions stage, at a cadence
|
||||
that bounds the at-risk window to a small number of tables. The flush MUST be
|
||||
idempotent and additive (never clobber a prior `ai:` description; preserve `db:` and
|
||||
external keys via the existing merge). Finished tables MUST remain durable even if the
|
||||
stage never completes — is wedged, killed, or interrupted. A failed/skipped
|
||||
relationship/embedding stage or a killed descriptions stage MUST NOT lose the
|
||||
descriptions already flushed.
|
||||
|
||||
### 4. Resume re-enriches only the unfinished tables
|
||||
|
||||
On a resumed ingest with an unchanged `inputHash`, the descriptions stage MUST
|
||||
re-issue LLM description calls **only for tables not already enriched**, deriving the
|
||||
already-enriched set from the on-disk artifacts (the `inputHash`-tagged durable
|
||||
record / the manifest's `ai:` descriptions), and MUST return the merged full result
|
||||
so downstream stages behave as on a fresh run. A changed `inputHash` (schema or
|
||||
enrichment settings changed) MUST recompute the stage as today (spec 19's
|
||||
inputHash-gated semantics preserved). The durable record MAY be recreated without a
|
||||
migration bridge if its on-disk shape changes (it is regenerable local/artifact
|
||||
state).
|
||||
|
||||
### 5. No regression for small or uninterrupted ingests
|
||||
|
||||
A small or single-run ingest that is never interrupted MUST produce the same
|
||||
artifacts (descriptions, manifest, embeddings) as today. The incremental flush MUST
|
||||
be idempotent with the spec-19 checkpoint and the terminal write (descriptions
|
||||
survive the embeddings/relationship rewrites). The bounded-call boundary MUST NOT
|
||||
change a normal successful enrichment's output, only how a wedged call is terminated.
|
||||
|
||||
### 6. A skipped table costs one description, never the stage's output
|
||||
|
||||
A descriptions stage that **completes** with one or more skipped/aborted tables MUST
|
||||
persist every successfully-generated description (the durable record and the `ai:`
|
||||
manifest entries) and MUST mark the stage completed (a `local_scan_enrichment_stages`
|
||||
row, embeddings + downstream proceeding) — it MUST NOT discard the whole stage's
|
||||
output because some tables were skipped. No single table's failure may reject the
|
||||
per-table fan-out: a per-table failure degrades to one missing description (left for
|
||||
the resume remainder), not a failed stage. A genuine `ctx.signal` cancellation is the
|
||||
only thing that fails the stage (so it resumes), and even then the already-flushed
|
||||
descriptions remain durable.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- **Enforced timeout (subprocess backend):** a subprocess-backed enrichment call
|
||||
that hangs past the deadline is terminated within the deadline plus a small grace;
|
||||
ktx's await settles, the spawned child **and a grandchild it spawned** both exit
|
||||
(verified via the child's `exit`, not left spinning), and the table is skipped with
|
||||
an `enrichment_timeout` warning. The stage advances rather than wedging. A
|
||||
`ctx.signal` abort terminates the same way.
|
||||
- **HTTP backend unaffected:** an HTTP-backed enrichment call still cancels promptly
|
||||
on abort via the existing native path, with no subprocess involved.
|
||||
- **Default + policy:** the default timeout is 120s and a timeout is not retried (one
|
||||
wedge = one timeout); a transient error is still retried up to the attempt limit.
|
||||
- **Graceful skip persists the rest:** a stage that completes with one table failing
|
||||
(timeout, exhausted retries, or an unexpected throw) still writes the other N−1
|
||||
descriptions to the durable record + `ai:` `_schema` and marks the stage completed
|
||||
(a `local_scan_enrichment_stages` row exists); the failed table is a single `null`
|
||||
description left for the resume remainder, not a discarded stage.
|
||||
- **Incremental durability:** interrupting the descriptions stage after K of N tables
|
||||
leaves those K durable on disk (raw artifact + `ai:` descriptions in `_schema`),
|
||||
with no completed `local_scan_enrichment_stages` row.
|
||||
- **Resume does not re-spend:** re-running the interrupted ingest (unchanged
|
||||
`inputHash`, fresh `runId`) issues **no** LLM description calls for the K already-
|
||||
enriched tables and enriches only the remaining N−K; the returned result is the
|
||||
full merged set. A changed `inputHash` recomputes the stage.
|
||||
- **No regression:** a small uninterrupted ingest yields identical artifacts and the
|
||||
same descriptions/embeddings output as today; the incremental flush is idempotent
|
||||
with the checkpoint and terminal writes.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Incremental persistence of embeddings.** Embeddings are fast and already covered
|
||||
by spec 19's stage-level cross-run resume; the dominant loss is descriptions. This
|
||||
spec scopes incremental persistence to the `descriptions` stage.
|
||||
- **Changing the timeout default, retry counts, or adding a timeout config knob.**
|
||||
D2 keeps the moderate default and the single env tuning seam.
|
||||
- **Routing HTTP backends through the subprocess boundary.** Their native abort
|
||||
already meets the contract; a subprocess would add cost and a credential-passing
|
||||
surface for no benefit.
|
||||
- **A persistent subprocess pool.** One short-lived ktx child per subprocess-backed
|
||||
call; no pool, no respawn/job-tracking (D1).
|
||||
- **Re-implementing spec 16 (per-query deadline) or spec 19 (relationship-stage
|
||||
budget, cost-boundary checkpoint, cross-run stage resume).** This spec composes
|
||||
above them: spec 16 bounds individual queries, spec 19 makes whole stages durable
|
||||
and resumable, and this spec hardens the per-table enrichment call's termination
|
||||
and adds within-stage description durability.
|
||||
- **A general per-stage incremental-flush framework.** The incremental flush is
|
||||
specifically the descriptions stage; it is not a generic abstraction over every
|
||||
enrichment stage.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
||||
design.
|
||||
|
||||
- **Bounded per-table call (gap #1)** — `context/scan/description-generation.ts`,
|
||||
`KtxDescriptionGenerator.generateBatchedTableDescriptions` (the bounded+retry block
|
||||
~760–866; `enrichTimeoutMs` ~769, `enrichAttempts` ~770, `KtxAbortedError` on
|
||||
timeout ~811, `enrichment_timeout`/`enrichment_failed` warnings ~858). The retry
|
||||
loop stays here; each attempt runs through the kill boundary for subprocess
|
||||
backends.
|
||||
- **LLM runtime + backend selection** — `context/llm/runtime-port.ts`
|
||||
(`KtxLlmRuntimePort.generateObject`, `abortSignal` on the input),
|
||||
`context/llm/local-config.ts` (~127–163, selects `CodexKtxLlmRuntime` /
|
||||
`ClaudeCodeKtxLlmRuntime` / `AiSdkKtxLlmRuntime`), `context/project/config.ts`
|
||||
(`KTX_LLM_BACKENDS`). The "owns a subprocess" property should be declared by the
|
||||
backend/runtime (e.g. on the runtime interface), not inferred from a name list.
|
||||
- **Subprocess backends** — `context/llm/codex-runtime.ts` +
|
||||
`context/llm/codex-sdk-runner.ts` (`CodexSdkCliRunner.runStreamed`, the SDK's
|
||||
`spawn(executable, args, { signal })` is in `@openai/codex-sdk`),
|
||||
`context/llm/claude-code-runtime.ts` (`collectResult` ~275–322, the `interrupt()`
|
||||
abort path). These are what the kill boundary must wrap and tree-kill.
|
||||
- **Reuse spec 16's mechanism (extended to group/tree kill)** —
|
||||
`connectors/sqlite/read-query-child.ts` (the forked child shape) and
|
||||
`connectors/sqlite/connector.ts` `runReadQueryOffProcess` (~292–350: `fork`,
|
||||
deadline timer, `child.kill('SIGKILL')`, `settle()`, the `.js`-if-exists-else-`.ts`
|
||||
child-URL resolver ~25–27, knip dynamic entry). Gap #1 differs by making the child a
|
||||
process-group leader and killing the **group/tree** (the SDK grandchild), portably.
|
||||
Abort helpers: `context/core/abort.ts` (`createAbortError`, `throwIfAborted`,
|
||||
`linkAbortSignal`). Note the new child hosts an LLM runtime, so the implementer owns
|
||||
passing the backend config/credentials to it (env/IPC) and serializing the
|
||||
structured result back.
|
||||
- **Incremental persistence (gap #2)** —
|
||||
`context/scan/local-enrichment.ts` (`generateDescriptions` ~279–352: the per-table
|
||||
`pLimit` fan-out and the in-memory `updates` accumulation; `runEnrichmentStage`
|
||||
~413/~421–474 with `findCompletedStage` ~427 and `saveCompletedStage`; the
|
||||
`onCheckpoint` hook ~598–612). Make `generateDescriptions` resume-aware: read the
|
||||
existing record, skip already-enriched tables, flush per batch, return the merged
|
||||
full set.
|
||||
- **Artifact writer + additive merge** — `context/scan/local-enrichment-artifacts.ts`
|
||||
(`writeLocalScanEnrichmentCheckpoint` ~351–379, `writeEnrichmentDescriptionArtifacts`
|
||||
with `descriptions.json` ~316, `writeLocalScanManifestShards` ~270–308,
|
||||
`loadExistingManifestState` ~196–253, `tableDescription`/`columnDescription`
|
||||
~75–105); `context/scan/manifest.ts` (`mergeDescriptionsPreservingExternal` ~96–115,
|
||||
`SCAN_MANAGED_DESCRIPTION_KEYS`). Factor a per-batch flush that reuses the additive
|
||||
description/manifest write; tag the durable record with `inputHash`.
|
||||
- **Stage store + input hash** —
|
||||
`context/scan/sqlite-local-enrichment-state-store.ts` (`STAGES_TABLE =
|
||||
'local_scan_enrichment_stages'`, PK `(connection_id, stage, input_hash)`,
|
||||
`findCompletedStage`, `saveCompletedStage`),
|
||||
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78). The
|
||||
whole-stage row stays; the `inputHash` is the gate for the resume-skip set.
|
||||
- **Scan driver** — `context/scan/local-scan.ts` (the `onCheckpoint` wiring and the
|
||||
terminal `writeLocalScanEnrichmentArtifacts`), and `KtxScanContext.signal`
|
||||
(`context/scan/types.ts`) which the kill boundary must honor.
|
||||
- **Tests** — gap #1: a fake subprocess-backed runtime whose child hangs (ignores
|
||||
SIGTERM) is killed at a tiny test-seam deadline; assert the await settles within
|
||||
deadline+grace, the child and a spawned grandchild both exit, and the table is
|
||||
skipped with `enrichment_timeout`; assert an HTTP-backed abort still settles via the
|
||||
native path. gap #2: interrupt the descriptions stage after K/N tables (a flush
|
||||
seam), assert the K are durable (raw artifact + `ai:` in `_schema`) with no completed
|
||||
stage row; a resume with matching `inputHash` issues no LLM calls for the K and
|
||||
enriches only N−K; a changed `inputHash` recomputes; regression: a small
|
||||
uninterrupted ingest yields identical artifacts.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
Surfaced during the Spider 2.0-Lite **BigQuery** ingest (2026-06-23, codex enrichment
|
||||
backend). Re-enriching the giant public datasets, `covid19_usa` wedged at 268/285 for
|
||||
41+ minutes on 2 hung 252-column tables; the 30-min per-table `AbortSignal` timeout
|
||||
never killed the hung codex children, and because descriptions checkpoint only at
|
||||
stage completion, the 283 already-enriched tables were unrecoverable — the operator
|
||||
had to kill, cache-bust, and re-ingest the database from scratch (with a short timeout
|
||||
as a stopgap). The benchmark merely exercised a large/wide multi-dataset ingest at
|
||||
scale; the gaps and the fixes are generic production hygiene for any agent that
|
||||
enriches a real warehouse with a subprocess LLM backend. Do not encode any benchmark
|
||||
specifics in the implementation.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Implemented on branch `write-feature-spec-wiki`. Both gaps shipped; all acceptance
|
||||
criteria are covered by tests. The full ktx test surface for the touched code is
|
||||
green (the only failures in the whole suite are 3 pre-existing assertions in
|
||||
`test/skills/analytics-skill-content.test.ts` about the analytics SKILL.md markdown
|
||||
— an unrelated subsystem this change does not touch).
|
||||
|
||||
### Gap #1 — enforced timeout for subprocess backends
|
||||
|
||||
- **Structural property on the runtime, not a name list.** Added
|
||||
`subprocessForkSpec(): SubprocessRuntimeForkSpec | null` to `KtxLlmRuntimePort`
|
||||
(`context/llm/runtime-port.ts`). `CodexKtxLlmRuntime` / `ClaudeCodeKtxLlmRuntime`
|
||||
return a serializable `{ backend, projectDir, modelSlots }`; `AiSdkKtxLlmRuntime`
|
||||
(and the deterministic stub) return `null`. The per-table call branches on this,
|
||||
never on a vendor list (D1).
|
||||
- **Shared structured core.** Both subprocess runtimes gained
|
||||
`generateStructuredJson(jsonSchema)` (returns the raw object; the caller
|
||||
Zod-validates). Their existing `generateObject` was refactored to delegate to the
|
||||
same streaming core, so structured generation has one implementation.
|
||||
- **Kill boundary.** New `context/llm/subprocess-generate-object.ts`
|
||||
(`runGenerateObjectInSubprocess`, `KtxSubprocessDeadlineError`) forks a ktx-owned
|
||||
child (`subprocess-generate-object-child.ts`) **detached** (process-group leader);
|
||||
the SDK's model binary inherits the group. On the deadline or `ctx.signal`, ktx
|
||||
tree-kills the group with `SIGKILL` (`process.kill(-pid, …)` on POSIX,
|
||||
`taskkill /pid <pid> /T /F` on Windows) and rejects promptly; on success the raw
|
||||
output is Zod-validated. Credentials reach the child via inherited `process.env`
|
||||
(the runtimes re-derive their allowlisted env), never over IPC.
|
||||
- **Wiring.** `KtxDescriptionGenerator.generateBatchedTableDescriptions`
|
||||
(`context/scan/description-generation.ts`) routes each retry attempt through the
|
||||
boundary for subprocess backends and keeps the native `AbortSignal` → `fetch`
|
||||
path for HTTP backends. A fired deadline maps to the existing
|
||||
`KtxAbortedError`/`enrichment_timeout` no-retry policy (one wedge = one timeout);
|
||||
default stays 120s (D2).
|
||||
- **Tests.** `test/context/llm/subprocess-generate-object.test.ts` forks a real
|
||||
fixture child that spawns a grandchild and ignores SIGTERM, and asserts the
|
||||
deadline/abort tree-kills both (the grandchild PID is reaped) and the await
|
||||
settles within deadline+grace; plus success / schema-failure / child-error paths.
|
||||
`test/context/scan/description-generation.test.ts` adds the generator-level
|
||||
timeout-skip and the "HTTP backend spawns no child" cases.
|
||||
|
||||
### Gap #2 — incremental descriptions persistence + resume
|
||||
|
||||
- **Durable record + resume store.** `createKtxScanDescriptionResumeStore`
|
||||
(`context/scan/local-enrichment-artifacts.ts`) writes the descriptions-so-far to
|
||||
a durable record (inputHash-tagged) and **only the manifest shards that gained a
|
||||
table this batch** (new `onlyChangedTableNames` filter on
|
||||
`writeLocalScanManifestShards`, additive merge preserved). `load(inputHash)`
|
||||
returns the prior enriched set only on a matching inputHash (D3).
|
||||
- **Resume-aware fan-out.** `generateDescriptions` (`context/scan/local-enrichment.ts`)
|
||||
loads the prior record, skips already-enriched tables, enriches only the
|
||||
remainder, flushes every `DESCRIPTION_FLUSH_EVERY` (10) completed tables (a single
|
||||
in-flight flush; the final force-flush drains the tail), and returns the full
|
||||
merged set (recovered + fresh + `null` for still-failed, so failures are retried,
|
||||
D4). Wired through `local-scan.ts` (store constructed when not `--dry-run`).
|
||||
- **Graceful-skip backstop (requirement 6).** The per-table worker wraps the call in
|
||||
a try/catch: any non-cancellation failure degrades to one `null` description + an
|
||||
`enrichment_failed` warning and the fan-out continues, so no single table can
|
||||
reject `Promise.all` / abort the stage. This makes the "one skipped table costs one
|
||||
description, not the stage's output" guarantee live at the stage boundary
|
||||
(`generateBatchedTableDescriptions` already degrades its own failures; this is the
|
||||
explicit backstop). A `ctx.signal` cancellation still propagates (the stage fails
|
||||
and resumes), and the already-flushed descriptions stay durable. This closes the
|
||||
field bug where a completed-with-skips stage persisted 0 descriptions / 0 stage rows.
|
||||
- **Deviation from the spec's literal path (necessary correction).** The durable
|
||||
record lives at a **stable, non-`syncId`** path
|
||||
(`raw-sources/<connectionId>/live-database/enrichment-progress/descriptions.json`),
|
||||
not the `syncId`-scoped `…/<syncId>/enrichment/descriptions.json` the spec named.
|
||||
Reason: a from-scratch interruption (the incident's exact case — no prior
|
||||
*completed* run) gets a **fresh `syncId`** on the next run
|
||||
(`buildSyncId` in `context/ingest/local-stage-ingest.ts`), so a `syncId`-scoped
|
||||
record would be unreachable on resume. The manifest is already at the stable
|
||||
per-connection scope (`semantic-layer/<connectionId>/_schema/`), so this keeps the
|
||||
resume source at the same stable scope. The `syncId`-scoped `enrichment/descriptions.json`
|
||||
debug artifact written by the terminal/checkpoint writers is unchanged.
|
||||
- **Tests.** `test/context/scan/description-resume.test.ts` drives
|
||||
`runLocalScanEnrichment` against a real git-backed project: a fresh run flushes a
|
||||
durable record + `ai:` manifest descriptions; a matching-`inputHash` resume issues
|
||||
zero LLM calls and returns the full merged set; a partial record re-enriches only
|
||||
the missing tables; a changed `inputHash` recomputes; the changed-shard filter
|
||||
rewrites only the affected shard; and (requirement 6) a run where one table fails
|
||||
still persists the other tables (durable record + `ai:`) and **completes the stage**
|
||||
(a completed `local_scan_enrichment_stages` row), with the failed table left `null`
|
||||
for resume.
|
||||
|
||||
### Incidental
|
||||
|
||||
- Fixed a stale assertion in `description-generation.test.ts` ("does not run
|
||||
per-column fallback…" expected 1 call) to `3`, matching the retry policy added in
|
||||
commit `01f63380` (D2 / acceptance: a transient error retries up to the attempt
|
||||
limit). The HTTP path is unchanged; the assertion simply predated the retry.
|
||||
- No new `ktx.yaml` config field or runtime knob was added (D2). The rate-limit
|
||||
governor is not wired into the scan-enrichment path, so the kill-boundary child
|
||||
loses no pacing.
|
||||
- Rebuilt and re-linked (`pnpm run build && pnpm run link:dev`); the child compiles
|
||||
to `dist/context/llm/subprocess-generate-object-child.js`.
|
||||
|
|
@ -1,567 +0,0 @@
|
|||
# Selective enrichment stages (`--stages`) + per-stage cache keys
|
||||
|
||||
> Refined spec. Intake draft: `todo/21-selective-enrichment-stages.md`.
|
||||
>
|
||||
> **Scope: make the three enrichment stages independently invalidatable and
|
||||
> independently re-runnable.** Today one coarse cache key gates all three stages,
|
||||
> so changing any one stage's inputs re-pays for every stage — most painfully the
|
||||
> expensive per-table `descriptions`. And there is no CLI surface to re-run a
|
||||
> chosen subset. This spec splits the key per stage (so a change invalidates only
|
||||
> the stage it touched) and adds a `--stages` flag that force-re-runs a chosen
|
||||
> subset while preserving the others. It is the operability follow-on to spec 19
|
||||
> (durable, cross-run stage resume) and spec 20 (resilient, per-table-resumable
|
||||
> descriptions); it composes with both rather than replacing them.
|
||||
|
||||
## Problem
|
||||
|
||||
Enrichment has three stages — **`descriptions`** (one paid LLM call per table),
|
||||
**`embeddings`** (sentence-transformer vectors over the schema + descriptions),
|
||||
**`relationships`** (FK/join detection, optionally LLM-proposed). After specs 19
|
||||
and 20 these stages are durable and resumable, but they are still **coupled for
|
||||
cache invalidation and unreachable for selective re-run**. Three facts make a
|
||||
targeted re-run impossible without a full, expensive re-enrich.
|
||||
|
||||
### 1. One coarse cache key gates all three stages
|
||||
|
||||
`runLocalScanEnrichment` (`context/scan/local-enrichment.ts:611`) computes a single
|
||||
`inputHash` from `{ snapshot, mode, detectRelationships, providerIdentity,
|
||||
relationshipSettings }` and every stage reuses it — `descriptions` (~`:642`),
|
||||
`embeddings` (~`:673`), `relationships` (~`:729`). `providerIdentity` itself
|
||||
(`localScanProviderIdentity`, `local-scan.ts:241–255`) is one blob conflating the
|
||||
description LLM identity, the embedding model/dimensions/batch size, **and** the
|
||||
whole relationship config — and it redundantly re-encodes `mode` and
|
||||
`relationships`, which the coarse hash already mixes in.
|
||||
|
||||
The consequence: flipping `scan.relationships.llmProposals`, switching the LLM
|
||||
backend, or upgrading the embeddings model changes the **one** hash and so
|
||||
invalidates **all three** stages. ktx then re-runs the expensive per-table
|
||||
`descriptions` even though they did not conceptually change. The headline cost of
|
||||
the system — paid LLM description calls — is thrown away on any unrelated
|
||||
enrichment-config edit.
|
||||
|
||||
### 2. No CLI surface to select stages
|
||||
|
||||
The enrichment internals already support a relationships-only path
|
||||
(`KtxScanMode` `'relationships'`, `types.ts:12` — `descriptions`/`embeddings` are
|
||||
gated on `mode === 'enriched'` at `local-enrichment.ts:632`, while
|
||||
`shouldDetectRelationships` admits `mode === 'relationships'` at `:624–626`). But
|
||||
`ktx ingest` hardcodes `mode: 'enriched'` (`public-ingest.ts:973`) and exposes no
|
||||
flag to select a subset (`ingest-commands.ts:26–49` — only `--no-query-history`
|
||||
and friends). The relationships-only capability is built but unreachable, and there
|
||||
is no way at all to ask for "descriptions only" or "embeddings only."
|
||||
|
||||
### 3. The foundation for "touch one stage, keep the rest" already exists
|
||||
|
||||
The per-stage store `local_scan_enrichment_stages` is keyed
|
||||
`(connection_id, stage, input_hash)` (spec 19) and the descriptions write is
|
||||
additive — `mergeDescriptionsPreservingExternal` (`manifest.ts`) and
|
||||
`loadExistingManifestState` (`local-enrichment-artifacts.ts`) preserve prior `ai:`,
|
||||
`db:`, and external description keys on rewrite; spec 20's per-table resume record
|
||||
(`createKtxScanDescriptionResumeStore`, `local-enrichment-artifacts.ts:286`) already
|
||||
re-issues LLM calls only for the still-failed tables. So "recompute one stage, leave
|
||||
the others byte-for-byte" needs only two missing pieces: **per-stage key
|
||||
granularity** and a **CLI surface** to select stages.
|
||||
|
||||
**Requirement:** let an operator re-run a chosen subset of enrichment stages on an
|
||||
already-ingested connection, recomputing only those stages, preserving the others'
|
||||
artifacts untouched, and **re-paying only for what genuinely changed** — never
|
||||
re-running the costly `descriptions` because an unrelated stage's inputs moved.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Any team running ktx in production maintains its semantic layer over time: they
|
||||
improve the description prompt or switch the description LLM, upgrade the embeddings
|
||||
model, or turn on LLM-proposed joins. Today each of those forces a **full re-enrich
|
||||
of every connection** — re-running the expensive per-table descriptions even when
|
||||
only embeddings or relationships changed. Two routine operations should be cheap and
|
||||
targeted:
|
||||
|
||||
- **"Re-embed everything on the new model."** Swapping the embeddings model should
|
||||
recompute only embeddings, leaving descriptions and joins on disk.
|
||||
- **"Backfill joins now that `llmProposals` is on."** Enabling LLM-proposed
|
||||
relationships should recompute only relationships.
|
||||
|
||||
And one operation needs an explicit trigger because no input changed:
|
||||
|
||||
- **"These descriptions came out thin — re-run them with a longer timeout."** A
|
||||
connection whose description coverage is poor because tables timed out (same
|
||||
snapshot, same LLM, so the hash is unchanged) should be re-runnable on demand,
|
||||
cheaply retrying only the tables that failed.
|
||||
|
||||
This is core operability for a long-lived ingestion product and is wholly
|
||||
independent of any benchmark.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve ambiguities the intake draft left open. They constrain the
|
||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
||||
|
||||
### D1 — Split the coarse hash into three per-stage input hashes
|
||||
|
||||
Replace the single `computeKtxScanEnrichmentInputHash` call with **per-stage** hash
|
||||
computation, each keyed on only that stage's own inputs. Decompose the
|
||||
`localScanProviderIdentity` blob into the slices each stage actually depends on:
|
||||
|
||||
- **`descriptions`** → `{ snapshot, llmIdentity }`, where `llmIdentity` is the
|
||||
description-LLM identity (`llm.models.default`, `baseUrlConfigured`). **Not** the
|
||||
embedding model/dimensions/batch size, **not** relationship settings.
|
||||
- **`embeddings`** → `{ snapshot, embeddingIdentity, descriptionDigest }`, where
|
||||
`embeddingIdentity` is `{ model, dimensions, batchSize }` and `descriptionDigest`
|
||||
is a stable digest of the resolved description text the embeddings consume (the
|
||||
same text `buildEmbeddings` → `buildKtxColumnEmbeddingText` feeds the model,
|
||||
`local-enrichment.ts:466–486`, `embedding-text.ts:17–44`). This content-addresses
|
||||
embeddings on their real upstream (D4).
|
||||
- **`relationships`** → `{ snapshot, relationshipSettings (incl. `llmProposals` and
|
||||
`detectionBudgetMs`), llmIdentity }`. **Not** the description content (decision X,
|
||||
D5), **not** the embedding identity.
|
||||
|
||||
`mode` and `detectRelationships` drop out of the per-stage inputs: each stage
|
||||
produces output under exactly one mode, so the stage name already scopes that, and
|
||||
re-mixing `mode` only re-couples the keys. After the split, flipping `llmProposals`
|
||||
invalidates only `relationships`; swapping the embeddings model invalidates only
|
||||
`embeddings`; switching the description LLM invalidates only `descriptions`.
|
||||
|
||||
The per-stage hash becomes the key everywhere a single hash is used today: the
|
||||
`local_scan_enrichment_stages` lookup/save in `runEnrichmentStage`, and the spec-20
|
||||
descriptions resume record (`createKtxScanDescriptionResumeStore`), which is now
|
||||
keyed on the **descriptions** stage's hash — so changing the embedding model no
|
||||
longer busts the descriptions resume record, a strict improvement.
|
||||
|
||||
> **No migration bridge.** The stage store and the descriptions resume record are
|
||||
> disposable local `.ktx` state (regenerable from a fresh ingest). The new per-stage
|
||||
> keys simply miss the old coarse-keyed rows, forcing one full re-enrich on the next
|
||||
> run after upgrade. Recreate/ignore stale-shaped records with no compatibility
|
||||
> shim, consistent with specs 19/20 and ktx's no-backward-compatibility policy.
|
||||
|
||||
### D2 — `--stages <comma-list>` selects a subset; one gate, no new mode
|
||||
|
||||
Add `ktx ingest [connectionId] --stages <comma-list>`, a non-empty subset of
|
||||
`descriptions,embeddings,relationships`. Plural because it takes a **set**:
|
||||
`--stages relationships` and `--stages descriptions,embeddings` both read naturally,
|
||||
and the plural signals "list expected." Flag absent = all three (today's behavior).
|
||||
|
||||
A Commander custom parser validates each name against the canonical stage registry
|
||||
and parses into an ordered, de-duplicated set. **An unknown or empty stage name is a
|
||||
hard `InvalidArgumentError`** — never silently ignored. The set threads CLI →
|
||||
`runKtxPublicIngest` (`KtxScanArgs`) → `runLocalScan` → `runLocalScanEnrichment`.
|
||||
|
||||
Inside enrichment the run set is **`(mode/provider-eligible stages) ∩ (selected
|
||||
stages)`** — a single gate. Each existing stage block additionally checks
|
||||
membership in the selected set (`descriptions`/`embeddings` already gate on
|
||||
`mode === 'enriched'` + providers; `relationships` on `shouldDetectRelationships`).
|
||||
This adds **no** new `KtxScanMode` variant and **no** second parallel selection
|
||||
path; `mode` keeps meaning "the connection's enrichment level," and `--stages` means
|
||||
"which of those stages to (re)compute this run." A named stage that cannot run
|
||||
because a prerequisite is absent (e.g. `--stages embeddings` with no embedding
|
||||
provider configured) MUST fail or warn clearly, never silently no-op.
|
||||
|
||||
> Rejected alternative — repurpose `mode` (`--stages relationships` →
|
||||
> `mode: 'relationships'`). It only expresses single-stage cases, leaves
|
||||
> `descriptions,embeddings` with no mode, and creates two ways to say "relationships
|
||||
> only." The explicit stage set is the one canonical selector.
|
||||
|
||||
### D3 — A named stage force-re-runs; per-table resume still avoids re-paying
|
||||
|
||||
Naming a stage in `--stages` carries the intent "recompute this," so a named stage
|
||||
**re-enters its `compute()`, bypassing the spec-19 completed-row short-circuit** in
|
||||
`runEnrichmentStage` (`local-enrichment.ts:538–547`). The spec-20 machinery still
|
||||
applies **inside** `compute()`:
|
||||
|
||||
- `--stages descriptions` re-enters `generateDescriptions`, which loads the
|
||||
per-table resume record and re-issues LLM calls **only for the still-null/failed
|
||||
tables** (when the descriptions hash is unchanged) — the "fill thin coverage with
|
||||
a longer `KTX_ENRICH_LLM_TIMEOUT_MS`" case, paying only for the gaps.
|
||||
- A genuine input change (e.g. switching the LLM → a new descriptions hash)
|
||||
invalidates the resume record and rebuilds the stage fully, as today.
|
||||
|
||||
Stages **not** named are skipped entirely — not run, not resumed — and their
|
||||
on-disk artifacts are left exactly as they are (additive write; preserve-others is
|
||||
already the behavior). The **no-flag default is unchanged**: all eligible stages
|
||||
run, the completed-row short-circuit is respected (spec-19 cross-run resume).
|
||||
|
||||
Behavior follows from the input (did you explicitly name the stage?), not the call
|
||||
path. A consequence to state plainly: `--stages descriptions,embeddings,relationships`
|
||||
is **not** identical to passing no flag — naming all three is the explicit "force a
|
||||
full enrichment recompute," whereas no flag is "ingest, resuming whatever is done."
|
||||
|
||||
### D4 — Downstream staleness: one real edge, content-addressed, surfaced not silent
|
||||
|
||||
The only hard dependency between stages is **`descriptions → embeddings`**
|
||||
(embeddings embed the description text; `relationships` is decoupled, D5). Two
|
||||
mechanisms keep it correct without a hardcoded dependency table:
|
||||
|
||||
- **Self-healing via content-addressing.** Because the embeddings hash includes
|
||||
`descriptionDigest` (D1), re-running `descriptions` changes that digest, so a
|
||||
later embeddings run (or a full ingest) sees a hash miss and recomputes — stale
|
||||
embeddings can never silently persist across a future embeddings run. (Without
|
||||
this, the embeddings hash would be unchanged after a description edit and a later
|
||||
run would wrongly short-circuit on stale vectors.)
|
||||
- **Surfaced immediately.** After a selective run, for each **unselected** stage that
|
||||
has artifacts on disk, recompute its *current* per-stage hash from on-disk state
|
||||
and compare it to the stored completed-row hash; if they differ, emit a
|
||||
**recoverable `enrichment_stage_stale` warning** naming the stale stage and the
|
||||
cascade command (e.g. `--stages descriptions,embeddings`). This is derived from the
|
||||
system's own state — it also catches "you changed the embedding model in `ktx.yaml`
|
||||
but only ran `--stages descriptions`."
|
||||
|
||||
The run **never silently leaves a stale-but-unflagged downstream**, and **never
|
||||
silently auto-cascades** extra work — the operator is told and decides. Re-running
|
||||
`descriptions` does **not** flag `relationships` stale (D5).
|
||||
|
||||
### D5 — Relationships are decoupled from description content, but still get it as context
|
||||
|
||||
`relationships` keys on `{ snapshot, relationshipSettings, llmIdentity }` and is
|
||||
**not** invalidated or stale-flagged by a description change (decision X). Rationale:
|
||||
relationships are the low-value, best-effort, expensive-to-probe stage (spec 19's
|
||||
own framing); coupling them to description content would make every routine
|
||||
description re-run also invalidate joins — re-opening the exact over-invalidation
|
||||
this spec exists to close.
|
||||
|
||||
Independently, a `relationships`-only run (descriptions stage not running this
|
||||
invocation) MUST **hydrate its working schema from the persisted on-disk enriched
|
||||
`_schema`** (AI descriptions + embeddings) so `llmProposals` runs with full
|
||||
description context, not raw column names. Today the relationship stage builds its
|
||||
schema from the bare snapshot (db comments only — `local-enrichment.ts:621,688,740`
|
||||
never merge the AI descriptions), so this also closes a latent gap: both the
|
||||
full-run and the relationships-only paths MUST feed `llmProposals` the
|
||||
best-available descriptions (fresh-this-run if `descriptions` ran, else on-disk) —
|
||||
behavior from inputs, not path.
|
||||
|
||||
### D6 — Scope: enrichment stages only, composable with existing flags
|
||||
|
||||
`--stages` controls only the three enrichment stages. It is **orthogonal to and
|
||||
composable with** the existing `--no-query-history` flag — a pure joins backfill
|
||||
across everything is `ktx ingest --all --stages relationships --no-query-history`.
|
||||
Schema introspection still runs (it is the hash substrate and the enrichment base,
|
||||
and it is cheap — no LLM). The stage-name namespace is built as a **registry** so it
|
||||
can later extend to the broader scan phases (schema / query-history / source /
|
||||
memory) and subsume the inconsistent negative `--no-query-history` flag — but that
|
||||
unification is **out of scope** here.
|
||||
|
||||
## Requirements
|
||||
|
||||
### 1. Per-stage input hashes
|
||||
|
||||
Each enrichment stage MUST key its cache lookup/save and (for `descriptions`) its
|
||||
resume record on a hash of only that stage's own inputs, per D1
|
||||
(`descriptions` ← snapshot + LLM identity; `embeddings` ← snapshot + embedding
|
||||
identity + a digest of the embedded description text; `relationships` ← snapshot +
|
||||
relationship settings + LLM identity). Changing one stage's inputs MUST invalidate
|
||||
**only** that stage. The single coarse `computeKtxScanEnrichmentInputHash` over
|
||||
`{ snapshot, mode, detectRelationships, providerIdentity, relationshipSettings }`
|
||||
MUST be removed in favor of per-stage computation. The stage store and the
|
||||
descriptions resume record MAY be recreated without a migration bridge (disposable
|
||||
local state).
|
||||
|
||||
### 2. `--stages` flag with strict validation
|
||||
|
||||
`ktx ingest` MUST accept `--stages <comma-list>`, a non-empty subset of
|
||||
`descriptions,embeddings,relationships`, defaulting (when absent) to all three. An
|
||||
unknown or empty stage name MUST be a hard parse error (`InvalidArgumentError`),
|
||||
never silently ignored. The selected set MUST thread through to enrichment and gate
|
||||
which stage blocks run as `(mode/provider-eligible) ∩ (selected)` — one gate, no new
|
||||
`KtxScanMode` variant, no second selection path. A selected stage whose prerequisite
|
||||
is missing MUST fail or warn clearly, not silently no-op.
|
||||
|
||||
### 3. Selecting a stage force-re-runs it; unselected stages are preserved
|
||||
|
||||
A stage named in `--stages` MUST re-enter its `compute()`, bypassing the
|
||||
completed-stage short-circuit, while still using the spec-20 per-table resume record
|
||||
so `descriptions` re-issues LLM calls only for still-failed tables (unchanged hash)
|
||||
and rebuilds fully on a changed hash. A stage **not** named MUST NOT run and MUST
|
||||
leave its on-disk artifacts untouched. The no-flag default MUST preserve spec-19
|
||||
cross-run resume (all eligible stages, completed-row short-circuit respected).
|
||||
|
||||
### 4. Downstream staleness is surfaced, never silent
|
||||
|
||||
After a selective run, the run MUST emit a recoverable `enrichment_stage_stale`
|
||||
warning for every **unselected** stage whose current per-stage hash no longer
|
||||
matches its stored completed-row hash (derived from on-disk state, naming the stage
|
||||
and the cascade command). The embeddings hash MUST include a digest of the embedded
|
||||
description text so a later embeddings run self-heals after a description change. The
|
||||
run MUST NOT silently leave a stale-but-unflagged downstream and MUST NOT silently
|
||||
auto-cascade. A description change MUST NOT stale-flag `relationships`.
|
||||
|
||||
### 5. Relationships run with description context
|
||||
|
||||
When the `relationships` stage runs without `descriptions` having run in the same
|
||||
invocation, it MUST hydrate its working schema from the persisted on-disk enriched
|
||||
`_schema` (AI descriptions + embeddings) so `llmProposals` has the same description
|
||||
context as a full enriched run, not bare column names. The full-run and
|
||||
relationships-only paths MUST feed `llmProposals` descriptions consistently.
|
||||
|
||||
### 6. No regression for normal ingests
|
||||
|
||||
A normal `ktx ingest` with no `--stages` flag MUST produce the same artifacts as
|
||||
today (descriptions, embeddings, manifest, relationships) and MUST preserve spec-19
|
||||
cross-run resume and spec-20 per-table description resume. The per-stage hash split
|
||||
MUST NOT change a normal run's output, only which stages a *changed* input
|
||||
invalidates.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- **Per-stage invalidation isolation:** flipping `scan.relationships.llmProposals`
|
||||
re-runs only `relationships` (descriptions + embeddings resolve from cache, no LLM
|
||||
description calls, no re-embedding); swapping the embeddings model re-runs only
|
||||
`embeddings`; switching the description LLM re-runs only `descriptions`. Verified by
|
||||
asserting no LLM description calls / no embed calls for the unaffected stages.
|
||||
- **Flag parse + validation:** `--stages relationships` and
|
||||
`--stages descriptions,embeddings` parse to the right set; `--stages foo`,
|
||||
`--stages` (empty), and `--stages descriptions,foo` each fail with a clear
|
||||
`InvalidArgumentError`.
|
||||
- **Resume-aware force-rerun:** on a connection whose `descriptions` stage completed
|
||||
with K failed/null tables (unchanged hash), `--stages descriptions` re-issues LLM
|
||||
calls for exactly those K tables and leaves the already-good descriptions
|
||||
untouched; the run completes and the K are now enriched. A changed descriptions
|
||||
hash instead rebuilds all tables.
|
||||
- **Preserve others:** after `--stages descriptions`, the on-disk `embeddings` and
|
||||
`relationships` artifacts are byte-stable (unselected stages did not run).
|
||||
- **Derived staleness warning:** after `--stages descriptions` changes the
|
||||
descriptions, the run emits `enrichment_stage_stale` for `embeddings` (its
|
||||
recomputed hash diverged) and does **not** emit it for `relationships` (decision
|
||||
X); a subsequent `--stages embeddings` clears it.
|
||||
- **Relationships context:** a `--stages relationships` run on an already-described
|
||||
connection feeds the on-disk AI descriptions into `llmProposals` (verified: the
|
||||
proposal prompt carries descriptions, not just column names).
|
||||
- **No regression:** a normal uninterrupted `ktx ingest` (no flag) yields identical
|
||||
artifacts and the same descriptions/embeddings/relationship output as today, with
|
||||
spec-19/20 resume intact.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Unifying `--stages` with the broader scan phases or `--no-query-history`.** The
|
||||
namespace is built to extend later; this spec ships only the three enrichment
|
||||
stages, composable with the existing query-history flag (D6).
|
||||
- **A new `KtxScanMode` variant or a second stage-selection path.** One gate,
|
||||
`(eligible) ∩ (selected)` (D2).
|
||||
- **Coupling `relationships` to description content** (decision X, D5). Improving
|
||||
descriptions does not invalidate or stale-flag joins.
|
||||
- **Auto-cascading downstream re-runs.** Staleness is surfaced as a warning; the
|
||||
operator chooses to cascade (D4).
|
||||
- **Capturing prompt/code-level description-prompt changes in the hash.** The
|
||||
descriptions hash keys on snapshot + LLM identity (config/model), not the prompt
|
||||
text; a pure prompt improvement that does not change a hash input will not
|
||||
force-rebuild already-good descriptions. Forcing that is out of scope — the
|
||||
operator changes a real input or selects the stage with a changed config.
|
||||
- **Re-implementing spec 19 (cross-run stage resume, completed-row store) or spec 20
|
||||
(per-table description resume, enforced timeout).** This spec composes above them:
|
||||
it splits the key those stages resume on and adds the CLI surface to select and
|
||||
force-re-run stages.
|
||||
- **A general per-phase incremental-flush framework.** The selection mechanism is the
|
||||
three enrichment stages; it is not a generic abstraction over every ingest phase.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns the
|
||||
design.
|
||||
|
||||
- **Coarse hash → per-stage hashes** — `context/scan/enrichment-state.ts`
|
||||
(`computeKtxScanEnrichmentInputHash` `:78`, `ComputeKtxScanEnrichmentInputHashInput`
|
||||
`:57`): replace with per-stage hash functions (or one function taking a per-stage
|
||||
input slice). `context/scan/local-enrichment.ts` (`:611` single hash; the three
|
||||
`runEnrichmentStage` calls at `descriptions` ~`:635`, `embeddings` ~`:666`,
|
||||
`relationships` ~`:722`; `runEnrichmentStage` `:524` and its short-circuit
|
||||
`:538–547`). The `descriptions` hash also feeds `generateDescriptions`'
|
||||
`resumeStore.load(inputHash)` (`:345`).
|
||||
- **Provider-identity decomposition** — `context/scan/local-scan.ts`
|
||||
(`localScanProviderIdentity` `:241–255`, the enrichment call site `:498–537`):
|
||||
split into `llmIdentity` / `embeddingIdentity`, drop the redundant `mode` /
|
||||
`relationships` re-encoding, and pass each stage only its slice.
|
||||
- **`descriptionDigest`** — `context/scan/local-enrichment.ts` (`buildEmbeddings`
|
||||
`:457–486`) and `context/scan/embedding-text.ts` (`buildKtxColumnEmbeddingText`
|
||||
`:17–44`): digest the resolved per-column/table description text that the embeddings
|
||||
consume, and fold that digest into the embeddings hash.
|
||||
- **CLI flag** — `commands/ingest-commands.ts` (`:26–49` option declarations,
|
||||
`:51–104` action handler): add `--stages` with a custom parser that validates
|
||||
against the canonical stage registry (`KTX_SCAN_ENRICHMENT_STAGES` in
|
||||
`enrichment-state.ts:4`) and rejects unknown/empty names with `InvalidArgumentError`.
|
||||
Thread through `public-ingest.ts` (`KtxScanArgs` build `:969–978`, `mode: 'enriched'`
|
||||
`:973`) → `scan.ts` (`runKtxScan`) → `local-scan.ts` (`runLocalScan`) →
|
||||
`runLocalScanEnrichment`.
|
||||
- **Stage gating + force-rerun** — `context/scan/local-enrichment.ts`: gate each stage
|
||||
block on membership in the selected set (`descriptions` `:632`, `embeddings`
|
||||
`:663–665`, `relationships` `:720`); make a named stage bypass the completed-row
|
||||
short-circuit in `runEnrichmentStage` while the inner `compute()` keeps the spec-20
|
||||
per-table resume. `KtxLocalScanEnrichmentInput` (`:60–85`) gains the selected-stage
|
||||
set.
|
||||
- **Staleness detection + warning** — `context/scan/local-enrichment.ts` (after the
|
||||
stage blocks): recompute each unselected stage's current hash from on-disk state,
|
||||
compare to the stored completed-row hash, push a recoverable warning on mismatch.
|
||||
Add `enrichment_stage_stale` to the `KtxScanWarningCode` union in
|
||||
`context/scan/types.ts` (alongside `relationship_detection_partial`).
|
||||
- **Relationships description context** — `context/scan/local-enrichment.ts`
|
||||
(`schema` built at `:621`/`:688`, passed to `discoverKtxRelationships` `:736–746`):
|
||||
hydrate `schema` with the best-available descriptions (fresh-this-run or loaded from
|
||||
the on-disk `_schema` via `loadExistingManifestState`,
|
||||
`local-enrichment-artifacts.ts`) before relationship detection.
|
||||
- **Stage store + resume record** —
|
||||
`context/scan/sqlite-local-enrichment-state-store.ts`
|
||||
(`local_scan_enrichment_stages`, PK `(connection_id, stage, input_hash)`,
|
||||
`findCompletedStage`, `saveCompletedStage`); `createKtxScanDescriptionResumeStore`
|
||||
(`local-enrichment-artifacts.ts:286–332`, path `:265–267`, inputHash gate
|
||||
`:305–307`) — both now keyed on the relevant per-stage hash. No migration bridge.
|
||||
- **Config inputs** — `context/project/config.ts` (`scanRelationshipsSchema`
|
||||
`:171–218` incl. `llmProposals` `:174` and `detectionBudgetMs`;
|
||||
`scan.enrichment.embeddings` model/dimensions/batchSize; `llm.models.default`,
|
||||
`llm.provider.gateway.base_url`): the sources of each per-stage identity slice.
|
||||
- **Tests** — per-stage invalidation isolation (flip one input, assert only the
|
||||
matching stage recomputes); `--stages` parse/validate (good subsets + unknown/empty
|
||||
rejected); resume-aware force-rerun (`--stages descriptions` retries only the null
|
||||
tables, leaves good ones, completes); preserve-others (unselected artifacts
|
||||
byte-stable); derived staleness (`enrichment_stage_stale` fires for embeddings after
|
||||
a descriptions change, not for relationships; cleared by a later `--stages
|
||||
embeddings`); relationships-only run feeds on-disk descriptions to `llmProposals`;
|
||||
regression — a normal no-flag ingest yields identical artifacts with spec-19/20
|
||||
resume intact.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
- **Docs:** add `--stages` to the `ktx ingest` CLI reference
|
||||
(`docs-site/content/docs/cli-reference/`) and note the per-stage cache behavior
|
||||
where enrichment/ingest is described.
|
||||
|
||||
## Benchmark context (motivation, not a requirement)
|
||||
|
||||
Surfaced during the Spider 2.0-Lite multi-backend ingestion (2026-06-24). A
|
||||
level-aware audit found (a) a tail of BigQuery datasets with poor *column*-description
|
||||
coverage (`google_dei` ~1%, `gnomAD`, `usfs_fia`, …) that want a **`descriptions`-only**
|
||||
re-run with a longer timeout, and (b) a desire to **backfill joins** across all
|
||||
already-ingested datasets after enabling `llmProposals` — without re-paying for
|
||||
descriptions. Both were blocked by the coarse single `inputHash` (flipping
|
||||
`llmProposals` or re-describing invalidated the whole enrichment) and the absence of a
|
||||
stage-selective CLI flag. The benchmark merely exercised large-scale multi-backend
|
||||
ingestion at scale; the gap and the fix are generic production operability. Do not
|
||||
encode any benchmark specifics in the implementation.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki`. All seven requirements implemented;
|
||||
all acceptance criteria covered by tests.
|
||||
|
||||
**What was built / where:**
|
||||
|
||||
- **Per-stage hashes (D1, Req 1).** `context/scan/enrichment-state.ts`: removed the
|
||||
coarse `computeKtxScanEnrichmentInputHash` and added
|
||||
`computeKtxDescriptionsStageHash` (snapshot + `llmIdentity`),
|
||||
`computeKtxEmbeddingsStageHash` (snapshot + `embeddingIdentity` + `descriptionDigest`),
|
||||
`computeKtxRelationshipsStageHash` (snapshot + `relationshipSettings` + `llmIdentity`),
|
||||
plus `computeKtxScanDescriptionDigest` and the `KtxScanLlmIdentity` /
|
||||
`KtxScanEmbeddingIdentity` types. `KTX_SCAN_ENRICHMENT_STAGES` is now exported as the
|
||||
canonical registry. `local-scan.ts` `localScanProviderIdentity` was split into
|
||||
`localScanLlmIdentity` + `localScanEmbeddingIdentity` (dropping the redundant
|
||||
`mode`/`relationships` re-encoding). `mode`/`detectRelationships` dropped out of the
|
||||
keys. No migration bridge — the stage store + descriptions resume record just miss the
|
||||
old coarse-keyed rows.
|
||||
- **`descriptionDigest` (D1/D4).** `local-enrichment.ts`: extracted
|
||||
`buildKtxColumnEmbeddingTexts(snapshot, descriptions)`, shared by the embeddings stage
|
||||
and the digest, so the embeddings hash content-addresses the exact text the model sees.
|
||||
- **`--stages` flag (D2/D6, Req 2).** `commands/ingest-commands.ts`:
|
||||
`parseEnrichmentStagesOption` (Commander parser) validates against the registry,
|
||||
rejects unknown/empty with `InvalidArgumentError`, returns an ordered de-duplicated
|
||||
set; threaded through `KtxPublicIngestArgs` → `context-build-view` → `KtxScanArgs` →
|
||||
`RunLocalScanOptions` → `KtxLocalScanEnrichmentInput`. One gate
|
||||
(`(eligible) ∩ (selected)`); no new `KtxScanMode`. A selected-but-ineligible stage
|
||||
emits a new `enrichment_stage_skipped` warning (never a silent no-op).
|
||||
- **Force-rerun (D3, Req 3).** `runEnrichmentStage` gained `forceRecompute`; a named
|
||||
stage bypasses the spec-19 completed-row short-circuit while `generateDescriptions`
|
||||
still consults the spec-20 per-table resume record (retries only failed tables on an
|
||||
unchanged hash).
|
||||
- **Descriptions hydration + `llmProposals` context (D5, Req 5).** `runLocalScanEnrichment`
|
||||
resolves best-available descriptions (fresh-this-run, else on-disk via a lazy
|
||||
`loadPriorDescriptions` thunk wired from `local-scan.ts` →
|
||||
`loadOnDiskDescriptionUpdates` in `local-enrichment-artifacts.ts`). `snapshotToKtxEnrichedSchema`
|
||||
now merges `ai` descriptions, and `relationship-llm-proposal.ts` `buildEvidencePacket`
|
||||
now carries the resolved description text — closing the latent gap on **both** the
|
||||
full-run and relationships-only paths.
|
||||
- **Derived staleness (D4, Req 4).** `enrichment_stage_stale` warning code +
|
||||
`findLatestCompletedStage` on the state store (interface + sqlite + test store). After a
|
||||
selective run, each unselected stage with a completed row is compared against its
|
||||
freshly recomputed hash; a mismatch warns and names the cascade command. Relationships
|
||||
are never flagged by a description change (decoupled per D5).
|
||||
- **Docs.** `docs-site/content/docs/cli-reference/ktx-ingest.mdx`: `--stages` flag row, a
|
||||
"Selecting enrichment stages" section (per-stage cache, force-rerun, staleness), and
|
||||
examples.
|
||||
|
||||
**Deviation from the spec — embeddings hydration is descriptions-only.** D5 states a
|
||||
relationships-only run should hydrate "AI descriptions **and** embeddings" from the
|
||||
on-disk `_schema`. Investigation found the `_schema` manifest shards store only
|
||||
descriptions; embedding vectors are written to a **syncId-scoped** `enrichment/embeddings.json`
|
||||
that no code reads back, and each run mints a fresh syncId — so there is no durable
|
||||
per-connection embeddings artifact to hydrate from. A relationships-only run therefore
|
||||
hydrates **descriptions** (required for, and verified against, the `llmProposals`
|
||||
acceptance criterion) but **not** embeddings. Consequence: a `--stages relationships`
|
||||
backfill gets deterministic + name-based + LLM-proposed candidates (the point of
|
||||
`llmProposals`), but not the embedding-similarity candidates a full run would add.
|
||||
Durable embeddings hydration (persist vectors at a stable per-connection path, or read
|
||||
them from the vector index) is a clean follow-on and was left out of scope.
|
||||
|
||||
**Tests:** `enrichment-state.test.ts` (per-stage hash stability + isolation),
|
||||
`commands/ingest-commands.test.ts` (parser good/bad subsets, threading, text-capture
|
||||
guard), `local-enrichment.test.ts` (force-rerun bypasses short-circuit + preserves
|
||||
others, naming all three forces a full recompute, per-stage invalidation isolation,
|
||||
prerequisite warning, on-disk descriptions reach `llmProposals`, resume-aware forced
|
||||
descriptions rerun, derived `enrichment_stage_stale` fires for embeddings/not
|
||||
relationships and clears after re-embed). Full `pnpm --filter @kaelio/ktx run test`,
|
||||
`type-check`, `dead-code`, and `build` pass. (One pre-existing unrelated failure in
|
||||
`test/skills/analytics-skill-content.test.ts` — the analytics `SKILL.md` lacks a
|
||||
`**Window functions**` heading the test expects — was present before this work and left
|
||||
untouched.)
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Defect found in post-implementation validation (2026-06-24)
|
||||
|
||||
**`--stages` subset excluding `descriptions` WIPES existing on-disk descriptions.** Violates Req
|
||||
"preserve-others / a selective run never deletes another stage's artifacts."
|
||||
|
||||
**Reproduction (deterministic):**
|
||||
- `northwind` before: 110 `ai:` column/table descriptions, 0 join edges.
|
||||
- `ktx-dev ingest northwind --stages relationships` → completes in ~35s, adds **22 join edges** ✅
|
||||
but the rewritten `public.yaml` has **0 descriptions** (no `ai:`, no `db:`, columns bare). ❌
|
||||
- A full `ktx-dev ingest northwind` (all stages) restores 110 descriptions + keeps the 22 joins.
|
||||
|
||||
**Likely root cause:** the relationships-only path rewrites the schema from the raw snapshot + only the
|
||||
freshly-run stage. The implementation notes claim `snapshotToKtxEnrichedSchema` merges `ai` descriptions
|
||||
and that descriptions are hydrated "fresh-this-run, else on-disk via `loadPriorDescriptions`" — but on the
|
||||
**write path** of a subset run the prior descriptions are NOT merged into the emitted schema (they reach
|
||||
the `llmProposals` evidence packet only). So the on-disk `_schema` loses them.
|
||||
|
||||
**Impact:** blocks the intended joins-everywhere backfill (`--stages relationships` across all dbs) and the
|
||||
`--stages descriptions`-only re-runs — either would destroy the unselected stage's artifacts across every
|
||||
db. Caught on a 1-db validation before any rollout.
|
||||
|
||||
**Acceptance fix:** after any `--stages` subset, the on-disk `_schema` must **retain all prior `ai:`/`db:`
|
||||
descriptions** (and prior joins when descriptions-only) for stages not named — only the named stages'
|
||||
artifacts change. Add a regression test that ingests a fully-enriched fixture, runs `--stages relationships`,
|
||||
and asserts description count is unchanged while joins increase.
|
||||
|
||||
### ✅ Fixed (2026-06-24)
|
||||
|
||||
**Real root cause (deeper than the first diagnosis):** the wipe happened in **two** places, and the first
|
||||
fix attempt only addressed one. `runLocalScan` (`context/scan/local-scan.ts`) writes the **structural**
|
||||
manifest shard from the bare snapshot *before* enrichment runs; that write merges with the on-disk shard,
|
||||
but the merge (`mergeDescriptionsPreservingExternal`, `live-database/manifest.ts`) treats `ai`/`db` as
|
||||
**scan-managed** and overwrites them with whatever the run emits — and the structural write emits none. So a
|
||||
subset run deleted the descriptions on the structural pre-write, *then* `runLocalScanEnrichment` read the
|
||||
already-wiped shard via `loadPriorDescriptions` and had nothing to restore. (A unit-level enrichment test
|
||||
passed because it never exercised the structural pre-write — a divergent-harness miss; the regression test
|
||||
was rewritten to go through the full `runLocalScan` path.)
|
||||
|
||||
**What changed:**
|
||||
- `runLocalScanEnrichment` (`local-enrichment.ts`) now returns the **best-available** descriptions
|
||||
(`resolveDownstreamDescriptions()` — fresh-this-run if `descriptions` ran, else the on-disk ones) as
|
||||
`descriptionUpdates`, instead of `[]` when the stage is skipped — so the enrichment write re-applies them.
|
||||
- `runLocalScan` (`local-scan.ts`) now, on a subset run, **captures the prior on-disk descriptions before
|
||||
the structural manifest write** and feeds them to both the structural write and enrichment — so the
|
||||
structural pre-write preserves them too (robust even if relationship detection later fails).
|
||||
- Joins were already preserved for `--stages descriptions` via the existing manual/inferred
|
||||
`preservedJoins` path; verified by a symmetric test.
|
||||
|
||||
**Tests:** `local-scan.test.ts` — a full `runLocalScan` `--stages relationships` run preserves on-disk `ai`
|
||||
descriptions while adding a join (RED without the fix, GREEN with it). `local-enrichment.test.ts` — the
|
||||
enrichment-layer contract (`--stages relationships` preserves descriptions / `--stages descriptions`
|
||||
preserves joins).
|
||||
|
||||
**Live validation (northwind, 15 tables):** `--stages relationships` BEFORE `ai:110 joins:22` → AFTER
|
||||
`ai:110 joins:22` (descriptions intact; previously wiped to 0). `--stages descriptions` restored the
|
||||
descriptions from the spec-20 resume record (`ai:0 → ai:110`) with **no** LLM calls while keeping `joins:22`.
|
||||
Full `pnpm --filter @kaelio/ktx run test` (3089 passed), `type-check`, `dead-code`, and `build` pass.
|
||||
|
|
@ -1,463 +0,0 @@
|
|||
# Resumable and fault-tolerant source ingest
|
||||
|
||||
> Refined spec. No intake draft — surfaced by a real user report, not the
|
||||
> playground agent (see Motivation). Lives beside the analogous scan-durability
|
||||
> specs 19/20.
|
||||
>
|
||||
> **Scope: make `ktx ingest` (the source-ingest work-unit pipeline behind dbt /
|
||||
> Metabase / Notion) survive interruption and partial failure on large
|
||||
> projects.** Two compounding gaps live on the source-ingest path: (1) an
|
||||
> interrupted run restarts every work unit from scratch — there is no cross-run
|
||||
> reuse of already-generated work-unit output, so a multi-day dbt ingest loses
|
||||
> *all* progress to a single VPN/network blip; (2) the final integration gate is
|
||||
> all-or-nothing — one artifact that cannot pass it (after LLM repair) discards
|
||||
> the **entire** run with nothing committed. This is the source-ingest analog of
|
||||
> spec 19 (move the durability boundary to the cost boundary so expensive LLM
|
||||
> work is not lost) and spec 20 (a stage survives an interruption with per-item
|
||||
> durability). It **reuses** the same content-keyed durability primitive those
|
||||
> specs established rather than copying it.
|
||||
|
||||
## Problem
|
||||
|
||||
Two independent failure modes on the source-ingest work-unit (WU) pipeline,
|
||||
both confirmed in the current code, both observed by a user on a ~2-day dbt
|
||||
ingest. Their union makes large-project ingest brittle: any interruption is
|
||||
total loss, and any single unfixable artifact at the end is total loss.
|
||||
|
||||
### 1. An interrupted run resumes nothing — every work unit re-runs
|
||||
|
||||
`IngestBundleRunner` (`context/ingest/ingest-bundle.runner.ts`) executes a run as
|
||||
a sequence of stages: fetch → parse/extract into **work units** → run each WU as
|
||||
an isolated agent loop in a child worktree (`runIsolatedWorkUnit` →
|
||||
`executeWorkUnit`, `stages/stage-3-work-units.ts`) → integrate the successful WU
|
||||
patches → reconcile → finalize → final gates → one atomic squash commit
|
||||
(`squashMergeIntoMain`, ~2716). The WU stage is where the LLM cost lives: each WU
|
||||
is an agent loop that reads its `rawFiles`/`dependencyPaths` and writes SL/wiki
|
||||
artifacts, producing a git patch (`WorkUnitOutcome.patchPath` /
|
||||
`patchTouchedPaths`, `stage-3-work-units.ts:31-46`).
|
||||
|
||||
The only persisted cross-run state is `SqliteBundleIngestStore`
|
||||
(`context/ingest/sqlite-bundle-ingest-store.ts`): run metadata, the final report,
|
||||
and provenance — all written at or near **run completion**. There is **no
|
||||
checkpoint of completed WU output**. A run that dies mid-flight (the user's
|
||||
VPN/network drop) leaves nothing reusable: the next `ktx ingest` re-fetches,
|
||||
re-parses, and **re-executes every WU from scratch**, re-paying the entire LLM
|
||||
cost. The store even keys `job_id` UNIQUE, so a re-run is a brand-new job with no
|
||||
relationship to the interrupted one.
|
||||
|
||||
> Observed (user report, large dbt project): a run reached deep into its
|
||||
> work-unit progress and was lost to a network blip; the follow-up run started
|
||||
> over from zero. On a ~2-day ingest this is the difference between a 5-minute
|
||||
> resume and a 2-day redo.
|
||||
|
||||
### 2. The final integration gate is all-or-nothing
|
||||
|
||||
After all surviving WUs are integrated, `validateFinalIngestArtifacts`
|
||||
(`context/ingest/artifact-gates.ts:96`) runs the final gate. It checks, across
|
||||
the *integrated* tree:
|
||||
|
||||
- **intrinsic source validity** — `validateTouchedSources` →
|
||||
`validateWuTouchedSources` (`stages/validate-wu-sources.ts:124`) →
|
||||
`validateSingleSource` (`context/sl/tools/sl-warehouse-validation.ts:56`),
|
||||
which runs a **live warehouse dry-run** (`SELECT * FROM (sql) LIMIT 1`);
|
||||
- **cross-artifact references** — dangling join targets
|
||||
(`findJoinTargetErrors`, `validate-wu-sources.ts:89`), dangling `wiki→wiki`
|
||||
refs (`validateWikiRefs` → `findMissingWikiRefs`), broken `wiki→sl_ref`s
|
||||
(`validateWikiSlRefs`, `artifact-gates.ts:39`), and broken wiki body refs
|
||||
(`findInvalidWikiBodyRefs`).
|
||||
|
||||
On any error it **`throw`s a single concatenated string** (`artifact-gates.ts:129`).
|
||||
The runner catches it, runs the LLM repair `repairFinalGateFailure`
|
||||
(`runner.ts:2595`, `maxAttempts: 2`), and if repair still fails, **re-throws**
|
||||
(`runner.ts:2623`) → `markFailed` → the squash never runs → `commitSha: null`
|
||||
(`runner.ts:2729`) → **the whole run is discarded, nothing committed.**
|
||||
|
||||
The crucial asymmetry: a WU that fails *on its own terms* never reaches this gate
|
||||
— `executeWorkUnit` already validates each WU in isolation (`validateWikiRefs`
|
||||
~143, `validateTouchedSources` ~150) and **soft-fails** it (`failWithReset`,
|
||||
~155: the WU resets, is excluded from integration, and the run continues). So by
|
||||
the time the final gate runs, intrinsic single-source failures are rare. The
|
||||
gate fails predominantly on **cross-artifact dangling references**: WU-A's source
|
||||
joins to a source WU-B was meant to create, but WU-B failed/was-excluded, so
|
||||
A's join now points at nothing. Each WU passed *alone*; the break only appears
|
||||
once the survivors are integrated — and that break currently nukes the run.
|
||||
|
||||
> Observed (user report): a run completed all task generation and then failed at
|
||||
> the final integration gate on a **single model**; because the gate is
|
||||
> all-or-nothing, that one failure discarded an ~18h run with nothing committed.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
Anyone ingesting a large warehouse/BI/dbt project with an LLM pipeline will hit
|
||||
both failures. Large ingests run long enough that an interruption is a *when*,
|
||||
not an *if* (laptop sleep, VPN reconnect, transient provider error, an operator
|
||||
ctrl-C on an apparently-stuck run), and a large artifact set makes it
|
||||
near-certain that *some* model lands a cross-reference its sibling didn't
|
||||
produce. Without cross-run reuse, every interruption is a from-scratch redo of
|
||||
the dominant (LLM) cost; without partial commit, one unfixable artifact throws
|
||||
away every good one. Both fixes make large-project ingest **resilient and
|
||||
resumable**: an interruption costs only the unfinished work, and a single bad
|
||||
model costs only that model — not the run. This is core robustness for a
|
||||
general-purpose ingestion product.
|
||||
|
||||
## Design decisions (resolved during refinement)
|
||||
|
||||
These resolve the design space explored during refinement. They constrain the
|
||||
implementer; the exact code is theirs (requirement-level, per the specs README).
|
||||
|
||||
### D1 — Resume is automatic and content-keyed at the work-unit level
|
||||
|
||||
A successful WU's output is cached across runs, keyed by a **content hash of its
|
||||
inputs**, with **no `--resume` flag**. Re-running the same `ktx ingest`
|
||||
transparently replays any WU whose inputs are byte-identical to a cached success
|
||||
and re-runs only the changed, failed, or missing WUs. The key is computed over:
|
||||
the contents of the WU's `rawFiles` + `dependencyPaths` (the bytes the WU reads,
|
||||
`types.ts:19-28`), the adapter/source identity, and a **version/prompt
|
||||
fingerprint** (ktx version + the WU system/user prompt + model role). A changed
|
||||
dbt model busts only that model's entry; everything unchanged replays for free.
|
||||
|
||||
> No flag, no config knob. Content-keying makes resume automatic; a flag would
|
||||
> double the state space for no benefit. This is the same shape scan uses
|
||||
> (`computeKtxScanEnrichmentInputHash`, spec 19), reached here for the WU
|
||||
> pipeline.
|
||||
|
||||
### D2 — The cached unit is the successful WU's patch; replay verifies or recomputes
|
||||
|
||||
The cache stores a successful WU's **output artifacts**: its git patch
|
||||
(`patchPath` content / `patchTouchedPaths`) plus the metadata integration needs
|
||||
(`actions`, `touchedSlSources`, `slDisallowed`). On a cache hit, the runner
|
||||
**replays the patch** into the session worktree — no agent loop, no LLM — exactly
|
||||
where it would have integrated a freshly-run WU. If a cached patch **fails to
|
||||
apply** (the surrounding tree drifted), the entry is discarded and the WU
|
||||
**recomputes**. So a stale hit degrades to "recompute," never to a corrupt tree:
|
||||
the cache can only make a run faster, never wrong.
|
||||
|
||||
### D3 — One durability primitive, shared by scan and ingest
|
||||
|
||||
Per the "one capability, one implementation" rule, the content-keyed store is
|
||||
**extracted** into a shared primitive and **both** scan and ingest route through
|
||||
it — not copied. Scan's `sqlite-local-enrichment-state-store.ts` (PK
|
||||
`(connection_id, stage, input_hash)`, `findCompletedStage` / `saveCompletedStage`)
|
||||
and its `inputHash` computation (`enrichment-state.ts`) are generalized to a
|
||||
content-keyed result cache; scan is migrated onto the shared primitive **in the
|
||||
same change** so no second copy exists even transiently. The ingest cache is a
|
||||
new logical namespace (e.g. keyed `(connectionId, sourceKey, workUnitInputHash)`)
|
||||
on that one store.
|
||||
|
||||
> Extract-and-share in one PR, not "build a copy for ingest now, unify later."
|
||||
> A temporary fork is exactly the divergence the rule forbids; the one-time
|
||||
> extraction cost is paid once and both paths benefit from every later fix.
|
||||
|
||||
### D4 — Only successes are cached; failures retry on the next run
|
||||
|
||||
A failed WU is **not** recorded as terminal — the next run retries it. WU
|
||||
failures on this path are dominantly transient (network, provider stall, an LLM
|
||||
slip), and the user's explicit ask is "resume and finish the rest," so a failure
|
||||
must not be sticky. This deliberately differs from scan's stage store (which
|
||||
caches failed stages and re-throws): there the failure is the stage's
|
||||
deterministic verdict; here a WU failure is usually a blip to retry. Caching only
|
||||
successes also keeps the invariant simple — a cache entry always means "this
|
||||
exact input already produced this exact good output."
|
||||
|
||||
### D5 — The final gate becomes non-fatal: deterministic dangling-edge prune
|
||||
|
||||
Replace the gate's fatal `throw`-after-repair with a deterministic reconciliation
|
||||
that always yields a committable, internally-consistent tree:
|
||||
|
||||
1. `validateFinalIngestArtifacts` is refactored to **return structured findings**
|
||||
(the danglers it already computes internally — join targets, `wiki→wiki`,
|
||||
`wiki→sl_ref`, wiki body refs — plus any intrinsic source failure) instead of
|
||||
flattening them into a thrown string.
|
||||
2. **Drop the rare self-invalid source first.** A source that fails its *own*
|
||||
validation at the final gate (intrinsic — rare, since stage 3 already filters
|
||||
these) is removed, establishing the surviving artifact set.
|
||||
3. **Prune the dead edges in a single pass** over that surviving set. For each
|
||||
dangling reference — whether it pointed at an absent sibling or at a
|
||||
just-dropped source — **remove that reference from its owner** (drop the join
|
||||
entry, remove the `wiki ref` / `sl_ref`, remove the broken body link), keeping
|
||||
the owning artifact. Because nodes are dropped first (step 2) and pruning only
|
||||
removes edges, pruning **cannot create a new dangling edge, so one pass
|
||||
suffices; no fixpoint.**
|
||||
4. Re-run the gate to **confirm** the remainder is clean (warehouse dry-runs are
|
||||
cached per D6/D2, ref checks are in-memory, so this is cheap), then squash-commit
|
||||
the remainder. If the confirm pass *still* fails, that is a real bug — fail the
|
||||
run loudly rather than commit a dirty tree.
|
||||
|
||||
`repairFinalGateFailure` (the LLM repair, `runner.ts:2595` / `final-gate-repair.ts`)
|
||||
is **removed**. The deterministic prune supersedes it for the referential class,
|
||||
and the rare intrinsic case is handled by drop.
|
||||
|
||||
> **Prune the edge, do not cascade the node.** The rejected alternative drops the
|
||||
> *referencing artifact* and, transitively, everything that referenced *it* — a
|
||||
> node-quarantine fixpoint that cascades healthy artifacts and needs a closure
|
||||
> search, a confirm loop, and an un-apply step. Pruning the dead edge keeps the
|
||||
> dependent intact (minus one pointer that never resolved anyway), needs no
|
||||
> fixpoint, and acts on findings the gate already produces.
|
||||
>
|
||||
> **Why remove the LLM repair rather than keep it as a pre-prune step.** Repair
|
||||
> can occasionally *fix* a ref (e.g. correct a typo'd source name) where prune
|
||||
> merely deletes it, preserving marginally more content. We drop it anyway:
|
||||
> determinism beats an LLM round-trip with variance on the commit path, prune
|
||||
> guarantees a commit where repair could only `throw`, and deleting it is a net
|
||||
> maintenance reduction. The decision is reversible — repair could later run as a
|
||||
> best-effort pass *before* prune — but the default is prune-only.
|
||||
|
||||
### D6 — Prune runs on the integrated tree, never poisons the cache (resume ∘ prune compose)
|
||||
|
||||
Pruning is applied to the **integrated session worktree** at gate time and is
|
||||
**re-derived from the current survivor set on every run**. It MUST NOT mutate the
|
||||
cached WU patches (D2). This makes resume and prune compose correctly and
|
||||
**self-heal**:
|
||||
|
||||
- Run 1: WU-A (joins to B) succeeds and is cached *with its join intact*; WU-B
|
||||
fails; the gate prunes A's join-to-B from the integrated tree and commits A
|
||||
without it.
|
||||
- Run 2 (after the root cause is fixed): A's input is unchanged → A **replays
|
||||
from cache with its join restored**; B now succeeds and exists; the gate finds
|
||||
no dangler and commits both, fully linked.
|
||||
|
||||
So a ref pruned because of a sibling's failure costs nothing permanent: fixing
|
||||
the sibling and re-running restores the link for free. The cache stores
|
||||
intent (the WU's real output); prune is a per-run consistency projection over
|
||||
whatever survived.
|
||||
|
||||
### D7 — Pruning is faithful and never silent
|
||||
|
||||
A pruned reference was, by definition, non-functional (its target was absent), so
|
||||
removing it loses nothing executable — and removing dangling SL joins is already
|
||||
the established fix for the SL engine's eager orphan-join rejection. Every prune
|
||||
and every drop MUST be **recorded in the run report and a trace event** naming
|
||||
the artifact, the removed reference, and the absent target. The report status
|
||||
MUST reflect partial completion (extend the existing `failedWorkUnits`
|
||||
mechanism, `IngestBundleResult`, `types.ts:204-213`, with the pruned-refs /
|
||||
dropped-sources detail) so a partial run is visibly partial, never a silent
|
||||
"success."
|
||||
|
||||
### D8 — Cache state is regenerable; no migration bridge
|
||||
|
||||
The WU cache is regenerable local state under `.ktx/`. Its on-disk/SQLite shape
|
||||
may change with **no migration bridge** — a stale-shaped or absent cache simply
|
||||
forces a full (non-resumed) run, exactly today's behavior. Consistent with ktx's
|
||||
no-backward-compatibility policy; the cache is an optimization, never a source of
|
||||
truth.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Cross-run WU resume, automatic and content-keyed.** A successful WU's output
|
||||
MUST be cached keyed by a content hash over its input bytes
|
||||
(`rawFiles` + `dependencyPaths`), the adapter/source identity, and a
|
||||
version/prompt fingerprint (ktx version + WU prompt + model role). Re-running
|
||||
`ktx ingest` MUST replay cached successes without an agent loop / LLM call and
|
||||
re-run only changed, failed, or missing WUs. No `--resume` flag and no config
|
||||
knob is added.
|
||||
2. **Replay verifies or recomputes.** On a cache hit the runner MUST replay the
|
||||
stored patch into the session worktree; if the patch does not apply cleanly the
|
||||
entry MUST be discarded and the WU recomputed. A cache hit MUST NOT be able to
|
||||
produce a tree different from what a fresh run of that WU would have integrated.
|
||||
3. **Only successes are cached.** A failed WU MUST NOT be recorded as terminal; it
|
||||
MUST be retried on the next run.
|
||||
4. **Conservative invalidation.** The input hash MUST change when the ktx version,
|
||||
the WU prompt, or the model role changes (bias toward recompute). Under-keying
|
||||
(stale reuse) is a correctness bug; over-keying (an unnecessary recompute) is
|
||||
acceptable.
|
||||
5. **The final gate is non-fatal.** A final-gate failure MUST NOT discard the run.
|
||||
`validateFinalIngestArtifacts` MUST return structured findings; the runner MUST
|
||||
deterministically **prune** every dangling reference from its owning artifact
|
||||
and **drop** any source that fails its own validation, then commit the
|
||||
remaining internally-consistent tree.
|
||||
6. **Single-pass prune, dependents survive.** Pruning MUST remove dead *edges*, not
|
||||
cascade-drop owning artifacts; it MUST complete in a single pass (no fixpoint)
|
||||
because edge removal cannot create new dangling edges. A dependent that loses
|
||||
one dangling ref MUST otherwise be committed intact.
|
||||
7. **Prune composes with resume.** Pruning MUST operate on the integrated tree and
|
||||
MUST NOT mutate cached WU patches. A reference pruned in one run because its
|
||||
target was absent MUST be restored automatically on a later run once the target
|
||||
exists (resume replays the owner's intact patch).
|
||||
8. **Confirm before commit.** After pruning/dropping, the gate MUST be re-run on
|
||||
the remainder and MUST pass before the squash; if it still fails the run MUST
|
||||
fail loudly rather than commit a dirty tree.
|
||||
9. **`repairFinalGateFailure` is removed.** The LLM final-gate repair path and its
|
||||
obsolete tests/branches MUST be deleted (no dormant compatibility path).
|
||||
10. **Every prune/drop is reported.** Each pruned reference and dropped source MUST
|
||||
be recorded in the run report and a trace event (artifact, removed ref, absent
|
||||
target). A run that pruned or dropped anything MUST report as partial, never as
|
||||
an unqualified success.
|
||||
11. **One shared durability primitive.** The content-keyed store MUST be a single
|
||||
implementation used by both scan and ingest; scan MUST be migrated onto it in
|
||||
the same change. No second copy may exist, even transiently.
|
||||
12. **No regression for clean runs.** A small, uninterrupted run whose every WU
|
||||
passes and whose final gate is clean MUST produce byte-identical artifacts and
|
||||
the same `commitSha`/report shape (modulo new, empty pruned/dropped fields) as
|
||||
today.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- **Resume skips completed work:** interrupt an ingest after K of N WUs have
|
||||
succeeded; re-run the same command (unchanged inputs); the run issues **zero**
|
||||
agent loops / LLM calls for the K cached WUs, runs only the remaining N−K, and
|
||||
produces the same final artifacts as an uninterrupted run.
|
||||
- **Changed model busts only its entry:** edit one dbt model between runs; the
|
||||
re-run re-executes **only** the WU(s) whose input bytes changed and replays the
|
||||
rest from cache.
|
||||
- **Stale patch self-corrects:** a cached patch that no longer applies (forced
|
||||
drift in a test) causes that WU to recompute, not a corrupt tree or a crash.
|
||||
- **Failures retry:** a WU that fails in run 1 (transient error) is **not** cached;
|
||||
run 2 retries it and, on success, integrates it.
|
||||
- **One bad model no longer nukes the run:** a run where WU-B fails so WU-A's join
|
||||
to B dangles **commits** — A is committed with the dangling join **pruned**, the
|
||||
report lists the pruned ref, and `commitSha` is non-null (contrast: today this
|
||||
throws and commits nothing).
|
||||
- **No cascade:** in that scenario A (and any other artifact that only referenced
|
||||
B) is committed intact except for the single pruned reference; nothing healthy
|
||||
is dropped.
|
||||
- **Self-heal:** fix B's root cause and re-run; A replays from cache with its join
|
||||
intact, B succeeds, and the final tree commits both fully linked with no prune.
|
||||
- **Intrinsic drop:** a source that fails its own warehouse dry-run at the final
|
||||
gate (forced) is dropped, refs to it are pruned, and the rest commits; the drop
|
||||
is reported.
|
||||
- **Repair is gone:** `repairFinalGateFailure` and its tests no longer exist; the
|
||||
gate path has no LLM call.
|
||||
- **One store:** scan and ingest both resume through the same content-keyed
|
||||
primitive (one implementation; scan's behavior is unchanged by the migration —
|
||||
spec 19/20 acceptance still passes).
|
||||
- **Clean-run regression:** a small uninterrupted all-passing ingest yields
|
||||
identical artifacts, `commitSha`, and report (empty pruned/dropped fields) to
|
||||
today.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Resuming the cross-WU stages.** Reconciliation, finalization, and the final
|
||||
gate re-run every time; their inputs depend on the full survivor set and their
|
||||
cost is small relative to WU generation. Only WU generation is cached.
|
||||
- **A `--resume` flag or any timeout/cache config knob.** Content-keying makes
|
||||
resume automatic (D1); one opinionated default is the canonical ktx shape.
|
||||
- **Caching failed WUs as terminal.** Failures retry (D4).
|
||||
- **Node-cascade quarantine of the final gate.** Prune edges, do not drop
|
||||
dependents (D5). No closure search, confirm-loop-over-nodes, or un-apply step.
|
||||
- **Tolerating dangling references (warn instead of remove).** Unsafe — the SL
|
||||
engine eagerly rejects orphan joins — so dead edges must be removed, not kept.
|
||||
- **Keeping the LLM final-gate repair.** Removed (D5/req 9).
|
||||
- **A general per-stage resume framework beyond the shared content-keyed store.**
|
||||
The store is the one shared primitive (D3); this spec does not abstract every
|
||||
ingest stage into a resumable framework.
|
||||
- **Re-implementing spec 19/20 (scan durability).** This spec composes the same
|
||||
primitive onto the source-ingest WU pipeline.
|
||||
|
||||
## Implementation orientation
|
||||
|
||||
Line numbers drift; treat these as anchors, not addresses. The implementer owns
|
||||
the design.
|
||||
|
||||
- **Run flow + the all-or-nothing seam** — `context/ingest/ingest-bundle.runner.ts`:
|
||||
WU run + integration of successful patches (~1600–1900), the final-gate block
|
||||
(~2549–2587, `runFinalArtifactGates`), the repair-then-rethrow that must be
|
||||
replaced by prune (~2588–2644; the fatal `throw` ~2623), and the atomic squash
|
||||
(~2701–2729; `commitSha: null` when nothing is touched ~2729). The prune step
|
||||
slots between the gate findings and the squash, operating on `sessionWorktree`.
|
||||
- **Work units & cacheable output** — `context/ingest/types.ts` (`WorkUnit`
|
||||
~19–28: `rawFiles`/`peerFileIndex`/`dependencyPaths`; `IngestBundleResult`
|
||||
~204–213: extend with pruned/dropped detail);
|
||||
`context/ingest/stages/stage-3-work-units.ts` (`executeWorkUnit`; the per-WU
|
||||
validation + `failWithReset` ~134–157 that already soft-fails a WU;
|
||||
`WorkUnitOutcome` ~31–46 with `patchPath`/`patchTouchedPaths`/`actions`/
|
||||
`touchedSlSources` — the cache payload). The cache lookup/replay wraps the
|
||||
per-WU execution; only the agent-loop branch is skipped on a hit.
|
||||
- **The gate (make it return findings)** — `context/ingest/artifact-gates.ts`
|
||||
(`validateFinalIngestArtifacts` ~96; the internal per-artifact danglers from
|
||||
`validateWikiSlRefs` ~39, `validateWikiRefs` ~74, `findInvalidWikiBodyRefs`;
|
||||
the concatenated `throw` ~129 to replace with a structured return);
|
||||
`context/ingest/stages/validate-wu-sources.ts` (`validateWuTouchedSources` ~124;
|
||||
`findJoinTargetErrors` ~89 already returns missing join targets per source —
|
||||
the join-edge danglers to prune); `context/sl/tools/sl-warehouse-validation.ts`
|
||||
(`validateSingleSource` ~56 — the intrinsic warehouse dry-run; its failures are
|
||||
the drop set, not the prune set).
|
||||
- **Per-ref-type pruners (pair 1:1 with the validators)** — join: remove the
|
||||
offending `joins[]` entry from the source YAML; `wiki refs`/`sl_refs`: remove
|
||||
the entry from page frontmatter (`context/wiki/wiki-ref-validation.ts`
|
||||
`findMissingWikiRefs`); wiki body refs: remove the broken link token
|
||||
(`context/ingest/wiki-body-refs.ts` `findInvalidWikiBodyRefs`). Each pruner is
|
||||
deterministic and edits the integrated worktree only.
|
||||
- **Remove the LLM repair** — `context/ingest/final-gate-repair.ts`
|
||||
(`repairFinalGateFailure`) and the `constrained-repair.ts` usage for
|
||||
`final_artifact_gate`; delete the call site (~2595) and its tests.
|
||||
- **Durability primitive to extract & share** —
|
||||
`context/scan/sqlite-local-enrichment-state-store.ts` (`local_scan_enrichment_stages`,
|
||||
PK `(connection_id, stage, input_hash)`, `findCompletedStage`/`saveCompletedStage`),
|
||||
`context/scan/enrichment-state.ts` (`computeKtxScanEnrichmentInputHash` ~78), and
|
||||
the resume wrapper `runEnrichmentStage` (`context/scan/local-enrichment.ts`).
|
||||
Generalize to a content-keyed result cache; migrate scan onto it; add the ingest
|
||||
namespace. The existing ingest store
|
||||
`context/ingest/sqlite-bundle-ingest-store.ts` (`SqliteBundleIngestStore`) is
|
||||
where ingest-side persistence lives — the WU cache sits alongside it under
|
||||
`.ktx/`.
|
||||
- **Tests** — resume: run an ingest against a real git-backed project with a fake
|
||||
agent runner, interrupt after K WUs, assert the re-run issues no agent loops for
|
||||
the K and the same artifacts result; changed-input bust; stale-patch recompute;
|
||||
failed-WU retry. Prune: a fixture where one WU fails so a sibling's join/wiki
|
||||
ref dangles → assert the run commits the sibling with the ref pruned, reports the
|
||||
prune, and `commitSha` is non-null; assert no cascade; assert self-heal on a
|
||||
follow-up run; assert intrinsic drop. Migration: spec 19/20 scan acceptance still
|
||||
green on the shared primitive. Regression: a small uninterrupted all-passing
|
||||
ingest is byte-identical to today.
|
||||
- After implementing, rebuild and re-link so the playground picks it up:
|
||||
`pnpm run build && pnpm run link:dev`.
|
||||
|
||||
## Motivation (the real report, not a benchmark)
|
||||
|
||||
A user ingesting a fairly large dbt project (~2-day run) hit both gaps together.
|
||||
First, an interruption — a VPN drop / network blip — lost all progress because
|
||||
ingest cannot resume; they had to restart from scratch. Second, on a later run
|
||||
that completed all task generation, a **single model** failed the final
|
||||
integration gate, and because the gate is all-or-nothing the one failure
|
||||
discarded an ~18h run with nothing committed. Their ask: "some form of resume or
|
||||
checkpoint (or at least reusing the patches that were already generated), and a
|
||||
way to skip or quarantine a single failing model instead of failing the entire
|
||||
run." This spec delivers both — resume via the content-keyed WU cache, and
|
||||
partial commit via deterministic dangling-edge pruning. Unlike specs 19/20 this
|
||||
gap was surfaced by a real user on a real warehouse, not by the benchmark; the
|
||||
fix is generic production hygiene for any large ingest.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
Shipped on branch `write-feature-spec-wiki` (squash-merge target). All 12
|
||||
requirements and every acceptance criterion are covered by committed code and
|
||||
tests; the full `@kaelio/ktx` package suite is green.
|
||||
|
||||
What was built and where:
|
||||
|
||||
- **Shared content-keyed durability primitive** — `context/cache/content-result-cache.ts`
|
||||
+ `sqlite-content-result-cache.ts` (`SqliteContentResultCache`, `local_content_results`).
|
||||
Scan was migrated onto it in the same change (`context/scan/sqlite-local-enrichment-state-store.ts`
|
||||
is now a thin adapter; the old `local_scan_enrichment_stages` table is dropped),
|
||||
so no second copy exists (D3 / req 11).
|
||||
- **Content-keyed WU cache + replay** — `context/ingest/work-unit-cache.ts`
|
||||
(`computeIngestWorkUnitInputHash` over raw/dependency bytes + source identity +
|
||||
CLI version + prompt fingerprint + model role; success-only `saveSuccessfulWorkUnitCache`).
|
||||
Replay/recompute and stale-recompute state refresh wrap the WU loop in
|
||||
`ingest-bundle.runner.ts` (D1/D2/D4 / reqs 1–4).
|
||||
- **Non-fatal final gate** — `artifact-gates.ts` `validateFinalIngestArtifacts`
|
||||
returns structured findings; `context/ingest/final-gate-prune.ts` deterministically
|
||||
drops self-invalid sources and prunes dangling edges in a single pass, then a
|
||||
confirm gate runs before squash (D5/D6 / reqs 5–8). `finalGatePrunedReferences`
|
||||
/ `finalGateDroppedSources` are recorded in the report + trace and surface as a
|
||||
`partial` outcome (D7 / req 10). `repairFinalGateFailure` and its tests are
|
||||
deleted (req 9).
|
||||
|
||||
Deviations / decisions worth noting (all preserve spec intent):
|
||||
|
||||
- **Cache stores artifact content snapshots (payload schema v2), not just a raw
|
||||
git patch.** Replay materializes the owner's artifacts against the *current*
|
||||
base, so a ref pruned in one run because a sibling failed is restored for free
|
||||
on a later run once the sibling exists — without re-running the owner's agent
|
||||
loop (D2/D6 / req 7 self-heal). A drifted/stale snapshot degrades to recompute.
|
||||
- **Final-gate prune/drop resolves sources through the canonical
|
||||
`resolveSlSourceFile` resolver**, not a derived `semantic-layer/<conn>/<name>.yaml`
|
||||
path, so it works for uppercase / hash-derived source filenames (not only
|
||||
lowercase demo names).
|
||||
- **`executeWorkUnit` defers pruneable cross-artifact findings** (missing join
|
||||
target / wiki ref / sl_ref) to the final gate instead of soft-failing the WU;
|
||||
only intrinsic `source_validation` failures remain fatal at the WU level. This
|
||||
is what lets a sibling-failed WU's owner survive to be pruned rather than be
|
||||
excluded upstream (reqs 5–7, "no cascade").
|
||||
- The raw report record keeps `status: 'completed'`; partial completion is derived
|
||||
by `ingestReportOutcome` from the populated prune/drop fields.
|
||||
|
|
@ -1,66 +0,0 @@
|
|||
# Multi-connection routing guidance in the ktx-analytics skill
|
||||
|
||||
## Problem
|
||||
|
||||
The agent-facing `ktx-analytics` skill (installed into agent environments via
|
||||
the ktx skills/install mechanism, see `.ktx/agents/install-manifest.json` in
|
||||
projects) describes the query workflow — wiki_search → sl_read_source →
|
||||
sl_query / sql_execution — but assumes the connection is obvious. In a
|
||||
multi-connection project nothing tells the agent to *first decide which
|
||||
connection the question is about*, and several tools silently require it:
|
||||
|
||||
- `sql_execution`, `sl_read_source`, `entity_details`: `connectionId`
|
||||
**required**;
|
||||
- `sl_query`, `discover_data`, `dictionary_search`: optional, but
|
||||
auto-inference only works with exactly one connection
|
||||
(`local-query.ts` `resolveLocalConnectionId` ~29-38 — throws with zero or
|
||||
multiple connections).
|
||||
|
||||
An agent that skips routing either errors out or, worse, queries the wrong
|
||||
database when names overlap.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Any ktx project with more than one connection — the common shape for a data
|
||||
org (warehouse + product DB + events DB). Routing is the first step of every
|
||||
question, and the skill should encode it so individual agents don't have to
|
||||
rediscover it.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Add an explicit routing step (step 0) to the skill's workflow:**
|
||||
- Call `connection_list` to see what exists.
|
||||
- Match the question's domain to a connection using connection ids/names,
|
||||
`discover_data` hits, and wiki context — not guesswork.
|
||||
- If genuinely ambiguous after discovery, ask the user rather than pick.
|
||||
2. **Thread the resolved `connectionId` everywhere:** all subsequent
|
||||
`sl_query`, `sql_execution`, `sl_read_source`, `entity_details`,
|
||||
`dictionary_search`, `discover_data` calls, and `wiki_search` once spec 01
|
||||
lands (search scoped to the resolved connection plus unscoped pages).
|
||||
3. **Single-connection projects stay frictionless:** the skill should say
|
||||
routing is trivial when `connection_list` returns one entry — don't add a
|
||||
mandatory ceremony step for the common simple case.
|
||||
4. **Capture routing knowledge:** when the agent learns a non-obvious
|
||||
question-domain → connection mapping, the skill should encourage
|
||||
`memory_ingest` so the mapping becomes wiki knowledge for next time.
|
||||
|
||||
This is a docs/prompt change in the skill content (plus any skill-install
|
||||
plumbing if the skill is versioned); no engine changes required.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- In a fixture project with ≥2 connections, an agent following the skill
|
||||
resolves the correct connection before its first data query, and no tool
|
||||
call fails with "connectionId is required".
|
||||
- In a single-connection project the skill-driven flow is unchanged (no
|
||||
extra mandatory steps).
|
||||
- Skill text nowhere assumes a default/implicit connection.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Spider 2.0-Lite local subset = 30 SQLite connections in one project; every
|
||||
one of the 135 questions targets exactly one of them. Connection ids are set
|
||||
to the benchmark's database names, so with this skill guidance routing is
|
||||
mechanical (`connection_list` + name match) and needs no benchmark-specific
|
||||
instructions — which is the point: the harness gives the agent only the
|
||||
question text.
|
||||
|
|
@ -1,51 +0,0 @@
|
|||
# Offline schema-documentation ingest adapter
|
||||
|
||||
> **Priority: LOW / backlog.** Explicitly **not** needed for the Spider
|
||||
> 2.0-Lite benchmark — we verified the benchmark's offline schema files
|
||||
> (DDL dumps + sample-row JSONs) are a strict subset of what the live SQLite
|
||||
> scan already captures (DDL, types, PKs, sample values, cardinality
|
||||
> profiling). Implement specs 01-03 first; pick this up only if a real
|
||||
> use case shows up.
|
||||
|
||||
## Problem
|
||||
|
||||
The ingest pipeline's schema knowledge comes from live database scans
|
||||
(`live-database` adapter) or BI-tool adapters (metabase, looker, dbt…).
|
||||
There is no adapter for **offline schema documentation**: files describing
|
||||
tables/columns that exist outside the database — column-description
|
||||
spreadsheets, data dictionaries, DDL exports with comments, hand-maintained
|
||||
schema docs.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Teams whose richest schema documentation lives outside `information_schema`:
|
||||
a wiki export of column meanings, a governance tool's CSV data dictionary,
|
||||
DDL files with COMMENT clauses the production scan can't see, or
|
||||
environments where ktx has no live access at all and must build the semantic
|
||||
layer from documentation alone.
|
||||
|
||||
## Requirements (sketch — refine when picked up)
|
||||
|
||||
1. A new ingest adapter (peer of `metabase`/`dbt` in
|
||||
`context/ingest/adapters/`) consuming a configured local path of schema
|
||||
docs per connection.
|
||||
2. Input formats to start: DDL files (`.sql`/`.csv` of CREATE statements)
|
||||
and tabular column dictionaries (CSV/JSON: table, column, description,
|
||||
…). Extensible to other formats.
|
||||
3. Output: **enrichment, not duplication** — merge descriptions/metadata
|
||||
into the manifest-backed semantic-layer sources and dictionary for the
|
||||
matching connection. Where a live scan exists, offline docs fill gaps
|
||||
(descriptions, enum meanings, deprecation notes) and flag drift
|
||||
(documented column missing from live schema and vice versa) rather than
|
||||
creating parallel wiki pages that duplicate schema info.
|
||||
4. Works without live database access (documentation-only bootstrap of a
|
||||
connection's semantic layer), clearly marked as unverified-against-live.
|
||||
|
||||
## Acceptance criteria (sketch)
|
||||
|
||||
- Given a connection with a live scan plus an offline column dictionary,
|
||||
semantic-layer sources carry the documented descriptions, and drift
|
||||
between doc and live schema is reported.
|
||||
- Given a connection with docs only (no live access), `sl list`/`sl read`
|
||||
expose manifest sources built from the docs.
|
||||
- No wiki pages are created that merely restate table/column lists.
|
||||
|
|
@ -1,59 +0,0 @@
|
|||
# Composite-key (multi-column) join detection
|
||||
|
||||
> Priority: MEDIUM. Found empirically during the first Spider2-lite sqlite
|
||||
> smoke test (2026-06-13): relationship detection emitted **zero joins** for a
|
||||
> database whose fact tables are linked only by composite keys. Agents still
|
||||
> answered correctly by inferring the join from shared `grain`, so this didn't
|
||||
> cost benchmark points — but it forces inference that explicit joins would
|
||||
> remove, and the gap is generic.
|
||||
|
||||
## Problem
|
||||
|
||||
Relationship detection appears to emit only single-column joins. For the IPL
|
||||
sqlite database, every table came back with `joins=0`, even though its fact
|
||||
tables are connected by a 4-column composite key
|
||||
(`match_id, over_id, ball_id, innings_no`) shared across `ball_by_ball`,
|
||||
`batsman_scored`, `extra_runs`, and `wicket_taken`. The semantic layer did
|
||||
correctly record that shared key as each table's `grain`, which is why agents
|
||||
could recover the relationship — but no `joins:` entries were produced for the
|
||||
fact-to-fact links.
|
||||
|
||||
## Generic use case
|
||||
|
||||
Event/fact tables keyed by composite business keys are common: ledger lines
|
||||
(`account_id, period, line_no`), telemetry (`device_id, ts, metric`), sports
|
||||
ball-by-ball, EAV/log schemas. Whenever there are no single-column FKs but a
|
||||
multi-column key recurs across tables, ktx should detect and document the join
|
||||
so agents (and `sl_query`) don't have to infer it.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. Relationship detection considers **multi-column** join candidates, not just
|
||||
single-column ones. A strong signal already exists in ktx: when two tables
|
||||
share an identical (or subset/superset) declared `grain`, that grain is a
|
||||
prime composite-join candidate.
|
||||
2. Emitted joins carry the full composite condition, e.g.
|
||||
`on: a.match_id = b.match_id AND a.over_id = b.over_id AND a.ball_id = b.ball_id AND a.innings_no = b.innings_no`,
|
||||
with a sensible `relationship` cardinality.
|
||||
3. The existing validation/threshold machinery
|
||||
(`scan.relationships.acceptThreshold` etc.) applies to composite candidates
|
||||
too; profile-based validation should check join selectivity on the full key.
|
||||
4. No regression for single-column joins; don't explode combinatorially —
|
||||
bound candidate generation (e.g. only consider shared-grain keys and
|
||||
declared/!inferred PK overlaps, cap column count).
|
||||
5. `sl_query` can compile a join across a composite-key relationship.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- For a fixture with two tables sharing a 3- or 4-column grain and no
|
||||
single-column FK, ingest emits a composite join between them with the full
|
||||
multi-column `on` condition.
|
||||
- `sl read <source>` shows the composite join; `sl_query` can traverse it.
|
||||
- Single-column join detection is unchanged on existing fixtures.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
IPL (and similar ball-by-ball/event schemas in the Spider2-lite local set)
|
||||
have no single-column FKs; their joins are entirely composite. Explicit
|
||||
composite joins would let the agent rely on documented relationships instead
|
||||
of inferring them from grain.
|
||||
|
|
@ -1,89 +0,0 @@
|
|||
# Canonical / authoritative-source measures in the semantic layer
|
||||
|
||||
## Problem
|
||||
|
||||
Many schemas contain an **authoritative table** that already encodes a metric's
|
||||
business rules — an official standings/leaderboard table, a general-ledger or
|
||||
period-end balance table, a materialized summary/snapshot — alongside the **raw
|
||||
transactional** rows the metric *could* be re-derived from. Re-deriving the metric
|
||||
from the raw rows frequently diverges from the canonical definition, because the
|
||||
authoritative table bakes in rules the raw data doesn't expose (drop-scores,
|
||||
penalties, adjustments, reconciliations, as-of snapshots).
|
||||
|
||||
Today ktx's semantic layer doesn't distinguish "authoritative summary" tables from
|
||||
raw fact tables, so the analytics skill has no signal that one source is canonical
|
||||
for a metric — and the agent often re-derives from raw rows and gets a defensible-
|
||||
but-different number.
|
||||
|
||||
## Generic use case (independent of any benchmark)
|
||||
|
||||
- "Championship points per competitor this season" — a sports schema may hold both
|
||||
raw per-event results AND an official standings table that applies drop-scores
|
||||
and penalties. The standings table is the canonical source; summing raw results
|
||||
is wrong.
|
||||
- "Account balance as of month end" — prefer a ledger/balance-snapshot table over
|
||||
re-summing every transaction (which may miss adjustments).
|
||||
- "Monthly recognized revenue" — prefer a finance summary table over re-deriving
|
||||
from line items.
|
||||
|
||||
In each case a real analyst should be steered to the authoritative source.
|
||||
|
||||
## Requirements
|
||||
|
||||
1. **Detect candidate authoritative tables during ingest.** Heuristics only —
|
||||
e.g. tables whose name/role suggests a summary (`*standings*`, `*balance*`,
|
||||
`*summary*`, `*snapshot*`, `*ledger*`), tables that are a coarser-grained
|
||||
aggregation of another table, or tables documented as authoritative in provided
|
||||
docs/wiki. Surface them as such in the semantic layer.
|
||||
|
||||
2. **Represent the metric as an SL measure backed by the authoritative table.**
|
||||
Where a canonical source exists, define the measure over it so a query for that
|
||||
metric resolves to the authoritative source by default. (The analytics skill
|
||||
already prefers SL measures over raw SQL — spec 07/skill rule — so this plugs
|
||||
into existing behavior.)
|
||||
|
||||
3. **Keep raw re-derivation available** as a non-default alternative; the measure
|
||||
documents which source it uses and why, so the choice is transparent and
|
||||
overridable.
|
||||
|
||||
## Fairness boundary (HARD — this spec is fairness-sensitive)
|
||||
|
||||
The choice of authoritative source MUST be driven by **schema/structure or provided
|
||||
documentation** — the table exists, is structured as a summary, or is documented as
|
||||
authoritative. It must **NEVER** be driven by observing which interpretation matches
|
||||
a benchmark gold answer. Concretely:
|
||||
|
||||
- ✅ Fair: "a table named/structured as official standings exists and aggregates the
|
||||
raw results → treat it as the canonical points source."
|
||||
- ❌ Forbidden: "for question X, use table T because that's what reproduces the gold
|
||||
result." That is per-instance gold-tuning (cheating) and must not appear in ktx,
|
||||
the ingest heuristics, or any mapping.
|
||||
|
||||
If a metric is genuinely underspecified and only the gold answer disambiguates the
|
||||
intended source, it is **not fairly fixable** — leave it. Whether this feature helps
|
||||
any specific benchmark instance is therefore *conditional* on a real schema/doc basis
|
||||
existing; do not manufacture one.
|
||||
|
||||
## Leak-safety (hard constraint)
|
||||
|
||||
No benchmark table names, queries, gold values, or instance-specific mappings
|
||||
anywhere in the spec, the heuristics, or tests. Examples must be synthetic/generic.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Ingest can flag candidate authoritative/summary tables via generic heuristics
|
||||
(name/role/aggregation/doc signals), with no benchmark-specific rules.
|
||||
- The semantic layer can express a measure as backed by a designated authoritative
|
||||
source; the skill resolves the metric to it by default; raw re-derivation remains
|
||||
available and the choice is documented.
|
||||
- Tests use synthetic schemas only; no gold-derived mappings exist anywhere.
|
||||
|
||||
## Benchmark context (motivation only)
|
||||
|
||||
Some SQLite-subset metric questions are underspecified between a raw-derivation and
|
||||
an authoritative-table interpretation (e.g. season points from raw results vs an
|
||||
official standings table). This is the roadmap's "canonical semantic-layer measures
|
||||
from schema + provided docs" item. It is fair ONLY where schema/docs support one
|
||||
source; the gold-only cases are explicitly out of scope (fixing them would require
|
||||
tuning to gold). Larger than the spec 09–12 skill-content tweaks: this touches
|
||||
ingest + the semantic-layer model.
|
||||
|
|
@ -1,57 +0,0 @@
|
|||
# 17 — Lifecycle-event metrics in the semantic layer
|
||||
|
||||
**Status:** draft (intake). Requirement-level; the implementer refines into `specs/17-*.md`.
|
||||
|
||||
## Problem / requirement
|
||||
|
||||
Many entities carry **several lifecycle timestamps** for the same record — an order has
|
||||
`placed/purchased`, `approved`, `shipped/carrier-handoff`, `delivered`, and `estimated-delivery`
|
||||
times; a ticket has `opened`, `assigned`, `resolved`, `closed`; a payment has `initiated`,
|
||||
`authorized`, `settled`. When an analyst asks for a count/volume/rate of records **in a named
|
||||
completed state, by period** ("delivered orders by month", "resolved tickets per week", "settled
|
||||
payments by day"), the correct time anchor is the timestamp of *that named event*, not the
|
||||
record-creation timestamp.
|
||||
|
||||
Today ktx ingests these timestamps as **peer date dimensions** with good column descriptions, but it
|
||||
does **not model the lifecycle event itself** — so nothing in the semantic layer tells a solver (or a
|
||||
human) that "delivered orders over time" should be anchored to the delivery timestamp. The choice is
|
||||
left to per-query reasoning, which is exactly where it goes wrong. (A companion analytics-skill rule
|
||||
now nudges the *solver* — ktx commit `226341cf` — but the durable, reusable home for this is the
|
||||
**model**, so any consumer of the semantic layer gets it for free.)
|
||||
|
||||
**Requirement:** during enrichment/ingestion, when a source has a state/status column plus one or more
|
||||
lifecycle timestamps whose names/descriptions map to that state's values, infer **lifecycle-event
|
||||
metrics** — e.g. a `delivered_orders` metric defined as `COUNT(*)` filtered to the delivered state with
|
||||
its **default time dimension** set to the matching event timestamp (`order_delivered_customer_date`),
|
||||
distinct from the creation-anchored `orders` metric. Keep the inference conservative and
|
||||
source-traceable (column names + enriched descriptions only); never invent a state/timestamp pairing
|
||||
that the schema/descriptions don't independently support.
|
||||
|
||||
## Sketch (implementer to refine)
|
||||
|
||||
- Detect (state column, lifecycle-timestamp) pairs from column names + enrichment descriptions
|
||||
(e.g. status value `delivered` ↔ `*_delivered_*_date`; `resolved` ↔ `resolved_at`).
|
||||
- Emit a metric per detected completed state: filter = the state predicate, grain = record,
|
||||
`defaultTimeDimension` = the matching event timestamp.
|
||||
- Surface these via `discover_data` / `entity_details` so "delivered orders over time" retrieves the
|
||||
delivery-anchored metric rather than a bare row count over the creation date.
|
||||
- Gate behind the existing `enrichment.mode: llm` path; respect the conservative-inference bar
|
||||
(precision over recall — a wrong pairing is worse than none).
|
||||
|
||||
## Generic use case (independent of the benchmark)
|
||||
|
||||
Any operational/transactional schema (e-commerce orders, support tickets, payments, claims, shipments)
|
||||
has this multi-timestamp lifecycle shape. An analyst asking "how many X were <completed-state> last
|
||||
month" almost always means *entered that state* last month. Encoding the event→timestamp mapping in the
|
||||
model makes every downstream question (BI tool, ad-hoc SQL, an LLM agent) pick the right anchor without
|
||||
re-deriving it, and prevents the silent "grouped by when they started" error.
|
||||
|
||||
## Benchmark context (motivation only — not a benchmark-specific rule)
|
||||
|
||||
Surfaced by the `spider2-autofix` loop, round r1: Spider 2.0-Lite `Brazilian_E_Commerce` cases local028
|
||||
("delivered orders for each month") and local031 ("highest monthly delivered orders volume") both failed
|
||||
because the solver bucketed delivered orders by `order_purchase_timestamp` instead of
|
||||
`order_delivered_customer_date`. The trace showed the solver had both columns and even compared both
|
||||
date bases for local031 before choosing purchase. A skill-text rule flipped both cases this round; this
|
||||
spec is the **model-layer** form of the same fix, which would make the right anchor the default for any
|
||||
solver and any lifecycle schema.
|
||||
Loading…
Add table
Add a link
Reference in a new issue